Monday, October 31, 2016

Linked Data "Magic" and "Big" data

Start-of-Term Feast at Hogwarts from http://harrypotter.wikia.com/wiki/Great_Hall CC BY-SA


"Looking up" and "discovering things"


In his famous 2006 post on Linked Data, Tim Berners-Lee stated four principles of Linked Data:

  1. Use URIs as names for things
  2. Use HTTP URIs so that people can look up those names.
  3. When someone looks up a URI, provide useful information, using the standards (RDF*, SPARQL)
  4. Include links to other URIs. so that they can discover more things.

This whole idea sounds so cool, it seems almost magical.  If we use HTTP URIs [1], we can "look up" things and "discover more things".  But as a practical matter, how do we "discover more things"?
From the standpoint of machine-aided learning on the Web of Data, "discovery" could fall into several categories.  "Discovery" could comprise addition of information to what has already been accumulated by a machine client.  In the language of RDF, we could say that this sort of learning happens when additional triples are added to the graph, either by merging newly discovered triples into the existing graph, or by materializing triples that are entailed by the existing triples in the graph. [2]  From the human perspective, "discovery" can happen when the human performs queries on the data in the graph and discovers connections among the data that were not previously recognized - either because the newly added triples have previously unknown connections to existing triples, or because the links among the data are so complex that connections between resources of interest are not apparent.

On a small scale, "looking things up" seems rather trivial: a client discovers a link to a new HTTP URI in an existing triple, dereferences the URI and receives some RDF, and then has additional triples connected to the existing one.  VoilĂ !  However, when one starts operating on the scale of hundreds, thousands, or millions of URIs that need to be "looked up", the situation gets more complicated.

As I was pondering this, Harry Potter came to mind.  In the first book, Harry and his friends are sitting in the Great Hall of Hogwarts at the welcoming feast and food magically appears on their plates, one course after another.  However, in later books, Hermione discovers that the food is actually prepared by house elves toiling away in the basement of Hogwarts.  Magic is required to make the food appear on the students' plates, but the actual creation of the food is labor-intensive, low-tech, and depends on virtual slavery.  I was then imagining myself as the person at Hogwarts whose job it was to manage the house elves and try how to figure out how to get all of that food from the kitchen onto the plates.  The process didn't really seem magical at all - actually more like tedious and uninteresting.

The "Hogwarts kitchen chore" here is figuring out how to connect my million triples to a million triples somewhere else so that I can query across the entire graph.

Approach 1: Retrieve data about individual resources by dereferencing their URIs

We do have the "magic" of HTTP at our disposal, but does my client really make a million HTTP calls to a server somewhere and then deal with the results one-by-one as they are received?  That doesn't seem very practical.  One alternative would be to save the triples as I've retrieved them.  It would still require a lot of HTTP calls, but I would only have to do it once if I stored the results in my local triplestore.  I would essentially be a little "Google bot" scraping the Linked Data Web.  In a previous post [3], I described playing around with a little Python script to do a miniaturized version of this, and I hope to play around with this approach more in the future.

Approach 2: Federated SPARQL query

If the million other triples that you want to connect your graph to are located in one place, and if that place provides access to those triples through a SPARQL endpoint, you could just leave those triples there and carry out your SPARQL query explorations using a federated SPARQL query.  This seems like a simple solution, since there isn't any need to move all of those triples from there to here.  Last fall, our Semantic Web Working Group at Vanderbilt studied Bob DuCharme's book Learning SPARQL, and as part of our experimentation, we tried running some federated queries.  You can see two examples here.  What I discovered from that exercise is that I can't just throw together the typical sloppy SPARQL query that I'm inclined to write.  The bindings coming from the remote server have to be transferred to the local server running the query before the federated query can be completed.  If there are few (as in example 10), the query executes with no noticeable delay.  On the other hand, if there are many (as in example 11), the query cannot be completed without the transfer of a massive amount of data from the remote server.  Either the query times out, or you wait forever for it to complete.  This same query could be completed easily in a short amount of time if all of the data were in a single triple store.  So federated queries are a potentially powerful way of looking for connections between two large blobs of data, but they have to be constructed carefully with thought toward reducing to a reasonable size the number of bindings that would have to be transmitted from the remote server.

Approach 3: Retrieve the entire remote dataset in one blob

If the million triples that you want to combine with your million triples are from a single source, and if that source provides an RDF dump, you can just download the whole dump and load it into your local triple store.  This may seem like overkill, but if the primary limitation is transfer of data across the Internet (as it is in Approaches 1 and 2), this approach may make the most sense.  Because of the large amount of redundancy in the literals and URIs, in many cases a large RDF graph serialized as Ntriples (a typical serialization choice for an RDF dump) will compress to a size that an order of magnitude smaller than the uncompressed size.  For example, the entire contents of the Getty Thesaurus of Geographic Names (containing pretty much the name and location of every major place on earth) as RDF/Ntriples is 13.8 Gb uncompressed, but only 661 Mb zip compressed (differing by a factor of 20).  With a good high-speed network, you can retrieve the entire dataset in a time measured in minutes.  However, getting that dataset loaded into a triple store is an adventure whose description will take up most of the rest of this post.

Approach 4: Just build the whole giant graph in one place

When I was first pondering this post, I was only considering three possible solutions to the problem.  However, today I attended Cliff Anderson's excellent workshop on Getting Started with Wikidata at the Vanderbilt Heard Library and a fourth possibility occurred to me.  Eventually, maybe everyone who cares about linking data will just put it all in one place.  Clearly that is not a solution in the short term and it seems a little bit silly to expect that could happen. But ten years ago I considered Wikipedia to be a joke and now it's probably one of the best things on the web, which shows what can be accomplished when a bunch of passionate, dedicated people work together to build something great.  If a similar amount of effort were expended on Wikidata, the whole issue of Linked Data might become moot.  People who are serious about linking quality data might just put it all in one place and the "moving triples/bindings from one place to another" would become a moot point.

First experiment: GeoNames RDF dump

In the Linked Data and Sematic Web propaganda pieces, the examples they use as illustrations often involve "toy" datasets that contain between 10 and 100 triples.  Last February, I decided to up the ante a little bit by conducting some experiments using the Bioimages RDF graph containing about 1.5 million triples and downloadable as compressed RDF/XML serialization.  You can read about the results of those experiments as part of another blog post [4].  In a nutshell, a local installation of Callimachus took about 8 minutes to load 1.2 million triples from a 109 Mb XML file, while Stardog took 14 seconds.  Needless to say, in my current set of experiments I did not bother testing Callimachus.

In the past, I've linked Bioimages locations to GeoNames URIs for geographic features that contained those locations.  It would be cool to leverage the hierarchies of political subdivisions described in GeoNames database (but not fully available in the Bioimages dataset).  That would allow me (among other things) to access the multilingual labels for features at any level in the hierarchy.  This was the driving motivation behind my little Python experiment [3] to grab triples for the features to which I had linked.  However, that method introduces management issues.  When I add links to new features, do I re-scrape all of the linked URIs to generate an updated file containing all of the relevant GeoNames triples, or do I keep a list of only the new geographic features to which I'd linked and keep adding one small GeoNames subgraph after another to my Bioimages graph?  Mostly, I just forget to do anything at all, which breaks queries that traverse links involving new data added to Bioimages.

The whole GeoNames triple management issue would go away if I could just load the entire GeoNames RDF dump into the same triplestore with my Bioimages graph.  GeoNames makes an RDF dump available via their ontology documentation page.  The dump contains data about 10 951 423 geographic features described by 162 million triples (as of 2016 February).  The compressed zip file is 616 Mb, which is a reasonable download, and when uncompressed, it expands to a 14.7 Gb file called "all-geonames-rdf.txt".  What the heck does that mean?  ".txt" is not a typical file extension for any RDF serialization of which I'm aware.

When you are my age, you've picked up a number of bad habits, and one that I picked up from my father was to only read the instruction book as a last resort.  So my first effort was to just try to load the "all-geonames-rdf.txt" file into Stardog.  No dice.  Stardog immediately complained that the file wasn't in a valid RDF format.  So how do you find out what is actually inside a single text file that's 15 Gb ?  Not by opening it in Notepad (yes, I did try!).  It's also way too big for rdfEditor (no surprise, although rdfEditor can open and validate the 1239236 triple images.rdf file from Bioimages with no problem).  I considered OpenRefine, but I'd tried and failed to open a 2.7 Gb Getty Thesaurus Ntriples file with it, so I didn't try that.[5]   So it became apparent that I wasn't going to open the whole file into any text editor.

With some help from StackOverflow, I next wrote a little batch file that would display one line at a time from a text file.  I was quickly able to determine that for some reason, the GeoNames people had chosen to do the dump as 11 million separate RDF/XML-serialized graphs, each one separated by a line of plain text containing the id of that feature and placed in a single file, like this:

http://sws.geonames.org/3/
<?xml version="1.0" encoding="UTF-8" standalone="no"?><rdf:RDF xmlns:cc="http://creativecommons.org/ns#" xmlns:dcterms="http://purl.org/dc/terms/" xmlns:foaf="http://xmlns.com/foaf/0.1/" xmlns:gn="http://www.geonames.org/ontology#" xmlns:owl="http://www.w3.org/2002/07/owl#" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#" xmlns:wgs84_pos="http://www.w3.org/2003/01/geo/wgs84_pos#"><gn:Feature rdf:about="http://sws.geonames.org/3/"><rdfs:isDefinedBy rdf:resource="http://sws.geonames.org/3/about.rdf"/>...<gn:locationMap rdf:resource="http://www.geonames.org/4/rudkhaneh-ye-zakali.html"/></gn:Feature></rdf:RDF>
http://sws.geonames.org/5/
<?xml version="1.0" encoding="UTF-8" standalone="no"?><rdf:RDF xmlns:cc="http://creativecommons.org/ns#" xmlns:dcterms="http://purl.org/dc/terms/" xmlns:foaf="http://xmlns.com/foaf/0.1/" xmlns:gn="http://www.geonames.org/ontology#" xmlns:owl="http://www.w3.org/2002/07/owl#" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#" xmlns:wgs84_pos="http://www.w3.org/2003/01/geo/wgs84_pos#"><gn:Feature rdf:about="http://sws.geonames.org/5/"><rdfs:isDefinedBy rdf:resource="http://sws.geonames.org/5/about.rdf"/>...<gn:locationMap rdf:resource="http://www.geonames.org/5/yekahi.html"/></gn:Feature></rdf:RDF>
http://sws.geonames.org/6/
<?xml version="1.0" encoding="UTF-8" standalone="no"?><rdf:RDF xmlns:cc="http://creativecommons.org/ns#" xmlns:dcterms="http://purl.org/dc/terms/" xmlns:foaf="http://xmlns.com/foaf/0.1/" xmlns:gn="http://www.geonames.org/ontology#" xmlns:owl="http://www.w3.org/2002/07/owl#" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#" xmlns:wgs84_pos="http://www.w3.org/2003/01/geo/wgs84_pos#"><gn:Feature rdf:about="http://sws.geonames.org/6/"><rdfs:isDefinedBy rdf:resource="http://sws.geonames.org/6/about.rdf"/>...<gn:locationMap rdf:resource="http://www.geonames.org/6/ab-e-yasi.html"/></gn:Feature></rdf:RDF>

etc.  Duh.  I suppose you could call that an RDF dump of some sort, but not one that was very easy to use.

Given my success with my earlier attempts to use the cool rdflib Python library to manipulate RDF data, I decided to try to write a Python script to import the stupidly formatted GeoNames graphs and output them as a single n3 formatted file.  You can see my attempt in this Gist.  The idea was to read in a junk text line and do nothing with it, then to read in the next line that contained the whole XML serialization for a feature, parse it, and add it to the growing accumulation graph.  Repeat 11 million times, then write the whole file in n3 serialization.  Although rdflib has a simple function to parse an RDF/XML file when it's opened, I couldn't figure out a way to get it to do the parsing on a line of text assigned to a variable.  Finally, I gave up and just wrote the line to a file and then opened the file with the parse function.  It was a great idea, but of course it hung when I tried to run it on the whole file.  At first, I though that it was the 11 million file writes and reads that was the problem. But when I hacked the program to print a number every time it added 10 graphs to the accumulation graph, I could see that the problem was that as the accumulation graph got bigger, adding each addition graph took longer.  By the time it got to adding the 1000th graph, it was clear that it was going to take forever to do the remaining 10 999 000 graphs.

That was a disappointment, but as with all good programmers, I was undaunted and searched the internet to find a program that somebody else had written to do the job.  I was able to find a Python 2 script written by Rakebul Hasan, which I was able to hack to work in Python 3 (Gist here).  I also added a counter to put a number on the screen after every 10 000th graph was added.  I let the script run for something like 6 hours, and in the end, I had a single file full of nTriples!  Hooray!

I compressed the output file from 18.3 Gb down to 1.2 Gb and Dropboxed it to my work computer where I wanted to had planned to load it into Stardog.  However, my attempt failed due to my aforementioned issue of not reading the instructions before starting in.  I had forgotten that the free Community version of Stardog was limited to 25 million triples per database (I was way over with the 162 million GeoNames triples).  I think eventually our Semantic Web working group will try the free 30 day Stardog trial (Developer license) that has no data limits after we figure out all of the tests that we want to run on it.  When we do that, we can try loading the GeoNames triples into Stardog again.

Second experiment: Loading 132 million Getty TGN triples into BlazeGraph

After that debacle, I decided to try loading the entire Getty Thesaurus of Geographic Names (TGN) RDF dump into Blazegraph.  Unlike Stardog, open source Blazegraph has no restrictions on the number of triples that you can have in a store.  I decided to switch to loading the TGN first (instead of GeoNames) because the entire dump was broken into a number of smaller files, allowing me to look more closely at the effect of file size on load time.

I should note here that I'm using the term "Blazegraph" interchangeably with "Bigdata".  Technically, I believe that the actual data store itself is referred to as "Bigdata" whereas the software that interacts with it is "Blazegraph".  You can read the Getting Started guide yourself and try to figure out the distinction.  For convenience, henceforth I'm going to just use Blazegraph to refer to the whole system.

I had only recently set up Blazegraph on my work computer and hadn't yet had time to try it out.  Being the black sheep of the Vanderbilt computing community, I have a frumpy Windows machine in my office, unlike the cool Linux people and the stylish Mac people.  So of course, Blazegraph would not actually run SPARQL queries after installed it on my machine without throwing a Windows-specific error.  I can't remember what it was (I wrote it down and lost the paper), but 5 minutes of Googling found me the (Java?) settings change I had to make to fix the problem.  So I was ready to roll.

The Getty TGN Explicit Exports RDF dump is a zip archive that contains 17 nTriple-serialized files of varying sizes that describe various categories of data in the thesaurus.  The file names give some hint of what the files contain, and I was able to determine more exactly what kinds of properties were included in each file by taking a screenshot of the Blazegraph "Explore" screen for a particular resource before and after loading each file.  For example, after adding the first file (1subjects.nt) to the triplestore, I saw this:

After adding the second file (2terms.nt), I saw this:

The difference between the two screens showed that the "2terms.nt" file contained the triples linking to the preferred label data for the resources.

When I tried loading the first file, I was a bit alarmed because nothing seemed to be happening for a long time, but I was patient and after about a half hour, Blazegraph reported that the file had loaded!  Because of their size, some files took way longer than others to load.  Here is a table that tells approximately how long it took the various files of various sizes to load:


I decided against loading TGNOut_RevisionHistory.nt because, although the revision history is important, it was not critical for the use cases I cared about, and given that its size was 2-3 times as large as the other big files I'd been loading, I didn't really want to invest the hours that it would probably take to load it.  The loading times are approximate because I was working on this project while other things were going on and I wasn't always paying attention to what was going on.  At the end of the load, Blazegraph give an elapsed time in ms, but I didn't notice this until after I'd already been loading several files, so I wasn't diligent about recording the value.

One interesting feature of the data shown above is that although there is a definite relationship between file size and loading time, it's not linear.  Loading the first 2.8 Gb file took a surprisingly short time.  I wasn't actually timing it, but I don't think it was more than a half hour.  So I was surprised when loading the next file (4.5 Gb) it took about 7.5 hours.  One thing to note is that file size is not necessarily proportional to number of triples, since triples with long literal values will require more characters than triples with short literals or IRI values.  However, it seems unlikely that this was the entire explanation.  Having already confessed to being bad about reading instructions, the answer may be somewhere in the Blazegraph documentation, but I haven't read the documentation carefully enough yet to have figured it out.

Clues about the reason for the long load times can be found by looking at the system performance while Blazegraph was loading triples:


You can see that immediately after I gave the command to load another file, the disk active time jumped from near zero to near 100% of the time.  Part of this may just be disk writing related to saving the file containing the ingested triples, but I suspect that the major issue is that Blazegraph is having to continually be writing to the disk because there isn't enough memory available to do what it needs to do.  I allocated 4 of the 8 Gb of memory on the computer to Blazegraph, and while it was loading triples, the memory usage would go over 90%.  I was also using the computer to do work at the same time and had other applications open that were also competing for use of the memory.  The whole triple loading process was a real drag on the system and any simple thing that involved disk use like a file-open dialog took forever.

There is a lot that I don't know about what is going on while Blazegraph is loading triples, but it is clear that for a task of this magnitude, it would be better to do it on a machine that had more than 8 Gb of memory.  A solid-state drive might also help improve the speed with whatever reading and writing can't be avoided in the process.

Where are the labels ?!

Here is what I ended up when I used the Explore tab of Blazegraph after loading the files:
If you compare this with what you see in a browser when you dereference http://vocab.getty.edu/tgn/1014952, everything is there except for several triples that were in the revision history file that I didn't load.  However, if you dereference the same URI requestion Content-type: text/turtle (or click on the N3/Turtle link from the search results page), you get this (headers omitted):

tgn:1014952 a gvp:Subject , skos:Concept , gvp:AdminPlaceConcept ;
    rdfs:label "Nashville" ;
    rdfs:seeAlso <http://www.getty.edu/vow/TGNFullDisplay?find=&place=&nation=&subjectid=1014952> ;
    dct:created "1991-09-13T00:41:00"^^xsd:dateTime ;
    skos:changeNote tgn_rev:5000847244 ;
    gvp:broader tgn:7013154 ;
    gvp:broaderPartitiveExtended tgn:7005685 , tgn:1000001 , tgn:7013154 , tgn:7012149 , tgn:7029392 ;
    gvp:broaderExtended tgn:7005685 , tgn:1000001 , tgn:7013154 , tgn:7012149 , tgn:7029392 ;
    gvp:broaderPreferredExtended tgn:7005685 , tgn:1000001 , tgn:7013154 , tgn:7029392 ;
    gvp:parentString "Ontario, Canada, North and Central America, World" ;
    skos:note tgn_rev:5000847244 ;
    gvp:parentStringAbbrev "Ontario, Canada, ... World" ;
    gvp:displayOrder "998"^^xsd:positiveInteger ;
    gvp:placeType aat:300008347 ;
    skosxl:prefLabel tgn_term:15232 ;
    skos:prefLabel "Nashville" ;
    gvp:broaderPartitive tgn:7013154 ;
    gvp:broaderPreferred tgn:7013154 ;
    skos:broader tgn:7013154 ;
    iso:broaderPartitive tgn:7013154 ;
    gvp:prefLabelGVP tgn_term:15232 ;
    skos:inScheme <http://vocab.getty.edu/tgn/> ;
    dct:contributor tgn_contrib:10000000 ;
    dct:source tgn_source:9006541-subject-1014952 ;
    gvp:placeTypePreferred aat:300008347 ;
    dc:identifier "1014952" ;
    skos:broaderTransitive tgn:7005685 , tgn:1000001 , tgn:7013154 , tgn:7012149 ;
    cc:license <http://opendatacommons.org/licenses/by/1.0/> ;
    void:inDataset <http://vocab.getty.edu/dataset/tgn> ;
    dct:license <http://opendatacommons.org/licenses/by/1.0/> ;
    prov:wasGeneratedBy tgn_rev:5000847244 ;
    foaf:focus tgn:1014952-place .


plus some triples about related things (like the place, sources, etc.).  If you compare this with what Blazegraph reports, you'll notice that there are two major categories of triples missing.  One is a set of triples related to broader categories, such as gvp:broader, gvp:broaderExtended, skos:broader, skos:broaderTransitive, etc.  The other is the set of labels, such as rdfs:label, skos:prefLabel, etc.

The difference is caused by the fact that these additional properties are not actually stored in the Getty TGN database.  Rather they are properties that are entailed but not materialized in the Explicit Exports RDF dump.  The circumstances are described in section 6.6 of the Getty Vocabularies: Linked Open Data Semantic Representation document.  When the Getty Vocabularies SPARQL endpoint is loaded with fresh data every two weeks, the 17 files I listed above are loaded along with the Getty Ontology and other external ontologies like SKOS.  Getty's graph database has inference features that allow it to reason entailed triples.  For example, gvp:broaderPreferred is a subproperty of gvp:broader, so including the triple:

tgn:1014952 gvp:broaderPreferred tgn:7013154.

in the RDF dump entails the triple:

tgn:1014952 gvp:broader tgn:7013154.

even though it's not in the dump.  Getty has defined a set of rules to that limits the entailed triples. Those rules are applied to generate the categories of triples that they have deemed important, then insert them back into their triplestore.  So entailed triples aren't reasoned every time somebody runs a query - they are materialized once in each update cycle and the queries are run over the graph that is the sum of the explicit and materialized entailed triples.  Getty does offer "Total Exports" RDF dumps, but those files are a lot larger.  At 19.8 Gb uncompressed for the Thesaurus of Geographic Names, it's about 50% larger than the 13.8 Gb Explicit Exports files I used.

So I've got a problem.  I really wanted the skos:prefLabel and skos:altLabel triples, but they are entailed, not explicit triples.  The explicit triple relevant to labels that is shown in the Explore tab for the example is:

tgn:1014952 skosxl:prefLabel tgn_term:15232.

The relationship between SKOS labels and SKOS-XL (SKOS eXtension for Labels) labels is described in Appendix B of the SKOS Reference.  SKOS uses a complex OWL feature called Property Chains to support the "dumbing-down" of SKOS-XL label entities (which can be identified by IRIs and be assigned properties) to vanilla SKOS lexical labels (i.e. literal value properties like skos:prefLabel, which cannot be assigned properties).  Here's the SKOS-XL label entity from the example:

 tgn_term:15232 a skosxl:Label ;
    gvp:term "Nashville" ;
    gvp:displayOrder "1"^^xsd:positiveInteger ;
    skosxl:literalForm "Nashville" ;
    gvp:termFlag <http://vocab.getty.edu/term/flag/Vernacular> ;
    gvp:termPOS <http://vocab.getty.edu/term/POS/Noun> ;
    gvp:contributorPreferred tgn_contrib:10000000 ;
    dct:contributor tgn_contrib:10000000 ;
    dct:source tgn_source:9006541-term-15232 ;
    dc:identifier "15232" .


Getty wants to include this entity in its explicit data because provenance data and display order information can be assigned to it - you can't do that with a literal.  If the Property Chain axioms in section B.3.2. are applied to the skosxl:prefLabel triple and the triples in the label entity description, the triple

tgn:1014952 skos:prefLabel "Nashville".

is entailed.

To solve my problem, I've got three options:
1. Put my giant blob of triples somewhere that can do Property Chain OWL reasoning, and let the skos:prefLabel triples be generated on the fly.  It's possible that Blazegraph can do that, but at this point I have no idea how you would make that happen.
2. Use a SPARQL construct query to construct all of the entailed skos:prefLabel and skos:altLabel triples that I want, then insert them into my Blazegraph graph.  This is probably the best option, but I haven't tried it yet.  The query would be pretty simple, but I have no idea how long it would run - it would generate about 4 million triples (based on the number of skosxl:Label instances in the VoID description of TGN).  I've never run a construct query that big before.
3. Use a more complicated SPARQL SELECT query using the SKOS-XL properties instead of the vanilla SKOS label properties.  That's annoying, but simpler in the short run.

If I wanted to use the triple pattern:

?location skos:prefLabel ?label.

I would instead need to use:

?location skosxl:prefLabel ?labelEntity.
?labelEntity skosxl:literalForm ?label.

instead.  Not really that bad, but annoying.

Take-homes from the School of Hard Knocks

At this point, I'm going to insert some comments about important take-homes that I've picked up in the process of this experimenting.

1. Capitalization and language tags.  Despite what Section 2.1.1 of RFC 5646 says about case sensitivity and conventions about capitalization, the Blazegraph SPARQL interface distinguishes between "zn-Hans" and "zn-hans".  (Maybe all SPARQL processors do?)  RFC 5646 says "The ABNF syntax also does not distinguish between upper- and lowercase: the uppercase US-ASCII letters in the range 'A' through 'Z' are always considered equivalent and mapped directly to their US-ASCII lowercase equivalents in the range 'a' through 'z'."  When Blazegraph reports lang(?label) for a literal ?label, it performs the ABNF mapping (to all lowercase) on the label.  Thus for

?label = "广德寺"@zh-Hans

lang(?label) = "zh-hans"

So the filter

FILTER (lang(?label)="zh-Hans")

will never produce results.  I noticed that Getty TGN uses all lower case, so to avoid problems in the future, I'm going to go to using all lowercase in my language subtags.

2. Named graphs.  In most SPARQL examples geared towards beginners, there is no graph specified.  All of the triples going into one unspecified pot (the default graph).  That's fine for playing around, but in production, one will need to remove sets of triples (i.e. graphs), then either replace them with an new graph (if it's an update) or not (if they simply aren't needed any more).  In SPARQL 1.1 Update, it's possible to manipulate specific triples using INSERT DATA and DELETE DATA, although that is unwieldy when dealing with large graphs.  I also found in my experimentation with Stardog that there are cases where triples don't get removed under certain circumstances. [6]

It's more straightforward and reliable to designate triples as belonging to some particular IRI-identified graph as they are loaded.  (They are then known as "quads" instead of "triples" - the fourth component that's added to the triple is the IRI of the graph that contains the triple.)  Then it is possible to drop the entire graph without affecting triples that were designated as belonging to other graphs.

The problem comes when you mix triples that were designated as belonging to a named graph with other triples that were not so designated.  The issue is summed up in the GraphDB documentation: "The SPARQL specification does not define what happens when no FROM or FROM NAMED clauses are present in a query, i.e., it does not define how a SPARQL processor should behave when no dataset is defined. In this situation, implementations are free to construct the default dataset as necessary."  In the case of the GraphDB implementation, it constructs the default dataset by including any triples that were not designated as part of any named graph as belonging to every named graph.

This was the behavior that I expected when I loaded the Getty dump into Blazegraph.  When loading triples using the UPDATE tab, I just gave commands like:

load <file:///c:/test/output/getty/TGNOut_1Subjects.nt>

and loaded the millions of triples serialized in the TGNOut_1Subjects.nt file into the Blazegraph unnamed default graph pot.  I then planned to load other sets of triples like this:

load <file:///c:/test/rdf/output/tang-song.ttl> into graph <http://lod.vanderbilt.edu/historyart/tang-song>

so that I could remove or replace those triples as a whole graph as experimentation progressed.

Including the keyword FROM in a SPARQL query should add the specified graph to the default graph used in the query, for example:

SELECT DISTINCT ?site 
FROM <http://lod.vanderbilt.edu/historyart/tang-song>
WHERE {
  ?site a geo:SpatialThing.
  }

However, in this case, Blazegraph does not add the triples in the http://lod.vanderbilt.edu/historyart/tang-song graph to those in the unspecified pot.  Instead, it defines the default graph to be composed of only the triples in the http://lod.vanderbilt.edu/historyart/tang-song graph.

The FROM NAMED keywords to make it possible to specify that some triple patterns only apply to a particular named graph, like this:

SELECT DISTINCT ?site
FROM NAMED <http://lod.vanderbilt.edu/historyart/tang-song>
WHERE {
  GRAPH <http://lod.vanderbilt.edu/historyart/tang-song> {
    ?site a geo:SpatialThing.
    }
}

or

SELECT DISTINCT ?site
FROM NAMED <http://lod.vanderbilt.edu/historyart/tang-song>
WHERE {
  GRAPH ?g {
    ?site a geo:SpatialThing.
    }
}

to say that the triple patterns could apply to any named graphs.  So it seems like an alternative approach would be to explicitly specify that some triple pattern should apply to all named graphs and others triple patterns should apply to triples in the unspecified pot, something like this [7]:

SELECT DISTINCT ?site
FROM NAMED <http://lod.vanderbilt.edu/historyart/tang-song>
WHERE {
  ?location skosxl:prefLabel ?labelEntity.
  ?labelEntity skosxl:literalForm ?label. 
  {GRAPH ?g {
    ?site a geo:SpatialThing.
    ?site rdfs:label ?label.
    }
  }
}

However, as far as I can tell, whenever either the FROM or FROM NAMED keywords are used in a Blazegraph SPARQL query, the triples in the unspecified default graph pot are just forgotten.  There probably is some kind of workaround involving cloning the default graph triples to another "namespace" (see the Blazegraph Quick Start guide for more on what Blazegraph calls "namespaces") and then running a federated query across the new namespace and named graphs in the old namespace.  I haven't gotten desperate enough to try this yet.

If one were playing around with tiny little graphs, one could simple give the DROP ALL command from the UPDATE tab and start over by reloading the triples into appropriately named graphs.  However, given the two-day loading time for Getty RDF dump, I'm not eager to do that.  I'll probably just upload my test graphs into the unnamed default graph and then try to delete their specific triples using a DELETE DATA command.

My take-home from this experience is that for production purposes, it's probably a good idea to always load triples as part of some named graph.  This dooms you to specifying FROM clauses in every query you write, but that's probably better than having to thrash around as I have with triples in two unlinkable pots.

3. Loading alternatives to be explored.  It is probably worthwhile to note that there are other ways in Blazegraph to load triples besides using the LOAD command via SPARQL 1.1 Update.  The Blazegraph UPDATE tab has other options in the dropdown besides SPARQL Update: "RDF Data" and "File Path or URL".  I'm not clear on how these behave differently or if they are likely to have different loading speeds on very large files.  The Blazegraph documentation is pretty sparse, which I guess is OK for a free product.  If you want hand-holding, you can pay for support.  It does make things a bit difficult for a beginner, however.


Querying the Getty TNG triples

Well, as usual, the technical details of being a Hogwarts Kitchen Manager has caused this blog post to expand to be much longer than I'd planned.  It's time for the Start-of-Term Feast!  

The first query that I've shown below can actually be run by pasting it straight into the Getty Vocabularies SPARQL endpoint online: http://vocab.getty.edu/sparql
The prefix list contains more prefixes than are actually necessary for the query that follows, but I wanted it to contain all of the prefixes that were likely to be needed for related experimentation. 

We have been working on a dataset provided by Tracy Miller of Vanderbilt's Department of History of Art: the tang-song temple dataset [8]. In it, we have names of temple sites in Chinese characters and Latin transliterations.  We would like to associate each site with the Getty TNG identifier that corresponds to the temple site or the city/village from which the temple site gets its name.  We could search for them using the Getty Vocabularies search facility, but those searches usually result in many results for sites that have the same name, but which are in the wrong province.  It is very labor-intensive to sort them out.

In our data, we have the province name in Chinese characters (e.g. ĺ±±čĄż for Shanxi).  In order to eliminate all of the matches to sites with the correct name but which are in the wrong province, we can use the query below.

PREFIX tgn: <http://vocab.getty.edu/tgn/>
PREFIX gvp: <http://vocab.getty.edu/ontology#>
PREFIX skosxl: <http://www.w3.org/2008/05/skos-xl#>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
PREFIX dcterms: <http://purl.org/dc/terms/>
PREFIX gvp_lang: <http://vocab.getty.edu/language/>
PREFIX att: <http://vocab.getty.edu/aat/>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX geo: <http://www.w3.org/2003/01/geo/wgs84_pos#>
PREFIX dc: <http://purl.org/dc/elements/1.1/>
PREFIX dwc: <http://rs.tdwg.org/dwc/terms/>

SELECT ?place ?strippedLiteral ?lat ?long 
WHERE {
  ?broaderLabel skosxl:literalForm "山西"@zh.
  ?broader skosxl:prefLabel|skosxl:altLabel ?broaderLabel.
  ?place gvp:broaderPreferred* ?broader.
  ?place skosxl:altLabel|skosxl:prefLabel ?label.
  ?label skosxl:literalForm ?literal.
  ?label skosxl:literalForm "Qinxian"@zh-latn.
  #  FILTER ( lang(?literal) = "zh-latn")
  ?place foaf:focus ?geoPlace.
  ?geoPlace geo:lat ?lat.
  ?geoPlace geo:long ?long.
  BIND (str(?literal) AS $strippedLiteral)
  }

In  the query, I bind the Chinese province name to the variable ?broaderLabel.  That label can correspond to the literal form of either a preferred label or alternate label for the province.  I use the SPARQL property path feature ("*") to allow the province to be any number of broaderPreferred links above the site (?place).  I then try to match with the literal form of either preferred or alternate labels for the site name.  In this example, for testing purposes I've hard-coded the province label to "山西"@zh and site name label to "Qinxian"@zh-latn, which are values that I know from manual searches will work.  Once the variable ?place is constrained by the two labels, I have the query report the place IRI, name, and the latitude and longitude.  Tracy has GPS geocordinates for buildings at the temple site, so the lat/long data from the query can be compared with her geocordinates to distinguish from among the resulting locations if more than one comes up within a particular province.

Here's what you get when you run the query:


The building geocoordinates at the Quixian temple site were at 36.7477, 112.5745, which is relatively close to both of these hits.  They will have to be examined manually to determine the one to which the link should be made.

If this were all that the query were going to do, there wouldn't be much point in the whole exercise of downloading and loading the Getty TGN dump into Blazegraph.  The query could just be run on the Getty online endpoint.  But I don't want to have to manually enter the province and site names for all 122 sites.  Before uploading the RDF for Tracy's data, I went back and changed the tang-song RDF data dump so that all language tags were lowercase only.  I then loaded it into the default (unnamed) graph of Blazegraph along with the Getty RDF dump so that I could query the tang-song graph and the Getty TNG graph together.

Here's the query as I ran it finally:

SELECT ?strippedSiteLabel ?lat ?long
WHERE {
  ?site a geo:SpatialThing.
  ?site rdfs:label ?siteLabel.
  FILTER (lang(?siteLabel)="zh-latn-pinyin")
  BIND (str(?siteLabel) AS ?strippedSiteLabel)
  ?site dwc:stateProvince ?province.
  
  ?broaderLabel skosxl:literalForm ?province.
  ?broader skosxl:prefLabel|skosxl:altLabel ?broaderLabel.
  ?place gvp:broaderPreferred* ?broader.
  ?place skosxl:altLabel|skosxl:prefLabel ?label.
  ?label skosxl:literalForm ?literal.
  FILTER ( lang(?literal) = "zh-latn")
  BIND (str(?literal) AS ?strippedLiteral)
  FILTER (?strippedLiteral = ?strippedSiteLabel)
  ?place foaf:focus ?geoPlace.
  ?geoPlace geo:lat ?lat.
  ?geoPlace geo:long ?long.
  }
#  Limit 10

The first block of triple patterns locates the transliterated site label and the province name in simplified Chinese characters.  I had to strip the language tag from the site labels because GeoNames tags theirs as zh-latn, while mine are tagged zh-latn-pinyin.  I puposefully tagged the province names as zh so that they would match the tags used by Getty.  Clearly, having to deal with mismatching of language tags is a pain.

The second block of triple patterns is a hack of the previous query.  Instead of hard-coding the province and site names as I did in the first query, they are now the variables bound in the first block of triple patterns.  I ran the query multiple times, adding one line at a time and it was interesting to see how the time to run changed.  Some changes, like adding filtering to language tags increased the time a lot (not sure why).  When all of the lines of the query had been added, it took about 30 seconds to run.  Here are what the results looked like:


There are a total of 122 temple sites, so clearly there are many that did not have any match.  The query could probably work better if REGEX were used to search for parts of strings to catch cases where a site name of ours like "Lingyansi" (=Lingyan Temple) would not match with variants like "Lingyan".  Still more work to do, but the general approach of leveraging the big Getty dump to reduce manual searching seems to work.

I still intend to load the GeoNames triples.  I started loading the one giant file overnight, then woke up the next morning to discover that Windows had decided that was the night to restart my computer (I hate you, Microsoft).  So I started it again over this weekend and after about 36 hours it still hadn't finished loading.  I need to use my work computer for "real" work, so I killed the load this morning so that it would stop slowing everything down.  I think I may write a little Python script to break the Geonames Ntriples file into multiple files (maybe 10?) and load them one at a time.  At least it would be easier to know when progress was being made!

I'm feeling more and more sympathy for the Hogwarts kitchen manager...

-----------------------------------------------------------------------------------------------
[1] Strictly speaking, I should probably say "IRI" here, but as a practical matter, I'm considering "URI" and "IRI" to be interchangeable.
[2] Discussed in my blog post "Confessions of an RDF agnostic, part 3: How a client learns"
[3] "Shiny new toys 3: Playing with Linked Data"
[4] "Stress testing the Stardog reasoner"
[5] OpenRefine was able to open the Getty file, but it hung when I tried to parse it as RDF/N3 and crashed with an OutOfMemoryError when I tried to save the project in text format.  I bumped the memory allocation from 1 Gb to 4 Gb - the maximum recommended for an 8 Gb system, but that didn't help and it still produced an "Unknown error, No technical details" error.
[6] It's worth noting that Callimacus manages sets of triples by actually keeping track of the names of files that were uploaded into the triplestore, rather than by a named graph URI.  There's a typical upload dialog where you select the file you want to upload and it will prompt you by asking you if you want to replace the file if that file already exists.  You can also delete a set of triples by telling it to delete that file.
[7] See Bob DuCharme's Learning SPARQL p. 82-83 for similar examples.
[8] Described in a recent blog post Guid-O-Matic goes to China

Sunday, October 23, 2016

Guid-O-Matic meets Darwin Core Archives

GOM2 part 2: Guid-O-Matic meets Darwin Core Archives

Note: This is the second post in a series of two, and it assumes that you've already read the first one.
left image from Darwin Core Text Guide http://rs.tdwg.org/dwc/terms/guides/text/


In my previous post, I described the origin of Guid-O-Matic 1.0 and its association with a presentation that I did at my first annual meeting of Biodiversity Information Standards (TDWG) at Woods Hole, Massachusetts in 2010.  Up to that time, I had been following the tdwg-content email list and had been trying to make sense of how the various parts of the TDWG Technical Architecture "three-legged stool" (the TAPIR exchange protocolTDWG Ontology, and LSID globally unique identifiers) were supposed to work.

TDWG Technical Architecture c. 2007
The impression that I had gotten from the email list was that the architecture was too complicated to implement, too expensive to maintain, and too slow for effective data transfer.  There seemed to be sentiment among some parts of the TDWG community that the entire GUID/Linked Data thing should just be chucked out the window.  I had recently implemented HTTP URIs as persistent unique identifiers at Bioimages with content-negotiation to provide RDF/XML when requested by the client and it didn't seem too hard to me.  I said that at the talk, which was politely received and no one (at least to my face) criticized me for my naĂŻvete on the subject.  Thus I was sucked into the RDF vortex, eventually resulting in me becoming co-convener of the TDWG RDF/OWL Task Group and in the adoption of the Darwin Core RDF Guide (http://dx.doi.org/10.3233/SW-150199 (open access at http://bit.ly/2e7i3Sj).


Image from Darwin Core Text Guide http://rs.tdwg.org/dwc/terms/guides/text/

Darwin Core Archives

At the same 2010 TDWG meeting where I introduced Guid-O-Matic 1.0, David Remson and Markus Döring presented a talk called "A Darwin-Core Archive solution to publishing and indexing taxonomic data within the Global Biodiversity Information Facility (GBIF) network".  This was my first exposure to Darwin Core Archives (DwC-A).  From the rubble of the collapsed TDWG technical architecture, the (then) new Darwin Core Vocabulary Standard provided a relatively simple way to transmit data in simple fielded text files (e.g. CSV files).  An XML metafile provided the mappings between columns of the text files and the Darwin Core properties that those columns represented.  The CSV file, XML metafile, and a third file containing metadata about the data in the CSV file were zipped up into a compressed archive that was the actual "Darwin Core Archive".  Because the CSV files themselves are not verbose and can be compressed very efficiently, a large amount of data can be transmitted over the Internet very efficiently.  DwC-A is now the primary means of transmission of data to GBIF, the multinational aggregator of biodiversity information from around the world.  

The information provided in the XML metafile is very similar to the information in the mapping table site-column-mappings.csv in the example I described in my previous blog post.  This is not a coincidence, since I was thinking about Darwin Core Archives when I was writing Guid-O-Matic 2.0 (GOM2).  In fact, there is an additional Xquery script in the Guid-O-Matic GitHub repo that is designed to extract information from a Darwin Core Archive metafile and generate the files that GOM2 needs to run.  I won't go into the details of the DwC-A translator here because there are directions here.  

As an aside, I should make note of an earlier application that performs many of the same functions as GOM2: the BiSciCol Triplifier.  I won't go into details about it here because you can read about it on the BiSciCol blog.  You can find the GitHub repo hereaccess the application here, and view sample output here. To summarize briefly, Triplifier is open source software (graphic web-based application or command line) that can read in a DwC-A and output serialized RDF triples. It assumes a particular graph model that I'll come back to later.  There may also be other tools that convert DwC-As to RDF that I don't know about.

The Darwin Core Archive How-to Guide provides a link to a sample DwC-A for Molluscs of Andorra.  After downloading the sample archive and unzipping it, here's what the meta.xml file looks like:

<?xml version="1.0"?>
<archive xmlns="http://rs.tdwg.org/dwc/text/"  metadata="eml.xml">
<core encoding="UTF-8" linesTerminatedBy="\n" fieldsTerminatedBy="," fieldsEnclosedBy="&quot;" ignoreHeaderLines="1" rowType="http://rs.tdwg.org/dwc/terms/Occurrence">
<files>
<location>darwincore.txt</location>
</files>
<id index="0"/>
<field   default="HumanObservation "  vocabulary="http://rs.tdwg.org/dwc/terms/type-vocabulary/" term="http://rs.tdwg.org/dwc/terms/basisOfRecord"/>
<field    default="2010-11-25T12:12:12 " term="http://purl.org/dc/terms/modified"/>
<field    default="SIBA " term="http://rs.tdwg.org/dwc/terms/institutionCode"/>
<field    default="Molluscs" term="http://rs.tdwg.org/dwc/terms/collectionCode"/>
<field  index="1" term="http://rs.tdwg.org/dwc/terms/catalogNumber"/>
<field  index="2" term="http://rs.tdwg.org/dwc/terms/scientificName"/>
<field  default="Animalia" term="http://rs.tdwg.org/dwc/terms/kingdom"/>
<field  index="3" term="http://rs.tdwg.org/dwc/terms/genus"/>
<field  index="4" term="http://rs.tdwg.org/dwc/terms/specificEpithet"/>
<field  index="5" term="http://rs.tdwg.org/dwc/terms/infraspecificEpithet"/>
<field  index="6" term="http://rs.tdwg.org/dwc/terms/taxonRank"/>
<field  index="7" term="http://rs.tdwg.org/dwc/terms/scientificNameAuthorship"/>
<field  index="8" term="http://rs.tdwg.org/dwc/terms/locality"/>
<field  index="9" term="http://rs.tdwg.org/dwc/terms/minimumElevationInMeters"/>
<field  index="10" term="http://rs.tdwg.org/dwc/terms/maximumElevationInMeters"/>
<field  index="11" term="http://rs.tdwg.org/dwc/terms/recordedBy"/>
<field  index="12" term="http://rs.tdwg.org/dwc/terms/decimalLongitude"/>
<field  index="13" term="http://rs.tdwg.org/dwc/terms/decimalLatitude"/>
<field  index="14" term="http://rs.tdwg.org/dwc/terms/dateIdentified"/>
<field  default="10000 " term="http://rs.tdwg.org/dwc/terms/coordinateUncertaintyInMeters"/>
<field  default="EUROPE" term="http://rs.tdwg.org/dwc/terms/continent"/>
<field  default="Andorra" term="http://rs.tdwg.org/dwc/terms/country"/>
</core>
</archive>

Here's what the column mapping file looks like after I ran the DwC-A translator on it:


Here's what a sample graph (serialized as Turtle; omitting the prefix list and document triples) looks like for a particular specimen  (SIBA:Molluscs:465) described in the mollusc dataset:

<SIBA:Molluscs:134>
     dwc:basisOfRecord "HumanObservation ";
     dcterms:modified "2010-11-25T12:12:12 ";
     dwc:institutionCode "SIBA ";
     dwc:collectionCode "Molluscs";
     dwc:kingdom "Animalia";
     dwc:coordinateUncertaintyInMeters "10000 ";
     dwc:continent "EUROPE";
     dwc:country "Andorra";
     dwc:catalogNumber "134";
     dwc:scientificName "Euomphalia strigella (Draparnaud, 1801)  ";
     dwc:genus "Euomphalia";
     dwc:specificEpithet "strigella";
     dwc:taxonRank "species";
     dwc:scientificNameAuthorship "(Draparnaud, 1801)";
     dwc:minimumElevationInMeters "         ";
     dwc:maximumElevationInMeters "         ";
     dwc:recordedBy "BorredĂ , V.";
     dwc:decimalLongitude "1.503830288051";
     dwc:decimalLatitude "42.472209680738";
     dwc:dateIdentified "2007";
     a dwc:Occurrence.

There are several things to notice about the RDF graph that GOM2 generated:
  1. The subject is not a valid IRI.  I'll fix that by appending a default prefix of http://example.org/.  
  2. Several properties have only whitespace as literal values.  That's a data-cleaning problem that should be fixed in the source CSV file, although GOM2 could be hacked to take care of that.  (If a cell has no value, GOM2 does not generate a triple for it.)  Similarly, several of the literals have trailing spaces - also a data cleaning problem.
  3. There are a number of the literals that should be datatyped literals rather than plain literals.  That can be handled by editing the column mapping file after the DwC-A translator initially creates it.

Here's the improved mapping file specifying datatyped literals:


and here's the improved graph including datatypes:

<http://example.org/SIBA:Molluscs:134>
     dwc:basisOfRecord "HumanObservation";
     dcterms:modified "2010-11-25T12:12:12"^^xsd:dateTime;
     dwc:institutionCode "SIBA";
     dwc:collectionCode "Molluscs";
     dwc:kingdom "Animalia";
     dwc:coordinateUncertaintyInMeters "10000"^^xsd:int;
     dwc:continent "EUROPE";
     dwc:country "Andorra";
     dwc:catalogNumber "134";
     dwc:scientificName "Euomphalia strigella (Draparnaud, 1801)";
     dwc:genus "Euomphalia";
     dwc:specificEpithet "strigella";
     dwc:taxonRank "species";
     dwc:scientificNameAuthorship "(Draparnaud, 1801)";
     dwc:recordedBy "BorredĂ , V.";
     dwc:decimalLongitude "1.503830288051"^^xsd:decimal;
     dwc:decimalLatitude "42.472209680738"^^xsd:decimal;
     dwc:dateIdentified "2007"^^xsd:gYear;
     a dwc:Occurrence.

Linking classes that have a one-to-one relationship with dwc:Occurrence in the data


There is still (in my opinion) a serious problem with this graph.  It's too "flat" and ascribes every Darwin Core property to an occurrence instance.  For example, the Darwin Core quick reference guide suggests that dwc:locality should be a property of a dcterms:Location and that dwc:dateIdentified should be a property of dwc:Identification.  dcterms:modified is also a property of the data record; it's not the date when the occurrence or even the specimen that documents the occurrence was last modified. 

It would be easy enough to solve this problem, since each instance of these other classes has a one-to-one relationship with the root occurrence instance.  I could add:

to the list of classes, then change the column mapping to:


and the locality triple would have a blank node subject representing a dcterms:Location instance.

But this does not solve the problem of linking the root occurrence class to the other classes.  The Darwin Core RDF Guide recognizes the existence of this problem in section 1.4.4.  It suggests that decisions about the assignment of Darwin Core properties to classes and connecting those classes by object properties should be made by community consensus.  There have been several suggestions about ways to link the core Darwin Core classes.  In this example, I will use Darwin-SW object properties to link the class instances according to the Darwin-SW graph model.  Here it is:


The DwC-A translator by default assumes only the class given in the archive metadata (dwc:Occurrence).  So I've added the dwc:Identification, dwc:Organism, dwc:Event, dcterms:Location, and dwc:PreservedSpecimen classes from the diagram above (with dwc:PreservedSpecimen in the place of dsw:Token) to the class list shown below.  Instead of designating the class instances as blank nodes, this time I've chosen to assign them IRIs that are formed from the root class IRI with appended fragment identifiers.   In the mapping table, GOM2 uses the assigned fragment identifier as a local ID for the class.


To generate the links between the class instances, I will add these rows to the mapping file to generate the blue object properties shown in the Darwin-SW graph diagram above:


Now I will assign the properties to the classes in which I think they belong.  I'm assuming that the catalog number, institution code, and collection code are properties of the preserved specimen, not the occurrence (this is at odds with people who consider that preserved specimens ARE occurrences).  There may be some people who would be surprised that I'm assigning the various taxon-related properties (dwc:genus, dwc:taxonRank, etc.) to an dwc:Identification instance rather than a dwc:Taxon instance.  Rather than explain the reason for this, I'll say that they are "convenience terms" and refer you to Section 2.7.4 of the Darwin Core RDF Guide.  Here is what I came up with for a "final" set of column mappings:


There is still one problem with the mappings that can't be fixed.  In its current state, GOM2 looks for a dcterms:modified property column in the metadata table, and if it finds one, it assigns it to the document that describes the occurrence rather than to the occurrence itself.  However, in this example, dcterms:modified is a constant and GOM2 isn't (yet) programmed to deal with that.  So it's going to show up as a property of the occurrence unless I just delete the row in the mapping table (which I did).

Here is what the graph looks like now (serialized as Turtle) with the properties sorted into the right classes and the classes linked together with object properties:

<http://example.org/SIBA:Molluscs:134>
     dwc:basisOfRecord "HumanObservation";
     dwc:recordedBy "BorredĂ , V.";
     dsw:occurrenceOf <http://example.org/SIBA:Molluscs:134#org>;
     dsw:atEvent <http://example.org/SIBA:Molluscs:134#eve>;
     a dwc:Occurrence.

<http://example.org/SIBA:Molluscs:134#id>
     dwc:kingdom "Animalia";
     dwc:scientificName "Euomphalia strigella (Draparnaud, 1801)";
     dwc:genus "Euomphalia";
     dwc:specificEpithet "strigella";
     dwc:taxonRank "species";
     dwc:scientificNameAuthorship "(Draparnaud, 1801)";
     dwc:dateIdentified "2007"^^xsd:gYear;
     dsw:identifies <http://example.org/SIBA:Molluscs:134#org>;
     a dwc:Identification.

<http://example.org/SIBA:Molluscs:134#org>
     a dwc:Organism.

<http://example.org/SIBA:Molluscs:134#eve>
     dsw:locatedAt <http://example.org/SIBA:Molluscs:134#loc>;
     a dwc:Event.

<http://example.org/SIBA:Molluscs:134#loc>
     dwc:coordinateUncertaintyInMeters "10000"^^xsd:int;
     dwc:continent "EUROPE";
     dwc:country "Andorra";
     dwc:decimalLongitude "1.503830288051"^^xsd:decimal;
     dwc:decimalLatitude "42.472209680738"^^xsd:decimal;
     a dcterms:Location.

<http://example.org/SIBA:Molluscs:134#sp>
     dwc:institutionCode "SIBA";
     dwc:collectionCode "Molluscs";
     dwc:catalogNumber "134";
     dsw:evidenceFor <http://example.org/SIBA:Molluscs:134>;
     dsw:derivedFrom <http://example.org/SIBA:Molluscs:134#org>;
     a dwc:PreservedSpecimen.

It is rather odd that dwc:eventDate was not provided for the date of collection of the specimen.  Also, I'm puzzled as to why the dwc:basisOfRecord is given as "HumanObservation" rather than "PreservedSpecimen".  Does that mean that there was no actual specimen collected?  (But then why is there a catalog number?)  But that's beside the point of this as a demonstration.  

You might also wonder why the organism and event instances are described when they don't actually have any properties (other than their type).  In this flat record, there isn't any good reason for it.  However, other data providers, whose data may be aggregated with these data, may link many identifications and many occurrences to a single organism, so the organism instance is needed to serve as a node to connect those many links.  Similarly there may be many occurrences at a single event, so again that class serves as a node to join many links.  

A different graph model: BiSciCol

BiSciCol graph model from http://biscicol.blogspot.com/2013_03_01_archive.html

If you don't like the Darwin-SW graph model, you can use GOM2 to map the columns using a different model.  The diagram above shows the model that the BiSciCol project created to use with the Triplifier application that I mentioned previously.  The BiSciCol model is simpler than Darwin-SW.  It does not include the organism class (and so cannot deal with linking multiple occurrences of the same organism).  It does not differentiate between occurrences and the specimens that document them.  It also assigns taxon-related properties to the dwc:Taxon class instance rather than considering them convenience properties and assigning them to the dwc:Identification class instance (I don't think the RDF guide was finished at the time triplifier was developed, so the concept of "convenience properties" had not yet been established.)

The BiSciCol model also reuses the object properties bsc:related_to and bsc:depends_on (with bsc: = http://biscicol.org/terms/index.html#) to link all of the classes, rather than minting separate object properties for each kind of link.  Based on the local names, I've never quite understood why all of the arrows point the directions that they do (a dcterms:Location is also related_to a dwc:Event, right?  Why does a dwc:Occurrence depends_on a dwc:Event and not the other way round?), but that's the way they are.  The arrow directions may be related to the direction that many-to-one relationships are expected to occur.  The BiSciCol model also allows using object properties to "jump over" a class if a particular dataset doesn't include it (e.g. the direct link from Occurrence to Taxon).  That allows for simpler RDF, but requires more complicated SPARQL queries. This differs from the Darwin-SW approach, which requires the insertion of a placeholder node if a class isn't represented in the data.  (See section 3.1 and 3.2 of http://bit.ly/2dG85b5 for a detailed explanation.) That results in more complicated RDF, but simplifies SPARQL querying.  

Here is the class definition table I used to generate RDF based on the BiSciCol graph model:

Here is the column mapping table that I used:


GOM2 doesn't have any way to decide whether to leave out classes that have been "jumped over" because they are missing in the data.  So these class list and mapping tables always generate every class in the model, and provide links both through any missing classes and in addition to the links  around them.  Here's what the BiSciCol graph looks like in Turtle serialization: 

<http://example.org/SIBA:Molluscs:134>
     dwc:basisOfRecord "HumanObservation";
     dcterms:modified "2010-11-25T12:12:12";
     dwc:institutionCode "SIBA";
     dwc:collectionCode "Molluscs";
     dwc:catalogNumber "134";
     dwc:recordedBy "BorredĂ , V.";
     bsc:depends_on <http://example.org/SIBA:Molluscs:134#eve>;
     bsc:depends_on <http://example.org/SIBA:Molluscs:134#loc>;
     bsc:related_to <http://example.org/SIBA:Molluscs:134#tax>;
     a dwc:Occurrence.

<http://example.org/SIBA:Molluscs:134#id>
     dwc:dateIdentified "2007"^^xsd:gYear;
     bsc:depends_on <http://example.org/SIBA:Molluscs:134>;
     bsc:depends_on <http://example.org/SIBA:Molluscs:134#tax>;
     a dwc:Identification.

<http://example.org/SIBA:Molluscs:134#eve>
     bsc:depends_on <http://example.org/SIBA:Molluscs:134#loc>;
     a dwc:Event.

<http://example.org/SIBA:Molluscs:134#loc>
     dwc:coordinateUncertaintyInMeters "10000"^^xsd:int;
     dwc:continent "EUROPE";
     dwc:country "Andorra";
     dwc:decimalLongitude "1.503830288051"^^xsd:decimal;
     dwc:decimalLatitude "42.472209680738"^^xsd:decimal;
     a dcterms:Location.

<http://example.org/SIBA:Molluscs:134#tax>
     dwc:kingdom "Animalia";
     dwc:scientificName "Euomphalia strigella (Draparnaud, 1801)";
     dwc:genus "Euomphalia";
     dwc:specificEpithet "strigella";
     dwc:taxonRank "species";
     dwc:scientificNameAuthorship "(Draparnaud, 1801)";
     a dwc:Taxon.

Which is better, the Darwin-SW graph model or the BiSciCol graph model?

This question cannot be answered without first defining what use cases we want to satisfy. GOM2 is easily hackable by changing values in the mapping CSV (vs. hard coding a particular graph model) so that it is possible to try different approaches and see how well they work.  GOM2 has an option to dump a serialization of a graph containing all of the triples generated from the entire metadata spreadsheet into a file that can then be loaded into a triplestore and tested via SPARQL queries.  So to answer this question, one would need to collect use cases from the TDWG community, try to satisfy them using the two models, then decide which one works better.  It may turn out that neither one works satisfactorily, in which case new or modified graph model would need to be constructed and tested until all of the use cases that are deemed to be important are satisfied.

Darwin Core Archive extensions: a "star schema"

The Darwin Core Text Guide also provides a mechanism to link additional fielded text (CSV) extension tables to the core fielded text table.  For example, an occurrence can be documented by multiple images.  In that case, multiple records in the image (extension) table could be linked to a single record in an occurrence (core) table.  The XML metafile would specify which column in the extension table contains a reference to the unique identifier of a row in the core table.  In database terms, the extension table records have a foreign key to the primary key of the core table.  


The diagram above illustrates a situation where five images document a single occurrence.  The relationship between rows in the extension table and the row in the core table that they are linked to by their foreign key can be represented as a simple RDF graph where the resource described by a row in the core table is a node that connects many instances of resources described by the extension table.  This organization has been called a "star schema" because of the shape of the graph.  The Darwin Core Archive system allows multiple extension tables, so the bubbles linked to the central core record can be instances of more than one class.  But there cannot be more than a single resource at the center of the star.  So there are some severe limitations of the DwC-A system for linking CSV tables into complicated graphs.

From an RDF perspective, a simple solution to this problem would be for IRI-identified resources that have a many-to-one relationship to some other kind of resource to simply have a column that contains an IRI reference to the other resource.  The triples about each kind of resource could just be serialized separately and the IRIs would allow the triples to be connected into a graph of any complexity once they were aggregated in a triple store.  However, the problem is more complicated if the resources having the many-to-one relationship do not have assigned identifiers (i.e. are blank nodes).  In that case, links between the resources would need to be made within a single document.  (Technically, a document could describe a resource without an IRI identifier that had a link to another resource with an identifier. But it would not be possible for a Linked Data client to ask for that document, since there would be no dereferenceable subject IRI to send to the server.)  This situation is illustrated in the diagram above, where the images do not have IRIs and are represented by blank nodes.  A client could not request information about them individually, but a client requesting information about the occurrence could receive information about the images along with the information about the occurrence if they were serialized in the same document.  Since the Darwin Core Text Guide does not require the extension records to have unique identifiers, the situation just described could apply to such archives.  

GOM2 can handle multiple tables that are linked in the "star schema" pattern.  A CSV table, linked-classes.csv, lists the extension classes.  For each one, it specifies the column that contains the foreign key, the property that should be used to make the link to the core class instance, the name of the file containing the extension table, and optionally other columns that may be used to construct a fragment identifier if the resource in the extension table is assigned a URI based on the core resource rather than being a blank node.  Here is an example for the diagram above:


The "_:1" string indicates that the extension class is represented by a blank node; the numeral "1" has no particular significance.  For examples where fragment identifiers are used to construct an IRI for resources described in the extension files, see the  Guid-O-Matic detailed explanation page.  

When the GOM2 DwC-A translator script processes a Darwin Core Archive meta.xml file, it creates a linked-classes.csv table.  If the archive contains only a single core file (as in the real Andorran mollusc archive), the linked-classes.csv table will have only column headers with no data rows.  If the archive contains any extension files, GOM2 will create a row for each extension file.  

A real example of converting a Darwin Core Archive with extension tables into RDF

Since 2014, GBIF has registered a dedicated Darwin Core multimedia extension where there are many media items per core occurrence record (described in this blog post).  Bioimages has submitted its high quality occurrence metadata along with links to images using this method.  The Guid-O-Matic GitHub repo includes an old DwC-A for the Bioimages data for you to play with.  

The steps for converting this DwC-A to RDF are the same as before.  Unzip the gbif-bioimages.zip file into the directory with the translate-meta.xq script. Here's an abbreviated view of the meta.xml file:

<?xml version="1.0"?>
<archive xmlns="http://rs.tdwg.org/dwc/text/" metadata="eml.xml">
<core encoding="UTF-8" linesTerminatedBy="\r\n" fieldsTerminatedBy="," fieldsEnclosedBy="&quot;" ignoreHeaderLines="1" rowType="http://rs.tdwg.org/dwc/terms/Occurrence">
<files>
<location>occurrences.csv</location>
</files>
<id index="0" />
<field index="0" term="http://rs.tdwg.org/dwc/terms/occurrenceID" />
<field index="1" term="http://rs.tdwg.org/dwc/terms/basisOfRecord" />
<field index="2" term="http://purl.org/dc/terms/modified" />
...
<field index="33" term="http://rs.tdwg.org/dwc/terms/scientificNameAuthorship" />
<field index="34" term="http://rs.tdwg.org/dwc/terms/scientificName" />
<field index="35" term="http://rs.tdwg.org/dwc/terms/previousIdentifications" />
<field default="English" term="http://purl.org/dc/terms/language" />
</core>
<extension encoding="UTF-8" linesTerminatedBy="\r\n" fieldsTerminatedBy="," fieldsEnclosedBy="&quot;" ignoreHeaderLines="1" rowType="http://rs.gbif.org/terms/1.0/Multimedia">
<files>
<location>images.csv</location>
</files>
<coreid index="0" />
<field index="1" term="http://purl.org/dc/terms/type" />
<field index="2" term="http://purl.org/dc/terms/format" />
<field index="3" term="http://purl.org/dc/terms/identifier" />
...
<field index="9" term="http://purl.org/dc/terms/publisher" />
<field index="10" term="http://purl.org/dc/terms/license" />
<field index="11" term="http://purl.org/dc/terms/rightsHolder" />
<field default="English" term="http://purl.org/dc/terms/language" />
</extension>
</archive>

You can see that the metafile describes core records, located in the occurrences.csv file, and extension records, located in the images.csv file.  When the translation script is run, class list and column mapping CSV files are generated for both the occurrence and the image metadata files.  The linked-classes.csv file looks like this:


Because the DwC-A metafile was not designed to facilitate RDF, it gives no indication what predicate should be used in the triple that links the extension resource to the core resource.  So the translator defaults to the generic dcterms:relation property, which simply indicates that there is some kind of relationship between the two resources.  In this case, I intend that the images serve as evidence for the occurrence records.  So I'm going to replace dcterms:relation with dsw:evidenceFor in the table.  

I'll now run GOM2 as I did before, this time entering http://bioimages.vanderbilt.edu/thomas/0455-01#2010-09-25 as the occurrence to be converted to RDF.  (In the Bioimages occurrence file, the primary key identifier for the occurrence is a full HTTP IRI, so it isn't necessary to specify a default domain to prepend as was the case with the mollusc file.)  Here's the graph I get (serialized as Turtle):

<http://bioimages.vanderbilt.edu/thomas/0424-01#2010-07-25>
     dcterms:language "English";
     dwc:occurrenceID "http://bioimages.vanderbilt.edu/thomas/0424-01#2010-07-25";
     dwc:basisOfRecord "HumanObservation";
     dcterms:modified "2014-07-15T14:44:35-05:00";
     dcterms:references "http://bioimages.vanderbilt.edu/thomas/0424-01#2010-07-25";
     dwc:individualID "http://bioimages.vanderbilt.edu/thomas/0424-01";
     dwc:establishmentMeans "native";
     dwc:recordedBy "Ron Thomas";
     dwc:eventDate "2010-07-25";
     dwc:continent "North America";
     dwc:countryCode "US";
     dwc:stateProvince "Arkansas";
     dwc:county "Searcy";
     dwc:locality "Stone Cemetery Rd.";
     dwc:decimalLatitude " 36.0393";
     dwc:decimalLongitude "-92.7125";
     dwc:geodeticDatum "EPSG:4326";
     dwc:coordinateUncertaintyInMeters "500";
     dwc:georeferenceRemarks "Location of individual determined by an independent GPS measurement.";
     dwc:identifiedBy "Ron Thomas";
     dwc:dateIdentified "2010-07-25";
     dwc:kingdom "Plantae";
     dwc:order "Sapindales";
     dwc:family "Hippocastanaceae";
     dwc:genus "Aesculus";
     dwc:specificEpithet "pavia";
     dwc:taxonRank "species";
     dwc:scientificNameAuthorship "L.";
     dwc:scientificName "Aesculus pavia L.";
     a dwc:Occurrence.

_:cdd65a04-62c0-42ef-93c1-de26c61e0b17
     dcterms:type "StillImage";
     dcterms:format "image/jpeg";
     dcterms:identifier "http://bioimages.vanderbilt.edu/gq/thomas/g0424-01-01.jpg";
     dcterms:references "http://bioimages.vanderbilt.edu/thomas/0424-01-01.htm";
     dcterms:title "Aesculus pavia (Hippocastanaceae) - fruit - as borne on the plant";
     dcterms:description "Image of Aesculus pavia (Hippocastanaceae) - fruit - as borne on the plant";
     dcterms:created "2010-07-25T09:47:03-05:00";
     dcterms:creator "Ron Thomas";
     dcterms:publisher "Bioimages http://bioimages.vanderbilt.edu/";
     dcterms:license "http://creativecommons.org/licenses/by-nc-sa/3.0/";
     dsw:evidenceFor <http://bioimages.vanderbilt.edu/thomas/0424-01#2010-07-25>;
     a gbifterms:Multimedia.

_:9010f4a0-175e-4b59-8e7c-3368228df518
     dcterms:type "StillImage";
     dcterms:format "image/jpeg";
     dcterms:identifier "http://bioimages.vanderbilt.edu/gq/thomas/g0424-01-03.jpg";
     dcterms:references "http://bioimages.vanderbilt.edu/thomas/0424-01-03.htm";
     dcterms:title "Aesculus pavia (Hippocastanaceae) - leaf - margin of upper + lower surface";
     dcterms:description "Image of Aesculus pavia (Hippocastanaceae) - leaf - margin of upper + lower surface";
     dcterms:created "2010-07-25T09:49:47-05:00";
     dcterms:creator "Ron Thomas";
     dcterms:publisher "Bioimages http://bioimages.vanderbilt.edu/";
     dcterms:license "http://creativecommons.org/licenses/by-nc-sa/3.0/";
     dsw:evidenceFor <http://bioimages.vanderbilt.edu/thomas/0424-01#2010-07-25>;
     a gbifterms:Multimedia.

_:2c1405b0-e450-41e8-880e-7c1e0e247770
     dcterms:type "StillImage";
     dcterms:format "image/jpeg";
     dcterms:identifier "http://bioimages.vanderbilt.edu/gq/thomas/g0424-01-02.jpg";
     dcterms:references "http://bioimages.vanderbilt.edu/thomas/0424-01-02.htm";
     dcterms:title "Aesculus pavia (Hippocastanaceae) - fruit - unspecified";
     dcterms:description "Image of Aesculus pavia (Hippocastanaceae) - fruit - unspecified";
     dcterms:created "2010-07-25T09:56:34-05:00";
     dcterms:creator "Ron Thomas";
     dcterms:publisher "Bioimages http://bioimages.vanderbilt.edu/";
     dcterms:license "http://creativecommons.org/licenses/by-nc-sa/3.0/";
     dsw:evidenceFor <http://bioimages.vanderbilt.edu/thomas/0424-01#2010-07-25>;
     a gbifterms:Multimedia.

You can see that without any editing of the columns mapping file, I already have a more complicated graph than I did with the mollusc data: there is one occurrence instance in the middle of the "star" with three image instances around the edge of the graph, linked to the occurrence by dsw:evidenceFor.  I will need to mess with the column mapping file for occurrence.csv in order to get it to conform to the Darwin-SW graph model.  I'll fix a bunch of language tag and datatyping issues, along with other specific problems I'll discuss later.

Here's how I changed the class and mapping files for the core occurrence metadata file:


Here's how I changed the class and mapping files for the extension media metadata file:


Note: Audubon Core sort of implies that there is an ac:ServiceAccessPoint class, but never actually defines one.  (A service access point describes a file containing a version of a media item having a particular size and media type, whereas the actual media item may be considered an abstract entity that is distinct from files that contains representations of it.)  Probably it would be better just to not assert a type for the service access point, but GOM2 requires that there be some type IRI for the class, so I "minted" the fake class IRI ac:ServiceAccessPoint.

Here's the resulting graph in Turtle serialization:

<http://bioimages.vanderbilt.edu/thomas/0424-01#2010-07-25>
     dcterms:identifier "http://bioimages.vanderbilt.edu/thomas/0424-01#2010-07-25";
     dwc:basisOfRecord "HumanObservation";
     dsw:occurrenceOf <http://bioimages.vanderbilt.edu/thomas/0424-01>;
     dwc:establishmentMeans "native";
     dwc:recordedBy "Ron Thomas";
     dsw:occurrenceOf _:2d368366-a71f-4e21-a008-6defe3b06f18;
     dsw:atEvent _:2858251b-ec69-45e5-a131-4d65427e5f53;
     a dwc:Occurrence.

_:8a059be6-7954-463b-81a0-197b11b234aa
     dwc:identifiedBy "Ron Thomas";
     dwc:dateIdentified "2010-07-25"^^xsd:date;
     dwc:kingdom "Plantae";
     dwc:order "Sapindales";
     dwc:family "Hippocastanaceae";
     dwc:genus "Aesculus";
     dwc:specificEpithet "pavia";
     dwc:taxonRank "species";
     dwc:scientificNameAuthorship "L.";
     dwc:scientificName "Aesculus pavia L.";
     dsw:identifies _:2d368366-a71f-4e21-a008-6defe3b06f18;
     a dwc:Identification.

_:2d368366-a71f-4e21-a008-6defe3b06f18
     owl:sameAs <http://bioimages.vanderbilt.edu/thomas/0424-01>;
     a dwc:Organism.

_:2858251b-ec69-45e5-a131-4d65427e5f53
     dwc:eventDate "2010-07-25"^^xsd:date;
     dsw:locatedAt _:d929d47f-6d34-4ee0-ae9d-59915bd8285d;
     a dwc:Event.

_:d929d47f-6d34-4ee0-ae9d-59915bd8285d
     dwc:continent "North America";
     dwc:countryCode "US";
     dwc:stateProvince "Arkansas";
     dwc:county "Searcy";
     dwc:locality "Stone Cemetery Rd.";
     dwc:decimalLatitude " 36.0393"^^xsd:decimal;
     dwc:decimalLongitude "-92.7125"^^xsd:decimal;
     dwc:geodeticDatum "EPSG:4326";
     dwc:coordinateUncertaintyInMeters "500"^^xsd:int;
     dwc:georeferenceRemarks "Location of individual determined by an independent GPS measurement."@en;
     a dcterms:Location.

_:ae320157-e94e-4c1a-ad51-bd43d79396cf
     rdf:type dcmitype:StillImage;
     dc:type "StillImage";
     ac:hasServiceAccessPoint <http://bioimages.vanderbilt.edu/gq/thomas/g0424-01-01.jpg>;
     foaf:page <http://bioimages.vanderbilt.edu/thomas/0424-01-01.htm>;
     dcterms:title "Aesculus pavia (Hippocastanaceae) - fruit - as borne on the plant"@en;
     dcterms:description "Image of Aesculus pavia (Hippocastanaceae) - fruit - as borne on the plant"@en;
     dcterms:created "2010-07-25T09:47:03-05:00"^^xsd:dateTime;
     dc:creator "Ron Thomas";
     dc:publisher "Bioimages http://bioimages.vanderbilt.edu/";
     dcterms:license <http://creativecommons.org/licenses/by-nc-sa/3.0/>;
     ac:hasServiceAccessPoint _:20c439a5-1ca6-42b6-8062-52bc873bb3a4;
     dsw:evidenceFor <http://bioimages.vanderbilt.edu/thomas/0424-01#2010-07-25>;
     a gbifterms:Multimedia.

_:20c439a5-1ca6-42b6-8062-52bc873bb3a4
     ac:variant ac:GoodQuality;
     dc:format "image/jpeg";
     owl:sameAs <http://bioimages.vanderbilt.edu/gq/thomas/g0424-01-01.jpg>;
     a ac:ServiceAccessPoint.

_:26c7497f-23ed-4a61-808d-40958d4f893e
     rdf:type dcmitype:StillImage;
     dc:type "StillImage";
     ac:hasServiceAccessPoint <http://bioimages.vanderbilt.edu/gq/thomas/g0424-01-03.jpg>;
     foaf:page <http://bioimages.vanderbilt.edu/thomas/0424-01-03.htm>;
     dcterms:title "Aesculus pavia (Hippocastanaceae) - leaf - margin of upper + lower surface"@en;
     dcterms:description "Image of Aesculus pavia (Hippocastanaceae) - leaf - margin of upper + lower surface"@en;
     dcterms:created "2010-07-25T09:49:47-05:00"^^xsd:dateTime;
     dc:creator "Ron Thomas";
     dc:publisher "Bioimages http://bioimages.vanderbilt.edu/";
     dcterms:license <http://creativecommons.org/licenses/by-nc-sa/3.0/>;
     ac:hasServiceAccessPoint _:d53b092f-cdcc-4be0-a498-d0778a259795;
     dsw:evidenceFor <http://bioimages.vanderbilt.edu/thomas/0424-01#2010-07-25>;
     a gbifterms:Multimedia.

_:d53b092f-cdcc-4be0-a498-d0778a259795
     ac:variant ac:GoodQuality;
     dc:format "image/jpeg";
     owl:sameAs <http://bioimages.vanderbilt.edu/gq/thomas/g0424-01-03.jpg>;
     a ac:ServiceAccessPoint.

_:8d8c96e9-fea9-441b-99a5-555f8816685c
     rdf:type dcmitype:StillImage;
     dc:type "StillImage";
     ac:hasServiceAccessPoint <http://bioimages.vanderbilt.edu/gq/thomas/g0424-01-02.jpg>;
     foaf:page <http://bioimages.vanderbilt.edu/thomas/0424-01-02.htm>;
     dcterms:title "Aesculus pavia (Hippocastanaceae) - fruit - unspecified"@en;
     dcterms:description "Image of Aesculus pavia (Hippocastanaceae) - fruit - unspecified"@en;
     dcterms:created "2010-07-25T09:56:34-05:00"^^xsd:dateTime;
     dc:creator "Ron Thomas";
     dc:publisher "Bioimages http://bioimages.vanderbilt.edu/";
     dcterms:license <http://creativecommons.org/licenses/by-nc-sa/3.0/>;
     ac:hasServiceAccessPoint _:6e66b906-3702-4567-94cb-a32eb95925da;
     dsw:evidenceFor <http://bioimages.vanderbilt.edu/thomas/0424-01#2010-07-25>;
     a gbifterms:Multimedia.

_:6e66b906-3702-4567-94cb-a32eb95925da
     ac:variant ac:GoodQuality;
     dc:format "image/jpeg";
     owl:sameAs <http://bioimages.vanderbilt.edu/gq/thomas/g0424-01-02.jpg>;
     a ac:ServiceAccessPoint.

Notes:
  1. There are a number of places where I substituted the legacy dc: namespace Dublin Core terms for dcterms: namespace terms.  (Some people have muddied the waters by suggesting that now dc: should be used as an abbreviation for what has traditionally been abbreviated as dcterms:, i.e. http://purl.org/dc/terms/.  Bad, bad, bad!  In this blog, I still use dc: to refer to the legacy namespace http://purl.org/dc/elements/1.1/ .) The blog post that describes how to use the GBIF DwC-A multimedia extension specifies that the dcterms: terms should be provided.  However, for a number of terms (e.g. dcterms:publisher), Dublin Core provides a range declaration for the property that implies it should be used with a non-literal object.  There is no range declaration for the dc: analog, so it is considered best practice to use the dc: analog with a literal object.
  2. In accordance with the Darwin Core RDF Guide section 2.6, the Darwin Core "ID" terms should not be used in RDF.  Since dwc:occurrenceID referred to the identifier of the subject resource, its predicate was changed to dcterms:identifier.  
  3. dwc:individualID (now dwc:organismID) value is a tougher nut to crack. In the Darwin Core Archive, it was serving as a foreign key to another IRI-identified resource, the organism.  So a dsw:occurrenceOf object property was substituted for dwc:individualID.  However, GOM2 does not have a mechanism for supplying an independently defined subject IRI for resources having one-to-one relationships with the root resource (in this case the occurrence).  GOM2 only allows for fragment identifier-appended IRIs or (as in this case) blank nodes.  So I did a little trick of asserting that the organism IRI was owl:sameAs the blank node identifier (_:2d368366-a71f-4e21-a008-6defe3b06f18).  A client that supports sameAs inferencing would then substitute the organism IRIs in every triple about the organism that was explicitly asserted for the blank node.  
  4. I applied datatypes to several dates that apply to most, but not all, dates in the Bioimages database.  For example, most images have creation dates that conform to xsd:dateTime, since those dates were automatically recorded by a digital camera.  However, for a few old scanned images, there are some dates in the database whose format only conform to xsd:date or even xsd:gYear.  Applying the xsd:dateTime datatype to those dates would cause an RDF database software that's picky about datatyping (like Stardog) to refuse to load the dataset.  This is a data cleaning problem that isn't so important in a demonstration like this, but which would need to be considered if DwC-A files were routinely used as a data source for really aggregating biodiversity data as RDF.
  5. The GBIF multimedia extension instructions specify that dcterms:identifier should be used for "the public URL that identifies and locates the media file directly, not the html page it might be shown on".  That is exactly the meaning of ac:hasServiceAccessPoint.  I'm not sure why this choice was made.  I suppose it is because in this "flat" representation of a multimedia item, no distinction is made between a URI for the media item itself and the URL that retrieves a representation of the multimedia item.  That isn't so important if there is only one version of the media item, but in the case of Bioimages, the image is available in four sizes (thumbnail, lower quality, good quality, and best quality=the original JPEG from the camera).  So in Bioimages there are four service access points for each image; the one listed here is only the "good quality" one.  In order to generate and link to the service access point (which has an independently defined subject IRI), I used the same owl:sameAs trick that I used with the organism instance.  I also made dc:format be a property of the service access point rather than the media item, since the same image could (in theory) be served as BMP, GIF, PNG, or other image formats.  Each of these would be represented by separate service access points. Because there is no RDF implementation guide for Audubon Core, it's the Wild West, and providers currently do whatever seems right to them. 
  6. I substituted foaf:page for the dcterms:references property required in the GBIF DwC-A multimedia extension.  There isn't really anything wrong with dcterms:references, it is just a very generic link, whereas foaf:page is another well-known term that implies that the the object is a document. 


Why is it important to play with this kind of thing?

Darwin Core Archives and the "star schema" approach has been highly successful for getting occurrence data from providers to GBIF.  However, the star schema is a very limited graph model and currently GBIF only really supports two classes as core file types: occurrence and taxon.  There are other people (like me and other people who track organisms over time) who might prefer to have an option for dwc:Organism as the core file type with dwc:Identification and dwc:Occurrence as extensions (one organism with many identifications and many occurrences).  Some may want dwc:MaterialSample as the core resource.  Other might like to have an option for dwc:Event as the core file type, with dwc:Occurrence as an extension (one event with many occurrences).  Then there is the use case where dwc:Occurrence is documented by multiple forms of evidence (specimens, photographs, DNA samples, etc.).  A star schema and corresponding DwC-A type could be designed for all of these use cases, but it is difficult to see how they could all be easily merged into a single database unless it were graph-based.  

A graph-based system could also link to IRI-identified resources outside of the biodiversity informatics domain, such as DOIs and ISBNs for publications, ORCID and VIAF identifiers for people, and GeoNames and Getty TGN identifiers for places.  Again, a graph-based system could easily suck in data on these types of resources and make them available for querying.  

Although it would be possible to use a graph database system that doesn't necessarily depend on IRIs (such as Neo4j), linking together resources from diverse sources would be easier if those resources had globally unique identifiers.   RDF-related technology is probably the most well-developed way to do that kind of linking.  Even if you are a hard-core Linked Data and HTTP IRI skeptic, and fervent believer in UUIDs, you could easily turn your UUIDs into IRIs by prepending them with "urn:uuid:" and they could play well in an RDF triplestore.  Dealing with the problem of minting and maintaining stable GUIDs is a critical problem to be solved before large-scale aggregation of data on diverse kinds of resources can be accomplished.  But that shouldn't stop us from experimenting with solving the other problem of deciding on object properties to connect important classes of resources.  GOM2 is intended to help facilitate the kind of experimenting with graph models that is necessary to decide what model will best satisfy identified use cases.