Sunday, October 23, 2016

Guid-O-Matic meets Darwin Core Archives

GOM2 part 2: Guid-O-Matic meets Darwin Core Archives

Note: This is the second post in a series of two, and it assumes that you've already read the first one.
left image from Darwin Core Text Guide http://rs.tdwg.org/dwc/terms/guides/text/


In my previous post, I described the origin of Guid-O-Matic 1.0 and its association with a presentation that I did at my first annual meeting of Biodiversity Information Standards (TDWG) at Woods Hole, Massachusetts in 2010.  Up to that time, I had been following the tdwg-content email list and had been trying to make sense of how the various parts of the TDWG Technical Architecture "three-legged stool" (the TAPIR exchange protocolTDWG Ontology, and LSID globally unique identifiers) were supposed to work.

TDWG Technical Architecture c. 2007
The impression that I had gotten from the email list was that the architecture was too complicated to implement, too expensive to maintain, and too slow for effective data transfer.  There seemed to be sentiment among some parts of the TDWG community that the entire GUID/Linked Data thing should just be chucked out the window.  I had recently implemented HTTP URIs as persistent unique identifiers at Bioimages with content-negotiation to provide RDF/XML when requested by the client and it didn't seem too hard to me.  I said that at the talk, which was politely received and no one (at least to my face) criticized me for my naïvete on the subject.  Thus I was sucked into the RDF vortex, eventually resulting in me becoming co-convener of the TDWG RDF/OWL Task Group and in the adoption of the Darwin Core RDF Guide (http://dx.doi.org/10.3233/SW-150199 (open access at http://bit.ly/2e7i3Sj).


Image from Darwin Core Text Guide http://rs.tdwg.org/dwc/terms/guides/text/

Darwin Core Archives

At the same 2010 TDWG meeting where I introduced Guid-O-Matic 1.0, David Remson and Markus Döring presented a talk called "A Darwin-Core Archive solution to publishing and indexing taxonomic data within the Global Biodiversity Information Facility (GBIF) network".  This was my first exposure to Darwin Core Archives (DwC-A).  From the rubble of the collapsed TDWG technical architecture, the (then) new Darwin Core Vocabulary Standard provided a relatively simple way to transmit data in simple fielded text files (e.g. CSV files).  An XML metafile provided the mappings between columns of the text files and the Darwin Core properties that those columns represented.  The CSV file, XML metafile, and a third file containing metadata about the data in the CSV file were zipped up into a compressed archive that was the actual "Darwin Core Archive".  Because the CSV files themselves are not verbose and can be compressed very efficiently, a large amount of data can be transmitted over the Internet very efficiently.  DwC-A is now the primary means of transmission of data to GBIF, the multinational aggregator of biodiversity information from around the world.  

The information provided in the XML metafile is very similar to the information in the mapping table site-column-mappings.csv in the example I described in my previous blog post.  This is not a coincidence, since I was thinking about Darwin Core Archives when I was writing Guid-O-Matic 2.0 (GOM2).  In fact, there is an additional Xquery script in the Guid-O-Matic GitHub repo that is designed to extract information from a Darwin Core Archive metafile and generate the files that GOM2 needs to run.  I won't go into the details of the DwC-A translator here because there are directions here.  

As an aside, I should make note of an earlier application that performs many of the same functions as GOM2: the BiSciCol Triplifier.  I won't go into details about it here because you can read about it on the BiSciCol blog.  You can find the GitHub repo hereaccess the application here, and view sample output here. To summarize briefly, Triplifier is open source software (graphic web-based application or command line) that can read in a DwC-A and output serialized RDF triples. It assumes a particular graph model that I'll come back to later.  There may also be other tools that convert DwC-As to RDF that I don't know about.

The Darwin Core Archive How-to Guide provides a link to a sample DwC-A for Molluscs of Andorra.  After downloading the sample archive and unzipping it, here's what the meta.xml file looks like:

<?xml version="1.0"?>
<archive xmlns="http://rs.tdwg.org/dwc/text/"  metadata="eml.xml">
<core encoding="UTF-8" linesTerminatedBy="\n" fieldsTerminatedBy="," fieldsEnclosedBy="&quot;" ignoreHeaderLines="1" rowType="http://rs.tdwg.org/dwc/terms/Occurrence">
<files>
<location>darwincore.txt</location>
</files>
<id index="0"/>
<field   default="HumanObservation "  vocabulary="http://rs.tdwg.org/dwc/terms/type-vocabulary/" term="http://rs.tdwg.org/dwc/terms/basisOfRecord"/>
<field    default="2010-11-25T12:12:12 " term="http://purl.org/dc/terms/modified"/>
<field    default="SIBA " term="http://rs.tdwg.org/dwc/terms/institutionCode"/>
<field    default="Molluscs" term="http://rs.tdwg.org/dwc/terms/collectionCode"/>
<field  index="1" term="http://rs.tdwg.org/dwc/terms/catalogNumber"/>
<field  index="2" term="http://rs.tdwg.org/dwc/terms/scientificName"/>
<field  default="Animalia" term="http://rs.tdwg.org/dwc/terms/kingdom"/>
<field  index="3" term="http://rs.tdwg.org/dwc/terms/genus"/>
<field  index="4" term="http://rs.tdwg.org/dwc/terms/specificEpithet"/>
<field  index="5" term="http://rs.tdwg.org/dwc/terms/infraspecificEpithet"/>
<field  index="6" term="http://rs.tdwg.org/dwc/terms/taxonRank"/>
<field  index="7" term="http://rs.tdwg.org/dwc/terms/scientificNameAuthorship"/>
<field  index="8" term="http://rs.tdwg.org/dwc/terms/locality"/>
<field  index="9" term="http://rs.tdwg.org/dwc/terms/minimumElevationInMeters"/>
<field  index="10" term="http://rs.tdwg.org/dwc/terms/maximumElevationInMeters"/>
<field  index="11" term="http://rs.tdwg.org/dwc/terms/recordedBy"/>
<field  index="12" term="http://rs.tdwg.org/dwc/terms/decimalLongitude"/>
<field  index="13" term="http://rs.tdwg.org/dwc/terms/decimalLatitude"/>
<field  index="14" term="http://rs.tdwg.org/dwc/terms/dateIdentified"/>
<field  default="10000 " term="http://rs.tdwg.org/dwc/terms/coordinateUncertaintyInMeters"/>
<field  default="EUROPE" term="http://rs.tdwg.org/dwc/terms/continent"/>
<field  default="Andorra" term="http://rs.tdwg.org/dwc/terms/country"/>
</core>
</archive>

Here's what the column mapping file looks like after I ran the DwC-A translator on it:


Here's what a sample graph (serialized as Turtle; omitting the prefix list and document triples) looks like for a particular specimen  (SIBA:Molluscs:465) described in the mollusc dataset:

<SIBA:Molluscs:134>
     dwc:basisOfRecord "HumanObservation ";
     dcterms:modified "2010-11-25T12:12:12 ";
     dwc:institutionCode "SIBA ";
     dwc:collectionCode "Molluscs";
     dwc:kingdom "Animalia";
     dwc:coordinateUncertaintyInMeters "10000 ";
     dwc:continent "EUROPE";
     dwc:country "Andorra";
     dwc:catalogNumber "134";
     dwc:scientificName "Euomphalia strigella (Draparnaud, 1801)  ";
     dwc:genus "Euomphalia";
     dwc:specificEpithet "strigella";
     dwc:taxonRank "species";
     dwc:scientificNameAuthorship "(Draparnaud, 1801)";
     dwc:minimumElevationInMeters "         ";
     dwc:maximumElevationInMeters "         ";
     dwc:recordedBy "Borredà, V.";
     dwc:decimalLongitude "1.503830288051";
     dwc:decimalLatitude "42.472209680738";
     dwc:dateIdentified "2007";
     a dwc:Occurrence.

There are several things to notice about the RDF graph that GOM2 generated:
  1. The subject is not a valid IRI.  I'll fix that by appending a default prefix of http://example.org/.  
  2. Several properties have only whitespace as literal values.  That's a data-cleaning problem that should be fixed in the source CSV file, although GOM2 could be hacked to take care of that.  (If a cell has no value, GOM2 does not generate a triple for it.)  Similarly, several of the literals have trailing spaces - also a data cleaning problem.
  3. There are a number of the literals that should be datatyped literals rather than plain literals.  That can be handled by editing the column mapping file after the DwC-A translator initially creates it.

Here's the improved mapping file specifying datatyped literals:


and here's the improved graph including datatypes:

<http://example.org/SIBA:Molluscs:134>
     dwc:basisOfRecord "HumanObservation";
     dcterms:modified "2010-11-25T12:12:12"^^xsd:dateTime;
     dwc:institutionCode "SIBA";
     dwc:collectionCode "Molluscs";
     dwc:kingdom "Animalia";
     dwc:coordinateUncertaintyInMeters "10000"^^xsd:int;
     dwc:continent "EUROPE";
     dwc:country "Andorra";
     dwc:catalogNumber "134";
     dwc:scientificName "Euomphalia strigella (Draparnaud, 1801)";
     dwc:genus "Euomphalia";
     dwc:specificEpithet "strigella";
     dwc:taxonRank "species";
     dwc:scientificNameAuthorship "(Draparnaud, 1801)";
     dwc:recordedBy "Borredà, V.";
     dwc:decimalLongitude "1.503830288051"^^xsd:decimal;
     dwc:decimalLatitude "42.472209680738"^^xsd:decimal;
     dwc:dateIdentified "2007"^^xsd:gYear;
     a dwc:Occurrence.

Linking classes that have a one-to-one relationship with dwc:Occurrence in the data


There is still (in my opinion) a serious problem with this graph.  It's too "flat" and ascribes every Darwin Core property to an occurrence instance.  For example, the Darwin Core quick reference guide suggests that dwc:locality should be a property of a dcterms:Location and that dwc:dateIdentified should be a property of dwc:Identification.  dcterms:modified is also a property of the data record; it's not the date when the occurrence or even the specimen that documents the occurrence was last modified. 

It would be easy enough to solve this problem, since each instance of these other classes has a one-to-one relationship with the root occurrence instance.  I could add:

to the list of classes, then change the column mapping to:


and the locality triple would have a blank node subject representing a dcterms:Location instance.

But this does not solve the problem of linking the root occurrence class to the other classes.  The Darwin Core RDF Guide recognizes the existence of this problem in section 1.4.4.  It suggests that decisions about the assignment of Darwin Core properties to classes and connecting those classes by object properties should be made by community consensus.  There have been several suggestions about ways to link the core Darwin Core classes.  In this example, I will use Darwin-SW object properties to link the class instances according to the Darwin-SW graph model.  Here it is:


The DwC-A translator by default assumes only the class given in the archive metadata (dwc:Occurrence).  So I've added the dwc:Identification, dwc:Organism, dwc:Event, dcterms:Location, and dwc:PreservedSpecimen classes from the diagram above (with dwc:PreservedSpecimen in the place of dsw:Token) to the class list shown below.  Instead of designating the class instances as blank nodes, this time I've chosen to assign them IRIs that are formed from the root class IRI with appended fragment identifiers.   In the mapping table, GOM2 uses the assigned fragment identifier as a local ID for the class.


To generate the links between the class instances, I will add these rows to the mapping file to generate the blue object properties shown in the Darwin-SW graph diagram above:


Now I will assign the properties to the classes in which I think they belong.  I'm assuming that the catalog number, institution code, and collection code are properties of the preserved specimen, not the occurrence (this is at odds with people who consider that preserved specimens ARE occurrences).  There may be some people who would be surprised that I'm assigning the various taxon-related properties (dwc:genus, dwc:taxonRank, etc.) to an dwc:Identification instance rather than a dwc:Taxon instance.  Rather than explain the reason for this, I'll say that they are "convenience terms" and refer you to Section 2.7.4 of the Darwin Core RDF Guide.  Here is what I came up with for a "final" set of column mappings:


There is still one problem with the mappings that can't be fixed.  In its current state, GOM2 looks for a dcterms:modified property column in the metadata table, and if it finds one, it assigns it to the document that describes the occurrence rather than to the occurrence itself.  However, in this example, dcterms:modified is a constant and GOM2 isn't (yet) programmed to deal with that.  So it's going to show up as a property of the occurrence unless I just delete the row in the mapping table (which I did).

Here is what the graph looks like now (serialized as Turtle) with the properties sorted into the right classes and the classes linked together with object properties:

<http://example.org/SIBA:Molluscs:134>
     dwc:basisOfRecord "HumanObservation";
     dwc:recordedBy "Borredà, V.";
     dsw:occurrenceOf <http://example.org/SIBA:Molluscs:134#org>;
     dsw:atEvent <http://example.org/SIBA:Molluscs:134#eve>;
     a dwc:Occurrence.

<http://example.org/SIBA:Molluscs:134#id>
     dwc:kingdom "Animalia";
     dwc:scientificName "Euomphalia strigella (Draparnaud, 1801)";
     dwc:genus "Euomphalia";
     dwc:specificEpithet "strigella";
     dwc:taxonRank "species";
     dwc:scientificNameAuthorship "(Draparnaud, 1801)";
     dwc:dateIdentified "2007"^^xsd:gYear;
     dsw:identifies <http://example.org/SIBA:Molluscs:134#org>;
     a dwc:Identification.

<http://example.org/SIBA:Molluscs:134#org>
     a dwc:Organism.

<http://example.org/SIBA:Molluscs:134#eve>
     dsw:locatedAt <http://example.org/SIBA:Molluscs:134#loc>;
     a dwc:Event.

<http://example.org/SIBA:Molluscs:134#loc>
     dwc:coordinateUncertaintyInMeters "10000"^^xsd:int;
     dwc:continent "EUROPE";
     dwc:country "Andorra";
     dwc:decimalLongitude "1.503830288051"^^xsd:decimal;
     dwc:decimalLatitude "42.472209680738"^^xsd:decimal;
     a dcterms:Location.

<http://example.org/SIBA:Molluscs:134#sp>
     dwc:institutionCode "SIBA";
     dwc:collectionCode "Molluscs";
     dwc:catalogNumber "134";
     dsw:evidenceFor <http://example.org/SIBA:Molluscs:134>;
     dsw:derivedFrom <http://example.org/SIBA:Molluscs:134#org>;
     a dwc:PreservedSpecimen.

It is rather odd that dwc:eventDate was not provided for the date of collection of the specimen.  Also, I'm puzzled as to why the dwc:basisOfRecord is given as "HumanObservation" rather than "PreservedSpecimen".  Does that mean that there was no actual specimen collected?  (But then why is there a catalog number?)  But that's beside the point of this as a demonstration.  

You might also wonder why the organism and event instances are described when they don't actually have any properties (other than their type).  In this flat record, there isn't any good reason for it.  However, other data providers, whose data may be aggregated with these data, may link many identifications and many occurrences to a single organism, so the organism instance is needed to serve as a node to connect those many links.  Similarly there may be many occurrences at a single event, so again that class serves as a node to join many links.  

A different graph model: BiSciCol

BiSciCol graph model from http://biscicol.blogspot.com/2013_03_01_archive.html

If you don't like the Darwin-SW graph model, you can use GOM2 to map the columns using a different model.  The diagram above shows the model that the BiSciCol project created to use with the Triplifier application that I mentioned previously.  The BiSciCol model is simpler than Darwin-SW.  It does not include the organism class (and so cannot deal with linking multiple occurrences of the same organism).  It does not differentiate between occurrences and the specimens that document them.  It also assigns taxon-related properties to the dwc:Taxon class instance rather than considering them convenience properties and assigning them to the dwc:Identification class instance (I don't think the RDF guide was finished at the time triplifier was developed, so the concept of "convenience properties" had not yet been established.)

The BiSciCol model also reuses the object properties bsc:related_to and bsc:depends_on (with bsc: = http://biscicol.org/terms/index.html#) to link all of the classes, rather than minting separate object properties for each kind of link.  Based on the local names, I've never quite understood why all of the arrows point the directions that they do (a dcterms:Location is also related_to a dwc:Event, right?  Why does a dwc:Occurrence depends_on a dwc:Event and not the other way round?), but that's the way they are.  The arrow directions may be related to the direction that many-to-one relationships are expected to occur.  The BiSciCol model also allows using object properties to "jump over" a class if a particular dataset doesn't include it (e.g. the direct link from Occurrence to Taxon).  That allows for simpler RDF, but requires more complicated SPARQL queries. This differs from the Darwin-SW approach, which requires the insertion of a placeholder node if a class isn't represented in the data.  (See section 3.1 and 3.2 of http://bit.ly/2dG85b5 for a detailed explanation.) That results in more complicated RDF, but simplifies SPARQL querying.  

Here is the class definition table I used to generate RDF based on the BiSciCol graph model:

Here is the column mapping table that I used:


GOM2 doesn't have any way to decide whether to leave out classes that have been "jumped over" because they are missing in the data.  So these class list and mapping tables always generate every class in the model, and provide links both through any missing classes and in addition to the links  around them.  Here's what the BiSciCol graph looks like in Turtle serialization: 

<http://example.org/SIBA:Molluscs:134>
     dwc:basisOfRecord "HumanObservation";
     dcterms:modified "2010-11-25T12:12:12";
     dwc:institutionCode "SIBA";
     dwc:collectionCode "Molluscs";
     dwc:catalogNumber "134";
     dwc:recordedBy "Borredà, V.";
     bsc:depends_on <http://example.org/SIBA:Molluscs:134#eve>;
     bsc:depends_on <http://example.org/SIBA:Molluscs:134#loc>;
     bsc:related_to <http://example.org/SIBA:Molluscs:134#tax>;
     a dwc:Occurrence.

<http://example.org/SIBA:Molluscs:134#id>
     dwc:dateIdentified "2007"^^xsd:gYear;
     bsc:depends_on <http://example.org/SIBA:Molluscs:134>;
     bsc:depends_on <http://example.org/SIBA:Molluscs:134#tax>;
     a dwc:Identification.

<http://example.org/SIBA:Molluscs:134#eve>
     bsc:depends_on <http://example.org/SIBA:Molluscs:134#loc>;
     a dwc:Event.

<http://example.org/SIBA:Molluscs:134#loc>
     dwc:coordinateUncertaintyInMeters "10000"^^xsd:int;
     dwc:continent "EUROPE";
     dwc:country "Andorra";
     dwc:decimalLongitude "1.503830288051"^^xsd:decimal;
     dwc:decimalLatitude "42.472209680738"^^xsd:decimal;
     a dcterms:Location.

<http://example.org/SIBA:Molluscs:134#tax>
     dwc:kingdom "Animalia";
     dwc:scientificName "Euomphalia strigella (Draparnaud, 1801)";
     dwc:genus "Euomphalia";
     dwc:specificEpithet "strigella";
     dwc:taxonRank "species";
     dwc:scientificNameAuthorship "(Draparnaud, 1801)";
     a dwc:Taxon.

Which is better, the Darwin-SW graph model or the BiSciCol graph model?

This question cannot be answered without first defining what use cases we want to satisfy. GOM2 is easily hackable by changing values in the mapping CSV (vs. hard coding a particular graph model) so that it is possible to try different approaches and see how well they work.  GOM2 has an option to dump a serialization of a graph containing all of the triples generated from the entire metadata spreadsheet into a file that can then be loaded into a triplestore and tested via SPARQL queries.  So to answer this question, one would need to collect use cases from the TDWG community, try to satisfy them using the two models, then decide which one works better.  It may turn out that neither one works satisfactorily, in which case new or modified graph model would need to be constructed and tested until all of the use cases that are deemed to be important are satisfied.

Darwin Core Archive extensions: a "star schema"

The Darwin Core Text Guide also provides a mechanism to link additional fielded text (CSV) extension tables to the core fielded text table.  For example, an occurrence can be documented by multiple images.  In that case, multiple records in the image (extension) table could be linked to a single record in an occurrence (core) table.  The XML metafile would specify which column in the extension table contains a reference to the unique identifier of a row in the core table.  In database terms, the extension table records have a foreign key to the primary key of the core table.  


The diagram above illustrates a situation where five images document a single occurrence.  The relationship between rows in the extension table and the row in the core table that they are linked to by their foreign key can be represented as a simple RDF graph where the resource described by a row in the core table is a node that connects many instances of resources described by the extension table.  This organization has been called a "star schema" because of the shape of the graph.  The Darwin Core Archive system allows multiple extension tables, so the bubbles linked to the central core record can be instances of more than one class.  But there cannot be more than a single resource at the center of the star.  So there are some severe limitations of the DwC-A system for linking CSV tables into complicated graphs.

From an RDF perspective, a simple solution to this problem would be for IRI-identified resources that have a many-to-one relationship to some other kind of resource to simply have a column that contains an IRI reference to the other resource.  The triples about each kind of resource could just be serialized separately and the IRIs would allow the triples to be connected into a graph of any complexity once they were aggregated in a triple store.  However, the problem is more complicated if the resources having the many-to-one relationship do not have assigned identifiers (i.e. are blank nodes).  In that case, links between the resources would need to be made within a single document.  (Technically, a document could describe a resource without an IRI identifier that had a link to another resource with an identifier. But it would not be possible for a Linked Data client to ask for that document, since there would be no dereferenceable subject IRI to send to the server.)  This situation is illustrated in the diagram above, where the images do not have IRIs and are represented by blank nodes.  A client could not request information about them individually, but a client requesting information about the occurrence could receive information about the images along with the information about the occurrence if they were serialized in the same document.  Since the Darwin Core Text Guide does not require the extension records to have unique identifiers, the situation just described could apply to such archives.  

GOM2 can handle multiple tables that are linked in the "star schema" pattern.  A CSV table, linked-classes.csv, lists the extension classes.  For each one, it specifies the column that contains the foreign key, the property that should be used to make the link to the core class instance, the name of the file containing the extension table, and optionally other columns that may be used to construct a fragment identifier if the resource in the extension table is assigned a URI based on the core resource rather than being a blank node.  Here is an example for the diagram above:


The "_:1" string indicates that the extension class is represented by a blank node; the numeral "1" has no particular significance.  For examples where fragment identifiers are used to construct an IRI for resources described in the extension files, see the  Guid-O-Matic detailed explanation page.  

When the GOM2 DwC-A translator script processes a Darwin Core Archive meta.xml file, it creates a linked-classes.csv table.  If the archive contains only a single core file (as in the real Andorran mollusc archive), the linked-classes.csv table will have only column headers with no data rows.  If the archive contains any extension files, GOM2 will create a row for each extension file.  

A real example of converting a Darwin Core Archive with extension tables into RDF

Since 2014, GBIF has registered a dedicated Darwin Core multimedia extension where there are many media items per core occurrence record (described in this blog post).  Bioimages has submitted its high quality occurrence metadata along with links to images using this method.  The Guid-O-Matic GitHub repo includes an old DwC-A for the Bioimages data for you to play with.  

The steps for converting this DwC-A to RDF are the same as before.  Unzip the gbif-bioimages.zip file into the directory with the translate-meta.xq script. Here's an abbreviated view of the meta.xml file:

<?xml version="1.0"?>
<archive xmlns="http://rs.tdwg.org/dwc/text/" metadata="eml.xml">
<core encoding="UTF-8" linesTerminatedBy="\r\n" fieldsTerminatedBy="," fieldsEnclosedBy="&quot;" ignoreHeaderLines="1" rowType="http://rs.tdwg.org/dwc/terms/Occurrence">
<files>
<location>occurrences.csv</location>
</files>
<id index="0" />
<field index="0" term="http://rs.tdwg.org/dwc/terms/occurrenceID" />
<field index="1" term="http://rs.tdwg.org/dwc/terms/basisOfRecord" />
<field index="2" term="http://purl.org/dc/terms/modified" />
...
<field index="33" term="http://rs.tdwg.org/dwc/terms/scientificNameAuthorship" />
<field index="34" term="http://rs.tdwg.org/dwc/terms/scientificName" />
<field index="35" term="http://rs.tdwg.org/dwc/terms/previousIdentifications" />
<field default="English" term="http://purl.org/dc/terms/language" />
</core>
<extension encoding="UTF-8" linesTerminatedBy="\r\n" fieldsTerminatedBy="," fieldsEnclosedBy="&quot;" ignoreHeaderLines="1" rowType="http://rs.gbif.org/terms/1.0/Multimedia">
<files>
<location>images.csv</location>
</files>
<coreid index="0" />
<field index="1" term="http://purl.org/dc/terms/type" />
<field index="2" term="http://purl.org/dc/terms/format" />
<field index="3" term="http://purl.org/dc/terms/identifier" />
...
<field index="9" term="http://purl.org/dc/terms/publisher" />
<field index="10" term="http://purl.org/dc/terms/license" />
<field index="11" term="http://purl.org/dc/terms/rightsHolder" />
<field default="English" term="http://purl.org/dc/terms/language" />
</extension>
</archive>

You can see that the metafile describes core records, located in the occurrences.csv file, and extension records, located in the images.csv file.  When the translation script is run, class list and column mapping CSV files are generated for both the occurrence and the image metadata files.  The linked-classes.csv file looks like this:


Because the DwC-A metafile was not designed to facilitate RDF, it gives no indication what predicate should be used in the triple that links the extension resource to the core resource.  So the translator defaults to the generic dcterms:relation property, which simply indicates that there is some kind of relationship between the two resources.  In this case, I intend that the images serve as evidence for the occurrence records.  So I'm going to replace dcterms:relation with dsw:evidenceFor in the table.  

I'll now run GOM2 as I did before, this time entering http://bioimages.vanderbilt.edu/thomas/0455-01#2010-09-25 as the occurrence to be converted to RDF.  (In the Bioimages occurrence file, the primary key identifier for the occurrence is a full HTTP IRI, so it isn't necessary to specify a default domain to prepend as was the case with the mollusc file.)  Here's the graph I get (serialized as Turtle):

<http://bioimages.vanderbilt.edu/thomas/0424-01#2010-07-25>
     dcterms:language "English";
     dwc:occurrenceID "http://bioimages.vanderbilt.edu/thomas/0424-01#2010-07-25";
     dwc:basisOfRecord "HumanObservation";
     dcterms:modified "2014-07-15T14:44:35-05:00";
     dcterms:references "http://bioimages.vanderbilt.edu/thomas/0424-01#2010-07-25";
     dwc:individualID "http://bioimages.vanderbilt.edu/thomas/0424-01";
     dwc:establishmentMeans "native";
     dwc:recordedBy "Ron Thomas";
     dwc:eventDate "2010-07-25";
     dwc:continent "North America";
     dwc:countryCode "US";
     dwc:stateProvince "Arkansas";
     dwc:county "Searcy";
     dwc:locality "Stone Cemetery Rd.";
     dwc:decimalLatitude " 36.0393";
     dwc:decimalLongitude "-92.7125";
     dwc:geodeticDatum "EPSG:4326";
     dwc:coordinateUncertaintyInMeters "500";
     dwc:georeferenceRemarks "Location of individual determined by an independent GPS measurement.";
     dwc:identifiedBy "Ron Thomas";
     dwc:dateIdentified "2010-07-25";
     dwc:kingdom "Plantae";
     dwc:order "Sapindales";
     dwc:family "Hippocastanaceae";
     dwc:genus "Aesculus";
     dwc:specificEpithet "pavia";
     dwc:taxonRank "species";
     dwc:scientificNameAuthorship "L.";
     dwc:scientificName "Aesculus pavia L.";
     a dwc:Occurrence.

_:cdd65a04-62c0-42ef-93c1-de26c61e0b17
     dcterms:type "StillImage";
     dcterms:format "image/jpeg";
     dcterms:identifier "http://bioimages.vanderbilt.edu/gq/thomas/g0424-01-01.jpg";
     dcterms:references "http://bioimages.vanderbilt.edu/thomas/0424-01-01.htm";
     dcterms:title "Aesculus pavia (Hippocastanaceae) - fruit - as borne on the plant";
     dcterms:description "Image of Aesculus pavia (Hippocastanaceae) - fruit - as borne on the plant";
     dcterms:created "2010-07-25T09:47:03-05:00";
     dcterms:creator "Ron Thomas";
     dcterms:publisher "Bioimages http://bioimages.vanderbilt.edu/";
     dcterms:license "http://creativecommons.org/licenses/by-nc-sa/3.0/";
     dsw:evidenceFor <http://bioimages.vanderbilt.edu/thomas/0424-01#2010-07-25>;
     a gbifterms:Multimedia.

_:9010f4a0-175e-4b59-8e7c-3368228df518
     dcterms:type "StillImage";
     dcterms:format "image/jpeg";
     dcterms:identifier "http://bioimages.vanderbilt.edu/gq/thomas/g0424-01-03.jpg";
     dcterms:references "http://bioimages.vanderbilt.edu/thomas/0424-01-03.htm";
     dcterms:title "Aesculus pavia (Hippocastanaceae) - leaf - margin of upper + lower surface";
     dcterms:description "Image of Aesculus pavia (Hippocastanaceae) - leaf - margin of upper + lower surface";
     dcterms:created "2010-07-25T09:49:47-05:00";
     dcterms:creator "Ron Thomas";
     dcterms:publisher "Bioimages http://bioimages.vanderbilt.edu/";
     dcterms:license "http://creativecommons.org/licenses/by-nc-sa/3.0/";
     dsw:evidenceFor <http://bioimages.vanderbilt.edu/thomas/0424-01#2010-07-25>;
     a gbifterms:Multimedia.

_:2c1405b0-e450-41e8-880e-7c1e0e247770
     dcterms:type "StillImage";
     dcterms:format "image/jpeg";
     dcterms:identifier "http://bioimages.vanderbilt.edu/gq/thomas/g0424-01-02.jpg";
     dcterms:references "http://bioimages.vanderbilt.edu/thomas/0424-01-02.htm";
     dcterms:title "Aesculus pavia (Hippocastanaceae) - fruit - unspecified";
     dcterms:description "Image of Aesculus pavia (Hippocastanaceae) - fruit - unspecified";
     dcterms:created "2010-07-25T09:56:34-05:00";
     dcterms:creator "Ron Thomas";
     dcterms:publisher "Bioimages http://bioimages.vanderbilt.edu/";
     dcterms:license "http://creativecommons.org/licenses/by-nc-sa/3.0/";
     dsw:evidenceFor <http://bioimages.vanderbilt.edu/thomas/0424-01#2010-07-25>;
     a gbifterms:Multimedia.

You can see that without any editing of the columns mapping file, I already have a more complicated graph than I did with the mollusc data: there is one occurrence instance in the middle of the "star" with three image instances around the edge of the graph, linked to the occurrence by dsw:evidenceFor.  I will need to mess with the column mapping file for occurrence.csv in order to get it to conform to the Darwin-SW graph model.  I'll fix a bunch of language tag and datatyping issues, along with other specific problems I'll discuss later.

Here's how I changed the class and mapping files for the core occurrence metadata file:


Here's how I changed the class and mapping files for the extension media metadata file:


Note: Audubon Core sort of implies that there is an ac:ServiceAccessPoint class, but never actually defines one.  (A service access point describes a file containing a version of a media item having a particular size and media type, whereas the actual media item may be considered an abstract entity that is distinct from files that contains representations of it.)  Probably it would be better just to not assert a type for the service access point, but GOM2 requires that there be some type IRI for the class, so I "minted" the fake class IRI ac:ServiceAccessPoint.

Here's the resulting graph in Turtle serialization:

<http://bioimages.vanderbilt.edu/thomas/0424-01#2010-07-25>
     dcterms:identifier "http://bioimages.vanderbilt.edu/thomas/0424-01#2010-07-25";
     dwc:basisOfRecord "HumanObservation";
     dsw:occurrenceOf <http://bioimages.vanderbilt.edu/thomas/0424-01>;
     dwc:establishmentMeans "native";
     dwc:recordedBy "Ron Thomas";
     dsw:occurrenceOf _:2d368366-a71f-4e21-a008-6defe3b06f18;
     dsw:atEvent _:2858251b-ec69-45e5-a131-4d65427e5f53;
     a dwc:Occurrence.

_:8a059be6-7954-463b-81a0-197b11b234aa
     dwc:identifiedBy "Ron Thomas";
     dwc:dateIdentified "2010-07-25"^^xsd:date;
     dwc:kingdom "Plantae";
     dwc:order "Sapindales";
     dwc:family "Hippocastanaceae";
     dwc:genus "Aesculus";
     dwc:specificEpithet "pavia";
     dwc:taxonRank "species";
     dwc:scientificNameAuthorship "L.";
     dwc:scientificName "Aesculus pavia L.";
     dsw:identifies _:2d368366-a71f-4e21-a008-6defe3b06f18;
     a dwc:Identification.

_:2d368366-a71f-4e21-a008-6defe3b06f18
     owl:sameAs <http://bioimages.vanderbilt.edu/thomas/0424-01>;
     a dwc:Organism.

_:2858251b-ec69-45e5-a131-4d65427e5f53
     dwc:eventDate "2010-07-25"^^xsd:date;
     dsw:locatedAt _:d929d47f-6d34-4ee0-ae9d-59915bd8285d;
     a dwc:Event.

_:d929d47f-6d34-4ee0-ae9d-59915bd8285d
     dwc:continent "North America";
     dwc:countryCode "US";
     dwc:stateProvince "Arkansas";
     dwc:county "Searcy";
     dwc:locality "Stone Cemetery Rd.";
     dwc:decimalLatitude " 36.0393"^^xsd:decimal;
     dwc:decimalLongitude "-92.7125"^^xsd:decimal;
     dwc:geodeticDatum "EPSG:4326";
     dwc:coordinateUncertaintyInMeters "500"^^xsd:int;
     dwc:georeferenceRemarks "Location of individual determined by an independent GPS measurement."@en;
     a dcterms:Location.

_:ae320157-e94e-4c1a-ad51-bd43d79396cf
     rdf:type dcmitype:StillImage;
     dc:type "StillImage";
     ac:hasServiceAccessPoint <http://bioimages.vanderbilt.edu/gq/thomas/g0424-01-01.jpg>;
     foaf:page <http://bioimages.vanderbilt.edu/thomas/0424-01-01.htm>;
     dcterms:title "Aesculus pavia (Hippocastanaceae) - fruit - as borne on the plant"@en;
     dcterms:description "Image of Aesculus pavia (Hippocastanaceae) - fruit - as borne on the plant"@en;
     dcterms:created "2010-07-25T09:47:03-05:00"^^xsd:dateTime;
     dc:creator "Ron Thomas";
     dc:publisher "Bioimages http://bioimages.vanderbilt.edu/";
     dcterms:license <http://creativecommons.org/licenses/by-nc-sa/3.0/>;
     ac:hasServiceAccessPoint _:20c439a5-1ca6-42b6-8062-52bc873bb3a4;
     dsw:evidenceFor <http://bioimages.vanderbilt.edu/thomas/0424-01#2010-07-25>;
     a gbifterms:Multimedia.

_:20c439a5-1ca6-42b6-8062-52bc873bb3a4
     ac:variant ac:GoodQuality;
     dc:format "image/jpeg";
     owl:sameAs <http://bioimages.vanderbilt.edu/gq/thomas/g0424-01-01.jpg>;
     a ac:ServiceAccessPoint.

_:26c7497f-23ed-4a61-808d-40958d4f893e
     rdf:type dcmitype:StillImage;
     dc:type "StillImage";
     ac:hasServiceAccessPoint <http://bioimages.vanderbilt.edu/gq/thomas/g0424-01-03.jpg>;
     foaf:page <http://bioimages.vanderbilt.edu/thomas/0424-01-03.htm>;
     dcterms:title "Aesculus pavia (Hippocastanaceae) - leaf - margin of upper + lower surface"@en;
     dcterms:description "Image of Aesculus pavia (Hippocastanaceae) - leaf - margin of upper + lower surface"@en;
     dcterms:created "2010-07-25T09:49:47-05:00"^^xsd:dateTime;
     dc:creator "Ron Thomas";
     dc:publisher "Bioimages http://bioimages.vanderbilt.edu/";
     dcterms:license <http://creativecommons.org/licenses/by-nc-sa/3.0/>;
     ac:hasServiceAccessPoint _:d53b092f-cdcc-4be0-a498-d0778a259795;
     dsw:evidenceFor <http://bioimages.vanderbilt.edu/thomas/0424-01#2010-07-25>;
     a gbifterms:Multimedia.

_:d53b092f-cdcc-4be0-a498-d0778a259795
     ac:variant ac:GoodQuality;
     dc:format "image/jpeg";
     owl:sameAs <http://bioimages.vanderbilt.edu/gq/thomas/g0424-01-03.jpg>;
     a ac:ServiceAccessPoint.

_:8d8c96e9-fea9-441b-99a5-555f8816685c
     rdf:type dcmitype:StillImage;
     dc:type "StillImage";
     ac:hasServiceAccessPoint <http://bioimages.vanderbilt.edu/gq/thomas/g0424-01-02.jpg>;
     foaf:page <http://bioimages.vanderbilt.edu/thomas/0424-01-02.htm>;
     dcterms:title "Aesculus pavia (Hippocastanaceae) - fruit - unspecified"@en;
     dcterms:description "Image of Aesculus pavia (Hippocastanaceae) - fruit - unspecified"@en;
     dcterms:created "2010-07-25T09:56:34-05:00"^^xsd:dateTime;
     dc:creator "Ron Thomas";
     dc:publisher "Bioimages http://bioimages.vanderbilt.edu/";
     dcterms:license <http://creativecommons.org/licenses/by-nc-sa/3.0/>;
     ac:hasServiceAccessPoint _:6e66b906-3702-4567-94cb-a32eb95925da;
     dsw:evidenceFor <http://bioimages.vanderbilt.edu/thomas/0424-01#2010-07-25>;
     a gbifterms:Multimedia.

_:6e66b906-3702-4567-94cb-a32eb95925da
     ac:variant ac:GoodQuality;
     dc:format "image/jpeg";
     owl:sameAs <http://bioimages.vanderbilt.edu/gq/thomas/g0424-01-02.jpg>;
     a ac:ServiceAccessPoint.

Notes:
  1. There are a number of places where I substituted the legacy dc: namespace Dublin Core terms for dcterms: namespace terms.  (Some people have muddied the waters by suggesting that now dc: should be used as an abbreviation for what has traditionally been abbreviated as dcterms:, i.e. http://purl.org/dc/terms/.  Bad, bad, bad!  In this blog, I still use dc: to refer to the legacy namespace http://purl.org/dc/elements/1.1/ .) The blog post that describes how to use the GBIF DwC-A multimedia extension specifies that the dcterms: terms should be provided.  However, for a number of terms (e.g. dcterms:publisher), Dublin Core provides a range declaration for the property that implies it should be used with a non-literal object.  There is no range declaration for the dc: analog, so it is considered best practice to use the dc: analog with a literal object.
  2. In accordance with the Darwin Core RDF Guide section 2.6, the Darwin Core "ID" terms should not be used in RDF.  Since dwc:occurrenceID referred to the identifier of the subject resource, its predicate was changed to dcterms:identifier.  
  3. dwc:individualID (now dwc:organismID) value is a tougher nut to crack. In the Darwin Core Archive, it was serving as a foreign key to another IRI-identified resource, the organism.  So a dsw:occurrenceOf object property was substituted for dwc:individualID.  However, GOM2 does not have a mechanism for supplying an independently defined subject IRI for resources having one-to-one relationships with the root resource (in this case the occurrence).  GOM2 only allows for fragment identifier-appended IRIs or (as in this case) blank nodes.  So I did a little trick of asserting that the organism IRI was owl:sameAs the blank node identifier (_:2d368366-a71f-4e21-a008-6defe3b06f18).  A client that supports sameAs inferencing would then substitute the organism IRIs in every triple about the organism that was explicitly asserted for the blank node.  
  4. I applied datatypes to several dates that apply to most, but not all, dates in the Bioimages database.  For example, most images have creation dates that conform to xsd:dateTime, since those dates were automatically recorded by a digital camera.  However, for a few old scanned images, there are some dates in the database whose format only conform to xsd:date or even xsd:gYear.  Applying the xsd:dateTime datatype to those dates would cause an RDF database software that's picky about datatyping (like Stardog) to refuse to load the dataset.  This is a data cleaning problem that isn't so important in a demonstration like this, but which would need to be considered if DwC-A files were routinely used as a data source for really aggregating biodiversity data as RDF.
  5. The GBIF multimedia extension instructions specify that dcterms:identifier should be used for "the public URL that identifies and locates the media file directly, not the html page it might be shown on".  That is exactly the meaning of ac:hasServiceAccessPoint.  I'm not sure why this choice was made.  I suppose it is because in this "flat" representation of a multimedia item, no distinction is made between a URI for the media item itself and the URL that retrieves a representation of the multimedia item.  That isn't so important if there is only one version of the media item, but in the case of Bioimages, the image is available in four sizes (thumbnail, lower quality, good quality, and best quality=the original JPEG from the camera).  So in Bioimages there are four service access points for each image; the one listed here is only the "good quality" one.  In order to generate and link to the service access point (which has an independently defined subject IRI), I used the same owl:sameAs trick that I used with the organism instance.  I also made dc:format be a property of the service access point rather than the media item, since the same image could (in theory) be served as BMP, GIF, PNG, or other image formats.  Each of these would be represented by separate service access points. Because there is no RDF implementation guide for Audubon Core, it's the Wild West, and providers currently do whatever seems right to them. 
  6. I substituted foaf:page for the dcterms:references property required in the GBIF DwC-A multimedia extension.  There isn't really anything wrong with dcterms:references, it is just a very generic link, whereas foaf:page is another well-known term that implies that the the object is a document. 


Why is it important to play with this kind of thing?

Darwin Core Archives and the "star schema" approach has been highly successful for getting occurrence data from providers to GBIF.  However, the star schema is a very limited graph model and currently GBIF only really supports two classes as core file types: occurrence and taxon.  There are other people (like me and other people who track organisms over time) who might prefer to have an option for dwc:Organism as the core file type with dwc:Identification and dwc:Occurrence as extensions (one organism with many identifications and many occurrences).  Some may want dwc:MaterialSample as the core resource.  Other might like to have an option for dwc:Event as the core file type, with dwc:Occurrence as an extension (one event with many occurrences).  Then there is the use case where dwc:Occurrence is documented by multiple forms of evidence (specimens, photographs, DNA samples, etc.).  A star schema and corresponding DwC-A type could be designed for all of these use cases, but it is difficult to see how they could all be easily merged into a single database unless it were graph-based.  

A graph-based system could also link to IRI-identified resources outside of the biodiversity informatics domain, such as DOIs and ISBNs for publications, ORCID and VIAF identifiers for people, and GeoNames and Getty TGN identifiers for places.  Again, a graph-based system could easily suck in data on these types of resources and make them available for querying.  

Although it would be possible to use a graph database system that doesn't necessarily depend on IRIs (such as Neo4j), linking together resources from diverse sources would be easier if those resources had globally unique identifiers.   RDF-related technology is probably the most well-developed way to do that kind of linking.  Even if you are a hard-core Linked Data and HTTP IRI skeptic, and fervent believer in UUIDs, you could easily turn your UUIDs into IRIs by prepending them with "urn:uuid:" and they could play well in an RDF triplestore.  Dealing with the problem of minting and maintaining stable GUIDs is a critical problem to be solved before large-scale aggregation of data on diverse kinds of resources can be accomplished.  But that shouldn't stop us from experimenting with solving the other problem of deciding on object properties to connect important classes of resources.  GOM2 is intended to help facilitate the kind of experimenting with graph models that is necessary to decide what model will best satisfy identified use cases.

No comments:

Post a Comment