Sunday, November 6, 2016

Guid-O-Matic meets the DwC-A RDF octopuses

Real Darwin Core data as RDF

Recently, I undertook a project to look at real data expressed as Darwin Core, with the goal of examining how the providers structured the data.  In particular, I wanted to examine data that was packaged in a variety of Darwin Core archives, each having a different core class at the center. My ultimate goal was to see if I could impose a graph model on the various datasets that would allow them to be aggregated as RDF triples that could be searched simultaneously via SPARQL.  

Darwin Core RDF in the wild

As a preliminary exercise, I took a look at several sources of Darwin Core-described data that have been exposed as RDF.  One that has been around as RDF for a number of years is the Arctos database at http://arctos.database.museum/.  If you put http://arctos.database.museum/guid/MVZ:Mamm:165861 into a browser, you'll get a specimen web page.  However, if you use a client such as cURL, Postman, or Advanced Rest Client with request header Accept:application/rdf+xml, you will get the representation at this Gist.  I had to add a missing geo:location property to the geo:Point blank node and then escape the left angle brackets in the dwc:TypeStatus HTML.  Then I pasted it into RDF Translator and turned it into the more human-readable Turtle serialization:

<http://arctos.database.museum/guid/MVZ:Mamm:165861> 
    dcterms:created "1999-02-04" ;
    dcterms:creator "unknown" ;
    dcterms:description "Mammal specimens 165861 Ctenomys sociabilis" ;
    dcterms:hasVersion <http://arctos.database.museum/guid/MVZ:Mamm:165861> ;
    dcterms:modified "2016-01-19" ;
    dcterms:title "MVZ:Mamm:165861 - Mammal specimens 165861 Ctenomys sociabilis" ;
    dwc:AgeClass "" ;
    dwc:BasisOfRecord "PreservedSpecimen" ;
    dwc:CatalogNumber "165861" ;
    dwc:Class "Mammalia" ;
    dwc:CollectionCode "Mamm" ;
    dwc:Collector "Oliver P. Pearson" ;
    dwc:ContinentOcean "South America" ;
    dwc:CoordinateUncertaintyInMeters "30" ;
    dwc:Country "Argentina" ;
    dwc:County "" ;
    dwc:DecimalLatitude "-71.1918056" ;
    dwc:DecimalLongitude "-40.9684722" ;
    dwc:EarliestDateCollected "1983-11-16" ;
    dwc:Family "Ctenomyidae" ;
    dwc:GenBankNum "" ;
    dwc:Genus "Ctenomys" ;
    dwc:GeoreferenceProtocol "MaNIS georeferencing guidelines" ;
    dwc:GeoreferenceSource "Google Earth (assumed v.6)" ;
    dwc:HigherGeography "South America, Argentina, Neuquen" ;
    dwc:HigherTaxon "Animalia; Chordata; Mammalia; Rodentia; Ctenomyidae; Ctenomyini; Ctenomys sociabilis Pearson and Christie, 1985" ;
    dwc:HorizontalDatum "World Geodetic System 1984" ;
    dwc:IdentifiedBy "Oliver P. Pearson" ;
    dwc:ImageURL "http://arctos.database.museum/MediaSearch.cfm?action=search&media_id=10473504,10472904,10472905,10472906,10472907" ;
    dwc:IndividualCount "1" ;
    dwc:InstitutionCode "MVZ" ;
    dwc:Island "" ;
    dwc:IslandGroup "" ;
    dwc:Kingdom "Animalia" ;
    dwc:LatestDateCollected "1983-11-16" ;
    dwc:Locality "Estancia Fortin Chacabuco, 3 km S and 2 km W Cerro Puntudo, Depto. Los Lagos" ;
    dwc:MaximumDepthInMeters "" ;
    dwc:MaximumElevationInMeters "1075" ;
    dwc:MinimumDepthInMeters "" ;
    dwc:MinimumElevationInMeters "1075" ;
    dwc:Order "Rodentia" ;
    dwc:OriginalCoordinateSystem "decimal degrees" ;
    dwc:OtherCatalogNumbers "collector number=7101" ;
    dwc:Phylum "Chordata" ;
    dwc:Preparations "skull; study skin; skeleton" ;
    dwc:RelatedCatalogedItems "" ;
    dwc:Remarks "" ;
    dwc:SampleID "http://arctos.database.museum/guid/MVZ:Mamm:165861" ;
    dwc:ScientificName "Ctenomys sociabilis" ;
    dwc:Sex "female" ;
    dwc:Species "Ctenomys sociabilis" ;
    dwc:StateProvince "Neuquen" ;
    dwc:Subspecies "" ;
    dwc:TypeStatus "holotype of <a href=\"http://arctos.database.museum/name/Ctenomys sociabilis\">Ctenomys sociabilis</a>, page 338 in <a href=\"http://arctos.database.museum/publication/10000266\">Pearson and Christie 1985</a>" ;
    dwc:VerbatimCollectingDate "16 Nov 1983" ;
    dwc:VerbatimCoordinates "-40.9684722/-71.1918056" ;
    dwc:VerbatimElevation "1075-1075 m" ;
    geo:location [ a geo:Point ;
            geo:lat "-40.9684722" ;
            geo:long "-71.1918056" ] .

Notes:
1. This is a relatively old effort which at least predates the Darwin Core RDF Guide, and which is clearly labeled (in XML comments) as experimental.  Some of the DwC properties used are not in the current standard (e.g. dwc:EarliestDateCollected) and all are capitalized (not the case for any current DwC properties). 
2. There is no explicitly specified class for the subject URI-identified resource (i.e. no rdf:type property), although the XML comments note that the metadata are about a specimen.  This is confirmed by the dwc:basisOfRecord value.
3. The script that generates this creates empty string values (objects) for properties (predicates) that don't have values.  It would probably be better just to not generate those triples.

Another place I looked was at the Implementers list at http://cetaf.org/cetaf-stable-identifiers.  The Consortium of European Taxonomic Facilities (CETAF) Stable Identifiers Initiative has the laudable goal of encouraging natural history collections to implement persistent semantic web-compatable URIs for its specimens.  Participation as an implementer doesn't require that the URIs be dereferenceable as machine-readable representations, but a number of the providers do that.  It was very interesting exploring the data provided by various providers, but I would be getting distracted if I commented too much about it in this post (maybe I will in a different post in the future).  

There is a website associated with the CETAF Stable Identifiers Initiative at http://herbal.rbge.info/ which provides a URI tester and documentation.  The documentation describes three levels of implementation, with Level 3 specifying that data are encoded in RDF.  The documentation mentions the CETAF Specimen Preview Profile (CSPP), a minimum set of agreed RDF metadata elements implemented consistently across CETAF organizations.  I don't have any criticisms of the CSPP, although I would quibble with their use of dcterms:creator, dcterms:type, and dcterms:publisher in the example of a CSPP-compliant document. [1]  

I looked through the RDF for the example resources at http://herbal.rbge.info/md.php?q=implementers, which was a little difficult because there were content-negotiation problems with a some of the sites.  I picked a nice example record from the Muséum national d’Histoire naturelle, Paris for https://science.mnhn.fr/catalognumber/mnhn/p/p00084058 to examine.  Here it is converted to Turtle:

<http://coldb.mnhn.fr/catalognumber/mnhn/p/p00084058> 
    a dwc:Occurrence ;
    dcterms:bibliographicCitation """Muséum national d’Histoire naturelle,  Paris (France)
    Specimen MNHN-P-P00084058
    http://coldb.mnhn.fr/catalognumber/mnhn/p/p00084058""";
    dcterms:identifier <http://coldb.mnhn.fr/catalognumber/mnhn/p/p00084058> ;
    dcterms:publisher <https://science.mnhn.fr/institution/mnhn/collection/p/item/search> ;
    dcterms:title "Specimen - holotype Calendula arvensis Boiss."@en,
        "Spécimen - holotype Calendula arvensis Boiss."@fr ;
    dcterms:type <http://purl.org/dc/dcmitype/PhysicalObject> ;
    dwc:associatedMedia <http://imager.mnhn.fr/imager3/w400/media/1442335066494u157WCqMqGrWsML5> ;
    dwc:basisOfRecord "PreservedSpecimen" ;
    dwc:catalogNumber "P00084058" ;
    dwc:collectionCode "P" ;
    dwc:country "Algérie" ;
    dwc:countryCode "dz" ;
    dwc:family "Asteraceae" ;
    dwc:genus "Calendula" ;
    dwc:identifiedBy "R. Maire" ;
    dwc:institutionCode "MNHN" ;
    dwc:occurrenceID "http://coldb.mnhn.fr/catalognumber/mnhn/p/p00084058" ;
    dwc:recordedBy "Pomel, A.N." ;
    dwc:scientificName "Calendula arvensis Boiss." ;
    dwc:scientificNameAuthorship "Boiss." ;
    dwc:specificEpithet "arvensis" ;
    dwc:typeStatus "holotype" ;
    dwc:verbatimLocality "O. [Oranais] St. Louis." ;
    rdfs1:comment "Describing the specimen (data)." .

I only listed the specimen part of the document - there were other descriptions of images and documents.  

Note:
1. Although the dwc:basisOfRecord is "PreservedSpecimen" and the dcterms:type is <http://purl.org/dc/dcmitype/PhysicalObject>, the actual rdf:type of the resource is dwc:Occurrence.  So this description is following the practice of considering specimens to be occurrences, rather than considering specimens to be evidence for occurrences.  More on this later.  

Here's a more minimalist example from the Harvard Herbarium for http://mczbase.mcz.harvard.edu/guid/MCZ:Mamm:53924, converted to Turtle:

<http://purl.oclc.org/net/edu.harvard.huh/guid/uuid/413bd724-7a92-41fe-befa-c7e1678b54a7>
    a dwc:Occurrence ;
    dcterms:modified "2010-08-16 16:18:42" ;
    dcterms:references <http://kiki.huh.harvard.edu/databases/specimen_search.php?mode=details&id=109482> ;
    dwciri:recordedBy <http://purl.oclc.org/net/edu.harvard.huh/guid/uuid/327b54e3-0053-49e9-8ecd-b988d256f48a> ;
    dwc:catalogNumber "barcode-00125895" ;
    dwc:collectionCode "A" ;
    dwc:country "China" ;
    dwc:recordedBy "H. T. Tsai" ;
    dwc:scientificName "Calendula lappa Linnaeus" ;
    dwc:scientificNameAuthorship "Linnaeus" ;
    dwc:stateProvince "Yunnan" .

Notes:  
1. The record has a dwciri:recordedby link to a dereferenceable URI for the collector, which is super-cool.  
2. This one is typed as an occurrence, but it has a catalog number, so I'm pretty sure that the subject is also supposed to be a specimen. 

Sea urchin image by David Monniaux (Own work) CC-BY-SA-3.0 via Wikimedia Commons


The RDF occurrence "sea urchin" model

All three of these examples have several features in common:

1. Properties that are listed under different classes in the DwC Quick Reference guide such as Taxon, Identification, Event, or Location are all assigned to an instance of a single class (specimen or occurrence).  In database terms, we could describe the model as very "flat" (as opposed to normalized).  In RDF terms, we could describe the model as being oriented around a single IRI-identified node.

2. The objects (i.e. values) of the triples are generally literals (as opposed to URIs; represented as boxes in the diagram).  This is the easiest way to generate the RDF if the values are already present in a database table as strings.

3. The source data could also fairly easily be stored in a single core occurrence table in a DwC Archive, since all of the properties and values all have a one-to-one relationship with the subject resource. 

4. All three of these example graphs [2] could easily be aggregated into the same triple store and be queried for basic information about the specimens.   

With respect to RDF, I've dubbed this model the "sea urchin", because the there is a single resource in the center of the graph with many dead end "spines" (literal values) surrounding it.  I call the literal values "dead ends" because they can't serve as the subjects of other RDF triples.  

Overall, this is a relatively sensible approach if your primary use case is to integrate your specimen data with other institution's specimen data and if your goal is to just get your data out as RDF in the easiest way possible.  However, because of the lack of links to IRI objects, this is a relatively depauperate form of Linked Data.  It also would be difficult to integrate with other sorts of biodiversity data whose core resources were not specimens (core resources such as taxa, events, or organisms).  

"Starfish" schemas

In a recent post, I discussed how linking Darwin Core (DwC) extension tables to a core table in a Darwin Core Archive (DwC-A) represented a simple data organization pattern that has been called a "star schema".  The "star" part of the name comes from the fact that the related resource tables are linked to a single focal resource table that sits at the center.

In the diagram below, the center is the core table, which contains metadata on occurrences. The arms of the starfish represent other classes that could theoretically be linked to the class of the core table via extension tables.  (It's questionable whether some of the classes shown are are really appropriate to link to occurrences, but I wanted to have enough classes for all five arms.)

Image by Paul Shaffner CC-BY via Wikimedia Commons

In the starfish diagram above, each link to the center is from a class represented by an entire table, but the graph pattern applies on the instance level as well.  We would typically choose to use an extension table if there were a many-to-one relationship between the instances in the extension table and instances in the core table.  We can diagram that situation like this:


In this graph, a single occurrence instance is documented by five images.  In the image extension table, five rows are linked through a foreign key to the same one row in the core table.  The diagram below the tables shows how we could represent this same structure as an RDF graph.  The bubbles on the periphery represents the five image instances that are linked to a single occurrence instance in the center.  

The diagram shows the situation in which there is only a single extension table. If there were several tables, then the many bubbles in the periphery would represent one to many instances of each of several classes.  In the spirit of the marine metaphor, I'm going to dub this graph pattern "starfish schema".   

The starfish schema is similar to the sea urchin model in that there is a single focal resource in the center, but the surrounding linked resources are IRI-identified (or blank) nodes, not literals.  This model is more versatile than the sea urchin model because each of the surrounding resources can be the subjects of additional triples, and because there can be many-to-one relationships with the surrounding resources rather than exclusively one-to-one relationships (as is the case with the literals in the sea urchin model).  

It is possible for the instances of the surrounding resources to be linked to resources elsewhere if the surrounding resources have properties with URI-values.  However, information about those resources linked outside of the immediate circle of "bubbles" cannot be stored in a single DwC-A, since a DwC-A can only have a single core table connected by a single foreign key in each of the extension tables.  Thus a provider must choose one core class that is the focus of their dataset and figure out a way to flatten all the rest of their data to fit into the extension tables.  Because different providers prefer different core classes based on the focus of their dataset, it becomes difficult to directly merge the data stored in the DwC-A.  

In the subsequent examples, we'll look at examples of DwC-A-packaged datasets having different core resources, and see how they can be converted into RDF using the Darwin-SW model in a way that will allow them to be easily aggregated.  I recently made a request on the tdwg-content email list for suggestions of a variety of such datasets and got some really nice examples.  Thanks to all who helped me out on this quest!


Event core with Occurrence extension: Vascular plants of South Northumberland and Durham, UK

The first example was provided by Quentin Groom of Botanic Garden Meise, Belgium, based on Groom Q, Durkin J, O'Reilly J, Mclay A, Richards A, Angel J, Horsley A, Rogers M, Young G (2015) A benchmark survey of the common plants of South Northumberland and Durham, United Kingdom. Biodiversity Data Journal 3: e7318. http://dx.doi.org/10.3897/BDJ.3.e7318.  Its DwC-A is downloadable from GBIF and the dataset has the citation "Botanical Garden Meise: A common plants survey of vascular plants in South Northumberland and Durham, United Kingdom. http://dx.doi.org/10.3897/bdj.3.e7318 
Accessed via http://www.gbif.org/dataset/5d784d06-fa1d-4f00-8cdc-663d04d26061 on 2016-10-26".  The dataset contains information about 42 517 occurrences.

Here's a diagram of what the simplified graph for the dataset looks like:


The core file contains rows representing event instances, each assigned a UUID identifier.  The extension files contain rows representing Occurrence instances, each assigned a locally unique opaque identifier.  Here's what the meta.xml file of the DwC-A looks like:

<archive xmlns="http://rs.tdwg.org/dwc/text/" metadata="eml.xml">
  <core encoding="UTF-8" fieldsTerminatedBy="\t" linesTerminatedBy="\n" fieldsEnclosedBy="" ignoreHeaderLines="1" rowType="http://rs.tdwg.org/dwc/terms/Event">
    <files>
      <location>event.txt</location>
    </files>
    <id index="0" />
    <field index="1" term="http://purl.org/dc/terms/type"/>
    <field index="2" term="http://purl.org/dc/terms/language"/>
    <field index="3" term="http://purl.org/dc/terms/rightsHolder"/>
    <field index="4" term="http://purl.org/dc/terms/bibliographicCitation"/>
    <field index="5" term="http://rs.tdwg.org/dwc/terms/datasetID"/>
    <field index="6" term="http://rs.tdwg.org/dwc/terms/datasetName"/>
    <field index="7" term="http://rs.tdwg.org/dwc/terms/eventID"/>
    <field index="8" term="http://rs.tdwg.org/dwc/terms/samplingProtocol"/>
    <field index="9" term="http://rs.tdwg.org/dwc/terms/sampleSizeValue"/>
    <field index="10" term="http://rs.tdwg.org/dwc/terms/sampleSizeUnit"/>
    <field index="11" term="http://rs.tdwg.org/dwc/terms/samplingEffort"/>
    <field index="12" term="http://rs.tdwg.org/dwc/terms/eventDate"/>
    <field index="13" term="http://rs.tdwg.org/dwc/terms/year"/>
    <field index="14" term="http://rs.tdwg.org/dwc/terms/habitat"/>
    <field index="15" term="http://rs.tdwg.org/dwc/terms/continent"/>
    <field index="16" term="http://rs.tdwg.org/dwc/terms/country"/>
    <field index="17" term="http://rs.tdwg.org/dwc/terms/countryCode"/>
    <field index="18" term="http://rs.tdwg.org/dwc/terms/county"/>
    <field index="19" term="http://rs.tdwg.org/dwc/terms/locality"/>
    <field index="20" term="http://rs.tdwg.org/dwc/terms/verbatimCoordinates"/>
    <field index="21" term="http://rs.tdwg.org/dwc/terms/verbatimCoordinateSystem"/>
    <field index="22" term="http://rs.tdwg.org/dwc/terms/verbatimSRS"/>
    <field index="23" term="http://rs.tdwg.org/dwc/terms/decimalLatitude"/>
    <field index="24" term="http://rs.tdwg.org/dwc/terms/decimalLongitude"/>
    <field index="25" term="http://rs.tdwg.org/dwc/terms/geodeticDatum"/>
    <field index="26" term="http://rs.tdwg.org/dwc/terms/coordinateUncertaintyInMeters"/>
    <field index="27" term="http://rs.tdwg.org/dwc/terms/pointRadiusSpatialFit"/>
    <field index="28" term="http://rs.tdwg.org/dwc/terms/footprintWKT"/>
    <field index="29" term="http://rs.tdwg.org/dwc/terms/footprintSRS"/>
  </core>
  <extension encoding="UTF-8" fieldsTerminatedBy="\t" linesTerminatedBy="\n" fieldsEnclosedBy="" ignoreHeaderLines="1" rowType="http://rs.tdwg.org/dwc/terms/Occurrence">
    <files>
      <location>occurrence.txt</location>
    </files>
    <coreid index="0" />
    <field index="1" term="http://purl.org/dc/terms/type"/>
    <field index="2" term="http://purl.org/dc/terms/modified"/>
    <field index="3" term="http://purl.org/dc/terms/language"/>
    <field index="4" term="http://purl.org/dc/terms/license"/>
    <field index="5" term="http://purl.org/dc/terms/rightsHolder"/>
    <field index="6" term="http://purl.org/dc/terms/accessRights"/>
    <field index="7" term="http://purl.org/dc/terms/references"/>
    <field index="8" term="http://rs.tdwg.org/dwc/terms/collectionID"/>
    <field index="9" term="http://rs.tdwg.org/dwc/terms/ownerInstitutionCode"/>
    <field index="10" term="http://rs.tdwg.org/dwc/terms/basisOfRecord"/>
    <field index="11" term="http://rs.tdwg.org/dwc/terms/occurrenceID"/>
    <field index="12" term="http://rs.tdwg.org/dwc/terms/catalogNumber"/>
    <field index="13" term="http://rs.tdwg.org/dwc/terms/occurrenceRemarks"/>
    <field index="14" term="http://rs.tdwg.org/dwc/terms/recordedBy"/>
    <field index="15" term="http://rs.tdwg.org/dwc/terms/individualCount"/>
    <field index="16" term="http://rs.tdwg.org/dwc/terms/organismQuantity"/>
    <field index="17" term="http://rs.tdwg.org/dwc/terms/organismQuantityType"/>
    <field index="18" term="http://rs.tdwg.org/dwc/terms/lifeStage"/>
    <field index="19" term="http://rs.tdwg.org/dwc/terms/establishmentMeans"/>
    <field index="20" term="http://rs.tdwg.org/dwc/terms/identifiedBy"/>
    <field index="21" term="http://rs.tdwg.org/dwc/terms/taxonID"/>
    <field index="22" term="http://rs.tdwg.org/dwc/terms/scientificNameID"/>
    <field index="23" term="http://rs.tdwg.org/dwc/terms/scientificName"/>
    <field index="24" term="http://rs.tdwg.org/dwc/terms/acceptedNameUsage"/>
    <field index="25" term="http://rs.tdwg.org/dwc/terms/kingdom"/>
    <field index="26" term="http://rs.tdwg.org/dwc/terms/phylum"/>
    <field index="27" term="http://rs.tdwg.org/dwc/terms/order"/>
    <field index="28" term="http://rs.tdwg.org/dwc/terms/family"/>
    <field index="29" term="http://rs.tdwg.org/dwc/terms/genus"/>
    <field index="30" term="http://rs.tdwg.org/dwc/terms/specificEpithet"/>
    <field index="31" term="http://rs.tdwg.org/dwc/terms/infraspecificEpithet"/>
    <field index="32" term="http://rs.tdwg.org/dwc/terms/taxonRank"/>
    <field index="33" term="http://rs.tdwg.org/dwc/terms/scientificNameAuthorship"/>
    <field index="34" term="http://rs.tdwg.org/dwc/terms/vernacularName"/>
    <field index="35" term="http://rs.tdwg.org/dwc/terms/nomenclaturalCode"/>
  </extension>
</archive>

The meta.xml file designates the mapping of columns in the table to properties.  You can see the core table is specified as a dwc:Event and the single extension table is specified as a dwc:Occurrence.  A large number of Dublin and Darwin Core properties have been assigned to each of the two classes.  Many of the properties that were assigned to the Occurrence instance in the "sea urchin" model are now assigned to the event instance.  They are NOT just properties that are listed in the dwc:Event category on the DwC Quick Reference Guide.  Rather, any properties that were shared by Occurrences that were recorded at the same Event will be listed under the Event instance so that they don't need to be repeated in every Occurrence record.  This normalization phenomenon should be familiar to anyone in the relational database world.  

This kind of normalization works well for any data provider whose data includes many Occurrences at one Event.  It does not work well for data providers who may have many Occurrences for one Organism, or who have many Occurrences representing one Taxon.  In order to be able to accommodate the normalization desires of every provider, I'm going to use a model that is normalized enough to handle many-to-one (or many-to-many) relationships among all of the major Darwin Core classes: Darwin-SW (see http://dx.doi.org/10.3233/SW-150203 for details) .  

In order to express these data as RDF using the Darwin-SW model, I need to create instances of most of the major Darwin Core classes, then appropriately assign each of the properties listed in the meta.xml file to those classes.  To decide what is "appropriate" most of the time I will be assigning them to the classes under which they are listed in the DwC Quick Reference Guide.  I will also follow the recommendations of the Darwin Core RDF Guide.  The guide is particularly relevant when properties involve identifiers, or when the values of properties are URIs rather than strings.  

When I examine the properties assigned to columns in the event.txt file by the meta.xml mappings, I see that they fall into two Darwin Core classes: dwc:Event and dcterms:Location, plus another class that isn't explicitly included in Darwin Core: dcterms:Dataset.  I'm going to use the Xquery script that I've dubbed Guid-O-Matic to generate the RDF, so I need to add these classes in the file event-classes.csv that's needed by the script.


The "_:1" indicates that the dcterms:Location instance will be a blank node - same with dcmitype:Dataset.  Here's how I mapped the properties listed under the core file in meta.csv, to these three classes:



The subject_id column shows the class to which I assigned each property.  If you think about the language, rightsHolder, bibliographicCitation, datasetID, and datasetName properties, it should be clear that these are not properties of the Event, but rather properties of the dataset that contains the metadata about those Events; hence, the need to generate a dcmitype:Dataset instance.  In cases where it seemed apparent that the literal values were expressed in English (vs. being controlled value terms that resemble English), I applied language tags to the literals.  In cases where numeric values were of a consistent form, I applied XML datatypes.  In a number of instances where Dublin Core terms with non-literal ranges were mapped to columns containing literals in meta.xml, I substituted the literal value analogs recommended by the RDF Guide.  

The two $link rows were added to create the triples that link the Event to the other two classes.  The Event was linked to the Location by the predicate dsw:locatedAt.  It is not clear to me what predicate should link a thing to the Dataset about that thing, so I used the generic predicate dcterms:relation.   

Here's how the Turtle looks after Guid-O-Matic generates the triples:

<http://www.br.fgov.be/id/d9dccf50-f80b-47c2-a852-5ba6b6e6242b>
     dc:type "event";
     dcterms:identifier "d9dccf50-f80b-47c2-a852-5ba6b6e6242b";
     dwc:samplingProtocol "ad hoc observation"@en;
     dwc:eventDate "2005-08-12";
     dwc:year "2005"^^xsd:gYear;
     dwc:habitat "Golf course"@en;
     dsw:locatedAt _:bd44c07b-2b76-4f18-98bd-5d7edfc53a04;
     dcterms:relation _:74e35202-da07-4a25-8f05-303dceb48874;
     a dwc:Event.

_:bd44c07b-2b76-4f18-98bd-5d7edfc53a04
     dwc:continent "Europe";
     dwc:country "United Kingdom";
     dwc:countryCode "GB";
     dwc:county "Durham";
     dwc:locality "Boldon Golf Course Site 2"@en;
     dwc:verbatimCoordinates "NZ35666052";
     dwc:verbatimCoordinateSystem "OSGB 1936";
     dwc:verbatimSRS "PROJCS[\"OSGB 1936 / British National Grid\",GEOGCS[\"OSGB 1936\",DATUM[\"OSGB_1936\",SPHEROID[\"Airy 1830\",6377563.396,299.3249646,AUTHORITY[\"EPSG\",\"7001\"]],AUTHORITY[\"EPSG\",\"6277\"]],PRIMEM[\"Greenwich\",0,AUTHORITY[\"EPSG\",\"8901\"]],UNIT[\"degree\",0.017453292519943";
     dwc:decimalLatitude "54.93812584"^^xsd:decimal;
     dwc:decimalLongitude "-1.44487595"^^xsd:decimal;
     dwc:geodeticDatum "EPSG:4326";
     dwc:coordinateUncertaintyInMeters "7.07"^^xsd:decimal;
     dwc:pointRadiusSpatialFit "1.57";
     dwc:footprintWKT "POLYGON((435659 560520, 435669 560520, 435669 560530, 435659 560530))";
     dwc:footprintSRS "GEOGCS[\"GCS_WGS_1984\",DATUM[\"D_WGS_1984\",SPHEROID[\"WGS_1984\",6378137,298.257223563]],PRIMEM[\"Greenwich\",0],UNIT[\"Degree\",0.0174532925199433]]";
     a dcterms:Location.

_:74e35202-da07-4a25-8f05-303dceb48874
     dc:language "en-GB";
     xmpRights:Owner "Quentin Groom";
     dcterms:bibliographicCitation "Groom QJ, Durkin JL, O'Reilly J, Mclay A, Richards AJ, Angel J, Horsley A, Rogers M, Young G (2016) A benchmark survey of the common plants of South Northumberland and Durham, United Kingdom. Biodiversity Data Journal";
     dterms:identifier "efad5b4f-2d4d-4566-9821-bc802d94dcfa";
     dwc:datasetName "A common plant survey of South Northumberland and Durham, United Kingdom"@en;
     a dcmitype:Dataset.

I made up a root URI of <http://www.br.fgov.be/id/ to append in front of the UUID in order to turn the UUID into a URI.  So the subject URIs are all fake and won't actually dereference to anything.

Next I'll look at the properties assigned to columns in the occurrence.txt file by the meta.xml mappings.  In this case there are five classes that I want to add to the class list:


Here's how I did the mappings of the columns in occurrence.txt:

As in the events.txt mapping, there are several rows that create the links between the classes I've added. There are a number of intellectual property relationships asserted, but it isn't clear to me what they apply to.  When an Occurrence is based on a specimen, that specimen can be owned by an institution, but an Occurrence (i.e. an organism at a time and place) can't be owned by anyone.  I presume the ownership, licensing, access rights, etc. are about the Dataset that records the occurrence, so that's why I created the dcmitype:Dataset class again like I did for event.  I had to do a tricky little thing with creating a blank node to be the object of the dcterms:accessRights property, since the range of that term is dcterms:RightsStatement, not a literal. (Refer to [1] again.)

The other tricky thing was dealing with the taxonomic information.  I'm not going to go into the details of the reason why the properties like dwc:genus, dwc:family, etc. are properties of the dwc:Identification instance rather than of dwc:Taxon.  They are convenience terms as explained in Section 2.7 of the RDF Guide.  Normally, in a situation like this, I wouldn't generate a dwc:Taxon instance - what should happen is that there should be a dwciri:toTaxon link to an IRI minted by a consensus Taxon (or Taxon Name Use) repository, and they should handle describing all of the hierarchical relationships between the object taxon and their parent and child taxa, linking to the secundum reference, linking to the name entity, etc.  (Unfortunately, such a thing doesn't really exist and probably won't until there is a TCS 2.0.)  However, in some cases, this dataset provides links to bona fide IPNI URIs for name entities (albeit non-dereferenciable URNs) that have associated TDWG Taxon Concept Ontology RDF.  So I went ahead and made the effort to create a blank node for the Taxon instance, then link to the IPNI name entity using tc:hasName, and to the name string using tc:nameString as suggested in a TDWG RDF Task Group page with no official standing.

Here's an example of the Turtle generated by Guid-O-Matic for one of the occurrences associated with the Event instance that was shown above:

<http://www.br.fgov.be/id/d9dccf50-f80b-47c2-a852-5ba6b6e6242b#2cd4p9h.25xd83>
     dwc:collectionID "MapMateCentre:2d6";
     dwc:basisOfRecord "HumanObservation";
     dcterms:identifier "2cd4p9h.25xd83";
     dwc:catalogNumber "MAPMATE:record:io8wp2d6";
     dwc:recordedBy "HS/FC for E3 Ecology";
     dsw:occurrenceOf _:e045ca10-e054-419f-8128-184bdbc68a55;
     dcterms:relation _:0f2b2b17-6ae3-464d-88a8-f9f23f2a2f2a;
     dsw:atEvent <http://www.br.fgov.be/id/d9dccf50-f80b-47c2-a852-5ba6b6e6242b>;
     a dwc:Occurrence.

_:e045ca10-e054-419f-8128-184bdbc68a55
     a dwc:Organism.

_:df29875e-15be-44e7-b019-5245215e1d60
     dwc:identifiedBy "HS/FC for E3 Ecology";
     dwc:scientificName "Prunella vulgaris";
     dwc:acceptedNameUsage "Prunella vulgaris L.";
     dwc:kingdom "Plantae";
     dwc:phylum "Anthophyta";
     dwc:order "Lamiales";
     dwc:family "lamiaceae";
     dwc:genus "Prunella";
     dwc:specificEpithet "vulgaris";
     dwc:taxonRank "species";
     dwc:scientificNameAuthorship "L.";
     dwc:vernacularName "Selfheal"@en;
     dwc:nomenclaturalCode "ICBN";
     dsw:identifies _:e045ca10-e054-419f-8128-184bdbc68a55;
     dwciri:toTaxon _:837e07bb-b350-4389-9cdf-89e43f16bc2f;
     a dwc:Identification.

_:fb9c57d5-00b9-43c6-8e1f-9b8051266230
     rdfs:label "CC-0";
     a dcterms:RightsStatement.

_:837e07bb-b350-4389-9cdf-89e43f16bc2f
     dcterms:identifier "1610";
     tc:hasName <urn:lsid:ipni.org:names:210696-2>;
     tc:nameString "Prunella vulgaris L.";
     a dwc:Taxon.

_:0f2b2b17-6ae3-464d-88a8-f9f23f2a2f2a
     dcterms:modified "2015-11-18T19:44+0100";
     dc:language "EN-GB";
     dcterms:license <http://creativecommons.org/publicdomain/zero/1.0/legalcode>;
     xmpRights:owner "Quentin Groom";
     dcterms:references <http://bsbidb.org.uk/record/2cd4p9h.25xd83>;
     dwc:ownerInstitutionCode "BR";
     dcterms:accessRights _:fb9c57d5-00b9-43c6-8e1f-9b8051266230;
     a dcmitype:Dataset.

You can see that I generated an IRI for the occurrence by appending the identifier for the Occurrence (2cd4p9h.25xd83) as a hash to the event's IRI (http://www.br.fgov.be/id/d9dccf50-f80b-47c2-a852-5ba6b6e6242b).  With this design, a client dereferencing the occurrence IRI would also retrieve the data about the Event and all other Occurrences associated with that Event.  The link from the Occurrence instance to the Event instance is made via the dsw:atEvent predicate.  Use of that predicate is specified in Guid-O-Matic's linked-classes.csv file, which specifies the extension tables, the linking property, and the mechanism for generating IRIs for the extension root resource (either using a hash as in this case, or by representing the root resource as a blank node).  It looks like this:


You are probably wondering why I bothered to create the dwc:Organism instance when there were no properties assigned to it.  I am going to defer on answering this question until after I talk about octopuses and give several more examples.  All of the files needed to use Guid-O-Matic to convert this dataset to RDF are in this GitHub folder.


The Darwin Core Archive RDF Octopus

In each of the previous models, I used a marine animal to describe the graph model illustrated by the RDF data that I presented.  I shall continue that trend by describing the Darwin Core Archive RDF Octopus Model, to which I will refer to as the DwC-A octopus model for short.

The DwC-A octopus model consists of one or more octopuses, each one of which represents a file in a Darwin Core Archive.  The body of the octopus represents the root class of the file, specified as the row type in the archive's meta.xml file.  Since the Vascular plants of South Northumberland and Durham dataset has two files, the model has two octopuses, one core Event octopus and one extension Occurrence octopus.

The arms of the octopus are the properties of classes that the octopus metadata describes.  There are some arms connected to the body that represent properties of the root class (e.g. the Event class).  But there are also special distorted arms that represent links to other classes that have a one-to-one relationship with the root class.  For example, the Event octopus has an extra-long arm that links to the Location class.  The linking part of that arm is the object property dsw:locatedAt.  The bulbous blob on the end of the extra-long arm represents the instance of the Location class, which itself has properties extending from it.  Overall, every property represented in the table comes off the same octopus somewhere.  There's only one (distorted) location arm on the Event octopus, so the properties bristling from its bulbous end won't be repeated.  Thus, all of the data about an instance of the Event/Location octopus can be represented in a single row of the table.

In this diagram of the model, I show a single Occurrence octopus with a deformed tentacle attached to the Event octopus' head.  That deformed tentacle is the object property that links an Occurrence instance to an Event instance: dsw:atEvent.  The Occurrence octopus has an even weirder arm - one that connects to three blobs that represent the other classes that have a one-to-one relationship with the Occurrence class.  All of the data about a single instance of Occurrence and the single instances of linked classes on the deformed arm can be stored in a single row of the Occurrence table.  However, if I wanted to show the graph of instances rather than a diagram of the model generically, I would show many Occurrence octopuses with their tentacles stuck to the head of the single Event octopus.  That's similar to the situation with the many peripheral bubbles around the one central bubble in the starfish schema, but it's different because the extension octopuses are not limited to representing just a single class.

I'm going to leave my description of the DwC-A octopus model there for now, but will come back to it with another example after I introduce a different dataset.

Taxon core with Occurrence, TypesAndSpecimen, Distribution, Reference, and Description extensions: Catalogue of Afrotropical Bees

The second example was provided by Willem Coetzer of the National Research Foundation/South African Institute for Aquatic Biodiversity, and includes the dataset with citation: Agricultural Research Council: Catalogue of Afrotropical Bees. http://doi.org/10.15468/u9ezbh
Accessed via http://www.gbif.org/dataset/da38f103-4410-43d1-b716-ea6b1b92bbac on 2016-10-26.  It includes data on 7,118 occurrences and 8,031 taxa.

The Afrotropical bee DwC-A is significantly more complex than the previous dataset.  In this archive, the core file describes taxon instances.  There are five extension files: reference, distribution, description, occurrence, and typesAndSpecimen.  In the diagram below, you can see that each of the five extension octopuses (Description, Reference, Distribution, Occurrence, and Preserved Specimen=typesAndSpecimen) have linked themselves to the core Taxon octopus by some kind of tentacle (short and direct, or long and deformed).


As was the case in the previous octopus graph, this diagram represents the connections between classes generically.  It could also represent a graph of instances of the classes, as long as we keep in mind that there are many-to-one relationships between the extension instances and the core taxon instance.  So if you were imagining the graph of instance octopuses, there would be many Occurrence octopuses attached to the head of each Taxon octopus.

I am not going to insert in this post all of the mapping files needed to convert this archive to RDF, but you can find them in this GitHub folder.

It wasn't clear to me what properties should be used to link Description, Reference, and Distribution.  I used dcterms:references to link the Reference instance to the Taxon instance, which I think is OK.  I used dwc:taxonRemarks to link the Taxon instance to the Description instance, which is NOT really OK, since according to the RDF Guide, dwc:taxonRemarks should have a literal value and there is no dwciri:taxonRemarks analog.  Oh well.  For Distribution, I just gave up and used the very generic dcterms:relation to make the link to Taxon.

The links and intervening classes from the Occurrence octopus to the Taxon octopus are similar to those of the previous diagram.  One difference is that this time, the Occurrence octopus has one tentacle reaching over to Preserved Specimen (the dsw:evidenceFor object property).  The Darwin-SW model (on which these diagrams is based) distinguishes between Occurrences and the evidence that documents them.  In the Vascular plants of South Northumberland and Durham dataset, the Occurrences were based on human observations, rather than on specimen records.  As far as I know, there was no distinct record of evidence that can be pointed to as a voucher for the record of the occurrence in that dataset.  One might argue that there is a data file somewhere with a record of the occurrence, but that doesn't constitute any more evidence than any occurrence record that is available online would have.  If there were a field notebook sitting somewhere, one could point to that as a form of evidence, but there isn't anything like that in the vascular plants dataset.  In this example, however, there are specimens associated with the occurrences, so the Occurrence octopus has a deformed arm reaching over to the Preserved Specimen spot, indicating that the Preserved Specimen is evidence for the Occurrence.  The tentacle also reaches around and touches the Organism class (dsw:derivedFrom object property), because the specimen is derived from the organism.  If one would like to say that the Preserved Specimen is the organism rather than saying it's derived from the organism, that's fine as long as the whole organism is collected (as opposed to collecting only a branch from a tree).  There will be a later example where I chose to model the organism and specimen as a single resource.

In the typesandspecimen extension file it seems clear that the primary class represented is preserved specimen.  The meta.xml file says the row type is gbifterms:TypesAndSpecimen, and I don't have a problem with that.  However, I also asserted the standard Darwin Core type dwc:PreservedSpecimen, and that's how that octopus is labeled in the diagram.  The relationship between these (extension) type specimens and the (core) taxon is that the taxon is circumscribed by the specimen.  Darwin Core doesn't define object properties for this, so I resorted to using the property tc:circumscribedBy from the old TDWG Taxon Concept ontology to make the link directly to Taxon.[3]

Here's a breakdown of the columns in the table showing how I distributed the columns to classes:



One column, locality, needed to be in the dwc:Location class, so to make the link from the PreservedSpecimen octopus to Location, I had to extend its tentacle through blank, propertyless Event and Occurrence nodes.  The PreservedSpecimen was already linked directly to the core taxon instance by tc:circumscribedBy, but I also wanted to make links to Taxon in the same way that other specimens were linked to their corresponding taxa.


The type specimens were linked to Organism and Occurrence in the same way the generic specimens described in the occurrence tables were linked (via dsw:derivedFrom and dsw:evidenceFor).  However, in this case, we can also be pretty sure that the type specimens were used as the basis for the Identification that happened when the taxon was created (or modified).  So unlike the specimens in the occurrence table, these type specimens also have a dsw:isBasisForId link to the Identification instance.  The taxon-related properties in the table that can be considered convenience properties were assigned to that Identification instance, as was done with the generic occurrence-related specimens.

The end result is that the octopus graph is really messy, since the tentacles of the Occurrence octopus and the PreservedSpecimen (i.e. TypeAndSpecimen) octopus lie on top of each other and form multiple triangles in the center of the graph.  But that's fine - in reality, people aren't going to be looking at this graph and a SPARQL query can traverse it with no problem.  Since this octopus graph and the one for the Vascular plants of South Northumberland and Durham dataset are both based on the same graph model (Darwin-SW), the two datasets can easily be merged into one graph.

You can view an example record for Melitturga capensis, which has several occurrences and a linked type specimen.

Organism core with StillImage and Identification extensions: Bioimages

We are getting to the point where we are pushing the limit of graph complexity that can be handled by a Darwin Core Archive.  The Bioimages dataset can (mostly) be encapsulated as a DwC-A.  In fact, most of the data has been submitted to GBIF as an occurrence-core DwC-A, and is in the dataset index (http://doi.org/10.15468/jib4rt).  However, the data don't actually "live" in tables with an occurrence core.  The data can be encapsulated (and are downloadable) as several CSV files.  The core CSV file is organisms.csv and the extension CSV files images.csv and determinations.csv are linked in many-to-one relationships to organisms.csv by an field that serves as a foreign key.  (There are some other linked CSV files for taxon names, secundum references, agents, etc. but they aren't important to this example.)  The CSV files and a zip file containing the entire Bioimages RDF graph serialized as RDF/XML is on GitHub here.  The particular release used in this experiment was http://dx.doi.org/10.5281/zenodo.51121.

The diagram below shows part of the graph related to a particular organism (the URIs are mostly real, with bio: abbreviating http://bioimages.vanderbilt.edu/).



In the Bioimages database, occurrences are not actually tracked as independent entities.  One could consider each still image to document a separate occurrence.  In the example above, there would be five occurrences documented by images.  However, as a practical matter, it isn't that useful to document an occurrence of the same tree three times in a five minute period.  So occurrences are constructed artificially by grouping all of the images that were taken of the organism in one calendar day, ad diagrammed below:

This makes sense, but the graph is now too complicated to encode as a simple DwC-A. Something more complicated would be required.

Bioimages assigns a small number of properties to the organism instance, as opposed to no properties in the previous examples.  If one tends to think of class instances as things off which we hang properties, then it would seem like there is little purpose to having the organism class, since its instances usually have few or no properties.  In database terms, it would be like saying that there was little purpose to having a database table with no columns.  However, the primary purpose of the organism class is actually to serve as a node to which more than one identification instances and more than one occurrence instances can be linked.  In database terms, its purpose is similar to that of an associative table.  In the previous examples, there didn't seem to be any need for an organism instance in the graph, because one occurrence instance was linked to one organism and one identification.  But in any case where an organism is documented repeatedly, such as mark-recapture, bird banding, radio-tracking, camera trapping, etc. the organism node serves the important purpose of linking multiple occurrences.

Here's what the Bioimages octopus diagram would look like:


There are three octopuses, one for each of the main data tables (Organism, StillImage, Identification).  This diagram doesn't include the grouping of images from the same day into one occurrence.  Bioimages does not mint Taxon identifiers, but does provide both a name entity URI (from uBio) and a literal secundum reference in the hope of some glorious future day when they could be matched with permanent consensus URIs for taxa.

Click for an XML serialized graph for an organism (a tree) and an image of the tree.

Occurrence core with Amplification, Loan, MaterialSample, Permit, Preparation, and ResourceRelationship extensions: a Global Genome Biodiversity Network archive

As far as I know, the next dataset is not a dataset uploaded to GBIF.  Gabriele Dröge of Botanischer Garten und Botanisches Museum Berlin-Dahlem sent me the link to it: http://collections.nmnh.si.edu/ipt/archive.do?r=nmnh_materialsample_test so that I could play with the data.  The archive is in the DwC-A format, so Guid-O-Matic can process it, but the extensions are generally not typed using Darwin Core classes.  Rather, the extension tables represent classes of the Global Genome Biodiversity Network (GGBN) Data Standard.  I'm not familiar with this standard, although through my teaching I'm familiar enough with the process of acquiring DNA sequences to make a stab at trying to map out the data as an RDF graph.

All of the mapping files needed to convert this dataset to RDF using Guid-O-Matic are in this GitHub folder.

Since the dataset is in the DwC-A format, it is stuck with using the starfish schema design.  Not surprisingly, Occurrence is at the center as the core, and every other extension class radiates off of it.  However, that design does not fit very well with the process of acquiring and processing samples to generate DNA sequences.  That process is more linear, with a series of resources being generated from sequential steps in the processing protocol.  In the case of sequences being generated from a pinned insect specimen, I would envision the process something like this:

fish specimen --> tissue sample ----> extracted DNA --> amplified DNA ----> nucleotide sequence
          sample prep        DNA extraction           PCR            sequencing

Each resource in the processing protocol is derived from the previous one.  There's a graphic similar to this in the "Mandatory and recommended fields for sharing data with GGBN" page of the GGBN Wiki.  If I am understanding the dataset correctly, I think that the metadata for insect specimen are included in the occurrence.txt file, the metadata for tissue samples are in the preparation.txt file and typed as ggbn:Preparation, the metadata for extracted DNA are in the materialsample.txt file and typed as ggbn:MaterialSample, and the metadata for amplified DNA are in the amplification.txt file and typed as ggbn:Amplification.  There isn't a file or class for the nucleotide sequence, but the amplification.txt file includes the properties ggbn:geneticAccessionNumber, which I would consider a sort of convenience term used to aid searching, and ggbn:geneticAccessionURI, which I would consider an object property that would link to an external sequence URI.  It is quite possible that I'm misinterpreting something about the dataset.  Nevertheless, let us charge on fearlessly!

Darwin-SW has a transitive property, dws:derivedFrom, that can be used to link a series of resources in a chain such as I illustrated above.  Optimally, I would model graph for the GGBN archive dataset like this:


However, it is not possible for Guid-O-Matic to generate a graph like this from a DwC-A, since each of the three extension class would have to be linked directly to the core class in the starfish schema shape:


Despite its non-linear form, there isn't actually anything untrue about this graph.  Since dsw:derivedFrom is transitive, the asserted dsw:derivedFrom links in the second diagram would also be entailed from the first diagram.  But the full set of relationships would not be available in the second graph - in the first diagram I could use SPARQL property paths to traverse the series of dsw:derivedFrom links, but that couldn't be done in the second diagram.  If the three extension class instances had identifiers instead of being blank nodes, I could make an explicit link to them in the column mappings, and set up the first hierarchy manually, but they don't have identifiers, so I can't. 

Also, Guid-O-Matic currently can (currently) only specify triples to links resources outside the table when the outside resource is the object of a triple.  That means I can't use it to generate the triple:

http://www.ncbi.nlm.nih.gov/nuccore/JQ841293 dsw:derivedFrom _:3.

I can only state the relationship using the inverse Darwin-SW property, like this:

_:3 dsw:hasDerivative http://www.ncbi.nlm.nih.gov/nuccore/JQ841293.

The second triple entails the first, but unless a triplestore is reasoning the inverse triple, one couldn't query for the preferred term of the pair: dsw:derivedFrom.  

As I said, we are pushing the limits of what can be done with Guid-O-Matic and DwC-A.  Once the data were in a triplestore, SPARQL CONSTRUCT queries could probably be carried out to correct the deficiencies I've just outlined.  

Of the other extension classes, the Permit class seams fairly straightforward - it specifies the license document for the collection of the specimen.  The Loan class seems to be some kind of a rights statement for loaning out the specimen.  The Preservation class table includes only a single property: ggbn:preservationType, with values like "Ethanol - 95%" and "DMSO-EDTA".  I'm guessing that this refers to either the preserved specimen, the tissue sample, or the DNA extraction, but I'm not sure which.  Because of the limitations of the starfish schema graph, it's going to end up being linked to the preserved specimen whether we like it or not.  It looks like the Resource relationship class is being used to assert sameAs relationships for alternate identifiers for the core resource.  It will be difficult to model this directly with Guid-O-Matic, but again, this is a kind of thing that could be fixed later using SPARQL CONSTRUCT.

 OK, here is what the octopus diagram is going to look like for this archive:




In this model, I chose to type the specimen as both dwc:PreservedSpecimen and dwc:Organism, whereas in the Catalog of Afrotropical Bees example, I modeled the dwc:PreservedSpecimen as being dervived from the dwc:Organism.  I probably could have done that here if I had wanted to, but it seemed likely that most, if not all, of the preserved specimens were whole preserved animals.  If the preserved specimens were herbarium specimens where a branch was removed from a whole tree, it would make more sense to say that the specimen was derived from the organism rather than saying that the specimen was an organism.  The choice between these two options is not that important.  The critical thing is that the deformed arm of the organism/specimen that bulges out to form the occurrence blob has a dsw:hasEvidence arm that reaches back to the specimen.  That allows for consistent querying for both kinds of models (where the specimen is derived from the organism, and where the specimen is the same as the organism).  

As I discussed earlier, I would prefer to have the Amplification derived from the Material Sample, and the Material Sample derived from the Preparation.  I've shown this by dotted-line tentacles that should, but don't exist.  Instead each of those three octopuses reach around to the Preserved Specimen as dsw:derivedFrom links.  There is also a link to a "ghost" Nucleotide Sequence instance that represents what http://www.ncbi.nlm.nih.gov/nuccore/JQ841293 denotes.  As far as I know, NCBI does not provide any kind of RDF metadata about their sequences, but if they did, it would be in the place of the ghost.


"Occurrence core" with Measurement, Taxon, and Reference extensions: Microbial eukayrotes for Encyclopedia of Life TraitBank

Anne Thessen provided me with an archive that isn't publicly available as far as I now.  I believe that the data are included in the public TraitBank record, but I don't know that there is anyway to access these specific data in isolation from the rest of TraitBank.  All of the mapping files I used are in appropriately named subfolders of this GitHub folder.

I saved this dataset for last, because it conforms least to the starfish schema model of DwC-A.  In fact, the meta.xml file in the archive that Anne sent me isn't actually valid for a DwC-A because it doesn't designate one of the text files as the core and others as the extensions.  This actually makes sense because the entities represented in the files are related in ways that are too complicated to express as a starfish schema.  

The text files are structured a lot like the tables of a relational database.  They have an ID column that serves as a primary key for the rows, and some have columns containing foreign keys that reference rows in other tables.  Although the rows don't have globally unique identifiers, they do have local identifiers that are unique to the dataset, e.g. T003 as the primary key for a taxon and C009 as the primary key for an occurrence.  Since these files weren't organized as a typical DwC-A, Guid-O-Matic wasn't able to process them without some manual curation.  What I did was to split up the meta.xml file so that I could run the script separately on each class.  That meant that I couldn't use blank nodes for the resources in the tables, but that was fine because I just generated a fake URI for each record by appending "http://eol.org/traitbank/" in front of the primary keys to create URIs like "http://eol.org/traitbank/T003" and "http://eol.org/traitbank/C009".

Here's what the table model looks like, using "crow's feet" notation to indicate many-to-one relationships:



At first glance, it may not be apparent why these tables can't be a DwC-A, because there is a core resource in the center and the others are in the periphery.  However, the starfish schema of DwC-A requires that the peripheral (extension) table records have a many-to-one relationship with records in the core file.  In this case, two of the three "extension" files (Reference and Taxon) have one-to-many relationships with the core file, which is backwards from what can be handled by DwC-A.  

Despite having to convert each table to RDF separately, it wasn't really that hard to map the columns of the tables to RDF modeled on Darwin-SW.  Here's what the octopus diagram looks like:



Notice that this database does not link to any evidence (preserved specimen, image, etc.) that documents the occurrence record.  Some might consider the reference as a sort of documentation, but there is no "voucher" that could be examined to verify that the particular taxon occurred at that time and place.  

As with the other examples where explicit dwc:Taxon instances were created (vs. using taxon-related convenience properties of a dwc:Identification instance), I generated RDF for the taxon using the unofficial TDWG RDF Task Group page as a model.  



Merging all of the datasets

One of the primary goals of this exercise was to show that data from a widely variety of Darwin Core Archives having different core and extension files can be easily aggregated and queried effectively if they are exposed as RDF using a consistent graph model.  When one looks at any particular octopus diagram, it seems to be unnecessarily complex.  There doesn't seem to be any point in giving the octopus weird, blobby tentacles when it could just have normal short tentacles (i.e. literal value properties linked directly to the core or extension class as in the sea urchin or starfish schema graph models).

However, if you place all of the diagrams side by side as I have done in the diagram above, you can see it would not be possible to consistently overlay all of the diagrams if they were any simpler.  A provider of any particular model would be prone to suggest denormalizing the model to eliminate one of the nodes that seemed unnecessary to them, but if you examine all of the models, you will see that someone's octopus is sitting on every one of the nodes shown in the generic Darwin-SW diagram on the lower right except for Location.  (However, I could easily imagine that there are providers who repeatedly hold sampling events at the same locations who would like to also put an octopus at that node.)  We designed Darwin-SW to be the least complicated model that can accommodate the range of variation of data models of datasets that we thought would be likely to be merged into a single graph.  In fact, it might not be complicated enough to accommodate fossil specimens for which time can't be represented as a simple ISO 8601 datatyped literal.  That's why I included the Time blob on all of the diagrams - a time:TemporalEntity class could be linked there to enable more complicated representations of time periods (such as the geological time scale) using the draft W3C Time ontology (in the Editor's Draft stage as of 3 November 2016).

Note that I said that a goal of the exercise was to show that the merged graphs can be queried effectively.  Any RDF graphs can be merged to create a larger graph and that larger graph can always be queried.  The question is whether those queries will find what we are looking for in all of the datasets, not just some of the datasets for which specific queries have been designed.  As they say, the proof is in the pudding, so I will now describe what I did to test for effective querying.

The Guid-O-Matic Xquery script was designed just to show that RDF could be generated from fielded text files.  So I was not sure how it would work on large datsets.  Several of the datasets were too large to run the scripts from the BaseX GUI, but they ran OK from the command line with 4 GB of memory allocated to the program.  Nevertheless, the script was not particularly fast.  I was trying to keep track of how long they took to run, and the smaller, less complicated ones were generated relatively quickly.  However, the largest one with the most complicated file structure (the Global Genome Biodiversity Network archive) took more than an hour to generate.  I'm not sure exactly how long it took, because I went out to mow the yard and when I came back inside it was done.  Clearly, for large datasets it would be good to have better, faster software than Guid-O-Matic.



The total number of triples in the dataset is a little over 8 million.  

I loaded the newly created files into the Vanderbilt Heard Library triple store (the Bioimages triples were already there).  As I noted in an earlier post, the Callimachus instance we are running there is very slow at loading graphs.  So it took an annoyingly long time to load everything.  I also had to hand-edit the droege.ttl file to fix two of the http://www.ncbi.nlm.nih.gov/nuccore/xxxx URIs that were malformed, and fix some datatyping problems.  But eventually I was able to load all of the files in the table above.

If you want to try loading the triples into your own choice of triplestore/SPARQL endpoint, the graphs generated from the Darwin Core Archives are available in Turtle serialization by downloading this compressed file (34 Mb).  The Bioimages graph can be downloaded separately here as RDF/XML.  

Querying the data

The Heard Library SPARQL endpoint is publicly available at http://rdf.library.vanderbilt.edu/sparql?view.  So you can try the queries that follow for yourself.  The Callimachus endpoint "remembers" namespace abbreviations that have been used before, so you should be able to run the queries without declaring the namespace abbreviations.  On other SPARQL endpoint systems, you may need to declare them.  Here's a list of namespaces I'm going to use in the queries that follow:

prefix dc: <http://purl.org/dc/elements/1.1/>
prefix foaf: <http://xmlns.com/foaf/0.1/>
prefix dcterms: <http://purl.org/dc/terms/>
prefix dwc: <http://rs.tdwg.org/dwc/terms/>
prefix dwciri: <http://rs.tdwg.org/dwc/iri/>
prefix dsw: <http://purl.org/dsw/>

The first query I ran was to find all of the data providers.  Their names are linked as the creators of the documents that serialize the RDF.  (There probably would be a better way to indicate the publisher of a dataset using the VoID vocabulary or something.  I haven't worked that out yet.)  So one could discover them using:

SELECT DISTINCT ?provider WHERE {
?doc a foaf:Document.
?doc dc:creator ?provider.
}

But there are other document creators in the triple store.  To be specific, I need to specify the kinds of things that the documents reference.  Unfortunately, because the DwC-A's have different kinds of cores, there isn't just one core resource to which the documents refer.  This situation requires a more complex query with a UNION of the graph patterns that specify all of the different kinds of core resources.  In addition, I would like to know how many organisms each dataset describes.  I could count the number of occurrences rather than organisms, but the Bioimages dataset records multiple occurrences for some organisms, so I'd rather not count the organisms more than once.  Here's the query I used:

SELECT DISTINCT ?provider (COUNT(?org) AS ?count) WHERE {

?doc a foaf:Document.
?doc dc:creator ?provider.
?doc dcterms:references ?thing.
{
?thing a dwc:Taxon.
?id dwciri:toTaxon ?thing.
?id dsw:identifies ?org.

   UNION
{
?thing a dwc:Occurrence.
?thing dsw:occurrenceOf ?org.
}
   UNION
{
?thing a dwc:Organism
BIND (?thing as ?org)
}
   UNION
{
?thing a dwc:Event.
?occ dsw:atEvent ?thing.
?occ dsw:occurrenceOf ?org.
}

}
GROUP BY ?provider
ORDER BY DESC(?count)
LIMIT 20

The query produced these results:

providercount
Botanic Garden Meise, Belgium42517^^xsd:integer
Global Genome Biodiversity Network40048^^xsd:integer
Encyclopedia of Life28736^^xsd:integer
Agricultural Research Council (South Africa)17117^^xsd:integer
bioimages.vanderbilt.edu3425^^xsd:integer

That seemed to work correctly, since the results are consistent with the number of occurrences reported in the GBIF dataset summaries when they were given.  

Core graph pattern

In queries that follow, you'll see this basic graph pattern:

{
?eve dsw:locatedAt ?loc.
?occ dsw:atEvent ?eve.
?occ dsw:occurrenceOf ?org.
?id dsw:identifies ?org.
}

This pattern links (via Darwin-SW object properties) the primary Darwin Core classes that are represented in the RDF graphs for all of the datasets by virtue of the fact that the datasets were forced to conform to the Darwin-SW graph model:



The graph pattern can be further constrained by adding triple patterns that tie the class instances to particular literal-value properties of those classes.  There are also peripheral classes that may be connected to the primary classes (e.g. dwc:Taxon, dwc:PreservedSpecimen, etc.) as part of the graph pattern, but since some datasets do not include instances of those peripheral classes, the query would fail to include any results from those datasets unless the triple patterns linking to the peripheral classes were made OPTIONAL.


Find geographic locations of occurrences by high-level taxonomy

Here is an example of a query that reports the continents on which occurrences were recorded, broken down by kingdom and phylum.  Notice that it contains the core graph pattern I listed above.

SELECT DISTINCT ?king ?phylum  ?cont WHERE {
?id dwc:kingdom ?king.
OPTIONAL {?id dwc:phylum ?phylum.}
?loc dwc:continent ?cont.
?eve dsw:locatedAt ?loc.
?occ dsw:atEvent ?eve.
?occ dsw:occurrenceOf ?org.
?id dsw:identifies ?org.
}
ORDER BY ?king
LIMIT 100

Here are the results:

king phylumcont
AnimaliaNA
AnimaliaOC
AnimaliaSA
AnimaliaArthropodaAfrica
AnimaliaArthropodaEurope
Animalia ArthropodaSouth America
AnimaliaArthropodaNorth America
AnimaliaArthropodaAsia
AnimaliaChordataNorth America
AnimaliaChordataSouth America
AnimaliaChordataAsia
AnimaliaChordataLocality Unknown
AnimaliaChordataAfrica
AnimaliaChordataEurope
AnimaliaChordataAustralia
AnimaliaChordataAsia ?
BacteriaNA
FungiNorth America
PlantaeNA
PlantaeOC
PlantaeAngiospermaSouthern America
PlantaeEmbryophytaOceania
PlantaeAnthophytaEurope
PlantaePterophytaEurope
PlantaeConiferophytaEurope
PlantaeCharophytaEurope
PlantaeLycophytaEurope
PlantaeNorth America
PlantaeCentral America - Neotropics
PlantaeSouth America - Neotropics
PlantaeNorth America - Neotropics
PlantaeAsia-Tropical
PlantaeAsia-Temperate
PlantaeAfrica
PlantaeWest Indies - Neotropics

Notice why it was important that I made the phylum optional.  Virtually all identifications include kingdom metadata because I think GBIF requires it.  However, not all records include phyla.  If the phylum triple pattern had not been optional, the matches that did not include phylum information would have been missed.  

Another thing to note is the inconsistency in the way that continent is reported.  There are some abbreviations, some errors like "Southern America" instead of "South America", and non-standard values like "Asia-Tropical".  Clearly there are problems caused by a lack of controlled values.  It would be great to do some data enhancement by determining the GeoNames or Getty TGN identifiers for lowest-level geographic subdivisions, then link them to the Location instances using dwciri:inDescribedPlace.  If that were done, then one could consistently search for controlled value place names by traversing the geographic hierarchy using SPARQL property paths.  I may try to do this in a future project if I have time.

The other problem with the query as it was given is that it misses all of the marine records, since they don't have a continent value.  The query could be improved by changing the triple pattern:

?loc dwc:continent ?cont.

to allow either predicate: 

?loc dwc:continent|dwc:waterBody ?cont.

in order to pick up the marine records.  Try it and see what happens!

Find species that occur in different datasets


Next, I wanted to see if I could find records for an organism that I knew occurred in two different datasets.  I had noticed that Prunella species occurred in the Vascular plants of South Northumberland and Durham dataset and knew that they also occurred in Bioimages.  So I ran this query to find occurrences along with their geocordinates

SELECT DISTINCT ?occ ?name ?cont ?lat ?long  WHERE {
?id dwc:genus "Prunella".
OPTIONAL {?id dwc:scientificName ?name.}
?loc dwc:continent ?cont.
?eve dsw:locatedAt ?loc.
?occ dsw:atEvent ?eve.
?occ dsw:occurrenceOf ?org.
?id dsw:identifies ?org.
OPTIONAL {
   ?loc dwc:decimalLatitude ?datatypeLat.
   ?loc dwc:decimalLongitude ?datatypeLong.
   BIND (str(?datatypeLat) AS ?lat)
   BIND (str(?datatypeLong) AS ?long)
   }
}
LIMIT 1000

I made the scientific name optional, although it's given pretty consistently in most of the records.  Here is an abbreviated set of results:

cont lat    long name                occ
NA 35.85749    -86.29678 Prunella vulgaris //bioimages.vanderbilt.edu/ind-baskauf/25197#2003-05-20 »
NA 35.5578    -83.49631 Prunella vulgaris //bioimages.vanderbilt.edu/ind-baskauf/28758#2003-07-25 »
NA 36.2674    -86.9001 Prunella vulgaris //bioimages.vanderbilt.edu/ind-baskauf/10001#2004-07-09 »
Europe 54.94777693 -1.69709708 Prunella vulgaris //www.br.fgov.be/id/9ac1c8ad-ff30-4316-8a20-c1a30ad1fffd#2cd4p9h.256xrd »
Europe 54.7411305  -1.71417203 Prunella vulgaris //www.br.fgov.be/id/178b9d5e-b80b-4d69-8680-5c89cf84f644#2cd4p9h.amdpvy »
Europe 54.75045286 -2.00928986 Prunella vulgaris //www.br.fgov.be/id/d8beaf0d-6735-41ba-ad9f-9ce09d73ad7d#2cd4p9h.bcnsn1 »
...
Asia 47.5333    98.5292 Prunella himalayana _:47502c13-95d0-4436-9b18-bb470dff03f6
Asia 28.0442    97.5700 Prunella immaculata _:1e280a31-1bf9-4a94-a126-b5ab3092a77f
Asia 44.9042    100.5667 Prunella fulvescens dahurica _:e51d64ec-b227-41b0-9d49-f5b3e17b3eb3
Asia 47.5333    98.5292 Prunella fulvescens dahurica _:6f7e961a-1ced-4bdd-ad24-8980256d01e6
Europe 41.4808    24.3250 Prunella modularis mabbotti _:51f2b0f3-85f1-40f2-9e19-028dee3bd757
North America 37.513246 -84.416034 Prunella vulgaris _:ceeda4be-2f84-4d33-8410-2c3f2e1eb7d8

Based on the identifier format, it looks like the query picked up records from at least three of the datasets.

Kinds of evidence for occurrences

Here is one last query to discover the classes of things that are used as evidence to vouch for occurrences (here referred to as "tokens"), and to count how many occurrences were documented by those classes.

SELECT DISTINCT ?class (COUNT(?occ) AS ?count) WHERE {
?occ dsw:occurrenceOf ?org.
OPTIONAL {
   ?token dsw:evidenceFor ?occ.
   ?token a ?class.
   }
}
GROUP by ?class
LIMIT 20

Here are the results:

classcount
dcmitype:StillImage »15727^^xsd:integer
dwc:PreservedSpecimen »62101^^xsd:integer
dsw:Token »4^^xsd:integer
schema:ImageObject »1^^xsd:integer
43415^^xsd:integer
gbifterms:TypesAndSpecimen »4907^^xsd:integer
dwc:MaterialSample »40048^^xsd:integer
dwc:Organism »40048^^xsd:integer

The results are a little deceptive, because in some of the datasets, tokens had multiple type declarations (e.g. typed as all of the following: a material sample, an organism, and a preserved specimen).  However, we can get the big picture here that  there are a lot of occurrences that are documented by images, a lot documented by specimens, but also a lot that aren't documented by anything at all.  This last category would be flagged as having basisOfRecord="HumanObservation", but as I discussed earlier, unless there is some kind of scan of a field notebook, these records basically don't have any kind of voucher for the occurrence record other than the database record itself.  

This example illustrates why the Darwin-SW does not consider occurrences to be the same thing as specimens.  Occurrences can be documented by anything or by nothing.  Another possibility that can't be distinguished by this query is the possibility that an occurrence is documented by more than one distinct kind of evidence (e.g. by both an image of the organism when it was alive and a preserved specimen collected at the same time).  Darwin-SW allows for zero to many forms of documentary evidence. [4]

Take-home messages

There are several points that I hope you have gotten from reading this post:

1. It is totally possible to merge biodiversity datasets into a single, effectively queryable graph database using conventional Linked Data technology, as long as a consistent graph model is used.

2.The graph model must be "normalized" to the extent that it can handle every kind of many-to-one relationship that exists in the datasets to be merged.  

3. Lack of URIs minted by providers is a problem, but if SPARQL querying is the only objective, fake URIs generated from locally unique identifiers or blank nodes make the task possible without provider-minted URIs.  It would be better if those fake URIs were real and dereferenceable, but the queries work just fine if they are not.

4. It is critical that literal-value properties are consistently assigned to particular classes, even if those classes don't exist in the source datasets.  

5. Lack of consistency in using controlled vocabulary, or lack of providing some levels of convenience properties will result in missed matches.  So creating a truly usable merged graph would require generating links to standardized URIs for agents, geographical features, and taxa or taxon name entities.  This requires some string-matching (possibly with human intervention), but once it was done, searching could be done very consistently.

6. Experimentation would be needed to see if this scales for graphs containing orders of magnitude greater numbers of triples than the approximately 10 million triples tested here.  However, triple stores that can handle billions of triples exist, so the real question would be what kind of loading and search times would be required on that scale.

7. It doesn't really make sense to generate Darwin Core Archives from databases using something like IPT and then use something pathetic like Guid-O-Matic to convert the fielded text files into RDF serializations.  It would be more sensible to program something analogous to IPT that would convert data from databases or spreadsheets directly to some RDF serialization.  This software could not be fully developed until TDWG settles on a consensus graph model and assigns terms from Darwin Core (and other key vocabularies) to the classes in this consensus model.  But that is totally doable in a relatively short timeframe if there were a will to do it.

I would like to do a lot more with this dataset to improve it and to do more interesting queries.  But this is all that I have time for now.  So stay tuned.

------------------------------------------------------------------------------
[1] The issue stems from the distinction between the legacy Dublin Core terms in the http://purl.org/dc/elements/1.1/ namespace (which most people historically have abbreviated dc:)  and the "terms" namespace http://purl.org/dc/terms/ (traditionally referred to as dcterms:).  If you are using spreadsheets you probably couldn't care less about the distinction. But if you are setting guidelines for institutions to convert their data to RDF, you should care.  Best practices dictate that terms like dcterms:creator, dcterms:type, and dcterms:publisher should be used with non-literal values.  If a provider wishes to use literal values, it is best practice to use the legacy terms dc:creator and dc:publisher terms, which have no declared ranges.  There has been a disturbing (in my opinion) practice recently of using dc: to refer to the "terms" namespace rather than the legacy namespace.  It is not clear whether the example compliant document is following this (disturbing) practice or intending the traditional meaning of dc:, since there are no namespace declarations in the example.  If they intend for dc: to mean http://purl.org/dc/elements/1.1/, then their example is not compliant because it does not use the terms specified in the CSPP Elements table above.  If they intend for dc: to mean http://purl.org/dc/terms/, then their example does not follow best practices because dc:creator, dc:type, and dc:publisher all have literal values.  

[2] Assuming that the Arctos RDF were changed to use current DwC terms.

[3] As of 2016-11-06, Guid-O-Matic can't make links in the one-to-many direction; only in the many-to-one direction.  So temporarily I'm making the connection in the inverse direction using the fake property tc:INVcircumscribedBy.  Eventually I'll fix it and make it go the other direction.

[4] Note on 2016-11-07. Quentin Groom pointed out that Prunella is a homonym for both a bird and a plant.  That's a great illustration why a search for a single convenience term like dwc:genus is deficient and why linking to an identifier for the name entity would be a lot better.  

No comments:

Post a Comment