Wednesday, November 9, 2016

Fixing the octopus


In a recent post, I described how I used a rather hacked-together bit of software called Guid-O-Matic to convert a diverse set of Darwin Core Archives into RDF graphs serialized with a choice of formats.  In that post, I confessed that the resulting RDF had several deficiencies, which I claimed could probably be fixed pretty easily using SPARQL CONSTRUCT queries.  In this post, I'll describe how I made those fixes, talk about a problem with using blank nodes, and discuss possible reasons for bothering to create a more complex graph structure (e.g. using RDF) rather than just settling for the simple structure of Darwin Core Archives.

Image by Paul Shaffner CC-BY via Wikimedia Commons

Fixing the "made-up" predicates and the problem with blank nodes

As I discussed in the earlier post, by its nature a Darwin Core Archive is limited to linking fielded text files in a very simple structure that has been called a "star schema".  The limitations of this format allow only for a very simple structure for the RDF graph generated directly from data in the archive - I've dubbed such a graph model a "starfish schema".  


The diagram above shows an example of a "starfish" schema.  One to many resources of a particular type can be linked to the central resource via an object property, and there can be more than one type of linked resource.  In the example above, there are multiple images linked to the central resource by the property dsw:derivedFrom and multiple determinations linked to the central resource by the property dsw:identifies.  Why did I chose to make the arrows point from the periphery to the center rather than the other way around?  It is a matter of convenience - I can just have a column in the image table that is a foreign key pointing to the organism primary key, then generate a triple linking the subject image to the object organism when I write out the graph for the image.  Here's what it would look like:

<http://bioimages.vanderbilt.edu/baskauf/79649>
   ... other stuff ...
   a dcmitype:StillImage;
   dsw:derivedFrom <http://bioimages.vanderbilt.edu/vanderbilt/7-314>.

<http://bioimages.vanderbilt.edu/baskauf/79651>
   ... other stuff ...
   a dcmitype:StillImage;
   dsw:derivedFrom <http://bioimages.vanderbilt.edu/vanderbilt/7-314>.

etc.

Darwin-SW defines a lot of inverse properties, and there is an inverse for dsw:derivedFrom that is called dsw:hasDerivative.  So the following graph:

<http://bioimages.vanderbilt.edu/vanderbilt/7-314>
  a dwc:Organism;
  dsw:hasDerivative <http://bioimages.vanderbilt.edu/baskauf/79649>,
                    <http://bioimages.vanderbilt.edu/baskauf/79651>.

would entail the same links as in the previous example if a client reasoned the inverse relationships.  However, I am loathe to assume that people using generic triplestores and SPARQL endpoints will be doing that kind of reasoning.  That's why in Darwin-SW we picked one of the two properties in each inverse pair to be "preferred" (shown in blue in the following diagram):  



When making the choice of which inverse to prefer, we chose the one that was most likely to point in a many-to-one direction, so that the value (i.e. object) of the property could be stored as a single foreign key in a column of a table of metadata about the subject.  If we went the other way (the gray arrows), you would have the problem of storing an indefinite number of repeated values in a row in a table about the subject resource.  

I confess that Guid-O-Matic is a hack - I wrote it in the easiest possible way.  So it assumes that when it's translating an extension table of a Darwin Core Archive into an RDF graph that the property in the triple used to connect the row of the extension table to a row in the core table is a property that is appropriate when the extension table row is the subject and the core table row is the object.  It can't handle making the link in the other direction.  This problem could be fixed by adding some code that allowed the user to designate that the connecting property operates in the inverse direction from what is expected.  Then the program would be able to generate records like this:

<http://bioimages.vanderbilt.edu/baskauf/79649>
   ... other stuff ...
   a dcmitype:StillImage.
<http://bioimages.vanderbilt.edu/vanderbilt/7-314>
  dsw:hasDerivative <http://bioimages.vanderbilt.edu/baskauf/79649>.

<http://bioimages.vanderbilt.edu/baskauf/79651>
   ... other stuff ...
   a dcmitype:StillImage.
<http://bioimages.vanderbilt.edu/vanderbilt/7-314>
  dsw:hasDerivative <http://bioimages.vanderbilt.edu/baskauf/79651>.

Being a lazy person, I have not bothered to write said code.  Instead, I just made up fake inverse properties and used them.  So for the Catalogue of Afrotropical Bees, where the octopus diagram looked like this:


and where tc:circumscribedBy and dwc:taxonRemarks pointed in the wrong direction (in the one-to-many direction), I made up the fake properties tc:INVcircumscribedBy and dwc:INVtaxonRemarks and just used them to make the arrows point in the opposite direction.

I was feeling guilty about taking this shortcut, but resolved to fix the problem later using a SPARQL construct query.  That query would be super-simple and look like this:

CONSTRUCT {?taxon tc:circumscribedBy ?specimen.} WHERE 
          {?specimen tc:INVcircumscribedBy ?taxon.}

All I had to do was run the query and then dump the triples back into the triplestore.  Problem solved!

Unfortunately, when I ran the CONSTRUCT query, here was what I got:

<http://www.arc.agric.za/taxon/1491> 
     tc:circumscribedBy _:genid322ffff2592d8cf32d4a5d2d87712dd8ad99c22f8b .
<http://www.arc.agric.za/taxon/1530> 
     tc:circumscribedBy _:genid3338abf8272dc2102d45272d92172d1ccf2265dd1d .
<http://www.arc.agric.za/taxon/3152> 
     tc:circumscribedBy _:genid3117ade37c2dbc132d4f762d83f02d89fc8a5ac58c .

I forgot that I had let the preserved specimens (i.e. type specimens) be blank nodes.  That's fine, as long as I'm only talking about them in the same document where I define their properties along with the properties of the taxa, but now I need to refer to them in a second document and I can't because the blank node identifier doesn't have any meaning outside the document in which it is used.  So although I'm not really interested in minting identifiers for resources whose records I don't maintain, I need to do it anyway.  

Here was the easy fix that I chose.  I created a new column in the type specimen table called "frag" and filled it with consecutive numbers from 1 to N so that there would be a locally unique identifier for the type specimen.  Then I changed the links table for Guid-O-Matic to this:


Instead of indicating that the linked class was a blank node (by putting "_:" in the suffix1 column), I indicated the value in the "frag" column should be appended to the root URI of the core resource (the taxon) as a fragment identifier, followed by "type".  That resulted in generating triples like this:

<http://www.arc.agric.za/taxon/1491#145type
     tc:INVcircumscribedBy <http://www.arc.agric.za/taxon/1491>.

Now when I run the construct query from before, I get output like this:

<http://www.arc.agric.za/taxon/1491> 
     tc:circumscribedBy <http://www.arc.agric.za/taxon/1491#145type>.

which was what I wanted.  I was able to load the file containing those triples into the triplestore and have the links in the direction that was actually shown on the octopus diagram.  

When I fixed the problem with the fake dwc:INVtaxonRemarks link, I decided not to actually use dwc:taxonRemarks, which the Darwin Core RDF Guide indicates should be used with literal values.  I decided instead to make the link using dcterms:description, which has no stated range of rdfs:Literal (as do some other Dublin Core terms).  




dsw:derivedFrom relationships

When I described the Global Genome Biodiversity Network archive (octopus diagram above), I noted that I would prefer to describe the relationship sequential relationship

fish specimen --> tissue sample ----> extracted DNA --> amplified DNA ----> nucleotide sequence
          sample prep        DNA extraction           PCR            sequencing

 among the resources derived from the organism/preserved specimen like this:


(using the GGBN class names given in the dataset).  The dsw:derivedFrom object property is transitive, so a NucleotideSequence is derivedFrom an Amplification, but it also has an entailed derivedFrom relationship with a MaterialSample, a Preparation, and the original PreservedSpecimen as well.  

The problem is that due to the starfish schema limitations of Darwin Core archives, the graph has to (at least initially) be this:


where all of the resources in the extension files are connected directly to the core resource.  (The link to the NucleotideSequence is to a resource external to the dataset.)  In my original post, I suggested that SPARQL CONSTRUCT queries could be used to change the second graph into the first.  So I decided to see if that was true.  This situation has the same problem as I described earlier: I can't use blank nodes to represent the sequentially-derived resources and then try to talk about them later in a second document.  So I recreated the graph using generated hash URIs for them in the same manner that I described in the first section of this post.  

One issue that I needed to figure out what whether there were 1:1 relationships between the PreservedSpecimens and resources downstream in the derivation chain.  If there were, then it would be easy to generate the sequential derivedFrom links.  First I just counted the number of each kind of resource using a query like this:

SELECT DISTINCT (COUNT(?resource) AS ?count) WHERE {
?resource a ggbn:Preparation.
}

I determined that there were 40048 Preparations, 40048 MaterialSamples, and 6923 Amplifications.  This led me to believe that there was a 1;1 relationship between material sample and preparation.  I needed to make sure that there were no organisms that had more than one preparation, and no preparations that had more than one MaterialSample.  I used a query like this to make sure:

SELECT DISTINCT  ?organism ?prep1 ?prep2 WHERE {
?organism a dwc:Organism.
?prep1 dsw:derivedFrom ?organism.
?prep1 a ggbn:Preparation.
?prep2 dsw:derivedFrom ?organism.
?prep2 a ggbn:Preparation.
FILTER (?prep1 != ?prep2)
}
Limit 10

Basically, it checks to see if there are any cases where there are two preparations that are not the same that are derived from the organism.  By running this and similar queries, I was able to discover that no organism/specimen had more than one Perparation and that no Preparation had more than one MaterialSample.  There were cases where there were more than one Amplification per MaterialSample.  That was fine because that was the last link and there was no ambiguity about which MaterialSample to which it should link.  

Here's the query I used to construct the MaterialSample link to the Preparation one hop above it ( the Preparation was already linked to the Organism):

CONSTRUCT {?ms dsw:derivedFrom ?prep.} WHERE {
  ?prep a ggbn:Preparation.
  ?prep dsw:derivedFrom ?org.
  ?ms a ggbn:MaterialSample.
  ?ms dsw:derivedFrom ?org.
  }

Here's the query I used to link the Amplification to the MaterialSample one hop above it:

CONSTRUCT {?amp dsw:derivedFrom ?ms.} WHERE {
  ?amp a ggbn:Amplification.
  ?amp dsw:derivedFrom ?org.
  ?ms a ggbn:MaterialSample.
  ?ms dsw:derivedFrom ?org.
  }

Originally, I'd made the link from the Amplification to the NucleotideSequence in the opposite direction using dsw:hasDerivative.  I'd rather make the link in the preferred direction using dsw:derivedFrom so that I could use SPARQL property paths to traverse the chain of dsw:derivedFrom links at will.  Constructing that inverse relationship was easy, and at the same time I generated a type declaration for the NucleotideSequence instance.  As far as I know, NCBI doesn't provide any RDF metadata when one tries to dereference the sequence URIs, so I was on my own as far as generating metadata about sequences was concerned.  I don't know of a consensus class for sequences, so for convenience, I made one up for the time being: http://example.org/NucleotideSequence (ex:NucleotideSequence).   Here's the query I used:

prefix ex: <http://example.org/>
prefix dsw: <http://purl.org/dsw/>
prefix ggbn: <http://data.ggbn.org/schemas/ggbn/terms/>
CONSTRUCT {
?nuc dsw:derivedFrom ?amp.
?nuc a ex:NucleotideSequence.
} WHERE {
  ?amp a ggbn:Amplification.
  ?amp dsw:hasDerivative ?nuc.
  }

I actually did the CONSTRUCT queries using Stardog as a localhost rather than the Vanderbilt Heard Library SPARQL endpoint (currently Callimachus-based) because Stardog allows the output to be saved in a file in any serialization (I saved it as Turtle).  I loaded all of the constructed triples that I manufactured in this exercise back into the Heard Library endpoint (and also replaced the graphs that were using blank nodes with graphs using the hash URIs).  Then I could test the new graph.  Note: the complete updated set of graphs is available at https://github.com/baskaufs/guid-o-matic/tree/master/dwc-a/octopus in the octopus.zip file.


Using the transitive dsw:derivedFrom links to find stuff

The point of making dsw:derivedFrom transitive was to allow a user to find things that were any number of dsw:derivedFrom links apart in a chain of derivation.  One way to make use of the transitive properties would be to use a SPARQL endpoint with transitive reasoning enabled.  However, another method that can be used with any endpoint that supports SPARQL 1.1 is Property Paths.  Here's an example of a query that allowed me to discover something that I couldn't easily know by just examining the large data files visually:

prefix ex: <http://example.org/>
prefix dsw: <http://purl.org/dsw/>
prefix dwc: <http://rs.tdwg.org/dwc/terms/>

SELECT DISTINCT   ?organism  ?nuc1 ?nuc2 WHERE {
?organism a dwc:Organism.
?nuc1 a ex:NucleotideSequence.
?nuc1 dsw:derivedFrom+ ?organism.
?nuc2 a ex:NucleotideSequence.
?nuc2 dsw:derivedFrom+ ?organism.
FILTER (?nuc1 != ?nuc2)
}
Limit 10

In this query, I wanted to find out if there were any specimens/organisms that had more than one nucleotide sequence as an end product.  I already knew that there were multiple Amplifications per organism, and I knew that there were a total of 4269 sequences (vs. 6923 Amplifications), but I did not know the relationship between organisms and sequences.  The query uses triple patterns containing dsw:derivedFrom+.  The extra "+" at the end of the predicate indicates one or more hops apart.  In this case, there is a known and consistent sequence of derived resources, so it wouldn't be necessary to use property paths, but you could imagine situations that included subsampling where the number of links might vary, or where there were other derived resources like ProteinSequences that one wanted to discover.  In those cases, just being able to link generically to any derived resource would be good.  

The results showed that there were some specimens like http://n2t.net/ark:/65665/3b0ee4eda-0274-4527-a4d8-069504a9f066 that did have multiple NucleotideSequences.  Here's a query that you could do to see what all of the derived resources were, and what classes they were instances of:

SELECT DISTINCT   ?derivative ?class WHERE {
?derivative dsw:derivedFrom+ <http://n2t.net/ark:/65665/3b0ee4eda-0274-4527-a4d8-069504a9f066>.
?derivative a ?class.
}
Limit 100

The results show that there is one Preparation, one MaterialSample, 23 Amplifications, and 23 NucleotideSequences that are derived from the specimen.

What's this good for?

Since the previous post, I had one off-list email conversation where the question came up of whether there were use cases that the graph-based approach (e.g. RDF) can satisfy that might be difficult to satisfy using more conventional methods.  This is an important question, given that GBIF already has a lot of web services that can allow users to do amazing things.  I don't have a really good answer to this question.  I think that before embarking on a program of creating and maintaining a giant triplestore of RDF that duplicates what GBIF already has, we need to lay out the additional use cases that such a triplestore could satisfy beyond what can already be done conventionally. 

I think that the most interesting cases would be those where two providers submit information about related resources, each without the knowledge of the other.  Here's an example.  Let's say that someone collects a bird specimen like http://n2t.net/ark:/65665/3b0ee4eda-0274-4527-a4d8-069504a9f066. It is sampled and eventually results in the sequence http://www.ncbi.nlm.nih.gov/nuccore/KP790702.  We have an interest in that sequence for some reason - maybe we are doing a molecular phylogeny.  Unbeknownst to us, some expert has looked at that bird specimen and decided that it was not actually Locustella certhiola ssp. certhiola - perhaps the expert has applied another determination to that specimen and disagreed with the previous determination about the specific epithet.  That might be important to our project, but how would we know that it happened? 

SELECT DISTINCT   ?date ?determiner WHERE {
<http://www.ncbi.nlm.nih.gov/nuccore/KP790702> dsw:derivedFrom+ ?organism.
?determination dsw:identifies ?organism.
?determination dwc:dateIdentified ?date.
?determination dwc:identifiedBy ?determiner.

This query would give us a list of all determination dates and determiners (since the Darwin-SW model allows for many determinations for one organism).  Unfortunately, this query won't actually work, since the dataset does not provide the identification dates and name of determiner.  But it would work if the data were there.  

One could imagine even broader queries like this:

SELECT DISTINCT   ?sequence ?date WHERE {
?specimen a dwc:PreservedSpecimen.
?sequence a ex:NucleotideSequence.
?specimen dwc:institutionCode "USNM".
?specimen dsw:derivedFrom* ?organism.
?sequence dsw:derivedFrom+ ?organism.
?sequence dcterms:created ?date.
FILTER (?date > "2016-11-09"^^xsd:date)
}

This query would ask whether there were any new sequences that were created after 9 November 2016 that were derived from any specimens that were in the US National Museum collection.  Note that I used a "*" after dsw:derivedFrom instead of a "+" to allow for zero to many links - a necessity if the the specimen was considered to be the organism rather than be derived from the organism.  This query won't actually work, since there are no RDF metadata about sequences provided by NCBI that we can put into the triplestore.  But if we could get those data, USNM could know that their collection was being used to generate new sequence data even if the sequencing were done on a downstream sample that were in the possession of another institution.  

There is another potential use case involving inferring duplicates that is described in the Darwin-SW paper at http://dx.doi.org/10.3233/SW-150203 (open access at http://semantic-web-journal.net/content/lessons-learned-adapting-darwin-core-vocabulary-standard-use-rdf).  It would be very interesting to accumulate more such use cases.

No comments:

Post a Comment