Sunday, November 13, 2016

SPARQL-based web app to find Chinese temple buildings

(temporary) URL of the web app:


This fall, the Semantic Web Working Group at Vanderbilt University has been learning about applying Linked Data technologies to practical problems, and as an exercise, we have been working with Tracy Miller's dataset on the architecture of buildings at Chinese temple sites (referred to in the rest of this post as the tang-song dataset).  In an earlier post, I talked a little bit about how we turned spreadsheets containing her data into RDF.  This required first deciding on a graph model that linked one to many buildings to the temple sites at which they were located.  The model uses the PeriodO ontology to describe the dynasties over which the temple sites were constructed, and generally uses the W3C Time Ontology to model time relationships.  Wherever possible, sites were linked to the Getty Thesaurus of Geographic Names (TGN) records for the places where the sites were located.  The model also uses generic Dublin Core and WGS84 Basic Geo vocabularies plus some terms as necessary to link everything together.

The tang-song temple/building graph (in Turtle serialization) is now "done" in the sense that we are now using it to "do" things.  It will continue to evolve as we add triples for the architectural features and links to the images of the buildings, which have not yet been included.  But it's done enough that we can now play with using the data to do useful things by loading them into a triplestore and running SPARQL queries on them.    

In this post, I'm going to talk about a web application that I've put together that generates SPARQL queries based on user input, then takes the results and generates some useful output in the browser about tang-song temple buildings.[1]  For those of you who do not have enough patience to read the gory details and just want to play with the application, it is online at a temporary location.  I won't promise that it will stay there - eventually it will probably be linked to  

The application has two components: a Bootstrap-based HTML page and a jQuery-based Javascript script that's called by the HTML page.  The HTML page does not have to be at any particular location - it can be run as a local file as long as the Javascript script file is in the same directory (and assuming that the Vanderbilt Heard Library SPARQL endpoint is functioning and has the tang-song graph loaded).  So even if you can't find the page online, you can just download the HTML and Javascript files to a folder somewhere and open the HTML page in a browser.  If you do that, you can also hack the Javascript "live" and see the effects by just reloading the HTML page.  One other note: if you intend to hack the Javascript, make sure that your text editor loads it as UTF-8 character encoding, or you might lose the Chinese characters.

By the way, the Chinese that you see on the web pages was generated by Google Translate, so my apologies to native speakers.  Hopefully the Chinese-speaking members of our working group can help me clean it up.

An aside about data munging

I wanted to mention as an aside that I did make use of the giant mass of Getty TGN triples that I laboriously loaded into Blazegraph (a painful process described in a recent blog post).  I used a SPARQL query that used the existing Chinese names for counties/cities and provinces in the tang-song dataset to find the Getty TGN URIs for the lowest political subdivisions in which the temple site was located.  Those URIs were added back into to the tang-song graph and I subsequently ran another query to find the Latin Pinyin transliterations for the counties/cities from the TGN graph (something that we did not originally have in the tang-song dataset and didn't want to have to look up manually). 

Originally I was going to blog about the experience, but the query that I ran was just a hack of the last one that I described in the recent post.  I decided it wasn't interesting enough to blog about.  However, the experience did drive home the very serious problem related to named graphs that I wrote about in the "Take-homes..." section of that earlier post.  I had originally loaded the tang-song graph into the Blazegraph triplestore (i.e. "BigData") without specifying a named graph for it (necessary because I had stupidly not designated the Getty TGN triples as part of a named graph, either).  Now we had an updated tang-song graph in which we had changed many of the URIs.  When I loaded it into the triplestore (also not as part of a named graph), those triples were also with the triples from the outdated graph and I started getting hits for the bad URIs as well as the good.  

Because the bad triples weren't part of a named graph, there wasn't any easy way (as far as I know) in Blazegraph to remove them without dropping the whole default graph, including the Getty TGN triples that took me three days to load.  I managed to use an un-elegant workaround to get the Pinyin transliterations that I needed, but it became clear to me that eventually I was going to have to drop the whole default graph, reload the Getty TGN graph, and make sure that I was careful to only add triples to named graphs from that point forward.  

This is not a problem that we have with the Callimachus installation that we are currently using at because it automatically keeps track of the triples that were included in a particular file that was uploaded.  If you upload a new version of that file, it will automatically replace the triples that were in the old version of the file.  This is kind of clunky, because it treats each file as a named graph rather than letting you specify the URI for the graph, but at least you can manage sets of triples in some manner.  If we replace the Callimachus endpoint with Blazegraph in the future (which is likely), we will have to be careful about specifying each upload as part of a named graph.

General description of how the web app works

Here's a simplified version of how the web app works.

1. When the page loads, the Javascript fires a series of queries to the Heard Library SPARQL endpoint to find out what Chinese provinces and what temple sites are represented in the graph.  A critical piece of these queries is specifying which geo:SpatialThing instances in the triplestore are actually temple sites.  That's done with this triple pattern:

?site <> <>.

The feature code gn:S.ANS is described in the GeoNames ontology as "a place where archeological remains, old structures, or cultural artifacts are located", which pretty much describes the temple sites and differentiates them from other geo:SpatialThings in the triplestore.

2. The endpoint sends the query results back in XML format.

3. The script uses jQuery functions to parse the XML and pull out the names of the provinces and sites.  It then inserts them into the dropdown selection lists on the web page.  One very fun aspect of this is that the queries specify whether the results should include names in Pinyin transliteration or in simplified Chinese characters.  The choice of English (i.e. Pinyin when there isn't an actual English name) vs. Chinese is one of the options offered on the web page. (In contrast to the dynamically populated province and site dropdown lists, the dynasty dropdown list is hard-coded in the Javascript, since it is fixed for the dataset.)

4. The temple site dropdown is more complicated than the other three, whose options don't change interactively (other than to be displayed in alternate languages).  When one of the first three dropdowns (Language, Province, or Dynasty) is changed, a new temple site query is sent and the temple site dropdown options are screened based on the choices in the boxes above.  So for example, if "Jin" is selected as the dynasty, the options displayed on the temple site dropdown are reduced to only sites whose period of construction included the Jin dynasty.

5. Once the options are selected to the user's satisfaction, clicking on the Search button fires another SPARQL query to the endpoint asking for names of buildings (in Chinese characters, Pinyin, and English as available) for sites that meet the screening criteria. The query also requests the geocoordinates of the buildings if they are available.

6. The results XML is parsed and the returned values are used to create HTML that is then inserted into the page.  If the geocordinates exist, they are inserted into strings that load two sorts of Google maps onto the page.  One shows the site location on a political map at lower magnification and the other shows the Google Earth view at a magnification where the orientation of the buildings can be seen.  (See the screenshot at the top of this page for examples.)

One thing that is clear about this method is that things don't happen instantaneously.  Depending on how long it take the server to execute the query, there may be a noticeable delay before the results of the query are injected into the page.  For that reason, there is a "spinner" at the bottom of the page next to the "Search" button that indicates that the user must wait for the results to come back.

Details of generating the SPARQL query

I won't go into the details of the HTML, since it's pretty standard - it loads the Bootstrap script to create the responsive design of the page, and also sets up the buttons and dropdown lists.  The tang-song.js Javascript is more interesting.

There are a number of places in the Javascript where SPARQL queries are generated.  Here is an example: the query that asks what provinces are referenced in the dataset:

var string = 'SELECT DISTINCT ?province WHERE {'
   +'?site <> <>.'
   +'?site <> ?province.'
   +"FILTER (lang(?province)='" + languageTag + "')"
   +'ORDER BY ASC(?province)';
var encodedQuery = encodeURIComponent(string);

The query is constructed by concatenating each of the lines of the query into one long string - they are concatenated on separate lines just to make it easier for a programmer to see the individual triple patterns of the query.  The variable languageTag contains the IETF language tag for the desired version of the page (zh-latn-pinyin for English and zh-hans for simplified Chinese characters), and is inserted in the appropriate place to filter the results by language.  The results are ordered alphabetically so that when the list comes back from the server it won't need to be sorted.  

The actual query is made by an HTTP GET to the endpoint with the query appended to the URL like this:

'' + encodedQuery

  The query has to be URL-encoded so that spaces and other characters that are "bad" for URLs are escaped.  

On of the problems with debugging a program like this is knowing why it has failed when nothing happens.  I leaned very heavily on Chrome's developer tools that can be accessed from the Customize menu in the upper right corner of the browser.  Choose "More tools", then "Developer tools" from the menu.  Click on the Network tab to see what is happening when the page runs.  Here's what it looks like when the page loads:

The last two items on the list at the right are the two SPARQL queries that load the province and temple site dropdown lists.  If you click on the second to last item, you can see a breakdown of the request URL, headers, and the query string in unencoded form:

At the top you can see the ugly long URL that the program created by concatenating all of the query pieces and URLencoding it.  At the bottom is a decoded view.  You can actually copy and paste the decoded view into the Heard Library SPARQL endpoint "sandbox" at in order to debug.  When I'm debugging, after pasting I go ahead and add the hard returns after each period to make the query less confusing.  Sometimes that in itself is enough to let me know what was wrong with it.  

The actual sending and receiving of the query is done by jQuery using the .ajax function:

            type: 'GET',
            url: '' + encodedQuery,
            headers: {
                Accept: 'application/sparql-results+xml'
            success: parseProvinceXml

The Callimachus endpoint only supports XML results, but other SPARQL applications support JSON results, which would probably be easier for the Javascript to ingest.  The returned XML gets passed into an XML parsing function.

Details of handling the response

Of the various SPARQL queries generated by the script, the most complicated response that comes back from the endpoint is the one that returns the data from clicking on the Search button.  The query requests a number of variables:

SELECT DISTINCT ?siteName ?buildingNameEn ?buildingNameZh ?buildingNameLatn ?lat ?long  WHERE {...}

that are needed to generate the desired web page components.  

If you run the query directly in the Heard Library's sandbox, the results get formatted into tabular form with each of the requested variables as a column header:

This is a bit misleading, because that's really nothing like the way the results come back from the endpoint when the query is made via HTTP.  Rather, the XML that comes back after the HTTP request looks like this:

<?xml version='1.0' encoding='UTF-8'?>
<sparql xmlns=''>
<variable name='siteName'/>
<variable name='buildingNameEn'/>
<variable name='buildingNameZh'/>
<variable name='buildingNameLatn'/>
<variable name='lat'/>
<variable name='long'/>
<binding name='buildingNameLatn'>
<literal xml:lang='zh-latn-pinyin'>Guanyinge</literal>
<binding name='buildingNameZh'>
<literal xml:lang='zh-hans'>山門</literal>
<binding name='long'>
<binding name='lat'>
<binding name='siteName'>
<literal xml:lang='zh-latn-pinyin'>Dulesi</literal>
<binding name='buildingNameLatn'>
<literal xml:lang='zh-latn-pinyin'>Shanmen</literal>
<binding name='buildingNameZh'>
<literal xml:lang='zh-hans'>觀音閣</literal>
<binding name='long'>
<binding name='lat'>
<binding name='siteName'>
<literal xml:lang='zh-latn-pinyin'>Dulesi</literal>

In order to make use of the results, they have to be pulled out of the appropriate place in the XML, which would be a real pain if it weren't for some helpful jQuery functions.  The parseXml function at line 330 of the script uses a .find method to get the values.  Here's an example for the ?lat variable:

            $(this).find("binding[name='lat']").each(function() {

that assigns the result to the variable latitude.  Once the necessary strings are pulled from the XML, they are joined together into HTML as necessary to create the desired page content.  For example, here's how I made the cool little Google Earth view of the temples:

html='<img src="'+latitude+','+longitude+'&amp;maptype=hybrid&amp;zoom=18&amp;size=300x300&amp;markers=color:green%7C'+latitude+','+longitude+'&amp;sensor=false"/>'

with the latitude and longitude variables inserted into the appropriate places.  The zoom can be controlled to get the desired magnification. zoom=18 was good to show the buildings and zoom=11 was nice to show where the temple site was on a map of an appropriate scale to show the city in which the temple was located.  

If you have been reading about the web interface and haven't tried it yet, click here to open it.

Why bother with SPARQL and RDF?

One question that you may be wondering about is why one should bother with creating an RDF graph database, then generate web pages by populating them with content retrieved from the database through a SPARQL endpoint?  Why not just acquire JSON from "traditional" web services?  I am not really the right person to be answering this question since I'm not a programmer or web designer.  However, I think that the answer lies in the fact that there is really no limit to the complexity of SPARQL queries that can be sent to the endpoint.  A provider of web services will probably describe their API and tell users what search parameters to use to acquire specific kinds of information.  In contrast, any kind of information can be retrieved from a SPARQL endpoint if the programmer of the client understands the structure of the graph stored in the endpoint.  

There isn't really any advantage of using SPARQL for retrieving information like the list of dynasties as in the earlier example.  One could just provide a search parameter for that in an API.  However, the dynasty search is considerably more complicated as I will now describe. 

"Reasoning" about ranges of dynasties

Tracy's original data had a description of the range of dynasties over which the temple was built or modified.  Here's an example of some of the triples associated with the Cixiang Monastery:

     gn:featureCode gn:S.ANS;
     rdfs:label "慈相寺"@zh-hans;
     rdfs:label "Cixiangsi"@zh-latn-pinyin;
     rdfs:label "Cixiang Monastery"@en;
     dcterms:temporal _:3f1c0a54-58d7-4664-bcf5-abe0ca73bbda;
     a geo:SpatialThing.

     time:intervalStartedBy <>;
     time:intervalFinishedBy <>;
     rdfs:label "北宋至清"@zh-hans;
     rdfs:label "Northern Song to Qing"@en;
     a dcterms:PeriodOfTime, time:ProperInterval.

Two resources are described here: the Cixiang Monastery site itself, and the time interval of its construction.  The time interval is defined by links to URIs in the PeriodO gazetteer for the starting and ending dynasties.  Those URIs dereference, and their associated RDF metadata describe the dynasty periods using the W3C Time Ontology.  So in theory, a client that was programmed to "understand" the Time Ontology could "figure out" whether a particular dynasty selected by the user was within the range spanned by a temple site.  However, I don't have such a client and I want to do the job with a generic SPARQL query.  

The Geological timescale example provided in the Time Ontology specification suggests a useful strategy.  In that example, each of the sequential Periods in the timescale are related to the period before and the period after by the predicate time:intervalMetBy, which has this definition: "If a proper interval T1 is intervalMetBy another proper interval T2, then the beginning of T1 is the end of T2." 

Unfortunately, the Chinese dynasties are more complicated than the geological time scale because there were dynasties that occurred at the same time in different geographic areas.  I looked at all of the different dynasty ranges specified in Tracy's data (using a SPARQL SELECT DISTINCT query - are you surprised?), then tried to diagram out the dynasties in a way that would be amenable to describing their relationships using time:intervalMetBy.  To simplify things, if the data said that the site interval started with the generic dynasty "Song", I used the IRI for Northern Song as the starting IRI, and if the interval ended with "Song", I used the IRI for Southern Song as the ending IRI.  

For the later dynasties that are completely sequential, it's straightforward - I can just say something like:

<> #Qing
     time:intervalMetBy <>. #Ming

However, in the middle of the diagram where there were different dynasties in the northern and southern parts of China, it gets more complicated.  Some places that were under the control of the Northern Song dynasty eventually came under the Jin Dynasty, while other areas in the south remained under the (Southern) Song Dynasty.  (My apologies to Chinese historians if I don't have this exactly right - I read up on this on the Internet.)  Also, the starting time of the Liao Dynasty predates the end of the Five Dynasties period.  So this diagram may be somewhat of an oversimplification.  Nevertheless, I went with it.  

Here's how I modeled the relationships:

<> #Qing
     time:intervalMetBy <>. #Ming

<> #Ming
     time:intervalMetBy <>. #Yuan

<> #Yuan
     time:intervalMetBy <>. #Jin

<> #Yuan
     time:intervalMetBy <>. #Southern Song

<> #Southern Song
     time:intervalMetBy <>. #Northern Song

<> #Jin
     time:intervalMetBy <>. #Northern Song

<> #Jin
     time:intervalMetBy <>. #Liao

<> #Northern Song
     time:intervalMetBy <>. #Five Dynasties

<> #Five Dynasties
     time:intervalMetBy <>. #Tang

This graph is available in Turtle serialization as a file from the working group GitHub site.  

This set of relationships now provides a way to describe "before" and "after" relationships using SPARQL property paths.  If I want to query for all dynasties including and after the Jin Dynasty, I can use this triple pattern:

?dynasty time:intervalMetBy* <>.

where the star after time:intervalMetBy is the "zero to many links" SPARQL property path operator.  I can use two of these types of statements to indicate that a dynasty must fall within or between two dynasties.  If a site has a time interval ?interval ranging from ?startDynasty to ?endDynasty, I can determine whether that interval includes a particular dynasty (e.g. the Yuan dynasty) using this query fragment.  

?interval time:intervalStartedBy ?startDynasty.
?interval time:intervalFinishedBy ?endDynasty.

# target dynasty must be earlier than ?endDynasty
?endDynasty time:intervalMetBy* <>. #test with Yuan

#target dynasty must be later than ?startDynasty
<> time:intervalMetBy* ?startDynasty. #test with Yuan

In the program, the URI for the particular dynasty of interest is selected using the dynasty dropdown list.  Then that URI is inserted in the query from the Javascript variable (set by the dropdown) into the place where <> is shown in the example.  A particular ?interval value is bound if both the "earlier" and "later" triple patterns are satisfied for the selected dynasty.

You can see an example on line 286 of the tang-song.js file.  That's the section of the code that constructs the query to be sent when the Search button is clicked.  The same pattern is used at line 115 in the function that screens the site options.

The main point

This example illustrates the generic nature by which data can be acquired from the server via a SPARQL query.  If I want my client application to screen sites by dynasty, I don't have to ask the server administrator to create server-side code that would enable me to get that information via an API.  I decide about the kind of information that I'm going to get by the way that I design the SPARQL query that I send to the endpoint.  Essentially, every user can design a personalized API that does exactly what they want, rather than relying on the server administrator to create an API that will satisfy all of the kinds of things that the users might want.

Because the nature of the data that is retrieved from the endpoint is not fixed, the actual code that constructs the query can be modified on the fly by the user based on what happens during the interaction with the endpoint.  I've thought about programming a generic data exploration page where the properties screened by the dropdowns were not fixed.  The user could ask the endpoint what properties were used with a certain class of resources, then select which properties to use for each of several dropdowns.  The web application would then ask the endpoint to return the possible values for each property chosen by the user and then set the values in those dropdowns to those values that were retrieved.  The application would look similar to the Temples search page, but instead of the dropdowns being fixed as "Dynasty", "Province", etc., they could be any kind of information available in the triplestore about a class of resources.  

Sometime when I have more time, I'm going to try programming that!

[1] It's hacked from an earlier page that we created as part of Sean King's 2015 Dean's Fellow project, for which Suellen Stringer-Hye and I served as co-mentors.  Jodie Gambill made major contributions towards styling the HTML and making the page work better.  

Wednesday, November 9, 2016

Fixing the octopus

In a recent post, I described how I used a rather hacked-together bit of software called Guid-O-Matic to convert a diverse set of Darwin Core Archives into RDF graphs serialized with a choice of formats.  In that post, I confessed that the resulting RDF had several deficiencies, which I claimed could probably be fixed pretty easily using SPARQL CONSTRUCT queries.  In this post, I'll describe how I made those fixes, talk about a problem with using blank nodes, and discuss possible reasons for bothering to create a more complex graph structure (e.g. using RDF) rather than just settling for the simple structure of Darwin Core Archives.

Image by Paul Shaffner CC-BY via Wikimedia Commons

Fixing the "made-up" predicates and the problem with blank nodes

As I discussed in the earlier post, by its nature a Darwin Core Archive is limited to linking fielded text files in a very simple structure that has been called a "star schema".  The limitations of this format allow only for a very simple structure for the RDF graph generated directly from data in the archive - I've dubbed such a graph model a "starfish schema".  

The diagram above shows an example of a "starfish" schema.  One to many resources of a particular type can be linked to the central resource via an object property, and there can be more than one type of linked resource.  In the example above, there are multiple images linked to the central resource by the property dsw:derivedFrom and multiple determinations linked to the central resource by the property dsw:identifies.  Why did I chose to make the arrows point from the periphery to the center rather than the other way around?  It is a matter of convenience - I can just have a column in the image table that is a foreign key pointing to the organism primary key, then generate a triple linking the subject image to the object organism when I write out the graph for the image.  Here's what it would look like:

   ... other stuff ...
   a dcmitype:StillImage;
   dsw:derivedFrom <>.

   ... other stuff ...
   a dcmitype:StillImage;
   dsw:derivedFrom <>.


Darwin-SW defines a lot of inverse properties, and there is an inverse for dsw:derivedFrom that is called dsw:hasDerivative.  So the following graph:

  a dwc:Organism;
  dsw:hasDerivative <>,

would entail the same links as in the previous example if a client reasoned the inverse relationships.  However, I am loathe to assume that people using generic triplestores and SPARQL endpoints will be doing that kind of reasoning.  That's why in Darwin-SW we picked one of the two properties in each inverse pair to be "preferred" (shown in blue in the following diagram):  

When making the choice of which inverse to prefer, we chose the one that was most likely to point in a many-to-one direction, so that the value (i.e. object) of the property could be stored as a single foreign key in a column of a table of metadata about the subject.  If we went the other way (the gray arrows), you would have the problem of storing an indefinite number of repeated values in a row in a table about the subject resource.  

I confess that Guid-O-Matic is a hack - I wrote it in the easiest possible way.  So it assumes that when it's translating an extension table of a Darwin Core Archive into an RDF graph that the property in the triple used to connect the row of the extension table to a row in the core table is a property that is appropriate when the extension table row is the subject and the core table row is the object.  It can't handle making the link in the other direction.  This problem could be fixed by adding some code that allowed the user to designate that the connecting property operates in the inverse direction from what is expected.  Then the program would be able to generate records like this:

   ... other stuff ...
   a dcmitype:StillImage.
  dsw:hasDerivative <>.

   ... other stuff ...
   a dcmitype:StillImage.
  dsw:hasDerivative <>.

Being a lazy person, I have not bothered to write said code.  Instead, I just made up fake inverse properties and used them.  So for the Catalogue of Afrotropical Bees, where the octopus diagram looked like this:

and where tc:circumscribedBy and dwc:taxonRemarks pointed in the wrong direction (in the one-to-many direction), I made up the fake properties tc:INVcircumscribedBy and dwc:INVtaxonRemarks and just used them to make the arrows point in the opposite direction.

I was feeling guilty about taking this shortcut, but resolved to fix the problem later using a SPARQL construct query.  That query would be super-simple and look like this:

CONSTRUCT {?taxon tc:circumscribedBy ?specimen.} WHERE 
          {?specimen tc:INVcircumscribedBy ?taxon.}

All I had to do was run the query and then dump the triples back into the triplestore.  Problem solved!

Unfortunately, when I ran the CONSTRUCT query, here was what I got:

     tc:circumscribedBy _:genid322ffff2592d8cf32d4a5d2d87712dd8ad99c22f8b .
     tc:circumscribedBy _:genid3338abf8272dc2102d45272d92172d1ccf2265dd1d .
     tc:circumscribedBy _:genid3117ade37c2dbc132d4f762d83f02d89fc8a5ac58c .

I forgot that I had let the preserved specimens (i.e. type specimens) be blank nodes.  That's fine, as long as I'm only talking about them in the same document where I define their properties along with the properties of the taxa, but now I need to refer to them in a second document and I can't because the blank node identifier doesn't have any meaning outside the document in which it is used.  So although I'm not really interested in minting identifiers for resources whose records I don't maintain, I need to do it anyway.  

Here was the easy fix that I chose.  I created a new column in the type specimen table called "frag" and filled it with consecutive numbers from 1 to N so that there would be a locally unique identifier for the type specimen.  Then I changed the links table for Guid-O-Matic to this:

Instead of indicating that the linked class was a blank node (by putting "_:" in the suffix1 column), I indicated the value in the "frag" column should be appended to the root URI of the core resource (the taxon) as a fragment identifier, followed by "type".  That resulted in generating triples like this:

     tc:INVcircumscribedBy <>.

Now when I run the construct query from before, I get output like this:

     tc:circumscribedBy <>.

which was what I wanted.  I was able to load the file containing those triples into the triplestore and have the links in the direction that was actually shown on the octopus diagram.  

When I fixed the problem with the fake dwc:INVtaxonRemarks link, I decided not to actually use dwc:taxonRemarks, which the Darwin Core RDF Guide indicates should be used with literal values.  I decided instead to make the link using dcterms:description, which has no stated range of rdfs:Literal (as do some other Dublin Core terms).  

dsw:derivedFrom relationships

When I described the Global Genome Biodiversity Network archive (octopus diagram above), I noted that I would prefer to describe the relationship sequential relationship

fish specimen --> tissue sample ----> extracted DNA --> amplified DNA ----> nucleotide sequence
          sample prep        DNA extraction           PCR            sequencing

 among the resources derived from the organism/preserved specimen like this:

(using the GGBN class names given in the dataset).  The dsw:derivedFrom object property is transitive, so a NucleotideSequence is derivedFrom an Amplification, but it also has an entailed derivedFrom relationship with a MaterialSample, a Preparation, and the original PreservedSpecimen as well.  

The problem is that due to the starfish schema limitations of Darwin Core archives, the graph has to (at least initially) be this:

where all of the resources in the extension files are connected directly to the core resource.  (The link to the NucleotideSequence is to a resource external to the dataset.)  In my original post, I suggested that SPARQL CONSTRUCT queries could be used to change the second graph into the first.  So I decided to see if that was true.  This situation has the same problem as I described earlier: I can't use blank nodes to represent the sequentially-derived resources and then try to talk about them later in a second document.  So I recreated the graph using generated hash URIs for them in the same manner that I described in the first section of this post.  

One issue that I needed to figure out what whether there were 1:1 relationships between the PreservedSpecimens and resources downstream in the derivation chain.  If there were, then it would be easy to generate the sequential derivedFrom links.  First I just counted the number of each kind of resource using a query like this:

SELECT DISTINCT (COUNT(?resource) AS ?count) WHERE {
?resource a ggbn:Preparation.

I determined that there were 40048 Preparations, 40048 MaterialSamples, and 6923 Amplifications.  This led me to believe that there was a 1;1 relationship between material sample and preparation.  I needed to make sure that there were no organisms that had more than one preparation, and no preparations that had more than one MaterialSample.  I used a query like this to make sure:

SELECT DISTINCT  ?organism ?prep1 ?prep2 WHERE {
?organism a dwc:Organism.
?prep1 dsw:derivedFrom ?organism.
?prep1 a ggbn:Preparation.
?prep2 dsw:derivedFrom ?organism.
?prep2 a ggbn:Preparation.
FILTER (?prep1 != ?prep2)
Limit 10

Basically, it checks to see if there are any cases where there are two preparations that are not the same that are derived from the organism.  By running this and similar queries, I was able to discover that no organism/specimen had more than one Perparation and that no Preparation had more than one MaterialSample.  There were cases where there were more than one Amplification per MaterialSample.  That was fine because that was the last link and there was no ambiguity about which MaterialSample to which it should link.  

Here's the query I used to construct the MaterialSample link to the Preparation one hop above it ( the Preparation was already linked to the Organism):

CONSTRUCT {?ms dsw:derivedFrom ?prep.} WHERE {
  ?prep a ggbn:Preparation.
  ?prep dsw:derivedFrom ?org.
  ?ms a ggbn:MaterialSample.
  ?ms dsw:derivedFrom ?org.

Here's the query I used to link the Amplification to the MaterialSample one hop above it:

CONSTRUCT {?amp dsw:derivedFrom ?ms.} WHERE {
  ?amp a ggbn:Amplification.
  ?amp dsw:derivedFrom ?org.
  ?ms a ggbn:MaterialSample.
  ?ms dsw:derivedFrom ?org.

Originally, I'd made the link from the Amplification to the NucleotideSequence in the opposite direction using dsw:hasDerivative.  I'd rather make the link in the preferred direction using dsw:derivedFrom so that I could use SPARQL property paths to traverse the chain of dsw:derivedFrom links at will.  Constructing that inverse relationship was easy, and at the same time I generated a type declaration for the NucleotideSequence instance.  As far as I know, NCBI doesn't provide any RDF metadata when one tries to dereference the sequence URIs, so I was on my own as far as generating metadata about sequences was concerned.  I don't know of a consensus class for sequences, so for convenience, I made one up for the time being: (ex:NucleotideSequence).   Here's the query I used:

prefix ex: <>
prefix dsw: <>
prefix ggbn: <>
?nuc dsw:derivedFrom ?amp.
?nuc a ex:NucleotideSequence.
  ?amp a ggbn:Amplification.
  ?amp dsw:hasDerivative ?nuc.

I actually did the CONSTRUCT queries using Stardog as a localhost rather than the Vanderbilt Heard Library SPARQL endpoint (currently Callimachus-based) because Stardog allows the output to be saved in a file in any serialization (I saved it as Turtle).  I loaded all of the constructed triples that I manufactured in this exercise back into the Heard Library endpoint (and also replaced the graphs that were using blank nodes with graphs using the hash URIs).  Then I could test the new graph.  Note: the complete updated set of graphs is available at in the file.

Using the transitive dsw:derivedFrom links to find stuff

The point of making dsw:derivedFrom transitive was to allow a user to find things that were any number of dsw:derivedFrom links apart in a chain of derivation.  One way to make use of the transitive properties would be to use a SPARQL endpoint with transitive reasoning enabled.  However, another method that can be used with any endpoint that supports SPARQL 1.1 is Property Paths.  Here's an example of a query that allowed me to discover something that I couldn't easily know by just examining the large data files visually:

prefix ex: <>
prefix dsw: <>
prefix dwc: <>

SELECT DISTINCT   ?organism  ?nuc1 ?nuc2 WHERE {
?organism a dwc:Organism.
?nuc1 a ex:NucleotideSequence.
?nuc1 dsw:derivedFrom+ ?organism.
?nuc2 a ex:NucleotideSequence.
?nuc2 dsw:derivedFrom+ ?organism.
FILTER (?nuc1 != ?nuc2)
Limit 10

In this query, I wanted to find out if there were any specimens/organisms that had more than one nucleotide sequence as an end product.  I already knew that there were multiple Amplifications per organism, and I knew that there were a total of 4269 sequences (vs. 6923 Amplifications), but I did not know the relationship between organisms and sequences.  The query uses triple patterns containing dsw:derivedFrom+.  The extra "+" at the end of the predicate indicates one or more hops apart.  In this case, there is a known and consistent sequence of derived resources, so it wouldn't be necessary to use property paths, but you could imagine situations that included subsampling where the number of links might vary, or where there were other derived resources like ProteinSequences that one wanted to discover.  In those cases, just being able to link generically to any derived resource would be good.  

The results showed that there were some specimens like that did have multiple NucleotideSequences.  Here's a query that you could do to see what all of the derived resources were, and what classes they were instances of:

SELECT DISTINCT   ?derivative ?class WHERE {
?derivative dsw:derivedFrom+ <>.
?derivative a ?class.
Limit 100

The results show that there is one Preparation, one MaterialSample, 23 Amplifications, and 23 NucleotideSequences that are derived from the specimen.

What's this good for?

Since the previous post, I had one off-list email conversation where the question came up of whether there were use cases that the graph-based approach (e.g. RDF) can satisfy that might be difficult to satisfy using more conventional methods.  This is an important question, given that GBIF already has a lot of web services that can allow users to do amazing things.  I don't have a really good answer to this question.  I think that before embarking on a program of creating and maintaining a giant triplestore of RDF that duplicates what GBIF already has, we need to lay out the additional use cases that such a triplestore could satisfy beyond what can already be done conventionally. 

I think that the most interesting cases would be those where two providers submit information about related resources, each without the knowledge of the other.  Here's an example.  Let's say that someone collects a bird specimen like It is sampled and eventually results in the sequence  We have an interest in that sequence for some reason - maybe we are doing a molecular phylogeny.  Unbeknownst to us, some expert has looked at that bird specimen and decided that it was not actually Locustella certhiola ssp. certhiola - perhaps the expert has applied another determination to that specimen and disagreed with the previous determination about the specific epithet.  That might be important to our project, but how would we know that it happened? 

SELECT DISTINCT   ?date ?determiner WHERE {
<> dsw:derivedFrom+ ?organism.
?determination dsw:identifies ?organism.
?determination dwc:dateIdentified ?date.
?determination dwc:identifiedBy ?determiner.

This query would give us a list of all determination dates and determiners (since the Darwin-SW model allows for many determinations for one organism).  Unfortunately, this query won't actually work, since the dataset does not provide the identification dates and name of determiner.  But it would work if the data were there.  

One could imagine even broader queries like this:

SELECT DISTINCT   ?sequence ?date WHERE {
?specimen a dwc:PreservedSpecimen.
?sequence a ex:NucleotideSequence.
?specimen dwc:institutionCode "USNM".
?specimen dsw:derivedFrom* ?organism.
?sequence dsw:derivedFrom+ ?organism.
?sequence dcterms:created ?date.
FILTER (?date > "2016-11-09"^^xsd:date)

This query would ask whether there were any new sequences that were created after 9 November 2016 that were derived from any specimens that were in the US National Museum collection.  Note that I used a "*" after dsw:derivedFrom instead of a "+" to allow for zero to many links - a necessity if the the specimen was considered to be the organism rather than be derived from the organism.  This query won't actually work, since there are no RDF metadata about sequences provided by NCBI that we can put into the triplestore.  But if we could get those data, USNM could know that their collection was being used to generate new sequence data even if the sequencing were done on a downstream sample that were in the possession of another institution.  

There is another potential use case involving inferring duplicates that is described in the Darwin-SW paper at (open access at  It would be very interesting to accumulate more such use cases.