Thursday, September 17, 2015

Why are are handwashing studies and their reporting so broken?

For around ten years, I've had my introductory biology students perform experiments to attempt to determine the effectiveness of soaps and hand cleansing agents.  This is really a great exercise to get students thinking about the importance of good experimental design, because it is very difficult to do an experiment that is good enough to show differences caused by their experimental treatments.  The bacterial counts they measure are very variable and it's difficult to control the conditions of the experiment.  Since there is no predetermined outcome, the students have to grapple with drawing appropriate conclusions from the statistical tests they conduct - they don't know what the "right" answer is for their experiment.

We are just about to start in on the project in my class again this year, so I was excited to discover that a new paper had just come out that purports to show that triclosan, the most common antibacterial agent in soap, has no effect under conditions similar to normal hand washing:

Kim, S.A., H. Moon, and M.S. Ree. 2015. Bactericidal effects of triclosan in soap both in vitro and in vivo. Journal of Antimicrobial Chemotherapy. (the DOI doesn't currently dereference, but the paper is at

The authors exposed 20 recommended bacterial strains to soap with and without triclosan at two different temperatures.  They also exposed bacteria to regular and antibacterial soap for varying lengths of time.  In a second experiment, the authors artificially contaminated the hands of volunteers who then washed with one of two kinds of soap.  The bacteria remaining on the hands were then sampled.

The authors stated that there was no difference in the effect of soap with and without triclosan.  They concluded that this was because the bacteria were not in contact with the triclosan long enough for it to have an effect.  Based on what I've read and on the various experiments my students have run over the years, I think this conclusion is correct.  So what's my problem with the paper?

Why do we like to show that things are different and not that they are the same?

When talking to my beginner students, they often wonder why experimental scientists are so intent on showing that things are significantly different?  Why not show that they are the same - sometimes that's what we actually want to know anyway.

When analyzing the results of an experiment statistically, we evaluate the results by calculating "P".  P is the probability that we would get results that are this different by chance, if the things we were comparing are actually the same.  If P is high, then it's likely that the differences are due to random variation.  If P is low, it's unlikely that the differences are due to chance variation, but rather that they are caused by the real effect of the thing we are measuring.  The typical cutoff for statistical significance is when P<0.05 .  If P<0.05, then we say that we have showed that the results are significantly different.

The problem lies in our conclusion when P>0.05 .  A common (and wrong) conclusion is that when P>0.05 we have shown that the results are not different (i.e. the same).  Actually, what has happened is that we have failed to show that the results are different.  Isn't that the same thing?

Absolutely not.  In simple terms, I put the answer this way: if P<0.05, that is probably because the things we are measuring are different.  If P>0.05, that is either because the things we are measuring are the same OR it's because our experiment stinks!  When differences are small, it may be very difficult to perform a good experiment and show that P>0.05 .  On the other hand, any bumbling idiot can do an experiment that produces P>0.05 by any number of poor practices: not enough samples, poorly controlled experimental conditions, or doing the wrong kind of statistical test.

So there is a special burden placed on a scientist who wants to show that two things are the same.  It is not good enough to run a statistical test and get P>0.05 .  The scientist must also show that the experiment and analysis was capable of detecting differences of a certain size if they existed.  This is called a "power analysis".  A power analysis shows that the test has enough statistical power to uncover differences when they are actually there.  Before claiming that there is no effect of the treatment (no significant difference), the scientist has to show that his or her experiment doesn't stink.

So what's wrong with the Kim et al. 2015 paper???

The problem with the paper is that it doesn't actually provide evidence that supports its conclusions.

If we look at the Kim et al. paper, we can find the problem buried on the third page.  Normally in a study, one reports "N", the sample size, a.k.a. the number of times you repeated the experiment.  Repeating the experiment is the only way you can find out whether the differences you see are due to differences or bad luck in sampling.  In the Kim et al. paper, with regards to the in vitro part of the study, all that is said is "All treatments were performed in triplicate."  Are you joking?????!!!  Three replicates is a terrible sample size for this kind of experiment where results tend to be very variable.  I guess N=2 would have been worse, but this is pretty bad.

My next gripe with the paper is in the graphs.  It is a fairly typical practice in reporting results to show a bar graph where the height represents the mean value of the experimental treatment and the error bars show some kind of measure of how well that mean value is known.  The amount of overlap (if any) provides a visual way of assessing how different the means are.

Typically, 95% confidence intervals or standard errors of the mean are used to set the size of the error bars.  But Kim et al. used standard deviation.  Standard deviation measures the variability of the data, but it does NOT provide an assessment of how well the mean value is known.  Both 95% confidence intervals and standard error of the mean are influenced by the sample size as well as the variability of the data.  They take into consider all of the factors that affect how well we know our mean value.  So the error bars on these graphs based on standard deviation really don't provide any useful information about how different the mean values are.*

The in vivo experiment was better.  In that experiment there were 16 volunteers who participated in the experiment.  So that sample size is better than 3.  But there are other problems.

First of all, it appears that all 16 volunteers washed their hands using all three treatments.  There is nothing wrong with that, but apparently the data were analyzed using a one-factor ANOVA.  In this case, the statistical test would have been much more powerful if it had been blocked by participant, since there may be variability that was caused by the participants themselves and not by the applied treatment.  

Secondly, the researchers applied an a posteriori Tukey's multiple range test to determine which pairwise comparisons were significantly different.  Tukey's test is appropriate in cases where there is no a priori rationale for comparing particular pairs of treatments.  However, in this case, is is perfectly clear which pair of treatments the researchers are interested in: the comparison of regular and antibacterial soap!  Just look at the title of the paper!  The comparison of the soap treatments with the baseline is irrelevant to the hypothesis that is being tested, so its presence does not create a requirement for a test for unplanned comparisons. Tukey's test adjusts the experiment-wise error rate to adjust for multiple unplanned comparisons, effectively raising the bar and making it harder to show that P<0.05; in this case jury-rigging the test to make it more likely that the effects will NOT be different.  

Both the failure to block by participant and using an inappropriate a posteriori test makes the statistical analysis weaker, not stronger, and a stronger test is what you need if you want to show that the reason why you failed to show differences was because they weren't there.  

The graph is also misleading for the reasons I mentioned about the first graph.  The error bars here apparently bracket the range within which the middle 80% of the data fall.  Again, this is a measure of the dispersion of the data, not a measure of how well the mean values are known.  We can draw no conclusions from the degree of overlap of the error bars, because the error bars represent the wrong thing.  They should have been 95% confidence intervals if the authors wanted to have meaning in the amount of overlap.  

Is N-16 an adequate sample size?  We have no idea, because no power test was reported.  This kind of sloppy experimental design and analysis seems to be par for the course in experiments involving hand cleansing.  I usually suggest that my students read the scathing rebuke by Paulson (2005) [1] of the Sickbert-Bennett et al. (2005)[2] paper that bears some similarities to the Kim et al. paper.  Sickbert-Bennett et al. claimed that it made little difference what kind of hand cleansing agent one used or if one used any agent at all.  However, Paulson pointed out that the sample size used by Sickbert-Bennett (N=5) would have needed to have been as much as 20 times larger (i.e. N=100) to have made their results conclusive.  Their experiment was way to weak to draw the conclusion that the factors' had the same effect.  This is probably also true for Kim et al., although to know for sure, somebody needs to run a power test on their data.

What is wrong here???

There are so many things wrong here, I hardly know where to start.

1. Scientists who plan to engage in experimental science need to have a basic understanding of experimental design and statistical analysis.  Something is really wrong with our training of future scientists if we don't teach them to avoid basic mistakes like this.

2. Something is seriously wrong with the scientific review process if papers like this get published with really fundamental problems in their analyses and in the conclusions that are drawn form those analyses.  The fact that this paper got published means not just that four co-authors don't know basic experimental design and analysis, but two or more peer reviewers and an editor can't recognize problems with experimental design and analysis.

3. Something is seriously wrong with science reporting.  This paper has been picked up and reported online by,, webmd,,,, and probably more.  Did any of these news outlets read the paper?  Did any of them consult with somebody who knows how to assess the quality of research and get a second opinion on this paper.  SHAME ON YOU, news media !!!!!

* Error bars that represent standard deviation will always span a larger range than those that represent standard error of the mean, since standard error of the mean is estimated by s/ sqroot(N).  The 95% confidence interval is + or - approximately two times the standard error of the mean.  So when N=3, the 95% confidence interval will be approximately +/- 2s/ sqroot(3) or +/-1.15s.  So in the case where N=3, the square root error bars span a range that is slightly smaller than the 95% confidence interval error bars would span.  This makes it slightly easier to get have error bars that don't overlap than it would be if they represented 95% confidence intervals.  When N=16, the 95% confidence interval would be approximately +/- 2s/ sqroot(16) or +/-s/2.  In this case, the standard deviation error bars are twice the size of the 95% confidence intervals, making it much easier to have error bars that overlap than if the bars represented 95% confidence intervals.  In a case like this where we are trying to show that things are the same, making the error bars twice as big as they should be makes the sample means look like they are more similar than they actually are, which is misleading.  The point here is that using standard deviations for error bars is the wrong thing to do when comparing means.

[1] Paulson, D.S. 2005. Response: comparative efficacy of hand hygiene agents. American Journal of Infection Control 33:431-434.
[2] Sickbert-Bennett E.E., D.J. Weber, M.F. Gergen-Teague, M.D. Sobsey, G.P. Samsa, W.A. Rutala. 2005. Comparative efficacy of hand hygiene agents in the reduction of bacteria and viruses.  American Journal of Infection Control 33:67-77.

Saturday, September 5, 2015

Date of Toilet Paper Apocalypse Now Known


Fig. 1. Charmin toilet paper roll in 2015

Have you noticed how rolls of toilet paper fit into their holders more loosely than they did in the past?  Take a look the next time you are doing your Charmin paperwork, and you will see the first signs of the impending Toilet Paper Apocalypse.

Methods and data

Like the proverbial frog in boiling water, the first signs of the Toilet Paper Apocalypse were not obvious.  I first became aware that it was coming in 2009, when I happened to buy two 8-packs of Charmin Giant rolls and noticed that one of them was noticeably shorter than the other:

Fig. 2. Roll size change observed in 2009.
Careful examination of the quantity details showed that the width of each roll had been decreased from the standard width of 11.4 cm to 10.8 cm.  Although this decrease of 0.6 cm wasn't that apparent when comparing single rolls, it was pretty obvious when the rolls were stacked four high.  When I noticed this, I had a sinking feeling that I was seeing the beginning of a nefarious plot being carried out by faceless bureaucrats at Procter and Gamble.  But without additional data, it was hard to know whether this 5% decrease in the size of rolls was a fluke or part of a pattern.

Fast-forward to 2015 and another trip to the store.  This time it was a purchase of two 6-packs of Charmin Mega rolls:
Fig. 3.a. Charmin roll size change in 2015.

Close examination of the details tells the story:
Fig. 3.b. Charmin roll size change in 2015 (detail).

Sometime between 2009 and 2015, P&G decreased the width of their rolls again, from 10.8 to 9.9 cm, another 8% decrease in the size per sheet.  Now in 2015 the number of sheets per Mega roll was reduced from 330 to 308, a further 7% decline in the amount of toilet paper per roll and a corresponding increase in P&G's profits (since the price of the product has stayed the same with the size decreases).  

What first seemed an idle conspiracy theory is now a known fact.  Being an analysis nerd, I had to search for additional data from the years between 2009 and 2015.  Google did not disappoint.  I was able to find this 2013 post from the Consumerist, which documented another decline in roll width, from 10.8 cm to 10.0 cm (a  7% decrease in sheet size):

Fig. 4. Charmin roll size change in 2013.  Image from


Unfortunately, not all of the changes involved rolls of the same size.  I also did not have exact dates for each change.  However, by estimating the dates and doing some size conversions, I was able to assemble the following data:
Fig. 5. Data showing relationship between year and roll size.

 I conducted a regression analysis on the data and the trend was clear:
Fig. 6. Regression analysis predicting the relationship between year and roll size.
The regression had P=0.0203, so there is no question about the significance of the relationship.  Note also that the regression line has an R2 value of 0.96, which means we can safely extrapolate into the future to predict the course of events leading up to the apocalypse.  Rearranging the regression line and solving for the year when the toilet paper surface area goes to zero gives a year value of 2046.  So we can now confidently predict the date when the apocalypse will come to pass.


This analysis has three important implications. 

1. Charmin toilet paper will disappear on or about the year 2046.  For those of us who are devoted Charmin users, that means shortages, hoarding, and chaos in the grocery stores during the 2040's.  Remember the horror that was Y2K?  We have that to look forward to all over again.

2. Since each Charmin roll size decrease was accomplished without a corresponding decrease in price, the profits for Procter and Gamble will increase as an inverse function of the area of paper per roll.  A little elementary math will prove that a linear decrease in the roll area to zero results in Procter and Gamble's profits increasing to infinity in the year 2046.  The implications are clear: by taking advantage of their customer's addiction to their product, P&G intends to dominate the world economy by mid-century. 

3. As the amount of toilet paper per roll decreases, other effects will begin to appear.  For example, if we assume a constant number of sheets per roll and a length per sheet of 10.1 cm, we can project based on the regression line from Fig. 6 that the width of sheets will reach 2.0 cm in approximately 2040 (Fig. 7).

Fig. 7. Simulated Charmin roll size in 2040.
When the sheet size reaches that shown in Fig. 7, it becomes questionable as to whether the toilet paper can actually perform its intended function. In that case, we can expect cascades of related effects caused by the increase in transmission of fecal-borne pathogens, such as pandemics of cholera, hepatitus A, and typhoid fever.  It is possible that the human population may collapse due to these pandemics several years before the actual disappearance of toilet paper altogether.


This paper should be considered a clarion call for action.  The most likely solution to the problem is probably some form of government regulation of roll and sheet size.  However, given the widespread gridlock in facing lesser issues, such as climate change and refugee crises, it is likely that inaction by governments on this issue will continue.  In that case, we should expect widespread hoarding, followed by rationing as 2046 approaches.


Thursday, July 30, 2015

Shiny new toys 3: Playing with Linked Data

In the second post in this series, I recapitulated my conclusions about the question "What can we learn using RDF?" that I had explored in an earlier series of blog posts:

1. In order for the use of RDF to be beneficial, it needs to facilitate processing by semantic clients ("machines") that results in interesting inferences and information that would not otherwise be obvious to human users.  We could call this the "non-triviality" problem.
2. Retrieving information that isn't obvious is more likely to happen if the RDF is rich in object properties [1] that link to other IRI-identified resources and if those linked IRIs are from Internet domains controlled by different providers.  We could call this the "Linked Data" problem: we need to break down the silos that separate the datasets of different providers.  We could reframe by saying that the problem to be solved is the lack of persistent, unique identifiers, lack of consensus object properties, and lack of the will for people to use and reuse those identifiers and properties.
3. RDF-enabled machine reasoning will be beneficial if the entailed triples allow the construction of more clever or meaningful queries, or if they state relationships that would not be obvious to humans.  We could call this the "Semantic Web" problem.

In that second post, I used the "shiny new toys" (Darwin-SW 0.4, the Bioimages RDF dataset, and the Heard Library Public SPARQL endpoint) to play around with the Semantic Web issue by using SPARQL to materialize entailed inverse relationships.  In this post, I'm going to play around with the Linked Data issue by trying to add an RDF dataset from a different provider to the Heard Library triplestore and see what we can do with it.

 Image from CC BY-SA

Linked Data

I'm not going to attempt to review the philosophy and intricacies of Linked Data.  Both and the W3C have introductions to Linked Data.  To summarize, here's what I think of when I hear the term "Linked Data":

1. Things (known as "resources" in RDF lingo) are identified by HTTP IRIs.
2. Those things are described by some kind of machine-processable data.
2. Those data link the thing to other things that are also identified by HTTP IRIs.
4. If a machine doesn't know about some other linked thing, it can dereference that thing's HTTP IRI in order to retrieve the machine-processable data about the other thing via the Internet.

This may be an oversimplification that is short on technical details, but I think it catches the main points.

Some notes:
1. IRIs are a superset of URIs.  See for details.  In this post, you could pretty much substitute "URI" anytime I use "IRI" if you wanted to.
2. "Machine-processable data" might mean that the data are described using RDF, but that isn't a requirement - there are other ways that data can be described and encoded, including Microdata and JSON-LD that are similar to, but not synonymous with RDF.
3. It is possible to link to other things without using RDF triples.  For example, in its head element, an HTML web page can have a link tag with a rel="meta" attribute that links to some other document.  It is possible that a machine would be able to "follow it's nose" to find other information using that link.
4. RDF does not require that the things be identified by IRIs that begin with "HTTP://" (i.e. HTTP IRIs).  But if they aren't, it becomes difficult to use HTTP and the Internet to retrieve the information about a linked thing.

In this blog post, I'm going to narrow the scope of Linked Data as it is applied in my new toys:
1. All IRIs are HTTP IRIs.
2. Resources are described as RDF.
3. Links are made via RDF triples.
4. The linked RDF is serialized within some document that can be retrieved via HTTP using the linked IRI.

What "external" links are included in the Bioimages RDF dataset?

In the previous two blog posts in this series, all of the SPARQL queries investigated resources whose descriptions were contained in the Bioimages dataset itself.  However, the Bioimages dataset contains links to some resources that are described outside of that dataset:
  • Places that are described by GeoNames.
  • Taxonomic names that are described by uBio.
  • Agents that are described by ORCID.
  • Literature identified by DOIs that link to publisher descriptions.

In principle, a Linked Data client ("machine", an program designed to find, load, and interpret machine-processable data) could start with the Bioimages VoID description and follow links from there to discover and add to the triplestore all of the linked information, including information about the four kinds of resources described outside of Bioimages.  Although that would be cool, it probably isn't practical for us to take that approach at the present.  Instead, I would like to find a way to acquire relevant triples from a particular external provider, then add them to the triplestore manually.

The least complicated of the external data sources [1] to experiment with is probably GeoNames.  GeoNames actually provides a data dump download service that can be used to download its entire dataset.  Unfortunately, the form of that dump isn't RDF, so it would have to be converted to RDF.  Its dataset also includes a very large number of records (10 million geographical names, although subsets can be downloaded).  So I decided it would probably be more practical to just retrieve the RDF about particular geographic features that are linked from Bioimages.

Building a homemade Linked Data client

In the do-it-yourself spirit of this blog series, I decided to program my own primitive Linked Data client.  As a summer father/daughter project, we've been teaching ourselves Python, so I decided this would be a good Python programming exercise.  There are two reasons why I like using Python vs. other programming languages I've used in the past.  One is that it's very easy to test your code experimentation interactively as you build it.  The other is that somebody has probably already written a library that contains most of the functions that you need.  After a miniscule amount of time Googling, I found the RDFLib library that provides functions for most common RDF-related tasks.  I wanted to leverage the new Heard Library SPARQL endpoint, so I planned to use XML search results from the endpoint as a way to find the IRIs that I want to pull from GeoNames.  Fortunately, Python has built-in functions for handling XML as well.

To scope out the size of the task, I created a SPARQL query that would retrieve the GeoName IRIs to which Bioimages links.  The Darwin Core RDF Guide defines the term dwciri:inDescribedPlace that relates Location instances to geographic places.  In Bioimages, each of the Location instances associated with Organism Occurrences are linked to GeoNames features using dwciri:inDescribedPlace.  As in the earlier blog posts of this series, I'm providing a template SPARQL query that can be pasted into the blank box of the Heard Library endpoint:

PREFIX foaf: <>
PREFIX ac: <>
PREFIX dwc: <>
PREFIX dwciri: <>
PREFIX dsw: <>
PREFIX Iptc4xmpExt: <>
PREFIX rdfs: <>
PREFIX dc: <>
PREFIX dcterms: <>
PREFIX dcmitype: <>



 ?location a dcterms:Location.
 ?location dwciri:inDescribedPlace ?place.

You can substitute other query forms in place of the SELECT DISTINCT line and other graph patterns for the WHERE clause in this template.

The graph pattern in the example above binds Location instances to a variable and then finds GeoNames features that are linked using dwciri:inDescribedPlace.  Running this query finds the 185 unique references to GeoNames IRIs that fit the graph pattern.  If you run the query using the Heard Library endpoint web interface, you will see a list of the IRIs.  If you use a utility like cURL to run the query, you can save the results in an XML file like (which I've called "results.xml" in the Python script).

Now we have the raw materials to build the application.  Here's the code required to parse the results XML from the file into a Python object:

import xml.etree.ElementTree as etree
tree = etree.parse('results.xml')

A particular result XML element in the file looks like this:

    <binding name='place'>

To put out all of the <uri> nodes, I used:


and to turn a particular <uri> node into a string, I used:


In theory, this string is the GeoNames feature IRI that I would need to dereference to acquire the RDF about that feature.  But content negotiation redirects the IRI that identifies the abstract feature to a similar IRI that ends in "about.rdf" and identifies the actual RDF/XML document file that is about the GeoNames feature.  If I were programming a real semantic client, I'd have to be able to handle this kind of redirection and recognize HTTP response codes like 303 (SeeOther).  But I'm a Python newbie and I don't have a lot of time to spend on the program, so I hacked it to generate the file IRI like this:


RDFLib is so awesome - it has a single function that will make the HTTP request, receive the file from the server, and parse it:

result = addedGraph.parse(getUri)

It turns out that the Bioimages data has a bad link to a non-existent IRI:

so I put the RDF parse code inside a try:/except:/else: error trap so that the program wouldn't crash if the server HTTP response was something other than 200.

In RDFLib it is also super-easy to do a UNION merge of two graphs.  I merge the graph I just retrieved from GeoNames into the graph where I'm accumulating triples by:

builtGraph = builtGraph + addedGraph

When I'm done merging all of the data that I've retrieved from GeoNames, I serialize the graph I've built into RDF/XML and save it in the file "geonames.rdf":

s = builtGraph.serialize(destination='geonames.rdf', format='xml')

Here's what the whole script looks like:

import rdflib
import xml.etree.ElementTree as etree
tree = etree.parse('results.xml')

while fileIndex<len(resultsArray):
        result = addedGraph.parse(getUri)
        builtGraph = builtGraph + addedGraph

s = builtGraph.serialize(destination='geonames.rdf', format='xml')

Voilà! a Linked Data client in 19 lines of code!  You can get the raw code with extra annotations at

The resultsin the "geonames.rdf" file are serialized in the typical, painful RDF/XML syntax:

<?xml version="1.0" encoding="UTF-8"?>
  <rdf:Description rdf:about="">
    <gn:alternateName xml:lang="bpy">কলোরাডো</gn:alternateName>
    <gn:alternateName xml:lang="eo">Kolorado</gn:alternateName>
    <gn:alternateName xml:lang="mk">Колорадо</gn:alternateName>
    <gn:alternateName xml:lang="he">קולורדו</gn:alternateName>
    <gn:parentCountry rdf:resource=""/>

but that's OK, because only a machine will be reading it.  You can see that the data I've pulled from GeoNames provides me with information that didn't already exist in the Bioimages dataset, such as how to write "Colorado" in Cyrillic and Hebrew.

Ursus americanus occurrence in Parque Nacional Yellowstone in 1971:  Image (c) 2005 by James H. Bassett CC BY-NC-SA

Doing something fun with the new data

As exciting as it is to have built my own Linked Data client, it would be even more fun to use the scraped data to run some more interesting queries.  The geonames.rdf triples have been added to the Heard Library triplestore and can be included in queries by adding the


clause to the template set I listed earlier and the gn: namespace abbreviation to the list:

PREFIX gn: <>

OK, here is the first fun query.  Find 20 GeoNames places that are linked to Locations in the Bioimages dataset, show their English names and name translations in Japanese.  By looking at the RDF snippet above, you can see that the property gn:name is used to link the feature to its preferred (English?) name.  The property gn:alternateName is used to link the feature to alternative names in other languages.  So the triple patterns

 ?place gn:name ?placeName.
 ?place gn:alternateName ?langTagName.

can be added to the template query to find the linked names.  However, we don't want ALL of the alternative names, just the ones in Japanese.  For that, we need to add the filter

FILTER ( lang(?langTagName) = "ja" )

The whole query would be

SELECT DISTINCT ?placeName ?langTagName

 ?location a dcterms:Location.
 ?location dwciri:inDescribedPlace ?place.
 ?place gn:name ?placeName.
 ?place gn:alternateName ?langTagName.
FILTER ( lang(?langTagName) = "ja" )

Here are sample results:

Great Smoky Mountains National Parkグレート・スモーキー山脈国立公園@ja
Yellowstone National Parkイエローストーン国立公園@ja

Other fun language tags to try include "ka", "bpy", and "ar".
Here is the second fun query.  List species that are represented in the Bioimages database that are found in "Parque Nacional Yellowstone" (the Spanish name for Yellowstone National Park).  Here's the query:

SELECT DISTINCT ?genus ?species

 ?determination dwc:genus ?genus.
 ?determination dwc:specificEpithet ?species.
 ?organism dsw:hasIdentification ?determination.
 ?organism dsw:hasOccurrence ?occurrence.
 ?occurrence dsw:atEvent ?event.
 ?event dsw:locatedAt ?location.
 ?location dwciri:inDescribedPlace ?place.
 ?place gn:alternateName "Parque Nacional Yellowstone"@es.

Most of this query could have done using only the Bioimages dataset without our Linked Data effort, but there is nothing in the Bioimages data that provides any information about place names in Spanish.  Querying on that basis required triples from GeoNames.  The results are:

genus  species
Pinus albicaulis
Ursus americanus

Adding the GeoNames data to Bioimages enables more than alternative language representations for place names.  Each feature is linked to its parent administrative feature (counties to states, states to countries, etc.), whose data could also be retrieved from GeoNames to build a geographic taxonomy that could be used to write smarter queries.  Many of the geographic features are also linked to Wikipedia articles, so queries could be used to build web pages that showed images of organisms found in certain counties, along with a link to the Wikipedia article about the county.

Possible improvements to the homemade client

  1. Let the software make the query to the SPARQL endpoint itself instead of providing the downloaded file to the script.  This is easily possible with the RDFLib Python library.
  2. Facilitate content negotiation by requesting an RDF content-type, then handle 303 redirects to the file containing the metadata. This would allow the actual resource IRI to be dereferenced without jury-rigging based on an ad hoc addition of "about.rdf" to the resource IRI.  
  3. Search recursively for parent geographical features.  If the retrieved file links to a parent resource that hasn't already been retrieved, retrieve the file about it, too.  Keep doing that until there aren't any higher levels to be retrieved.
  4. Check with the SPARQL endpoint to find Location records that have changed (or are new) since some previous time, and retrieve only the features linked in that record.  Check the existing GeoNames data to make sure the record hasn't already been retrieved.
This would be a great project for a freshman computer programming class.

So can we "learn" something using RDF?

Returning to the question posed at the beginning of this post, does retrieving data from GeoNames and pooling it with the Bioimages data address the "non-triviality problem"?  In this example, it does provide useful information that wouldn't be obvious to human users (names in multiple languages).  Does this qualify as "interesting"?  Maybe not, since the translation could be obtained by pasting names into Google translate.  But it is more convenient to let a computer do it. 

To some extent, the "Linked Data problem" is solved in this case since there is now a standard, well-known property to do the linking (dwciri:inDescribedPlace) and a stable set of external IRI identifiers (the GeoNames IRIs) to link to .  The will to do the linking was also there on the part of Bioimages and its image providers - the Bioimages image ingestion software we are currently testing makes it more convenient for users to easily make that link.

So on the bases of this example, I am going to give a definite "maybe" to the question "can we learn something useful using RDF".

I may write another post in this series if I can pull RDF data from some of the other RDF data sources (ORCID, uBio) and do something interesting with them.

[1] uBio doesn't actually use HTTP URIs, I can't get ORCID URIs to return RDF, and DOIs use redirection to a number of data providers.

Saturday, July 25, 2015

Shiny new toys 2: Inverse properties, reasoning, and SPARQL

In an earlier series of blog posts, I explored the question "What can we learn using RDF?".  I ended that series by concluding several things:

1. In order for the use of RDF to be beneficial, it needs to facilitate processing by semantic clients ("machines") that results in interesting inferences and information that would not otherwise be obvious to human users.  We could call this the "non-triviality" problem.
2. Retrieving information that isn't obvious is more likely to happen if the RDF is rich in object properties [1] that link to other IRI-identified resources and if those linked IRIs are from Internet domains controlled by different providers.  We could call this the "Linked Data" problem: we need to break down the silos that separate the datasets of different providers.  We could reframe by saying that the problem to be solved is the lack of persistent, unique identifiers, lack of consensus object properties, and lack of the will for people to use and reuse those identifiers and properties.
3. RDF-enabled machine reasoning will be beneficial if the entailed triples allow the construction of more clever or meaningful queries, or if they state relationships that would not be obvious to humans.  We could call this the "Semantic Web" problem.

RDF has now been around for about fifteen years, yet it has failed to gain the kind of traction that the HTML-facilitated, human-oriented Web attained in its first fifteen years.  I don't think the problem is primarily technological.  The necessary standards (RDF, SPARQL, OWL) and network resources are in place, and computing and storage capabilities are better than ever.  The real problem seems to be a social one.  In the community within which I function (biodiversity informatics), with respect to the Linked Data problem, we have failed to adequately scope what the requirements are for a "good" object properties, and failed to settle on a usable system for facilitating discovery and reuse of identifiers.  With respect to the Semantic Web problem, there has been relatively little progress on discussing as a community [2] the nature of the use cases we care about, and how the adoption of particular semantic technologies would satisfy them.  What's more, with the development of and JSON-LD, there are those who question whether the Semantic Web problem is even relevant.

This blog post ties together the old "What can we learn using RDF?" theme with theme of this new blog series ("Shiny new toys") by examining a design choice about object properties that Cam Webb and I made when we created Darwin-SW.  I show how we can play with the "shiny new toys" (Darwin-SW 0.4, the Bioimages RDF dataset, and the Heard Library Public SPARQL endpoint) to investigate the practicality of that choice.  I end by pondering how that design choice is related to the RDF problems I listed above.

Why is it called "Darwin-SW"?

In my previous blog post, I mentioned that instances of Darwin Core classes in Bioimages RDF are linked using a graph model based mostly on the Darwin-SW ontology (DSW).  A primary function of Darwin-SW is to provide the object properties that are missing from Darwin Core.  For example, Darwin Core itself does not provide any way to link an Organism with its Identifications or its Occurrences.

If linking were the only purpose of Darwin-SW, then why isn't it called "Darwin-LD" (for "Darwin Linked Data") instead of Darwin-SW (for "Darwin Semantic Web")?  When Cam Webb and I developed it, we wanted to find out what kinds of properties we could assign to the DSW terms that would satisfy some use cases that were interesting to us.  We were interested in at leveraging at least some of the capabilities of the Semantic Web in addition to facilitating Linked Data.

Compared to OBO Foundry ontologies, DSW is a fairly light-weight ontology.  It doesn't generate a lot of entailments because it uses OWL properties somewhat sparingly, but there are specific term properties that generate important entailments that help us satisfy our use cases. [4]  Darwin-SW declares some classes to be disjoint, establishes the key transitive properties dsw:derivedFrom and dsw:hasDerivative, declares ranges and domains for many properties, and defines pairs of object properties that are declared to be inverses.  It is the last item that I'm going to talk about specifically in this blog post.  (I may talk about the others in later posts.)

What are inverse properties?

Sometimes a vocabulary defines a single object property to link instances of classes of subject and object resources.  For example, Dublin Core defines the term dcterms:publisher to link a resource to the Agent that published it: 

On the other hand, Dublin Core also defines the terms dcterms:hasPart and dcterms:isPartOf, which can be used to link a resource to some other resource that is included within it. 
In the first case of dcterms:publisher, there is only one way to write a triple that links the two resources, e.g.

             dcterms:publisher <>.

But in the second case, there are two ways to create the linkage, e.g.

             dcterms:hasPart <>.


             dcterms:isPartOf <>.

Because the terms dcterms:hasPart and dcterms:isPartOf link the two resources in opposite directions, they could be considered inverse properties.  You have the choice of using either one to perform the linking function.

If a vocabulary creator like DCMI provides two apparent inverse properties that can be used to link a pair of resources in either direction, that brings up an important question.  If a user asserts that

             dcterms:hasPart <>.

is it safe to assume that

             dcterms:isPartOf <>.

is also true?  It would make sense to think so, but DCMI actually never explicitly states that asserting the first relationship entails that the second relationship is also true.  Compare that to the situation in the FOAF vocabulary, which defines the terms foaf:depicts and foaf:depiction

In contrast to the Dublin Core dcterms:hasPart and dcterms:isPartOf properties, FOAF explicitly declares

  foaf:depicts owl:inverseOf foaf:depiction.

in the RDF definition of the vocabulary.  So stating the triple 


regardless of whether the second triple is explicitly stated or not.

Why does this matter?

When the creator of a vocabulary offers only one property that can be used to link a certain kind of resource to another kind of resource, data consumers can know that they will always find all of the linkages between the two kinds of resources when they search for them.  The query

     ?book dcterms:publisher ?agent.

will retrieve every case in a dataset where a book has been linked to its publisher. 

However, when the creator of a vocabulary offers two choices of properties that can be used to link a certain kind of resource to another kind of resource, there is no way for a data consumer to know whether a particular dataset producer has expressed the relationship in one way or the other.  If a dataset producer asserts only that

             dcterms:hasPart <>.

but a data consumer performs the query

     ?term dcterms:isPartOf ?vocabulary.

the link between vocabulary and term asserted by the producer will not be discovered by the consumer.  In the absence of a pre-established agreement between the producer and consumer about which property term will always be used to make the link, the producer must express the relationship both ways, e.g.

# producer's data
             dcterms:hasPart <>.
             dcterms:isPartOf <>.

if the producer wants to ensure that consumers will always discover the relationship.  In order for consumers to ensure that they find every case where terms are related to vocabularies, they must perform a more complicated query like:

     {?term dcterms:isPartOf ?vocabulary.}
     {?vocabulary dcterms:hasPart ?term.}
This is really annoying. If a lot of inverse properties are involved, either the producer must assert a lot of unnecessary triples or the consumer must write really complicated queries.  For the pair of Dublin Core terms, this is the only alternative in the absence of a convention on which of the two object properties should be used.

However, for the pair of FOAF terms, there is another alternative.  Because the relationship

  foaf:depicts owl:inverseOf foaf:depiction.

was asserted in the FOAF vocabulary definition, a data producer can safely assert the relationship in only one way IF the producer is confident that the consumer will be aware of the owl:inverseOf assertion AND if the consumer performs reasoning on the producer's graph (i.e. a collection of RDF triples).  By "performs reasoning", I mean that the consumer uses some kind of semantic client to materialize the missing, entailed triples and add them to the producer's graph.  If the producer asserts 

and the consumer conducts the query

     ?plant foaf:depiction ?image.

the consumer will discover the relationship if pre-query reasoning materialized the entailed triple 

and included it in the dataset that was being searched.

Why did Darwin-SW include so many inverse property pairs?

Examination of the graph model above shows that DSW is riddled with pairs of inverse properties.  Like foaf:depicts and foaf:depiction, each property in a pair is declared to be owl:inverseOf its partner.  Why did we do that when we created Darwin-SW?

Partly we did it as a matter of convenience.  If you are describing an organism, it is slightly more convenient to say 
             a dwc:Organism;
             dwc:organismName "Bicentennial Oak"@en;

than to have to say 
             a dwc:Organism;
             dwc:organismName "Bicentennial Oak"@en.

The other reason for making it possible to establish links in either direction is somewhat philosophical.  Why did DCMI create a property that links documents and their publishers in the direction where the document was the subject and not the other way around?  Probably because Dublin Core is all about describing the properties of documents, and not so much about describing agents (that's FOAF's thing).  Why did TDWG create the Darwin Core property dwciri:recordedBy that links Occurrences to the Agents that record them in the direction where Occurrence is the subject and not the other way around?  Because TDWG is more interested in describing Occurrences than it is in describing agents.  In general, vocabularies define object properties in a particular direction so that the kind of resource their creators care about the most is in the subject position of the triple.  TDWG has traditionally been specimen- and Occurrence-centric, whereas Cam and I have the attitude that in a graph-based RDF world, there isn't any center of the universe.  It should be possible for any resource in a graph to be considered the subject of a triple linking that resource to another node in the graph.  So we created DSW so that users can always chose an object property that places their favorite Darwin Core class in the subject position of the triple.

Is creating a bunch of inverse property pairs a stupid idea?

In creating DSW, Cam and I didn't claim to have a corner on the market of wisdom.  What we wanted to accomplish was to lay out a possible, testable solution for the problem of linking biodiversity resources, and then see if it worked.  If one assumes that it isn't reasonable to expect data producers to express relationships in both directions, nor to expect consumers to always write complex queries, the feasibility of minting inverse property pairs really depends on whether one "believes in" the Semantic Web.  If one believes that carrying out machine reasoning is feasible on the scale in which the biodiversity informatics community operates, and if one believes that data consumers (or aggregators who provide those data to consumers) will routinely carry out reasoning on graphs provided by producers, then it is reasonable to provide pairs of properties having owl:inverseOf relationships.  On the other hand, if data consumers (or aggregators who provide data to consumers) can't or won't routinely carry out the necessary reasoning, then it would be better to just deprecate one of the properties of each pair so that there will never be uncertainty about which one data producers will use.

So part of the answer to this question involves discovering how feasible it is to conduct reasoning of the sort that is required on a graph that is of a size that might be typical for biodiversity data producers.

Introducing the reasoner

"Machine reasoning" can mean several things.  On type of reasoning is to determine unstated triples that are entailed by a graph (see this blog post for thoughts on that).   Another type of reasoning is to determine whether a graph is consistent (see this blog post for thoughts on that).  Reasoning of the first sort is carried out on graphs by software that has been programmed to apply entailment rules.  It would be tempting to think that it would be good to have a reasoner that would determine all possible entailments.  However, that is undesirable for two reasons.  The first is that applying some entailment rules may have undesired consequences (see this blog post for examples; see Hogan et al. 2009 for examples of categories of entailments that they excluded).  The second is that it may take too long to compute all of the entailments.  For example, some uses of OWL Full may be undecidable (i.e. all entailments are not computable in a finite amount of time).  So in actuality, we have to decide what sort of entailment rules are important to us, and find or write software that will apply those rules to our graph.

A number of reasoners have been developed to efficiently compute entailments and determine consistency.  One example is Apache Jena.  There are various strategies that have been employed to optimize the process of reasoning.  If a query is made on a graph, a reasoner can compute entailments "on the fly" as it encounters triples that entail them.  However, this process would slow down the execution of every query.  Alternatively, a reasoner could compute the set of entailed triples using the entire data graph and save those "materialized" triples in a separate graph.  This might take more time than computing the entailments at query time, but it would only need to be done once.  After the entailed triples are stored, queries could be run on the union of the data graph and the graph containing the materialized entailed triples with little effect on the time of execution of the query.  This strategy is called "forward chaining".

Creating a reasoner using a SPARQL endpoint

One of the "shiny new toys" I've been playing with is the public Vanderbilt Heard Library SPARQL endpoint that we set up as part of a Dean's Fellow project.  The endpoint is set up on a Dell PowerEdge 2950 with 32GB RAM and uses Callimachus 1.4.2 with an Apache frontend.  In the previous blog post, I talked about the user-friendly web page that we made to demonstrate how SPARQL could be used to query the Bioimages RDF (another of my "shiny new toys").  In the Javascript that makes the web page work, we used the SPARQL SELECT query form to find resources referenced in triples that matched a graph pattern (i.e. set of triple patterns).  The main query that we used screened resources by requiring them to be linked to other resources that had particular property values.  This is essentially a Linked Data application, since it solves a problem by making use of links between resources (although it would be a much more interesting one if links extended beyond the Bioimages RDF dataset).

The SPARQL CONSTRUCT query form asks the endpoint to return a graph containing triples constrained by the graph pattern specified in the WHERE clause of the query.  Unfortunately, CONSTRUCT queries can't be carried out by pasting them into the blank query box of the Heard Library endpoint web page.  The endpoint will carry out CONSTRUCT queries if they are submitted using cURL or some other application that allows you to send HTTP GET requests directly.  I used the Advanced Rest Client Chrome extension, which displays the request and response headers and also measures the response time of the server.  If you are trying out the examples I've listed here and don't want to bother figuring out how to configure an application send and receive the data related to CONSTRUCT queries, you can generally replace

CONSTRUCT {?subject ex:property ?object.}


SELECT ?subject ?object

and paste the modified query in the blank query box of the endpoing web page.  The subjects and objects of the triples that would have been constructed will then be shown on the screen.  It probably would also be advisable to add


at the end of the query in case it generates thousands of results.

In the previous section, I described a forward chaining reasoner as performing the following functions:
1. Apply some entailment rules to a data graph to generate entailed triples.
2. Store the entailed triples in a new graph.
3. Conduct subsequent queries on the union of the data graph and the entailed triple graph.

These functions can be duplicated by a SPARQL endpoint responding to an appropriate CONSTRUCT query if the WHERE clause specifies the entailment rules.  The endpoint returns the graph of entailed triples in response to the query, and that graph can be loaded into the triple store to be included in subsequent queries.

Here is the start of a query that would materialize triples that are entailed by using properties that are declared to have an owl:inverse relationship to another property:

CONSTRUCT {?Resource1 ?Property2 ?Resource2.}
 ?Property1 owl:inverseOf ?Property2.
 ?Resource2 ?Property1 ?Resource1.

To use this query, the RDF that defines the properties (e.g. the FOAF vocabulary RDF document) must be included in the triplestore with the data. The first triple pattern binds instances of declared inverse properties (found in the vocabulary definitions RDF) to the variables ?Property1 and ?Property2.  The second triple pattern binds subjects and objects of triples containing ?Property1 as their predicate to the variables ?Resource2 and ?Resource1 respectively.  The endpoint then constructs new triples where the subject and object positions are reversed and inserts the inverse property as the predicate.

There are two problems with the query as it stands.  The first problem is that the query generates ALL possible triples that are entailed by triples in the dataset that contain ?Property1.  Actually, we only want to materialize triples if they don't already exist in the dataset.  In SPARQL 1.0, fixing this problem was complicated.  However, in SPARQL 1.1, it's easy.  The MINUS operator removes matches if they exist in some other graph.  Here's the modified query:

CONSTRUCT {?Resource1 ?Property2 ?Resource2.}
 ?Property1 owl:inverseOf ?Property2.
 ?Resource2 ?Property1 ?Resource1.
  MINUS {?Resource1 ?Property2 ?Resource2.}

The first two lines binds resources to the same variables as above, but the MINUS operator eliminates bindings to the variables in cases where the triple to be constructed already exists in the data. [5]

The other problem is that owl:inverseOf is symmetric.  In other words, declaring:

<property1> owl:inverseOf <property2>.

entails that

<property2> owl:inverseOf <property1>.

In a vocabulary definition, it should not be necessary to make declarations in both directions, so vocabularies generally only declare one of them.  Since the SPARQL-based reasoner we are creating is ignorant of the semantics of owl:inverseOf, we need to take additional steps to make sure that we catch entailments based on properties that are either the subject or the object of an owl:inverseOf declaration.  There are several ways to do that.

One possibility would be to simply materialize all of the unstated triples that are entailed by the symmetry of owl:inverseOf.  That could be easily done with this query:

CONSTRUCT {?Property1 owl:inverseOf ?Property2.}
 ?Property2 owl:inverseOf ?Property1.
  MINUS {?Property1 owl:inverseOf ?Property2.}

followed by running the modified query we created above.

An alternative to materializing the triples entailed by the symmetry of owl:inverseOf would be to run a second query:

CONSTRUCT {?Resource1 ?Property2 ?Resource2.}
 ?Property2 owl:inverseOf ?Property1.
 ?Resource2 ?Property1 ?Resource1.
  MINUS {?Resource1 ?Property2 ?Resource2.}

similar to the earlier one we created, but with the subject and object positions of ?Property1 and ?Property2 are reversed.  It might be even better to combine both forms into a single query:

CONSTRUCT {?Resource1 ?Property2 ?Resource2.}
   {?Property1 owl:inverseOf ?Property2.}
   {?Property2 owl:inverseOf ?Property1.}
 ?Resource2 ?Property1 ?Resource1.
  MINUS {?Resource1 ?Property2 ?Resource2.}

Hooray! We have created a simple reasoner to materialize entailed by owl:inverseOf properties! Flushed with excitement, I rush to send the query to the Heard Library SPARQL endpoint to try it on the Bioimages RDF ...

... only to sit there waiting until the server times out.

What happened? Although this kind of query works on small "practice" datasets, there are too many triples in the Bioimages dataset for the endpoint to complete the task in the time allowed in the system settings (which I think is limited to 60 seconds).

A more modest SPARQL reasoner

After this initial failure, perhaps we should scale back our expectations and figure out what was making the query take too long.

First, it would be good to have a way to determine the scope of what we were asking the endpoint to do. SPARQL provides a way to find out how many times an expression has bound: the COUNT function. It would also be good to make sure that the query is being done only on the triples we care about (the Bioimages data).  Here is a query that will determine the number of triples contained in the graph that is bundled in the file on the Bioimages GitHub repository (release 2014-07-22,

PREFIX foaf: <>
PREFIX ac: <>
PREFIX dwc: <>
PREFIX dwciri: <>
PREFIX dsw: <>
PREFIX Iptc4xmpExt: <>
PREFIX rdfs: <>
PREFIX dc: <>
PREFIX dcterms: <>
PREFIX dcmitype: <>

SELECT (COUNT(?subject) AS ?count)


 ?subject ?predicate ?object.

A few notes about this query:
1. The PREFIX declarations include more namespaces than are necessary for any particular query included in this blog post.  I've included all of the ones that I typically use when querying the Bioimages RDF dataset so that it would be easier for you to make up your own queries.  Typically, a PREFIX declaration must be included in any SPARQL query in which that namespace is abbrevated.  However, the Callimachus Web interface for the Heard Library endpoint (the blank box page) will "remember" prefixes once they have been declared in a query, so if you are using the Web interface, you only need to declare them once (per session?).  To be safe, you can just paste this set of PREFIX declarations above into any of the queries I list below.
2. The FROM clause includes a particular graph in the dataset to be queried.  Although I won't show these FROM clauses in the example queries that follow, you should place that set of FROM clauses  between the SELECT or DESCRIBE clause and the WHERE clause in the example queries.  The queries may work without doing this, but without the FROM claueses, the queries might pick up random triples that Callimachus has left lying around in the triplestore.
3. The expression (COUNT(?subject) AS ?count) calculates the number of expressions bound to the ?subject variable and then binds that integer value to the ?count variable.

Running this query against the Bioimages dataset produces an integer value of 1488578 and takes 9793 ms to execute.[6]  So performing a query that binds about 1.5 million triples takes Callimachus about 10 seconds.  The time to execute seems to depends primarily on the number of sets of bound variables as can be seen if you rerun the query without the


clause.  This time it only takes 1721 ms because the images.rdf file contains the lion's share of the triples in the dataset (1211642, to be exact).

Just for fun, you can play with the COUNT function by running a more complicated query designed to list all of the photographers whose work has been contributed to Bioimages, ordered by the number of images contributed:

SELECT ?photographerName (COUNT(?image) AS ?count)
 ?image a dcmitype:StillImage.
 ?image dc:creator ?photographerName.
GROUP BY ?photographerName

This query also illustrates the use of two more useful SPARQL clauses: GROUP BY and ORDER BY.

Now back to business.  To limit this investigation to a smaller number of triples, I'm going to investigate a set of inverse properties that involves only Organisms and not images.  This query

SELECT (COUNT(?subject) AS ?count)
 ?subject a dwc:Organism.

tells us how many Organisms are described in the Bioimages dataset (3316) and changing "a dwc:Organism" to "a dcmitype:StillImage" tells us how many images are described (13597).  There are an average of about 4 images per Organism, with Organisms and images having approximately equal numbers of properties.  So limiting the investigation to Organism properties and eliminating image properties reduces the scope of the query considerably.

I'm going to investigate this part of the Bioimages RDF graph model (which comes from the DSW graph model):
Organism Occurrences are documented by three different classes of things in the Bioimages dataset.  You can discover them using this query. 

 ?occurrence a dwc:Occurrence.
 ?occurrence dsw:hasEvidence ?thing.
 ?thing a ?class.

In RDF/Turtle (on which SPARQL triple patterns are based), "a" is syntatic sugar for "rdf:type", so this query tells us what types (classes) of things provide evidence for Occurrences of Organisms (called "Tokens" in DSW).  There are three distinct types of things: Organisms that are living specimens (mostly trees in the Vanderbilt arboretum), still images (most of the records), and preserved specimens (a few records of trees documented by specimens in the University of North Carolina [NCU)] Herbarium).

In an earlier version of the Bioimages RDF data, I was careful to always link Organisms and Occurrences with both of the relevant inverse properties: dsw:hasOccurrence and dsw:occurrenceOf.  However, in the current version I decided to use only the one property that was the most convenient for the way I generated the RDF: dsw:hasOccurrencedsw:occurrenceOf is grayed out in the Bioimages RDF graph model diagram because although triples involving it are entailed, they are not usually materialized in the dataset. [7]  The exception to this are the Occurrences documented by the NCU preserved specimens - the file containing the triples describing them (ncu-all.rdf) was basically hand-constructed as a test and includes dsw:occurrenceOf properties.

We can find out how many times dsw:hasOccurrence was used in the dataset with this query:

SELECT (COUNT(?occurrence) AS ?count)
?organism dsw:hasOccurrence ?occurrence.

There are 4096 instances. As expected, substituting dsw:occurrenceOf for dsw:hasOccurrence finds the 27 instances used in the descriptions of the Occurrences documented by the 27 NCU specimens.  To test the method of constructing entailed triples using SPARQL CONSTRUCT, we can use the query:

CONSTRUCT {?organism dsw:hasOccurrence ?occurrence.}
?occurrence dsw:occurrenceOf ?organism.

This query should construct the triples entailed by the 27 triples in the NCU specimen subgraph having dsw:occurrenceOf as their predicate, and it does - in only 603 ms.  The time is short whether the images.rdf triples are added to the queried dataset using


or not.  Even though including the images.rdf triples increases the size of the queried dataset by a factor of about 5, it does not increase the number of triples bound by the triple pattern

?occurrence dsw:occurrenceOf ?organism.

so there is little effect on the speed of execution.  In order to make this query do what we actually want, we need to remove triples that are already included in the dataset: 

CONSTRUCT {?organism dsw:hasOccurrence ?occurrence.}
  ?occurrence dsw:occurrenceOf ?organism.
  MINUS {?organism dsw:hasOccurrence ?occurrence.}

Since every link between Organisms and Occurrences should already be documented by a triple containing dsw:hasOccurrence, this query should produce no results.  Surprisingly, it does produce a  single result in 732 ms:

<rdf:Description rdf:about="">
<dsw:hasOccurrence rdf:resource="" />

Upon further investigation, I found that this unexpected triple was caused by my mistyping of "ncu592831" instead of "ncu592813".  So the query correctly materialized the one entailed triple that wasn't already included in the dataset.

Now let's try the more useful query that should materialize the many missing entailed triples that contain dsw:occurrenceOf:

CONSTRUCT {?occurrence dsw:occurrenceOf ?organism.}
  ?organism dsw:hasOccurrence ?occurrence.
  MINUS {?occurrence dsw:occurrenceOf ?organism.}

It takes a lot longer for the server to run this query (5477 ms) because it has to construct a lot more triples than in the previous query.  The query does correctly remove the pre-existing 26 "dsw:occurrenceOf" triples from the set of 4096 triples that are entailed by triples containing dsw:hasOccurrence as their predicate. Again, it makes little difference in the execution time whether the many images.rdf triples are included in the whole dataset or not, since none of them are bound in the query.

OK, the endpoint was able to handle reasoning on this scale within the timeout limits.  Let's try something more demanding.

Playing harder with the toys

Earlier, we used the dsw:hasEvidence linking predicate to find out what kinds of things were used to document occurrences.  There are more dsw:hasEvidence links than there are dsw:hasOccurrence links because there are more documentary images in the dataset than there are Occurrences.  We can find out how many dsw:hasEvidence links there are by this query:

SELECT (COUNT(?token) AS ?count)
  ?occurrence dsw:hasEvidence ?token.

In  351 ms we find out that there are 13827.  We can check on the number of links that were made in the opposite direction by:

SELECT (COUNT(?token) AS ?count)
  ?token dsw:evidenceFor ?occurrence.

and in 350 ms we get the answer: 13624.  The difference between these two results tells us that there are at least 203 unmaterialized, entailed triples having dsw:evidenceFor as their predicate.  Here's the reason: although in case of images the links between Occurrences and the Tokens that serve as evidence for them them are made in both directions, in the case of living specimens in arboreta I didn't explicitly make the link in the direction that used dsw:evidenceFor.  There is no particularly good reason for that choice - as a data producer I was just using the prerogative that Darwin-SW gave me.  So this would be an excellent opportunity to fix that problem using forward-chaining reasoning.  I should be able to construct the missing 203 "dsw:evidenceFor" triples with this query:

CONSTRUCT {?occurrence dsw:hasEvidence ?token.}
  ?token dsw:evidenceFor ?occurrence.
  MINUS {?occurrence dsw:hasEvidence ?token.}

Alas, we have pushed our "shiny new toy" (the Heard Library SPARQL endpoint) too far.  After 60442 ms, the endpoint times out and we get nothing.  There were just too many bindings for the endpoint to complete the processing in the allowed time.

Some thoughts about time needed to reason...

Using my homemade SPARQL reasoner to materialize missing triples entailed by use of inverse properties worked fine when the number of links documented by the inverse properties was not too big.  It was able to generate about 4000 entailed triples while checking them against  about another 4000 existing triples that expressed the link using the inverse property, and it did so in about 5.5 seconds.  However, when that number was increased by about a factor of 3.5 (to about 14000 triples), the endpoint timed out after 60 seconds.  Clearly the relationship between the number of triples involved and the time it takes to execute is not linear.  This makes sense if one considers what the MINUS operator does.  The MINUS operator compares the set of variable bindings that satisfy one graph pattern with the set that satisfies a second graph pattern.  It then removes the second set of bindings from the first.  If each binding from the first graph has to be checked against each binding from the second, and the time it takes to make each comparison is the same, the time to make the comparison should be proportional to the square of the number of bindings (assuming that the number of bindings in the first graph are similar to the second).  Based on this seat-of-the-pants guesstimate, increasing the number of triples from 4000 to 14000 would increase the time required from about 5.5 seconds to 5.5*3.5^2 seconds or 67 seconds, which would push it just beyond the 60 second timeout of the endpoint.  I haven't attempted to test this by having the timeout setting lengthened.

The point is that the time required to carry out reasoning on a graph that contains 1.5 million triples is not negligible.  Reasoner performance could be improved by using a better computer or a more efficient algorithm than one based on Callimachus SPARQL queries; that is, use software actually designed to conduct reasoning of this sort.  But even with these kinds of improvements, reasoning on a dataset with a more realistic size could take a long time.  I spent some time Googling to find some examples where "real" reasoners were used on "real" datasets to try to get a feel for the time that would be required to reason over large assertional datasets (i.e. instance data) as opposed to large terminological datasets (complex ontologies like the OBO biomedical ontologies).  After wasting a lot of time, I ended up falling back on the Hogan et al. (2009) paper [8], which I've brought up in previous blog posts.  In their study, they collected about a billion (i.e. 10^9) triples from "the wild" by scraping the Web.  This is about 3 orders of magnitude larger than the Bioimages dataset.  They applied a restricted set of entailment rules (see their section 3.3 and following sections) to the data (see their section 3.5) and let their algorithm rip.  When reasoning was done on the billion triples using a single computer (see their section 3.6), they materialized about a billion new entailed triples in 19.5 hours.  Using 1 master and 8 slave computers, they were able to compute the same entailed triples in 3.5 hours.

The Hogan et al. triples were fairly "dirty" since they were harvested from all over the web and the reasoning process they applied was much more complex than reasoning only the triples entailed by inverse property declarations.  So simpler reasoning on cleaner data might take less time. Nevertheless, it clearly takes a while when reasoning has to be applied to many triples.  I tried to imagine the size of the dataset that would result if all of the Occurrence records in GBIF were described using RDF.  As of today (2015-07-26), GBIF has 566 589 120 Occurrence records.  It takes about 100 triples to describe each instance of a resource (image or Organism) in the Bioimages RDF dataset.  At that rate, it would require about 50 000 000 000 (5*10^10 or 50 billion) triples to express the GBIF Occurrence dataset as RDF.  That's about 50 times larger than the Hogan et al. dataset.  If the processing time were linearly related to number of triples and the nature of the reasoning were similar to that performed by Hogan et al., it could take about 1000 hours (about 6 weeks) for the Hogan et al. seven-computer master/slave reasoner configuration to reason over the GBIF dataset.  If the time required were worse than linear (as it appears with my homemade SPARQL reasoner), then the time could be much longer.  If the reasoning process were much simpler because fewer entailment rules were used, then the time could be a lot shorter.

Another major consideration is whether the reasoning needs to be applied repeatedly to the entire dataset, or only once to new data when it is added to the dataset.  When forward chaining is used by a reasoner, the entailed materialized triples only have to be computed once for a fixed dataset.  However, if triples are added to the original dataset, the entailed triples may need to be recomputed over the enlarged dataset.  This isn't necessarily the case with the inverse property reasoning I have been playing with.  The unexpressed, entailed triples could be calculated one time using only the graph containing the new data, and the new data and the new reasoned triples could be added to the existing dataset.  This method would introduce a few duplicate triples if a new resource described by the incoming dataset were linked to a resource in the existing dataset.  But that might be a small cost compared to the cost of re-running the reasoner over the entire existing dataset if the dataset were large.

What does this mean about Darwin-SW?

Although defining all object properties as inverse pairs allows a dataset producer to make either linked resource the subject of a connecting triple, that choice comes at a cost.  If we assume that it is unrealistic to expect users that are conducting queries to make their queries complicated enough to catch all of the possible ways that producers may express the linkages between resources using the inverse property pairs, then we must either expect data producers to express the links in both directions, or expect data aggregators to perform reasoning on the datasets they aggregate.  Since it's probably futile to believe that data producers can be forced to to anything in the Open World of RDF, assuming that data aggregators will always perform reasoning is essentially placing a bet on the Semantic Web.  If the Semantic Web materializes and reasoning on very large (10^10 triple) datasets becomes practical, then defining pairs of inverse object properties might make sense.  On the other hand, in a world where Linked Data prevails but the Semantic Web is rejected as too complicated, defining pairs of inverse object properties is probably a bad idea and Darwin-SW (in its current form) is probably doomed to the dustbin of history.  What a happy thought!

With respect to the "What can we learn using RDF?" problems introduced in the beginning of this post, reasoning entailed inverse relationships doesn't really result in "interesting" inferences nor does it enable the construction of more clever or meaningful queries.  Another strike against pairs of inverse properties. :-(

So should Cam and I dump one of each of the pairs of inverse properties from Darwin-SW?  How fast can the entailed inverse relationships in the Bioimages dataset be materialized on a real reasoner?  Please comment if you think you have answers.


[1] In this post, I use the term "object property" to mean a property that is used as a predicate to link a resource to an IRI-identified or blank node object, regardless of whether that property term has been declared to be a owl:ObjectProperty.  This is to distinguish such properties from properties whose object is intended to be a literal. 
[2] I don't intend to discount the work of groups who have developed interesting systems that make use of semantic technologies (Filtered Push comes to mind.)  But I don't feel like a rigorous discussion of use cases and how we might satisfy them has happened in the TDWG-wide arena.  (TDWG=Biodiversity Information Standards).  It certainly hasn't happened in the context of the TDWG RDF/OWL Task Group.
 [3] See "JSON-LD and Why I Hate the Semantic Web" for an interesting pro-Linked Data, anti-Semantic Web diatribe.
[4] See section 3 of our submitted paper "Darwin -SW: Darwin Core-based based terms for expressing biodiversity data as RDF" for examples.
[5] SPARQL 1.1 provides an alternative method of negation besides MINUS: the NOT EXISTS filter.  The differences between MINUS and FILTER NOT EXISTS are subtle - refer to for more details.  In the examples given here, I think the two produce equivalent results, and the response time is approximately the same.  So the examples I give will all use MINUS.
[6] In these examples, I just pasted in the time I got for one particular trial.  If repeated, the response times can vary by hundreds of milliseconds, but I'm to lazy to repeat the trials and report an average.
[7] To see why this matters, try these two queries:

?tree dwc:organismName "Bicentennial Oak".
?tree dsw:hasOccurrence ?occurrence.
?occurrence dsw:atEvent ?event.
?event dsw:locatedAt ?location.
?location dwc:locality ?locality.

?tree dwc:organismName "Bicentennial Oak".
?occurrence dsw:occurrenceOf ?tree.
?occurrence dsw:atEvent ?event.
?event dsw:locatedAt ?location.
?location dwc:locality ?locality.

The first query produces a result. The second query produces nothing, even though

?occurrence dsw:occurrenceOf ?tree.

is a triple pattern consistent with Darwin-SW (see the Bioimages or Darwin-SW RDF data model diagram).  The problem is that the triple that would satisfy that triple pattern is entailed, but not materialized in the current release of the Bioimages RDF dataset.

[8] Aidan Hogan, Andreas Harth and Axel Polleres.  Scalable Authoritative OWL Reasoning for the Web.  International Journal on Semantic Web and Information Systems, 5(2), pages 49-90, April-June 2009.