Steve Baskauf's blog: February 2016

Thursday, February 25, 2016

Stress testing the Stardog reasoner

If you've been following my recent blog posts, you will know that I've been enjoying testing out the Stardog graph database software. I've been using it to construct SPARQL queries on Linked Data from the wild and to try out Stardog's built-in reasoner to generate triples that are entailed based on the semantics of the FOAF vocabulary, but not asserted in my test dataset (found here and here) . This has been fun, but these tests have been done on what's really a "toy" dataset that contains only about 700 triples.

In an earlier blog post, I reported on my experimentation using SPARQL CONSTRUCT queries with the Callimachus graph database to generate entailed but unasserted inverse relationships present in the Bioimages RDF dataset. That dataset contains approximately 1.5M triples of real data associated with about 14k images of about 3.3k organisms representing 1.2k named taxa. The current version of the dataset can be downloaded as a zip file from the Bioimages GitHub repo. (The experiments here used the 2016-02-25 release: http://dx.doi.org/10.5281/zenodo.46574.) In those experiments, I found that materializing the unasserted triples using SPARQL was possible, but that in certain cases, the queries took so long that they timed out after the limit of one minute. The problem seemed to be associated with use of the MINUS keyword, which I was using to prevent the query from returning triples that were already present in the dataset. This was a bit disappointing because although 1.5 million triples sounds like a lot, it is not uncommon for triplestores to contain several orders of magnitude more triples than that. So it was a failure on what was not a particularly rigorous test.

In this blog post, I'm going to report on my efforts to expand on some of the tests that I conducted earlier using the Bioimages dataset. This time, rather than using SPARQL CONSTRUCT to materialize the entailed triples and then add them back into the database ahead of querying, I'm going to let Stardog reason the necessary triples on the fly at the time of querying. See this for Stardog's explanation of their reasoning approach.

Speed: Stardog vs. Callimachus

Loading triples

The first thing that I was curious to know was how long it would take to load the 1.5M triples into the Stardog database. Currently, I re-upload the whole dataset to the Vanderbilt Heard Library triplestore each time I create a new Bioimages release. That triplestore is a Callimachus installation and although I don't think I've ever timed how long it takes to do the upload, it's long enough that I leave my computer and go to make coffee - many minutes. So I was surprised at how fast Stardog was able to parse the RDF/XML from the source files and load them into the triplestore. There are several files that are very small and they loaded almost instantaneously. The biggest file (images.rdf) is 109 Mb uncompressed and contains the lion's share of the triples: 1 243 969. Stardog loaded it in 14 seconds. I decided to do a local install of Callimachus on the same computer so that I could compare. The load time for Callimachus: 7 minutes and 54 seconds! The second largest file (organisms.rdf) is 25 Mb uncompressed and contains 279 429 triples. Stardog loaded it in 6 seconds; Callimachus in 1 minute and 33 seconds. Wow.

Counting asserted triples

The second test I ran was to see how long it took each endpoint to report a count of the number of triples in the graph. To find this using Stardog, I used the query:

SELECT (COUNT(?subject) AS ?count)
WHERE {
?subject ?predicate ?object.
}

which will report the number of triples in the default graph (when I loaded the triples, I did not specify any named graph, so they all got thrown into the default graph pot). I got the results, 1 533 245 triples, in less than a second. My reaction time was too slow to time it with the stopwatch on my phone. If I run the same query in Callimachus, I got the result, 1 822 884 triples, in 22 seconds. The reason this number is 300 000 triples bigger than the Stardog number is because the default Callimachus graph includes a bunch of Callimachus-specific triples that I didn't put there. You can see this using the query

SELECT DISTINCT ?class
WHERE {
?subject a ?class.
}

which produces a whole bunch of classes I don't recognize, like "callimachus/1.4/types/Schematron". Running the same query on Stardog returns only classes that I recognize because they all came from RDF that I uploaded. If I want to restrict the Callimachus triple count to triples that I uploaded, I have to use named graphs.

There is a fundamental difference in the way that named graphs are specified between the two platforms (at least the way I understand it). In Callimachus, uploaded triples are always assigned to a named graph that includes the name of the file that held those triples. For example, the set of triples about images are assigned to the named graph:

<http://localhost:8080/bioimages/images.rdf>

If you run a SPARQL query on Callimachus and don't specify a graph using the FROM keyword, Callimachus runs the query on all of the triples it has, including those in named graphs.

Stardog assign a URI to its default graph:

tag:stardog:api:context:default

and if no other graph name is specified at the time the triples are uploaded, they are assigned to this graph. This is the graph on which SPARQL queries are run in the absence of the FROM keyword. In Stardog, you can assign any URI as the graph name when a file is loaded - there need be no relationship between that URI and the file name. However, if a graph name is specified, Stardog will NOT include those triples in SPARQL queries unless you explicitly say that they should be included using the FROM keyword. This is very different behavior from Callimachus.

Given these differences, in order to count the triples that I uploaded and exclude the Callimachus-generated triples, I need to use this query:

SELECT (COUNT(?subject) AS ?count)

FROM <http://localhost:8080/bioimages/baskauf.rdf>
FROM <http://localhost:8080/bioimages/geonames.rdf>
FROM <http://localhost:8080/bioimages/images.rdf>
FROM <http://localhost:8080/bioimages/kirchoff.rdf>
FROM <http://localhost:8080/bioimages/local.rdf>
FROM <http://localhost:8080/bioimages/ncu-all.rdf>
FROM <http://localhost:8080/bioimages/organisms.rdf>
FROM <http://localhost:8080/bioimages/stdview.rdf>
FROM <http://localhost:8080/bioimages/thomas.rdf>
FROM <http://localhost:8080/bioimages/uri.rdf>

WHERE {
?subject ?predicate ?object.
}

in which I specify the graph that corresponds to each of the files I uploaded. If I run this query, I get the answer, 1 533 272 triples, in about 6 seconds. This is quite a bit faster than the time it took when I didn't specify the graphs to be included, but is still much slower than the corresponding Stardog query. I'm also not sure why there are 27 more triples in the Callimachus count than Stardog's count and I'm to lazy to figure it out.

Counting asserted and entailed triples

As I said in the beginning of this post, one of the things I'm really interested in knowing is how much Stardog gets slowed down when I ask it to reason. In the experiments in this post, I'm going to reason based on the Darwin-SW ontology (DSW; version 0.4). To accomplish this, I uploaded the Darwin-SW triples (from the file dsw.owl, which is included in the bioimages-rdf.zip file) to Stardog as a named graph, then specified that graph as the Tbox to be used for reasoning. See the "Learning from the FOAF vocabulary" section of my last blog post for more details on Tbox reasoning.

The Bioimages graph model is based on DSW and heavily uses its predicates to link resources. Although DSW is a fairly lightweight ontology, it does declare a number of property inverses, and range and domains for many of its object properties. So we can expect that the Bioimages dataset entails triples that are not asserted. To attempt to determine how many of these triples there are, I flipped the Reasoning switch to "ON", then re-ran the first counting query above. In contrast to the nearly instantaneous counting that happened when the Reasoning switch was off, it took about 4.5 seconds to compute the results with reasoning. The total number of triples counted was 1 765 911, an increase of 232 666 triples or about 13%. It's not possible to compare this to Callimachus because it doesn't have reasoning built in. I'm not sure exactly what all of these extra triples are, but read on for some examples of types of triples Stardog can reason based on DSW.

Some queries that make use of reasoned triples.

Inverse properties

Partly out of convenience and partly out of obstinance, in the Bioimages RDF I did not link resources of different classes using inverse predicates in both possible directions. For example, the dataset includes the triple

<http://bioimages.vanderbilt.edu/ind-andersonwb/wa310#2003eve> dsw:locatedAt
<http://bioimages.vanderbilt.edu/ind-andersonwb/wa310#2003loc>.

but not the inverse

<http://bioimages.vanderbilt.edu/ind-andersonwb/wa310#2003loc> dsw:locates
<http://bioimages.vanderbilt.edu/ind-andersonwb/wa310#2003eve>.

The diagram above shows the predicates that I usually used in yellow, and the ones I did not typically use in gray. I could have fairly easily generated the inverse triples, but it was more convenient not to, and I wanted to see how annoying it would be for me to use the RDF without them.

If I wanted to know in which states Darel Hess has recorded occurrences, I could run the query:

SELECT DISTINCT ?state
WHERE {
?occurrence dwc:recordedBy "Darel Hess".
?occurrence dsw:atEvent ?event.
?event dsw:locatedAt ?location.
?location dwc:stateProvince ?state.
}

and it would produce the results:

California
Tennessee
Virginia
Kentucky
North Carolina

However, if I ran this query using the inverse linking properties with reasoning off:

SELECT DISTINCT ?state
WHERE {
?occurrence dwc:recordedBy "Darel Hess".
?event dsw:eventOf ?occurrence.
?location dsw:locates ?event.
?location dwc:stateProvince ?state.
}

I would get no results because even though those inverse links are entailed by these statements in the DSW ontology:

dsw:locatedAt owl:inverseOf dsw:locates.
dsw:atEvent owl:inverseOf dsw:eventOf.

they weren't actually asserted in the RDF I wrote. If I turn on the Reasoning switch and run the query again, I get the same five states as I did above. The amount of additional time required to run this query is too short for me to measure, which is nice. So in this case, my decision to omit the inverse triples when I exposed the RDF didn't result in any annoying problems for me at all, as long as I used Stardog to run the query and flipped the Reasoning switch to ON.

Types entailed by range declarations

What kind of thing is http://bioimages.vanderbilt.edu/baskauf/33423 ? I can find out by running this query:

SELECT DISTINCT ?class
WHERE {
<http://bioimages.vanderbilt.edu/baskauf/33423> a ?class.
}

(where "a" is semantic sugar for rdf:type). Running the query with reasoning turned off produces one result, the class dcmitype:StillImage . Running the query with reasoning turned on produces three results:

dcmitype:StillImage
owl:Thing
dsw:Token

The first additional class is no surprise; every resource is reasoned to be an instance of owl:Thing. dsw:Token is less obvious.

In DSW, a Token is any kind of thing that serves as evidence. The DSW ontology declares:

dsw:hasEvidence rdfs:range dsw:Token.

RDFS includes a rule for rdfs:range that entails when

<p> rdfs:range <class>.

and

<s> <p> <o>.

then

<o> rdf:type <class>.

The Bioimages RDF includes this triple:

<http://bioimages.vanderbilt.edu/ind-baskauf/33419#2004-04-29> dsw:hasEvidence <http://bioimages.vanderbilt.edu/baskauf/33423>.

so by the rule,

<http://bioimages.vanderbilt.edu/baskauf/33423> rdfs:type dsw:Token.

In this example, the additional time required to reason the other two classes was negligible.

Breaking Stardog

In the previous two examples, I let Stardog off pretty easy. It didn't have to work with very many triples since relatively few triples satisfied the first triple pattern. The first triple pattern specified particular resources in the subject or object position that created few bindings to the variable in that triple pattern. So I decided to try a more demanding query.

Here is a query that asks what kinds of things serve as evidence for occurrences that take place in the United States:

SELECT DISTINCT ?class
WHERE {
?occurrence dsw:hasEvidence ?token.
?token a ?class.
?occurrence dsw:atEvent ?event.
?event dsw:locatedAt ?location.
?location dwc:countryCode "US".
}

In this query, there are a lot more bindings to variables in the first triple pattern. Each of the 13958 images in the database serve as evidence for some occurrence. Since most of the images in the database are from the United States, there are many values of ?token that satisfy the graph pattern. But most of them are instances of the same class (dcmitype:StillImage) with only a few instances of specimens. With no reaoning, Stardog comes up with results in an instant:

dcmitype:StillImage
dwc:PreservedSpecimen

However, with reasoning turned on, the query takes a very long time. In fact, the first time I ran it, nothing happened for so long I decided to go do something else. When I came back, I had gotten a timeout error:

Later, I tried running the query again and got results after 12 minutes. I'm not sure why it didn't time out this time - perhaps some of the earlier work was cached in some way??? However, the results were clearly wrong because Stardog reported zero results. The results should have included the two classes included in the previous results, plus at least owl:Thing and dsw:Token as we saw in the earlier example based on a single resource.

So the question on my mind was: why does it take so long for Stardog to perform reasoning in this case? I don't really understand the mechanism by which Stardog performs reasoning, so I'm just guessing here. The number of bindings to ?token will be very high, since all 13958 images in the database will fall into that category. Images are the most numerous kind of resource in the database and each image has many triples describing it. In fact, 645 350 of the triples in the database (42% of all triples) have an image as their subject while 73174 have images as their object. It might take a lot of effort on Stardog's part to reason whether predicate use in those triples entails unasserted types based on range and domain declarations for the predicates. In contrast, the reasoning task in the "Inverse properties" example was much less intensive since it had to determine entailed inverse "dsw:eventOf" triples based on only 3972 asserted "dsw:atEvent" triples.

One possible course of action that I haven't investigated is trying to improve performance by changing the settings in the Admin Console to increase index sizes, lengthen the timeout time, approximate reasoning, or other types of changes that I don't currently understand.

(added after initial publication)

2016-02-25:
Twitter thread with Kendall Clark (of Stardog, thanks for taking the time to tweet!):

KC: the ?token a ?class isn't legal under SPARQL OWL reasoning rules. But we try to do it anyway and it's an area we need to improve.

me: Thanks, I suppose I'm misunderstanding the process. I assumed: reason entailed classes -> query for matches to triple pattern.

KC: There are restrictions to insure things like decidability and reasonable performance, etc. We try to work around them if possible.

me: OK, need to learn more, I suppose about OWL profiles. Was assuming reasoning on domain, range, & equivalent class were all "normal"

KC: I suspect we could help you work around this if we understood better what you were trying to achieve. :>

me: At this point, entirely exploratory & no agenda. But "what kinds of things..." (?s a ?o) questions seem likely to be of interest

2016-02-26: From the thread of the error report I submitted:

We looked at the performance problems with this query and the problem
seems to be the query optimizer not handling {?token a ?class} pattern
efficiently. We created a ticket (#2835) for this issue.

For the patterns that retrieve types of an individual, Stardog will
execute additional queries behind the scenes. If these subqueries time
out a warning is printed in the log but the end user would receive
incomplete results and no warnings. This behavior is not ideal and
we'll make improvements here.

Investigation of reasoning time and type inferencing

In an attempt to investigate what affects the time required to reason entailed types, I created this query:

SELECT DISTINCT ?class
WHERE {
?token a dwc:PreservedSpecimen.
# ?token a dwc:LivingSpecimen.
# ?token a dcmitype:StillImage.
?token a ?class.
}

There are three classes of resources in the Bioimages dataset that serve as evidence for occurrences: images, preserved specimens, and living specimens. In the query, I specify in the first triple pattern that a Token must be an instance of one of the three classes and have commented out the patterns involving other two classes to make it easy to switch to performing the query on the other classes. The last triple pattern asks what other classes the Tokens are instances of. With reasoning turned off, the query produces an answer faster than I can time it. With reasoning turned on, the time to complete the query varies greatly.

Here is the scope of each type of resource (number of instances, number of triples in which an instance is the subject, and number of triples in which an instance is the object) and the approximate amount of time that it takes to execute the corresponding query with reasoning [1]:

class instances subject object seconds
-------------------------------------------------------------
dwc:PreservedSpecimen 27 297 27 0.4
dwc:LivingSpecimen 227 6675 3279 2
dcmitype:StillImage 13958 645350 73174 137

The times aren't very accurate, but they show that the reasoning time is related in some way to the scope of the class. The amount of time per instance is fairly consistently about 0.01 s/instance, so the time may just be directly related to the number of instances whose types Stardog must reason. One thing that is clear is that carrying out this kind of reasoning would not be practical in datasets that describe millions or billions of resources.

Reconsidering defining pairs of inverse properties

In a blog post last year, I questioned our decision to define many pairs of inverse properties in DSW. To recap briefly, the arguments for that decision were:

when asserting triples in serialized form, it's more convenient and less verbose to be able to chose the direction of the link.
being able to express the link in either direction avoids making particular types of resources the "center of the universe" when vocabulary users are required to place those resources in the subject position of triples.

The argument against was:

if we assume that reasoning will be uncommon (i.e. we believe in Linked Data, but not necessarily in the Semantic Web), either data providers will have to include triples expressing the link in both directions, or data consumers will have to construct more complex queries to catch links expressed by the producers in either direction.

So the decision about whether defining pairs of inverse pairs was a stupid idea or not hinges primarily on how easy and reliable it is to reason entailed but unasserted inverse triples. When I started this exercise I was leaning towards the "it's a good idea" side. All I had to do was load the DSW ontology into the Tbox and flip the Reasoning switch to "ON" and I could construct queries whose graph patterns needed the unasserted inverse triples. However, after Stardog choked on a not very complicated query, I started thinking (again) that maybe it was a bad idea to depend on the SPARQL endpoint to carry out reasoning in order for queriers to be able to set up triple patterns in either direction.

Part of the problem is that I can't specify for Stardog to "carry out all of the reasoning that doesn't cause you to time out". For example, maybe I only care about knowing the explicitly asserted types of things that serve as evidence for occurrences in the US, and don't care about the entailed, unasserted types. I could run the query from above with reasoning turned off:

SELECT DISTINCT ?class
WHERE {
?occurrence dsw:hasEvidence ?token.
?token a ?class.
?occurrence dsw:atEvent ?event.
?event dsw:locatedAt ?location.
?location dwc:countryCode "US".
}

But I couldn't run the query using the inverse properties in the graph pattern:

SELECT DISTINCT ?class
WHERE {
?occurrence dsw:hasEvidence ?token.
?token a ?class.
?event dsw:eventOf ?occurrence.
?location dsw:locates ?event.
?location dwc:countryCode "US".
}

with reasoning turned off because the dsw:eventOf and dsw:locates triples have to be reasoned. Stardog won't just reason the entailed inverse triples required to satisfy the graph pattern without also reasoning the problematic entailed rdf:type relationships that were causing the query to take too long.

It's also disturbing to me that when Stardog succeeded in finishing the very long query, it (incorrectly) reported no results. In my opinion, that's worse than timing out and failing to report a result, because unknowingly getting a wrong answer is worse than getting no answer. After only testing a handful of queries, this is the second "wrong" answer I've gotten from Stardog (see this blog post for a description of the other). If I'm really going to commit to depending on reasoning in order for my queries to work, then I have to have some confidence that I'm going to actually get the answers that would be correct if the expected types of reasoning were applied correctly and completely. That's making me think that it's not such a good idea to define pairs of inverse properties in DSW without specifying which property of the pair is preferred.

How PROV-O handles inverse properties

Since I wrote the earlier blog post where I discussed inverse properties, I had occasion to read the
overview document for the W3C PROV Ontology (PROV-O). It included a section discussing property inverses. This section notes that

When all inverses are defined for all properties, modelers may choose from two logically equivalent properties when making each assertion. Although the two options may be logically equivalent, developers consuming the assertions may need to exert extra effort to handle both (e.g., by either adding an OWL reasoner or writing code and queries to handle both cases). This extra effort can be reduced by preferring one inverse over another.

PROV-O then goes on to specify which inverse is preferred, but also to reserve the name of property inverses so term name usage would be consistent among modelers that chose to use the inverse properties. The specification of the preferred inverse is done in the following manner.

The definition of the preferred property includes a prov:inverse annotation with the local name of its inverse as a value. Example:

prov:wasDerivedFrom
a owl:ObjectProperty;
rdfs:isDefinedBy <http://www.w3.org/ns/prov#>;
prov:inverse "hadDerivation".

The definition does not include the inverse property declaration, so a machine that solely discovered that definition would not necessarily "know" that there was an inverse property. The definition of the non-preferred inverse includes the owl:inverseOf declaration, which would ensure that a machine that discovered that definition would "know" that the preferred inverse existed. For example:

prov:hadDerivation
rdfs:isDefinedBy <http://www.w3.org/ns/prov#>;
owl:inverseOf prov:wasDerivedFrom.

This is seems like a logical course of action for us to follow with DSW.

How should we decide which inverse is preferred?

I've thought about this question for a while and I think I have an answer. In the DSW model, classes are sometimes included largely to facilitate one-to-many relationships. (See what we wrote here for more on that subject.) Although it would be nice to believe that data linked using DSW would "live" as RDF in a triplestore, realistically it's likely to have its original home in a relational database, or even as CSV files, then be exported as some serialization of RDF. Given that reality, it makes sense to pick the preferred inverse based on what would make linking the database tables the easiest. For example, the event node is included in DSW so that one can link many occurrences to a single observing or collecting event. So there is a one event to many occurrence relationship. If I have an event table and an occurrence table, its easiest to structure them like this:

Event table:

id dwc:eventDate rdfs:label
---------------------------------------------------------
<event1> "2015-06-14" "14 June 2015 Bioblitz"
<event2> "1838-08-23" "Schmidt 1838 collecting event"

Occurrence table:

id dwc:recordedBy dsw:atEvent
---------------------------------------------------------
<occurrence1> "José Shmoe" <event1>
<occurrence2> "Courtney Khan" <event1>
<occurrence3> "Peleg Smith" <event2>
<occurrence4> "Josiah Jones" <event2>
<occurrence5> "Hepsabeth Arnold" <event2>

Using a value of dsw:atEvent as a foreign key in the Occurrence table is easier to implement than using the inverse dsw:eventOf as a property of the Event records, since using dsw:eventOf would require one to deal with figuring out how to store several key values in one row, or to create a separate join table. Consequently, in the RDF, one would link like this:

<occurrence4> dsw:atEvent <event2>.

rather than

<event2> dsw:eventOf <occurrence4>.

The rule therefore would be to choose the inverse property such that the subject resource is on the "many" side of the one-to-many relationship.

Conclusions

From this set of experiments, I've reached several conclusions:

The speed of Stardog in both loading triples and running queries seems to be far superior to Callimachus.
Although it is very cool and simple to have Stardog perform reasoning, the reasoning does not appear to be reliable enough (and in some cases fast enough) for me to commit to depending on it.
I'm thinking that on balance, the benefits of defining pairs of inverse linking properties does not outweigh the costs and that it would be better to declare a single preferred linking property.

A couple years ago, I wrote a series blog posts called "Confessions of an RDF agnostic". In that series, I tried to separate out the hype that we commonly hear about RDF and the Semantic Web from what we might legitimately gain from using RDF. One of the points that I made in that series is that it is not really sufficient to make the case that the Semantic Web is awesome because it allows us to do reasoning unless that reasoning allows us to do something that we couldn't easily do by some other means.

Let's apply that lens to the experiments that I've carried out in this blog post and the previous two (here and here). It's really cool that I could use owl:sameAs reasoning to link resources that have both ORCID and VIAF identifiers. But the only reason I have to do that is because we have two competing systems instead of one consensus system. It's great that I could reason that a resource labeled using a foaf:name property also has an entailed rdfs:label property with the same value. But I only had to do that because there isn't a consensus that all resources should typically have an rdfs:label property to provide a human-readable label. It's great that reasoning based on an owl:equivalentClass declaration allowed me to know that a foaf:Person is also a schema:Person. But I only had to reason that because Schema.org couldn't discipline themselves to reuse a well-known term and felt compelled to mint a new term. It's nice that owl:inverseOf declarations lets me query using a triple pattern based on either of a pair of inverse properties, but the reason I need to do that reasoning is because of (possibly) poor ontology design that doesn't make it clear to users which property they should use.

In fact, I don't think there is a single reasoning task that I've done so far that hasn't just made up for lack of standardization and community consensus-building, failure to reuse existing identifiers, or possible ontology design flaws. I'm not saying that its impossible to do awesome things with reasoning, but so far I haven't manage to actually do any of them yet.[2] Despite my excitement of having some success playing around with reasoning with Stardog, I haven't yet converted myself from RDF Agnostic to Semantic Web True Believer.

--------------------------------------------------------

Notes

[1] If you are interested in knowing the results of the queries:
Preserved specimens resulted in:

owl:Thing
dwc:PreservedSpecimen
dsw:Specimen
dsw:Token

with owl:Thing and dsw:Token entailed for reasons given previously. dsw:Specimen is entailed because dsw:PreservedSpecimen rdfs:subClassOr dsw:Specimen.

Living specimens resulted in:

dwc:LivingSpecimen
dwc:Organism
dsw:Specimen
dcterms:PhysicalResource
owl:Thing
dsw:IndividualOrganism
dsw:Token
dsw:LivingSpecimen

with dwc:Organism and dcterms:PhysicalResource asserted and dsw:LivingSpecimen and dsw:IndividualOrganism entailed because they are deprecated classes that are equivalent to currently used classes.

Images resulted in:

owl:Thing
dsw:Token
dcmitype:StillImage

with owl:Thing and dsw:Token entailed for reasons given previously. One thing that this makes clear is that the number of entailed classes has little to do with the reasoning time, since the image query took the longest, but had the fewest entailed classes.

[2] Shameless self-promotion: see Baskauf and Web (in press) for some use cases that RDF might satisfy that might not be easy to accomplish using other technologies.

Sunday, February 21, 2016

Reasoning on real Linked Data using Stardog

Notes:

1. To derive any benefit from reading this post, you really should download and install Stardog, load the example files, and try the queries yourself. See this page for help on getting Stardog to run on a Windows system.

2. The SPARQL examples in this post use the generally recognized namespace abbreviations for well-known vocabularies. I assume that you are running the queries on Stardog and have selected those prefixes in the box at the top of the Query Panel so that you don't have to actually type them as part of the query.

3. In the post, when I talk about what happens in interactions between my imaginary client and a server, I'm reporting the responses I got when I dereferenced URIs using the Advanced REST Client plugin for Chrome.

In my previous blog post, I pretended to be a Linked Data client (i.e. software that made use of Linked Data principles), and tried to "discover" information about people and publications by dereferencing ORCID IDs and DOIs while requesting RDF/XML. I was hampered by two basic problems:

although ORCID "knows" the DOIs of the publications made by people and shows them on the person's web page, it does not link to those DOIs in the RDF it exposes.
the RDF/XML exposed by CrossRef when DOIs were dereferenced had malformed datatyped date literals.

In a really amazing turnaround, CrossRef fixed the malformed date issue in less than an hour after I tweated about it. Wow. There has been no response from ORCID.

In my communication with CrossRef, they said that if a publisher provides an ORCID ID for the author of a publication, they link to it using a dcterms:creator property. However, none of the articles I looked up had this information, and in lieu of DOI information from ORCID, I was forced to create my own graph of triples linking the authors to their publications. The file that I created is here. You can also see most of it in the diagram above.

My graph contains the following information: information about our Semantic Web working group (including its name and some of the members), links to the DOIs of member's publications, and owl:sameAs assertions linking ORCID IDs to VIAF URIs when they exist. I purposely restricted my use of non-W3C standard vocabularies to the FOAF vocabulary, specifically the terms foaf:made, foaf:name, foaf:homepage, foaf:member, and foaf:primaryTopic for two reasons: because FOAF is widely used (and used in the ORCID RDF that I scraped), and because the term definitions in the FOAF vocabulary include triples that generate entailments that would be interesting for reasoning play.

The True Believer's client

I am now going to pretend that I have written a client based on the principles of a Semantic Web "true believer". By that, I mean that I'm pretending that I've written a computer program that is able to start from ground zero and discover the properties of subject or object resources and "meaning" of predicates using nothing more than the information provided when the URIs of those resources and predicates are dereferenced. The client does not exactly have a tabula rasa because it has been programmed to "know" about entailments resulting from RDFS and OWL (W3C Recommendations), but it has not been programmed to do processing based on the idiosyncrasies of particular vocabularies or servers. My imaginary client is also going to expect that servers that it communicates with follow generally recognized Semantic Web best practices for HTTP server/client interactions. My Semantic Web True Believer's client will do more than a dim-witted Linked Data client because it will conduct reasoning based on the triples that it discovers in its exploration of the Semantic Web.

I will start by pretending that the client has discovered a URI that denotes our Semantic Web working group:

<https://gist.githubusercontent.com/baskaufs/beeaa94606113b970002/raw/df6ec9cbe57290cc2289d2cc37c221e9f494d153/assertions#group>

This URI is based on the first "Cool URI" strategy: hash URIs (without content negotiation). When my client tries to dereference the working group's URI, the server strips off the part of the URI after the "#" and returns a text document. Regardless of the Content-Type my client requests from the GitHub server in its HTTP Accept header, it always gets text designated as Content-Type: text/plain because the GitHub server is only set up to return plain text when a raw file is requested. So my client already has a problem if it expects servers to always correctly tell it the content type of the returned document. To deal with this document, I'd have to do some programming that would allow it to recognize that the document is actually Content-Type: text/turtle.

OK, let's pretend that I've done that and my client has ingested the 28 triples in the file. It now needs to do two jobs to "learn" more:

dereference the subject and object URIs to discover more triples about the resources described in the 28 triples
dereference the predicate URIs to discover what they "mean"

The first job was the subject of my last blog post. Both the DOI and ORCID servers "play nicely" with my client and return RDF/XML when my client asks for it in the Accept header. The 683 triples that would result from dereferencing all of the subject and object URIs are in this RDF/XML file.

The second job involves discovering the meaning of the FOAF predicates used in the 28 triples. The FOAF predicates use the namespace http://xmlns.com/foaf/0.1/, so an abbreviated term URI like foaf:made would be http://xmlns.com/foaf/0.1/made in unabbreviated form. The FOAF terms follow the second recipe for "cool URIs": "303 URIs". 303 URIs are a result of the resolution of the httpRange 14 controversy, where it was determined that it was OK for non-information resources (like people or ideas) to have URIs that didn't end in hash fragment identifiers.

Here is the essence of how 303 URIs are supposed to work. A client attempts to dereference a URI. If the URI is a URL for an information resource (a document like a web page), the server responds to the GET command with an HTTP 200 code ("OK") and sends the resource itself. However, if the URI identifies a non-information resource that can't be sent via the Internet (like a person or an idea), the server responds with an HTTP 303 code ("See Other") and sends the URI of a document about the resource (a.k.a. a "representation") of the sort preferred by the client (HTML if the client is a web browser, or some flavor of RDF for semantic clients like mine). The client then dereferences the new URI and gets information about the non-information resource in the preferred document type. To the True Believer, in accordance with the httpRange-14 resolution, the HTTP status code is really important, because it communicates important information about what kind of thing the URI represents. A response code of 200 means the resource is an Internet-deliverable information resource (i.e. document), while a response code of 303 means the resources is a physical or abstract thing that can't be delivered through the Internet. Unfortunately, in the real world some administrators of servers that provide RDF either don't know how to set up the server to respond with the "correct" response codes, or they don't care enough to bother. So the creator of a real semantic client would probably have to program contingencies for inappropriate responses.

"Discovering" the FOAF vocabulary

So what happens if my imaginary client tries to dereference the URI foaf:made with a request header of Accept: application/rdf+xml? The first thing that happens is that it gets a 303 See Other redirect to http://xmlns.com/foaf/spec/ . So far, so good; foaf:made is not an information resource - it represents to concept of "making", so the 303 code is appropriate. However, if my client requests the server to send typical flavors of RDF (application/rdf+xml or text/turtle), it does not get them. It gets text/html instead. So if my client only understands RDF/XML or RDF/turtle, it's out of luck with the document sent by the server.

The reason the server returned an HTML document to my client is because the document included RDFa. I'm not very good at reading RDFa from a raw HTML document, so I ran it through the W3C RDFa Validator. It validated as RDFa 1.1 with HTML5+RDFa as host language. Just to see what would happen, I tried adding the RDFa-serialized triples to Stardog by loading the HTML file. No luck - "The file is invalid." The RDF editor I use (rdfEditor) was also unable to parse the RDFa and threw an error. So my imaginary client will have to be more up-to-date than these programs to ingest the RDFa.

There is one additional "out" to my client. The HTML contains a header link element:

<link href="http://xmlns.com/foaf/spec/index.rdf" rel="alternate" type="application/rdf+xml" />

This is a preferred way to link a generic HTML document to an RDF representation. So if my client can't handle RDFa, it still has an out if it can follow the link element to the RDF/XML representation.

So does it matter whether my client "learns" about the FOAF vocabulary from the RDFa directly or by following the link to the RDF/XML? I did a triple count on the RDFa and got 345. When I did a triple count on the RDF/XML, I got 635 triples. So some triples are clearly missing from the RDFa. The most obvious thing I noticed by comparing the two versions is that the RDFa is missing 75 rdfs:comment and 78 rdfs:label properties. That would have little effect on machine reasoning, but it would affect one's ability to generate human-readable descriptions of the FOAF terms. I haven't done an exhaustive comparison, but there are some differences that are important from a machine perspective. For example, there are five owl:equivalentClass declarations in the RDF/XML that seem to be missing in the RDFa. The RDF/XML also declares properties to be either owl:ObjectProperty or owl:DatatypeProperty. That accounts for about 50 of the missing triples and could be significant for machine reasoning.

Since this client is imaginary, I will imagine that it discovers the RDF/XML. This will be more convenient since Stardog can read it, and because it contains the more extensive set of FOAF triples.

"Learning" from the FOAF vocabulary

Thus far, my client's "learning" has consisted entirely of adding to it's knowledge by retrieving triples via dereferencing URIs. The other way that "learning" can happen is by reasoning triples that are entailed by the semantics of the vocabularies used in the retrieved triples.

There are two different approaches that you can take on reasoning. One is to reason entailed triples from explicitly asserted triples before querying the graph, adding the entailed triples to the graph, then carrying out the query. An advantage of this method is that once the entailed triples are added to the graph, the reasoning does not need to be carried out with every query. A disadvantage is that all entailed triples must be materialized, since one does not know which ones might be relevant to some future query. Also, if some of the asserted triples are removed from the graph, it is difficult to know which triples in the graph were reasoned from the asserted triples and should therefore also be removed from the graph.

A second approach is to reason the entailed triples at the time that the graph is queried. An advantage of this approach is that reasoning only needs to be carried out when entailed triples would be relevant to the query. So potentially this would be much faster than the first approach, but the reasoning would have to be repeated with each new query. With this approach, removing triples from the graph causes no problems, since the entailed triples are reasoned on the fly and aren't stored as a permanent part of the graph.

I'm going to imagine that my client uses the second method for reasoning, since I'm currently playing with Stardog, and it uses that approach. So I can simulate my client's behavior by loading into Stardog the triples that my client would have found from dereferencing URIs, then flip the big blue reasoning "on" switch and see what happens.

Before I flip the switch, I have to make a decision. By default, Stardog carries out reasoning based on all of the triples that are in its default graph. I'm not sure that I feel comfortable with that. If my client has been snooping around in the RDF wild, sucking in whatever triples it finds by following links and dereferencing URIs, that could potentially result in reasoning based on silly or even nefarious triples. At this point, I feel more comfortable restricting reasoning to that which is entailed by more authoritative triples asserted as part of well-known vocabularies (such as FOAF). Restricting reasoning in this way is accomplished by separating triples into two categories. The first is called the "Tbox" (for terminological triples) or the "schema". The second is called the "Abox" (for assertional triples); these triples are essentially the "data". If a triple in the Tbox asserts that some property is equivalent to some other property, then Stardog reasons new triples that are entailed by that assertion. However, if that same assertion of property equivalence is asserted by a triple in the Abox, Stardog ignores it.

The Admin Console of Stardog allows you to specify a named graph to be used as the Tbox. For this test, I said that the named graph http://xmlns.com/foaf/0.1/ should be used as the Tbox. (To edit the settings, the database must be turned "off", then turned back on after the change has been made.) In the Query Panel, I selected "Add" from the Data dropdown, I chose the FOAF RDF/XML file that I downloaded as the file and entered http://xmlns.com/foaf/0.1/ as the graph URI on the line below. I also added to the default graph the 28 triples from the "assertions" file and the 683 triples that I scraped from ORCID and the DOIs.

My imaginary client is now ready to "learn" by reasoning on the acquired triples when I query it.

Experiments

What is Clifford Anderson? (subclass and equivalent class reasoning)

To discover what classes Cliff Anderson is an instance of, I can use the following query:

SELECT DISTINCT ?class
WHERE {
<http://orcid.org/0000-0003-0328-0792> a ?class.
}

where the URI is Cliff's ORCID URI. If I run the query with reasoning turned off, I get two results:

foaf:Person
prov:Person

Both of these classes are asserted explicitly in the ORCID RDF that I obtained by dereferencing Cliff's ORCID URI. If I switch reasoning to "ON" and re-run the query, I get:

owl:Thing
foaf:Person
prov:Person
foaf:Agent
geo:SpatialThing
http://www.w3.org/2000/10/swap/pim/contact#Person
schema:Person
dcterms:Agent

The first result is trivial. Any time you turn reasoning on, it reasons that any resource is an owl:Thing.

The second and third results were asserted explicitly in the ORCID RDF.

foaf:Agent and geo:SpatialThing are entailed because the FOAF vocabulary declares:
foaf:Person rdfs:subClassOf foaf:Agent, geo:SpatialThing.

http://www.w3.org/2000/10/swap/pim/contact#Person is a term from a sort of W3C test environment. It and schema:Person are entailed because
foaf:Person owl:equivalentClass schema:Person,
http://www.w3.org/2000/10/swap/pim/contact#Person.

dcterms:Agent is entailed because
foaf:Agent owl:equivalentClass dcterms:Agent
and foaf:Agent was already entailed based on a subClassOf relationship (above). This last example is a case where two steps of reasoning were used to materialize a triple.

That was pretty easy! My client reasoned that Cliff was an instance of six additional classes. I suppose that could be useful under some circumstances, since those classes include just about every possibility that you could use for a person.

Who wrote "Competencies Required for Digital Curation: An Analysis of Job Advertisements"? (inverse and equivalent property reasoning)

The metadata from CrossRef provides the following information about doi:10.2218/ijdc.v8i1.242 :

<http://dx.doi.org/10.2218/ijdc.v8i1.242> dcterms:creator
<http://id.crossref.org/contributor/edward-warga-15jdtaq0utve>,
<http://id.crossref.org/contributor/jeonghyun-kim-15jdtaq0utve>,
<http://id.crossref.org/contributor/william-moen-15jdtaq0utve>.

We can see that CrossRef explicitly links publications to its ad hoc URIs for authors via the Dublin Core term dcterms:creator. If we execute the query

SELECT ?author
WHERE {
<http://dx.doi.org/10.2218/ijdc.v8i1.242> dcterms:creator ?author.
}

with reasoning turned off, it is no surprise that this query finds the three CrossRef URIs linked in the Turtle above. When we turn reasoning on, we get the same three URIs, plus http://orcid.org/0000-0003-2445-1511.

This new dcterms:creator link is entailed because:

1. Asserted triple:
<http://orcid.org/0000-0003-2445-1511>
foaf:made <http://dx.doi.org/10.2218/ijdc.v8i1.242>.

2. foaf:made owl:inverseOf foaf:maker.
which entails
<http://dx.doi.org/10.2218/ijdc.v8i1.242>
foaf:maker <http://orcid.org/0000-0003-2445-1511>.

3. foaf:maker owl:equivalentProperty dcterms:creator.
which entails
<http://dx.doi.org/10.2218/ijdc.v8i1.242>
dcterms:creator <http://orcid.org/0000-0003-2445-1511>.

Thus http://orcid.org/0000-0003-2445-1511 satisfies the graph pattern and shows up as the fourth solution. So in human-readable terms, who are the four creators? If I keep reasoning turned on and modify the query to:

SELECT ?author ?name
WHERE {
<http://dx.doi.org/10.2218/ijdc.v8i1.242> dcterms:creator ?author.
?author foaf:name ?name.
}

I only get the names of the three contributors from the CrossRef metadata:

http://id.crossref.org/contributor/edward-warga-15jdtaq0utve Edward Warga
http://id.crossref.org/contributor/jeonghyun-kim-15jdtaq0utve Jeonghyun Kim
http://id.crossref.org/contributor/william-moen-15jdtaq0utve William Moen

and I'm missing the author's name from the ORCID metadata. That's because ORCID used rdfs:label instead of foaf:name for the person's name. But since the FOAF vocabulary asserts that

foaf:name rdfs:subPropertyOf rdfs:label.

I can get all of the names if I leave reasoning turned on and change the query to:

SELECT ?author ?name
WHERE {
<http://dx.doi.org/10.2218/ijdc.v8i1.242> dcterms:creator ?author.
?author rdfs:label ?name.
}

The results show that I now get all of the names:

http://orcid.org/0000-0003-2445-1511 Edward Warga
http://id.crossref.org/contributor/edward-warga-15jdtaq0utve Edward Warga
http://id.crossref.org/contributor/jeonghyun-kim-15jdtaq0utve Jeonghyun Kim
http://id.crossref.org/contributor/william-moen-15jdtaq0utve William Moen

Is this good or bad thing? I guess it depends. Reasoning has allowed me to infer the dcterms:creator relationship with the Ed's ORCID URI, which is probably good since his ORCID URI is linked to other things and his ad hoc CrossRef URI isn't.

However, if I were trying to find out how many unique co-authors there were for a publication, in this particular case, it would be relatively easy for a client to conclude that Ed is being listed twice as an author since the two name strings are identical. However, if I run the same kind of query on "Herman Bavinck, Reformed Dogmatics, vol. 3: Sin and Salvation in Christ":

SELECT ?author ?name
WHERE {
<http://dx.doi.org/10.1017/s003693060700364x> dcterms:creator ?author.
?author rdfs:label ?name.
}

I get

http://orcid.org/0000-0003-0328-0792 Clifford B. Anderson
http://id.crossref.org/contributor/clifford-anderson-7gu43tj0rli3 Clifford Anderson

which is more problematic because my client would have to disambiguate the two forms of Cliff's name to know that there was one author rather than two.

What is the preferred label for the author of "On Teaching XQuery to Digital Humanists"? (sameAs reasoning)

The Semantic Web working group Turtle triples link group members to their publications, and the CrossRef DOI triples provide the titles of the publications. I could use this query to determine the preferred label for the author of one of the publications:

SELECT ?label
WHERE {
?pub dcterms:title "On Teaching XQuery to Digital Humanists".
?person foaf:made ?pub.
?person skos:prefLabel ?label.
}

However, if I run the query, I get nothing. It doesn't matter whether I turn reasoning on or not. This is because the link between group members and their publications is made via the ORCID URIs that denote the person. The ORCID metadata doesn't provide skos:prefLabel for people; that was asserted in the VIAF metadata. Here are the relevant triples:

<http://dx.doi.org/10.4242/balisagevol13.anderson01>
dcterms:title "On Teaching XQuery to Digital Humanists".
<http://orcid.org/0000-0003-0328-0792>
foaf:made <http://dx.doi.org/10.4242/balisagevol13.anderson01>.
<http://viaf.org/viaf/168432349> skos:prefLabel "Clifford B. Anderson"@en-us,
"Clifford Anderson"@nl-nl.

However, working group triples also assert that:

<http://orcid.org/0000-0003-0328-0792>
owl:sameAs <http://viaf.org/viaf/168432349>.

The semantics of owl:sameAs entail that either of the two URIs linked by it can be substituted for the other in any triple. So if reasoning based on owl:sameAs were carried out, it would entail that

<http://orcid.org/0000-0003-0328-0792>
skos:prefLabel "Clifford B. Anderson"@en-us,
"Clifford Anderson"@nl-nl.

and the query should find the preferred label.

Stardog does not carry out sameAs reasoning by default. sameAs reasoning is carried out in a different manner than other reasoning - see the Stardog 4 manual for details. One obvious reason for the difference is that owl:sameAs assertions relate instances (or "individuals" in OWL terminology) rather than properties or classes, so that kind of assertion is likely to be found in the Abox rather than the Tbox on which Stardog bases its reasoning. It's probably just as well that the decision to turn on sameAs reasoning is separate from the decision to turn on schema-based reasoning, since sameAs reasoning can have rather nasty unintended consequences (see this paper for some interesting reading on the subject). The unintended consequences can be even more insidious if they result from unintentional sameAs assertions caused by sloppy use of functional and inverse functional properties. Perhaps for this reason, Stardog allows a user to choose to reason based on explicit owl:sameAs assertions without enabling sameAs reasoning based on functional/inverse functional property use.

To get my example query to work, once again I have to go to the Admin Console of Stardog for my database, turn the database off, click edit, then select the level of sameAs reasoning that I want to permit (OFF, ON [owl:sameAs only], or FULL [all types of sameAs reasoning]), click Save, then turn the database back on. In this experiment, I used "ON".

Now if I turn the Reasoning switch to ON in the Query Panel, owl:sameAs reasoning will be included along with other reasoning entailed by triples in the Tbox. When I run the query, I get the result

"Clifford B. Anderson"@en-us

Cool! I can now use either the ORCID or the VIAF URI to refer to Cliff in triples, and I get the same result! [1]

Oddly enough, "Clifford Anderson"@nl-nl is NOT included in the results. I haven't yet figured out why, because it should be in the results. This problem only seems to happen for queries that depend on entailed triples. If I change the query to

SELECT ?label
WHERE {
<http://viaf.org/viaf/168432349> skos:prefLabel ?label.
}

which requires only explicitly asserted triples, I get both results. Is this a bug?

Note added 2016-03-02: I submitted a bug report on this to Stardog and got a response:

The sameAs URI that
> Stardog picked to use was http://orcid.org/0000-0003-0328-0792, which was
> different from the one that was linked to the skos:prefLabel triples:
> http://viaf.org/viaf/168432349.
> I don't know if that matters or not.


Turns out it matters and the bug occurs only when the triples are
asserted for the URI that is not being returned. We'll fix this for
the next release

Wikimedia Commons. left: Luigizanasi CC BY-SA, right: Øyvind Holmstad CC BY-SA

Could I actually build my True Believer's client?

In an earlier blog post, I described how I used the RDFLib Python library to grab GeoNames RDF by dereferencing GeoNames URIs, then put them into a graph that I saved on my hard drive. I then manually loaded the graph into the Heard Library triplestore so that I could play with it using the public SPARQL endpoint. The Heard Library triplestore is currently running Callimachus, which doesn't allow graphs to be loaded via HTTP. Stardog does allow this, so in principle one could write a Python program to scrape metadata by dereferencing the ORCID, VIAF, and DOI URIs, then dump the scraped triples into a Stardog triplestore via HTTP using the SPARQL protocol. I haven't read the Stardog manual carefully enough yet to know whether there is a way to specify via HTTP that a SPARQL query should be done with reasoning enabled or not. It certainly can be done using the command line interface, so at a minimum the Python program should be able to interact with a local implementation of Stardog via command line. Ooooh!! That might be a good summer project...

Conclusions

Although this was really just an exercise to see if I could get Stardog to reason on real data (mostly) from the wild, I'm super-excited how easy it was to get it to work. Both the Tbox (635 triples from the FOAF vocabulary) and the Abox (about 700 scraped from ORCID, VIAF, and DOIs, plus the linking triples I asserted) were relatively small, so the reasoning and queries executed almost instantly. Aside from the one problem with not getting all of the language-tagged literals, the results were consistent with what I expected. I'm planning next to "stress test" the system by bumping the number of triples in the Abox up by 3 or 4 orders of magnitude when I load the 1 million+ Bioimages triples. I want to investigate the questions that I raised in an earlier blog post where I tried to conduct reasoning using generic SPARQL queries and ran into performance issues. Stay tuned...

------------------------------------------------------------------------------

Footnote:

[1] The Stardog manual notes that when it performs owl:sameAs reasoning, it does not generate all of the possible alternative triples. This prevents superfluous triple "bloating", but Stardog randomly chooses only one of the alternative URIs to track the resource. As far as I can tell, there is no way to specify which one is preferred. So for example, if a person's VIAF and ORCID URIs are linked with owl:sameAs, there is apparently no way to control which one Stardog would report in a SPARQL query result if the person's node were bound to a variable.