If you've been following my recent blog posts, you will know that I've been enjoying testing out the Stardog graph database software. I've been using it to construct SPARQL queries on Linked Data from the wild and to try out Stardog's built-in reasoner to generate triples that are entailed based on the semantics of the FOAF vocabulary, but not asserted in my test dataset (found here and here) . This has been fun, but these tests have been done on what's really a "toy" dataset that contains only about 700 triples.
In an earlier blog post, I reported on my experimentation using SPARQL CONSTRUCT queries with the Callimachus graph database to generate entailed but unasserted inverse relationships present in the Bioimages RDF dataset. That dataset contains approximately 1.5M triples of real data associated with about 14k images of about 3.3k organisms representing 1.2k named taxa. The current version of the dataset can be downloaded as a zip file from the Bioimages GitHub repo. (The experiments here used the 2016-02-25 release: http://dx.doi.org/10.5281/zenodo.46574.) In those experiments, I found that materializing the unasserted triples using SPARQL was possible, but that in certain cases, the queries took so long that they timed out after the limit of one minute. The problem seemed to be associated with use of the MINUS keyword, which I was using to prevent the query from returning triples that were already present in the dataset. This was a bit disappointing because although 1.5 million triples sounds like a lot, it is not uncommon for triplestores to contain several orders of magnitude more triples than that. So it was a failure on what was not a particularly rigorous test.
In this blog post, I'm going to report on my efforts to expand on some of the tests that I conducted earlier using the Bioimages dataset. This time, rather than using SPARQL CONSTRUCT to materialize the entailed triples and then add them back into the database ahead of querying, I'm going to let Stardog reason the necessary triples on the fly at the time of querying. See this for Stardog's explanation of their reasoning approach.
Speed: Stardog vs. Callimachus
Loading triples
The first thing that I was curious to know was how long it would take to load the 1.5M triples into the Stardog database. Currently, I re-upload the whole dataset to the Vanderbilt Heard Library triplestore each time I create a new Bioimages release. That triplestore is a Callimachus installation and although I don't think I've ever timed how long it takes to do the upload, it's long enough that I leave my computer and go to make coffee - many minutes. So I was surprised at how fast Stardog was able to parse the RDF/XML from the source files and load them into the triplestore. There are several files that are very small and they loaded almost instantaneously. The biggest file (images.rdf) is 109 Mb uncompressed and contains the lion's share of the triples: 1 243 969. Stardog loaded it in 14 seconds. I decided to do a local install of Callimachus on the same computer so that I could compare. The load time for Callimachus: 7 minutes and 54 seconds! The second largest file (organisms.rdf) is 25 Mb uncompressed and contains 279 429 triples. Stardog loaded it in 6 seconds; Callimachus in 1 minute and 33 seconds. Wow.Counting asserted triples
The second test I ran was to see how long it took each endpoint to report a count of the number of triples in the graph. To find this using Stardog, I used the query:SELECT (COUNT(?subject) AS ?count)
WHERE {
?subject ?predicate ?object.
}
which will report the number of triples in the default graph (when I loaded the triples, I did not specify any named graph, so they all got thrown into the default graph pot). I got the results, 1 533 245 triples, in less than a second. My reaction time was too slow to time it with the stopwatch on my phone. If I run the same query in Callimachus, I got the result, 1 822 884 triples, in 22 seconds. The reason this number is 300 000 triples bigger than the Stardog number is because the default Callimachus graph includes a bunch of Callimachus-specific triples that I didn't put there. You can see this using the query
SELECT DISTINCT ?class
WHERE {
?subject a ?class.
}
which produces a whole bunch of classes I don't recognize, like "callimachus/1.4/types/Schematron". Running the same query on Stardog returns only classes that I recognize because they all came from RDF that I uploaded. If I want to restrict the Callimachus triple count to triples that I uploaded, I have to use named graphs.
There is a fundamental difference in the way that named graphs are specified between the two platforms (at least the way I understand it). In Callimachus, uploaded triples are always assigned to a named graph that includes the name of the file that held those triples. For example, the set of triples about images are assigned to the named graph:
<http://localhost:8080/bioimages/images.rdf>
If you run a SPARQL query on Callimachus and don't specify a graph using the FROM keyword, Callimachus runs the query on all of the triples it has, including those in named graphs.
Stardog assign a URI to its default graph:
tag:stardog:api:context:default
and if no other graph name is specified at the time the triples are uploaded, they are assigned to this graph. This is the graph on which SPARQL queries are run in the absence of the FROM keyword. In Stardog, you can assign any URI as the graph name when a file is loaded - there need be no relationship between that URI and the file name. However, if a graph name is specified, Stardog will NOT include those triples in SPARQL queries unless you explicitly say that they should be included using the FROM keyword. This is very different behavior from Callimachus.
Given these differences, in order to count the triples that I uploaded and exclude the Callimachus-generated triples, I need to use this query:
SELECT (COUNT(?subject) AS ?count)
FROM <http://localhost:8080/bioimages/baskauf.rdf>
FROM <http://localhost:8080/bioimages/geonames.rdf>
FROM <http://localhost:8080/bioimages/images.rdf>
FROM <http://localhost:8080/bioimages/kirchoff.rdf>
FROM <http://localhost:8080/bioimages/local.rdf>
FROM <http://localhost:8080/bioimages/ncu-all.rdf>
FROM <http://localhost:8080/bioimages/organisms.rdf>
FROM <http://localhost:8080/bioimages/stdview.rdf>
FROM <http://localhost:8080/bioimages/thomas.rdf>
FROM <http://localhost:8080/bioimages/uri.rdf>
WHERE {
?subject ?predicate ?object.
}
in which I specify the graph that corresponds to each of the files I uploaded. If I run this query, I get the answer, 1 533 272 triples, in about 6 seconds. This is quite a bit faster than the time it took when I didn't specify the graphs to be included, but is still much slower than the corresponding Stardog query. I'm also not sure why there are 27 more triples in the Callimachus count than Stardog's count and I'm to lazy to figure it out.
Counting asserted and entailed triples
As I said in the beginning of this post, one of the things I'm really interested in knowing is how much Stardog gets slowed down when I ask it to reason. In the experiments in this post, I'm going to reason based on the Darwin-SW ontology (DSW; version 0.4). To accomplish this, I uploaded the Darwin-SW triples (from the file dsw.owl, which is included in the bioimages-rdf.zip file) to Stardog as a named graph, then specified that graph as the Tbox to be used for reasoning. See the "Learning from the FOAF vocabulary" section of my last blog post for more details on Tbox reasoning.The Bioimages graph model is based on DSW and heavily uses its predicates to link resources. Although DSW is a fairly lightweight ontology, it does declare a number of property inverses, and range and domains for many of its object properties. So we can expect that the Bioimages dataset entails triples that are not asserted. To attempt to determine how many of these triples there are, I flipped the Reasoning switch to "ON", then re-ran the first counting query above. In contrast to the nearly instantaneous counting that happened when the Reasoning switch was off, it took about 4.5 seconds to compute the results with reasoning. The total number of triples counted was 1 765 911, an increase of 232 666 triples or about 13%. It's not possible to compare this to Callimachus because it doesn't have reasoning built in. I'm not sure exactly what all of these extra triples are, but read on for some examples of types of triples Stardog can reason based on DSW.
Some queries that make use of reasoned triples.
Inverse properties
Partly out of convenience and partly out of obstinance, in the Bioimages RDF I did not link resources of different classes using inverse predicates in both possible directions. For example, the dataset includes the triple
<http://bioimages.vanderbilt.edu/ind-andersonwb/wa310#2003eve> dsw:locatedAt
<http://bioimages.vanderbilt.edu/ind-andersonwb/wa310#2003loc>.
but not the inverse
<http://bioimages.vanderbilt.edu/ind-andersonwb/wa310#2003loc> dsw:locates
<http://bioimages.vanderbilt.edu/ind-andersonwb/wa310#2003eve>.
The diagram above shows the predicates that I usually used in yellow, and the ones I did not typically use in gray. I could have fairly easily generated the inverse triples, but it was more convenient not to, and I wanted to see how annoying it would be for me to use the RDF without them.
If I wanted to know in which states Darel Hess has recorded occurrences, I could run the query:
SELECT DISTINCT ?state
WHERE {
?occurrence dwc:recordedBy "Darel Hess".
?occurrence dsw:atEvent ?event.
?event dsw:locatedAt ?location.
?location dwc:stateProvince ?state.
}
and it would produce the results:
California
Tennessee
Virginia
Kentucky
North Carolina
However, if I ran this query using the inverse linking properties with reasoning off:
SELECT DISTINCT ?state
WHERE {
?occurrence dwc:recordedBy "Darel Hess".
?event dsw:eventOf ?occurrence.
?location dsw:locates ?event.
?location dwc:stateProvince ?state.
}
I would get no results because even though those inverse links are entailed by these statements in the DSW ontology:
dsw:locatedAt owl:inverseOf dsw:locates.
dsw:atEvent owl:inverseOf dsw:eventOf.
they weren't actually asserted in the RDF I wrote. If I turn on the Reasoning switch and run the query again, I get the same five states as I did above. The amount of additional time required to run this query is too short for me to measure, which is nice. So in this case, my decision to omit the inverse triples when I exposed the RDF didn't result in any annoying problems for me at all, as long as I used Stardog to run the query and flipped the Reasoning switch to ON.
Types entailed by range declarations
What kind of thing is http://bioimages.vanderbilt.edu/baskauf/33423 ? I can find out by running this query:
SELECT DISTINCT ?class
WHERE {
<http://bioimages.vanderbilt.edu/baskauf/33423> a ?class.
}
(where "a" is semantic sugar for rdf:type). Running the query with reasoning turned off produces one result, the class dcmitype:StillImage . Running the query with reasoning turned on produces three results:
dcmitype:StillImage
owl:Thing
dsw:Token
The first additional class is no surprise; every resource is reasoned to be an instance of owl:Thing. dsw:Token is less obvious.
In DSW, a Token is any kind of thing that serves as evidence. The DSW ontology declares:
dsw:hasEvidence rdfs:range dsw:Token.
RDFS includes a rule for rdfs:range that entails when
<p> rdfs:range <class>.
and
<s> <p> <o>.
then
<o> rdf:type <class>.
The Bioimages RDF includes this triple:
<http://bioimages.vanderbilt.edu/ind-baskauf/33419#2004-04-29> dsw:hasEvidence <http://bioimages.vanderbilt.edu/baskauf/33423>.
so by the rule,
<http://bioimages.vanderbilt.edu/baskauf/33423> rdfs:type dsw:Token.
In this example, the additional time required to reason the other two classes was negligible.
Breaking Stardog
In the previous two examples, I let Stardog off pretty easy. It didn't have to work with very many triples since relatively few triples satisfied the first triple pattern. The first triple pattern specified particular resources in the subject or object position that created few bindings to the variable in that triple pattern. So I decided to try a more demanding query.Here is a query that asks what kinds of things serve as evidence for occurrences that take place in the United States:
SELECT DISTINCT ?class
WHERE {
?occurrence dsw:hasEvidence ?token.
?token a ?class.
?occurrence dsw:atEvent ?event.
?event dsw:locatedAt ?location.
?location dwc:countryCode "US".
}
In this query, there are a lot more bindings to variables in the first triple pattern. Each of the 13958 images in the database serve as evidence for some occurrence. Since most of the images in the database are from the United States, there are many values of ?token that satisfy the graph pattern. But most of them are instances of the same class (dcmitype:StillImage) with only a few instances of specimens. With no reaoning, Stardog comes up with results in an instant:
dcmitype:StillImage
dwc:PreservedSpecimen
However, with reasoning turned on, the query takes a very long time. In fact, the first time I ran it, nothing happened for so long I decided to go do something else. When I came back, I had gotten a timeout error:
Later, I tried running the query again and got results after 12 minutes. I'm not sure why it didn't time out this time - perhaps some of the earlier work was cached in some way??? However, the results were clearly wrong because Stardog reported zero results. The results should have included the two classes included in the previous results, plus at least owl:Thing and dsw:Token as we saw in the earlier example based on a single resource.
So the question on my mind was: why does it take so long for Stardog to perform reasoning in this case? I don't really understand the mechanism by which Stardog performs reasoning, so I'm just guessing here. The number of bindings to ?token will be very high, since all 13958 images in the database will fall into that category. Images are the most numerous kind of resource in the database and each image has many triples describing it. In fact, 645 350 of the triples in the database (42% of all triples) have an image as their subject while 73174 have images as their object. It might take a lot of effort on Stardog's part to reason whether predicate use in those triples entails unasserted types based on range and domain declarations for the predicates. In contrast, the reasoning task in the "Inverse properties" example was much less intensive since it had to determine entailed inverse "dsw:eventOf" triples based on only 3972 asserted "dsw:atEvent" triples.
One possible course of action that I haven't investigated is trying to improve performance by changing the settings in the Admin Console to increase index sizes, lengthen the timeout time, approximate reasoning, or other types of changes that I don't currently understand.
(added after initial publication)
2016-02-25:Twitter thread with Kendall Clark (of Stardog, thanks for taking the time to tweet!):
KC: the ?token a ?class isn't legal under SPARQL OWL reasoning rules. But we try to do it anyway and it's an area we need to improve.
me: Thanks, I suppose I'm misunderstanding the process. I assumed: reason entailed classes -> query for matches to triple pattern.
KC: There are restrictions to insure things like decidability and reasonable performance, etc. We try to work around them if possible.
me: OK, need to learn more, I suppose about OWL profiles. Was assuming reasoning on domain, range, & equivalent class were all "normal"
KC: I suspect we could help you work around this if we understood better what you were trying to achieve. :>
me: At this point, entirely exploratory & no agenda. But "what kinds of things..." (?s a ?o) questions seem likely to be of interest
2016-02-26: From the thread of the error report I submitted:
We looked at the performance problems with this query and the problem
seems to be the query optimizer not handling {?token a ?class} pattern
efficiently. We created a ticket (#2835) for this issue.
For the patterns that retrieve types of an individual, Stardog will
execute additional queries behind the scenes. If these subqueries time
out a warning is printed in the log but the end user would receive
incomplete results and no warnings. This behavior is not ideal and
we'll make improvements here.
Investigation of reasoning time and type inferencing
In an attempt to investigate what affects the time required to reason entailed types, I created this query:SELECT DISTINCT ?class
WHERE {
?token a dwc:PreservedSpecimen.
# ?token a dwc:LivingSpecimen.
# ?token a dcmitype:StillImage.
?token a ?class.
}
There are three classes of resources in the Bioimages dataset that serve as evidence for occurrences: images, preserved specimens, and living specimens. In the query, I specify in the first triple pattern that a Token must be an instance of one of the three classes and have commented out the patterns involving other two classes to make it easy to switch to performing the query on the other classes. The last triple pattern asks what other classes the Tokens are instances of. With reasoning turned off, the query produces an answer faster than I can time it. With reasoning turned on, the time to complete the query varies greatly.
Here is the scope of each type of resource (number of instances, number of triples in which an instance is the subject, and number of triples in which an instance is the object) and the approximate amount of time that it takes to execute the corresponding query with reasoning [1]:
class instances subject object seconds
-------------------------------------------------------------
dwc:PreservedSpecimen 27 297 27 0.4
dwc:LivingSpecimen 227 6675 3279 2
dcmitype:StillImage 13958 645350 73174 137
The times aren't very accurate, but they show that the reasoning time is related in some way to the scope of the class. The amount of time per instance is fairly consistently about 0.01 s/instance, so the time may just be directly related to the number of instances whose types Stardog must reason. One thing that is clear is that carrying out this kind of reasoning would not be practical in datasets that describe millions or billions of resources.
Reconsidering defining pairs of inverse properties
In a blog post last year, I questioned our decision to define many pairs of inverse properties in DSW. To recap briefly, the arguments for that decision were:- when asserting triples in serialized form, it's more convenient and less verbose to be able to chose the direction of the link.
- being able to express the link in either direction avoids making particular types of resources the "center of the universe" when vocabulary users are required to place those resources in the subject position of triples.
- if we assume that reasoning will be uncommon (i.e. we believe in Linked Data, but not necessarily in the Semantic Web), either data providers will have to include triples expressing the link in both directions, or data consumers will have to construct more complex queries to catch links expressed by the producers in either direction.
Part of the problem is that I can't specify for Stardog to "carry out all of the reasoning that doesn't cause you to time out". For example, maybe I only care about knowing the explicitly asserted types of things that serve as evidence for occurrences in the US, and don't care about the entailed, unasserted types. I could run the query from above with reasoning turned off:
SELECT DISTINCT ?class
WHERE {
?occurrence dsw:hasEvidence ?token.
?token a ?class.
?occurrence dsw:atEvent ?event.
?event dsw:locatedAt ?location.
?location dwc:countryCode "US".
}
But I couldn't run the query using the inverse properties in the graph pattern:
SELECT DISTINCT ?class
WHERE {
?occurrence dsw:hasEvidence ?token.
?token a ?class.
?event dsw:eventOf ?occurrence.
?location dsw:locates ?event.
?location dwc:countryCode "US".
}
with reasoning turned off because the dsw:eventOf and dsw:locates triples have to be reasoned. Stardog won't just reason the entailed inverse triples required to satisfy the graph pattern without also reasoning the problematic entailed rdf:type relationships that were causing the query to take too long.
It's also disturbing to me that when Stardog succeeded in finishing the very long query, it (incorrectly) reported no results. In my opinion, that's worse than timing out and failing to report a result, because unknowingly getting a wrong answer is worse than getting no answer. After only testing a handful of queries, this is the second "wrong" answer I've gotten from Stardog (see this blog post for a description of the other). If I'm really going to commit to depending on reasoning in order for my queries to work, then I have to have some confidence that I'm going to actually get the answers that would be correct if the expected types of reasoning were applied correctly and completely. That's making me think that it's not such a good idea to define pairs of inverse properties in DSW without specifying which property of the pair is preferred.
How PROV-O handles inverse properties
Since I wrote the earlier blog post where I discussed inverse properties, I had occasion to read theoverview document for the W3C PROV Ontology (PROV-O). It included a section discussing property inverses. This section notes that
When all inverses are defined for all properties, modelers may choose from two logically equivalent properties when making each assertion. Although the two options may be logically equivalent, developers consuming the assertions may need to exert extra effort to handle both (e.g., by either adding an OWL reasoner or writing code and queries to handle both cases). This extra effort can be reduced by preferring one inverse over another.PROV-O then goes on to specify which inverse is preferred, but also to reserve the name of property inverses so term name usage would be consistent among modelers that chose to use the inverse properties. The specification of the preferred inverse is done in the following manner.
The definition of the preferred property includes a prov:inverse annotation with the local name of its inverse as a value. Example:
prov:wasDerivedFrom
a owl:ObjectProperty;
rdfs:isDefinedBy <http://www.w3.org/ns/prov#>;
prov:inverse "hadDerivation".
The definition does not include the inverse property declaration, so a machine that solely discovered that definition would not necessarily "know" that there was an inverse property. The definition of the non-preferred inverse includes the owl:inverseOf declaration, which would ensure that a machine that discovered that definition would "know" that the preferred inverse existed. For example:
prov:hadDerivation
rdfs:isDefinedBy <http://www.w3.org/ns/prov#>;
owl:inverseOf prov:wasDerivedFrom.
How should we decide which inverse is preferred?
I've thought about this question for a while and I think I have an answer. In the DSW model, classes are sometimes included largely to facilitate one-to-many relationships. (See what we wrote here for more on that subject.) Although it would be nice to believe that data linked using DSW would "live" as RDF in a triplestore, realistically it's likely to have its original home in a relational database, or even as CSV files, then be exported as some serialization of RDF. Given that reality, it makes sense to pick the preferred inverse based on what would make linking the database tables the easiest. For example, the event node is included in DSW so that one can link many occurrences to a single observing or collecting event. So there is a one event to many occurrence relationship. If I have an event table and an occurrence table, its easiest to structure them like this:Event table:
id dwc:eventDate rdfs:label
---------------------------------------------------------
<event1> "2015-06-14" "14 June 2015 Bioblitz"
<event2> "1838-08-23" "Schmidt 1838 collecting event"
Occurrence table:
id dwc:recordedBy dsw:atEvent
---------------------------------------------------------
<occurrence1> "José Shmoe" <event1>
<occurrence2> "Courtney Khan" <event1>
<occurrence3> "Peleg Smith" <event2>
<occurrence4> "Josiah Jones" <event2>
<occurrence5> "Hepsabeth Arnold" <event2>
<occurrence4> dsw:atEvent <event2>.
rather than
<event2> dsw:eventOf <occurrence4>.
The rule therefore would be to choose the inverse property such that the subject resource is on the "many" side of the one-to-many relationship.
Conclusions
From this set of experiments, I've reached several conclusions:- The speed of Stardog in both loading triples and running queries seems to be far superior to Callimachus.
- Although it is very cool and simple to have Stardog perform reasoning, the reasoning does not appear to be reliable enough (and in some cases fast enough) for me to commit to depending on it.
- I'm thinking that on balance, the benefits of defining pairs of inverse linking properties does not outweigh the costs and that it would be better to declare a single preferred linking property.
Let's apply that lens to the experiments that I've carried out in this blog post and the previous two (here and here). It's really cool that I could use owl:sameAs reasoning to link resources that have both ORCID and VIAF identifiers. But the only reason I have to do that is because we have two competing systems instead of one consensus system. It's great that I could reason that a resource labeled using a foaf:name property also has an entailed rdfs:label property with the same value. But I only had to do that because there isn't a consensus that all resources should typically have an rdfs:label property to provide a human-readable label. It's great that reasoning based on an owl:equivalentClass declaration allowed me to know that a foaf:Person is also a schema:Person. But I only had to reason that because Schema.org couldn't discipline themselves to reuse a well-known term and felt compelled to mint a new term. It's nice that owl:inverseOf declarations lets me query using a triple pattern based on either of a pair of inverse properties, but the reason I need to do that reasoning is because of (possibly) poor ontology design that doesn't make it clear to users which property they should use.
In fact, I don't think there is a single reasoning task that I've done so far that hasn't just made up for lack of standardization and community consensus-building, failure to reuse existing identifiers, or possible ontology design flaws. I'm not saying that its impossible to do awesome things with reasoning, but so far I haven't manage to actually do any of them yet.[2] Despite my excitement of having some success playing around with reasoning with Stardog, I haven't yet converted myself from RDF Agnostic to Semantic Web True Believer.
--------------------------------------------------------
Notes
[1] If you are interested in knowing the results of the queries:Preserved specimens resulted in:
owl:Thing
dwc:PreservedSpecimen
dsw:Specimen
dsw:Token
with owl:Thing and dsw:Token entailed for reasons given previously. dsw:Specimen is entailed because dsw:PreservedSpecimen rdfs:subClassOr dsw:Specimen.
Living specimens resulted in:
dwc:LivingSpecimen
dwc:Organism
dsw:Specimen
dcterms:PhysicalResource
owl:Thing
dsw:IndividualOrganism
dsw:Token
dsw:LivingSpecimen
with dwc:Organism and dcterms:PhysicalResource asserted and dsw:LivingSpecimen and dsw:IndividualOrganism entailed because they are deprecated classes that are equivalent to currently used classes.
Images resulted in:
owl:Thing
dsw:Token
dcmitype:StillImage
with owl:Thing and dsw:Token entailed for reasons given previously. One thing that this makes clear is that the number of entailed classes has little to do with the reasoning time, since the image query took the longest, but had the fewest entailed classes.
[2] Shameless self-promotion: see Baskauf and Web (in press) for some use cases that RDF might satisfy that might not be easy to accomplish using other technologies.