Sunday, February 21, 2016

Reasoning on real Linked Data using Stardog

Notes: 
1. To derive any benefit from reading this post, you really should download and install Stardog, load the example files, and try the queries yourself.  See this page for help on getting Stardog to run on a Windows system.  

2. The SPARQL examples in this post use the generally recognized namespace abbreviations for well-known vocabularies.  I assume that you are running the queries on Stardog and have selected those prefixes in the box at the top of the Query Panel so that you don't have to actually type them as part of the query.

3. In the post, when I talk about what happens in interactions between my imaginary client and a server, I'm reporting the responses I got when I dereferenced URIs using the Advanced REST Client plugin for Chrome.


In my previous blog post, I pretended to be a Linked Data client (i.e. software that made use of Linked Data principles), and tried to "discover" information about people and publications by dereferencing ORCID IDs and DOIs while requesting RDF/XML.  I was hampered by two basic problems:

  • although ORCID "knows" the DOIs of the publications made by people and shows them on the person's web page, it does not link to those DOIs in the RDF it exposes.
  • the RDF/XML exposed by CrossRef when DOIs were dereferenced had malformed datatyped date literals.

In a really amazing turnaround, CrossRef fixed the malformed date issue in less than an hour after I tweated about it.  Wow.  There has been no response from ORCID.

In my communication with CrossRef, they said that if a publisher provides an ORCID ID for the author of a publication, they link to it using a dcterms:creator property.  However, none of the articles I looked up had this information, and in lieu of DOI information from ORCID, I was forced to create my own graph of triples linking the authors to their publications.  The file that I created is here.  You can also see most of it in the diagram above.

My graph contains the following information: information about our Semantic Web working group (including its name and some of the members), links to the DOIs of member's publications, and owl:sameAs assertions linking ORCID IDs to VIAF URIs when they exist.  I purposely restricted my use of non-W3C standard vocabularies to the FOAF vocabulary, specifically the terms foaf:made, foaf:name, foaf:homepagefoaf:member, and foaf:primaryTopic for two reasons: because FOAF is widely used (and used in the ORCID RDF that I scraped), and because the term definitions in the FOAF vocabulary include triples that generate entailments that would be interesting for reasoning play.


The True Believer's client

I am now going to pretend that I have written a client based on the principles of a Semantic Web "true believer".  By that, I mean that I'm pretending that I've written a computer program that is able to start from ground zero and discover the properties of subject or object resources and "meaning" of predicates using nothing more than the information provided when the URIs of those resources and predicates are dereferenced.  The client does not exactly have a tabula rasa because it has been programmed to "know" about entailments resulting from RDFS and OWL (W3C Recommendations), but it has not been programmed to do processing based on the idiosyncrasies of particular vocabularies or servers.  My imaginary client is also going to expect that servers that it communicates with follow generally recognized Semantic Web best practices for HTTP server/client interactions.  My Semantic Web True Believer's client will do more than a dim-witted Linked Data client because it will conduct reasoning based on the triples that it discovers in its exploration of the Semantic Web.

I will start by pretending that the client has discovered a URI that denotes our Semantic Web working group:

<https://gist.githubusercontent.com/baskaufs/beeaa94606113b970002/raw/df6ec9cbe57290cc2289d2cc37c221e9f494d153/assertions#group>

This URI is based on the first "Cool URI" strategy: hash URIs (without content negotiation).  When my client tries to dereference the working group's URI, the server strips off the part of the URI after the "#" and returns a text document.  Regardless of the Content-Type my client requests from the GitHub server in its HTTP Accept header, it always gets text designated as Content-Type: text/plain because the GitHub server is only set up to return plain text when a raw file is requested.  So my client already has a problem if it expects servers to always correctly tell it the content type of the returned document.  To deal with this document, I'd have to do some programming that would allow it to recognize that the document is actually Content-Type: text/turtle.

OK, let's pretend that I've done that and my client has ingested the 28 triples in the file.  It now needs to do two jobs to "learn" more:

  • dereference the subject and object URIs to discover more triples about the resources described in the 28 triples
  • dereference the predicate URIs to discover what they "mean"

The first job was the subject of my last blog post.  Both the DOI and ORCID servers "play nicely" with my client and return RDF/XML when my client asks for it in the Accept header.  The 683 triples that would result from dereferencing all of the subject and object URIs are in this RDF/XML file.

The second job involves discovering the meaning of the FOAF predicates used in the 28 triples.  The FOAF predicates use the namespace http://xmlns.com/foaf/0.1/, so an abbreviated term URI like foaf:made would be http://xmlns.com/foaf/0.1/made in unabbreviated form.  The FOAF terms follow the second recipe for "cool URIs": "303 URIs".  303 URIs are a result of the resolution of the httpRange 14 controversy, where it was determined that it was OK for non-information resources (like people or ideas) to have URIs that didn't end in hash fragment identifiers.

Here is the essence of how 303 URIs are supposed to work.  A client attempts to dereference a URI.  If the URI is a URL for an information resource (a document like a web page), the server responds to the GET command with an HTTP 200 code ("OK") and sends the resource itself.  However, if the URI identifies a non-information resource that can't be sent via the Internet (like a person or an idea), the server responds with an HTTP 303 code ("See Other") and sends the URI of a document about the resource (a.k.a. a "representation") of the sort preferred by the client (HTML if the client is a web browser, or some flavor of RDF for semantic clients like mine).  The client then dereferences the new URI and gets information about the non-information resource in the preferred document type.  To the True Believer, in accordance with the httpRange-14 resolution, the HTTP status code is really important, because it communicates important information about what kind of thing the URI represents. A response code of 200 means the resource is an Internet-deliverable information resource (i.e. document), while a response code of 303 means the resources is a physical or abstract thing that can't be delivered through the Internet.  Unfortunately, in the real world some administrators of servers that provide RDF either don't know how to set up the server to respond with the "correct" response codes, or they don't care enough to bother.  So the creator of a real semantic client would probably have to program contingencies for inappropriate responses.


"Discovering" the FOAF vocabulary

So what happens if my imaginary client tries to dereference the URI foaf:made with a request header of Accept: application/rdf+xml?  The first thing that happens is that it gets a 303 See Other redirect to http://xmlns.com/foaf/spec/ .  So far, so good; foaf:made is not an information resource - it represents to concept of "making", so the 303 code is appropriate.  However, if my client requests the server to send typical flavors of RDF (application/rdf+xml or text/turtle), it does not get them.  It gets text/html instead.  So if my client only understands RDF/XML or RDF/turtle, it's out of luck with the document sent by the server.

The reason the server returned an HTML document to my client is because the document included RDFa. I'm not very good at reading RDFa from a raw HTML document, so I ran it through the W3C RDFa Validator.  It validated as RDFa 1.1 with HTML5+RDFa as host language.  Just to see what would happen, I tried adding the RDFa-serialized triples to Stardog by loading the HTML file.  No luck - "The file is invalid."  The RDF editor I use (rdfEditor) was also unable to parse the RDFa and threw an error.  So my imaginary client will have to be more up-to-date than these programs to ingest the RDFa.

There is one additional "out" to my client.  The HTML contains a header link element:

<link href="http://xmlns.com/foaf/spec/index.rdf" rel="alternate"  type="application/rdf+xml" />

This is a preferred way to link a generic HTML document to an RDF representation.  So if my client can't handle RDFa, it still has an out if it can follow the link element to the RDF/XML representation.

So does it matter whether my client "learns" about the FOAF vocabulary from the RDFa directly or by following the link to the RDF/XML?  I did a triple count on the RDFa and got 345.  When I did a triple count on the RDF/XML, I got 635 triples.  So some triples are clearly missing from the RDFa.  The most obvious thing I noticed by comparing the two versions is that the RDFa is missing 75 rdfs:comment and 78 rdfs:label properties.  That would have little effect on machine reasoning, but it would affect one's ability to generate human-readable descriptions of the FOAF terms.  I haven't done an exhaustive comparison, but there are some differences that are important from a machine perspective.  For example, there are five owl:equivalentClass declarations in the RDF/XML that seem to be missing in the RDFa.  The RDF/XML also declares properties to be either owl:ObjectProperty or owl:DatatypeProperty.  That accounts for about 50 of the missing triples and could be significant for machine reasoning.

Since this client is imaginary, I will imagine that it discovers the RDF/XML.  This will be more convenient since Stardog can read it, and because it contains the more extensive set of FOAF triples.


"Learning" from the FOAF vocabulary

Thus far, my client's "learning" has consisted entirely of adding to it's knowledge by retrieving triples via dereferencing URIs.  The other way that "learning" can happen is by reasoning triples that are entailed by the semantics of the vocabularies used in the retrieved triples.

There are two different approaches that you can take on reasoning.  One is to reason entailed triples from explicitly asserted triples before querying the graph, adding the entailed triples to the graph, then carrying out the query. An advantage of this method is that once the entailed triples are added to the graph, the reasoning does not need to be carried out with every query.  A disadvantage is that all entailed triples must be materialized, since one does not know which ones might be relevant to some future query. Also, if some of the asserted triples are removed from the graph, it is difficult to know which triples in the graph were reasoned from the asserted triples and should therefore also be removed from the graph.

A second approach is to reason the entailed triples at the time that the graph is queried.  An advantage of this approach is that reasoning only needs to be carried out when entailed triples would be relevant to the query.  So potentially this would be much faster than the first approach, but the reasoning would have to be repeated with each new query.  With this approach, removing triples from the graph causes no problems, since the entailed triples are reasoned on the fly and aren't stored as a permanent part of the graph.

I'm going to imagine that my client uses the second method for reasoning, since I'm currently playing with Stardog, and it uses that approach.  So I can simulate my client's behavior by loading into Stardog the triples that my client would have found from dereferencing URIs, then flip the big blue reasoning "on" switch and see what happens.

Before I flip the switch, I have to make a decision.  By default, Stardog carries out reasoning based on all of the triples that are in its default graph.  I'm not sure that I feel comfortable with that.  If my client has been snooping around in the RDF wild, sucking in whatever triples it finds by following links and dereferencing URIs, that could potentially result in reasoning based on silly or even nefarious triples.  At this point, I feel more comfortable restricting reasoning to that which is entailed by more authoritative triples asserted as part of well-known vocabularies (such as FOAF).  Restricting reasoning in this way is accomplished by separating triples into two categories.  The first is called the "Tbox" (for terminological triples) or the "schema".  The second is called the "Abox" (for assertional triples); these triples are essentially the "data".  If a triple in the Tbox asserts that some property is equivalent to some other property, then Stardog reasons new triples that are entailed by that assertion.  However, if that same assertion of property equivalence is asserted by a triple in the Abox, Stardog ignores it.

The Admin Console of Stardog allows you to specify a named graph to be used as the Tbox.  For this test, I said that the named graph http://xmlns.com/foaf/0.1/ should be used as the Tbox.  (To edit the settings, the database must be turned "off", then turned back on after the change has been made.)  In the Query Panel, I selected "Add" from the Data dropdown, I chose the FOAF RDF/XML file that I downloaded as the file and entered http://xmlns.com/foaf/0.1/ as the graph URI on the line below.  I also added to the default graph the 28 triples from the "assertions" file and the 683 triples that I scraped from ORCID and the DOIs.

My imaginary client is now ready to "learn" by reasoning on the acquired triples when I query it.

Experiments

What is Clifford Anderson? (subclass and equivalent class reasoning)

To discover what classes Cliff Anderson is an instance of, I can use the following query:

SELECT DISTINCT ?class
WHERE {
  <http://orcid.org/0000-0003-0328-0792> a ?class.
  }

where the URI is Cliff's ORCID URI.  If I run the query with reasoning turned off, I get two results:

foaf:Person
prov:Person

Both of these classes are asserted explicitly in the ORCID RDF that I obtained by dereferencing Cliff's ORCID URI.  If I switch reasoning to "ON" and re-run the query, I get:

owl:Thing
foaf:Person
prov:Person
foaf:Agent
geo:SpatialThing
http://www.w3.org/2000/10/swap/pim/contact#Person
schema:Person
dcterms:Agent

The first result is trivial.  Any time you turn reasoning on, it reasons that any resource is an owl:Thing.

The second and third results were asserted explicitly in the ORCID RDF.

foaf:Agent and geo:SpatialThing are entailed because the FOAF vocabulary declares:
foaf:Person rdfs:subClassOf foaf:Agent, geo:SpatialThing.

http://www.w3.org/2000/10/swap/pim/contact#Person is a term from a sort of W3C test environment. It and schema:Person are entailed because
foaf:Person owl:equivalentClass schema:Person, 
                http://www.w3.org/2000/10/swap/pim/contact#Person.

dcterms:Agent is entailed because
foaf:Agent owl:equivalentClass dcterms:Agent
and foaf:Agent was already entailed based on a subClassOf relationship (above).  This last example is a case where two steps of reasoning were used to materialize a triple.

That was pretty easy!  My client reasoned that Cliff was an instance of six additional classes.  I suppose that could be useful under some circumstances, since those classes include just about every possibility that you could use for a person.


Who wrote "Competencies Required for Digital Curation: An Analysis of Job Advertisements"? (inverse and equivalent property reasoning)

The metadata from CrossRef provides the following information about doi:10.2218/ijdc.v8i1.242 :

<http://dx.doi.org/10.2218/ijdc.v8i1.242> dcterms:creator 
                <http://id.crossref.org/contributor/edward-warga-15jdtaq0utve>,
                <http://id.crossref.org/contributor/jeonghyun-kim-15jdtaq0utve>,
                <http://id.crossref.org/contributor/william-moen-15jdtaq0utve>.

We can see that CrossRef explicitly links publications to its ad hoc URIs for authors via the Dublin Core term dcterms:creator.  If we execute the query

SELECT ?author
WHERE {
  <http://dx.doi.org/10.2218/ijdc.v8i1.242> dcterms:creator ?author.
  }

with reasoning turned off, it is no surprise that this query finds the three CrossRef URIs linked in the Turtle above.  When we turn reasoning on, we get the same three URIs, plus http://orcid.org/0000-0003-2445-1511.

This new dcterms:creator link is entailed because:

1. Asserted triple:
<http://orcid.org/0000-0003-2445-1511> 
                       foaf:made <http://dx.doi.org/10.2218/ijdc.v8i1.242>.

2. foaf:made owl:inverseOf foaf:maker.
which entails
<http://dx.doi.org/10.2218/ijdc.v8i1.242> 
                       foaf:maker <http://orcid.org/0000-0003-2445-1511>.

3. foaf:maker owl:equivalentProperty dcterms:creator.
which entails
<http://dx.doi.org/10.2218/ijdc.v8i1.242> 
                       dcterms:creator <http://orcid.org/0000-0003-2445-1511>.

Thus http://orcid.org/0000-0003-2445-1511 satisfies the graph pattern and shows up as the fourth solution.  So in human-readable terms, who are the four creators?  If I keep reasoning turned on and modify the query to:

SELECT ?author ?name
WHERE {
  <http://dx.doi.org/10.2218/ijdc.v8i1.242> dcterms:creator ?author.
  ?author foaf:name ?name.
  }

I only get the names of the three contributors from the CrossRef metadata:

 http://id.crossref.org/contributor/edward-warga-15jdtaq0utve Edward Warga
 http://id.crossref.org/contributor/jeonghyun-kim-15jdtaq0utve Jeonghyun Kim
 http://id.crossref.org/contributor/william-moen-15jdtaq0utve William Moen

and I'm missing the author's name from the ORCID metadata.  That's because ORCID used rdfs:label instead of foaf:name for the person's name.  But since the FOAF vocabulary asserts that

foaf:name rdfs:subPropertyOf rdfs:label.

I can get all of the names if I leave reasoning turned on and change the query to:

SELECT ?author ?name
WHERE {
  <http://dx.doi.org/10.2218/ijdc.v8i1.242> dcterms:creator ?author.
  ?author rdfs:label ?name.
  }

The results show that I now get all of the names:

 http://orcid.org/0000-0003-2445-1511                         Edward Warga
 http://id.crossref.org/contributor/edward-warga-15jdtaq0utve Edward Warga
 http://id.crossref.org/contributor/jeonghyun-kim-15jdtaq0utve Jeonghyun Kim
 http://id.crossref.org/contributor/william-moen-15jdtaq0utve William Moen

Is this good or bad thing?  I guess it depends.  Reasoning has allowed me to infer the dcterms:creator relationship with the Ed's ORCID URI, which is probably good since his ORCID URI is linked to other things and his ad hoc CrossRef URI isn't.

However, if I were trying to find out how many unique co-authors there were for a publication, in this particular case, it would be relatively easy for a client to conclude that Ed is being listed twice as an author since the two name strings are identical.  However, if I run the same kind of query on "Herman Bavinck, Reformed Dogmatics, vol. 3: Sin and Salvation in Christ":

SELECT ?author ?name
WHERE {
  <http://dx.doi.org/10.1017/s003693060700364x> dcterms:creator ?author.
  ?author rdfs:label ?name.
  }

I get

http://orcid.org/0000-0003-0328-0792                            Clifford B. Anderson
http://id.crossref.org/contributor/clifford-anderson-7gu43tj0rli3  Clifford Anderson

which is more problematic because my client would have to disambiguate the two forms of Cliff's name to know that there was one author rather than two.

What is the preferred label for the author of  "On Teaching XQuery to Digital Humanists"? (sameAs reasoning)

The Semantic Web working group Turtle triples link group members to their publications, and the CrossRef DOI triples provide the titles of the publications.  I could use this query to determine the preferred label for the author of one of the publications:

SELECT ?label
WHERE {
  ?pub dcterms:title "On Teaching XQuery to Digital Humanists".
  ?person foaf:made ?pub.
  ?person skos:prefLabel ?label.
  }

However, if I run the query, I get nothing.  It doesn't matter whether I turn reasoning on or not.  This is because the link between group members and their publications is made via the ORCID URIs that denote the person.  The ORCID metadata doesn't provide skos:prefLabel for people; that was asserted in the VIAF metadata.  Here are the relevant triples:

<http://dx.doi.org/10.4242/balisagevol13.anderson01> 
                 dcterms:title "On Teaching XQuery to Digital Humanists".
<http://orcid.org/0000-0003-0328-0792> 
                foaf:made <http://dx.doi.org/10.4242/balisagevol13.anderson01>.
<http://viaf.org/viaf/168432349> skos:prefLabel "Clifford B. Anderson"@en-us,
                                                "Clifford Anderson"@nl-nl.

However, working group triples also assert that:

<http://orcid.org/0000-0003-0328-0792> 
                 owl:sameAs <http://viaf.org/viaf/168432349>.

The semantics of owl:sameAs entail that either of the two URIs linked by it can be substituted for the other in any triple.  So if reasoning based on owl:sameAs were carried out, it would entail that

<http://orcid.org/0000-0003-0328-0792> 
                                 skos:prefLabel "Clifford B. Anderson"@en-us,
                                                "Clifford Anderson"@nl-nl.

and the query should find the preferred label.  

Stardog does not carry out sameAs reasoning by default.  sameAs reasoning is carried out in a different manner than other reasoning - see the Stardog 4 manual for details.  One obvious reason for the difference is that owl:sameAs assertions relate instances (or "individuals" in OWL terminology) rather than properties or classes, so that kind of assertion is likely to be found in the Abox rather than the Tbox on which Stardog bases its reasoning.  It's probably just as well that the decision to turn on sameAs reasoning is separate from the decision to turn on schema-based reasoning, since sameAs reasoning can have rather nasty unintended consequences (see this paper for some interesting reading on the subject).  The unintended consequences can be even more insidious if they result from unintentional sameAs assertions caused by sloppy use of functional and inverse functional properties.  Perhaps for this reason, Stardog allows a user to choose to reason based on explicit owl:sameAs assertions without enabling sameAs reasoning based on functional/inverse functional property use.

To get my example query to work, once again I have to go to the Admin Console of Stardog for my database, turn the database off, click edit, then select the level of sameAs reasoning that I want to permit (OFF, ON [owl:sameAs only], or FULL [all types of sameAs reasoning]), click Save, then turn the database back on.   In this experiment, I used "ON".

Now if I turn the Reasoning switch to ON in the Query Panel, owl:sameAs reasoning will be included along with other reasoning entailed by triples in the Tbox.  When I run the query, I get the result

"Clifford B. Anderson"@en-us

Cool!  I can now use either the ORCID or the VIAF URI to refer to Cliff in triples, and I get the same result! [1]

Oddly enough, "Clifford Anderson"@nl-nl is NOT included in the results.  I haven't yet figured out why, because it should be in the results.  This problem only seems to happen for queries that depend on entailed triples.  If I change the query to

SELECT ?label
WHERE {
  <http://viaf.org/viaf/168432349> skos:prefLabel ?label.
  }

which requires only explicitly asserted triples, I get both results.  Is this a bug?

Note added 2016-03-02: I submitted a bug report on this to Stardog and got a response:
The sameAs URI that
> Stardog picked to use was http://orcid.org/0000-0003-0328-0792, which was
> different from the one that was linked to the skos:prefLabel triples:
> http://viaf.org/viaf/168432349.
> I don't know if that matters or not.

Turns out it matters and the bug occurs only when the triples are
asserted for the URI that is not being returned. We'll fix this for
the next release


Wikimedia Commons. left: Luigizanasi CC BY-SA, right: Øyvind Holmstad  CC BY-SA 

Could I actually build my True Believer's client?  

In an earlier blog post, I described how I used the RDFLib Python library to grab GeoNames RDF by dereferencing GeoNames URIs, then put them into a graph that I saved on my hard drive.  I then manually loaded the graph into the Heard Library triplestore so that I could play with it using the public SPARQL endpoint.  The Heard Library triplestore is currently running Callimachus, which doesn't allow graphs to be loaded via HTTP.  Stardog does allow this, so in principle one could write a Python program to scrape metadata by dereferencing the ORCID, VIAF, and DOI URIs, then dump the scraped triples into a Stardog triplestore via HTTP using the SPARQL protocol.  I haven't read the Stardog manual carefully enough yet to know whether there is a way to specify via HTTP that a SPARQL query should be done with reasoning enabled or not.  It certainly can be done using the command line interface, so at a minimum the Python program should be able to interact with a local implementation of Stardog via command line.  Ooooh!!  That might be a good summer project...

Conclusions

Although this was really just an exercise to see if I could get Stardog to reason on real data (mostly) from the wild, I'm super-excited how easy it was to get it to work.  Both the Tbox (635 triples from the FOAF vocabulary) and the Abox (about 700 scraped from ORCID, VIAF, and DOIs, plus the linking triples I asserted) were relatively small, so the reasoning and queries executed almost instantly.  Aside from the one problem with not getting all of the language-tagged literals, the results were consistent with what I expected.  I'm planning next to "stress test" the system by bumping the number of triples in the Abox up by 3 or 4 orders of magnitude when I load the 1 million+ Bioimages triples.  I want to investigate the questions that I raised in an earlier blog post where I tried to conduct reasoning using generic SPARQL queries and ran into performance issues.  Stay tuned...

------------------------------------------------------------------------------

Footnote:

[1] The Stardog manual notes that when it performs owl:sameAs reasoning, it does not generate all of the possible alternative triples.  This prevents superfluous triple "bloating", but Stardog randomly chooses only one of the alternative URIs to track the resource.  As far as I can tell, there is no way to specify which one is preferred.  So for example, if a person's VIAF and ORCID URIs are linked with owl:sameAs, there is apparently no way to control which one Stardog would report in a SPARQL query result if the person's node were bound to a variable.

1 comment:

  1. A tweet from Michael Grove (@mikegrovesoft): [on Stardog] you can disable reasoning on a per-query basis; `reasoning=false` as a URL query parameter

    ReplyDelete