Monday, May 12, 2014

Confessions of an RDF agnostic, part 5: queries and owl:sameAs

Image John Berkey, 1950 (c) Fawcett Publications, Inc.*

A team of mathematicians works several years calculating a positronic brain equipped to do certain similar acts of calculation.  Using this brain they make further calculations to create one still more complicated and so on.  …  What we call the Machines are the results of ten such steps.

The Machine is only a tool after all, which can help humanity progress faster by taking some of the burdens of calculations and interpretations off his back.  The task of the human brain remains what it has always been; that of discovering new data to be analyzed, and of devising new concepts to be tested."

I, Robot, Isaac Asimov, 1950.




My past several blog posts have dealt with the gory details of how we can say that an RDF semantic client "knows" things and how a client can "learn" in the sense of increasing the volume of what it "knows".  For one who grew up with the idea that intelligent agents (a la robots) should easily self-assemble from electronic circuits or spongy globes of plantinumiridium, the answer is a bit of a let down: it learns by accumulating additional RDF triples in its graph.  It's actually a little better than that, because triples are more than just marked up text - they represent knowledge about actual resources (things, characteristics, concepts, etc.).  So a client can conduct reasoning on triples and can determine whether a set of statements (triples in a graph) are consistent or not.  A client can be "misled" by feeding it inaccurate or misleading triples, and I mentioned an entertaining parlor trick that can cause a semantic client to "explode" with irrelevant triples.  But the end result is still that the enlightened client just accumulates a graph full of triples that don't actually "do" anything on their own.  It's time to bring humans into the game.

Modifying the Rod Page challenge


I'd like to step back for a moment and return to where I started this series: the Rod Page Challenge (http://iphylo.blogspot.com/2011/10/tdwg-challenge-what-is-rdf-good-for.html) which he summarized as "What new things have we learnt about biodiversity by converting biodiversity data into RDF?" In Rod's post, he specifically asked "what new inferences have we made using RDF?"   As I noted in my previous posts, any time a client materializes an entailed triple into its graph, it has made a new inference.  So if we wanted to be cheeky about it, we could point out any entailed triple that we've added to a graph and say we've met the challenge.  But to be fair, in a lot of cases those new triples might not be particularly interesting. They may simply make concrete some fact that is obvious but which was not stated explicitly by the provider (for example, that the rdf:type every resource is owl:Thing and rdfs:Resource).  So I'm going to modify Rod's challenge slightly by adding the word "interesting": "what interesting new inferences have we made using RDF?"  The other modification that I'm going to make is to add some clarifications about the role that humans play in the process of "learning" using RDF.  I expect that the semantic clients (machines) will be making the new inferences and that humans will be learning the new things about biodiversity.  So I'll state the modified Rod Page challenge like this:
What new things have humans learnt about biodiversity by converting biodiversity data into RDF?  What interesting new inferences have semantic clients made using RDF?
 The modified Rod Page challenge extends the arena of "learning" beyond the triple stores of semantic clients and includes the interaction between humans and those semantic clients. 

modified from Tim Berners-Lee http://www.w3.org/DesignIssues/diagrams/sweb-stack/2006a.png

SPARQL


There are a number of ways that humans could "learn" by interacting with a semantic client.  Probably the most obvious way would be through conducting a query of the graph of accumulated triples that the client has assembled by direct discovery or through reasoning.  The W3C has developed SPARQL, a query language designed specifically for use with RDF.   SPARQL works by sorting through a graph and winnowing out triples that match patterns that are specified in the query.  The components of the pattern can be literals, IRIs, or variables.  The variables are bound to RDF terms in such a way that the pattern matches the data in the graph.  I'm interested in images, so here is an example of a very simple query that would help me discover images that I've created that are licensed CC BY-NC-SA:

PREFIX dcterms: <http://purl.org/dc/terms/>
SELECT ?image
WHERE {
     ?image dcterms:creator <http://orcid.org/0000-0003-4365-3135>.
     ?image dcterms:license <http://creativecommons.org/licenses/by-nc-sa/3.0/>.
}

where http://orcid.org/0000-0003-4365-3135 is an IRI identifying me. The query would return the set of all IRIs for ?image that conform to the pattern I specified in the WHERE clause. 

Unfortunately, there is a potential problem with this query.  There are two well-known terms for describing who created a resource.  In addition to dcterms:creator, there is also foaf:maker.  Fortunately, SPARQL provides a simple solution to this problem.  The UNION keyword allows me to specify pattern alternatives:

PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX dcterms: <http://purl.org/dc/terms/>
SELECT ?image
WHERE {
     {?image dcterms:creator <http://orcid.org/0000-0003-4365-3135>.}
     UNION
     {?image foaf:maker <http://orcid.org/0000-0003-4365-3135>.}

     ?image dcterms:license <http://creativecommons.org/licenses/by-nc-sa/3.0/>.
}

OK, that's great - problem solved.  But there's another problem.  There is another well-known identifier for me: http://viaf.org/viaf/63557389 .  We can fix the query by the same approach using UNION:

PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX dcterms: <http://purl.org/dc/terms/>
SELECT ?image
WHERE {
     {
          {?image dcterms:creator <http://orcid.org/0000-0003-4365-3135>.}
          UNION
          {?image dcterms:creator <http://viaf.org/viaf/63557389>.}
     }
     UNION
     {
          {?image foaf:maker <http://orcid.org/0000-0003-4365-3135>.}
          UNION
          {?image foaf:maker <http://viaf.org/viaf/63557389>.}
     }
     ?image dcterms:license <http://creativecommons.org/licenses/by-nc-sa/3.0/>.
}

Ta-da!  Oops.  I forgot one thing.  There is actually a third IRI for identifying me that I had already been using for years before the VIAF and OCID identifiers were created: http://bioimages.vanderbilt.edu/contact/baskauf .  We could continue with the same approach and make the query more complicated, but I am tiring of writing the permutations.  It would be better to follow an approach that takes advantage of my investment in using RDF.

http://bioimages.vanderbilt.edu/baskauf/32279

owl:sameAs and owl:equivalentProperty


Let's assume that we have some way of dealing with the possible nasty implications (discussed in my earlier blog posts) of turning our semantic client loose on the Internet to discover useful triples and then infer other triples that are entailed by the discovered ones.  Dereferencing http://bioimages.vanderbilt.edu/contact/baskauf will uncover the following triples:

<http://bioimages.vanderbilt.edu/contact/baskauf> 
    owl:sameAs <http://viaf.org/viaf/63557389>.
   
<http://bioimages.vanderbilt.edu/contact/baskauf> 
    owl:sameAs <http://orcid.org/0000-0003-4365-3135>.

If my client discovered a triple like:

<http://bioimages.vanderbilt.edu/baskauf/32279> 
     dcterms:creator <http://bioimages.vanderbilt.edu/contact/baskauf>.

 it could use the owl:sameAs relationships to infer two more entailed triples:

<http://bioimages.vanderbilt.edu/baskauf/32279> 
     dcterms:creator <http://viaf.org/viaf/63557389>.

<http://bioimages.vanderbilt.edu/baskauf/32279> 
     dcterms:creator <http://orcid.org/0000-0003-4365-3135>.

In keeping with what I've said previously about how clients "learn", discovering the owl:sameAs relationships does NOT somehow mysteriously allow the semantic client to"know" that the three IRIs identify the same thing (i.e. me).  Rather, the client "learns" that they are the same by generating every triple that can be made by replacing (in the same position) the one equivalent IRI with the other. [1]

The term owl:equivalentProperty fills a similar role for properties as the role that owl:sameAs fills for instances.  Examination of the defining RDF for either the dcterms: or foaf: vocabularies reveals that

dcterms:creator owl:equivalentProperty foaf:maker.

A client that is aware of the owl-interpretation of RDF can use this relationship to materialize entailed triples in which either dcterms:creator or foaf:maker is replaced with the other property.  So a client that is aware of the equivalence of dcterms:creator and foaf:maker can add this triple to its graph:

<http://bioimages.vanderbilt.edu/baskauf/32279> 
     foaf:maker <http://bioimages.vanderbilt.edu/contact/baskauf>.

 Letting the client do the heavy lifting

 

If we allow the client to materialize all triples that are entailed by all of the owl:sameAs and owl:equivalentProperty declarations, discovery of the one triple

<http://bioimages.vanderbilt.edu/baskauf/32279> 
     dcterms:creator <http://bioimages.vanderbilt.edu/contact/baskauf>.

results in the creation of five additional triples that represent all of the permutations that can be formed by replacements of the equivalent IRIs:

<http://bioimages.vanderbilt.edu/baskauf/32279> 
     dcterms:creator <http://viaf.org/viaf/63557389>.

<http://bioimages.vanderbilt.edu/baskauf/32279> 
     dcterms:creator <http://orcid.org/0000-0003-4365-3135>. 

<http://bioimages.vanderbilt.edu/baskauf/32279> 
     foaf:maker <http://bioimages.vanderbilt.edu/contact/baskauf>.

<http://bioimages.vanderbilt.edu/baskauf/32279> 
     foaf:maker <http://viaf.org/viaf/63557389>.

<http://bioimages.vanderbilt.edu/baskauf/32279> 
     foaf:maker <http://orcid.org/0000-0003-4365-3135>.

Now let's return to the problem that we faced earlier with the SPARQL query.  We can now get around the problem of needing to include every permutation in the query by having the semantic client generate the permutations of equivalent triples its graph.  Now it doesn't matter if we use the earlier query or a query like

SELECT ?image
WHERE {
     ?image foaf:maker <viaf.org/viaf/63557389>.
     ?image dcterms:license <http://creativecommons.org/licenses/by-nc-sa/3.0/>.
}

which will produce exactly the same result.  This approach to querying has been called forward-chaining materialisation by Hogan et al. (2009) [2].

Brett Gustafson CC BY Wikimedia Commons

 Show me your license!

 

 The image http://bioimages.vanderbilt.edu/baskauf/32279 has a license that is very explicitly expressed in RDF/XML.  However, I would like my client to be able to obtain information in a variety of ways.  My client may scrape the web and discover images whose metadata are expressed in RDFa.  Creative Commons explains how providers can use snippets of XHTML code provided by their license chooser to express a license in RDFa.  Here's an example:


<a rel="license" href="http://creativecommons.org/licenses/by/2.5/">
<img alt="Creative Commons License" style="border-width:0"
src="http://i.creativecommons.org/l/by/2.5/88x31.png" /></a><br />


The part of this code that is important to this discussion is the rel="license" attribute.  Using that attribute in RDFa asserts an xhv:license property for the subject resource such as:

@prefix xhv: <http://www.w3.org/1999/xhtml/vocab#>.
<http://commons.wikimedia.org/wiki/File%3AChicago_police_officer_on_segway.jpg>
     xhv:license  <http://creativecommons.org/licenses/by/2.5/>.

OK, now we have a problem again.  My original query only included a query pattern that included dcterms:license .  It's not going to catch xhv:license .  Following my previous strategy doesn't work either because neither the definition of the XHTML vocabulary (expressed as RDFa at http://www.w3.org/1999/xhtml/vocab) nor the definition of the Dublin Core vocabulary provide any relationship between dcterms:license and xhv:license so I can't just relate them using owl:equivalentProperty .  

However, rummaging around in the Creative Commons Rights Expression Language (view source of http://creativecommons.org/ns to see the defining RDFa) produces the term cc:license which has these relevant relationships:

cc:license owl:sameAs xhv:license.  [3]
cc:license rdfs:subPropertyOf dcterms:license.

If a client accepts the Creative Commons Rights Expression Language view of the world, then any license property expressed in RDFa using the rel="license" attribute also entails a cc:license and dcterms:license properties linking to the same license IRI (although a dcterms:license property does not entail cc:license and xhv:license properties because rdfs:subPropertyOf only entails property substitution in one direction).  

So if I want to ensure that my original SPARQL query detects images whose license is expressed using RDFa, I should have my client materialize the two additional entailed triples; in the example above, that would be:

@prefix dcterms: <http://purl.org/dc/terms/>.
@prefix cc: <http://creativecommons.org/ns#>.
<http://commons.wikimedia.org/wiki/File%3AChicago_police_officer_on_segway.jpg>
     dcterms:license  <http://creativecommons.org/licenses/by/2.5/>.
<http://commons.wikimedia.org/wiki/File%3AChicago_police_officer_on_segway.jpg>
     cc:license  <http://creativecommons.org/licenses/by/2.5/>.

http://www.w3.org/RDF/




"I can't believe I ate that whole thing."

1972 American television commercial














Modified Rod Page Challenge, revisited


So if my client discovered these two triples:

<http://bioimages.vanderbilt.edu/baskauf/32279> 
     dcterms:creator <http://bioimages.vanderbilt.edu/contact/baskauf>;
     xhv:license  <http://creativecommons.org/licenses/by-nc-sa/3.0/>.

then based on the known equivalence and subproperty relationships discussed above, it could "know" a total of nine triples:

<http://bioimages.vanderbilt.edu/baskauf/32279> 
     dcterms:creator <http://bioimages.vanderbilt.edu/contact/baskauf>,
     <http://viaf.org/viaf/63557389>,<http://orcid.org/0000-0003-4365-3135>;
     foaf:maker <http://bioimages.vanderbilt.edu/contact/baskauf>,
     <http://viaf.org/viaf/63557389>,<http://orcid.org/0000-0003-4365-3135>;
     xhv:license  <http://creativecommons.org/licenses/by-nc-sa/3.0/>;
     cc:license  <http://creativecommons.org/licenses/by-nc-sa/3.0/>;
     dcterms:license  <http://creativecommons.org/licenses/by-nc-sa/3.0/>.

That's a 450% increase in knowledge!  My client has learned a lot!

These triples would be useful to me because they would allow me to conduct a simple query such as

PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX dcterms: <http://purl.org/dc/terms/>
SELECT ?image
WHERE {
     ?image foaf:maker <http://viaf.org/viaf/63557389>.
     ?image dcterms:license <http://creativecommons.org/licenses/by-nc-sa/3.0/>.
}

that would find the image http://bioimages.vanderbilt.edu/baskauf/32279 even though the triple patterns in the query contained none of the predicates used in the originally discovered triples.  That's useful! Hooray!

Unfortunately, this example totally fails to meet the modified Rod Page Challenge, because nothing that I discovered was actually interesting.  All that I have done is to bloat my client with superfluous triples that would not have been necessary if there had been a consensus provider of IRIs for people, and if creators of vocabularies had reused existing terms.  In other words, I've used RDF to help make up for a failure of the community to create consensus institutions and standards, not to discover something interesting.

Oh, I forgot to mention that http://www.morphbank.net/?id=808585 , http://eol.org/data_objects/21888332 , and http://www.discoverlife.org/mp/20p?see=I_SB32279 are other identifiers for the same image.  Fortunately, all three of these aggregators (none of which are currently exposing RDF) have tracked and backlinked to the original IRI for the image.  But if they used their own IRIs for the image to describe the license and creator of the image, then my client would have to materialize 36 entailed triples to make sure that my queries don't miss information that actually only requires two triples to express. 

Does "bloating" matter?


OK, so I've been hard on the Linked Data community for a failure to reuse terms and to establish mechanisms for developing repositories of well-known identifiers for community use.  So what if we make it necessary to materialize 36 triples to express what should be expressible in 2?  Does it matter? I don't have a precise answer for this because I don't have the resources or the skills to find out.  However, let's do a few back-of-the-envelope calculations. 

Hogan et al. (2009) performed tests [4] on a set of fairly clean 1.27 million assertional (instance) triples and 295 terminological (referring to classes and properties) triples.  Various sorts of reasoning on a graph of that size took times on the order of 100+ seconds (their Table 3).  They also applied somewhat more constrained reasoning on a set of about 1.1 billion "dirty" triples crawled from the Web.  Reasoning of various sorts on a graph of that size took times on the order of 15 to 100 hours (their Table 5).  If we assume that there have been some improvements in performance of hard drives and processors since 2009, these times would be shorter, but still not trivial.

Graph sizes of 1 million to 1 billion triples sound impressive!  However, to put that in perspective, the metadata describing the rather small Bioimages database (which includes about 10 000 images) requires about a million triples to express in RDF.  That comes out to about 100 triples per image.  Now imagine a natural history museum that has digitized 10 million specimens.  If we assume 100 triples to describe each image, that easily gets us to a billion triples, and I didn't even mention the metadata about the specimens themselves or the data about taxa that are represented in the collection. The message I'm trying to deliver here is that it would be very easy for a single natural history museum to generate a number of triples that could require tens or hundreds of hours of computer time to reason over.  Now imagine GBIF with its 438 million (as of 2014-05-12) occurrence records.  If we are thinking about carrying out RDF-based reasoning on data of this magnitude, I would assert that unnecessarily increasing the number of triples by a factor of 18 is not trivial. 

In the recent tdwg-content email discussion (starting with http://lists.tdwg.org/pipermail/tdwg-content/2014-May/003225.html and continuing on for about 40 responses), I got the impression that a lot of people felt that there was no way to avoid the repeated minting of new identifiers for the same resource.  If that is really true, and the biodiversity informatics community can't discipline itself to use consensus identifiers, then it may be pointless to expend more effort on a serious attempt to enable the use of RDF in the community.  Just saying "we'll link the additional identifiers using owl:sameAs" is not a solution.

Summary


"Knowing" and "learning" by clients based on RDF means little if human's can't search the accumulated triples by some efficient mechanism, such as SPARQL querying.  Additionally, simply accumulating a greater number of triples is useless unless those additional triples actually encode new and useful information.  The effectiveness of querying can be increased by letting a client materialize alternative ways of expressing facts.  But lack of consensus IRIs and properties results in an explosion of unnecessary triples that has the potential of significantly impeding the ability to reason over large graphs.

Darn!  Blog post #5 on this topic and I still haven't gotten to talking about doing something useful with RDF.   Maybe next time when I talk about ways that we link resources...

----------
* Image served from http://www.computerhistory.org/ object ID 500004889

[1] Refer to Halpin, H., I. Herman, and P. J. Hayes. 2009. When owl:sameAs isn't the Same: An Analysis of Identity Links on the Semantic Web.  http://www.w3.org/2009/12/rdf-ws/papers/ws21 for good reading on owl:sameAs.  Section 3 is particularly relevant here.

[2]  Refer to sections 2.1 and 3.2 of Aidan Hogan, Andreas Harth and Axel Polleres.  Scalable Authoritative OWL Reasoning for the Web.  International Journal on Semantic Web and Information Systems, 5(2), pages 49-90, April-June 2009.
http://www.deri.ie/fileadmin/documents/DERI-TR-2009-04-21.pdf

[3] Using owl:sameAs to relate two properties apparently implies an owl:equivalentProperty relationship.  However, this is only allowed in OWL Full.  (see http://www.w3.org/TR/owl-ref/#equivalentProperty-def) I'm not sure what all the implications are of this.  

[4] Hogan et al. 2009. section 3.4.

1 comment:

  1. Re [3], see http://answers.semanticweb.com/questions/1210/owl-full-and-reasoning and probably many other such discussions

    ReplyDelete