Sunday, May 4, 2014

Confessions of an RDF agnostic, part 4: An inconsistent world

In its most common usage, the word "blivet" refers to an indecipherable figure, illustrated above. ... It appeared on the March 1965 cover of Mad magazine bearing the caption "Introducing 'The Mad Poiuyt' ", and has appeared numerous times since then. ...
In traditional U.S. Army slang dating back to the Second World War, a blivet was defined as "ten pounds of manure in a five pound bag".

Wikipedia article on "blivet".  (Image from Wikimedia Commons)







In my last post I discussed what I thought it meant for a semantic client to "know" and "learn" things.  I said that a client "knows" the facts that are encoded by the triples in its graph, and that a client "learns" when it adds more triples to its graph, either through discovering them directly or through inferring triples that are entailed by interpretations of flavors of RDF that it supports.  There is another sort of "knowing" that a semantic client can be programmed to do: determining whether a graph is consistent.  Here is a simple example of an inconsistent RDF triple:

<http://orcid.org/0000-0003-4365-3135> foaf:age "156.56"^^xsd:integer.

 The problem with this triple is NOT related to whether it is true or false (it's false).  The problem is NOT related to whether it's valid RDF (it is perfectly valid RDF).  The problem is that the literal "156.56" is not in the lexical space allowed for the xsd:integer datatype.  In RDF, Anyone can say Anything about Anything, so nothing would prohibit a provider from asserting this triple.  But the triple could be determined by a client to be inconsistent.  What does that mean?

Linking Open (LOD) Data Project Cloud Diagram from http://linkeddata.org/ CC BY-SA

Describing a world

The RDF (1.0) Semantics document explains the situation this way:
An assertion amounts to stating a constraint on the possible ways the world might be. Notice that there is no presumption here that any assertion contains enough information to specify a single unique interpretation. It is usually impossible to assert enough in any language to completely constrain the interpretations to a single possible world, so there is no such thing as 'the' unique interpretation of an RDF graph. In general, the larger an RDF graph is - the more it says about the world - then the smaller the set of interpretations that an assertion of the graph allows to be true - the fewer the ways the world could be, while making the asserted graph true of it.
The OWL 2 Primer elaborates: "a set of statements may be consistent (that is, there is a possible state of affairs in which all the statements in the set are jointly true) or inconsistent (there is no such state of affairs). "

So we can imagine the process of "learning" by a semantic client to be a process of discovering what a world is like by narrowing down its possible conditions with the addition of each triple.  If worlds can exist that are in accord with the state of affairs described by the triples in the client's graph, then the client can reason that the graph is consistent.  If there is NO world possible given the triples in the graph, then the graph is inconsistent.  In the case of the example above, the single triple is enough to render a graph inconsistent, because there is no RDF world in which an integer can have a lexical representation containing a decimal point. [1] 

Notice that in the discussion above I have said "a world" not "the world".  Because in RDF Anyone can say Anything about Anything, RDF could be used to describe an imaginary world, such as Middle Earth, the world of Harry Potter, or a world in which I'm 156 years old.  It is not required that a world resemble the "real world", although that's usually what we would prefer.

The approach taken by RDF can seem "backwards" to people who are just starting to learn about RDF.  Those people might mistakenly think that declaring the range of foaf:maker to be foaf:Agent restricts foaf:maker to being used as a predicate only in triples with objects that are Agents.  It does not.  Rather, the range declaration entails that the object of the triple IS an Agent.  In RDF, we don't say "you can't have that kind of triple in my world".  Rather we say, "what kind of world would I have with your triple in it?"


CC-BY 2.0 "Randy Son of Robert" via Wikimedia Commons

Doing something about the "sad" example


In my last post, I described a "sad" example where using the term foaf:depiction in a triple implied that a book was an image.  I said that there was nothing wrong with that - perhaps the creator of the triple intended to extend the concept of "image" to include books.  If I don't like that, I can actually do something about it.  Web Ontology Language (OWL) provides the term owl:disjointWith as a means to state that it is inconsistent for a particular resource to be in instance of two particular classes.  I can program my semantic client to accept the rdfs-interpretation (which introduces the notion of classes) and owl-interpretation (which considers the notion of disjointness) of RDF, then feed it these two triples:

foaf:Image owl:disjointWith bibo:Book.
<http://dbpedia.org/resource/Moby-Dick> rdf:type bibo:Book.

With them, I state that in my world, images aren't books, and Moby Dick is a book.  Now if my semantic client encounters the problematic triple:

<urn:lsid:ubio.org:namebank:111731>
foaf:depiction <http://dbpedia.org/resource/Moby-Dick>.


which entails (due to the range declaration of foaf:depiction) that

<http://dbpedia.org/resource/Moby-Dick> rdf:type foaf:Image.

my semantic client will detect an inconsistency and take some kind of action (inform me, spit out offending triple, etc.).  Pretty cool, eh?

http://www.tdwg.org/

More unhappiness, unfortunately


Using declarations of disjointness seems like a great way to make it possible to detect when someone makes an assertion that doesn't make sense with my view of the world.  Unfortunately it is also pretty easy to unknowingly make assertions that introduce inconsistencies if one carelessly uses terms that are loaded with semantics.  The well-known FOAF vocabulary is popular for expressing relationships involving agents (such as people and organizations).  I could use it to say some things about Biodiversity Information Standards (TDWG).  For example, if I would like to let people know that TDWG is an organization, I could assert:

<http://www.tdwg.org/> rdf:type foaf:Organization.

Unfortunately, it is also likely that somebody else might assert

<http://www.tdwg.org/> rdf:type foaf:Document.

because http://www.tdwg.org/ is the web address for the TDWG homepage.  This assertion could be made directly, or could be entailed by using http://www.tdwg.org/ as the object in a triple with a predicate like foaf:homepage that has a range of foaf:Document.   The reason I would be likely to get into trouble with my assertion is that the FOAF specification declares:

foaf:Document owl:disjointWith foaf:Organization.

If I program my semantic client to throw a fit whenever it encounters an inconsistency, I can trigger such a fit through one moment of carelessness.  I probably should have used an IRI different from the TDWG homepage URL to identify TDWG, but as far as I know a consensus IRI identifying TDWG doesn't exist.  I probably also should have been more careful to make sure that I knew what I was doing when I used the FOAF vocabulary. 


Tim B-L photo by Paul Clarke CC-BY 2.0 via Wikimedia Commons

Insidious unhappiness

Hogan et al. (2009) provides another example [2] of how easy it is to unintentionally introduce inconsistencies using the FOAF vocabulary. Their client performed a Web crawl and discovered the following triples in the wild:

w3:timbl rdf:type foaf:Person.
w3:w3c rdf:type foaf:Organization.


That's perfectly reasonable, because Tim Berners-Lee is a person and the W3C is an organization.  They also discovered these triples:

w3:timbl foaf:homepage  <http://w3.org/>.
w3:w3c foaf:homepage <http://w3.org/>.

These triples say that the web page http://w3.org/ is the homepage of both Tim Berners-Lee and the W3C.  That sounds innocent enough.  However, the FOAF vocabulary asserts the following property for foaf:homepage:

foaf:homepage rdf:type owl:InverseFunctionalProperty.

The W3C Wiki provides this description of an inverse functional property: "If the predicate has the 'InverseFunctionalProperty', than that means that wherever you see the (subject) linked to an (object) by this particular (predicate), then the (subject) is the one and only (subject) with that (object) connected by the (predicate.)"  I suppose it makes sense to declare that the property foaf:homepage is inverse functional, since a homepage could be considered to always be about one particular thing.  However, making a owl:InverseFunctionalProperty declaration for foaf:homepage entails that if two things have the same homepage, they ARE the same thing.  In other words, saying that http://w3.org/ is the homepage of both Tim Berners-Lee and the W3C entails that they are equivalent.  This is the same thing that would be accomplished using the owl:sameAs property:

w3:timbl owl:sameAs w3:w3c.

Equating Tim Berners-Lee with the W3C through use of an inverse functional property has two somewhat nasty consequences.  The first one is similar to the problem we had in the previous example.  The FOAF vocabulary asserts that:

foaf:Organization owl:disjointWith foaf:Person.

i.e. it is inconsistent for a person to also be an organization.  Yet using foaf:homepage in the manner above has entailed that w3:timbl is the same as w3:w3c.  That in turn entails that Tim Berners-Lee is an organization and that the W3C is a person.  So a semantic client could reason an inconsistency in a graph containing the 6 triples mentioned above in this example. 

The second nasty consequence is that once w3:timbl is reasoned to be equivalent to w3:w3c, any properties discovered or reasoned for Tim Berners-Lee would apply to the W3C and vice-versa.  For instance, if it were expressed in RDF that Tim Berners-Lee had a toothache, that would entail that the W3C had a toothache.  If it were expressed in RDF that the W3C had gone bankrupt and ceased to exist, that would entail that Tim Berners-Lee had gone bankrupt and ceased to exist.  These entailed statements would be nonsensical in any world that we hoped would resemble "the real world". 

What's a client to do?


The last example showed that inconsistent use of foaf:homepage by a couple of data providers can have some bad consequences for a client that did unrestricted reasoning on any triples it discovered.  This calls to mind the warning given in the RDF Concepts document that I quoted in the last blog post:
RDF does not prevent anyone from making assertions that are nonsensical or inconsistent with other statements, or the world as people see it. Designers of applications that use RDF should be aware of this and may design their applications to tolerate incomplete or inconsistent sources of information.
So if I were programming a semantic client, I would have several options to avoid the unpleasant effects of unrestricted reasoning of the sort described in the examples above:
  • avoid the owl-interpretation of RDF.  That would protect my client from scary effects of OWL terms like owl:InverseFunctionalProperty, but would also prevent my client from doing almost any kind of useful or interesting reasoning. 
  • prohibit my client from discovering any triples generated by a provider outside my own organization.  That would reduce the probability of inconsistent use of semantically loaded terms, but would also eliminate most possibilities of using RDF to discover interesting things from other sources of information. 
  • place limits on the types of entailed triples that I allow my client to add to its graph.  These limitations could be put into place by limiting inferencing based on certain categories of terms, or by assessing the reliability or authoritativeness of triples based on their origin.  This is the approach that Hogan et al. (2009) took in designing their SAOR client.  For one thing, they placed limits on circumstances where owl:sameAs inferencing was allowed.  For another, to prevent "ontology hijacking", they disallowed inferences based on unauthoritative statements made about classes and properties that would affect reasoning on those classes and properties.  

The third option is probably the most useful way to prevent a client from doing "harmful" reasoning.  But at the same time, it might also prevent the client from doing reasoning that might enable discovering useful things.  For example, asserting that

dcterms:hasPart owl:inverseOf dcterms:isPartOf.

would fall under Hogan et al. (2009)'s definition of ontology hijacking because that assertion isn't in the defining DCMI document.  But under certain circumstances that assertion could permit the discovery of interesting information that would otherwise be missed in a query.  Since Hogan et al.'s client was designed to operate on any triples scraped from the wild all over the Internet, the types of inferences it was allowed to make were more restricted than the inferences that a client might be allowed to make on a more controlled set of data sources. 

Good grief! Don't you have anything to say that's not sad??!!

CC BY Steve Jurvetson at http://www.flickr.com/photos/jurvetson



"Don't worry, be happy." Bobby McFerrin













From the tone of this blog post, one might get the impression that I have nothing positive to say about terms onto which semantics have been imposed.  Not so.  Happily, I can shamelessly promote Darwin-SW, the vocabulary/ontology that Cam Webb and I developed to try to advance the use of RDF in the biodiversity informatics community.  Darwin-SW contains a boatload of owl:disjointWith statements where we declare things like "a Location isn't a TaxonConcept", "an Event isn't an Identification", etc. [3]  It also contains numerous object properties with range and domain declarations that would entail that resources linked by them would be instances of particular Darwin Core classes (regardless of whether a provider making use of Darwin-SW declared those types explicitly or not).  In essence, a client that injested triples from Darwin-SW would be accepting a "world view" that is based on a particular model of biodiversity-related classes and the connections between them.  (See this submitted paper for more on that model.)  A client could "know" when a provider of triples describing biodiversity resources used Darwin-SW properties in a manner that was inconsistent with the model on which Darwin-SW was built, because the provided triples would entail that resources were instances of multiple disjoint classes, and thus render the resulting graph inconsistent. 

This approach (using range declarations to generate inconsistencies when object properties are used in a manner for which they were not intended) has been criticized as too limiting.  But in the absence of another mechanism to determine whether data conform to established patterns (refer to the work on RDF validation and "shapes"), this is one way that a semantic client can know that "something is rotten in Denmark" (sorry about that one, GBIF!).  If a provider doesn't like the model on which Darwin-SW was based, that provider should go looking for another ontology that describes the world in a way that they like better.

Summary

In the previous blog post, I said that what a semantic client "knows" consists of the triples in its graph, and that a client "learns" by adding triples to its graph directly, or by inferring other triples that are entailed by existing triples.  This post discusses another thing that semantic clients can "know": whether a graph is consistent (there are possible worlds in which all the statements of the graph are jointly true) or inconsistent (there is no world that can be described by all the triples in the graph).  A graph can be rendered inconsistent by:
  • simple errors (e.g. bad datatyping of literals), 
  • by careless use of terms having semantics that aren't understood by the provider of the triples, or 
  • by combining triples that were created by providers who have conflicting views of the world.  
Programmers of a semantic client should carefully consider what sorts of entailed triples they are willing to allow their client to infer.  This probably requires careful examination of the sources and quality of the triples that the client is likely to ingest, and the likelihood that various providers will have a consistent view of the world. 

Next up: querying and "knowing", plus some unhappiness involving owl:sameAs and other properties of equivalence.

Endnotes

[1] In RDF 1.0, clashes involving XML datatyped literals arose under the rdfs-interpretation.  (See http://www.w3.org/TR/2004/REC-rdf-mt-20040210/#dtype_interp).  However, in RDF 1.1, datatype D-entailment is a direct extension to basic RDF, so clashes can occur under any interpretation of RDF.  (See http://www.w3.org/TR/rdf11-mt/#literals-and-datatypes)

[2] Example from section 3.1 of Aidan Hogan, Andreas Harth and Axel Polleres.  Scalable Authoritative OWL Reasoning for the Web.  International Journal on Semantic Web and Information Systems, 5(2), pages 49-90, April-June 2009.
http://www.deri.ie/fileadmin/documents/DERI-TR-2009-04-21.pdf

[3] Darwin-SW is somewhat in limbo at the moment (2014-05-04) because it is built upon classes in the TDWG Darwin Core standard and there is a stalled proposal to clarify the definitions of Darwin Core classes.  Until that proposal is resolved, Darwin-SW is necessarily somewhat unstable. 

No comments:

Post a Comment