Steve Baskauf's blog: May 2014

Dunce cap by Rept0n1x, Wikimedia Commons BY-SA

The term duns or dunce became ... a synonym for one incapable of scholarship.

Wikipedia article on "Dunce"

Meaning and "understanding"

The aspect of RDF that makes it more than just a markup system is that triples are intended to "mean" something. In the most raw sense of RDF, one can think of the subject and object of a triple as entities that we would like to describe in some way, while the predicate represents some kind of relationship that connects them. If we adopt the perspective of the rdfs-interpretation of RDF (see the second post in this series for more on interpretations of RDF), we can consider the predicate to represent a property that is used to describe the subject resource (which also implies that we consider the subject to be a thing that we wish to describe). The object in the triple can be considered to be a value of the property. RDFS is a convenient semantic extension of RDF, and one that is likely to be accepted by many RDF users since it is consistent with the outlook of commonly-used vocabularies such as Dublin Core (see the DCMI Abstract Model, which was designed to be compatible with RDFS) and within our own biodiversity informatics community, Darwin Core (see the Darwin Core Abstract Model, which is roughly based on Dublin core).

So buying into the rdfs-interpretation of RDF gives us properties. It also defines the notion of classes and instances of classes. With RDFS, we can not only talk about "things" (resources), but also about "kinds of things" (classes). RDFS defines a special property called rdf:type that is used to connect a resource to the class of which it is an instance [1]. This property is so fundamental, it has a special abbreviation in Turtle and SPARQL: "a". So if we say

<http://bioimages.vanderbilt.edu/baskauf/11143#loc>
dwc:decimalLatitude 36.14592^^xsd:decimal;
a dcterms:Location.

a semantic client can "know" what kind of thing http://bioimages.vanderbilt.edu/baskauf/11143#loc is. That seems very good, because we have increased a machine's ability to "know".

It is tempting to think that if we program our client to make use of the rdfs-interpretation, we are on the way to achieving Tim Berners-Lee's "intelligent agents" (see second post in this series for the quote) because with the notions introduced by RDFS clients can "know" what kind of thing something is and some stuff about the thing's properties. A human looking at the two triples above might "know" that the thing identified by the subject IRI is a location, and that it's latitude is N 36.14592 degrees. However, a semantic client's ability to "know" things is actually very limited. The introduction to the RDF (1.0) Semantics document puts the ability of a client to "know" into perspective and I recommend reading it for further insight on this topic. Particularly relevant are these quotes:

Exactly what is considered to be the 'meaning' of an assertion in RDF or RDFS in some broad sense may depend on many factors, including social conventions, comments in natural language or links to other content-bearing documents. Much of this meaning will be inaccessible to machine processing...

The chief utility of a formal semantic theory is not to provide any deep analysis of the nature of the things being described by the language or to suggest any particular processing model, but rather to provide a technical way to determine when inference processes are valid, i.e. when they preserve truth.

To paraphase this bluntly, clients are stupid. When we make them "aware" of the rdfs-interpretation of RDF, we don't somehow magically endow them with the ability to "understand" what dwc:decimalLatitude "means" or to "understand" what a dcterms:Location is. We can delude ourselves into thinking that our predicates and classes are meaningful by assigning them clever local names like "decimalLatitude" and "Location" that mean something to humans. But the triples above don't "mean" anything more to a client than

<http://bioimages.vanderbilt.edu/baskauf/11143#loc>
xq:p2-glwsopgn_2q4as 36.14592^^xsd:decimal;
a vr:e33t5pp-98.

Since both terms are HTTP URIs, there may be some hope that "links to other content-bearing documents" might make the terms "mean" something more to clients. Alas, in its defining RDF, the meaning of the term dwc:decimalLatitude is almost entirely imparted by a human language comment. There is virtually nothing there that is meaningful to a semantic client aside from the term's designation as an rdfs:property. Similarly, the defining RDF of dcterms:Location contains little more than a brief, human-readable comment ("A spatial region or named place."), designation as a rdfs:Class, and declaration that it is a subclass of dcterms:LocationPeriodOrJurisdiction. (which itself is defined by a human-readable comment).

So even though RDF triples have the potential to "mean" something, in many cases that "meaning" depends critically on social conventions about what human-readable comments mean to humans.

Making a client smarter

In my last post in this series, I presented a rather dreary picture of the way that a human might have interactions with a semantic client through the mediation of SPARQL queries. In the examples in that depressing post, the work of the client was mostly to make up for failure of humans to reach consensus on identifiers. I'm going to attempt to be more upbeat in this post by talking about how a client might actually "learn" something useful that will help its human partner discover interesting things.

The quote from the RDF Semantics document mentioned three ways that "meaning" might be imparted to RDF:

social conventions
comments in natural language
links to other content-bearing documents

Numbers 1 and 2 are human things. Number 3 has some potential for machines. The dream of the semantic web (see the second blog post in this series) was to enable semantic clients to traverse the web of data and accumulate data that would allow humans to use their intuition and inspiration to do wonderful things with those data. So we might be able to make our clients a little less stupid if we enable them to go out and discover more information that is linked to what they already know. In my third blog post in this series, I asserted that clients "learned" when they added triples to their graph. What kinds of properties would enable a client to assemble triples that would add "meaning" that could be leveraged by humans?

To think about this question, I'm going to use the following example [2] :

@prefix po: http://owlfiles.plantontology.org/PO_ .
@prefix dsw: http://purl.org/dsw/ .
@prefix dwc: http://rs.tdwg.org/dwc/terms/ .
<http://bioimages.vanderbilt.edu/uncg/14>
   dwc:establishmentMeans "cultivated";
     a po:0000003;
     dsw:hasDerivative <http://herbarium.unc.edu/592810>.

The properties in this example fall into three categories that will be described in the following sections.

Datatype properties

The first category of property includes those that have literals as objects. Web Ontology Language (OWL) introduces the notion of a class of properties called datatype properties that link instances (also called "individuals" in OWL) to data values in the form of datatyped literals. For convenience, I'm going to refer to properties that link instances to literals as "datatype properties" even if their type hasn't been explicitly declared as owl:DatatypeProperty.

Although the object of a datatype property is a literal, that literal still denotes some entity. If the literal is accompanied by an explicit datatype IRI, then a client can "understand" the "meaning" of that literal by applying a lexical-to-value mapping (assuming the client's programming includes the capability to do such mapping). So in the case of a typed literal such as "5"^^xsd:integer, the client can "know" that the literal denotes the abstract number five, rather than the character string "5". If the literal isn't accompanied by an explicit datatype IRI (i.e., is an untyped literal), it is by default a member of the class xsd:string. Therefore the entity that the literal denotes is the string itself. In other words, in the example, the resource that the untyped literal "cultivated" denotes is not a conceptual entity for "cultivatedness" but rather is a ten character string composed of the characters "c", "u", "l", "t", "i", "v", "a", "t", "e", and "d".

From the standpoint of discovery, datatype properties are a sort of "dead end". This is true because it is not permissible in RDF to use a literal as the subject of a triple. That means that one can't make additional statements in RDF describing the entity that is described by the literal. In the case of a typed literal such as "5"^^xsd:integer, that probably isn't particularly important since the properties of the entity "the number five" are pretty well known. But in the case of a literal like "cultivated", the client is really in the dark. When we provide the property/value pair dwc:establishmentMeans "cultivated", we intend it to mean that the subject came to exist through the actions of humans. But there is no way for a client to know that. It would be nice if the client at least knew that "cultivated" meant the same thing as "managed" (GBIF's preferred value). When we had this kind of problem with multiple IRIs in the examples of my previous post, we used owl:sameAs to declare equivalence. But since literals can't be the subject of triples, we can't say:

"cultivated" owl:sameAs "managed".

because a literal can't be the subject of a triple. Even if we were allowed to make a statement like that, it would be a bad idea because the statement would be incorrect; the two strings that the literals denote are not equivalent.

This isn't to say that having a client discover datatype properties of a resource is useless for a human that would like to query the graph assembled by the client. Strings can be very convenient for simple searches. But unless a community adopts and adheres strictly to a controlled vocabulary, there may be many literals that are intended to represent the same thing (see this blog post by John Wieczorek if you want to know how bad things can get). The bottom line is that datatype properties don't do much for enabling interesting new inferences or discovery of new information.

cul-de-sac

rdf:type and classes

The second property category contains the single property: rdf:type. As I noted earlier, rdf:type and the notion of class membership is an important aspect of the rdfs-interpretation of RDF since (at least in theory) it allows a semantic client to know more about what a resource "is". In the case of the triple

<http://bioimages.vanderbilt.edu/baskauf/11143#loc> a dcterms:Location.

I noted that there was very little useful information that a semantic client could "learn" by dereferencing the IRI dcterms:Location (i.e. http://purl.org/dc/terms/Location), although a human could guess something about the class because the IRI contains the local name "Location". The situation is a bit different with this triple:

<http://bioimages.vanderbilt.edu/uncg/14> a po:0000003.

Since the IRI po:0000003 (i.e. http://owlfiles.plantontology.org/PO_0000003) is opaque, a human looking at the triple will have no idea what kind of thing http://bioimages.vanderbilt.edu/uncg/14 is without dereferencing the po:0000003 IRI. The human dereferencer can discover an rdfs:label property with a value of "whole plant" and know that it's a plant. But a semantic client can know more.

OBO Ontologies

The term po:0000003 is part of the Plant Ontology, which is one of the Open Biological and Biomedical Ontologies ("OBO" ontologies). The Plant Ontology is typical for an OBO ontology in that it is very focused on classes. As of 2014-05-19, the OWL version of the Plant Ontology contained descriptions of 1691 classes and 10 object properties. The object properties are designed to relate the classes in ways such as "part of", "located in", "adjacent to", etc. It also makes heavy use of what it refers to as the "is a" property. In the raw RDF, that translates into rdfs:subClassOf. The Plant Ontology does not define terms that are particularly useful for relating instances [3]. So the Plant Ontology is very useful for talking generically about plants and their features (classes) but is much less useful for talking about particular individual plants (instances).

Here are some things that a semantic client can "learn" (i.e., here are triples that it can add to its graph) about the class of whole plants by exploring the Plant Ontology:

po:0000003 rdfs:subClassOf po:0009011. (plant structure)
po:0009011 rdfs:subClassOf po:0025131. (plant anatomical entity)
po:0000003 rdfs:subClassOf <http://owlfiles.plantontology.org/PO>
(the super class of all Plant Ontology classes)

There are six owl:disjointWith declarations which entail that it is not consistent for a resource to be both a whole plant and also a plant organ, collective organ part structure, portion of plant tissue, collective plant structure, cardinal organ part, or vascular system. There are also some properties that provide alternative names for the class, but which wouldn't be much benefit to a machine.

If I assume that I let my client client ingest the subclass properties, what do I get in terms of "learning" on the part of my client? Putting it another way, what kinds of entailed triples could my client materialize based on the new triples it discovered by dereferencing po:0000003 that would be relevant to my understanding of? Here are three:

<http://bioimages.vanderbilt.edu/uncg/14> a po:0009011.
<http://bioimages.vanderbilt.edu/uncg/14> a po:0025131.
<http://bioimages.vanderbilt.edu/uncg/14>
a <http://owlfiles.plantontology.org/PO>.

As a human, these triples might be useful for me in a SPARQL query if I wanted to restrict my search to things that fall into those classes. For example, if I wanted to search only for things that were plant anatomical entities, I could do so and http://bioimages.vanderbilt.edu/uncg/14 would come up.

If I let my client ingest the disjointness properties, what do I get in terms of "learning" on the part of my client? My client becomes able to detect that its graph is inconsistent if http://bioimages.vanderbilt.edu/uncg/14 is ever discovered or inferred to be a vascular system (or any of the other 5 disjoint classes).

So in summary, an rdf:type declaration to class from an OBO ontology can enable a client to assemble some triples that would add "meaning" that I could leverage. However, I would need to carefully consider whether those added triples actually help me enough to make up for the added danger of possible undesired effects (such as unintentionally rendering the graph inconsistent).

Object properties

The third category of property includes those that have IRIs as objects. OWL introduces the notion of a class of properties called object properties that link subject instances (a.k.a. "individuals" sensu OWL) to object instances. (Note: these instances do not have to be identified using IRIs - they can be anonymous or "blank" nodes. But without being identified by an IRI, one cannot establish a link to the resource from outside the graph in which it is initially described. Since this discussion is about discovery of useful information from elsewhere, I'm limiting this category to IRI-identified instances.) For convenience, I'm going to refer to properties that link subject instances to IRI-identified object instances as "object properties" even if their type hasn't been explicitly declared as owl:ObjectProperty.

The example contained this triple:

<http://bioimages.vanderbilt.edu/uncg/14>
     dsw:hasDerivative <http://herbarium.unc.edu/592810>.

A human could do some guessing about the meaning of this triple by reading the hasDerivative local name and from the subdomain "herbarium...", but would have a clearer understanding of the intended use of dsw:hasDerivative by reading the Darwin-SW documentation (full disclosure: Cam Webb and I wrote Darwin-SW).

A semantic client that dereferenced the property's IRI would learn that it had an inverse property dsw:derivedFrom and that it was transitive. Both of these properties of dsw:hasDerivative could allow the client to materialize potentially useful triples. Even if the client and its human handler were completely clueless about what dsw:derivedFrom actually "meant", they would at least know that http://bioimages.vanderbilt.edu/uncg/14 was linked in some way to http://herbarium.unc.edu/592810. If the client were able to dereference the object IRI [2], it could discover triples like these:

<http://herbarium.unc.edu/592810>
   a dwctype:PreservedSpecimen;
   a dcmitype:PhysicalObject;
   dcterms:created "2010-10-18"^^xsd:date;
     dwcuri:inCollection <http://biocol.org/urn:lsid:biocol.org:col:15495>;
   dsw:evidenceFor <http://herbarium.unc.edu/592810#occ>.

These types are relatively well-known, so that is of some use. There are also object properties that lead to other resources, whose RDF description could be obtained, e.g.

<http://herbarium.unc.edu/592810#occ>
   a dwctype:Occurrence;
   dwcuri:recordedBy <http://bioimages.vanderbilt.edu/contact/kirchoff#masone>;
     dwc:recordNumber "6";
     dsw:atEvent <http://bioimages.vanderbilt.edu/specimen/ncu592810#eve>.

which also has a relatively well-known type and an object property that links to the person who recorded it, which could also be dereferenced, etc., etc.

The point here is that it is object properties that enable the kind of links that permit a semantic client to discover interesting triples that may not have been previously known to the client's human buddy. The subdomain of the starting subject resource's IRI was bioimages.vanderbilt.edu, but the object properties connect that resource to other resources whose IRIs are managed by herbarium.unc.edu, and biocol.org. I'm likely to be aware of things happening with resources in the bioimages.vanderbilt.edu subdomain because I manage it. But I could be surprised by metadata served by other providers (unpleasantly surprised in the case of biocol.org since most of the triples that were once there aren't being served anymore!).

Image by Kowloonese from Wikimedia Commons. Public domain.

Road analogy

We can think about our client as a car driving around looking for information. The client starts somewhere (the subject resource) and starts driving to look around. If it drives down a "datatype property road", it hits a dead end. It might find something useful there, but the road doesn't take it anywhere else. If it drives down an "rdf:type road", it comes to a cul-de-sac. The size of the cul-de-sac could range from a dead end to a large loop, with varying amounts of information to be discovered. If the client drives down an "object property road", there is no way to know the number and length of side-streets that it might encounter, so there is no particular limit to what it might find by driving down those streets.

Returning to the original question I raised, "What kinds of properties would enable a client to assemble triples that would add "meaning" that could be leveraged by humans?", I would assert that object properties that link instances are the most productive sort if our goal is to discover things about biodiversity data that are interesting and novel.

Back to SPARQL

In my last post, I said that the most obvious way that a human could "learn" by interacting with a semantic client through conducting a query using SPARQL (the query language developed by the W3C specifically for use with RDF). Based on what I've said so far in this post, I'm going to flesh this out a bit more.

I have tried to establish that clients are fundamentally stupid and that humans are fundamentally smart. So let's let the division of labor reflect this. The job of the client is to amass a giant blob of triples into its graph and then materialize other triples that the amassed triples entail (subject to the restrictions that its human master puts upon it; see cautionary tales in the third post in this series). The job of the human is to figure out queries that would leverage the meaning placed upon the triples in order to discover useful information about the "world" described by the graph assembled by the client. In the case of humans that are biodiversity informaticians, that world will hopefully be the "real" world in which we live. The SPARQL endpoint mediates the interaction between the human querier and the client.

At its core, SPARQL is really just asking the client to do pattern matching, and by the time we get to the query stage in the process, the client is essentially "done" thinking (i.e. it should have materialized entailed triples before the query is made). So it's up to the human to come up with triple patterns that can represent a restriction based on a statement that has "real world" meaning. The human is only going to be successful in doing that if the predicates in the triple patterns and the classes used for typing have a clear, consensus meaning for both the querier and the data providers who generated the triples. No Semantic Web "magic" is going to fix the problem if there isn't a common understanding of the meaning of the terms.

Here's an example query that would find cultivated plants from which specimens were derived:

PREFIX po: <http://owlfiles.plantontology.org/PO_>
PREFIX dsw: <http://purl.org/dsw/>
PREFIX dwctype: <http://rs.tdwg.org/dwc/dwctype/>

SELECT ?resource
WHERE {
     ?resource a po:0000003.
   ?resource dwc:establishmentMeans "managed".
     ?resource dsw:hasDerivative ?specimen.
     ?specimen a dwctype:PreservedSpecimen.
     }

I like this query because it's nice and simple. I don't like this query for several reasons.

TDWG doesn't actually recognize organisms as a class (yet), nor is it clear what kind of thing has dwc:establishmentMeans as a property (organism? occurrence? ??? There was a long and painful tdwg-content thread about this; I'm too lazy to look it up).
Darwin-SW isn't any kind of ratified standard, but then there aren't any object properties to connect Darwin Core classes because TDWG hasn't worked out anything like a domain model. So at the present we have to use something non-standard.
There is a proposal (stalled, as usual) before TDWG to clarify the Darwin Core classes and their namespaces, so I don't actually know if the class will be dwctype:PreservedSpecimen or dwc:PreservedSpecimen.

Darn. Things were looking so hopeful. OK, let's pretend that either TDWG gets it's act together, or everybody starts thinking about things just like me (neither one is likely). If we were to work things out, we could make this query more cool by stringing together more object properties:

PREFIX po: <http://owlfiles.plantontology.org/PO_>
PREFIX dsw: <http://purl.org/dsw/>
PREFIX dwctype: <http://rs.tdwg.org/dwc/dwctype/>

SELECT ?resource
WHERE {
     ?resource a po:0000003.
   ?resource dwc:establishmentMeans "managed".
     ?resource dsw:hasDerivative ?specimen.
     ?specimen a dwctype:PreservedSpecimen.
     ?specimen foaf:depiction ?image.
     ?image a dcmitype:StillImage.
     ?specimen dsw:evidenceFor ?occurrence.
     ?occurrence dwcuri:recordedBy ?collector.
     ?collector foaf:familyName "Smith".
     ?resource dsw:hasIdentification ?id.
     ?id dwc:genus "Quercus".
}

Now the query asks for cultivated plants identified as oaks from which imaged specimens were collected by a person named "Smith". Notice that in the query, each datatype property is a dead end because the literal can't be put in the subject position of another triple pattern. Each type declaration ("a") is also a dead end because we will probably want to put a fixed IRI in the object position. It is the object properties that allow our query to search through the "web of data".

The ability to conduct interesting queries of this sort is fundamentally going to depend on a humans to designate the questions that they want to explore. The semantic client will assemble the triples, but it isn't smart enough to create the queries.

Linking and the Biocollections Ontology (BCO)

As I noted above, a serious problem is that there are currently no object properties in any TDWG standard that could be used to provide the kinds of linkages that I've argued are so important to enable the discovery of interesting things using RDF. The stalled Darwin Core RDF Guide provides a few (e.g. dwcuri:recordedBy) but does not link the core classes (e.g. Occurrence, Identification, Event, etc.). It has been suggested repeatedly that developing the Biological Collections Ontology (BCO; an OBO-like ontology) would be a way forward for solving this problem. So I'd like to examine the ontology's potential for facilitating linking.

A paper on the BCO [4] (of which I was a co-author) implied that development of that ontology provided a way to link diverse instance data. During the discussion of the manuscript, I questioned the necessity of creating a complex ontology to link instance data. The manuscript claimed that combining BCO with datasets would answer important questions, and in the supporting information section provided several use-cases that required linking information. It was not clear to me how terms from the ontology would facilitate queries that would address these use cases. I recommended that we include sample SPARQL queries to show how this could be done, but in the end none were ever put into the paper. I was not interested in holding up what I felt was otherwise a good paper, so I dropped my objections. Since that time, it was again suggested in a talk at the 2013 TDWG meeting that development of the BCO was an important step towards linking biodiversity data. Again, no actual queries were provided to show how this might be accomplished.

Let's take a look at Figure 3 from the paper (also discussed at about 10 minutes into the video of the talk). It suggests linking an insect instance to a taxon instance by asserting that an "identification using key" instance has a specified input that was the insect, and a specified output that was the taxon. From the latest stable release of the BCO on 2014-05-21 a client could "learn" the following about "has specified input" and "has specified output":

@prefix bfo: http://purl.obolibrary.org/obo/BFO_ .
@prefix obi: http://purl.obolibrary.org/obo/OBI_ .
@prefix ro: http://purl.obolibrary.org/obo/RO_ .
@prefix my: http://example.org/ .

obi:0000293 a owl:ObjectProperty, owl:AnnotationProperty;
     rdfs:label "has_specified_input"@en;
     rdfs:domain obi:0000011;
     owl:inverseOf obi:0000295;
     rdfs:subPropertyOf ro:0002233.

obi:0000295 a owl:ObjectProperty, owl:AnnotationProperty;
     rdfs:label "is_specified_input_of"@en;
     rdfs:range obi:0000011;
     rdfs:subPropertyOf ro:0002352.

obi:0000299 a owl:ObjectProperty, owl:AnnotationProperty;
     rdfs:label "has_specified_output"@en;
     rdfs:domain obi:0000011;
     owl:inverseOf obi:0000312;
     rdfs:subPropertyOf ro:0002234.

obi:0000312 a owl:ObjectProperty, owl:AnnotationProperty;
     rdfs:label "is_specified_output_of"@en;
     rdfs:range obi:0000011;
     rdfs:subPropertyOf ro:0002353.

There are other properties, but they would mostly be useful only for humans. Here are some more things a client could discover by exploring the various ontologies related to OBI:

obi:0000011 (a planned process) has superclasses bfo:0000007 (process), bfo:0000003 (occurrent), and bfo:0000001 (entity).
ro:0002233 (has input) is subproperty of ro:0000057(has participant).
ro:0002352 (input of) is subproperty of ro:0000056 (participates in).
ro:0002352 (input of) is subproperty of ro:0002328 (functionally related to).
ro:0002234 (has output) is subproperty of ro:0000057 (has participant).
ro:0002353 (output of) is subproperty of ro:0000056 (participates in).
ro:0002353 (output of) is subproperty of ro:0002328 (functionally related to).

If we link the insect specimen to the taxon using the "has specified output" and "has specified input" properties as was suggested, we have these two triples:

my:identification031 obi:0000299 my:insectTaxon01;
     obi:0000293 my:insect03.

A client that reasoned on the triples from the various ontologies could materialize these 16 entailed triples:

my:identification031 obi:0002234 my:insectTaxon01;
     obi:0000057 my:insectTaxon01.
my:insectTaxon01 obi:0000312 my:identification031;
     obi:0002353 my:identification031;
     obi:0000056 my:identification031;
     obi:0002328 my:identification031.
my:identification031 obi:0002233 my:insect03;
     obi:0000057 my:insect03.
my:insect03 obi:0000295 my:identification031;
     obi:0002352 my:identification031;
     obi:0000056 my:identification031;
     obi:0002328 my:identification031.
my:identification031 a obi:0000011, obi:0000007, obi:0000003, obi:0000003.

Here are some observations based on theses results:
1. Although linking as suggested does connect the insect with the taxon, a single explicit linkage entails eighteen triples that don't provide any information that would improve a human's ability to discover the connection using a query. A single triple containing a generic object property:

my:insect03 my:identifiedToTaxon my:insectTaxon01.

would be equally effective in making the link.
2. I am not sure why obi:0000293 and the other properties were declared to be both owl:ObjectProperty and owl:AnnotationProperty. This causes an inconsistency because annotation and object properties are disjoint. I suspect that this is just an error in coding, but having a complex ontology makes this kind of error more likely to happen, and less likely to be noticed.
3. Figure 3 shows eleven links of various sorts that are made using the two terms obi:0000299 and obi:0000293. It seems to me that this would complicate querying because a triple pattern like:

?resource1 obi:0000299 ?resource2.

would bind triples that linked various kinds of resources related by various processes. The human would probably have to complicate the query to sort them out and find the desired kinds of relationships. One could probably fix that by specifying types for the resources. But simply making it clear in documentation which sorts of resources should be used with a property like my:identifiedToTaxon would be a less complicated approach.
4. None of the properties used in the example above were actually minted in the process of the development of BCO. I understand and approve of the strategy of re-using terms from existing ontologies, but my point here is that the BCO-building effort did not produce any new capabilities for linking resources in the manner described above. I could have generated the triples in the example even if the various workshops and meetings devoted to the development of BCO had never happened, because the BCO per se doesn't really increase capabilities for linking by creating new object properties to do the linking.

Provenance and the Biocollections Ontology

Figure 3 also suggests tracking the provenance of samples using ro:0001000 ("derives from") and the caption says that a chain of inputs and outputs can be used to infer that an instance of DNA molecules is derived from an instance of an insect specimen. Here's what the BCO ontology says about "derives from":

ro:0001000 a owl:ObjectProperty;
rdfs:label "derives from"@en;
rdfs:subPropertyOf owl:topObjectProperty.

Using this property, one could describe the links between sampled objects like this:

my:tissueSample01 ro:0001000 my:insectTaxon01.
my:dnaSample01 ro:0001000 my:tissueSample01.

In contrast to the previous properties that had way more semantics than they probably needed to link effectively, this property has virtually no semantics that would be of any use to a machine. If "derives from" were transitive, then a semantic client could easily materialize the triple

my:dnaSample01 ro:0001000 my:insectTaxon01.

allowing a human to discover all samples that were derived from the insect with a simple query like

SELECT ?derivedSample WHERE {
?derivedSample ro:0001000 my:insectTaxon01.
}

But since "derives from" isn't transitive, a "stupid" semantic client can't do that kind of dirty work for the human. The human could write more complex queries, rules, or specialized software to make the connections. But those kinds of actions wouldn't make use of the reasoning capabilities built into RDF - the statements made in RDF can't directly entail the more distant "derives from" relationship. [5]

The figure caption suggests tracking the provenance using "inputs and outputs", which might imply that linking should instead be achieved using "has specified input" and "has specified output" like this:

my:tissueSamplingProcess01 obi:0000293 my:insectTaxon01.
my:tissueSamplingProcess01 obi:0000299 my:tissueSample01.
my:dnaExtractionProcess01 obi:0000293 my:tissueSample01.
my:dnaExtractionProcess01 obi:0000299 my:dnaSample01.

I don't have the patience at this point to list all of the triples that this would entail - based on the earlier example, there should be about 32, or perhaps more. But it is not at all apparent to me how having a client materialize all those triples would allow a client to infer that the instance of DNA molecules was derived from the instance of an insect specimen, or that doing so would make the job of a human querier easier.

OBO ontologies and the lack of clear definitions for terms

There is no question that there is currently a lack of clarity about the meaning of important terms and classes in Darwin Core. There are several ways dealing with this problem. One is to make it easier for humans to understand what the terms mean by improving the human-readable definitions of the terms. This is one of the primary purposes of the (stalled) proposal to clarify the definitions of all Darwin Core classes. Another would be to introduce clarity by tying the RDF definitions of Darwin Core classes to formal ontologies. This has already been done by linking the new class dwctype:MaterialSample to the Ontology for Biomedical Investigations (OBI) by making it rdfs:subClassOf http://purl.obolibrary.org/obo/OBI_0100051. I supported the proposal for defining the material sample class in that way because I think the precise language of the ontologies may be more clear than words along. However, I also agree with the point made by Joel Sachs in his talk at TDWG 2013 where he cautioned that tying Darwin Core terms directly to external ontologies might result in unintended inferences. So if TDWG goes down the road of tying class definitions to formal ontologies, it should be done with full knowledge of the implications for machine reasoning.

Wikimedia Commons. left: Luigizanasi CC BY-SA, right: Øyvind Holmstad CC BY-SA

Hammers and saws

I once was asked a question, which I'll paraphrase as "Don't you think that the BCO approach is better than Darwin-SW approach?". That is like asking a carpenter "Don't you think that a hammer is better than a saw?" A hammer is better than a saw if you want to drive nails. A saw is better than a hammer if you want to cut wood. The BCO approach is better than Darwin-SW if you want to describe in a clear and semantically precise way how various biodiversity-related classes are related to each other, but it's not very useful for linking things. The Darwin-SW approach is better than the BCO approach if you want to link things, but it is totally useless as far as describing the nature of biodiversity classes is concerned. BCO is a hammer; Darwin-SW is a saw.

It has been suggested repeatedly that to move forward, TDWG needs to engage in ontology building, and in particular, by creating OBO-like ontologies. I am not opposed to building ontologies, as long as we clearly articulate our reasons for doing so and can show what we will accomplish from the effort. But I do not believe that those reasons have yet been clearly articulated. I certainly do not believe that putting more effort in building the BCO is going to solve TDWG's problem of lack of object properties to link instances of diverse kinds of biodiversity resources.

Summary

1.Despite the illusion generated by local names used in property IRIs, semantic clients have little or no understanding of the actual meaning of properties they encounter.

2. Clients "learn" useful things when they discover novel triples having resource IRIs from Internet domains controlled by other providers.

3. Clients "learn" useful things when the entailed triples that they materialize help their human partners conduct more clever or meaningful queries. Simply materializing triples that restate the same linkages, or that declare uninteresting types (like an rdf:type of "super class of all Plant Ontology classes") aren't particularly useful.

4. Sources of RDF triples that are rich in object properties are most likely to help a human querier discover novel information.

5. If effort is expended toward ontology development, there should be a clear statement (preferably with functioning examples) that shows how that development will help semantic clients construct or evaluate graphs in a way that will assist humans in discovering information that isn't already obvious, or that can't be discovered with more conventional methods.

In my next post, I plan to revisit (for the last time) the Rod Page Challenge, and talk about what must happen to turn me from an RDF agnostic into an RDF believer.

Endnotes

[1] Although rdf:type is in the general rdf: namespace, in RDF 1.1 its meaning is now fleshed out in the RDFS specification http://www.w3.org/TR/2014/REC-rdf-schema-20140225/

[2] Unfortunately the IRI http://herbarium.unc.edu/592810 is a fake. There are so few real examples of specimens that are meaningfully linked to other things using RDF that I had to make it up. The other IRIs are real. There is actual RDF for this specimen at http://bioimages.vanderbilt.edu/specimen/ncu592810 . View the page source to see the raw RDF.

[3] Technically, I think this isn't true because object properties relate individuals to individuals (i.e., instances to instances). So I believe that if a Plant Ontology object property is used to relate two classes, the classes would be both classes and instances. If I am remembering correctly, that is only allowed on OWL Full, so that might be problematic for users who might be disturbed by the undecidability that this would introduce. I spent about 20 minutes rummaging around the OWL documents trying to find the domain and range declarations for owl:ObjectProperty and to confirm that a resource can only be both a class and instance in OWL Full, but couldn't find it and didn't want to spend any more time on it. So there is a good chance I'm wrong about this.

[4] Walls RL, Deck J, Guralnick R, Baskauf S, Beaman R, et al. (2014) Semantics in Support of Biodiversity Knowledge Discovery: An Introduction to the Biological Collections Ontology and Related Ontologies. PLoS ONE 9(3): e89606. http://dx.doi.org/10.1371/journal.pone.0089606

[5] Shameless plug for Darwin-SW, which has the transitive property dsw:derivedFrom. See https://code.google.com/p/darwin-sw/wiki/TokenIssues for more details on how it is used. See http://www.semantic-web-journal.net/content/darwin-sw-darwin-core-based-terms-expressing-biodiversity-data-rdf for SPARQL query examples.

Steve Baskauf's blog

Friday, May 23, 2014

Confessions of an RDF agnostic, part 6: properties, ontologies, and linking