Steve Baskauf's blog: April 2014

Monday, April 28, 2014

Confessions of an RDF agnostic, part 3: How a client learns

"It is not easy to build a robot, and only very clever boys should try it."

...

"Grandpa Clayton said heartily, "Why, that's excellent for a start. This is your first one, isn't it?"
"Yes," Andy said uncomfortably. "I meant him to be a man, but he turned out like this."
"Well, that's all right," Grandpa Clayton said. "it takes all sorts of people to make a world, and I expect it's the same with robots."

Carol Ryrie Brink (1966) Andy Buckram's Tin Men.

What does a semantic client "know"?

In the first post of this series, I raised the question "What does it actually mean to say that we can 'learn' something using RDF?" This is a rather vague question for two reasons: what did I mean by "learn" and what did I mean by "RDF"? In this post I want to flesh out this question.

Last October I forced myself to read the RDF Semantics document *. It was somewhat painful, but much of what I found written in technical language supported the ideas about RDF that I'd absorbed through osmosis over the past several years. Section 6 very succinctly addresses the question of what a semantic client can know:

Given a set of RDF graphs, there are various ways in which one can 'add' information to it. Any of the graphs may have some triples added to it; the set of graphs may be extended by extra graphs; or the vocabulary of the graph may be interpreted relative to a stronger notion of vocabulary entailment, i.e. with a larger set of semantic conditions understood to be imposed on the interpretations. All of these can be thought of as an addition of information, and may make more entailments hold than held before the change. All of these additions are monotonic, in the sense that entailments which hold before the addition of information, also hold after it.

In the previous post, I noted that the basic "fact" unit in RDF is a triple. A graph is a set of triples, so a graph essentially represents a set of known facts. So the technical answer to the question "what does a semantic client know?" is: "the triples that comprise the graph it has assembled". This is a little anticlimactic if we were hoping that sentience was somehow going to arise spontaneously in our "intelligent agent", but I'm afraid that is about all that there is. If we accept that the RDF graph assembled by a client is what it "knows", then Section 6 answers the question "how does a semantic client learn?" If learning is the acquisition of knowledge, then "learning" for a semantic client is the addition of triples to its graph.

"Learning" by the addition of triples

Linking Open (LOD) Data Project Cloud Diagram from http://linkeddata.org/ CC BY-SA

The most straightforward way for a semantic client to "learn" is to simply add triples to those already present in its graph. There are a number of ways this could happen. The human managing the client might feed triples directly into it from an in-house institutional database. A graph might be loaded in bulk from another data provider. Or the client might "follow its nose" and discover triples on its own by traversing the Semantic Web as envisioned by Tim Berners-Lee. In the Linked Data model, resources are identified by HTTP URIs which can be dereferenced to acquire an RDF document describing the resource.

"Learning through Discovery" is an exciting prospect for our precocious intelligent agent, and it would be tempting to turn it loose on the Internet to collect as many triples as possible. However, there is also a dark side to this kind of learning. RDF assumes that "Anyone can say Anything about Anything". In the language of the RDF Concepts document:

To facilitate operation at Internet scale, RDF is an open-world framework that allows anyone to make statements about any resource.

In general, it is not assumed that complete information about any resource is available. RDF does not prevent anyone from making assertions that are nonsensical or inconsistent with other statements, or the world as people see it. Designers of applications that use RDF should be aware of this and may design their applications to tolerate incomplete or inconsistent sources of information.

So it is possible that our client will discover triples that are correct and useful (awesome!). However, it is also possible that it will discover triples that are incorrect due to carelessness, ignorance, or outright nefariousness. Another possibility that should be considered is that the client will encounter triples that are correct, but useless (a possibility that I'd like to explore further in a future post). One would like to believe that "bad triples" wouldn't be introduced into the Semantic Web out of malice, but given the existence of computer viruses and spam, it would probably be naive to think that. The likelihood that our client may discover "bad" triples introduces a social dimension to the problem of how it "learns": how do we program the client to know which data sources to trust?

"Learning" through entailment rules

The quote from Section 6 of the RDF Semantics document describes the following way of adding information to an RDF graph: "...the vocabulary of the graph may be interpreted relative to a stronger notion of vocabulary entailment, i.e. with a larger set of semantic conditions understood to be imposed on the interpretations." For example generic RDF can be extended by the rdfs-interpretation which satisfies additional semantic conditions, such as:

If <x,y> is in IEXT(I(rdfs:range)) and <u,v> is in IEXT(x) then v is in ICEXT(y)

The semantic conditions then establish various entailment rules. The OWL 2 Primer defines entailment as follows: "a set of statements A entails a statement a if in any state of affairs wherein all statements from A are true, also a is true."

If you are finding your head spinning by now point, an illustration may help. The semantic condition above leads to the following entailment rule:

Rule rdfs3:
If graph E contains {aaa rdfs:range XXX. uuu aaa vvv.} then add {vvv rdf:type XXX.}

A client that is programmed to extend RDF to the rdfs-interpretation can use this rule to generate an inferred triple that has never been explicitly stated.

http://commons.wikimedia.org/wiki/File:Van_Gogh_Age_19.jpg

Happy example!!!

My busybody semantic client has surfed the Semantic Web and discovered the FOAF vocabulary. My imaginary programming prowess has enabled the client to sort through the RDFa there and "learn" (add to its graph) the triple:

foaf:depiction rdfs:range foaf:Image.

(Turtle serialization, with foaf: and rdfs: representing their conventional namespaces). After some additional surfing, my client also discovers the following triple:

<http://viaf.org/viaf/9854560>
foaf:depiction <http://commons.wikimedia.org/wiki/File:Van_Gogh_Age_19.jpg>.

it can use Rule rdfs3 to infer that the two triples it has discovered entail a third triple:

<http://commons.wikimedia.org/wiki/File:Van_Gogh_Age_19.jpg> rdf:type foaf:Image.

My semantic client is so smart! It has "learned" (i.e. added a triple to its graph=added information=learned) that the resource identified by the IRI

http://commons.wikimedia.org/wiki/File:Van_Gogh_Age_19.jpg

is an image!!! This is just what I was hoping for and I am beginning to feel like I'm on the way to creating an "intelligent agent".

Image by Randy Son of Robert Wikimedia Commons cc-by-2.0

Sad example :-( :-( :-(

Flushed with enthusiasm and its first victory, my client discovers the following triple:

<urn:lsid:ubio.org:namebank:111731>
foaf:depiction <http://dbpedia.org/resource/Moby-Dick>.

I'm a bit unsure about this one. I know that the subject IRI identifies the scientific name for sperm whale. I also know that the object IRI identifies the book Moby Dick. The triple seems reasonable because the book Moby Dick does depict a sperm whale (sort of). So I let my client have a go at learning again. It applies entailment Rule rdfs3 and infers this triple:

<http://dbpedia.org/resource/Moby-Dick> rdf:type foaf:Image.

I am no longer feeling so good about how things are going with my intelligent agent. It has just "learned" that the book Moby Dick is an image. What has gone wrong?

What went wrong in the sad example?

The short answer to this question is: Nothing. My semantic client has made no mistakes - it correctly inferred a triple that was entailed once I accepted the RDFS interpretation of RDF. It now "knows" that the book Moby Dick is an image.

One could claim that the problem was with the triple:

<urn:lsid:ubio.org:namebank:111731>
foaf:depiction <http://dbpedia.org/resource/Moby-Dick>.

However, in RDF Anyone can say Anything about Anything. It is possible that the creator of that triple was careless and did not realize that use of foaf:depiction entailed that the object of the triple was an image. However, it is also possible that the creator of that triple DID understand the implications of using foaf:depiction and intended to extend the concept of "image" to include things that could be figuratively considered "images" (like books). Without knowing more about the provider of the triple and what the provider understood the FOAF vocabulary to mean, we cannot know if the usage was intentional or a mistake. As in the case of "learning through addition of triples", the dimension of trust is important here as well. We must be able to trust that the providers of triples we ingest are both honest in the information they expose and competent to use vocabulary terms in a manner consistent with their intended meaning.

Horrifying example

If we wanted to create the most powerful "intelligent agent" possible, then we might consider allowing our client to conduct reasoning based on the strongest possible vocabulary entailments. Returning one more time to Section 6 of the RDF Symantics document, this would result in a larger set of semantic conditions understood to be imposed on the interpretations, which can be thought of as an addition of information. Our client can "learn even more". The Web Ontology Language (OWL) provides terms loaded with powerful semantics (owl-interpretation) that can allow clients to infer triples in many useful ways. However, whenever an interpretation has stronger entailment, there are also more ways to create inconsistencies and unintended consequences.

If you have gotten this far in the post, let me suggest some reading that was very helpful to my thinking these issues and which was sometimes entertaining (if you have a warped sense of entertainment, as I do):

Aidan Hogan, Andreas Harth and Axel Polleres. Scalable Authoritative OWL Reasoning for the Web. International Journal on Semantic Web and Information Systems, 5(2), pages 49-90, April-June 2009. http://www.deri.ie/fileadmin/documents/DERI-TR-2009-04-21.pdf

Hogan et al. (2009) provide an interesting example of four triples containing terms from the RDFS and OWL vocabularies:

rdfs:subClassOf rdfs:subPropertyOf rdfs:Resource.
rdfs:subclassOf rdfs:subPropertyOf rdfs:subPropertyOf.
rdf:type rdfs:subPropertyOf rdfs:subClassOf.
rdfs:subClassOf rdf:type owl:SymmetricProperty.

If we allow our client to ingest these triples and then conduct naive reasoning based on rdfs- and owl-interpretations of RDF, the client will infer every possible combination of every unique IRI in its graph. A relatively small graph containing a thousand triples would result in the client inferring 1.6x10^8 meaningless triples. These four triples would effectively serve as a loaded bomb for a client that had no discretion about the source of its triples and the kinds of reasoning it conducted.

Hogan et al. (2009) also discusses the idea of "ontology hijacking" where statements are made about classes or properties in such a way that reasoning on those classes or properties is affected in a harmful or inflationary way. In their SAOR reasoner, they introduced sets of rules that determined the types of reasoning that their client would be allowed to conduct and the types of triples on which the reasoning could be conducted.

In this section, I am not saying that it is bad to use interpretations with stronger entailment. Rather, I hope I made the point that the programmer of a client must make careful decisions about the conditions under which the client would be allowed to generate entailed triples and incorporate them with other "facts" that they "know" (i.e. into their graph).

Summary

"RDF" can be interpreted under different vocabulary interpretations, ranging from fewer semantic conditions and weaker entailment, to more semantic conditions and stronger entailment. Information can be gained by a client if it ingests more triples, or if it conducts inferencing based on stronger entailment.

Entailment rules do NOT enforce conditions.

Entailment rules imply that other unstated triples exist.

Inferred triples are true to the extent that the statements which entail them are also true. This introduces a requirement for an element of trust.

A client is not required to apply all possible entailment rules.

A client is not required to to apply rules to any particular set of triples.

This discussion of "knowing" and "learning" dodged the issue of querying the graph. That is a likely mechanism that the human user of the client will use to "discover" what the client has "learned". But that is a subject for another post...

---------
* In February, the RDF 1.1 Semantics document superseded the RDF (1.0) Semantics document. It broadens the earlier document in a number of ways such as supporting IRIs and clarifying the types of literals, but otherwise most of what is stated in the 1.0 document remains true.

Sunday, April 27, 2014

Confessions of an RDF agnostic, part 2: I have a dream…

"I have a dream for the Web… Machines become capable of analyzing all the data on the Web - the content, links, and transactions between people and computers. A 'Semantic Web,' which should make this possible, has yet to emerge, but when it does, the day-to-day mechanisms of trade, bureaucracy, and our daily lives will be handled by machines talking to machines, leaving humans to provide the inspiration and intuition. The intelligent 'agents' people have touted for ages will finally materialize."

Tim Berners-Lee (1999) "Weaving the Web: The Original Design and Ultimate Destiny of the World Wide Web By Its Inventor" p.157-8.

When I was a kid, I really enjoyed reading books about robots and computers. In books like "Andy Buckram's Tin Men" and "I Robot", one constructed robots out of tin cans or whatever you had at hand, and then through the magic of positronic material or computer programming, the robot became a sentient being, capable of thinking and reasoning. I hoped that someday I would actually get to see a real computer. You can imagine my disappointment when I actually saw my first computer in college and discovered that computers only "knew" how to accomplish the things that one programmed them to do. The prospect of the emergence of an "intelligent agent" that can "discover" new information without the intervention of a human programmer is very appealing and if Tim Berners-Lee says it can be done, it certainly should be possible, right?

The prospect of using RDF and its variants RDFS and OWL to enable machines to do semantic reasoning is very alluring and it is easy to jump on the bandwagon and advocate for adopting it without carefully considering its limitations. So, I'd like to take a moment to step back and summarize a few important facts about RDF. [The rest of this post presupposes some knowledge of the rudiments of RDF at the level of understanding triples and graphs. For more background, I recommend the W3C's RDF Primer. For background in the context of biodiversity informatics, I recommend the TDWG RDF Task Group's Beginner's Guide to RDF. I also shamelessly promote this video upon which I spent/wasted many hours in advance of the TDWG 2013 Semantics of Biodiversity symposium.]

1. RDF is not a programming language. A set of statements in RDF don't "do" anything. Rather, RDF is a way of stating "facts" about things, known as "resources". A single "fact" in RDF is called a triple. A triple can describe a property of a resource. A triple can also describe how a resource is related, or linked to other resources.
2. A set of triples is called an RDF graph. The triples in a graph describe a certain state of affairs. One cannot assume that everything is known about that state of affairs - there is always the potential to acquire additional information about the state of affairs.
3. RDF triples are not just a format for information exchange. Although they are serialized in different formats (XML, Turtle, JSON, etc.) they represent abstract relationships that are independent of the serialization.
4. Actually "doing" something with an RDF graph requires a "semantic client". A semantic client is a computer program that is designed to consume information in the form of triples. The client software is constructed to work according to rules laid out by the standards that define the various flavors of RDF. The semantic client produces some useful result based on rule-based processing of the triples it has consumed.

What does a semantic client "understand" about a triple?

Suppose I state the following:

<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:dcterms="http://purl.org/dc/terms/">
   <rdf:Description rdf:about="http://bioimages.vanderbilt.edu/baskauf/26828">
     <dcterms:created rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime">
2003-06-06T08:47:15-05:00</dcterms:created>
   </rdf:Description>
</rdf:RDF>

If this were simply processed as raw XML, it could be interpreted to mean that "2003-06-06T08:47:15-05:00" was some bit of string data that could be understood based on the tags in the markup and through a pre-established understanding between the sender and receiver.

However, since this XML is valid RDF, a semantic client could understand it to mean that there is a relationship between some thing (i.e. resource) identified by the IRI* http://bioimages.vanderbilt.edu/baskauf/26828, and the instant of time 8:47:15 AM central daylight time on 6 June 2003. Note that the relationship is NOT between the string "http://bioimages.vanderbilt.edu/baskauf/26828" and the string "2003-06-06T08:47:15-05:00", but rather between the entity identified by the IRI and the time instant encoded by the datatyped string. The XML is just a means of serializing the abstract relationship described by the RDF triple. The triple would "mean" exactly the same thing if it were serialized in Turtle syntax as:

@prefix dcterms: <http://purl.org/dc/terms/>.
@prefix xsd: <http://www.w3.org/2001/XMLSchema#>.
<http://bioimages.vanderbilt.edu/baskauf/26828>
     dcterms:created "2003-06-06T08:47:15-05:00"^^xsd:dateTime.

Notice that I said that the client "could" understand the object of triple to refer to an instant in time. A client may (but is not required to) recognize XML Schema datatypes. Similarly, a client might "understand" that the relationship between the resource and the time is one of creation (i.e. that the time is when the resource was created). Such an "understanding" could occur because the Dublin Core vocabulary (of which the predicate dcterms:created is part) is well-known and commonly used.

I could also say something like:

@prefix my: <http://my.xyz/ >.
@prefix xsd: <http://www.w3.org/2001/XMLSchema#>.
<http://bioimages.vanderbilt.edu/baskauf/26828>
     my:x4m5dd2 "2003-06-06T08:47:15-05:00"^^xsd:dateTime.

A client that "understood" XML Schema datatypes could know that there was a relationship between the resource identified by the IRI http://bioimages.vanderbilt.edu/baskauf/26828 and the instant of time 8:47:15 AM central daylight time on 6 June 2003, but would have no idea about the nature of that relationship without further knowledge of the predicate my:x4m5dd2 . (It is possible for a semantic client to "learn" more about what a predicate means - possibly by dereferencing the IRI, but that's a story for another blog post.)

The point here is that the ability of a client to "understand" a triple depends in part on decisions about the parts of RDF/RDFS/OWL that the client's programmer decides to implement, and in part on a significant social component: both the human responsible for producing the triple and the programmer of the client need to have a common understanding of what the predicate of the triple "means".

What does a semantic client "do"?

If I create a graph of RDF triples and expose it through the Internet, what should I expect a client to do with it? There is no requirement that any client do anything in particular with triples. A client encountering a foaf:mbox property in a triple might under some circumstance send an email to the object email address. A client encountering GEO namespace properties might place a point on a map visible to its user. Presented with particular combinations of triples, a client might turn on a switch. A client may facilitate a query or infer additional triples based on existing triples and a set of rules. But these actions are dependent on the programmer of the client and are not controlled by the creator of the triples, who is simply creating a set of facts about the world according to the creator's perspective.

Summary:

The idea of "intelligent agents" analyzing data in the form of RDF and taking action based on those data is very exciting and appealing. However, making that happens depends critically on several factors:
- the availability of useful information in the form of RDF triples.
- decisions made by the programmers of clients about which rules the clients will use to process the triples they encounter.
- a common understanding of the meaning of predicates.
- programming decisions about the actions that will be taken by clients based upon the triples the clients encounter.

All four of these must be in place in order for RDF to become useful. There is also a fifth factor that is primarily economic. It is not enough to demonstrate that RDF can actually do something useful in a particular context. One must also demonstrate that using RDF allows us to do things in that context that are impossible or ineffective with existing implemented technologies. I believe that this may be the most important reason why little progress has been made in moving toward wider use of RDF within the TDWG community. There is a cost associated with learning about and adopting a new technology, and that cost must be exceeded by the benefits to be gained through use of that technology. Just being exciting isn't enough, and it isn't yet clear to me that we have demonstrated compelling things that RDF can do for us that other technologies can't. How's that for agnosticism?

In subsequent blog posts, I plan to talk in more detail about the factors outlined above. Next up: What does it mean to "discover new information" in an RDF context?

* "IRI" now used in preference to "URI", see http://www.w3.org/TR/2014/REC-rdf11-concepts-20140225/#dfn-iri

Friday, April 25, 2014

Some thoughts after encountering Digital Humanists

As I have been thrashing around with RDF in the context of TDWG over the past couple years, I have wondered if there was anyone at Vanderbilt besides me who was working on anything remotely related to RDF, Linked Data, or the Semantic Web. I never searched systematically, but when I brought the issue up with science and computer people, I usually encountered blank stares and the question "What?"

Recently, I started following Clifford Anderson (@andersoncliffb) and David Michelson (@davidamichelson) on Twitter and got interested in the "Topics In Digital Humanities" course they were teaching this semester. I decided to get out of my biodiversity informatics silo and attend part of their final student presentations this past Monday. I couldn't stay for the whole thing, but I was fascinated by the part I saw.

The three talks I saw were related to digitizing metadata and images related to early Christian artifacts - particularly in the context of the syriaca.org project. Although it seems like there would be little relationship between those projects and biodiversity informatics, one thing that really struck me as I watched the presentations was how similar the problems they faced were to those involved in digitizing natural history museum specimens and recording species occurrence metadata. They struggled to find terms in controlled vocabularies to describe their artifacts. They dealt with issues of demarking segments of an image that documented several features of interest. They were working out how to work collaboratively on common data sources.

At the same time, I was struck how the tools they used were different from those that are used or talked about in TDWG.

First and foremost, all of their work involved using XML. I've heard almost nothing positive about XML in the context of TDWG: it's too verbose and takes too long to transmit, it's confusing and not readable, etc. So I was surprised to see that it was central to what they were doing. There seems to be a simple reason for this: it enables them to mark up text using very simple tags (looked to me at a glance like XHTML) and then use existing technology (primarily XQUERY, I think) to search the marked up text. In other words, they are immediately accomplishing useful things using off-the-shelf technology. This is in marked contrast to the biodiversity informatics community where years have been spent arguing about whether GUIDs and RDF are going to solve our problems, or if they are a useless waste of time, and then having nothing functional to show for all of the arguing and effort.

The second thing that struck me was how little emphasis there was on URIs or any sort of GUID, including DOIs. I was a bit surprised by that. I asked a question about URIs and it seemed to go right past the speaker. I suppose this is a function of the fact that the documents on which they are working exist in a local database and there isn't a requirement at this point for them to link to records elsewhere. But it seems that they will have to face that issue at some point.

The final thing that seemed really odd to me was the whole identification as "digital humanists". I have to say that I don't exactly understand what that means, but after looking at things like https://www.hastac.org/ and https://my.vanderbilt.edu/digitalhumanities/ I'm getting a better idea. I think that one reason why this puzzles me is that the Linked Data world (with which I'm more familiar) is fixated on connecting all information of all kinds and therefore Linked Data advocates in the biodiversity informatics community aren't interested in calling themselves "Digital Museum Curators", "Digital Scientists", or something like that because they consider their interests to include agents, literature references, and geography in addition to collections.

I think that some of the differences I've seen here in approach are related to a difference in scale: biodiversity informatics involves assembling many small individual records that are scattered in many places vs. digital humanists marking up larger works that are localized in a few places. In any case, I'm impressed with what the Digital Humanists at Vanderbilt have accomplished and I'm looking forward to learning more from them.

Confessions of an RDF agnostic, part 1: What can we learn using RDF?

http://www.w3.org/RDF/

Disclaimer: the opinions expressed here are mine alone and do not reflect policies or recommendations of TDWG, the TDWG RDF/OWL Task Group, or W3C.

I've now spent several years as part of the TDWG community working toward making it possible to apply RDF technology to biodiversity data (for the last two years as co-convener of the TDWG RDF/OWL Task Group). Over the last nine months I've spent some time reflecting about that effort - most intensively while preparing for the Primer session (part 1 part 2) at the Semantics of Biodiversity session at the TDWG meeting in October. Trying to figure out how to teach something to beginners is a great way to cut through the fuzziness in your thinking on a subject and I was helped in the effort by forcing myself to slog through some of the rather dry W3C technical material on RDF such as the RDF Semantics (http://www.w3.org/TR/rdf-mt/ ) and RDF Concepts and Abstract Syntax (http://www.w3.org/TR/rdf-concepts/ ) documents.

If one is going to start off an educational session for beginners, it's probably good to try to address questions like "why should I bother to learn about this?" and "what good is all this for?" So in addition to thrashing through the W3C documents, I prepared for the Primer session by thinking about the Rod Page Challenge (first expressed in an email, then fleshed out at http://iphylo.blogspot.com/2011/10/tdwg-challenge-what-is-rdf-good-for.html) which stated succinctly was "What new things have we learnt about biodiversity by converting biodiversity data into RDF?" The quick and dirty answer to Rod's question is "nothing much", at least at the present time. I'm basing that answer on the fact that nobody has been trumpeting on tdwg-content about great things that they have done with the RDF data that Rod linked to in his blog post.

In thinking about Rod's challenge, I tried to dig a little deeper and ask myself: what does it actually mean to say that we can "learn" something using RDF? RDF and its vocabulary interpretations RDFS and OWL are supposed to support "reasoning", which implies that machines that use them can help us figure out things with them. What is the nature of those things and are they really things that we couldn't figure out using a more conventional technology? In the end, I ended up with way more slides than I could show in the time allocated to my part of the Primer session, but hey, I could just blog about the parts I had to cut, right?

Some people may be surprised that the convener of an RDF Task Group would claim to be an "RDF agnostic". I truly don't know whether RDF will actually help us "learn" anything useful that we couldn't more easily find out using other technologies. That's not necessarily because I think it's not possible, but rather because I think success will depend on a large extent to our ability to have more clear expectations about what we as a community want to accomplish and whether we can work together effectively to accomplish those things. In some number of blog posts greater than zero and less than infinity, I want to explore the issues of what we can reasonably expect to "learn" using RDF and what barriers exist that inhibit progress towards accomplishing that "learning".