Saturday, March 11, 2017

Controlled values (again)

The connection between The Darwin Core Hour and the TDWG Standards Documentation Specification

I've included the word "again" in the title of this blog post because I wrote a series of blog posts [1] about a year ago exploring issues related to thesauri, ontologies, controlled vocabularies, and SKOS.  Those posts were of a somewhat technical nature since I was exploring possible ways to represent controlled vocabularies as RDF.  However, there has been a confluence of two events that have inspired me to write a less technical blog post on the subject of controlled vocabularies for the general TDWG audience.

One is the genesis of the excellent Darwin Core Hour webinar series.  I encourage you to participate in them if you want to learn more about Darwin Core. The previous webinars have been recorded and can be viewed online.  The most recent webinar "Even Simple is Hard", presented by John Wieczorek on March 7, provided a nice introduction to issues related to controlled vocabularies, and the next one on April 4 "Thousands of shades for 'Controlled' Vocabularies", to be presented by Paula Zermoglio and will deal with the specifics of controlled vocabularies.

The other thing that's going on is that we are in the midst of the public comment period for the draft TDWG Standards Documentation Specification (SDS), of which I'm the lead author.  At the TDWG Annual Meeting in December, I led a session to inform people about the SDS and it's sister standard, the TDWG Vocabulary Management Specification.  At that session, the topic of controlled vocabularies came up.  I made some statements explaining the way that the SDS specifies that controlled vocabularies will be described in machine-readable form.  What I said seemed to take some people by surprise, and although I provided a brief explanation, there wasn't enough time to have an in-depth discussion.  I hoped that the topic would come up during the SDS public comment period, but so far it has not.  Given the current interest in constructing controlled vocabularies, I hope that this blog post will either generate some discussion, or satisfy people's curiosity about how the SDS deals with machine-readable controlled vocabularies.

Definitions

It is probably best to start off by providing some definitions.  It turns out that there is actually an international standard that deals with controlled vocabularies.  It is ISO 25964: "Thesauri and interoperability with other vocabularies".  Unfortunately, that standard is hidden behind a paywall and is ridiculously expensive to buy.  As part of my work on the SDS, I obtained a copy of ISO 25964 by Interlibrary Loan.  I had to return that copy, but I took some notes that are on the VOCAB Task Groups GitHub site.  I encourage you to refer to those notes for more details about what I'm only going to briefly describe here.

Controlled vocabularies and thesauri

ISO 25964 defines a controlled vocabulary as a
prescribed list of terms, headings or codes, each representing a concept. NOTE: Controlled vocabularies are designed for applications in which it is useful to identify each concept with one consistent label, for example when classifying documents, indexing them and/or searching them. Thesauri, subject heading schemes and name authority lists are examples of controlled vocabularies.
It also defines a form of controlled vocabulary, which is the major subject of the standard: a thesaurus.  A thesaurus is a
controlled and structured vocabulary in which concepts are represented by terms, organized so that relationships between concepts are made explicit, and preferred terms are accompanied by lead-in entries for synonyms or quasi-synonyms. NOTE: The purpose of a thesaurus is to guide both the indexer and the searcher to select the same preferred term or combination of preferred terms to represent a given subject. For this reason a thesaurus is optimized for human navigability and terminological coverage of a domain.  [my emphasis]
If you participated in or listened to the Darwin Core Hour "Even Simple is Hard", you can see the close relationship between the way "controlled vocabulary" was used in that seminar and the definition of thesaurus given here.  When submitting metadata about an occurrence to an aggregator, we want to use the same controlled value term in our metadata as will be used by those who may be searching for our metadata in the future.  Referring to an example given in the webinar, if we (the "indexers") provide "PreservedSpecimen" as a value in metadata in our spreadsheet, others in the future who are searching (the "searchers") for occurrences documented by preserved specimens can search for "PreservedSpecimen", and find our record.  That won't happen if we use a value of "herbarium sheet".  Figuring out how to get indexers and searchers to use the same terms is the job of a thesaurus.

A thesaurus is also designed to capture relationships between controlled value terms, such as "broader" and "narrower".  A searcher who knows about preserved specimens but wants records documented by any kind of physical thing (including living specimens, fossils, material samples as well) can be directed to a broader term that encompasses all kinds of physical things, e.g. "PhysicalObject".

So although in Darwin Core (and in TDWG in general) we tend to talk about "controlled vocabularies", I would assert that we are, in fact, talking about thesauri as defined by ISO 25964.

Strings and URIs

If you have spent any time pondering TDWG vocabularies, you probably have noticed that all kinds of TDWG vocabulary terms (classes and properties) are named using Uniform Resource Identifiers, or URIs.  Because the kind of URIs used in TDWG term names begin with "http://" people get the mistaken impression that URIs always make something "happen".  This is because we are used to seeing Web URLs that start with "http://", and have come to believe that if we put a URI that starts with "http://" in a browser, we will get a web page.  However, there are some terms in TDWG vocabularies that will probably never "do" anything in a browser.  For example, Audubon Core co-opts the term http://ns.adobe.com/exif/1.0/PixelXDimension as a property whose value gives the number of pixels in an image in the X dimension.  You can try putting that term URI in a browser and prove to yourself that nothing useful happens.  So we need to get over the idea that URIs must "do" something (they might or might not), and get used to the idea that their primary purpose is to serve as a globally unique name that conforms to some particular structural rules [2].

You can see the value of using URIs over plain text strings if you consider the Darwin Core term "class".  When we use "class" in the context of Darwin Core, we intend its Darwin Core definition: "The full scientific name of the class in which the taxon is classified."  However, "Class" has a different meaning in a different international standard, the W3C's RDF Schema 1.1 (RDFS).  In that context, "Class" means "The class of resources that are RDF classes."  There may be many other meanings of "class" in other fields.  It can mean different things in education, in computer programming, and in sociology.  We can tell people exactly what we intend by the use of a term if we identify it with a URI rather than a plain text string.  So for example, if we use the term http://rs.tdwg.org/dwc/terms/class, people will know that we mean "class" in the Darwin Core sense, but if we use the term http://www.w3.org/2000/01/rdf-schema#Class, people will know that we mean "class" in the RDFS sense.

Clearly, it is a pain in the neck to write out a long and messy URI.  For convenience, there is an abbreviated form for URIs called "compact URIs" or CURIEs.  When we use a CURIE, we define an abbreviation for part of the URI (commonly called the "namespace" abbreviation).  So for example, we could declare the namespace abbreviations:

dwc: = http://rs.tdwg.org/dwc/terms/
rdfs: = http://www.w3.org/2000/01/rdf-schema#

and then abbreviate the URI by replacing the namespace with the abbreviation to form a CURIE.  With the defined abbreviations above, we could say dwc:class when we intend "class" in the Darwin Core sense and rdfs:Class when we intend "class" in the RDFS sense.  This is much shorter than writing out the full URI, and if the last part of the CURIE after the namespace (known as the "local name") is formed from a natural language string (possibly in camelCase), it's easy for a native speaker of that natural language to "read" the CURIE as part of a sentence.

It is not a requirement that the local name be a natural language string.  Some vocabularies prefer to have opaque identifiers, particularly when there is no assumed primary language.  So for example, the URI http://vocab.getty.edu/tgn/1000111, which is commonly abbreviated by the CURIE tgn:1000111, denotes the country China, which may have many names in natural language strings of various languages.

What's the dwciri: namespace for?

Those who are familiar with using Darwin Core in spreadsheets and relational databases are familiar with the "regular" Darwin Core terms in the "dwc:" namespace (http://rs.tdwg.org/dwc/terms/).  However, most are NOT familiar with the namespace http://rs.tdwg.org/dwc/iri/, commonly abbreviated dwciri: .  This namespace was created as a result of the adoption of the Darwin Core RDF Guide, and most people who don't care about RDF have probably ignored it.  However, I'd like to bring it up in this context because it can play an important role in disambiguation.

Here is a typical task.  You have a spreadsheet record whose dwc:county value says "Robertson".  You know that it's a third level-political subdivision because that's what dwc:county records.  However, there are several third-level political subdivisions in the United States alone, and there probably are some in other countries as well.  So there is work to be done in disambiguating this value.  You'll probably need to check the dwc:stateProvince and dwc:country or dwc:countryCode values, too.  Of course, there may also be other records whose dwc:county values are "Robertson County" or "Comté de Robertson" or "Comte de Robertson" that are probably from the same third-level political subdivision as your record.  Once you've gone to the trouble of figuring out that the record is in Robertson County, Tennessee, USA, you (or other data aggregators) really should never have to go through that effort again.

There are two standardized controlled vocabularies that have created URI identifiers for geographic places: GeoNames and the Getty Thesaurus of Geographic Names (TGN).  There are reasons (beyond the scope of this blog post) why one might prefer one of these vocabularies over another, but either one provides an unambiguous, globally unique URI for Robertson County, Tennessee, USA: http://sws.geonames.org/4653638/ froim GeoNames and http://vocab.getty.edu/tgn/2001910 from the TGN.  The RDF Guide makes it clear that the value of every dwciri: term must be a URI, while the values of many dwc: terms may be a variety of text strings, including human-readable names, URIs, abbreviations, etc.  With a dwc: term, a user probably will not know whether disambiguation needs to be done, while with a dwciri: term, a user knows exactly what the value denotes, since a URI is a globally unique name.

In RDF, we could say that a particular location was in Robertson County, Tennessee, USA like this:

my:location dwciri:inDescribedPlace tgn:2001910.

However, there is also no rule that says you couldn't have a spreadsheet with a column header of "dwciri:inDescribedPlace", as long as the cells below it that contained URI values.  So dwciri: terms could be used in non-RDF data representations as well as in RDF.

If you look at the table in Section 3.7 of the Darwin Core RDF Guide, you will see that there are dwciri: analogs for every dwc: term where we thought it made sense to use a URI as a value.[3]  In many cases, those were terms where Darwin Core recommended use of a controlled vocabulary.  Thus, once providers or aggregators went to the trouble to clean their data and determine the correct controlled values for a dwc: property, they and everybody else in the future could be done with that job forever if they recorded the results as a URI value from a controlled vocabulary for a dwciri: Darwin Core property.


The crux of the issue

OK, after a considerable amount of background, I need to get to the main point of the post.  John's "Even Simple is Hard" talk was directed to the vast majority of Darwin Core users: those who generate or consume data as Simple Darwin Core (spreadsheets), or who output or consume Darwin Core Archives (fielded text tables linked in a "star schema").  In both of those cases, the tables or spreadsheets will most likely be populated with plain text strings.  There may be a few users who have made the effort to find unambiguous URI values instead of plain strings, and hopefully that number will go up in the future as more URIs are minted for controlled vocabulary terms.  However, in the current context, when John talks about populating the values of Darwin Core properties with "controlled values", I think that he probably means to use a single, consensus unicode string that denotes the concept underlying that controlled value.

Preferred unicode strings

I am reminded of a blog post that John wrote in 2013 called "Data Diversity of the Week: Sex" in which he describes some of the 189 distinct values used in VertNet to denote the concept of "maleness".  We could all agree that the appropriate unicode string to denote maleness in a spreadsheet should be the four characters "male".  Anyone who cleans data and encounters values for dwc:sex like "M", "m", "Male", "MALE", "macho", "masculino", etc. etc. could substitute the string "male" in that field.  There would, of course, be the problem of losing the verbatim value if a substitution were made.

I suspect that most TDWG'ers would consider the task of developing a controlled vocabulary for dwc:sex to involve sitting down a bunch of users and aggregators at a big controlled vocabulary conference, and coming to some consensus about the particular strings that we should all use to denote maleness, femaleness, and all other flavors of gender that we find in the natural world.

I don't want to burst anybody's bubble, but as it currently stands, that's not how the draft Standards Documentation Specification would work with respect to controlled vocabularies.  A TDWG controlled vocabulary would be more than a list of acceptable strings.  It would have all of the same features that other TDWG vocabularies have.

SDS: URIs

For one thing, each term in a controlled vocabulary would be identified by a URI.  That is already current practice in TDWG vocabularies and in Dublin Core.  The SDS does not specify whether the URIs should use "English-friendly" local names or opaque numbers for local names.  Either would be fine.  For illustration purposes, I'll pick opaque numbers.  Let's use "12345" as the local name for "maleness".  The SDS is also silent about namespace construction.  One could do something like

dwcsex: = "http://rs.tdwg.org/dwc/cv/sex/"

for the namespace.  Then the URI for maleness would be

http://rs.tdwg.org/dwc/cv/sex/12345

as a full URI or

dwcsex:12345

as a CURIE.  Anybody who wants to unambiguously refer to maleness can use the URI dwcsex:12345 regardless of whether they are using a spreadsheet, relational database, or RDF.

SDS: Machine-readable stuff

In John's talk, he mentioned the promise of using "semantics" to help automate the process of data cleaning.  A critical feature of the SDS is that in addition to specifying how human-readable documents should be laid out, it also describes how metadata should be described in order to make those data machine readable.  The examples are given as RDF/Turtle, but the SDS makes it clear that it is agnostic about how machine-readability should be achieved.  RDF-haters are welcome to use JSON-LD.  HTML-lovers are welcome to use RDFa embeded in web page markup.  Or better - provide the machine readable data in all of the above formats and let users chose.  The main requirement is that regardless of the chosen serialization, every machine-readable representation must "say" the same thing, i.e. must use the same properties specified in the SDS to describe the metadata.  So the SDS is clear about what properties should be used to describe each aspect of the metadata.

In the case of controlled vocabulary terms, several designated properties are the same as those used in other TDWG vocabularies.  For example, rdfs:comment is used to provide the English term definition and rdfs:label is used to indicate a human readable label for the term in English.  The specification does a special thing to accommodate our community's idiosyncrasy of relying on a particular unicode string to denote a controlled vocabulary term.  That unicode string is designated as the value of rdf:value, a property that is well-known, but doesn't have a very specific meaning and could be used in this way.  It's possible that the particular consensus string might be the same as the label, but it wouldn't have to be.  For example we could say this:

dwcsex:12345 rdfs:label "Male"@en;
             rdfs:comment "the male gender"@en;
             rdf:value "male".

In this setup, the label to be presented to humans starts with a capital letter, while the consensus string value denoting the term doesn't.  In other cases, the human readable label might contain several words with spaces between, while the consensus string value might be in camelCase with no spaces.  The label probably should be language-tagged, while the consensus string value is a plain literal.

Dealing with multiple labels for a controlled value: SKOS

As it currently stands, the SDS says that the normative definition and label for terms should be in English.  Thus the controlled value vocabulary document (both human- and machine-readable) will contain English only.  Given the international nature of TDWG, it would be desirable to also make documents, term labels, and definitions available in as many languages as possible.  However, it should not require invoking the change process outlined in the Vocabulary Management Specification every time a new translation is added, or if the wording in a non-English language version changes.  So the SDS assumes that there will be ancillary documents (human- and machine-readable) in which the content is translated to other languages, and that those ancillary documents will be associated with the actual standards documents.

This is where the Simple Knowledge Organization System (SKOS) comes in.  SKOS was developed as a W3C Recommendation in parallel with the development of ISO 25964.  SKOS is a vocabulary that was specifically designed to facilitate the development of thesauri.  Given that I've made the case that what TDWG calls "controlled vocabularies" are actually thesauri, SKOS has many of the terms we need to describe our controlled value terms.

An important SKOS term is skos:prefLabel (preferred label).  skos:prefLabel is actually defined as a subproperty of rdfs:label, so any value that is a skos:prefLabel is also a generic label.  However, the "rules" of SKOS say that you should never have more than one skos:prefLabel value for a given resource, in a given language.  Thus, there is only one skos:prefLabel value for English, but there can be other skos:prefLabel values for Spanish, German, Chinese, etc.

SKOS also provides the term skos:altLabel.  skos:altLabel is used to specify other labels that people might use, but that aren't really the "best" one.  There can be an unlimited number of skos:altLabel values in a given language for a given controlled vocabulary term.  There is also a property skos:hiddenLabel.  The values of skos:hiddenLabel are "bad" values that you know people use, but you really wouldn't want to suggest as possibilities (for example, misspellings).

SKOS has a particular term that indicates that a value is a definition: skos:definition. That has a more specific meaning than rdfs:comment, which could really be any kind of comment.  So using it in addition to rdfs:comment is a good idea.

So here is how the description of our "maleness" term would look in machine-readable form (serialized as human-friendly RDF/Turtle):

Within the standards document itself:

dwcsex:12345 rdfs:label "Male"@en;
             skos:prefLabel "Male"@en;
             rdfs:comment "the male gender"@en;
             skos:definition "the male gender"@en;
             rdf:value "male".


In an ancillary document that is outside the standard:

dwcsex:12345 skos:prefLabel "Masculino"@es;
             skos:altLabel "Macho"@es;
             skos:altLabel "macho"@es;
             skos:altLabel "masculino"@es;
             skos:altLabel "male"@en;
             skos:prefLabel "男"@zh-hans;
             skos:prefLabel "男"@zh-hant;
             skos:prefLabel "männlich"@de;
             skos:altLabel "M";
             skos:altLabel "M.";
             skos:hiddenLabel "M(ale)";
etc. etc.

In the ancillary document, one would attempt to include as many of the 189 values for "male" that John mentioned in his blog post.  Having this diversity of labels available makes two things possible.  One is to automatically generate pick lists in any language.  The if the user selects German as the preferred language, the pick list presents the German preferred label "männlich" to the user, but the value selected is actually recorded by the application as the language-independent URI dwcsex:12345.  Although I didn't show it, the ancillary document could also contain definitions in multiple languages to clarify things to international users in the event that viewing the label itself is not enough.

The many additional labels in the ancillary document also facilitate data cleaning.  For example, if GBIF has a million horrible spreadsheets to try to clean up, they could simply do string matching against the various label values without regard to the language tags and type of label (pref vs. alt vs. hidden).  Because the ancillary document is not part of the standard itself, the laundry list of possible labels can be extended at will every time a new possible value is discovered.

Making the data available

It is TOTALLY within the capabilities of TDWG to proved machine-readable data of this sort, and if the SDS is ratified, that's what we will be doing.  Setting up a SPARQL endpoint to deliver the machine-readable metadata is not hard.  For those who are are RDF-phobic, a machine-readable version of the controlled vocabulary can be available through the API as JSON-LD, which provides exactly the same information as the RDF/Turtle above and would look like this:

{
  "@context": {
    "dwcsex": "http://rs.tdwg.org/dwc/cv/sex/",
    "rdf": "http://www.w3.org/1999/02/22-rdf-syntax-ns#",
    "rdfs": "http://www.w3.org/2000/01/rdf-schema#",
    "skos": "http://www.w3.org/2004/02/skos/core#",
    "xsd": "http://www.w3.org/2001/XMLSchema#"
  },
  "@id": "dwcsex:12345",
  "rdf:value": "male",
  "rdfs:comment": {
    "@language": "en",
    "@value": "the male gender"
  },
  "rdfs:label": {
    "@language": "en",
    "@value": "Male"
  },
  "skos:altLabel": [
    {
      "@language": "es",
      "@value": "macho"
    },
    "M",
    {
      "@language": "en",
      "@value": "male"
    },
    "M.",
    {
      "@language": "es",
      "@value": "masculino"
    },
    {
      "@language": "es",
      "@value": "Macho"
    }
  ],
  "skos:definition": {
    "@language": "en",
    "@value": "the male gender"
  },
  "skos:hiddenLabel": "M(ale)",
  "skos:prefLabel": [
    {
      "@language": "es",
      "@value": "Masculino"
    },
    {
      "@language": "zh-hant",
      "@value": "男"
    },
    {
      "@language": "en",
      "@value": "Male"
    },
    {
      "@language": "zh-hans",
      "@value": "男"
    },
    {
      "@language": "de",
      "@value": "männlich"
    }
  ]
}

People could write their own data-cleaning apps to consume this JSON description of the controlled vocabulary and never even have to think about RDF.  

"Semantics", SKOS concept schemes, and Ontologies

Up to this point, I've been dodging an issue that will concern some readers and which other readers won't care about one whit.  If you are a casual reader and don't care about the fine points of machine-readable data and ontologies, you can just stop reading here.  

The SDS says that controlled vocabulary terms should be typed as skos:Concept, as opposed to rdfs:Class.  That prescription has the implication that the controlled vocabulary will be a SKOS concept scheme rather than an ontology.  This was what freaked people out at the TDWG meeting, because there is a significant constituency of TDWG whose first inclination when faced with a machine-data problem is to construct an OWL ontology.  At the meeting, I made the statement that a SKOS concept scheme is a screwdriver and an OWL ontology is a hammer.  Neither a screwdriver nor a hammer is intrinsically a better tool.  You use a screwdriver when you want to put in screws and you use a hammer when you want to pound nails.  So in order to decide which tool is right, we need to be clear about what we are trying to accomplish with the controlled vocabulary.  

ISO 25964 provides the following commentary about thesauri and ontologies in section 21.2:
Whereas the role of most of the vocabularies ... is to guide the selection of search/indexing terms, or the browsing of organized document collections, the purpose of ontologies in the context of retrieval is different. Ontologies are not designed for information retrieval by index terms or class notation, but for making assertions about individuals, e.g. about real persons or abstract things such as a process. 
and in section 22.3:
One key difference is that, unlike thesauri, ontologies necessarily distinguish between classes and individuals, in order to enable reasoning and inferencing. ... The concepts of a thesaurus and the classes of an ontology represent meaning in two fundamentally different ways. Thesauri express the meaning of a concept through terms, supported by adjuncts such as a hierarchy, associated concepts, qualifiers, scope notes and/or a precise definition, all directed mainly to human users. Ontologies, in contrast, convey the meaning of classes through machine-readable membership conditions. ... The instance relationship used in some thesauri approximates to the class assertion used in ontologies. Likewise, the generic hierarchical relationship ... corresponds to the subclass axiom in ontologies. However, in practice few thesauri make the distinction between generic, whole-part and instance relationships. The undifferentiated hierarchical relationship most commonly found in thesauri is inadequate for the reasoning functions of ontologies. Similarly the associative relationship is unsuited to an ontology, because it is used in a multitude of different situations and therefore is not semantically precise enough to enable inferencing.
In layman's terms, there are two key differences between thesauri and ontologies.  The primary purpose of a thesaurus is to guide a human user to pick the right term for categorizing a resource.  The primary purpose of an ontology is to allow a machine to do automated reasoning about classes and instances of things.  The second difference is that reasoning and inferencing is, in a sense, "automatic" for an ontology, whereas the choice to make use of hierarchical relationships in a thesaurus is optional and controlled by a user.  

Let's apply these ideas to our dwc:sex example.  We could take the ontology approach and say that 

dwcsex:12345 rdf:type rdfs:Class.

We could then define another class in our gender ontology:

dwcsex:12347 rdf:type rdfs:Class;
             rdfs:label "Animal gender".

and assert

dwcsex:12345 rdfs:subClassOf dwcsex:12347.

This assertion is the the sort that John described in his webinar: an "is_a" hierarchical relationship.  We could represent it in words as:

"Male" is_a "Animal gender".

As data providers, we don't have to "do anything" to assert this fact, or decide in particular cases whether we like the fact or not.  Anything that has a dwc:sex value of "male" will automatically have a dwc:sex value of "Animal gender" because that fact is entailed by the ontology.  

Alternatively, we could take the thesaurus approach and say that 

dwcsex:12345 rdf:type skos:Concept.

We could then define another concept in our gender thesaurus:

dwcsex:12347 rdf:type skos:Concept;
             rdfs:label "genders that animals can have".

and assert

dwcsex:12345 skos:broader dwcsex:12347.

As in the ontology example, this assertion also describes a hierarchical relationship ("has_broader_category").  We could represent it in words as:

"Male" has_the_broader_category "genders that animals can have".

In this case, nothing is entailed automatically.  If we assert that some thing has a dwc:sex value of "male", that's all we know.  However, if a human user is using a SKOS-aware application, the application could interact with the user and say "Hey, not finding what you want?  I could show you some other genders that animals can have." and then find other controlled vocabulary terms that have dwcsex:12347 as a broader concept.  It would also be no problem to assert this:

dwcsex:12348 rdf:type skos:Concept;
             rdfs:label "genders that parts of plants can have".
dwcsex:12345 skos:broader dwcsex:12348.

We aren't doing anything "bad" by somehow entailing that males are both plants and animals.  We are just saying that "male" is a gender that can fall into several broader categories as part of a concept scheme: "genders that animals can have" and "genders that parts of plants can have".  This is what was meant by "the undifferentiated hierarchical relationship most commonly found in thesauri is inadequate for the reasoning functions..." in the text of ISO 25964.  The hierarchical relationships of thesauri can guide human categorizers and searchers, but they don't automatically entail additional facts.

Screwdriver or hammer?

Now that I've explained a little about the differences between thesauri and ontologies, which one is the right tool for the controlled vocabularies?  There is no definite answer to this, but the common practice for most controlled vocabularies seems to be the thesaurus approach.  That's the approach used by all of the Getty Thesauri, and also the approach used by the Library of Congress for the controlled vocabularies it defines.  In the case of both of these providers, the machine-readable forms of their controlled vocabularies are expressed as SKOS concept schemes, not ontologies.  

That is not to say that all Darwin Core terms that currently say "best practice is to use a controlled vocabulary" should be defined as SKOS concept schemes.  In particular, the two vocabularies that John spoke about at length in his web cast (dwc:basisOfRecord and dcterms:type) should probably be defined as ontologies.  That's because they both are ways of describing the kind of thing something is, and that's precisely the purpose of an rdfs:Class.  On could create an ontology that asserts:

dwc:PreservedSpecimen rdfs:subClassOf dctype:PhysicalObject.

and all preserved specimens could automatically be reasoned to be physical objects whether a data provider said so or not [4].  In other words, machines that are "aware" of that ontology would "know" that

"preserved specimen" is_a "physical object".

without any effort on the part of data providers or aggregators.

But both dcterms:type and dwc:basisOfRecord are really ambiguous terms that are a concession to the fact that we try to cram information about all kinds of resources into a single row of a spreadsheet.  The ambiguity about what dwc:basisOfRecord actually means is the reason why the RDF guide says to use rdf:type instead [5].  There is no prohibition against having as many values of rdf:type as is useful.  You can assert:

my:specimen rdf:type dwc:PreservedSpecimen;
            rdf:type dctype:PhysicalObject;
            rdf:type dcterms:PhysicalResource.

with no problem, other than it won't fit in a single cell in a spreadsheet!

So what if we decide that "controlled values" for a term should be ontology classes?

The draft Standards Documentation Specification says that controlled value terms will be typed as skos:Concepts.  What if there is a case where it would be better for the "controlled values" to be classes from an ontology?  There is a simple answer to that.  Change the Darwin Core term definition to say "Best practice is to use a class from a well-known ontology." instead of saying "Best practice is to use a controlled vocabulary."  That language would be in keeping with the definitions and descriptions of controlled vocabularies and ontologies given in ISO 25964.  Problem solved.

I should note that it is quite possible to use all of the SKOS label-related properties (skos:prefLabel, skos:altLabel, skos:hiddenLabel) with any kind of resource, not just SKOS concepts.  So if it were decided in a particular case that it would be better for a "controlled value" to be defined as classes in an ontology rather than as concepts in a thesaurus, one could still use the multi-lingual strategy described earlier in the post.

Also, there is no particular type for the values of dwciri: terms.  The only requirement is that the value must be a URI rather than a string literal.  It would be fine for that URI to denote either a class or a concept.

So one of the tasks of a group creating a "controlled vocabulary" would be to define the use cases to be satisfied, and then decide whether those use cases would be best satisfied by a thesaurus or an ontology.

Feedback!  Feedback! Feedback!

If something in this post has pushed your button, then respond by making a comment about the Standards Documentation Specification before the 30 day public comment period ends on or around March 27, 2017.  There are directions from the Review Manager, Dag Endresen, on how to comment at http://lists.tdwg.org/pipermail/tdwg-content/2017-February/003690.html .  You can email anonymous comments directly to him, but I don't think any members of the Task Group will get their feelings hurt by criticisms, so an open dialog on the issue tracker or tdwg-content would be even better.

Footnotes

[1] Blog entries from my blog http://baskauf.blogspot.com/
March 14, 2016 "Ontologies, thesauri, and SKOS"
March 21, 2016 "Controlled values for Subject Category from Audubon Core"
April 1, 2016 "Controlled values for Establishment Means from Darwin Core"
April 4, 2016 "Controlled values for Country from Darwin Core"

[2] The IETF RFC 3986 defines the syntax of URIs.  A superset of URIs, known as Internationalized Resource Identifiers or IRIs is now commonly used in place of URIs in many standards documents.  For the purpose of this blog post, I'll consider them interchangeable.

[3] There are also terms, like dwciri:inDescribedPlace that are related to an entire set of Darwin Core terms ("convenience terms").  Talking about those terms is beyond the scope of this blog post, but for further reading, see the explanation in Section 2.7 of the RDF Guide.

[4] Cautionary quibble: in John's diagram, he asserted that dwc:LivingSpecimen is_a dctype:PhysicalObject.  However, the definition of dctype:PhysicalObject is "An inanimate, three-dimensional object or substance.", which would not apply to many living specimens.  A better alternative would be the class dcterms:PhysicalResource, which is defined as "a material thing".  However, dcterms:PhysicalResource is not included in the DCMI type vocabulary - it's in the generic Dublin Core vocabulary.  That's a problem if the type vocabulary is designated as the only valid controlled vocabulary to be used for dcterms:type.

[5] See Section 2.3.1.4 of the RDF Guide for details.  DCMI also recommends using rdf:type over dcterms:type in RDF applications.  A key problem that our community has not dealt with is clarifying whether we are talking about a resource, or the database record about that resource.  We continually confuse these to things in spreadsheets and most of the time we don't care.  However, the difference becomes critical if we are talking about modification dates or licenses.  A spreadsheet can have a column for "license" but is the value describing a license for the database record, or the image described in that record.  A spreadsheet can have a value for "modified", but does that mean the date that the record was last modified or the date that the herbarium sheet described by the record was last modified?  With respect of dwc:basisOfRecord, the local name of dwc:basisOfRecord ("basis of record") implies that the value is the type of the resource upon which a record is based, which implies that the metadata in the spreadsheet row is about something else.  That "something else" is probably an occurrence.  So we conflate "occurrence" with "preserved specimen" by talking about both of them in the same spreadsheet row.  According to its definition, dwc:Occurrence is "An existence of an Organism (sensu http://rs.tdwg.org/dwc/terms/Organism) at a particular place at a particular time." while a preserved specimen is a form of evidence for an occurrence, not a subclass of occurrence.  That distinction doesn't seem important to most people who use spreadsheets, but it is important if an occurrence is documented by multiple forms of evidence (specimens, images, DNA samples) - something that is becoming increasingly common.  What we should have is an rdf:type value for the occurrence, and a separate rdf:type value for the item of evidence (possibly more than one).  But spreadsheets aren't complex enough to handle that, so we limp along with dwc:basisOfRecord.

Monday, February 13, 2017

SPARQL: the weirdness of unnamed graphs

This post is of a rather technical nature and is directed at people who are serious about setting up and experimenting with the management of SPARQL endpoints.  If you want to get an abbreviated view, you could probably read down through the "Graphs: named and otherwise" section, then skip down to "Names of unnamed graphs (review)" and read through the end.

Background

This semester, the focus of our Linked Data and Semantic Web working group [1] at Vanderbilt has been to try to move from Linked Data in theory, to Linked Data in reality.  Our main project has been to document the process of moving Veronica Ikeshoji-Orlati's dataset on ancient vases from spreadsheet to Linked Data dataset.  We have continued to make progress on moving Tracy Miller's data on images of Chinese temple architecture from a test RDF graph to production.

One of the remaining problems we have been facing is setting up a final production triplestore and SPARQL endpoing to host the RDF data that we are producing.  The Vanderbilt Heard Library has maintained a Callimachus-based SPARQL endpoint at http://rdf.library.vanderbilt.edu/ since the completion of Sean King's Dean's Fellow project in 2015.  But Callimachus has some serious deficiencies related to speed (see this post for details) and we had been considering replacing it with a Stardog-based endpoint.  However, the community version of Stardog has a limit of 25 million triples per database, and we could easily go over that by loading either the Getty Thesaurus of Geographic Names or the GeoNames dataset, both of which contain well over 100 million triples.  So we have been considering using Blazegraph (the graph database used by Wikidata), which has no triple limit.  I have already reported on my experience with loading 100+ million triples into Blazegraph in an earlier post.  One issue that became apparent to me through that exercise was that appropriate use of named graphs would be critical to effective maintenance of a production triplestore and SPARQL endpoint. It also became apparent that my understanding of RDF graphs in the context of SPARQL was too deficient to make further progress on this front.  This post is a record of my attempt to remedy that problem.

Assumptions

This post assumes that you have a basic understanding of RDF, the Turtle serialization of RDF, and the SPARQL query language.  There are many good sources of information about RDF in general - the RDF 1.1 Primer is a good place to start.  For a good introduction to SPARQL, I recommend Bob DuCharme's Learning SPARQL.


Graphs: named and otherwise

One of the impediments to understanding how graphs interact with SPARQL is understanding the terminology used in the SPARQL specification.  Please bear with me as I define some of the important terms needed to talk about graphs in the context of SPARQL.  The technical documentation defining these terms is the SPARQL 1.1 Query Language W3C Recommendation, Section 13.

In the abstract sense, a graph defines the connections between entities, with the entities represented as nodes and the connections between them represented as arcs (also known as edges).  The RDF data model is graph-based, and a graph in RDF is described by triples.  Each triple describes the relationship between two nodes connected by an arc.  Thus, in RDF a graph can be defined as a set of triples.

I used three graphs for the tests I'll be describing in this post.  The first graph contains 12 triples and describes the Chinese temple site Anchansi and some other things related to that site.  The full graph in Turtle serialization can be obtained at this gist, but two of the triples are shown in the diagram above.  As with any other resource in RDF, a graph can be named by assigning it a URI.  In this first graph, I've chosen not to assign it an identifying URI.  I will refer to this graph as the "unnamed graph".

The second graph contains 18 triples and describes the temple site Baitaisi.  The graph in Turtle serialization is at this gist, and two triples from the graph are shown in the diagram above.  I have chosen to name the second graph by assigning it the URI <http://tang-song/baitaisi>.  You should note that although the URI denotes the graph, it isn't a URL that "does" something.  There is no web page that will be loaded if you put the URI in a browser.  That is totally fine - the URI is really just a name.  I'll refer to this graph by its URI - it is an example of a named graph.

A third graph about the Chinese temple site Baiyugong is here.  I'll refer to it from time to time by its URI <http://tang-song/baiyugong>.

In the context of SPARQL, an RDF dataset is a collection of graphs.  This collection of graphs will be loaded into some kind of data store (which I will refer to as a "triplestore"), where it can be queried using SPARQL.  There may be many graphs in a triple store and SPARQL can query any or all of them.

In a SPARQL query, the default graph is the set of triples that is queried by default when graph patterns in the query are not restricted to a particular named graph.  There is always a default graph in an RDF dataset.  However, that graph may include an unnamed graph, the merge of one or more named graphs, or it may be an empty graph (a graph containing no triples).

A dataset may also include named graphs whose triples are searched exclusively when that graph is specified by name.



Aside on the three tested endpoints: setup and querying

This section of the post is geared towards those who want to try any of these experiments themselves, who want to work towards setting up one of the three systems as a functioning triple store/SPARQL endpoint, or who just want to have a better understanding how the query interface works.  If you don't care about any of those things, you can skip to the next section.

Each of the three systems can be downloaded and set up for free.  I believe that all three can be set up on Windows, Mac, and Linux, although I have only set them up on Windows.

Callimachus can be downloaded from here as a .zip bundle.  After downloading, the Getting Started Guide has straightforward installation instructions.  After the setup script is complete, you will need to set up a local administrator account using a one-time URL.  If the process fails, if you can't login, or if you destroy the installation (which I will tell you how to do later), you can delete the entire directory into which you unzipped the archive, unzip it again, and repeat the installation steps.  You can't re-use the account setup URL a second time.

To download Stardog, go to http://stardog.com/ and click on the Download button.  Unless you know you want to use the Enterprise version, select the Stardog Community version.  Unfortunately, it has been a while since I installed Stardog, so I can't remember the details.  However, I don't remember having any problems when I followed the Quick Start Guide.  In order to avoid having to set the STARDOG_HOME environmental variable every time I wanted to use Stardog, I made the following batch file in my user directory:

set STARDOG_HOME=C:\stardog-home
C:\stardog-4.0.3\bin\stardog-admin.bat server start

where the stardog-4.0.3\bin is the directory where the binaries were installed.  To start the server, I just run this batch file from a command prompt.  Stardog ships with a default superuser account "admin" with the password "admin", which is fine for testing on your local machine.

To download Blazegraph, go to https://www.blazegraph.com/ and click the download button. The executable is a Java .jar file that is invoked from the command line, so there is basically no installation required.  Blazegraph has a Quick Start guide as a part of its wiki, although the wiki in general is somewhat minimal and does not have much in the way of examples.  For convenience, I put the .jar file in my user directory and put that single command line into a batch file so that I can easily start Blazegraph by invoking the batch file.  There isn't any user login required to use Blazegraph - read-only access is set up by settings in the installation.  I've read about this on the developer's email list, but not really absorbed it.

So what exactly is happening when you start up each of these applications from the command line?  You'll get some kind of message saying that the software has started, but you won't get any kind of GUI interface to operate the software.  That's because what you are actually doing is starting up a web server that is running on your local computer ("localhost"), and is not actually connected to any outside network.  By default, each of the three applications allows you to access the local server endpoint through a different port (Callimachus = port 8080, Stardog = port 5820, and Blazegraph = port 9999), so you can run all three at once if you want.  If you wanted to operate one of the applications as an external server, you would change the port to something else (probably port 80).

So what does this mean?  As with most other Web interactions, the communication with each of these localhost servers can take place through HTTP-mediated communication.  SPARQL stands for "SPARQL Protocol and RDF Query Langage" - the "Protocol" part means that a part of the SPARQL Recommendation describes the language by which communication with the server takes place.  The user sends a command via HTTP to the address of the server endpoint, coded using the SPARQL protocol and the server sends a response back to the user in the format (XML, JSON) that the user requests.  If you enjoy such gory details, you can use cURL, Postman, or Advanced Rest Client to send raw queries to the localhost endpoint and then dissect the response to figure out what it means.  Most people are are going to be way to lazy to do this for testing purposes.

Because it's a pain to send and receive raw HTTP, each of the three platforms provides a web interface that mediates between the human and the endpoint.  The web interface is a web form that allows the human user to type the query into a box, then click a button to send the query to the endpoint.  The code in the web page properly encodes the query, sends it to the localhost endpoint, receives the response, then decodes the response into a tabular form that is easier for a human to visualize than XML or JSON.  The web form makes it easy to interact with the endpoint for the purpose of developing and testing queries.

However, when the endpoint is ultimately put into production, the sending of queries and visualization of the response would be handled not by a web form, but by Javascript in web pages that make it possible for the end users to interact with the endpoint using typical web controls like dropdowns and buttons without having to have any knowledge of writing queries.  To see how this kind of interaction works, open the test Chinese Temple website at http://bioimages.vanderbilt.edu/tang-song.html using Chrome.  Click on the options button in the upper right corner of the browser and select "More tools" then "Developer tools".  Click on the Network tab and you can watch how the web page interacts with the endpoint.  Clicking on any of the "sparql?query=..." items, then the "header" tab on the right shows the queries that are being sent to the endpoint.  Clicking on "response" tab on the right shows the response of the endpoint.  This response is used by the Javascript in the web page to build the dropdown lists and the output at the bottom of the page.

In the rest of this post, I will describe interactions with the localhost endpoint through the web form interface, but keep in mind that the same queries and commands that we type into the box could be sent directly to the endpoint from any other kind of application (Javascript in a web page, desktop application, smartphone app) that is capable of communicating using HTTP.

Opening and using the web form interfaces

Each of the three applications has a similar web form interface, although the exact behavior of each interface varies.  There are actually two ways to interact with the server: through a SPARQL query (a read operation) and through a SPARQL Update command (a write operation).  The details of these two kinds of interactions are given for each of the applications.

Callimachus

To load the Callimachus web form interface after starting the server, paste the URL

http://localhost:8080/sparql?view

into the browser address box.  If everything is working, the Callimachus interface will look something like this:


Both queries and update commands are pasted into the same box.  However, to make a query, you must click the "Evaluate Query" button. To give an update command, you must click the "Execute Update" button.  After evaluating a query, you will be taken to another page where the response is displayed.  Hitting the back button will take you back to the query page with the query still intact in the box.  After executing an update, the orange button will "gray out" while the command is being executed and turn orange again when it is finished.  No other indication is given that the command was executed.

Namespace prefixes must be explicitly defined in the text of the box.  However, once prefixes are used, Callimachus "remembers" them, so it isn't necessary to re-define them with every query.

Stardog

To load the Stardog web form interface after starting the server, paste the URL

http://localhost:5820/myDB#!/query

into the browser address box.  If everything is working, the Stardog interface will look something like this:

Stardog does not differentiate between queries and update commands.  Both are typed into the same box and the "Execute" button is used to initiate both.  Query results will be given in the "Results" area at the bottom of the screen.  Successful update commands will display "True" in the Results area.

Commonly used prefixes that appear in the Prefixes: box don't have to be explicitly typed in the text box.  Additional pre-populated prefixes can be added in the Admin Console.


Blazegraph

To load the Blazegraph web form interface after starting the server, paste the URL

http://localhost:9999/blazegraph/#query

into the browser address box.  If everything is working, the Blazegraph query interface will look something like this:


Only queries can be pasted into this box.  Well-known namespace abbreviations can be inserted into the box using the "Namespace shortcuts" dropdowns above the box.  If the query executes successfully, the results will show up in the space below the Execute button.  The page also maintains a record of the past queries that have been run.  They are hyperlinked, and clicking on them reloads the query in the box.

To perform a SPARQL Update, the UPDATE tab must be selected.  That generates a different web form that looks like this:


There are several ways to interact with this page.  For now, the "Type:" dropdown should be set for "SPARQL Update".  A successful update will show a COMMIT message at the bottom of the screen.  The "mutationCount" gives an indication of the number of changes made; in this example 10 triples were added to the triplestore, so the mutationCount=10.



The SPARQL Update "nuclear option": DROP ALL

One important question in any kind of experimentation is: "What do I do if I've totally screwed things up and I just want to start over?"  In Stardog and Blazegraph, the answer is the SPARQL Update command "DROP ALL".  Executing DROP ALL causes all of the triples in all of the graphs in the database to be deleted.  You have a clean slate and an empty triplestore ready to start afresh.  Obviously, you don't want to do this if you've spent hours loading hundreds of millions of triples into your production triplestore.  But in the type of experiments I'm running here, it's a convenient way to clear things out for a new experiment.

However, you NEVER, NEVER, NEVER want to issue this command in Callimachus.  You will understand why later in this post, but for now I'll just say that the best case scenario is that you will be starting over with a clean install of Callimachus if you do it.  Instead of DROP ALL, you should drop each graph individually.  We will see how to do that below.


Putting a graph into the triplestore (finally)

All of these preliminaries were to get us to the point of being ready to load a graph into the triplestore.  In each of the three applications, there are multiple ways to accomplish this task, and many of those ways differ among the applications.  However, loading a graph using SPARQL Update works the same on all three (the beauty of W3C standards!), so that's how we will start.

If you want to get try to achieve the same results as are shown in the examples here, save the three example files from my Gists: test-baitaisi.ttl, test-baiyugong.ttl, and test-unnamed.ttl.  Put them somewhere on your hard drive where they will have a short and simple file path.

Using the SPARQL Update web form of the application of your choice, type a command of this form:

LOAD <file:///c:/Dropbox/tang-song/test-baitaisi.ttl> INTO GRAPH <http://tang-song/baitaisi>

This command contains two URIs within angle brackets.  The second URI is the name that I want to use to denote the uploaded graph.  I'll use that URI any time I want to refer to the graph in a query.   Recall that this URI is just a name and doesn't have to actually represent any real URL on the Web.  The first URI in the LOAD command is a URL - it provides the means to retrieve a file.  It contains the path to the test-baitaisi.ttl file that you downloaded (or some other file that contains serialized RDF).  The triple slash thing after "file:" is kind of weird.  The "host name" would typically go between the second and third slashes, but on your local computer it can be omitted - resulting in three slashes in a row.  (I think you can actually use "file://localhost/c:/..." but I haven't tried it.)  The path can't be relative, so in Windows, the full path starting with the drive letter must be given.  I have not tried Mac and Linux, but see the answer to this stackoverflow question for probably path forms.  If the path is wrong or the file doesn't exist, an error will be generated.

Execute the update by clicking on the button.  How do we know that the graph is actually there?  Here is a SPARQL query that can answer the question:

select distinct ?g where { 
graph ?g {?s ?p ?o
}

If you are using Blazegraph, you'll have to switch from the Update tab to the Query tab before pasting the query into the box.  Execute the query, and the results in Stardog and Blazegraph should show the URI that you used to name the graph that you just uploaded: http://tang-song/baitaisi .

The results in Callimachus are strange.  You should see //tang-song/baitaisi in the list, but there are a bunch of other graphs in the triplestore that you never put there.  These are graphs that are needed to make Callimachus operate.  Now you can understand why using the DROP ALL command has such a devastating effect in Callimachus.  The command DROP ALL is faithfully executed by Callimachus and wipes out every graph in the triplestore, including the ones that Callimachus needs to function.  The web server continues to operate, but it doesn't actually "know" how to do anything and fails to display any web page of the interface.  Why Callimachus allows users to execute this "self-destruct" command is beyond me!

The graceful way to get rid of your graph in Callimachus is to drop the specific graph rather than all graphs, using this SPARQL Update command:

DROP GRAPH <http://tang-song/baitaisi> 

This will leave intact the other graphs that are necessary for the operation of Callimachus.

Specifying the role of a named graph

For the examples in this section, you should load the test-baitaisi.ttl and test-baiyugong.ttl files from your hard drive into the triplestore(s) using the SPARQL Update LOAD command as shown in the previous section, naming them with the URIs http://tang-song/baitaisi and http://tang-song/baiyugong respectively.

The FROM clause is used to specify that triples from particular named graphs should be used as the default graph.  There can be more than one named graph specified - the default graph is the merge of triples from all of the specified named graphs [2].  For example, the query

PREFIX geo: <http://www.w3.org/2003/01/geo/wgs84_pos#>

SELECT DISTINCT ?site
FROM <http://tang-song/baitaisi>
FROM <http://tang-song/baiyugong>
WHERE {
  ?site a geo:SpatialThing.
  }

designates that the default graph should be composed of the merge of the two graphs we loaded.  The graph pattern in the WHERE clause is applied to all of the triples in both of the graphs (i.e. the default graph).  Running the query returns the URIs of both sites represented in the graphs:

<http://lod.vanderbilt.edu/historyart/site/Baitaisi>
<http://lod.vanderbilt.edu/historyart/site/Baiyugong>

The FROM NAMED clause is used to say that a named graph is part of the RDF dataset, but that a graph pattern will be applied to that named graph only if it is specified explicitly using the GRAPH keyword.  If we wrote the query like this:

PREFIX geo: <http://www.w3.org/2003/01/geo/wgs84_pos#>

SELECT DISTINCT ?site
FROM <http://tang-song/baitaisi>
FROM NAMED <http://tang-song/baiyugong>
WHERE {
  ?site a geo:SpatialThing.
  }

we only get one result:

<http://lod.vanderbilt.edu/historyart/site/Baitaisi>

because we didn't specify that the graph pattern should apply to the http://tang-song/baiyugong named graph.  In this query:

PREFIX schema: <http://schema.org/>

SELECT DISTINCT ?building
FROM <http://tang-song/baitaisi>
FROM NAMED <http://tang-song/baiyugong>
WHERE {
  GRAPH <http://tang-song/baiyugong> {
    ?building a schema:LandmarksOrHistoricalBuildings.
  }
  }

only buildings described in the <http://tang-song/baiyugong> graph are returned:

<http://lod.vanderbilt.edu/historyart/site/Baiyugong#Houdian>
<http://lod.vanderbilt.edu/historyart/site/Baiyugong#Sanxiandian>
<http://lod.vanderbilt.edu/historyart/site/Baiyugong#Shanmen>
<http://lod.vanderbilt.edu/historyart/site/Baiyugong#Zhengdian>

More complicated queries can be constructed, like this:

PREFIX schema: <http://schema.org/>
PREFIX geo: <http://www.w3.org/2003/01/geo/wgs84_pos#>

SELECT DISTINCT ?site ?building
FROM <http://tang-song/baitaisi>
FROM NAMED <http://tang-song/baiyugong>
WHERE {
  ?site a geo:SpatialThing.
  GRAPH <http://tang-song/baiyugong> {
        ?building a schema:LandmarksOrHistoricalBuildings.
  }
  }

where sites in the default <http://tang-song/baitaisi> graph are bound, but buildings in the specified <http://tang-song/baiyugong> graph are bound.

Using FROM and FROM NAMED clauses in a query make it very clear what graphs should be considered for matching with graph patterns in the WHERE clause.

What happens if we load a graph without a name?

It is possible to load a graph into a triplestore without giving it a name, as in this SPARQL Update command:

LOAD <file:///c:/Dropbox/tang-song/test-unnamed.ttl>

Assume that this file has been loaded along with the previous two named graphs.  What would happen if we ran this query:

PREFIX geo: <http://www.w3.org/2003/01/geo/wgs84_pos#>

SELECT DISTINCT ?site
WHERE {
  ?site a geo:SpatialThing.
  }

When no named graph is specified using a FROM clause, the endpoint applies the query to "the default graph".  The problem is that the SPARQL specification is not clear how the default graph should be constructed.  Section 13 says "An RDF Dataset comprises one graph, the default graph, which does not have a name, and zero or more named graphs...", which implies that triples loaded into the store without specifying a graph URI will become part of the default graph.  This is also implied in Example 1 in Section 13.1, which shows the "Default graph" as being the one without a name.  However, it is also clear that "default graph" cannot be synonymous with "unnamed graph", since the FROM clause allows named graphs to be specified as the default graph.  So what happens when we run this query?

On Stardog, the graph pattern binds only a single URI for ?site:

<http://lod.vanderbilt.edu/historyart/site/Anchansi>

This is the site described by the unnamed graph I loaded.  However, running the query on Blazegraph and Callimachus produces this result:

<http://lod.vanderbilt.edu/historyart/site/Baitaisi>
<http://lod.vanderbilt.edu/historyart/site/Baiyugong>
<http://lod.vanderbilt.edu/historyart/site/Anchansi>

which are the URIs for the sites described by the unnamed graph and both of the named graphs!

This behavior is somewhat disturbing, because it means that the same query, performed on the same graphs loaded into triplestores using the same LOAD commands do NOT produce the same results.  The results are implementation-specific.

Construction of the dataset in the absence of FROM and FROM NAMED

The GraphDB documentation sums up the situation like this:
The SPARQL specification does not define what happens when no FROM or FROM NAMED clauses are present in a query, i.e., it does not define how a SPARQL processor should behave when no dataset is defined. In this situation, implementations are free to construct the default dataset as necessary.
In the absence of FROM and FROM NAMED clauses, GraphDB constructs the dataset's default graph in the same way as Callimachus and Blazegraph: by merging the database's unnamed graph and all named graphs in the database.

In the absence of FROM and FROM NAMED clauses, all of the applications include all named graphs in the dataset, allowing graph patterns to be applied specifically to them using the GRAPH keyword.  So in the case of this query:

PREFIX geo: <http://www.w3.org/2003/01/geo/wgs84_pos#>

SELECT DISTINCT ?site
WHERE {
  GRAPH ?g {?site a geo:SpatialThing.}
  }

we would expect the results to include ?site URI bindings from the two named graphs:

<http://lod.vanderbilt.edu/historyart/site/Baitaisi>
<http://lod.vanderbilt.edu/historyart/site/Baiyugong>

which we do.  However, Callimachus and Blazegraph also include:

<http://lod.vanderbilt.edu/historyart/site/Anchansi>

in the results, indicating that they consider the empty graph to also bind to ?g (Stardog does not).

Construction of the dataset when FROM and FROM NAMED clauses are present

If we run this query:

PREFIX geo: <http://www.w3.org/2003/01/geo/wgs84_pos#>

SELECT DISTINCT ?site
FROM <http://tang-song/baitaisi>
WHERE {
  ?site a geo:SpatialThing.
  }

we get the same result on all three platforms - only the site URI <http://lod.vanderbilt.edu/historyart/site/Baitaisi> from the named graph that was specified in the FROM clause.  This should be expected, since Section 13.2 of the SPARQL 1.1 specification says
A SPARQL query may specify the dataset to be used for matching by using the FROM clause and the FROM NAMED clause to describe the RDF dataset. If a query provides such a dataset description, then it is used in place of any dataset that the query service would use if no dataset description is provided in a query.
The bolded text (my emphasis) is suitably vague about what would be included in the dataset in the absence of FROM and FROM NAMED clauses (i.e. the default graph).  Section 13.2 also says
If there is no FROM clause, but there is one or more FROM NAMED clauses, then the dataset includes an empty graph for the default graph.
that is, if only FROM NAMED clauses are included in the query, unnamed graph(s) will NOT be used as the default graph, since the default graph is required to be empty.

Querying the entire triplestore using Stardog

The way that Stardog constructs datasets is problematic since there is no straightforward way to include all unnamed and named graphs (i.e. all triples in the store) in the same query.  The following query is possible:

PREFIX geo: <http://www.w3.org/2003/01/geo/wgs84_pos#>

SELECT DISTINCT ?site
WHERE {
   {?site a geo:SpatialThing.}
       UNION
  {GRAPH ?g {?site a geo:SpatialThing.}}
  }

but complicated, since the desired graph pattern has to be repeated twice in the query.  The first graph pattern binds matching triples in the unnamed graph, and the second graph pattern binds matching triples in all of the named graphs.

What is the name of an unnamed graph?

Previously, we saw that that the query

SELECT DISTINCT ?g WHERE { 
GRAPH ?g {?s ?p ?o
}

could be used to ask the names of graphs that were present in the triplestore.  Let's find out what happens when we run this query on the three triplestores with the unnamed graph and the two named graphs loaded.  Stardog gives this result:

http://tang-song/baitaisi
http://tang-song/baiyugong

which is not surprising since we saw that an early query using GRAPH ?g bound only to the two named graphs.  Running the query on Blazegraph produces this somewhat surprising result:

<http://tang-song/baitaisi>
<http://tang-song/baiyugong>
<file:/c:/Dropbox/tang-song/test-unnamed.ttl>

We see the two named graphs, but we also see a URI for the unnamed graph.  In the absence of an assigned URI from the LOAD command, Blazegraph has assigned the graph a URI that is almost the file URI (only one slash after file: instead of three).  This might explain why all three graphs (including the unnamed graph) bound to the ?g in an earlier Blazegraph query.  However, it does not explain that same behavior in Callimachus, since in Callimachus the current query only lists the two named graphs (besides the many Callimachus utility graphs that make the thing run).

Loading graphs using the GUI

Each of the three platforms I've been testing also provide a means to load files into the store using a graphical user interface instead of the SPARQL Update LOAD command.


Callimachus

To get to the file manager in Callimachus, in the browser URL box enter:

http://localhost:8080/?view

You'll see a screen like this:

You can create subfolders and upload files using the red buttons at the top of the screen.  After uploading the unnamed graph file, I ran the query to show all of the named graphs. The one called test-unnamed.ttl showed up on the list.  So although loading files using the GUI does not provide an opportunity to specify a name for the graph, Callimachus assigns a name to the graph anyway (the file name).


Stardog

Before uploading using the GUI, I executed DROP ALL to make sure that all graphs (named and unnamed) were removed from the store.  To load a file, select Add from the Data dropdown list at the top of the page.  A popup will appear and you will have an opportunity to select the file using a dialog. It looks like this:


Stardog gives you an opportunity to specify a name for the graph in the box.  If you don't put in a name, the URI tag:stardog:api:context:default shows up in the box.  I loaded the unnamed graph file and left the graph name box empty, assuming that tag:stardog:api:context:default would be assigned as the name of the graph.  However, running the query to list all graphs produced only the URIs for the two explicitly named graphs.

When I performed the query

PREFIX geo: <http://www.w3.org/2003/01/geo/wgs84_pos#>

SELECT DISTINCT ?site 
FROM <http://tang-song/baitaisi>
WHERE {
  ?site a geo:SpatialThing.
}

I only got one result:

http://lod.vanderbilt.edu/historyart/site/Baitaisi

But when I included the <tag:stardog:api:context:default> graph in a FROM clause:

PREFIX geo: <http://www.w3.org/2003/01/geo/wgs84_pos#>

SELECT DISTINCT ?site 
FROM <http://tang-song/baitaisi>
FROM <tag:stardog:api:context:default>
WHERE {
  ?site a geo:SpatialThing.
}

I got two results:

http://lod.vanderbilt.edu/historyart/site/Anchansi
http://lod.vanderbilt.edu/historyart/site/Baitaisi

So in some circumstances, referring to the "name" of the unnamed graph <tag:stardog:api:context:default> causes it to behave as a named graph, even though it didn't show up when I ran the query to list the names of all graphs.

Blazegraph

I also ran DROP ALL on the Update page before loading files using the Blazegraph GUI.  At the bottom of the Update page, I changed the dropdown from "SPARQL Update" to "File Path or URL". I then used the Choose File button to initiate a file selection dialog.  After selecting the test-unnamed.ttl file, the dropdown switched on its own to "RDF Data" with Format: Turtle, and displayed the file in the box, like this:


There was no opportunity to specify a URI for the graph.  I clicked on the update button, then switched to the query page so that I could run the query to ask the names of the graphs.  The graph name bd:nullGraph was given.  The namespace shortcuts indicate that bd: is the abbreviation for <http://www.bigdata.com/rdf#>.  So using the GUI interface to load an unnamed graph again results in Blazegraph assigning a name to an unnamed graph, this time <http://www.bigdata.com/rdf#nullGraph> instead of a file name-based URI.

As with Stardog, including the IRI "name" of the unnamed graph in a FROM clause causes it to be added to the default graph.

Names of unnamed graphs (review)

Here's what I've discovered so far about how the three SPARQL endpoints/triplestores name graphs when no name is assigned to them by the user upon loading.  The examples assume that test-unnamed.ttl is the name of the uploaded file.

Callimachus

When loaded using the GUI: the graph name is the file name, e.g. <test-unnamed.ttl>

When loaded using the SPARQL Update command: LOAD <file:///c:/Dropbox/tang-song/test-unnamed.ttl>: the triples do not appear to be assigned to any named graph.

Stardog

When loaded using the GUI, or when loaded using the SPARQL Update command: LOAD <file:///c:/Dropbox/tang-song/test-unnamed.ttl>:  the triples are added to the graph named <tag:stardog:api:context:default> (although that graph doesn't seem to bind to patterns where there is a variable in the graph position of a graph pattern).

Blazegraph

When loaded using the GUI: the triples are added to the graph named <http://www.bigdata.com/rdf#nullGraph>

When loaded using the SPARQL Update command: LOAD <file:///c:/Dropbox/tang-song/test-unnamed.ttl>the graph name is a modification of the file name, e.g. <file:/c:/Dropbox/tang-song/test-unnamed.ttl>

For all practical purposes, Blazegraph does not have unnamed graphs - triples always load into some named graph that binds to variables in the graph position of a graph pattern.

Deleting unnamed graphs

As noted earlier, deleting a named graph is easy using the

DROP GRAPH <graphURI>

command of SPARQL Update.  What about deleting the "unnamed" graphs of the flavors I've just described?  In all of the cases of named "unnamed" graphs listed above, inserting the URI of the unnamed graph results in the deletion of the graph.  That's not surprising, since the aren't really "unnamed" after all.  The problematic situation is where triples are loaded into Callimachus using SPARQL Update with no graph name.  Those triples can't be deleted using DROP GRAPH because there is no way to refer to their truly unnamed graph.

DROP DEFAULT

The SPARQL 1.1 Update specification, Section 3.2.2 give an option for the DROP command called DROP DEFAULT.  Given the uncertainty about what is actually considered the "default" graph in the three platforms, I decided to run some tests.

In Callimachus, DROP DEFAULT doesn't do anything as far as I can tell.  That's unfortunate, because it's the only platform that uploads triples into a graph that truly has no name.  As far as I can tell, there is no way to use the DROP command to clear out triples that are loaded using the SPARQL Update LOAD command with no graph URI provided.  (Well, actually DROP ALL will work if you want to self-destruct the whole system!)

In Stardog, every loaded graph with an unspecified name goes into the graph <tag:stardog:api:context:default>.  DROP DEFAULT does delete everything in that graph.

In Blazegraph, DROP DEFAULT deletes the graph <http://www.bigdata.com/rdf#nullGraph>, which is where triples uploaded via the GUI go.  However, DROP DEFAULT does not delete any of the "file name URI"-graphs that result when graphs are uploaded using the SPARQL Update LOAD command without a provided graph IRI.

Summary

Named graphs are likely to be an important part of managing a complex and changing triplestore/SPARQL endpoint, since they are the primary way to run queries over a specific part of the database and the primary way to remove a specified subset of the triples from the store without disturbing the rest of the loaded triples.

Although it is less complicated to load triples into the store without specifying them as part of a named graph, the handling of "unnamed" graphs by the various platforms is very idiosyncratic.  Unnamed graphs introduce complications in querying and management of triples in the store.  Specifically:

  • In Callimachus, in some cases there appears to be no simple way to get rid of triples that aren't associated with some flavor of named graph.  
  • In Stardog, there is no simple way to use a single graph pattern to query triples in both named and unnamed graphs.  
  • Blazegraph seems to be the most trouble-free since omitting any FROM or FROM GRAPH clauses allows triples in both named and "unnamed" graphs to be queried using a single graph pattern.  I put "unnamed" in quotes because Blazegraph always loads triples into a graph identified with an URI even when one isn't specified by the user.  However, knowing what those URIs are is a bit confusing since the URI that is assigned depends on the method used to load the graph.

My take-home from these experiments is that we are probably best-off continuing with our plan to use Blazegraph as our triplestore, and that we should probably load triples from various sources and projects into named graphs.

What I've left out

There are a number of features of the SPARQL endpoints/triplestores that I have not discussed in this post.  One is the service description of the endpoint.  The SPARQL 1.1 suite of Recommendations includes the SPARQL 1.1 Service Description specification.  This specification describes how a SPARQL endpoint should provide information about the triplestore to machine clients that discover it.   This information includes number of triples, graphs present in the store, namespaces used, etc.  A client can request the service description by sending an HTTP GET request to the endpoint without any query string.  For example, with Blazegraph, sending a GET to

http://localhost:9999/blazegraph/sparql

returns the service description.

The various applications also enable partitioning data in the store on a level higher than graphs.  For example, Blazegraph supports named "datasets".  Querying across different datasets requires doing a federated query, even if the datasets are in the same triplestore.  Stardog has a similar feature where its triples are partitioned into "databases" that can be managed independently.

There are also alternate methods for loading graphs that I haven't mentioned or explored.  Because SPARQL Update commands can be made via HTTP, they can be automated by a desktop application that can issue HTTP requests (as opposed to typing the commands in a web form).  So with appropriate software,  graph maintenance can be automated or carried out at specified intervals.  Blazegraph also has a "bulk loader" facility that I have not explored.  Clearly there is a lot more details to be learned!


[1] GitHub repo at https://github.com/HeardLibrary/semantic-web
[2] In the merge, blank nodes are kept distinct within the source graphs.  If the same blank node identifier is used in two of the merged graphs, the blank node identifier of one of the graphs will be changed to ensure that it denotes a different resource from the resource identified by the blank node identifier in the other graph.