Tuesday, May 2, 2017

Using the TDWG Standards Documentation Specification with a Controlled Vocabulary

I just recently got the news that the TDWG Executive Committee has ratified the Standards Documentation Specification (SDS).  We now have a way to describe vocabularies not only in a consistent human-readable form, but also in a machine-readable form.  This includes not only TDWG's current core vocabularies (Darwin Core, DwC and Audubon Core, AC), but also vocabularies of controlled values that will be developed in the future.

What are the implications of saying that the vocabularies will be "machine-readable"?  We have had about ten years now of promises that using "semantic technologies" will magically revolutionize biodiversity informatics, but despite repeated meetings, reports, and publications, our actual core technologies are built around conventional database technologies and simple file formats like CSVs.  Many TDWG old-timers have reached the point of "semantic fatigue" resulting from broken promises about what RDF, Linked Data, and the Semantic Web is going to do for them.  So the purpose of this blog post is NOT to sing the praises of RDF and try to change peoples minds about it.  Rather, it is to show how describing vocabularies using the SDS can make management of controlled vocabularies be practical, and to show how the machine-readable representations of those controlled vocabularies can be used to build applications that can mediate the generation and cleaning of data without human intervention.

I've been working recently with Quentin Groom to flesh out how a test controlled vocabulary for dwc:occurrenceStatus would be serialized in accordance with the SDS.  That test vocabulary is used in the examples that follow.  I'm really excited about this and hopefully there will be more progress to report on this front in the near future.

What is a controlled vocabulary term?

There are several misconceptions about the terminology for describing controlled vocabularies that need to be cleared up before I get into the details about how the SDS will facilitate the management and use of controlled vocabularies.  The first misconception is about the meaning of "controlled vocabulary term".  In the TDWG community, there is a tendency for people to think that a "controlled vocabulary term" is a certain string that we should all use to represent a particular value of a property.  For example, we could say that in a Darwin Core Archive, we would like for everyone to use the string "extant" as the value for the property dwc:occurrenceStatus when we intend to convey the meaning that an organism was present in a certain geographical location at a certain period of time.  However, the controlled vocabulary term is actually the concept of what we would describe in English as "an organism was present in a certain geographical location at a certain period of time" and not any particular string that we might use as a label for that concept.

This idea that an controlled value term is a concept rather than a language-dependent label lies at the heart of the Simple Knowledge Organization System (SKOS), a W3C Recommendation used to describe thesauri and controlled vocabularies. In fact, the core entity in SKOS is skos:Concept, the class of ideas or notions.  Those ideas can be "be identified using URIs, labeled with lexical strings in one or more natural languages" [1], but neither the URIs nor the strings "are" the concepts. The SDS recognizes this distinction when it specifies (Section 4.5.4) that controlled vocabulary terms should be typed as skos:Concept.

What is a term IRI?

Another common misconception is that a IRI must "do something" when you paste it into a web browser.  (In current W3C standards, "IRI", Internationalized Resource Identifier, has replaced "URI", Uniform Resource Identifier, but in the context of this post you can consider them to be interchangeable.)  Although it is nice if an IRI dereferences when you put it in a browser, there is no requirement that it do so.  At it's core, an IRI is simply a globally unique identifier that conforms to a particular IETF specification [2].

For example, the IRI http://rs.tdwg.org/dwc/iri/occurrenceStatus is a valid IRI, because it conforms to the IRI specification.  However, it does not currently dereference because no one has (yet) set up the TDWG server to handle it.  It is, however, a valid Darwin Core term, because it is defined in Section 3.7 of the Darwin Core RDF Guide.  The SDS specifies in Section 2.1.1 that IRIs are the type of identifiers that are used in TDWG standards to uniquely identify resources, including vocabulary terms.  Some other kind of globally unique identifier (e.g. UUIDs) could have been used, but using IRIs codified the practice already used by TDWG for other vocabularies.

The SDS does not specify the exact form of IRIs.  That is a matter of design choice, probably to be determined by the TDWG Technical Architecture Group (TAG).  Existing terms in DwC and AC use the pattern where a term IRI is composed of a namespace part and a local name that is a string composed of some form of an English label for the term.  For example, http://rs.tdwg.org/dwc/iri/occurrenceStatus is constructed from the namespace "http://rs.tdwg.org/dwc/iri/" (abbreviated by the compact URI or CURIE dwciri:) and the camel case local name "occurrenceStatus".  There is no requirement in the SDS for the form of the local name part of a term IRI - it could also be an opaque identifier such as a number.  Again, this is a design choice.  So it would be fine for the local name part of the IRI to be something like "os12345".

What is a label?

A label is a natural language string that is used by humans to recognize a resource.  In SKOS, labels are strings of Unicode characters in a given language.  The rules of SKOS declare that for each concept there is at most one preferred label per language, indicated by the property skos:prefLabel.  There may be any number of additional labels, such as "hidden labels" (skos:hiddenLabel) that are known to be associated with a concept, but that should not be suggested for use.  In SKOS, labels may have a language tag, although that is not required.

In SKOS, the intent is to create a mechanism that leads human users to discover the preferred label for a concept in the user's own language, while also specifying other non-preferred labels that users might be inclined to use on their own.

Based on TDWG precedent, the SDS specifies that English language labels must be included in the standards documents that describe vocabularies.  Labels in other languages are encouraged, but do not fall within the standard itself.  That makes adding those labels less cumbersome from the vocabulary maintenance standpoint.

What is a "value"?

The prevalent view in TDWG that there is one particular string that should serve as the "controlled value" for a term is alien to SKOS.  In SKOS, unique identification of concepts is always accomplished by IRIs. As a concession to current practice, in Section 4.5.4 the SDS declares that each controlled vocabulary term should be associated with a text string that is unique with that vocabulary.  The utility property rdf:value is used to associate that string with the term.  If people want to provide a string in a CSV file to represent a controlled vocabulary term, they can use this string as a value of a Darwin Core property such as dwc:occurrenceStatus.  However, if they want to be completely unambiguous, they can use the term IRI as a value of dwciri:occurrenceStatus. Using dwciri:occurrenceStatus instead of dwc:occurrenceStatus is basically a signal that the value is "clean" and that no disambiguation is necessary.

The pieces of the controlled vocabulary

The Standards Documentation Specification breaks apart machine-readable controlled vocabulary metadata into several pieces.  One piece is the metadata that actually comprise the standard itself.  Those metadata are described in Sections 4.2.2, 4.4.2, 4.5, and 4.5.4 .  In the case of the terms themselves, the critical metadata properties are rdfs:label (to indicate the label in English), rdfs:comment (to indicate the definition in English), and rdf:value (to indicate the unique text string associated with the term).  Because these values are part of the normative description of the vocabulary standard, their creation and modification are strictly controlled by processes described in the newly adopted Vocabulary Maintenance Specification.

In contrast, assignment of labels in languages other than English and translations of definitions into other languages falls outside the standards process.  Lists of multilingual labels and definitions are therefore kept in documents that are separate from the standards documents.  This makes it possible to easily add to these lists, or make corrections without invoking any kind of standards process.  The properties skos:prefLabel and skos:definition can be used to indicate the multilingual translations of labels and definitions respectively.

In addition to the preferred labels, it is also possible to maintain lists of non-preferred labels that have been used by some data providers, but which do not conform to the unique text string assigned to each term.  GBIF, VertNet, and other aggregators have compiled such lists from actual data in the wild.  The term skos:hiddenLabel can be used to associate these strings with the controlled value terms to which they have been mapped.

Controlled vocabulary metadata sources

For convenience, the machine-readable metadata in this post will be shown in RDF/Turtle, which is generally considered to be the easiest serialization for humans to read.  However, it may be serialized in any equivalent form -  developers may prefer a different serialization such as XML or JSON.  Here is an example of the metadata associated with a term from a controlled vocabulary designed to provide values for the Darwin Core term occurrenceStatus:

<http://rs.tdwg.org/cv/status/extant> a skos:Concept;
     skos:inScheme <http://rs.tdwg.org/cv/status/>;
     rdfs:isDefinedBy <http://rs.tdwg.org/cv/status/>;
     dcterms:isPartOf <http://rs.tdwg.org/cv/status/>;
     rdf:value "extant";
     rdfs:label "extant"@en;
     rdfs:comment "The species is known or thought very likely to occur presently in the area, which encompasses localities with current or recent (last 20-30 years) records where suitable habitat at appropriate altitudes remains."@en.

These metadata would be included in the machine-readable form of the vocabulary standard document.  Here are metadata associated with the same term, but included in an ancillary document that is not part of the standard:

     skos:prefLabel "presente"@pt;
     skos:definition "Sabe-se que a espécie ocorre na área ou a sua ocorrência é tida como bastante provável, o que inclui localidades com registos atuais ou recentes (últimos 20-30 anos) nas quais se mantêm habitats adequados às altitudes apropriadas."@pt;
     skos:prefLabel "extant"@en;
     skos:definition "The species is known or thought very likely to occur presently in the area, which encompasses localities with current or recent (last 20-30 years) records where suitable habitat at appropriate altitudes remains."@en;
     skos:prefLabel "vorhanden "@de;
     skos:definition "Von der Art ist bekannt oder wird mit hoher Wahrscheinlichkeit angenommen, dass sie derzeit im Gebiet anwesend ist, und für die Art existieren aktuelle oder in den letzten 20 bis 30 Jahren erstellte Aufzeichnungen, in Lagen mit geeigneten Lebensräumen. "@de.

These data provide the non-normative translations of the preferred term label and definition.  Here are some metadata that might be in a third document:

     skos:hiddenLabel "Reported";
     skos:hiddenLabel "Outbreak";
     skos:hiddenLabel "Infested";
     skos:hiddenLabel "present";
     skos:hiddenLabel "probable breeding";
     skos:hiddenLabel "Frecuente";
     skos:hiddenLabel "Raro";
     skos:hiddenLabel "confirmed breeding";
     skos:hiddenLabel "Present";
     skos:hiddenLabel "Présent ";
     skos:hiddenLabel "presence";
     skos:hiddenLabel "presente";
     skos:hiddenLabel "frecuente";
...   .

These are all of the known variants of strings that have been mapped to the term http://rs.tdwg.org/cv/status/extant.

For management purposes, these three documents will probably be managed separately.  The first list from the standards document will be changed rarely, if ever.  The second list will (hopefully) be added to frequently by human curators as the controlled vocabulary is translated into new languages.  The third list may be massive, and maintained by data-cleaning software as human operators of the software discover new variants in submitted data and assign those variants to particular terms in the controlled vocabulary.

Periodically, as the three lists are updated, they can be merged.  Given that the the SDS is agnostic about the form of the machine-readable metadata, they could be ingested as JSON-LD and processed using purpose-built applications.  However, in the following examples, I'll load the metadata into an RDF triplestore and expose the merged graph via a SPARQL endpoint.  That is convenient because the merging can be accomplished without any additional processing of the data on my part.

Accessing the merged graph

I've loaded the metadata shown above into the Vanderbilt library's SPARQL endpoint, where it can be queried at http://rdf.library.vanderbilt.edu/sparql?view.  The following query can be pasted into the box to see what properties and values exist for http://rs.tdwg.org/cv/status/extant in the merged graph:

SELECT DISTINCT ?property ?value WHERE {
   <http://rs.tdwg.org/cv/status/extant> ?property ?value.

Unfortunately, the results include some garbage that Callimachus inserts as it manages the graph, but you can see that the metadata included in the standards document, translations document, and hidden label document all come up.

Clearly, nobody is actually going to want to paste queries into a box to use this information.  However, the data can be accessed by an HTTP GET call using CURL, Python, Javascript, jQuery, XQuery, or whatever flavor of software you like.  Here's what the query above looks like when URL encoded and attached to the endpoint IRI as a query string:


The query can be sent using HTTP GET by your favorite application to retrieve the same metadata as one sees in the paste-in box.  Currently, the Vanderbilt SPARQL endpoint only supports XML query results, but we are working on moving to using Blazegraph, which also supports JSON results.

Many people seem to be somewhat mystified about the purpose of a SPARQL endpoint and assume that it is some kind of weird Semantic Web thing.  If you fall into this category, you should think of a SPARQL endpoint as a kind of "programmable" web API.  Unlike a "normal" API where you must select from a fixed set of requests, you can request any imaginable result that can possibly be retrieved from a dataset.  That means that the request IRIs are probably going to be more complex, but once they have been conceived, the requests are going to be made by a software application, so who cares how complex they are?

Multilingual pick list for occurrenceStatus

I'm going to demonstrate how the multilingual data could be used to create a dropdown where a user selects the appropriate controlled value for the Darwin Core term occurrenceStatus when presented with a list of labels in the user's native language.  Here's the SPARQL query that lies at the heart of generating the pick list:

PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
SELECT DISTINCT ?label ?def ?term WHERE {
?term <http://www.w3.org/2000/01/rdf-schema#isDefinedBy><http://rs.tdwg.org/cv/status/>.
?term skos:prefLabel ?label.
?term skos:definition ?def.
FILTER (lang(?label)='en')
FILTER (lang(?def)='en')
ORDER BY ASC(?label)

Here's what it does.  The triple pattern:
?term <http://www.w3.org/2000/01/rdf-schema#isDefinedBy><http://rs.tdwg.org/cv/status/>.
restricts the results to terms that are part of the occurrenceStatus controlled vocabulary.  The triple patterns:
?term skos:prefLabel ?label.
?term skos:definition ?def.
bind preferred labels and definitions to the variables ?label and ?def.  The triple patterns:
FILTER (lang(?label)='en')
FILTER (lang(?def)='en')
restrict the labels and definitions to those that are language-tagged as English.  To change the requested language, a different language tag, such as 'pt' or 'de' can be substituted for 'en' by the software.  The last line tells the endpoint to return the results in alphabetical order by label.  The query is URL encoded and appended as a query string to the IRI of the SPARQL endpoint:


A page that makes use of this query is online at http://bioimages.vanderbilt.edu/pick-list.html?en.  The URL of the page ends in a query string that specifies the starting language for the page.  Currently en, pt, and de are available, although I'm hoping to add ko, zh-hans, zh-hant, and es soon.  The "guts" of the program are the javascript code at http://bioimages.vanderbilt.edu/pick-list.js.  Lines 58 through 67 generate the query above and line 68 URL-encodes it.  Lines 71 through 78 perform the HTTP GET call to the endpoint, and lines 69 through 102 process the XML results whey they come back and add them to the options of the pick list.  If you are viewing the page in a Chrome browser, you can see what's going on behind the scenes using the Developer tools that you can access from the menu in the upper right of the Chrome window ("More tools" --> "Developer tools").  Here's what the request looks like:

Here's what the response looks like:

You can see that the results are in XML, which makes the Javascript uglier than it would have to be.  When we get our new endpoint set up to return JSON, the Javascript will be simpler.  In line 101 of the Javascript code, the language-specific label gets inserted as the label of the option, but the actual value of the option is set as the IRI that is returned from the endpoint for that particular term.  Thus, the labels inserted into the option list vary depending on the selected language, but the IRI is language-independent.  In this demo page, the IRI is simply displayed on the screen, but in a real application, the IRI would be assigned as the value of a Darwin Core property.  In my opinion, the appropriate property would be dwciri:occurrenceStatus, regardless of whether the property is part of an RDF representation or a CSV file.  Using a dwciri: term implies that the value is a clean and unambiguous IRI.  Using dwc:occurrenceStatus would imply that the value could be any kind of string, with no implication that it was "cleaned" or even appropriate for the term.

You may have noticed that the query also returns the term definition in the target language.  Originally, my intention was that it should appear as a popup when the user moused over the natural language label on the dropdown, but my knowledge of HTML is too weak for me to know how to accomplish that without some digging.  I might add that in the future.

"Data cleaning" application demonstration

I created a second demo page to show how data from the merged graph could be used in data cleaning.  That page is at http://bioimages.vanderbilt.edu/clean.html.  The basic query that it uses is:

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX cvstatus: <http://rs.tdwg.org/cv/status/>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
SELECT DISTINCT ?term where {
?term rdfs:isDefinedBy cvstatus:.
 {?term skos:prefLabel ?langLabel.FILTER (str(?langLabel) = 'Común')}
 {?term skos:hiddenLabel 'Común'. }
 {?term rdf:value 'Común'. }

This query is a little more complicated than the last one.  The triple pattern
?term rdfs:isDefinedBy cvstatus:.
limits terms to the appropriate controlled vocabulary.  The rest of the query is composed of the UNION of three graph patterns.  The first pattern:
?term skos:prefLabel ?langLabel.
FILTER (str(?langLabel) = 'Común')
screens the string to be cleaned against all of the preferred labels in any language.  The second pattern:
?term skos:hiddenLabel 'Común'.
checks whether the string to be cleaned is included in the list of non-preferred labels that have been accumulated from real data.  The third pattern:
?term rdf:value 'Común'.
checks if the string to be cleaned is actually one of the preferred, unique text strings associated with any term.  In the Javascript that makes the page run (see http://bioimages.vanderbilt.edu/clean.js for details), the string to be cleaned is inserted into the query from a variable (i.e. a variable substituted in place of 'Común' in the query above.)

In this particular case, the string 'Común' was mapped to the concept identified by http://rs.tdwg.org/cv/status/extant, so a match is made by the second of the three graph patterns (the hidden label one).  Here's what the page looks like when it is running with Developer tools turned on:

You can see that the response is a single value wrapped up in a bunch of XML.  Again, things would be simpler if the endpoint were set up to return JSON (soon, really!).  So in essence, the data cleaning function could be accessed by this "API call":


where the string to be cleaned is substituted for "Com%C3%BAn" (urlencoded).

As a practical matter, it would probably not be smart to actually build an application that relied on screening every record by making a call like this to the SPARQL endpoint.  Our endpoint just isn't up to handling that kind of traffic.  It would be more realistic to build an application that made one call at the start of each session that retrieved the whole array that mapped known strings to controlled value IRIs.  A query to accomplish that would be:

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX cvstatus: <http://rs.tdwg.org/cv/status/>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>

SELECT DISTINCT ?term ?value where {?term rdfs:isDefinedBy cvstatus:. 
{?term skos:prefLabel ?langLabel.FILTER (str(?langLabel) = ?value)}
{?term skos:hiddenLabel ?value. }
{?term rdf:value ?value. }

Notice that it is basically the same as the previous query, except that the string to be cleaned is represented by the variable ?value instead of being a literal.  Here's what the HTTP GET IRI would look like:


If you sent the request header:
Accept: application/sparql-results+json
you would get JSON back instead of XML.  As I've said repeatedly, this won't currently work on the Vanderbilt library SPARQL endpoint, which does not yet support JSON results.  However, I ran the query on an installation of Blazegraph so that I could get JSON results.  You can see the results at this Gist:  https://gist.github.com/baskaufs/0b1193990bc7182e440ff238cac6e528
 These results could be ingested by a data-cleaning application, which could then keep track of newly encountered strings.  A human would have to map each new string to one of the controlled value IRIs, but if those new mappings were added to the list of known variants, they would become available as soon as the graph on the SPARQL endpoint were updated.

Where from here?

The applications that I've shown here are just demos and developers who are better programmers than I can incorporate similar code into their own applications to make use of the controlled vocabulary data that TDWG working groups will generate and expose in accordance with the SDS.  Clearly, work flows will need to be established, but once those are set up, there is the potential to automate most of what I've demonstrated here.  The raw data will live on the TDWG Github site, probably in the form of CSVs and the transformation to a machine-readable form will be automated.  There could be one to many SPARQL endpoints exposing those data - one feature of the SDS is that it will be possible for machines to discover the IRIs of vocabulary distributions, including SPARQL endpoints.  So of one endpoint goes down or is replaced, a machine will be able to automatically switch over to a different one.

[1] SKOS Simple Knowledge Organization System Reference.  W3C Recommendation. https://www.w3.org/TR/skos-reference/
[2] Internationalized Resource Identifiers (IRIs). RFC 2987. The Internet Engineering Task Force (IETF). https://tools.ietf.org/html/rfc3987

No comments:

Post a Comment