Sunday, March 10, 2019

Understanding the TDWG Standards Documentation Specification, Part 2: Hierarchy Model and Implementation of IRIs



This is the second in a series of posts about the TDWG Standards Documentation Specification (SDS).  For background on the SDS, see the first post.

Note: this post was revised on 2020-03-04 when IRI dereferencing of the http://rs.tdwg.org/ subdomain went from testing into production.

Implementation plan?

The SDS was ratified and issued in April of 2017.  It did not, however, include any plan for its implementation.  It wasn't actually clear whose responsibility it was to make implementation of the SDS happen.  The Technical Architecture Group (TAG) might have been a logical group to take charge, but in 2017 it had not yet been reconstituted in its current form.  As the architect of the SDS, I had a vested interest in seeing it become functional, so I decided to take the initiative to figure out how it could be implemented. As I worked on this project, I got feedback from the Darwin Core Maintenance Group, key people working on the TDWG website and other infrastructure, and later from some TAG members.

Although the SDS provided a general framework, it left a lot of the details to implementers.  In particular, the SDS had relatively little to say about the form of URIs used as identifiers for documents whose form was specified by the SDS.  For guidance, I looked to precedents set by Darwin Core, general practices in the Linked Data world, and practicalities of URI dereferencing.

The SDS model

The SDS describes a hierarchical model for resources within its scope.  That hierarchy is relatively simple for documents within a standard: there is simply a hasPart/isPartOf relationship between the standard and its documents. 

For vocabularies, the situation is more complicated.  The SDS describes four levels in the hierarchy that applies to vocabularies: standard, vocabulary, term list, and term.  There was some discussion in the run-up to ratification of the SDS as to whether the model needed to be this complicated.  At that time, I asserted that this was the least complicated model that could accomplish all of the things that people said they wanted to do with vocabularies in TDWG. 

It would be tempting to say that a much simpler model might be possible.  For example, we could consider the Audubon Core Standard to be synonymous with the Audubon Core vocabulary.  We could say that Audubon Core terms were a direct part of it -- a simple two-level hierarchy. 

However, the Audubon Core Standard is more than just a set of terms.  The Audubon Core vocabulary is distinct from the documents that describe how Audubon Core should be used (the structure document, term list document, etc.), which are also part of the standard.  Although we might lump the standard, vocabulary, and documents together in our human minds, if we really aspire to have machine-readable descriptions of components of TDWG standards, we have to distinguish between things that are not the same -- things that have different authors, creation dates, and version histories. 

Example of first (standards) and second (vocabularies and documents) levels of the TDWG Standards Documentation Specification hierarchy

As I described in the previous post, there was also a desire expressed in the community for the capability to have more than one "Darwin Core vocabulary".  Some people might want only the basic vocabulary (a "bag of terms" with definitions). Others might want a more complicated vocabulary where some terms might be declared to be equivalent to terms outside of Darwin Core, or classes might be declared to be subclasses of classes in an outside ontology.  Still others might want to create a Darwin Core vocabulary that restrict the values that can be used for certain terms, or entail class membership through range and domain declarations.  So although we don't currently have more than one Darwin Core vocabulary, we want to allow for that possibility in the future. That's another reason to have a model that separates the standard from the vocabulary or vocabularies that it defines.


Example of second (vocabularies) and third (term lists) levels of the TDWG Standards Documentation Specification hierarchy

Within a vocabulary, the SDS describes an entity called "term list" (Section 3.3.3 and 4.4.2). 

Example of third (term list) and fourth (term) levels of the TDWG Standards Documentation Specification hierarchy.  This is an example of a list of terms defined by TDWG and only includes a few of the terms on the list.

For terms defined by a TDWG vocabulary, there is an authoritative term list for each namespace.  For example, there is an authoritative term list for the dwc: namespace and another for the dwciri: namespace.  These lists are considered authoritative because they define the terms they contain.  Dereferencing a term list IRI should return the term list document. 

Example of third (term list) and fourth (term) levels of the TDWG Standards Documentation Specification hierarchy.  This is an example of a list of terms borrowed by TDWG and only includes a few of the terms on the list.

 A term list can also contain terms that are borrowed from another vocabulary and included in the TDWG vocabulary.  The SDS does not prescribe how borrowed terms should be organized in term lists -- for example, whether all borrowed terms should be included in a single list or whether there should be a separate term list for each namespace from which terms are borrowed.  As a practical matter, it made sense to create a separate term list for each namespace.

Some notes about IRIs

According to the SDS, each resource in the hierarchy should be assigned an IRI as an identifier (Section 2.1.1).  An IRI is a superset of URIs that allows for non-Latin characters to be used.  For the purposes of this post, you can consider URIs and IRIs to be synonymous.

There has always been confusion between the use of IRIs/URIsas identifiers and URLs as resource locators.  Fundamentally, an IRI is an identifier that may or may not actually dereference in a web browser to retrieve a web page about the resource.  In the Linked Data community, it is considered a best practice for IRIs to dereference, but it isn't a requirement.  In fact, there are a number of "borrowed" term IRIs in Audubon Core that don't dereference and probably never will.  So although it isn't a requirement of the SDS that TDWG IRIs dereference, one goal of implementation is to eventually make that happen. 

The origin of the subdomain rs.tdwg.org has always been a little mysterious to me.  I believe that the "rs" part stands for "schema repository" and that it was originally intended to be a place from which XML and other schemas could be retrieved.  Although I don't think there is any official policy that requires use of the rs.tdwg.org subdomain for TDWG-minted IRIs, that has become the convention with Darwin Core and Audubon Core and I've taken that as the precedent to be followed when creating other IRIs that denote resources associated with TDWG standards.  The exception to this pattern are the IRIs for the standards themselves.  The precedent there is that TDWG standards have IRIs in the form http://www.tdwg.org/standards/nnn, where "nnn" is a number assigned to a particular standard. 

IRI patterns for vocabulary standards

 I used the precedents established by the Darwin and Audubon Core standards, together with the URI specification (RFC 3986) itself to establish IRI patterns that are consistent with the hierarchy established by the SDS.   Section 1.2.3 of RFC 3986 notes that a forward slash is used to "delimit components that are significant to the generic parser's hierarchical interpretation of an identifier" and the IRIs of components of vocabularies can be interpreted this way.  

Here are the patterns I established or continued based on past practice:

Standards IRI:

http://www.tdwg.org/standards/nnn
where "nnn" consists of numeric characters assigned to the standard.  Dereferencing these IRIs should lead the user to the landing page of the standard.  Example of the Darwin Core standard:

Note that since these IRIs aren't within the rs.tdwg.org subdomain, the test system I've implemented does not handle their dereferencing.  Standards IRI dereferencing is handled by a separate system and I don’t know how fully functional it is for all prior TDWG standards.

Vocabulary IRI:

http://rs.tdwg.org/vvv/
where "vvv" is a sequence of alphabetic characters assigned to the vocabulary.  Example of the Darwin Core basic vocabulary:

Term list IRI:

http://rs.tdwg.org/vvv/ttt/
where "vvv" is a sequence of alphabetic characters assigned to the vocabulary and "ttt" is a sequence of alphabetic characters assigned to the term list within that vocabulary.  Example of the Darwin Core IRI-valued terms:

Term IRI:

http://rs.tdwg.org/vvv/ttt/nnn
where "vvv" is a sequence of alphabetic characters assigned to the vocabulary, "ttt" is a sequence of alphabetic characters assigned to the term list within that vocabulary, and "nnn" is the local name of the term.  Example of the "in described place" term:

The term pattern described above is backward compatible with all current Darwin Core and Audubon Core term IRIs.  Existing Darwin Core RDF/XML asserts relationships between terms and the resource that defines them like this:

http://rs.tdwg.org/dwc/terms/dateIdentified rdfs:isDefinedBy http://rs.tdwg.org/dwc/terms/ .

So the IRI pattern for term lists is also backwards compatible with this previous use, with the name "term list" now explicitly given to the resource that defines terms. 

The IRI pattern for vocabularies is new, but is consistent with the hierarchy and is necessary to distinguish between vocabularies and the standards that create them. 


IRI pattern for documents


Previously, there had been no consistent pattern for IRIs assigned to documents associated with standards.  Here are some examples of IRIs for Darwin Core documents:

The Darwin Core XML guide: http://rs.tdwg.org/dwc/terms/guides/xml/
The Darwin Core simple text guide: http://rs.tdwg.org/dwc/terms/simple/

To maintain backwards compatibility, these pre-existing IRIs were left unchanged.  However, the IRI patterns used for Darwin Core documents make it difficult to distinguish programmatically between term and document IRIs using pattern matching.  So for all documents from standards other than Darwin Core, I used this pattern:

http://rs.tdwg.org/sss/doc/docname/

where "sss" is a sequence of alphabetic characters representing the standard and "docname" is a short series of alphabetic characters representing the document.  For example:


is the IRI for the Audubon Core Structure document. 

Redirection

One thing that should be made clear is the distinction between the IRI that identifies a resource and the URL that actually can be used to retrieve a document or metadata about some other resource.  Because the SDS considers the resources it describes as abstract entities, those entities can have multiple formats or serializations that are distinct from the abstract resources themselves.  For example, the Audubon Core Structure document is an abstract thing identified by http://rs.tdwg.org/ac/doc/structure/ .  However, the HTML serialization of that document can currently be retrieved from the URL https://tdwg.github.io/ac/structure/ and in the future that document might be made available at different URLs in other formats such as PDF.  It is required that the IRI of the abstract resource be stable and unchanged, but there is no requirement that the retrieval URL for a serialization stay the same over time.  Thus it's important that citations and bookmarks be set to the permanent IRI of the resource, and that redirection from the permanent IRI to the retrieval URL be maintained so that people can actually acquire a copy of the resource using a browser. 

In the past, obscure, deprecated Darwin Core terms simply didn't dereference.  In the test system, they redirect programmatically to a URL that is the term IRI plus ".htm".  Here's an example:


redirects to


The document that is retrieved is an HTML, human-readable description of the term. 

Historically, current Darwin Core terms redirected to the Darwin Core Quick Reference page and that behavior has been maintained in the test system.  Here's an example:


redirects to


The same is true with Audubon Core terms, whose IRIs redirect to an appropriate place on the Audubon Core Term List document.  The URLs of both the Audubon Core Term List page and Darwin Core Quick Reference page have changed recently, reinforcing the importance of citing the actual term IRIs rather than the redirected URLs.

TDWG Standards Documentation Specification version model (from Section 2.3)

Versions

Taking cues from Dublin Core and the W3C, the SDS describes a version model that can be used to track versions of resources associated with TDWG standards.  For example, dereferencing the Darwin Core vocabulary IRI http://rs.tdwg.org/dwc/ shows that there are 19 versions: 18 previous version and a most recent version that corresponds to the current Darwin Core vocabulary. 

For vocabularies and term lists, the version IRIs are constructed by appending an ISO 8601 date after the final slash and inserting "version/" before the terminal string.   For example, the current Darwin Core vocabulary IRI is http://rs.tdwg.org/dwc/ and a version of the Darwin Core vocabulary is http://rs.tdwg.org/version/dwc/2015-03-27 .  The current Darwin Core IRI-value term list IRI is http://rs.tdwg.org/dwc/iri/ and a version of it is http://rs.tdwg.org/dwc/version/iri/2015-03-27 .  (Although it wouldn't be necessary to include the characters "version/" in the version IRI, doing so makes pattern recognition for those IRIs much simpler.) 

Following the precedent already set for Darwin Core, term version IRIs are formed by appending an ISO 8601 date with a dash.  Again "version/" is inserted ahead of the local name to make IRI pattern recognition easier.  For example, the term IRI http://rs.tdwg.org/dwc/terms/establishmentMeans has a version http://rs.tdwg.org/dwc/terms/version/establishmentMeans-2009-04-24

For documents, the version IRI is formed by simply appending the ISO 8601 date after the trailing slash.  (In the case of documents, IRI pattern recognition is less critical since there aren't hierarchical levels below the level of the document. So "version/" isn't inserted in the version IRI.)  For example, the document http://rs.tdwg.org/sds/doc/specification/ has a version http://rs.tdwg.org/sds/doc/specification/2007-11-05 . 

In the case of non-document resources, resolution of version IRIs is fully implemented, since human-readable pages can be constructed programmatically for those resources using data from the metadata database.  However, since the human-readable versions of standards documents are generally created manually and have idiosyncratic redirection IRIs, version IRI resolution is currently only partially implemented.  In the case of many standards documents, the location of previous versions is not known or they are not yet available online.  So for now, one can't explore older versions of standards documents in the same way one can explore older versions of vocabularies, term lists, and terms.

Summary

I've implemented a system of IRIs that are consistent with the SDS and past practice of Darwin and Audubon Cores.  Although the patterns I established aren't the only possible ones, they work well for facilitating pattern matching by a server that generates many of the documents programmatically, so I feel that the pattern system is sound.

Here are some starting points for exploration:

Audubon Core basic vocabulary:

Darwin Core basic vocabulary:

From these two vocabulary pages you can surf to term lists, terms, and older versions of all of the resources.

Terms borrowed by Audubon Core from the IPTC Photo Metadata Extension:
http://rs.tdwg.org/ac/Iptc4xmpExt/

The October 16, 2011 version of the Darwin Core vocabulary:
http://rs.tdwg.org/version/dwc/2011-10-16

The April 24, 2009 version of the list of core Darwin Core terms:
http://rs.tdwg.org/dwc/version/terms/2009-04-24

The September 11, 2009 version of Basis of Record:
http://rs.tdwg.org/dwc/terms/version/basisOfRecord-2009-09-11

A deprecated Darwin Core term list:
http://rs.tdwg.org/dwc/curatorial/

A deprecated Darwin Core term:
http://rs.tdwg.org/dwc/dwctype/MachineObservation

Here are some examples of document IRIs that redirect:




In the next post, I'll describe how the system I've implemented allows retrieval of machine-readable metadata.

No comments:

Post a Comment