Steve Baskauf's blog: March 2019

Sunday, March 10, 2019

Understanding the TDWG Standards Documentation Specification, Part 2: Hierarchy Model and Implementation of IRIs

This is the second in a series of posts about the TDWG Standards Documentation Specification (SDS). For background on the SDS, see the first post.

Note: this post was revised on 2020-03-04 when IRI dereferencing of the http://rs.tdwg.org/ subdomain went from testing into production.

Implementation plan?

The SDS was ratified and issued in April of 2017. It did not, however, include any plan for its implementation. It wasn't actually clear whose responsibility it was to make implementation of the SDS happen. The Technical Architecture Group (TAG) might have been a logical group to take charge, but in 2017 it had not yet been reconstituted in its current form. As the architect of the SDS, I had a vested interest in seeing it become functional, so I decided to take the initiative to figure out how it could be implemented. As I worked on this project, I got feedback from the Darwin Core Maintenance Group, key people working on the TDWG website and other infrastructure, and later from some TAG members.

Although the SDS provided a general framework, it left a lot of the details to implementers. In particular, the SDS had relatively little to say about the form of URIs used as identifiers for documents whose form was specified by the SDS. For guidance, I looked to precedents set by Darwin Core, general practices in the Linked Data world, and practicalities of URI dereferencing.

The SDS model

The SDS describes a hierarchical model for resources within its scope. That hierarchy is relatively simple for documents within a standard: there is simply a hasPart/isPartOf relationship between the standard and its documents.

For vocabularies, the situation is more complicated. The SDS describes four levels in the hierarchy that applies to vocabularies: standard, vocabulary, term list, and term. There was some discussion in the run-up to ratification of the SDS as to whether the model needed to be this complicated. At that time, I asserted that this was the least complicated model that could accomplish all of the things that people said they wanted to do with vocabularies in TDWG.

It would be tempting to say that a much simpler model might be possible. For example, we could consider the Audubon Core Standard to be synonymous with the Audubon Core vocabulary. We could say that Audubon Core terms were a direct part of it -- a simple two-level hierarchy.

However, the Audubon Core Standard is more than just a set of terms. The Audubon Core vocabulary is distinct from the documents that describe how Audubon Core should be used (the structure document, term list document, etc.), which are also part of the standard. Although we might lump the standard, vocabulary, and documents together in our human minds, if we really aspire to have machine-readable descriptions of components of TDWG standards, we have to distinguish between things that are not the same -- things that have different authors, creation dates, and version histories.

Example of first (standards) and second (vocabularies and documents) levels of the TDWG Standards Documentation Specification hierarchy

As I described in the previous post, there was also a desire expressed in the community for the capability to have more than one "Darwin Core vocabulary". Some people might want only the basic vocabulary (a "bag of terms" with definitions). Others might want a more complicated vocabulary where some terms might be declared to be equivalent to terms outside of Darwin Core, or classes might be declared to be subclasses of classes in an outside ontology. Still others might want to create a Darwin Core vocabulary that restrict the values that can be used for certain terms, or entail class membership through range and domain declarations. So although we don't currently have more than one Darwin Core vocabulary, we want to allow for that possibility in the future. That's another reason to have a model that separates the standard from the vocabulary or vocabularies that it defines.

Example of second (vocabularies) and third (term lists) levels of the TDWG Standards Documentation Specification hierarchy

Within a vocabulary, the SDS describes an entity called "term list" (Section 3.3.3 and 4.4.2).

Example of third (term list) and fourth (term) levels of the TDWG Standards Documentation Specification hierarchy. This is an example of a list of terms defined by TDWG and only includes a few of the terms on the list.

For terms defined by a TDWG vocabulary, there is an authoritative term list for each namespace. For example, there is an authoritative term list for the dwc: namespace and another for the dwciri: namespace. These lists are considered authoritative because they define the terms they contain. Dereferencing a term list IRI should return the term list document.

Example of third (term list) and fourth (term) levels of the TDWG Standards Documentation Specification hierarchy. This is an example of a list of terms borrowed by TDWG and only includes a few of the terms on the list.

A term list can also contain terms that are borrowed from another vocabulary and included in the TDWG vocabulary. The SDS does not prescribe how borrowed terms should be organized in term lists -- for example, whether all borrowed terms should be included in a single list or whether there should be a separate term list for each namespace from which terms are borrowed. As a practical matter, it made sense to create a separate term list for each namespace.

Some notes about IRIs

According to the SDS, each resource in the hierarchy should be assigned an IRI as an identifier (Section 2.1.1). An IRI is a superset of URIs that allows for non-Latin characters to be used. For the purposes of this post, you can consider URIs and IRIs to be synonymous.

There has always been confusion between the use of IRIs/URIsas identifiers and URLs as resource locators. Fundamentally, an IRI is an identifier that may or may not actually dereference in a web browser to retrieve a web page about the resource. In the Linked Data community, it is considered a best practice for IRIs to dereference, but it isn't a requirement. In fact, there are a number of "borrowed" term IRIs in Audubon Core that don't dereference and probably never will. So although it isn't a requirement of the SDS that TDWG IRIs dereference, one goal of implementation is to eventually make that happen.

The origin of the subdomain rs.tdwg.org has always been a little mysterious to me. I believe that the "rs" part stands for "schema repository" and that it was originally intended to be a place from which XML and other schemas could be retrieved. Although I don't think there is any official policy that requires use of the rs.tdwg.org subdomain for TDWG-minted IRIs, that has become the convention with Darwin Core and Audubon Core and I've taken that as the precedent to be followed when creating other IRIs that denote resources associated with TDWG standards. The exception to this pattern are the IRIs for the standards themselves. The precedent there is that TDWG standards have IRIs in the form http://www.tdwg.org/standards/nnn, where "nnn" is a number assigned to a particular standard.

IRI patterns for vocabulary standards

I used the precedents established by the Darwin and Audubon Core standards, together with the URI specification (RFC 3986) itself to establish IRI patterns that are consistent with the hierarchy established by the SDS. Section 1.2.3 of RFC 3986 notes that a forward slash is used to "delimit components that are significant to the generic parser's hierarchical interpretation of an identifier" and the IRIs of components of vocabularies can be interpreted this way.

Here are the patterns I established or continued based on past practice:

Standards IRI:

http://www.tdwg.org/standards/nnn

where "nnn" consists of numeric characters assigned to the standard. Dereferencing these IRIs should lead the user to the landing page of the standard. Example of the Darwin Core standard:

http://www.tdwg.org/standards/450

Note that since these IRIs aren't within the rs.tdwg.org subdomain, the test system I've implemented does not handle their dereferencing. Standards IRI dereferencing is handled by a separate system and I don’t know how fully functional it is for all prior TDWG standards.

Vocabulary IRI:

http://rs.tdwg.org/vvv/

where "vvv" is a sequence of alphabetic characters assigned to the vocabulary. Example of the Darwin Core basic vocabulary:

http://rs.tdwg.org/dwc/

Term list IRI:

http://rs.tdwg.org/vvv/ttt/

where "vvv" is a sequence of alphabetic characters assigned to the vocabulary and "ttt" is a sequence of alphabetic characters assigned to the term list within that vocabulary. Example of the Darwin Core IRI-valued terms:

http://rs.tdwg.org/dwc/iri/

Term IRI:

http://rs.tdwg.org/vvv/ttt/nnn

where "vvv" is a sequence of alphabetic characters assigned to the vocabulary, "ttt" is a sequence of alphabetic characters assigned to the term list within that vocabulary, and "nnn" is the local name of the term. Example of the "in described place" term:

http://rs.tdwg.org/dwc/iri/inDescribedPlace

The term pattern described above is backward compatible with all current Darwin Core and Audubon Core term IRIs. Existing Darwin Core RDF/XML asserts relationships between terms and the resource that defines them like this:

http://rs.tdwg.org/dwc/terms/dateIdentified rdfs:isDefinedBy http://rs.tdwg.org/dwc/terms/ .

So the IRI pattern for term lists is also backwards compatible with this previous use, with the name "term list" now explicitly given to the resource that defines terms.

The IRI pattern for vocabularies is new, but is consistent with the hierarchy and is necessary to distinguish between vocabularies and the standards that create them.

IRI pattern for documents

Previously, there had been no consistent pattern for IRIs assigned to documents associated with standards. Here are some examples of IRIs for Darwin Core documents:

The Darwin Core XML guide: http://rs.tdwg.org/dwc/terms/guides/xml/

The Darwin Core simple text guide: http://rs.tdwg.org/dwc/terms/simple/

To maintain backwards compatibility, these pre-existing IRIs were left unchanged. However, the IRI patterns used for Darwin Core documents make it difficult to distinguish programmatically between term and document IRIs using pattern matching. So for all documents from standards other than Darwin Core, I used this pattern:

http://rs.tdwg.org/sss/doc/docname/

where "sss" is a sequence of alphabetic characters representing the standard and "docname" is a short series of alphabetic characters representing the document. For example:

http://rs.tdwg.org/ac/doc/structure/

is the IRI for the Audubon Core Structure document.

Redirection

One thing that should be made clear is the distinction between the IRI that identifies a resource and the URL that actually can be used to retrieve a document or metadata about some other resource. Because the SDS considers the resources it describes as abstract entities, those entities can have multiple formats or serializations that are distinct from the abstract resources themselves. For example, the Audubon Core Structure document is an abstract thing identified by http://rs.tdwg.org/ac/doc/structure/ . However, the HTML serialization of that document can currently be retrieved from the URL https://tdwg.github.io/ac/structure/ and in the future that document might be made available at different URLs in other formats such as PDF. It is required that the IRI of the abstract resource be stable and unchanged, but there is no requirement that the retrieval URL for a serialization stay the same over time. Thus it's important that citations and bookmarks be set to the permanent IRI of the resource, and that redirection from the permanent IRI to the retrieval URL be maintained so that people can actually acquire a copy of the resource using a browser.

In the past, obscure, deprecated Darwin Core terms simply didn't dereference. In the test system, they redirect programmatically to a URL that is the term IRI plus ".htm". Here's an example:

http://rs.tdwg.org/dwc/curatorial/CollectorNumber

redirects to

http://rs.tdwg.org/dwc/curatorial/CollectorNumber.htm

The document that is retrieved is an HTML, human-readable description of the term.

Historically, current Darwin Core terms redirected to the Darwin Core Quick Reference page and that behavior has been maintained in the test system. Here's an example:

http://rs.tdwg.org/dwc/terms/institutionCode

redirects to

https://dwc.tdwg.org/terms/#dwc:institutionCode

The same is true with Audubon Core terms, whose IRIs redirect to an appropriate place on the Audubon Core Term List document. The URLs of both the Audubon Core Term List page and Darwin Core Quick Reference page have changed recently, reinforcing the importance of citing the actual term IRIs rather than the redirected URLs.

TDWG Standards Documentation Specification version model (from Section 2.3)

Versions

Taking cues from Dublin Core and the W3C, the SDS describes a version model that can be used to track versions of resources associated with TDWG standards. For example, dereferencing the Darwin Core vocabulary IRI http://rs.tdwg.org/dwc/ shows that there are 19 versions: 18 previous version and a most recent version that corresponds to the current Darwin Core vocabulary.

For vocabularies and term lists, the version IRIs are constructed by appending an ISO 8601 date after the final slash and inserting "version/" before the terminal string. For example, the current Darwin Core vocabulary IRI is http://rs.tdwg.org/dwc/ and a version of the Darwin Core vocabulary is http://rs.tdwg.org/version/dwc/2015-03-27 . The current Darwin Core IRI-value term list IRI is http://rs.tdwg.org/dwc/iri/ and a version of it is http://rs.tdwg.org/dwc/version/iri/2015-03-27 . (Although it wouldn't be necessary to include the characters "version/" in the version IRI, doing so makes pattern recognition for those IRIs much simpler.)

Following the precedent already set for Darwin Core, term version IRIs are formed by appending an ISO 8601 date with a dash. Again "version/" is inserted ahead of the local name to make IRI pattern recognition easier. For example, the term IRI http://rs.tdwg.org/dwc/terms/establishmentMeans has a version http://rs.tdwg.org/dwc/terms/version/establishmentMeans-2009-04-24

For documents, the version IRI is formed by simply appending the ISO 8601 date after the trailing slash. (In the case of documents, IRI pattern recognition is less critical since there aren't hierarchical levels below the level of the document. So "version/" isn't inserted in the version IRI.) For example, the document http://rs.tdwg.org/sds/doc/specification/ has a version http://rs.tdwg.org/sds/doc/specification/2007-11-05 .

In the case of non-document resources, resolution of version IRIs is fully implemented, since human-readable pages can be constructed programmatically for those resources using data from the metadata database. However, since the human-readable versions of standards documents are generally created manually and have idiosyncratic redirection IRIs, version IRI resolution is currently only partially implemented. In the case of many standards documents, the location of previous versions is not known or they are not yet available online. So for now, one can't explore older versions of standards documents in the same way one can explore older versions of vocabularies, term lists, and terms.

Summary

I've implemented a system of IRIs that are consistent with the SDS and past practice of Darwin and Audubon Cores. Although the patterns I established aren't the only possible ones, they work well for facilitating pattern matching by a server that generates many of the documents programmatically, so I feel that the pattern system is sound.

Here are some starting points for exploration:

Audubon Core basic vocabulary:

http://rs.tdwg.org/ac/

Darwin Core basic vocabulary:

http://rs.tdwg.org/dwc/

From these two vocabulary pages you can surf to term lists, terms, and older versions of all of the resources.

Terms borrowed by Audubon Core from the IPTC Photo Metadata Extension:
http://rs.tdwg.org/ac/Iptc4xmpExt/

The October 16, 2011 version of the Darwin Core vocabulary:
http://rs.tdwg.org/version/dwc/2011-10-16

The April 24, 2009 version of the list of core Darwin Core terms:
http://rs.tdwg.org/dwc/version/terms/2009-04-24

The September 11, 2009 version of Basis of Record:
http://rs.tdwg.org/dwc/terms/version/basisOfRecord-2009-09-11

A deprecated Darwin Core term list:
http://rs.tdwg.org/dwc/curatorial/

A deprecated Darwin Core term:
http://rs.tdwg.org/dwc/dwctype/MachineObservation

Here are some examples of document IRIs that redirect:

http://rs.tdwg.org/ac/doc/introduction/

http://rs.tdwg.org/tapir/doc/xmlschema/

http://rs.tdwg.org/apn/doc/data/

In the next post, I'll describe how the system I've implemented allows retrieval of machine-readable metadata.

Sunday, March 3, 2019

Understanding the TDWG Standards Documentation Specification Part 1: Background

This is the first in a series of posts about the TDWG Standards Documentation Specification (SDS), with special reference to how its implementation enables machine access to information about TDWG standards. In particular, the SDS makes it possible to acquire all available information about TDWG vocabularies, including all historical versions of terms. In this post, I'm going to describe the genesis of the SDS and how the practical experience of the TDWG community influenced the ultimate state of the specification.

Historical background

The original draft of the SDS was written in 2007 by Roger Hyam as part of the effort to modernize the TDWG standards development process. The original draft was focused on how human-readable documents should be formatted. The SDS remained in draft form for several years and during that time, new standards documents generally reflected the directives of that draft.

In 2013, a Vocabulary Management Task Group examined the status of the old TDWG Ontology and the experience of the community with the term change section of the Darwin Core Namespace Policy. The task group recommended that a new SDS be written with guidelines for the formatting of both human- and computer-readable documents, and that the Darwin Core Namespace Policy be used as the starting point for writing a specification describing how vocabularies should be maintained.

In 2014, I was asked to lead a task group to revise the SDS and to move it forward to the status of ratified standard. One advantage of returning to work on the specification after seven years had elapsed was that we had the benefit of experience from work with the Darwin Core standard and had learned several important lessons from that. Some of those lessons were about weaknesses in policies related to standards documents and some were process-oriented. Because of the interrelation between the documentation of standards and the processes of their development and maintenance, the parallel development of both the SDS and the Vocabulary Maintenance Specification (VMS) by the task group allowed the two specifications to be developed in a complementary fashion.

One of the key problems with the state of TDWG standards documents was that it was difficult to know which documents associated with a complex standard like Darwin Core were actually part of the standard, and which documents were ancillary documents that provided useful information about the standard, but that were not actually part of the standard. That distinction was important because changes to documents within a standard should be subject to a potentially rigorous process of review, while documents outside the standard could be changed at will. There was a similar problem with the idea expressed in the original SDS draft that certain documents that were part of a standard should be considered normative, while other documents that were part of the standard were not normative. If the status of "normative" were bestowed on an entire document, what did that mean for parts of that document such as examples, or mutable URLs? Did changing an example or URL require invoking a standards review process or could they just be changed or corrected at will?

To make matters worse, the final designation of documents that were considered authoritatively to be part of the standard was determined by which documents were included in a .zip file that was uploaded to the OJS instance that was managing the standards adoption process at the time. That made it virtually impossible for any layman to actually know whether a particular document was part of a standard or not.

In the Darwin Core Standard at that time, the RDF/XML representation of the vocabulary was designated as the normative document. That presented several problems. One problem was that the XML document was by its nature a machine-readable document, making it difficult for people to read and understand it. Another question involved the text, XML, and RDF guides that specified how the standard was to be implemented, but that were not considered "the normative document". Clearly those documents were required to comply with the standard, so shouldn't they be considered in some way normative? The problem was made worse by the fact that the RDF guide document minted an entire category of Darwin Core terms (the dwciri: terms having IRI values), but those terms weren't actually found in the normative RDF/XML document.

Defining the Darwin Core vocabulary as RDF/XML also anchored it in a serialization that was becoming less commonly used. With the ratification of RDF/Turtle and JSON-LD by the W3C as alternative machine-readable serializations, it made less sense to define Darwin Core specifically in RDF/XML.

During the time between the writing of the draft SDS in 2007 and the convening of the task group, there was also significant discussion in the community as to whether Darwin Core should be developed as a full ontology, or whether it should remain a simple "bag of terms" having minimal human-readable definitions. The first option would allow for greater expressiveness, but the second option would allow for the broadest possible use of the vocabulary.

Strategies of the Standards Documentation Specification

The problems outlined above led the task group to create a specification with several key features that addressed those problems.

The ratified SDS threw out the idea that inclusion of a document as part of a standard was determined by presence in a .zip file. Instead, a document is considered part of a standard if it is designated as such. That designation was to take place in two ways. First, a human-readable document itself should state clearly in its header section that it is part of a standard (Section 3.2.3.1). Machine-readable documents would have a dcterms:isPartOf property that link them to a standard (Section 4.2.2). Second, each standard will have an official "landing page" that would state clearly which documents were parts of the standard (Section 3.1).

The SDS also got rid of the idea that particular documents were normative. Any document that is part of a standard can contain parts that are normative and parts that are not. Each human-readable document will contain a statement in its introduction outlining what parts are normative and what parts are not (Section 3.2.1). This designation can be made by labeling certain parts as normative, or by rules such as "all parts are normative except sections labeled as 'example' in their subtitles". The problem of serializations for standards and their parts was addressed by considering standards components to be abstract entities that can have multiple equivalent serializations (Section 2.1). For human-readable documents, it is irrelevant whether a document is in HTML, PDF, or Markdown format. It is desirable to make documents available in as many formats as possible as long as they contain substantially the same content. For machine-readable documents, there is no preferred serialization. It is required that a machine consuming any of the serializations should receive exactly the same information (Section 2.2.4). Again, the more available serializations the better as long as the abstract meaning of their content is the same.

The issue of enhancing vocabularies through added semantics was addressed by a "layered" approach that had been suggested in online discussion prior to the formation of the task group. All TDWG vocabularies will consist of a set of terms with basic properties that delineate their definition, label, and housekeeping metadata. This "basic" vocabulary can be used in a broad range of applications. Additional vocabularies could be constructed by adding components to the basic vocabulary, such as constraints and properties generating entailments (Section 4.4.2.2). Thus, there could eventually be several Darwin Core vocabularies, one consisting of only the basic components, and zero to many additional vocabularies consisting of the basic vocabulary plus "enhancement" components layered on top of the basic vocabulary. Because the nature of such enhancements could not be known in advance, the VMS contains a process for the development of vocabulary enhancements that includes use-case collection and implementation experience reports. (Section 4) At the present time, there aren't any additional enhanced vocabularies, but they could be created in the future if members of the community can show that those enhancements are needed to accomplish some useful purpose.

In the next post of this series, I'll discuss how these strategies resulted in the model for machine-readable metadata embodied in the final standards documentation specification.