Sunday, March 3, 2019

Understanding the TDWG Standards Documentation Specification Part 1: Background

This is the first in a series of posts about the TDWG Standards Documentation Specification (SDS), with special reference to how its implementation enables machine access to information about TDWG standards. In particular, the SDS makes it possible to acquire all available information about TDWG vocabularies, including all historical versions of terms. In this post, I'm going to describe the genesis of the SDS and how the practical experience of the TDWG community influenced the ultimate state of the specification.

Historical background 


 The original draft of the SDS was written in 2007 by Roger Hyam as part of the effort to modernize the TDWG standards development process. The original draft was focused on how human-readable documents should be formatted. The SDS remained in draft form for several years and during that time, new standards documents generally reflected the directives of that draft.

 In 2013, a Vocabulary Management Task Group examined the status of the old TDWG Ontology and the experience of the community with the term change section of the Darwin Core Namespace Policy. The task group recommended that a new SDS be written with guidelines for the formatting of both human- and computer-readable documents, and that the Darwin Core Namespace Policy be used as the starting point for writing a specification describing how vocabularies should be maintained.

 In 2014, I was asked to lead a task group to revise the SDS and to move it forward to the status of ratified standard. One advantage of returning to work on the specification after seven years had elapsed was that we had the benefit of experience from work with the Darwin Core standard and had learned several important lessons from that. Some of those lessons were about weaknesses in policies related to standards documents and some were process-oriented. Because of the interrelation between the documentation of standards and the processes of their development and maintenance, the parallel development of both the SDS and the Vocabulary Maintenance Specification (VMS) by the task group allowed the two specifications to be developed in a complementary fashion.

One of the key problems with the state of TDWG standards documents was that it was difficult to know which documents associated with a complex standard like Darwin Core were actually part of the standard, and which documents were ancillary documents that provided useful information about the standard, but that were not actually part of the standard. That distinction was important because changes to documents within a standard should be subject to a potentially rigorous process of review, while documents outside the standard could be changed at will. There was a similar problem with the idea expressed in the original SDS draft that certain documents that were part of a standard should be considered normative, while other documents that were part of the standard were not normative. If the status of "normative" were bestowed on an entire document, what did that mean for parts of that document such as examples, or mutable URLs? Did changing an example or URL require invoking a standards review process or could they just be changed or corrected at will?

To make matters worse, the final designation of documents that were considered authoritatively to be part of the standard was determined by which documents were included in a .zip file that was uploaded to the OJS instance that was managing the standards adoption process at the time. That made it virtually impossible for any layman to actually know whether a particular document was part of a standard or not.

In the Darwin Core Standard at that time, the RDF/XML representation of the vocabulary was designated as the normative document. That presented several problems. One problem was that the XML document was by its nature a machine-readable document, making it difficult for people to read and understand it. Another question involved the text, XML, and RDF guides that specified how the standard was to be implemented, but that were not considered "the normative document". Clearly those documents were required to comply with the standard, so shouldn't they be considered in some way normative? The problem was made worse by the fact that the RDF guide document minted an entire category of Darwin Core terms (the dwciri: terms having IRI values), but those terms weren't actually found in the normative RDF/XML document.

Defining the Darwin Core vocabulary as RDF/XML also anchored it in a serialization that was becoming less commonly used. With the ratification of RDF/Turtle and JSON-LD by the W3C as alternative machine-readable serializations, it made less sense to define Darwin Core specifically in RDF/XML. 

During the time between the writing of the draft SDS in 2007 and the convening of the task group, there was also significant discussion in the community as to whether Darwin Core should be developed as a full ontology, or whether it should remain a simple "bag of terms" having minimal human-readable definitions. The first option would allow for greater expressiveness, but the second option would allow for the broadest possible use of the vocabulary.

Strategies of the Standards Documentation Specification 


The problems outlined above led the task group to create a specification with several key features that addressed those problems.

The ratified SDS threw out the idea that inclusion of a document as part of a standard was determined by presence in a .zip file. Instead, a document is considered part of a standard if it is designated as such. That designation was to take place in two ways. First, a human-readable document itself should state clearly in its header section that it is part of a standard (Section 3.2.3.1). Machine-readable documents would have a dcterms:isPartOf property that link them to a standard (Section 4.2.2). Second, each standard will have an official "landing page" that would state clearly which documents were parts of the standard (Section 3.1). 

The SDS also got rid of the idea that particular documents were normative. Any document that is part of a standard can contain parts that are normative and parts that are not. Each human-readable document will contain a statement in its introduction outlining what parts are normative and what parts are not (Section 3.2.1). This designation can be made by labeling certain parts as normative, or by rules such as "all parts are normative except sections labeled as 'example' in their subtitles". The problem of serializations for standards and their parts was addressed by considering standards components to be abstract entities that can have multiple equivalent serializations (Section 2.1). For human-readable documents, it is irrelevant whether a document is in HTML, PDF, or Markdown format. It is desirable to make documents available in as many formats as possible as long as they contain substantially the same content. For machine-readable documents, there is no preferred serialization. It is required that a machine consuming any of the serializations should receive exactly the same information (Section 2.2.4). Again, the more available serializations the better as long as the abstract meaning of their content is the same.

The issue of enhancing vocabularies through added semantics was addressed by a "layered" approach that had been suggested in online discussion prior to the formation of the task group.  All TDWG vocabularies will consist of a set of terms with basic properties that delineate their definition, label, and housekeeping metadata. This "basic" vocabulary can be used in a broad range of applications. Additional vocabularies could be constructed by adding components to the basic vocabulary, such as constraints and properties generating entailments (Section 4.4.2.2). Thus, there could eventually be several Darwin Core vocabularies, one consisting of only the basic components, and zero to many additional vocabularies consisting of the basic vocabulary plus "enhancement" components layered on top of the basic vocabulary. Because the nature of such enhancements could not be known in advance, the VMS contains a process for the development of vocabulary enhancements that includes use-case collection and implementation experience reports. (Section 4)  At the present time, there aren't any additional enhanced vocabularies, but they could be created in the future if members of the community can show that those enhancements are needed to accomplish some useful purpose.

In the next post of this series, I'll discuss how these strategies resulted in the model for machine-readable metadata embodied in the final standards documentation specification.  

No comments:

Post a Comment