Wednesday, October 23, 2019

Understanding the Standards Documentation Specification, Part 6: The rs.tdwg.org repository


This is the sixth and final post in a series on the TDWG Standards Documentation Specification (SDS).  The five earlier posts explain the history and model of the SDS, and how to retrieve the machine-readable metadata about TDWG standards.

Note: this post was revised on 2020-03-04 when IRI dereferencing of the http://rs.tdwg.org/ subdomain went from testing into production.

Where do standards data live?

 In earlier posts in this series, I said that after the SDS was adopted, there wasn't any particular plan for actually putting it into practice. Since I had a vested interest in its success, I took it upon myself to work on the details of its implementation, particularly with respect to making standards metadata available in machine-readable form.

The SDS is silent about where data should live and how it should be turned into the various serializations that should be available when clients dereference resource IRIs.  My thinking on this subject was influenced by my observations about previous management of TDWG standards data.  In the past, the following things have happened to TDWG standards data:

  • the standards documents for TAPIR were accidentally overwritten and lost.
  • the authoritative Darwin Core (DwC) documents were locked up on a proprietary publishing system where only a few people could look at them or even know what was there.
  • the normative Darwin Core document was written in RDF/XML, which no one could read and which was in a document that had to be edited by hand.


Given that background I was pretty convinced that the place for the standards data to live was in a public GitHub repository.  I was able to have a repository called rs.tdwg.org set up in the TDWG GitHub site for the purpose of storing the standards metadata.

Form of the standards metadata

 Given past problems with formats that have become obsolete or that were difficult to read and edit, I was convinced that the standards metadata should be in a simple format.  To me the obvious format was CSV. 

At the time I started working on this project, I had been working on an application to transform CSV spreadsheets into various forms of RDF, so I had already been thinking about how the CSV spreadsheets should be set up to do that.  I liked the model used for DwC Archives (DwC-A) and defined in the DwC text guide.
Example metadata CSV file for terms defined by Audubon Core: audubon.csv

In the DwC-A model, each table is "about" some class of thing.  Each row in a data table represents an instance of that class, and each column represents some property of those instances.  The contents of each cell represent the value of the property for that instance. 

Darwin Core Archive model (from the Darwin Core Text Guide)

In order to associate the columns with their property terms, DwC Archives use an XML file (meta.xml) that associates the intended properties with the columns of the spreadsheet.  Since a flat spreadsheet can't handle one-to-many relationships very well, the model connects the instances in the core spreadsheet with extension tables that allow properties to have multiple values.

For the purposes of generating RDF, the form of the meta.xml file is not adequate.  One problem is that the meta.xml file does not indicate whether the value (known in RDF as the object) recorded in the cell is supposed to be a literal (string) or an IRI.  A second problem is that in RDF values of properties can also have language tags or datatypes if they are not plain literals.   Finally, a DwC Archive assumes that a row is a single type of thing, but actually a row may actually contain information about several types of things.
 
Example CSV mapping file: audubon-column-mappings.csv

For those reasons I ended up creating my own form of mapping file -- another CSV file rather than a file in XML format.  I won't go into more details here, since I've already described the system of files in another blog post.  But you can see from the example above that the file relates the column headers to properties, indicates the type of object (IRI, plain literal, datatyped literal, or language tagged literal), and provides the value of the language tag or datatype.  The final column indicates whether that column applies to the main subject of the table or an instance of another class that has a one-to-one relationship with the subject resource. 

Add captionExample extension links file: linked-classes.csv


The links between the core file and the extensions are described in a separate links file (e.g. linked-classes.csv).  In this example, extension files are required because each term can have many versions and a term can also replace more than one term.  Because in RDF the links can be described by properties in either direction, the links file lists the property linking from the extension to the core file (e.g. dcterms:isVersionOf) and from the core file to the extension (e.g. dcterms:hasVersion). 

This system differs a bit from the DwC-A system where the fields in the linked extension files are described within the same meta.xml file. I opted to have a separate mapping file for each extension.  The filenames listed in the linked-classes.csv file point to the extension data files and the mapping files associated with the extension data files use the same naming pattern as the mapping files for the core file.

The description of file types above explains most of the many files that you'll find if you look in a particular directory in the rs.tdwg.org repo.

Organization of directories in rs.tdwg.org

 The set of files detailed above describe a single category of resources.  Most of the directories in the rs.tdwg.org repository contain such a set that is associated with a particular namespace that is in use within a TDWG vocabulary (in the language of the SDS, "term lists").  For example, the directory "audubon" (containing the example files above) describes the current terms minted by Audubon Core and the directory "terms" describes terms minted by Darwin Core.  There are also directories that describe terms that are borrowed by Audubon or Darwin Cores.  Those directories have names that end with "-for-ac" or "-for-dwc". 

For each of the directories that describe terms in a particular namespace, there is another directory that describes the versions of those terms.  Those directory names have "-versions" appended to the directory name for their corresponding current terms. 

Finally, there are some special directories that describe resources in the TDWG standards hierarchy at levels higher than individual terms: "term-lists", "vocabularies", and "standards".  There is also a special directory for documents ("docs") that describe all of the documents that are associated with TDWG standards.  Taken together, all of these directories contain the metadata necessary to completely characterize all of the components of TDWG standards.

Using rs.tdwg.org metadata

 In theory, one could pick through all of the CSV files that I just described and learn anything you wanted to know about any part of any TDWG standard.  However, that would be a lot to ask of a human.  The real purpose of the repository is to provide source data for software that can generate the human- and machine-readable serializations that the SDS specifies.  By building all of the serializations from the same CSV tables, we can reduce errors caused by human entry and guarantee that a consumer always receives exactly the same metadata regardless of the chosen format.

One option for creating the serializations is to run a build script that generates the serialization as a static file.  I used this approach to generate the Audubon Core Term List document.  A Python script generates Markdown from the appropriate CSV files.  The generated file is pushed to GitHub where it is rendered as a web page via GitHub Pages.

Another option is to generate the serializations on the fly based on the CSV tables.  In another blog post I describe my efforts to set up a web service that uses CSV files of the form described above to generate RDF/Turtle, RDF/XML, or JSON-LD serializations of the data.  That system has now been implemented for TDWG standards components. 

The SDS specifies that if an IRI is dereferenced with an Accept: header for one of the RDF serializations, the server should perform content negotiation (303 redirect) to direct the client to the URL for the serialization they want.  For example, when a client that is a browser (with an Accept header of text/html) dereferences the Darwin Core term IRI http://rs.tdwg.org/dwc/terms/recordedBy, it will be redirected to the Darwin Core Quick Reference Guide bookmark for that term.  However, if an Accept: header of text/turtle is used, the client will be redirected to http://rs.tdwg.org/dwc/terms/recordedBy.ttl .  Similarly, application/rdf+xml redirects to a URL ending in .rdf and application/json or application/ld+json redirects to a URL ending in .json .  Those URLs for specific serializations can also be requested directly without requiring content negotiation.

The test system also generates HTML web pages for obsolete Darwin Core terms that otherwise wouldn't be available via the Darwin Core website.  For example: http://rs.tdwg.org/dwc/curatorial/Preparations redirects to http://rs.tdwg.org/dwc/curatorial/Preparations.htm, a web page describing an obsolete Darwin Core term from 2007.

Providing term dereferencing of this sort is considered a best practice in the Linked Data community.  But for developers interested in obtaining the machine-readable metadata, as a practical matter it's probably easier to just get a machine-readable dump of all of the whole dataset by one of the methods described in my earlier posts.  However, having the data available in CSV form on GitHub makes the data available in a primitive "machine-readable" form that doesn't really have anything to do with Linked Data.  Anyone can write a script to retrieve the raw CSV files from the GitHub repo and process them using conventional means as long as they understand how the various CSV files within a directory are related to each other.  Because of the simplicity of the format of the data, it is highly likely that they will be usable long into the future (or at least as long as GitHub is viable) even if Linked Data falls by the wayside.

Maintaining the CSV files in rs.tdwg.org

 The files in rs.tdwg.org were originally manually assembled (by me) laboriously from a variety of sources.  All of the current and obsolete Darwin Core data were pulled from the "complete history" RDF/XML file that was formerly maintained as the "normative document" for Darwin Core.  Audubon Core terms data were assembled from the somewhat obsolete terms.tdwg.org website.  Data on ancient TDWG standards documents and their authors was assembled by a lot of detective work on my part.  However, maintaining the CSV files manually is not really a viable option.  Whenever a new version of a term is generated, that should spawn a series of new versions up the standards hierarchy.  The new term version should result in a new modified date for its corresponding current term, spawn a new version of its containing term list, result in an addition to the list of terms contained in the term list, generate a new version of the whole vocabulary, etc. 

It would be unreliable to trust that a human could make all of the necessary modifications to all of the CSV files without errors.  It is also unreasonable to expect standards maintainers to have to suffer through editing a bunch of CSV files every time they need to change a term.  They should only have to make minimal changes to a single CSV file and the rest of the work should be done by a script. 

I've written a Python script within a Jupyter notebook to do that work.  Currently the script will make changes to the necessary CSV files for term changes and additions within a single term list (a.k.a. "namespace") of a vocabulary.  It currently does not handle term deprecations and replacements -- presumably those will be uncommon enough that they could be done by manual editing.  It also doesn't handle changes to the documents metadata.  I haven't really implemented document versioning on rs.tdwg.org, mostly because that's either lost or unknown information for all of the older standards.  That should change in the future, but it just isn't something I've had the time to work on yet.

Some final notes

 Some might take issue with the fact that I've somewhat unilaterally made these implementation decisions (although I did discuss them with a number of key TDWG people during the time when I was setting up the rs.tdwg.org repo).  The problem is that TDWG doesn't really have a very formal mechanism for handling this kind of work.  There is the TAG and an Infrastructure interest group, but neither of them currently have operational procedures for this kind of implementation.  Fortunately, TDWG generally has given a fairly free hand to people who are willing to do the work necessary for standards development, and I've received encouragement on this work, for which I'm grateful. 

I feel relatively confident about the approach of archiving the standards data as CSV files.  With respect to the method of mapping the columns to properties and my ad hoc system for linking tables, I think it would actually be better to use the JSON metadata description files specified in the W3C standard for generating RDF from CSV files. I wasn't aware of that standard when I started working on the project, but it would probably be a better way to clarify the relationships between CSV tables and to impart meaning to their columns. 

So far the system that I created for dereferencing the rs.tdwg.org IRIs seems to be adequate.  In the long run, it might be better to use an alternative system.  One is to simply have a build script that generates all of the possible serializations as static files.  There would be a lot of them, but who cares?  They could then be served by a much simpler script that just carried out the content negotiation but did not actually have to generate the pages.  Another alternative would be to pay a professional to create a better system.  That would involve a commitment of funds on the part of TDWG.  But in either case the alternative systems could draw their data from the CSV files in rs.tdwg.org as they currently exist. 

When we were near the adoption of the SDS, someone asked whether the model we developed was too complicated.  My answer was that it was just complicated enough to do all of the things that people said that they wanted.  One of my goals in this implementation project was to show that it actually was possible to fully implement the SDS as we wrote it.  Although the mechanism for managing and delivering the data may change in the future, the system that I've developed shows that it's reasonable to expect that TDWG can dereference (with content negotiation) the IRIs for all of the terms that it mints, and to provide a full version history for every term, vocabulary, and document that we've published in the past.

Note: although this is the last post in this series, some people have asked about how one would actually build a new vocabulary using this system.  I'll try to write a follow-up showing how it can be done.