This is the sixth and final post in a series on the TDWG
Standards Documentation Specification (SDS).
The five earlier posts explain the history and model of the SDS, and how
to retrieve the machine-readable metadata about TDWG standards.
Note: this post was revised on 2020-03-04 when IRI dereferencing of the http://rs.tdwg.org/ subdomain went from testing into production.
Where do standards data live?
The SDS is silent about where data should live and how it
should be turned into the various serializations that should be available when clients
dereference resource IRIs. My thinking
on this subject was influenced by my observations about previous management of TDWG
standards data. In the past, the
following things have happened to TDWG standards data:
- the standards documents for TAPIR were accidentally overwritten and lost.
- the authoritative Darwin Core (DwC) documents were locked up on a proprietary publishing system where only a few people could look at them or even know what was there.
- the normative Darwin Core document was written in RDF/XML, which no one could read and which was in a document that had to be edited by hand.
Given that background I was pretty convinced that the place
for the standards data to live was in a public GitHub repository. I was able to have a repository called rs.tdwg.org set up in the TDWG GitHub site for the purpose of storing the standards
metadata.
Form of the standards metadata
At the time I started working on this project, I had been
working on an application to transform CSV spreadsheets into various forms of
RDF, so I had already been thinking about how the CSV spreadsheets should be
set up to do that. I liked the model used
for DwC Archives (DwC-A) and defined in the DwC text guide.
Example metadata CSV file for terms defined by Audubon Core:
audubon.csv
|
In the DwC-A model, each table is "about" some class of thing. Each row in a data table represents an instance of that class, and each column represents some property of those instances. The contents of each cell represent the value of the property for that instance.
Darwin Core Archive model (from the Darwin Core Text Guide) |
In order to associate the columns with their property terms, DwC Archives use an XML file (meta.xml) that associates the intended properties with the columns of the spreadsheet. Since a flat spreadsheet can't handle one-to-many relationships very well, the model connects the instances in the core spreadsheet with extension tables that allow properties to have multiple values.
For the purposes of generating RDF, the form of the meta.xml
file is not adequate. One problem is
that the meta.xml file does not indicate whether the value (known in RDF as the
object) recorded in the cell is supposed to be a literal (string) or an IRI. A second problem is that in RDF values of
properties can also have language tags or datatypes if they are not plain
literals. Finally, a DwC Archive assumes that a row is a
single type of thing, but actually a row may actually contain information about
several types of things.
Example CSV mapping file: audubon-column-mappings.csv
|
For those reasons I ended up creating my own form of mapping file -- another CSV file rather than a file in XML format. I won't go into more details here, since I've already described the system of files in another blog post. But you can see from the example above that the file relates the column headers to properties, indicates the type of object (IRI, plain literal, datatyped literal, or language tagged literal), and provides the value of the language tag or datatype. The final column indicates whether that column applies to the main subject of the table or an instance of another class that has a one-to-one relationship with the subject resource.
Add captionExample extension links file: linked-classes.csv |
The links between the core file and the extensions are
described in a separate links file (e.g. linked-classes.csv). In this example, extension files are required
because each term can have many versions and a term can also replace more than
one term. Because in RDF the links can
be described by properties in either direction, the links file lists the
property linking from the extension to the core file (e.g. dcterms:isVersionOf)
and from the core file to the extension (e.g. dcterms:hasVersion).
This system differs a bit from the DwC-A system where the
fields in the linked extension files are described within the same meta.xml
file. I opted to have a separate mapping file for each extension. The filenames listed in the
linked-classes.csv file point to the extension data files and the mapping files
associated with the extension data files use the same naming pattern as the
mapping files for the core file.
The description of file types above explains most of the
many files that you'll find if you look in a particular directory in the
rs.tdwg.org repo.
Organization of directories in rs.tdwg.org
For each of the directories that describe terms in a
particular namespace, there is another directory that describes the versions of
those terms. Those directory names have
"-versions" appended to the directory name for their corresponding
current terms.
Finally, there are some special directories that describe
resources in the TDWG standards hierarchy at levels higher than individual
terms: "term-lists", "vocabularies", and
"standards". There is also a
special directory for documents ("docs") that describe all of the
documents that are associated with TDWG standards. Taken together, all of these directories
contain the metadata necessary to completely characterize all of the components
of TDWG standards.
Using rs.tdwg.org metadata
One option for creating the serializations is to run a build
script that generates the serialization as a static file. I used this approach to generate the Audubon
Core Term List document. A Python script generates Markdown from the appropriate CSV files. The generated file is pushed to GitHub where
it is rendered as a web page via GitHub Pages.
Another option is to generate the serializations on the fly based
on the CSV tables. In another blog post I describe my efforts to set up a web service that uses CSV files of the form
described above to generate RDF/Turtle, RDF/XML, or JSON-LD serializations of
the data. That system has now been implemented for TDWG standards components.
The SDS specifies that if an IRI is dereferenced with an
Accept: header for one of the RDF serializations, the server should perform
content negotiation (303 redirect) to direct the client to the URL for the
serialization they want. For example, when a client that is a browser (with an Accept header of text/html) dereferences the Darwin Core term IRI http://rs.tdwg.org/dwc/terms/recordedBy, it will be redirected to the Darwin Core Quick Reference
Guide bookmark for that term. However,
if an Accept: header of text/turtle is used, the client will be redirected to http://rs.tdwg.org/dwc/terms/recordedBy.ttl
. Similarly, application/rdf+xml
redirects to a URL ending in .rdf and application/json or application/ld+json
redirects to a URL ending in .json .
Those URLs for specific serializations can also be requested directly
without requiring content negotiation.
The test system also generates HTML web pages for obsolete
Darwin Core terms that otherwise wouldn't be available via the Darwin Core
website. For example: http://rs.tdwg.org/dwc/curatorial/Preparations
redirects to http://rs.tdwg.org/dwc/curatorial/Preparations.htm, a web page describing an
obsolete Darwin Core term from 2007.
Providing term dereferencing of this sort is considered a
best practice in the Linked Data community.
But for developers interested in obtaining the machine-readable
metadata, as a practical matter it's probably easier to just get a
machine-readable dump of all of the whole dataset by one of the methods
described in my earlier posts. However,
having the data available in CSV form on GitHub makes the data available in a primitive
"machine-readable" form that doesn't really have anything to do with
Linked Data. Anyone can write a script
to retrieve the raw CSV files from the GitHub repo and process them using
conventional means as long as they understand how the various CSV files within
a directory are related to each other.
Because of the simplicity of the format of the data, it is highly likely
that they will be usable long into the future (or at least as long as GitHub is
viable) even if Linked Data falls by the wayside.
Maintaining the CSV files in rs.tdwg.org
It would be unreliable to trust that a human could make all
of the necessary modifications to all of the CSV files without errors. It is also unreasonable to expect standards
maintainers to have to suffer through editing a bunch of CSV files every time
they need to change a term. They should
only have to make minimal changes to a single CSV file and the rest of the work
should be done by a script.
I've written a Python script within a Jupyter notebook to do
that work. Currently the script will make changes to the necessary CSV files for term
changes and additions within a single term list (a.k.a. "namespace")
of a vocabulary. It currently does not
handle term deprecations and replacements -- presumably those will be uncommon
enough that they could be done by manual editing. It also doesn't handle changes to the documents
metadata. I haven't really implemented
document versioning on rs.tdwg.org, mostly because that's either lost or
unknown information for all of the older standards. That should change in the future, but it just
isn't something I've had the time to work on yet.
Some final notes
I feel relatively confident about the approach of archiving
the standards data as CSV files. With
respect to the method of mapping the columns to properties and my ad hoc system
for linking tables, I think it would actually be better to use the JSON
metadata description files specified in the W3C standard for generating RDF from CSV files. I wasn't aware of that standard when
I started working on the project, but it would probably be a better way to
clarify the relationships between CSV tables and to impart meaning to their
columns.
So far the system that I created for dereferencing the rs.tdwg.org IRIs seems to be adequate. In the long run, it might be better to use an alternative system. One is to simply have a build script that generates all of the possible serializations as static files. There would be a lot of them, but who cares? They could then be served by a much simpler script that just carried out the content negotiation but did not actually have to generate the pages. Another alternative would be to pay a professional to create a better system. That would involve a commitment of funds on the part of TDWG. But in either case the alternative systems could draw their data from the CSV files in rs.tdwg.org as they currently exist.
When we were near the adoption of the SDS, someone asked
whether the model we developed was too complicated. My answer was that it was just complicated
enough to do all of the things that people said that they wanted. One of my goals in this implementation
project was to show that it actually was possible to fully implement the SDS as
we wrote it. Although the mechanism for
managing and delivering the data may change in the future, the system that I've
developed shows that it's reasonable to expect that TDWG can dereference (with content
negotiation) the IRIs for all of the terms that it mints, and to provide a full
version history for every term, vocabulary, and document that we've published
in the past.
Note: although this is the last post in this series, some people have asked about how one would actually build a new vocabulary using this system. I'll try to write a follow-up showing how it can be done.
Note: although this is the last post in this series, some people have asked about how one would actually build a new vocabulary using this system. I'll try to write a follow-up showing how it can be done.
No comments:
Post a Comment