This is the second in a series of posts about the TDWG
Standards Documentation Specification (SDS).
For background on the SDS, see the first post.
Note: this post was revised on 2020-03-04 when IRI dereferencing of the http://rs.tdwg.org/ subdomain went from testing into production.
Note: this post was revised on 2020-03-04 when IRI dereferencing of the http://rs.tdwg.org/ subdomain went from testing into production.
Implementation plan?
The SDS was ratified and issued in April of 2017. It did not, however, include any plan for its
implementation. It wasn't actually clear
whose responsibility it was to make implementation of the SDS happen. The Technical Architecture Group (TAG) might have
been a logical group to take charge, but in 2017 it had not yet been
reconstituted in its current form. As
the architect of the SDS, I had a vested interest in seeing it become
functional, so I decided to take the initiative to figure out how it could be
implemented. As I worked on this project, I got feedback from the Darwin Core Maintenance Group, key people working on the TDWG website and other infrastructure, and later from some TAG members.
Although the SDS provided a general framework, it left a lot
of the details to implementers. In
particular, the SDS had relatively little to say about the form of URIs used as
identifiers for documents whose form was specified by the SDS. For guidance, I looked to precedents set by
Darwin Core, general practices in the Linked Data world, and practicalities of
URI dereferencing.
The SDS model
The SDS describes a hierarchical model for resources within its
scope. That hierarchy is relatively
simple for documents within a standard: there is simply a hasPart/isPartOf
relationship between the standard and its documents.
For vocabularies, the situation is more complicated. The SDS describes four levels in the hierarchy
that applies to vocabularies: standard, vocabulary, term list, and term. There was some discussion in the run-up to ratification
of the SDS as to whether the model needed to be this complicated. At that time, I asserted that this was the
least complicated model that could accomplish all of the things that people
said they wanted to do with vocabularies in TDWG.
It would be tempting to say that a much simpler model might
be possible. For example, we could
consider the Audubon Core Standard to be synonymous with the Audubon Core
vocabulary. We could say that Audubon Core terms were a direct part of it -- a simple two-level hierarchy.
However, the Audubon Core Standard is more than just a set of
terms. The Audubon Core vocabulary is
distinct from the documents that describe how Audubon Core should be used (the
structure document, term list document, etc.), which are also part of the standard. Although we might lump the standard, vocabulary,
and documents together in our human minds, if we really aspire to have machine-readable
descriptions of components of TDWG standards, we have to distinguish between
things that are not the same -- things that have different authors, creation dates,
and version histories.
Example of first (standards) and second (vocabularies and documents) levels of the TDWG Standards Documentation Specification hierarchy |
As I described in the previous post, there was also a desire expressed in the community for the capability to have more than one "Darwin Core vocabulary". Some people might want only the basic vocabulary (a "bag of terms" with definitions). Others might want a more complicated vocabulary where some terms might be declared to be equivalent to terms outside of Darwin Core, or classes might be declared to be subclasses of classes in an outside ontology. Still others might want to create a Darwin Core vocabulary that restrict the values that can be used for certain terms, or entail class membership through range and domain declarations. So although we don't currently have more than one Darwin Core vocabulary, we want to allow for that possibility in the future. That's another reason to have a model that separates the standard from the vocabulary or vocabularies that it defines.
Example of second (vocabularies) and third (term lists) levels of the TDWG Standards Documentation Specification hierarchy |
For terms defined by a TDWG vocabulary, there is an authoritative
term list for each namespace. For
example, there is an authoritative term list for the dwc: namespace and another
for the dwciri: namespace. These lists
are considered authoritative because they define the terms they contain. Dereferencing a term list IRI should return
the term list document.
Some notes about IRIs
According to the SDS, each resource in the hierarchy should
be assigned an IRI as an identifier (Section 2.1.1). An
IRI is a superset of URIs that allows for non-Latin characters to be used. For the purposes of this post, you can consider
URIs and IRIs to be synonymous.
There has always been confusion between the use of IRIs/URIsas identifiers and URLs as resource locators.
Fundamentally, an IRI is an identifier that may or may not actually
dereference in a web browser to retrieve a web page about the resource. In the Linked Data community, it is
considered a best practice for IRIs to dereference, but it isn't a
requirement. In fact, there are a number
of "borrowed" term IRIs in Audubon Core that don't dereference and
probably never will. So although it
isn't a requirement of the SDS that TDWG IRIs dereference, one goal of
implementation is to eventually make that happen.
The origin of the subdomain rs.tdwg.org has always been a
little mysterious to me. I believe that
the "rs" part stands for "schema repository" and that it
was originally intended to be a place from which XML and other schemas could be
retrieved. Although I don't think there
is any official policy that requires use of the rs.tdwg.org subdomain for
TDWG-minted IRIs, that has become the convention with Darwin Core and Audubon Core
and I've taken that as the precedent to be followed when creating other IRIs that
denote resources associated with TDWG standards. The exception to this pattern are the IRIs
for the standards themselves. The
precedent there is that TDWG standards have IRIs in the form
http://www.tdwg.org/standards/nnn, where "nnn" is a number assigned to a
particular standard.
IRI patterns for vocabulary standards
Here
are the patterns I established or continued based on past practice:
Standards IRI:
http://www.tdwg.org/standards/nnn
where "nnn" consists of numeric characters assigned
to the standard. Dereferencing these
IRIs should lead the user to the landing page of the standard. Example of the Darwin Core standard:
Note that since these IRIs aren't within the rs.tdwg.org
subdomain, the test system I've implemented does not handle their
dereferencing. Standards IRI
dereferencing is handled by a separate system and I don’t know how fully functional
it is for all prior TDWG standards.
Vocabulary IRI:
http://rs.tdwg.org/vvv/
where "vvv" is a sequence of alphabetic characters
assigned to the vocabulary. Example of
the Darwin Core basic vocabulary:
Term list IRI:
http://rs.tdwg.org/vvv/ttt/
where "vvv" is a sequence of alphabetic characters
assigned to the vocabulary and "ttt" is a sequence of alphabetic characters
assigned to the term list within that vocabulary. Example of the Darwin Core IRI-valued terms:
Term IRI:
http://rs.tdwg.org/vvv/ttt/nnn
where "vvv" is a sequence of alphabetic characters
assigned to the vocabulary, "ttt" is a sequence of alphabetic
characters assigned to the term list within that vocabulary, and "nnn"
is the local name of the term. Example
of the "in described place" term:
The term pattern described above is backward compatible with
all current Darwin Core and Audubon Core term IRIs. Existing Darwin Core RDF/XML asserts relationships
between terms and the resource that defines them like this:
http://rs.tdwg.org/dwc/terms/dateIdentified rdfs:isDefinedBy
http://rs.tdwg.org/dwc/terms/ .
So the IRI pattern for term lists is also backwards compatible
with this previous use, with the name "term list" now explicitly given to
the resource that defines terms.
The IRI pattern for vocabularies is new, but is consistent
with the hierarchy and is necessary to distinguish between vocabularies and the
standards that create them.
IRI pattern for documents
Previously, there had been no consistent pattern for IRIs
assigned to documents associated with standards. Here are some examples of IRIs for Darwin Core documents:
The Darwin Core XML guide: http://rs.tdwg.org/dwc/terms/guides/xml/
The Darwin Core simple text guide: http://rs.tdwg.org/dwc/terms/simple/
To maintain backwards compatibility, these pre-existing IRIs
were left unchanged. However, the IRI patterns
used for Darwin Core documents make it difficult to distinguish programmatically
between term and document IRIs using pattern matching. So for all documents from standards other than Darwin Core, I
used this pattern:
http://rs.tdwg.org/sss/doc/docname/
where "sss" is a sequence of alphabetic characters
representing the standard and "docname" is a short series of alphabetic
characters representing the document.
For example:
is the IRI for the Audubon Core Structure document.
Redirection
One thing that should be made clear is the distinction between
the IRI that identifies a resource and the URL that actually can be used to
retrieve a document or metadata about some other resource. Because the SDS considers the resources it describes as abstract entities, those entities can have multiple formats or serializations
that are distinct from the abstract resources themselves. For example, the Audubon Core Structure
document is an abstract thing identified by http://rs.tdwg.org/ac/doc/structure/
. However, the HTML serialization of
that document can currently be retrieved from the URL https://tdwg.github.io/ac/structure/
and in the future that document might be made available at different URLs in
other formats such as PDF. It is
required that the IRI of the abstract resource be stable and unchanged, but
there is no requirement that the retrieval URL for a serialization stay the
same over time. Thus it's important that
citations and bookmarks be set to the permanent IRI of the resource, and that redirection
from the permanent IRI to the retrieval URL be maintained so that people can
actually acquire a copy of the resource using a browser.
In the past, obscure, deprecated Darwin Core terms simply
didn't dereference. In the test system,
they redirect programmatically to a URL that is the term IRI plus
".htm". Here's an example:
redirects to
The document that is retrieved is an HTML, human-readable
description of the term.
Historically, current Darwin Core terms redirected to the Darwin
Core Quick Reference page and that behavior has been maintained in the test
system. Here's an example:
redirects to
The same is true with Audubon Core terms, whose IRIs
redirect to an appropriate place on the Audubon Core Term List document. The URLs of both the Audubon Core Term List
page and Darwin Core Quick Reference page have changed recently, reinforcing
the importance of citing the actual term IRIs rather than the redirected URLs.
TDWG Standards Documentation Specification version model (from Section 2.3) |
Versions
Taking cues from Dublin Core and the W3C, the SDS describes
a version model that can be used to track versions of resources associated with
TDWG standards. For example,
dereferencing the Darwin Core vocabulary IRI http://rs.tdwg.org/dwc/ shows that
there are 19 versions: 18 previous version and a most recent version that
corresponds to the current Darwin Core vocabulary.
For vocabularies and term lists, the version IRIs are constructed by appending an ISO 8601 date after the
final slash and inserting "version/" before the terminal string. For example, the current Darwin Core vocabulary IRI is http://rs.tdwg.org/dwc/
and a version of the Darwin Core vocabulary is http://rs.tdwg.org/version/dwc/2015-03-27
. The current Darwin Core IRI-value term
list IRI is http://rs.tdwg.org/dwc/iri/ and a version of it is http://rs.tdwg.org/dwc/version/iri/2015-03-27
. (Although it wouldn't be necessary to
include the characters "version/" in the version IRI, doing so makes pattern
recognition for those IRIs much simpler.)
Following the precedent already set for Darwin Core, term
version IRIs are formed by appending an ISO 8601 date with a dash. Again "version/" is inserted ahead
of the local name to make IRI pattern recognition easier. For example, the term IRI http://rs.tdwg.org/dwc/terms/establishmentMeans
has a version http://rs.tdwg.org/dwc/terms/version/establishmentMeans-2009-04-24
For documents, the version IRI is formed by simply appending
the ISO 8601 date after the trailing slash.
(In the case of documents, IRI pattern recognition is less critical
since there aren't hierarchical levels below the level of the document. So
"version/" isn't inserted in the version IRI.) For example, the document http://rs.tdwg.org/sds/doc/specification/
has a version http://rs.tdwg.org/sds/doc/specification/2007-11-05 .
In the case of non-document resources, resolution of version
IRIs is fully implemented, since human-readable pages can be constructed programmatically for those
resources using data from the metadata database. However, since the human-readable versions of
standards documents are generally created manually and have idiosyncratic redirection
IRIs, version IRI resolution is currently only partially implemented. In the case of many standards documents, the location
of previous versions is not known or they are not yet available online. So for now, one can't explore older versions
of standards documents in the same way one can explore older versions of
vocabularies, term lists, and terms.
Summary
I've implemented a system of IRIs that are consistent with
the SDS and past practice of Darwin and Audubon Cores. Although the patterns I established aren't the only possible ones, they work well for facilitating pattern matching by a server that generates many of the documents programmatically, so I feel that the pattern system is sound.
Here are some starting points for exploration:
Audubon Core basic vocabulary:
Darwin Core basic vocabulary:
From these two vocabulary pages you can surf to term lists,
terms, and older versions of all of the resources.
Terms borrowed by Audubon Core from the IPTC Photo Metadata Extension:
http://rs.tdwg.org/ac/Iptc4xmpExt/
The October 16, 2011 version of the Darwin Core vocabulary:
http://rs.tdwg.org/version/dwc/2011-10-16
The April 24, 2009 version of the list of core Darwin Core terms:
http://rs.tdwg.org/dwc/version/terms/2009-04-24
The September 11, 2009 version of Basis of Record:
http://rs.tdwg.org/dwc/terms/version/basisOfRecord-2009-09-11
A deprecated Darwin Core term list:
http://rs.tdwg.org/dwc/curatorial/
Terms borrowed by Audubon Core from the IPTC Photo Metadata Extension:
http://rs.tdwg.org/ac/Iptc4xmpExt/
The October 16, 2011 version of the Darwin Core vocabulary:
http://rs.tdwg.org/version/dwc/2011-10-16
The April 24, 2009 version of the list of core Darwin Core terms:
http://rs.tdwg.org/dwc/version/terms/2009-04-24
The September 11, 2009 version of Basis of Record:
http://rs.tdwg.org/dwc/terms/version/basisOfRecord-2009-09-11
A deprecated Darwin Core term list:
http://rs.tdwg.org/dwc/curatorial/
Here are some examples of document IRIs that redirect:
In the next post, I'll describe how the system I've
implemented allows retrieval of machine-readable metadata.
No comments:
Post a Comment