Thursday, March 5, 2020

TDWG gets 5 Stars!

Photo from W3C https://www.w3.org/DesignIssues/LinkedData.html


TDWG IRIs are dereferenceable with content negotiation!


Yesterday was a happy day for me because after several years of work, the switch was flipped and all of the IRIs minted by TDWG under the rs.tdwg.org subdomain became dereferenceable with content negotiation in most cases.  For those readers who aren't hard-core Linked Open Data (LOD) buffs, I'll explain what that means.

An internationalized resource identifier (IRI; superset of uniform resource identifiers, URIs) is a globally unique identifier that generally looks like the well known URL. It usually starts with http:// or https://,  which implies that something will happen if you put it in a web browser. That "something" is dereferencing - the browser uses the IRI to try to retrieve a document from a remote server and if successful, a web page shows up in the browser. Because a browser's job is to retrieve web pages, when it dereferences an IRI, it asks for a particular "content type" (text/html) indicating that it wants an HTML web page.

But there are other kinds of software designed to retrieve documents that are readable by machines rather than by humans. When those applications dereference an IRI, they ask for other content types (like text/turtle or application/rdf+xml) that can be interpreted as structured data and be integrated with data from other sources. The same IRI can be used to retrieve different documents that provide the same information in different formats depending on the content type that is requested. The process of determining what kind of document to return to the requesting application is called content negotiation.

In the past, the behavior of TDWG IRIs were inconsistent. Some IRIs like those of Darwin Core terms would retrieve a web page in a browser and provide machine-readable RDF/XML when requested. Other IRIs like those of Audubon Core terms would retrieve a web page, but no machine-readable formats. Obsolete IRIs like those of old versions of Darwin Core and the defunct TDWG ontology did nothing at all. Then there were many TDWG resources, such as old standards documents, that didn't even have IRIs.

In an earlier blog post, I described the IRI patterns that I established in order to be able to denote all of the kinds of TDWG standards components that were described in the TDWG Standards Documentation Specification. Those patterns made it possible to use IRIs to refer to things like vocabularies, term lists, and documents in a consistent way. Just creating the IRI patterns and using them to assign IRIs to vocabularies and documents provided a way to uniquely identify those resources, but did not create the "magic" of actually making it possible to use those IRIs to retrieve information. That's what happened yesterday.


What happens when the IRIs are dereferenced?

The action that takes place when an rs.tdwg.org IRI is dereferenced depends on the category of the resource and the content type that's requested.  There are four categories of behavior that vary primarily on how they deliver human-readable content.

1. "Living" TDWG vocabulary terms. When a term from one of the actively maintained TDWG vocabularies (currently Darwin Core and Audubon Core) is dereferenced, the browser is redirected to the most helpful reference document for that vocabulary (the Quick Reference Guide for Darwin Core and the Term List document for Audubon Core). You can try this with dwc:recordedBy, http://rs.tdwg.org/dwc/terms/recordedBy and ac:caption, http://rs.tdwg.org/ac/terms/caption.

2. Obsolete TDWG vocabulary terms, vocabularies, term lists, and special categories of resources. When terms in these categories are dereferenced, a generic web page is generated by a script that provides vanilla information about the term. The same is true for some special categories like Executive Committee decisions.  Try it with an obsolete term http://rs.tdwg.org/dwc/curatorial/Disposition, a decision http://rs.tdwg.org/decisions/decision-2011-10-16_6 and a term list http://rs.tdwg.org/ac/xmp/.

3. TDWG-maintained standards documents. The maintenance of TDWG standards documents is idiosyncratic and their location depends on where their maintainers happened to have stashed them. The URLs used to retrieve the documents might change if they are put into different places or if their format changes (e.g. changed from PDF to Markdown).  To provide a stable way to denote those documents, the IRIs minted in rs.tdwg.org subdomain redirect to whatever current URL delivers that particular document. If the document moves or the access URL changes for some reason, the stable IRI will redirect to the new access URL. Try it with the TDWG Vocabulary Maintenance Specification http://rs.tdwg.org/vms/doc/specification/, the Audubon Core Structure document http://rs.tdwg.org/ac/doc/structure/,  and the TAPIR Protocol Specification http://rs.tdwg.org/tapir/doc/specification/.

4. Non-TDWG-maintained standards documents. A lot of the old TDWG standards were not actually published by TDWG, and their maintenance is carried out by organizations whose websites are not under TDWG control. So we will just try to keep the TDWG-issued document IRIs pointing at whatever the access URL is currently for the document. Examples: Economic Botany Data Collection Standard specification http://rs.tdwg.org/ebdc/doc/specification/, Taxonomic Literature : A Selective Guide to Botanical Publications and Collections with Dates, Commentaries and Types (Second edition, vol. 1) http://rs.tdwg.org/tl/doc/v1/, and Index Herbariorum http://rs.tdwg.org/ih/doc/book/.

Machine-readable metadata
For these categories, the machine readable metadata is delivered in the same way: generated by script from the data in the rs.tdwg.org Github repository. To access the content through content negotiation, you can dereference any of the IRIs above using software like Postman that will allow you to specify an Accept header for the machine-readable content type that you want (text/turtle or application/rdf+xml). To access the machine-readable documents directly, drop any trailing slashes and append .ttl or .rdf to access RDF/Turtle or RDF/XML respectively. Examples: http://rs.tdwg.org/dwc/terms/recordedBy.ttlhttp://rs.tdwg.org/dwc/terms/recordedBy.rdf, and http://rs.tdwg.org/tl/doc/v1.ttl.

There are also a number of legacy XML schemas that are still being retrieved by some applications and they are made available by just redirecting from the rs.tdwg.org IRI to wherever the schema lives. Example: http://rs.tdwg.org/dwc/text/tdwg_dwc_text.xsd .

How this happens
The script that handles all of these many variations of IRIs is written in XQuery (a functional programming language designed to process XML) and runs on a BaseX server instance. A second XQuery script generates the vanilla HTML web pages that are generated from the same data as the machine-readable metadata. I've written more extensively about this approach in an earlier post, so I won't say more about it here.

There was a lot of concern about maintaining a server that is based on a programming language that is not well-known among IT professionals. So it's likely that in the future the XQuery-based system will be replaced by something else. I'd like to use something based on the W3C Generating RDF from Tabular Data on the Web Recommendation, since the source data live as CSV files on Github. But for now, this is what we have.

5 Stars???

The title of this post says that TDWG now gets 5 stars. What does that mean? In 2010, Tim Berners-Lee promoted a 5 star system to rate the extent to which data sources are freely available in machine-readable form. The TDWG standards metadata have been available online in structured form under an open license (stars 1 through 3), but failed to achieve 5 stars since standards-based machine readable data (RDF) couldn't be acquired by dereferencing the IRIs (star 4) and the resources weren't linked to others in the machine-readable metadata (star 5). As of yesterday, we can tick off stars 4 and 5, so the TDWG standards metadata are now fully compliant with Linked Open Data best practices. Congratulations TDWG!

Special thanks to Matt Blissett of GBIF for working out the technical details of setting up the server and production protocol and to Tim Robertson of GBIF for his support in getting this done. Thanks also to Cliff Anderson and the XQuery Working Group of the Vanderbilt University Heard Library for introducing me to BaseX server.




3 comments:

  1. Congratulations on getting this accomplished, and for explaining it so well!

    ReplyDelete
  2. What a about actual RDF data (occurrences, taxonomy)?
    Are there browsable graphs and/or SPARQL databases?
    Are there VOID or DCAT RDF descriptions for them?
    Are there tools to convert from GBIF.org CSV to TDWG RDF?

    ReplyDelete
  3. Hi Jean-Marc. At this point, I don't think there are any systematic efforts to represent occurrences and taxonomy as RDF. There are particular places that have datasets accessible via SPARQL or by dereferencing IRIs. Some CETAF (https://www.cetaf.org/) institutions have dereferenceable IRIs with RDF. Plazi (http://plazi.org/) has a lot of taxonomic treatments with Linked Data that are accessible via SPARQL and I think the Pensoft (https://pensoft.net/) journals also produce linked data. The main problem is that the existing TDWG vocabularies do not have a consensus graph model. There is an effort underway to try to develop one, but there hasn't been a lot of progress on it yet.

    I have done some experimenting with converting GBIF CSVs to RDF. You can read about it in some of my earlier blog posts like this one: http://baskauf.blogspot.com/2016/11/guid-o-matic-meets-dwc-rdf-octopus.html

    ReplyDelete