Tuesday, April 2, 2019

Understanding the TDWG Standards Documentation Specification, Part 3: Machine-readable Metadata Via Content Negotiation

This is the third in a series of posts about the TDWG Standards Documentation Specification (SDS).  For background on the SDS, see the first post.  For information on its hierarchical model and how it relates to IRI design, see the second post.

Note: this post was revised on 2020-03-04 when IRI dereferencing of the http://rs.tdwg.org/ subdomain went from testing into production.


Human- vs. Machine-readable metadata

In the previous posts, I made the point that the SDS considers standards-related resources such as standards, vocabularies, term lists, terms, and documents to be abstract entities (section 2.1).  As such, the IRI assigned to a resource denotes that resource in its abstract form.  That abstract resource does not have one particular representation -- rather it can have multiple representation syntaxes which differ in format, but which in most cases provide equivalent information.

For example, consider the deprecated Darwin Core term dwccuratorial:Disposition.  It is denoted by the IRI http://rs.tdwg.org/dwc/curatorial/Disposition.  The metadata for this term in human-readable form looks like this:

Term Name: dwccuratorial:Disposition
Label: Disposition
Term IRI: http://rs.tdwg.org/dwc/curatorial/Disposition
Term version IRI: http://rs.tdwg.org/dwc/curatorial/version/Disposition-2007-04-17
Modified: 2009-04-24
Definition: The current disposition of the cataloged item. Examples: "in collection", "missing", "voucher elsewhere", "duplicates elsewhere".
Type: Property
Note: This term is no longer recommended for use.
Is replaced by: http://rs.tdwg.org/dwc/terms/disposition

In RDF/Turtle machine-readable serializations, the metadata looks like this (namespace abbreviations omitted):

<http://rs.tdwg.org/dwc/curatorial/Disposition>
     rdfs:isDefinedBy <http://rs.tdwg.org/dwc/curatorial/>;
     dcterms:isPartOf <http://rs.tdwg.org/dwc/curatorial/>;
     dcterms:created "2007-04-17"^^xsd:date;
     dcterms:modified "2009-04-24"^^xsd:date;
     owl:deprecated "true"^^xsd:boolean;
     rdfs:label "Disposition"@en;
     skos:prefLabel "Disposition"@en;
     rdfs:comment "The current disposition of the cataloged item. Examples: \"in collection\", \"missing\", \"voucher elsewhere\", \"duplicates elsewhere\"."@en;
     skos:definition "The current disposition of the cataloged item. Examples: \"in collection\", \"missing\", \"voucher elsewhere\", \"duplicates elsewhere\"."@en;
     rdf:type <http://www.w3.org/1999/02/22-rdf-syntax-ns#Property>;
     tdwgutility:abcdEquivalence "DataSets/DataSet/Units/Unit/SpecimenUnit/Disposition";
     dcterms:hasVersion <http://rs.tdwg.org/dwc/curatorial/version/Disposition-2007-04-17>;
     dcterms:isReplacedBy <http://rs.tdwg.org/dwc/terms/disposition>.

In RDF/XML machine-readable form, the metadata looks like this (namespace abbreviations omitted):

<rdf:Description rdf:about="http://rs.tdwg.org/dwc/curatorial/Disposition">
     <rdfs:isDefinedBy rdf:resource="http://rs.tdwg.org/dwc/curatorial/"/>
     <dcterms:isPartOf rdf:resource="http://rs.tdwg.org/dwc/curatorial/"/>
     <dcterms:created rdf:datatype="http://www.w3.org/2001/XMLSchema#date">2007-04-17</dcterms:created>
     <dcterms:modified rdf:datatype="http://www.w3.org/2001/XMLSchema#date">2009-04-24</dcterms:modified>
     <owl:deprecated rdf:datatype="http://www.w3.org/2001/XMLSchema#boolean">true</owl:deprecated>
     <rdfs:label xml:lang="en">Disposition</rdfs:label>
     <skos:prefLabel xml:lang="en">Disposition</skos:prefLabel>
     <rdfs:comment xml:lang="en">The current disposition of the cataloged item. Examples: "in collection", "missing", "voucher elsewhere", "duplicates elsewhere".</rdfs:comment>
     <skos:definition xml:lang="en">The current disposition of the cataloged item. Examples: "in collection", "missing", "voucher elsewhere", "duplicates elsewhere".</skos:definition>
     <rdf:type rdf:resource="http://www.w3.org/1999/02/22-rdf-syntax-ns#Property"/>
     <tdwgutility:abcdEquivalence>DataSets/DataSet/Units/Unit/SpecimenUnit/Disposition</tdwgutility:abcdEquivalence>
     <dcterms:hasVersion rdf:resource="http://rs.tdwg.org/dwc/curatorial/version/Disposition-2007-04-17"/>
     <dcterms:isReplacedBy rdf:resource="http://rs.tdwg.org/dwc/terms/disposition"/>
</rdf:Description>

For brevity, I'll omit the JSON-LD serialization.  If you make a careful comparison of the two machine-readable serializations shown here, you'll see that they contain exactly the same information.  

The SDS requires that when a machine consumes any machine-readable serialization, it acquire information identical to any other serialization (section 4).  For most resources (terms, vocabularies, etc.), the human-readable representation generally contains the same information as the machine-readable serializations for all of the key properties required by the SDS, although some that aren't required, such as the abdcEquivalence, are omitted.  The exception to this is standards-related documents -- the human-readable representation is the document itself, while the machine-readable representations are metadata about the document.  (In contrast, machine-readable metadata about vocabularies, term lists, and terms contain virtually complete data about the resource.)  

Distinguishing between resources and the documents that describe them

Section 4.1 of the SDS requires that machine-readable documents must have IRIs that are different from the IRIs of the abstract resources that they describe.  Although at first it many not be apparent why this is important, we can see why if we consider the case of some of the older TDWG standards documents.  For instance, the document Floristic Regions of the World (denoted by the IRI http://rs.tdwg.org/frw/doc/book/) by A. L. Takhtahan was adopted as part of TDWG standard http://www.tdwg.org/standards/104. It is copyrighted by the University of California Press and is not available under an open license.  However, the metadata about the book in RDF/Turtle serialization (denoted by the IRI http://rs.tdwg.org/frw/doc/book.ttl) is freely available.  So we could make the statement

http://rs.tdwg.org/frw/doc/book.ttl dcterms:license https://creativecommons.org/publicdomain/zero/1.0/ .

but it would NOT be accurate to make the statement 

http://rs.tdwg.org/frw/doc/book/ dcterms:license https://creativecommons.org/publicdomain/zero/1.0/ .

because the book isn't licensed as CC0. Similarly, it would be correct to say:

http://rs.tdwg.org/frw/doc/book/ dc:creator "A. L. Takhtahan" .

but not:

http://rs.tdwg.org/frw/doc/book/ dc:creator "Biodiversity Information Standards (TDWG)" .

because TDWG did not create the book.  On the other hand, saying:

http://rs.tdwg.org/frw/doc/book.ttl dc:creator "Biodiversity Information Standards (TDWG)" .

would be correct, since TDWG did create the RDF/Turtle metadata document that describes the book.

Although in human-readable documents we tend to be fuzzy about the distinction between resources and the metadata about those resources, when we create machine-readable metadata representations we need to be careful to distinguish between the two.

The SDS prescribes a way to link metadata documents and the resources they are about: dcterms:references and dcterms:isReferencedBy (section 4.1).  In the example above, we can say:

http://rs.tdwg.org/frw/doc/book.ttl dcterms:references http://rs.tdwg.org/frw/doc/book/ .

and

http://rs.tdwg.org/frw/doc/book/ dcterms:isReferencedBy http://rs.tdwg.org/frw/doc/book.ttl .

Content negotiation

As I explained in the second post of this series, IRIs are fundamentally identifiers.  There is no requirement that an IRI actually dereference to retrieve a web page or any other kind of document, although if it did, that would be nice, since that's the kind of behavior that people expect, particularly if the IRI begins with "http://" or "https://".  If you think about it, defining TDWG IRIs to denote an abstract conceptual thing is a bit of a problem, because only non-abstract files can actually be returned to a user from a server through the Internet.  You can't retrieve an abstract thing like the emotion "love" or the concept "justice" through the Internet, although you could certainly mint IRIs to denote those kinds of things.

The standard practice when an IRI denotes a resource that is a physical object or abstract idea is to redirect the user to a document that is about the object or idea.  Such a document containing descriptive metadata about the resource is called a representation of the resource.  Users can specify what kind of document (human- or machine-readable) they want, and more specifically, the serialization that they want if they are asking for a machine-readable document.  This process is called content negotiation.

Resolution of permanent identifiers indefinitely is specified by Recommendation 7 of the TDWG Globally Unique Identifier (GUID) Applicability Statement standard, although it does not go into details of how that resolution should happen.  Section 2.1.1 and 2.1.2 of the SDS expands on the GUID AS by saying that the abstract IRI should be stable and generic, and that content negotiation should redirect the user to an IRI for a particular content type that will serve as a URL that can be used to retrieve the document of the content type the user wanted.  That requirement is based on the widespread practice in the Linked Data community as expressed in the 2008 W3C Note "Cool URIs for the Semantic Web".

The SDS does not specify a particular way that this redirection should be accomplished, but given that it's desirable to support as many different serializations as possible, I chose to implement the "303 URIs forwarding to Different Documents" recipe described in the Cool URIs document.  Here are the specific details:

1. Client software performs an HTTP GET request for the abstract IRI of the resource and includes an Accept header that specifies the content type that it wants.

2. The server responds with an HTTP status code of 303 and includes the URL for the specific content type requested.  To construct the redirect URL, any abstract IRIs with trailing slashes first have the trailing slash removed. If text/html is requested (i.e. human-readable web page), .htm is appended to the IRI to form the redirect URL.  If text/turtle is requested, .ttl is appended.  If application/rdf+xml is requested, .rdf is appended.  If application/ld+json is requested, .json is appended.

3. The client then requests the specific redirect URL and the server returns the appropriate document in the serialization requested.  In this stage, the Accept header is ignored by the server.  In the case of standards documents and current terms in Darwin and Audubon Cores, there typically will be an additional redirect to a web page that isn't generated programmatically by the rs.tdwg.org server and might be located anywhere.

We can test the behavior using curl or a graphical HTTP client like Postman.  Here is an example using Postman (with automatic following of redirects turned off):

1. Client requests metadata about the basic Darwin Core vocabulary by HTTP GET to the generic IRI: http://rs.tdwg.org/dwc/ with an Accept header of text/turtle.



2. The server responds with a 303 (see other) code and redirects to http://rs.tdwg.org/dwc.ttl

3. The client sends another GET request to http://rs.tdwg.org/dwc.ttl, this time without any Accept header.



4. The server responds with a 200 (success) code and a Content-Type response header of text/turtle.  The response body is the document serialized as RDF/Turtle.


This illustration was done "manually" using Postman, but it is relatively simple to use any typical programming language (such as Javascript or Python) to perform HTTP calls with appropriate Accept headers.[1]  So enabling IRI dereferencing with content negotiation really starts to open up TDWG standards to machine readability.

One feature of this implementation method is that it allows a human user to examine a representation in any serialization using a browser by just by hacking the abstract IRI using the rules in step 2.  Thus, if you want to see what the RDF/XML serialization looks like for the basic Darwin Core vocabulary, you can put the URL http://rs.tdwg.org/dwc.rdf into a browser.  The browser will send an Accept header of text/html, but since the URL contains an extension for a specific file type, the server will ignore the Accept header and send RDF/XML anyway.  (Depending on how the browser is set up to handle file types, it may display the retrieved file in the browser window, or may initiate a download of the file into the user's Downloads directory.)

Important note: currently (as of April 2019), there is an error in the algorithm that generates the JSON-LD that causes repeated properties to be serialized incorrectly.  The JSON that is returned validates as JSON-LD, but when the document is interpreted, some instances of the repeated properties are ignored.  So application designers should at this point plan to consume either RDF/XML or RDF/Turtle until this error is corrected.

Why does this matter?

There are three reasons why implementation of dereferencing TDWG standards-related IRIs through content negotiation is important.

1. The least important reason is probably the one that is given as a core rationale in the Linked Data world: when someone "looks up" a URI, they get useful information and can discover more things through the links in the metadata.  In theory, one could "discover" any resource related to TDWG standards, scrape the machine-readable metadata about that resource, dereference other resources that are linked to the first one, scrape those resources' medata and follow their links, etc. until everything that there is to be known about TDWG standards has been discovered.  Essentially, we could have an analog of the Google web scraper that scrapes machine-readable documents instead of web pages. In theory, this could be done, but it would result in many HTTP calls and would be a very inefficient way to keep up-to-date on TDWG standards.  There is a much better way, and I'll discuss it in the next post.

2. Probably the most important reason is that implementing real permanent IRIs for TDWG vocabularies and documents puts a stop to the continual breaking of links and browser bookmarks that happens every time documents get moved to a new website, get changed from HTML to markdown, etc.  If we stress that the permanent IRIs are what should be bookmarked and cited, we can always set up the server to redirect to the URL of the day where the document or information actually lives.  Since the permanent IRIs are "cool" and don't include implementation-specific aspects like ".php" or "?pid=123&lan=en", we can change the way we actually generate and serve the data at will without ever "breaking" any links.  This is really critical if we want people to be able to cite IRIs for TDWG standards components in journal articles with those IRIs continuing to dereference indefinitely.

3. The third reason is more philosophical.  By having IRIs that dereference to human- and machine-readable metadata, we demonstrate that these are "real" IRIs that exhibit the behavior expected from "grown-up" organizations in the Linked Data world in specific, and the web in general.  We show that TDWG is not some fly-by-night organization that creates identifiers one day and abandons them the next.  The Internet is littered with the wreckage of vocabularies and ontologies from organizations that minted terms but stopped paying for their domain name, or couldn't keep their servers running.  Having properly dereferencing, permanent IRIs marks TDWG as a real standards organization that can run with the big dogs like Dublin Core and the W3C.  (We also get 5 stars !)

In my next post I'll talk about retrieving SDS-specified machine readable standards metadata en masse.

[1] Sample Python 3 code for dereferencing a term IRI

Note: you may need to use PIP to install the requests module if you don't already have it.

import requests
iri = 'http://rs.tdwg.org/ac/terms/caption'
accept = 'text/turtle'
r = requests.get(iri, headers={'Accept' : accept})
print(r.text)


No comments:

Post a Comment