Wednesday, April 24, 2019

Understanding the TDWG Standards Documentation Specification, Part 5: Acquiring Machine-readable using DCAT

This is the fifth in a series of posts about the TDWG Standards Documentation Specification (SDS).  For background on the SDS, see the first post.  For information on the SDS hierarchical model and how it relates to IRI design, see the second post.  For information about how TDWG standards metadata can be retrieved via IRI dereferencing, see the third post.  For information about accessing TDWG standards metadata via a SPARQL API, see the fourth post.

Note: this post was revised on 2020-03-04 when IRI dereferencing of the http://rs.tdwg.org/ subdomain went from testing into production.

Acquiring the machine-readable TDWG standards metadata based on the W3C Data Catalog (DCAT) Vocabulary Recommendation.


Not-so-great methods of getting a dump of all of the machine-readable metadata

In the last two posts of this series, I showed two different ways that you could acquire machine-readable metadata about TDWG Standards and their components.

In the third post, I explained how the implementation of the Standards Documentation Specification (SDS) could allow a machine (i.e. computer software) use the classic Linked Open Data (LOD) method of "following its nose" and essentially scraping the standards metadata by discovering linked IRIs, then following those links to retrieve metadata about the linked components.  There are two problems with this approach.  One is that it's very inefficient.  Multiple HTTP calls are required to acquire the metadata about a single resource and there are thousands of resources that would need to be scraped.  A more serious problem is that some of the terms that are current or past terms of Darwin and Audubon Cores are not dereferenceable.  For example, the International Press Telecommunications Council (IPTC) terms that are borrowed by Audubon Core are defined in a PDF document and don't dereference.  There are many ancient Darwin Core terms in namespaces other than the rs.tdwg.org subdomain that don't even bring up a web page, let alone machine-readable metadata.  And the "permanent URLs" of the standards themselves (e.g. http://www.tdwg.org/standards/116) do not use content negotiation to return machine-readable metadata (although they might at some future point).  So there are many items of interest whose machine-readable metadata simply cannot be discovered by this means, since linked IRIs can't be dereferenced with a request for machine-readable metadata.

In the fourth post, I described how the SPARQL query language could be used to get all of the triples in the TDWG Standards dataset.  The query to do so was really simple:

CONSTRUCT {?s ?p ?o}
FROM <http://rs.tdwg.org/>
WHERE {?s ?p ?o}

and by requesting the appropriate content type (XML, Turtle, or JSON-LD) via an Accept header, a single HTTP call would retrieve all of the metadata at once.  If all goes well, this is a simple and effective method.  However, this method depends critically on two things: there has to be a SPARQL endpoint that is functioning and publicly accessible, and the metadata in the triplestore of the underlying graph database must be up-to-date with the most recent data.  At the moment, both of those things are true about the Vanderbilt Library SPARQL endpoint (https://sparql.vanderbilt.edu/sparql), but there is no guarantee that it will continue to be true indefinitely.  There is no reason why there cannot be multiple SPARQL endpoints where the data are available, and TDWG itself could run its own, but currently there are no plans for that to happen and so we are stuck with depending on the Vanderbilt endpoint.

Getting a machine-readable data dump from TDWG itself


I'm now going to tell you about the best way to acquire authoritative machine-readable metadata from the rs.tdwg.org implementation itself.  But first we need to talk about the W3C Data Catalog (DCAT) recommendation, which is used to organize the data dump.  The SDS does not mention the DCAT recommendation, but since DCAT is an international standard, it is the logical choice to be used for describing the TDWG standards datasets.


Data Catalog Vocabulary (DCAT)

In 2014, the W3C ratified the DCAT vocabulary as a Recommendation (the W3C term for its ratified standards).  DCAT is a vocabulary for describing datasets of any form.  The described datasets can be machine-readable, but do not have to be, and could include non-machine-readable forms like spreadsheets.  The description of the datasets is in RDF, although the Recommendation is agnostic about the serialization.  

There are three classes of resources that are described by the DCAT vocabulary.  A data catalog is the resource that describes datasets.  It's type is dcat:Catalog (http://www.w3.org/ns/dcat#Catalog).  The datasets described in the catalog are assigned the type dcat:Dataset, which is a subclass of dctype:Dataset (http://purl.org/dc/dcmitype/Dataset).  The third class of resources, distributions, are described as "an accessible form of a dataset" and can include downloadable files or web services.  Distributions are assigned the type dcat:Distribution (http://www.w3.org/ns/dcat#Distribution).  The hierarchical relationship among these classes of resources is shown in the following diagram.


An important thing to notice is that the DCAT vocabulary defines several terms whose IRIs are very similar: dcat:dataset and dcat:Dataset, and dcat:distribution and dcat:Distribution.  The only thing that differs between the pairs of terms is whether the local name is capitalized or not.  Those with capitalized local names denote classes and those that begin with lower case denote object properties.

Organization of TDWG data according to the DCAT data model

I assigned the IRI http://rs.tdwg.org/index to denote the TDWG standards metadata catalog.  The local name "index" is descriptive of a catalog, and the IRI has the added benefit of supporting a typical web behavior: if a base subdomain like http://rs.tdwg.org/ is dereferenced, it is typical for that form of IRI to dereference to a "homepage" having the IRI http://rs.tdwg.org/index.htm, and http://rs-test.tdwg.org/index.htm does indeed redirect to a "homepage" of sorts: the README.md page for the rs.tdwg.org GitHub repo where the authoritative metadata tables live.  You can try this yourself by putting either http://rs.tdwg.org/or http://rs.tdwg.org/index.htm into a browser URL bar and see what happens.  However, making an HTTP call to either of these IRIs with an Accept header for machine-readable RDF (text/turtle or application/rdf+xml) will redirect to a representation-specific IRI like http://rs.tdwg.org/index.ttl or http://rs.tdwg.org/index.rdf as you'd expect in the Linked Data world.

The data catalog denoted by http://rs.tdwg.org/index describes the data located in the GitHub repository https://github.com/tdwg/rs.tdwg.org.  Those data are organized into a number of directories, with each directory containing all of the information required to map metadata-containing CSV files to machine-readable RDF.  From the standpoint of DCAT, we can consider the information in each directory as a dataset.  There is no philosophical reason why we should organize the datasets that way.  Rather, it is based on practicality, since the server that dereferences TDWG IRIs can generate a data dump for each directory via a dump URL.  See this file for a complete list of the datasets.

Each of the abstract datasets can be accessed through one of several distributions.  Currently, the RDF metadata about the TDWG data says that there are three distributions for each of the datasets: one in RDF/XML, one in RDF/Turtle, and one in JSON-LD (with the JSON-LD having a problem I mentioned in the third post).  The IANA media type for each distribution is given as the value of a dcat:mediaType property (see the diagram above for an example).

One thing that is a bit different from what one might consider the traditional Linked Data approach is that the distributions are not really considered representations of the datasets.  That is, under the DCAT model, one does not necessarily expect to be redirected to the distribution IRI from dereferencing of the dataset IRI through content negotiation.  That's because content negotiation generally results in direct retrieval of some human- or machine-readable serialization, but in the DCAT model, the distribution itself is a separate, abstract entity apart from the serialization.  The serialization itself is connected via a dcat:downloadURL property of the distribution (see the diagram above).  I'm not sure why the DCAT model adds this extra layer, but I think it is probably so that a permanent IRI can be assigned to the distribution, while the download URL can be a mutable thing that can change over time, yet still be discovered through its link to the distribution.

At the moment, the dataset IRIs don't dereference, although that could be changed in the future if need be.  Despite that, their metadata are exposed when the data catalog IRI itself is dereferenced, so a machine could learn all it needed to know about them with a single HTTP call to the catalog IRI.

In the case of the TDWG data, I didn't actually mint IRIs for the distributions, since it's not that likely that anyone would ever need to address them directly and I wasn't interested in maintaining another set of identifiers.  So they are represented by blank (anonymous) nodes in the dataset.  The download URLs can be determined from the dataset URI by rules, so there's no need to maintain a record of them, either.

Here is an abbreviated bit of the Turtle that you get if you dereference the catalog IRI http://rs.tdwg.org/index and request text/turtle (or just retrieve http://rs.tdwg.org/index.ttl):

@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>.
@prefix xsd: <http://www.w3.org/2001/XMLSchema#>.
@prefix dc: <http://purl.org/dc/elements/1.1/>.
@prefix dcterms: <http://purl.org/dc/terms/>.
@prefix dcat: <http://www.w3.org/ns/dcat#>.
@prefix dcmitype: <http://purl.org/dc/dcmitype/>.

<http://rs.tdwg.org/index>
     dc:publisher "Biodiversity Information Standards (TDWG)"@en;
     dcterms:publisher <https://www.grid.ac/institutes/grid.480498.9>;
     dcterms:license <http://creativecommons.org/licenses/by/4.0/>;
     dcterms:modified "2018-10-09"^^xsd:date;
     rdfs:label "TDWG dataset catalog"@en;
     rdfs:comment "This dataset contains the data that underlies TDWG standards and standards documents"@en;
     dcat:dataset <http://rs.tdwg.org/index/audubon>;
     a dcat:Catalog.

<http://rs.tdwg.org/index/audubon>
     dcterms:modified "2018-10-09"^^xsd:date;
     rdfs:label "Audubon Core-defined terms"@en;
     dcat:distribution _:53c07f45-4561-448b-9bb9-396e47d3ad1d;
     a dcmitype:Dataset.

_:53c07f45-4561-448b-9bb9-396e47d3ad1d
     dcat:mediaType <https://www.iana.org/assignments/media-types/application/rdf+xml>;
     dcterms:license <https://creativecommons.org/publicdomain/zero/1.0/>;
     dcat:downloadURL <http://rs.tdwg.org/dump/audubon.rdf>;
     a dcat:Distribution.

In this Turtle, you can see the DCAT-based structure as described above.

Returning to a comment that I made earlier, DCAT can describe data in any form and it's not restricted to RDF.  So in theory, one could consider each dataset to have a distribution that is in CSV format, and use the GitHub raw URL for the CSV file as the download URL of that distribution.  I haven't done that because complete information about the dataset requires the combination of the raw CSV file with a property mapping table and I don't know how to represent that complexity in DCAT.  But at least in theory it could be done.  One can also indicate that a distribution of the dataset is available from an API such as a SPARQL endpoint, which I also have not done because the datasets aren't compartmentalized into named graphs and therefore can't really be distinguished from each other.  But again, in theory it could be done.

Getting a dump of all of the data

At the start of this post, I complained that there were potential issues with the first two methods that I described for retrieving all of the TDWG standards metadata.  I promised a better way, so here it is!

In theory, a client could start with the catalog IRI (http://rs.tdwg.org/index), dereference it requesting the machine-readable serialization flavor of your choice, and follow the links to the download URLs of all 50 of the datasets currently in the catalog.  That would be in the LOD style and would require far fewer HTTP calls than the thousands that would be required to scrape all of the machine-readable data one standards-related resource at a time.

However, here is a quick and dirty way that doesn't require using any Linked Data technology:
  • use a script of your favorite programming language to load the raw file for the datasets CSV table on GitHub
  • get the dataset name from the second ("term_localName") column (e.g. audubon)
  • prepend http://rs.tdwg.org/dump/ to the name (e.g. http://rs.tdwg.org/dump/audubon)
  • append the appropriate file extension for the serialization you want (.ttl for Turtle, .rdf for XML) to the URL from the previous step (e.g. http://rs.tdwg.org/dump/audubon.ttl)
  • make an HTTP GET call to that URL to acquire the machine-readable serialization for that dataset.  
  • Repeat for the other 49 data rows in the table.

I've actually done something like this in lines 55 to 63 of a Python script on GitHub.  Rather than making a GET request, the script actually uses the constructed URL to create a SPARQL Update command that loads the data directly from the TDWG server into a graph database triplestore (lines 133 and 127) via an HTTP POST request.  But you could use GET to load the data directly into your own software using a library like Python's RDFLib if you preferred to work with it directly rather than through a SPARQL endpoint.

The advantage of getting the dump in this way is that it would be coming directly from the authoritative TDWG server (which gets its data from the CSVs in the rs.tdwg.org repo of the TDWG GitHub site).  You would then be guaranteed to have the most up-to-date version of the data, something that would not necessarily happen if you got the data from somebody else's SPARQL endpoint.

In the future, this method will be important because it would be the best way to build reliable applications that made use of standards metadata.  For many standards and the "regular" TDWG vocabularies that conform to the SDS (Darwin and Audubon Cores), retrieving up-to-date metadata probably isn't that critical because those standards don't change very quickly.  However, in the case of controlled vocabularies, access to up-to-date data may be more important.

No comments:

Post a Comment