Steve Baskauf's blog: April 2019

Wednesday, April 24, 2019

Understanding the TDWG Standards Documentation Specification, Part 5: Acquiring Machine-readable using DCAT

This is the fifth in a series of posts about the TDWG Standards Documentation Specification (SDS). For background on the SDS, see the first post. For information on the SDS hierarchical model and how it relates to IRI design, see the second post. For information about how TDWG standards metadata can be retrieved via IRI dereferencing, see the third post. For information about accessing TDWG standards metadata via a SPARQL API, see the fourth post.

Note: this post was revised on 2020-03-04 when IRI dereferencing of the http://rs.tdwg.org/ subdomain went from testing into production.

Acquiring the machine-readable TDWG standards metadata based on the W3C Data Catalog (DCAT) Vocabulary Recommendation.

Not-so-great methods of getting a dump of all of the machine-readable metadata

In the last two posts of this series, I showed two different ways that you could acquire machine-readable metadata about TDWG Standards and their components.

In the third post, I explained how the implementation of the Standards Documentation Specification (SDS) could allow a machine (i.e. computer software) use the classic Linked Open Data (LOD) method of "following its nose" and essentially scraping the standards metadata by discovering linked IRIs, then following those links to retrieve metadata about the linked components. There are two problems with this approach. One is that it's very inefficient. Multiple HTTP calls are required to acquire the metadata about a single resource and there are thousands of resources that would need to be scraped. A more serious problem is that some of the terms that are current or past terms of Darwin and Audubon Cores are not dereferenceable. For example, the International Press Telecommunications Council (IPTC) terms that are borrowed by Audubon Core are defined in a PDF document and don't dereference. There are many ancient Darwin Core terms in namespaces other than the rs.tdwg.org subdomain that don't even bring up a web page, let alone machine-readable metadata. And the "permanent URLs" of the standards themselves (e.g. http://www.tdwg.org/standards/116) do not use content negotiation to return machine-readable metadata (although they might at some future point). So there are many items of interest whose machine-readable metadata simply cannot be discovered by this means, since linked IRIs can't be dereferenced with a request for machine-readable metadata.

In the fourth post, I described how the SPARQL query language could be used to get all of the triples in the TDWG Standards dataset. The query to do so was really simple:

CONSTRUCT {?s ?p ?o}
FROM <http://rs.tdwg.org/>
WHERE {?s ?p ?o}

and by requesting the appropriate content type (XML, Turtle, or JSON-LD) via an Accept header, a single HTTP call would retrieve all of the metadata at once. If all goes well, this is a simple and effective method. However, this method depends critically on two things: there has to be a SPARQL endpoint that is functioning and publicly accessible, and the metadata in the triplestore of the underlying graph database must be up-to-date with the most recent data. At the moment, both of those things are true about the Vanderbilt Library SPARQL endpoint (https://sparql.vanderbilt.edu/sparql), but there is no guarantee that it will continue to be true indefinitely. There is no reason why there cannot be multiple SPARQL endpoints where the data are available, and TDWG itself could run its own, but currently there are no plans for that to happen and so we are stuck with depending on the Vanderbilt endpoint.

Getting a machine-readable data dump from TDWG itself

I'm now going to tell you about the best way to acquire authoritative machine-readable metadata from the rs.tdwg.org implementation itself. But first we need to talk about the W3C Data Catalog (DCAT) recommendation, which is used to organize the data dump. The SDS does not mention the DCAT recommendation, but since DCAT is an international standard, it is the logical choice to be used for describing the TDWG standards datasets.

Data Catalog Vocabulary (DCAT)

In 2014, the W3C ratified the DCAT vocabulary as a Recommendation (the W3C term for its ratified standards). DCAT is a vocabulary for describing datasets of any form. The described datasets can be machine-readable, but do not have to be, and could include non-machine-readable forms like spreadsheets. The description of the datasets is in RDF, although the Recommendation is agnostic about the serialization.

There are three classes of resources that are described by the DCAT vocabulary. A data catalog is the resource that describes datasets. It's type is dcat:Catalog (http://www.w3.org/ns/dcat#Catalog). The datasets described in the catalog are assigned the type dcat:Dataset, which is a subclass of dctype:Dataset (http://purl.org/dc/dcmitype/Dataset). The third class of resources, distributions, are described as "an accessible form of a dataset" and can include downloadable files or web services. Distributions are assigned the type dcat:Distribution (http://www.w3.org/ns/dcat#Distribution). The hierarchical relationship among these classes of resources is shown in the following diagram.

An important thing to notice is that the DCAT vocabulary defines several terms whose IRIs are very similar: dcat:dataset and dcat:Dataset, and dcat:distribution and dcat:Distribution. The only thing that differs between the pairs of terms is whether the local name is capitalized or not. Those with capitalized local names denote classes and those that begin with lower case denote object properties.

Organization of TDWG data according to the DCAT data model

I assigned the IRI http://rs.tdwg.org/index to denote the TDWG standards metadata catalog. The local name "index" is descriptive of a catalog, and the IRI has the added benefit of supporting a typical web behavior: if a base subdomain like http://rs.tdwg.org/ is dereferenced, it is typical for that form of IRI to dereference to a "homepage" having the IRI http://rs.tdwg.org/index.htm, and http://rs-test.tdwg.org/index.htm does indeed redirect to a "homepage" of sorts: the README.md page for the rs.tdwg.org GitHub repo where the authoritative metadata tables live. You can try this yourself by putting either http://rs.tdwg.org/or http://rs.tdwg.org/index.htm into a browser URL bar and see what happens. However, making an HTTP call to either of these IRIs with an Accept header for machine-readable RDF (text/turtle or application/rdf+xml) will redirect to a representation-specific IRI like http://rs.tdwg.org/index.ttl or http://rs.tdwg.org/index.rdf as you'd expect in the Linked Data world.

The data catalog denoted by http://rs.tdwg.org/index describes the data located in the GitHub repository https://github.com/tdwg/rs.tdwg.org. Those data are organized into a number of directories, with each directory containing all of the information required to map metadata-containing CSV files to machine-readable RDF. From the standpoint of DCAT, we can consider the information in each directory as a dataset. There is no philosophical reason why we should organize the datasets that way. Rather, it is based on practicality, since the server that dereferences TDWG IRIs can generate a data dump for each directory via a dump URL. See this file for a complete list of the datasets.

Each of the abstract datasets can be accessed through one of several distributions. Currently, the RDF metadata about the TDWG data says that there are three distributions for each of the datasets: one in RDF/XML, one in RDF/Turtle, and one in JSON-LD (with the JSON-LD having a problem I mentioned in the third post). The IANA media type for each distribution is given as the value of a dcat:mediaType property (see the diagram above for an example).

One thing that is a bit different from what one might consider the traditional Linked Data approach is that the distributions are not really considered representations of the datasets. That is, under the DCAT model, one does not necessarily expect to be redirected to the distribution IRI from dereferencing of the dataset IRI through content negotiation. That's because content negotiation generally results in direct retrieval of some human- or machine-readable serialization, but in the DCAT model, the distribution itself is a separate, abstract entity apart from the serialization. The serialization itself is connected via a dcat:downloadURL property of the distribution (see the diagram above). I'm not sure why the DCAT model adds this extra layer, but I think it is probably so that a permanent IRI can be assigned to the distribution, while the download URL can be a mutable thing that can change over time, yet still be discovered through its link to the distribution.

At the moment, the dataset IRIs don't dereference, although that could be changed in the future if need be. Despite that, their metadata are exposed when the data catalog IRI itself is dereferenced, so a machine could learn all it needed to know about them with a single HTTP call to the catalog IRI.

In the case of the TDWG data, I didn't actually mint IRIs for the distributions, since it's not that likely that anyone would ever need to address them directly and I wasn't interested in maintaining another set of identifiers. So they are represented by blank (anonymous) nodes in the dataset. The download URLs can be determined from the dataset URI by rules, so there's no need to maintain a record of them, either.

Here is an abbreviated bit of the Turtle that you get if you dereference the catalog IRI http://rs.tdwg.org/index and request text/turtle (or just retrieve http://rs.tdwg.org/index.ttl):

@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>.
@prefix xsd: <http://www.w3.org/2001/XMLSchema#>.
@prefix dc: <http://purl.org/dc/elements/1.1/>.
@prefix dcterms: <http://purl.org/dc/terms/>.
@prefix dcat: <http://www.w3.org/ns/dcat#>.
@prefix dcmitype: <http://purl.org/dc/dcmitype/>.

<http://rs.tdwg.org/index>
dc:publisher "Biodiversity Information Standards (TDWG)"@en;
dcterms:publisher <https://www.grid.ac/institutes/grid.480498.9>;
dcterms:license <http://creativecommons.org/licenses/by/4.0/>;
dcterms:modified "2018-10-09"^^xsd:date;
rdfs:label "TDWG dataset catalog"@en;
rdfs:comment "This dataset contains the data that underlies TDWG standards and standards documents"@en;
dcat:dataset <http://rs.tdwg.org/index/audubon>;
a dcat:Catalog.

<http://rs.tdwg.org/index/audubon>
dcterms:modified "2018-10-09"^^xsd:date;
rdfs:label "Audubon Core-defined terms"@en;
dcat:distribution _:53c07f45-4561-448b-9bb9-396e47d3ad1d;
a dcmitype:Dataset.

_:53c07f45-4561-448b-9bb9-396e47d3ad1d
dcat:mediaType <https://www.iana.org/assignments/media-types/application/rdf+xml>;
dcterms:license <https://creativecommons.org/publicdomain/zero/1.0/>;
dcat:downloadURL <http://rs.tdwg.org/dump/audubon.rdf>;
a dcat:Distribution.

In this Turtle, you can see the DCAT-based structure as described above.

Returning to a comment that I made earlier, DCAT can describe data in any form and it's not restricted to RDF. So in theory, one could consider each dataset to have a distribution that is in CSV format, and use the GitHub raw URL for the CSV file as the download URL of that distribution. I haven't done that because complete information about the dataset requires the combination of the raw CSV file with a property mapping table and I don't know how to represent that complexity in DCAT. But at least in theory it could be done. One can also indicate that a distribution of the dataset is available from an API such as a SPARQL endpoint, which I also have not done because the datasets aren't compartmentalized into named graphs and therefore can't really be distinguished from each other. But again, in theory it could be done.

Getting a dump of all of the data

At the start of this post, I complained that there were potential issues with the first two methods that I described for retrieving all of the TDWG standards metadata. I promised a better way, so here it is!

In theory, a client could start with the catalog IRI (http://rs.tdwg.org/index), dereference it requesting the machine-readable serialization flavor of your choice, and follow the links to the download URLs of all 50 of the datasets currently in the catalog. That would be in the LOD style and would require far fewer HTTP calls than the thousands that would be required to scrape all of the machine-readable data one standards-related resource at a time.

However, here is a quick and dirty way that doesn't require using any Linked Data technology:

use a script of your favorite programming language to load the raw file for the datasets CSV table on GitHub
get the dataset name from the second ("term_localName") column (e.g. audubon)
prepend http://rs.tdwg.org/dump/ to the name (e.g. http://rs.tdwg.org/dump/audubon)
append the appropriate file extension for the serialization you want (.ttl for Turtle, .rdf for XML) to the URL from the previous step (e.g. http://rs.tdwg.org/dump/audubon.ttl)
make an HTTP GET call to that URL to acquire the machine-readable serialization for that dataset.
Repeat for the other 49 data rows in the table.

I've actually done something like this in lines 55 to 63 of a Python script on GitHub. Rather than making a GET request, the script actually uses the constructed URL to create a SPARQL Update command that loads the data directly from the TDWG server into a graph database triplestore (lines 133 and 127) via an HTTP POST request. But you could use GET to load the data directly into your own software using a library like Python's RDFLib if you preferred to work with it directly rather than through a SPARQL endpoint.

The advantage of getting the dump in this way is that it would be coming directly from the authoritative TDWG server (which gets its data from the CSVs in the rs.tdwg.org repo of the TDWG GitHub site). You would then be guaranteed to have the most up-to-date version of the data, something that would not necessarily happen if you got the data from somebody else's SPARQL endpoint.

In the future, this method will be important because it would be the best way to build reliable applications that made use of standards metadata. For many standards and the "regular" TDWG vocabularies that conform to the SDS (Darwin and Audubon Cores), retrieving up-to-date metadata probably isn't that critical because those standards don't change very quickly. However, in the case of controlled vocabularies, access to up-to-date data may be more important.

Sunday, April 7, 2019

Understanding the TDWG Standards Documentation Specification, Part 4: Machine-readable Metadata Via an API

This is the fourth in a series of posts about the TDWG Standards Documentation Specification (SDS). For background on the SDS, see the first post. For information on its hierarchical model and how it relates to IRI design, see the second post. For information about how metadata is retrieved via IRI dereferencing, see the third post.

Note: this post was revised on 2020-03-04 when IRI dereferencing of the http://rs.tdwg.org/ subdomain went from testing into production.

Retrieving metadata about TDWG standards using a web API

If you have persevered through the first three posts in this series, congratulations! The main reason for those earlier posts was to provide the background for this post, which is on the topic that will probably be most interesting to readers: how to effectively retrieve machine-readable metadata about TDWG standards using a web API.

Let's start with retrieving an example resource: the term IRIs and definitions of terms of a TDWG vocabulary (Darwin Core=dwc or Audubon Core=ac).

Here is what we need for the API call:

Resource URL: https://sparql.vanderbilt.edu/sparql
Method: GET
Authentication required: No
Request header key: Accept
Request header value: application/json, text/csv or application/xml
Parameter key: query
Parameter value: insert "dwc" or "ac" in place of {vocabularyAbbreviation} in the following string:
"prefix%20skos%3A%20%3Chttp%3A%2F%2Fwww.w3.org%2F2004%2F02%2Fskos%2Fcore%23%3E%0Aprefix%20dcterms%3A%20%3Chttp%3A%2F%2Fpurl.org%2Fdc%2Fterms%2F%3E%0ASELECT%20DISTINCT%20%3Firi%20%3Fdefinition%0AWHERE%20%7B%0A%20%20GRAPH%20%3Chttp%3A%2F%2Frs.tdwg.org%2F%3E%20%7B%0A%20%20%20%20%3Chttp%3A%2F%2Frs.tdwg.org%2F{vocabularyAbbreviation}%2F%3E%20dcterms%3AhasPart%20%3FtermList.%0A%20%20%20%20%3FtermList%20dcterms%3AhasPart%20%3Firi.%0A%20%20%20%20%3Firi%20skos%3AprefLabel%20%3Flabel.%0A%20%20%20%20%3Firi%20skos%3Adefinition%20%3Fdefinition.%0A%20%20%20%20FILTER(lang(%3Flabel)%3D%22en%22)%0A%20%20%20%20FILTER(lang(%3Fdefinition)%3D%22en%22)%0A%20%20%20%20%7D%0A%7D%0AORDER%20BY%20%3Firi"

Note: the Accept header is required to receive JSON -- omitting it returns XML.

Here's an example response that shows the structure of the JSON that is returned:

{
"head": {
"vars": [
"iri",
"definition"
]
},
"results": {
"bindings": [
{
"iri": {
"type": "uri",
"value": "http://ns.adobe.com/exif/1.0/PixelXDimension"
},
"definition": {
"xml:lang": "en",
"type": "literal",
"value": "Information specific to compressed data. When a compressed file is recorded, the valid width of the meaningful image shall be recorded in this tag, whether or not there is padding data or a restart marker. This tag shall not exist in an uncompressed file."
}
},

(... many more array values here ...)

{
"iri": {
"type": "uri",
"value": "http://rs.tdwg.org/dwc/terms/waterBody"
},
"definition": {
"xml:lang": "en",
"type": "literal",
"value": "The name of the water body in which the Location occurs. Recommended best practice is to use a controlled vocabulary such as the Getty Thesaurus of Geographic Names."
}
}
]
}
}

Here is an example script to use the API via Python 3. (You can convert to your own favorite programming language or see this page if you need to set up Python 3 on your computer.) Note: the requests module is not included in the Python standard library and must be installed using PIP or another package manager.

Although the API can return CSV and XML, we will only be using JSON in this example.
------

import requests

vocab = input('Enter the vocabulary abbreviation (dwc for Darwin Core or ac for Audubon Core): ')

# values required for the HTTP request
resourceUrl = 'https://sparql.vanderbilt.edu/sparql'
requestHeaderKey = 'Accept'
requestHeaderValue = 'application/json'
parameterKey = 'query'
parameterValue ='prefix%20skos%3A%20%3Chttp%3A%2F%2Fwww.w3.org%2F2004%2F02%2Fskos%2Fcore%23%3E%0Aprefix%20dcterms%3A%20%3Chttp%3A%2F%2Fpurl.org%2Fdc%2Fterms%2F%3E%0ASELECT%20DISTINCT%20%3Firi%20%3Fdefinition%0AWHERE%20%7B%0A%20%20GRAPH%20%3Chttp%3A%2F%2Frs.tdwg.org%2F%3E%20%7B%0A%20%20%20%20%3Chttp%3A%2F%2Frs.tdwg.org%2F'
parameterValue += vocab
parameterValue += '%2F%3E%20dcterms%3AhasPart%20%3FtermList.%0A%20%20%20%20%3FtermList%20dcterms%3AhasPart%20%3Firi.%0A%20%20%20%20%3Firi%20skos%3AprefLabel%20%3Flabel.%0A%20%20%20%20%3Firi%20skos%3Adefinition%20%3Fdefinition.%0A%20%20%20%20FILTER(lang(%3Flabel)%3D%22en%22)%0A%20%20%20%20FILTER(lang(%3Fdefinition)%3D%22en%22)%0A%20%20%20%20%7D%0A%7D%0AORDER%20BY%20%3Firi'
url = resourceUrl + '?' + parameterKey + '=' + parameterValue

# make the HTTP request and store the terms data in a list
r = requests.get(url, headers={requestHeaderKey: requestHeaderValue})
responseBody = r.json()
items = responseBody['results']['bindings']

# iterate through the list and print what we wanted
for item in items:
print(item['iri']['value'])
print(item['definition']['value'])
print()

------

For anyone who has programmed an application to retrieve data from an API, this is pretty standard stuff and because the requests module is so simple to use, the part of the code that actually retrieves the data from the API (lines 16-18) is only three lines long. So the coding required to retrieve the data is not complicated. For the output I just had the values for the IRI and definition printed to the console, but obviously you could do whatever you wanted with them in your own programming.

If you are familiar with using web APIs and if you examined the details of the code, you will probably have several questions:

- Why is the parameter value so much longer and weirder than what is typical for web APIs?

- What is this sparql.vanderbilt.edu API?

- What other kinds of resources can be obtained from the API?

About the API

The reason that the parameter value is so long and weird looking is because the required parameter value is a SPARQL query in URL-encoded form. I purposefully obfuscated the parameter value by URL-encoding it in the script because I wanted to emphasize how a SPARQL endpoint is fundamentally just like any other web API, except with a more complicated query parameter.

I feel like in the past Linked Data, RDF, and SPARQL has been talked about in the TDWG community like it is some kind of religion with secrets that only initiated members of the priesthood can know. (For a short introduction to this topic, see this video.) It is true that if you want to design an RDF data model or build the infrastructure to transform tabular data to RDF, you need to know a lot of technical details, but those are not tasks that most people need to do. You actually don't need to know anything about RDF, how it's structured, or how to create it in order to use a SPARQL endpoint, as I just demonstrated above.

The endpoint http://sparql.vanderbilt.edu/sparql provides public access to datasets that have been made available by the Vanderbilt Libraries. It is our intention to keep this API up and the datasets stable for as long as possible. (For more about the API, see this page.) However, there is nothing special about about the API - it's just an installation of Blazegraph, which is freely available without cost as a Docker image (see this page for instructions if you want to try installing it on your own computer). The TDWG dataset that is loaded into the Vanderbilt API is also freely available and can be installed in any Blazegraph instance. So although the Vanderbilt API provides a convenient way to access the TDWG data, there is nothing special about it. There is no custom programming that has been done to get it online and there has been no processing of the data that was loaded into it. There could be zero to many other APIs that could be set up to provide exactly the same services using exactly the same API calls. For those who are interested, later on in this post I will provide more details about how anyone can obtain the data, but those are details that most users can happily ignore.

The interesting thing about SPARQL endpoints is that there is an unlimited number of resources that can be obtained from the API. Conventional APIs, such as the GBIF or Twitter APIs, provide web pages that list the available resources and the parameter key/value pairs required to obtain them. If potential users want to obtain a resource that is not currently available, they have to ask the API developers to create the code required to allow them to access that resource. A SPARQL endpoint is much simpler. It has exactly one resource URL (the URL of the endpoint) and for read operations has only one parameter key (query). The value of that single parameter is the SPARQL query.

In a manner analogous to traditional API documentation, we can (and should) provide a list of queries that would retrieve the types of information that users typically might want to obtain. Developers who are satisfied with that list can simply follow the recipe and make API calls using that recipe as they would for any other API. But the great thing about a SPARQL endpoint is that you are NOT limited to any provided list of queries. If you are willing to study the TDWG standards data model that I described in the second post of this series and expend a minimal amount of time learning to construct SPARQL queries (see this beginner's page to get started), you can retrieve any kind of data that you can imagine without needing to beg some developers to add that functionality to their API.

In the next section, I'm going to simplify the Python 3 script that I listed above, then provide several additional API call examples.

A generic Python script for making other API calls

Here is the previous script in a more straightforward and hackable form:

------

import requests

vocab = input('Enter the vocabulary abbreviation (dwc for Darwin Core or ac for Audubon Core): ')

parameterValue ='''prefix skos: <http://www.w3.org/2004/02/skos/core#>

prefix dcterms: <http://purl.org/dc/terms/>

SELECT DISTINCT ?iri ?definition

WHERE {

GRAPH <http://rs.tdwg.org/> {

<http://rs.tdwg.org/'''

parameterValue += vocab

parameterValue += '''/> dcterms:hasPart ?termList.

?termList dcterms:hasPart ?iri.

?iri skos:prefLabel ?label.

?iri skos:definition ?definition.

FILTER(lang(?label)="en")

FILTER(lang(?definition)="en")

}

ORDER BY ?iri'''

endpointUrl = 'https://sparql.vanderbilt.edu/sparql'

requestHeaderValue = 'application/json'

# make the HTTP request and store the terms data in a list

r = requests.get(endpointUrl, headers={'Accept': requestHeaderValue}, params={'query': parameterValue})

responseBody = r.json()

items = responseBody['results']['bindings']

# iterate through the list and print what we wanted

for item in items:

print(item['iri']['value'])

print(item['definition']['value'])

print()

------

The awesome Python requests module allows you to pass the parameters to the .get() method as a dict, getting rid of the necessity of constructing the entire URL yourself. The values you pass are automatically URL-encoded, so that eliminates the necessity of doing the encoding yourself. As a result, I was able to create the parameter value by assigning multi-line strings that are formatted in a much more readable way. Since the only header we should ever need to send is Accept and the only parameter key we should need is query, I just hard-coded them into the corresponding dicts of the .get() method. I left the value for the Accept request header as a variable in line 24 as a variable in case anybody wants to play with requesting XML or a CSV table.

We can now request different kinds of data from the API by changing the parameter value that is assigned in lines 3 through 21.

Multilingual labels and definitions

To retrieve the label and definition for a Darwin Core term in a particular language, substitute these lines for lines 3-21:

------

localName = input('Enter the local name of a Darwin Core term: ')

language = input('Enter the two-letter code for the language you want (en, es, zh-hans): ')

parameterValue ='''prefix skos: <http://www.w3.org/2004/02/skos/core#>

SELECT DISTINCT ?label ?definition

WHERE {

GRAPH <http://rs.tdwg.org/> {

BIND(IRI("http://rs.tdwg.org/dwc/terms/'''

parameterValue += localName

parameterValue += '''") as ?iri)

BIND("'''

parameterValue += language

parameterValue += '''" as ?language)

?iri skos:prefLabel ?label.

?iri skos:definition ?definition.

FILTER(lang(?label)=?language)

FILTER(lang(?definition)=?language)

}

}'''

------

The printout section needs to be changed, since we asked for a label instead of an IRI:

------

for item in items:

print()

print(item['label']['value'])

print(item['definition']['value'])

------

The "local name" asked for by the script is the last part of a Darwin Core IRI. For example, the local name for dwc:recordedBy (that is, http://rs.tdwg.org/dwc/terms/recordedBy) would be recordedBy. (You can find more local names to try here.)

Other than English, we currently only have translations of term names and labels in Spanish and simplified Chinese. We also only have translations of dwc: namespace terms from Darwin Core and not dwciri:, dc:, or dcterms: terms. So this resource is currently somewhat limited, but could get better in the future with the addition of other languages to the dataset.

Track the history of any TDWG term to the beginning of the universe

The user sends the full IRI of any term ever created by TDWG and the API will return the term name, version date of issue, definition and status of every version that was a precursor of that term. Again, replace lines 3-21 with this:

------

iri = input('Enter the unabbreviated IRI of a TDWG vocabulary term: ')

parameterValue ='''prefix dcterms: <http://purl.org/dc/terms/>

prefix skos: <http://www.w3.org/2004/02/skos/core#>

prefix tdwgutility: <http://rs.tdwg.org/dwc/terms/attributes/>

SELECT DISTINCT ?term ?date ?definition ?status

WHERE {

GRAPH <http://rs.tdwg.org/> {

<'''

parameterValue += iri

parameterValue += '''> dcterms:hasVersion ?directVersion.

?directVersion dcterms:replaces* ?version.

?version dcterms:issued ?date.

?version tdwgutility:status ?status.

?version dcterms:isVersionOf ?term.

?version skos:definition ?definition.

FILTER(lang(?definition)="en")

}

ORDER BY DESC(?date)'''

------

and replace the printout section with this:

------

for item in items:

print()

print(item['date']['value'])

print(item['term']['value'])

print(item['definition']['value'])

print(item['status']['value'])

------

The results of this query allow you to see every possible previous term that might have been used in the past to refer to this concept, and to see how the definition of those earlier terms differed from the target term. You should try it with everyone's favorite confusing term, dwc:basisOfRecord, which has the unabbreviated IRI http://rs.tdwg.org/dwc/terms/basisOfRecord .

You can make a simple modification to the script to have the call return every term that has ever been used to replace an obsolete term, and the definitions of every version of those terms. Just replace dcterms:replaces* with dcterms:isReplacedby* in the second parameterValue string. If you want them to be ordered from oldest to newest, you can replace DESC(?date) with ASC(?date). Try it with this refugee from the past: http://digir.net/schema/conceptual/darwin/2003/1.0/YearCollected .

What are all of the TDWG Standards documents?

This version of the script lets you enter any part of a TDWG standard's name and it will retrieve all of the documents that are part of that standard, tell you the date it was last modified, and give the URL where you might be able to find it (some are print only and at least one -- XDF -- seems to be lost entirely). Press enter without any text and you will get all of them. Here's the code to generate the parameter value:

------

searchString = input('Enter part of the standard name, or press Enter for all: ')

parameterValue ='''PREFIX foaf: <http://xmlns.com/foaf/0.1/>

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

PREFIX dcterms: <http://purl.org/dc/terms/>

SELECT DISTINCT ?docLabel ?date ?stdLabel ?url

FROM <http://rs.tdwg.org/>

WHERE {

?standard a dcterms:Standard.

?standard rdfs:label ?stdLabel.

?standard dcterms:hasPart ?document.

?document a foaf:Document.

?document rdfs:label ?docLabel.

?document rdfs:seeAlso ?url.

?document dcterms:modified ?date.

FILTER(lang(?stdLabel)="en")

FILTER(lang(?docLabel)="en")

FILTER(contains(?stdLabel, "'''

parameterValue += searchString

parameterValue += '''"))

}

ORDER BY ?stdLabel'''

------

and here's the printout section:

------

for item in items:

print()

print(item['docLabel']['value'])

print(item['date']['value'])

print(item['stdLabel']['value'])

print(item['url']['value'])

------

Note: the URLs that are returned are access URLs, NOT the IRI identifiers for the documents!

The following is a variation of the API call above. In this variation, you enter the name of a standard (or press Enter for all), and you can retrieve the names of all of the contributors (whose roles might have included author, editor, translator, reviewer, or review manager). Parameter value code:

------

parameterValue ='''PREFIX foaf: <http://xmlns.com/foaf/0.1/>

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

PREFIX dcterms: <http://purl.org/dc/terms/>

SELECT DISTINCT ?contributor ?stdLabel

FROM <http://rs.tdwg.org/>

WHERE {

?standard a dcterms:Standard.

?standard rdfs:label ?stdLabel.

?standard dcterms:hasPart ?document.

?document a foaf:Document.

?document dcterms:contributor ?contribUri.

?contribUri rdfs:label ?contributor.

FILTER(contains(?stdLabel, "'''

parameterValue += searchString

parameterValue += '''"))

}

ORDER BY ?contributor'''

------

Printout code:

------

for item in items:

print()

print(item['contributor']['value'])

print(item['stdLabel']['value'])

------

Note: assembling this list of documents was my best shot at determining what documents should be considered to be part of the standards themselves, as opposed to ancillary documents not part of the standards. It's possible that I might have missed some, or included some that aren't considered key to the standards. This is more of a problem with older documents whose status was not clearly designated.

The SDS isn't very explicit about how to assign all of the properties that should probably be assigned to documents, so some information that might be important is missing, such as documentation of contributor roles. Also, I could not determine who all of the review managers were, where the authoritative locations were for all documents, nor find prior versions for some documents. So this part of the TDWG standards metadata still needs some work.

An actual Linked Data application

In the previous examples, the data involved was limited to metadata about TDWG standards. However, we can make an API call that is an actual bona fide application of Linked Data. Data from the Bioimages project are available as RDF/XML. You can examine the human-readable web page of an image at http://bioimages.vanderbilt.edu/thomas/0488-01-01 and the corresponding RDF/XML here. Both the human- and machine-readable versions of the image metadata use either Darwin Core or Audubon Core terms as most of their properties. However, the Bioimages metadata do not provide an explanation of what those TDWG vocabulary terms mean.

Both the Bioimages and TDWG metadata datasets have been loaded into the Vanderbilt Libraries SPARQL API, and we can include both datasets in the query's dataset using the FROM keyword. That allows us to make use of information from the TDWG dataset in the query of the Bioimages data because the two datasets are linked by use of common term IRIs. In the query, we can ask for the metadata values for the image (from the Bioimages dataset), but include the definition of the properties (from the TDWG dataset; not present in the Bioimages dataset).

------

iri = input('Enter the unabbreviated IRI of an image from Bioimages: ')

parameterValue ='''PREFIX dcterms: <http://purl.org/dc/terms/>

PREFIX skos: <http://www.w3.org/2004/02/skos/core#>

SELECT DISTINCT ?label ?value ?definition

FROM <http://rs.tdwg.org/>

FROM <http://bioimages.vanderbilt.edu/images>

WHERE {

<'''

parameterValue += iri

parameterValue += '''> ?property ?value.

?property skos:prefLabel ?label.

?property skos:definition ?definition.

FILTER(lang(?label)="en")

FILTER(lang(?definition)="en")

}'''

------

Printout code:

------

for item in items:

print()

print(item['label']['value'])

print(item['value']['value'])

print(item['definition']['value'])

------

You can try this script out on the example IRI I gave above (http://bioimages.vanderbilt.edu/thomas/0488-01-01) or on any other image identifier in the collection (listed under "Refer to this permanent identifier for the image:" on any of the image metadata pages that you get to by clicking on an image thumbnail).

Conclusion

Hopefully, these examples can give you a taste for the kind of metadata about TDWG standards that can be retrieved using an API. There are several final issues that I should discuss before I wrap up this post. I'm going to present them in a Q&A format.

Q: Can I build an application to use this API?

A. Yes, you could. We intend for the Vanderbilt SPARQL API to remain up indefinitely at the endpoint URL given in the examples. However, we can't make a hard promise about that, and the API is not set up to handle large amounts of traffic. There aren't any usage limits and subsequently it's already been crashed once by someone who hit it really hard. So if you need a robust service, you should probably set up your own installation of Blazegraph and populate it with the TDWG dataset.

Q: How can I get a dump of the TDWG data to populate my own version of the API?

A: The simplest way is to execute this query to the Vanderbilt SPARQL API as above with an Accept header of text/turtle:

CONSTRUCT {?s ?p ?o}

FROM <http://rs.tdwg.org/>

WHERE {?s ?p ?o}

URL-encoded, the query is:

CONSTRUCT%20%7B%3Fs%20%3Fp%20%3Fo%7D%0AFROM%20%3Chttp%3A%2F%2Frs.tdwg.org%2F%3E%0AWHERE%20%7B%3Fs%20%3Fp%20%3Fo%7D

If you use Postman, you can drop down the Send button to Send and Download and save the data in a file, which you can upload into your own instance of Blazegraph or some other SPARQL endpoint/triplestore. (There are approximately 43000 statements (triples) in the dataset, so copy and paste is not a great method of putting them into a file.) If your triplestore doesn't support RDF/Turtle, you can get RDF/XML instead by using an Accept header of application/rdf+xml.

There is a better method of acquiring the data that uses the authoritative source data, but I'll have to describe that in a subsequent post.

Q: How accurate are the data?

A: I've spent many, many hours over the last several years curating the source data in GitHub. Nevertheless, I still discover errors almost every time I try new queries on the data. If you discover errors, put them in the issues tracker and I'll try to fix them.

Q: How would this work for future controlled vocabularies?

A: This is a really important question. It's so important that I'm going to address it in a subsequent post in the series.

Q: How can I retrieve information from the API about resources that weren't described in the examples?

A: Since a SPARQL endpoint is essentially a program-it-yourself API, all you need is to have the right SPARQL query to retrieve the information you want. First you need to have a clear idea of the question you want to answer. Then you've got two options: find someone who knows how to write SPARQL queries and get them to write the query for you, or teach yourself how to write SPARQL queries and do it yourself. You can test your queries by pasting them in the box at https://sparql.vanderbilt.edu/ as you build them. It is not possible to create the queries without understanding the underlying data model (the graph model) and the machine-readable properties assigned to each kind of resource. That's why I wrote the first (boring) parts of this series and why we wrote the specification itself.

Q: Where did the data in the dataset come from and how is it managed?

A: That is an excellent question. Actually it is several questions:

- where does the data come from? (answer: the source csv tables in GitHub)

- how does the source data get turned into machine-readable data?

- how does the machine-readable data get into the API?

One of the beauties of REST is that when you request a URI from a server, you should be able to get a useful response from the server without having to worry about how the server generates that response. What that means in this context is that the intermediate steps that lie between the source data and what comes out of the API (the answers to the second and third questions above) can change and the client should never notice the difference since it would still be able to get exactly the same response. That's because the processing essentially involves implementing a mapping between what's in the tables on GitHub and what the SDS says the standardized machine-readable metadata should look like. There is no one particular way that mapping must happen, as long as the end result is the same. I will discuss this point in what will probably be the last post of the series.

Tuesday, April 2, 2019

Understanding the TDWG Standards Documentation Specification, Part 3: Machine-readable Metadata Via Content Negotiation

This is the third in a series of posts about the TDWG Standards Documentation Specification (SDS). For background on the SDS, see the first post. For information on its hierarchical model and how it relates to IRI design, see the second post.

Note: this post was revised on 2020-03-04 when IRI dereferencing of the http://rs.tdwg.org/ subdomain went from testing into production.

Human- vs. Machine-readable metadata

In the previous posts, I made the point that the SDS considers standards-related resources such as standards, vocabularies, term lists, terms, and documents to be abstract entities (section 2.1). As such, the IRI assigned to a resource denotes that resource in its abstract form. That abstract resource does not have one particular representation -- rather it can have multiple representation syntaxes which differ in format, but which in most cases provide equivalent information.

For example, consider the deprecated Darwin Core term dwccuratorial:Disposition. It is denoted by the IRI http://rs.tdwg.org/dwc/curatorial/Disposition. The metadata for this term in human-readable form looks like this:

Term Name: dwccuratorial:Disposition
Label: Disposition
Term IRI: http://rs.tdwg.org/dwc/curatorial/Disposition
Term version IRI: http://rs.tdwg.org/dwc/curatorial/version/Disposition-2007-04-17
Modified: 2009-04-24
Definition: The current disposition of the cataloged item. Examples: "in collection", "missing", "voucher elsewhere", "duplicates elsewhere".
Type: Property
Note: This term is no longer recommended for use.
Is replaced by: http://rs.tdwg.org/dwc/terms/disposition

In RDF/Turtle machine-readable serializations, the metadata looks like this (namespace abbreviations omitted):

<http://rs.tdwg.org/dwc/curatorial/Disposition>
rdfs:isDefinedBy <http://rs.tdwg.org/dwc/curatorial/>;
dcterms:isPartOf <http://rs.tdwg.org/dwc/curatorial/>;
dcterms:created "2007-04-17"^^xsd:date;
dcterms:modified "2009-04-24"^^xsd:date;
owl:deprecated "true"^^xsd:boolean;
rdfs:label "Disposition"@en;
skos:prefLabel "Disposition"@en;
rdfs:comment "The current disposition of the cataloged item. Examples: \"in collection\", \"missing\", \"voucher elsewhere\", \"duplicates elsewhere\"."@en;
skos:definition "The current disposition of the cataloged item. Examples: \"in collection\", \"missing\", \"voucher elsewhere\", \"duplicates elsewhere\"."@en;
rdf:type <http://www.w3.org/1999/02/22-rdf-syntax-ns#Property>;
tdwgutility:abcdEquivalence "DataSets/DataSet/Units/Unit/SpecimenUnit/Disposition";
dcterms:hasVersion <http://rs.tdwg.org/dwc/curatorial/version/Disposition-2007-04-17>;
dcterms:isReplacedBy <http://rs.tdwg.org/dwc/terms/disposition>.

In RDF/XML machine-readable form, the metadata looks like this (namespace abbreviations omitted):

<rdf:Description rdf:about="http://rs.tdwg.org/dwc/curatorial/Disposition">

<rdfs:isDefinedBy rdf:resource="http://rs.tdwg.org/dwc/curatorial/"/>

<dcterms:isPartOf rdf:resource="http://rs.tdwg.org/dwc/curatorial/"/>

<dcterms:created rdf:datatype="http://www.w3.org/2001/XMLSchema#date">2007-04-17</dcterms:created>

<dcterms:modified rdf:datatype="http://www.w3.org/2001/XMLSchema#date">2009-04-24</dcterms:modified>

<owl:deprecated rdf:datatype="http://www.w3.org/2001/XMLSchema#boolean">true</owl:deprecated>

<rdfs:label xml:lang="en">Disposition</rdfs:label>

<skos:prefLabel xml:lang="en">Disposition</skos:prefLabel>

<rdfs:comment xml:lang="en">The current disposition of the cataloged item. Examples: "in collection", "missing", "voucher elsewhere", "duplicates elsewhere".</rdfs:comment>

<skos:definition xml:lang="en">The current disposition of the cataloged item. Examples: "in collection", "missing", "voucher elsewhere", "duplicates elsewhere".</skos:definition>

<rdf:type rdf:resource="http://www.w3.org/1999/02/22-rdf-syntax-ns#Property"/>

<tdwgutility:abcdEquivalence>DataSets/DataSet/Units/Unit/SpecimenUnit/Disposition</tdwgutility:abcdEquivalence>

<dcterms:hasVersion rdf:resource="http://rs.tdwg.org/dwc/curatorial/version/Disposition-2007-04-17"/>

<dcterms:isReplacedBy rdf:resource="http://rs.tdwg.org/dwc/terms/disposition"/>

</rdf:Description>

For brevity, I'll omit the JSON-LD serialization. If you make a careful comparison of the two machine-readable serializations shown here, you'll see that they contain exactly the same information.

The SDS requires that when a machine consumes any machine-readable serialization, it acquire information identical to any other serialization (section 4). For most resources (terms, vocabularies, etc.), the human-readable representation generally contains the same information as the machine-readable serializations for all of the key properties required by the SDS, although some that aren't required, such as the abdcEquivalence, are omitted. The exception to this is standards-related documents -- the human-readable representation is the document itself, while the machine-readable representations are metadata about the document. (In contrast, machine-readable metadata about vocabularies, term lists, and terms contain virtually complete data about the resource.)

Distinguishing between resources and the documents that describe them

Section 4.1 of the SDS requires that machine-readable documents must have IRIs that are different from the IRIs of the abstract resources that they describe. Although at first it many not be apparent why this is important, we can see why if we consider the case of some of the older TDWG standards documents. For instance, the document Floristic Regions of the World (denoted by the IRI http://rs.tdwg.org/frw/doc/book/) by A. L. Takhtahan was adopted as part of TDWG standard http://www.tdwg.org/standards/104. It is copyrighted by the University of California Press and is not available under an open license. However, the metadata about the book in RDF/Turtle serialization (denoted by the IRI http://rs.tdwg.org/frw/doc/book.ttl) is freely available. So we could make the statement

http://rs.tdwg.org/frw/doc/book.ttl dcterms:license https://creativecommons.org/publicdomain/zero/1.0/ .

but it would NOT be accurate to make the statement

http://rs.tdwg.org/frw/doc/book/ dcterms:license https://creativecommons.org/publicdomain/zero/1.0/ .

because the book isn't licensed as CC0. Similarly, it would be correct to say:

http://rs.tdwg.org/frw/doc/book/ dc:creator "A. L. Takhtahan" .

but not:

http://rs.tdwg.org/frw/doc/book/ dc:creator "Biodiversity Information Standards (TDWG)" .

because TDWG did not create the book. On the other hand, saying:

http://rs.tdwg.org/frw/doc/book.ttl dc:creator "Biodiversity Information Standards (TDWG)" .

would be correct, since TDWG did create the RDF/Turtle metadata document that describes the book.

Although in human-readable documents we tend to be fuzzy about the distinction between resources and the metadata about those resources, when we create machine-readable metadata representations we need to be careful to distinguish between the two.

The SDS prescribes a way to link metadata documents and the resources they are about: dcterms:references and dcterms:isReferencedBy (section 4.1). In the example above, we can say:

http://rs.tdwg.org/frw/doc/book.ttl dcterms:references http://rs.tdwg.org/frw/doc/book/ .

and

http://rs.tdwg.org/frw/doc/book/ dcterms:isReferencedBy http://rs.tdwg.org/frw/doc/book.ttl .

Content negotiation

As I explained in the second post of this series, IRIs are fundamentally identifiers. There is no requirement that an IRI actually dereference to retrieve a web page or any other kind of document, although if it did, that would be nice, since that's the kind of behavior that people expect, particularly if the IRI begins with "http://" or "https://". If you think about it, defining TDWG IRIs to denote an abstract conceptual thing is a bit of a problem, because only non-abstract files can actually be returned to a user from a server through the Internet. You can't retrieve an abstract thing like the emotion "love" or the concept "justice" through the Internet, although you could certainly mint IRIs to denote those kinds of things.

The standard practice when an IRI denotes a resource that is a physical object or abstract idea is to redirect the user to a document that is about the object or idea. Such a document containing descriptive metadata about the resource is called a representation of the resource. Users can specify what kind of document (human- or machine-readable) they want, and more specifically, the serialization that they want if they are asking for a machine-readable document. This process is called content negotiation.

Resolution of permanent identifiers indefinitely is specified by Recommendation 7 of the TDWG Globally Unique Identifier (GUID) Applicability Statement standard, although it does not go into details of how that resolution should happen. Section 2.1.1 and 2.1.2 of the SDS expands on the GUID AS by saying that the abstract IRI should be stable and generic, and that content negotiation should redirect the user to an IRI for a particular content type that will serve as a URL that can be used to retrieve the document of the content type the user wanted. That requirement is based on the widespread practice in the Linked Data community as expressed in the 2008 W3C Note "Cool URIs for the Semantic Web".

The SDS does not specify a particular way that this redirection should be accomplished, but given that it's desirable to support as many different serializations as possible, I chose to implement the "303 URIs forwarding to Different Documents" recipe described in the Cool URIs document. Here are the specific details:

1. Client software performs an HTTP GET request for the abstract IRI of the resource and includes an Accept header that specifies the content type that it wants.

2. The server responds with an HTTP status code of 303 and includes the URL for the specific content type requested. To construct the redirect URL, any abstract IRIs with trailing slashes first have the trailing slash removed. If text/html is requested (i.e. human-readable web page), .htm is appended to the IRI to form the redirect URL. If text/turtle is requested, .ttl is appended. If application/rdf+xml is requested, .rdf is appended. If application/ld+json is requested, .json is appended.

3. The client then requests the specific redirect URL and the server returns the appropriate document in the serialization requested. In this stage, the Accept header is ignored by the server. In the case of standards documents and current terms in Darwin and Audubon Cores, there typically will be an additional redirect to a web page that isn't generated programmatically by the rs.tdwg.org server and might be located anywhere.

We can test the behavior using curl or a graphical HTTP client like Postman. Here is an example using Postman (with automatic following of redirects turned off):

1. Client requests metadata about the basic Darwin Core vocabulary by HTTP GET to the generic IRI: http://rs.tdwg.org/dwc/ with an Accept header of text/turtle.

2. The server responds with a 303 (see other) code and redirects to http://rs.tdwg.org/dwc.ttl

3. The client sends another GET request to http://rs.tdwg.org/dwc.ttl, this time without any Accept header.

4. The server responds with a 200 (success) code and a Content-Type response header of text/turtle. The response body is the document serialized as RDF/Turtle.

This illustration was done "manually" using Postman, but it is relatively simple to use any typical programming language (such as Javascript or Python) to perform HTTP calls with appropriate Accept headers.[1] So enabling IRI dereferencing with content negotiation really starts to open up TDWG standards to machine readability.

One feature of this implementation method is that it allows a human user to examine a representation in any serialization using a browser by just by hacking the abstract IRI using the rules in step 2. Thus, if you want to see what the RDF/XML serialization looks like for the basic Darwin Core vocabulary, you can put the URL http://rs.tdwg.org/dwc.rdf into a browser. The browser will send an Accept header of text/html, but since the URL contains an extension for a specific file type, the server will ignore the Accept header and send RDF/XML anyway. (Depending on how the browser is set up to handle file types, it may display the retrieved file in the browser window, or may initiate a download of the file into the user's Downloads directory.)

Important note: currently (as of April 2019), there is an error in the algorithm that generates the JSON-LD that causes repeated properties to be serialized incorrectly. The JSON that is returned validates as JSON-LD, but when the document is interpreted, some instances of the repeated properties are ignored. So application designers should at this point plan to consume either RDF/XML or RDF/Turtle until this error is corrected.

Why does this matter?

There are three reasons why implementation of dereferencing TDWG standards-related IRIs through content negotiation is important.

1. The least important reason is probably the one that is given as a core rationale in the Linked Data world: when someone "looks up" a URI, they get useful information and can discover more things through the links in the metadata. In theory, one could "discover" any resource related to TDWG standards, scrape the machine-readable metadata about that resource, dereference other resources that are linked to the first one, scrape those resources' medata and follow their links, etc. until everything that there is to be known about TDWG standards has been discovered. Essentially, we could have an analog of the Google web scraper that scrapes machine-readable documents instead of web pages. In theory, this could be done, but it would result in many HTTP calls and would be a very inefficient way to keep up-to-date on TDWG standards. There is a much better way, and I'll discuss it in the next post.

2. Probably the most important reason is that implementing real permanent IRIs for TDWG vocabularies and documents puts a stop to the continual breaking of links and browser bookmarks that happens every time documents get moved to a new website, get changed from HTML to markdown, etc. If we stress that the permanent IRIs are what should be bookmarked and cited, we can always set up the server to redirect to the URL of the day where the document or information actually lives. Since the permanent IRIs are "cool" and don't include implementation-specific aspects like ".php" or "?pid=123&lan=en", we can change the way we actually generate and serve the data at will without ever "breaking" any links. This is really critical if we want people to be able to cite IRIs for TDWG standards components in journal articles with those IRIs continuing to dereference indefinitely.

3. The third reason is more philosophical. By having IRIs that dereference to human- and machine-readable metadata, we demonstrate that these are "real" IRIs that exhibit the behavior expected from "grown-up" organizations in the Linked Data world in specific, and the web in general. We show that TDWG is not some fly-by-night organization that creates identifiers one day and abandons them the next. The Internet is littered with the wreckage of vocabularies and ontologies from organizations that minted terms but stopped paying for their domain name, or couldn't keep their servers running. Having properly dereferencing, permanent IRIs marks TDWG as a real standards organization that can run with the big dogs like Dublin Core and the W3C. (We also get 5 stars !)

In my next post I'll talk about retrieving SDS-specified machine readable standards metadata en masse.

[1] Sample Python 3 code for dereferencing a term IRI

Note: you may need to use PIP to install the requests module if you don't already have it.

import requests
iri = 'http://rs.tdwg.org/ac/terms/caption'
accept = 'text/turtle'
r = requests.get(iri, headers={'Accept' : accept})
print(r.text)