Sunday, April 7, 2019

Understanding the TDWG Standards Documentation Specification, Part 4: Machine-readable Metadata Via an API

This is the fourth in a series of posts about the TDWG Standards Documentation Specification (SDS).  For background on the SDS, see the first post.  For information on its hierarchical model and how it relates to IRI design, see the second post.  For information about how metadata is retrieved via IRI dereferencing, see the third post.

Note: this post was revised on 2020-03-04 when IRI dereferencing of the http://rs.tdwg.org/ subdomain went from testing into production.

Retrieving metadata about TDWG standards using a web API


If you have persevered through the first three posts in this series, congratulations!  The main reason for those earlier posts was to provide the background for this post, which is on the topic that will probably be most interesting to readers: how to effectively retrieve machine-readable metadata about TDWG standards using a web API.

Let's start with retrieving an example resource: the term IRIs and definitions of terms of a TDWG vocabulary (Darwin Core=dwc or Audubon Core=ac).

Here is what we need for the API call:

Resource URLhttps://sparql.vanderbilt.edu/sparql
Method: GET
Authentication required: No
Request header keyAccept
Request header value: application/json, text/csv or application/xml
Parameter key: query
Parameter value: insert "dwc" or "ac" in place of {vocabularyAbbreviation} in the following string:
"prefix%20skos%3A%20%3Chttp%3A%2F%2Fwww.w3.org%2F2004%2F02%2Fskos%2Fcore%23%3E%0Aprefix%20dcterms%3A%20%3Chttp%3A%2F%2Fpurl.org%2Fdc%2Fterms%2F%3E%0ASELECT%20DISTINCT%20%3Firi%20%3Fdefinition%0AWHERE%20%7B%0A%20%20GRAPH%20%3Chttp%3A%2F%2Frs.tdwg.org%2F%3E%20%7B%0A%20%20%20%20%3Chttp%3A%2F%2Frs.tdwg.org%2F{vocabularyAbbreviation}%2F%3E%20dcterms%3AhasPart%20%3FtermList.%0A%20%20%20%20%3FtermList%20dcterms%3AhasPart%20%3Firi.%0A%20%20%20%20%3Firi%20skos%3AprefLabel%20%3Flabel.%0A%20%20%20%20%3Firi%20skos%3Adefinition%20%3Fdefinition.%0A%20%20%20%20FILTER(lang(%3Flabel)%3D%22en%22)%0A%20%20%20%20FILTER(lang(%3Fdefinition)%3D%22en%22)%0A%20%20%20%20%7D%0A%7D%0AORDER%20BY%20%3Firi"

Note: the Accept header is required to receive JSON -- omitting it returns XML.

Here's an example response that shows the structure of the JSON that is returned:

{
    "head": {
        "vars": [
            "iri",
            "definition"
        ]
    },
    "results": {
        "bindings": [
            {
                "iri": {
                    "type": "uri",
                    "value": "http://ns.adobe.com/exif/1.0/PixelXDimension"
                },
                "definition": {
                    "xml:lang": "en",
                    "type": "literal",
                    "value": "Information specific to compressed data. When a compressed file is recorded, the valid width of the meaningful image shall be recorded in this tag, whether or not there is padding data or a restart marker.  This tag shall not exist in an uncompressed file."
                }
            },
(... many more array values here ...)
            {
                "iri": {
                    "type": "uri",
                    "value": "http://rs.tdwg.org/dwc/terms/waterBody"
                },
                "definition": {
                    "xml:lang": "en",
                    "type": "literal",
                    "value": "The name of the water body in which the Location occurs. Recommended best practice is to use a controlled vocabulary such as the Getty Thesaurus of Geographic Names."
                }
            }
        ]
    }
}

Here is an example script to use the API via Python 3.  (You can convert to your own favorite programming language or see this page if you need to set up Python 3 on your computer.)  Note: the requests module is not included in the Python standard library and must be installed using PIP or another package manager.

Although the API can return CSV and XML, we will only be using JSON in this example.
------

import requests

vocab = input('Enter the vocabulary abbreviation (dwc for Darwin Core or ac for Audubon Core): ')

# values required for the HTTP request
resourceUrl = 'https://sparql.vanderbilt.edu/sparql'
requestHeaderKey = 'Accept'
requestHeaderValue = 'application/json'
parameterKey = 'query'
parameterValue ='prefix%20skos%3A%20%3Chttp%3A%2F%2Fwww.w3.org%2F2004%2F02%2Fskos%2Fcore%23%3E%0Aprefix%20dcterms%3A%20%3Chttp%3A%2F%2Fpurl.org%2Fdc%2Fterms%2F%3E%0ASELECT%20DISTINCT%20%3Firi%20%3Fdefinition%0AWHERE%20%7B%0A%20%20GRAPH%20%3Chttp%3A%2F%2Frs.tdwg.org%2F%3E%20%7B%0A%20%20%20%20%3Chttp%3A%2F%2Frs.tdwg.org%2F'
parameterValue += vocab
parameterValue += '%2F%3E%20dcterms%3AhasPart%20%3FtermList.%0A%20%20%20%20%3FtermList%20dcterms%3AhasPart%20%3Firi.%0A%20%20%20%20%3Firi%20skos%3AprefLabel%20%3Flabel.%0A%20%20%20%20%3Firi%20skos%3Adefinition%20%3Fdefinition.%0A%20%20%20%20FILTER(lang(%3Flabel)%3D%22en%22)%0A%20%20%20%20FILTER(lang(%3Fdefinition)%3D%22en%22)%0A%20%20%20%20%7D%0A%7D%0AORDER%20BY%20%3Firi'
url = resourceUrl + '?' + parameterKey + '=' + parameterValue

# make the HTTP request and store the terms data in a list
r = requests.get(url, headers={requestHeaderKey: requestHeaderValue})
responseBody = r.json()
items = responseBody['results']['bindings']

# iterate through the list and print what we wanted
for item in items:
    print(item['iri']['value'])
    print(item['definition']['value'])
    print()

------
For anyone who has programmed an application to retrieve data from an API, this is pretty standard stuff and because the requests module is so simple to use, the part of the code that actually retrieves the data from the API (lines 16-18) is only three lines long.  So the coding required to retrieve the data is not complicated.  For the output I just had the values for the IRI and definition printed to the console, but obviously you could do whatever you wanted with them in your own programming.

If you are familiar with using web APIs and if you examined the details of the code, you will probably have several questions:

- Why is the parameter value so much longer and weirder than what is typical for web APIs?
- What is this sparql.vanderbilt.edu API?
- What other kinds of resources can be obtained from the API?

About the API

The reason that the parameter value is so long and weird looking is because the required parameter value is a SPARQL query in URL-encoded form.  I purposefully obfuscated the parameter value by URL-encoding it in the script because I wanted to emphasize how a SPARQL endpoint is fundamentally just like any other web API, except with a more complicated query parameter.  

I feel like in the past Linked Data, RDF, and SPARQL has been talked about in the TDWG community like it is some kind of religion with secrets that only initiated members of the priesthood can know.  (For a short  introduction to this topic, see this video.) It is true that if you want to design an RDF data model or build the infrastructure to transform tabular data to RDF, you need to know a lot of technical details, but those are not tasks that most people need to do.  You actually don't need to know anything about RDF, how it's structured, or how to create it in order to use a SPARQL endpoint, as I just demonstrated above.

The endpoint http://sparql.vanderbilt.edu/sparql provides public access to datasets that have been made available by the Vanderbilt Libraries.  It is our intention to keep this API up and the datasets stable for as long as possible.  (For more about the API, see this page.)  However, there is nothing special about about the API - it's just an installation of Blazegraph, which is freely available without cost as a Docker image (see this page for instructions if you want to try installing it on your own computer).  The TDWG dataset that is loaded into the Vanderbilt API is also freely available and can be installed in any Blazegraph instance.  So although the Vanderbilt API provides a convenient way to access the TDWG data, there is nothing special about it. There is no custom programming that has been done to get it online and there has been no processing of the data that was loaded into it.  There could be zero to many other APIs that could be set up to provide exactly the same services using exactly the same API calls.   For those who are interested, later on in this post I will provide more details about how anyone can obtain the data, but those are details that most users can happily ignore.

The interesting thing about SPARQL endpoints is that there is an unlimited number of resources that can be obtained from the API.  Conventional APIs, such as the GBIF or Twitter APIs, provide web pages that list the available resources and the parameter key/value pairs required to obtain them.  If potential users want to obtain a resource that is not currently available, they have to ask the API developers to create the code required to allow them to access that resource.  A SPARQL endpoint is much simpler.  It has exactly one resource URL (the URL of the endpoint) and for read operations has only one parameter key (query).  The value of that single parameter is the SPARQL query.  

In a manner analogous to traditional API documentation, we can (and should) provide a list of queries that would retrieve the types of information that users typically might want to obtain.  Developers who are satisfied with that list can simply follow the recipe and make API calls using that recipe as they would for any other API.  But the great thing about a SPARQL endpoint is that you are NOT limited to any provided list of queries.  If you are willing to study the TDWG standards data model that I described in the second post of this series and expend a minimal amount of time learning to construct SPARQL queries (see this beginner's page to get started), you can retrieve any kind of data that you can imagine without needing to beg some developers to add that functionality to their API.  

In the next section, I'm going to simplify the Python 3 script that I listed above, then provide several additional API call examples.

A generic Python script for making other API calls

Here is the previous script in a more straightforward and hackable form:
------

import requests

vocab = input('Enter the vocabulary abbreviation (dwc for Darwin Core or ac for Audubon Core): ')

parameterValue ='''prefix skos: <http://www.w3.org/2004/02/skos/core#>
prefix dcterms: <http://purl.org/dc/terms/>
SELECT DISTINCT ?iri ?definition
WHERE {
  GRAPH <http://rs.tdwg.org/> {
    <http://rs.tdwg.org/'''

parameterValue += vocab
parameterValue += '''/> dcterms:hasPart ?termList.
    ?termList dcterms:hasPart ?iri.
    ?iri skos:prefLabel ?label.
    ?iri skos:definition ?definition.
    FILTER(lang(?label)="en")
    FILTER(lang(?definition)="en")
    }
}
ORDER BY ?iri'''

endpointUrl = 'https://sparql.vanderbilt.edu/sparql'
requestHeaderValue = 'application/json'

# make the HTTP request and store the terms data in a list
r = requests.get(endpointUrl, headers={'Accept': requestHeaderValue}, params={'query': parameterValue})
responseBody = r.json()
items = responseBody['results']['bindings']

# iterate through the list and print what we wanted
for item in items:
    print(item['iri']['value'])
    print(item['definition']['value'])
    print()

------

The awesome Python requests module allows you to pass the parameters to the .get() method as a dict, getting rid of the necessity of constructing the entire URL yourself.  The values you pass are automatically URL-encoded, so that eliminates the necessity of doing the encoding yourself.  As a result, I was able to create the parameter value by assigning multi-line strings that are formatted in a much more readable way.  Since the only header we should ever need to send is Accept and the only parameter key we should need is query, I just hard-coded them into the corresponding dicts of the .get() method.  I left the value for the Accept request header as a variable in line 24 as a variable in case anybody wants to play with requesting XML or a CSV table.

We can now request different kinds of data from the API by changing the parameter value that is assigned in lines 3 through 21. 

Multilingual labels and definitions 

To retrieve the label and definition for a Darwin Core term in a particular language, substitute these lines for lines 3-21:
------

localName = input('Enter the local name of a Darwin Core term: ')
language = input('Enter the two-letter code for the language you want (en, es, zh-hans): ')

parameterValue ='''prefix skos: <http://www.w3.org/2004/02/skos/core#>
SELECT DISTINCT ?label ?definition
WHERE {
  GRAPH <http://rs.tdwg.org/> {
    BIND(IRI("http://rs.tdwg.org/dwc/terms/'''
parameterValue += localName
parameterValue += '''") as ?iri)
    BIND("'''
parameterValue += language
parameterValue += '''" as ?language)
    ?iri skos:prefLabel ?label.
    ?iri skos:definition ?definition.
    FILTER(lang(?label)=?language)
    FILTER(lang(?definition)=?language)
    }
}'''

------
The printout section needs to be changed, since we asked for a label instead of an IRI:
------

for item in items:
    print()
    print(item['label']['value'])
    print(item['definition']['value'])

------
The "local name" asked for by the script is the last part of a Darwin Core IRI.  For example, the local name for dwc:recordedBy (that is, http://rs.tdwg.org/dwc/terms/recordedBy) would be recordedBy.  (You can find more local names to try here.)

Other than English, we currently only have translations of term names and labels in Spanish and simplified Chinese.  We also only have translations of dwc: namespace terms from Darwin Core and not dwciri:, dc:, or dcterms: terms.  So this resource is currently somewhat limited, but could get better in the future with the addition of other languages to the dataset.

Track the history of any TDWG term to the beginning of the universe

The user sends the full IRI of any term ever created by TDWG and the API will return the term name, version date of issue, definition and status of every version that was a precursor of that term.  Again, replace lines 3-21 with this:
------

iri = input('Enter the unabbreviated IRI of a TDWG vocabulary term: ')

parameterValue ='''prefix dcterms: <http://purl.org/dc/terms/>
prefix skos: <http://www.w3.org/2004/02/skos/core#>
prefix tdwgutility: <http://rs.tdwg.org/dwc/terms/attributes/>
SELECT DISTINCT ?term ?date ?definition ?status
WHERE {
  GRAPH <http://rs.tdwg.org/> {
    <'''
parameterValue += iri
parameterValue += '''> dcterms:hasVersion ?directVersion.
    ?directVersion dcterms:replaces* ?version.
    ?version dcterms:issued ?date.
    ?version tdwgutility:status ?status.
    ?version dcterms:isVersionOf ?term.
    ?version skos:definition ?definition.
    FILTER(lang(?definition)="en")
    }
}
ORDER BY DESC(?date)'''

------
and replace the printout section with this:
------

for item in items:
    print()
    print(item['date']['value'])
    print(item['term']['value'])
    print(item['definition']['value'])
    print(item['status']['value'])

------

The results of this query allow you to see every possible previous term that might have been used in the past to refer to this concept, and to see how the definition of those earlier terms differed from the target term.  You should try it with everyone's favorite confusing term, dwc:basisOfRecord, which has the unabbreviated IRI http://rs.tdwg.org/dwc/terms/basisOfRecord .  

You can make a simple modification to the script to have the call return every term that has ever been used to replace an obsolete term, and the definitions of every version of those terms.  Just replace dcterms:replaces* with dcterms:isReplacedby* in the second parameterValue string.  If you want them to be ordered from oldest to newest, you can replace DESC(?date) with ASC(?date).   Try it with this refugee from the past: http://digir.net/schema/conceptual/darwin/2003/1.0/YearCollected .

What are all of the TDWG Standards documents?

This version of the script lets you enter any part of a TDWG standard's name and it will retrieve all of the documents that are part of that standard, tell you the date it was last modified, and give the URL where you might be able to find it (some are print only and at least one -- XDF -- seems to be lost entirely).  Press enter without any text and you will get all of them.  Here's the code to generate the parameter value:
------

searchString = input('Enter part of the standard name, or press Enter for all: ')

parameterValue ='''PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX dcterms: <http://purl.org/dc/terms/>
SELECT DISTINCT ?docLabel ?date ?stdLabel ?url
FROM <http://rs.tdwg.org/>
WHERE {
  ?standard a dcterms:Standard.
  ?standard rdfs:label ?stdLabel.
  ?standard dcterms:hasPart ?document.
  ?document a foaf:Document.
  ?document rdfs:label ?docLabel.
  ?document rdfs:seeAlso ?url.
  ?document dcterms:modified ?date.
  FILTER(lang(?stdLabel)="en")
  FILTER(lang(?docLabel)="en")
  FILTER(contains(?stdLabel, "'''
parameterValue += searchString
parameterValue += '''")) 
}
ORDER BY ?stdLabel'''

------
and here's the printout section:
------

for item in items:
    print()
    print(item['docLabel']['value'])
    print(item['date']['value'])
    print(item['stdLabel']['value'])
    print(item['url']['value'])

------

Note: the URLs that are returned are access URLs, NOT the IRI identifiers for the documents!

The following is a variation of the API call above.  In this variation, you enter the name of a standard (or press Enter for all), and you can retrieve the names of all of the contributors (whose roles might have included author, editor, translator, reviewer, or review manager).  Parameter value code:

------

parameterValue ='''PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX dcterms: <http://purl.org/dc/terms/>
SELECT DISTINCT ?contributor ?stdLabel
FROM <http://rs.tdwg.org/>
WHERE {
  ?standard a dcterms:Standard.
  ?standard rdfs:label ?stdLabel.
  ?standard dcterms:hasPart ?document.
  ?document a foaf:Document.
  ?document dcterms:contributor ?contribUri.
  ?contribUri rdfs:label ?contributor.
  FILTER(contains(?stdLabel, "'''
parameterValue += searchString
parameterValue += '''")) 
}
ORDER BY ?contributor'''

------
Printout code:
------

for item in items:
    print()
    print(item['contributor']['value'])
    print(item['stdLabel']['value'])

------

Note: assembling this list of documents was my best shot at determining what documents should be considered to be part of the standards themselves, as opposed to ancillary documents not part of the standards.  It's possible that I might have missed some, or included some that aren't considered key to the standards.  This is more of a problem with older documents whose status was not clearly designated.

The SDS isn't very explicit about how to assign all of the properties that should probably be assigned to documents, so some information that might be important is missing, such as documentation of contributor roles.  Also, I could not determine who all of the review managers were, where the authoritative locations were for all documents, nor find prior versions for some documents.  So this part of the TDWG standards metadata still needs some work.  

An actual Linked Data application

In the previous examples, the data involved was limited to metadata about TDWG standards.  However, we can make an API call that is an actual bona fide application of Linked Data.  Data from the Bioimages project are available as RDF/XML.  You can examine the human-readable web page of an image at http://bioimages.vanderbilt.edu/thomas/0488-01-01 and the corresponding RDF/XML here.  Both the human- and machine-readable versions of the image metadata use either Darwin Core or Audubon Core terms as most of their properties.  However, the Bioimages metadata do not provide an explanation of what those TDWG vocabulary terms mean.  

Both the Bioimages and TDWG metadata datasets have been loaded into the Vanderbilt Libraries SPARQL API, and we can include both datasets in the query's dataset using the FROM keyword.  That allows us to make use of information from the TDWG dataset in the query of the Bioimages data because the two datasets are linked by use of common term IRIs.  In the query, we can ask for the metadata values for the image (from the Bioimages dataset), but include the definition of the properties (from the TDWG dataset; not present in the Bioimages dataset).  
------

iri = input('Enter the unabbreviated IRI of an image from Bioimages: ')

parameterValue ='''PREFIX dcterms: <http://purl.org/dc/terms/>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
SELECT DISTINCT ?label ?value ?definition
FROM <http://rs.tdwg.org/>
FROM <http://bioimages.vanderbilt.edu/images>
WHERE {
    <'''
parameterValue += iri
parameterValue += '''> ?property ?value.
    ?property skos:prefLabel ?label.
    ?property skos:definition ?definition.
    FILTER(lang(?label)="en")
    FILTER(lang(?definition)="en")
}'''

------
Printout code:
------

for item in items:
    print()
    print(item['label']['value'])
    print(item['value']['value'])
    print(item['definition']['value'])

------

You can try this script out on the example IRI I gave above (http://bioimages.vanderbilt.edu/thomas/0488-01-01) or on any other image identifier in the collection (listed under "Refer to this permanent identifier for the image:" on any of the image metadata pages that you get to by clicking on an image thumbnail).  

Conclusion

Hopefully, these examples can give you a taste for the kind of metadata about TDWG standards that can be retrieved using an API.  There are several final issues that I should discuss before I wrap up this post.  I'm going to present them in a Q&A format.

Q: Can I build an application to use this API?
A. Yes, you could.  We intend for the Vanderbilt SPARQL API to remain up indefinitely at the endpoint URL given in the examples.  However, we can't make a hard promise about that, and the API is not set up to handle large amounts of traffic.  There aren't any usage limits and subsequently it's already been crashed once by someone who hit it really hard.  So if you need a robust service, you should probably set up your own installation of Blazegraph and populate it with the TDWG dataset.

Q: How can I get a dump of the TDWG data to populate my own version of the API?
A: The simplest way is to execute this query to the Vanderbilt SPARQL API as above with an Accept header of text/turtle:

CONSTRUCT {?s ?p ?o}
FROM <http://rs.tdwg.org/>
WHERE {?s ?p ?o}

URL-encoded, the query is:

CONSTRUCT%20%7B%3Fs%20%3Fp%20%3Fo%7D%0AFROM%20%3Chttp%3A%2F%2Frs.tdwg.org%2F%3E%0AWHERE%20%7B%3Fs%20%3Fp%20%3Fo%7D

If you use Postman, you can drop down the Send button to Send and Download and save the data in a file, which you can upload into your own instance of Blazegraph or some other SPARQL endpoint/triplestore.  (There are approximately 43000 statements (triples) in the dataset, so copy and paste is not a great method of putting them into a file.)  If your triplestore doesn't support RDF/Turtle, you can get RDF/XML instead by using an Accept header of application/rdf+xml.

There is a better method of acquiring the data that uses the authoritative source data, but I'll have to describe that in a subsequent post.

Q: How accurate are the data?
A: I've spent many, many hours over the last several years curating the source data in GitHub.  Nevertheless, I still discover errors almost every time I try new queries on the data.  If you discover errors, put them in the issues tracker and I'll try to fix them.

Q: How would this work for future controlled vocabularies?
A: This is a really important question.  It's so important that I'm going to address it in a subsequent post in the series.

Q: How can I retrieve information from the API about resources that weren't described in the examples?
A: Since a SPARQL endpoint is essentially a program-it-yourself API, all you need is to have the right SPARQL query to retrieve the information you want.  First you need to have a clear idea of the question you want to answer.  Then you've got two options: find someone who knows how to write SPARQL queries and get them to write the query for you, or teach yourself how to write SPARQL queries and do it yourself.  You can test your queries by pasting them in the box at https://sparql.vanderbilt.edu/ as you build them.  It is not possible to create the queries without understanding the underlying data model (the graph model) and the machine-readable properties assigned to each kind of resource.  That's why I wrote the first (boring) parts of this series and why we wrote the specification itself.

Q: Where did the data in the dataset come from and how is it managed?
A: That is an excellent question.  Actually it is several questions:

- where does the data come from? (answer: the source csv tables in GitHub)
- how does the source data get turned into machine-readable data?
- how does the machine-readable data get into the API?

One of the beauties of REST is that when you request a URI from a server, you should be able to get a useful response from the server without having to worry about how the server generates that response.  What that means in this context is that the intermediate steps that lie between the source data and what comes out of the API (the answers to the second and third questions above) can change and the client should never notice the difference since it would still be able to get exactly the same response.  That's because the processing essentially involves implementing a mapping between what's in the tables on GitHub and what the SDS says the standardized machine-readable metadata should look like.  There is no one particular way that mapping must happen, as long as the end result is the same.  I will discuss this point in what will probably be the last post of the series.



No comments:

Post a Comment