Monday, April 4, 2016

Controlled values for Country from Darwin Core

This is the last post in a rather technical series of posts where I've examined what might be appropriate ways to construct controlled vocabularies that could be used to provide values for properties in Biodiversity Information Standards (TDWG) vocabularies.  In that context, I've been looking at appropriate ways to use terms from the W3C Simple Knowledge Organization System (SKOS) Recommendation to write those vocabularies.

To recap, in the first post, I looked at the differences between two kinds of vocabularies: ontologies and thesauri and concluded that SKOS played well with thesauri but not so well with ontologies.  In the second post, I tried my hand at making a controlled vocabulary for Subject Category (Iptc4xmpExt:CVterm) from Audubon Core.  That was a fairly standard exercise in constructing a hierarchical SKOS concept scheme, and it was relatively uncomplicated since I was able to mint my own terms without the complications of having to conform to the demands of any standards process.  In the third post, I constructed a SKOS concept scheme by co-opting in its entirety a vocabulary created by GBIF to be the source of values for Establishment Means from Darwin Core.  That effort was more complicated since I was using borrowed terms, and also because there was a need to delineate not only URIs to be used as values for  dwciri:establishmentMeans, but also literal values to be used with  dwc:establishmentMeans.

In this post, I'm going to take on a different kind of challenge.  The definition of the Darwin Core term dwc:country is "the name of the country or major administrative unit in which the Location occurs. Recommended best practice is to use a controlled vocabulary such as the Getty Thesaurus of Geographic Names."  Although, the definition offers some latitude in the choice of controlled vocabularies, it very pointedly suggests that the Getty Thesaurus of Geographic Names would be a good option.  In contrast to my previous two experiments where I had to make up the RDF to define the controlled vocabularies, in this case the controlled vocabulary is already expressed as RDF.  The challenge this time is that there is too much RDF available and my task is to sort out the parts of it that are actually needed for the purpose of providing controlled values for dwc:country.

The Getty Thesaurus of Geographic Names

The Thesaurus of Geographic Names (TGN) is only one of the vocabularies published by the Getty Research Institute as Linked Open Data.  It's pretty massive.  A compressed file containing the explicit triples in NTriples format is about 1 Gb in size and uncompressed the full dataset of explicit triples is about 25 Gb.  Fortunately, the data can be explored either by a SPARQL endpoint or by searching for particular names using a web interface.  The data are freely available under a Open Data Commons Attribution License (ODC-By 1.0) which allows "Extraction and Re-utillisation of the whole or a Substantial part of the Contents" and "Creation of Derivative Databases".  Cool!  That's just what I want to do.  Since I'm going to be using data from the database in the examples, I should provide attribution.  Here goes:
This report contains information from Thesaurus of Geographic Names (TGN)® which is made available under the ODC Attribution License.
Now that I've gotten that over with, let's explore!

Country names in the TGN

Using the web search form, I searched for "Spain", then picked "Spain .......(nation)".  There was a lot of stuff in the results.  From the standpoint of this project, I'm interested in the names for Spain.  There are 30 names or abbreviations listed in different languages and using several types of characters including Latin, Cyrillic, Chinese, and Arabic.  Since Darwin Core specifies that TGN should be used as a controlled vocabulary, I need to know which of these many names is the preferred one.  Fortunately, the first one listed ("España") is designated as "preferred".  I do not want to have to find every country name myself by looking it up, so I clicked on the Semantic View N3/Turtle link, which downloaded a 1233 triple Turtle file so that I could see how the RDF was structured.

Right away, I see that there is a problem.  The skos:Concept for Spain (http://vocab.getty.edu/tgn/1000095, abbreviated tgn:1000095) has ten skos:prefLabel values.  So we are left with the problem that I discussed in my previous post: how does one indicate in RDF which of the several preferred labels is the one text string that should be used to refer to the concept as a literal?  Answering that question required some digging in the RDF.  In my previous post I mentioned the problem caused by the fact that RDF does not allow a literal to be the subject of a triple.  If I could use a preferred label literal as the subject of a triple, I could give "España" some property that indicated it was the single "preferred" value to be used in preference to the others.  But I can't.  SKOS gets around this problem in a manner similar to the DCAM model that I talked about in my previous post.  The SKOS W3C Recommendation includes an extension called the SKOS Simple Knowledge Organization System eXtension for Labels (SKOS-XL) described in Appendix B of the SKOS Reference.  It defines the skosxl:Label (http://www.w3.org/2008/05/skos-xl#Label) class.  Each instance of that class can be assigned a URI and has a single plain literal form that is linked to the instance by the skosxl:literalForm property.  The semantics of SKOS-XL are such that skosxl:Label instances can be "dumbed-down" to be "vanilla" SKOS labels.[1]  Here's an example:

tgn:1000095 skosxl:prefLabel tgn_term:131-es.
tgn_term:131-es a skosxl:Label;
                skosxl:literalForm "España"@es.

entails

tgn:1000095 skos:prefLabel "España"@es.

Creating skosxl:Label instances allows one to say things about those instances, such as tracking provenance.  I was curious to know how TGN indicated that a particular skosxl:Label instance was the actual "preferred" one of the many skosxl:prefLabel values.  It appears that the "preferred" status is conferred to the skosxl:prefLabel value that is listed first.  The order of listing is indicated using the gvp:displayOrder (http://vocab.getty.edu/ontology#displayOrder) property; for example, the skosxl:Label instance for "España"@es has this triple:

tgn_term:131-es gvp:displayOrder "1"^^xsd:positiveInteger.

That seems to be a somewhat tenuous method for demarcating the "preferred" preferred label, but that's what they do.  

Searching for the "preferred" skosxl:prefLabel

My next step was to check out the Getty Vocabularies SPARQL endpoint to see if I could grab "preferred" labels.  I started with this query to pull the "preferred" label for Spain:

SELECT ?label ?language WHERE {
  <http://vocab.getty.edu/tgn/1000095> skosxl:prefLabel ?xlLabel.
  ?xlLabel gvp:displayOrder "1"^^xsd:positiveInteger.
  ?xlLabel skosxl:literalForm ?prefLabel.
  BIND (str(?prefLabel) AS ?label)
  BIND (lang(?prefLabel) AS ?language)
  }

Notes:
1. The second triple pattern matches only the skosxl:Label instances that are displayed first.
2. The two BIND expressions separates the literal from the language tag.  Eventually, I'm going to discard the language tag, but for now I want to be able to see what it is.

As expected, running the query produces "España" as the label and "es" as the language.  The next step is to figure out how to screen out the concepts that represent countries.  I rummaged around in the RDF for Spain and noticed that there is a gvp:placeType property that looked promising. Spain has a whole bunch of values for that property, so I ran this query to find out what they were:

SELECT DISTINCT ?uri ?label where {
  tgn:1000095 gvp:placeType ?uri.
  ?uri skos:prefLabel ?label.
  FILTER (lang(?label)="en")
  }  

It looks like the one I want is aat:300128207, which has the label "nations".  Now I'm in a position to get the "preferred" labels for all countries:

SELECT ?label ?language ?stripEngLabel WHERE {
  ?country gvp:placeType aat:300128207.
  ?country skosxl:prefLabel ?xlLabel.
  OPTIONAL {
            ?country skos:prefLabel ?engLabel.
            FILTER (lang(?engLabel)="en")
            BIND (str(?engLabel) AS ?stripEngLabel)
            }
  ?xlLabel gvp:displayOrder "1"^^xsd:positiveInteger.
  ?xlLabel skosxl:literalForm ?prefLabel.
  BIND (str(?prefLabel) AS ?label)
  BIND (lang(?prefLabel) AS ?language)
  }
ORDER BY ?stripEngLabel

Notes:
1. Triple patterns 1 and 2 extend the earlier Spain-specific query to all nations.
2. Because some of the "preferred" prefLabels are in languages that weren't familiar to me, I also had the query display the preferred English labels for the countries (triple patterns 3 to 5).  These are optional since a few countries are missing them.
3. Triple patterns 6 to 9 are the same as in the earlier query.

The results tell me several interesting things about the "preferred" prefLabels.  There can actually be several skosxl:prefLabel instances that are assigned a gvp:displayOrder value of 1 for each country.  However, although they have different language tags, in nearly all cases they have identical literal values for a given country.  That string generally represents the name of that country in a language that predominates in that country.  So for example, Brazil is spelled "Brasil" because that's its Portuguese spelling, and Finland is "Soumi" because that's what "Finland" is in Finnish.  In cases where the name is normally rendered in another script, the preferred name is transliterated in Latin characters. So Bhutan (འབྲུག་ཡུལ) is rendered as "Druk Yul" and Russia (Росси́я) is rendered as "Rossija".  Another thing to note is that the ASCII character set is not sufficient to represent all of the "preferred" preferred labels.  Mozambique is "Moçambique", Nepal is "Nepāl", Albania is "Shqipëria", Azerbaijan is "Azǝrbaycan", Viet Nam is "Việt Nam", and Cambodia is "Kâmpŭchéa".  The list of countries as I've screened it also includes "Confederate States of America" and "Navajo Nation", which aren't typically found on most lists of countries.  Surprisingly, China is missing from the results.  It comes up via the web search, with the transliterated "Zhongguo" as the "preferred" option.  When I downloaded the RDF/Turtle, the skosxl:Label instance that is assigned gvp:displayOrder of 1 (tgn_term:159-zh-Latn) is not one of the skosxl:prefLabel values.  Rather, it's a value for the skosxl:altLabel property.  I think this may just be an error rather than intentional.  So although my query is a pretty efficient way to grab the "preferred" preferred labels, the results still probably will need some manual curation.  

To collapse the list down to non-redundant values, I changed the form to SELECT DISTINCT and removed the variable for the language tag:

SELECT DISTINCT ?label ?stripEngLabel WHERE {
  ?country gvp:placeType aat:300128207.
  ?country skosxl:prefLabel ?xlLabel.
  OPTIONAL {
            ?country skos:prefLabel ?engLabel.
            FILTER (lang(?engLabel)="en")
            BIND (str(?engLabel) AS ?stripEngLabel)
            }
  ?xlLabel gvp:displayOrder "1"^^xsd:positiveInteger.
  ?xlLabel skosxl:literalForm ?prefLabel.
  BIND (str(?prefLabel) AS ?label)
  }
ORDER BY ?stripEngLabel

Now each country appears in the list only once (mostly).  Armenia, Congo (Republic of) each appear twice: Armenia with "preferred" prefLabels of "Armenia" and "Hayastan", and Congo (Republic of) as "Congo" and "Congo (Republic of)".  Armenia comes up twice because there are two records for it: tgn:7004538 and tgn:7006651.   The latter URI is the one associated with the record that comes up in the web search.  I'm not sure what makes tgn:7004538 different.  For Congo, the web search shows "Congo (Republic of)" as the "preferred" preferred label.  It comes up twice in the search because there are three skosxl:Label instances with gvp:displayOrder of 1.  Two of them (the French and Spanish versions) have skosxl:literalForm values with "Congo" literals and the English version has a skosxl:literalForm value with a "Congo (Republic of)" literal.  I suppose this is a mistake - perhaps the English form was changed from "Congo" to "Congo (Republic of)" without getting downgraded to a gvp:displayOrder of greater than 1 (French is the official language there).

Considerations for creating a TDWG controlled vocabulary

In my previous post, I described some considerations for creating a controlled vocabulary for values of the Darwin Core property dwc:establishmentMeans and its URI analogue, dwciri:establishmentMeans.  The distinction between these two terms is that the first one is intended to be used with a literal value that represents the category (e.g. "native", "invasive", "managed", etc.), while the second term is intended to be used with a URI value.  Historically, most datasets provided strings as values for dwc:establishmentMeans, which would be fine in searches or for categorizing records.  However, using a URI value for dwciri:establishmentMeans would allow a client to discover other useful information about the category, such as whether there was a broader category that would also be applicable, and preferred labels for the category in other languages.

The situation with a controlled vocabulary for the values of dwc:country is different.  If you examine Section 3.7 of the Darwin Core RDF Guide ("dwc: namespace terms that have analogues in the dwciri: namespace"), you'll find dwc:establishmentMeans, but not dwc:country.  Instead, dwc:country is listed in Section 3.5: "convenience terms" that are expected  to be used only with literal values.  Convenience terms are described in Section 2.7 of the guide.  When identification is based solely on literals, it is necessary to specify literals for several levels in a geographic hierarchy, since "Allen County" could be located in the state of Indiana, Kansas, Ohio, or Kentucky.  However, if a provider can provide an unambiguous identifier for the lowest level in a geographic hierarchy (such as the URI http://sws.geonames.org/5145576/), it should not be necessary for the provider to state the higher levels in the hierarchy (state, country, continent) in every database record.  That information could be derived from a single set of relationships associated with the URI-identified resource.  

The practical implication for this exercise is that in our controlled vocabulary we don't need to provide information about the hierarchy associated with the controlled vocabulary term - if a provider cares about that, it should provide a value for a dwciri:inDescribedPlace property associated with the location and let users make use of information linked to its URI value (e.g. a GeoNames URI) to discover the hierarchy.  In fact we really don't care that much about the URI associated with the country literal, since it won't be used to populate database records.  However, supplying the TGN URI for the country concept would make it possible to track the provenance of the literals and provide a way to link to alternative labels listed in a different document.  

Use cases

1. Find the controlled literal value given the English name of the country.
2. Find the controlled literal value given the name of the country in various languages.
3. Aid in data cleaning by finding alternate labels for the country.

The Controlled Vocabulary

(A document containing the complete graph for the examples below is here.)  The description of the TDWG controlled vocabulary is very similar to the example in my previous post, so the notes I made there apply here as well.

@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>.
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>.
@prefix xsd: <http://www.w3.org/2001/XMLSchema#>.
@prefix dwc: <http://rs.tdwg.org/dwc/terms/>.
@prefix dc: <http://purl.org/dc/elements/1.1/>.
@prefix dcterms: <http://purl.org/dc/terms/>.
@prefix dcmitype: <http://purl.org/dc/dcmitype/>.
@prefix dcam: <http://purl.org/dc/dcam/>.
@prefix dcat: <http://www.w3.org/ns/dcat#>.
@prefix skos: <http://www.w3.org/2004/02/skos/core#>.
@prefix foaf: <http://xmlns.com/foaf/0.1/>.
@prefix tgn: <http://vocab.getty.edu/tgn/>.
@prefix vann: <http://purl.org/vocab/vann/>.

<http://rs.tdwg.org/cvterms/country> a dcmitype:DataSet;
     rdfs:label "TDWG Country Controlled Vocabulary"@en;
     dcterms:title "TDWG Country Controlled Vocabulary"@en;
     rdfs:comment "This is a controlled vocabulary intended to be used as values for dwc:country."@en;
     dcterms:description "This is a controlled vocabulary intended to be used as values for dwc:country."@en;
     skos:note "This is a controlled vocabulary intended to be used as values for dwc:country."@en;
     vann:termGroup <http://rs.tdwg.org/cvterms/country/>;
     dc:publisher "Biodiversity Information Standards"@en.

Note:
1. I used the property vann:termGroup to link the vocabulary to the term list.  It's defined in the VANN vocabulary as "A group of related terms in a vocabulary".  The property has no formal semantics and no declared range, so I think this usage is OK, although there aren't a lot of examples to follow.  (See my previous post for more on the idea of including terms in a term list rather than directly in the vocabulary.)

Here's how I described the term list.  Again, the style is similar to what I used in my previous post when describing the term list for the dwc:establishmentMeans controlled vocabulary:

<http://rs.tdwg.org/cvterms/country/> a dcat:Dataset, skos:Collection;
     rdfs:label "Country Term List"@en;
     dcterms:title "TDWG Country Term List"@en;
     rdfs:comment "This document contains a list of controlled vocabulary terms whose rdf:value should to be used as literal values for dwc:country."@en;
     dcterms:description "This document contains a list of controlled vocabulary terms whose rdf:value should to be used as literal values for dwc:country."@en;
     skos:note "This document contains a list of controlled vocabulary terms whose rdf:value should to be used as literal values for dwc:country."@en;
     dcterms:modified "2015-03-17"^^xsd:date;
     rdfs:seeAlso <http://vocab.getty.edu/tgn/>; # here is the link to the Getty Thesaurus of Geographic Names
     skos:member <http://vocab.getty.edu/tgn/1000155>;
     skos:member <http://vocab.getty.edu/tgn/1000180>;
     skos:member <http://vocab.getty.edu/tgn/7001725>;
     skos:member <http://vocab.getty.edu/tgn/7002435>;
     skos:member <http://vocab.getty.edu/tgn/1000111>;
     skos:member <http://vocab.getty.edu/tgn/1000095>;
     dcterms:isPartOf <http://rs.tdwg.org/cvterms/country>;
     dcterms:license <http://opendatacommons.org/licenses/by/1.0/>;
     <http://ns.adobe.com/xap/1.0/rights/UsageTerms> "Contains information from Getty Thesaurus of Geographic Names (TGN) which is made available under the ODC Attribution License.";
     dcterms:source <http://vocab.getty.edu/tgn/>;
     dc:publisher "Biodiversity Information Standards"@en.

Notes:
1. A key difference here from the dwc:establishmentMeans term list is that I modeled this term list as a skos:Collection rather than a skos:ConceptScheme.  Concept collections are described in Section 9 of the SKOS Reference. They are described simply as "labeled and/or ordered groups of SKOS concepts" with the comment "Collections are useful where a group of concepts shares something in common, and it is convenient to group them under a common label...".  That's pretty much the situation with this term list - it's a set of concepts we want to group under the URI of our term list.  In contrast, a skos:ConceptScheme "corresponds roughly to the notion of an individual thesaurus...", which in the present situation would be the whole Getty Thesaurus of Geographic Names.  
2. I did a tricky little thing here with the term list URI.  That URI is the same as the URI for the vocabulary.  If the server is set up appropriately, both of those URIs will dereference to the same document (the one that contains the graph that defines the vocabulary, term list, and terms), but because the URIs differ by the "/" character, they identify different entities.  
3. The triples with dcterms:license, xmpRights:UsageTerms, and dcterms:source properties are an attempt to comply with the requirements of TGN's ODC Attribution license.  There might be a better way to do it.  xmpRights:UsageTerms is recommended by Audubon Core, but annoyingly, it is not dereferenceable.
4. I linked the term list to the concepts for the country names using the property skos:member.  That property has the domain skos:Collection, using that property would entail that the term list was a skos:Collection even if I hadn't asserted it explicitly.  In the example above, I've only linked to six country concepts - in the real list there would be over 100 triples using skos:member as their predicate.

If the description of the country concepts were adequate for our purposes, and if we assumed that a client would dereference the country concept URIs (or have access to the TGN triples loaded into a triple store), we would be done.  However, as I discussed in the section about "preferred" preferred labels, getting the one preferred literal for a given country is not straightforward and we should be doing the work of finding it rather than making the users of our controlled vocabulary do it for themselves.  I'll describe how to do that work next.

Here's a snippet of RDF that describes the six country name concepts that I linked to using skos:member in the term list:

<http://vocab.getty.edu/tgn/1000155> rdf:value "République centrafricaine";
           rdfs:label "Central African Republic"@en;
           dcam:memberOf tgn:;
           skos:inScheme tgn:;
           dcterms:isPartOf <http://rs.tdwg.org/cvterms/country/>;
           foaf:focus <http://sws.geonames.org/239880/>;
           a skos:Concept.

<http://vocab.getty.edu/tgn/1000180> rdf:value "Moçambique";
           rdfs:label "Mozambique"@en;
           dcam:memberOf tgn:;
           skos:inScheme tgn:;
           dcterms:isPartOf <http://rs.tdwg.org/cvterms/country/>;
           foaf:focus <http://sws.geonames.org/1036973/>;
           a skos:Concept.

<http://vocab.getty.edu/tgn/7001725> rdf:value "Swaziland";
           rdfs:label "Swaziland"@en;
           dcam:memberOf tgn:;
           skos:inScheme tgn:;
           dcterms:isPartOf <http://rs.tdwg.org/cvterms/country/>;
           foaf:focus <http://sws.geonames.org/934841/>;
           a skos:Concept.

<http://vocab.getty.edu/tgn/7002435> rdf:value "Rossija";
           rdfs:label "Russia"@en;
           dcam:memberOf tgn:;
           skos:inScheme tgn:;
           dcterms:isPartOf <http://rs.tdwg.org/cvterms/country/>;
           foaf:focus <http://sws.geonames.org/2017370/>;
           a skos:Concept.

<http://vocab.getty.edu/tgn/1000111> rdf:value "Zhongguo";
           rdfs:label "China"@en;
           dcam:memberOf tgn:;
           skos:inScheme tgn:;
           dcterms:isPartOf <http://rs.tdwg.org/cvterms/country/>;
           foaf:focus <http://sws.geonames.org/1814991/>;
           a skos:Concept.

<http://vocab.getty.edu/tgn/1000095> rdf:value "España";
           rdfs:label "Spain"@en;
           dcam:memberOf tgn:;
           skos:inScheme tgn:;
           dcterms:isPartOf <http://rs.tdwg.org/cvterms/country/>;
           foaf:focus <http://sws.geonames.org/2510769/>;
           a skos:Concept.

Notes:
1. In accordance with normal TDWG practice, I provided an English label for each term.
2. Each term is linked to the TGN by skos:inScheme (in accordance with the SKOS model) and dcam:memberOf (in accordance with the DCMI Abstract Model; DCAM).
3. In accordance with the DCMI Recommendation for expressing the DCAM as RDF, I've used rdf:value to link to the plain literal value string that is the preferred literal for the country name. (I discussed this at length in my previous post.)
4. Each term is linked to the term list using dcterms:isPartOf.  This conforms to the hierarchical pattern that I discussed in my previous post:

term isPartOf termList isPartOf vocabulary isPartOf standard

Consistently following that pattern allows a client to discover the containing term list, vocabulary, and standard for any term defined or borrowed by TDWG for one of its vocabularies.  
5. The property foaf:focus is taken from the FOAF Vocabulary. It's used to link to the "underlying or 'focal' entity associated with some SKOS-described concept." In this example, I'm linking the concept for a country name to the GeoNames instance of that country as a GeoNames gn:Feature (defined in the GeoNames Ontology).  I'm not 100% sure that this is correct, since I think I would say that the SKOS concept represents the name of the country, and that's not the same as the country itself (which is what I think the GeoNames feature represents).  However, I'm asserting it anyway, since TGN makes that kind of link to a place.  In the TGN RDF, that place is typed as a schema:Place and wgs:SpatialThing (which I don't think necessarily correspond to a gn:Feature either).  However, the TGN-minted URI is ad hoc and not well-known, whereas the GeoNames URI is well-known.  So from the standpoint of breaking down data silos, I think there is more value in linking to the GeoNames feature. The appropriateness of what I did on this point is probably something that should be discussed further.

Getting the country metadata

There is no possible way that I'm going to want to manually write the triples for the concepts for over 100 countries.  Fortunately, I can use the Getty Vocabularies SPARQL endpoint to get most of what I need.  

Constructing a graph from the Getty Vocabularies endpoint

I've already laid out a query above that will find the country concepts (mostly) and pull the "preferred" preferred label for that country.  What I need to do at this point is hack that query to construct the triples I need.  Unfortunately, I couldn't figure out how to get the Getty Vocabularies endpoint to do a CONSTRUCT query.  Fortunately, they do support remote queries.  So I set up a CONSTRUCT query on my local Stardog instance using the SERVICE keyword to pull the metadata that I need from the Getty endpoint.  Stardog will then let me save the triples of constructed graph and I can do whatever I want with them.  Here's the query I used:

PREFIX gvp: <http://vocab.getty.edu/ontology#>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
PREFIX skosxl: <http://www.w3.org/2008/05/skos-xl#>
PREFIX dcterms: <http://purl.org/dc/terms/>
PREFIX dcam: <http://purl.org/dc/dcam/>

CONSTRUCT {
          ?country rdf:value ?value.
          ?country rdfs:label ?engLabel.
          ?country dcam:memberOf <http://vocab.getty.edu/tgn/>.
          ?country skos:inScheme <http://vocab.getty.edu/tgn/>.
          ?country dcterms:isPartOf <http://rs.tdwg.org/cvterms/country/>.
          ?country a skos:Concept.
          }
WHERE {
  SERVICE <http://vocab.getty.edu/sparql>
         {
         SELECT DISTINCT ?country ?value ?engLabel WHERE 
            {
            ?country gvp:placeType <http://vocab.getty.edu/aat/300128207>.
            ?country skosxl:prefLabel ?xlLabel.
            OPTIONAL {
                     ?country skos:prefLabel ?engLabel.
                     FILTER (lang(?engLabel)="en")
                     }
            ?xlLabel gvp:displayOrder "1"^^xsd:positiveInteger.
            ?xlLabel skosxl:literalForm ?prefLabel.
            BIND (str(?prefLabel) AS ?value)
            }
         ORDER BY ?engLabel
         }
     }

The output of the query in RDF/Turtle can be seen here.

Notes:
1. I got rid of the BIND statement from the earlier query that took off the English tag from the English preferred label, since I want the preferred label to be language-tagged.  
2. The results have all of the triples that were in country name concept snippets except for the foaf:focus link to the GeoNames feature.  I'll take care of that next.

Finding the GeoNames URIs

Unfortunately, GeoNames does not provide SPARQL access to their RDF data.  However, with a little searching, I found that FactForge has loaded the GeoNames triples into their endpoint.  I ran this little query to see if I could pull up one of the records:

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT * WHERE {
  <http://sws.geonames.org/239880/> rdfs:label ?label.
  FILTER (lang(?label)="en")
  }

Voilà!  I got "Central African Republic"@en as a response!  After a lot of thrashing around, I figured out how to construct a query that would allow me to pull the GeoNames URIs and preferred English labels from the FactForge triplestore.  One problem I encountered was the result of their owl:sameAs declarations that asserted equivalence between the GeoNames URIs and DBpedia URIs for the countries.  When I searched for URIs of countries, I got the DBpedia URIs and not the GeoNames URIs.  I had to check the "Include inferred" and "Expand results over equivalent URIs" checkboxes before I could get my query to work.  Although their endpoint provides a function to save the query, I couldn't see a way to save the results, so again, I ended up running the query as a remote query via my local Stardog instance:

PREFIX dbp-ont: <http://dbpedia.org/ontology/>
PREFIX dc: <http://purl.org/dc/elements/1.1/>
PREFIX owl: <http://www.w3.org/2002/07/owl#>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
PREFIX gn: <http://www.geonames.org/ontology#>

CONSTRUCT {
          ?country dc:relation ?label.
          }
WHERE
  {
  SERVICE <http://factforge.net/sparql>
      {
      SELECT DISTINCT ?country ?label WHERE 
          {
          ?country gn:featureCode <http://www.geonames.org/ontology#A.PCLI>.
          ?country skos:prefLabel ?label.
          FILTER (lang(?label)="en")
          FILTER (strStarts(str(?country), "http://sws.geonames.org/"))
          }
          ORDER BY ?label
      }
  }

Notes:
1. I constructed a sort of "dummy" triple using dc:relation as a way to link the GeoNames URI with the preferred English label.  That's because the next step is to dump these triples in with the triples that I got from the Getty Vocabularies and I wanted to make sure that I used a predicate that wasn't used in any of the Getty triples.  Another option would have been to try to just run both this query and the earlier remote Getty query at the same time as a federated query.  But one has to be careful about the amount of data that gets sent via the remote queries - it's easy to make the queries hang when you're asking too much of the remote endpoint. I already wasted a lot of time getting this query to work, so I decided it was easier to just save the triples from both queries and load them into my local triple store to do the final merge of data.
2. I limited the query by requiring that ?country to have a gn:featureCode of <http://www.geonames.org/ontology#A.PCLI>, the URI used by GeoNames to indicate independent political entities.
3. Each country is identified by a number of owl:sameAs URIs, so to make sure that the query only returned the GeoNames URIs, I used the second filter in the query.  Fortunately, the remote query must default to the settings equivalent to checking the "Include inferred" and "Expand results over equivalent URIs" checkboxes on the web form interface because I got the results I wanted.

You can see the results of the query here. It is significant that there are several cases where there are two GeoNames URIs that were returned by the query.  For example, "Andorra"@en was associated with http://sws.geonames.org/3041565/ and http://sws.geonames.org/3041563/. When I investigated, I found that the first URI was for the Principality of Andorra (the independent political entity), but the second was for Andorra the capital.  Similarly, "Barbados"@en was associated with both http://sws.geonames.org/3374084/ (the independent political entity) and http://sws.geonames.org/3374085/ (the island).  I suspected that this was happening because of owl:sameAs declarations, which I verified with this query:

PREFIX dbp-ont: <http://dbpedia.org/ontology/>
PREFIX dc: <http://purl.org/dc/elements/1.1/>
PREFIX owl: <http://www.w3.org/2002/07/owl#>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
PREFIX gn: <http://www.geonames.org/ontology#>
      SELECT DISTINCT ?country1 ?label1 ?country2 ?label2 WHERE 
          {
          ?country1 gn:featureCode <http://www.geonames.org/ontology#A.PCLI>.
          ?country1 skos:prefLabel ?label1.
          FILTER (lang(?label1)="en")
          FILTER (strStarts(str(?country1), "http://sws.geonames.org/"))
          ?country2 gn:featureCode <http://www.geonames.org/ontology#A.PCLI>.
          ?country2 skos:prefLabel ?label2.
          FILTER (lang(?label2)="en")
          FILTER (strStarts(str(?country2), "http://sws.geonames.org/"))
          ?country1 owl:sameAs ?country2.
          FILTER (str(?country1) != str(?country2))
          }

I got results for Andorra, Barbados, Monaco, and Nauru, which were all of the countries that showed up twice in my list.  What this query tells me is that something was generating entailed owl:sameAs relationships for things that aren't really the same.  A likely cause would be that the two different entities were each declared owl:sameAs some third entity (perhaps a resource identified by a DBpedia URI or some other URI), and since owl:sameAs is transitive and symmetric, the two different entities would be entailed to be the same.  It's hard to figure out exactly what is going on since these queries only work when reasoning is turned on at the endpoint, so one can't easily parse out the assertions that are causing the trouble.  In any case, some manual curation is necessary to cull out the incorrect URIs for Andorra, Barbados, Monaco, and Nauru.

Linking the GeoNames URIs to the TGN concept URIs

Before moving on, I did a little manual cleanup.  I deleted the data for Caria, Confederate States of America, Friesland, Gaul, Navajo Nation, Prussia, Samnium, Scythia, and Soviet Union,[2] and added labels for North and South Yemen in the TGN data.  I deleted Andorra the city, Barbados the island, Monaco the city, and Nauru the island from the FactForge triples. Then I ran this query:

CONSTRUCT {
  ?countryConcept rdf:value ?value.
  ?countryConcept rdfs:label ?label.
  ?countryConcept dcam:memberOf <http://vocab.getty.edu/tgn/>.
  ?countryConcept skos:inScheme <http://vocab.getty.edu/tgn/>.
  ?countryConcept dcterms:isPartOf <http://rs.tdwg.org/cvterms/country/>.
  ?countryConcept foaf:focus ?countryFeature.
  ?countryConcept a skos:Concept.
  }
WHERE
  {
    ?countryConcept a skos:Concept.
    ?countryConcept rdf:value ?value.
    ?countryConcept rdfs:label ?label.
    OPTIONAL
       {
       ?countryFeature dc:relation ?label.
       }
    }

The resulting file is here.  The approach was pretty successful!  There were only 26 countries that failed to match labels.  Some had obvious problems like the parentheses in "Congo (Republic of)", "Vietnam" not matching with "Viet Nam", or "Holy See" vs. "Vatican City".  Some were just missing in the FactForge triples, like "Côte d'Ivoire" and "Palestine".  So there is a bit more manual curation required to finish this completely.  But I'm done messing with it for now.

Putting it together

To put the whole graph together in one file, I added the triples about the vocabulary and the term list to the triples I made in the last step.  I had to run a small query to construct the

<http://rs.tdwg.org/cvterms/country/>
        skos:member <http://vocab.getty.edu/tgn/1000155>,
                    <http://vocab.getty.edu/tgn/1000180>, ...

triples.  When it was done, I uploaded it to the Heard Library triple store so that it could be played with using the SPARQL endpoint there.

Satisfying the use cases

The SPARQL queries (below) that satisfy the use cases can be run on the Vanderbilt Heard Library SPARQL endpoint at http://rdf.library.vanderbilt.edu/sparql?view.  The endpoint "knows" about the namespace abbreviations used in the examples, so the example queries can be run without the need for namespace declarations.

1. Find the controlled literal value given the English name of the country.

After all that work, the query for this use case is easy.  Here's a query that finds the controlled literal value for the English name "China":

 SELECT ?value WHERE
     {
     ?term dcterms:isPartOf <http://rs.tdwg.org/cvterms/country/>.
     ?term rdf:value ?value.
     ?term rdfs:label "China"@en.
     }

To list controlled literal values for all countries:

 SELECT ?value ?label WHERE
     {
     ?term dcterms:isPartOf <http://rs.tdwg.org/cvterms/country/>.
     ?term rdf:value ?value.
     ?term rdfs:label ?label.
     }
ORDER BY ?label

2. Find the controlled literal value given the name of the country in various languages.

This one is going to take a little extra work.  Here's a hack of an earlier query that will grab from TGN all of the preferred labels in any language for countries:

PREFIX gvp: <http://vocab.getty.edu/ontology#>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>

CONSTRUCT {
          ?country skos:prefLabel ?prefLabel.
          }
WHERE {
  SERVICE <http://vocab.getty.edu/sparql>
         {
         SELECT DISTINCT ?country ?prefLabel WHERE 
            {
            ?country gvp:placeType <http://vocab.getty.edu/aat/300128207>.
            ?country skos:prefLabel ?prefLabel.
            }
         }
     }

I saved the resulting triples in a file and uploaded it to the endpoint.  As I discussed in my last post, it is probably best to not include translations in the standards documents, but rather to include them in other documents that are not part of the standard.  One could then make modifications and additions to the translations without invoking the standards process.  Here is a query that would retrieve all of the controlled literal values in a particular language (Russian in this case):

SELECT ?value ?label WHERE
     {
     ?term dcterms:isPartOf <http://rs.tdwg.org/cvterms/country/>.
     ?term rdf:value ?value.
     OPTIONAL
        {
        ?term skos:prefLabel ?label.
        FILTER (langMatches(lang(?label),"ru"))
        }
     }
ORDER BY ?label

Note that I made the skos:prefLabel triple pattern and FILTER optional.  That's so the list will display the literal value for every country even if there is no translation in the target language.  The FILTER that I used returns every form of the language.  So if you try the code "zh" for Chinese, you will also get zh-latn tagged labels as well.  To filter for only the main language code, replace the FILTER line with:

FILTER (lang(?label)="ru")

3. Aid in data cleaning by finding alternate labels for the country.

To meet this use case, we first need to grab all of the alternate labels for countries from TGN using a hack of the query from use case #2:

PREFIX gvp: <http://vocab.getty.edu/ontology#>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>

CONSTRUCT {
          ?country skos:altLabel ?altLabel.
          }
WHERE {
  SERVICE <http://vocab.getty.edu/sparql>
         {
         SELECT DISTINCT ?country ?altLabel WHERE 
            {
            ?country gvp:placeType <http://vocab.getty.edu/aat/300128207>.
            ?country skos:altLabel ?altLabel.
            }
         }
     }

As before, I saved the triples in a file and uploaded them into the Heard Library triplestore.

Now, if I want to find possible country literal values that people might have used for the country Swaziland (i.e. alternate versions of the country name, or the name in different languages), I can find all of the preferred and alternate labels that are in the TGN using this query:

SELECT DISTINCT ?stripLabel WHERE
     {
     ?term dcterms:isPartOf <http://rs.tdwg.org/cvterms/country/>.
     ?term rdf:value "Swaziland".
        {?term skos:prefLabel ?label.}
          UNION
       {?term skos:altLabel ?label.}
     BIND (str(?label) AS ?stripLabel)
     }
ORDER BY ?stripLabel

I used SELECT DISTINCT, since some language-tagged literals for country labels have the same string part.  The UNION allows either skos:prefLabel or skos:altLabel triples to match.  In the BIND statement, I remove the language tag, since that tag would not be likely to be found in the data that we are trying to clean.

Summary

1. Because dwc:country is a "convenience term" whose primary purpose is to make text-based database searches simpler, our primary concern in creating a controlled vocabulary for its values is to establish a single string literal value to represent the country.  Designating a standard URI is not the primary concern, although a URI is a useful, unambiguous way to link the standard literal value to other labels for the country.

2. With some digging, it's possible to use the Getty Vocabularies SPARQL endpoint to retrieve the "preferred" skos:prefLabel for the country.  By means of a CONSTRUCT query, the Getty TGN URI can be linked to that preferred label by an rdf:value predicate.

3. Using the FactForge SPARQL endpoint, with some effort one can extract the preferred English labels associated with GeoNames feature URIs.  Those labels can then be matched with the preferred English labels for TGN concepts in a CONSTRUCT query that generates a foaf:focus triple to link the TGN concept to its corresponding GeoNames feature.

4. Although SPARQL queries automate this whole process to some extent, there are some obvious errors, so some human curation would be necessary before the putative controlled vocabulary standard would be ready for use.

5. Triples linking to the multi-lingual preferred labels and alternative labels that are made available through TGN can be combined with the triples from the controlled vocabulary standard to facilitate translations and data cleaning.  Additional triples (possibly linked with a skos:hiddenLabel property) could be added for common misspellings.

6. The vocabulary that I constructed does not address the problem of changing country names.  For example, specimens collected in the Soviet Union if collected now would be listed under literals for Russia, Kazakhstan, Latvia, etc. So do we go back and change the dwc:country value for those earlier records?  This problem is beyond the scope of this post, but I would suggest that it would be more productive to work on assigning stable URI values (e.g. from GeoNames) to dwciri:inDescribedPlace properties for records.  Then if a particular place becomes part of a different country because of political changes, that problem could be fixed once for all records by edits to the GeoNames database, rather than by changes to thousands or millions of individual occurrence records.

Notes

[1] The semantics of this arise from the class and property definitions found in section B.3.2. of the SKOS Reference Recommendation, which makes use of advanced OWL property chains (I had to look them up).  I don't know how likely it is that any client is going to do the machine reasoning necessary to materialize things like skos:prefLabel relationships from skosxl:prefLabel relationships.  Nevertheless, those relationships are entailed if they aren't explicitly expressed.
[2] I missed Yugoslavia, but I got rid of it later.

2 comments:

  1. Hi! In this case, reading the documentation may be a good idea :-)

    - don't use gvp:displayOrder to find "preferred preferred". Use gvp:prefLabelGVP ("GVP preferred"). See http://vocab.getty.edu/queries#Places_with_English_or_GVP_Label for an example how to pick English or the GVP preferred

    - tgn:7004538 comes up because that's a "historical region" (of the Roman empire)

    - Of course we support CONSTRUCT, eg try this qurey:
    PREFIX gvp:
    PREFIX skos:
    PREFIX skosxl:
    PREFIX dcterms:
    PREFIX dcam:

    CONSTRUCT {
    ?country rdf:value ?value.
    ?country rdfs:label ?engLabel.
    ?country dcam:memberOf .
    ?country skos:inScheme .
    ?country dcterms:isPartOf .
    ?country a skos:Concept.
    }
    WHERE {
    ?country gvp:placeType .
    ?country skosxl:prefLabel ?xlLabel.
    OPTIONAL {
    ?country skos:prefLabel ?engLabel.
    FILTER (lang(?engLabel)="en")
    }
    ?xlLabel gvp:displayOrder "1"^^xsd:positiveInteger.
    ?xlLabel skosxl:literalForm ?prefLabel.
    BIND (str(?prefLabel) AS ?value)
    }

    - "don't know how likely it is that any client is going to do the machine reasoning necessary to materialize things like skos:prefLabel relationships from skosxl:prefLabel relationships": We do that, see http://vocab.getty.edu/doc/#SKOS-XL_Inference

    - by looking only for "nations", you'll miss some countries, eg tgn:7024573 Greenland that has direct type "island nations" without its super-type "nations". see http://vocab.getty.edu/doc/queries/#Places_by_Type and search for direct or hierarchical type of "nations" (255) or "sovereign states" (320)

    - "there are some obvious errors": after the explanations above, if you find real errors, please post them in the support group (listed on the home page)

    ReplyDelete
  2. Vladimir,
    Thanks for pointing out these things I missed. gvp:prefLabelGVP is a simpler way to accomplish what I wanted to do. Using property paths to catch all nations is a great idea - I missed the "island nations" thing.

    When I said "Unfortunately, I couldn't figure out how to get the Getty Vocabularies endpoint to do a CONSTRUCT query.", I wasn't clear. I meant that I couldn't do them by pasting them in the query box at http://vocab.getty.edu/sparql . I didn't have any problem with the CONSTRUCT queries when I sent them to the endpoint through HTTP.

    When I said, "there are some obvious errors", I didn't necessarily mean errors in the TGN data. Rather, I meant that there were errors in the results of the whole automated process, including trying to link the TGN data to GeoNames data by string matching. Following your suggestions would have eliminated some of the problems I had, but not all of them because string matching is a "dirty" process and needs some human intervention, e.g. in the cases where parentheses and spaces in one dataset prevented strings from matching in the other dataset.

    Thanks for taking the time to respond!
    Steve

    ReplyDelete