Steve Baskauf's blog: Controlled values (again)

The connection between The Darwin Core Hour and the TDWG Standards Documentation Specification

I've included the word "again" in the title of this blog post because I wrote a series of blog posts [1] about a year ago exploring issues related to thesauri, ontologies, controlled vocabularies, and SKOS. Those posts were of a somewhat technical nature since I was exploring possible ways to represent controlled vocabularies as RDF. However, there has been a confluence of two events that have inspired me to write a less technical blog post on the subject of controlled vocabularies for the general TDWG audience.

One is the genesis of the excellent Darwin Core Hour webinar series. I encourage you to participate in them if you want to learn more about Darwin Core. The previous webinars have been recorded and can be viewed online. The most recent webinar "Even Simple is Hard", presented by John Wieczorek on March 7, provided a nice introduction to issues related to controlled vocabularies, and the next one on April 4 "Thousands of shades for 'Controlled' Vocabularies", to be presented by Paula Zermoglio and will deal with the specifics of controlled vocabularies.

The other thing that's going on is that we are in the midst of the public comment period for the draft TDWG Standards Documentation Specification (SDS), of which I'm the lead author. At the TDWG Annual Meeting in December, I led a session to inform people about the SDS and it's sister standard, the TDWG Vocabulary Management Specification. At that session, the topic of controlled vocabularies came up. I made some statements explaining the way that the SDS specifies that controlled vocabularies will be described in machine-readable form. What I said seemed to take some people by surprise, and although I provided a brief explanation, there wasn't enough time to have an in-depth discussion. I hoped that the topic would come up during the SDS public comment period, but so far it has not. Given the current interest in constructing controlled vocabularies, I hope that this blog post will either generate some discussion, or satisfy people's curiosity about how the SDS deals with machine-readable controlled vocabularies.

Definitions

It is probably best to start off by providing some definitions. It turns out that there is actually an international standard that deals with controlled vocabularies. It is ISO 25964: "Thesauri and interoperability with other vocabularies". Unfortunately, that standard is hidden behind a paywall and is ridiculously expensive to buy. As part of my work on the SDS, I obtained a copy of ISO 25964 by Interlibrary Loan. I had to return that copy, but I took some notes that are on the VOCAB Task Groups GitHub site. I encourage you to refer to those notes for more details about what I'm only going to briefly describe here.

Controlled vocabularies and thesauri

ISO 25964 defines a controlled vocabulary as a

prescribed list of terms, headings or codes, each representing a concept. NOTE: Controlled vocabularies are designed for applications in which it is useful to identify each concept with one consistent label, for example when classifying documents, indexing them and/or searching them. Thesauri, subject heading schemes and name authority lists are examples of controlled vocabularies.

It also defines a form of controlled vocabulary, which is the major subject of the standard: a thesaurus. A thesaurus is a

controlled and structured vocabulary in which concepts are represented by terms, organized so that relationships between concepts are made explicit, and preferred terms are accompanied by lead-in entries for synonyms or quasi-synonyms. NOTE: The purpose of a thesaurus is to guide both the indexer and the searcher to select the same preferred term or combination of preferred terms to represent a given subject. For this reason a thesaurus is optimized for human navigability and terminological coverage of a domain. [my emphasis]

If you participated in or listened to the Darwin Core Hour "Even Simple is Hard", you can see the close relationship between the way "controlled vocabulary" was used in that seminar and the definition of thesaurus given here. When submitting metadata about an occurrence to an aggregator, we want to use the same controlled value term in our metadata as will be used by those who may be searching for our metadata in the future. Referring to an example given in the webinar, if we (the "indexers") provide "PreservedSpecimen" as a value in metadata in our spreadsheet, others in the future who are searching (the "searchers") for occurrences documented by preserved specimens can search for "PreservedSpecimen", and find our record. That won't happen if we use a value of "herbarium sheet". Figuring out how to get indexers and searchers to use the same terms is the job of a thesaurus.

A thesaurus is also designed to capture relationships between controlled value terms, such as "broader" and "narrower". A searcher who knows about preserved specimens but wants records documented by any kind of physical thing (including living specimens, fossils, material samples as well) can be directed to a broader term that encompasses all kinds of physical things, e.g. "PhysicalObject".

So although in Darwin Core (and in TDWG in general) we tend to talk about "controlled vocabularies", I would assert that we are, in fact, talking about thesauri as defined by ISO 25964.

Strings and URIs

If you have spent any time pondering TDWG vocabularies, you probably have noticed that all kinds of TDWG vocabulary terms (classes and properties) are named using Uniform Resource Identifiers, or URIs. Because the kind of URIs used in TDWG term names begin with "http://" people get the mistaken impression that URIs always make something "happen". This is because we are used to seeing Web URLs that start with "http://", and have come to believe that if we put a URI that starts with "http://" in a browser, we will get a web page. However, there are some terms in TDWG vocabularies that will probably never "do" anything in a browser. For example, Audubon Core co-opts the term http://ns.adobe.com/exif/1.0/PixelXDimension as a property whose value gives the number of pixels in an image in the X dimension. You can try putting that term URI in a browser and prove to yourself that nothing useful happens. So we need to get over the idea that URIs must "do" something (they might or might not), and get used to the idea that their primary purpose is to serve as a globally unique name that conforms to some particular structural rules [2].

You can see the value of using URIs over plain text strings if you consider the Darwin Core term "class". When we use "class" in the context of Darwin Core, we intend its Darwin Core definition: "The full scientific name of the class in which the taxon is classified." However, "Class" has a different meaning in a different international standard, the W3C's RDF Schema 1.1 (RDFS). In that context, "Class" means "The class of resources that are RDF classes." There may be many other meanings of "class" in other fields. It can mean different things in education, in computer programming, and in sociology. We can tell people exactly what we intend by the use of a term if we identify it with a URI rather than a plain text string. So for example, if we use the term http://rs.tdwg.org/dwc/terms/class, people will know that we mean "class" in the Darwin Core sense, but if we use the term http://www.w3.org/2000/01/rdf-schema#Class, people will know that we mean "class" in the RDFS sense.

Clearly, it is a pain in the neck to write out a long and messy URI. For convenience, there is an abbreviated form for URIs called "compact URIs" or CURIEs. When we use a CURIE, we define an abbreviation for part of the URI (commonly called the "namespace" abbreviation). So for example, we could declare the namespace abbreviations:

dwc: = http://rs.tdwg.org/dwc/terms/
rdfs: = http://www.w3.org/2000/01/rdf-schema#

and then abbreviate the URI by replacing the namespace with the abbreviation to form a CURIE. With the defined abbreviations above, we could say dwc:class when we intend "class" in the Darwin Core sense and rdfs:Class when we intend "class" in the RDFS sense. This is much shorter than writing out the full URI, and if the last part of the CURIE after the namespace (known as the "local name") is formed from a natural language string (possibly in camelCase), it's easy for a native speaker of that natural language to "read" the CURIE as part of a sentence.

It is not a requirement that the local name be a natural language string. Some vocabularies prefer to have opaque identifiers, particularly when there is no assumed primary language. So for example, the URI http://vocab.getty.edu/tgn/1000111, which is commonly abbreviated by the CURIE tgn:1000111, denotes the country China, which may have many names in natural language strings of various languages.

What's the dwciri: namespace for?

Those who are familiar with using Darwin Core in spreadsheets and relational databases are familiar with the "regular" Darwin Core terms in the "dwc:" namespace (http://rs.tdwg.org/dwc/terms/). However, most are NOT familiar with the namespace http://rs.tdwg.org/dwc/iri/, commonly abbreviated dwciri: . This namespace was created as a result of the adoption of the Darwin Core RDF Guide, and most people who don't care about RDF have probably ignored it. However, I'd like to bring it up in this context because it can play an important role in disambiguation.

Here is a typical task. You have a spreadsheet record whose dwc:county value says "Robertson". You know that it's a third level-political subdivision because that's what dwc:county records. However, there are several third-level political subdivisions in the United States alone, and there probably are some in other countries as well. So there is work to be done in disambiguating this value. You'll probably need to check the dwc:stateProvince and dwc:country or dwc:countryCode values, too. Of course, there may also be other records whose dwc:county values are "Robertson County" or "Comté de Robertson" or "Comte de Robertson" that are probably from the same third-level political subdivision as your record. Once you've gone to the trouble of figuring out that the record is in Robertson County, Tennessee, USA, you (or other data aggregators) really should never have to go through that effort again.

There are two standardized controlled vocabularies that have created URI identifiers for geographic places: GeoNames and the Getty Thesaurus of Geographic Names (TGN). There are reasons (beyond the scope of this blog post) why one might prefer one of these vocabularies over another, but either one provides an unambiguous, globally unique URI for Robertson County, Tennessee, USA: http://sws.geonames.org/4653638/ froim GeoNames and http://vocab.getty.edu/tgn/2001910 from the TGN. The RDF Guide makes it clear that the value of every dwciri: term must be a URI, while the values of many dwc: terms may be a variety of text strings, including human-readable names, URIs, abbreviations, etc. With a dwc: term, a user probably will not know whether disambiguation needs to be done, while with a dwciri: term, a user knows exactly what the value denotes, since a URI is a globally unique name.

In RDF, we could say that a particular location was in Robertson County, Tennessee, USA like this:

my:location dwciri:inDescribedPlace tgn:2001910.

However, there is also no rule that says you couldn't have a spreadsheet with a column header of "dwciri:inDescribedPlace", as long as the cells below it that contained URI values. So dwciri: terms could be used in non-RDF data representations as well as in RDF.

If you look at the table in Section 3.7 of the Darwin Core RDF Guide, you will see that there are dwciri: analogs for every dwc: term where we thought it made sense to use a URI as a value.[3] In many cases, those were terms where Darwin Core recommended use of a controlled vocabulary. Thus, once providers or aggregators went to the trouble to clean their data and determine the correct controlled values for a dwc: property, they and everybody else in the future could be done with that job forever if they recorded the results as a URI value from a controlled vocabulary for a dwciri: Darwin Core property.

The crux of the issue

OK, after a considerable amount of background, I need to get to the main point of the post. John's "Even Simple is Hard" talk was directed to the vast majority of Darwin Core users: those who generate or consume data as Simple Darwin Core (spreadsheets), or who output or consume Darwin Core Archives (fielded text tables linked in a "star schema"). In both of those cases, the tables or spreadsheets will most likely be populated with plain text strings. There may be a few users who have made the effort to find unambiguous URI values instead of plain strings, and hopefully that number will go up in the future as more URIs are minted for controlled vocabulary terms. However, in the current context, when John talks about populating the values of Darwin Core properties with "controlled values", I think that he probably means to use a single, consensus unicode string that denotes the concept underlying that controlled value.

Preferred unicode strings

I am reminded of a blog post that John wrote in 2013 called "Data Diversity of the Week: Sex" in which he describes some of the 189 distinct values used in VertNet to denote the concept of "maleness". We could all agree that the appropriate unicode string to denote maleness in a spreadsheet should be the four characters "male". Anyone who cleans data and encounters values for dwc:sex like "M", "m", "Male", "MALE", "macho", "masculino", etc. etc. could substitute the string "male" in that field. There would, of course, be the problem of losing the verbatim value if a substitution were made.

I suspect that most TDWG'ers would consider the task of developing a controlled vocabulary for dwc:sex to involve sitting down a bunch of users and aggregators at a big controlled vocabulary conference, and coming to some consensus about the particular strings that we should all use to denote maleness, femaleness, and all other flavors of gender that we find in the natural world.

I don't want to burst anybody's bubble, but as it currently stands, that's not how the draft Standards Documentation Specification would work with respect to controlled vocabularies. A TDWG controlled vocabulary would be more than a list of acceptable strings. It would have all of the same features that other TDWG vocabularies have.

SDS: URIs

For one thing, each term in a controlled vocabulary would be identified by a URI. That is already current practice in TDWG vocabularies and in Dublin Core. The SDS does not specify whether the URIs should use "English-friendly" local names or opaque numbers for local names. Either would be fine. For illustration purposes, I'll pick opaque numbers. Let's use "12345" as the local name for "maleness". The SDS is also silent about namespace construction. One could do something like

dwcsex: = "http://rs.tdwg.org/dwc/cv/sex/"

for the namespace. Then the URI for maleness would be

http://rs.tdwg.org/dwc/cv/sex/12345

as a full URI or

dwcsex:12345

as a CURIE. Anybody who wants to unambiguously refer to maleness can use the URI dwcsex:12345 regardless of whether they are using a spreadsheet, relational database, or RDF.

SDS: Machine-readable stuff

In John's talk, he mentioned the promise of using "semantics" to help automate the process of data cleaning. A critical feature of the SDS is that in addition to specifying how human-readable documents should be laid out, it also describes how metadata should be described in order to make those data machine readable. The examples are given as RDF/Turtle, but the SDS makes it clear that it is agnostic about how machine-readability should be achieved. RDF-haters are welcome to use JSON-LD. HTML-lovers are welcome to use RDFa embeded in web page markup. Or better - provide the machine readable data in all of the above formats and let users chose. The main requirement is that regardless of the chosen serialization, every machine-readable representation must "say" the same thing, i.e. must use the same properties specified in the SDS to describe the metadata. So the SDS is clear about what properties should be used to describe each aspect of the metadata.

In the case of controlled vocabulary terms, several designated properties are the same as those used in other TDWG vocabularies. For example, rdfs:comment is used to provide the English term definition and rdfs:label is used to indicate a human readable label for the term in English. The specification does a special thing to accommodate our community's idiosyncrasy of relying on a particular unicode string to denote a controlled vocabulary term. That unicode string is designated as the value of rdf:value, a property that is well-known, but doesn't have a very specific meaning and could be used in this way. It's possible that the particular consensus string might be the same as the label, but it wouldn't have to be. For example we could say this:

dwcsex:12345 rdfs:label "Male"@en;
rdfs:comment "the male gender"@en;
rdf:value "male".

In this setup, the label to be presented to humans starts with a capital letter, while the consensus string value denoting the term doesn't. In other cases, the human readable label might contain several words with spaces between, while the consensus string value might be in camelCase with no spaces. The label probably should be language-tagged, while the consensus string value is a plain literal.

Dealing with multiple labels for a controlled value: SKOS

As it currently stands, the SDS says that the normative definition and label for terms should be in English. Thus the controlled value vocabulary document (both human- and machine-readable) will contain English only. Given the international nature of TDWG, it would be desirable to also make documents, term labels, and definitions available in as many languages as possible. However, it should not require invoking the change process outlined in the Vocabulary Management Specification every time a new translation is added, or if the wording in a non-English language version changes. So the SDS assumes that there will be ancillary documents (human- and machine-readable) in which the content is translated to other languages, and that those ancillary documents will be associated with the actual standards documents.

This is where the Simple Knowledge Organization System (SKOS) comes in. SKOS was developed as a W3C Recommendation in parallel with the development of ISO 25964. SKOS is a vocabulary that was specifically designed to facilitate the development of thesauri. Given that I've made the case that what TDWG calls "controlled vocabularies" are actually thesauri, SKOS has many of the terms we need to describe our controlled value terms.

An important SKOS term is skos:prefLabel (preferred label). skos:prefLabel is actually defined as a subproperty of rdfs:label, so any value that is a skos:prefLabel is also a generic label. However, the "rules" of SKOS say that you should never have more than one skos:prefLabel value for a given resource, in a given language. Thus, there is only one skos:prefLabel value for English, but there can be other skos:prefLabel values for Spanish, German, Chinese, etc.

SKOS also provides the term skos:altLabel. skos:altLabel is used to specify other labels that people might use, but that aren't really the "best" one. There can be an unlimited number of skos:altLabel values in a given language for a given controlled vocabulary term. There is also a property skos:hiddenLabel. The values of skos:hiddenLabel are "bad" values that you know people use, but you really wouldn't want to suggest as possibilities (for example, misspellings).

SKOS has a particular term that indicates that a value is a definition: skos:definition. That has a more specific meaning than rdfs:comment, which could really be any kind of comment. So using it in addition to rdfs:comment is a good idea.

So here is how the description of our "maleness" term would look in machine-readable form (serialized as human-friendly RDF/Turtle):

Within the standards document itself:

dwcsex:12345 rdfs:label "Male"@en;
skos:prefLabel "Male"@en;
rdfs:comment "the male gender"@en;
skos:definition "the male gender"@en;
rdf:value "male".

In an ancillary document that is outside the standard:

dwcsex:12345 skos:prefLabel "Masculino"@es;
skos:altLabel "Macho"@es;
skos:altLabel "macho"@es;
skos:altLabel "masculino"@es;
skos:altLabel "male"@en;
skos:prefLabel "男"@zh-hans;
skos:prefLabel "男"@zh-hant;
skos:prefLabel "männlich"@de;
skos:altLabel "M";
skos:altLabel "M.";
skos:hiddenLabel "M(ale)";
etc. etc.

In the ancillary document, one would attempt to include as many of the 189 values for "male" that John mentioned in his blog post. Having this diversity of labels available makes two things possible. One is to automatically generate pick lists in any language. The if the user selects German as the preferred language, the pick list presents the German preferred label "männlich" to the user, but the value selected is actually recorded by the application as the language-independent URI dwcsex:12345. Although I didn't show it, the ancillary document could also contain definitions in multiple languages to clarify things to international users in the event that viewing the label itself is not enough.

The many additional labels in the ancillary document also facilitate data cleaning. For example, if GBIF has a million horrible spreadsheets to try to clean up, they could simply do string matching against the various label values without regard to the language tags and type of label (pref vs. alt vs. hidden). Because the ancillary document is not part of the standard itself, the laundry list of possible labels can be extended at will every time a new possible value is discovered.

Making the data available

It is TOTALLY within the capabilities of TDWG to proved machine-readable data of this sort, and if the SDS is ratified, that's what we will be doing. Setting up a SPARQL endpoint to deliver the machine-readable metadata is not hard. For those who are are RDF-phobic, a machine-readable version of the controlled vocabulary can be available through the API as JSON-LD, which provides exactly the same information as the RDF/Turtle above and would look like this:

{
  "@context": {
    "dwcsex": "http://rs.tdwg.org/dwc/cv/sex/",
    "rdf": "http://www.w3.org/1999/02/22-rdf-syntax-ns#",
    "rdfs": "http://www.w3.org/2000/01/rdf-schema#",
    "skos": "http://www.w3.org/2004/02/skos/core#",
    "xsd": "http://www.w3.org/2001/XMLSchema#"
  },
  "@id": "dwcsex:12345",
  "rdf:value": "male",
  "rdfs:comment": {
    "@language": "en",
    "@value": "the male gender"
  },
  "rdfs:label": {
    "@language": "en",
    "@value": "Male"
  },
  "skos:altLabel": [
    {
      "@language": "es",
      "@value": "macho"
    },
    "M",
    {
      "@language": "en",
      "@value": "male"
    },
    "M.",
    {
      "@language": "es",
      "@value": "masculino"
    },
    {
      "@language": "es",
      "@value": "Macho"
    }
  ],
  "skos:definition": {
    "@language": "en",
    "@value": "the male gender"
  },
  "skos:hiddenLabel": "M(ale)",
  "skos:prefLabel": [
    {
      "@language": "es",
      "@value": "Masculino"
    },
    {
      "@language": "zh-hant",
      "@value": "男"
    },
    {
      "@language": "en",
      "@value": "Male"
    },
    {
      "@language": "zh-hans",
      "@value": "男"
    },
    {
      "@language": "de",
      "@value": "männlich"
    }
  ]
}

People could write their own data-cleaning apps to consume this JSON description of the controlled vocabulary and never even have to think about RDF.

"Semantics", SKOS concept schemes, and Ontologies

Up to this point, I've been dodging an issue that will concern some readers and which other readers won't care about one whit. If you are a casual reader and don't care about the fine points of machine-readable data and ontologies, you can just stop reading here.

The SDS says that controlled vocabulary terms should be typed as skos:Concept, as opposed to rdfs:Class. That prescription has the implication that the controlled vocabulary will be a SKOS concept scheme rather than an ontology. This was what freaked people out at the TDWG meeting, because there is a significant constituency of TDWG whose first inclination when faced with a machine-data problem is to construct an OWL ontology. At the meeting, I made the statement that a SKOS concept scheme is a screwdriver and an OWL ontology is a hammer. Neither a screwdriver nor a hammer is intrinsically a better tool. You use a screwdriver when you want to put in screws and you use a hammer when you want to pound nails. So in order to decide which tool is right, we need to be clear about what we are trying to accomplish with the controlled vocabulary.

ISO 25964 provides the following commentary about thesauri and ontologies in section 21.2:

Whereas the role of most of the vocabularies ... is to guide the selection of search/indexing terms, or the browsing of organized document collections, the purpose of ontologies in the context of retrieval is different. Ontologies are not designed for information retrieval by index terms or class notation, but for making assertions about individuals, e.g. about real persons or abstract things such as a process.

and in section 22.3:

One key difference is that, unlike thesauri, ontologies necessarily distinguish between classes and individuals, in order to enable reasoning and inferencing. ... The concepts of a thesaurus and the classes of an ontology represent meaning in two fundamentally different ways. Thesauri express the meaning of a concept through terms, supported by adjuncts such as a hierarchy, associated concepts, qualifiers, scope notes and/or a precise definition, all directed mainly to human users. Ontologies, in contrast, convey the meaning of classes through machine-readable membership conditions. ... The instance relationship used in some thesauri approximates to the class assertion used in ontologies. Likewise, the generic hierarchical relationship ... corresponds to the subclass axiom in ontologies. However, in practice few thesauri make the distinction between generic, whole-part and instance relationships. The undifferentiated hierarchical relationship most commonly found in thesauri is inadequate for the reasoning functions of ontologies. Similarly the associative relationship is unsuited to an ontology, because it is used in a multitude of different situations and therefore is not semantically precise enough to enable inferencing.

In layman's terms, there are two key differences between thesauri and ontologies. The primary purpose of a thesaurus is to guide a human user to pick the right term for categorizing a resource. The primary purpose of an ontology is to allow a machine to do automated reasoning about classes and instances of things. The second difference is that reasoning and inferencing is, in a sense, "automatic" for an ontology, whereas the choice to make use of hierarchical relationships in a thesaurus is optional and controlled by a user.

Let's apply these ideas to our dwc:sex example. We could take the ontology approach and say that

dwcsex:12345 rdf:type rdfs:Class.

We could then define another class in our gender ontology:

dwcsex:12347 rdf:type rdfs:Class;

rdfs:label "Animal gender".

and assert

dwcsex:12345 rdfs:subClassOf dwcsex:12347.

This assertion is the the sort that John described in his webinar: an "is_a" hierarchical relationship. We could represent it in words as:

"Male" is_a "Animal gender".

As data providers, we don't have to "do anything" to assert this fact, or decide in particular cases whether we like the fact or not. Anything that has a dwc:sex value of "male" will automatically have a dwc:sex value of "Animal gender" because that fact is entailed by the ontology.

Alternatively, we could take the thesaurus approach and say that

dwcsex:12345 rdf:type skos:Concept.

We could then define another concept in our gender thesaurus:

dwcsex:12347 rdf:type skos:Concept;

rdfs:label "genders that animals can have".

and assert

dwcsex:12345 skos:broader dwcsex:12347.

As in the ontology example, this assertion also describes a hierarchical relationship ("has_broader_category"). We could represent it in words as:

"Male" has_the_broader_category "genders that animals can have".

In this case, nothing is entailed automatically. If we assert that some thing has a dwc:sex value of "male", that's all we know. However, if a human user is using a SKOS-aware application, the application could interact with the user and say "Hey, not finding what you want? I could show you some other genders that animals can have." and then find other controlled vocabulary terms that have dwcsex:12347 as a broader concept. It would also be no problem to assert this:

dwcsex:12348 rdf:type skos:Concept;

rdfs:label "genders that parts of plants can have".

dwcsex:12345 skos:broader dwcsex:12348.

We aren't doing anything "bad" by somehow entailing that males are both plants and animals. We are just saying that "male" is a gender that can fall into several broader categories as part of a concept scheme: "genders that animals can have" and "genders that parts of plants can have". This is what was meant by "the undifferentiated hierarchical relationship most commonly found in thesauri is inadequate for the reasoning functions..." in the text of ISO 25964. The hierarchical relationships of thesauri can guide human categorizers and searchers, but they don't automatically entail additional facts.

Screwdriver or hammer?

Now that I've explained a little about the differences between thesauri and ontologies, which one is the right tool for the controlled vocabularies? There is no definite answer to this, but the common practice for most controlled vocabularies seems to be the thesaurus approach. That's the approach used by all of the Getty Thesauri, and also the approach used by the Library of Congress for the controlled vocabularies it defines. In the case of both of these providers, the machine-readable forms of their controlled vocabularies are expressed as SKOS concept schemes, not ontologies.

That is not to say that all Darwin Core terms that currently say "best practice is to use a controlled vocabulary" should be defined as SKOS concept schemes. In particular, the two vocabularies that John spoke about at length in his web cast (dwc:basisOfRecord and dcterms:type) should probably be defined as ontologies. That's because they both are ways of describing the kind of thing something is, and that's precisely the purpose of an rdfs:Class. On could create an ontology that asserts:

dwc:PreservedSpecimen rdfs:subClassOf dctype:PhysicalObject.

and all preserved specimens could automatically be reasoned to be physical objects whether a data provider said so or not [4]. In other words, machines that are "aware" of that ontology would "know" that

"preserved specimen" is_a "physical object".

without any effort on the part of data providers or aggregators.

But both dcterms:type and dwc:basisOfRecord are really ambiguous terms that are a concession to the fact that we try to cram information about all kinds of resources into a single row of a spreadsheet. The ambiguity about what dwc:basisOfRecord actually means is the reason why the RDF guide says to use rdf:type instead [5]. There is no prohibition against having as many values of rdf:type as is useful. You can assert:

my:specimen rdf:type dwc:PreservedSpecimen;
rdf:type dctype:PhysicalObject;
rdf:type dcterms:PhysicalResource.

with no problem, other than it won't fit in a single cell in a spreadsheet!

So what if we decide that "controlled values" for a term should be ontology classes?

The draft Standards Documentation Specification says that controlled value terms will be typed as skos:Concepts. What if there is a case where it would be better for the "controlled values" to be classes from an ontology? There is a simple answer to that. Change the Darwin Core term definition to say "Best practice is to use a class from a well-known ontology." instead of saying "Best practice is to use a controlled vocabulary." That language would be in keeping with the definitions and descriptions of controlled vocabularies and ontologies given in ISO 25964. Problem solved.

I should note that it is quite possible to use all of the SKOS label-related properties (skos:prefLabel, skos:altLabel, skos:hiddenLabel) with any kind of resource, not just SKOS concepts. So if it were decided in a particular case that it would be better for a "controlled value" to be defined as classes in an ontology rather than as concepts in a thesaurus, one could still use the multi-lingual strategy described earlier in the post.

Also, there is no particular type for the values of dwciri: terms. The only requirement is that the value must be a URI rather than a string literal. It would be fine for that URI to denote either a class or a concept.

So one of the tasks of a group creating a "controlled vocabulary" would be to define the use cases to be satisfied, and then decide whether those use cases would be best satisfied by a thesaurus or an ontology.

Feedback! Feedback! Feedback!

If something in this post has pushed your button, then respond by making a comment about the Standards Documentation Specification before the 30 day public comment period ends on or around March 27, 2017. There are directions from the Review Manager, Dag Endresen, on how to comment at http://lists.tdwg.org/pipermail/tdwg-content/2017-February/003690.html . You can email anonymous comments directly to him, but I don't think any members of the Task Group will get their feelings hurt by criticisms, so an open dialog on the issue tracker or tdwg-content would be even better.

Footnotes

[1] Blog entries from my blog http://baskauf.blogspot.com/
March 14, 2016 "Ontologies, thesauri, and SKOS"
March 21, 2016 "Controlled values for Subject Category from Audubon Core"
April 1, 2016 "Controlled values for Establishment Means from Darwin Core"
April 4, 2016 "Controlled values for Country from Darwin Core"

[2] The IETF RFC 3986 defines the syntax of URIs. A superset of URIs, known as Internationalized Resource Identifiers or IRIs is now commonly used in place of URIs in many standards documents. For the purpose of this blog post, I'll consider them interchangeable.

[3] There are also terms, like dwciri:inDescribedPlace that are related to an entire set of Darwin Core terms ("convenience terms"). Talking about those terms is beyond the scope of this blog post, but for further reading, see the explanation in Section 2.7 of the RDF Guide.

[4] Cautionary quibble: in John's diagram, he asserted that dwc:LivingSpecimen is_a dctype:PhysicalObject. However, the definition of dctype:PhysicalObject is "An inanimate, three-dimensional object or substance.", which would not apply to many living specimens. A better alternative would be the class dcterms:PhysicalResource, which is defined as "a material thing". However, dcterms:PhysicalResource is not included in the DCMI type vocabulary - it's in the generic Dublin Core vocabulary. That's a problem if the type vocabulary is designated as the only valid controlled vocabulary to be used for dcterms:type.

[5] See Section 2.3.1.4 of the RDF Guide for details. DCMI also recommends using rdf:type over dcterms:type in RDF applications. A key problem that our community has not dealt with is clarifying whether we are talking about a resource, or the database record about that resource. We continually confuse these to things in spreadsheets and most of the time we don't care. However, the difference becomes critical if we are talking about modification dates or licenses. A spreadsheet can have a column for "license" but is the value describing a license for the database record, or the image described in that record. A spreadsheet can have a value for "modified", but does that mean the date that the record was last modified or the date that the herbarium sheet described by the record was last modified? With respect of dwc:basisOfRecord, the local name of dwc:basisOfRecord ("basis of record") implies that the value is the type of the resource upon which a record is based, which implies that the metadata in the spreadsheet row is about something else. That "something else" is probably an occurrence. So we conflate "occurrence" with "preserved specimen" by talking about both of them in the same spreadsheet row. According to its definition, dwc:Occurrence is "An existence of an Organism (sensu http://rs.tdwg.org/dwc/terms/Organism) at a particular place at a particular time." while a preserved specimen is a form of evidence for an occurrence, not a subclass of occurrence. That distinction doesn't seem important to most people who use spreadsheets, but it is important if an occurrence is documented by multiple forms of evidence (specimens, images, DNA samples) - something that is becoming increasingly common. What we should have is an rdf:type value for the occurrence, and a separate rdf:type value for the item of evidence (possibly more than one). But spreadsheets aren't complex enough to handle that, so we limp along with dwc:basisOfRecord.

6 comments:

Viktor SenderovMarch 13, 2017 at 4:15 AM
Hi, I also think that ontologies _and_ thesauri made up of SKOS concepts have their uses and it is wrong to assume that we always have to create ontologies. Having said that I think this post assumes (and I am not implying it is incorrect, I am just pointing out this implicit assumption) that creating an ontology is the same as creating a hierarchy of classes (in OWL) for our conceptualization. In my view creating a formal hierarchy of classes for a conceptualization is surely an ontology, but also any explicit formal shared specification of conceptualization can be considered an ontology in a broader sense (https://www.obitko.com/tutorials/ontologies-semantic-web/what-is-ontology.html). That's why a SKOS vocabulary described in Turtle is an ontology in a very broad sense even though the concepts are defined as individuals rather than classes. Even if you don't agree that the SKOS concept schema is an ontology (and I think there is no right or wrong here, as it is just a matter of definition), you can surely agree that we can express it in RDF/ Turtle as you point our yourself. For the OBKMS data model (which I call an ontology), I have created several controlled vocabularies for things such as keywords, taxonomic statuses, etc, and encoded them as RDF. My encoding is slightly more complex that was is given here, that's why I want to paste it. For example here is a vocabulary (don't copy paste this into a reasoner, it's almost valid Turtle) for some taxonomic statuses:
```
:StatusVocabulary rdf:type owl:Class ;
rdfs:subClassOf http://www.w3.org/2004/02/skos/core#ConceptScheme ,
[ rdf:type owl:Restriction ;
owl:onProperty skos:inScheme ;
owl:allValuesFrom :SubjectTerm
] ;

rdfs:label "OpenBiodiv Vocabulary of Taxonomic Statuses"@en ;
rdfs:comment "The status following a taxonomic name usage in a taxonomic
manuscript, i.e. 'n. sp.',
'comb. new',
'sec. Franz (2017)', etc"@en .

:TaxonomicStatus rdf:type owl:Class ;
rdfs:subClassOf [ rdf:type owl:Restriction ;
owl:onProperty http://www.w3.org/2004/02/skos/core#inScheme ;
owl:someValuesFrom :StatusVocabulary ] .
:TaxonomicUncertaitanty a :TaxonomicStatus ;
rdfs:label "Taxonomic Uncertainty"@en ;
rdfs:comment "This term indicates when applied to a taxonomic name
that there is some uncertainty about the name:
either in the placement of the name in the hierarchy
(e.g. incertae sedis), or in the description of the name
(e.g. nomen dubium)."@en .

:TaxonomicDiscovery a :TaxonomicStatus ;
rdfs:label "Taxon Discovery"@en ;
rdfs:comment "This term when applied to a taxonomic name indicates
that this name denotes a taxon that is being described in
the present context. E.g.:
n. sp., gen. nov., n. trib., etc."@en .

:ReplacementName a :TaxonomicStatus ;
rdfs:label "Updated Name"@en ;
rdfs:comment "This term when applied to a taxonomic name indicates
that the name it is being applied to is an updated version
of a different name. This update may come about through
changes in rank (stat. n.) when the endings change (e.g.
-ini -> -idae), through changes in genus placement
(new comb.), through updates needed purely for nomenclatural
reasons such as to avoid homonymy or correct grammatical
or spelling mistakes (nomen nov.), or anything else."@en .

:Synonym a :TaxonomicStatus .

:AcceptedName a :TaxonomicStatus .

:ConservedName a :TaxonomicStatus .

:TypeSpeciesDesignation a :TaxonomicStatus .

:Record a :TaxonomicStatus .

:TaxonConceptLabel a :TaxonomicStatus.

```
Viktor SenderovMarch 13, 2017 at 4:18 AM
owl:allValuesFrom :SubjectTerm this should read owl:allValuesFrom :TaxonomicStatus
Viktor SenderovMarch 13, 2017 at 4:26 AM
... and :TaxonomicConcept is also a subclass of skos:concept! Just realized impossibru to edit comments here :)
Steve BaskaufMarch 13, 2017 at 5:58 AM
Hi Viktor,
Thanks for your comments! In your followup comment, you mentioned :TaxonomicConcept, but I didn't see you talk about it earlier. Did you mean :TaxonomicStatus?

I didn't mention this in the post, but the SKOS primer has some comments about defining resources as both skos:Concept and owl:Class. See https://www.w3.org/TR/skos-primer/#secskosowl So if one did the typical thing that's done in ontology building (define entities as a hierarchy of classes) but also defined them as a hierarchy of SKOS concepts, then the issues raised in this section of the SKOS Primer would apply.

It's true that creating an ontology doesn't always mean defining entities as classes, but in my experience, that's what people typically do. And people who care about formal ontologies also probably care about OWL reasoning, which is why they should care about whether their actions throw the situation into OWL Full rather than OWL DL.

This is venturing into areas beyond what I'm familiar with, but I think the point is that when one ventures into mixing SKOS concept schemes and ontologies, they need to do it with care.

Viktor SenderovMarch 13, 2017 at 7:02 AM
Yes, of course I needed to write :TaxonomicStatus. And i notice another error: inScheme should actually be the its iverse "isSchemeOf" in the first restriction. As I said this was "almost valid" RDF.

I absolutely agree with your comments that sometimes it is better to use vocabs instead of class hierarchies. For example, i had considered creating subclasses for said statuses (all woild have been subclasses of a broader TaxonomocNameUsage class) but opted for thr simplicity of the vocabulary.

The point I am raising is that you can (and probably should) encode these vocabularies as RDF, load them into the knowledge graph together with your ontology and use them as part of your queries even if they don't entail any additional triples.

I guess in a sense the most important thing is to specify the data model in a machine readable format (I don't mind whether we call it an ontology or not)

Saturday, March 11, 2017

Controlled values (again)