Steve Baskauf's blog: March 2016

Monday, March 21, 2016

Controlled values for Subject Category from Audubon Core

Disclamer: this post contains opinions that are entirely my own and do not represent any official position related to any of my work as part of TDWG.

A while ago, I got drafted to serve as the convener of a task group of Biodiversity Information Standards (TDWG) charged with (among other things) drafting a Standards Documentation Specification. This specification should lay out how TDWG standards would be structured to make them clear and understandable for humans and machines. With respect to human-readable documents, there were precedents to follow, and there is one precedent (Darwin Core) to follow for machine-readable representations. So we've made some progress on a draft specification.

However, what has really caused me to stall out on this work is trying to figure out how TDWG should specify controlled vocabularies. There are several places in TDWG vocabulary standards where it is stated that best practice is to use a "controlled vocabulary" for values associated with a property. Here are some examples:

Iptc4xmpExt:CVterm (from Audubon Core)
dwc:establishmentMeans (from Darwin Core)
dwc:country (from Darwin Core)

The questions spinning around in my head were:

What exactly is a controlled value?
What is the purpose of a controlled value?
How should one describe controlled values using RDF?

These questions set me off on an exploration of ISO 25964 and SKOS, which I wrote about in my previous blog post. In that post, I asserted that perhaps the most important question to be answered before constructing a vocabulary of any sort is to carefully lay out the use cases to be satisfied by that vocabulary. That seems obvious, but the TDWG email discussion list is littered with long arguments that would have been more productive if use cases had been clearly laid out at the start (mea culpa!). So I've spent some time recently thinking about what the use cases are for controlled vocabularies in the TDWG context. I've concluded that "controlled vocabulary" is a pretty vague term and that there really are several intended purposes for controlled vocabularies in the context of TDWG properties. The differences in these intended purposes should inform the design of the RDF that specifies the controlled vocabulary terms. In this and subsequent posts, I'm going to talk about my attempts to specify three experimental controlled vocabularies to be used as values for TDWG terms, and to lay out the use cases to be satisfied in each situation.

Subject Category from Audubon Core (Iptc4xmpExt:CVterm)

Audubon Core is "a set of vocabularies designed to represent metadata for biodiversity multimedia resources and collections".[1] Although Audubon Core mints some new terms, wherever possible it reuses existing terms. One of these reused terms is http://iptc.org/std/Iptc4xmpExt/2008-02-29/CVterm (abbreviated as Iptc4xmpExt:CVterm), which comes from the IPTC Standard Photo Metadata terms and is labeled by Audubon Core as "Subject Category". The property Iptc4xmpExt:CVterm supports classification of media items by linking to controlled vocabulary terms for subjects of the items.

Audubon Core is considered to be a data model that does not prefer any particular implementation. As such, it can be used in structured text, such as CSV, or as RDF (although a final RDF implementation has not been completed at this point). Because Audubon Core may be used in implementations that are predominately text-based, unqualified literals are permitted as values instead of URIs if the term is either from one of the Audubon Core recommended sets, or if the source vocabulary is specified using the ac:subjectCategoryVocabulary (i.e. http://rs.tdwg.org/ac/terms/subjectCategoryVocabulary) property. However, in the context of RDF, it is probably better for the object of Iptc4xmpExt:CVterm to be a URI whenever possible, since that would guarantee uniqueness, and permit discovery of other properties of the controlled value, such as preferred human-readable labels.

SERNEC Live Plant Image Group standardized image views

Because I manage the Bioimages website, I'm very interested in categorizing plant images in a systematic way. In 2008, Bruce Kirchoff and I created a system of standardized views that would allow plant images to be organized in a systematic way based on the plant part, and the viewing angle and orientation of the part. These views were published in Vulpina 7:16-30 and were subsequently vetted by a group of live plant photographers under the auspices of the Southeast Regional Network of Expertise and Collections (SERNEC). I made an attempt at an RDF representation of the standardized views in the form of an ontology, but at the time I was struggling with task because I was pretty sure that what I was doing was not "right". So the effort to continue development of the standardized views and extend them to new plant groups and animals stalled out.

After educating myself recently about SKOS and thesauri, it was clear to me why formalizing the standardized views as an ontology back then was the wrong approach. The views form a hierarchy; for example, a view of the perianth of a flower was categorized in the group of views associated with inflorescences, and inflorescence views were one of the categories of views that applied to herbaceous angiosperms. My initial attempt modeled the views as classes, and the hierarchical relationships were expressed using rdfs:subClassOf. This allowed reasoning membership in classes at higher levels in the hierarchy. There were two problems with this approach. One was that I really wanted to use the views as values with the Iptc4xmpExt:CVterm property. But as classes, it would be more appropriate to make the link using rdf:type. The second problem was that the subclassing didn't make any sense. If I said

perianth subClassOf inflorescence subClassOf herbaceous angiosperm

do I really mean that something that is a perianth is also an inflorescence and is also an herbaceous angiosperm? Not really. I suppose that I could get around this problem by saying that what I was defining was a "view" of perianth and a "view" of an inflorescence and a "view" of an herbaceous angiosperm, and in that context the subclassing might make some sense. But if I make the link using rdf:type and say something like

<image> a <perianth>.

then am I saying that an image is a perianth? Am I saying that an image is a view of a perianth? What exactly am I saying?

Designing a SLPIG standardized view thesaurus

I could get around this confusion if I consider a view to be a skos:Concept rather than a class. The purpose of a skos:Concept is to allow humans to categorize things in a knowledge organization system. That's exactly what the standardized views are for: to categorize images, not really to describe the nature of images, or of perianth, or of herbaceous angiosperms.

Once I had that epiphany, then I knew that what I wanted to do was to design a thesaurus, not an ontology. A major design consideration is that the thesaurus needed to reflect the structure of the existing view hierarchy, since the hierarchy was already in use. It should also satisfy these use cases:

Allow a user to select a view by presenting labels for concepts that are appropriate for a particular category of organisms (e.g. gymnosperms, herbaceous angiosperms, etc.)
Allow images to be grouped by major categories (leaf, bark, fruit, etc.) and within those categories by views that were appropriate for each major category.
Group images that fall into the same major category regardless of whether they are in the same category of organism (e.g. show bark of any trees whether they are woody angiosperms or gymnosperms).
Support labels in multiple languages.

There were several ways I could have achieved these design goals. I decided to create a skos:ConceptScheme for each of the organism groups: woody angiosperms, herbaceous angiosperms, and gymnosperms. Within each concept scheme, the top concepts were the major view categories for that group (e.g. entire organism, stem, leaf, inflorescence, fruit, and seed for herbaceous angiosperms). The views within the major categories were linked to their category using skos:broader. Views that were the same as views in another major category were linked using skos:exactMatch. skos:prefLabel was used to specify the preferred label for a language. Here is an incomplete snippet of the part of the thesaurus that organizes the views that apply to herbaceous angiosperms:

<http://bioimages.vanderbilt.edu/rdf/stdview#02>
a skos:ConceptScheme;
rdfs:isDefinedBy <http://bioimages.vanderbilt.edu/rdf/stdview>;
rdfs:seeAlso <http://www.cals.ncsu.edu/plantbiology/ncsc/vulpia/pdf/Baskauf_&_Kirchoff_Digital_Plant_Images.pdf>;
skos:hasTopConcept <http://bioimages.vanderbilt.edu/rdf/stdview#0200>,
<http://bioimages.vanderbilt.edu/rdf/stdview#0201>,
<http://bioimages.vanderbilt.edu/rdf/stdview#0202>,
<http://bioimages.vanderbilt.edu/rdf/stdview#0203>,
<http://bioimages.vanderbilt.edu/rdf/stdview#0204>,
<http://bioimages.vanderbilt.edu/rdf/stdview#0205>,
<http://bioimages.vanderbilt.edu/rdf/stdview#0206>;
skos:note "II. Herbaceous angiosperm views"@en;
skos:prefLabel "herbaceous angiosperms"@en.

<http://bioimages.vanderbilt.edu/rdf/stdview#0203>
a skos:Concept;
skos:definition "II.C. Leaf"@en;
skos:exactMatch <http://bioimages.vanderbilt.edu/rdf/stdview#0104>,
<http://bioimages.vanderbilt.edu/rdf/stdview#0304>;
skos:inScheme <http://bioimages.vanderbilt.edu/rdf/stdview#02>;
skos:prefLabel "leaf"@en.

<http://bioimages.vanderbilt.edu/rdf/stdview#020302>
a skos:Concept;
skos:broader <http://bioimages.vanderbilt.edu/rdf/stdview#0203>;
skos:closeMatch <http://bioimages.vanderbilt.edu/rdf/stdview#010401>,
<http://bioimages.vanderbilt.edu/rdf/stdview#020301>,
<http://bioimages.vanderbilt.edu/rdf/stdview#030401>;
skos:definition "II.C.2. leaf on the upper stem, with the apex up"@en;
skos:inScheme <http://bioimages.vanderbilt.edu/rdf/stdview#02>;
skos:prefLabel "upper stem leaves"@en.

<http://bioimages.vanderbilt.edu/rdf/stdview#020303>
a skos:Concept;
skos:broader <http://bioimages.vanderbilt.edu/rdf/stdview#0203>;
skos:definition "II.C.3. margin of upper surface of leaf; part of the lower surface of another leaf with major veins visible should be shown behind the upper surface"@en;
skos:exactMatch <http://bioimages.vanderbilt.edu/rdf/stdview#010402>;
skos:inScheme <http://bioimages.vanderbilt.edu/rdf/stdview#02>;
skos:prefLabel "margin of upper and lower leaf surface"@en.

The entire thesaurus can be retrieved from http://bioimages.vanderbilt.edu/rdf/stdview.rdf as RDF/XML or http://bioimages.vanderbilt.edu/rdf/stdview.ttl as RDF/Turtle. One annoying thing is that the RDF editor I use (rdfEditor) balks if I use the namespace abbreviation stdview: for http://bioimages.vanderbilt.edu/rdf/stdview# because that causes the local name string to begin with a numeric character. I can't remember if that's just a problem in XML or if it really applies to Turtle as well. In any case, that's why the full URIs are listed in the example above instead of abbreviating them as something like stdview:020302. The SPARQL endpoints I've experimented with don't seem to mind the abbreviations, however.

In the example, the specific view of the margin of the upper and lower leaf surface is linked to the general category of leaf views using skos:broader. It's linked to the herbaceous angiosperm organism group using skos:inScheme. It's linked to the view of the margin of the upper and lower leaf surface in the woody angiosperm concept scheme using skos:exactMatch. The view of an upper stem leaf isn't exactly the same thing as a view of a lower stem leaf (stdview:020301), nor of a view of a whole leaf in woody angiosperms (stdview:010401), nor of a needle in gymnosperms (stdview:030401). But it's similar to those views, so it's linked to those them using skos:closeMatch.

Using the SLPIG standardized view thesaurus

the 13958 images in the Bioimages database are all categorized using standard view URIs as values of the Iptc4xmpExt:CVterm property. (See http://bioimages.vanderbilt.edu/tsn/19312 as an example of how the SLPIG views are used to sort out images.) The thesaurus has been loaded in the Vanderbilt Heard Library triplestore and is queriable at its SPARQL endpoint. So we can test out the thesaurus there by pasting the queries that follow into the endpoint's query box.

Use case 1 (Allow a user to select a view by presenting labels for concepts that are appropriate for a particular category of organisms (e.g. gymnosperms, herbaceous angiosperms, etc.):

This query shows all of the categories and views for the scheme labeled "woody angiosperms":

PREFIX Iptc4xmpExt: <http://iptc.org/std/Iptc4xmpExt/2008-02-29/>

PREFIX stdview: <http://bioimages.vanderbilt.edu/rdf/stdview#>

PREFIX skos: <http://www.w3.org/2004/02/skos/core#>

SELECT ?categoryLabel ?viewLabel

WHERE {

?scheme skos:prefLabel "woody angiosperms"@en.

?scheme skos:hasTopConcept ?viewCategory.

?view skos:broader ?viewCategory.

?viewCategory skos:prefLabel ?categoryLabel.

?view skos:prefLabel ?viewLabel.

}

You can replace "woody angiosperms" with "herbaceous angiosperms" or "gymnosperms" to display the categories and views for other groups. An application could present a user with a pick list of categories, then views after a particular scheme is chosen. The pick list could be used to categorize images as their metadata were recorded.

Use case 2 (Allow images to be grouped by major categories (leaf, bark, fruit, etc.) and within those categories by views that were appropriate for each major category):

I've built a test application at http://bioimages.vanderbilt.edu/sparql-search.htm that uses SPARQL queries to narrow the search categories. To see how the dropdown controls were created, view the page source. The guts of the SPARQL queries and the dialogue with the Heard Library endpoint can be seen in the http://bioimages.vanderbilt.edu/sparql-search.js file. Each of the dropdown pick lists is populated by querying the endpoint to find out what values are available for each of the categories. Here's the function that generates the query that requests the data needed to populate the category dropdown (using some jQuery calls in addition to generic Javascript):

function setCategoryOptions(passedGenus) {

 // create URI-encoded query string
        var string = "PREFIX Iptc4xmpExt: <http://iptc.org/std/Iptc4xmpExt/2008-02-29/>"+
                    "PREFIX skos: <http://www.w3.org/2004/02/skos/core#>"+
                    "PREFIX dwc: <http://rs.tdwg.org/dwc/terms/>"+
                    "PREFIX foaf: <http://xmlns.com/foaf/0.1/>"+
                    "PREFIX dsw: <http://purl.org/dsw/>"+
                    'SELECT DISTINCT ?category WHERE {' +
                    "?identification dwc:genus " + passedGenus + "." +
                     "?organism dsw:hasIdentification ?identification." +
                    "?organism foaf:depiction ?image." +
                    "?image Iptc4xmpExt:CVterm ?view." +
                    "?view skos:broader ?featureCategory." +
                    "?featureCategory skos:prefLabel ?category." +
                    '}'
                    +'ORDER BY ASC(?category)';
 var encodedQuery = encodeURIComponent(string);

        // send query to endpoint
        $.ajax({
            type: 'GET',
            url: 'http://rdf.library.vanderbilt.edu/sparql?query=' + encodedQuery,
            headers: {
                Accept: 'application/sparql-results+xml'
            },
            success: parseCategoryXml
        });

 }

After the function runs, it passes the results XML to a function that pulls the values from the elements and uses them to populate the dropdown lists. The function shown above runs when the page loads, and in that case the value of passedGenus is ?genus, which places no restrictions on the genus in the query. However, the function also fires when there is a change in the genus drop-down. In that case, the selected value of the genus is inserted into the query as a literal (e.g. "Acer"). The query then finds out what categories are actually present in the data for that genus (as opposed to what categories "should" be appropriate for that genus). So for example, if the genus Bradburia is selected, only the categories "entire organism", "inflorescence", and "leaf" are loaded into the pick list because there aren't any stem, fruit, or seed images in the database. One down side of this is that the query takes long enough to run that the dropdown sometimes doesn't get populated with the appropriate values before the user makes a selection. As I noted in an earlier post, Callimachus (which the Heard Library endpoint is currently using) runs queries much more slowly than Stardog, so it's possible that performance here would be much better with a faster endpoint.

Once the user selects a category, that fires the function to query the endpoint to find the views that fall into that category:

function setViewOptions(passedCategory) {
 // create URI-encoded query string
 var string = "PREFIX skos: <http://www.w3.org/2004/02/skos/core#>"
                    +'SELECT DISTINCT ?viewLabel WHERE {'
             +'?featureCategory skos:prefLabel '+passedCategory+'.'
             +'?view skos:broader ?featureCategory.'
             +'?view skos:prefLabel ?viewLabel.'
             +'}'
                    +'ORDER BY ASC(?viewLabel)';
        var encodedQuery = encodeURIComponent(string);

...

As before, the XML results eventually end up on the drop-down pick list for selecting the view. Unlike the previous query, it doesn't (at this point) display only the labels that are used for images in the database that meet the other search criteria - it displays all possible views that fall into the selected category. It would be nice to restrict the views to those used in images that meet the other search criteria, but I haven't spent the time necessary to make the code that complex.

Use case 3 (Group images that fall into the same major category regardless of whether they are in the same category of organism, e.g. show bark of any trees whether they are woody angiosperms or gymnosperms):

The test SPARQL web search Javascript cheats on this one by including a triple pattern that requires a match to the label string:

             +'?featureCategory skos:prefLabel '+passedCategory+'.'

If the category passed to the function were "bark"@en, then the triple pattern would be

?featureCategory skos:prefLabel "bark"@en.

causing the views to be returned for any category that has the preferred label "bark"@en. That's fine as long as the major categories have the same preferred label, but if I'd used the labels "angiosperm bark"@en and "gymnosperm bark"@en, that trick wouldn't work.

A better approach would be to make use of this information:

stdview:0102 skos:exactMatch stdview:0302.

where stdview:0102 is the bark category from the woody angiosperm concept scheme and stdview:0302 is the bark category from the gymnosperm concept scheme.

Let's say that I want to find all of the views that are in any category that is equivalent to the category of the view of http://bioimages.vanderbilt.edu/baskauf/41954 (a photo of redwood bark). Maybe I'm in the redwood forest and I want to see all of the kinds of bark that might be there I don't care if the tree is an angiosperm or gymnosperm. I could use this query to discover the other bark views:

SELECT DISTINCT ?otherViews ?label
WHERE {
<http://bioimages.vanderbilt.edu/baskauf/41954> Iptc4xmpExt:CVterm ?view.
?view skos:broader ?viewCategory.
?viewCategory skos:exactMatch* ?equivCategory.
?otherViews skos:broader ?equivCategory.
?otherViews skos:prefLabel ?label.
}

which produces these results:

otherViews label
stdview:030202 bark of a medium tree or large branch@en
stdview:030201 bark of a large tree@en
stdview:030203 bark of a small tree or small branch@en
stdview:030200 unspecified bark view@en
stdview:010201 bark of a large tree@en
stdview:010203 bark of a small tree or small branch@en
stdview:010202 bark of a medium tree or large branch@en
stdview:010200 unspecified bark view@en

Although some of the labels are the same, all of the view URIs are different because they fall into two different concept schemes. The key triple pattern in this graph pattern is:

?viewCategory skos:exactMatch* ?equivCategory.

In that triple pattern, I used the "*" arbitrary length path matching operator, which matches with paths zero to many properties long. In theory, I could have used the "?" operator (paths zero or one property long), except that for whatever reason, doing that hangs the Callimachus endpoint. I used "*" rather than "+" (paths one to many properties long) because I also want the query to pick up the views that are in the same category as the redwood bark picture (stdview:0302).

When I was writing the thesaurus, one design consideration that I had to decide about was whether I wanted to assume that the endpoint would support reasoning using SKOS as a schema (Tbox). According to the SKOS spec, skos:exactMatch is transitive and symmetric, so if I knew for sure that the endpoint were going to reason entailed relationships, I could have related multiple equivalent categories for entire organisms in my thesaurus like this:

stdview:0101 skos:exactMatch stdview:0201.
stdview:0201 skos:exactMatch stdview:0301.
stdview:0101 skos:exactMatch stdview:0401.

or in any other way that connected the equivalent classes via at least one link. However, since I wasn't sure that endpoints would have that capability, I stated the relationships like this:

stdview:0101 skos:exactMatch stdview:0201.
stdview:0101 skos:exactMatch stdview:0301.
stdview:0101 skos:exactMatch stdview:0401.

stdview:0201 skos:exactMatch stdview:0101.
stdview:0201 skos:exactMatch stdview:0301.
stdview:0201 skos:exactMatch stdview:0401.
stdview:0301 skos:exactMatch stdview:0101.
stdview:0301 skos:exactMatch stdview:0201.
stdview:0301 skos:exactMatch stdview:0401.
stdview:0401 skos:exactMatch stdview:0101.
stdview:0401 skos:exactMatch stdview:0201.
stdview:0401 skos:exactMatch stdview:0301.

which causes every equivalent category to be explicitly linked to every other class via skos:exactMatch. That was annoying, but safe. Another possibility would have been to have just defined a single category concept like stdview:entireOrganism and reused it in all of the concept schemes. There is nothing in the SKOS guidelines that says that a concept cannot be used in several concept schemes. However, since the view categories had already been assigned category URIs that were in use, it seemed best to keep using those and to link them with skos:exactMatch.

Use case 4 (Support labels in multiple languages):
At present, the preferred labels are only given in English. But they are language-tagged literals, so at some point in the future when preferred labels are provided for other languages, preferred labels for one language could be distinguished from preferred labels in other languages by using a filter. For example, one could add

FILTER(langMatches(lang(?label), "en"))
BIND (str(?label) AS ?strippedLabel)

The FILTER statement requires the labels to be some variety of English (en, en-US, en-GB, etc.) and the second statement binds the string part of the label (minus the language tag) to a new variable ?strippedLabel, which can be displayed to users. You can try adding this to the query above [2], although it will only work for the English language tag "en" at the present.

Conclusions

1. I was really very pleased with how this thesaurus has worked out. I was able to keep it relatively simple, with only two levels in its concept hierarchy. The semantics of SKOS seem to be right for the task and thus far I haven't thought of any tasks that I haven't been able to easily write SPARQL queries to complete. I'm getting an immediate bang for my buck by being able to search for images after loading the thesaurus triples and accessing them through the Heard Library SPARQL endpoint.

2. I think that if I were starting from scratch, I'd still define a concept scheme for each organism group (herbaceous angiosperms, gymnosperms, etc.) but would designate a single concept for each category and view to avoid having to declare many concepts as equivalent.

3. I should note that I'm not considering that this thesaurus would be the controlled vocabulary for Iptc4xmpExt:CVterm in Audubon Core. It would be a controlled vocabulary that could be used with Iptc4xmpExt:CVterm. It would have value to the extent that it were widely used. The point I'm trying to make in this post is that it would be advantageous for the value of Iptc4xmpExt:CVterm to be populated with URIs that dereference to SKOS concepts rather than populated with strings that would have to be cleaned up and reconciled with some list of standardized strings.

4. This thesaurus is uses SKOS in a fairly conventional way. It assumes that the controlled values will be specified completely and sufficiently by associating the image with only a single view URI in the metadata, rather than using literals and requiring aggregators to perform string matching. As you'll see in upcoming posts, this won't be the case for other controlled vocabularies with which I'm experimenting. In this case, I make no attempt to associate strings with the image record because I assume that the labels may change or be added at any time, and that producers and consumers can access the labels at will from the thesaurus.

Endnotes

[1] http://terms.tdwg.org/wiki/Audubon_Core
[2] Like this:

PREFIX Iptc4xmpExt: <http://iptc.org/std/Iptc4xmpExt/2008-02-29/>
PREFIX stdview: <http://bioimages.vanderbilt.edu/rdf/stdview#>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>

SELECT DISTINCT ?otherViews ?strippedLabel
WHERE {
<http://bioimages.vanderbilt.edu/baskauf/41954> Iptc4xmpExt:CVterm ?view.
?view skos:broader ?viewCategory.
?viewCategory skos:exactMatch* ?equivCategory.
?otherViews skos:broader ?equivCategory.
?otherViews skos:prefLabel ?label.
FILTER(langMatches(lang(?label), "en"))
BIND (str(?label) AS ?strippedLabel)
}

Monday, March 14, 2016

Ontologies, thesauri, and SKOS

Disclamer: this post contains opinions that are entirely my own and do not represent any official position related to any of my work as part of TDWG.

In my last post, I noted that RDF exported from http://terms.tdwg.org made the assertion:

dwc:recordedBy a skos:Concept.

The definitive RDF for this Darwin Core term asserts:

dwc:recordedBy a rdf:Property.

Similarly, the terms.tdwg.org RDF asserts:

dwc:Organism a skos:Concept.

whereas the Darwin Core RDF asserts:

dwc:Organism a rdfs:Class.

Can you really do that? Can a concept also be a class? Can a concept also be a property?

"Can you do that?"

In the process of learning more about how to use RDF, I learned that it was not productive to ask the question "Can you do that?" The problem is, that question is not specific enough for the RDF world. When I ask the question "Can you do that?", which of the following do I really mean?

Is it possible to assert that in RDF?
Is that assertion consistent with a data model that I'm following?
Is it generally inconsistent to assert that in RDF?
Does it violate conventions to assert that?
Does making that assertion facilitate meeting one of your use cases?

This list gives you an idea of the range of considerations that one should contemplate when deciding about the "right" way to assert RDF. In this post, I'm going to use these questions to frame a discussion of the appropriate use of SKOS (Simple Knowledge Organization System) in creating RDF definitions for terms in ontologies and in thesauri (or controlled vocabularies).

First, a word about SKOS. The SKOS data model claims to be "simple". One might not think that it is simple after looking at the SKOS reference Recommendation, which is lengthy and detailed. Rather, it is simple in that it aspires to facilitate data-sharing and integration of thesauri, but NOT to serve as a formal knowledge representation language. [1]

The graph

Assume that I assert the following graph:

dwc:Organism a rdfs:Class.
dwc:Organism a skos:Concept.

The questions that follow relate to this graph.

A simple question

Question: "Is it possible to assert that in RDF?"
Answer: I just did, so clearly it's possible. Anyone can say Anything about Anything in RDF. But just because it's possible doesn't mean it makes sense or is a good idea.

Consistency with the SKOS model

Question: "Is that assertion consistent with a data model that I'm following?"
Answer: Section 3.5.1. of the SKOS Reference says that "this specification does not make any additional statement about the formal relationship between the class of SKOS concepts and the class of OWL classes. The decision not to make any such statement has been made to allow applications the freedom to explore different design patterns for working with SKOS in combination with OWL."

Since

owl:Class rdfs:subClassOf rdfs:Class.

anything that is an owl:Class will also be and rdfs:Class. So if it's consistent with the SKOS model for something to be both a skos:Concept and an owl:Class, then it's also consistent for them to be both a skos:Concept and an rdfs:Class.

The same section of the SKOS Reference also says that it's consistent for a thing to be both an skos:Concept and an OWL property. Since owl:ObjectProperty and owl:DatatypeProperty are subproperties of rdf:Property, then something that was an OWL property would also a rdf:Property and therefore it would be consistent for that thing to be both a skos:Concept and an rdf:Property. So saying

dwc:recordedBy a skos:Concept, rdf:Property.

is consistent with the SKOS model. However, consistency with the SKOS model doesn't necessarily mean it's a good idea.

More general consistency

Questions: "Is it generally inconsistent to assert that in RDF?"
Answers: Section 3.3 of the SKOS reference asserts

skos:Concept a owl:Class.

This means that if we assert

dwc:Organism a skos:Concept.

then dwc:Organism is an instance of the skos:Concept class (or an "individual" in OWL parlance). But it is also a class because I declared that in my graph. Is it a problem for something to be both an instance and a class? The answer is: "it depends".

In OWL Full, it is fine for a resource to be both an individual and a class, so my graph would be consistent under OWL Full reasoning. However, OWL-DL imposes a disjointness condition between classes and individuals so in OWL-DL, my graph would be inconsistent. (See section 5.2. of the SKOS Primer for more on this.) So whether or not my graph is inconsistent depends on the type of reasoning that I expect users to conduct.

In the case where

dwc:recordedBy a skos:Concept, rdf:Property.

I'm not aware of a similar situation where it is inconsistent for a resource to be both an instance (individual) and a property.

What is SKOS for?

Question: "Does it violate conventions to assert that?"
Answer: We could start to approach the question of whether it's a bad idea to type properties and classes as skos:Concept by asking, what exactly is meant by skos:Concept?

Section 2.1 of the SKOS Primer says "The fundamental element of the SKOS vocabulary is the concept. Concepts are the units of thought—ideas, meanings, or (categories of) objects and events—which underlie many knowledge organization systems. As such, concepts exist in the mind as abstract entities which are independent of the terms used to label them." Section 3.1 of the normative SKOS Reference says "A SKOS concept can be viewed as an idea or notion; a unit of thought. However, what constitutes a unit of thought is subjective, and this definition is meant to be suggestive, rather than restrictive." Both of these definitions suggest that instances of skos:Concept are things that we have in our head, and not real-world things. However, I suppose that one could argue that properties and classes exist in our heads, even if their instances are real-world objects.

I think we can get a better understanding of what skos:Concepts are by looking at the reason why SKOS itself was developed. SKOS was developed alongside ISO 25964-1 (Information and documentation -- Thesauri and interoperability with other vocabularies -- Part 1: Thesauri for information retrieval) and ISO 25964-2 (Information and documentation -- Thesauri and interoperability with other vocabularies -- Part 2: Interoperability with other vocabularies). The NISO summary explains the relationship betweeen ISO 25964 and SKOS like this: "Especially close neighbors in the Semantic Web jigsaw are ISO 25964 and SKOS. ISO 25964-1 essentially advises on the selection and fitting together of concepts, terms and relationships to make a good thesaurus. SKOS addresses the next step, with recommendations on porting the resultant thesauri (or other ‘simple Knowledge Organization Systems’) to the Web. ISO 25964-2 recommends the sort of mappings that can be established between one KOS and another; SKOS presents a way of expressing these when published to the Web." SKOS also provides for kinds of Knowledge Organization Systems other than thesauri, but it's clear that facilitating thesauri on the Web was an organizing principle for the development of SKOS.

It is a mystery to me why standards like ISO 25964-1 and ISO 25964-2 are hidden behind a huge paywall. My library didn't have them, and at a cost of about $400 for the two parts, I certainly wasn't going to pay for them myself. I managed to borrow them via Interlibrary Loan and was able to peer into their secrets.

Section 4.1 of Part 1 describes thesauri like this: "The traditional aim of a thesaurus is to guide the indexer and the searcher to choose the same term for the same concept. In order to achieve this, a thesaurus should first list all of the concepts that might be useful for retrieval purposes in a given domain. The concepts are represented by terms ... . Secondly, a thesaurus should present the preferred terms in such a way that people will easily identify the one(s) they need." Section 3.86 of Part 2 includes the note "... a thesaurus is optimized for human navigability and terminological coverage of a domain". From this, I conclude that thesauri are all about human searching based on human-readable terms or strings.

In contrast, section 21.1 of Part 2 describes ontologies like this (section 21.1.1): "... 'ontology' is often interpreted as the use of a formal language to set out a formalized representation of a domain of knowledge. Among other tasks, this enables the consistency of knowledge assertions (facts) to be checked against the ontology, and possibly new ones to be inferred. An ontology and a set of facts (assertions about individuals) together form a knowledge base. One of the fundamental purposes of an ontology is reasoning, including generic tasks such as: inferring class membership for individuals, inferring relationships between classes and properties, and checking the consistency of a knowledge base." It goes on to say (section 21.1.2) "Whereas the role of most of the vocabularies described in this part of ISO 25964 is to guide the selection of search/indexing terms, or the browsing of organized document collections, the purpose of ontologies in the context of retrieval is different. Ontologies are not designed for information retrieval by index terms or class notation, but for making assertions about individuals, e.g. about real persons or abstract things such as a process." In section 21.1.3: "More recently, the term 'lightweight ontology' has been employed in some Semantic Web literature to cover all sorts of structured vocabularies and knowledge organization systems, including thesauri, classification schemes, etc. This terminology is not employed in this part of ISO 25964, since the blurring of distinctions entailed in the loose use of the term is considered unhelpful." Finally (section 21.3) "the concepts of a thesaurus and the classes of an ontology represent meaning in two fundamentally different ways. Thesauri express the meaning of a concept through terms, supported by adjuncts such as a hierarchy, associated concepts, qualifiers, scope notes and/or a precise definition, all directed mainly to human users. Ontologies, in contrast, convey the meaning of classes through machine-readable membership conditions."

From my reading of ISO 25964, I conclude that thesauri are focused on information retrieval by humans using terms, and ontologies are focused on reasoning related to properties and classes by machines about real or abstract things. They are two distinct kinds of vocabularies with different purposes, and I don't think that it is helpful to mix up the two by declaring the same resource to be both a class and a concept.

Although the SKOS Recommendation is a little less explicit about the definition of thesauri and ontologies, this distinction is implicit in section 1.3 of the SKOS Recommendation. That section says

The elements of the SKOS data model are classes and properties, and the structure and integrity of the data model is defined by the logical characteristics of, and interdependencies between, those classes and properties. This is perhaps one of the most powerful and yet potentially confusing aspects of SKOS, because SKOS can, in more advanced applications, also be used side-by-side with OWL to express and exchange knowledge about a domain. However, SKOS is not a formal knowledge representation language.

To understand this distinction, consider that the "knowledge" made explicit in a formal ontology is expressed as sets of axioms and facts. A thesaurus or classification scheme is of a completely different nature, and does not assert any axioms or facts. Rather, a thesaurus or classification scheme identifies and describes, through natural language and other informal means, a set of distinct ideas or meanings, which are sometimes conveniently referred to as "concepts". These "concepts" may also be arranged and organized into various structures, most commonly hierarchies and association networks. These structures, however, do not have any formal semantics, and cannot be reliably interpreted as either formal axioms or facts about the world. Indeed they were never intended to be so, for they serve only to provide a convenient and intuitive map of some subject domain, which can then be used as an aid to organizing and finding objects, such as documents, which are relevant to that domain.

I will encourage you to read it in its entirety. The take-home message that I get from section 1.3 is similar to what I got from ISO 25964: concepts are intended for humans to use to organize and find things like records and documents. They are not intended to be used for expression of axioms and facts about the world as are classes.

Now that I've written all of that, I'd like to suggest a simple answer to the question I posed at the beginning of this section: does it violate conventions to assert

dwc:Organism a rdfs:Class,skos:Concept.

? My conclusion is "yes". The statement

dwc:Organism a rdfs:Class.

comes from the Darwin Core vocabulary, which is all about defining classes and properties, and as such it could be classified as an ontology by the definition of ISO 25964. As such, mixing in

dwc:Organism a skos:Concept.

does not follow the conventions of ISO 25964 and the SKOS Recommendation that concepts aren't intended to be used to represent knowledge.

Thesauri vs. ontologies

There is one "can you do that?" question left from my list:

Question: Does making that assertion facilitate meeting one of your use cases?
Answer: This is one of the most important questions, but also one of the most difficult to answer because we can't know the answer without knowing the use cases. We can't know the use cases unless we've examined the context of why we are constructing the vocabulary that contains the assertions.

In the case of our little graph

dwc:Organism a rdfs:Class.
dwc:Organism a skos:Concept.

that originated from the TDWG terms browser (http://terms.tdwg.org), I already confessed in my previous post that I have no idea of the reason for the creation of most of the triples there. So in that context, I can't give an answer to this question.

To flesh out the circumstances under which one might choose to model something as a class vs. a concept, I'm going to illustrate with an example from my area of interest, biodiversity informatics.

In biodiversity informatics, we track occurrences of organisms, and one important aspect of that is knowing about the evidence that a particular kind of organism occurred at some place during some time. A traditional form of evidence is a specimen, which is either the organism itself (dead or alive), or some piece of the organism, or some remnant of the organism (like a fossil or droppings). That specimen is a sort of voucher that any doubter can check to make sure that the recorder of the occurrence didn't make a mistake. So one important bit of information to keep track of about the occurrrence is the kind of specimen that was collected.

Here is a diagram showing how we could model the relationships between various categories of specimens, with the broadest category at the top and narrowest categories at the bottom.

There are two ways that we could construct this model in RDF: build an ontology, or build a thesaurus. Let's start with building an ontology.

My ontology

Here's one way that I could describe the relationships in the diagram:

@prefix my: <http://example.org/my/>.
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>.
@prefix owl: <http://www.w3.org/2002/07/owl#>.

my:Ontology a owl:Ontology.
my:Specimen a rdfs:Class;
rdfs:label "Specimen"@en;
rdfs:isDefinedBy my:Ontology.
my:PreservedSpecimen a rdfs:Class;
rdfs:label "Preserved Specimen"@en;
rdfs:subClassOf my:Specimen;
rdfs:isDefinedBy my:Ontology.
my:FossilSpecimen a rdfs:Class;
rdfs:label "Fossil Specimen"@en;
rdfs:subClassOf my:Specimen;
rdfs:isDefinedBy my:Ontology.
my:LivingSpecimen a rdfs:Class;
rdfs:label "Living Specimen"@en;
rdfs:subClassOf my:Specimen;
rdfs:isDefinedBy my:Ontology.
my:SpecimenInAlcohol a rdfs:Class;
rdfs:label "Specimen In Alcohol"@en;
rdfs:subClassOf my:PreservedSpecimen;
rdfs:isDefinedBy my:Ontology.
my:PressedSpecimen a rdfs:Class;
rdfs:label "Pressed Specimen"@en;
rdfs:subClassOf my:PreservedSpecimen;
rdfs:isDefinedBy my:Ontology.
my:PinnedSpecimen a rdfs:Class;
rdfs:label "Pinned Specimen"@en;
rdfs:subClassOf my:PreservedSpecimen;
rdfs:isDefinedBy my:Ontology.

Now if I wanted to describe the evidence for an occurrence in some data, I could assert this:

:occurrence <http://purl.obolibrary.org/obo/RO_0002558> :voucher.

:voucher a my:PressedSpecimen.

where <http://purl.obolibrary.org/obo/RO_0002558> is the "has evidence" relationship from the Relations Ontology.[2] My ontology would then entail that the following triples are also true:

:voucher a my:PreservedSpecimen.

:voucher a my:Specimen.

because of the rdfs:subClassOf statements in the ontology.

If I were a fan of SKOS, I might be tempted to assert this kind of thing in my ontology:

my:PreservedSpecimen a rdfs:Class;
skos:prefLabel "Preserved Specimen"@en.
skos:broader my:Specimen.

rather than using rdfs:subClassOf. However, the using SKOS vocabulary entails some things that I might not intend. Using skos:prefLabel is fine because it has no range declaration; anything can have a preferred label. But using skos:broader entails that my:PreservedSpecimen and my:Specimen are instances of skos:Concept because skos:broader is a subproperty of skos:semanticRelation, which has both range and domain of skos:Concept.[3] So using the skos:broader property creates all of the issues that I discussed above related to declaring something to be both an class and a concept. If I'm interested in using SKOS, I'd be better off constructing a thesaurus.

My thesaurus

Here's how I could describe the same relationships as a thesaurus:

@prefix my: <http://example.org/my/>.
@prefix skos: <http://www.w3.org/2004/02/skos/core#>.

my:Scheme a skos:ConceptScheme.
my:Specimen a skos:Concept;
skos:prefLabel "Specimen"@en;
skos:inScheme my:Scheme.
my:PreservedSpecimen a skos:Concept;
skos:prefLabel "Preserved Specimen"@en;
skos:broader my:Specimen;
skos:inScheme my:Scheme.
my:FossilSpecimen a skos:Concept;
skos:prefLabel "Fossil Specimen"@en;
skos:broader my:Specimen;
skos:inScheme my:Scheme.
my:LivingSpecimen a skos:Concept;
skos:prefLabel "Living Specimen"@en;
skos:broader my:Specimen;
skos:inScheme my:Scheme.
my:SpecimenInAlcohol a skos:Concept;
skos:prefLabel "Specimen In Alcohol"@en;
skos:broader my:PreservedSpecimen;
skos:inScheme my:Scheme.
my:PressedSpecimen a skos:Concept;
skos:prefLabel "Pressed Specimen"@en;
skos:broader my:PreservedSpecimen;
skos:inScheme my:Scheme.
my:PinnedSpecimen a skos:Concepts;
skos:prefLabel "Pinned Specimen"@en;
skos:broader my:PreservedSpecimen;
skos:inScheme my:Scheme.

Now if I wanted to describe the evidence for an occurrence, I could assert this:

:occurrence <http://purl.obolibrary.org/obo/RO_0002558> my:PressedSpecimen.

where as before <http://purl.obolibrary.org/obo/RO_0002558> is the "has evidence" relationship from the Relations Ontology.[2]

Superficially, this looks a lot like the ontology that I made before. Each of the concepts that I defined are part of a collection (my:Scheme) in the same way that each of the classes that I defined were part of an ontology (my:Ontology). By using skos;broader, my thesaurus encodes information about the hierarchical relationships among the concepts, just as the ontology encodes the same relationships using rdfs:subClassOf. However, unlike the ontology, the thesaurus does not automatically entail additional relationships like

:occurrence <http://purl.obolibrary.org/obo/RO_0002558> my:PreservedSpecimen.

:occurrence <http://purl.obolibrary.org/obo/RO_0002558> my:Specimen.

from the expressed hierarchy.

There are several ways I could use SPARQL to "discover" those kinds of additional relationships. If my endpoint had reasoning capabilities like Stardog, I could load the SKOS vocabulary as the schema to use for reasoning (Tbox), then flip the reasoning switch on. I could then discover other broader concepts by executing this query:

SELECT ?entailedConcept WHERE {
:occurrence <http://purl.obolibrary.org/obo/RO_0002558> ?assertedConcept.
?assertedConcept skos:broaderTransitive ?entailedConcept.
}

skos:broader is a subproperty of skos:broaderTransitive, and skos:broaderTransitive is a transitive property, so with reasoning enabled, the broader concepts my:PreservedSpecimen and my:Specimen would be returned by the query. If the thesaurus were consistently written [4], one could find the same information using property paths with no reasoning turned on, as in this query:

SELECT ?entailedConcept WHERE {
:occurrence <http://purl.obolibrary.org/obo/RO_0002558> ?assertedConcept.
?assertedConcept skos:broader+ ?entailedConcept.
}

Which is better, an ontology or a thesaurus?

Clearly, it would be possible to encode the desired relationships using either an ontology or a thesaurus. However, either of the possibilities has pros and cons depending on what I want to accomplish.

Semantic constraints:
ontology - The classes could be defined using many more semantic constraints and relationships than I included in my examples. That's good if one wishes to be more expressive.
thesaurus - There are few semantics that can be imposed beyond the broader/narrower relationships. That's good if one wishes to avoid unnecessary or unforseen complications.

Entailments:
ontology - The semantics expressed in the ontology entails triples automatically for any client that is programmed to "understand" RDFS or OWL. That's good if you want to force those kinds of entailments on users.
thesaurus - The semantics expressed in the thesaurus will entail triples to the extent that the client "understands" RDFS and OWL and is programmed to make use of components of the SKOS vocabulary. That's good if you want to allow developers to choose which entailments they care about.

Instantiation:
ontology - Using the ontology method as I illustrated it required instantiating voucher individuals. This could be a good thing if I intend to make other statements about the properties of the vouchers, but could be a bad thing if I wasn't interested in keeping a separate database table about vouchers.
thesaurus - Using the thesaurus method as I illustrated it only required linking to a controlled vocabulary term. I was neither required nor able to say more about the evidence that documented the occurrence other than the category into which the evidence fell.

Cost:
ontology - I chose an example where the ontology and thesaurus versions were artificially similar. People who are into building ontologies generally make them a lot more complicated than this. In the words of section 1.3 of the SKOS Recommendation, "... some person has to do the work of transforming the structure and intellectual content of a thesaurus or classification scheme into a set of formal axioms and facts. This work of transformation is both intellectually demanding and time consuming, and therefore costly."
thesaurus - If the terms of a controlled vocabulary are already established, it takes relatively little work to express them as a SKOS concept scheme. In the words of section 1.3: "Much can be gained from using thesauri, etc., as-is, as informal, convenient structures for navigation within a subject domain. Using them as-is does not require any re-engineering and is therefore much less costly."

Clearly, there is no simple answer to the question of whether an ontology is better than a thesaurus. One would need to do some careful thinking about the considerations listed above and the use cases that one wishes to satisfy.

Conclusions

"Can you do that?"

If you've read this far, you have hopefully reached the same conclusion as me: mixing ontologies and thesauri, and making statements like:

dwc:Organism a rdfs:Class,skos:Concept.
dwc:recordedBy a rdf:Property, skos:Concept.

is not a good idea. You can do it, but probably should not.

How about mixing thesauri and data?

It is probably also not a good idea to mix thesauri and data. There is an interesting comparison of the difference between modeling the creation of the concept of King Henry VIII and the creation of King Henry VIII here. The FOAF vocabulary provides a property that can be used to link things to concepts about things: foaf:focus.

SKOS and TDWG controlled vocabularies

In the Darwin Core there are a number of terms whose definition includes "recommended best practice is to use a controlled vocabulary" (as a source of values), but for which no standard controlled vocabulary has been established. Examples include: dwc:sex, dwc:lifeStage, dwc:reproductiveCondition, dwc:behavior, dwc:establishmentMeans, dwc:occurrenceStatus, dwc:disposition, dwc:organismScope, dwc:taxonRank, dwc:taxonomicStatus, and dwc:measurementType. Other than providing some suggestions for literal values that can be used with these properties, the standard has little to say about the nature of the controlled vocabularies that should be used with the terms. TDWG also defines a standard category called "Data Standard" defined as "Specifies valid values in controlled vocabularies." However, up to this point there have been no adopted standards falling into this category. There are probably a number of reasons why, but I believe that one important reason is that TDWG hasn't figured out how to write such a standard in a robust and machine readable way.

One of the tasks of the Vocabulary Maintenance Specification Task Group (of which I'm the convener) is to complete a standards documentation specification that details how standards should be written so that they can be easily understood by both humans and machines. Given what I've learned about SKOS and thesauri, I now believe that SKOS is probably the right model for describing machine-readable versions of controlled vocabularies, which are essentially a form of thesaurus. I also now believe that SKOS is probably NOT the right model for defining other vocabularies that are essentially ontologies (sensu ISO 25964), although certain terms from SKOS (such as skos:prefLabel) may be fine for use in those vocabularies.

In my next post, I may write about my experimentation with writing controlled vocabularies using the SKOS model.

[1] SKOS Reference section 1.3.
[2] Whether using the "has evidence" relation as I have in these examples is a good idea or not is beyond the scope of this post.
[3] https://www.w3.org/TR/skos-reference/#L2251
[4] The query based on property paths would not work it the thesaurus used a combination of skos:broader and skos:narrower to define the hierarchy, whereas the query based on reasoning using the SKOS vocabulary as the schema would work.