Monday, March 14, 2016

Ontologies, thesauri, and SKOS

Disclamer: this post contains opinions that are entirely my own and do not represent any official position related to any of my work as part of TDWG.

In my last post, I noted that RDF exported from http://terms.tdwg.org made the assertion:

dwc:recordedBy a skos:Concept.

The definitive RDF for this Darwin Core term asserts:

dwc:recordedBy a rdf:Property.

Similarly, the terms.tdwg.org RDF asserts:

dwc:Organism a skos:Concept.

whereas the Darwin Core RDF asserts:

dwc:Organism a rdfs:Class.

Can you really do that?  Can a concept also be a class?  Can a concept also be a property?

"Can you do that?"

In the process of learning more about how to use RDF, I learned that it was not productive to ask the question "Can you do that?" The problem is, that question is not specific enough for the RDF world.  When I ask the question "Can you do that?", which of the following do I really mean?

  • Is it possible to assert that in RDF?
  • Is that assertion consistent with a data model that I'm following? 
  • Is it generally inconsistent to assert that in RDF?
  • Does it violate conventions to assert that?
  • Does making that assertion facilitate meeting one of your use cases?

This list gives you an idea of the range of considerations that one should contemplate when deciding about the "right" way to assert RDF.  In this post, I'm going to use these questions to frame a discussion of the appropriate use of SKOS (Simple Knowledge Organization System) in creating RDF definitions for terms in ontologies and in thesauri (or controlled vocabularies).

First, a word about SKOS.  The SKOS data model claims to be "simple".  One might not think that it is simple after looking at the SKOS reference Recommendation, which is lengthy and detailed.  Rather, it is simple in that it aspires to facilitate data-sharing and integration of thesauri, but NOT to serve as a formal knowledge representation language. [1]

The graph

Assume that I assert the following graph:

dwc:Organism a rdfs:Class.
dwc:Organism a skos:Concept.

The questions that follow relate to this graph.

A simple question

Question: "Is it possible to assert that in RDF?"
Answer: I just did, so clearly it's possible.  Anyone can say Anything about Anything in RDF.  But just because it's possible doesn't mean it makes sense or is a good idea.

Consistency with the SKOS model

Question: "Is that assertion consistent with a data model that I'm following?"
Answer: Section 3.5.1. of the SKOS Reference says that "this specification does not make any additional statement about the formal relationship between the class of SKOS concepts and the class of OWL classes. The decision not to make any such statement has been made to allow applications the freedom to explore different design patterns for working with SKOS in combination with OWL."

Since

owl:Class rdfs:subClassOf rdfs:Class.

anything that is an owl:Class will also be and rdfs:Class.  So if it's consistent with the SKOS model for something to be both a skos:Concept and an owl:Class, then it's also consistent for them to be both a skos:Concept and an rdfs:Class.

The same section of the SKOS Reference also says that it's consistent for a thing to be both an skos:Concept and an OWL property.  Since owl:ObjectProperty and owl:DatatypeProperty are subproperties of rdf:Property, then something that was an OWL property would also a rdf:Property and therefore it would be consistent for that thing to be both a skos:Concept and an rdf:Property.  So saying

dwc:recordedBy a skos:Concept, rdf:Property.

is consistent with the SKOS model.  However, consistency with the SKOS model doesn't necessarily mean it's a good idea.

More general consistency

Questions: "Is it generally inconsistent to assert that in RDF?"
Answers: Section 3.3 of the SKOS reference asserts

skos:Concept a owl:Class.

This means that if we assert

dwc:Organism a skos:Concept.

then dwc:Organism is an instance of the skos:Concept class (or an "individual" in OWL parlance).  But it is also a class because I declared that in my graph.  Is it a problem for something to be both an instance and a class?  The answer is: "it depends".

In OWL Full, it is fine for a resource to be both an individual and a class, so my graph would be consistent under OWL Full reasoning.  However, OWL-DL imposes a disjointness condition between classes and individuals so in OWL-DL, my graph would be inconsistent.  (See section 5.2. of the SKOS Primer for more on this.)  So whether or not my graph is inconsistent depends on the type of reasoning that I expect users to conduct.

In the case where

dwc:recordedBy a skos:Concept, rdf:Property.

I'm not aware of a similar situation where it is inconsistent for a resource to be both an instance (individual) and a property.

What is SKOS for?

Question: "Does it violate conventions to assert that?"
Answer: We could start to approach the question of whether it's a bad idea to type properties and classes as skos:Concept by asking, what exactly is meant by skos:Concept?

Section 2.1 of the SKOS Primer says "The fundamental element of the SKOS vocabulary is the concept. Concepts are the units of thought—ideas, meanings, or (categories of) objects and events—which underlie many knowledge organization systems. As such, concepts exist in the mind as abstract entities which are independent of the terms used to label them."  Section 3.1 of the normative SKOS Reference says "A SKOS concept can be viewed as an idea or notion; a unit of thought. However, what constitutes a unit of thought is subjective, and this definition is meant to be suggestive, rather than restrictive."  Both of these definitions suggest that instances of skos:Concept are things that we have in our head, and not real-world things.  However, I suppose that one could argue that properties and classes exist in our heads, even if their instances are real-world objects.

I think we can get a better understanding of what skos:Concepts are by looking at the reason why SKOS itself was developed.  SKOS was developed alongside ISO 25964-1 (Information and documentation -- Thesauri and interoperability with other vocabularies -- Part 1: Thesauri for information retrieval) and ISO 25964-2 (Information and documentation -- Thesauri and interoperability with other vocabularies -- Part 2: Interoperability with other vocabularies).  The NISO summary explains the relationship betweeen ISO 25964 and SKOS like this: "Especially close neighbors in the Semantic Web jigsaw are ISO 25964 and SKOS. ISO 25964-1 essentially advises on the selection and fitting together of concepts, terms and relationships to make a good thesaurus. SKOS addresses the next step, with recommendations on porting the resultant thesauri (or other ‘simple Knowledge Organization Systems’) to the Web. ISO 25964-2 recommends the sort of mappings that can be established between one KOS and another; SKOS presents a way of expressing these when published to the Web."  SKOS also provides for kinds of Knowledge Organization Systems other than thesauri, but it's clear that facilitating thesauri on the Web was an organizing principle for the development of SKOS.

It is a mystery to me why standards like ISO 25964-1 and ISO 25964-2 are hidden behind a huge paywall.  My library didn't have them, and at a cost of about $400 for the two parts, I certainly wasn't going to pay for them myself.  I managed to borrow them via Interlibrary Loan and was able to peer into their secrets.

Section 4.1 of Part 1 describes thesauri like this: "The traditional aim of a thesaurus is to guide the indexer and the searcher to choose the same term for the same concept.   In order to achieve this, a thesaurus should first list all of the concepts that might be useful for retrieval purposes in a given domain.  The concepts are represented by terms ... . Secondly, a thesaurus should present the preferred terms in such a way that people will easily identify the one(s) they need."  Section 3.86 of Part 2 includes the note "... a thesaurus is optimized for human navigability and terminological coverage of a domain".  From this, I conclude that thesauri are all about human searching based on human-readable terms or strings.

In contrast, section 21.1 of Part 2 describes ontologies like this (section 21.1.1): "... 'ontology' is often interpreted as the use of a formal language to set out a formalized representation of a domain of knowledge.  Among other tasks, this enables the consistency of knowledge assertions (facts) to be checked against the ontology, and possibly new ones to be inferred.  An ontology and a set of facts (assertions about individuals) together form a knowledge base.  One of the fundamental purposes of an ontology is reasoning, including generic tasks such as: inferring class membership for individuals, inferring relationships between classes and properties, and checking the consistency of a knowledge base."   It goes on to say (section 21.1.2) "Whereas the role of most of the vocabularies described in this part of ISO 25964 is to guide the selection of search/indexing terms, or the browsing of organized document collections, the purpose of ontologies in the context of retrieval is different.  Ontologies are not designed for information retrieval by index terms or class notation, but for making assertions about individuals, e.g. about real persons or abstract things such as a process."  In section 21.1.3: "More recently, the term 'lightweight ontology' has been employed in some Semantic Web literature to cover all sorts of structured vocabularies and knowledge organization systems, including thesauri, classification schemes, etc.  This terminology is not employed in this part of ISO 25964, since the blurring of distinctions entailed in the loose use of the term is considered unhelpful."  Finally (section 21.3) "the concepts of a thesaurus and the classes of an ontology represent meaning in two fundamentally different ways.  Thesauri express the meaning of a concept through terms, supported by adjuncts such as a hierarchy, associated concepts, qualifiers, scope notes and/or a precise definition, all directed mainly to human users.  Ontologies, in contrast, convey the meaning of classes through machine-readable membership conditions."

From my reading of ISO 25964, I conclude that thesauri are focused on information retrieval by humans using terms, and ontologies are focused on reasoning related to properties and classes by machines about real or abstract things.  They are two distinct kinds of vocabularies with different purposes, and I don't think that it is helpful to mix up the two by declaring the same resource to be both a class and a concept.

Although the SKOS Recommendation is a little less explicit about the definition of thesauri and ontologies, this distinction is implicit in section 1.3 of the SKOS Recommendation.  That section says
The elements of the SKOS data model are classes and properties, and the structure and integrity of the data model is defined by the logical characteristics of, and interdependencies between, those classes and properties. This is perhaps one of the most powerful and yet potentially confusing aspects of SKOS, because SKOS can, in more advanced applications, also be used side-by-side with OWL to express and exchange knowledge about a domain. However, SKOS is not a formal knowledge representation language.
To understand this distinction, consider that the "knowledge" made explicit in a formal ontology is expressed as sets of axioms and facts. A thesaurus or classification scheme is of a completely different nature, and does not assert any axioms or facts. Rather, a thesaurus or classification scheme identifies and describes, through natural language and other informal means, a set of distinct ideas or meanings, which are sometimes conveniently referred to as "concepts". These "concepts" may also be arranged and organized into various structures, most commonly hierarchies and association networks. These structures, however, do not have any formal semantics, and cannot be reliably interpreted as either formal axioms or facts about the world. Indeed they were never intended to be so, for they serve only to provide a convenient and intuitive map of some subject domain, which can then be used as an aid to organizing and finding objects, such as documents, which are relevant to that domain. 
I will encourage you to read it in its entirety.  The take-home message that I get from section 1.3 is similar to what I got from ISO 25964:  concepts are intended for humans to use to organize and find things like records and documents.  They are not intended to be used for expression of axioms and facts about the world as are classes.

Now that I've written all of that, I'd like to suggest a simple answer to the question I posed at the beginning of this section: does it violate conventions to assert

dwc:Organism a rdfs:Class,skos:Concept.

?  My conclusion is "yes".  The statement

dwc:Organism a rdfs:Class.

comes from the Darwin Core vocabulary, which is all about defining classes and properties, and as such it could be classified as an ontology by the definition of ISO 25964.  As such, mixing in

dwc:Organism a skos:Concept.

does not follow the conventions of ISO 25964 and the SKOS Recommendation that concepts aren't intended to be used to represent knowledge.

Thesauri vs. ontologies

There is one "can you do that?" question left from my list:

Question: Does making that assertion facilitate meeting one of your use cases?
Answer: This is one of the most important questions, but also one of the most difficult to answer because we can't know the answer without knowing the use cases.  We can't know the use cases unless we've examined the context of why we are constructing the vocabulary that contains the assertions.

In the case of our little graph

dwc:Organism a rdfs:Class.
dwc:Organism a skos:Concept.

that originated from the TDWG terms browser (http://terms.tdwg.org), I already confessed in my previous post that I have no idea of the reason for the creation of most of the triples there.  So in that context, I can't give an answer to this question.

To flesh out the circumstances under which one might choose to model something as a class vs. a concept, I'm going to illustrate with an example from my area of interest, biodiversity informatics.

In biodiversity informatics, we track occurrences of organisms, and one important aspect of that is knowing about the evidence that a particular kind of organism occurred at some place during some time.  A traditional form of evidence is a specimen, which is either the organism itself (dead or alive), or some piece of the organism, or some remnant of the organism (like a fossil or droppings).  That specimen is a sort of voucher that any doubter can check to make sure that the recorder of the occurrence didn't make a mistake.  So one important bit of information to keep track of about the occurrrence is the kind of specimen that was collected.

Here is a diagram showing how we could model the relationships between various categories of specimens, with the broadest category at the top and narrowest categories at the bottom.

There are two ways that we could construct this model in RDF: build an ontology, or build a thesaurus.  Let's start with building an ontology.

My ontology

Here's one way that I could describe the relationships in the diagram:

@prefix my: <http://example.org/my/>.
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>.
@prefix owl: <http://www.w3.org/2002/07/owl#>.

my:Ontology a owl:Ontology.
my:Specimen a rdfs:Class;
            rdfs:label "Specimen"@en;
            rdfs:isDefinedBy my:Ontology.
my:PreservedSpecimen a rdfs:Class;
                     rdfs:label "Preserved Specimen"@en;
                     rdfs:subClassOf my:Specimen;
                     rdfs:isDefinedBy my:Ontology.
my:FossilSpecimen a rdfs:Class;
                     rdfs:label "Fossil Specimen"@en;
                     rdfs:subClassOf my:Specimen;
                     rdfs:isDefinedBy my:Ontology.
my:LivingSpecimen a rdfs:Class;
                     rdfs:label "Living Specimen"@en;
                     rdfs:subClassOf my:Specimen;
                     rdfs:isDefinedBy my:Ontology.
my:SpecimenInAlcohol a rdfs:Class;
                     rdfs:label "Specimen In Alcohol"@en;
                     rdfs:subClassOf my:PreservedSpecimen;
                     rdfs:isDefinedBy my:Ontology.
my:PressedSpecimen a rdfs:Class;
                     rdfs:label "Pressed Specimen"@en;
                     rdfs:subClassOf my:PreservedSpecimen;
                     rdfs:isDefinedBy my:Ontology.
my:PinnedSpecimen a rdfs:Class;
                     rdfs:label "Pinned Specimen"@en;
                     rdfs:subClassOf my:PreservedSpecimen;
                     rdfs:isDefinedBy my:Ontology.

Now if I wanted to describe the evidence for an occurrence in some data, I could assert this:

:occurrence <http://purl.obolibrary.org/obo/RO_0002558> :voucher.
:voucher a my:PressedSpecimen.

where <http://purl.obolibrary.org/obo/RO_0002558> is the "has evidence" relationship from the Relations Ontology.[2]  My ontology would then entail that the following triples are also true:

:voucher a my:PreservedSpecimen.
:voucher a my:Specimen.

because of the rdfs:subClassOf statements in the ontology.

If I were a fan of SKOS, I might be tempted to assert this kind of thing in my ontology: 

my:PreservedSpecimen a rdfs:Class;
                     skos:prefLabel "Preserved Specimen"@en.
                     skos:broader my:Specimen.

rather than using rdfs:subClassOf.  However, the using SKOS vocabulary entails some things that I might not intend.  Using skos:prefLabel is fine because it has no range declaration; anything can have a preferred label. But using skos:broader entails that my:PreservedSpecimen and my:Specimen are instances of skos:Concept because skos:broader is a subproperty of skos:semanticRelation, which has both range and domain of skos:Concept.[3]  So using the skos:broader property creates all of the issues that I discussed above related to declaring something to be both an class and a concept.  If I'm interested in using SKOS, I'd be better off constructing a thesaurus.

My thesaurus

Here's how I could describe the same relationships as a thesaurus:

@prefix my: <http://example.org/my/>.
@prefix skos: <http://www.w3.org/2004/02/skos/core#>.

my:Scheme a skos:ConceptScheme.
my:Specimen a skos:Concept;
            skos:prefLabel "Specimen"@en;
            skos:inScheme my:Scheme.
my:PreservedSpecimen a skos:Concept;
                     skos:prefLabel "Preserved Specimen"@en;
                     skos:broader my:Specimen;
                     skos:inScheme my:Scheme.
my:FossilSpecimen a skos:Concept;
                     skos:prefLabel "Fossil Specimen"@en;
                     skos:broader my:Specimen;
                     skos:inScheme my:Scheme.
my:LivingSpecimen a skos:Concept;
                     skos:prefLabel "Living Specimen"@en;
                     skos:broader my:Specimen;
                     skos:inScheme my:Scheme.
my:SpecimenInAlcohol a skos:Concept;
                     skos:prefLabel "Specimen In Alcohol"@en;
                     skos:broader my:PreservedSpecimen;
                     skos:inScheme my:Scheme.
my:PressedSpecimen a skos:Concept;
                     skos:prefLabel "Pressed Specimen"@en;
                     skos:broader my:PreservedSpecimen;
                     skos:inScheme my:Scheme.
my:PinnedSpecimen a skos:Concepts;
                     skos:prefLabel "Pinned Specimen"@en;
                     skos:broader my:PreservedSpecimen;
                     skos:inScheme my:Scheme.

Now if I wanted to describe the evidence for an occurrence, I could assert this:

:occurrence <http://purl.obolibrary.org/obo/RO_0002558> my:PressedSpecimen.

where as before <http://purl.obolibrary.org/obo/RO_0002558> is the "has evidence" relationship from the Relations Ontology.[2]  

Superficially, this looks a lot like the ontology that I made before.  Each of the concepts that I defined are part of a collection (my:Scheme) in the same way that each of the classes that I defined were part of an ontology (my:Ontology).  By using skos;broader, my thesaurus encodes information about the hierarchical relationships among the concepts, just as the ontology encodes the same relationships using rdfs:subClassOf.  However, unlike the ontology, the thesaurus does not automatically entail additional relationships like

:occurrence <http://purl.obolibrary.org/obo/RO_0002558> my:PreservedSpecimen.
:occurrence <http://purl.obolibrary.org/obo/RO_0002558> my:Specimen.

from the expressed hierarchy.

There are several ways I could use SPARQL to "discover" those kinds of additional relationships.  If my endpoint had reasoning capabilities like Stardog, I could load the SKOS vocabulary as the schema to use for reasoning (Tbox), then flip the reasoning switch on.  I could then discover other broader concepts by executing this query:

SELECT ?entailedConcept WHERE {
    :occurrence <http://purl.obolibrary.org/obo/RO_0002558> ?assertedConcept.
    ?assertedConcept skos:broaderTransitive ?entailedConcept.
    }

skos:broader is a subproperty of skos:broaderTransitive, and skos:broaderTransitive is a transitive property, so with reasoning enabled, the broader concepts my:PreservedSpecimen and my:Specimen would be returned by the query.  If the thesaurus were consistently written [4], one could find the same information using property paths with no reasoning turned on, as in this query:

SELECT ?entailedConcept WHERE {
    :occurrence <http://purl.obolibrary.org/obo/RO_0002558> ?assertedConcept.
    ?assertedConcept skos:broader+ ?entailedConcept.
    }

 Which is better, an ontology or a thesaurus?

Clearly, it would be possible to encode the desired relationships using either an ontology or a thesaurus.  However, either of the possibilities has pros and cons depending on what I want to accomplish.

Semantic constraints:
ontology - The classes could be defined using many more semantic constraints and relationships than I included in my examples.  That's good if one wishes to be more expressive.
thesaurus - There are few semantics that can be imposed beyond the broader/narrower relationships.  That's good if one wishes to avoid unnecessary or unforseen complications.

Entailments:
ontology - The semantics expressed in the ontology entails triples automatically for any client that is programmed to "understand" RDFS or OWL.  That's good if you want to force those kinds of entailments on users.
thesaurus - The semantics expressed in the thesaurus will entail triples to the extent that the client "understands" RDFS and OWL and is programmed to make use of components of the SKOS vocabulary.  That's good if you want to allow developers to choose which entailments they care about.

Instantiation:
ontology - Using the ontology method as I illustrated it required instantiating voucher individuals.  This could be a good thing if I intend to make other statements about the properties of the vouchers, but could be a bad thing if I wasn't interested in keeping a separate database table about vouchers.
thesaurus - Using the thesaurus method as I illustrated it only required linking to a controlled vocabulary term.  I was neither required nor able to say more about the evidence that documented the occurrence other than the category into which the evidence fell.

Cost:
ontology - I chose an example where the ontology and thesaurus versions were artificially similar.  People who are into building ontologies generally make them a lot more complicated than this.  In the words of section 1.3 of the SKOS Recommendation, "... some person has to do the work of transforming the structure and intellectual content of a thesaurus or classification scheme into a set of formal axioms and facts. This work of transformation is both intellectually demanding and time consuming, and therefore costly."
thesaurus - If the terms of a controlled vocabulary are already established, it takes relatively little work to express them as a SKOS concept scheme.  In the words of section 1.3: "Much can be gained from using thesauri, etc., as-is, as informal, convenient structures for navigation within a subject domain. Using them as-is does not require any re-engineering and is therefore much less costly."

Clearly, there is no simple answer to the question of whether an ontology is better than a thesaurus.  One would need to do some careful thinking about the considerations listed above and the use cases that one wishes to satisfy.

Conclusions

"Can you do that?"

If you've read this far, you have hopefully reached the same conclusion as me: mixing ontologies and thesauri, and making statements like:

dwc:Organism a rdfs:Class,skos:Concept.
dwc:recordedBy a rdf:Property, skos:Concept.

is not a good idea.  You can do it, but probably should not.

How about mixing thesauri and data?

It is probably also not a good idea to mix thesauri and data.  There is an interesting comparison of the difference between modeling the creation of the concept of King Henry VIII and the creation of King Henry VIII here. The FOAF vocabulary provides a property that can be used to link things to concepts about things: foaf:focus.


SKOS and TDWG controlled vocabularies

In the Darwin Core there are a number of terms whose definition includes "recommended best practice is to use a controlled vocabulary" (as a source of values), but for which no standard controlled vocabulary has been established.  Examples include: dwc:sex, dwc:lifeStage, dwc:reproductiveCondition, dwc:behavior, dwc:establishmentMeans, dwc:occurrenceStatus, dwc:disposition, dwc:organismScope, dwc:taxonRank, dwc:taxonomicStatus, and dwc:measurementType.  Other than providing some suggestions for literal values that can be used with these properties, the standard has little to say about the nature of the controlled vocabularies that should be used with the terms.  TDWG also defines a standard category called "Data Standard" defined as "Specifies valid values in controlled vocabularies."  However, up to this point there have been no adopted standards falling into this category.  There are probably a number of reasons why, but I believe that one important reason is that TDWG hasn't figured out how to write such a standard in a robust and machine readable way.

One of the tasks of the Vocabulary Maintenance Specification Task Group (of which I'm the convener) is to complete a standards documentation specification that details how standards should be written so that they can be easily understood by both humans and machines.  Given what I've learned about SKOS and thesauri, I now believe that SKOS is probably the right model for describing machine-readable versions of controlled vocabularies, which are essentially a form of thesaurus.  I also now believe that SKOS is probably NOT the right model for defining other vocabularies that are essentially ontologies (sensu ISO 25964), although certain terms from SKOS (such as skos:prefLabel) may be fine for use in those vocabularies.

In my next post, I may write about my experimentation with writing controlled vocabularies using the SKOS model.

[1] SKOS Reference section 1.3.
[2] Whether using the "has evidence" relation as I have in these examples is a good idea or not is beyond the scope of this post.
[3] https://www.w3.org/TR/skos-reference/#L2251
[4] The query based on property paths would not work it the thesaurus used a combination of skos:broader and skos:narrower to define the hierarchy, whereas the query based on reasoning using the SKOS vocabulary as the schema would work.

No comments:

Post a Comment