In this series of posts, I've been enjoying exploring what it's possible to "learn" using RDF data that are exposed "in the wild". Thus far, I've been pleased to find that it's relatively easy to dereference CrossRef DOIs, and ORCID and VIAF identifiers for people to get usable RDF. The current lack of links between publications and people is a bit disappointing, but that could get better over time. However, the lack of consistency in vocabulary use is counterproductive, and the way that CrossRef generates ad hoc identifiers for people rather than finding a way to link to existing well-known identifiers is also annoying.
One possible source of RDF in the wild that I've been interested in investigating is Semantic MediaWiki. In 2013, I was part of a Vocabulary Management Task Group of Biodiversity Information Standards (TDWG) that was investigating possible solutions to some long-term problems associated with managing TDWG vocabularies. I won't go into the details, since you can read the report here. One part of the report involved looking at the use of the Semantic MediaWiki (SMW) platform for developing and maintaining vocabularies. I wasn't involved much in writing that part of the report, and although I had experimented briefly with using an instance of SMW to create test pages for vocabulary terms, I really only saw the result in terms of the human-readable web page that it generated. One of the selling points of SMW is that when one generates a page, SMW generates an RDF representation of the information on the page. So in theory, SMW provides an easy-to-use system that generates RDF about vocabulary terms, and that would eliminate the need to build a separate system to generate the machine-readable component of the vocabulary. One selling point for TDWG and GBIF (which had an interest in the outcome of the report) was that SMW could provide support for multiple languages - an important feature for international informatics organizations. So my goal in the investigation that led to this post was to take a look at the RDF that SMW generates. What might that RDF be good for and what is the potential for integrating the data SMW provides with data from other sources?
Semantic MediaWiki ... can turn a wiki into a powerful and flexible knowledge management system. All data created within SMW can easily be published via the Semantic Web, allowing other systems to use this data seamlessly. https://www.semantic-mediawiki.org/ |
Getting RDF metadata from Semantic MediaWiki
The first thing I had to figure out was how to actually get RDF out of a SMW instance. I started at the main SMW page. It says "All data created within SMW can easily be published via the Semantic Web, allowing other systems to use this data seamlessly." Cool! Just what I wanted. Eventually I found my way to the Help:RDF export page. It said that I could get generated machine-readable documents in OWL/RDF via the "Special:ExportRDF" page. On that page, there is a box where you can paste in a page name to get the RDF. I tried pasting in the page name "Sites_using_Semantic_MediaWiki",which produced this RDF/XML document. So what did I get?First, I converted the 101 triples from XML to Turtle so that it would be easy to read. One important bit of information is this:
<https://www.semantic-mediawiki.org/wiki/Special:ExportRDF/Sites_using_Semantic_MediaWiki>
swivt:creationDate "2016-03-04T11:30:48+01:00"^^xsd:dateTime;
a owl:Ontology;
owl:imports <http://semantic-mediawiki.org/swivt/1.0>.
This tells me something very interesting: SMW generates an OWL ontology on the fly from the page data. I can see this by the creation date given for the ontology, which was the time that I downloaded the file. This seems a bit odd to me when I consider what seems to me to be a typical outlook on the use of ontologies. One can divide triples into two categories: schema (a.k.a. Tbox) and data (a.k.a. Abox). The schema contains the definitions of properties and classes, while the data consists of the information about instances of those classes and how they are linked by the defined properties. In the SMW triples, we see some typical stuff found in ontologies, schema-like stuff like:
swivt:page a owl:ObjectProperty.
swivt:page a owl:ObjectProperty.
In fact, about a third of the triples (31), are class or property declarations of this sort. But the document also contains descriptions of seven instances of the swivt:Subject class, data-like stuff. It isn't clear to me why the returned triples include the schema information along with the instance data, rather than the usual practice of letting the client discover (if necessary) the schema information by dereferencing the predicate and class URIs. For good measure, the generated ontology imports the ontology that defines the SWiVT (Semantic Wiki Vocabulary and Terminology) vocabulary; which is what a client would get by dereferencing any of the swivt: namespace terms.
I did a quick comparison of some of the properties defined in the returned document with properties defined in the SWiVT ontology. swivt:creationDate was defined both places, although the returned document said it was a owl:DatatypeProperty while the SWiVT ontology said it was a owl:AnnotationProperty. The properties swivt:page, swivt:type, swivt:wikiNamespace, and swivt:wikiPageModificationDate were defined both places, but the other four swivt: namespace properties were not found in the ontology and were defined in the document with no properties other than their type. There were also 16 classes or properties in the "wiki:" namespace that did not seem to be defined anywhere that I could find. I suppose that I was supposed to determine what they meant by interpreting their local name, such as "Property-3ANumber_of_talk_page_revisions", but that is not a particularly good practice.
OK, enough of the gory details - let's cut to the chase. From what I can tell, there are two main functions of the data on this page. One is to record metadata about the master page, such as number of revisions, page creator, etc. This is done in a somewhat backwards manner:
wiki:Sites_using_Semantic_MediaWiki
swivt:page <https://www.semantic-mediawiki.org/wiki/Sites_using_Semantic_MediaWiki>;
swivt:wikiPageModificationDate "2015-10-20T07:20:36Z"^^xsd:dateTime;
wiki:Property-3ALanguage_code "en"^^xsd:string;
wiki:Property-3ANumber_of_revisions "38"^^xsd:double;
wiki:Property-3APage_creator wiki:User-3AKevin;
a swivt:Subject,
rdfs:label "Sites using Semantic MediaWiki".
(some properties omitted for brevity). The subject of the triples is a swivt:Subject, whereas the properties seem to actually be about the web page. Why not this:
<https://www.semantic-mediawiki.org/wiki/Sites_using_Semantic_MediaWiki>
dcterms:modified "2015-10-20T07:20:36Z"^^xsd:dateTime;
dc:language "en"^^xsd:string;
ex:revisions "38"^^xsd:double;
dcterms:creator wiki:User-3AKevin;
a foaf:Document,
rdfs:label "Sites using Semantic MediaWiki".
using (wherever possible) well-known Dublin Core terms?
The other function is to generate URIs for swivt:Subject instances and associate them with web pages and label strings, generally in different languages. For example:
wiki:Websites_die_Semantic_MediaWiki_einsetzen
wiki:Property-3AMaster_page wiki:Sites_using_Semantic_MediaWiki;
a swivt:Subject;
swivt:page <https://www.semantic-mediawiki.org/wiki/Websites_die_Semantic_MediaWiki_einsetzen>;
rdfs:label "Websites die Semantic MediaWiki einsetzen".
(again some properties omitted). The subject and wiki page URIs and the labels seem to be algorithmically generated from the same string, so there isn't a lot of new information that we are gaining about the "subjects" - we aren't even learning the language of the label. It seems like the same information could be expressed in a more straightforward manner by describing the multi-lingual pages using well-known properties:
<https://www.semantic-mediawiki.org/wiki/Websites_die_Semantic_MediaWiki_einsetzen>
dcterms:isVersionOf wiki:Sites_using_Semantic_MediaWiki;
a foaf:Document;
rdfs:label "Websites die Semantic MediaWiki einsetzen"@de.
If describing the subject of the page were important, use the property dcterms:subject with an object that was from some standard subject thesaurus, rather than an ad hoc subject generated from a page title string.
Conclusions
- The practice of generating an ontology on the fly when the actual goal seems to be to expose assertional data seems odd.
- If the goal is to allow "other systems to use this data seamlessly", why does Semantic MediaWiki use a purpose-built vocabulary instead of well known vocabularies such as Dublin Core and FOAF or schema.org?
- Turning data about 7 web pages into an ontology and exposing a lot of what appear to be local "housekeeping" triples bloats the output to 101 triples. I would guess that the information that outside users cared about could probably be asserted in about 30 Dublin Core-based triples.
Getting RDF metadata from terms.tdwg.org
It was an informative exercise to get data from the Semantic MediaWiki site, but my real interest was in looking at the terms.tdwg.org instance of Semantic MediaWiki. It was basically set up to evaluate the recommendations of the Vocabulary Management Task Group that Semantic MediaWiki be used to help with the management of terms in TDWG vocabularies. The approach described on the Main Page is that class or property terms are described as concepts on separate pages. Those concepts can then be grouped into term vocabularies. The goal was to allow for collaborative development of vocabularies and to aid in the discovery and linking of biodiversity concepts.
So far an impressive amount of work has been put into this project. There are 9249 concepts defined so far, and some of them have translations into multiple languages. From a human-readable standpoint, it looks like the instance is a success since it appears that TDWG has created a system that is easy to navigate and can be maintained by the community (although since I haven't participated in helping to curate the pages I don't know how labor-intensive the editing and maintenance process is). However, what I really want to investigate here is what I would get if I were a machine trying to pull information from the site.
I decided to look at the page for dwc:identifiedBy, a fairly typical Darwin Core term that had translations into several languages. Conveniently, the human-readable page had a link I could click on to retrieve RDF/XML from an RDF feed. After I got the file, the first problem that I ran into was that the document wasn't valid RDF. The error was "The URI scheme is not valid." The offending lines were 21, 24, 172 and 190:
21 xmlns:sioc="discussion http://rdfs.org/sioc/ns#"
24 xmlns:vs="status http://www.w3.org/2003/06/sw-vocab-status/ns#">
172 <owl:ObjectProperty rdf:about="discussion http://rdfs.org/sioc/ns#has" />
190 <owl:DatatypeProperty rdf:about="status http://www.w3.org/2003/06/sw-vocab-status/ns#term" />
So far an impressive amount of work has been put into this project. There are 9249 concepts defined so far, and some of them have translations into multiple languages. From a human-readable standpoint, it looks like the instance is a success since it appears that TDWG has created a system that is easy to navigate and can be maintained by the community (although since I haven't participated in helping to curate the pages I don't know how labor-intensive the editing and maintenance process is). However, what I really want to investigate here is what I would get if I were a machine trying to pull information from the site.
I decided to look at the page for dwc:identifiedBy, a fairly typical Darwin Core term that had translations into several languages. Conveniently, the human-readable page had a link I could click on to retrieve RDF/XML from an RDF feed. After I got the file, the first problem that I ran into was that the document wasn't valid RDF. The error was "The URI scheme is not valid." The offending lines were 21, 24, 172 and 190:
21 xmlns:sioc="discussion http://rdfs.org/sioc/ns#"
24 xmlns:vs="status http://www.w3.org/2003/06/sw-vocab-status/ns#">
172 <owl:ObjectProperty rdf:about="discussion http://rdfs.org/sioc/ns#has" />
190 <owl:DatatypeProperty rdf:about="status http://www.w3.org/2003/06/sw-vocab-status/ns#term" />
where the URIs had extra text in front of the "http://". I fixed this manually by deleting the offending characters. I was left with a valid document containing 162 triples. Let's sort out what I got:
123 triples that are more or less useless:
- 3 triples defining an ad hoc owl:Ontology for the page as in the SMW example.
- 41 triples declaring various properties and classes to be properties and classes.
- 58 swivt:specialProperty_ASK triples like
dwc:recordedBy swivt:specialProperty_ASK wiki:dwc-3ArecordedBy-23_QUERY07fe03a640c0221e3af48390f65e3b1a
I can't figure out what the purpose of those are. When the object URI is dereferenced, it just leads to the same metadata that I already had for dwc:recordedBy.
- 20 triples devoted to swivt: and wiki: housekeeping properties that are mostly duplicated by other triples containing well-known Dublin Core and SKOS properties.
- 1 rather odd triple:
dwc:recordedBy owl:sameAs dwc:recordedBy.
9 relatively straightforward triples:
- 4 triples expressing vann:, sioc:, and vs: properties.
- 5 triples expressing rdfs:label and rdfs:isDefinedBy properties.
The remaining triples fall into three categories that I'm going to look at separately: Dublin Core properties, SKOS properties, and rdf:type declarations
Dublin Core
These triples:
dwc:recordedBy dc:language "en"^^xsd:string,
"es"^^xsd:string,
"fr"^^xsd:string,
"ja"^^xsd:string,
"zh-Hans"^^xsd:string.
assert that the term dcterms:recordedBy is expressed in five languages. I'm pretty sure that this is erroneous. The web page contains five languages, the preferred labels are expressed in five languages, and the definitions are expressed in five languages, but the term dwc:recordedBy itself is an abstract thing that doesn't really have a language.
The triples:
dwc:recordedBy dcterms:issued "2008/11/19"^^xsd:string;
dcterms:modified "2014/10/23"^^xsd:string.
would be better expressed as:
dwc:recordedBy dcterms:issued "2008-11-19"^^xsd:date;
dcterms:modified "2014-10-23"^^xsd:date.
if we wanted a consuming client to "understand" that they were dates.
SKOS
The triples containing skos:definition, skos:example, and skos:prefLabel properties provide values in English, Spanish, French, Japanese, and simplified Chinese. However, the triples are expressed in this form:
dcterms:recordedBy skos:prefLabel "es:Registrado por"^^xsd:string.
This is non-standard usage - rather than expressing the object as a datatyped literal, it should be expressed as a language tagged literal, like this:
dcterms:recordedBy skos:prefLabel "Registrado por"@es.
For the skos:definition and skos:example triples, this just means that a generic client is going to be clueless about how to interpret the multilingual values. Ad hoc programming would be required to parse out the language tag from the literal value.
The skos:prefLabel triples are more problematic. Integrety condition S14 of the SKOS specification says that "A resource has no more than one value of skos:prefLabel per [optional] language tag." Since the provided skos:prefLabel values don't have language tags, the RDF as delivered would not be consistent with the SKOS data model since there is more than one untagged value present.
rdf:type
There are four triples that specify the class of which dwc:recordedBy is an instance:
dwc:recordedBy a swivt:Subject,
wiki:Class-3ADarwin_Core,
rdf:Property,
skos:Concept.
The first two types assert membership in classes that are not widely known, so I consider them mostly harmless.
dwc:recordedBy a rdf:Property.
I'm not sure about the last declaration:
dwc:recordedBy a skos:Concept.
I have always been a bit fuzzy on what sorts of things should be described by SKOS. I'm going to save talking about this issue until a later post.
Score Card
By my reckoning, I would break down the 162 triples down like this:
- Triples that are more or less useless: about 125/162 or 77%
- Triples that have some kind of serious problem: about 20/162 or 12%
- Triples that could be used by a machine as is: about 17/162 or 10%
Conclusions
The relatively small number of useful triples and the fact that the feed supplies invalid RDF means that with the current state of terms.tdwg.org, a generic semantic client could not scrape useful metadata from the site. This is a similar situation to what I discovered when trying to retrieve RDF about DOIs from CrossRef and got malformed RDF.[1] It tells me the same thing: so far nobody is actually using the machine-readable data that terms.tdwg.org puts out. On the positive side, the 12% of triples that weren't datatyped correctly, that lacked proper language tags, etc. could be fixed (and probably should be).
However, I have to say that I'm somewhat mystified about Semantic MediaWiki's approach of generating a new owl:Ontology every time somebody retrieves data from a page. I realize that in RDF Anyone can say Anything about Anything, but why would you want to say that? I guess that whether this makes sense depends on whether one intends for the generated RDF to be the authoritative source of machine-readable information about a term or not. I think of an ontology as a means for expressing terminological (schema; Tbox) information, and not for expressing mutable assertional data. At least in the case of Darwin Core, the authoritative RDF is at http://rs.tdwg.org/dwc/terms/, where it is stable and relatively immutable. terms.tdwg.org is a place for discussion and the creation of non-normative translations. In that context, generating an ontology for every wiki page seems unproductive.
If I were developing a client that routinely scraped the site for new translations, I would also be annoyed by the 3/4 of triples that are of little or no value to me. Of course, I could screen them out and only store the ones that I cared about. But that moves me to the position of having to create a purpose-built client. A Semantic Web "true believer's client" that traversed the web of data looking for new knowledge would get bloated with useless triples.
http://www.w3.org/RDF/ |
"I can't believe I ate that whole thing."
1972 American television commercial
[1] They fixed that problem very quickly - impressive!
For the terms.tdwg.org SemanticMedia wiki instance, an XSLT was developed to take the RDF output of all pages/terms within a scheme (e.g., Darwin Core) and output "clean" RDF and SKOS for the scheme as a whole. See, e.g., the Darwin Core page at http://terms.tdwg.org/wiki/Darwin_Core and its "Export SKOS" and "Export RDF" links on top right. Caveat: the Darwin Core presented there is not the latest version.
ReplyDeleteGood to know about that export feature. A few comments about what you get from the export.
ReplyDelete1. The output doesn't seem to be valid RDF. There were a few errors in the XML, which I found by pasting it into the W3C RDF validator.
2. After getting the RDF to validate by manually fixing the errors, I realized that none of the triples actually have subject IRIs. The subjects are actually all blank nodes. The problem may be that the XML attribute used for the rdf:Description elements is "rdf:value" instead of "rdf:about". Changing that would give the terms actual subject IRIs.
3. However, if that were done, the subject IRIs for all of the terms aren't the actual term IRIs. For example, the subject IRI for dwc:genus is given as "http://terminology-sandbox.biowikifarm.net/wiki/Special:URIResolver/dwc-3Agenus", but it should be "http://rs.tdwg.org/dwc/terms/genus".
4. Assuming that all of this could be worked out to be valid RDF with the correct subject IRIs, the outputted RDF still doesn't give me the thing I'd most like to get from the Semantic MediaWiki interface: all of the translations into languages other than English. I can get the basic term information in RDF by just dereferencing the terms and obtaining the dwcterms.rdf file. But I can't get the translations, and that is would would be extremely valuable!
In theory, a client dereferencing a Darwin Core term could get the dwcterms.rdf document and from the term description there, follow its nose to the terms wiki (the links are in the DwC Quick Reference guide, but not in the dwcterms.rdf RDF, at least yet). From the terms wiki, the client could discover all of the translations, if the Semantic MediaWiki instance were set up to allow that. That would be super cool.