Tuesday, February 16, 2016

Linked Data (for real)

Linked Data Principles

In 2006, Tim Berners-Lee published four basic principles of Linked Data [1]:
  1. Use URIs as names for things
  2. Use HTTP URIs so that people can look up those names.
  3. When someone looks up a URI, provide useful information, using the standards (RDF*, SPARQL)
  4. Include links to other URIs. so that they can discover more things.
Although these principles refer to "people" and "someone", Linked Data is really about making it possible for machines to do the "looking up", otherwise Linked Data wouldn't be any better than the regular Web.  This is such a cool idea - very simple and powerful, and it seems like it should have gotten immediate traction.  However, it's been ten years now since Tim B-L wrote these principles, and although there have been many ontologies written and demos created, we still aren't at the point where Linked Data has really caught on in a big way.  The lack of consensus URIs for naming things, failure to expose metadata as RDF when the URI is dereferenced asking for it, and failure to link to data that's in somebody else's silo has greatly impeded progress towards implementing the Linked Data dream.  It's still way easier to just Google what you are looking for, and depend on Google's great ability to interpret text [2] to find the information for you, rather than trying to use some Linked Data service to find out what you want to know.

Nevertheless, now that ORCID IDs are coming into wide use in identifying people, and DOIs are widely used to identify publications, that's pretty significant progress towards #1 in the list.  Both ORCID and CrossRef (a major servicer of DOIs) will provide RDF/XML if you ask for it when you dereference their URIs, so there's #3.  And people who create ORCID records for themselves usually link to their publications by DOI if they can.  To a lesser (but growing) extent, authors of publications are linking the other way as well, by including their ORCID ID along with their names in the publication metadata.  So there's potential for #4, at least for a limited number of types of resources (people and publications).  

Ever since I learned that ORCID and CrossRef were providing RDF/XML, I wanted to see if I could do Linked Data "for real", i.e. start with the URI of something, find useful information in the form of RDF, and follow links to other URIs to discover more.  Being able to do this "in the wild", rather than in a single silo or with "toy" datasets would be really cool.

Using HTTP URIs to look up things

In my last blog post, I mentioned that our Semantic Web Working Group here at Vanderbilt has been working through the book Semantic Web for the Working Ontologist. As an exercise for that activity, I decided to look up information about people in our group who had ORCID IDs by dereferencing the ORCID URIs while requesting RDF/XML.  I've described how I did it elsewhere, so I won't repeat that here.  

        a                  foaf:Person ;
        foaf:name          "Clifford B. Anderson" ;
        foaf:page          <http://www.library.vanderbilt.edu/scholarly/> ;
        foaf:publications  <http://orcid.org/0000-0003-0328-0792#workspace-works> .

As I mentioned in my last post,the ORCID RDF uses mostly FOAF properties to describe people.  Those properties include basic ones like foaf:name and foaf:page, and I got excited to see that they also used the property foaf:publications, defined as "A link to the publications of this person".  Cool!  That's just what I wanted.  However, in the RDF, the link was to a URI that consisted of the ORCID ID with a "#workspace-works" fragment identifier.  The description of that object resource consisted of a single triple that asserted it was a foaf:Document (something that is already entailed by the range of foaf:publications).  So even though the human-readable web page that you get when you dereference an ORCID ID tells you the DOIs of publications created by the person, the RDF tells you nothing.  Upon further reading about foaf:publications, it isn't really the term we want anyway - it is used to link to "a Document listing (primarily in human-readable form) some publications associated with the Person", i.e. a human-readable Web page about publications, not to the dereferenceable URI of the publication itself.  For that, they should be using a predicate like foaf:made .

<http://dx.doi.org/10.1017/s003693060700364x> dcterms:creator <http://id.crossref.org/contributor/clifford-anderson-7gu43tj0rli3>;
                                              dcterms:date "2010-7-1"^^xsd:date;
                                              dcterms:isPartOf <http://id.crossref.org/issn/0036-9306>;
                                              dcterms:publisher "Cambridge University Press (CUP)";
                                              dcterms:title "Herman Bavinck, Reformed Dogmatics, vol. 3: Sin and Salvation in Christ, ed. John Bolt, trans. John Vriend (Grand Rapids: Baker Academic, 2006), pp. 685. $49.99.".

Perhaps I would have better luck if I started from the DOI side??  The RDF about the publication does have a link to the author, using dcterms:creator.  That's nice.  Alas, it's not the author's ORCID ID!  It is a CrossRef-minted ID.  Perhaps CrossRef has a unique identifier that it uses in preference to the ORCID ID?


Aaaaaack!!!!  Each of Cliff's seven publications uses a different URI for Cliff!  Do they dereference?

Nothing comes back using either a browser or when requesting RDF/XML.  These identifiers are completely ad hoc and useless - they might as well be blank nodes!  Well, this pretty much shoots Tim B-L's Linked Data Principle #4 out of the water with respect to linking authors and their publications in either direction.

<http://orcid.org/0000-0003-0328-0792> foaf:made <http://dx.doi.org/10.11630/1550-4891.10.02.118>,
                                      owl:sameAs <http://viaf.org/viaf/168432349>.

Useful information?

Alright, if ORCID isn't going to assert the most basic links between people and DOI-identified publications, I will.  Anyone can say Anything about Anything, right?  In an effort to salvage this project, I created a small graph of triples that asserted that a person in our S.W.W.O. group foaf:made his or her publications.  For good measure, I asserted owl:sameAs VIAF identifiers when that relationship was true.  Now all I needed to do was load the RDF/XML files that I got from dereferencing the various URIs into Stardog, the graph database that we are playing around with in the group.  My original plan was to eventually build a little RDF scraper application that would retrieve the data for me using HTTP, and possibly load the triples directly into Stardog for me.  I described how I built a "toy" application like this in an earlier post.  But first, I tried loading the triples manually.  

Ok, great.  Every time I tried to load a file containing triples retrieved by dereferencing the CrossRef DOIs, Stardog gave me an error message.  I tried running the triples through the W3C RDF validator, but the RDF/XML came back as valid.  So I had to resort to looking at the RDF serialization with my naked eyes.  There it was:

<http://dx.doi.org/10.1017/s003693060700364x> dcterms:date "2010-7-1"^^xsd:date.

CrossRef was serving malformed ISO 8601 dates that were datatyped as xsd:date, and Stardog was rightfully barking about that - single digit months and days don't work in the lexical space for xsd:date.  To get the triples to load, I had to manually type in the missing zeros in the months and days.  Grrr.  So there is no way for me to write a homemade RDF scraper that will automatically "look up" the DOIs and retrieve CrossRef's RDF and load it into Stardog without some ad hoc processing code to fix this error.  CrossRef, you flunk Linked Data principle #3!

Trying it out

Needless to say, this experience was less than satisfying and didn't boost my enthusiasm for doing Linked Data with RDF.  In order to make myself feel better, I loaded the cleaned up scraped data and the assertions that I made into Stardog to play with.  Since I am only running Stardog as a localhost on my desktop, I also loaded the graphs into the Vanderbilt Heard Library's triple store so that you could try the queries for yourself via the public SPARQL endpoint.  Here's a fun little SPARQL query that for all members of our Semantic Web working group who have ORCID IDs retrieves the names of their coauthors:

PREFIX  foaf: <http://xmlns.com/foaf/0.1/>
PREFIX  dcterms: <http://purl.org/dc/terms/>


FROM <http://rdf.library.vanderbilt.edu/swwg/assertions.ttl>
FROM <http://rdf.library.vanderbilt.edu/swwg/sww-group.rdf>

      ?s a foaf:Group.
      ?s foaf:member ?person.
      ?person foaf:made ?publication.
      ?publication dcterms:creator ?coauthors.
      ?coauthors foaf:name ?name.

You can paste it into the query box at the endpoint if you want to try it out.  As you can see, this query also finds the group members themselves in addition to their co-authors, and it lists some of the authors several times due to CrossRef's ad hoc minting of dcterm:creator objects, each of which has a foaf:name that isn't standardized in any way (look at how many redundant Cliff Anderson and Suellen Stringer-Hye names come up).


My primary conclusion from this little exercise is that neither ORCID nor CrossRef is really serious about contributing to the RDF Linked Data effort.  The fact that I get an error when I try to load every CrossRef RDF/XML file into a triplestore tells me that nobody at CrossRef has ever actually tried to load one of their RDF/XML files into a triplestore (it's not just Stardog that balks at the malformed xsd:date datatyped literals, Callimachus does as well).  And since ORCID records DOIs for publications as part of their structured data, why in the world wouldn't they expose that information as RDF if they were really serious about supporting Linked Data as RDF?  There is little point in me trying to do "real" Linked Data with these metadata using a generic software client if I have to manually fix bad RDF or assert my own triples to make it work.  Sigh.

In my next blog post, I plan to write about using Stardog's built-in reasoning tools to make up for the fact that providers don't use consensus vocabularies in the metadata that they provide.

[2] Yeah, I know that Google is in on Schema.org and is using structured data to make their searches smarter.  But they are still getting a lot (most?) of their information from parsing text.


  1. Steve,

    Thanks for trying out Crossref's RDF. Shame you ran into problems. That sucks. Unfortunately, none of us at Crossref are really experts on RDF or linked data, and (shame) I have to say that while we are using a standard library to output RDF (JENA) it is true we haven't thoroughly tested our RDF against linked data tools.

    All that said though, a small number of people are using our RDF formats day to day.

    I'd like to address your two pain points for the Crossref sourced RDF (as far as I can tell):

    1) Malformed xsd:Date

    That's pretty rubbish on our part. I'll get a fix out for that as soon as I can.

    2) Contributor IDs and Name Disambiguation

    Crossref does not and cannot disambiguate contributor names. Where publishers do not provide ORCIDs to Crossref, we leave our contributor names in an ambiguous state, each with their own contributor URI. Where publishers provide Crossref with ORCIDs you will instead see ORCID URIs as the resource referred to by our contributor predicates.

    We are seeing increasing numbers of ORCIDs deposited into Crossref from publishers - the linkage between Crossref DOIs and ORCIDs is increasing every day.

    I'd be happy to hear about any other issues you find with Crossref's RDF. Again, thanks for trying out RDF and highlighting issues with it. It is great to get some feedback.

    Karl Ward

    1. I've deployed a fix for the malformed xsd:Date issue. Again, let me know if you find any other problems.

      Karl Ward

    2. That would be great if you would fix the malformed dates. That's really the main problem.

      It's also great that CrossRef uses ORCID IDs to identify the authors when they are available. I suppose in the examples I tried, the articles were too old, or the publishers don't make use of ORCID IDs. Hopefully exposing ORICD IDs will become a more common practice in the future for publishers.

      My personal opinion would be that it would probably be preferable to just leave the authors as blank nodes. You could still provide the various FOAF properties about the author, and people could query for them using string matching on the literal values. I feel like when one mints a URI, one takes on some kind of responsibility for maintaining metadata on that resource. ORCID is doing that for people, CrossRef isn't. So I'd just leave the nodes as anonymous. Just my opinion - others might disagree.