Tuesday, February 16, 2016

Linked Data (for real)

Linked Data Principles

In 2006, Tim Berners-Lee published four basic principles of Linked Data [1]:
  1. Use URIs as names for things
  2. Use HTTP URIs so that people can look up those names.
  3. When someone looks up a URI, provide useful information, using the standards (RDF*, SPARQL)
  4. Include links to other URIs. so that they can discover more things.
Although these principles refer to "people" and "someone", Linked Data is really about making it possible for machines to do the "looking up", otherwise Linked Data wouldn't be any better than the regular Web.  This is such a cool idea - very simple and powerful, and it seems like it should have gotten immediate traction.  However, it's been ten years now since Tim B-L wrote these principles, and although there have been many ontologies written and demos created, we still aren't at the point where Linked Data has really caught on in a big way.  The lack of consensus URIs for naming things, failure to expose metadata as RDF when the URI is dereferenced asking for it, and failure to link to data that's in somebody else's silo has greatly impeded progress towards implementing the Linked Data dream.  It's still way easier to just Google what you are looking for, and depend on Google's great ability to interpret text [2] to find the information for you, rather than trying to use some Linked Data service to find out what you want to know.

Nevertheless, now that ORCID IDs are coming into wide use in identifying people, and DOIs are widely used to identify publications, that's pretty significant progress towards #1 in the list.  Both ORCID and CrossRef (a major servicer of DOIs) will provide RDF/XML if you ask for it when you dereference their URIs, so there's #3.  And people who create ORCID records for themselves usually link to their publications by DOI if they can.  To a lesser (but growing) extent, authors of publications are linking the other way as well, by including their ORCID ID along with their names in the publication metadata.  So there's potential for #4, at least for a limited number of types of resources (people and publications).  

Ever since I learned that ORCID and CrossRef were providing RDF/XML, I wanted to see if I could do Linked Data "for real", i.e. start with the URI of something, find useful information in the form of RDF, and follow links to other URIs to discover more.  Being able to do this "in the wild", rather than in a single silo or with "toy" datasets would be really cool.


Using HTTP URIs to look up things

In my last blog post, I mentioned that our Semantic Web Working Group here at Vanderbilt has been working through the book Semantic Web for the Working Ontologist. As an exercise for that activity, I decided to look up information about people in our group who had ORCID IDs by dereferencing the ORCID URIs while requesting RDF/XML.  I've described how I did it elsewhere, so I won't repeat that here.  

<http://orcid.org/0000-0003-0328-0792>
        a                  foaf:Person ;
        foaf:name          "Clifford B. Anderson" ;
        foaf:page          <http://www.library.vanderbilt.edu/scholarly/> ;
        foaf:publications  <http://orcid.org/0000-0003-0328-0792#workspace-works> .

As I mentioned in my last post,the ORCID RDF uses mostly FOAF properties to describe people.  Those properties include basic ones like foaf:name and foaf:page, and I got excited to see that they also used the property foaf:publications, defined as "A link to the publications of this person".  Cool!  That's just what I wanted.  However, in the RDF, the link was to a URI that consisted of the ORCID ID with a "#workspace-works" fragment identifier.  The description of that object resource consisted of a single triple that asserted it was a foaf:Document (something that is already entailed by the range of foaf:publications).  So even though the human-readable web page that you get when you dereference an ORCID ID tells you the DOIs of publications created by the person, the RDF tells you nothing.  Upon further reading about foaf:publications, it isn't really the term we want anyway - it is used to link to "a Document listing (primarily in human-readable form) some publications associated with the Person", i.e. a human-readable Web page about publications, not to the dereferenceable URI of the publication itself.  For that, they should be using a predicate like foaf:made .

<http://dx.doi.org/10.1017/s003693060700364x> dcterms:creator <http://id.crossref.org/contributor/clifford-anderson-7gu43tj0rli3>;
                                              dcterms:date "2010-7-1"^^xsd:date;
                                              dcterms:isPartOf <http://id.crossref.org/issn/0036-9306>;
                                              dcterms:publisher "Cambridge University Press (CUP)";
                                              dcterms:title "Herman Bavinck, Reformed Dogmatics, vol. 3: Sin and Salvation in Christ, ed. John Bolt, trans. John Vriend (Grand Rapids: Baker Academic, 2006), pp. 685. $49.99.".

Perhaps I would have better luck if I started from the DOI side??  The RDF about the publication does have a link to the author, using dcterms:creator.  That's nice.  Alas, it's not the author's ORCID ID!  It is a CrossRef-minted ID.  Perhaps CrossRef has a unique identifier that it uses in preference to the ORCID ID?

<http://id.crossref.org/contributor/c-b-anderson-j4njogcbvxbq>
<http://id.crossref.org/contributor/clifford-anderson-4jtvqwphdh7v>
<http://id.crossref.org/contributor/clifford-anderson-7gu43tj0rli3>
<http://id.crossref.org/contributor/clifford-b-anderson-21b9xhi0bciaf>
<http://id.crossref.org/contributor/clifford-b-anderson-2l6i8wstlb2xy>
<http://id.crossref.org/contributor/clifford-b-anderson-e27ih81bavcc>
<http://id.crossref.org/contributor/clifford-b-anderson-zey8n8gvil0>

Aaaaaack!!!!  Each of Cliff's seven publications uses a different URI for Cliff!  Do they dereference?


Nothing comes back using either a browser or when requesting RDF/XML.  These identifiers are completely ad hoc and useless - they might as well be blank nodes!  Well, this pretty much shoots Tim B-L's Linked Data Principle #4 out of the water with respect to linking authors and their publications in either direction.

<http://orcid.org/0000-0003-0328-0792> foaf:made <http://dx.doi.org/10.11630/1550-4891.10.02.118>,
                                                 <http://dx.doi.org/10.4242/balisagevol13.anderson01>,
                                                 <http://dx.doi.org/10.11630/1550-4891.09.02.130>,
                                                 <http://dx.doi.org/10.1163/ej.9789004203365.i-284.36>,
                                                 <http://dx.doi.org/10.1017/s003693060700364x>,
                                                 <http://dx.doi.org/10.1163/156973208x316234>,
                                                 <http://dx.doi.org/10.1177/004057369905500421>;
                                      owl:sameAs <http://viaf.org/viaf/168432349>.

Useful information?

Alright, if ORCID isn't going to assert the most basic links between people and DOI-identified publications, I will.  Anyone can say Anything about Anything, right?  In an effort to salvage this project, I created a small graph of triples that asserted that a person in our S.W.W.O. group foaf:made his or her publications.  For good measure, I asserted owl:sameAs VIAF identifiers when that relationship was true.  Now all I needed to do was load the RDF/XML files that I got from dereferencing the various URIs into Stardog, the graph database that we are playing around with in the group.  My original plan was to eventually build a little RDF scraper application that would retrieve the data for me using HTTP, and possibly load the triples directly into Stardog for me.  I described how I built a "toy" application like this in an earlier post.  But first, I tried loading the triples manually.  

Ok, great.  Every time I tried to load a file containing triples retrieved by dereferencing the CrossRef DOIs, Stardog gave me an error message.  I tried running the triples through the W3C RDF validator, but the RDF/XML came back as valid.  So I had to resort to looking at the RDF serialization with my naked eyes.  There it was:

<http://dx.doi.org/10.1017/s003693060700364x> dcterms:date "2010-7-1"^^xsd:date.

CrossRef was serving malformed ISO 8601 dates that were datatyped as xsd:date, and Stardog was rightfully barking about that - single digit months and days don't work in the lexical space for xsd:date.  To get the triples to load, I had to manually type in the missing zeros in the months and days.  Grrr.  So there is no way for me to write a homemade RDF scraper that will automatically "look up" the DOIs and retrieve CrossRef's RDF and load it into Stardog without some ad hoc processing code to fix this error.  CrossRef, you flunk Linked Data principle #3!

Trying it out


Needless to say, this experience was less than satisfying and didn't boost my enthusiasm for doing Linked Data with RDF.  In order to make myself feel better, I loaded the cleaned up scraped data and the assertions that I made into Stardog to play with.  Since I am only running Stardog as a localhost on my desktop, I also loaded the graphs into the Vanderbilt Heard Library's triple store so that you could try the queries for yourself via the public SPARQL endpoint.  Here's a fun little SPARQL query that for all members of our Semantic Web working group who have ORCID IDs retrieves the names of their coauthors:

PREFIX  foaf: <http://xmlns.com/foaf/0.1/>
PREFIX  dcterms: <http://purl.org/dc/terms/>

SELECT DISTINCT ?name

FROM <http://rdf.library.vanderbilt.edu/swwg/assertions.ttl>
FROM <http://rdf.library.vanderbilt.edu/swwg/sww-group.rdf>

WHERE
      {
      ?s a foaf:Group.
      ?s foaf:member ?person.
      ?person foaf:made ?publication.
      ?publication dcterms:creator ?coauthors.
      ?coauthors foaf:name ?name.
      }

You can paste it into the query box at the endpoint if you want to try it out.  As you can see, this query also finds the group members themselves in addition to their co-authors, and it lists some of the authors several times due to CrossRef's ad hoc minting of dcterm:creator objects, each of which has a foaf:name that isn't standardized in any way (look at how many redundant Cliff Anderson and Suellen Stringer-Hye names come up).

Summary

My primary conclusion from this little exercise is that neither ORCID nor CrossRef is really serious about contributing to the RDF Linked Data effort.  The fact that I get an error when I try to load every CrossRef RDF/XML file into a triplestore tells me that nobody at CrossRef has ever actually tried to load one of their RDF/XML files into a triplestore (it's not just Stardog that balks at the malformed xsd:date datatyped literals, Callimachus does as well).  And since ORCID records DOIs for publications as part of their structured data, why in the world wouldn't they expose that information as RDF if they were really serious about supporting Linked Data as RDF?  There is little point in me trying to do "real" Linked Data with these metadata using a generic software client if I have to manually fix bad RDF or assert my own triples to make it work.  Sigh.

In my next blog post, I plan to write about using Stardog's built-in reasoning tools to make up for the fact that providers don't use consensus vocabularies in the metadata that they provide.


[2] Yeah, I know that Google is in on Schema.org and is using structured data to make their searches smarter.  But they are still getting a lot (most?) of their information from parsing text.

Monday, February 1, 2016

RDF for talking about people

History

RDF has now been around for over 15 years and over that time period there have been many efforts to use RDF to describe many kinds of things (or rather, should I say rdf:Resource's or owl:Thing's?). One of the most fundamental types of thing that one might want to describe is a person.  One of the early efforts to describe people was the Friend-of-a-Friend (FOAF) project.  If you want some historical perspective, you can read the original project description from 2000 here. To make it easy to describe a person, sometime before 2003 Leigh Dodds created FOAF-a-Matic, a little Javascript-based web page that generated an XML serialized RDF description of a person using FOAF properties.  But what good is a description of a person that is denoted by a blank node?

In 2006, Tim Berners-Lee, the creator of the web and promoter of RDF, suggested that everyone "Give yourself a URI".  His suggested method: create a FOAF page, add a hash fragment identifier to the page URL so that the URI for you will be different from the URI for the FOAF document, and you're done!  You can see his FOAF document at http://www.w3.org/People/Berners-Lee/card.rdf, which you should get when you dereference his URI: http://www.w3.org/People/Berners-Lee/card#i, but currently don't because the content negotiation seems to be broken.

This strategy is pretty good if you are TimBL and are the dictator for life of the W3C, which would pretty much allow you to guarantee that your URI in the http://www.w3.org/ domain would be stable.  For the rest of us mere mortals, it is a bit harder.  When I made my first FOAF profile, I used http://people.vanderbilt.edu/~steve.baskauf/foaf.rdf#me as my URI.  This was not a very good URI for several reasons:
1. Including the .rdf was bad form for a cool URI.  I was stuck with it because I didn't have any control over content negotiation on the server.
2. Who put the stupid tilde in there???
3. One day Vanderbilt suddenly decided that they didn't like the people.vanderbilt.edu subdomain, but rather liked the my.vanderbilt.edu subdomain better.  So they shut down people.vanderbilt.edu.  ... URI permanently broken.

After that, I used (and still use) http://bioimages.vanderbilt.edu/contact/baskauf as a URI for myself, which is probably more stable since I control the bioimages.vanderbilt.edu subdomain (at least until Vanderbilt decides that I don't control it any more).

VIAF and ORCID

To my surprise, I one day discovered that I had a VIAF URI (http://viaf.org/viaf/63557389) !  I'm not sure exactly how that happened, but somebody at OCLC decided that I should have one.  This was really great, because VIAF URIs are well-known and probably pretty persistent.  The downside was that I had no control over what RDF was delivered when a client dereferenced that URI.  A couple years later, I learned about ORCID and was able to set up an ORCID ID for myself (http://orcid.org/0000-0003-4365-3135).  To some extent, one can control the RDF content that is delivered from ORCID by adding to your profile things like links to a website, and publications.  However, I can't indicate everything I want, including important things like the fact that there are two other URIs that refer to me!  I've put that in the text of my ORCID biography, but that won't generate a machine readable link.  In the RDF that I provide when http://bioimages.vanderbilt.edu/contact/baskauf is dereferenced, I assert an owl:sameAs relationships to the other two URIs.  But anyone dereferencing either the VIAF or ORCID URIs won't discover this relationship.

What RDF do you get from ORCID and VIAF?

I was curious to know what kind of properties VIAF and ORCID used in the RDF they provide and to what extent you get the same information from both places.  Our Semantic Web Working Group has been working through the book Semantic Web for the Working Ontologist, and as an exercise for chapter 3, we were looking at a prototype RDF description of our colleague Cliff Anderson.  Since Cliff was agreeable to being our guinea pig, I decided to dereference his VIAF and ORCID URIs to see what kind of RDF I got.  I won't go into the details of how I did it, since the methods and the actual RDF are already posted in the SWWO Chapter 3 notes.

ORCID

Here are a few things that I noticed about the record that was delivered when Cliff's ORCID was dereferenced with a request for RDF:

  1. The actual ORCID ID http://orcid.org/0000-0003-0328-0792 definitely refers to Cliff, a foaf:Person.  Other things related to Cliff are distinguished from Cliff by adding hash fragment identifiers to Cliff's URI in order to make their URIs different.  This is the opposite approach to what TimBL suggested, but is now perfectly kosher under the resolution to the HTTP Range 14 issue. (see this email for the original description of the issue). ORCID differentiates between the ORCID ID for Cliff (http://orcid.org/0000-0003-0328-0792#orcid-id), a document about Cliff's publications (http://orcid.org/0000-0003-0328-0792#workspace-works), Cliff's personal profile document (http://pub.orcid.org/orcid-pub-web/experimental_rdf_v1/0000-0003-0328-0792), and Cliff himself.  
  2. ORCID primarily uses FOAF vocabulary terms to describe Cliff.  This has been standard practice for many years, despite the fact that FOAF isn't any kind of standard.  ORCID also uses rdfs:label to provide Cliff's name, which is nice since this is a fairly universal way to label resources.  
  3. ORCID uses PROV and PAV ontology terms to describe the provenance information about the record.  The personal profile document is the subject of the provenance triples.  PROV is a W3C Recommendation.  PAV is not, see this article for its rationale.  
  4. The foaf:primaryTopic property is used to link the personal profile document (and hence provenance information) to Cliff.
  5. A GeoNames URI is used to refer to the United States.

VIAF

Here are things I noticed about the record from VIAF:

  1. The actual VIAF ID http://viaf.org/viaf/168432349 definitely refers to Cliff, a schema:Person.  VIAF differentiates between Cliff and the document about Cliff by including a trailing slash after the VIAF ID to generate a URI for the document: http://viaf.org/viaf/168432349/ .  This is a clever trick to cause redirection to the document when Cliff's URI is dereferenced.  However, it is probably less easier for humans to distinguish than the hash URI trick used by ORCID.  
  2. VIAF primarily uses Schema.org terms to describe Cliff, although it still uses FOAF vocabulary terms to describe other relationships and classes.  It uses skos:prefLabel rather than rdfs:label to indicate the preferred label for Cliff in two languages.  However, since skos:prefLabel is a subproperty of rdfs:label, the more well-known relationship would be entailed if a client performed reasoning.  
  3. VIAF could provide provenance information about the record as properties of the document.  But it doesn't.
  4. The foaf:primaryTopic property is used to link the document to Cliff.
  5. The record includes definitions of skos:Concept's that have Cliff as their foaf:focus.  The implications of these semantics are not clear to me.

FOAF vs. Schema.org

The current version of FOAF (Paddington Edition - if you read this far you now know why I put his picture at the top of the post!) is 0.99 and was issued in January 2014.  There must be something psychological about working on a vocabulary for 14 years and still holding off on calling it version 1.0!  Will the next version be 1.0, or will there be no more versions after this one?  The change log of version 0.99 notes that it declares equivalence between foaf:Person and schema:Person, between foaf:Image and schema:ImageObject, and between foaf:Document and schema:CreativeWork (click here to see the FOAF RDF).  The declaration seems to be reciprocal for schema:Person in the Schema.org RDF, but not for schema:ImageObject and schema:CreativeWork. However, schema:ImageObject is an equivalent class to dcmitype:Image.  I'm a bit uncertain about all this because I'm not 100% where authoritative RDF for Schema.org resides (see this for some Turtle RDF of Schema.org term definitions).  

The question that I'm wondering about here is whether the Schema.org terms are destined to replace the FOAF terms.  Dan Brickley, one of the FOAF authors, now runs the daily operations of Schema.org and chairs the Schema.org Community Group.  So his efforts now are clearly focused on Schema.org.  Schema.org also has buy-in from Google, Microsoft, and Yahoo, vs. FOAF, which has no particular organizational support.  Schema.org terms can be used not only with RDF, but with various other Linked Data technologies including Microdata and JSON-LD - while FOAF is only for RDF. So it looks like Schema.org may rule the future.  But FOAF has been so widely used for so long that it probably isn't going away soon.  We seem doomed to having two competing RDF vocabularies to describe people for a long time to come.  

VIAF vs. ORCID

Are we also doomed to having two competing systems of URI identifiers for people?  In 2013, ORCID and ISNI (which has a core relationship with the VIAF database) issued a Joint Statement on Interoperation and committed to investigate the feasibility of a shared identifier scheme for a single number to represent an individual in both databases.  The ORCID Registry assigns IDs from a block of numbers that ISNI has set aside to avoid having the same number assigned to different people in the two systems.  However, as far as I can tell, there has been no progress since 2013 in dealing with the opposite problem: avoiding having the same person being assigned different numbers in the two systems.  As with the two vocabulary schemes, we seem to be doomed to having two competing systems for assigning identifiers to people. 

owl:sameAs or SPARQL as a solution to dealing with duplicate infromation?

There doesn't seem to be any way to force either of the ID systems (VIAF or ORCID) to link to the other's IDs.  However, anyone can declare two resources to be equivalent by linking the two URIs of the resources by owl:sameAs in triples that they assert.  I hope that our Semantic Web Group can play around with merging information from VIAF and ORCID using the StarDog reasoner to materialize triples entailed by the use of owl:sameAs and other terms of equivalence such as owl:equivalentProperty.  Alternatively, we could work out some SPARQL queries that would merge information from both types of records and make it possible to assert properties from both sources using either FOAF or Schema.org properties.  I will plan to report in a future blog if we come up with anything interesting.


Thursday, September 17, 2015

Why are are handwashing studies and their reporting so broken?



For around ten years, I've had my introductory biology students perform experiments to attempt to determine the effectiveness of soaps and hand cleansing agents.  This is really a great exercise to get students thinking about the importance of good experimental design, because it is very difficult to do an experiment that is good enough to show differences caused by their experimental treatments.  The bacterial counts they measure are very variable and it's difficult to control the conditions of the experiment.  Since there is no predetermined outcome, the students have to grapple with drawing appropriate conclusions from the statistical tests they conduct - they don't know what the "right" answer is for their experiment.

We are just about to start in on the project in my class again this year, so I was excited to discover that a new paper had just come out that purports to show that triclosan, the most common antibacterial agent in soap, has no effect under conditions similar to normal hand washing:

Kim, S.A., H. Moon, and M.S. Ree. 2015. Bactericidal effects of triclosan in soap both in vitro and in vivo. Journal of Antimicrobial Chemotherapy. http://dx.doi.org/10.1093/jac/dkv275 (the DOI doesn't currently dereference, but the paper is at http://m.jac.oxfordjournals.org/content/early/2015/09/14/jac.dkv275.short?rss=1)

The authors exposed 20 recommended bacterial strains to soap with and without triclosan at two different temperatures.  They also exposed bacteria to regular and antibacterial soap for varying lengths of time.  In a second experiment, the authors artificially contaminated the hands of volunteers who then washed with one of two kinds of soap.  The bacteria remaining on the hands were then sampled.

The authors stated that there was no difference in the effect of soap with and without triclosan.  They concluded that this was because the bacteria were not in contact with the triclosan long enough for it to have an effect.  Based on what I've read and on the various experiments my students have run over the years, I think this conclusion is correct.  So what's my problem with the paper?

Why do we like to show that things are different and not that they are the same?


When talking to my beginner students, they often wonder why experimental scientists are so intent on showing that things are significantly different?  Why not show that they are the same - sometimes that's what we actually want to know anyway.

When analyzing the results of an experiment statistically, we evaluate the results by calculating "P".  P is the probability that we would get results that are this different by chance, if the things we were comparing are actually the same.  If P is high, then it's likely that the differences are due to random variation.  If P is low, it's unlikely that the differences are due to chance variation, but rather that they are caused by the real effect of the thing we are measuring.  The typical cutoff for statistical significance is when P<0.05 .  If P<0.05, then we say that we have showed that the results are significantly different.

The problem lies in our conclusion when P>0.05 .  A common (and wrong) conclusion is that when P>0.05 we have shown that the results are not different (i.e. the same).  Actually, what has happened is that we have failed to show that the results are different.  Isn't that the same thing?

Absolutely not.  In simple terms, I put the answer this way: if P<0.05, that is probably because the things we are measuring are different.  If P>0.05, that is either because the things we are measuring are the same OR it's because our experiment stinks!  When differences are small, it may be very difficult to perform a good experiment and show that P>0.05 .  On the other hand, any bumbling idiot can do an experiment that produces P>0.05 by any number of poor practices: not enough samples, poorly controlled experimental conditions, or doing the wrong kind of statistical test.

So there is a special burden placed on a scientist who wants to show that two things are the same.  It is not good enough to run a statistical test and get P>0.05 .  The scientist must also show that the experiment and analysis was capable of detecting differences of a certain size if they existed.  This is called a "power analysis".  A power analysis shows that the test has enough statistical power to uncover differences when they are actually there.  Before claiming that there is no effect of the treatment (no significant difference), the scientist has to show that his or her experiment doesn't stink.


So what's wrong with the Kim et al. 2015 paper???


The problem with the paper is that it doesn't actually provide evidence that supports its conclusions.

If we look at the Kim et al. paper, we can find the problem buried on the third page.  Normally in a study, one reports "N", the sample size, a.k.a. the number of times you repeated the experiment.  Repeating the experiment is the only way you can find out whether the differences you see are due to differences or bad luck in sampling.  In the Kim et al. paper, with regards to the in vitro part of the study, all that is said is "All treatments were performed in triplicate."  Are you joking?????!!!  Three replicates is a terrible sample size for this kind of experiment where results tend to be very variable.  I guess N=2 would have been worse, but this is pretty bad.


My next gripe with the paper is in the graphs.  It is a fairly typical practice in reporting results to show a bar graph where the height represents the mean value of the experimental treatment and the error bars show some kind of measure of how well that mean value is known.  The amount of overlap (if any) provides a visual way of assessing how different the means are.

Typically, 95% confidence intervals or standard errors of the mean are used to set the size of the error bars.  But Kim et al. used standard deviation.  Standard deviation measures the variability of the data, but it does NOT provide an assessment of how well the mean value is known.  Both 95% confidence intervals and standard error of the mean are influenced by the sample size as well as the variability of the data.  They take into consider all of the factors that affect how well we know our mean value.  So the error bars on these graphs based on standard deviation really don't provide any useful information about how different the mean values are.*

The in vivo experiment was better.  In that experiment there were 16 volunteers who participated in the experiment.  So that sample size is better than 3.  But there are other problems.


First of all, it appears that all 16 volunteers washed their hands using all three treatments.  There is nothing wrong with that, but apparently the data were analyzed using a one-factor ANOVA.  In this case, the statistical test would have been much more powerful if it had been blocked by participant, since there may be variability that was caused by the participants themselves and not by the applied treatment.  

Secondly, the researchers applied an a posteriori Tukey's multiple range test to determine which pairwise comparisons were significantly different.  Tukey's test is appropriate in cases where there is no a priori rationale for comparing particular pairs of treatments.  However, in this case, is is perfectly clear which pair of treatments the researchers are interested in: the comparison of regular and antibacterial soap!  Just look at the title of the paper!  The comparison of the soap treatments with the baseline is irrelevant to the hypothesis that is being tested, so its presence does not create a requirement for a test for unplanned comparisons. Tukey's test adjusts the experiment-wise error rate to adjust for multiple unplanned comparisons, effectively raising the bar and making it harder to show that P<0.05; in this case jury-rigging the test to make it more likely that the effects will NOT be different.  

Both the failure to block by participant and using an inappropriate a posteriori test makes the statistical analysis weaker, not stronger, and a stronger test is what you need if you want to show that the reason why you failed to show differences was because they weren't there.  

The graph is also misleading for the reasons I mentioned about the first graph.  The error bars here apparently bracket the range within which the middle 80% of the data fall.  Again, this is a measure of the dispersion of the data, not a measure of how well the mean values are known.  We can draw no conclusions from the degree of overlap of the error bars, because the error bars represent the wrong thing.  They should have been 95% confidence intervals if the authors wanted to have meaning in the amount of overlap.  

Is N-16 an adequate sample size?  We have no idea, because no power test was reported.  This kind of sloppy experimental design and analysis seems to be par for the course in experiments involving hand cleansing.  I usually suggest that my students read the scathing rebuke by Paulson (2005) [1] of the Sickbert-Bennett et al. (2005)[2] paper that bears some similarities to the Kim et al. paper.  Sickbert-Bennett et al. claimed that it made little difference what kind of hand cleansing agent one used or if one used any agent at all.  However, Paulson pointed out that the sample size used by Sickbert-Bennett (N=5) would have needed to have been as much as 20 times larger (i.e. N=100) to have made their results conclusive.  Their experiment was way to weak to draw the conclusion that the factors' had the same effect.  This is probably also true for Kim et al., although to know for sure, somebody needs to run a power test on their data.

What is wrong here???

There are so many things wrong here, I hardly know where to start.

1. Scientists who plan to engage in experimental science need to have a basic understanding of experimental design and statistical analysis.  Something is really wrong with our training of future scientists if we don't teach them to avoid basic mistakes like this.

2. Something is seriously wrong with the scientific review process if papers like this get published with really fundamental problems in their analyses and in the conclusions that are drawn form those analyses.  The fact that this paper got published means not just that four co-authors don't know basic experimental design and analysis, but two or more peer reviewers and an editor can't recognize problems with experimental design and analysis.

3. Something is seriously wrong with science reporting.  This paper has been picked up and reported online by newsweek.com, cbsnews.com, webmd, time.com, huffingtonpost.com, theguardian.com, and probably more.  Did any of these news outlets read the paper?  Did any of them consult with somebody who knows how to assess the quality of research and get a second opinion on this paper.  SHAME ON YOU, news media !!!!!

--------------
Steve Baskauf is a Senior Lecturer in the Biological Sciences Department at Vanderbilt University, where he introduces students to elementary statistical analysis in the context of the biology curriculum. 

* Error bars that represent standard deviation will always span a larger range than those that represent standard error of the mean, since standard error of the mean is estimated by s/ sqroot(N).  The 95% confidence interval is + or - approximately two times the standard error of the mean.  So when N=3, the 95% confidence interval will be approximately +/- 2s/ sqroot(3) or +/-1.15s.  So in the case where N=3, the square root error bars span a range that is slightly smaller than the 95% confidence interval error bars would span.  This makes it slightly easier to get have error bars that don't overlap than it would be if they represented 95% confidence intervals.  When N=16, the 95% confidence interval would be approximately +/- 2s/ sqroot(16) or +/-s/2.  In this case, the standard deviation error bars are twice the size of the 95% confidence intervals, making it much easier to have error bars that overlap than if the bars represented 95% confidence intervals.  In a case like this where we are trying to show that things are the same, making the error bars twice as big as they should be makes the sample means look like they are more similar than they actually are, which is misleading.  The point here is that using standard deviations for error bars is the wrong thing to do when comparing means.

[1] Paulson, D.S. 2005. Response: comparative efficacy of hand hygiene agents. American Journal of Infection Control 33:431-434. http://dx.doi.org/10.1016/j.ajic.2005.04.248
[2] Sickbert-Bennett E.E., D.J. Weber, M.F. Gergen-Teague, M.D. Sobsey, G.P. Samsa, W.A. Rutala. 2005. Comparative efficacy of hand hygiene agents in the reduction of bacteria and viruses.  American Journal of Infection Control 33:67-77.  http://dx.doi.org/10.1016/j.ajic.2004.08.005

Saturday, September 5, 2015

Date of Toilet Paper Apocalypse Now Known


Introduction


Fig. 1. Charmin toilet paper roll in 2015

Have you noticed how rolls of toilet paper fit into their holders more loosely than they did in the past?  Take a look the next time you are doing your Charmin paperwork, and you will see the first signs of the impending Toilet Paper Apocalypse.

Methods and data


Like the proverbial frog in boiling water, the first signs of the Toilet Paper Apocalypse were not obvious.  I first became aware that it was coming in 2009, when I happened to buy two 8-packs of Charmin Giant rolls and noticed that one of them was noticeably shorter than the other:

Fig. 2. Roll size change observed in 2009.
Careful examination of the quantity details showed that the width of each roll had been decreased from the standard width of 11.4 cm to 10.8 cm.  Although this decrease of 0.6 cm wasn't that apparent when comparing single rolls, it was pretty obvious when the rolls were stacked four high.  When I noticed this, I had a sinking feeling that I was seeing the beginning of a nefarious plot being carried out by faceless bureaucrats at Procter and Gamble.  But without additional data, it was hard to know whether this 5% decrease in the size of rolls was a fluke or part of a pattern.

Fast-forward to 2015 and another trip to the store.  This time it was a purchase of two 6-packs of Charmin Mega rolls:
Fig. 3.a. Charmin roll size change in 2015.

Close examination of the details tells the story:
Fig. 3.b. Charmin roll size change in 2015 (detail).

Sometime between 2009 and 2015, P&G decreased the width of their rolls again, from 10.8 to 9.9 cm, another 8% decrease in the size per sheet.  Now in 2015 the number of sheets per Mega roll was reduced from 330 to 308, a further 7% decline in the amount of toilet paper per roll and a corresponding increase in P&G's profits (since the price of the product has stayed the same with the size decreases).  

What first seemed an idle conspiracy theory is now a known fact.  Being an analysis nerd, I had to search for additional data from the years between 2009 and 2015.  Google did not disappoint.  I was able to find this 2013 post from the Consumerist, which documented another decline in roll width, from 10.8 cm to 10.0 cm (a  7% decrease in sheet size):

Fig. 4. Charmin roll size change in 2013.  Image from http://consumerist.com/2013/10/31/charmin-deploys-toilet-paper-sheet-shrink-ray-slims-rolls-down/

Analysis

Unfortunately, not all of the changes involved rolls of the same size.  I also did not have exact dates for each change.  However, by estimating the dates and doing some size conversions, I was able to assemble the following data:
Fig. 5. Data showing relationship between year and roll size.

 I conducted a regression analysis on the data and the trend was clear:
Fig. 6. Regression analysis predicting the relationship between year and roll size.
The regression had P=0.0203, so there is no question about the significance of the relationship.  Note also that the regression line has an R2 value of 0.96, which means we can safely extrapolate into the future to predict the course of events leading up to the apocalypse.  Rearranging the regression line and solving for the year when the toilet paper surface area goes to zero gives a year value of 2046.  So we can now confidently predict the date when the apocalypse will come to pass.

Discussion

This analysis has three important implications. 

1. Charmin toilet paper will disappear on or about the year 2046.  For those of us who are devoted Charmin users, that means shortages, hoarding, and chaos in the grocery stores during the 2040's.  Remember the horror that was Y2K?  We have that to look forward to all over again.

2. Since each Charmin roll size decrease was accomplished without a corresponding decrease in price, the profits for Procter and Gamble will increase as an inverse function of the area of paper per roll.  A little elementary math will prove that a linear decrease in the roll area to zero results in Procter and Gamble's profits increasing to infinity in the year 2046.  The implications are clear: by taking advantage of their customer's addiction to their product, P&G intends to dominate the world economy by mid-century. 

3. As the amount of toilet paper per roll decreases, other effects will begin to appear.  For example, if we assume a constant number of sheets per roll and a length per sheet of 10.1 cm, we can project based on the regression line from Fig. 6 that the width of sheets will reach 2.0 cm in approximately 2040 (Fig. 7).

Fig. 7. Simulated Charmin roll size in 2040.
When the sheet size reaches that shown in Fig. 7, it becomes questionable as to whether the toilet paper can actually perform its intended function. In that case, we can expect cascades of related effects caused by the increase in transmission of fecal-borne pathogens, such as pandemics of cholera, hepatitus A, and typhoid fever.  It is possible that the human population may collapse due to these pandemics several years before the actual disappearance of toilet paper altogether.

Conclusions

This paper should be considered a clarion call for action.  The most likely solution to the problem is probably some form of government regulation of roll and sheet size.  However, given the widespread gridlock in facing lesser issues, such as climate change and refugee crises, it is likely that inaction by governments on this issue will continue.  In that case, we should expect widespread hoarding, followed by rationing as 2046 approaches.

--------------------------------------------------------------------------------------------------------------

Steve Baskauf is a Senior Lecturer in the Biological Sciences Department at Vanderbilt University, where he introduces students to elementary statistical analysis in the context of the biology curriculum.

Thursday, July 30, 2015

Shiny new toys 3: Playing with Linked Data

In the second post in this series, I recapitulated my conclusions about the question "What can we learn using RDF?" that I had explored in an earlier series of blog posts:

1. In order for the use of RDF to be beneficial, it needs to facilitate processing by semantic clients ("machines") that results in interesting inferences and information that would not otherwise be obvious to human users.  We could call this the "non-triviality" problem.
2. Retrieving information that isn't obvious is more likely to happen if the RDF is rich in object properties [1] that link to other IRI-identified resources and if those linked IRIs are from Internet domains controlled by different providers.  We could call this the "Linked Data" problem: we need to break down the silos that separate the datasets of different providers.  We could reframe by saying that the problem to be solved is the lack of persistent, unique identifiers, lack of consensus object properties, and lack of the will for people to use and reuse those identifiers and properties.
3. RDF-enabled machine reasoning will be beneficial if the entailed triples allow the construction of more clever or meaningful queries, or if they state relationships that would not be obvious to humans.  We could call this the "Semantic Web" problem.

In that second post, I used the "shiny new toys" (Darwin-SW 0.4, the Bioimages RDF dataset, and the Heard Library Public SPARQL endpoint) to play around with the Semantic Web issue by using SPARQL to materialize entailed inverse relationships.  In this post, I'm going to play around with the Linked Data issue by trying to add an RDF dataset from a different provider to the Heard Library triplestore and see what we can do with it.

 
 Image from http://linkeddata.org/ CC BY-SA

Linked Data


I'm not going to attempt to review the philosophy and intricacies of Linked Data.  Both http://linkeddata.org/ and the W3C have introductions to Linked Data.  To summarize, here's what I think of when I hear the term "Linked Data":

1. Things (known as "resources" in RDF lingo) are identified by HTTP IRIs.
2. Those things are described by some kind of machine-processable data.
2. Those data link the thing to other things that are also identified by HTTP IRIs.
4. If a machine doesn't know about some other linked thing, it can dereference that thing's HTTP IRI in order to retrieve the machine-processable data about the other thing via the Internet.

This may be an oversimplification that is short on technical details, but I think it catches the main points.

Some notes:
1. IRIs are a superset of URIs.  See https://www.ietf.org/rfc/rfc3987.txt for details.  In this post, you could pretty much substitute "URI" anytime I use "IRI" if you wanted to.
2. "Machine-processable data" might mean that the data are described using RDF, but that isn't a requirement - there are other ways that data can be described and encoded, including Microdata and JSON-LD that are similar to, but not synonymous with RDF.
3. It is possible to link to other things without using RDF triples.  For example, in its head element, an HTML web page can have a link tag with a rel="meta" attribute that links to some other document.  It is possible that a machine would be able to "follow it's nose" to find other information using that link.
4. RDF does not require that the things be identified by IRIs that begin with "HTTP://" (i.e. HTTP IRIs).  But if they aren't, it becomes difficult to use HTTP and the Internet to retrieve the information about a linked thing.

In this blog post, I'm going to narrow the scope of Linked Data as it is applied in my new toys:
1. All IRIs are HTTP IRIs.
2. Resources are described as RDF.
3. Links are made via RDF triples.
4. The linked RDF is serialized within some document that can be retrieved via HTTP using the linked IRI.

What "external" links are included in the Bioimages RDF dataset?


In the previous two blog posts in this series, all of the SPARQL queries investigated resources whose descriptions were contained in the Bioimages dataset itself.  However, the Bioimages dataset contains links to some resources that are described outside of that dataset:
  • Places that are described by GeoNames.
  • Taxonomic names that are described by uBio.
  • Agents that are described by ORCID.
  • Literature identified by DOIs that link to publisher descriptions.

In principle, a Linked Data client ("machine", an program designed to find, load, and interpret machine-processable data) could start with the Bioimages VoID description and follow links from there to discover and add to the triplestore all of the linked information, including information about the four kinds of resources described outside of Bioimages.  Although that would be cool, it probably isn't practical for us to take that approach at the present.  Instead, I would like to find a way to acquire relevant triples from a particular external provider, then add them to the triplestore manually.

The least complicated of the external data sources [1] to experiment with is probably GeoNames.  GeoNames actually provides a data dump download service that can be used to download its entire dataset.  Unfortunately, the form of that dump isn't RDF, so it would have to be converted to RDF.  Its dataset also includes a very large number of records (10 million geographical names, although subsets can be downloaded).  So I decided it would probably be more practical to just retrieve the RDF about particular geographic features that are linked from Bioimages.

Building a homemade Linked Data client


In the do-it-yourself spirit of this blog series, I decided to program my own primitive Linked Data client.  As a summer father/daughter project, we've been teaching ourselves Python, so I decided this would be a good Python programming exercise.  There are two reasons why I like using Python vs. other programming languages I've used in the past.  One is that it's very easy to test your code experimentation interactively as you build it.  The other is that somebody has probably already written a library that contains most of the functions that you need.  After a miniscule amount of time Googling, I found the RDFLib library that provides functions for most common RDF-related tasks.  I wanted to leverage the new Heard Library SPARQL endpoint, so I planned to use XML search results from the endpoint as a way to find the IRIs that I want to pull from GeoNames.  Fortunately, Python has built-in functions for handling XML as well.

To scope out the size of the task, I created a SPARQL query that would retrieve the GeoName IRIs to which Bioimages links.  The Darwin Core RDF Guide defines the term dwciri:inDescribedPlace that relates Location instances to geographic places.  In Bioimages, each of the Location instances associated with Organism Occurrences are linked to GeoNames features using dwciri:inDescribedPlace.  As in the earlier blog posts of this series, I'm providing a template SPARQL query that can be pasted into the blank box of the Heard Library endpoint:

PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX ac: <http://rs.tdwg.org/ac/terms/>
PREFIX dwc: <http://rs.tdwg.org/dwc/terms/>
PREFIX dwciri: <http://rs.tdwg.org/dwc/iri/>
PREFIX dsw: <http://purl.org/dsw/>
PREFIX Iptc4xmpExt: <http://iptc.org/std/Iptc4xmpExt/2008-02-29/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX dc: <http://purl.org/dc/elements/1.1/>
PREFIX dcterms: <http://purl.org/dc/terms/>
PREFIX dcmitype: <http://purl.org/dc/dcmitype/>

SELECT DISTINCT ?place

FROM <http://rdf.library.vanderbilt.edu/bioimages/images.rdf>
FROM <http://rdf.library.vanderbilt.edu/bioimages/organisms.rdf>
FROM <http://rdf.library.vanderbilt.edu/bioimages/baskauf.rdf>
FROM <http://rdf.library.vanderbilt.edu/bioimages/thomas.rdf>
FROM <http://rdf.library.vanderbilt.edu/bioimages/uri.rdf>
FROM <http://rdf.library.vanderbilt.edu/bioimages/kirchoff.rdf>
FROM <http://rdf.library.vanderbilt.edu/bioimages/ncu-all.rdf>
FROM <http://rdf.library.vanderbilt.edu/bioimages/dsw.owl>
FROM <http://rdf.library.vanderbilt.edu/bioimages/dwcterms.rdf>
FROM <http://rdf.library.vanderbilt.edu/bioimages/local.rdf>
FROM <http://rdf.library.vanderbilt.edu/bioimages/stdview.rdf>

WHERE {
 ?location a dcterms:Location.
 ?location dwciri:inDescribedPlace ?place.
}


You can substitute other query forms in place of the SELECT DISTINCT line and other graph patterns for the WHERE clause in this template.

The graph pattern in the example above binds Location instances to a variable and then finds GeoNames features that are linked using dwciri:inDescribedPlace.  Running this query finds the 185 unique references to GeoNames IRIs that fit the graph pattern.  If you run the query using the Heard Library endpoint web interface, you will see a list of the IRIs.  If you use a utility like cURL to run the query, you can save the results in an XML file like https://github.com/jbaskauf/python/blob/master/results.xml (which I've called "results.xml" in the Python script).

Now we have the raw materials to build the application.  Here's the code required to parse the results XML from the file into a Python object:

import xml.etree.ElementTree as etree
tree = etree.parse('results.xml')


A particular result XML element in the file looks like this:

<result>
    <binding name='place'>
        <uri>http://sws.geonames.org/4617305/</uri>
    </binding>
</result>


To put out all of the <uri> nodes, I used:

resultsArray=tree.findall('.//{http://www.w3.org/2005/sparql-results#}uri')

and to turn a particular <uri> node into a string, I used:

baseUri=resultsArray[fileIndex].text

In theory, this string is the GeoNames feature IRI that I would need to dereference to acquire the RDF about that feature.  But content negotiation redirects the IRI that identifies the abstract feature to a similar IRI that ends in "about.rdf" and identifies the actual RDF/XML document file that is about the GeoNames feature.  If I were programming a real semantic client, I'd have to be able to handle this kind of redirection and recognize HTTP response codes like 303 (SeeOther).  But I'm a Python newbie and I don't have a lot of time to spend on the program, so I hacked it to generate the file IRI like this:

getUri=baseUri+"about.rdf"

RDFLib is so awesome - it has a single function that will make the HTTP request, receive the file from the server, and parse it:

result = addedGraph.parse(getUri)

It turns out that the Bioimages data has a bad link to a non-existent IRI:

http://sws.geonames.org/8504608/about.rdf

so I put the RDF parse code inside a try:/except:/else: error trap so that the program wouldn't crash if the server HTTP response was something other than 200.

In RDFLib it is also super-easy to do a UNION merge of two graphs.  I merge the graph I just retrieved from GeoNames into the graph where I'm accumulating triples by:

builtGraph = builtGraph + addedGraph

When I'm done merging all of the data that I've retrieved from GeoNames, I serialize the graph I've built into RDF/XML and save it in the file "geonames.rdf":

s = builtGraph.serialize(destination='geonames.rdf', format='xml')

Here's what the whole script looks like:

import rdflib
import xml.etree.ElementTree as etree
tree = etree.parse('results.xml')
resultsArray=tree.findall('.//{http://www.w3.org/2005/sparql-results#}uri')
builtGraph=rdflib.Graph()
addedGraph=rdflib.Graph()

fileIndex=0
while fileIndex<len(resultsArray):
    print(fileIndex)
    baseUri=resultsArray[fileIndex].text
    getUri=baseUri+"about.rdf"
    try:
        result = addedGraph.parse(getUri)
    except:
        print(getUri)
    else:
        builtGraph = builtGraph + addedGraph
    fileIndex=fileIndex+1

s = builtGraph.serialize(destination='geonames.rdf', format='xml')


Voilà! a Linked Data client in 19 lines of code!  You can get the raw code with extra annotations at
https://raw.githubusercontent.com/jbaskauf/python/master/geonamesTest.py

The resultsin the "geonames.rdf" file are serialized in the typical, painful RDF/XML syntax:

<?xml version="1.0" encoding="UTF-8"?>
<rdf:RDF
   xmlns:cc="http://creativecommons.org/ns#"
   xmlns:dcterms="http://purl.org/dc/terms/"
   xmlns:foaf="http://xmlns.com/foaf/0.1/"
   xmlns:gn="http://www.geonames.org/ontology#"
   xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
   xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"
   xmlns:wgs84_pos="http://www.w3.org/2003/01/geo/wgs84_pos#"
>
  <rdf:Description rdf:about="http://sws.geonames.org/5417618/">
    <gn:alternateName xml:lang="bpy">কলোরাডো</gn:alternateName>
    <gn:alternateName xml:lang="eo">Kolorado</gn:alternateName>
    <gn:countryCode>US</gn:countryCode>
    <gn:alternateName xml:lang="mk">Колорадо</gn:alternateName>
    <gn:alternateName xml:lang="he">קולורדו</gn:alternateName>
    <gn:parentCountry rdf:resource="http://sws.geonames.org/6252001/"/>
    <gn:name>Colorado</gn:name>
...
  </rdf:Description>
...
</rdf:RDF>

but that's OK, because only a machine will be reading it.  You can see that the data I've pulled from GeoNames provides me with information that didn't already exist in the Bioimages dataset, such as how to write "Colorado" in Cyrillic and Hebrew.




Ursus americanus occurrence in Parque Nacional Yellowstone in 1971: http://www.gbif.org/occurrence/930741907  Image (c) 2005 by James H. Bassett CC BY-NC-SA http://bioimages.vanderbilt.edu/bassettjh/jb261

Doing something fun with the new data


As exciting as it is to have built my own Linked Data client, it would be even more fun to use the scraped data to run some more interesting queries.  The geonames.rdf triples have been added to the Heard Library triplestore and can be included in queries by adding the

FROM <http://rdf.library.vanderbilt.edu/bioimages/geonames.rdf>

clause to the template set I listed earlier and the gn: namespace abbreviation to the list:

PREFIX gn: <http://www.geonames.org/ontology#>

-----------------------------------------
OK, here is the first fun query.  Find 20 GeoNames places that are linked to Locations in the Bioimages dataset, show their English names and name translations in Japanese.  By looking at the RDF snippet above, you can see that the property gn:name is used to link the feature to its preferred (English?) name.  The property gn:alternateName is used to link the feature to alternative names in other languages.  So the triple patterns

 ?place gn:name ?placeName.
 ?place gn:alternateName ?langTagName.


can be added to the template query to find the linked names.  However, we don't want ALL of the alternative names, just the ones in Japanese.  For that, we need to add the filter

FILTER ( lang(?langTagName) = "ja" )

The whole query would be

SELECT DISTINCT ?placeName ?langTagName

WHERE {
 ?location a dcterms:Location.
 ?location dwciri:inDescribedPlace ?place.
 ?place gn:name ?placeName.
 ?place gn:alternateName ?langTagName.
FILTER ( lang(?langTagName) = "ja" )
}
LIMIT 20


Here are sample results:

placeNamelangTagName
Hawaiiハワイ州@ja
Great Smoky Mountains National Parkグレート・スモーキー山脈国立公園@ja
Yellowstone National Parkイエローストーン国立公園@ja

Other fun language tags to try include "ka", "bpy", and "ar".
---------------------------------------
Here is the second fun query.  List species that are represented in the Bioimages database that are found in "Parque Nacional Yellowstone" (the Spanish name for Yellowstone National Park).  Here's the query:

SELECT DISTINCT ?genus ?species

WHERE {
 ?determination dwc:genus ?genus.
 ?determination dwc:specificEpithet ?species.
 ?organism dsw:hasIdentification ?determination.
 ?organism dsw:hasOccurrence ?occurrence.
 ?occurrence dsw:atEvent ?event.
 ?event dsw:locatedAt ?location.
 ?location dwciri:inDescribedPlace ?place.
 ?place gn:alternateName "Parque Nacional Yellowstone"@es.
}


Most of this query could have done using only the Bioimages dataset without our Linked Data effort, but there is nothing in the Bioimages data that provides any information about place names in Spanish.  Querying on that basis required triples from GeoNames.  The results are:

genus  species
Pinus albicaulis
Ursus americanus

--------------------------------
Adding the GeoNames data to Bioimages enables more than alternative language representations for place names.  Each feature is linked to its parent administrative feature (counties to states, states to countries, etc.), whose data could also be retrieved from GeoNames to build a geographic taxonomy that could be used to write smarter queries.  Many of the geographic features are also linked to Wikipedia articles, so queries could be used to build web pages that showed images of organisms found in certain counties, along with a link to the Wikipedia article about the county.

Possible improvements to the homemade client

  1. Let the software make the query to the SPARQL endpoint itself instead of providing the downloaded file to the script.  This is easily possible with the RDFLib Python library.
  2. Facilitate content negotiation by requesting an RDF content-type, then handle 303 redirects to the file containing the metadata. This would allow the actual resource IRI to be dereferenced without jury-rigging based on an ad hoc addition of "about.rdf" to the resource IRI.  
  3. Search recursively for parent geographical features.  If the retrieved file links to a parent resource that hasn't already been retrieved, retrieve the file about it, too.  Keep doing that until there aren't any higher levels to be retrieved.
  4. Check with the SPARQL endpoint to find Location records that have changed (or are new) since some previous time, and retrieve only the features linked in that record.  Check the existing GeoNames data to make sure the record hasn't already been retrieved.
This would be a great project for a freshman computer programming class.

So can we "learn" something using RDF?


Returning to the question posed at the beginning of this post, does retrieving data from GeoNames and pooling it with the Bioimages data address the "non-triviality problem"?  In this example, it does provide useful information that wouldn't be obvious to human users (names in multiple languages).  Does this qualify as "interesting"?  Maybe not, since the translation could be obtained by pasting names into Google translate.  But it is more convenient to let a computer do it. 

To some extent, the "Linked Data problem" is solved in this case since there is now a standard, well-known property to do the linking (dwciri:inDescribedPlace) and a stable set of external IRI identifiers (the GeoNames IRIs) to link to .  The will to do the linking was also there on the part of Bioimages and its image providers - the Bioimages image ingestion software we are currently testing makes it more convenient for users to easily make that link.

So on the bases of this example, I am going to give a definite "maybe" to the question "can we learn something useful using RDF".

I may write another post in this series if I can pull RDF data from some of the other RDF data sources (ORCID, uBio) and do something interesting with them.

--------------------------------------
[1] uBio doesn't actually use HTTP URIs, I can't get ORCID URIs to return RDF, and DOIs use redirection to a number of data providers.