Tuesday, May 2, 2017

Using the TDWG Standards Documentation Specification with a Controlled Vocabulary

I just recently got the news that the TDWG Executive Committee has ratified the Standards Documentation Specification (SDS).  We now have a way to describe vocabularies not only in a consistent human-readable form, but also in a machine-readable form.  This includes not only TDWG's current core vocabularies (Darwin Core, DwC and Audubon Core, AC), but also vocabularies of controlled values that will be developed in the future.

What are the implications of saying that the vocabularies will be "machine-readable"?  We have had about ten years now of promises that using "semantic technologies" will magically revolutionize biodiversity informatics, but despite repeated meetings, reports, and publications, our actual core technologies are built around conventional database technologies and simple file formats like CSVs.  Many TDWG old-timers have reached the point of "semantic fatigue" resulting from broken promises about what RDF, Linked Data, and the Semantic Web is going to do for them.  So the purpose of this blog post is NOT to sing the praises of RDF and try to change peoples minds about it.  Rather, it is to show how describing vocabularies using the SDS can make management of controlled vocabularies be practical, and to show how the machine-readable representations of those controlled vocabularies can be used to build applications that can mediate the generation and cleaning of data without human intervention.

I've been working recently with Quentin Groom to flesh out how a test controlled vocabulary for dwc:occurrenceStatus would be serialized in accordance with the SDS.  That test vocabulary is used in the examples that follow.  I'm really excited about this and hopefully there will be more progress to report on this front in the near future.

What is a controlled vocabulary term?

There are several misconceptions about the terminology for describing controlled vocabularies that need to be cleared up before I get into the details about how the SDS will facilitate the management and use of controlled vocabularies.  The first misconception is about the meaning of "controlled vocabulary term".  In the TDWG community, there is a tendency for people to think that a "controlled vocabulary term" is a certain string that we should all use to represent a particular value of a property.  For example, we could say that in a Darwin Core Archive, we would like for everyone to use the string "extant" as the value for the property dwc:occurrenceStatus when we intend to convey the meaning that an organism was present in a certain geographical location at a certain period of time.  However, the controlled vocabulary term is actually the concept of what we would describe in English as "an organism was present in a certain geographical location at a certain period of time" and not any particular string that we might use as a label for that concept.

This idea that an controlled value term is a concept rather than a language-dependent label lies at the heart of the Simple Knowledge Organization System (SKOS), a W3C Recommendation used to describe thesauri and controlled vocabularies. In fact, the core entity in SKOS is skos:Concept, the class of ideas or notions.  Those ideas can be "be identified using URIs, labeled with lexical strings in one or more natural languages" [1], but neither the URIs nor the strings "are" the concepts. The SDS recognizes this distinction when it specifies (Section 4.5.4) that controlled vocabulary terms should be typed as skos:Concept.

What is a term IRI?

Another common misconception is that a IRI must "do something" when you paste it into a web browser.  (In current W3C standards, "IRI", Internationalized Resource Identifier, has replaced "URI", Uniform Resource Identifier, but in the context of this post you can consider them to be interchangeable.)  Although it is nice if an IRI dereferences when you put it in a browser, there is no requirement that it do so.  At it's core, an IRI is simply a globally unique identifier that conforms to a particular IETF specification [2].

For example, the IRI http://rs.tdwg.org/dwc/iri/occurrenceStatus is a valid IRI, because it conforms to the IRI specification.  However, it does not currently dereference because no one has (yet) set up the TDWG server to handle it.  It is, however, a valid Darwin Core term, because it is defined in Section 3.7 of the Darwin Core RDF Guide.  The SDS specifies in Section 2.1.1 that IRIs are the type of identifiers that are used in TDWG standards to uniquely identify resources, including vocabulary terms.  Some other kind of globally unique identifier (e.g. UUIDs) could have been used, but using IRIs codified the practice already used by TDWG for other vocabularies.

The SDS does not specify the exact form of IRIs.  That is a matter of design choice, probably to be determined by the TDWG Technical Architecture Group (TAG).  Existing terms in DwC and AC use the pattern where a term IRI is composed of a namespace part and a local name that is a string composed of some form of an English label for the term.  For example, http://rs.tdwg.org/dwc/iri/occurrenceStatus is constructed from the namespace "http://rs.tdwg.org/dwc/iri/" (abbreviated by the compact URI or CURIE dwciri:) and the camel case local name "occurrenceStatus".  There is no requirement in the SDS for the form of the local name part of a term IRI - it could also be an opaque identifier such as a number.  Again, this is a design choice.  So it would be fine for the local name part of the IRI to be something like "os12345".

What is a label?

A label is a natural language string that is used by humans to recognize a resource.  In SKOS, labels are strings of Unicode characters in a given language.  The rules of SKOS declare that for each concept there is at most one preferred label per language, indicated by the property skos:prefLabel.  There may be any number of additional labels, such as "hidden labels" (skos:hiddenLabel) that are known to be associated with a concept, but that should not be suggested for use.  In SKOS, labels may have a language tag, although that is not required.

In SKOS, the intent is to create a mechanism that leads human users to discover the preferred label for a concept in the user's own language, while also specifying other non-preferred labels that users might be inclined to use on their own.

Based on TDWG precedent, the SDS specifies that English language labels must be included in the standards documents that describe vocabularies.  Labels in other languages are encouraged, but do not fall within the standard itself.  That makes adding those labels less cumbersome from the vocabulary maintenance standpoint.

What is a "value"?

The prevalent view in TDWG that there is one particular string that should serve as the "controlled value" for a term is alien to SKOS.  In SKOS, unique identification of concepts is always accomplished by IRIs. As a concession to current practice, in Section 4.5.4 the SDS declares that each controlled vocabulary term should be associated with a text string that is unique with that vocabulary.  The utility property rdf:value is used to associate that string with the term.  If people want to provide a string in a CSV file to represent a controlled vocabulary term, they can use this string as a value of a Darwin Core property such as dwc:occurrenceStatus.  However, if they want to be completely unambiguous, they can use the term IRI as a value of dwciri:occurrenceStatus. Using dwciri:occurrenceStatus instead of dwc:occurrenceStatus is basically a signal that the value is "clean" and that no disambiguation is necessary.

The pieces of the controlled vocabulary

The Standards Documentation Specification breaks apart machine-readable controlled vocabulary metadata into several pieces.  One piece is the metadata that actually comprise the standard itself.  Those metadata are described in Sections 4.2.2, 4.4.2, 4.5, and 4.5.4 .  In the case of the terms themselves, the critical metadata properties are rdfs:label (to indicate the label in English), rdfs:comment (to indicate the definition in English), and rdf:value (to indicate the unique text string associated with the term).  Because these values are part of the normative description of the vocabulary standard, their creation and modification are strictly controlled by processes described in the newly adopted Vocabulary Maintenance Specification.

In contrast, assignment of labels in languages other than English and translations of definitions into other languages falls outside the standards process.  Lists of multilingual labels and definitions are therefore kept in documents that are separate from the standards documents.  This makes it possible to easily add to these lists, or make corrections without invoking any kind of standards process.  The properties skos:prefLabel and skos:definition can be used to indicate the multilingual translations of labels and definitions respectively.

In addition to the preferred labels, it is also possible to maintain lists of non-preferred labels that have been used by some data providers, but which do not conform to the unique text string assigned to each term.  GBIF, VertNet, and other aggregators have compiled such lists from actual data in the wild.  The term skos:hiddenLabel can be used to associate these strings with the controlled value terms to which they have been mapped.

Controlled vocabulary metadata sources

For convenience, the machine-readable metadata in this post will be shown in RDF/Turtle, which is generally considered to be the easiest serialization for humans to read.  However, it may be serialized in any equivalent form -  developers may prefer a different serialization such as XML or JSON.  Here is an example of the metadata associated with a term from a controlled vocabulary designed to provide values for the Darwin Core term occurrenceStatus:

<http://rs.tdwg.org/cv/status/extant> a skos:Concept;
     skos:inScheme <http://rs.tdwg.org/cv/status/>;
     rdfs:isDefinedBy <http://rs.tdwg.org/cv/status/>;
     dcterms:isPartOf <http://rs.tdwg.org/cv/status/>;
     rdf:value "extant";
     rdfs:label "extant"@en;
     rdfs:comment "The species is known or thought very likely to occur presently in the area, which encompasses localities with current or recent (last 20-30 years) records where suitable habitat at appropriate altitudes remains."@en.

These metadata would be included in the machine-readable form of the vocabulary standard document.  Here are metadata associated with the same term, but included in an ancillary document that is not part of the standard:

     skos:prefLabel "presente"@pt;
     skos:definition "Sabe-se que a espécie ocorre na área ou a sua ocorrência é tida como bastante provável, o que inclui localidades com registos atuais ou recentes (últimos 20-30 anos) nas quais se mantêm habitats adequados às altitudes apropriadas."@pt;
     skos:prefLabel "extant"@en;
     skos:definition "The species is known or thought very likely to occur presently in the area, which encompasses localities with current or recent (last 20-30 years) records where suitable habitat at appropriate altitudes remains."@en;
     skos:prefLabel "vorhanden "@de;
     skos:definition "Von der Art ist bekannt oder wird mit hoher Wahrscheinlichkeit angenommen, dass sie derzeit im Gebiet anwesend ist, und für die Art existieren aktuelle oder in den letzten 20 bis 30 Jahren erstellte Aufzeichnungen, in Lagen mit geeigneten Lebensräumen. "@de.

These data provide the non-normative translations of the preferred term label and definition.  Here are some metadata that might be in a third document:

     skos:hiddenLabel "Reported";
     skos:hiddenLabel "Outbreak";
     skos:hiddenLabel "Infested";
     skos:hiddenLabel "present";
     skos:hiddenLabel "probable breeding";
     skos:hiddenLabel "Frecuente";
     skos:hiddenLabel "Raro";
     skos:hiddenLabel "confirmed breeding";
     skos:hiddenLabel "Present";
     skos:hiddenLabel "Présent ";
     skos:hiddenLabel "presence";
     skos:hiddenLabel "presente";
     skos:hiddenLabel "frecuente";
...   .

These are all of the known variants of strings that have been mapped to the term http://rs.tdwg.org/cv/status/extant.

For management purposes, these three documents will probably be managed separately.  The first list from the standards document will be changed rarely, if ever.  The second list will (hopefully) be added to frequently by human curators as the controlled vocabulary is translated into new languages.  The third list may be massive, and maintained by data-cleaning software as human operators of the software discover new variants in submitted data and assign those variants to particular terms in the controlled vocabulary.

Periodically, as the three lists are updated, they can be merged.  Given that the the SDS is agnostic about the form of the machine-readable metadata, they could be ingested as JSON-LD and processed using purpose-built applications.  However, in the following examples, I'll load the metadata into an RDF triplestore and expose the merged graph via a SPARQL endpoint.  That is convenient because the merging can be accomplished without any additional processing of the data on my part.

Accessing the merged graph

I've loaded the metadata shown above into the Vanderbilt library's SPARQL endpoint, where it can be queried at http://rdf.library.vanderbilt.edu/sparql?view.  The following query can be pasted into the box to see what properties and values exist for http://rs.tdwg.org/cv/status/extant in the merged graph:

SELECT DISTINCT ?property ?value WHERE {
   <http://rs.tdwg.org/cv/status/extant> ?property ?value.

Unfortunately, the results include some garbage that Callimachus inserts as it manages the graph, but you can see that the metadata included in the standards document, translations document, and hidden label document all come up.

Clearly, nobody is actually going to want to paste queries into a box to use this information.  However, the data can be accessed by an HTTP GET call using CURL, Python, Javascript, jQuery, XQuery, or whatever flavor of software you like.  Here's what the query above looks like when URL encoded and attached to the endpoint IRI as a query string:


The query can be sent using HTTP GET by your favorite application to retrieve the same metadata as one sees in the paste-in box.  Currently, the Vanderbilt SPARQL endpoint only supports XML query results, but we are working on moving to using Blazegraph, which also supports JSON results.

Many people seem to be somewhat mystified about the purpose of a SPARQL endpoint and assume that it is some kind of weird Semantic Web thing.  If you fall into this category, you should think of a SPARQL endpoint as a kind of "programmable" web API.  Unlike a "normal" API where you must select from a fixed set of requests, you can request any imaginable result that can possibly be retrieved from a dataset.  That means that the request IRIs are probably going to be more complex, but once they have been conceived, the requests are going to be made by a software application, so who cares how complex they are?

Multilingual pick list for occurrenceStatus

I'm going to demonstrate how the multilingual data could be used to create a dropdown where a user selects the appropriate controlled value for the Darwin Core term occurrenceStatus when presented with a list of labels in the user's native language.  Here's the SPARQL query that lies at the heart of generating the pick list:

PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
SELECT DISTINCT ?label ?def ?term WHERE {
?term <http://www.w3.org/2000/01/rdf-schema#isDefinedBy><http://rs.tdwg.org/cv/status/>.
?term skos:prefLabel ?label.
?term skos:definition ?def.
FILTER (lang(?label)='en')
FILTER (lang(?def)='en')
ORDER BY ASC(?label)

Here's what it does.  The triple pattern:
?term <http://www.w3.org/2000/01/rdf-schema#isDefinedBy><http://rs.tdwg.org/cv/status/>.
restricts the results to terms that are part of the occurrenceStatus controlled vocabulary.  The triple patterns:
?term skos:prefLabel ?label.
?term skos:definition ?def.
bind preferred labels and definitions to the variables ?label and ?def.  The triple patterns:
FILTER (lang(?label)='en')
FILTER (lang(?def)='en')
restrict the labels and definitions to those that are language-tagged as English.  To change the requested language, a different language tag, such as 'pt' or 'de' can be substituted for 'en' by the software.  The last line tells the endpoint to return the results in alphabetical order by label.  The query is URL encoded and appended as a query string to the IRI of the SPARQL endpoint:


A page that makes use of this query is online at http://bioimages.vanderbilt.edu/pick-list.html?en.  The URL of the page ends in a query string that specifies the starting language for the page.  Currently en, pt, and de are available, although I'm hoping to add ko, zh-hans, zh-hant, and es soon.  The "guts" of the program are the javascript code at http://bioimages.vanderbilt.edu/pick-list.js.  Lines 58 through 67 generate the query above and line 68 URL-encodes it.  Lines 71 through 78 perform the HTTP GET call to the endpoint, and lines 69 through 102 process the XML results whey they come back and add them to the options of the pick list.  If you are viewing the page in a Chrome browser, you can see what's going on behind the scenes using the Developer tools that you can access from the menu in the upper right of the Chrome window ("More tools" --> "Developer tools").  Here's what the request looks like:

Here's what the response looks like:

You can see that the results are in XML, which makes the Javascript uglier than it would have to be.  When we get our new endpoint set up to return JSON, the Javascript will be simpler.  In line 101 of the Javascript code, the language-specific label gets inserted as the label of the option, but the actual value of the option is set as the IRI that is returned from the endpoint for that particular term.  Thus, the labels inserted into the option list vary depending on the selected language, but the IRI is language-independent.  In this demo page, the IRI is simply displayed on the screen, but in a real application, the IRI would be assigned as the value of a Darwin Core property.  In my opinion, the appropriate property would be dwciri:occurrenceStatus, regardless of whether the property is part of an RDF representation or a CSV file.  Using a dwciri: term implies that the value is a clean and unambiguous IRI.  Using dwc:occurrenceStatus would imply that the value could be any kind of string, with no implication that it was "cleaned" or even appropriate for the term.

You may have noticed that the query also returns the term definition in the target language.  Originally, my intention was that it should appear as a popup when the user moused over the natural language label on the dropdown, but my knowledge of HTML is too weak for me to know how to accomplish that without some digging.  I might add that in the future.

"Data cleaning" application demonstration

I created a second demo page to show how data from the merged graph could be used in data cleaning.  That page is at http://bioimages.vanderbilt.edu/clean.html.  The basic query that it uses is:

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX cvstatus: <http://rs.tdwg.org/cv/status/>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
SELECT DISTINCT ?term where {
?term rdfs:isDefinedBy cvstatus:.
 {?term skos:prefLabel ?langLabel.FILTER (str(?langLabel) = 'Común')}
 {?term skos:hiddenLabel 'Común'. }
 {?term rdf:value 'Común'. }

This query is a little more complicated than the last one.  The triple pattern
?term rdfs:isDefinedBy cvstatus:.
limits terms to the appropriate controlled vocabulary.  The rest of the query is composed of the UNION of three graph patterns.  The first pattern:
?term skos:prefLabel ?langLabel.
FILTER (str(?langLabel) = 'Común')
screens the string to be cleaned against all of the preferred labels in any language.  The second pattern:
?term skos:hiddenLabel 'Común'.
checks whether the string to be cleaned is included in the list of non-preferred labels that have been accumulated from real data.  The third pattern:
?term rdf:value 'Común'.
checks if the string to be cleaned is actually one of the preferred, unique text strings associated with any term.  In the Javascript that makes the page run (see http://bioimages.vanderbilt.edu/clean.js for details), the string to be cleaned is inserted into the query from a variable (i.e. a variable substituted in place of 'Común' in the query above.)

In this particular case, the string 'Común' was mapped to the concept identified by http://rs.tdwg.org/cv/status/extant, so a match is made by the second of the three graph patterns (the hidden label one).  Here's what the page looks like when it is running with Developer tools turned on:

You can see that the response is a single value wrapped up in a bunch of XML.  Again, things would be simpler if the endpoint were set up to return JSON (soon, really!).  So in essence, the data cleaning function could be accessed by this "API call":


where the string to be cleaned is substituted for "Com%C3%BAn" (urlencoded).

As a practical matter, it would probably not be smart to actually build an application that relied on screening every record by making a call like this to the SPARQL endpoint.  Our endpoint just isn't up to handling that kind of traffic.  It would be more realistic to build an application that made one call at the start of each session that retrieved the whole array that mapped known strings to controlled value IRIs.  A query to accomplish that would be:

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX cvstatus: <http://rs.tdwg.org/cv/status/>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>

SELECT DISTINCT ?term ?value where {?term rdfs:isDefinedBy cvstatus:. 
{?term skos:prefLabel ?langLabel.FILTER (str(?langLabel) = ?value)}
{?term skos:hiddenLabel ?value. }
{?term rdf:value ?value. }

Notice that it is basically the same as the previous query, except that the string to be cleaned is represented by the variable ?value instead of being a literal.  Here's what the HTTP GET IRI would look like:


If you sent the request header:
Accept: application/sparql-results+json
you would get JSON back instead of XML.  As I've said repeatedly, this won't currently work on the Vanderbilt library SPARQL endpoint, which does not yet support JSON results.  However, I ran the query on an installation of Blazegraph so that I could get JSON results.  You can see the results at this Gist:  https://gist.github.com/baskaufs/0b1193990bc7182e440ff238cac6e528
 These results could be ingested by a data-cleaning application, which could then keep track of newly encountered strings.  A human would have to map each new string to one of the controlled value IRIs, but if those new mappings were added to the list of known variants, they would become available as soon as the graph on the SPARQL endpoint were updated.

Where from here?

The applications that I've shown here are just demos and developers who are better programmers than I can incorporate similar code into their own applications to make use of the controlled vocabulary data that TDWG working groups will generate and expose in accordance with the SDS.  Clearly, work flows will need to be established, but once those are set up, there is the potential to automate most of what I've demonstrated here.  The raw data will live on the TDWG Github site, probably in the form of CSVs and the transformation to a machine-readable form will be automated.  There could be one to many SPARQL endpoints exposing those data - one feature of the SDS is that it will be possible for machines to discover the IRIs of vocabulary distributions, including SPARQL endpoints.  So of one endpoint goes down or is replaced, a machine will be able to automatically switch over to a different one.

[1] SKOS Simple Knowledge Organization System Reference.  W3C Recommendation. https://www.w3.org/TR/skos-reference/
[2] Internationalized Resource Identifiers (IRIs). RFC 2987. The Internet Engineering Task Force (IETF). https://tools.ietf.org/html/rfc3987

Thursday, March 30, 2017

Why I decided to vote for the union

This is my 32nd blog post and it's the first time I've written about my personal life.  Despite the technical nature of the previous 31 posts (well, maybe the Toilet Paper Apocalypse one doesn't count in that category), this is the hardest one for me to write.

For the past several weeks, my employer, Vanderbilt University, has been at battle with the Service Employees International Union (SEIU) and a group of non-tenure track faculty who are trying to organize a union.  I have had very mixed emotions about how I felt about this.  On the one hand, after teaching at Vanderbilt for almost eighteen years, I'm relatively secure in my job and it wasn't clear to me that there would be any particular advantage to me to be part of a union (I'm a Senior Lecturer, one of the ranks of non-tenure track faculty included in the unionization proposal).  I've spent a considerable amount of time during those weeks trying to inform myself about what it would mean to be part of a faculty union.  I've been asked to be part of a Faculty Senate panel discussing the unionization proposal this afternoon, and spent time last night trying to decide what I would say during the three minutes that I've been allocated to explain my position on the issue.  As part of my deliberations last night, I spent a couple of hours reading old emails from my first two years teaching at Vanderbilt.  I'm an obsessive email filer.  I have most of the emails I've received since 1995 filed in topical and chronological folders, so it didn't take me long to find the relevant emails.

It's hard for me to describe what the experience of reading those emails was like.  Although I've kept all of those emails for years, I have avoided ever reading them again because knew the experience would be disturbing to me.  It was sort of like ripping a scab off of a mostly healed wound, but that doesn't capture the intensity of the emotions that it raised.

General science class in 1983, Ridgeway, Ohio


I grew up in a rural part of Ohio in a conservative Republican family that was always very anti-union.  So that has predisposed me to have a negative outlook on unions.  After I graduated with my undergraduate degree in 1982, I spent the next ten years teaching high school.  I taught in a variety of schools: a rural school in Ohio for one year, a public school in Swaziland (Africa) for three years, and a school in rural/suburban Tennessee for six years.  The classes I taught included chemistry, physics, biology, physical science, general science, math, and computer programming.
Physical science class in 1985, Mzimpofu, Swaziland
Despite the variety of locations, the schools actually had a lot in common.  When I arrived at each of those schools, they had little or no science equipment and I spent years trying to figure out how to get enough equipment to teach my lab classes in a way that was engaging to the students.  At those schools, I served in a variety of roles, including department chair, chair of the faculty advisory committee, student teacher supervisor, choir director, and adviser of research and environmental clubs.
Physics class in 1991, Kingston Springs, Tennessee
By the end of my time teaching high school, I had amassed a number of teaching credentials and awards, including scoring in the 99th percentile for professional knowledge on the National Teacher Exam, achieving the highest level (Role Model Teacher) on the grueling Tennessee Career Ladder certification program, and being named Teacher of the Year at the school level in 1990 and on the county level in 1992.

In 1993, I decided that I wanted to take on a different challenge: entering a Ph.D. program in the biology department at Vanderbilt.  Over the next six years, I took graduate classes, carried out research, served as a teaching assistant in the biology labs for ten semesters, and had a major role in managing the life of our family while my wife worked towards tenure at a nearby university.  By August 1999, I had defended my dissertation and was on the market looking for a teaching job on the college level.

Being a part-time Lecturer at Vanderbilt

In the fall of 1999, I was writing papers from my dissertation and trying to figure out how to get a job within commuting distance of my wife's university.  By that time, she had tenure, which complicated the process.  At the last minute, there was an opening for a half-time lecturer position, teaching BIOL 101, a non-majors biology service course for education majors in Vanderbilt's Peabody College of Education.  It seemed like this was the ideal class for me with my background teaching high school for many years.  It was rather daunting because I got the job a few days before the semester started.  I had to scramble to put together a syllabus and try to keep up with planning class sessions, developing labs, and writing lectures.  But I'd done this three times before when I had started at my three different high schools, so I knew I could do it if I threw myself into the job.

I had naively assumed that my job was to teach these students biology, uphold the academic standards that I had always cared about, and enforce the College rules about things like class attendance.  It is striking to me as I look through the emails from that semester how many of them involved missed classes, and complaints about grades and the workload.  Here's an example:
Professor Baskauf,
I am not able to be in class tomorrow because my
flight leaves a 2:30 pm.  Both of my friday classes were
cancelled, so I am going home tomorrow.  However, the later
flights were too late for my parents in that my flight is a
long one and the airport is 2hrs from my house.  I
apologize for missing class.
Within a month, it was clear that the students were unhappy about my expectations for them.  I had a conversation with my chair about the expectations for the class, which I was beginning to think must differ radically from what I had anticipated.  Here's an email I got on October 25 from my department chair:
I had a meeting with the Peabody folks with regard to BSCI 101 and what the
purpose of the course is.  It appears that they have had little or no
contact with any of the instructors for  the course for several years and
really have no idea of the current course content and organization.  At some
point, I'd like to set up a meeting with you, the relevant Peabody folks,
and myself to make sure we all understand why we offer BSCI 101 at all; and,
since it is a service course, to make sure we are all on the same page with
regard to content and structure.  From our discussion last Friday, I think
what they expect is not much different from what we would do, but is perhaps
a little different from the course as it evolved in the Molecular Biology
Department.  I'd like to send them a syllabus to look over, then we'll try
to set up a time for discussion.  Could you get me a copy of the syllabus
you're using this year?
I tried to adjust the content and format of the course to make it more relevant to education majors and pushed through to the end of the semester.  I had a number of discussions with my TA about how we could work to make the labs more engaging and made plans on how I was going to improve the course in the spring semester, which I had already been scheduled to teach.

On January 3, 2000, I found out from my chair that Dean Infante had examined my student evaluations and decided to fire me.  He had never been in my class (actually, no administrator or other faculty member had ever set foot in my class) and as far as I know, he had no idea what I had actually done in the class.  He just decided that my student evaluation numbers were too low.  I was a "bad" teacher and Vanderbilt wasn't going to let me teach that class again.

With my past record of 15 years of excellent teaching, this was a crushing blow to me emotionally.  I'm normally a really optimistic person, but on that day I had an glimmering of what it must feel like to be clinically depressed.  I could hardly make myself get out of bed.  In addition to the emotional toll, I now had two little kids to help support - we had been planning on the income from my teaching and we were also looking at losing our day care at Vanderbilt if I were no longer employed.

Fortunately, my department chair went to bat for me.  Ironically, the appeal that he made to the dean was NOT that I was hard working, or innovative, or that I had high standards for my students.  He took my student evaluation numbers from when I was a TA in the bio department to the Dean's office and convinced them that those numbers showed that the new student evaluation numbers were an outlier.  Although I didn't know it at the time, I was apparently on some kind of probation - the department was supposed to be monitoring me to make sure that I wasn't still being a "bad" teacher.

In the second semester that I taught the 101 class, I took extreme precautions to be very accessible to students.  I emailed all of the students who didn't do well on tests and asked them if the wanted to meet with me.  We did a self-designed project to investigate what it took to build a microcosm ecosystem.  We went on an ecology field trip to a local park and an on-campus field trip to visit research labs that were using zebrafish and fruit flies as model organisms to study genetics and development.  I think the students were still unhappy with their grades and my expectations for workload, but apparently their evaluations were good enough for me to be hired as a full-time Lecturer in the fall.

Being a full-time Lecturer at Vanderbilt

The faculty member who had previously been the lab coordinator for the Intro to Biological Sciences labs for majors, was leaving that position to take a different teaching job in the department.  The chair of the new Biological Sciences Department (formed by the merger of my biology department and the molecular biology department), contacted me about "going into the breach" as he phrased it, and taking over as lab coordinator.  I had actually been a TA five times for the semester of that course dealing with ecology and evolution (my specialty).  So I was well acquainted with that teaching lab.  Having had no success in getting a tenure-track job at any college within commuting distance, I took the offer of a one year appointment, assuming that I could do the job until I got a better position somewhere else.

When I started the job, I really had very little idea what my job responsibilities were supposed to be.  I was supposed to "coordinate labs".  The job expectations were never communicated to me beyond that.  Unfortunately, the focus of the course during my first semester was the half of the course that dealt with molecular biology, which I had never studied and for which I had never served as a TA.  Things did not go well.  For starters, the long-term lab manager discovered that she had cancer and missed long stretches of work for her treatments.  Fortunately, I was allowed to hire a temporary staff person with a masters related to molecular biology.  I spent much of the semester in the prep room with her trying to figure out why our cloning and other experiments weren't working as they should.  A major part of my job responsibilities was to supervise the TAs and manage the grades and website for both my class and the lecture part of the course.  I spent almost no time in the classroom with the students - I wasn't aware that that was actually supposed to be a part of the job.

At the end of the first semester, I was relieved to have managed to pull of the whole series of labs with some degree of success and was looking forward to the ecology and evolution semester, with which I was very familiar.  However, I was shocked to discover that I was actually going to be subject to student evaluations again.  Apparently, there was some college rule that everyone who is in a faculty position has to be evaluated by students.  In January, I ran into my chair and he commented that we would have to get my student evaluations up in the coming semester.  Oh, and by the way, the grades were also too high for the course.  I was going to have to increase the rigor of the course to bring them down to what was considered a reasonable range for lab courses in the department.

At that point in time, the lab grades were included with the lecture grades to form a single grade for the course.  The tenure track faculty involved in the lecture part of the course decided that a range of B to B- was a reasonable target for the lab portion of the course, so it fell on me to structure the grading in the course in a way that the grades would fall into that range.  At that time, the largest component of the grade was lab reports, which I found to be graded by the TAs in a very arbitrary and capricious manner.  In the spring semester, I replaced lab reports with weekly problem sets, and replaced lightly-weighted occasional quizzes with regular lab tests that formed half of the grade of the course.  I made the tests difficult enough to lower the grade to the target range, but it was clear to the students that I was to blame for creating the tests that were killing their GPAs.

In the second semester, I made it a point to be in the lab during every section to ask students if they had questions or needed help.  That did a lot to improve the students' impressions of me as compared to the fall.  But in late March, I was blindsided by another unanticipated event: fraternity formals.  Students had previously asked me to excuse them from lab on Fridays to leave early for spring break or to go out of town for family visits.  I had been consistently enforcing the College's policy on excused absences, which said unequivocally "conflicts arising from personal travel plans or social obligations are not regarded as occasions that qualify as an excused absence" and made them take zeros for their work on days when they missed class for those reasons.  Obviously students were not happy about this, but the situation came to a head when students started asking to reschedule class to go out of town for fraternity formals.  I had gone to a school that didn't have fraternities, and I'd never heard of a fraternity formal.  When I found out that fraternity formals involved skipping school to go out of town to go to a party, I told them that I couldn't consider that an excused absence under the college's policy on attendance.  The students were furious.  They had spent money on plane tickets and tuxedos and now I was forcing them to chose between class and going to their party.  An exchange with two of the students ended up in us walking over to Dean Eickmeier's office where he confirmed that my decision was consistent with college policy.  In some cases, students opted to come to class.  One student brought me an apparently fake medical excuse.  Others took the zero and went to the party.  One student said that I "did not have any compassion that a normal human would have" and threatened that he was going to write a scathing article about me in the Hustler (the student newspaper).  Another student said that he was going to "get me" on the evaluations.  Alarmed, and given my previous bad experience with student evaluations, I documented the incidents in an email to my chair.

Despite these bumps in the road, my evaluations were better in the spring semster, and I was anticipating being reappointed again for 2001-02.  I did request that my chair include in the request for my reappointment a copy of my email detailing the incidents involving the unhappy students with excused absences.  On May 23rd, I received this ominous email from my chair:
 I have received a response from Dean Venable to my recommendation for your reappointment.  He has agreed to reappoint you, but he has placed some conditions on the reappointment that we need to discuss.  I would like to do that today, if possible.  I have a flexible schedule until mid-afternoon.
I went to meet with the chair, and he gave me a copy of the letter from Dean Venable.  You can read it yourself:
Once again, a Dean sitting in his office pouring over computer printouts had decided that I was a bad teacher based solely on student evaluations.  No personal discussion with me about the class, no classroom observations ever by any administrator, no examination of the course materials or goals.  Worse yet, he chose to cherry-pick the written comments to emphasize the negative ones.  By my count, 11% of students made negative comments about my teaching style, while 24% made positive comments about it.  Here are some of the comments from the spring semester Dean Venable chose to ignore:

Dr. Baskauf is very good at instructing the class.  He is easy to understand and teaches the material well so that we understand what he is saying.  I think that Dr. Baskauf would be a better help to the lecture however, since the lecture class is more important to students and worth 3 hours.  I wish I could have had him for a professor in lecture as well as lab.
Dr. Baskauf really puts forth an aire of knowledge.  He was always willing to help with any problems that we were having with our labs, in and out of class, while not just telling us the answers, but nudging us along while we figured it out for ourselves.  Whats more important is that he seems to really love the material and teaching the class which makes the experience that much better and makes it much easer to learn.
Dr. Baskauf was always well prepared for the lab.  This was very helpful because he could always give a concise overview that I could understand.  The powerpoint presentations were a great idea as they really helped me to follow his instructions better.  He is very friendly and always willing to help when I had questions.
Bascauf is very good at explaining and communicating with the class.  he is very helpful as well.
Dr. Baskauf made this lab one of the most enjoyable and challenging classes I have yet taken at Vanderbilt.  He was especially willing to help students to better understand the value of what they were learning.
Baskauf was always very well prepared for lab.  He obviously put a lot of work into setting everything up.  He always had very clear tutorials and lectures.
Dr. Baskauf created a challenging and stimulating environment for learning about biological experiments.  Although many aspects of the lab were tough, Dr. Baskauf was able to understand how difficult is was for the class as a whole.  He opened himself up to adapting to our needs.  His approach to teaching is something I have yet to experience elsewhere at Vanderbilt.  I hope to have him as an instructor at some further point in my career here at Vanderbilt.
The sentence for my crime was:
  • to be mentored by a senior faculty member
  • to work with the Center for Teaching to improve my lecturing style and interpersonal skills, and
  • to be subjected to an extra mid-semester student evaluation

Oh, yes - and no pay raise.  These were all necessary to bring me "up to the teaching standards required by the College", with the threat that I would be fired if I didn't improve.  

So, for a second time, I had been flagged with the scarlet letter of "bad teacher" based solely on student evaluations.  Again, I was angry at the injustice and incredibly demoralized.  I really wanted to just quit at Vanderbilt, but I really needed the job.  So I swallowed my pride and completed my sentence.  I was "mentored" the next year by Bubba Singleton (later my highly supportive department chair), who was extremely helpful in helping me figure out ways to structure the class so that I could maintain my academic standards while also keeping students happy enough that I didn't get fired again.  

Life as a Senior Lecturer

Ever since that time, I've maintained an Excel spreadsheet with a graph of my student evaluations, which I check each year to ensure that I'm not heading into the danger zone.  Despite the fact that I don't really "teach" in the usual sense (most of my work involves curriculum development, wrangling online course management systems, recruiting and mentoring TAs, supervising staff, and handling administrative problems), I've managed to keep the student evaluation numbers to an acceptable level.  I've instituted open-ended research projects into the course (which by the way, caused my evaluation numbers to plunge the year they were introduced), continued to introduce the latest educational technology into the course, and continued to revise and update labs as biology evolves.  In 2002, I was promoted to Senior Lecturer (which comes with a three-year appointment) and in 2010 I received the Harriet S. Gilliam Award for Excellence in Teaching by a Senior Lecturer. 

The Harriet S. Gilliam Award silver cup, with the letter from Dean Venable that I store inside it

So I think that most people at Vanderbilt now think I'm a good teacher.  But student evaluations and the threat of being fired based on student evaluations hangs over me like a Sword of Damocles every time I'm up for re-appointment.  

After I wrote this, I seriously considered deleting the whole thing.  Even after all of these years, for a veteran teacher, being fired and being sentenced to remedial help for "bad teaching" is an embarrassment.  I feel embarrassed, even though I know that I was just as good a teacher at that time as I was before and after.  But I think it's important for people to know what it feels like to be treated unjustly in a situation where there is a huge power imbalance - where a powerful person you've never met passes judgment on your teaching based on student evaluations rather than by observing your classroom.

Do non-tenure track faculty at Vanderbilt need a union?

When the unionization question came up, I have to say that I was pretty skeptical about it.  When I taught high school, I was a member of the National Education Association, which functioned something like a union, but I had mostly thought of it as a professional association.  My upbringing predisposed me to thinking negatively about unions.  Given my current relatively stable position, it wasn't clear to me that it was in my interest to be part of a union.

However, as I started investigating where non-tenure track faculty stand in the College of Arts and Sciences at Vanderbilt, it was clear to me that we are actually just as powerless as I had always considered us to be.  Although non-tenure track faculty constitute 38% of the faculty of A&S, they are banned from election to Faculty Senate and have been improperly disenfranchised from voting for Faculty Senate for at least ten years. (I have never been given the opportunity to vote.) See Article IV, Section 2 of the CAS Constitution for details. Non-tenure track faculty are not eligible for election to the A&S Faculty Council, nor are they allowed to vote for Faculty Council representatives (Article II, Section 1, Part B).  At College-wide faculty meetings, full-time Senior Lecturers have a vote, but all non-tenure track faculty are only allowed to vote on an issue only when the Dean decides that the matter is related to their assigned duties (Article I, Section I, Part C).  The Provost's Office insists that non-tenure track faculty participate in University governance through participation in University Committees, but my analysis shows that appointment to University Committees is greatly skewed towards tenure-track faculty, with only three non-tenure track faculty actually sitting on those committees (one each on Religious Affairs, Athletics, and Chemical Safety).  The Shared Governance Committee, charged last fall with suggesting changes and improvements in the future does not include a single non-tenure track member of A&S (only one non-tenure track member at all - from the Blair School of Music).  We really have virtually no voice in the governance of the College or the University.  

We also have no influence over the process of our re-appointment, or how much we are paid.  Prior to our reappointment, we submit a dossier.  Then months later, we either do or don't get a reappointment letter with a salary number and a place to sign.  If we don't like the number, we can quit.  In a previous reappointment cycle, I suggested to my chair that it would be fair for me to be paid what I would receive if I were to leave Vanderbilt and teach in the Metro Nashville public school system.  At that time, with a Ph.D. and the number of years of teaching experience that I had, my salary in the public schools would have been about $10k more than what I was getting at Vanderbilt.  I think that at that time I actually still had a valid Tennessee high school teaching license, so it would have been a real possibility for me.  They did give me something like an additional 1% pay increase over the base increase for that year, but I've never gotten parity with what I would receive as a public high school teacher (let alone what I would earn teaching at a private school).  That's particularly ironic, given that the number of students and teaching assistants I supervise has gone up by about 60% since I started in the job (with no corresponding increase in pay), to over 400 students and 12 to 20 TAs per semester, plus three full-time staff.  This is a much greater responsibility than I would have if I were teaching high school.  The reason that I was given for not being granted parity in pay with the public schools was that the college couldn't afford that much.  I like my job and I enjoy working with my students and TAs, so I probably won't quit to go back to teaching high school.  But it seems really unfair to me and I'm powerless to change the situation.

Currently, I'm up for re-appointment with a promotion to Principal Senior Lecturer.  That might result in a pay increase, but there is no transparency about the decision-making process in the Dean's office.  Some day later this year, I'll probably get a letter offering me an appointment with some salary number on it.  Or not.  

The Provost's Office website has a list of frequently asked questions whose answers insinuate that the union will probably lie to us, and may negotiate away our benefits without consulting with us.  I will admit that I was a bit concerned about the negative effects of unionization when the issue first came up.  However, I contacted some senior lecturers at Duke and University of Chicago to ask them how the contract negotiating process was going at their schools.  It was clear to me that the negotiating teams at those schools (composed of  non-tenure track faculty from the schools themselves) were very attuned to the concerns of the colleagues they were representing, and that they had no intention of negotiating away important benefits that they already had.  Mostly, it just looked like a huge amount of time and work on their part.  But it definitely was not the apocalypse - for most people at those schools, life goes on as normal.

Now that I'm now a relatively high ranking non-tenure track faculty with reasonable job security, it seems unlikely that personally I will derive a large benefit from unionization.  But given that I have virtually no influence or negotiating power within the university, it is very hard for me to see what I have to lose by being part of the union.  More importantly, as I re-read the emails from my first painful years of teaching at Vanderbilt, it was evident to me that part-time faculty and faculty with one year appointments are particularly vulnerable to the whims of upper-level administration.  I have always been fortunate to have department chairs who supported me vigorously and went to bat for me when I needed it.  But there is no guarantee that will happen in the future, or in other departments.  For whatever faults there may be in having a union, it will provide a degree of protection and transparency that has been completely lacking for non-tenure track faculty at Vanderbilt.  And that's the primary reason why if offered the chance I'm planning to vote "yes" on unionization.

Sunday, March 26, 2017

A Web Service with Content Negotiation for Linked Data using BaseX


Last October, I wrote a post called Guid-O-Matic Goes to China.  That post described an application I wrote in Xquery to generate RDF in various serializations from simple CSV files.  Those of you who know me from TDWG are probably shaking your heads and wondering "Why in the world is he using Xquery to do this?  Who uses Xquery?"

The answer to the second question is "Digital Humanists".  There is an active Digital Humanities effort at Vanderbilt, and recently Vanderbilt received funding from the Andrew W. Mellon Foundation to open a Center for Digital Humanities.  I've enjoyed hanging out with the digital humanists and they form a significant component of our Semantic Web Working Group.  Digital Humanists also form a significant component of the Xquery Working Group at Vanderbilt.  Last year, I attended that group for most of the year, and that was how I learned enough Xquery to write the application.

That brings me to the first question (Why is he using Xquery?).  In my first post on Guid-O-Matic, I mentioned that one reason why I wanted to write the application was because BaseX (a freely available XML database and Xquery processor) included a web application component that allows Xquery modules to support a BaseX RESTXQ web application service.  After I wrote Guid-O-Matic, I played around with BaseX RESTXQ in an attempt to build a web service that would support content negotiation as required for Linked Data best practices.  However, the BaseX RESTXQ module had a bug that prevented using it to perform content negotiation as described in its documentation.  For a while I hoped that the bug would get fixed, but it became clear that content negotiation was not a feature that was used frequently enough for the developers to take the time to fix the bug.  In December, I sat down with Cliff Anderson, Vanderbilt's Xquery guru, and he helped me come up with strategy for a workaround for the bug.  Until recently, I was too busy to pick up the project again, but last week I was finally able to finish writing the functions in the module to run the web server.

How does it work?

Here is the big picture of how the Guid-O-Matic web service works:
A web-based client (browser or Linked Data client) uses HTTP to communicate with the BaseX web service.  The web service is an Xquery module whose functions process the URIs sent from the client via HTTP GET requests.  It uses the Accept: header to decide what kind of serialization the client wants, then uses a 303 redirect to tell the client which specific URI to use to request a specific representation in that serialization.  The client then sends a GET request for the specific representation it wants.  The web service calls Guid-O-Matic Xquery functions that use data from the XML database to build the requested documents. Depending on the representation-specific URI, it serializes the RDF as either XML, Turtle, or JSON-LD.  (Currently, there is only a stub for generating HTML, since the human-readable representation would be idiosyncratic depending on the installation.)  In the previously described versions of Guid-O-Matic, the data were retrieved from CSV files.  In this version, CSV files are still used to generate the XML files using a separate script.  But those XML files are then loaded into BaseX's built-in XML database, which is the actual data source used by the scripts called by the web service.  In theory, one could build and maintain the XML files independently without constructing them from CSVs.  One could also generate the CSV files from some other source as long as they were in the form that Guid-O-Matic understands.

Trying it out

You can try the system out for yourself to see how it works by following the following steps.

  1. Download and install BaseX. BaseX is available for download at http://basex.org/.  It's free and platform independent.  I won't go into the installation details because it's a pretty easy install.
  2. Clone the Guid-O-Matic GitHub repo.  It's available at https://github.com/baskaufs/guid-o-matic.  
  3. Load the XML files into a BaseX database.  The easiest way to do this is probably to run the BaseX GUI.  On Windows, just double-click on the icon on the desktop.  From the Database menu, select "New..." Browse to the place on your hard drive where you cloned the Guid-O-Matic repo, then Open the "xml-for-database" folder.  Name the database "tang-song" (it includes the data described in the Guid-O-Matic Goes to China post).  Select the "Parse files in archives" option.  I think the rest of the options can be left at their defaults.  Click OK.  You can close the BaseX GUI.  
  4. Copy the restxq module into the webinf directory of BaseX.  This step requires you to know where BaseX was installed on your hard drive.  Within the BaseX installation folder, there should be a subfolder called "webapp".  Within this folder, there should be a file with the extension ".xqm", probably named something like "restxq.xqm".  In order for the web app to work, you either need to delete this file, or change its extension from ".xqm" to something else like ".bak" if you think there is a possibility that you will want to look at it in the future.  Within the cloned Guid-O-Matic repo find the file "restxq-db.xqm" and copy it to the webapp folder.  This file contains the script that runs the server.  You can open it within the BaseX GUI or any text editor if you want to try hacking it.  
  5. Start the server. Open a command prompt/command window.  On my Windows computer, I can just type basexhttp.bat to launch the batch file that starts the server.  (I don't think that I had to add the BaseX/bin/ folder to my path statement, but if you get a File Not Found error, you might have to navigate to that directory first to get the batch file to run.)  For non-Windows computers there should be another script named basexhttp that you can run by an appropriate method for your OS.  See http://docs.basex.org/wiki/Startup for details.  When you are ready to shut down the server, you can do it gracefully from the command prompt by pressing Ctrl-C.  By default, the server runs on port 8984 and that's what we will use in the test examples.  If you actually want to run this as a real web server, you'll probably have to change it to a different port (like port 80).  See the BaseX documentation for more on this.
  6. Send an HTTP GET request to the server. There are a number of ways to cause client software to interact with the server (lots more on this later). The easiest way is to open any web browser and enter http://localhost:8984/Lingyansi in the URL box.  If the server is working, it should redirect to the URL http://localhost:8984/Lingyansi.htm and display a placeholder web page.
If you have successfully gotten the placeholder web page to display, you can carry out the additional tests that I'll describe in the following sections.

Image from the W3C Interest Group Note https://www.w3.org/TR/cooluris/ © 2008 W3C 

What's going on?

The goal of the Guid-O-Matic web service is to implement content negotiation in a manner consistent with the Linked Data practice described in section 4.3 the W3C Cool URIs for the Semantic Web document.  The purpose of this best practice is to allow users to discover information about things that are denoted by URIs, but that are not documents that can be delivered via the Web.  For example, we could use the URI http://lod.vanderbilt.edu/historyart/site/Lingyansi to denote the Lingyan Temple in China.  If we put that URI into a web browser, it is not realistic to expect the Internet to deliver the Lingyan Temple to our desktop. According to the convention established in the resolution to the httpRange-14 question, when a client makes an HTTP GET request to dereference the URI of a non-information resource (like a temple), an appropriate response from the server is to provide an HTTP 303 (See Other) response code that redirects the client to another URI that denotes an information resource (i.e. a document) that is about the non-information resource.  The user can indicate the desired kind of document by providing an HTTP Accept: header that provides the media type they would prefer.  So when a client makes a GET request for http://lod.vanderbilt.edu/historyart/site/Lingyansi, along with a request header of Accept: text/html it is appropriate for the server to respond with a 303 redirect to the URI http://lod.vanderbilt.edu/historyart/site/Lingyansi.htm, which denotes a document (web page) about the Linyansi Temple.  

There is no particular convention about the form of the URIs used to represent the non-information and information resources, although it is considered to be a poor practice to include file extensions in the URIs of non-information resources.  You can see one pattern in the diagram above.  Guid-O-Matic follows the following convention.  If a URI is extensionless, it is assumed to represent a non-information resource.  Each non-information resource included in the database can be described by a number of representations, i.e., documents having differing media types.  The URIs denoting those documents are formed by appending a file extension to the base URI of the non-information resource.  The extension used is one that is standard for that media type.  Many other patterns are possible, but using a pattern other than this one would require different programming than what is shown in this post.

The following media types are currently supported by Guid-O-Matic and can be requested from the web service:

Extension  Media Type
---------  -----------
.ttl       text/turtle
.rdf       application/rdf+xml
.json      application/ld+json or application/json
.htm       text/html

The first three media types are serializations of RDF and would be requested by Linked Data clients (machines), while the fourth media type is a human-readable representation that would typically be requested by a web browser.  As the web service is currently programmed, any media type requested other than the five listed above results in redirection to the URI for the HTML file.  

There are currently two hacks in the web service code that recognize two special URIs.  If the part of the URI after the domain name ends in "/header", the web server simply echos back to the client a report of the Accept: header that was sent to it as part of the GET request.  You can try this by putting http://localhost:8984/header in the URL box of your browser.  For Chrome, here's the response I got:

text/html application/xhtml+xml application/xml;q=0.9 image/webp */*;q=0.8

As you can see, the value of the Accept: request header generated by the browser is more complicated than the simple "text/html" that you might expect [1].  

In production, one would probably want to delete the handler for this URI, since it's possible that one might have a resource in the database whose local name is "header" (although probably not for Chinese temples!).  Alternatively, one could change the URI pattern to something like /utility/header that wouldn't collide with the pattern used for resources in the database.

The other hack is to allow users to request an RDF dump of the entire dataset.  A dump is requested using a URI ending in /dump along with a request header for one of the RDF media types.  If the header contains "text/html" (a browser), Turtle is returned.  Otherwise, the dump is in the requested media type.  The Chinese temple dataset is small enough that it is reasonable to request a dump of the entire dataset, but for larger datasets where a dump might tie up the server, it might be desirable to delete or comment out the code for this URI pattern.

Web server code

Here is the code for the main handler function:

  function page:content-negotiation($acceptHeader,$full-local-id)
  if (contains($full-local-id,"."))
  then page:handle-repesentation($acceptHeader,$full-local-id)
  else page:see-also($acceptHeader,$full-local-id)

The %rest:path annotation performs the pattern matching on the requested URI.  It matches any local name that follows a single slash, and assigns that local name to the variable $full-local-id.  The %rest:header-param annotation assigns the value of the Accept: request header to the variable $acceptHeader.  These two variables are passed into the page:content-negotiation function. 
The function then chooses between two actions depending on whether the local name of the URI contains a period (".") or not.  If it does, then the server knows that the client wants a document about the resource in a particular serialization (a representation) and it calls the page:handle-repesentation function to generate the document.  If the local name doesn't contain a period, then the function calls the page:see-also function to generate the 303 redirect.  

Here's the function that generates the redirect:

declare function page:see-also($acceptHeader,$full-local-id)
  if(serialize:find-db($full-local-id))  (: check whether the resource is in the database :)
      let $extension := page:determine-extension($acceptHeader)
            <http:response status="303">
              <http:header name="location" value="{ concat($full-local-id,".",$extension) }"/>
      page:not-found()  (: respond with 404 if not in database :)

The page:see-also function first makes sure that metadata about the requested resource actually exists in the database by calling the serialize:find-db function that is part of the Guid-O-Matic module. The serialize:find-db function returns a value of boolean true if metadata about the identified resource exist.  If value generated is not true, the page:see-also function calls a function that generates a 404 "Not found" response code.  Otherwise, it uses the requested media type to determine the file extension to append to the requested URI (the URI of the non-information resource).  The function then generates an XML blob that signals to the server that it should send back to the client a 303 redirect to the new URI that it constructed (the URI of the document about the requested resource).  

Here's the function that initiates the generation of the document in a particular serialization about the resource:

declare function page:handle-repesentation($acceptHeader,$full-local-id)
  let $local-id := substring-before($full-local-id,".")
      if(serialize:find-db($local-id))  (: check whether the resource is in the database :)
          let $extension := substring-after($full-local-id,".")
          (: When a specific file extension is requested, override the requested content type. :)
          let $response-media-type := page:determine-media-type($extension)
          let $flag := page:determine-type-flag($extension)
          return page:return-representation($response-media-type,$local-id,$flag)
          page:not-found()  (: respond with 404 if not in database :)

The function begins by parsing out the identifier part from the local name.  It checks to make sure that metadata about the identified resource exist in the database - if not, it generates a 404.  (It's necessary to do the check again in this function, because clients might request the document directly without going through the 303 redirect process.)  If the metadata exist, the function parses out the extension part from the local name.  The extension is used to determine the media type of the representation, which determines both the Content-Type: response header and a flag used to signal to Guid-O-Matic the desired serialization.  In this function, the server ignores the media type value of the Accept: request header.   Because the requested document has a particular media type, that type will be reported accurately regardless of what the client requests.  This behavior is useful in the case where a human wants to use a browser to look at a document that's a serialization of RDF.  If the Accept: header were respected, the human user would see only the web page about the resource rather than the desired RDF document. Finally, the necessary variables are passed to the page:return-representation function that handles the generation of the document.

Here is the code for the page:return-representation function:

declare function page:return-representation($response-media-type,$local-id,$flag)
      <output:media-type value='{$response-media-type}'/>
  if ($flag = "html")
  then page:handle-html($local-id)
  else serialize:main-db($local-id,$flag,"single","false")

The function generates a sequence of two items.  The first is an XML blob that signals to the server that it should generate a Content-Type: response header with a media type appropriate for the document that is being delivered.  The second is the response body, which is generated by one of two functions.  The page:handle-html function for generating the web page is a placeholder function, and in production there would be a call to a real function in a different module that used data from the XML database to generate appropriate content for the described resource.  The serialize:main-db function is the core function of Guid-O-Matic that builds a document from the database in the serialization indicated by the $flag variable.  The purpose of Guid-O-Matic was previously described in Guid-O-Matic Goes to China, so at this point the serialize:main-db function can be considered a black box.  For those interested in the gory details of generating the serializations, look at the code in the serialize.xqm module in the Guid-O-Matic repo.  

The entire restxq-db.xqm web service module can be viewed here.

Trying out the server

To try out the web server, you need to have a client installed on your computer that is capable of making HTTP requests with particular Accept: request headers.  An application commonly used for this purpose is curl. I have to confess that I'm not enough of a computer geek to enjoy figuring out the proper command line options to make it work for me.  Nevertheless, it's simple and free.  If you have installed curl on your computer, you can use it to test the server.  The basic curl command I'll use is

curl -v -H "Accept:text/turtle" http://localhost:8984/Lingyansi

The -v option makes curl verbose, i.e. to make it show the header data as curl does stuff.  The -H option is used to send a request header - in the example, the media type for Turtle is requested.  The last part of the command is the URI to be used in the HTTP request, which is a GET request by default.  In this example, the HTTP GET is made to a URI for the local webserver running on port 8984 (i.e. where the BaseX server runs by default).  

Here's what happens when the curl command is given to make a request to the web service application:

The lines starting with ">" show the communication to the server and the lines starting with "<" show the communication coming from the server.  You can see that the server has responded in the desired manner: the client requested the file /Lingyansi in Turtle serialization, and the server responded with a 303 redirect to Lingyansi.ttl.  Following the server's redirection suggestion, I can issue the command 

curl -v -H "Accept:text/turtle" http://localhost:8984/Lingyansi.ttl

and I get this response:

This time I get a response code of 200 (OK) and the document is sent as the body of the response.  When GETting the turtle file, the Accept: header is ignored by the web server, and can be anything.  Because of the .ttl file extension, the Content-type: response header will always be text/turtle.

 If the -L option is added to the curl command, curl will automatically re-issue the command to the new URI specified by the redirect:

curl -v -L -H "Accept:text/turtle" http://localhost:8984/Lingyansi

Here's what the interaction of curl with the server looks like when the -L option is used:

Notice that for the second GET, curl reuses the connection with the server that it left open after the first GET.  One of the criticisms of the 303 redirect solution to the httpRange-14 controversy is that it is inefficient - two GET calls to the server are required to retrieve metadata about a single resource.

If you aren't into painful command line applications, there are several GUI options for sending HTTP requests.  One application that I use is Advanced Rest Client (ARC), a Chrome plugin (unfortunately available for Windows only).  Here's what the ARC GUI looks like:

The URI goes in the box at the top and GET is selected using the appropriate radio button.  If you select Headers form, a dropdown list of possible request headers appears when you start typing and you can select Accept.  In this example I've given a value of text/turtle, but you can also test the other values recognized by the server script: text/html, application/rdf+xml, application/json, and application/ld+json.  

When you click SEND, the response is somewhat inscrutably "404 Not found".  I'm not sure exactly what ARC is doing here - clearly something more than a single HTTP GET.  However, if you click the DETAILS dropdown, you have the option of selecting "Redirects".  Then you see that the server issued a 303 See Other redirect to Lingyansi.ttl

If you change the URI to http://localhost:8984/Lingyansi.ttl, you'll get this response:

This time no redirect, a HTTP response code of 200 (OK), a Content-Type: text/turtle response header, and the Turtle document as the body.  

There is a third option for a client to send HTTP requests: Postman.  It is also free, and available for other platforms besides Windows.  It has a GUI interface that is similar to Advanced Rest Client.  However, for whatever reason, Postman always behaves like curl with the -L option.  That is, it always automatically responds by sending you the ultimate representation without showing you the intervening 303 redirect.  There might be some way to make it show you the complete process that is going on, but I haven't figured out how to do that yet.  

If you are using Chrome as your browser, you can go to the dot, dot, dot dropdown in the upper right corner of the browser and select "More tools", then "Developer tools".  That will open a pane on the right side of the browser to show you what's going on "under the hood".  Make sure that the Network tab is selected, then load http://localhost:8984/Lingyansi .   This is what you'll see:

The Network pane on the right shows that the browser first tried to retrieve Lingyansi, but received a 303 redirect.  It then successfully retrieved Lingyansi.htm, which it then rendered as a web page in the pane on the left.  Notice that after the redirect, the browser replaced the URI that was typed in with the new URI of the page that it actually loaded.

Who cares?

If after all of this long explanation and technical gobbledygook you are left wondering why you should care about this, you are in good company.  Most people couldn't care less about 303 redirects.

As someone who is trying to believe in Linked Data, I'm trying to care about 303 redirects.  According to the core principles of Linked Data elaborated by Tim Berners-Lee in 2006, a machine client should be able to "follow its nose" so that it can "look up" information about resources that it learns about through links from somewhere else.  303 redirects facilitate this kind of discovery by providing a mechanism for a machine client to tell the server what kind of machine-readable metadata it wants (and that it wants machine-readable metadata and not a human-readable web page!).  

Despite the ten+ years that the 303 redirect solution has existed, there are relatively few actual datasets that properly implement the solution.  Why?

I don't control the server that hosts my Bioimages website and I spent several years trying to get anybody from IT Services at Vanderbilt to pay attention long enough to explain what kind of behavior I wanted from the server, and why.  In the end, I did get some sort of content negotiation.  If you perform an HTTP GET request for a URI like http://bioimages.vanderbilt.edu/ind-baskauf/40477 and include an Accept: application/rdf+xml header, the server responds by sending you http://bioimages.vanderbilt.edu/ind-baskauf/40477.rdf (an RDF/XML representation).  However, it just sends the file with a 200 OK response code and doesn't do any kind of redirection (although the correct Content-Type is reported in the response).  The behavior is similar in a browser.  Sending a GET request for http://bioimages.vanderbilt.edu/ind-baskauf/40477 results in the server sending http://bioimages.vanderbilt.edu/ind-baskauf/40477.htm, but since the response code is 200, the browser doesn't replace the URI entered in the box with the URI of the file that it actually delivers.  It seems like this solution should be OK, even though it doesn't involve a 303 redirect.

Unfortunately, from a Linked Data point of view, at the Bioimages server there are always two URIs that denote the same information resource, and neither of them can be inferred to be a non-information resource by virtue of a 303 response, as suggested by the httpRange-14 resolution. On a more practical level, users end up bookmarking two different URIs for the same page (since when content negotiation takes place, the browser doesn't change the URI to the one ending with .htm) and search engines index the same page under two different URIs, resulting in duplicate search results and potentially lower page rankings.  

Another circumstance where failing to follow the Cool URIs recommendation caused a problem is when I tried to use the CETAF Specimen URI Tester on Bioimages URIs.  The tester was created as part of an initiative by the Information Science and Technology Commission (ISTC) of the Consortium of European Taxonomic Facilities (CETAF).  When their URI tester is run on a "cool" URI like http://data.rbge.org.uk/herb/E00421509, the URI is considered to pass the CETAF tests for Level 3 implementation (redirect and return of RDF).  However, a Bioimages URI like http://bioimages.vanderbilt.edu/ind-baskauf/40477 fails the second test of the suite because there is no 303 redirect, even though the URI returns RDF when the media type application/rdf+xml is requested. Bummer. Given the number of cases where RDF can actually be retrieved from URIs that don't use 303 redirects (including all RDF serialized as RDFa), it probably would be best not to build a tester that relied solely on 303 redirects.  But until the W3C changes its mind about the httpRange-14 decision, a 303 redirect is the kosher way to find out from a server that a URI represents a non-information resource.  

So I guess the answer to the question "Who cares?" is "people who care about Linked Data and the Semantic Web".  The problem is that there just aren't that many people in that category, and even fewer who also care enough to implement the 303 redirect solution.  Then there are the people who believe in Linked Data, but were unhappy about the httpRange-14 resolution and don't follow it out of spite.  And there are also the people who believe in Linked Data, but don't believe in RDF (i.e. provide JSON-LD or microformat metadata directly as part of HTML).  

A potentially important thing

Now that I've spent time haranguing about the hassles associated with getting 303 redirects to work, I'll mention a reason why I still think the effort might be worth it.  

The RDF produced by Guid-O-Matic pretends that eventually the application will be deployed on a server that uses a base URI of http://lod.vanderbilt.edu/historyart/site/ . (Eventually there will be some real URIs minted for the Chinese temples, but they won't be the ones used in these examples).  So if we pretend that at some point in the future the Guid-O-Matic web service were deployed on the web (via port 80 instead of port 8984), an HTTP GET request could be made to http://lod.vanderbilt.edu/historyart/site/Lingyansi instead of http://localhost:8984/Lingyansi and the server script would respond with the documents shown in the examples. 

If you look carefully at the Turtle that the Guid-O-Matic server script produces for the Lingyan Temple, you'll see these RDF triples (among others):

     rdf:type schema:Place;
     rdfs:label "Lingyan Temple"@en;
     a geo:SpatialThing.

     dc:format "text/turtle";
     dc:creator "Vanderbilt Department of History of Art";
     dcterms:references <http://lod.vanderbilt.edu/historyart/site/Lingyansi>;
     dcterms:modified "2016-10-19T13:46:00-05:00"^^xsd:dateTime;
     a foaf:Document.

 You can see that maintaining the distinction between the version of the URI with the .ttl extension and the URI without the extension is important.  The URI without the extension denotes a place and a spatial thing labeled "Lingyan Temple".  It was not created by the Vanderbilt Department of History of Art, it is not in RDF/Turtle format, nor was it last modified on October 19, 2016.  Those latter properties belong to the document that describes the temple.  The document about the temple is linked to the temple itself by the property dcterms:references.  

Because maintaining the distinction between a URI denoting a non-information resource (a temple) and a URI that denotes an information resource (a document about a temple) is important, it is a good thing if you don't get the same response from the server when you try to dereference the two different URIs.  A 303 redirect is a way to clearly maintain the distinction.

Being clear about the distinction between resources and metadata about resources has very practical implications.  I recently had a conversation with somebody at the Global Biodiversity Information Facility (GBIF) about the licensing for Bioimages.  (Bioimages is a GBIF contributor.)  I asked whether he meant the licensing for images in the Bioimages website, or the licensing for the metadata about images in the Bioimages dataset.  The images are available with a variety of licenses ranging from CC0 to CC BY-NC-SA, but the metadata are all available under a CC0 license.  The images and the metadata about images are two different things, but the current GBIF system (based on CSV files, not RDF) doesn't allow for making this distinction on the provider level.  In the case of museum specimens that can't be delivered via the Internet or organism observations that aren't associated with a particular form of deliverable physical or electronic evidence, the distinction doesn't matter much because we can assume that a specified license applies to the metadata.  But for occurrences documented by images, the distinction is very important.

What I've left out

This post dwells on the gory details of the operation of the Guid-O-Matic server script, and tells you how to load outdated XML data files about Chinese temples, but doesn't talk about how you could actually make the web service work with your own data.  I may write about that in the future, but for now you can go to this page of instructions for details of how to set up the CSV files that are the ultimate source of the generated RDF.  The value in the baseIriColumn in the constants.csv file needs to be changed to the path of the directory where you want the database XML files to end up.  After the CSV files are created and have replaced the Chinese temple CSV files in the Guid-O-Matic repo, you need to load the file load-database.xq from the Guid-O-Matic repo into the BaseX GUI.  When you click on the run button (green triangle) of the BaseX GUI, the necessary XML files will be generated in the folder you specified.  The likelihood of success in generating the XML files is higher on a Windows computer because I haven't tested the script on Mac or Linux, and there may be file path issues that I still haven't figured out on those operating systems.  

The other thing that you should know if you want to hack the code is that the restxq-db.xqm imports the Guid-O-Matic modules that are necessary to generate the response body RDF serializations from my GitHub site on the web.  That means that if you want to hack the functions that actually generate the RDF (located in the module serialize.xqm), you'll need to change the module references in the prologue of the restxq-db.xqm module (line 6) so that they refer to files on your computer.  Instead of 

import module namespace serialize = 'http://bioimages.vanderbilt.edu/xqm/serialize' at 'https://raw.githubusercontent.com/baskaufs/guid-o-matic/master/serialize.xqm';

you'll need to use 

import module namespace serialize = 'http://bioimages.vanderbilt.edu/xqm/serialize' at '[file path]';

where [file path] is the path to the file in Guid-O-Matic repo on your local computer.  On my computer, it's in the download location I specified for GitHub repos, c:/github/guid-o-matic/serialize.xqm .  It will probably be somewhere else on your computer.  Once you've made the change and saved the new version of restxq-db.xqm, you can hack the functions in serialize.xqm and the changes will take affect in the documents sent from the server.


[1] The actual request header has commas not shown here.  But BaseX reads the comma separated values as parts of a sequence, and when it reports the sequence back, it omits the commas.