Sunday, July 2, 2017

How (and why) we set up a SPARQL endpoint



Recently I've been kept busy trying to finish up the task of getting the Vanderbilt Semantic Web Working Group's SPARQL endpoint up and running.  I had intended to write a post about how we did it so that others could replicate the effort, but as I was pondering what to write, I realized that it is probably important for me to begin by explaining why we wanted to set one up in the first place.

Why we wanted to set up a SPARQL endpoint

It is tempting to give a simple rationale for our interest in a SPARQL endpoint: Linked Data and the Semantic Web are cool and awesome, and if you are going to dive into them, you need to have a SPARQL endpoint to expose your RDF data.  If you have read my previous blog posts, you will realize that I've only been taking little sips of the Semantic Web Kool-Aid, and have been reluctant to drink the whole glass.  There have been some serious criticisms of RDF and the Semantic Web, with claims that it's too complicated and too hard to implement.  I think that many of those criticisms are well-founded and that the burden is on advocates to show that they can do useful things that can't be accomplished with simpler technologies.  The same test should be applied to SPARQL: what can you do with a triple store and SPARQL endpoint that you can't do with a conventional server, or a different non-RDF graph database system like Neo4J?

 It seems to me that there are two useful things that you can get from a triplestore/SPARQL endpoint that you won't necessarily get elsewhere.  The first is the ability to dump date from two different sources into the same graph database and immediately be able to query the combined graph without any further work or data munging.  I don't think that is straightforward with alternatives like Neo4J.  This is the proverbial "breaking down of silos" that Linked Data is supposed to accomplish.  Successfully merging the graphs of two different providers and deriving some useful benefit from doing so depends critically on those two graphs sharing URI identifiers that allow useful links to be made between the two graphs.  Providers have been notoriously bad about re-using other providers' URIs, but given the emergence of useful, stable, and authoritative sources (like the Library of Congress, GeoNames, the Getty Thesauri, ORCID, and others) of RDF data about URI-identified resources, this problem has been getting better.

The second useful thing that you can get from a triplestore/SPARQL endpoint is essentially a programmable API.  Conventional APIs abound, but generally they support only particular search methods with a fixed set of parameters that can be used to restrict the search.  SPARQL queries are generic.  If you can think of a way that you want to search the data and know how to construct a SPARQL query to do it, you can do any kind of search that you can imagine.  In a conventional API, if the method or parameter that you need to screen the results doesn't exist, you have to get a programmer to create it for you on the API.  With SPARQL, you do your own "programming".

These two functions have influenced the approach that we've taken with our SPARQL endpoint.  We wanted to create a single Vanderbilt triplestore because we wanted to be able to merge in it shared information resources to which diverse users around the campus might want to link.  It would not make sense to create multiple SPARQL endpoints on campus, which could result in duplicated data aggregation efforts.  We also want the endpoint to have a stable URI that is independent of department or unit within the institution.  If people build applications to use that endpoint as an API, we don't want to break those applications by changing the subdomain or subpath of the URI, or to have the endpoint disappear once users finished a project or got bored with their own endpoint.


Step 1: Figuring out where to host the server

For several years, we've had an implementation of the Callimachus triplestore/SPARQL endpoint set up on a server operated by the Jean and Alexander Heard Library at Vanderbilt (http://rdf.library.vanderbilt.edu).  As I've noted in earlier posts, Callimachus has some serious deficiencies, but since that SPARQL endpoint was working and the technical support we've gotten from the library has been great, we weren't in too big of a hurry to move it somewhere else.  Based on a number of factors, we decided that we would rather have an installation of Blazegraph, and since we wanted this to be a campus-wide resource, we began discussion with Vanderbilt IT services about how to get Blazegraph installed on a server that they supported. There was no real opposition to doing that, but it became apparent that the greatest barrier to making it happen was to get people at ITS to understand what a triplestore and SPARQL endpoint was, why we wanted one, and how we planned to use it.  Vanderbilt's ITS operates on a model where support is provided to administrative entities who bear the cost of that support, and where individuals are accountable to defined administrators and units.  Our nebulous campus-wide resource didn't really fit well in that model and it wasn't clear who exactly should be responsible to help us.  Emails went unanswered and actions weren't taken.  Eventually, it became apparent that if we wanted to get the resource up in a finite amount of time, our working group would have to make it happen ourselves.

Fortunately, we had a grant from the Vanderbilt Institute for Digital Learning (VIDL) that provided us with some money that we could use to set our system up on a commercial cloud server.  We received permission to use the subdomain sparql.vanderbilt.edu, so we started the project with the assumption that we could get a redirect from that subdomain to our cloud server when the time was right.  We decided to go with a Jelastic-based cloud service, since that system provided a GUI interface for managing the server.  There were cheaper systems that required doing all of the server configuration via the command line, but since we were novices, we felt that paying a little more for the GUI was worth it.  We ended up using ServInt as a provider for no particularly compelling reason other than group members had used it before.  There are many other options available.

Step 2: Setting up the cloud server

Since ServInt offered a 14 day free trial account, there was little risk in us playing around with the system at the start.  Eventually, we set up a paid account.  Since the cost was being covered by an institutional account, we did not want the credit card to be automatically charged when our initial $50 ran out.  Unfortunately, the account portal was a bit confusing - it turns out that turning off AutoPay does not actually prevent an automatic charge to the credit card.  So when our account got to an alert level of $15, we got hit with another $50 charge.  It turns out that the real way to turn off automatic charges is to set the autofunding level to $0.  Lesson learned - read the details.  Fortunately we were going to by more server time anyway.

The server charges are based on units called "cloudlets".  Cloudlet usage is based on a combination of the amount of space taken up by the installation and by the amount of traffic handled by the server.  One cloudlet is 128 MiB of RAM and 400MHz of CPU.  The minimum number of reserved cloudlets per application server is one, and you can increase the upper scaling limits to whatever you want.  A higher limit means you could pay at a greater rate if there is heavy usage.  Here's what the Jelastic Environmental Topology GUI looks like:


The system that we eventually settled on uses an Nginx front-end server to handle authentication and load balancing, and a Tomcat back-end server to actually run the Blazegraph application.  The primary means of adjusting for heavy usage is to increase the "vertical" scaling limit (maximum resources allocated to a particular Tomcat instance).  If usage were really high, I think you could increase the horizontal scaling by creating more than one Tomcat instance.  I believe that in that case the Nginx server would balance the load among the multiple Tomcat servers.  However, since our traffic is generally nearly zero, we haven't really had to mess with that.  The only time that system resources have gotten tight was when we were loading files with over a million triples.  But that was only a short-term issue that lasted a few minutes.  At our current usage, the cost to operate the two-server combination is about $2 per day.


The initial setup of Blazegraph was really easy.  In the GUI control panel (see above), you just click "Create environment", then click the Create button to create the Tomcat server instance.  When you create the environment, you have to select a name for it that is unique within the jelastic.servint.net subdomain.  The name you chose will be the subdomain of your server.  We chose "vuswwg", so the whole domain name of our server was vuswwg.jelastic.servint.net .  What we chose wasn't really that important, since we were planning eventually to redirect to the server from sparql.vanderbilt.edu .

To load Blazegraph, go to the Blazegraph releases page and copy the WAR Application download link, e.g. https://github.com/blazegraph/database/releases/download/BLAZEGRAPH_RELEASE_2_1_4/blazegraph.war .  On the control panel, click on Upload and paste in the link you copied.  On the Deployment manager list, we selected "vuswwg" from the Deploy to... dropdown next to the BaseX863.war name.  On the subsequent popup, you will create a "context name".  The context name serves as the subpath that will refer to the Blazegraph application.  We chose "bg", so the complete path for the Blazegraph web application was: http://vuswwg.jelastic.servint.net/bg/ .  Once the context was created, Blazegraph was live and online.  Putting the application URL into a browser brought up the generic Blazegraph web GUI.  Hooray!

The process seemed too easy, and unfortunately, it was.  We had Blazegraph online, but there was no security and the GUI web interface as installed would have allowed anyone with Internet access to load their data into the triplestore, or to issue a "DROP ALL" SPARQL Update command to delete all of the triples in the store.  If one were only interested in testing, that would be fine, but for our purposes it was not.

Step 3: Securing the server

It became clear to us that we did not have the technical skills necessary to bring the cloud server up to the necessary level of security.  Fortunately, we were able to acquire the help of developer Ken Polzin, with whom I'd worked on a previous Bioimages project.  Here is an outline of how Ken set up our system (more detailed instructions are here).

We installed an Nginx server in a manner similar to what was described above for the Tomcat installation.  Since the Nginx server was going to be the outward-facing server, Public IPv4 needed to be turned on for it, and we turned it off on the Tomcat server.

There were several aspects of the default Nginx server configuration that needed to be changed in order for the server to work in the way that we want.  The details are on this page.  One change redirected from the root of the URI to the /bg/ subpath where Blazegraph lives.  That allows users to enter https://sparql.vanderbilt.edu/ and be redirected to https://sparql.vanderbilt.edu/bg/ or https://sparql.vanderbilt.edu/sparql and be redirected to https://sparql.vanderbilt.edu/bg/sparql .  We wanted this behavior so that we would have a "cool URI" that was simple and did not include implementation-specific information (i.e. "bg" for Blazegraph).  Another change in the configuration facilitated remote calls to the endpoint by enabling cross-origin resources sharing (CORS).

The other major change to the Nginx configuration was related to controlling write access to Blazegraph.  We accomplished that by restricting unauthenticated users to HTTP GET access.  Methods of writing to the server, such as SPARQL Update commands, require HTTP POST.  Changes we made to the configuration file required authentication for any non-GET calls.  Unfortunately, that also meant that regular SPARQL queries could not be requested by unauthenticated users using POST, but that is only an issue if the queries are very long and exceed the character limit for GET URIs.

Authentication for administrative users was accomplished using passwords encrypted using OpenSSL.  Directions for generating and storing passwords is here.  The authentication requirements were made in the Nginx configuration file as discussed above.  Once authentication was in place, usernames and passwords could be sent as part of the POST dialog.  Programming languages and HTTP GUIs such as Postman have built-in mechanisms to support Basic Authentication.  Here is an example using Postman:



Related to the issue of restricting write access was modification of the default Blazegraph web GUI.  Out of the box, the GUI had a tab for Updates (which we had disabled for unauthenticated users) and for some other advanced features that we didn't want the public to see.  Those tabs can be hidden by modifying the /opt/tomcat/webapps/bg/html/index.html using the server control panel (details here).  We also were able to style the query GUI page to comply with Vanderbilt's branding standards.  You can see the final appearance of the GUI at https://sparql.vanderbilt.edu .

The final step in securing access to the server was to set up HTTPS.  The Jeslastic system provides a simple method to set up HTTP using a free Let's Encrypt SSL certificate.  Before we could enable HTTPS, we had to get the redirect set up from the sparql.vanderbilt.edu subdomain to the vuswwg.jelastic.servint.net subdomain.  This was accomplished by creating a DNS "A" record pointing to the public IP address of the Nginx instance.  To the best of my knowledge, in the Jelastic system, the Nginx IP address is stable as long as the Nginx server is not turned off.  (Restarting the server is fine.)  If the server were turned off and then back on, the IP address would change, and a new A record would have to be set up for sparql.vanderbilt.edu .  Getting the A record set up required several email exchanges with Vanderbilt ITS before everything worked correctly.  Once the record proliferated throughout the DNS system, we could follow the directions to install Let's Encrypt and make the final changes to the Nginx configuration file (see this page for details).

Step 4: Loading data into the triplestore

One consequence of the method that we chose for restricting write access to the server was that it was no longer possible to using the Blazegraph web GUI to load RDF files directly from a hard drive into the triplestore.  Fortunately, files could be loaded using the SPARQL Update protocol or the more generic data loading commands that are part of the Blazegraph application, both via HTTP (see examples here).

One issue with using the generic loading commands is that I'm not aware of any way to specify that the triples in the RDF file be added to a particular named graph in the triple store.  If one's plan for managing the triple store involved deleting the entire store and replacing it, then that wouldn't be important.  However, we plan to compartmentalize the data of various users by associating those data with particular named graphs that can be dropped or replaced independently.  So our options were limited to what we could do with SPARQL Update.

The two most relevant Update commands were LOAD and DROP, for loading data into a graph and deleting graphs, respectively.  Both commands must be executed through HTTP POST.

There are actually two ways to accomplish a LOAD command: by sending the request as URL-encoded text or by sending it as plain text.  I couldn't see any advantage to the URL-encoded method, so I used the plain text method.  In that method, the body of the POST request is simply the Update command.  However, the server will not understand the command unless it is accompanied by a Content-Type request header of application/sparql-update.  Since the server is set up to require authorization for POST requests, an Authorization header is also required, although Postman handles that for us automatically when the Basic Auth option is chosen.

The basic format of the SPARQL Update LOAD command is:

LOAD <sourceFileURL> INTO GRAPH <namedGraphUri>

where sourceFileURL is a dereferenceable URL from which the file can be retrieved and namedGraphUri is the URI of the named graph in which the loaded triples should be included.  The named graph URI is simply an identifier and does not have to represent any real location on the web.

The sourceFileURL can be a web location (such as GitHub) or a local file on the server if the file: URI type is used (e.g. file:///opt/tomcat/temp/upload/data.rdf).  Unfortunately, the file cannot be loaded directly from your local drive.  Rather, it first must be uploaded to the server, then loaded from its new location on the server using the LOAD command.  To upload a file, click on the Tomcat Config icon (wrench/spanner) that appears when you mouse over the area to the right of the Tomcat entry.  A directory tree will appear in the pane below.  You can then navigate to the place where you want to upload the file.  For the URL I listed above, here's what the GUI looks like:


Select the Upload option and the a popup will allow you to browse to the location of the file on your local drive.  Once the file is in place on the server, you can use your favorite HTTP tool to POST the SPARQL Update command and load the file into the triplestore.

This particular method is a bit of a hassle, and is not amenable to automation.  If you are managing files using GitHub, it's a lot easier to load the file directly from there using a variation on the Github raw file URL.  For example, if I want to load the file https://github.com/baskaufs/cv/blob/master/occurrenceStatus/occurrenceStatus.ttl into the triplestore, I would need to load the raw version at https://raw.githubusercontent.com/baskaufs/cv/master/occurrenceStatus/occurrenceStatus.ttl .  However, it is not possible to successfully load that file into Blazegraph directly from Github using the LOAD command.  The reason is that when a file is loaded from a remote URL using the SPARQL update command, Blazegraph apparently depends on the Content-Type header from the source to know that the file is some serialization of RDF.  Github and Github Gist always report the media type of raw files as text/plain regardless of the file extension, and Blazegraph takes that to mean that the file does not contain RDF triples.  If one uses the raw file URL in a SPARQL Update LOAD command, Blazegraph will issue a 200 (OK) HTTP code, but won't actually load any triples.

The solution to this problem is to use a redirect that specifies the correct media type.  The caching proxy service RawGit (https//rawgit.com/) interprets the file extension of a Github raw file and relays the requested file with the correct Content-Type header.  The example file above would be retrieved using the RawGit development URL https://rawgit.com/baskaufs/cv/master/occurrenceStatus/occurrenceStatus.ttl . RawGit will add the Content-Type header text/turtle as it redirects the file.  (Read the RawGit home page at https://rawgit.com/ for an explanation of the distinction between RawGit development URLs and production URLs.)

The SPARQL Update DROP command has this format:

DROP GRAPH <namedGraphUri>

Executing the command removes from the store all of the triples that had been assigned to the graph identified by the URI in the position of  namedGraphUri.

If the graphs are small (a few hundred or thousand triples), both loading and dropping them requires a trivial amount of time.  However, when the graph size is significant (i.e. a million triples or more), then a non-trivial amount of time is required either to load or drop the graph.  I think that the reason is because of the indexing that Blazegraph does as it loads the triples.  That indexing is what makes Blazegraph be able to conduct efficient querying.  The transfer of the file itself can be sped up by compressing it.  Blazegraph supports gzip (.gz) file compression.  However, compressing the file doesn't seem to speed up the actual time required to load the triples into the store.  I haven't done a lot of experimenting with this, but I have one anecdotal experience loading a gzip compressed file containing about 1.7 million triples.  I uploaded the file to the server, then used the file: URI version of SPARQL Update to load it into the store.  Normally, the server sends an HTTP 200 code and a response body indicating the number of "mutations" (triples modified) after the load command is successfully completed.  However, in the case of the 1.7  million triple file, the server timed out and sent an error code.  But when I checked the status of the graph a little later on, all of the triples seemed to have successfully loaded.  So the timeout seems to have been a limit on communication between the server and client, but not necessarily a limit on the time necessary to carry out actions that are happening internally in the server.

I was a bit surprised to discover that dropping a large graph took about as long as to load it.  In retrospect, I probably shouldn't have been surprised.  Removing a bunch of triples involves removing them from the indexing system, not just deleting some file location entry as would be the case for deleting a file on a hard drive.  So it makes sense that the removal activity should take about as long ad the adding activity.

These speed issues suggest some considerations for graph management in a production environment.  If one wanted to replace a large graph (i.e. over a million triples), dropping the graph, then reloading it probably would not be the best option, since both actions would be time consuming and the data would probably be essentially "off line" during the process.  It might work better to load the new data into a differently named graph, then use the SPARQL Update COPY or MOVE functions to replace the existing graph that needs to be replaced.  I haven't actually tried this yet, so it may not work any better than dropping and reloading.

Step 5: Documenting the graphs in the triplestore

One problem with the concept of a SPARQL endpoint as a programmable API is that users need to understand the graphs in the triplestore in order to know how to "program" their queries.  So our SPARQL endpoint wasn't really "usable" until we provided a description of the graphs included in the triple store.  On the working group Github site, we have created a "user guide" with some general instructions about using the endpoint and a page for each project whose data are included in the triplestore.  The project pages describe the named graphs associated with the project, including a graphical representation of the graph model and sample queries (an example is here).  With a little experimentation, users should be able to construct their own queries to retrieve data associated with the project.

Step 6: Using the SPARQL endpoint

I've written some previous posts about using our old Callimachus endpoint as a source of XML data to run web applications.  Since Blazegraph supports JSON query results, I was keen to try writing some new Javascript to take advantage of that.  I have a new little demo page at http://bioimages.vanderbilt.edu/lang-labels.html that consumes JSON from our new endpoint.  The underlying Javascript that makes the page work is at http://bioimages.vanderbilt.edu/lang-labels.js .  The script sends a very simple query to the endpoint, e.g.:

SELECT DISTINCT ?label WHERE {
<http://rs.tdwg.org/cv/status/extant> <http://www.w3.org/2004/02/skos/core#prefLabel> ?label.
}

when the query is generated using the page's default values.  Here's what that query looks like when it's URL encoded and ready to be sent by HTTP GET:

https://sparql.vanderbilt.edu/sparql?query=SELECT%20DISTINCT%20%3Flabel%20WHERE%20%7B%3Chttp%3A%2F%2Frs.tdwg.org%2Fcv%2Fstatus%2Fextant%3E%20%3Chttp%3A%2F%2Fwww.w3.org%2F2004%2F02%2Fskos%2Fcore%23prefLabel%3E%20%3Flabel.%7D

On Chrome, you can use the Developer Tools to track the interactions between the browser and the server as you load the page and click the button.

The Jquery code that does the AJAX looks like this:

$.ajax({
    type: 'GET',
    url: 'https://sparql.vanderbilt.edu/sparql?query=' + encoded,
    headers: {
Accept: 'application/sparql-results+json'
    },
    success: function(returnedJson) {
[handler function goes here]
        }
    });

Since the HTTP GET request includes an Accept: header of application/sparql-results+json, the server response (the Javascript object returnedJson) looks like this:

{
  "head" : {
    "vars" : [ "label" ]
  },
  "results" : {
    "bindings" : [ {
      "label" : {
        "xml:lang" : "en",
        "type" : "literal",
        "value" : "extant"
      }
    }, {
      "label" : {
        "xml:lang" : "de",
        "type" : "literal",
        "value" : "vorhanden "
      }
    }, {
      "label" : {
        "xml:lang" : "es",
        "type" : "literal",
        "value" : "existente"
      }
    }, {
      "label" : {
        "xml:lang" : "pt",
        "type" : "literal",
        "value" : "presente"
      }
    }, {
      "label" : {
        "xml:lang" : "zh-hans",
        "type" : "literal",
        "value" : "现存"
      }
    }, {
      "label" : {
        "xml:lang" : "zh-hant",
        "type" : "literal",
        "value" : "現存"
      }
    } ]
  }
}

It then becomes a simple matter to pull the label values from the JSON array using this Javascript loop found in the handler function:

var value = "";
for (i = 0; i < returnedJson.results.bindings.length; i++) {
    value = value + "<p>" 
    + returnedJson.results.bindings[i].label["xml:lang"] + " " 
    + returnedJson.results.bindings[i].label.value + "</p>";
    }

then display value on the web page.  One issue with the translation from the JSON array to the Javascript array reference can be seen in the code snippet above.  The JSON key "xml:lang" is not a valid Javascript name due to the presence of the colon.  So "bracket notation" must be used in the Javascript array reference instead of "dot notation" to refer to it.

Conclusion

I am quite excited that our endpoint is now fully operational and that we can build applications around it.  One disappointing discovery that I made recently is that as currently configured, our Blazegraph instance is not correctly handling URL encoded literals that are included in query strings.  It works file with literals that contain only ASCII strings, but including a string like "現存" (URL encoded as "%E7%8F%BE%E5%AD%98") in a query fails to produce any result.  This problem doesn't happen when the same query is made to the Callimachus endpoint.  That is a major blow, since several datasets that we have loaded or intend to load into the triplestore include UTF-8 encoded strings representing literals in a number of languages.  I sent a post about this problem to the Bigdata developer's email list, but have not yet gotten any response.  If anyone has any ideas about why we are having this problem, or how to solve it, I'd be keen to hear from you.

Aside from that little snafu, we have achieved one of the "useful things" that SPARQL endpoints allow: making it possible for users to "program the API" to get any kind of information they want from the datasets included in the triplestore.  It remains for us to explore the second "useful thing" that I mentioned at the start of this post: merging RDF datasets from multiple sources and accomplishing something useful by doing so.  Stay tuned as we try to learn effective ways to do that in the future.

We also hope that at some point in the future we will have demonstrated that there is some value to having a campus-wide SPARQL endpoint, and that once we can clearly show how we set it up and what it can be used for, we will be able to move it from the commercial cloud server to a server maintained by Vanderbilt ITS.


Tuesday, May 2, 2017

Using the TDWG Standards Documentation Specification with a Controlled Vocabulary

[Note: this post was updated on 2017-06-29 to switch the endpoint URI from the temporary one at http://rdf.library.vanderbilt.edu/sparql to the permanent one at https://sparql.vanderbilt.edu/sparql .)

I just recently got the news that the TDWG Executive Committee has ratified the Standards Documentation Specification (SDS).  We now have a way to describe vocabularies not only in a consistent human-readable form, but also in a machine-readable form.  This includes not only TDWG's current core vocabularies (Darwin Core, DwC and Audubon Core, AC), but also vocabularies of controlled values that will be developed in the future.

What are the implications of saying that the vocabularies will be "machine-readable"?  We have had about ten years now of promises that using "semantic technologies" will magically revolutionize biodiversity informatics, but despite repeated meetings, reports, and publications, our actual core technologies are built around conventional database technologies and simple file formats like CSVs.  Many TDWG old-timers have reached the point of "semantic fatigue" resulting from broken promises about what RDF, Linked Data, and the Semantic Web is going to do for them.  So the purpose of this blog post is NOT to sing the praises of RDF and try to change peoples minds about it.  Rather, it is to show how describing vocabularies using the SDS can make management of controlled vocabularies be practical, and to show how the machine-readable representations of those controlled vocabularies can be used to build applications that can mediate the generation and cleaning of data without human intervention.

I've been working recently with Quentin Groom to flesh out how a test controlled vocabulary for dwc:occurrenceStatus would be serialized in accordance with the SDS.  That test vocabulary is used in the examples that follow.  I'm really excited about this and hopefully there will be more progress to report on this front in the near future.

What is a controlled vocabulary term?

There are several misconceptions about the terminology for describing controlled vocabularies that need to be cleared up before I get into the details about how the SDS will facilitate the management and use of controlled vocabularies.  The first misconception is about the meaning of "controlled vocabulary term".  In the TDWG community, there is a tendency for people to think that a "controlled vocabulary term" is a certain string that we should all use to represent a particular value of a property.  For example, we could say that in a Darwin Core Archive, we would like for everyone to use the string "extant" as the value for the property dwc:occurrenceStatus when we intend to convey the meaning that an organism was present in a certain geographical location at a certain period of time.  However, the controlled vocabulary term is actually the concept of what we would describe in English as "an organism was present in a certain geographical location at a certain period of time" and not any particular string that we might use as a label for that concept.

This idea that an controlled value term is a concept rather than a language-dependent label lies at the heart of the Simple Knowledge Organization System (SKOS), a W3C Recommendation used to describe thesauri and controlled vocabularies. In fact, the core entity in SKOS is skos:Concept, the class of ideas or notions.  Those ideas can be "be identified using URIs, labeled with lexical strings in one or more natural languages" [1], but neither the URIs nor the strings "are" the concepts. The SDS recognizes this distinction when it specifies (Section 4.5.4) that controlled vocabulary terms should be typed as skos:Concept.

What is a term IRI?

Another common misconception is that a IRI must "do something" when you paste it into a web browser.  (In current W3C standards, "IRI", Internationalized Resource Identifier, has replaced "URI", Uniform Resource Identifier, but in the context of this post you can consider them to be interchangeable.)  Although it is nice if an IRI dereferences when you put it in a browser, there is no requirement that it do so.  At it's core, an IRI is simply a globally unique identifier that conforms to a particular IETF specification [2].

For example, the IRI http://rs.tdwg.org/dwc/iri/occurrenceStatus is a valid IRI, because it conforms to the IRI specification.  However, it does not currently dereference because no one has (yet) set up the TDWG server to handle it.  It is, however, a valid Darwin Core term, because it is defined in Section 3.7 of the Darwin Core RDF Guide.  The SDS specifies in Section 2.1.1 that IRIs are the type of identifiers that are used in TDWG standards to uniquely identify resources, including vocabulary terms.  Some other kind of globally unique identifier (e.g. UUIDs) could have been used, but using IRIs codified the practice already used by TDWG for other vocabularies.

The SDS does not specify the exact form of IRIs.  That is a matter of design choice, probably to be determined by the TDWG Technical Architecture Group (TAG).  Existing terms in DwC and AC use the pattern where a term IRI is composed of a namespace part and a local name that is a string composed of some form of an English label for the term.  For example, http://rs.tdwg.org/dwc/iri/occurrenceStatus is constructed from the namespace "http://rs.tdwg.org/dwc/iri/" (abbreviated by the compact URI or CURIE dwciri:) and the camel case local name "occurrenceStatus".  There is no requirement in the SDS for the form of the local name part of a term IRI - it could also be an opaque identifier such as a number.  Again, this is a design choice.  So it would be fine for the local name part of the IRI to be something like "os12345".

What is a label?

A label is a natural language string that is used by humans to recognize a resource.  In SKOS, labels are strings of Unicode characters in a given language.  The rules of SKOS declare that for each concept there is at most one preferred label per language, indicated by the property skos:prefLabel.  There may be any number of additional labels, such as "hidden labels" (skos:hiddenLabel) that are known to be associated with a concept, but that should not be suggested for use.  In SKOS, labels may have a language tag, although that is not required.

In SKOS, the intent is to create a mechanism that leads human users to discover the preferred label for a concept in the user's own language, while also specifying other non-preferred labels that users might be inclined to use on their own.

Based on TDWG precedent, the SDS specifies that English language labels must be included in the standards documents that describe vocabularies.  Labels in other languages are encouraged, but do not fall within the standard itself.  That makes adding those labels less cumbersome from the vocabulary maintenance standpoint.

What is a "value"?

The prevalent view in TDWG that there is one particular string that should serve as the "controlled value" for a term is alien to SKOS.  In SKOS, unique identification of concepts is always accomplished by IRIs. As a concession to current practice, in Section 4.5.4 the SDS declares that each controlled vocabulary term should be associated with a text string that is unique with that vocabulary.  The utility property rdf:value is used to associate that string with the term.  If people want to provide a string in a CSV file to represent a controlled vocabulary term, they can use this string as a value of a Darwin Core property such as dwc:occurrenceStatus.  However, if they want to be completely unambiguous, they can use the term IRI as a value of dwciri:occurrenceStatus. Using dwciri:occurrenceStatus instead of dwc:occurrenceStatus is basically a signal that the value is "clean" and that no disambiguation is necessary.

The pieces of the controlled vocabulary

The Standards Documentation Specification breaks apart machine-readable controlled vocabulary metadata into several pieces.  One piece is the metadata that actually comprise the standard itself.  Those metadata are described in Sections 4.2.2, 4.4.2, 4.5, and 4.5.4 .  In the case of the terms themselves, the critical metadata properties are rdfs:label (to indicate the label in English), rdfs:comment (to indicate the definition in English), and rdf:value (to indicate the unique text string associated with the term).  Because these values are part of the normative description of the vocabulary standard, their creation and modification are strictly controlled by processes described in the newly adopted Vocabulary Maintenance Specification.

In contrast, assignment of labels in languages other than English and translations of definitions into other languages falls outside the standards process.  Lists of multilingual labels and definitions are therefore kept in documents that are separate from the standards documents.  This makes it possible to easily add to these lists, or make corrections without invoking any kind of standards process.  The properties skos:prefLabel and skos:definition can be used to indicate the multilingual translations of labels and definitions respectively.

In addition to the preferred labels, it is also possible to maintain lists of non-preferred labels that have been used by some data providers, but which do not conform to the unique text string assigned to each term.  GBIF, VertNet, and other aggregators have compiled such lists from actual data in the wild.  The term skos:hiddenLabel can be used to associate these strings with the controlled value terms to which they have been mapped.

Controlled vocabulary metadata sources

For convenience, the machine-readable metadata in this post will be shown in RDF/Turtle, which is generally considered to be the easiest serialization for humans to read.  However, it may be serialized in any equivalent form -  developers may prefer a different serialization such as XML or JSON.  Here is an example of the metadata associated with a term from a controlled vocabulary designed to provide values for the Darwin Core term occurrenceStatus:

<http://rs.tdwg.org/cv/status/extant> a skos:Concept;
     skos:inScheme <http://rs.tdwg.org/cv/status/>;
     rdfs:isDefinedBy <http://rs.tdwg.org/cv/status/>;
     dcterms:isPartOf <http://rs.tdwg.org/cv/status/>;
     rdf:value "extant";
     rdfs:label "extant"@en;
     rdfs:comment "The species is known or thought very likely to occur presently in the area, which encompasses localities with current or recent (last 20-30 years) records where suitable habitat at appropriate altitudes remains."@en.

These metadata would be included in the machine-readable form of the vocabulary standard document.  Here are metadata associated with the same term, but included in an ancillary document that is not part of the standard:

<http://rs.tdwg.org/cv/status/extant>
     skos:prefLabel "presente"@pt;
     skos:definition "Sabe-se que a espécie ocorre na área ou a sua ocorrência é tida como bastante provável, o que inclui localidades com registos atuais ou recentes (últimos 20-30 anos) nas quais se mantêm habitats adequados às altitudes apropriadas."@pt;
     skos:prefLabel "extant"@en;
     skos:definition "The species is known or thought very likely to occur presently in the area, which encompasses localities with current or recent (last 20-30 years) records where suitable habitat at appropriate altitudes remains."@en;
     skos:prefLabel "vorhanden "@de;
     skos:definition "Von der Art ist bekannt oder wird mit hoher Wahrscheinlichkeit angenommen, dass sie derzeit im Gebiet anwesend ist, und für die Art existieren aktuelle oder in den letzten 20 bis 30 Jahren erstellte Aufzeichnungen, in Lagen mit geeigneten Lebensräumen. "@de.

These data provide the non-normative translations of the preferred term label and definition.  Here are some metadata that might be in a third document:

<http://rs.tdwg.org/cv/status/extant>
     skos:hiddenLabel "Reported";
     skos:hiddenLabel "Outbreak";
     skos:hiddenLabel "Infested";
     skos:hiddenLabel "present";
     skos:hiddenLabel "probable breeding";
     skos:hiddenLabel "Frecuente";
     skos:hiddenLabel "Raro";
     skos:hiddenLabel "confirmed breeding";
     skos:hiddenLabel "Present";
     skos:hiddenLabel "Présent ";
     skos:hiddenLabel "presence";
     skos:hiddenLabel "presente";
     skos:hiddenLabel "frecuente";
...   .

These are all of the known variants of strings that have been mapped to the term http://rs.tdwg.org/cv/status/extant.

For management purposes, these three documents will probably be managed separately.  The first list from the standards document will be changed rarely, if ever.  The second list will (hopefully) be added to frequently by human curators as the controlled vocabulary is translated into new languages.  The third list may be massive, and maintained by data-cleaning software as human operators of the software discover new variants in submitted data and assign those variants to particular terms in the controlled vocabulary.

Periodically, as the three lists are updated, they can be merged.  Given that the the SDS is agnostic about the form of the machine-readable metadata, they could be ingested as JSON-LD and processed using purpose-built applications.  However, in the following examples, I'll load the metadata into an RDF triplestore and expose the merged graph via a SPARQL endpoint.  That is convenient because the merging can be accomplished without any additional processing of the data on my part.

Accessing the merged graph

I've loaded the metadata shown above into the Vanderbilt SPARQL endpoint, where it can be queried at https://sparql.vanderbilt.edu.  The following query can be pasted into the box to see what properties and values exist for http://rs.tdwg.org/cv/status/extant in the merged graph:

SELECT DISTINCT ?property ?value WHERE {
   <http://rs.tdwg.org/cv/status/extant> ?property ?value.
   }

You can see that the metadata included in the standards document, translations document, and hidden label document all come up.

Clearly, nobody is actually going to want to paste queries into a box to use this information.  However, the data can be accessed by an HTTP GET call using CURL, Python, Javascript, jQuery, XQuery, or whatever flavor of software you like.  Here's what the query above looks like when URL encoded and attached to the endpoint IRI as a query string:

https://sparql.vanderbilt.edu/sparql?query=SELECT%20DISTINCT%20%3Fproperty%20%3Fvalue%20WHERE%20%7B%0A%20%20%20%3Chttp%3A%2F%2Frs.tdwg.org%2Fcv%2Fstatus%2Fextant%3E%20%3Fproperty%20%3Fvalue.%0A%20%20%20%7D

The query can be sent using HTTP GET by your favorite application to retrieve the same metadata as one sees in the paste-in box.  The new Blazegraph SPARQL endpoint supports both XML and JSON query results.  It returns XML by default, but if an Accept: header of application/sparql-results+json is sent with the request, the results will be returned in JSON.

Many people seem to be somewhat mystified about the purpose of a SPARQL endpoint and assume that it is some kind of weird Semantic Web thing.  If you fall into this category, you should think of a SPARQL endpoint as a kind of "programmable" web API.  Unlike a "normal" API where you must select from a fixed set of requests, you can request any imaginable result that can possibly be retrieved from a dataset.  That means that the request IRIs are probably going to be more complex, but once they have been conceived, the requests are going to be made by a software application, so who cares how complex they are?

Multilingual pick list for occurrenceStatus

I'm going to demonstrate how the multilingual data could be used to create a dropdown where a user selects the appropriate controlled value for the Darwin Core term occurrenceStatus when presented with a list of labels in the user's native language.  Here's the SPARQL query that lies at the heart of generating the pick list:

PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
SELECT DISTINCT ?label ?def ?term WHERE {
?term <http://www.w3.org/2000/01/rdf-schema#isDefinedBy><http://rs.tdwg.org/cv/status/>.
?term skos:prefLabel ?label.
?term skos:definition ?def.
FILTER (lang(?label)='en')
FILTER (lang(?def)='en')
}
ORDER BY ASC(?label)

Here's what it does.  The triple pattern:
?term <http://www.w3.org/2000/01/rdf-schema#isDefinedBy><http://rs.tdwg.org/cv/status/>.
restricts the results to terms that are part of the occurrenceStatus controlled vocabulary.  The triple patterns:
?term skos:prefLabel ?label.
?term skos:definition ?def.
bind preferred labels and definitions to the variables ?label and ?def.  The triple patterns:
FILTER (lang(?label)='en')
FILTER (lang(?def)='en')
restrict the labels and definitions to those that are language-tagged as English.  To change the requested language, a different language tag, such as 'pt' or 'de' can be substituted for 'en' by the software.  The last line tells the endpoint to return the results in alphabetical order by label.  The query is URL encoded and appended as a query string to the IRI of the SPARQL endpoint:

https://sparql.vanderbilt.edu/sparql?query=PREFIX%20skos%3A%20%3Chttp%3A%2F%2Fwww.w3.org%2F2004%2F02%2Fskos%2Fcore%23%3ESELECT%20DISTINCT%20%3Flabel%20%3Fdef%20%3Fterm%20WHERE%20%7B%3Fterm%20%3Chttp%3A%2F%2Fwww.w3.org%2F2000%2F01%2Frdf-schema%23isDefinedBy%3E%3Chttp%3A%2F%2Frs.tdwg.org%2Fcv%2Fstatus%2F%3E.%3Fterm%20skos%3AprefLabel%20%3Flabel.%3Fterm%20skos%3Adefinition%20%3Fdef.FILTER%20(lang(%3Flabel)%3D%27en%27)FILTER%20(lang(%3Fdef)%3D%27en%27)%7DORDER%20BY%20ASC(%3Flabel)

A page that makes use of this query is online at http://bioimages.vanderbilt.edu/pick-list.html?en.  The URL of the page ends in a query string that specifies the starting language for the page.  Currently en, pt, de, zh-hans, zh-hant, and es are available, although I'm hoping to add ko soon.  The "guts" of the program are the javascript code at http://bioimages.vanderbilt.edu/pick-list.js.  Lines 58 through 67 generate the query above and line 68 URL-encodes it.  Lines 71 through 78 perform the HTTP GET call to the endpoint, and lines 69 through 102 process the XML results whey they come back and add them to the options of the pick list.  If you are viewing the page in a Chrome browser, you can see what's going on behind the scenes using the Developer tools that you can access from the menu in the upper right of the Chrome window ("More tools" --> "Developer tools").  Here's what the request looks like:


Here's what the response looks like:


You can see that the results are in XML, which makes the Javascript uglier than it would have to be.  The Javascript will be simpler if the results were retrieved as JSON, but I haven't rewritten the script since our new endpoint was set up.  In line 101 of the Javascript code, the language-specific label gets inserted as the label of the option, but the actual value of the option is set as the IRI that is returned from the endpoint for that particular term.  Thus, the labels inserted into the option list vary depending on the selected language, but the IRI is language-independent.  In this demo page, the IRI is simply displayed on the screen, but in a real application, the IRI would be assigned as the value of a Darwin Core property.  In my opinion, the appropriate property would be dwciri:occurrenceStatus, regardless of whether the property is part of an RDF representation or a CSV file.  Using a dwciri: term implies that the value is a clean and unambiguous IRI.  Using dwc:occurrenceStatus would imply that the value could be any kind of string, with no implication that it was "cleaned" or even appropriate for the term.

You may have noticed that the query also returns the term definition in the target language.  Originally, my intention was that it should appear as a popup when the user moused over the natural language label on the dropdown, but my knowledge of HTML is too weak for me to know how to accomplish that without some digging.  I might add that in the future.

"Data cleaning" application demonstration

I created a second demo page to show how data from the merged graph could be used in data cleaning.  That page is at http://bioimages.vanderbilt.edu/clean.html.  The basic query that it uses is:

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX cvstatus: <http://rs.tdwg.org/cv/status/>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
SELECT DISTINCT ?term where {
?term rdfs:isDefinedBy cvstatus:.
 {?term skos:prefLabel ?langLabel.FILTER (str(?langLabel) = 'Común')}
UNION
 {?term skos:hiddenLabel 'Común'. }
UNION
 {?term rdf:value 'Común'. }
}

This query is a little more complicated than the last one.  The triple pattern
?term rdfs:isDefinedBy cvstatus:.
limits terms to the appropriate controlled vocabulary.  The rest of the query is composed of the UNION of three graph patterns.  The first pattern:
?term skos:prefLabel ?langLabel.
FILTER (str(?langLabel) = 'Común')
screens the string to be cleaned against all of the preferred labels in any language.  The second pattern:
?term skos:hiddenLabel 'Común'.
checks whether the string to be cleaned is included in the list of non-preferred labels that have been accumulated from real data.  The third pattern:
?term rdf:value 'Común'.
checks if the string to be cleaned is actually one of the preferred, unique text strings associated with any term.  In the Javascript that makes the page run (see http://bioimages.vanderbilt.edu/clean.js for details), the string to be cleaned is inserted into the query from a variable (i.e. a variable substituted in place of 'Común' in the query above.)

In this particular case, the string 'Común' was mapped to the concept identified by http://rs.tdwg.org/cv/status/extant, so a match is made by the second of the three graph patterns (the hidden label one).  Here's what the page looks like when it is running with Developer tools turned on:


You can see that the response is a single value wrapped up in a bunch of XML.  Again, things would be simpler if the code were changed to receive JSON.  So in essence, the data cleaning function could be accessed by this "API call":

http://rdf.library.vanderbilt.edu/sparql?query=PREFIX%20rdfs%3A%20%3Chttp%3A%2F%2Fwww.w3.org%2F2000%2F01%2Frdf-schema%23%3EPREFIX%20cvstatus%3A%20%3Chttp%3A%2F%2Frs.tdwg.org%2Fcv%2Fstatus%2F%3EPREFIX%20skos%3A%20%3Chttp%3A%2F%2Fwww.w3.org%2F2004%2F02%2Fskos%2Fcore%23%3ESELECT%20DISTINCT%20%3Fterm%20where%20%7B%3Fterm%20rdfs%3AisDefinedBy%20cvstatus%3A.%20%7B%3Fterm%20skos%3AprefLabel%20%3FlangLabel.FILTER%20(str(%3FlangLabel)%20%3D%20%27Com%C3%BAn%27)%7DUNION%20%7B%3Fterm%20skos%3AhiddenLabel%20%27Com%C3%BAn%27.%20%7DUNION%20%7B%3Fterm%20rdf%3Avalue%20%27Com%C3%BAn%27.%20%7D%7D

where the string to be cleaned is substituted for "Com%C3%BAn" (urlencoded).

As a practical matter, it would probably not be smart to actually build an application that relied on screening every record by making a call like this to the SPARQL endpoint.  Our endpoint just isn't up to handling that kind of traffic.  It would be more realistic to build an application that made one call at the start of each session that retrieved the whole array that mapped known strings to controlled value IRIs.  A query to accomplish that would be:

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX cvstatus: <http://rs.tdwg.org/cv/status/>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>

SELECT DISTINCT ?term ?value where {?term rdfs:isDefinedBy cvstatus:. 
{?term skos:prefLabel ?langLabel.FILTER (str(?langLabel) = ?value)}
UNION 
{?term skos:hiddenLabel ?value. }
UNION 
{?term rdf:value ?value. }
}

Notice that it is basically the same as the previous query, except that the string to be cleaned is represented by the variable ?value instead of being a literal.  Here's what the HTTP GET IRI would look like:

https://sparql.vanderbilt.edu/sparql?query=PREFIX%20rdf%3A%20%3Chttp%3A%2F%2Fwww.w3.org%2F1999%2F02%2F22-rdf-syntax-ns%23%3E%0APREFIX%20rdfs%3A%20%3Chttp%3A%2F%2Fwww.w3.org%2F2000%2F01%2Frdf-schema%23%3E%0APREFIX%20cvstatus%3A%20%3Chttp%3A%2F%2Frs.tdwg.org%2Fcv%2Fstatus%2F%3E%0APREFIX%20skos%3A%20%3Chttp%3A%2F%2Fwww.w3.org%2F2004%2F02%2Fskos%2Fcore%23%3E%0A%0ASELECT%20DISTINCT%20%3Fterm%20%3Fvalue%20where%20%7B%3Fterm%20rdfs%3AisDefinedBy%20cvstatus%3A.%20%0A%7B%3Fterm%20skos%3AprefLabel%20%3FlangLabel.FILTER%20(str(%3FlangLabel)%20%3D%20%3Fvalue)%7D%0AUNION%20%0A%7B%3Fterm%20skos%3AhiddenLabel%20%3Fvalue.%20%7D%0AUNION%20%0A%7B%3Fterm%20rdf%3Avalue%20%3Fvalue.%20%7D%0A%7D

If you send the request header:
Accept: application/sparql-results+json
you will get JSON back instead of XML.  You can see the results at this Gist:  https://gist.github.com/baskaufs/0b1193990bc7182e440ff238cac6e528
 These results could be ingested by a data-cleaning application, which could then keep track of newly encountered strings.  A human would have to map each new string to one of the controlled value IRIs, but if those new mappings were added to the list of known variants, they would become available as soon as the graph on the SPARQL endpoint were updated.

Where from here?

The applications that I've shown here are just demos and developers who are better programmers than I can incorporate similar code into their own applications to make use of the controlled vocabulary data that TDWG working groups will generate and expose in accordance with the SDS.  Clearly, work flows will need to be established, but once those are set up, there is the potential to automate most of what I've demonstrated here.  The raw data will live on the TDWG Github site, probably in the form of CSVs and the transformation to a machine-readable form will be automated.  There could be one to many SPARQL endpoints exposing those data - one feature of the SDS is that it will be possible for machines to discover the IRIs of vocabulary distributions, including SPARQL endpoints.  So of one endpoint goes down or is replaced, a machine will be able to automatically switch over to a different one.

[1] SKOS Simple Knowledge Organization System Reference.  W3C Recommendation. https://www.w3.org/TR/skos-reference/
[2] Internationalized Resource Identifiers (IRIs). RFC 2987. The Internet Engineering Task Force (IETF). https://tools.ietf.org/html/rfc3987

Thursday, March 30, 2017

Why I decided to vote for the union

This is my 32nd blog post and it's the first time I've written about my personal life.  Despite the technical nature of the previous 31 posts (well, maybe the Toilet Paper Apocalypse one doesn't count in that category), this is the hardest one for me to write.

For the past several weeks, my employer, Vanderbilt University, has been at battle with the Service Employees International Union (SEIU) and a group of non-tenure track faculty who are trying to organize a union.  I have had very mixed emotions about how I felt about this.  On the one hand, after teaching at Vanderbilt for almost eighteen years, I'm relatively secure in my job and it wasn't clear to me that there would be any particular advantage to me to be part of a union (I'm a Senior Lecturer, one of the ranks of non-tenure track faculty included in the unionization proposal).  I've spent a considerable amount of time during those weeks trying to inform myself about what it would mean to be part of a faculty union.  I've been asked to be part of a Faculty Senate panel discussing the unionization proposal this afternoon, and spent time last night trying to decide what I would say during the three minutes that I've been allocated to explain my position on the issue.  As part of my deliberations last night, I spent a couple of hours reading old emails from my first two years teaching at Vanderbilt.  I'm an obsessive email filer.  I have most of the emails I've received since 1995 filed in topical and chronological folders, so it didn't take me long to find the relevant emails.

It's hard for me to describe what the experience of reading those emails was like.  Although I've kept all of those emails for years, I have avoided ever reading them again because knew the experience would be disturbing to me.  It was sort of like ripping a scab off of a mostly healed wound, but that doesn't capture the intensity of the emotions that it raised.

General science class in 1983, Ridgeway, Ohio

Background

I grew up in a rural part of Ohio in a conservative Republican family that was always very anti-union.  So that has predisposed me to have a negative outlook on unions.  After I graduated with my undergraduate degree in 1982, I spent the next ten years teaching high school.  I taught in a variety of schools: a rural school in Ohio for one year, a public school in Swaziland (Africa) for three years, and a school in rural/suburban Tennessee for six years.  The classes I taught included chemistry, physics, biology, physical science, general science, math, and computer programming.
Physical science class in 1985, Mzimpofu, Swaziland
Despite the variety of locations, the schools actually had a lot in common.  When I arrived at each of those schools, they had little or no science equipment and I spent years trying to figure out how to get enough equipment to teach my lab classes in a way that was engaging to the students.  At those schools, I served in a variety of roles, including department chair, chair of the faculty advisory committee, student teacher supervisor, choir director, and adviser of research and environmental clubs.
Physics class in 1991, Kingston Springs, Tennessee
By the end of my time teaching high school, I had amassed a number of teaching credentials and awards, including scoring in the 99th percentile for professional knowledge on the National Teacher Exam, achieving the highest level (Role Model Teacher) on the grueling Tennessee Career Ladder certification program, and being named Teacher of the Year at the school level in 1990 and on the county level in 1992.

In 1993, I decided that I wanted to take on a different challenge: entering a Ph.D. program in the biology department at Vanderbilt.  Over the next six years, I took graduate classes, carried out research, served as a teaching assistant in the biology labs for ten semesters, and had a major role in managing the life of our family while my wife worked towards tenure at a nearby university.  By August 1999, I had defended my dissertation and was on the market looking for a teaching job on the college level.


Being a part-time Lecturer at Vanderbilt

In the fall of 1999, I was writing papers from my dissertation and trying to figure out how to get a job within commuting distance of my wife's university.  By that time, she had tenure, which complicated the process.  At the last minute, there was an opening for a half-time lecturer position, teaching BIOL 101, a non-majors biology service course for education majors in Vanderbilt's Peabody College of Education.  It seemed like this was the ideal class for me with my background teaching high school for many years.  It was rather daunting because I got the job a few days before the semester started.  I had to scramble to put together a syllabus and try to keep up with planning class sessions, developing labs, and writing lectures.  But I'd done this three times before when I had started at my three different high schools, so I knew I could do it if I threw myself into the job.

I had naively assumed that my job was to teach these students biology, uphold the academic standards that I had always cared about, and enforce the College rules about things like class attendance.  It is striking to me as I look through the emails from that semester how many of them involved missed classes, and complaints about grades and the workload.  Here's an example:
Professor Baskauf,
I am not able to be in class tomorrow because my
flight leaves a 2:30 pm.  Both of my friday classes were
cancelled, so I am going home tomorrow.  However, the later
flights were too late for my parents in that my flight is a
long one and the airport is 2hrs from my house.  I
apologize for missing class.
Within a month, it was clear that the students were unhappy about my expectations for them.  I had a conversation with my chair about the expectations for the class, which I was beginning to think must differ radically from what I had anticipated.  Here's an email I got on October 25 from my department chair:
I had a meeting with the Peabody folks with regard to BSCI 101 and what the
purpose of the course is.  It appears that they have had little or no
contact with any of the instructors for  the course for several years and
really have no idea of the current course content and organization.  At some
point, I'd like to set up a meeting with you, the relevant Peabody folks,
and myself to make sure we all understand why we offer BSCI 101 at all; and,
since it is a service course, to make sure we are all on the same page with
regard to content and structure.  From our discussion last Friday, I think
what they expect is not much different from what we would do, but is perhaps
a little different from the course as it evolved in the Molecular Biology
Department.  I'd like to send them a syllabus to look over, then we'll try
to set up a time for discussion.  Could you get me a copy of the syllabus
you're using this year?
I tried to adjust the content and format of the course to make it more relevant to education majors and pushed through to the end of the semester.  I had a number of discussions with my TA about how we could work to make the labs more engaging and made plans on how I was going to improve the course in the spring semester, which I had already been scheduled to teach.

On January 3, 2000, I found out from my chair that Dean Infante had examined my student evaluations and decided to fire me.  He had never been in my class (actually, no administrator or other faculty member had ever set foot in my class) and as far as I know, he had no idea what I had actually done in the class.  He just decided that my student evaluation numbers were too low.  I was a "bad" teacher and Vanderbilt wasn't going to let me teach that class again.

With my past record of 15 years of excellent teaching, this was a crushing blow to me emotionally.  I'm normally a really optimistic person, but on that day I had an glimmering of what it must feel like to be clinically depressed.  I could hardly make myself get out of bed.  In addition to the emotional toll, I now had two little kids to help support - we had been planning on the income from my teaching and we were also looking at losing our day care at Vanderbilt if I were no longer employed.

Fortunately, my department chair went to bat for me.  Ironically, the appeal that he made to the dean was NOT that I was hard working, or innovative, or that I had high standards for my students.  He took my student evaluation numbers from when I was a TA in the bio department to the Dean's office and convinced them that those numbers showed that the new student evaluation numbers were an outlier.  Although I didn't know it at the time, I was apparently on some kind of probation - the department was supposed to be monitoring me to make sure that I wasn't still being a "bad" teacher.

In the second semester that I taught the 101 class, I took extreme precautions to be very accessible to students.  I emailed all of the students who didn't do well on tests and asked them if the wanted to meet with me.  We did a self-designed project to investigate what it took to build a microcosm ecosystem.  We went on an ecology field trip to a local park and an on-campus field trip to visit research labs that were using zebrafish and fruit flies as model organisms to study genetics and development.  I think the students were still unhappy with their grades and my expectations for workload, but apparently their evaluations were good enough for me to be hired as a full-time Lecturer in the fall.


Being a full-time Lecturer at Vanderbilt

The faculty member who had previously been the lab coordinator for the Intro to Biological Sciences labs for majors, was leaving that position to take a different teaching job in the department.  The chair of the new Biological Sciences Department (formed by the merger of my biology department and the molecular biology department), contacted me about "going into the breach" as he phrased it, and taking over as lab coordinator.  I had actually been a TA five times for the semester of that course dealing with ecology and evolution (my specialty).  So I was well acquainted with that teaching lab.  Having had no success in getting a tenure-track job at any college within commuting distance, I took the offer of a one year appointment, assuming that I could do the job until I got a better position somewhere else.

When I started the job, I really had very little idea what my job responsibilities were supposed to be.  I was supposed to "coordinate labs".  The job expectations were never communicated to me beyond that.  Unfortunately, the focus of the course during my first semester was the half of the course that dealt with molecular biology, which I had never studied and for which I had never served as a TA.  Things did not go well.  For starters, the long-term lab manager discovered that she had cancer and missed long stretches of work for her treatments.  Fortunately, I was allowed to hire a temporary staff person with a masters related to molecular biology.  I spent much of the semester in the prep room with her trying to figure out why our cloning and other experiments weren't working as they should.  A major part of my job responsibilities was to supervise the TAs and manage the grades and website for both my class and the lecture part of the course.  I spent almost no time in the classroom with the students - I wasn't aware that that was actually supposed to be a part of the job.

At the end of the first semester, I was relieved to have managed to pull of the whole series of labs with some degree of success and was looking forward to the ecology and evolution semester, with which I was very familiar.  However, I was shocked to discover that I was actually going to be subject to student evaluations again.  Apparently, there was some college rule that everyone who is in a faculty position has to be evaluated by students.  In January, I ran into my chair and he commented that we would have to get my student evaluations up in the coming semester.  Oh, and by the way, the grades were also too high for the course.  I was going to have to increase the rigor of the course to bring them down to what was considered a reasonable range for lab courses in the department.

At that point in time, the lab grades were included with the lecture grades to form a single grade for the course.  The tenure track faculty involved in the lecture part of the course decided that a range of B to B- was a reasonable target for the lab portion of the course, so it fell on me to structure the grading in the course in a way that the grades would fall into that range.  At that time, the largest component of the grade was lab reports, which I found to be graded by the TAs in a very arbitrary and capricious manner.  In the spring semester, I replaced lab reports with weekly problem sets, and replaced lightly-weighted occasional quizzes with regular lab tests that formed half of the grade of the course.  I made the tests difficult enough to lower the grade to the target range, but it was clear to the students that I was to blame for creating the tests that were killing their GPAs.

In the second semester, I made it a point to be in the lab during every section to ask students if they had questions or needed help.  That did a lot to improve the students' impressions of me as compared to the fall.  But in late March, I was blindsided by another unanticipated event: fraternity formals.  Students had previously asked me to excuse them from lab on Fridays to leave early for spring break or to go out of town for family visits.  I had been consistently enforcing the College's policy on excused absences, which said unequivocally "conflicts arising from personal travel plans or social obligations are not regarded as occasions that qualify as an excused absence" and made them take zeros for their work on days when they missed class for those reasons.  Obviously students were not happy about this, but the situation came to a head when students started asking to reschedule class to go out of town for fraternity formals.  I had gone to a school that didn't have fraternities, and I'd never heard of a fraternity formal.  When I found out that fraternity formals involved skipping school to go out of town to go to a party, I told them that I couldn't consider that an excused absence under the college's policy on attendance.  The students were furious.  They had spent money on plane tickets and tuxedos and now I was forcing them to chose between class and going to their party.  An exchange with two of the students ended up in us walking over to Dean Eickmeier's office where he confirmed that my decision was consistent with college policy.  In some cases, students opted to come to class.  One student brought me an apparently fake medical excuse.  Others took the zero and went to the party.  One student said that I "did not have any compassion that a normal human would have" and threatened that he was going to write a scathing article about me in the Hustler (the student newspaper).  Another student said that he was going to "get me" on the evaluations.  Alarmed, and given my previous bad experience with student evaluations, I documented the incidents in an email to my chair.

Despite these bumps in the road, my evaluations were better in the spring semster, and I was anticipating being reappointed again for 2001-02.  I did request that my chair include in the request for my reappointment a copy of my email detailing the incidents involving the unhappy students with excused absences.  On May 23rd, I received this ominous email from my chair:
 I have received a response from Dean Venable to my recommendation for your reappointment.  He has agreed to reappoint you, but he has placed some conditions on the reappointment that we need to discuss.  I would like to do that today, if possible.  I have a flexible schedule until mid-afternoon.
I went to meet with the chair, and he gave me a copy of the letter from Dean Venable.  You can read it yourself:
Once again, a Dean sitting in his office pouring over computer printouts had decided that I was a bad teacher based solely on student evaluations.  No personal discussion with me about the class, no classroom observations ever by any administrator, no examination of the course materials or goals.  Worse yet, he chose to cherry-pick the written comments to emphasize the negative ones.  By my count, 11% of students made negative comments about my teaching style, while 24% made positive comments about it.  Here are some of the comments from the spring semester Dean Venable chose to ignore:

Dr. Baskauf is very good at instructing the class.  He is easy to understand and teaches the material well so that we understand what he is saying.  I think that Dr. Baskauf would be a better help to the lecture however, since the lecture class is more important to students and worth 3 hours.  I wish I could have had him for a professor in lecture as well as lab.
Dr. Baskauf really puts forth an aire of knowledge.  He was always willing to help with any problems that we were having with our labs, in and out of class, while not just telling us the answers, but nudging us along while we figured it out for ourselves.  Whats more important is that he seems to really love the material and teaching the class which makes the experience that much better and makes it much easer to learn.
Dr. Baskauf was always well prepared for the lab.  This was very helpful because he could always give a concise overview that I could understand.  The powerpoint presentations were a great idea as they really helped me to follow his instructions better.  He is very friendly and always willing to help when I had questions.
Bascauf is very good at explaining and communicating with the class.  he is very helpful as well.
Dr. Baskauf made this lab one of the most enjoyable and challenging classes I have yet taken at Vanderbilt.  He was especially willing to help students to better understand the value of what they were learning.
Baskauf was always very well prepared for lab.  He obviously put a lot of work into setting everything up.  He always had very clear tutorials and lectures.
Dr. Baskauf created a challenging and stimulating environment for learning about biological experiments.  Although many aspects of the lab were tough, Dr. Baskauf was able to understand how difficult is was for the class as a whole.  He opened himself up to adapting to our needs.  His approach to teaching is something I have yet to experience elsewhere at Vanderbilt.  I hope to have him as an instructor at some further point in my career here at Vanderbilt.
The sentence for my crime was:
  • to be mentored by a senior faculty member
  • to work with the Center for Teaching to improve my lecturing style and interpersonal skills, and
  • to be subjected to an extra mid-semester student evaluation

Oh, yes - and no pay raise.  These were all necessary to bring me "up to the teaching standards required by the College", with the threat that I would be fired if I didn't improve.  

So, for a second time, I had been flagged with the scarlet letter of "bad teacher" based solely on student evaluations.  Again, I was angry at the injustice and incredibly demoralized.  I really wanted to just quit at Vanderbilt, but I really needed the job.  So I swallowed my pride and completed my sentence.  I was "mentored" the next year by Bubba Singleton (later my highly supportive department chair), who was extremely helpful in helping me figure out ways to structure the class so that I could maintain my academic standards while also keeping students happy enough that I didn't get fired again.  

Life as a Senior Lecturer

Ever since that time, I've maintained an Excel spreadsheet with a graph of my student evaluations, which I check each year to ensure that I'm not heading into the danger zone.  Despite the fact that I don't really "teach" in the usual sense (most of my work involves curriculum development, wrangling online course management systems, recruiting and mentoring TAs, supervising staff, and handling administrative problems), I've managed to keep the student evaluation numbers to an acceptable level.  I've instituted open-ended research projects into the course (which by the way, caused my evaluation numbers to plunge the year they were introduced), continued to introduce the latest educational technology into the course, and continued to revise and update labs as biology evolves.  In 2002, I was promoted to Senior Lecturer (which comes with a three-year appointment) and in 2010 I received the Harriet S. Gilliam Award for Excellence in Teaching by a Senior Lecturer. 

The Harriet S. Gilliam Award silver cup, with the letter from Dean Venable that I store inside it

So I think that most people at Vanderbilt now think I'm a good teacher.  But student evaluations and the threat of being fired based on student evaluations hangs over me like a Sword of Damocles every time I'm up for re-appointment.  

After I wrote this, I seriously considered deleting the whole thing.  Even after all of these years, for a veteran teacher, being fired and being sentenced to remedial help for "bad teaching" is an embarrassment.  I feel embarrassed, even though I know that I was just as good a teacher at that time as I was before and after.  But I think it's important for people to know what it feels like to be treated unjustly in a situation where there is a huge power imbalance - where a powerful person you've never met passes judgment on your teaching based on student evaluations rather than by observing your classroom.

Do non-tenure track faculty at Vanderbilt need a union?

When the unionization question came up, I have to say that I was pretty skeptical about it.  When I taught high school, I was a member of the National Education Association, which functioned something like a union, but I had mostly thought of it as a professional association.  My upbringing predisposed me to thinking negatively about unions.  Given my current relatively stable position, it wasn't clear to me that it was in my interest to be part of a union.

However, as I started investigating where non-tenure track faculty stand in the College of Arts and Sciences at Vanderbilt, it was clear to me that we are actually just as powerless as I had always considered us to be.  Although non-tenure track faculty constitute 38% of the faculty of A&S, they are banned from election to Faculty Senate and have been improperly disenfranchised from voting for Faculty Senate for at least ten years. (I have never been given the opportunity to vote.) See Article IV, Section 2 of the CAS Constitution for details. Non-tenure track faculty are not eligible for election to the A&S Faculty Council, nor are they allowed to vote for Faculty Council representatives (Article II, Section 1, Part B).  At College-wide faculty meetings, full-time Senior Lecturers have a vote, but all non-tenure track faculty are only allowed to vote on an issue only when the Dean decides that the matter is related to their assigned duties (Article I, Section I, Part C).  The Provost's Office insists that non-tenure track faculty participate in University governance through participation in University Committees, but my analysis shows that appointment to University Committees is greatly skewed towards tenure-track faculty, with only three non-tenure track faculty actually sitting on those committees (one each on Religious Affairs, Athletics, and Chemical Safety).  The Shared Governance Committee, charged last fall with suggesting changes and improvements in the future does not include a single non-tenure track member of A&S (only one non-tenure track member at all - from the Blair School of Music).  We really have virtually no voice in the governance of the College or the University.  

We also have no influence over the process of our re-appointment, or how much we are paid.  Prior to our reappointment, we submit a dossier.  Then months later, we either do or don't get a reappointment letter with a salary number and a place to sign.  If we don't like the number, we can quit.  In a previous reappointment cycle, I suggested to my chair that it would be fair for me to be paid what I would receive if I were to leave Vanderbilt and teach in the Metro Nashville public school system.  At that time, with a Ph.D. and the number of years of teaching experience that I had, my salary in the public schools would have been about $10k more than what I was getting at Vanderbilt.  I think that at that time I actually still had a valid Tennessee high school teaching license, so it would have been a real possibility for me.  They did give me something like an additional 1% pay increase over the base increase for that year, but I've never gotten parity with what I would receive as a public high school teacher (let alone what I would earn teaching at a private school).  That's particularly ironic, given that the number of students and teaching assistants I supervise has gone up by about 60% since I started in the job (with no corresponding increase in pay), to over 400 students and 12 to 20 TAs per semester, plus three full-time staff.  This is a much greater responsibility than I would have if I were teaching high school.  The reason that I was given for not being granted parity in pay with the public schools was that the college couldn't afford that much.  I like my job and I enjoy working with my students and TAs, so I probably won't quit to go back to teaching high school.  But it seems really unfair to me and I'm powerless to change the situation.

Currently, I'm up for re-appointment with a promotion to Principal Senior Lecturer.  That might result in a pay increase, but there is no transparency about the decision-making process in the Dean's office.  Some day later this year, I'll probably get a letter offering me an appointment with some salary number on it.  Or not.  

The Provost's Office website has a list of frequently asked questions whose answers insinuate that the union will probably lie to us, and may negotiate away our benefits without consulting with us.  I will admit that I was a bit concerned about the negative effects of unionization when the issue first came up.  However, I contacted some senior lecturers at Duke and University of Chicago to ask them how the contract negotiating process was going at their schools.  It was clear to me that the negotiating teams at those schools (composed of  non-tenure track faculty from the schools themselves) were very attuned to the concerns of the colleagues they were representing, and that they had no intention of negotiating away important benefits that they already had.  Mostly, it just looked like a huge amount of time and work on their part.  But it definitely was not the apocalypse - for most people at those schools, life goes on as normal.

Now that I'm now a relatively high ranking non-tenure track faculty with reasonable job security, it seems unlikely that personally I will derive a large benefit from unionization.  But given that I have virtually no influence or negotiating power within the university, it is very hard for me to see what I have to lose by being part of the union.  More importantly, as I re-read the emails from my first painful years of teaching at Vanderbilt, it was evident to me that part-time faculty and faculty with one year appointments are particularly vulnerable to the whims of upper-level administration.  I have always been fortunate to have department chairs who supported me vigorously and went to bat for me when I needed it.  But there is no guarantee that will happen in the future, or in other departments.  For whatever faults there may be in having a union, it will provide a degree of protection and transparency that has been completely lacking for non-tenure track faculty at Vanderbilt.  And that's the primary reason why if offered the chance I'm planning to vote "yes" on unionization.