Thursday, March 30, 2017

Why I decided to vote for the union

This is my 32nd blog post and it's the first time I've written about my personal life.  Despite the technical nature of the previous 31 posts (well, maybe the Toilet Paper Apocalypse one doesn't count in that category), this is the hardest one for me to write.

For the past several weeks, my employer, Vanderbilt University, has been at battle with the Service Employees International Union (SEIU) and a group of non-tenure track faculty who are trying to organize a union.  I have had very mixed emotions about how I felt about this.  On the one hand, after teaching at Vanderbilt for almost eighteen years, I'm relatively secure in my job and it wasn't clear to me that there would be any particular advantage to me to be part of a union (I'm a Senior Lecturer, one of the ranks of non-tenure track faculty included in the unionization proposal).  I've spent a considerable amount of time during those weeks trying to inform myself about what it would mean to be part of a faculty union.  I've been asked to be part of a Faculty Senate panel discussing the unionization proposal this afternoon, and spent time last night trying to decide what I would say during the three minutes that I've been allocated to explain my position on the issue.  As part of my deliberations last night, I spent a couple of hours reading old emails from my first two years teaching at Vanderbilt.  I'm an obsessive email filer.  I have most of the emails I've received since 1995 filed in topical and chronological folders, so it didn't take me long to find the relevant emails.

It's hard for me to describe what the experience of reading those emails was like.  Although I've kept all of those emails for years, I have avoided ever reading them again because knew the experience would be disturbing to me.  It was sort of like ripping a scab off of a mostly healed wound, but that doesn't capture the intensity of the emotions that it raised.

General science class in 1983, Ridgeway, Ohio

Background

I grew up in a rural part of Ohio in a conservative Republican family that was always very anti-union.  So that has predisposed me to have a negative outlook on unions.  After I graduated with my undergraduate degree in 1982, I spent the next ten years teaching high school.  I taught in a variety of schools: a rural school in Ohio for one year, a public school in Swaziland (Africa) for three years, and a school in rural/suburban Tennessee for six years.  The classes I taught included chemistry, physics, biology, physical science, general science, math, and computer programming.
Physical science class in 1985, Mzimpofu, Swaziland
Despite the variety of locations, the schools actually had a lot in common.  When I arrived at each of those schools, they had little or no science equipment and I spent years trying to figure out how to get enough equipment to teach my lab classes in a way that was engaging to the students.  At those schools, I served in a variety of roles, including department chair, chair of the faculty advisory committee, student teacher supervisor, choir director, and adviser of research and environmental clubs.
Physics class in 1991, Kingston Springs, Tennessee
By the end of my time teaching high school, I had amassed a number of teaching credentials and awards, including scoring in the 99th percentile for professional knowledge on the National Teacher Exam, achieving the highest level (Role Model Teacher) on the grueling Tennessee Career Ladder certification program, and being named Teacher of the Year at the school level in 1990 and on the county level in 1992.

In 1993, I decided that I wanted to take on a different challenge: entering a Ph.D. program in the biology department at Vanderbilt.  Over the next six years, I took graduate classes, carried out research, served as a teaching assistant in the biology labs for ten semesters, and had a major role in managing the life of our family while my wife worked towards tenure at a nearby university.  By August 1999, I had defended my dissertation and was on the market looking for a teaching job on the college level.


Being a part-time Lecturer at Vanderbilt

In the fall of 1999, I was writing papers from my dissertation and trying to figure out how to get a job within commuting distance of my wife's university.  By that time, she had tenure, which complicated the process.  At the last minute, there was an opening for a half-time lecturer position, teaching BIOL 101, a non-majors biology service course for education majors in Vanderbilt's Peabody College of Education.  It seemed like this was the ideal class for me with my background teaching high school for many years.  It was rather daunting because I got the job a few days before the semester started.  I had to scramble to put together a syllabus and try to keep up with planning class sessions, developing labs, and writing lectures.  But I'd done this three times before when I had started at my three different high schools, so I knew I could do it if I threw myself into the job.

I had naively assumed that my job was to teach these students biology, uphold the academic standards that I had always cared about, and enforce the College rules about things like class attendance.  It is striking to me as I look through the emails from that semester how many of them involved missed classes, and complaints about grades and the workload.  Here's an example:
Professor Baskauf,
I am not able to be in class tomorrow because my
flight leaves a 2:30 pm.  Both of my friday classes were
cancelled, so I am going home tomorrow.  However, the later
flights were too late for my parents in that my flight is a
long one and the airport is 2hrs from my house.  I
apologize for missing class.
Within a month, it was clear that the students were unhappy about my expectations for them.  I had a conversation with my chair about the expectations for the class, which I was beginning to think must differ radically from what I had anticipated.  Here's an email I got on October 25 from my department chair:
I had a meeting with the Peabody folks with regard to BSCI 101 and what the
purpose of the course is.  It appears that they have had little or no
contact with any of the instructors for  the course for several years and
really have no idea of the current course content and organization.  At some
point, I'd like to set up a meeting with you, the relevant Peabody folks,
and myself to make sure we all understand why we offer BSCI 101 at all; and,
since it is a service course, to make sure we are all on the same page with
regard to content and structure.  From our discussion last Friday, I think
what they expect is not much different from what we would do, but is perhaps
a little different from the course as it evolved in the Molecular Biology
Department.  I'd like to send them a syllabus to look over, then we'll try
to set up a time for discussion.  Could you get me a copy of the syllabus
you're using this year?
I tried to adjust the content and format of the course to make it more relevant to education majors and pushed through to the end of the semester.  I had a number of discussions with my TA about how we could work to make the labs more engaging and made plans on how I was going to improve the course in the spring semester, which I had already been scheduled to teach.

On January 3, 2000, I found out from my chair that Dean Infante had examined my student evaluations and decided to fire me.  He had never been in my class (actually, no administrator or other faculty member had ever set foot in my class) and as far as I know, he had no idea what I had actually done in the class.  He just decided that my student evaluation numbers were too low.  I was a "bad" teacher and Vanderbilt wasn't going to let me teach that class again.

With my past record of 15 years of excellent teaching, this was a crushing blow to me emotionally.  I'm normally a really optimistic person, but on that day I had an glimmering of what it must feel like to be clinically depressed.  I could hardly make myself get out of bed.  In addition to the emotional toll, I now had two little kids to help support - we had been planning on the income from my teaching and we were also looking at losing our day care at Vanderbilt if I were no longer employed.

Fortunately, my department chair went to bat for me.  Ironically, the appeal that he made to the dean was NOT that I was hard working, or innovative, or that I had high standards for my students.  He took my student evaluation numbers from when I was a TA in the bio department to the Dean's office and convinced them that those numbers showed that the new student evaluation numbers were an outlier.  Although I didn't know it at the time, I was apparently on some kind of probation - the department was supposed to be monitoring me to make sure that I wasn't still being a "bad" teacher.

In the second semester that I taught the 101 class, I took extreme precautions to be very accessible to students.  I emailed all of the students who didn't do well on tests and asked them if the wanted to meet with me.  We did a self-designed project to investigate what it took to build a microcosm ecosystem.  We went on an ecology field trip to a local park and an on-campus field trip to visit research labs that were using zebrafish and fruit flies as model organisms to study genetics and development.  I think the students were still unhappy with their grades and my expectations for workload, but apparently their evaluations were good enough for me to be hired as a full-time Lecturer in the fall.


Being a full-time Lecturer at Vanderbilt

The faculty member who had previously been the lab coordinator for the Intro to Biological Sciences labs for majors, was leaving that position to take a different teaching job in the department.  The chair of the new Biological Sciences Department (formed by the merger of my biology department and the molecular biology department), contacted me about "going into the breach" as he phrased it, and taking over as lab coordinator.  I had actually been a TA five times for the semester of that course dealing with ecology and evolution (my specialty).  So I was well acquainted with that teaching lab.  Having had no success in getting a tenure-track job at any college within commuting distance, I took the offer of a one year appointment, assuming that I could do the job until I got a better position somewhere else.

When I started the job, I really had very little idea what my job responsibilities were supposed to be.  I was supposed to "coordinate labs".  The job expectations were never communicated to me beyond that.  Unfortunately, the focus of the course during my first semester was the half of the course that dealt with molecular biology, which I had never studied and for which I had never served as a TA.  Things did not go well.  For starters, the long-term lab manager discovered that she had cancer and missed long stretches of work for her treatments.  Fortunately, I was allowed to hire a temporary staff person with a masters related to molecular biology.  I spent much of the semester in the prep room with her trying to figure out why our cloning and other experiments weren't working as they should.  A major part of my job responsibilities was to supervise the TAs and manage the grades and website for both my class and the lecture part of the course.  I spent almost no time in the classroom with the students - I wasn't aware that that was actually supposed to be a part of the job.

At the end of the first semester, I was relieved to have managed to pull of the whole series of labs with some degree of success and was looking forward to the ecology and evolution semester, with which I was very familiar.  However, I was shocked to discover that I was actually going to be subject to student evaluations again.  Apparently, there was some college rule that everyone who is in a faculty position has to be evaluated by students.  In January, I ran into my chair and he commented that we would have to get my student evaluations up in the coming semester.  Oh, and by the way, the grades were also too high for the course.  I was going to have to increase the rigor of the course to bring them down to what was considered a reasonable range for lab courses in the department.

At that point in time, the lab grades were included with the lecture grades to form a single grade for the course.  The tenure track faculty involved in the lecture part of the course decided that a range of B to B- was a reasonable target for the lab portion of the course, so it fell on me to structure the grading in the course in a way that the grades would fall into that range.  At that time, the largest component of the grade was lab reports, which I found to be graded by the TAs in a very arbitrary and capricious manner.  In the spring semester, I replaced lab reports with weekly problem sets, and replaced lightly-weighted occasional quizzes with regular lab tests that formed half of the grade of the course.  I made the tests difficult enough to lower the grade to the target range, but it was clear to the students that I was to blame for creating the tests that were killing their GPAs.

In the second semester, I made it a point to be in the lab during every section to ask students if they had questions or needed help.  That did a lot to improve the students' impressions of me as compared to the fall.  But in late March, I was blindsided by another unanticipated event: fraternity formals.  Students had previously asked me to excuse them from lab on Fridays to leave early for spring break or to go out of town for family visits.  I had been consistently enforcing the College's policy on excused absences, which said unequivocally "conflicts arising from personal travel plans or social obligations are not regarded as occasions that qualify as an excused absence" and made them take zeros for their work on days when they missed class for those reasons.  Obviously students were not happy about this, but the situation came to a head when students started asking to reschedule class to go out of town for fraternity formals.  I had gone to a school that didn't have fraternities, and I'd never heard of a fraternity formal.  When I found out that fraternity formals involved skipping school to go out of town to go to a party, I told them that I couldn't consider that an excused absence under the college's policy on attendance.  The students were furious.  They had spent money on plane tickets and tuxedos and now I was forcing them to chose between class and going to their party.  An exchange with two of the students ended up in us walking over to Dean Eickmeier's office where he confirmed that my decision was consistent with college policy.  In some cases, students opted to come to class.  One student brought me an apparently fake medical excuse.  Others took the zero and went to the party.  One student said that I "did not have any compassion that a normal human would have" and threatened that he was going to write a scathing article about me in the Hustler (the student newspaper).  Another student said that he was going to "get me" on the evaluations.  Alarmed, and given my previous bad experience with student evaluations, I documented the incidents in an email to my chair.

Despite these bumps in the road, my evaluations were better in the spring semster, and I was anticipating being reappointed again for 2001-02.  I did request that my chair include in the request for my reappointment a copy of my email detailing the incidents involving the unhappy students with excused absences.  On May 23rd, I received this ominous email from my chair:
 I have received a response from Dean Venable to my recommendation for your reappointment.  He has agreed to reappoint you, but he has placed some conditions on the reappointment that we need to discuss.  I would like to do that today, if possible.  I have a flexible schedule until mid-afternoon.
I went to meet with the chair, and he gave me a copy of the letter from Dean Venable.  You can read it yourself:
Once again, a Dean sitting in his office pouring over computer printouts had decided that I was a bad teacher based solely on student evaluations.  No personal discussion with me about the class, no classroom observations ever by any administrator, no examination of the course materials or goals.  Worse yet, he chose to cherry-pick the written comments to emphasize the negative ones.  By my count, 11% of students made negative comments about my teaching style, while 24% made positive comments about it.  Here are some of the comments from the spring semester Dean Venable chose to ignore:

Dr. Baskauf is very good at instructing the class.  He is easy to understand and teaches the material well so that we understand what he is saying.  I think that Dr. Baskauf would be a better help to the lecture however, since the lecture class is more important to students and worth 3 hours.  I wish I could have had him for a professor in lecture as well as lab.
Dr. Baskauf really puts forth an aire of knowledge.  He was always willing to help with any problems that we were having with our labs, in and out of class, while not just telling us the answers, but nudging us along while we figured it out for ourselves.  Whats more important is that he seems to really love the material and teaching the class which makes the experience that much better and makes it much easer to learn.
Dr. Baskauf was always well prepared for the lab.  This was very helpful because he could always give a concise overview that I could understand.  The powerpoint presentations were a great idea as they really helped me to follow his instructions better.  He is very friendly and always willing to help when I had questions.
Bascauf is very good at explaining and communicating with the class.  he is very helpful as well.
Dr. Baskauf made this lab one of the most enjoyable and challenging classes I have yet taken at Vanderbilt.  He was especially willing to help students to better understand the value of what they were learning.
Baskauf was always very well prepared for lab.  He obviously put a lot of work into setting everything up.  He always had very clear tutorials and lectures.
Dr. Baskauf created a challenging and stimulating environment for learning about biological experiments.  Although many aspects of the lab were tough, Dr. Baskauf was able to understand how difficult is was for the class as a whole.  He opened himself up to adapting to our needs.  His approach to teaching is something I have yet to experience elsewhere at Vanderbilt.  I hope to have him as an instructor at some further point in my career here at Vanderbilt.
The sentence for my crime was:
  • to be mentored by a senior faculty member
  • to work with the Center for Teaching to improve my lecturing style and interpersonal skills, and
  • to be subjected to an extra mid-semester student evaluation

Oh, yes - and no pay raise.  These were all necessary to bring me "up to the teaching standards required by the College", with the threat that I would be fired if I didn't improve.  

So, for a second time, I had been flagged with the scarlet letter of "bad teacher" based solely on student evaluations.  Again, I was angry at the injustice and incredibly demoralized.  I really wanted to just quit at Vanderbilt, but I really needed the job.  So I swallowed my pride and completed my sentence.  I was "mentored" the next year by Bubba Singleton (later my highly supportive department chair), who was extremely helpful in helping me figure out ways to structure the class so that I could maintain my academic standards while also keeping students happy enough that I didn't get fired again.  

Life as a Senior Lecturer

Ever since that time, I've maintained an Excel spreadsheet with a graph of my student evaluations, which I check each year to ensure that I'm not heading into the danger zone.  Despite the fact that I don't really "teach" in the usual sense (most of my work involves curriculum development, wrangling online course management systems, recruiting and mentoring TAs, supervising staff, and handling administrative problems), I've managed to keep the student evaluation numbers to an acceptable level.  I've instituted open-ended research projects into the course (which by the way, caused my evaluation numbers to plunge the year they were introduced), continued to introduce the latest educational technology into the course, and continued to revise and update labs as biology evolves.  In 2002, I was promoted to Senior Lecturer (which comes with a three-year appointment) and in 2010 I received the Harriet S. Gilliam Award for Excellence in Teaching by a Senior Lecturer. 

The Harriet S. Gilliam Award silver cup, with the letter from Dean Venable that I store inside it

So I think that most people at Vanderbilt now think I'm a good teacher.  But student evaluations and the threat of being fired based on student evaluations hangs over me like a Sword of Damocles every time I'm up for re-appointment.  

After I wrote this, I seriously considered deleting the whole thing.  Even after all of these years, for a veteran teacher, being fired and being sentenced to remedial help for "bad teaching" is an embarrassment.  I feel embarrassed, even though I know that I was just as good a teacher at that time as I was before and after.  But I think it's important for people to know what it feels like to be treated unjustly in a situation where there is a huge power imbalance - where a powerful person you've never met passes judgment on your teaching based on student evaluations rather than by observing your classroom.

Do non-tenure track faculty at Vanderbilt need a union?

When the unionization question came up, I have to say that I was pretty skeptical about it.  When I taught high school, I was a member of the National Education Association, which functioned something like a union, but I had mostly thought of it as a professional association.  My upbringing predisposed me to thinking negatively about unions.  Given my current relatively stable position, it wasn't clear to me that it was in my interest to be part of a union.

However, as I started investigating where non-tenure track faculty stand in the College of Arts and Sciences at Vanderbilt, it was clear to me that we are actually just as powerless as I had always considered us to be.  Although non-tenure track faculty constitute 38% of the faculty of A&S, they are banned from election to Faculty Senate and have been improperly disenfranchised from voting for Faculty Senate for at least ten years. (I have never been given the opportunity to vote.) See Article IV, Section 2 of the CAS Constitution for details. Non-tenure track faculty are not eligible for election to the A&S Faculty Council, nor are they allowed to vote for Faculty Council representatives (Article II, Section 1, Part B).  At College-wide faculty meetings, full-time Senior Lecturers have a vote, but all non-tenure track faculty are only allowed to vote on an issue only when the Dean decides that the matter is related to their assigned duties (Article I, Section I, Part C).  The Provost's Office insists that non-tenure track faculty participate in University governance through participation in University Committees, but my analysis shows that appointment to University Committees is greatly skewed towards tenure-track faculty, with only three non-tenure track faculty actually sitting on those committees (one each on Religious Affairs, Athletics, and Chemical Safety).  The Shared Governance Committee, charged last fall with suggesting changes and improvements in the future does not include a single non-tenure track member of A&S (only one non-tenure track member at all - from the Blair School of Music).  We really have virtually no voice in the governance of the College or the University.  

We also have no influence over the process of our re-appointment, or how much we are paid.  Prior to our reappointment, we submit a dossier.  Then months later, we either do or don't get a reappointment letter with a salary number and a place to sign.  If we don't like the number, we can quit.  In a previous reappointment cycle, I suggested to my chair that it would be fair for me to be paid what I would receive if I were to leave Vanderbilt and teach in the Metro Nashville public school system.  At that time, with a Ph.D. and the number of years of teaching experience that I had, my salary in the public schools would have been about $10k more than what I was getting at Vanderbilt.  I think that at that time I actually still had a valid Tennessee high school teaching license, so it would have been a real possibility for me.  They did give me something like an additional 1% pay increase over the base increase for that year, but I've never gotten parity with what I would receive as a public high school teacher (let alone what I would earn teaching at a private school).  That's particularly ironic, given that the number of students and teaching assistants I supervise has gone up by about 60% since I started in the job (with no corresponding increase in pay), to over 400 students and 12 to 20 TAs per semester, plus three full-time staff.  This is a much greater responsibility than I would have if I were teaching high school.  The reason that I was given for not being granted parity in pay with the public schools was that the college couldn't afford that much.  I like my job and I enjoy working with my students and TAs, so I probably won't quit to go back to teaching high school.  But it seems really unfair to me and I'm powerless to change the situation.

Currently, I'm up for re-appointment with a promotion to Principal Senior Lecturer.  That might result in a pay increase, but there is no transparency about the decision-making process in the Dean's office.  Some day later this year, I'll probably get a letter offering me an appointment with some salary number on it.  Or not.  

The Provost's Office website has a list of frequently asked questions whose answers insinuate that the union will probably lie to us, and may negotiate away our benefits without consulting with us.  I will admit that I was a bit concerned about the negative effects of unionization when the issue first came up.  However, I contacted some senior lecturers at Duke and University of Chicago to ask them how the contract negotiating process was going at their schools.  It was clear to me that the negotiating teams at those schools (composed of  non-tenure track faculty from the schools themselves) were very attuned to the concerns of the colleagues they were representing, and that they had no intention of negotiating away important benefits that they already had.  Mostly, it just looked like a huge amount of time and work on their part.  But it definitely was not the apocalypse - for most people at those schools, life goes on as normal.

Now that I'm now a relatively high ranking non-tenure track faculty with reasonable job security, it seems unlikely that personally I will derive a large benefit from unionization.  But given that I have virtually no influence or negotiating power within the university, it is very hard for me to see what I have to lose by being part of the union.  More importantly, as I re-read the emails from my first painful years of teaching at Vanderbilt, it was evident to me that part-time faculty and faculty with one year appointments are particularly vulnerable to the whims of upper-level administration.  I have always been fortunate to have department chairs who supported me vigorously and went to bat for me when I needed it.  But there is no guarantee that will happen in the future, or in other departments.  For whatever faults there may be in having a union, it will provide a degree of protection and transparency that has been completely lacking for non-tenure track faculty at Vanderbilt.  And that's the primary reason why if offered the chance I'm planning to vote "yes" on unionization.

Sunday, March 26, 2017

A Web Service with Content Negotiation for Linked Data using BaseX

Background

Last October, I wrote a post called Guid-O-Matic Goes to China.  That post described an application I wrote in Xquery to generate RDF in various serializations from simple CSV files.  Those of you who know me from TDWG are probably shaking your heads and wondering "Why in the world is he using Xquery to do this?  Who uses Xquery?"

The answer to the second question is "Digital Humanists".  There is an active Digital Humanities effort at Vanderbilt, and recently Vanderbilt received funding from the Andrew W. Mellon Foundation to open a Center for Digital Humanities.  I've enjoyed hanging out with the digital humanists and they form a significant component of our Semantic Web Working Group.  Digital Humanists also form a significant component of the Xquery Working Group at Vanderbilt.  Last year, I attended that group for most of the year, and that was how I learned enough Xquery to write the application.

That brings me to the first question (Why is he using Xquery?).  In my first post on Guid-O-Matic, I mentioned that one reason why I wanted to write the application was because BaseX (a freely available XML database and Xquery processor) included a web application component that allows Xquery modules to support a BaseX RESTXQ web application service.  After I wrote Guid-O-Matic, I played around with BaseX RESTXQ in an attempt to build a web service that would support content negotiation as required for Linked Data best practices.  However, the BaseX RESTXQ module had a bug that prevented using it to perform content negotiation as described in its documentation.  For a while I hoped that the bug would get fixed, but it became clear that content negotiation was not a feature that was used frequently enough for the developers to take the time to fix the bug.  In December, I sat down with Cliff Anderson, Vanderbilt's Xquery guru, and he helped me come up with strategy for a workaround for the bug.  Until recently, I was too busy to pick up the project again, but last week I was finally able to finish writing the functions in the module to run the web server.

How does it work?

Here is the big picture of how the Guid-O-Matic web service works:
A web-based client (browser or Linked Data client) uses HTTP to communicate with the BaseX web service.  The web service is an Xquery module whose functions process the URIs sent from the client via HTTP GET requests.  It uses the Accept: header to decide what kind of serialization the client wants, then uses a 303 redirect to tell the client which specific URI to use to request a specific representation in that serialization.  The client then sends a GET request for the specific representation it wants.  The web service calls Guid-O-Matic Xquery functions that use data from the XML database to build the requested documents. Depending on the representation-specific URI, it serializes the RDF as either XML, Turtle, or JSON-LD.  (Currently, there is only a stub for generating HTML, since the human-readable representation would be idiosyncratic depending on the installation.)  In the previously described versions of Guid-O-Matic, the data were retrieved from CSV files.  In this version, CSV files are still used to generate the XML files using a separate script.  But those XML files are then loaded into BaseX's built-in XML database, which is the actual data source used by the scripts called by the web service.  In theory, one could build and maintain the XML files independently without constructing them from CSVs.  One could also generate the CSV files from some other source as long as they were in the form that Guid-O-Matic understands.

Trying it out

You can try the system out for yourself to see how it works by following the following steps.

  1. Download and install BaseX. BaseX is available for download at http://basex.org/.  It's free and platform independent.  I won't go into the installation details because it's a pretty easy install.
  2. Clone the Guid-O-Matic GitHub repo.  It's available at https://github.com/baskaufs/guid-o-matic.  
  3. Load the XML files into a BaseX database.  The easiest way to do this is probably to run the BaseX GUI.  On Windows, just double-click on the icon on the desktop.  From the Database menu, select "New..." Browse to the place on your hard drive where you cloned the Guid-O-Matic repo, then Open the "xml-for-database" folder.  Name the database "tang-song" (it includes the data described in the Guid-O-Matic Goes to China post).  Select the "Parse files in archives" option.  I think the rest of the options can be left at their defaults.  Click OK.  You can close the BaseX GUI.  
  4. Copy the restxq module into the webinf directory of BaseX.  This step requires you to know where BaseX was installed on your hard drive.  Within the BaseX installation folder, there should be a subfolder called "webapp".  Within this folder, there should be a file with the extension ".xqm", probably named something like "restxq.xqm".  In order for the web app to work, you either need to delete this file, or change its extension from ".xqm" to something else like ".bak" if you think there is a possibility that you will want to look at it in the future.  Within the cloned Guid-O-Matic repo find the file "restxq-db.xqm" and copy it to the webapp folder.  This file contains the script that runs the server.  You can open it within the BaseX GUI or any text editor if you want to try hacking it.  
  5. Start the server. Open a command prompt/command window.  On my Windows computer, I can just type basexhttp.bat to launch the batch file that starts the server.  (I don't think that I had to add the BaseX/bin/ folder to my path statement, but if you get a File Not Found error, you might have to navigate to that directory first to get the batch file to run.)  For non-Windows computers there should be another script named basexhttp that you can run by an appropriate method for your OS.  See http://docs.basex.org/wiki/Startup for details.  When you are ready to shut down the server, you can do it gracefully from the command prompt by pressing Ctrl-C.  By default, the server runs on port 8984 and that's what we will use in the test examples.  If you actually want to run this as a real web server, you'll probably have to change it to a different port (like port 80).  See the BaseX documentation for more on this.
  6. Send an HTTP GET request to the server. There are a number of ways to cause client software to interact with the server (lots more on this later). The easiest way is to open any web browser and enter http://localhost:8984/Lingyansi in the URL box.  If the server is working, it should redirect to the URL http://localhost:8984/Lingyansi.htm and display a placeholder web page.
If you have successfully gotten the placeholder web page to display, you can carry out the additional tests that I'll describe in the following sections.


Image from the W3C Interest Group Note https://www.w3.org/TR/cooluris/ © 2008 W3C 

What's going on?

The goal of the Guid-O-Matic web service is to implement content negotiation in a manner consistent with the Linked Data practice described in section 4.3 the W3C Cool URIs for the Semantic Web document.  The purpose of this best practice is to allow users to discover information about things that are denoted by URIs, but that are not documents that can be delivered via the Web.  For example, we could use the URI http://lod.vanderbilt.edu/historyart/site/Lingyansi to denote the Lingyan Temple in China.  If we put that URI into a web browser, it is not realistic to expect the Internet to deliver the Lingyan Temple to our desktop. According to the convention established in the resolution to the httpRange-14 question, when a client makes an HTTP GET request to dereference the URI of a non-information resource (like a temple), an appropriate response from the server is to provide an HTTP 303 (See Other) response code that redirects the client to another URI that denotes an information resource (i.e. a document) that is about the non-information resource.  The user can indicate the desired kind of document by providing an HTTP Accept: header that provides the media type they would prefer.  So when a client makes a GET request for http://lod.vanderbilt.edu/historyart/site/Lingyansi, along with a request header of Accept: text/html it is appropriate for the server to respond with a 303 redirect to the URI http://lod.vanderbilt.edu/historyart/site/Lingyansi.htm, which denotes a document (web page) about the Linyansi Temple.  

There is no particular convention about the form of the URIs used to represent the non-information and information resources, although it is considered to be a poor practice to include file extensions in the URIs of non-information resources.  You can see one pattern in the diagram above.  Guid-O-Matic follows the following convention.  If a URI is extensionless, it is assumed to represent a non-information resource.  Each non-information resource included in the database can be described by a number of representations, i.e., documents having differing media types.  The URIs denoting those documents are formed by appending a file extension to the base URI of the non-information resource.  The extension used is one that is standard for that media type.  Many other patterns are possible, but using a pattern other than this one would require different programming than what is shown in this post.

The following media types are currently supported by Guid-O-Matic and can be requested from the web service:

Extension  Media Type
---------  -----------
.ttl       text/turtle
.rdf       application/rdf+xml
.json      application/ld+json or application/json
.htm       text/html


The first three media types are serializations of RDF and would be requested by Linked Data clients (machines), while the fourth media type is a human-readable representation that would typically be requested by a web browser.  As the web service is currently programmed, any media type requested other than the five listed above results in redirection to the URI for the HTML file.  

There are currently two hacks in the web service code that recognize two special URIs.  If the part of the URI after the domain name ends in "/header", the web server simply echos back to the client a report of the Accept: header that was sent to it as part of the GET request.  You can try this by putting http://localhost:8984/header in the URL box of your browser.  For Chrome, here's the response I got:

text/html application/xhtml+xml application/xml;q=0.9 image/webp */*;q=0.8

As you can see, the value of the Accept: request header generated by the browser is more complicated than the simple "text/html" that you might expect [1].  

In production, one would probably want to delete the handler for this URI, since it's possible that one might have a resource in the database whose local name is "header" (although probably not for Chinese temples!).  Alternatively, one could change the URI pattern to something like /utility/header that wouldn't collide with the pattern used for resources in the database.

The other hack is to allow users to request an RDF dump of the entire dataset.  A dump is requested using a URI ending in /dump along with a request header for one of the RDF media types.  If the header contains "text/html" (a browser), Turtle is returned.  Otherwise, the dump is in the requested media type.  The Chinese temple dataset is small enough that it is reasonable to request a dump of the entire dataset, but for larger datasets where a dump might tie up the server, it might be desirable to delete or comment out the code for this URI pattern.

Web server code

Here is the code for the main handler function:


declare
  %rest:path("/{$full-local-id}")
  %rest:header-param("Accept","{$acceptHeader}")
  function page:content-negotiation($acceptHeader,$full-local-id)
  {
  if (contains($full-local-id,"."))
  then page:handle-repesentation($acceptHeader,$full-local-id)
  else page:see-also($acceptHeader,$full-local-id)
  };

The %rest:path annotation performs the pattern matching on the requested URI.  It matches any local name that follows a single slash, and assigns that local name to the variable $full-local-id.  The %rest:header-param annotation assigns the value of the Accept: request header to the variable $acceptHeader.  These two variables are passed into the page:content-negotiation function. 
The function then chooses between two actions depending on whether the local name of the URI contains a period (".") or not.  If it does, then the server knows that the client wants a document about the resource in a particular serialization (a representation) and it calls the page:handle-repesentation function to generate the document.  If the local name doesn't contain a period, then the function calls the page:see-also function to generate the 303 redirect.  

Here's the function that generates the redirect:


declare function page:see-also($acceptHeader,$full-local-id)
{
  if(serialize:find-db($full-local-id))  (: check whether the resource is in the database :)
  then
      let $extension := page:determine-extension($acceptHeader)
      return
          <rest:response>
            <http:response status="303">
              <http:header name="location" value="{ concat($full-local-id,".",$extension) }"/>
            </http:response>
          </rest:response> 
  else
      page:not-found()  (: respond with 404 if not in database :)
};


The page:see-also function first makes sure that metadata about the requested resource actually exists in the database by calling the serialize:find-db function that is part of the Guid-O-Matic module. The serialize:find-db function returns a value of boolean true if metadata about the identified resource exist.  If value generated is not true, the page:see-also function calls a function that generates a 404 "Not found" response code.  Otherwise, it uses the requested media type to determine the file extension to append to the requested URI (the URI of the non-information resource).  The function then generates an XML blob that signals to the server that it should send back to the client a 303 redirect to the new URI that it constructed (the URI of the document about the requested resource).  

Here's the function that initiates the generation of the document in a particular serialization about the resource:


declare function page:handle-repesentation($acceptHeader,$full-local-id)
{
  let $local-id := substring-before($full-local-id,".")
  return
      if(serialize:find-db($local-id))  (: check whether the resource is in the database :)
      then
          let $extension := substring-after($full-local-id,".")
          (: When a specific file extension is requested, override the requested content type. :)
          let $response-media-type := page:determine-media-type($extension)
          let $flag := page:determine-type-flag($extension)
          return page:return-representation($response-media-type,$local-id,$flag)
      else
          page:not-found()  (: respond with 404 if not in database :)
};



The function begins by parsing out the identifier part from the local name.  It checks to make sure that metadata about the identified resource exist in the database - if not, it generates a 404.  (It's necessary to do the check again in this function, because clients might request the document directly without going through the 303 redirect process.)  If the metadata exist, the function parses out the extension part from the local name.  The extension is used to determine the media type of the representation, which determines both the Content-Type: response header and a flag used to signal to Guid-O-Matic the desired serialization.  In this function, the server ignores the media type value of the Accept: request header.   Because the requested document has a particular media type, that type will be reported accurately regardless of what the client requests.  This behavior is useful in the case where a human wants to use a browser to look at a document that's a serialization of RDF.  If the Accept: header were respected, the human user would see only the web page about the resource rather than the desired RDF document. Finally, the necessary variables are passed to the page:return-representation function that handles the generation of the document.

Here is the code for the page:return-representation function:


declare function page:return-representation($response-media-type,$local-id,$flag)
{
if(serialize:find-db($local-id))
then 
  (
  <rest:response>
    <output:serialization-parameters>
      <output:media-type value='{$response-media-type}'/>
    </output:serialization-parameters>
  </rest:response>,
  if ($flag = "html")
  then page:handle-html($local-id)
  else serialize:main-db($local-id,$flag,"single","false")
  )
};


The function generates a sequence of two items.  The first is an XML blob that signals to the server that it should generate a Content-Type: response header with a media type appropriate for the document that is being delivered.  The second is the response body, which is generated by one of two functions.  The page:handle-html function for generating the web page is a placeholder function, and in production there would be a call to a real function in a different module that used data from the XML database to generate appropriate content for the described resource.  The serialize:main-db function is the core function of Guid-O-Matic that builds a document from the database in the serialization indicated by the $flag variable.  The purpose of Guid-O-Matic was previously described in Guid-O-Matic Goes to China, so at this point the serialize:main-db function can be considered a black box.  For those interested in the gory details of generating the serializations, look at the code in the serialize.xqm module in the Guid-O-Matic repo.  

The entire restxq-db.xqm web service module can be viewed here.

Trying out the server

To try out the web server, you need to have a client installed on your computer that is capable of making HTTP requests with particular Accept: request headers.  An application commonly used for this purpose is curl. I have to confess that I'm not enough of a computer geek to enjoy figuring out the proper command line options to make it work for me.  Nevertheless, it's simple and free.  If you have installed curl on your computer, you can use it to test the server.  The basic curl command I'll use is

curl -v -H "Accept:text/turtle" http://localhost:8984/Lingyansi

The -v option makes curl verbose, i.e. to make it show the header data as curl does stuff.  The -H option is used to send a request header - in the example, the media type for Turtle is requested.  The last part of the command is the URI to be used in the HTTP request, which is a GET request by default.  In this example, the HTTP GET is made to a URI for the local webserver running on port 8984 (i.e. where the BaseX server runs by default).  

Here's what happens when the curl command is given to make a request to the web service application:


The lines starting with ">" show the communication to the server and the lines starting with "<" show the communication coming from the server.  You can see that the server has responded in the desired manner: the client requested the file /Lingyansi in Turtle serialization, and the server responded with a 303 redirect to Lingyansi.ttl.  Following the server's redirection suggestion, I can issue the command 

curl -v -H "Accept:text/turtle" http://localhost:8984/Lingyansi.ttl

and I get this response:


This time I get a response code of 200 (OK) and the document is sent as the body of the response.  When GETting the turtle file, the Accept: header is ignored by the web server, and can be anything.  Because of the .ttl file extension, the Content-type: response header will always be text/turtle.

 If the -L option is added to the curl command, curl will automatically re-issue the command to the new URI specified by the redirect:

curl -v -L -H "Accept:text/turtle" http://localhost:8984/Lingyansi

Here's what the interaction of curl with the server looks like when the -L option is used:


Notice that for the second GET, curl reuses the connection with the server that it left open after the first GET.  One of the criticisms of the 303 redirect solution to the httpRange-14 controversy is that it is inefficient - two GET calls to the server are required to retrieve metadata about a single resource.

If you aren't into painful command line applications, there are several GUI options for sending HTTP requests.  One application that I use is Advanced Rest Client (ARC), a Chrome plugin (unfortunately available for Windows only).  Here's what the ARC GUI looks like:


The URI goes in the box at the top and GET is selected using the appropriate radio button.  If you select Headers form, a dropdown list of possible request headers appears when you start typing and you can select Accept.  In this example I've given a value of text/turtle, but you can also test the other values recognized by the server script: text/html, application/rdf+xml, application/json, and application/ld+json.  

When you click SEND, the response is somewhat inscrutably "404 Not found".  I'm not sure exactly what ARC is doing here - clearly something more than a single HTTP GET.  However, if you click the DETAILS dropdown, you have the option of selecting "Redirects".  Then you see that the server issued a 303 See Other redirect to Lingyansi.ttl


If you change the URI to http://localhost:8984/Lingyansi.ttl, you'll get this response:



This time no redirect, a HTTP response code of 200 (OK), a Content-Type: text/turtle response header, and the Turtle document as the body.  

There is a third option for a client to send HTTP requests: Postman.  It is also free, and available for other platforms besides Windows.  It has a GUI interface that is similar to Advanced Rest Client.  However, for whatever reason, Postman always behaves like curl with the -L option.  That is, it always automatically responds by sending you the ultimate representation without showing you the intervening 303 redirect.  There might be some way to make it show you the complete process that is going on, but I haven't figured out how to do that yet.  

If you are using Chrome as your browser, you can go to the dot, dot, dot dropdown in the upper right corner of the browser and select "More tools", then "Developer tools".  That will open a pane on the right side of the browser to show you what's going on "under the hood".  Make sure that the Network tab is selected, then load http://localhost:8984/Lingyansi .   This is what you'll see:

The Network pane on the right shows that the browser first tried to retrieve Lingyansi, but received a 303 redirect.  It then successfully retrieved Lingyansi.htm, which it then rendered as a web page in the pane on the left.  Notice that after the redirect, the browser replaced the URI that was typed in with the new URI of the page that it actually loaded.

Who cares?

If after all of this long explanation and technical gobbledygook you are left wondering why you should care about this, you are in good company.  Most people couldn't care less about 303 redirects.

As someone who is trying to believe in Linked Data, I'm trying to care about 303 redirects.  According to the core principles of Linked Data elaborated by Tim Berners-Lee in 2006, a machine client should be able to "follow its nose" so that it can "look up" information about resources that it learns about through links from somewhere else.  303 redirects facilitate this kind of discovery by providing a mechanism for a machine client to tell the server what kind of machine-readable metadata it wants (and that it wants machine-readable metadata and not a human-readable web page!).  

Despite the ten+ years that the 303 redirect solution has existed, there are relatively few actual datasets that properly implement the solution.  Why?

I don't control the server that hosts my Bioimages website and I spent several years trying to get anybody from IT Services at Vanderbilt to pay attention long enough to explain what kind of behavior I wanted from the server, and why.  In the end, I did get some sort of content negotiation.  If you perform an HTTP GET request for a URI like http://bioimages.vanderbilt.edu/ind-baskauf/40477 and include an Accept: application/rdf+xml header, the server responds by sending you http://bioimages.vanderbilt.edu/ind-baskauf/40477.rdf (an RDF/XML representation).  However, it just sends the file with a 200 OK response code and doesn't do any kind of redirection (although the correct Content-Type is reported in the response).  The behavior is similar in a browser.  Sending a GET request for http://bioimages.vanderbilt.edu/ind-baskauf/40477 results in the server sending http://bioimages.vanderbilt.edu/ind-baskauf/40477.htm, but since the response code is 200, the browser doesn't replace the URI entered in the box with the URI of the file that it actually delivers.  It seems like this solution should be OK, even though it doesn't involve a 303 redirect.

Unfortunately, from a Linked Data point of view, at the Bioimages server there are always two URIs that denote the same information resource, and neither of them can be inferred to be a non-information resource by virtue of a 303 response, as suggested by the httpRange-14 resolution. On a more practical level, users end up bookmarking two different URIs for the same page (since when content negotiation takes place, the browser doesn't change the URI to the one ending with .htm) and search engines index the same page under two different URIs, resulting in duplicate search results and potentially lower page rankings.  

Another circumstance where failing to follow the Cool URIs recommendation caused a problem is when I tried to use the CETAF Specimen URI Tester on Bioimages URIs.  The tester was created as part of an initiative by the Information Science and Technology Commission (ISTC) of the Consortium of European Taxonomic Facilities (CETAF).  When their URI tester is run on a "cool" URI like http://data.rbge.org.uk/herb/E00421509, the URI is considered to pass the CETAF tests for Level 3 implementation (redirect and return of RDF).  However, a Bioimages URI like http://bioimages.vanderbilt.edu/ind-baskauf/40477 fails the second test of the suite because there is no 303 redirect, even though the URI returns RDF when the media type application/rdf+xml is requested. Bummer. Given the number of cases where RDF can actually be retrieved from URIs that don't use 303 redirects (including all RDF serialized as RDFa), it probably would be best not to build a tester that relied solely on 303 redirects.  But until the W3C changes its mind about the httpRange-14 decision, a 303 redirect is the kosher way to find out from a server that a URI represents a non-information resource.  

So I guess the answer to the question "Who cares?" is "people who care about Linked Data and the Semantic Web".  The problem is that there just aren't that many people in that category, and even fewer who also care enough to implement the 303 redirect solution.  Then there are the people who believe in Linked Data, but were unhappy about the httpRange-14 resolution and don't follow it out of spite.  And there are also the people who believe in Linked Data, but don't believe in RDF (i.e. provide JSON-LD or microformat metadata directly as part of HTML).  

A potentially important thing

Now that I've spent time haranguing about the hassles associated with getting 303 redirects to work, I'll mention a reason why I still think the effort might be worth it.  

The RDF produced by Guid-O-Matic pretends that eventually the application will be deployed on a server that uses a base URI of http://lod.vanderbilt.edu/historyart/site/ . (Eventually there will be some real URIs minted for the Chinese temples, but they won't be the ones used in these examples).  So if we pretend that at some point in the future the Guid-O-Matic web service were deployed on the web (via port 80 instead of port 8984), an HTTP GET request could be made to http://lod.vanderbilt.edu/historyart/site/Lingyansi instead of http://localhost:8984/Lingyansi and the server script would respond with the documents shown in the examples. 

If you look carefully at the Turtle that the Guid-O-Matic server script produces for the Lingyan Temple, you'll see these RDF triples (among others):

<http://lod.vanderbilt.edu/historyart/site/Lingyansi>
     rdf:type schema:Place;
     rdfs:label "Lingyan Temple"@en;
     a geo:SpatialThing.

<http://lod.vanderbilt.edu/historyart/site/Lingyansi.ttl>
     dc:format "text/turtle";
     dc:creator "Vanderbilt Department of History of Art";
     dcterms:references <http://lod.vanderbilt.edu/historyart/site/Lingyansi>;
     dcterms:modified "2016-10-19T13:46:00-05:00"^^xsd:dateTime;
     a foaf:Document.

 You can see that maintaining the distinction between the version of the URI with the .ttl extension and the URI without the extension is important.  The URI without the extension denotes a place and a spatial thing labeled "Lingyan Temple".  It was not created by the Vanderbilt Department of History of Art, it is not in RDF/Turtle format, nor was it last modified on October 19, 2016.  Those latter properties belong to the document that describes the temple.  The document about the temple is linked to the temple itself by the property dcterms:references.  

Because maintaining the distinction between a URI denoting a non-information resource (a temple) and a URI that denotes an information resource (a document about a temple) is important, it is a good thing if you don't get the same response from the server when you try to dereference the two different URIs.  A 303 redirect is a way to clearly maintain the distinction.

Being clear about the distinction between resources and metadata about resources has very practical implications.  I recently had a conversation with somebody at the Global Biodiversity Information Facility (GBIF) about the licensing for Bioimages.  (Bioimages is a GBIF contributor.)  I asked whether he meant the licensing for images in the Bioimages website, or the licensing for the metadata about images in the Bioimages dataset.  The images are available with a variety of licenses ranging from CC0 to CC BY-NC-SA, but the metadata are all available under a CC0 license.  The images and the metadata about images are two different things, but the current GBIF system (based on CSV files, not RDF) doesn't allow for making this distinction on the provider level.  In the case of museum specimens that can't be delivered via the Internet or organism observations that aren't associated with a particular form of deliverable physical or electronic evidence, the distinction doesn't matter much because we can assume that a specified license applies to the metadata.  But for occurrences documented by images, the distinction is very important.

What I've left out

This post dwells on the gory details of the operation of the Guid-O-Matic server script, and tells you how to load outdated XML data files about Chinese temples, but doesn't talk about how you could actually make the web service work with your own data.  I may write about that in the future, but for now you can go to this page of instructions for details of how to set up the CSV files that are the ultimate source of the generated RDF.  The value in the baseIriColumn in the constants.csv file needs to be changed to the path of the directory where you want the database XML files to end up.  After the CSV files are created and have replaced the Chinese temple CSV files in the Guid-O-Matic repo, you need to load the file load-database.xq from the Guid-O-Matic repo into the BaseX GUI.  When you click on the run button (green triangle) of the BaseX GUI, the necessary XML files will be generated in the folder you specified.  The likelihood of success in generating the XML files is higher on a Windows computer because I haven't tested the script on Mac or Linux, and there may be file path issues that I still haven't figured out on those operating systems.  

The other thing that you should know if you want to hack the code is that the restxq-db.xqm imports the Guid-O-Matic modules that are necessary to generate the response body RDF serializations from my GitHub site on the web.  That means that if you want to hack the functions that actually generate the RDF (located in the module serialize.xqm), you'll need to change the module references in the prologue of the restxq-db.xqm module (line 6) so that they refer to files on your computer.  Instead of 

import module namespace serialize = 'http://bioimages.vanderbilt.edu/xqm/serialize' at 'https://raw.githubusercontent.com/baskaufs/guid-o-matic/master/serialize.xqm';

you'll need to use 

import module namespace serialize = 'http://bioimages.vanderbilt.edu/xqm/serialize' at '[file path]';

where [file path] is the path to the file in Guid-O-Matic repo on your local computer.  On my computer, it's in the download location I specified for GitHub repos, c:/github/guid-o-matic/serialize.xqm .  It will probably be somewhere else on your computer.  Once you've made the change and saved the new version of restxq-db.xqm, you can hack the functions in serialize.xqm and the changes will take affect in the documents sent from the server.

Note

[1] The actual request header has commas not shown here.  But BaseX reads the comma separated values as parts of a sequence, and when it reports the sequence back, it omits the commas.


Saturday, March 11, 2017

Controlled values (again)

The connection between The Darwin Core Hour and the TDWG Standards Documentation Specification

I've included the word "again" in the title of this blog post because I wrote a series of blog posts [1] about a year ago exploring issues related to thesauri, ontologies, controlled vocabularies, and SKOS.  Those posts were of a somewhat technical nature since I was exploring possible ways to represent controlled vocabularies as RDF.  However, there has been a confluence of two events that have inspired me to write a less technical blog post on the subject of controlled vocabularies for the general TDWG audience.

One is the genesis of the excellent Darwin Core Hour webinar series.  I encourage you to participate in them if you want to learn more about Darwin Core. The previous webinars have been recorded and can be viewed online.  The most recent webinar "Even Simple is Hard", presented by John Wieczorek on March 7, provided a nice introduction to issues related to controlled vocabularies, and the next one on April 4 "Thousands of shades for 'Controlled' Vocabularies", to be presented by Paula Zermoglio and will deal with the specifics of controlled vocabularies.

The other thing that's going on is that we are in the midst of the public comment period for the draft TDWG Standards Documentation Specification (SDS), of which I'm the lead author.  At the TDWG Annual Meeting in December, I led a session to inform people about the SDS and it's sister standard, the TDWG Vocabulary Management Specification.  At that session, the topic of controlled vocabularies came up.  I made some statements explaining the way that the SDS specifies that controlled vocabularies will be described in machine-readable form.  What I said seemed to take some people by surprise, and although I provided a brief explanation, there wasn't enough time to have an in-depth discussion.  I hoped that the topic would come up during the SDS public comment period, but so far it has not.  Given the current interest in constructing controlled vocabularies, I hope that this blog post will either generate some discussion, or satisfy people's curiosity about how the SDS deals with machine-readable controlled vocabularies.

Definitions

It is probably best to start off by providing some definitions.  It turns out that there is actually an international standard that deals with controlled vocabularies.  It is ISO 25964: "Thesauri and interoperability with other vocabularies".  Unfortunately, that standard is hidden behind a paywall and is ridiculously expensive to buy.  As part of my work on the SDS, I obtained a copy of ISO 25964 by Interlibrary Loan.  I had to return that copy, but I took some notes that are on the VOCAB Task Groups GitHub site.  I encourage you to refer to those notes for more details about what I'm only going to briefly describe here.

Controlled vocabularies and thesauri

ISO 25964 defines a controlled vocabulary as a
prescribed list of terms, headings or codes, each representing a concept. NOTE: Controlled vocabularies are designed for applications in which it is useful to identify each concept with one consistent label, for example when classifying documents, indexing them and/or searching them. Thesauri, subject heading schemes and name authority lists are examples of controlled vocabularies.
It also defines a form of controlled vocabulary, which is the major subject of the standard: a thesaurus.  A thesaurus is a
controlled and structured vocabulary in which concepts are represented by terms, organized so that relationships between concepts are made explicit, and preferred terms are accompanied by lead-in entries for synonyms or quasi-synonyms. NOTE: The purpose of a thesaurus is to guide both the indexer and the searcher to select the same preferred term or combination of preferred terms to represent a given subject. For this reason a thesaurus is optimized for human navigability and terminological coverage of a domain.  [my emphasis]
If you participated in or listened to the Darwin Core Hour "Even Simple is Hard", you can see the close relationship between the way "controlled vocabulary" was used in that seminar and the definition of thesaurus given here.  When submitting metadata about an occurrence to an aggregator, we want to use the same controlled value term in our metadata as will be used by those who may be searching for our metadata in the future.  Referring to an example given in the webinar, if we (the "indexers") provide "PreservedSpecimen" as a value in metadata in our spreadsheet, others in the future who are searching (the "searchers") for occurrences documented by preserved specimens can search for "PreservedSpecimen", and find our record.  That won't happen if we use a value of "herbarium sheet".  Figuring out how to get indexers and searchers to use the same terms is the job of a thesaurus.

A thesaurus is also designed to capture relationships between controlled value terms, such as "broader" and "narrower".  A searcher who knows about preserved specimens but wants records documented by any kind of physical thing (including living specimens, fossils, material samples as well) can be directed to a broader term that encompasses all kinds of physical things, e.g. "PhysicalObject".

So although in Darwin Core (and in TDWG in general) we tend to talk about "controlled vocabularies", I would assert that we are, in fact, talking about thesauri as defined by ISO 25964.

Strings and URIs

If you have spent any time pondering TDWG vocabularies, you probably have noticed that all kinds of TDWG vocabulary terms (classes and properties) are named using Uniform Resource Identifiers, or URIs.  Because the kind of URIs used in TDWG term names begin with "http://" people get the mistaken impression that URIs always make something "happen".  This is because we are used to seeing Web URLs that start with "http://", and have come to believe that if we put a URI that starts with "http://" in a browser, we will get a web page.  However, there are some terms in TDWG vocabularies that will probably never "do" anything in a browser.  For example, Audubon Core co-opts the term http://ns.adobe.com/exif/1.0/PixelXDimension as a property whose value gives the number of pixels in an image in the X dimension.  You can try putting that term URI in a browser and prove to yourself that nothing useful happens.  So we need to get over the idea that URIs must "do" something (they might or might not), and get used to the idea that their primary purpose is to serve as a globally unique name that conforms to some particular structural rules [2].

You can see the value of using URIs over plain text strings if you consider the Darwin Core term "class".  When we use "class" in the context of Darwin Core, we intend its Darwin Core definition: "The full scientific name of the class in which the taxon is classified."  However, "Class" has a different meaning in a different international standard, the W3C's RDF Schema 1.1 (RDFS).  In that context, "Class" means "The class of resources that are RDF classes."  There may be many other meanings of "class" in other fields.  It can mean different things in education, in computer programming, and in sociology.  We can tell people exactly what we intend by the use of a term if we identify it with a URI rather than a plain text string.  So for example, if we use the term http://rs.tdwg.org/dwc/terms/class, people will know that we mean "class" in the Darwin Core sense, but if we use the term http://www.w3.org/2000/01/rdf-schema#Class, people will know that we mean "class" in the RDFS sense.

Clearly, it is a pain in the neck to write out a long and messy URI.  For convenience, there is an abbreviated form for URIs called "compact URIs" or CURIEs.  When we use a CURIE, we define an abbreviation for part of the URI (commonly called the "namespace" abbreviation).  So for example, we could declare the namespace abbreviations:

dwc: = http://rs.tdwg.org/dwc/terms/
rdfs: = http://www.w3.org/2000/01/rdf-schema#

and then abbreviate the URI by replacing the namespace with the abbreviation to form a CURIE.  With the defined abbreviations above, we could say dwc:class when we intend "class" in the Darwin Core sense and rdfs:Class when we intend "class" in the RDFS sense.  This is much shorter than writing out the full URI, and if the last part of the CURIE after the namespace (known as the "local name") is formed from a natural language string (possibly in camelCase), it's easy for a native speaker of that natural language to "read" the CURIE as part of a sentence.

It is not a requirement that the local name be a natural language string.  Some vocabularies prefer to have opaque identifiers, particularly when there is no assumed primary language.  So for example, the URI http://vocab.getty.edu/tgn/1000111, which is commonly abbreviated by the CURIE tgn:1000111, denotes the country China, which may have many names in natural language strings of various languages.

What's the dwciri: namespace for?

Those who are familiar with using Darwin Core in spreadsheets and relational databases are familiar with the "regular" Darwin Core terms in the "dwc:" namespace (http://rs.tdwg.org/dwc/terms/).  However, most are NOT familiar with the namespace http://rs.tdwg.org/dwc/iri/, commonly abbreviated dwciri: .  This namespace was created as a result of the adoption of the Darwin Core RDF Guide, and most people who don't care about RDF have probably ignored it.  However, I'd like to bring it up in this context because it can play an important role in disambiguation.

Here is a typical task.  You have a spreadsheet record whose dwc:county value says "Robertson".  You know that it's a third level-political subdivision because that's what dwc:county records.  However, there are several third-level political subdivisions in the United States alone, and there probably are some in other countries as well.  So there is work to be done in disambiguating this value.  You'll probably need to check the dwc:stateProvince and dwc:country or dwc:countryCode values, too.  Of course, there may also be other records whose dwc:county values are "Robertson County" or "Comté de Robertson" or "Comte de Robertson" that are probably from the same third-level political subdivision as your record.  Once you've gone to the trouble of figuring out that the record is in Robertson County, Tennessee, USA, you (or other data aggregators) really should never have to go through that effort again.

There are two standardized controlled vocabularies that have created URI identifiers for geographic places: GeoNames and the Getty Thesaurus of Geographic Names (TGN).  There are reasons (beyond the scope of this blog post) why one might prefer one of these vocabularies over another, but either one provides an unambiguous, globally unique URI for Robertson County, Tennessee, USA: http://sws.geonames.org/4653638/ froim GeoNames and http://vocab.getty.edu/tgn/2001910 from the TGN.  The RDF Guide makes it clear that the value of every dwciri: term must be a URI, while the values of many dwc: terms may be a variety of text strings, including human-readable names, URIs, abbreviations, etc.  With a dwc: term, a user probably will not know whether disambiguation needs to be done, while with a dwciri: term, a user knows exactly what the value denotes, since a URI is a globally unique name.

In RDF, we could say that a particular location was in Robertson County, Tennessee, USA like this:

my:location dwciri:inDescribedPlace tgn:2001910.

However, there is also no rule that says you couldn't have a spreadsheet with a column header of "dwciri:inDescribedPlace", as long as the cells below it that contained URI values.  So dwciri: terms could be used in non-RDF data representations as well as in RDF.

If you look at the table in Section 3.7 of the Darwin Core RDF Guide, you will see that there are dwciri: analogs for every dwc: term where we thought it made sense to use a URI as a value.[3]  In many cases, those were terms where Darwin Core recommended use of a controlled vocabulary.  Thus, once providers or aggregators went to the trouble to clean their data and determine the correct controlled values for a dwc: property, they and everybody else in the future could be done with that job forever if they recorded the results as a URI value from a controlled vocabulary for a dwciri: Darwin Core property.


The crux of the issue

OK, after a considerable amount of background, I need to get to the main point of the post.  John's "Even Simple is Hard" talk was directed to the vast majority of Darwin Core users: those who generate or consume data as Simple Darwin Core (spreadsheets), or who output or consume Darwin Core Archives (fielded text tables linked in a "star schema").  In both of those cases, the tables or spreadsheets will most likely be populated with plain text strings.  There may be a few users who have made the effort to find unambiguous URI values instead of plain strings, and hopefully that number will go up in the future as more URIs are minted for controlled vocabulary terms.  However, in the current context, when John talks about populating the values of Darwin Core properties with "controlled values", I think that he probably means to use a single, consensus unicode string that denotes the concept underlying that controlled value.

Preferred unicode strings

I am reminded of a blog post that John wrote in 2013 called "Data Diversity of the Week: Sex" in which he describes some of the 189 distinct values used in VertNet to denote the concept of "maleness".  We could all agree that the appropriate unicode string to denote maleness in a spreadsheet should be the four characters "male".  Anyone who cleans data and encounters values for dwc:sex like "M", "m", "Male", "MALE", "macho", "masculino", etc. etc. could substitute the string "male" in that field.  There would, of course, be the problem of losing the verbatim value if a substitution were made.

I suspect that most TDWG'ers would consider the task of developing a controlled vocabulary for dwc:sex to involve sitting down a bunch of users and aggregators at a big controlled vocabulary conference, and coming to some consensus about the particular strings that we should all use to denote maleness, femaleness, and all other flavors of gender that we find in the natural world.

I don't want to burst anybody's bubble, but as it currently stands, that's not how the draft Standards Documentation Specification would work with respect to controlled vocabularies.  A TDWG controlled vocabulary would be more than a list of acceptable strings.  It would have all of the same features that other TDWG vocabularies have.

SDS: URIs

For one thing, each term in a controlled vocabulary would be identified by a URI.  That is already current practice in TDWG vocabularies and in Dublin Core.  The SDS does not specify whether the URIs should use "English-friendly" local names or opaque numbers for local names.  Either would be fine.  For illustration purposes, I'll pick opaque numbers.  Let's use "12345" as the local name for "maleness".  The SDS is also silent about namespace construction.  One could do something like

dwcsex: = "http://rs.tdwg.org/dwc/cv/sex/"

for the namespace.  Then the URI for maleness would be

http://rs.tdwg.org/dwc/cv/sex/12345

as a full URI or

dwcsex:12345

as a CURIE.  Anybody who wants to unambiguously refer to maleness can use the URI dwcsex:12345 regardless of whether they are using a spreadsheet, relational database, or RDF.

SDS: Machine-readable stuff

In John's talk, he mentioned the promise of using "semantics" to help automate the process of data cleaning.  A critical feature of the SDS is that in addition to specifying how human-readable documents should be laid out, it also describes how metadata should be described in order to make those data machine readable.  The examples are given as RDF/Turtle, but the SDS makes it clear that it is agnostic about how machine-readability should be achieved.  RDF-haters are welcome to use JSON-LD.  HTML-lovers are welcome to use RDFa embeded in web page markup.  Or better - provide the machine readable data in all of the above formats and let users chose.  The main requirement is that regardless of the chosen serialization, every machine-readable representation must "say" the same thing, i.e. must use the same properties specified in the SDS to describe the metadata.  So the SDS is clear about what properties should be used to describe each aspect of the metadata.

In the case of controlled vocabulary terms, several designated properties are the same as those used in other TDWG vocabularies.  For example, rdfs:comment is used to provide the English term definition and rdfs:label is used to indicate a human readable label for the term in English.  The specification does a special thing to accommodate our community's idiosyncrasy of relying on a particular unicode string to denote a controlled vocabulary term.  That unicode string is designated as the value of rdf:value, a property that is well-known, but doesn't have a very specific meaning and could be used in this way.  It's possible that the particular consensus string might be the same as the label, but it wouldn't have to be.  For example we could say this:

dwcsex:12345 rdfs:label "Male"@en;
             rdfs:comment "the male gender"@en;
             rdf:value "male".

In this setup, the label to be presented to humans starts with a capital letter, while the consensus string value denoting the term doesn't.  In other cases, the human readable label might contain several words with spaces between, while the consensus string value might be in camelCase with no spaces.  The label probably should be language-tagged, while the consensus string value is a plain literal.

Dealing with multiple labels for a controlled value: SKOS

As it currently stands, the SDS says that the normative definition and label for terms should be in English.  Thus the controlled value vocabulary document (both human- and machine-readable) will contain English only.  Given the international nature of TDWG, it would be desirable to also make documents, term labels, and definitions available in as many languages as possible.  However, it should not require invoking the change process outlined in the Vocabulary Management Specification every time a new translation is added, or if the wording in a non-English language version changes.  So the SDS assumes that there will be ancillary documents (human- and machine-readable) in which the content is translated to other languages, and that those ancillary documents will be associated with the actual standards documents.

This is where the Simple Knowledge Organization System (SKOS) comes in.  SKOS was developed as a W3C Recommendation in parallel with the development of ISO 25964.  SKOS is a vocabulary that was specifically designed to facilitate the development of thesauri.  Given that I've made the case that what TDWG calls "controlled vocabularies" are actually thesauri, SKOS has many of the terms we need to describe our controlled value terms.

An important SKOS term is skos:prefLabel (preferred label).  skos:prefLabel is actually defined as a subproperty of rdfs:label, so any value that is a skos:prefLabel is also a generic label.  However, the "rules" of SKOS say that you should never have more than one skos:prefLabel value for a given resource, in a given language.  Thus, there is only one skos:prefLabel value for English, but there can be other skos:prefLabel values for Spanish, German, Chinese, etc.

SKOS also provides the term skos:altLabel.  skos:altLabel is used to specify other labels that people might use, but that aren't really the "best" one.  There can be an unlimited number of skos:altLabel values in a given language for a given controlled vocabulary term.  There is also a property skos:hiddenLabel.  The values of skos:hiddenLabel are "bad" values that you know people use, but you really wouldn't want to suggest as possibilities (for example, misspellings).

SKOS has a particular term that indicates that a value is a definition: skos:definition. That has a more specific meaning than rdfs:comment, which could really be any kind of comment.  So using it in addition to rdfs:comment is a good idea.

So here is how the description of our "maleness" term would look in machine-readable form (serialized as human-friendly RDF/Turtle):

Within the standards document itself:

dwcsex:12345 rdfs:label "Male"@en;
             skos:prefLabel "Male"@en;
             rdfs:comment "the male gender"@en;
             skos:definition "the male gender"@en;
             rdf:value "male".


In an ancillary document that is outside the standard:

dwcsex:12345 skos:prefLabel "Masculino"@es;
             skos:altLabel "Macho"@es;
             skos:altLabel "macho"@es;
             skos:altLabel "masculino"@es;
             skos:altLabel "male"@en;
             skos:prefLabel "男"@zh-hans;
             skos:prefLabel "男"@zh-hant;
             skos:prefLabel "männlich"@de;
             skos:altLabel "M";
             skos:altLabel "M.";
             skos:hiddenLabel "M(ale)";
etc. etc.

In the ancillary document, one would attempt to include as many of the 189 values for "male" that John mentioned in his blog post.  Having this diversity of labels available makes two things possible.  One is to automatically generate pick lists in any language.  The if the user selects German as the preferred language, the pick list presents the German preferred label "männlich" to the user, but the value selected is actually recorded by the application as the language-independent URI dwcsex:12345.  Although I didn't show it, the ancillary document could also contain definitions in multiple languages to clarify things to international users in the event that viewing the label itself is not enough.

The many additional labels in the ancillary document also facilitate data cleaning.  For example, if GBIF has a million horrible spreadsheets to try to clean up, they could simply do string matching against the various label values without regard to the language tags and type of label (pref vs. alt vs. hidden).  Because the ancillary document is not part of the standard itself, the laundry list of possible labels can be extended at will every time a new possible value is discovered.

Making the data available

It is TOTALLY within the capabilities of TDWG to proved machine-readable data of this sort, and if the SDS is ratified, that's what we will be doing.  Setting up a SPARQL endpoint to deliver the machine-readable metadata is not hard.  For those who are are RDF-phobic, a machine-readable version of the controlled vocabulary can be available through the API as JSON-LD, which provides exactly the same information as the RDF/Turtle above and would look like this:

{
  "@context": {
    "dwcsex": "http://rs.tdwg.org/dwc/cv/sex/",
    "rdf": "http://www.w3.org/1999/02/22-rdf-syntax-ns#",
    "rdfs": "http://www.w3.org/2000/01/rdf-schema#",
    "skos": "http://www.w3.org/2004/02/skos/core#",
    "xsd": "http://www.w3.org/2001/XMLSchema#"
  },
  "@id": "dwcsex:12345",
  "rdf:value": "male",
  "rdfs:comment": {
    "@language": "en",
    "@value": "the male gender"
  },
  "rdfs:label": {
    "@language": "en",
    "@value": "Male"
  },
  "skos:altLabel": [
    {
      "@language": "es",
      "@value": "macho"
    },
    "M",
    {
      "@language": "en",
      "@value": "male"
    },
    "M.",
    {
      "@language": "es",
      "@value": "masculino"
    },
    {
      "@language": "es",
      "@value": "Macho"
    }
  ],
  "skos:definition": {
    "@language": "en",
    "@value": "the male gender"
  },
  "skos:hiddenLabel": "M(ale)",
  "skos:prefLabel": [
    {
      "@language": "es",
      "@value": "Masculino"
    },
    {
      "@language": "zh-hant",
      "@value": "男"
    },
    {
      "@language": "en",
      "@value": "Male"
    },
    {
      "@language": "zh-hans",
      "@value": "男"
    },
    {
      "@language": "de",
      "@value": "männlich"
    }
  ]
}

People could write their own data-cleaning apps to consume this JSON description of the controlled vocabulary and never even have to think about RDF.  

"Semantics", SKOS concept schemes, and Ontologies

Up to this point, I've been dodging an issue that will concern some readers and which other readers won't care about one whit.  If you are a casual reader and don't care about the fine points of machine-readable data and ontologies, you can just stop reading here.  

The SDS says that controlled vocabulary terms should be typed as skos:Concept, as opposed to rdfs:Class.  That prescription has the implication that the controlled vocabulary will be a SKOS concept scheme rather than an ontology.  This was what freaked people out at the TDWG meeting, because there is a significant constituency of TDWG whose first inclination when faced with a machine-data problem is to construct an OWL ontology.  At the meeting, I made the statement that a SKOS concept scheme is a screwdriver and an OWL ontology is a hammer.  Neither a screwdriver nor a hammer is intrinsically a better tool.  You use a screwdriver when you want to put in screws and you use a hammer when you want to pound nails.  So in order to decide which tool is right, we need to be clear about what we are trying to accomplish with the controlled vocabulary.  

ISO 25964 provides the following commentary about thesauri and ontologies in section 21.2:
Whereas the role of most of the vocabularies ... is to guide the selection of search/indexing terms, or the browsing of organized document collections, the purpose of ontologies in the context of retrieval is different. Ontologies are not designed for information retrieval by index terms or class notation, but for making assertions about individuals, e.g. about real persons or abstract things such as a process. 
and in section 22.3:
One key difference is that, unlike thesauri, ontologies necessarily distinguish between classes and individuals, in order to enable reasoning and inferencing. ... The concepts of a thesaurus and the classes of an ontology represent meaning in two fundamentally different ways. Thesauri express the meaning of a concept through terms, supported by adjuncts such as a hierarchy, associated concepts, qualifiers, scope notes and/or a precise definition, all directed mainly to human users. Ontologies, in contrast, convey the meaning of classes through machine-readable membership conditions. ... The instance relationship used in some thesauri approximates to the class assertion used in ontologies. Likewise, the generic hierarchical relationship ... corresponds to the subclass axiom in ontologies. However, in practice few thesauri make the distinction between generic, whole-part and instance relationships. The undifferentiated hierarchical relationship most commonly found in thesauri is inadequate for the reasoning functions of ontologies. Similarly the associative relationship is unsuited to an ontology, because it is used in a multitude of different situations and therefore is not semantically precise enough to enable inferencing.
In layman's terms, there are two key differences between thesauri and ontologies.  The primary purpose of a thesaurus is to guide a human user to pick the right term for categorizing a resource.  The primary purpose of an ontology is to allow a machine to do automated reasoning about classes and instances of things.  The second difference is that reasoning and inferencing is, in a sense, "automatic" for an ontology, whereas the choice to make use of hierarchical relationships in a thesaurus is optional and controlled by a user.  

Let's apply these ideas to our dwc:sex example.  We could take the ontology approach and say that 

dwcsex:12345 rdf:type rdfs:Class.

We could then define another class in our gender ontology:

dwcsex:12347 rdf:type rdfs:Class;
             rdfs:label "Animal gender".

and assert

dwcsex:12345 rdfs:subClassOf dwcsex:12347.

This assertion is the the sort that John described in his webinar: an "is_a" hierarchical relationship.  We could represent it in words as:

"Male" is_a "Animal gender".

As data providers, we don't have to "do anything" to assert this fact, or decide in particular cases whether we like the fact or not.  Anything that has a dwc:sex value of "male" will automatically have a dwc:sex value of "Animal gender" because that fact is entailed by the ontology.  

Alternatively, we could take the thesaurus approach and say that 

dwcsex:12345 rdf:type skos:Concept.

We could then define another concept in our gender thesaurus:

dwcsex:12347 rdf:type skos:Concept;
             rdfs:label "genders that animals can have".

and assert

dwcsex:12345 skos:broader dwcsex:12347.

As in the ontology example, this assertion also describes a hierarchical relationship ("has_broader_category").  We could represent it in words as:

"Male" has_the_broader_category "genders that animals can have".

In this case, nothing is entailed automatically.  If we assert that some thing has a dwc:sex value of "male", that's all we know.  However, if a human user is using a SKOS-aware application, the application could interact with the user and say "Hey, not finding what you want?  I could show you some other genders that animals can have." and then find other controlled vocabulary terms that have dwcsex:12347 as a broader concept.  It would also be no problem to assert this:

dwcsex:12348 rdf:type skos:Concept;
             rdfs:label "genders that parts of plants can have".
dwcsex:12345 skos:broader dwcsex:12348.

We aren't doing anything "bad" by somehow entailing that males are both plants and animals.  We are just saying that "male" is a gender that can fall into several broader categories as part of a concept scheme: "genders that animals can have" and "genders that parts of plants can have".  This is what was meant by "the undifferentiated hierarchical relationship most commonly found in thesauri is inadequate for the reasoning functions..." in the text of ISO 25964.  The hierarchical relationships of thesauri can guide human categorizers and searchers, but they don't automatically entail additional facts.

Screwdriver or hammer?

Now that I've explained a little about the differences between thesauri and ontologies, which one is the right tool for the controlled vocabularies?  There is no definite answer to this, but the common practice for most controlled vocabularies seems to be the thesaurus approach.  That's the approach used by all of the Getty Thesauri, and also the approach used by the Library of Congress for the controlled vocabularies it defines.  In the case of both of these providers, the machine-readable forms of their controlled vocabularies are expressed as SKOS concept schemes, not ontologies.  

That is not to say that all Darwin Core terms that currently say "best practice is to use a controlled vocabulary" should be defined as SKOS concept schemes.  In particular, the two vocabularies that John spoke about at length in his web cast (dwc:basisOfRecord and dcterms:type) should probably be defined as ontologies.  That's because they both are ways of describing the kind of thing something is, and that's precisely the purpose of an rdfs:Class.  On could create an ontology that asserts:

dwc:PreservedSpecimen rdfs:subClassOf dctype:PhysicalObject.

and all preserved specimens could automatically be reasoned to be physical objects whether a data provider said so or not [4].  In other words, machines that are "aware" of that ontology would "know" that

"preserved specimen" is_a "physical object".

without any effort on the part of data providers or aggregators.

But both dcterms:type and dwc:basisOfRecord are really ambiguous terms that are a concession to the fact that we try to cram information about all kinds of resources into a single row of a spreadsheet.  The ambiguity about what dwc:basisOfRecord actually means is the reason why the RDF guide says to use rdf:type instead [5].  There is no prohibition against having as many values of rdf:type as is useful.  You can assert:

my:specimen rdf:type dwc:PreservedSpecimen;
            rdf:type dctype:PhysicalObject;
            rdf:type dcterms:PhysicalResource.

with no problem, other than it won't fit in a single cell in a spreadsheet!

So what if we decide that "controlled values" for a term should be ontology classes?

The draft Standards Documentation Specification says that controlled value terms will be typed as skos:Concepts.  What if there is a case where it would be better for the "controlled values" to be classes from an ontology?  There is a simple answer to that.  Change the Darwin Core term definition to say "Best practice is to use a class from a well-known ontology." instead of saying "Best practice is to use a controlled vocabulary."  That language would be in keeping with the definitions and descriptions of controlled vocabularies and ontologies given in ISO 25964.  Problem solved.

I should note that it is quite possible to use all of the SKOS label-related properties (skos:prefLabel, skos:altLabel, skos:hiddenLabel) with any kind of resource, not just SKOS concepts.  So if it were decided in a particular case that it would be better for a "controlled value" to be defined as classes in an ontology rather than as concepts in a thesaurus, one could still use the multi-lingual strategy described earlier in the post.

Also, there is no particular type for the values of dwciri: terms.  The only requirement is that the value must be a URI rather than a string literal.  It would be fine for that URI to denote either a class or a concept.

So one of the tasks of a group creating a "controlled vocabulary" would be to define the use cases to be satisfied, and then decide whether those use cases would be best satisfied by a thesaurus or an ontology.

Feedback!  Feedback! Feedback!

If something in this post has pushed your button, then respond by making a comment about the Standards Documentation Specification before the 30 day public comment period ends on or around March 27, 2017.  There are directions from the Review Manager, Dag Endresen, on how to comment at http://lists.tdwg.org/pipermail/tdwg-content/2017-February/003690.html .  You can email anonymous comments directly to him, but I don't think any members of the Task Group will get their feelings hurt by criticisms, so an open dialog on the issue tracker or tdwg-content would be even better.

Footnotes

[1] Blog entries from my blog http://baskauf.blogspot.com/
March 14, 2016 "Ontologies, thesauri, and SKOS"
March 21, 2016 "Controlled values for Subject Category from Audubon Core"
April 1, 2016 "Controlled values for Establishment Means from Darwin Core"
April 4, 2016 "Controlled values for Country from Darwin Core"

[2] The IETF RFC 3986 defines the syntax of URIs.  A superset of URIs, known as Internationalized Resource Identifiers or IRIs is now commonly used in place of URIs in many standards documents.  For the purpose of this blog post, I'll consider them interchangeable.

[3] There are also terms, like dwciri:inDescribedPlace that are related to an entire set of Darwin Core terms ("convenience terms").  Talking about those terms is beyond the scope of this blog post, but for further reading, see the explanation in Section 2.7 of the RDF Guide.

[4] Cautionary quibble: in John's diagram, he asserted that dwc:LivingSpecimen is_a dctype:PhysicalObject.  However, the definition of dctype:PhysicalObject is "An inanimate, three-dimensional object or substance.", which would not apply to many living specimens.  A better alternative would be the class dcterms:PhysicalResource, which is defined as "a material thing".  However, dcterms:PhysicalResource is not included in the DCMI type vocabulary - it's in the generic Dublin Core vocabulary.  That's a problem if the type vocabulary is designated as the only valid controlled vocabulary to be used for dcterms:type.

[5] See Section 2.3.1.4 of the RDF Guide for details.  DCMI also recommends using rdf:type over dcterms:type in RDF applications.  A key problem that our community has not dealt with is clarifying whether we are talking about a resource, or the database record about that resource.  We continually confuse these to things in spreadsheets and most of the time we don't care.  However, the difference becomes critical if we are talking about modification dates or licenses.  A spreadsheet can have a column for "license" but is the value describing a license for the database record, or the image described in that record.  A spreadsheet can have a value for "modified", but does that mean the date that the record was last modified or the date that the herbarium sheet described by the record was last modified?  With respect of dwc:basisOfRecord, the local name of dwc:basisOfRecord ("basis of record") implies that the value is the type of the resource upon which a record is based, which implies that the metadata in the spreadsheet row is about something else.  That "something else" is probably an occurrence.  So we conflate "occurrence" with "preserved specimen" by talking about both of them in the same spreadsheet row.  According to its definition, dwc:Occurrence is "An existence of an Organism (sensu http://rs.tdwg.org/dwc/terms/Organism) at a particular place at a particular time." while a preserved specimen is a form of evidence for an occurrence, not a subclass of occurrence.  That distinction doesn't seem important to most people who use spreadsheets, but it is important if an occurrence is documented by multiple forms of evidence (specimens, images, DNA samples) - something that is becoming increasingly common.  What we should have is an rdf:type value for the occurrence, and a separate rdf:type value for the item of evidence (possibly more than one).  But spreadsheets aren't complex enough to handle that, so we limp along with dwc:basisOfRecord.