Sunday, October 23, 2016

Guid-O-Matic goes to China

GOM2 part 1: Guid-O-Matic goes to China

Note: This is the first post in a series of two.  The second post is here.
(left) Thousand Buddha Hall of Lingyan Temple.  By G41rn8 - Own work, CC BY-SA 4.0, 

For a little over a year, I've been participating in a Linked Data and Semantic Web Working Group at Vanderbilt University.  It's a rather diverse group that includes mostly librarians and digital humanists (I'm the token natural scientist).  It is a fun group with a positive, can-do attitude with most members being relatively new to RDF, Linked Data, etc.  There are two reasons why working with this group is great.  One is that they have not (yet) become jaded and cynical about RDF and its associated stack of promises, as seems to be typical after people beat their heads against it for several years.  The other is that Digital Humanists seem to generally have a very practical view on adopting technology.  They try stuff out to see if it works, then try to build something useful out of what works for them.  Granted, my experience with Digital Humanists is rather narrow, but at least this seems to be the case at Vanderbilt.

This fall, the group got support from the Vanderbilt Institute for Digital Learning (VIDL) in the form of a faculty working group grant (thanks VIDL!).  This has brought some additional faculty into the group from departments across campus.  As a "learning by doing" project, we have been working with some data on Chinese religious sites and buildings provided by Tracy Miller of the Department of History of Art.  It is a great dataset to work with because it has a structure that is complex enough to make it worth trying to express it as a graph, but simple enough to not be too confusing for Linked Data beginners.  It also contains multilingual data, with some data expressed in simplified Chinese characters.  Eventually, we hope to link these data to images cataloged in the DIMLI system used by the History of Art faculty in classroom instruction and publications.  Tracy, William Sealy of DIMLI, and Yuh-Fen Benda, Asian Studies Librarian, have been working to whip the data into shape as we play with turning it into Linked Data.

This project has inspired me to work on a project that is the subject of the rest of this blog post: creating a new version of Guid-O-Matic.

What the heck is Guid-O-Matic?

In 2010, I attended my first annual meeting of Biodiversity Information Standards (TDWG) at Woods Hole, Massachusetts.  I had recently implemented HTTP URIs as persistent unique identifiers at Bioimages with content-negotiation to provide RDF/XML when requested by the client (which have been stable and functional for 6 years now, I might add).  As a programming amateur and informatics neophyte, I had managed to do it, so it didn't seem to me that implementing HTTP URIs as persistent identifiers was really that hard.  I gave a short talk at the meeting basically saying that, and one feature of the talk was presentation of a cute little program that I had written in Visual Basic called "Guid-O-Matic", which took input from a CSV spreadsheet and turned it into a static RDF/XML file suitable for a server to provide when a client requested Content-type: application/rdf+xml.

The purpose of Guid-O-Matic was just a demonstration, so it never really went anywhere.  But I loved the little squid icon (GUID rhymes with squid, get it?), so when I started a new effort to turn CSV spreadsheet data into RDF this fall, I decided to brand the software as Guid-O-Matic 2.0 (GOM2).  GOM2 is written in Xquery, a functional programming language that seems to be popular among digital humanists (at least it's popular with the digital humanists at Vanderbilt).  I learned it at Cliff Anderson's Xquery working group last year and was intrigued by it because the BaseX implementation has a web application component that could be used to serve the output of Xquery scripts.  (Technically, I probably should be calling Xquery queries and functions "queries", but I'll refer to them as scripts.)  The main GOM2 script is actually an Xquery function embedded in an Xquery module that could eventually be called by a BaseX RESTXQ web application service.  At the moment, I'm just calling the GOM2 function from a stub Xquery script that does nothing more than allow the  user to substitute appropriate arguments, then call the function.  I will not go into the details here since I've attempted to document them elsewhere.  If you want to try doing the activities I'm going to describe in the rest of this blog, I encourage you to install BaseX, clone the Guid-O-Matic GitHub repo and hack away.

What does Guid-O-Matic do?

The basic function of GOM2 is to map the components of a table (containing metadata) to RDF triples.  The row of the table represents the subject of the triple, the column of the table represents the predicate, and the value of the cell at the intersection of the row and column is the object of the triple.  Thus, every cell in the table that has a value will generate a triple.  The entire table will generate a graph containing triples that describe a class of resources whose instances are represented by the individual rows.

In the example above, the rows represent instances of a "site" class, stored a file called site.csv in this example.  The first row represents the particular site called "Lingyansi".  The column name_zh-Hans represents a "label" property (i.e. predicate; I will use the terms "predicate" and "property" interchangeably).  灵岩寺 is the value of the label property - in RDF terms, it is a literal object of the triple.  The first column of the table provides a unique string for each site that is derived from the Chinese name of the site transliterated into Roman characters.  Since each row has a unique value in that column, we use it as a primary key for the row.  In RDF, identified subjects of the triple must be denoted by an IRI (internationalized resource identifier; functionally the same as a URI and the current preferred term, so I'll use it).  It's possible that a table may already contain a unique IRI in each row.  In that case, we're good to go.  However, if it doesn't, GOM2 allows you to designate a default IRI prefix that can be prepended to the primary key to form a valid IRI.  In this example, I used "" as the default prefix - Tracy's group will change it later when a decision is made about the final host for the data.

In Turtle serialization, I would like to state the triple that I've described as:

<> rdfs:label "灵岩寺"@zh-Hans.

There are a couple of problems.  One is that GOM2 needs to know that the actual predicate for the column name_zh-Hans is rdfs:label.  The other is that GOM2 needs to know that I intend for the value "灵岩寺" to be a literal with the language tag zn-Hans.  I accomplish this through a separate CSV file that contains column mappings, called site-column-mappings.csv in this example.  Here's what it looks like:

The first column in the mapping table gives the name of the header for the column in the metadata table site.csv .  The second column specifies the IRI predicate to be used for triples generated using values from that column.  The type column indicates that the object of the triple is a language-tagged literal and the value in the attribute column gives the ISO 639-1  code (zh-Hans for simplified Chinese characters) to be used with the literal value in the column that is being mapped.

The site.csv table also includes a column that provides a GeoNames IRI for the village that is near the temple site.  Using that information, I'd like to assert the triple:

<> foaf:based_near <>.

I set up the mapping for that column in a similar way to the label, except that I indicate that the type of object for the triple is an IRI rather than a literal.  In reality, we don't store the whole IRI in the sites.csv table.  Rather, we store only the GeoNames ID number for Lingyansi: "1803429".  For an IRI-object triple, if the value column of the mapping table has a value, GOM2 will prepend that value ("") before the value in the column being mapped ("1803429"), and append the value in the attribute column ("/") after the value in the column being mapped.  There are more details here about how to map plain literals, datatyped literals, and to insert properties that always have a constant value for every record.

One-to-one relationships in a table row

In the example, the range of dynasties over which the buildings at the site were built were described using the literal "Tang to Qing".  The simplest way to express this as an RDF triple would be to make up a bespoke predicate whose intended value would be a literal that describes the time period, e.g. ex:siteBeginningAndEndingDynasties.  That approach would be very straightforward, but it's unlikely that any generic client would ever be able to interpret what the literal means.  It would be better to use a well-known predicate like dcterms:temporal, whose definition seems just right: "Temporal characteristics of the resource."  However, the range of dcterms:temporal is not rdfs:Literal, it's dcterms:PeriodOfTime.  So the appropriate way to express the relationship would probably be:

<> dcterms:temporal _:1.
_:1 rdfs:label "Tang to Qing";
    a dcterms:PeriodOfTime.

In talking about the meaning of the rows in the metadata CSV file, I oversimplified by saying that the data in that row represented values of properties for an instance of a single class (the "site" class in this example).  In actuality, the row may contain values for properties of the root class (site) as well as values of properties of other classes whose instances have a one-to-one relationship with the root class (e.g. dcterms:PeriodOfTime).  In line 5 of the table from the site-column-mappings.csv file, a link is specified that connects the instance of the root class (site) to the instance of related class instance which has a one-to-one relationship with the root class (the period of time).  In this case, I don't really have any interest in minting an identifier for the time period, so I've chosen to let the time period that's associated with the site be a blank node (indicated by _:1).

GOM2 requires another CSV file that lists the classes that are represented in the metadata table.  In this example, it's called site-classes.csv, and it looks like this:

The first column contains the identifier used for the class in the mapping table and the second column contains the IRI for the class.  Note that "_:1" doesn't mean that particular string will be used as the node identifier for the blank node.  GOM2 will actually generate a random node identifier when it serializes the RDF.  Rather, "_:1" just means that it's the first class in the list represented by a blank node.  

One thing that I didn't mention is that GOM2 has to know what namespace abbreviations (a.k.a. CURIE, or compact IRI) are used in the abbreviated IRIs (a.k.a. Qnames). There is one more table that's needed by the software, namespace.csv, which looks like this:

Rows 1 through 6 contain namespace abbreviations that are always needed by GOM2, and rows 7 through 9 were added because those namespaces were used in the site.csv file.  GOM2 now has all of the information it needs to create the graph.  (There are a few details that I've left out, particularly related to generating the metadata about the output document - you can read more elsewhere if you are interested.)

If I select Turtle as the output serialization, here is what I get:

@prefix rdf: <>.
@prefix rdfs: <>.
@prefix xsd: <>.
@prefix dc: <>.
@prefix foaf: <>.
@prefix dcterms: <>.
@prefix geo: <>.
@prefix dwc: <>.

     rdfs:label "灵岩寺"@zh-Hans;
     foaf:based_near <>;
     dcterms:temporal _:93b9e6a0-03bb-4a3f-bf62-81ac86e97d26;
     a geo:SpatialThing.

     rdfs:label "Tang to Qing"@en;
     a dcterms:PeriodOfTime.

     dc:format "text/turtle";
     dc:creator "Vanderbilt Department of History of Art";
     rdfs:comment "Generated by Guid-O-Matic 2.0";
     dcterms:references <>;
     dcterms:modified "2016-10-19T13:46:00-05:00"^^xsd:dateTime;
     a foaf:Document.

Here's the JSON-LD serialization:

"@context": {
"rdf": "",
"rdfs": "",
"xsd": "",
"dc": "",
"foaf": "",
"dcterms": "",
"geo": "",
"dwc": ""
"@graph": [
"@id": "",
"rdfs:label": {"@language": "zh-Hans","@value": "灵岩寺"},
"foaf:based_near": {"@id": ""},
"dcterms:temporal": {"@id": "_:ec49b21f-7204-46e7-84ad-0ff1daa13f6b"},
"@type": "geo:SpatialThing"
"@id": "_:ec49b21f-7204-46e7-84ad-0ff1daa13f6b",
"rdfs:label": {"@language": "en","@value": "Tang to Qing"},
"@type": "dcterms:PeriodOfTime"
"@id": "",
"dc:format": "application/json",
"dc:creator": "Vanderbilt Department of History of Art",
"rdfs:comment": "Generated by Guid-O-Matic 2.0",
"dcterms:references": {"@id": ""},
"dcterms:modified": {"@type": "xsd:dateTime","@value": "2016-10-19T13:46:00-05:00"},
"@type": "foaf:Document"

Here's the XML serialization:

<rdf:Description rdf:about="">
     <rdfs:label xml:lang="zh-Hans">灵岩寺</rdfs:label>
     <foaf:based_near rdf:resource=""/>
     <dcterms:temporal rdf:nodeID="Uf95cc1f6-03c9-43b4-9074-786ff33c493e"/>
     <rdf:type rdf:resource=""/>

<rdf:Description rdf:nodeID="Uf95cc1f6-03c9-43b4-9074-786ff33c493e">
     <rdfs:label xml:lang="en">Tang to Qing</rdfs:label>
     <rdf:type rdf:resource=""/>

<rdf:Description rdf:about="">
     <dc:creator>Vanderbilt Department of History of Art</dc:creator>
     <rdfs:comment>Generated by Guid-O-Matic 2.0</rdfs:comment>
     <dcterms:references rdf:resource=""/>
     <dcterms:modified rdf:datatype="">2016-10-19T13:46:00-05:00</dcterms:modified>
     <rdf:type rdf:resource=""/>


All of these serializations contain the same 6 triples shown in the example "bubble" graph diagram show earlier (plus some extra triples about the RDF document itself).  You can generate your own "bubble" graph diagram by copying the XML serialization, pasting it into the W3C RDF Validator, select "Graph Only", then click "Parse RDF".

In this example, I chose to use a blank node for the dcterms:PeriodOfTime instance.  Another option would be to assign it a an IRI formed by appending a fragment identifier to the root IRI for the geo:SpatialThing instance.  I'll illustrate that approach in an example in the next blog post.  

If you are a digital humanist with marginal interest in generating RDF, but are not interested in the more subtle issues involving choosing an RDF graph model, you will probably want to stop here.  If you are a TDWG stalwart who cares about Darwin Core, GBIF, etc., or if you are anyone interested in approaches for dealing with generating more complex RDF graphs from linked CSV tables, continue on to the next blog post.  In either case, I'd encourage you to download BaseX (if you don't already have it), download the Guid-O-Matic repo, try running the script on the example data, then hack the various CSV tables to see what happens.  Instructions for getting started are here.  

No comments:

Post a Comment