Monday, October 31, 2016

Linked Data "Magic" and "Big" data

Start-of-Term Feast at Hogwarts from http://harrypotter.wikia.com/wiki/Great_Hall CC BY-SA


"Looking up" and "discovering things"


In his famous 2006 post on Linked Data, Tim Berners-Lee stated four principles of Linked Data:

  1. Use URIs as names for things
  2. Use HTTP URIs so that people can look up those names.
  3. When someone looks up a URI, provide useful information, using the standards (RDF*, SPARQL)
  4. Include links to other URIs. so that they can discover more things.

This whole idea sounds so cool, it seems almost magical.  If we use HTTP URIs [1], we can "look up" things and "discover more things".  But as a practical matter, how do we "discover more things"?
From the standpoint of machine-aided learning on the Web of Data, "discovery" could fall into several categories.  "Discovery" could comprise addition of information to what has already been accumulated by a machine client.  In the language of RDF, we could say that this sort of learning happens when additional triples are added to the graph, either by merging newly discovered triples into the existing graph, or by materializing triples that are entailed by the existing triples in the graph. [2]  From the human perspective, "discovery" can happen when the human performs queries on the data in the graph and discovers connections among the data that were not previously recognized - either because the newly added triples have previously unknown connections to existing triples, or because the links among the data are so complex that connections between resources of interest are not apparent.

On a small scale, "looking things up" seems rather trivial: a client discovers a link to a new HTTP URI in an existing triple, dereferences the URI and receives some RDF, and then has additional triples connected to the existing one.  Voilà!  However, when one starts operating on the scale of hundreds, thousands, or millions of URIs that need to be "looked up", the situation gets more complicated.

As I was pondering this, Harry Potter came to mind.  In the first book, Harry and his friends are sitting in the Great Hall of Hogwarts at the welcoming feast and food magically appears on their plates, one course after another.  However, in later books, Hermione discovers that the food is actually prepared by house elves toiling away in the basement of Hogwarts.  Magic is required to make the food appear on the students' plates, but the actual creation of the food is labor-intensive, low-tech, and depends on virtual slavery.  I was then imagining myself as the person at Hogwarts whose job it was to manage the house elves and try how to figure out how to get all of that food from the kitchen onto the plates.  The process didn't really seem magical at all - actually more like tedious and uninteresting.

The "Hogwarts kitchen chore" here is figuring out how to connect my million triples to a million triples somewhere else so that I can query across the entire graph.

Approach 1: Retrieve data about individual resources by dereferencing their URIs

We do have the "magic" of HTTP at our disposal, but does my client really make a million HTTP calls to a server somewhere and then deal with the results one-by-one as they are received?  That doesn't seem very practical.  One alternative would be to save the triples as I've retrieved them.  It would still require a lot of HTTP calls, but I would only have to do it once if I stored the results in my local triplestore.  I would essentially be a little "Google bot" scraping the Linked Data Web.  In a previous post [3], I described playing around with a little Python script to do a miniaturized version of this, and I hope to play around with this approach more in the future.

Approach 2: Federated SPARQL query

If the million other triples that you want to connect your graph to are located in one place, and if that place provides access to those triples through a SPARQL endpoint, you could just leave those triples there and carry out your SPARQL query explorations using a federated SPARQL query.  This seems like a simple solution, since there isn't any need to move all of those triples from there to here.  Last fall, our Semantic Web Working Group at Vanderbilt studied Bob DuCharme's book Learning SPARQL, and as part of our experimentation, we tried running some federated queries.  You can see two examples here.  What I discovered from that exercise is that I can't just throw together the typical sloppy SPARQL query that I'm inclined to write.  The bindings coming from the remote server have to be transferred to the local server running the query before the federated query can be completed.  If there are few (as in example 10), the query executes with no noticeable delay.  On the other hand, if there are many (as in example 11), the query cannot be completed without the transfer of a massive amount of data from the remote server.  Either the query times out, or you wait forever for it to complete.  This same query could be completed easily in a short amount of time if all of the data were in a single triple store.  So federated queries are a potentially powerful way of looking for connections between two large blobs of data, but they have to be constructed carefully with thought toward reducing to a reasonable size the number of bindings that would have to be transmitted from the remote server.

Approach 3: Retrieve the entire remote dataset in one blob

If the million triples that you want to combine with your million triples are from a single source, and if that source provides an RDF dump, you can just download the whole dump and load it into your local triple store.  This may seem like overkill, but if the primary limitation is transfer of data across the Internet (as it is in Approaches 1 and 2), this approach may make the most sense.  Because of the large amount of redundancy in the literals and URIs, in many cases a large RDF graph serialized as Ntriples (a typical serialization choice for an RDF dump) will compress to a size that an order of magnitude smaller than the uncompressed size.  For example, the entire contents of the Getty Thesaurus of Geographic Names (containing pretty much the name and location of every major place on earth) as RDF/Ntriples is 13.8 Gb uncompressed, but only 661 Mb zip compressed (differing by a factor of 20).  With a good high-speed network, you can retrieve the entire dataset in a time measured in minutes.  However, getting that dataset loaded into a triple store is an adventure whose description will take up most of the rest of this post.

Approach 4: Just build the whole giant graph in one place

When I was first pondering this post, I was only considering three possible solutions to the problem.  However, today I attended Cliff Anderson's excellent workshop on Getting Started with Wikidata at the Vanderbilt Heard Library and a fourth possibility occurred to me.  Eventually, maybe everyone who cares about linking data will just put it all in one place.  Clearly that is not a solution in the short term and it seems a little bit silly to expect that could happen. But ten years ago I considered Wikipedia to be a joke and now it's probably one of the best things on the web, which shows what can be accomplished when a bunch of passionate, dedicated people work together to build something great.  If a similar amount of effort were expended on Wikidata, the whole issue of Linked Data might become moot.  People who are serious about linking quality data might just put it all in one place and the "moving triples/bindings from one place to another" would become a moot point.

First experiment: GeoNames RDF dump

In the Linked Data and Sematic Web propaganda pieces, the examples they use as illustrations often involve "toy" datasets that contain between 10 and 100 triples.  Last February, I decided to up the ante a little bit by conducting some experiments using the Bioimages RDF graph containing about 1.5 million triples and downloadable as compressed RDF/XML serialization.  You can read about the results of those experiments as part of another blog post [4].  In a nutshell, a local installation of Callimachus took about 8 minutes to load 1.2 million triples from a 109 Mb XML file, while Stardog took 14 seconds.  Needless to say, in my current set of experiments I did not bother testing Callimachus.

In the past, I've linked Bioimages locations to GeoNames URIs for geographic features that contained those locations.  It would be cool to leverage the hierarchies of political subdivisions described in GeoNames database (but not fully available in the Bioimages dataset).  That would allow me (among other things) to access the multilingual labels for features at any level in the hierarchy.  This was the driving motivation behind my little Python experiment [3] to grab triples for the features to which I had linked.  However, that method introduces management issues.  When I add links to new features, do I re-scrape all of the linked URIs to generate an updated file containing all of the relevant GeoNames triples, or do I keep a list of only the new geographic features to which I'd linked and keep adding one small GeoNames subgraph after another to my Bioimages graph?  Mostly, I just forget to do anything at all, which breaks queries that traverse links involving new data added to Bioimages.

The whole GeoNames triple management issue would go away if I could just load the entire GeoNames RDF dump into the same triplestore with my Bioimages graph.  GeoNames makes an RDF dump available via their ontology documentation page.  The dump contains data about 10 951 423 geographic features described by 162 million triples (as of 2016 February).  The compressed zip file is 616 Mb, which is a reasonable download, and when uncompressed, it expands to a 14.7 Gb file called "all-geonames-rdf.txt".  What the heck does that mean?  ".txt" is not a typical file extension for any RDF serialization of which I'm aware.

When you are my age, you've picked up a number of bad habits, and one that I picked up from my father was to only read the instruction book as a last resort.  So my first effort was to just try to load the "all-geonames-rdf.txt" file into Stardog.  No dice.  Stardog immediately complained that the file wasn't in a valid RDF format.  So how do you find out what is actually inside a single text file that's 15 Gb ?  Not by opening it in Notepad (yes, I did try!).  It's also way too big for rdfEditor (no surprise, although rdfEditor can open and validate the 1239236 triple images.rdf file from Bioimages with no problem).  I considered OpenRefine, but I'd tried and failed to open a 2.7 Gb Getty Thesaurus Ntriples file with it, so I didn't try that.[5]   So it became apparent that I wasn't going to open the whole file into any text editor.

With some help from StackOverflow, I next wrote a little batch file that would display one line at a time from a text file.  I was quickly able to determine that for some reason, the GeoNames people had chosen to do the dump as 11 million separate RDF/XML-serialized graphs, each one separated by a line of plain text containing the id of that feature and placed in a single file, like this:

http://sws.geonames.org/3/
<?xml version="1.0" encoding="UTF-8" standalone="no"?><rdf:RDF xmlns:cc="http://creativecommons.org/ns#" xmlns:dcterms="http://purl.org/dc/terms/" xmlns:foaf="http://xmlns.com/foaf/0.1/" xmlns:gn="http://www.geonames.org/ontology#" xmlns:owl="http://www.w3.org/2002/07/owl#" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#" xmlns:wgs84_pos="http://www.w3.org/2003/01/geo/wgs84_pos#"><gn:Feature rdf:about="http://sws.geonames.org/3/"><rdfs:isDefinedBy rdf:resource="http://sws.geonames.org/3/about.rdf"/>...<gn:locationMap rdf:resource="http://www.geonames.org/4/rudkhaneh-ye-zakali.html"/></gn:Feature></rdf:RDF>
http://sws.geonames.org/5/
<?xml version="1.0" encoding="UTF-8" standalone="no"?><rdf:RDF xmlns:cc="http://creativecommons.org/ns#" xmlns:dcterms="http://purl.org/dc/terms/" xmlns:foaf="http://xmlns.com/foaf/0.1/" xmlns:gn="http://www.geonames.org/ontology#" xmlns:owl="http://www.w3.org/2002/07/owl#" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#" xmlns:wgs84_pos="http://www.w3.org/2003/01/geo/wgs84_pos#"><gn:Feature rdf:about="http://sws.geonames.org/5/"><rdfs:isDefinedBy rdf:resource="http://sws.geonames.org/5/about.rdf"/>...<gn:locationMap rdf:resource="http://www.geonames.org/5/yekahi.html"/></gn:Feature></rdf:RDF>
http://sws.geonames.org/6/
<?xml version="1.0" encoding="UTF-8" standalone="no"?><rdf:RDF xmlns:cc="http://creativecommons.org/ns#" xmlns:dcterms="http://purl.org/dc/terms/" xmlns:foaf="http://xmlns.com/foaf/0.1/" xmlns:gn="http://www.geonames.org/ontology#" xmlns:owl="http://www.w3.org/2002/07/owl#" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#" xmlns:wgs84_pos="http://www.w3.org/2003/01/geo/wgs84_pos#"><gn:Feature rdf:about="http://sws.geonames.org/6/"><rdfs:isDefinedBy rdf:resource="http://sws.geonames.org/6/about.rdf"/>...<gn:locationMap rdf:resource="http://www.geonames.org/6/ab-e-yasi.html"/></gn:Feature></rdf:RDF>

etc.  Duh.  I suppose you could call that an RDF dump of some sort, but not one that was very easy to use.

Given my success with my earlier attempts to use the cool rdflib Python library to manipulate RDF data, I decided to try to write a Python script to import the stupidly formatted GeoNames graphs and output them as a single n3 formatted file.  You can see my attempt in this Gist.  The idea was to read in a junk text line and do nothing with it, then to read in the next line that contained the whole XML serialization for a feature, parse it, and add it to the growing accumulation graph.  Repeat 11 million times, then write the whole file in n3 serialization.  Although rdflib has a simple function to parse an RDF/XML file when it's opened, I couldn't figure out a way to get it to do the parsing on a line of text assigned to a variable.  Finally, I gave up and just wrote the line to a file and then opened the file with the parse function.  It was a great idea, but of course it hung when I tried to run it on the whole file.  At first, I though that it was the 11 million file writes and reads that was the problem. But when I hacked the program to print a number every time it added 10 graphs to the accumulation graph, I could see that the problem was that as the accumulation graph got bigger, adding each addition graph took longer.  By the time it got to adding the 1000th graph, it was clear that it was going to take forever to do the remaining 10 999 000 graphs.

That was a disappointment, but as with all good programmers, I was undaunted and searched the internet to find a program that somebody else had written to do the job.  I was able to find a Python 2 script written by Rakebul Hasan, which I was able to hack to work in Python 3 (Gist here).  I also added a counter to put a number on the screen after every 10 000th graph was added.  I let the script run for something like 6 hours, and in the end, I had a single file full of nTriples!  Hooray!

I compressed the output file from 18.3 Gb down to 1.2 Gb and Dropboxed it to my work computer where I wanted to had planned to load it into Stardog.  However, my attempt failed due to my aforementioned issue of not reading the instructions before starting in.  I had forgotten that the free Community version of Stardog was limited to 25 million triples per database (I was way over with the 162 million GeoNames triples).  I think eventually our Semantic Web working group will try the free 30 day Stardog trial (Developer license) that has no data limits after we figure out all of the tests that we want to run on it.  When we do that, we can try loading the GeoNames triples into Stardog again.

Second experiment: Loading 132 million Getty TGN triples into BlazeGraph

After that debacle, I decided to try loading the entire Getty Thesaurus of Geographic Names (TGN) RDF dump into Blazegraph.  Unlike Stardog, open source Blazegraph has no restrictions on the number of triples that you can have in a store.  I decided to switch to loading the TGN first (instead of GeoNames) because the entire dump was broken into a number of smaller files, allowing me to look more closely at the effect of file size on load time.

I should note here that I'm using the term "Blazegraph" interchangeably with "Bigdata".  Technically, I believe that the actual data store itself is referred to as "Bigdata" whereas the software that interacts with it is "Blazegraph".  You can read the Getting Started guide yourself and try to figure out the distinction.  For convenience, henceforth I'm going to just use Blazegraph to refer to the whole system.

I had only recently set up Blazegraph on my work computer and hadn't yet had time to try it out.  Being the black sheep of the Vanderbilt computing community, I have a frumpy Windows machine in my office, unlike the cool Linux people and the stylish Mac people.  So of course, Blazegraph would not actually run SPARQL queries after installed it on my machine without throwing a Windows-specific error.  I can't remember what it was (I wrote it down and lost the paper), but 5 minutes of Googling found me the (Java?) settings change I had to make to fix the problem.  So I was ready to roll.

The Getty TGN Explicit Exports RDF dump is a zip archive that contains 17 nTriple-serialized files of varying sizes that describe various categories of data in the thesaurus.  The file names give some hint of what the files contain, and I was able to determine more exactly what kinds of properties were included in each file by taking a screenshot of the Blazegraph "Explore" screen for a particular resource before and after loading each file.  For example, after adding the first file (1subjects.nt) to the triplestore, I saw this:

After adding the second file (2terms.nt), I saw this:

The difference between the two screens showed that the "2terms.nt" file contained the triples linking to the preferred label data for the resources.

When I tried loading the first file, I was a bit alarmed because nothing seemed to be happening for a long time, but I was patient and after about a half hour, Blazegraph reported that the file had loaded!  Because of their size, some files took way longer than others to load.  Here is a table that tells approximately how long it took the various files of various sizes to load:


I decided against loading TGNOut_RevisionHistory.nt because, although the revision history is important, it was not critical for the use cases I cared about, and given that its size was 2-3 times as large as the other big files I'd been loading, I didn't really want to invest the hours that it would probably take to load it.  The loading times are approximate because I was working on this project while other things were going on and I wasn't always paying attention to what was going on.  At the end of the load, Blazegraph give an elapsed time in ms, but I didn't notice this until after I'd already been loading several files, so I wasn't diligent about recording the value.

One interesting feature of the data shown above is that although there is a definite relationship between file size and loading time, it's not linear.  Loading the first 2.8 Gb file took a surprisingly short time.  I wasn't actually timing it, but I don't think it was more than a half hour.  So I was surprised when loading the next file (4.5 Gb) it took about 7.5 hours.  One thing to note is that file size is not necessarily proportional to number of triples, since triples with long literal values will require more characters than triples with short literals or IRI values.  However, it seems unlikely that this was the entire explanation.  Having already confessed to being bad about reading instructions, the answer may be somewhere in the Blazegraph documentation, but I haven't read the documentation carefully enough yet to have figured it out.

Clues about the reason for the long load times can be found by looking at the system performance while Blazegraph was loading triples:


You can see that immediately after I gave the command to load another file, the disk active time jumped from near zero to near 100% of the time.  Part of this may just be disk writing related to saving the file containing the ingested triples, but I suspect that the major issue is that Blazegraph is having to continually be writing to the disk because there isn't enough memory available to do what it needs to do.  I allocated 4 of the 8 Gb of memory on the computer to Blazegraph, and while it was loading triples, the memory usage would go over 90%.  I was also using the computer to do work at the same time and had other applications open that were also competing for use of the memory.  The whole triple loading process was a real drag on the system and any simple thing that involved disk use like a file-open dialog took forever.

There is a lot that I don't know about what is going on while Blazegraph is loading triples, but it is clear that for a task of this magnitude, it would be better to do it on a machine that had more than 8 Gb of memory.  A solid-state drive might also help improve the speed with whatever reading and writing can't be avoided in the process.

Where are the labels ?!

Here is what I ended up when I used the Explore tab of Blazegraph after loading the files:
If you compare this with what you see in a browser when you dereference http://vocab.getty.edu/tgn/1014952, everything is there except for several triples that were in the revision history file that I didn't load.  However, if you dereference the same URI requestion Content-type: text/turtle (or click on the N3/Turtle link from the search results page), you get this (headers omitted):

tgn:1014952 a gvp:Subject , skos:Concept , gvp:AdminPlaceConcept ;
    rdfs:label "Nashville" ;
    rdfs:seeAlso <http://www.getty.edu/vow/TGNFullDisplay?find=&place=&nation=&subjectid=1014952> ;
    dct:created "1991-09-13T00:41:00"^^xsd:dateTime ;
    skos:changeNote tgn_rev:5000847244 ;
    gvp:broader tgn:7013154 ;
    gvp:broaderPartitiveExtended tgn:7005685 , tgn:1000001 , tgn:7013154 , tgn:7012149 , tgn:7029392 ;
    gvp:broaderExtended tgn:7005685 , tgn:1000001 , tgn:7013154 , tgn:7012149 , tgn:7029392 ;
    gvp:broaderPreferredExtended tgn:7005685 , tgn:1000001 , tgn:7013154 , tgn:7029392 ;
    gvp:parentString "Ontario, Canada, North and Central America, World" ;
    skos:note tgn_rev:5000847244 ;
    gvp:parentStringAbbrev "Ontario, Canada, ... World" ;
    gvp:displayOrder "998"^^xsd:positiveInteger ;
    gvp:placeType aat:300008347 ;
    skosxl:prefLabel tgn_term:15232 ;
    skos:prefLabel "Nashville" ;
    gvp:broaderPartitive tgn:7013154 ;
    gvp:broaderPreferred tgn:7013154 ;
    skos:broader tgn:7013154 ;
    iso:broaderPartitive tgn:7013154 ;
    gvp:prefLabelGVP tgn_term:15232 ;
    skos:inScheme <http://vocab.getty.edu/tgn/> ;
    dct:contributor tgn_contrib:10000000 ;
    dct:source tgn_source:9006541-subject-1014952 ;
    gvp:placeTypePreferred aat:300008347 ;
    dc:identifier "1014952" ;
    skos:broaderTransitive tgn:7005685 , tgn:1000001 , tgn:7013154 , tgn:7012149 ;
    cc:license <http://opendatacommons.org/licenses/by/1.0/> ;
    void:inDataset <http://vocab.getty.edu/dataset/tgn> ;
    dct:license <http://opendatacommons.org/licenses/by/1.0/> ;
    prov:wasGeneratedBy tgn_rev:5000847244 ;
    foaf:focus tgn:1014952-place .


plus some triples about related things (like the place, sources, etc.).  If you compare this with what Blazegraph reports, you'll notice that there are two major categories of triples missing.  One is a set of triples related to broader categories, such as gvp:broader, gvp:broaderExtended, skos:broader, skos:broaderTransitive, etc.  The other is the set of labels, such as rdfs:label, skos:prefLabel, etc.

The difference is caused by the fact that these additional properties are not actually stored in the Getty TGN database.  Rather they are properties that are entailed but not materialized in the Explicit Exports RDF dump.  The circumstances are described in section 6.6 of the Getty Vocabularies: Linked Open Data Semantic Representation document.  When the Getty Vocabularies SPARQL endpoint is loaded with fresh data every two weeks, the 17 files I listed above are loaded along with the Getty Ontology and other external ontologies like SKOS.  Getty's graph database has inference features that allow it to reason entailed triples.  For example, gvp:broaderPreferred is a subproperty of gvp:broader, so including the triple:

tgn:1014952 gvp:broaderPreferred tgn:7013154.

in the RDF dump entails the triple:

tgn:1014952 gvp:broader tgn:7013154.

even though it's not in the dump.  Getty has defined a set of rules to that limits the entailed triples. Those rules are applied to generate the categories of triples that they have deemed important, then insert them back into their triplestore.  So entailed triples aren't reasoned every time somebody runs a query - they are materialized once in each update cycle and the queries are run over the graph that is the sum of the explicit and materialized entailed triples.  Getty does offer "Total Exports" RDF dumps, but those files are a lot larger.  At 19.8 Gb uncompressed for the Thesaurus of Geographic Names, it's about 50% larger than the 13.8 Gb Explicit Exports files I used.

So I've got a problem.  I really wanted the skos:prefLabel and skos:altLabel triples, but they are entailed, not explicit triples.  The explicit triple relevant to labels that is shown in the Explore tab for the example is:

tgn:1014952 skosxl:prefLabel tgn_term:15232.

The relationship between SKOS labels and SKOS-XL (SKOS eXtension for Labels) labels is described in Appendix B of the SKOS Reference.  SKOS uses a complex OWL feature called Property Chains to support the "dumbing-down" of SKOS-XL label entities (which can be identified by IRIs and be assigned properties) to vanilla SKOS lexical labels (i.e. literal value properties like skos:prefLabel, which cannot be assigned properties).  Here's the SKOS-XL label entity from the example:

 tgn_term:15232 a skosxl:Label ;
    gvp:term "Nashville" ;
    gvp:displayOrder "1"^^xsd:positiveInteger ;
    skosxl:literalForm "Nashville" ;
    gvp:termFlag <http://vocab.getty.edu/term/flag/Vernacular> ;
    gvp:termPOS <http://vocab.getty.edu/term/POS/Noun> ;
    gvp:contributorPreferred tgn_contrib:10000000 ;
    dct:contributor tgn_contrib:10000000 ;
    dct:source tgn_source:9006541-term-15232 ;
    dc:identifier "15232" .


Getty wants to include this entity in its explicit data because provenance data and display order information can be assigned to it - you can't do that with a literal.  If the Property Chain axioms in section B.3.2. are applied to the skosxl:prefLabel triple and the triples in the label entity description, the triple

tgn:1014952 skos:prefLabel "Nashville".

is entailed.

To solve my problem, I've got three options:
1. Put my giant blob of triples somewhere that can do Property Chain OWL reasoning, and let the skos:prefLabel triples be generated on the fly.  It's possible that Blazegraph can do that, but at this point I have no idea how you would make that happen.
2. Use a SPARQL construct query to construct all of the entailed skos:prefLabel and skos:altLabel triples that I want, then insert them into my Blazegraph graph.  This is probably the best option, but I haven't tried it yet.  The query would be pretty simple, but I have no idea how long it would run - it would generate about 4 million triples (based on the number of skosxl:Label instances in the VoID description of TGN).  I've never run a construct query that big before.
3. Use a more complicated SPARQL SELECT query using the SKOS-XL properties instead of the vanilla SKOS label properties.  That's annoying, but simpler in the short run.

If I wanted to use the triple pattern:

?location skos:prefLabel ?label.

I would instead need to use:

?location skosxl:prefLabel ?labelEntity.
?labelEntity skosxl:literalForm ?label.

instead.  Not really that bad, but annoying.

Take-homes from the School of Hard Knocks

At this point, I'm going to insert some comments about important take-homes that I've picked up in the process of this experimenting.

1. Capitalization and language tags.  Despite what Section 2.1.1 of RFC 5646 says about case sensitivity and conventions about capitalization, the Blazegraph SPARQL interface distinguishes between "zn-Hans" and "zn-hans".  (Maybe all SPARQL processors do?)  RFC 5646 says "The ABNF syntax also does not distinguish between upper- and lowercase: the uppercase US-ASCII letters in the range 'A' through 'Z' are always considered equivalent and mapped directly to their US-ASCII lowercase equivalents in the range 'a' through 'z'."  When Blazegraph reports lang(?label) for a literal ?label, it performs the ABNF mapping (to all lowercase) on the label.  Thus for

?label = "广德寺"@zh-Hans

lang(?label) = "zh-hans"

So the filter

FILTER (lang(?label)="zh-Hans")

will never produce results.  I noticed that Getty TGN uses all lower case, so to avoid problems in the future, I'm going to go to using all lowercase in my language subtags.

2. Named graphs.  In most SPARQL examples geared towards beginners, there is no graph specified.  All of the triples going into one unspecified pot (the default graph).  That's fine for playing around, but in production, one will need to remove sets of triples (i.e. graphs), then either replace them with an new graph (if it's an update) or not (if they simply aren't needed any more).  In SPARQL 1.1 Update, it's possible to manipulate specific triples using INSERT DATA and DELETE DATA, although that is unwieldy when dealing with large graphs.  I also found in my experimentation with Stardog that there are cases where triples don't get removed under certain circumstances. [6]

It's more straightforward and reliable to designate triples as belonging to some particular IRI-identified graph as they are loaded.  (They are then known as "quads" instead of "triples" - the fourth component that's added to the triple is the IRI of the graph that contains the triple.)  Then it is possible to drop the entire graph without affecting triples that were designated as belonging to other graphs.

The problem comes when you mix triples that were designated as belonging to a named graph with other triples that were not so designated.  The issue is summed up in the GraphDB documentation: "The SPARQL specification does not define what happens when no FROM or FROM NAMED clauses are present in a query, i.e., it does not define how a SPARQL processor should behave when no dataset is defined. In this situation, implementations are free to construct the default dataset as necessary."  In the case of the GraphDB implementation, it constructs the default dataset by including any triples that were not designated as part of any named graph as belonging to every named graph.

This was the behavior that I expected when I loaded the Getty dump into Blazegraph.  When loading triples using the UPDATE tab, I just gave commands like:

load <file:///c:/test/output/getty/TGNOut_1Subjects.nt>

and loaded the millions of triples serialized in the TGNOut_1Subjects.nt file into the Blazegraph unnamed default graph pot.  I then planned to load other sets of triples like this:

load <file:///c:/test/rdf/output/tang-song.ttl> into graph <http://lod.vanderbilt.edu/historyart/tang-song>

so that I could remove or replace those triples as a whole graph as experimentation progressed.

Including the keyword FROM in a SPARQL query should add the specified graph to the default graph used in the query, for example:

SELECT DISTINCT ?site 
FROM <http://lod.vanderbilt.edu/historyart/tang-song>
WHERE {
  ?site a geo:SpatialThing.
  }

However, in this case, Blazegraph does not add the triples in the http://lod.vanderbilt.edu/historyart/tang-song graph to those in the unspecified pot.  Instead, it defines the default graph to be composed of only the triples in the http://lod.vanderbilt.edu/historyart/tang-song graph.

The FROM NAMED keywords to make it possible to specify that some triple patterns only apply to a particular named graph, like this:

SELECT DISTINCT ?site
FROM NAMED <http://lod.vanderbilt.edu/historyart/tang-song>
WHERE {
  GRAPH <http://lod.vanderbilt.edu/historyart/tang-song> {
    ?site a geo:SpatialThing.
    }
}

or

SELECT DISTINCT ?site
FROM NAMED <http://lod.vanderbilt.edu/historyart/tang-song>
WHERE {
  GRAPH ?g {
    ?site a geo:SpatialThing.
    }
}

to say that the triple patterns could apply to any named graphs.  So it seems like an alternative approach would be to explicitly specify that some triple pattern should apply to all named graphs and others triple patterns should apply to triples in the unspecified pot, something like this [7]:

SELECT DISTINCT ?site
FROM NAMED <http://lod.vanderbilt.edu/historyart/tang-song>
WHERE {
  ?location skosxl:prefLabel ?labelEntity.
  ?labelEntity skosxl:literalForm ?label. 
  {GRAPH ?g {
    ?site a geo:SpatialThing.
    ?site rdfs:label ?label.
    }
  }
}

However, as far as I can tell, whenever either the FROM or FROM NAMED keywords are used in a Blazegraph SPARQL query, the triples in the unspecified default graph pot are just forgotten.  There probably is some kind of workaround involving cloning the default graph triples to another "namespace" (see the Blazegraph Quick Start guide for more on what Blazegraph calls "namespaces") and then running a federated query across the new namespace and named graphs in the old namespace.  I haven't gotten desperate enough to try this yet.

If one were playing around with tiny little graphs, one could simple give the DROP ALL command from the UPDATE tab and start over by reloading the triples into appropriately named graphs.  However, given the two-day loading time for Getty RDF dump, I'm not eager to do that.  I'll probably just upload my test graphs into the unnamed default graph and then try to delete their specific triples using a DELETE DATA command.

My take-home from this experience is that for production purposes, it's probably a good idea to always load triples as part of some named graph.  This dooms you to specifying FROM clauses in every query you write, but that's probably better than having to thrash around as I have with triples in two unlinkable pots.

3. Loading alternatives to be explored.  It is probably worthwhile to note that there are other ways in Blazegraph to load triples besides using the LOAD command via SPARQL 1.1 Update.  The Blazegraph UPDATE tab has other options in the dropdown besides SPARQL Update: "RDF Data" and "File Path or URL".  I'm not clear on how these behave differently or if they are likely to have different loading speeds on very large files.  The Blazegraph documentation is pretty sparse, which I guess is OK for a free product.  If you want hand-holding, you can pay for support.  It does make things a bit difficult for a beginner, however.


Querying the Getty TNG triples

Well, as usual, the technical details of being a Hogwarts Kitchen Manager has caused this blog post to expand to be much longer than I'd planned.  It's time for the Start-of-Term Feast!  

The first query that I've shown below can actually be run by pasting it straight into the Getty Vocabularies SPARQL endpoint online: http://vocab.getty.edu/sparql
The prefix list contains more prefixes than are actually necessary for the query that follows, but I wanted it to contain all of the prefixes that were likely to be needed for related experimentation. 

We have been working on a dataset provided by Tracy Miller of Vanderbilt's Department of History of Art: the tang-song temple dataset [8]. In it, we have names of temple sites in Chinese characters and Latin transliterations.  We would like to associate each site with the Getty TNG identifier that corresponds to the temple site or the city/village from which the temple site gets its name.  We could search for them using the Getty Vocabularies search facility, but those searches usually result in many results for sites that have the same name, but which are in the wrong province.  It is very labor-intensive to sort them out.

In our data, we have the province name in Chinese characters (e.g. 山西 for Shanxi).  In order to eliminate all of the matches to sites with the correct name but which are in the wrong province, we can use the query below.

PREFIX tgn: <http://vocab.getty.edu/tgn/>
PREFIX gvp: <http://vocab.getty.edu/ontology#>
PREFIX skosxl: <http://www.w3.org/2008/05/skos-xl#>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
PREFIX dcterms: <http://purl.org/dc/terms/>
PREFIX gvp_lang: <http://vocab.getty.edu/language/>
PREFIX att: <http://vocab.getty.edu/aat/>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX geo: <http://www.w3.org/2003/01/geo/wgs84_pos#>
PREFIX dc: <http://purl.org/dc/elements/1.1/>
PREFIX dwc: <http://rs.tdwg.org/dwc/terms/>

SELECT ?place ?strippedLiteral ?lat ?long 
WHERE {
  ?broaderLabel skosxl:literalForm "山西"@zh.
  ?broader skosxl:prefLabel|skosxl:altLabel ?broaderLabel.
  ?place gvp:broaderPreferred* ?broader.
  ?place skosxl:altLabel|skosxl:prefLabel ?label.
  ?label skosxl:literalForm ?literal.
  ?label skosxl:literalForm "Qinxian"@zh-latn.
  #  FILTER ( lang(?literal) = "zh-latn")
  ?place foaf:focus ?geoPlace.
  ?geoPlace geo:lat ?lat.
  ?geoPlace geo:long ?long.
  BIND (str(?literal) AS $strippedLiteral)
  }

In  the query, I bind the Chinese province name to the variable ?broaderLabel.  That label can correspond to the literal form of either a preferred label or alternate label for the province.  I use the SPARQL property path feature ("*") to allow the province to be any number of broaderPreferred links above the site (?place).  I then try to match with the literal form of either preferred or alternate labels for the site name.  In this example, for testing purposes I've hard-coded the province label to "山西"@zh and site name label to "Qinxian"@zh-latn, which are values that I know from manual searches will work.  Once the variable ?place is constrained by the two labels, I have the query report the place IRI, name, and the latitude and longitude.  Tracy has GPS geocordinates for buildings at the temple site, so the lat/long data from the query can be compared with her geocordinates to distinguish from among the resulting locations if more than one comes up within a particular province.

Here's what you get when you run the query:


The building geocoordinates at the Quixian temple site were at 36.7477, 112.5745, which is relatively close to both of these hits.  They will have to be examined manually to determine the one to which the link should be made.

If this were all that the query were going to do, there wouldn't be much point in the whole exercise of downloading and loading the Getty TGN dump into Blazegraph.  The query could just be run on the Getty online endpoint.  But I don't want to have to manually enter the province and site names for all 122 sites.  Before uploading the RDF for Tracy's data, I went back and changed the tang-song RDF data dump so that all language tags were lowercase only.  I then loaded it into the default (unnamed) graph of Blazegraph along with the Getty RDF dump so that I could query the tang-song graph and the Getty TNG graph together.

Here's the query as I ran it finally:

SELECT ?strippedSiteLabel ?lat ?long
WHERE {
  ?site a geo:SpatialThing.
  ?site rdfs:label ?siteLabel.
  FILTER (lang(?siteLabel)="zh-latn-pinyin")
  BIND (str(?siteLabel) AS ?strippedSiteLabel)
  ?site dwc:stateProvince ?province.
  
  ?broaderLabel skosxl:literalForm ?province.
  ?broader skosxl:prefLabel|skosxl:altLabel ?broaderLabel.
  ?place gvp:broaderPreferred* ?broader.
  ?place skosxl:altLabel|skosxl:prefLabel ?label.
  ?label skosxl:literalForm ?literal.
  FILTER ( lang(?literal) = "zh-latn")
  BIND (str(?literal) AS ?strippedLiteral)
  FILTER (?strippedLiteral = ?strippedSiteLabel)
  ?place foaf:focus ?geoPlace.
  ?geoPlace geo:lat ?lat.
  ?geoPlace geo:long ?long.
  }
#  Limit 10

The first block of triple patterns locates the transliterated site label and the province name in simplified Chinese characters.  I had to strip the language tag from the site labels because GeoNames tags theirs as zh-latn, while mine are tagged zh-latn-pinyin.  I puposefully tagged the province names as zh so that they would match the tags used by Getty.  Clearly, having to deal with mismatching of language tags is a pain.

The second block of triple patterns is a hack of the previous query.  Instead of hard-coding the province and site names as I did in the first query, they are now the variables bound in the first block of triple patterns.  I ran the query multiple times, adding one line at a time and it was interesting to see how the time to run changed.  Some changes, like adding filtering to language tags increased the time a lot (not sure why).  When all of the lines of the query had been added, it took about 30 seconds to run.  Here are what the results looked like:


There are a total of 122 temple sites, so clearly there are many that did not have any match.  The query could probably work better if REGEX were used to search for parts of strings to catch cases where a site name of ours like "Lingyansi" (=Lingyan Temple) would not match with variants like "Lingyan".  Still more work to do, but the general approach of leveraging the big Getty dump to reduce manual searching seems to work.

I still intend to load the GeoNames triples.  I started loading the one giant file overnight, then woke up the next morning to discover that Windows had decided that was the night to restart my computer (I hate you, Microsoft).  So I started it again over this weekend and after about 36 hours it still hadn't finished loading.  I need to use my work computer for "real" work, so I killed the load this morning so that it would stop slowing everything down.  I think I may write a little Python script to break the Geonames Ntriples file into multiple files (maybe 10?) and load them one at a time.  At least it would be easier to know when progress was being made!

I'm feeling more and more sympathy for the Hogwarts kitchen manager...

-----------------------------------------------------------------------------------------------
[1] Strictly speaking, I should probably say "IRI" here, but as a practical matter, I'm considering "URI" and "IRI" to be interchangeable.
[2] Discussed in my blog post "Confessions of an RDF agnostic, part 3: How a client learns"
[3] "Shiny new toys 3: Playing with Linked Data"
[4] "Stress testing the Stardog reasoner"
[5] OpenRefine was able to open the Getty file, but it hung when I tried to parse it as RDF/N3 and crashed with an OutOfMemoryError when I tried to save the project in text format.  I bumped the memory allocation from 1 Gb to 4 Gb - the maximum recommended for an 8 Gb system, but that didn't help and it still produced an "Unknown error, No technical details" error.
[6] It's worth noting that Callimacus manages sets of triples by actually keeping track of the names of files that were uploaded into the triplestore, rather than by a named graph URI.  There's a typical upload dialog where you select the file you want to upload and it will prompt you by asking you if you want to replace the file if that file already exists.  You can also delete a set of triples by telling it to delete that file.
[7] See Bob DuCharme's Learning SPARQL p. 82-83 for similar examples.
[8] Described in a recent blog post Guid-O-Matic goes to China

2 comments:

  1. I have now deleted the dataset to start over. It took 11.5 hours to delete the 134 million triples using the SPARQL Update command DROP ALL.

    ReplyDelete
  2. This comment has been removed by a blog administrator.

    ReplyDelete