Steve Baskauf's blog: February 2017

This post is of a rather technical nature and is directed at people who are serious about setting up and experimenting with the management of SPARQL endpoints. If you want to get an abbreviated view, you could probably read down through the "Graphs: named and otherwise" section, then skip down to "Names of unnamed graphs (review)" and read through the end.

Background

This semester, the focus of our Linked Data and Semantic Web working group [1] at Vanderbilt has been to try to move from Linked Data in theory, to Linked Data in reality. Our main project has been to document the process of moving Veronica Ikeshoji-Orlati's dataset on ancient vases from spreadsheet to Linked Data dataset. We have continued to make progress on moving Tracy Miller's data on images of Chinese temple architecture from a test RDF graph to production.

One of the remaining problems we have been facing is setting up a final production triplestore and SPARQL endpoing to host the RDF data that we are producing. The Vanderbilt Heard Library has maintained a Callimachus-based SPARQL endpoint at http://rdf.library.vanderbilt.edu/ since the completion of Sean King's Dean's Fellow project in 2015. But Callimachus has some serious deficiencies related to speed (see this post for details) and we had been considering replacing it with a Stardog-based endpoint. However, the community version of Stardog has a limit of 25 million triples per database, and we could easily go over that by loading either the Getty Thesaurus of Geographic Names or the GeoNames dataset, both of which contain well over 100 million triples. So we have been considering using Blazegraph (the graph database used by Wikidata), which has no triple limit. I have already reported on my experience with loading 100+ million triples into Blazegraph in an earlier post. One issue that became apparent to me through that exercise was that appropriate use of named graphs would be critical to effective maintenance of a production triplestore and SPARQL endpoint. It also became apparent that my understanding of RDF graphs in the context of SPARQL was too deficient to make further progress on this front. This post is a record of my attempt to remedy that problem.

Assumptions

This post assumes that you have a basic understanding of RDF, the Turtle serialization of RDF, and the SPARQL query language. There are many good sources of information about RDF in general - the RDF 1.1 Primer is a good place to start. For a good introduction to SPARQL, I recommend Bob DuCharme's Learning SPARQL.

Graphs: named and otherwise

One of the impediments to understanding how graphs interact with SPARQL is understanding the terminology used in the SPARQL specification. Please bear with me as I define some of the important terms needed to talk about graphs in the context of SPARQL. The technical documentation defining these terms is the SPARQL 1.1 Query Language W3C Recommendation, Section 13.

In the abstract sense, a graph defines the connections between entities, with the entities represented as nodes and the connections between them represented as arcs (also known as edges). The RDF data model is graph-based, and a graph in RDF is described by triples. Each triple describes the relationship between two nodes connected by an arc. Thus, in RDF a graph can be defined as a set of triples.

I used three graphs for the tests I'll be describing in this post. The first graph contains 12 triples and describes the Chinese temple site Anchansi and some other things related to that site. The full graph in Turtle serialization can be obtained at this gist, but two of the triples are shown in the diagram above. As with any other resource in RDF, a graph can be named by assigning it a URI. In this first graph, I've chosen not to assign it an identifying URI. I will refer to this graph as the "unnamed graph".

The second graph contains 18 triples and describes the temple site Baitaisi. The graph in Turtle serialization is at this gist, and two triples from the graph are shown in the diagram above. I have chosen to name the second graph by assigning it the URI <http://tang-song/baitaisi>. You should note that although the URI denotes the graph, it isn't a URL that "does" something. There is no web page that will be loaded if you put the URI in a browser. That is totally fine - the URI is really just a name. I'll refer to this graph by its URI - it is an example of a named graph.

A third graph about the Chinese temple site Baiyugong is here. I'll refer to it from time to time by its URI <http://tang-song/baiyugong>.

In the context of SPARQL, an RDF dataset is a collection of graphs. This collection of graphs will be loaded into some kind of data store (which I will refer to as a "triplestore"), where it can be queried using SPARQL. There may be many graphs in a triple store and SPARQL can query any or all of them.

In a SPARQL query, the default graph is the set of triples that is queried by default when graph patterns in the query are not restricted to a particular named graph. There is always a default graph in an RDF dataset. However, that graph may include an unnamed graph, the merge of one or more named graphs, or it may be an empty graph (a graph containing no triples).

A dataset may also include named graphs whose triples are searched exclusively when that graph is specified by name.

Aside on the three tested endpoints: setup and querying

This section of the post is geared towards those who want to try any of these experiments themselves, who want to work towards setting up one of the three systems as a functioning triple store/SPARQL endpoint, or who just want to have a better understanding how the query interface works. If you don't care about any of those things, you can skip to the next section.

Each of the three systems can be downloaded and set up for free. I believe that all three can be set up on Windows, Mac, and Linux, although I have only set them up on Windows.

Callimachus can be downloaded from here as a .zip bundle. After downloading, the Getting Started Guide has straightforward installation instructions. After the setup script is complete, you will need to set up a local administrator account using a one-time URL. If the process fails, if you can't login, or if you destroy the installation (which I will tell you how to do later), you can delete the entire directory into which you unzipped the archive, unzip it again, and repeat the installation steps. You can't re-use the account setup URL a second time.

To download Stardog, go to http://stardog.com/ and click on the Download button. Unless you know you want to use the Enterprise version, select the Stardog Community version. Unfortunately, it has been a while since I installed Stardog, so I can't remember the details. However, I don't remember having any problems when I followed the Quick Start Guide. In order to avoid having to set the STARDOG_HOME environmental variable every time I wanted to use Stardog, I made the following batch file in my user directory:

set STARDOG_HOME=C:\stardog-home
C:\stardog-4.0.3\bin\stardog-admin.bat server start

where the stardog-4.0.3\bin is the directory where the binaries were installed. To start the server, I just run this batch file from a command prompt. Stardog ships with a default superuser account "admin" with the password "admin", which is fine for testing on your local machine.

To download Blazegraph, go to https://www.blazegraph.com/ and click the download button. The executable is a Java .jar file that is invoked from the command line, so there is basically no installation required. Blazegraph has a Quick Start guide as a part of its wiki, although the wiki in general is somewhat minimal and does not have much in the way of examples. For convenience, I put the .jar file in my user directory and put that single command line into a batch file so that I can easily start Blazegraph by invoking the batch file. There isn't any user login required to use Blazegraph - read-only access is set up by settings in the installation. I've read about this on the developer's email list, but not really absorbed it.

[Note added 2017-09-19: I had occasion to reinstall Blazegraph on another computer and was reminded of an issue that has to be resolved before Blazegraph will work on Windows (at least Win10). The issue is described on the developer's discussion list. After running the first query or update, all subsequent queries or updates fail with a java.io.IOException error. The fix involves opening the blazegraph.jar file with an application that can open zip files (like 7zip). Extract the RWStore.properties configuration file and open it in a text editor. Add the line:

com.bigdata.rwstore.RWStore.readBlobsAsync=false

and save the file. Replace the old RWStore.properties file in the .jar file with the modified one and Blazegraph will work properly. Why this simple problem hasn't been fixed by the developers in the year since I last installed Blazegraph is beyond me. I guess most users don't install on Windows.]

So what exactly is happening when you start up each of these applications from the command line? You'll get some kind of message saying that the software has started, but you won't get any kind of GUI interface to operate the software. That's because what you are actually doing is starting up a web server that is running on your local computer ("localhost"), and is not actually connected to any outside network. By default, each of the three applications allows you to access the local server endpoint through a different port (Callimachus = port 8080, Stardog = port 5820, and Blazegraph = port 9999), so you can run all three at once if you want. If you wanted to operate one of the applications as an external server, you would change the port to something else (probably port 80).

So what does this mean? As with most other Web interactions, the communication with each of these localhost servers can take place through HTTP-mediated communication. SPARQL stands for "SPARQL Protocol and RDF Query Langage" - the "Protocol" part means that a part of the SPARQL Recommendation describes the language by which communication with the server takes place. The user sends a command via HTTP to the address of the server endpoint, coded using the SPARQL protocol and the server sends a response back to the user in the format (XML, JSON) that the user requests. If you enjoy such gory details, you can use cURL, Postman, or Advanced Rest Client to send raw queries to the localhost endpoint and then dissect the response to figure out what it means. Most people are are going to be way to lazy to do this for testing purposes.

Because it's a pain to send and receive raw HTTP, each of the three platforms provides a web interface that mediates between the human and the endpoint. The web interface is a web form that allows the human user to type the query into a box, then click a button to send the query to the endpoint. The code in the web page properly encodes the query, sends it to the localhost endpoint, receives the response, then decodes the response into a tabular form that is easier for a human to visualize than XML or JSON. The web form makes it easy to interact with the endpoint for the purpose of developing and testing queries.

However, when the endpoint is ultimately put into production, the sending of queries and visualization of the response would be handled not by a web form, but by Javascript in web pages that make it possible for the end users to interact with the endpoint using typical web controls like dropdowns and buttons without having to have any knowledge of writing queries. To see how this kind of interaction works, open the test Chinese Temple website at http://bioimages.vanderbilt.edu/tang-song.html using Chrome. Click on the options button in the upper right corner of the browser and select "More tools" then "Developer tools". Click on the Network tab and you can watch how the web page interacts with the endpoint. Clicking on any of the "sparql?query=..." items, then the "header" tab on the right shows the queries that are being sent to the endpoint. Clicking on "response" tab on the right shows the response of the endpoint. This response is used by the Javascript in the web page to build the dropdown lists and the output at the bottom of the page.

In the rest of this post, I will describe interactions with the localhost endpoint through the web form interface, but keep in mind that the same queries and commands that we type into the box could be sent directly to the endpoint from any other kind of application (Javascript in a web page, desktop application, smartphone app) that is capable of communicating using HTTP.

Opening and using the web form interfaces

Each of the three applications has a similar web form interface, although the exact behavior of each interface varies. There are actually two ways to interact with the server: through a SPARQL query (a read operation) and through a SPARQL Update command (a write operation). The details of these two kinds of interactions are given for each of the applications.

Callimachus

To load the Callimachus web form interface after starting the server, paste the URL

http://localhost:8080/sparql?view

into the browser address box. If everything is working, the Callimachus interface will look something like this:

Both queries and update commands are pasted into the same box. However, to make a query, you must click the "Evaluate Query" button. To give an update command, you must click the "Execute Update" button. After evaluating a query, you will be taken to another page where the response is displayed. Hitting the back button will take you back to the query page with the query still intact in the box. After executing an update, the orange button will "gray out" while the command is being executed and turn orange again when it is finished. No other indication is given that the command was executed.

Namespace prefixes must be explicitly defined in the text of the box. However, once prefixes are used, Callimachus "remembers" them, so it isn't necessary to re-define them with every query.

Stardog

To load the Stardog web form interface after starting the server, paste the URL

http://localhost:5820/myDB#!/query

into the browser address box. If everything is working, the Stardog interface will look something like this:

Stardog does not differentiate between queries and update commands. Both are typed into the same box and the "Execute" button is used to initiate both. Query results will be given in the "Results" area at the bottom of the screen. Successful update commands will display "True" in the Results area.

Commonly used prefixes that appear in the Prefixes: box don't have to be explicitly typed in the text box. Additional pre-populated prefixes can be added in the Admin Console.

Blazegraph

To load the Blazegraph web form interface after starting the server, paste the URL

http://localhost:9999/blazegraph/#query

into the browser address box. If everything is working, the Blazegraph query interface will look something like this:

Only queries can be pasted into this box. Well-known namespace abbreviations can be inserted into the box using the "Namespace shortcuts" dropdowns above the box. If the query executes successfully, the results will show up in the space below the Execute button. The page also maintains a record of the past queries that have been run. They are hyperlinked, and clicking on them reloads the query in the box.

To perform a SPARQL Update, the UPDATE tab must be selected. That generates a different web form that looks like this:

There are several ways to interact with this page. For now, the "Type:" dropdown should be set for "SPARQL Update". A successful update will show a COMMIT message at the bottom of the screen. The "mutationCount" gives an indication of the number of changes made; in this example 10 triples were added to the triplestore, so the mutationCount=10.

The SPARQL Update "nuclear option": DROP ALL

One important question in any kind of experimentation is: "What do I do if I've totally screwed things up and I just want to start over?" In Stardog and Blazegraph, the answer is the SPARQL Update command "DROP ALL". Executing DROP ALL causes all of the triples in all of the graphs in the database to be deleted. You have a clean slate and an empty triplestore ready to start afresh. Obviously, you don't want to do this if you've spent hours loading hundreds of millions of triples into your production triplestore. But in the type of experiments I'm running here, it's a convenient way to clear things out for a new experiment.

However, you NEVER, NEVER, NEVER want to issue this command in Callimachus. You will understand why later in this post, but for now I'll just say that the best case scenario is that you will be starting over with a clean install of Callimachus if you do it. Instead of DROP ALL, you should drop each graph individually. We will see how to do that below.

Putting a graph into the triplestore (finally)

All of these preliminaries were to get us to the point of being ready to load a graph into the triplestore. In each of the three applications, there are multiple ways to accomplish this task, and many of those ways differ among the applications. However, loading a graph using SPARQL Update works the same on all three (the beauty of W3C standards!), so that's how we will start.

If you want to get try to achieve the same results as are shown in the examples here, save the three example files from my Gists: test-baitaisi.ttl, test-baiyugong.ttl, and test-unnamed.ttl. Put them somewhere on your hard drive where they will have a short and simple file path.

Using the SPARQL Update web form of the application of your choice, type a command of this form:

LOAD <file:///c:/Dropbox/tang-song/test-baitaisi.ttl> INTO GRAPH <http://tang-song/baitaisi>

This command contains two URIs within angle brackets. The second URI is the name that I want to use to denote the uploaded graph. I'll use that URI any time I want to refer to the graph in a query. Recall that this URI is just a name and doesn't have to actually represent any real URL on the Web. The first URI in the LOAD command is a URL - it provides the means to retrieve a file. It contains the path to the test-baitaisi.ttl file that you downloaded (or some other file that contains serialized RDF). The triple slash thing after "file:" is kind of weird. The "host name" would typically go between the second and third slashes, but on your local computer it can be omitted - resulting in three slashes in a row. (I think you can actually use "file://localhost/c:/..." but I haven't tried it.) The path can't be relative, so in Windows, the full path starting with the drive letter must be given. I have not tried Mac and Linux, but see the answer to this stackoverflow question for probably path forms. If the path is wrong or the file doesn't exist, an error will be generated.

Execute the update by clicking on the button. How do we know that the graph is actually there? Here is a SPARQL query that can answer the question:

select distinct ?g where {
graph ?g {?s ?p ?o}
}

If you are using Blazegraph, you'll have to switch from the Update tab to the Query tab before pasting the query into the box. Execute the query, and the results in Stardog and Blazegraph should show the URI that you used to name the graph that you just uploaded: http://tang-song/baitaisi .

The results in Callimachus are strange. You should see //tang-song/baitaisi in the list, but there are a bunch of other graphs in the triplestore that you never put there. These are graphs that are needed to make Callimachus operate. Now you can understand why using the DROP ALL command has such a devastating effect in Callimachus. The command DROP ALL is faithfully executed by Callimachus and wipes out every graph in the triplestore, including the ones that Callimachus needs to function. The web server continues to operate, but it doesn't actually "know" how to do anything and fails to display any web page of the interface. Why Callimachus allows users to execute this "self-destruct" command is beyond me!

The graceful way to get rid of your graph in Callimachus is to drop the specific graph rather than all graphs, using this SPARQL Update command:

DROP GRAPH <http://tang-song/baitaisi>

This will leave intact the other graphs that are necessary for the operation of Callimachus.

Specifying the role of a named graph

For the examples in this section, you should load the test-baitaisi.ttl and test-baiyugong.ttl files from your hard drive into the triplestore(s) using the SPARQL Update LOAD command as shown in the previous section, naming them with the URIs http://tang-song/baitaisi and http://tang-song/baiyugong respectively.

The FROM clause is used to specify that triples from particular named graphs should be used as the default graph. There can be more than one named graph specified - the default graph is the merge of triples from all of the specified named graphs [2]. For example, the query

PREFIX geo: <http://www.w3.org/2003/01/geo/wgs84_pos#>

SELECT DISTINCT ?site
FROM <http://tang-song/baitaisi>
FROM <http://tang-song/baiyugong>
WHERE {
?site a geo:SpatialThing.
}

designates that the default graph should be composed of the merge of the two graphs we loaded. The graph pattern in the WHERE clause is applied to all of the triples in both of the graphs (i.e. the default graph). Running the query returns the URIs of both sites represented in the graphs:

<http://lod.vanderbilt.edu/historyart/site/Baitaisi>
<http://lod.vanderbilt.edu/historyart/site/Baiyugong>

The FROM NAMED clause is used to say that a named graph is part of the RDF dataset, but that a graph pattern will be applied to that named graph only if it is specified explicitly using the GRAPH keyword. If we wrote the query like this:

PREFIX geo: <http://www.w3.org/2003/01/geo/wgs84_pos#>

SELECT DISTINCT ?site
FROM <http://tang-song/baitaisi>
FROM NAMED <http://tang-song/baiyugong>
WHERE {
?site a geo:SpatialThing.
}

we only get one result:

<http://lod.vanderbilt.edu/historyart/site/Baitaisi>

because we didn't specify that the graph pattern should apply to the http://tang-song/baiyugong named graph. In this query:

PREFIX schema: <http://schema.org/>

SELECT DISTINCT ?building
FROM <http://tang-song/baitaisi>
FROM NAMED <http://tang-song/baiyugong>
WHERE {
GRAPH <http://tang-song/baiyugong> {
?building a schema:LandmarksOrHistoricalBuildings.
}
}

only buildings described in the <http://tang-song/baiyugong> graph are returned:

<http://lod.vanderbilt.edu/historyart/site/Baiyugong#Houdian>
<http://lod.vanderbilt.edu/historyart/site/Baiyugong#Sanxiandian>
<http://lod.vanderbilt.edu/historyart/site/Baiyugong#Shanmen>
<http://lod.vanderbilt.edu/historyart/site/Baiyugong#Zhengdian>

More complicated queries can be constructed, like this:

PREFIX schema: <http://schema.org/>
PREFIX geo: <http://www.w3.org/2003/01/geo/wgs84_pos#>

SELECT DISTINCT ?site ?building
FROM <http://tang-song/baitaisi>
FROM NAMED <http://tang-song/baiyugong>
WHERE {
?site a geo:SpatialThing.
GRAPH <http://tang-song/baiyugong> {
?building a schema:LandmarksOrHistoricalBuildings.
}
}

where sites in the default <http://tang-song/baitaisi> graph are bound, but buildings in the specified <http://tang-song/baiyugong> graph are bound.

Using FROM and FROM NAMED clauses in a query make it very clear what graphs should be considered for matching with graph patterns in the WHERE clause.

What happens if we load a graph without a name?

It is possible to load a graph into a triplestore without giving it a name, as in this SPARQL Update command:

LOAD <file:///c:/Dropbox/tang-song/test-unnamed.ttl>

Assume that this file has been loaded along with the previous two named graphs. What would happen if we ran this query:

PREFIX geo: <http://www.w3.org/2003/01/geo/wgs84_pos#>

SELECT DISTINCT ?site
WHERE {
?site a geo:SpatialThing.
}

When no named graph is specified using a FROM clause, the endpoint applies the query to "the default graph". The problem is that the SPARQL specification is not clear how the default graph should be constructed. Section 13 says "An RDF Dataset comprises one graph, the default graph, which does not have a name, and zero or more named graphs...", which implies that triples loaded into the store without specifying a graph URI will become part of the default graph. This is also implied in Example 1 in Section 13.1, which shows the "Default graph" as being the one without a name. However, it is also clear that "default graph" cannot be synonymous with "unnamed graph", since the FROM clause allows named graphs to be specified as the default graph. So what happens when we run this query?

On Stardog, the graph pattern binds only a single URI for ?site:

<http://lod.vanderbilt.edu/historyart/site/Anchansi>

This is the site described by the unnamed graph I loaded. However, running the query on Blazegraph and Callimachus produces this result:

<http://lod.vanderbilt.edu/historyart/site/Baitaisi>
<http://lod.vanderbilt.edu/historyart/site/Baiyugong>
<http://lod.vanderbilt.edu/historyart/site/Anchansi>

which are the URIs for the sites described by the unnamed graph and both of the named graphs!

This behavior is somewhat disturbing, because it means that the same query, performed on the same graphs loaded into triplestores using the same LOAD commands do NOT produce the same results. The results are implementation-specific.

Construction of the dataset in the absence of FROM and FROM NAMED

The GraphDB documentation sums up the situation like this:

The SPARQL specification does not define what happens when no FROM or FROM NAMED clauses are present in a query, i.e., it does not define how a SPARQL processor should behave when no dataset is defined. In this situation, implementations are free to construct the default dataset as necessary.

In the absence of FROM and FROM NAMED clauses, GraphDB constructs the dataset's default graph in the same way as Callimachus and Blazegraph: by merging the database's unnamed graph and all named graphs in the database.

In the absence of FROM and FROM NAMED clauses, all of the applications include all named graphs in the dataset, allowing graph patterns to be applied specifically to them using the GRAPH keyword. So in the case of this query:

PREFIX geo: <http://www.w3.org/2003/01/geo/wgs84_pos#>

SELECT DISTINCT ?site
WHERE {
GRAPH ?g {?site a geo:SpatialThing.}
}

we would expect the results to include ?site URI bindings from the two named graphs:

<http://lod.vanderbilt.edu/historyart/site/Baitaisi>
<http://lod.vanderbilt.edu/historyart/site/Baiyugong>

which we do. However, Callimachus and Blazegraph also include:

<http://lod.vanderbilt.edu/historyart/site/Anchansi>

in the results, indicating that they consider the empty graph to also bind to ?g (Stardog does not).

Construction of the dataset when FROM and FROM NAMED clauses are present

If we run this query:

PREFIX geo: <http://www.w3.org/2003/01/geo/wgs84_pos#>

SELECT DISTINCT ?site
FROM <http://tang-song/baitaisi>
WHERE {
?site a geo:SpatialThing.
}

we get the same result on all three platforms - only the site URI <http://lod.vanderbilt.edu/historyart/site/Baitaisi> from the named graph that was specified in the FROM clause. This should be expected, since Section 13.2 of the SPARQL 1.1 specification says

A SPARQL query may specify the dataset to be used for matching by using the FROM clause and the FROM NAMED clause to describe the RDF dataset. If a query provides such a dataset description, then it is used in place of any dataset that the query service would use if no dataset description is provided in a query.

The bolded text (my emphasis) is suitably vague about what would be included in the dataset in the absence of FROM and FROM NAMED clauses (i.e. the default graph). Section 13.2 also says

If there is no FROM clause, but there is one or more FROM NAMED clauses, then the dataset includes an empty graph for the default graph.

that is, if only FROM NAMED clauses are included in the query, unnamed graph(s) will NOT be used as the default graph, since the default graph is required to be empty.

Querying the entire triplestore using Stardog

The way that Stardog constructs datasets is problematic since there is no straightforward way to include all unnamed and named graphs (i.e. all triples in the store) in the same query. The following query is possible:

PREFIX geo: <http://www.w3.org/2003/01/geo/wgs84_pos#>

SELECT DISTINCT ?site
WHERE {
{?site a geo:SpatialThing.}
UNION
{GRAPH ?g {?site a geo:SpatialThing.}}
}

but complicated, since the desired graph pattern has to be repeated twice in the query. The first graph pattern binds matching triples in the unnamed graph, and the second graph pattern binds matching triples in all of the named graphs.

What is the name of an unnamed graph?

Previously, we saw that that the query

SELECT DISTINCT ?g WHERE {
GRAPH ?g {?s ?p ?o}
}

could be used to ask the names of graphs that were present in the triplestore. Let's find out what happens when we run this query on the three triplestores with the unnamed graph and the two named graphs loaded. Stardog gives this result:

http://tang-song/baitaisi
http://tang-song/baiyugong

which is not surprising since we saw that an early query using GRAPH ?g bound only to the two named graphs. Running the query on Blazegraph produces this somewhat surprising result:

<http://tang-song/baitaisi>
<http://tang-song/baiyugong>
<file:/c:/Dropbox/tang-song/test-unnamed.ttl>

We see the two named graphs, but we also see a URI for the unnamed graph. In the absence of an assigned URI from the LOAD command, Blazegraph has assigned the graph a URI that is almost the file URI (only one slash after file: instead of three). This might explain why all three graphs (including the unnamed graph) bound to the ?g in an earlier Blazegraph query. However, it does not explain that same behavior in Callimachus, since in Callimachus the current query only lists the two named graphs (besides the many Callimachus utility graphs that make the thing run).

Loading graphs using the GUI

Each of the three platforms I've been testing also provide a means to load files into the store using a graphical user interface instead of the SPARQL Update LOAD command.