Sunday, July 2, 2017

How (and why) we set up a SPARQL endpoint



Recently I've been kept busy trying to finish up the task of getting the Vanderbilt Semantic Web Working Group's SPARQL endpoint up and running.  I had intended to write a post about how we did it so that others could replicate the effort, but as I was pondering what to write, I realized that it is probably important for me to begin by explaining why we wanted to set one up in the first place.

Why we wanted to set up a SPARQL endpoint

It is tempting to give a simple rationale for our interest in a SPARQL endpoint: Linked Data and the Semantic Web are cool and awesome, and if you are going to dive into them, you need to have a SPARQL endpoint to expose your RDF data.  If you have read my previous blog posts, you will realize that I've only been taking little sips of the Semantic Web Kool-Aid, and have been reluctant to drink the whole glass.  There have been some serious criticisms of RDF and the Semantic Web, with claims that it's too complicated and too hard to implement.  I think that many of those criticisms are well-founded and that the burden is on advocates to show that they can do useful things that can't be accomplished with simpler technologies.  The same test should be applied to SPARQL: what can you do with a triple store and SPARQL endpoint that you can't do with a conventional server, or a different non-RDF graph database system like Neo4J?

 It seems to me that there are two useful things that you can get from a triplestore/SPARQL endpoint that you won't necessarily get elsewhere.  The first is the ability to dump date from two different sources into the same graph database and immediately be able to query the combined graph without any further work or data munging.  I don't think that is straightforward with alternatives like Neo4J.  This is the proverbial "breaking down of silos" that Linked Data is supposed to accomplish.  Successfully merging the graphs of two different providers and deriving some useful benefit from doing so depends critically on those two graphs sharing URI identifiers that allow useful links to be made between the two graphs.  Providers have been notoriously bad about re-using other providers' URIs, but given the emergence of useful, stable, and authoritative sources (like the Library of Congress, GeoNames, the Getty Thesauri, ORCID, and others) of RDF data about URI-identified resources, this problem has been getting better.

The second useful thing that you can get from a triplestore/SPARQL endpoint is essentially a programmable API.  Conventional APIs abound, but generally they support only particular search methods with a fixed set of parameters that can be used to restrict the search.  SPARQL queries are generic.  If you can think of a way that you want to search the data and know how to construct a SPARQL query to do it, you can do any kind of search that you can imagine.  In a conventional API, if the method or parameter that you need to screen the results doesn't exist, you have to get a programmer to create it for you on the API.  With SPARQL, you do your own "programming".

These two functions have influenced the approach that we've taken with our SPARQL endpoint.  We wanted to create a single Vanderbilt triplestore because we wanted to be able to merge in it shared information resources to which diverse users around the campus might want to link.  It would not make sense to create multiple SPARQL endpoints on campus, which could result in duplicated data aggregation efforts.  We also want the endpoint to have a stable URI that is independent of department or unit within the institution.  If people build applications to use that endpoint as an API, we don't want to break those applications by changing the subdomain or subpath of the URI, or to have the endpoint disappear once users finished a project or got bored with their own endpoint.


Step 1: Figuring out where to host the server

For several years, we've had an implementation of the Callimachus triplestore/SPARQL endpoint set up on a server operated by the Jean and Alexander Heard Library at Vanderbilt (http://rdf.library.vanderbilt.edu).  As I've noted in earlier posts, Callimachus has some serious deficiencies, but since that SPARQL endpoint was working and the technical support we've gotten from the library has been great, we weren't in too big of a hurry to move it somewhere else.  Based on a number of factors, we decided that we would rather have an installation of Blazegraph, and since we wanted this to be a campus-wide resource, we began discussion with Vanderbilt IT services about how to get Blazegraph installed on a server that they supported. There was no real opposition to doing that, but it became apparent that the greatest barrier to making it happen was to get people at ITS to understand what a triplestore and SPARQL endpoint was, why we wanted one, and how we planned to use it.  Vanderbilt's ITS operates on a model where support is provided to administrative entities who bear the cost of that support, and where individuals are accountable to defined administrators and units.  Our nebulous campus-wide resource didn't really fit well in that model and it wasn't clear who exactly should be responsible to help us.  Emails went unanswered and actions weren't taken.  Eventually, it became apparent that if we wanted to get the resource up in a finite amount of time, our working group would have to make it happen ourselves.

Fortunately, we had a grant from the Vanderbilt Institute for Digital Learning (VIDL) that provided us with some money that we could use to set our system up on a commercial cloud server.  We received permission to use the subdomain sparql.vanderbilt.edu, so we started the project with the assumption that we could get a redirect from that subdomain to our cloud server when the time was right.  We decided to go with a Jelastic-based cloud service, since that system provided a GUI interface for managing the server.  There were cheaper systems that required doing all of the server configuration via the command line, but since we were novices, we felt that paying a little more for the GUI was worth it.  We ended up using ServInt as a provider for no particularly compelling reason other than group members had used it before.  There are many other options available.

Step 2: Setting up the cloud server

Since ServInt offered a 14 day free trial account, there was little risk in us playing around with the system at the start.  Eventually, we set up a paid account.  Since the cost was being covered by an institutional account, we did not want the credit card to be automatically charged when our initial $50 ran out.  Unfortunately, the account portal was a bit confusing - it turns out that turning off AutoPay does not actually prevent an automatic charge to the credit card.  So when our account got to an alert level of $15, we got hit with another $50 charge.  It turns out that the real way to turn off automatic charges is to set the autofunding level to $0.  Lesson learned - read the details.  Fortunately we were going to by more server time anyway.

The server charges are based on units called "cloudlets".  Cloudlet usage is based on a combination of the amount of space taken up by the installation and by the amount of traffic handled by the server.  One cloudlet is 128 MiB of RAM and 400MHz of CPU.  The minimum number of reserved cloudlets per application server is one, and you can increase the upper scaling limits to whatever you want.  A higher limit means you could pay at a greater rate if there is heavy usage.  Here's what the Jelastic Environmental Topology GUI looks like:


The system that we eventually settled on uses an Nginx front-end server to handle authentication and load balancing, and a Tomcat back-end server to actually run the Blazegraph application.  The primary means of adjusting for heavy usage is to increase the "vertical" scaling limit (maximum resources allocated to a particular Tomcat instance).  If usage were really high, I think you could increase the horizontal scaling by creating more than one Tomcat instance.  I believe that in that case the Nginx server would balance the load among the multiple Tomcat servers.  However, since our traffic is generally nearly zero, we haven't really had to mess with that.  The only time that system resources have gotten tight was when we were loading files with over a million triples.  But that was only a short-term issue that lasted a few minutes.  At our current usage, the cost to operate the two-server combination is about $2 per day.


The initial setup of Blazegraph was really easy.  In the GUI control panel (see above), you just click "Create environment", then click the Create button to create the Tomcat server instance.  When you create the environment, you have to select a name for it that is unique within the jelastic.servint.net subdomain.  The name you chose will be the subdomain of your server.  We chose "vuswwg", so the whole domain name of our server was vuswwg.jelastic.servint.net .  What we chose wasn't really that important, since we were planning eventually to redirect to the server from sparql.vanderbilt.edu .

To load Blazegraph, go to the Blazegraph releases page and copy the WAR Application download link, e.g. https://github.com/blazegraph/database/releases/download/BLAZEGRAPH_RELEASE_2_1_4/blazegraph.war .  On the control panel, click on Upload and paste in the link you copied.  On the Deployment manager list, we selected "vuswwg" from the Deploy to... dropdown next to the BaseX863.war name.  On the subsequent popup, you will create a "context name".  The context name serves as the subpath that will refer to the Blazegraph application.  We chose "bg", so the complete path for the Blazegraph web application was: http://vuswwg.jelastic.servint.net/bg/ .  Once the context was created, Blazegraph was live and online.  Putting the application URL into a browser brought up the generic Blazegraph web GUI.  Hooray!

The process seemed too easy, and unfortunately, it was.  We had Blazegraph online, but there was no security and the GUI web interface as installed would have allowed anyone with Internet access to load their data into the triplestore, or to issue a "DROP ALL" SPARQL Update command to delete all of the triples in the store.  If one were only interested in testing, that would be fine, but for our purposes it was not.

Step 3: Securing the server

It became clear to us that we did not have the technical skills necessary to bring the cloud server up to the necessary level of security.  Fortunately, we were able to acquire the help of developer Ken Polzin, with whom I'd worked on a previous Bioimages project.  Here is an outline of how Ken set up our system (more detailed instructions are here).

We installed an Nginx server in a manner similar to what was described above for the Tomcat installation.  Since the Nginx server was going to be the outward-facing server, Public IPv4 needed to be turned on for it, and we turned it off on the Tomcat server.

There were several aspects of the default Nginx server configuration that needed to be changed in order for the server to work in the way that we want.  The details are on this page.  One change redirected from the root of the URI to the /bg/ subpath where Blazegraph lives.  That allows users to enter https://sparql.vanderbilt.edu/ and be redirected to https://sparql.vanderbilt.edu/bg/ or https://sparql.vanderbilt.edu/sparql and be redirected to https://sparql.vanderbilt.edu/bg/sparql .  We wanted this behavior so that we would have a "cool URI" that was simple and did not include implementation-specific information (i.e. "bg" for Blazegraph).  Another change in the configuration facilitated remote calls to the endpoint by enabling cross-origin resources sharing (CORS).

The other major change to the Nginx configuration was related to controlling write access to Blazegraph.  We accomplished that by restricting unauthenticated users to HTTP GET access.  Methods of writing to the server, such as SPARQL Update commands, require HTTP POST.  Changes we made to the configuration file required authentication for any non-GET calls.  Unfortunately, that also meant that regular SPARQL queries could not be requested by unauthenticated users using POST, but that is only an issue if the queries are very long and exceed the character limit for GET URIs.

Authentication for administrative users was accomplished using passwords encrypted using OpenSSL.  Directions for generating and storing passwords is here.  The authentication requirements were made in the Nginx configuration file as discussed above.  Once authentication was in place, usernames and passwords could be sent as part of the POST dialog.  Programming languages and HTTP GUIs such as Postman have built-in mechanisms to support Basic Authentication.  Here is an example using Postman:



Related to the issue of restricting write access was modification of the default Blazegraph web GUI.  Out of the box, the GUI had a tab for Updates (which we had disabled for unauthenticated users) and for some other advanced features that we didn't want the public to see.  Those tabs can be hidden by modifying the /opt/tomcat/webapps/bg/html/index.html using the server control panel (details here).  We also were able to style the query GUI page to comply with Vanderbilt's branding standards.  You can see the final appearance of the GUI at https://sparql.vanderbilt.edu .

The final step in securing access to the server was to set up HTTPS.  The Jeslastic system provides a simple method to set up HTTP using a free Let's Encrypt SSL certificate.  Before we could enable HTTPS, we had to get the redirect set up from the sparql.vanderbilt.edu subdomain to the vuswwg.jelastic.servint.net subdomain.  This was accomplished by creating a DNS "A" record pointing to the public IP address of the Nginx instance.  To the best of my knowledge, in the Jelastic system, the Nginx IP address is stable as long as the Nginx server is not turned off.  (Restarting the server is fine.)  If the server were turned off and then back on, the IP address would change, and a new A record would have to be set up for sparql.vanderbilt.edu .  Getting the A record set up required several email exchanges with Vanderbilt ITS before everything worked correctly.  Once the record proliferated throughout the DNS system, we could follow the directions to install Let's Encrypt and make the final changes to the Nginx configuration file (see this page for details).

Step 4: Loading data into the triplestore

One consequence of the method that we chose for restricting write access to the server was that it was no longer possible to using the Blazegraph web GUI to load RDF files directly from a hard drive into the triplestore.  Fortunately, files could be loaded using the SPARQL Update protocol or the more generic data loading commands that are part of the Blazegraph application, both via HTTP (see examples here).

One issue with using the generic loading commands is that I'm not aware of any way to specify that the triples in the RDF file be added to a particular named graph in the triple store.  If one's plan for managing the triple store involved deleting the entire store and replacing it, then that wouldn't be important.  However, we plan to compartmentalize the data of various users by associating those data with particular named graphs that can be dropped or replaced independently.  So our options were limited to what we could do with SPARQL Update.

The two most relevant Update commands were LOAD and DROP, for loading data into a graph and deleting graphs, respectively.  Both commands must be executed through HTTP POST.

There are actually two ways to accomplish a LOAD command: by sending the request as URL-encoded text or by sending it as plain text.  I couldn't see any advantage to the URL-encoded method, so I used the plain text method.  In that method, the body of the POST request is simply the Update command.  However, the server will not understand the command unless it is accompanied by a Content-Type request header of application/sparql-update.  Since the server is set up to require authorization for POST requests, an Authorization header is also required, although Postman handles that for us automatically when the Basic Auth option is chosen.

The basic format of the SPARQL Update LOAD command is:

LOAD <sourceFileURL> INTO GRAPH <namedGraphUri>

where sourceFileURL is a dereferenceable URL from which the file can be retrieved and namedGraphUri is the URI of the named graph in which the loaded triples should be included.  The named graph URI is simply an identifier and does not have to represent any real location on the web.

The sourceFileURL can be a web location (such as GitHub) or a local file on the server if the file: URI type is used (e.g. file:///opt/tomcat/temp/upload/data.rdf).  Unfortunately, the file cannot be loaded directly from your local drive.  Rather, it first must be uploaded to the server, then loaded from its new location on the server using the LOAD command.  To upload a file, click on the Tomcat Config icon (wrench/spanner) that appears when you mouse over the area to the right of the Tomcat entry.  A directory tree will appear in the pane below.  You can then navigate to the place where you want to upload the file.  For the URL I listed above, here's what the GUI looks like:


Select the Upload option and the a popup will allow you to browse to the location of the file on your local drive.  Once the file is in place on the server, you can use your favorite HTTP tool to POST the SPARQL Update command and load the file into the triplestore.

This particular method is a bit of a hassle, and is not amenable to automation.  If you are managing files using GitHub, it's a lot easier to load the file directly from there using a variation on the Github raw file URL.  For example, if I want to load the file https://github.com/baskaufs/cv/blob/master/occurrenceStatus/occurrenceStatus.ttl into the triplestore, I would need to load the raw version at https://raw.githubusercontent.com/baskaufs/cv/master/occurrenceStatus/occurrenceStatus.ttl .  However, it is not possible to successfully load that file into Blazegraph directly from Github using the LOAD command.  The reason is that when a file is loaded from a remote URL using the SPARQL update command, Blazegraph apparently depends on the Content-Type header from the source to know that the file is some serialization of RDF.  Github and Github Gist always report the media type of raw files as text/plain regardless of the file extension, and Blazegraph takes that to mean that the file does not contain RDF triples.  If one uses the raw file URL in a SPARQL Update LOAD command, Blazegraph will issue a 200 (OK) HTTP code, but won't actually load any triples.

The solution to this problem is to use a redirect that specifies the correct media type.  The caching proxy service RawGit (https//rawgit.com/) interprets the file extension of a Github raw file and relays the requested file with the correct Content-Type header.  The example file above would be retrieved using the RawGit development URL https://rawgit.com/baskaufs/cv/master/occurrenceStatus/occurrenceStatus.ttl . RawGit will add the Content-Type header text/turtle as it redirects the file.  (Read the RawGit home page at https://rawgit.com/ for an explanation of the distinction between RawGit development URLs and production URLs.)

The SPARQL Update DROP command has this format:

DROP GRAPH <namedGraphUri>

Executing the command removes from the store all of the triples that had been assigned to the graph identified by the URI in the position of  namedGraphUri.

If the graphs are small (a few hundred or thousand triples), both loading and dropping them requires a trivial amount of time.  However, when the graph size is significant (i.e. a million triples or more), then a non-trivial amount of time is required either to load or drop the graph.  I think that the reason is because of the indexing that Blazegraph does as it loads the triples.  That indexing is what makes Blazegraph be able to conduct efficient querying.  The transfer of the file itself can be sped up by compressing it.  Blazegraph supports gzip (.gz) file compression.  However, compressing the file doesn't seem to speed up the actual time required to load the triples into the store.  I haven't done a lot of experimenting with this, but I have one anecdotal experience loading a gzip compressed file containing about 1.7 million triples.  I uploaded the file to the server, then used the file: URI version of SPARQL Update to load it into the store.  Normally, the server sends an HTTP 200 code and a response body indicating the number of "mutations" (triples modified) after the load command is successfully completed.  However, in the case of the 1.7  million triple file, the server timed out and sent an error code.  But when I checked the status of the graph a little later on, all of the triples seemed to have successfully loaded.  So the timeout seems to have been a limit on communication between the server and client, but not necessarily a limit on the time necessary to carry out actions that are happening internally in the server.

I was a bit surprised to discover that dropping a large graph took about as long as to load it.  In retrospect, I probably shouldn't have been surprised.  Removing a bunch of triples involves removing them from the indexing system, not just deleting some file location entry as would be the case for deleting a file on a hard drive.  So it makes sense that the removal activity should take about as long ad the adding activity.

These speed issues suggest some considerations for graph management in a production environment.  If one wanted to replace a large graph (i.e. over a million triples), dropping the graph, then reloading it probably would not be the best option, since both actions would be time consuming and the data would probably be essentially "off line" during the process.  It might work better to load the new data into a differently named graph, then use the SPARQL Update COPY or MOVE functions to replace the existing graph that needs to be replaced.  I haven't actually tried this yet, so it may not work any better than dropping and reloading.

Step 5: Documenting the graphs in the triplestore

One problem with the concept of a SPARQL endpoint as a programmable API is that users need to understand the graphs in the triplestore in order to know how to "program" their queries.  So our SPARQL endpoint wasn't really "usable" until we provided a description of the graphs included in the triple store.  On the working group Github site, we have created a "user guide" with some general instructions about using the endpoint and a page for each project whose data are included in the triplestore.  The project pages describe the named graphs associated with the project, including a graphical representation of the graph model and sample queries (an example is here).  With a little experimentation, users should be able to construct their own queries to retrieve data associated with the project.

Step 6: Using the SPARQL endpoint

I've written some previous posts about using our old Callimachus endpoint as a source of XML data to run web applications.  Since Blazegraph supports JSON query results, I was keen to try writing some new Javascript to take advantage of that.  I have a new little demo page at http://bioimages.vanderbilt.edu/lang-labels.html that consumes JSON from our new endpoint.  The underlying Javascript that makes the page work is at http://bioimages.vanderbilt.edu/lang-labels.js .  The script sends a very simple query to the endpoint, e.g.:

SELECT DISTINCT ?label WHERE {
<http://rs.tdwg.org/cv/status/extant> <http://www.w3.org/2004/02/skos/core#prefLabel> ?label.
}

when the query is generated using the page's default values.  Here's what that query looks like when it's URL encoded and ready to be sent by HTTP GET:

https://sparql.vanderbilt.edu/sparql?query=SELECT%20DISTINCT%20%3Flabel%20WHERE%20%7B%3Chttp%3A%2F%2Frs.tdwg.org%2Fcv%2Fstatus%2Fextant%3E%20%3Chttp%3A%2F%2Fwww.w3.org%2F2004%2F02%2Fskos%2Fcore%23prefLabel%3E%20%3Flabel.%7D

On Chrome, you can use the Developer Tools to track the interactions between the browser and the server as you load the page and click the button.

The Jquery code that does the AJAX looks like this:

$.ajax({
    type: 'GET',
    url: 'https://sparql.vanderbilt.edu/sparql?query=' + encoded,
    headers: {
Accept: 'application/sparql-results+json'
    },
    success: function(returnedJson) {
[handler function goes here]
        }
    });

Since the HTTP GET request includes an Accept: header of application/sparql-results+json, the server response (the Javascript object returnedJson) looks like this:

{
  "head" : {
    "vars" : [ "label" ]
  },
  "results" : {
    "bindings" : [ {
      "label" : {
        "xml:lang" : "en",
        "type" : "literal",
        "value" : "extant"
      }
    }, {
      "label" : {
        "xml:lang" : "de",
        "type" : "literal",
        "value" : "vorhanden "
      }
    }, {
      "label" : {
        "xml:lang" : "es",
        "type" : "literal",
        "value" : "existente"
      }
    }, {
      "label" : {
        "xml:lang" : "pt",
        "type" : "literal",
        "value" : "presente"
      }
    }, {
      "label" : {
        "xml:lang" : "zh-hans",
        "type" : "literal",
        "value" : "现存"
      }
    }, {
      "label" : {
        "xml:lang" : "zh-hant",
        "type" : "literal",
        "value" : "現存"
      }
    } ]
  }
}

It then becomes a simple matter to pull the label values from the JSON array using this Javascript loop found in the handler function:

var value = "";
for (i = 0; i < returnedJson.results.bindings.length; i++) {
    value = value + "<p>" 
    + returnedJson.results.bindings[i].label["xml:lang"] + " " 
    + returnedJson.results.bindings[i].label.value + "</p>";
    }

then display value on the web page.  One issue with the translation from the JSON array to the Javascript array reference can be seen in the code snippet above.  The JSON key "xml:lang" is not a valid Javascript name due to the presence of the colon.  So "bracket notation" must be used in the Javascript array reference instead of "dot notation" to refer to it.

Conclusion

I am quite excited that our endpoint is now fully operational and that we can build applications around it.  One disappointing discovery that I made recently is that as currently configured, our Blazegraph instance is not correctly handling URL encoded literals that are included in query strings.  It works file with literals that contain only ASCII strings, but including a string like "現存" (URL encoded as "%E7%8F%BE%E5%AD%98") in a query fails to produce any result.  This problem doesn't happen when the same query is made to the Callimachus endpoint.  That is a major blow, since several datasets that we have loaded or intend to load into the triplestore include UTF-8 encoded strings representing literals in a number of languages.  I sent a post about this problem to the Bigdata developer's email list, but have not yet gotten any response.  If anyone has any ideas about why we are having this problem, or how to solve it, I'd be keen to hear from you.

Aside from that little snafu, we have achieved one of the "useful things" that SPARQL endpoints allow: making it possible for users to "program the API" to get any kind of information they want from the datasets included in the triplestore.  It remains for us to explore the second "useful thing" that I mentioned at the start of this post: merging RDF datasets from multiple sources and accomplishing something useful by doing so.  Stay tuned as we try to learn effective ways to do that in the future.

We also hope that at some point in the future we will have demonstrated that there is some value to having a campus-wide SPARQL endpoint, and that once we can clearly show how we set it up and what it can be used for, we will be able to move it from the commercial cloud server to a server maintained by Vanderbilt ITS.