Tuesday, May 28, 2019

Getting Data Out of Wikidata using Software

For those of you who have been struggling through my series of posts on the TDWG Standards Documentation Specification, you may be happy to read a more "fun" post on everybody's current darling: Wikidata.  This post has a lot of "try it yourself" activities, so if you have an installation of Python on your computer, you can try running the scripts yourself.



Because our group at the Vanderbilt Libraries is very interested in leveraging Wikidata and Wikibase for possible use in projects, I recently began participating in the LD4 Wikidata Affinity Group, an interest group of the Linked Data for Production project.  I also attended the 2019 LD4 Conference on Linked Data in Libraries in Boston earlier this month.  In both instances, most of the participants were librarians who were knowledgeable about Linked Data. So it's been a pleasure to participate in events where I don't have to explain what RDF is, or why one might be able to do cool things with Linked Open Data (LOD).

However, I have been surprised to hear people complain a couple of times at those events that Wikidata doesn't have a good API that people can use to acquire data to use in applications such as those that generate web pages.  When I mentioned that Wikidata's query service effectively serves as a powerful API, I got blank looks from most people present.

I suspect that one reason why people don't think of the Wikidata Query Service as an API is because it has such an awesome graphical interface that people can interact with.  Who wouldn't get into using a simple dropdown to create cool visualizations that include maps, timelines, and of course, pictures of cats?  But underneath it all, the query service is a SPARQL endpoint, and as I have pontificated in a previous post, a SPARQL endpoint is just a glorified, program-it-yourself API.

In this post, I will demonstrate how you can use SPARQL to acquire both generic data and RDF triples from the Wikidata query service.

What is SPARQL for?

The recursive acronym SPARQL (pronounced like "sparkle") stands for "SPARQL Protocol and RDF Query Language".  Most users of SPARQL know that it is a query language.  It is less commonly known that SPARQL has a protocol associated with it that allows client software to communicate with the endpoint server using hypertext transfer protocol (HTTP).  That protocol establishes things like how queries should be sent to the server using GET or POST, how the client indicates the desired form of the response, and how the server indicates the media type of the response.  These are all standard kinds of things that are required for a software client to interact with an API, and we'll see the necessary details when we get to the examples.

The query parts of SPARQL demarcate the kinds of tasks we can accomplish using it.  There are three main things we can do with SPARQL:

  • get generic data from the underlying graph database (triplestore) using SELECT
  • get RDF triples based on data in the graph database using CONSTRUCT
  • load RDF data into the triplestore using UPDATE

In the Wikidata system, data enters the Blazegraph triplestore directly from a separate database, so the third of these methods (UPDATE) is not enabled.  That leaves the SELECT and CONSTRUCT query forms and we will examine each of them separately.

Getting generic data using SPARQL SELECT

The SELECT query form is probably most familiar to users of the Wikidata Query Service.  It's the form used when you do all of those cool visualizations using the dropdown examples.  The example queries do some magical things using commands that are not part of standard SPARQL, such as view settings that are in comments and the AUTO_LANGUAGE feature.  In my examples, I will use only standard SPARQL for dealing with languages and ignore the view settings since we aren't using a graphical interface anyway.

We are going to develop an application that will allow us to discover what Wikidata knows about superheroes.  The query that we are going to start off with is this one:

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
SELECT DISTINCT ?name ?iri WHERE {
?iri wdt:P106 wd:Q188784.
?iri wdt:P1080 wd:Q931597.
?iri rdfs:label ?name.
FILTER(lang(?name)="en")
}
ORDER BY ASC(?name)

For reference purposes, in the query, wdt:P106 wd:Q188784 is "occupation superhero" and wdt:P1080 wd:Q931597 is "from fictional universe Marvel Universe".  This is what restricts the results to Marvel superheroes.  (You can leave this restriction out, but then the list gets unmanageably long.)  The language filter restricts the labels to the English ones.  The name of the superhero and its Wikidata identifier are what is returned by the query.

If you want to try the query, you can go the the graphical query interface (GUI), paste it into the box, and click the blue "run" button.  I should note that Wikidata will allow you to get away with leaving off the PREFIX declarations, but that bugs me, so I'm going to include them since I think it's a good practice to be in the habit of including them.

When you run the query, you will see the result in the form of a table:


The table shows all of the bindings to the ?name variable in one column and the bindings for the ?iri variable in a second column.  However, when you make the query programmatically rather than through the GUI, there are a number of possible non-tabular forms (i.e. serializations) in which you can receive the results.

To understand what is going on under the hood here, you need to issue the query to the SPARQL endpoint using client software that will allow you specify all of the settings required by the SPARQL HTTP protocol.  Most programmers at this point would tell you to use CURL to issue the commands, but personally, I find CURL difficult to use and confusing for beginners.  I always use Postman, which is free, and easy to use and understand.

The SPARQL protocol describes several ways to make a query.  We will talk about two of them here.

Query via GET is pretty straightforward if you are used to interacting with APIs.  The SPARQL query is sent to the endpoint (https://query.wikidata.org/sparql) as a query string with a key of query and value that is the URL-encoded query.  The details of how to do that using Postman are here.  The end result is that you are creating a long, ugly URL that looks like this:

https://query.wikidata.org/sparql?query=%0APREFIX+rdfs%3A+%3Chttp%3A%2F%2Fwww.w3.org%2F2000%2F01%2Frdf-schema%23%3E%0APREFIX+wd%3A+%3Chttp%3A%2F%2Fwww.wikidata.org%2Fentity%2F%3E%0APREFIX+wdt%3A+%3Chttp%3A%2F%2Fwww.wikidata.org%2Fprop%2Fdirect%2F%3E%0ASELECT+DISTINCT+%3Fname+%3Firi+WHERE+%7B%0A%3Firi+wdt%3AP106+wd%3AQ188784.%0A%3Firi+wdt%3AP1080+wd%3AQ931597.%0A%3Firi+rdfs%3Alabel+%3Fname.%0AFILTER%28lang%28%3Fname%29%3D%22en%22%29%0A%7D%0AORDER+BY+ASC%28%3Fname%29%0A

If you are OK with getting your results in the default XML serialization, you just need to request that URL and the file that comes back will have your results.  You can even do that by just pasting the ugly URL into the URL box of a web browser if you don't want to bother with Postman.

However, since we are planning to use the results in a program, it is much easier to use the results if they are in JSON.  Getting the results in JSON requires sending an Accept request header of application/sparql-results+json along with the GET request.  You can't do that in a web browser, but in Postman you can set request headers by filling in the appropriate boxes on the header tab as shown here.  SPARQL endpoints may also accept the more generic JSON Accept request header application/json, but the previous header is the preferred one for SPARQL requests.

Query via POST is in some ways simpler than using GET.  The SPARQL query is sent to the endpoint using only the base URL without any query string.  The query itself is sent to the endpoint in unencoded form as the message body.  A critical requirement is that the request must be sent with a Content-Type header of application/sparql-query.  If you want the response to be in JSON, you must also include an Accept request header of application/sparql-results+json as was the case with query via GET.

There is no particular advantage of using POST instead of GET, except in cases where using GET would result in a URL that exceeds the allowed length for the endpoint server.  I'm not sure what that limit is for Wikidata's server, but typically the maximum is between 5000 and 15000 characters.  So if the query you are sending ends up being very long, it is safer to send it using POST.

Response.  The JSON response that we get back look like this:

{
    "head": {
        "vars": [
            "name",
            "iri"
        ]
    },
    "results": {
        "bindings": [
            {
                "name": {
                    "xml:lang": "en",
                    "type": "literal",
                    "value": "Amanda Sefton"
                },
                "iri": {
                    "type": "uri",
                    "value": "http://www.wikidata.org/entity/Q3613591"
                }
            },
            {
                "name": {
                    "xml:lang": "en",
                    "type": "literal",
                    "value": "Andreas von Strucker"
                },
                "iri": {
                    "type": "uri",
                    "value": "http://www.wikidata.org/entity/Q4755702"
                }
            },
...
            {
                "name": {
                    "xml:lang": "en",
                    "type": "literal",
                    "value": "Zeitgeist"
                },
                "iri": {
                    "type": "uri",
                    "value": "http://www.wikidata.org/entity/Q8068621"
                }
            }
        ]
    }
}

The part of the results that we care about is the value of the bindings key.  It's a array of objects that include a key for each of the variables that we used in the query (e.g. name and iri).  The value for each variable key is another object representing a bound result. The bound result object contains key:value pairs that tell you more than we could learn from the Query Service GUI table, notably the language tag for strings and whether the result was a literal or URI.  But basically, the information for each object in the array corresponds to the information that was in a row in the GUI table.  In our program, we can step through each object in the array and pull out the bound results that we want.

The reason for going into these gory details is to point out that the generic HTTP operations that we just carried out can be done for any programming language that has libraries to perform HTTP calls.  We will see how this is done in practice for two languages.

Using Python to get generic data using SPARQL SELECT

A Python 3 script that performs the query above can be downloaded from this page.  The query itself is assigned to a variable as a multi-line string in lines 10-19.  In line 3, the script allows the user to choose a language for the query and the code for that language is inserted as the variable isoLanguage in line 17.

The script uses the popular and easy-to-use requests library to make the HTTP call.  It's not part of the standard library, so if you haven't used it before, you'll need to install it using PIP before you run the script.  The actual HTTP GET call is made in line 26.  The requests module is really smart and will automatically URL-encode the query when it's sent into the .get() method as a value of params.  So you don't have to worry about that yourself.

If you uncomment line 27 and comment line 26, you can make the request using the .post() method instead of GET.  For a query of this size, there is no particular advantage of one method over the other.  The syntax varies slightly (specifying the query as a data payload rather than a query parameter) and the POST request includes the required Content-Type header in addition to the Accept header to receive JSON.

There are print statements in lines 21 and 29 so that you can see what the query looks like after the insertion of the language code, and after it's been URL-encoded and appended to the base endpoint URL.  You can delete them later if they annoy you.  If you uncomment line 31, you can see the raw JSON results as the have been received from the query service.  They should look like what was shown above.

Line 33 converts the received JSON string into a Python data structure, and also pulls out the array value of the bindings key (now a Python "list" data structure).  Lines 34 to 37 step through each result in the list, extract the values bound to the ?name and ?iri variables, then prints them on the screen.  The result looks like this for English:

http://www.wikidata.org/entity/Q3613591 :  Amanda Sefton
http://www.wikidata.org/entity/Q4755702 :  Andreas von Strucker
http://www.wikidata.org/entity/Q14475812 :  Anne Weying
http://www.wikidata.org/entity/Q2604744 :  Anya Corazon
http://www.wikidata.org/entity/Q2299363 :  Armor
http://www.wikidata.org/entity/Q2663986 :  Aurora
http://www.wikidata.org/entity/Q647105 :  Banshee
http://www.wikidata.org/entity/Q302186 :  Beast
http://www.wikidata.org/entity/Q2893582 :  Bedlam
http://www.wikidata.org/entity/Q2343504 :  Ben Reilly
http://www.wikidata.org/entity/Q616633 :  Betty Ross
...

If we run the program and enter the language code for Russian (ru), the results look like this:

http://www.wikidata.org/entity/Q49262738 :  Ultimate Ангел
http://www.wikidata.org/entity/Q48891562 :  Ultimate Джин Грей
http://www.wikidata.org/entity/Q39052195 :  Ultimate Женщина-паук
http://www.wikidata.org/entity/Q4003146 :  Ultimate Зверь
http://www.wikidata.org/entity/Q48958279 :  Ultimate Китти Прайд
http://www.wikidata.org/entity/Q4003156 :  Ultimate Колосс
http://www.wikidata.org/entity/Q16619139 :  Ultimate Рик Джонс
http://www.wikidata.org/entity/Q7880273 :  Ultimate Росомаха
http://www.wikidata.org/entity/Q48946153 :  Ultimate Роуг
http://www.wikidata.org/entity/Q4003147 :  Ultimate Циклоп
http://www.wikidata.org/entity/Q48947511 :  Ultimate Человек-лёд
http://www.wikidata.org/entity/Q4003183 :  Ultimate Шторм
http://www.wikidata.org/entity/Q2663986 :  Аврора
http://www.wikidata.org/entity/Q3613591 :  Аманда Сефтон
http://www.wikidata.org/entity/Q4755702 :  Андреас фон Штрукер
http://www.wikidata.org/entity/Q2604744 :  Аня Коразон
http://www.wikidata.org/entity/Q770064 :  Архангел
http://www.wikidata.org/entity/Q28006858 :  Баки Барнс
...

So did our script just use an API?  I would argue that it did.  But it's programmable: if we wanted it to retrieve superheroes from the DC universe, all we would need to do is to replace Q931597 with Q1152150 in line 15 of the script.

Now that we have the labels and ID numbers for the superheroes, we could let the user pick one and we could carry out a second query to find out more.  I'll demonstrate that in the next example.

Using Javascript/JQuery to get generic data using SPARQL SELECT

Because the protocol to acquire the data is generic, we can go through the same steps in any programming language.  Here is an example using Javascript with some JQuery functions.  The accompanying web page sets up two dropdown lists, with the second one being populated by the Javascript using the superhero names retrieved using SPARQL.  You can try the page directly from this web page.  To have the page start off using a language other than English, append a question mark, followed by the language code, like this.  If you want to try hacking the Javascript yourself, you can download both documents into the same local directory, then double click on the HTML file to open it in a browser.  You can then edit the Javascript and reload the page to see the effect.

There are basically two things that happen in this page.  

Use SPARQL to find superhero names in the selected language.  The initial page load (line 157) and making a selection on the Language dropdown (lines 106-114) fire the setStatusOptions() function (lines 58-98).  That function inserts the selected language into the SPARQL query string (lines 69-78), URL-encodes the query (line 79), then performs the GET to the constructed URL (lines 82-87).  The script then steps through each result in the bindings array (line 89) and pulls out the bound name and iri value for the result (lines 90-91).  Up to this point, the script is doing exactly the same things as the Python script.  In line 93, the name is assigned to the dropdown option label and the IRI is assigned to the dropdown option value.  The page then waits for something else to happen.


In this screenshot, I've turned on Chrome's developer tools so I can watch what the page is doing as it runs.  This is what the screen looks like after the page loads.  I've clicked on the Network tab  and selected the sparql?query=PREFIC%20rdfs... item.  I can see the result of the query that was executed by the setStatusOptions() function.

Use SPARQL to find properties and values associated with a selected superhero. When a selection is made in the Character dropdown, the $("#box1").change(function() (lines 117-154) is executed.  It operates in a manner similar to the setStatusOptions() function, except that it uses a different query that finds properties associated with the superhero and the values of those properties.  Lines 121 through 133 insert the IRI from the selected dropdown into line 125 and the code of the selected language in lines 130 and 131, resulting in a query like this (for Black Panther, Q998220):

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
SELECT DISTINCT ?property ?value WHERE {
<http://www.wikidata.org/entity/Q998220> ?propertyUri ?valueUri.
?valueUri rdfs:label ?value.
?genProp <http://wikiba.se/ontology#directClaim> ?propertyUri.
?genProp rdfs:label ?property.
FILTER(substr(str(?propertyUri),1,36)="http://www.wikidata.org/prop/direct/")
FILTER(LANG(?property) = "en")
FILTER(LANG(?value) = "en")
}
ORDER BY ASC(?property)

The triple patterns in lines 127 and 128 and filter in 129 give access to the label of a property used in a statement about the superhero.  The model for property labels in Wikidata are a bit complex - see this page for more details.  This query only finds statements whose values are items (not those with string values) because the triple pattern in line 126 requires the value to have a label (and strings don't have labels).  A more complex graph pattern than this one would be necessary to get values of statements with both literal (string) and non-literal (item) values.

Lines 144-149 differ from the earlier function in that they build a text string (called text) containing the property and value strings from the returned query results.  The completed string is inserted as HTML into the div1 element of the web page (line 149). 



The screenshot above shows what happens in Developer Tools when the Character dropdown is used to select "Black Panther".  You can see on the Network tab on the right that another network action has occurred - the second SPARQL query.  Clicking on the response tab shows the response JSON, which is very similar in form to the previous query results.  On the left side of the screen, you can see where the statements about the Black Panther have been inserted into the web page.

It is worth noting that the results that are acquired vary a lot depending on the language that is chosen.  The first query that builds the character dropdown requires that the superhero have a label in the chosen language.  If there isn't a label in that language for that character, then the superhero isn't listed.  That's why the list of English superhero names is so long and the simplified Chinese list only has a few options.  Similarly, properties for a character are only listed if they have labels in that language and if those labels also have a value that has a label in that language.  So we miss a lot of superheros and properties that exist if no one has bothered to create labels for them in a given language.  

This page is also very generic.  Except for the page titles and headers in different languages, which are hard-coded, minor changes to the triple patterns in lines 73 and 74 would make it possible to retrieve information about almost any kind of thing described in Wikidata.

Getting RDF triples using SPARQL CONSTRUCT

In SPARQL SELECT, we specify any number of variables that we want the endpoint to send information about.  The values that are bound to those variables can be any kind of string (datatyped or language tagged) or IRI.  In contrast, SPARQL CONSTRUCT always returns the same kind of information: RDF triples.  So the response to CONSTRUCT is always an RDF graph.

As with the SELECT query, you can issue a CONSTRUCT query at the Wikidata Query Service GUI by pasting it into the box.  You can try it with this query:

PREFIX wd: <http://www.wikidata.org/entity/>
CONSTRUCT {
  wd:Q42 ?p1 ?o.
  ?s ?p2 wd:Q42.
}
WHERE {
  {wd:Q42 ?p1 ?o.}
  UNION
  {?s ?p2 wd:Q42.}
}

The WHERE clause of the query requires that triples match one of two patterns: triples where the subject is item Q42, and triples where the object is item Q42.  The graph to be constructed and returned consists of all of the triples that conform to one of those two patterns.  In other words, the graph that is returned to the client is all triples in Wikidata that are directly related to Douglas Adams (Q42).  



When we compare the results to what we got when we pasted the SELECT query into the box, we see that there is also a table at the bottom.  However, in a CONSTRUCT query there will always be three columns for the three parts of the triple (subject, predicate, and object), plus a column for the "context", which we won't worry about here.  The triples that are shown here mostly look the same, but that's only because the table in the GUI doesn't tell us the language tags of the labels and the names in most languages written in Latin characters are the same. 

If we use Postman to make the query, we have the option to specify the serialization that we want for the response graph.  Blazegraph (the system that runs Wikidata's SPARQL endpoint) will support any of the common RDF serializations, so we just need to send an Accept header with the appropriate media type (text/turtle for Turtle, application/rdf+xml for XML, or application/ld+json for JSON-LD).  Otherwise, the query is sent via GET or POST just as in the SELECT example.


These results (in Turtle serialization) show us the language tags of all of the labels, and we can see that the string part of many of them are the same when the language uses Latin characters.  

We can also run this query using this Python script.  As with the previous Python script, we assign the query to a variable in lines 9 to 18.  This time, we substitute the value of the item variable, which is set to be Q42, but could be changed to retrieve triples about any other item.  After we perform the GET request, the result is written to a text file, requestsOutput.ttl .  We could then load that file into our own triplestore if we wanted.

Using rdflib in Python to manipulate RDF data from SPARQL CONSTRUCT

Since the result of our CONSTRUCT query is an RDF graph, there aren't the same kind of direct uses for the data in generic Python (or Javascript, for that matter) as the JSON results of the SELECT query.  However, Python has an awesome library for working with RDF data called rdflib.  Let's take a look at another Python script that uses rdflib to mess around with RDF graphs acquired from the Wikidata SPARQL endpoint.  (Don't forget to use PIP to install rdflib if you haven't used it before.

Lines 11 through 24 do the same thing as the previous script: create variables containing the base endpoint URL and the query string.  

The rdflib package allows you to create an instance of a graph (line 8) and has a .parse() method that will retrieve a file containing serialized RDF from a URL, parse it, and load it into the graph instance (line 29).  In typical use, the URL of a specific file is passed into the method, but since all of the information necessary to initiate a SPARQL CONSTRUCT query via GET is encoded in a URL, and since the result of the query is a file containing serialized RDF, we can just pass the complex endpoint URL with the encoded query into the method and the graph returned from the query will go directly into the itemGraph graph instance.  

There are two issues with using the method in this way.  One is that unlike the requests .get() method, the rdflib .parse() method does not allow you to include request headers in your GET call.  Fortunately, if no Accept header is sent to the Wikidata SPARQL endpoint, it defaults to returning RDF/XML, and the .parse() method is fine with that.  The other issue is that unlike the requests .get() method, the rdflib .parse() method does not automatically URL-encode the query string and associate it with the query key.  That is why line 25 builds the URL manually and uses the urllib.parse.quote() function to URL-encode the query string before appending it to the rest of the URL.  

Upon completion of line 29, we now have the triples constructed by our query loaded into the itemGraph graph instance.  What can we do with them?  The rdflib documentation provides some ideas.  If I am understanding it correctly, graph is an iterable object consisting of tuples, each of which represents a triple.  So in line 33, we can simply use the len() function to determine how many triples were loaded into the graph instance.  In lines 35 and 37, I used the .preferredLabel() method to search through the graph to find the labels for Q42 in two languages.  

rdflib has a number of other powerful features that are worth exploring.  One is its embedded SPARQL feature, which perhaps isn't that useful here since we just got the graph using a SPARQL query.  Nevertheless, it's a cool function.  The other capability that could be very powerful is rdflib's nearly effortless ability to merge RDF graphs.  In the example script, the value of the item variable is hard-coded in line 5.  However, the value of item could be determined by a for loop and the triples associated with many items could be accumulated into a single graph before saving the merged graph as a file (line 41) to be used elsewhere (e.g. loaded into a triplestore).  You can imagine how a SPARQL SELECT query could be made to generate a list of items (as was done in the "Using Python to get generic data using SPARQL SELECT" section of this post), then that list could be passed into the code discussed here to create a graph containing all of the information about every item meeting some criteria set out in the SELECT query.  That's pretty powerful stuff!

Alternate methods to get data from Wikidata

Although I've made the case here that SPARQL SELECT and CONSTRUCT queries are probably the best way to get data from Wikidata, there are other options.  I'll describe three.

MediaWiki API

Since Wikidata is built on the MediaWiki system, the MediaWiki API is another mechanism to acquire generic data (not RDF triples) about items in Wikidata.  I have written a Python script that uses the wbgetclaims action to get data about the claims (i.e. statements) made about a Wikidata item.  I won't go into detail about the script, since it just uses the requests module's .get() method to get and parse JSON as was done in the first Python example of this post.  The main tricky thing about this method is that you need to understand about "snaks", an idiosyncratic feature of the Wikibase data model.  The structure of the JSON for the value of a claim varies depending on the type of the snak - thus the series of try...except... statements in lines 20 through 29.  

If you intend to use the MediaWiki API, you will need to put in a significant amount of time studying the API documentation.  A list of possible actions are on this page -  actions whose names begin with "wb" are relevant to Wikidata.  I will be talking a lot more about using the WikiMedia API in the next blog post, so stay tuned.

Dereferencing Wikidata item IRIs

Wikidata plays nicely in the Linked Data world in that they support content negotiation for dereferencing of their IRIs.  That means that you can just do an HTTP GET for any item IRI with an Accept request header of one of the RDF media types, and you'll get a very complete description of the item in RDF.  

For example, if I use Postman to dereference the IRI http://www.wikidata.org/entity/Q42 with an Accept header of text/turtle and allow Postman to automatically follow redirects, I eventually get redirected to the URL https://www.wikidata.org/wiki/Special:EntityData/Q42.ttl .  The result is a pretty massive file that contains 66499 triples (as of 2019-05-28).  In contrast, the SPARQL CONSTRUCT query to find all of the statements where Q42 was either the subject or object of the triple returned 884 triples.  Why are there 75 times as many triples when we dereference the URI?  If we scroll through the 66499 triples, we can see that not only do we have all of the triples that contain Q42, but also all of the triples about every part of every triple that contains Q42 (a complete description of the properties and a complete description of the values of statements about Q42).  So this is a possible method to acquire information about an item in the form of RDF triples, but you get way more than you may be interested in knowing.  

Using SPARQL DESCRIBE

One of the SPARQL query forms that I didn't mention earlier is DESCRIBE.  The SPARQL 1.1 Query Language specification is a bit vague about what is supposed to happen in a DESCRIBE query.  It says "The DESCRIBE form returns a single result RDF graph containing RDF data about resources. This data is not prescribed by a SPARQL query, where the query client would need to know the structure of the RDF in the data source, but, instead, is determined by the SPARQL query processor."  In other words, it's up to the particular SPARQL query processor implementation to decide what information to send the client about the resource.  It may opt to send triples that are indirectly related to the described resource, particularly if the connection is made by blank nodes (a situation that would make it more difficult for the client to "follow its nose" to find the other triples).  So basically, the way to find out what a SPARQL endpoint will send as a response to a DESCRIBE query is to do one and see what you get.  

When I issue the query

DESCRIBE <http://www.wikidata.org/entity/Q42>

to the Wikidata SPARQL endpoint with an Accept request header of text/turtle,  I get 884 triples, all of which have Q42 as either the subject or object.  So at least for the Wikidata query service SPARQL endpoint, the DESCRIBE query provides a simpler way to express the CONSTRUCT query that I described in the "Getting RDF triples using SPARQL CONSTRUCT" section above.  

The power of SPARQL CONSTRUCT

In the simple example above, DESCRIBE was a more efficient way than CONSTRUCT to get all of the triples where Q42 was the subject or object.  However, the advantage of using CONSTRUCT is that you can tailor the triples to be returned in more specific ways.  For example, you could easily obtain only the triples where Q42 is the subject by just leaving out the 

?s ?p2 wd:Q42.

part of the query.

In the CONSTRUCT examples I've discussed so far, the triples in the constructed graph all existed in the Wikidata dataset - we just plucked them out of there for our own use.  However, there is no requirement that the constructed triples actually exist in the data source.  We can actually "make up" triples to say anything we want.  I'll illustrate this with an example involving references.



The Wikibase graph model (upon which the Wikidata model is based) is somewhat complex with respect to the references that support claims about items.  (See this page for more information.)  When a statement is made, a statement resource is instantiated and it's linked to the subject item by a "property" (p: namespace) analog of the "truthy direct property" (wdt: namespace) that is used to link the subject item to the object of the claim.  The statement instance is then linked to zero to many reference instances by a prov:wasDerivedFrom (http://www.w3.org/ns/prov#wasDerivedFrom) predicate.  The reference instances can then be linked to a variety of source resources by reference properties.  These reference properties are not intrinsic to the Wikibase graph model and are created as needed by the community just as is the case with other properties in Wikidata.  

We can explore the references that support claims about Q42 by going to the Wikidata Query Service GUI and pasting in this query

PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX p: <http://www.wikidata.org/prop/>
PREFIX pr: <http://www.wikidata.org/prop/reference/>
PREFIX prov: <http://www.w3.org/ns/prov#>
SELECT DISTINCT ?propertyUri ?referenceProperty ?source
WHERE {
  wd:Q42 ?propertyUri ?statement.
  ?statement prov:wasDerivedFrom ?reference.
  ?reference ?referenceProperty ?source.
}
ORDER BY ?referenceProperty

After clicking on the blue "run" button, the results table shows us three things:
  • in the first column we see the property used in the claim
  • in the second column we see the kind of reference property that was used to support the claim
  • in the third column we see the value of the reference property, which is the cited source
Since the table is sorted by the reference properties, we can click on them to see what they are.  One of the useful ones is P248, "stated in".  It links to an item that is an information document or database that supports a claim.  This is very reminiscent of dcterms:source, "a related resource from which the described resource is derived".  If I wanted to capture this information in my own triplestore, but use the more standard Dublin Core term, I could construct a graph that contained the statement instance, but then connected the statement directly to the source using dcterms:source.  Here's how I would write the CONSTRUCT query:

PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX p: <http://www.wikidata.org/prop/>
PREFIX pr: <http://www.wikidata.org/prop/reference/>
PREFIX prov: <http://www.w3.org/ns/prov#>
PREFIX dcterms: <http://purl.org/dc/terms/>
CONSTRUCT {
  wd:Q42 ?propertyUri ?statement.
  ?statement dcterms:source ?source.
  }
WHERE {
  wd:Q42 ?propertyUri ?statement.
  ?statement prov:wasDerivedFrom ?reference.
  ?reference pr:P248 ?source.
}

You can test out the query at the Wikidata Query Service GUI.  You could simplify the situation even more if you made up your own predicate for "has a source about it", which we could call ex:source (http://example.org/source).  In that case, the constructed graph would be defined as

CONSTRUCT {
  wd:Q42 ex:source ?source.
  }

This construct query could be incorporated into a Python script (or a script in any other language that supports HTTP calls) using the requests or rdflib modules as described earlier in this post.

Conclusion

I have hopefully made the case that the ability to perform SPARQL SELECT and CONSTRUCT querys at the Wikidata Query Service eliminates the need for the Wikimedia Foundation to create an additional API to provide data from Wikidata.  Using SPARQL queries provides data retrieval capabilities that are only limited by your imagination.  It's true that using a SPARQL endpoint as an API requires some knowledge about constructing SPARQL queries, but I would assert that this is a skill that must be acquired by anyone who is really serious about using LOD.  I'm a relative novice at constructing SPARQL queries, but even so can think of a mind-boggling array of possibilities for using them to get data from Wikidata.

In my next post, I'm going to talk about the reverse situation: getting data into Wikidata using software.





No comments:

Post a Comment