Saturday, June 11, 2022

Making SPARQL queries to Wikidata using Python


"Welding sparkles" by Dhivya dhivi DJ, CC BY-SA 4.0, via Wikimedia Commons

Background

This is actually a sort of followup post to my most popular blog post: "Getting Data Out of Wikidata using Software", which has had about 6.5K views since 2019. That post was focused on the variety of query forms you could use and talked a lot about using Javascript to build web pages that acquired data from Wikidata dynamically. However, I did provide a link to some Python code, which included the line

r = requests.get(endpointUrl, params={'query': query}, headers={'Accept': 'application/sparql-results+json'})

for making the actual query to the Wikidata Query Service via HTTP GET. 

Since that time, I've used some variation on that code in dozens of Python scripts that I've written to grab data from Wikidata. In the process, I experienced some frustration when things did not behave as I had expected and when I got unexpected errors from the API. My goal for this post is to describe some of those problems and how I solved them. I'll also provide a link to the "Sparqler" Python class that I wrote to make querying simpler and more reliable, along with some examples of how to use it to do several types of queries.

 Note: SPARQL keywords are case insensitive. Although you often see them written in ALL CAPS in examples, I'm generally too lazy to do that and tend to use lower case, as you'll see in most of the examples below.

The Sparqler class

For those of you who don't care about the technical details, I'll cut right to the chase and tell you how to make queries to Wikidata using the code. You can access the code in GitHub here.  I should note that the code is general-purpose and can be used with any SPARQL 1.1 compliant endpoint, not just the Wikidata Query Service (WDQS). This includes Wikibase instances and installations of Blazegraph, Fuseki, Neptune, etc. The code also supports SPARQL Update for loading data into a triplestore, but that's the topic of another post.

To use the code, you need to import three modules: datetime, time, and requests. The requests module isn't included in the standard Python distribution, so you may need to install it with PIP if you haven't already. If you are using Jupyter notebooks through Anaconda, or Colab notebooks, requests will probably already be installed. Copy the code from "class Sparqler:" through just before the "Body of script" comment near the bottom of the file, and paste it near the top of your script. 

To test the code, you can run the entire script, which includes code at the end with an example of how to use the script. If you only run it once or twice, you can use the code as-is. However, if you make more than a few queries, you'll need to change the user_agent string from the example I gave to your own. You can read about that in the next section. 


 The body of the script has four main parts. Lines 238 through 256 create a value for the text query_string that gets sent to the WDQS endpoint. Lines 259 and 260 instantiate a Sparqler object called wdqs. Line 261 sends the query string that you created to the endpoint and returns the SELECT query results as a list of dictionaries called data. The remaining lines check for errors and display the results as pretty JSON (the reason for importing the json module at the top of the script). If you want to see the query_string as constructed or the raw response text from the endpoint, you can uncomment lines 257 and 266.

Here's what the response looks like:

[
  {
    "item": {
      "type": "uri",
      "value": "http://www.wikidata.org/entity/Q102949359"
    },
    "label": {
      "xml:lang": "en",
      "type": "literal",
      "value": "\"I Hate You For Hitting My Mother,\" Minneapolis"
    }
  },
  {
    "item": {
      "type": "uri",
      "value": "http://www.wikidata.org/entity/Q102961315"
    },
    "label": {
      "xml:lang": "en",
      "type": "literal",
      "value": "A Picture from an Outline of Women's Manners - The Wedding Ceremony"
    }
  },
  {
    "item": {
      "type": "uri",
      "value": "http://www.wikidata.org/entity/Q1399"
    },
    "label": {
      "xml:lang": "en",
      "type": "literal",
      "value": "Niccol\u00f2 Machiavelli"
    }
  }
]

It's in the standard SPARQL 1.1 JSON results format, so if you write code to extract the results from the data list of dictionaries, you can use it with the results of any query.

Features of the code

For those of you who are interested in knowing more about the code and the rationale behind it, read the following sections. If you just want to try it out, skip to the "Options for querying" section.

The User-Agent string

Often applications that request data from an API are requested to identify themselves as in indication that they aren't bad actors and to allow the API maintainers to contact the developers if the application is doing something the API maintainers don't like. In the case of the Wikimedia Foundation, they have adopted a User-Agent policy that requires that an HTTP User-Agent header be sent with all requests to their servers. This policy is not universally enforced, and I'm not sure whether it's enforced at all for the WDQS, but if you are writing a script that is making repeated queries at a high rate of speed, you should definitely supply a User-Agent header that identifies your application (and you) in the event that it is suspected to be a denial of service attack. 

The details of what they would like developers to include in the string are given on the policy page, but the TLDR is that you should have a name for the "application" (your script) and either your email address or the URL of a page that describes your project. The value given in lines 259 and 260 of the body of the script for the user_agent variable can be used as a template. When instantiating the Sparqler object, the string MUST be passed in as the value of the useragent argument if the endpoint URL given as the value of the endpoint argument is https://query.wikidata.org/sparql (the default if no endpoint argument is given). If you don't provide one, the script will exit.

The sleep argument

When you create a Sparqler object, you can choose to supply a value (in seconds) for the sleep argument. If none is supplied, it defaults to 0.1 s. Each time a query is made, the script pauses execution for the length of time specified. The rationale for the default of 0.1 s for the WDQS is similar to the previous section -- you don't want the WDQS operators to think you are a bad actor if you are hitting the endpoint repeatedly without delay. If you are reading from a localhost endpoint, you can set the value of sleep to zero. 

While I'm on the topic of being a courteous WDQS user, I would like to point out that often repetitive querying can be avoided if you use a "smarter" query. In the example code, I wanted to discover the Q IDs of three labels. I could have inserted the label value in the query as a literal in the position of ?value, e.g.

?item rdfs:label|skos:altLabel "尼可罗·马基亚维利"@zh.

then put the .query() method inside a loop that runs three times. However, instead in the script I used a loop to create an VALUES clause to enumerate the possible values of ?value . I still get the same information, but using the VALUES method only requires one interaction with the Query Service instead of three. For a small number like this, it's not that important, but I've sent queries with hundreds or thousands of values, and there the difference is significant.

GET vs. POST

This brings me to another important thing that I learned the hard way about interacting with SPARQL endpoints programmatically. If you drill down in the SPARQL 1.1 Protocol specification (which I doubt that anyone but me typically does!), you'll see that there are three options for sending queries via HTTP: one using GET and two using POST. When I first started running queries from scripts, I tended to use the GET method because it seemed simpler -- after URL-encoding the query just gets attached to the end of the URL as the value of a query parameter. However, what I discovered once I started making really long queries (like the one I previously described with thousands of VALUES) was that you can fairly easily exceed the length limits of a URL allowed by the server (something in the neighborhood of 5K to 15K characters). Once I discovered that, I switched to using POST since the query is passed as the message body and therefore has no particular length limit. 

So why would you ever need to use GET? In some cases, a SPARQL endpoint will only support GET requests because the endpoint is read-only. In cases where a SPARQL service supports both Query and Update, a quick-and-dirty way to restrict writing to the triplestore using Update (which must be done using POST) is to disallow any un-authenticated POST requests. Another case is services like AWS Neptune that have separate read-only endpoints whose access is separate from the endpoint that supports writing. A read-only endpoint would only support GET requests. 

For these reasons, you can specify that the Sparqler object use GET by providing a value of "get" for the method argument. Otherwise it defaults to POST.

UTF-8 support

If the literals that you are using only contain Latin characters, it doesn't really matter that much how you do the querying. However, a lot of projects I work on either involve languages with non-Latin character sets, or include characters with diacritics that aren't in the ASCII character set. Despite my best efforts to enforce UTF-8 encoding everywhere, I was still having queries that would fail to match labels in Wikidata that I knew should match. After wasting a bunch of time troubleshooting, I finally figured out the fix. 

As I mentioned earlier, the SPARQL 1.1 Protocol Recommendation provides two ways to send queries. The simplest one is to just send the query as text without URL-encoding as the message body. That's awesome for testing because you can just paste a query into the message text box of Postman and if you use the right Content-Type header, you can send the query with the click of a button. I assumed that as long as the text was all UTF-8, I would be fine. However, using this option was actually the cause of the problems I was having with match failures. When I switched to the other POST method (which URL-encodes the query string), my matching problems disappeared. For that reason, my script only uses the "query via URL-encoded POST" option. 

", ', and """ quoting for literals

I learned early on that in SPARQL you can use either double or single quotes for literals in queries. That's nice, because if you have a string containing a single quote like "don't", you can enclose it in double quotes and a string containing double quotes like 'say "hi" for me', you can enclose it in single quotes. But what if you have 'Mother said "don't forget to brush your teeth" to me.', which contains both double and single quotes? Also, in the situation of inserting strings into the query string using variables, you can 't know in advance what kind or kinds of quotes a string might contain. 

This problem frustrated me for quite some time and I experimented with checking strings for both kinds of quotes, replacing double quotes with singles, escaping quotes in various ways, but none of these approaches worked and my scripts kept crashing because of quote mismatches. 

Finally, I resorted to (you guessed it) reading the SPARQL 1.1. Query specification and there was the obvious (to Python users) answer in section 4.1.2: enclose the literals in sets of three quotes. I don't know why I didn't think of trying that. Note that in line 246 of the script, triple single-quotes are used to enclose the literals. Thus the script can handle both of the example English strings: the one with double quotes around the first part of the label and the label that includes "Women's" with an apostrophe.

After solving the quote and UTF-8 problems, my scripts now reliably handle literals that contain any UTF-8 characters.

Options for querying

The query in the code example uses the SELECT query form. This is probably the most common type of SPARQL query, but others are possible and Sparqler objects support any query form option. Depending on the chosen form of the query, there are also several possible response formats. Since we are talking about Python here, the most convenient response format is JSON, since it can easily be converted into a complex Python data structure. But in some situations, another format may be more convenient.

Query form

The query form is specified using the form keyword argument of the .query() method. It may seem a bit strange to specify the query form as an argument of the method when the query form is determined by the text of the query itself, but doing so allows the script to control the default format of the response and whether the raw response is processed prior to being returned from the method. For SELECT and ASK, the default response serialization is set to JSON. For the DESCRIBE and CONSTRUCT query forms that return graphs, the default serialization is Turtle. 

SELECT

The default query form is SELECT, so it isn't necessary to provide a form argument to use it. That's convenient, since it's probably the most commonly used form. The raw JSON response (which you can view as the value of the .response attribute of the Sparqler object, e.g. wdqs.response) from the endpoint is structured in a more complicated way that is required to just get the results of the query. The results list is actually the value of a bindings key that forms an object that's the value of a results key, like this:

{
  "head" : {
    "vars" : [ "item", "label" ]
  },
  "results" : {
    "bindings" : [ {
      "item" : {
        "type" : "uri",
        "value" : "http://www.wikidata.org/entity/Q102949359"
      },
...
      }
    } ]
  }
}

For convenience, when handling SELECT queries with the default JSON serialization, the script converts the raw JSON to a complex Python data object, then extracts the results list that's nested as the value of the bindings key and returns that as the value of the .query() method. That produces the result shown in the example shown earlier in the post.  

Here's an example that prints Douglas Adams' (Q42) name in all available languages:

query_string = '''select distinct ?label ?language where {
wd:Q42 rdfs:label ?label.
bind ( lang(?label) AS ?language )
}
order by ?language'''
names = wdqs.query(query_string)
for name in names:
    print(name['language']['value'], name['label']['value'])

The loop iterates through all of the items in the results list and pulls the value for each variable.  This structure: item['variableName']['value'] is consistent for all SELECT queries where variableName is the string you used for that variable in the query (e.g. ?variableName).

ASK

When the ASK query form is chosen, the result is a true or false, so the raw response is processed to return a Python boolean as the response value. That allows you to directly control program flow based on whether a particular graph pattern has any solutions, like this:

label_string = '尼可罗·马基亚维利'
language = 'zh'

query_string = '''ask where {
  ?entity rdfs:label """'''+ label_string + '"""@' + language + '''.
      }'''

if wdqs.query(query_string, form='ask'):
    print(label_string, 'is in Wikidata')
else:
    print('Could not find', label_string, 'in Wikidata')

I use this kind of query to check whether label/description combinations that I plan to use for new Wikidata items have already been used. If you try to create a new item that has the same label and description as an existing item, the Wikidata API will return an error message and refuse to create the item. So it's better to query ahead of time so that you can change either the label or description to make it unique. Here's some code that will perform that check for you:

label_string = 'Italian Lake Scene'
description_string = 'painting by Artist Unknown'

query_string = '''ask where {
  ?item rdfs:label """'''+ label_string + '''"""@en.
  ?item schema:description """'''+ description_string + '''"""@en.
      }'''

if wdqs.query(query_string, form='ask'):
    print('There is already an item in Wikidata with')
    print('label:', label_string)
    print('description:', description_string)
    print('The label or description must be changed before uploading.')

DESCRIBE

The DESCRIBE query form is probably the least commonly used SPARQL query form. Its behavior is somewhat dependent on the implementation. Blazegraph, which is the application that underlies the WDQS, returns all of the triples that include the resource that is the solution to the query. The simplest kind of DESCRIBE query just specifies the IRI of the resource to be described. Here's an example that will return all of the triples that provide some kind of information about Douglas Adams (Q42):

query_string = 'describe wd:Q42'
description = wdqs.query(query_string, form='describe')

description is a string containing the triples in Turtle serialization. That string could be saved as a file and loaded into an application that knows how to parse Turtle.

CONSTRUCT

 CONSTRUCT queries are similar to DESCRIBE in that they produce triples. The triples are those that conform to a graph pattern that you specify. For example, this query will produce all of the triples (serialized as Turtle) that are direct claims about Douglas Adams.

query_string = '''construct {wd:Q42 ?p ?o.} where {
wd:Q42 ?p ?o.
?prop wikibase:directClaim ?p.
}'''
triples = wdqs.query(query_string, form='construct')
print(triples)

This might be useful to you if you want to load just those triples into a triplestore.

Response formats

Because of the ease with which JSON can be converted directly to an analogously structured complex Python data object, Sparqler objects default to JSON as the response format for SELECT queries. For the two query forms that return triples (DESCRIBE and CONSTRUCT), the default is Turtle. ASK defaults to JSON, from which a Python boolean is extracted. However, these response formats can be overridden using the mediatype keyword argument in the .query() method if desired.

The mediatype argument for some other possible response formats for SELECT are:

application/sparql-results+xml for XML

text/csv for CSV tabular data

For non-JSON response serializations, the return value of the .query() method is the raw text from the endpoint. That may be useful if you want to save the XML for use with some XML processing language like XQuery. It also makes it super simple to save the output as a CSV file with a few lines of code, like this:

data = sve.query(query_string, mediatype='text/csv')
with open('graph_dump.csv', 'wt', encoding='utf-8') as file_object:
    file_object.write(data)

Triple output from DESCRIBE and CONSTRUCT can be serialized in other formats using these values of  the mediatype argument:

application/rdf+xml for XML

application/n-triples for N-Triples

Monitoring the status of the query

 The verbose keyword argument can be used to control whether you get printed feedback to monitor the status of the query. A False value (the default) suppresses printing. Supplying a True value prints a notification that the query has been requested and another when a response has been received, including the time to complete the query. This may be helpful during debugging or if the queries take a long time to execute. For small, routine queries, you probably want to turn this off. Note: the second notification takes place after the sleep delay, so the reported response time includes that delay.

 FROM and FROM NAMED

The SPARQL 1.1 Protocol specification provides a mechanism for specifying graphs to be included in the default graph using a request parameter rather than by using the FROM and FROM NAMED keywords in the text of the query itself. Sparqler supports this mechanism through the default and named arguments. Given that this is an advanced feature and that the WDQS triplestore does not have named graphs, I won't say more about this here. However, I'm planning talk about this feature in a future post about the Vanderbilt Libraries' new Neptune triplestore. For more details, see the doc strings in the code.

Detecting errors

Detecting errors depends on how errors are reported by the SPARQL query service. In the case of Blazegraph (the service on which the WDQS is based), errors are reported as unformatted text in the response body. This is not the case with every SPARQL service -- they may report errors by some different mechanism, such as a log that must be checked. 

Because the main use cases of the Sparqler class are SELECT and ASK queries to the WDQS, errors can be detected by checking whether the results are JSON or not (assuming the default JSON response format is used). When SELECT queries return JSON, the code tries to convert the response from JSON to a Python object. If it fails, it returns a None object. You can then detect a failed query by checking whether the value is None and if it is, you can try to parse out the error message string (provided as the value of the .response attribute of the Sparqler object, e.g. wdqs.response), or just print it for the user to see. Here is an example:

 query_string = '''select distinct ?p ? where {
wd:Q42 ?p ?o.
}
limit 3'''
data = wdqs.query(query_string)
if data is None:
    print(wdqs.response)
else:
    print(data)

The example intentionally omits the name of the second variable (?o) to cause the query to be malformed. If you run this query, None will be returned as the value of data, and the error message will be printed. If you add the missing "o" after the question mark and re-run the query, you should get the query results. 

Note that this mechanism detects actual errors and not a negative query result. For example, a select query with no matches will return an empty list ([]), which a negative result, not an error. The same is true for ASK queries that evaluate as False when there are no matches. That's why the code is written "if data is None:" rather than "if data:", which would evaluate as True if there were matches (non-empty list or True value) but as False for a either an error (a value of None) or no matches (an empty list or False value).  The point is that "no matches" result should be handled differently than an error in your code, and that's why the code if data is None: is used.

For other query forms (DESCRIBE and CONSTRUCT) and response formats other than JSON, the .query() method simply returns the response text. So I leave it to you to figure out how to differentiate between errors and valid responses (maybe search for "ExecutionException" in the response string?).

SPARQL Update support

The Sparqler class supports changing graphs in the triplestore using SPARQL Update if the SPARQL service supports that. This is done using the .update() method and two more specific types of Update operations: .load() and .drop() . However, since changes to the data available on the WSQS triplestore must be made through the Wikidata API and not through SPARQL Update, I won't discuss these features in this post. I'm planning to describe them in more detail in an upcoming post where I talk about our Neptune triplestore. Until then, you can look at the doc strings in the code for details.



No comments:

Post a Comment