Friday, February 7, 2020

VanderBot part 3: Writing data from a CSV file to Wikidata


In the previous post of this series, I described how my investigation of the Wikibase data model led me to settle on a relatively simple spreadsheet layout for tracking what items, statements, and references needed to be created or edited in Wikidata. Since column headers in a CSV spreadsheet don't really have any meaning other than to a human, it's necessary to map columns to features of the Wikibase model so that a script would know how to write the data in those columns to appropriate data items in Wikidata. 

Developing a schema to map spreadsheet columns to the Wikibase model

In a blog post from 2016, I wrote about a similar problem that I faced when creating an application that would translate tabular CSV data to RDF triples. In that case, I created a mapping CSV table that mapped table headers to particular RDF predicates, and that also indicated the kind of object represented in the table (language-tagged literal, IRI, etc.). That approach worked fine and had the advantage of simplicity, but it had the disadvantage that it was an entirely ad hoc solution that I made up for my own use.

When I learned about the "Generating RDF from Tabular Data on the Web" W3C Recommendation, I recognized that this was a more standardized way to accomplish a mapping from a CSV table to RDF. When I started working on the VanderBot project I realized that since the Wikibase model can be expressed as an RDF graph, I could construct a schema using this W3C standard to document how my CSV data should be mapped to Wikidata items, properties, references, labels, etc. The most relevant part of the standard is section 7.3, "Example with single table and using virtual columns to produce multiple subjects per row".

An example schema that maps the sample table from last the last post is here. The schema is written in JSON and if ingested by an application that can transform CSV files in accordance with the W3C specification, it should produce RDF triples identical to triples about the subject items that are stored in the Wikidata Query Service triplestore (not all triples, but many of the ones that would be generated if the CSV data were loaded into the Wikidata API).  I haven't actually tried this since I haven't acquired such an application, but the point is that the JSON schema applied to the CSV data will generate part of the graph that will eventually be present in Wikidata when the data are loaded.

I will not go into every detail of the example schema, but show several examples of how parts of it map particular columns.

Column for the item identifier

Each column in the table has a corresponding JSON object in the schema. The first column, with the column header title "wikidataId" is mapped with:

{
"titles": "wikidataId",
"name": "wikidataId",
"datatype": "string", 
"suppressOutput": true
}

This JSON simply associates a variable name (wikidataId) with the Wikidata Q ID for the item that's the subject of each row. (For simplicity, I've chosen to make the variable names the same as the column titles, but that isn't required.)  The "true" value for suppressOutput means that no statement is directly generated from this column.

Column for the label

The "labelEn" column is mapped with this JSON object:

{
"titles": "labelEn",
"name": "labelEn",
"datatype": "string",
"aboutUrl": "http://www.wikidata.org/entity/{wikidataId}",
"propertyUrl": "rdfs:label",
"lang": "en"
}

The value of aboutUrl indicates the subject of the triple generated by this column. The curly brackets indicate that the wikidataId variable should be substituted in that place to generate the URI for the subject.  The value of propertyUrl is rdfs:label, the RDF predicate that Wikibase uses for its label field. The object of the triple by default is the value present in that column for the row.  The lang value provides the language tag for the literal.


So when this mapping is applied to the labelEn column of the first row, the triple

<http://www.wikidata.org/entity/Q84268104> rdfs:label "Vanderbilt Department of Biomedical Engineering"@en.

would be generated.

Column for a property having value that is an item (P749)

Here is the JSON object that maps the "parentUnit" column.

{
"titles": "parentUnit",
"name": "parentUnit",
"datatype": "string",
"aboutUrl": "http://www.wikidata.org/entity/{wikidataId}",
"propertyUrl": "http://www.wikidata.org/prop/direct/P749",
"valueUrl": "http://www.wikidata.org/entity/{parentUnit}"
}

As before, the subject URI is established by substituting the wikidataId variable into the URI template for aboutUrl. Instead of directly mapping the column value as the object of the triple, the column value is inserted into a valueUrl URI template in the same manner as the aboutUrl.  


Applying this column mapping to the parentUnit column generates the triple:

<http://www.wikidata.org/entity/Q84268104> <http://www.wikidata.org/prop/direct/P749> <http://www.wikidata.org/entity/Q7914459>.

which can be abbreviated

wd:Q84268104 wdt:P749 wd:Q7914459.

The other columns in the CSV table are mapped similarly. If there is no valueURl key:value pair, the value for the column is a literal object, and if there is a value for valueURI, the value for the column is used to generate a URI denoting a non-literal object. 

The value of datatype is important since it determines the xsd:datatype of literal values in the generated triples.

Not every column generates a triple with a subject that's the subject of the row. The subject may be the value of any other column. This allows the data in the row to form a more complicated graph structure.

How the VanderBot script writes the CSV data to the Wikidata API

The script that does the actual writing to the Wikidata API is here.  The authentication process (line 338) is described in detail elsewhere.   

The actual script begins (line 374) by loading the schema JSON into a Python data structure and loading the CSV table into a list of dictionaries. 

The next section of the code (lines 402 to 554) uses the schema JSON to sort the columns of the tables into categories (labels, aliases, descriptions, statements with entity values, and statements with literal values).  

From lines 556 to 756, the script steps through each row of the table to generate the data that needs to be passed to the API to upload new data.  In each row, the script goes through each category of data (labels, aliases, etc.) and turns the value in a column into the specific JSON required by the API for uploading that kind of data. I call this "snak JSON" because the units in the JSON represent "snaks" (small, discrete statements) as defined by the Wikibase data model.

Originally, I had written the script in a simpler way, where each piece of information about the item was written in a separate API call. This seemed intuitive since there are individual API methods for uploading every category (label, description, property, reference, etc., see the API documentation). However, because of rate limitations that I'll talk about later, the most reasonable way to write the data was to determine which categories needed to be written for an item and then generate the JSON for all categories at once. I then used the "all in one" method wbeditentity to make all possible edits in a single API call. This resulted in much more complicated code that constructed deeply nested JSON that's difficult to read. The API help page didn't give any examples that were nearly this complicated, so getting this strategy to work required delving deeply into the Wikibase model. One lifesaver was that when a successful API call was made, the API's response included JSON structured according to the Wikibase model that was very similar to the JSON that was necessary to write to the API. Being able to look at this response JSON was really useful to help me figure out what subtle mistakes I was making when constructing the JSON to send to the API.

Simply creating labels, descriptions, and claims would not have been too hard, but I was determined to also have the capability to support references and qualifiers for claims. Here's how I hacked that task: for each statement column, I went through the columns and looked for other columns that the schema indicated were references or qualifiers of that statement. Currently, the script only handles one reference and one qualifier per statement, but when I get around to it, I'll improve the script to remove that limitation. 

In line 759, the script checks whether it found any information about the item that wasn't already written to Wikidata. If there was at least one thing to write, the script attempts to post a parameter dictionary (including the complex, constructed snak JSON) to the API (lines 305 to 335) If the attempt was unsuccessful because the API was too busy, it retries several times. If the attempt was unsuccessful for other reasons, the script displays the server's response for debugging. 

If the attempt was successful, the script extracts identifiers of newly-created data records (item Q IDs, statement UUIDs, and reference hashes - see the previous post for more on this) and adds them to the CSV table so that the script will know in the future that those data are already in Wikidata.  The script rewrites the CSV table after every line so that if the script crashes or the API throws an error during a write attempt, one can simply re-start the script after fixing the problem and the script will know not to create duplicate data on the second go-around (since the identifiers for the already-written data have already been added to the CSV).  

I mentioned near the end of my previous post that I don't have any way to record whether labels, descriptions, and qualifiers had already been written or not, since URI identifiers aren't generated for them.  The lack of URI identifiers means that one can't refer to those particular assertions directly by URIs in a SPARQL query. Instead, one must make a query asking explicitly for the value of the label, description, or qualifier and then determine whether it's the same as the value in the CSV table. The way the script currently works, prior to creating JSON to send to the API the script sends a SPARQL query asking for the values of labels and descriptions of all of the entities in the table (lines 465 and 515). Then as the script processes each line of the table, it checks whether the value in the CSV is the same as what's already in Wikidata (and then does nothing) or different. If the value is different, it writes the new value from the CSV and overwrites the value in Wikidata. 

It is important to understand this behavior, because if the CSV table is "stale" and has not been updated for a long time, other users may have improved the labels or descriptions.  Running the script with the stale values will effectively revert their improvements. So it's important to update the CSV file with current values before running this script that writes to the API.  After updating, then you can manually change any labels or descriptions that are unsatisfactory.  

In the future, I plan to write additional scripts for managing labels and aliases, so this crude management system will hopefully be improved.

Cleaning up missing references

In some cases, other Wikidata contributors have already made statements about pre-existing Vanderbilt employee items. For example, someone may have already asserted that the Vanderbilt employee's employer was Vanderbilt University. In such cases, the primary API writing script will do nothing with those statements because it is not possible to write a reference as part of the wbeditentity API method without also writing its parent statement. So I had to create a separate script that is a hack of the primary script in order to write the missing references. I won't describe that script here because its operation is very similar to the main script. The main difference is that it uses the wbsetreference API method that is able to directly write a reference given a statement identifier. After running the main script, I run the cleanup script until all of the missing references have been added.

Timing issues

Maxlag

One of the things that I mentioned in my original post on writing data to Wikidata was that when writing to the "real" Wikidata API (vs. the test API or your own Wikibase instance) it's important to respect the maxlag parameter.

You can set the value of the maxlag parameter in line 381. The recommended value is 5 seconds.  A higher maxlag value is more aggressive and a lower maxlag value is "nicer" but means that you are willing to be told more often by the API to wait. The value of maxlag you have chosen is added to the parameters sent to the API in line 764 just before the POST operation.  

The API lag is the average amount of time between when a user requests an operation and the API is able to honor that request.  At times of low usage (e.g. nighttime in the US and Europe), the lag may be small, but at times of high usage, the lag can be over 8 seconds (I've seen it go as high as 12 seconds). If you set maxlag to 5 seconds, you are basically telling the server that if the lag gets longer than 5 seconds, ignore your request and you'll try again later.  The server tells you to wait by responding to your POST request with a response that contains a maxlag error code and the amount of time the server is lagged.  This error is handled in line 315 of the script.  When a lag error is detected, the recommended practice is to wait at least 5 seconds before retrying.

Bot flags

I naïvely believed that if I respected maxlag errors that I'd be able to write to the API as fast as conditions allowed. However, the very first time I used the VanderBot script to write more than 25 records in a row, I was blocked by the API as a potential spammer with the message "As an anti-abuse measure, you are limited from performing this action too many times in a short space of time, and you have exceeded this limit. Please try again in a few minutes." Clearly my assumption was wrong.  Through trial and error, I determined that a write rate of one second per write was too fast and would result in being temporarily blocked, but a rate of two seconds per write was acceptable. So to handle cases when maxlag was not invoked, I put a delay of 2 seconds on the script (line 822).

I had several hypotheses about the cause of the blocking. One possible reason was because I didn't have a bot flag. (More on that later.) Another reason might be because I was running the script from my local computer rather than from PAWS. PAWS is a web-based interactive programming and publishing environment based on Jupyter notebooks. At Wikicon North America, I had an interesting and helpful conversation with Dominic Byrd-McDevitt of the National Archives who showed me how he used PAWS to publish NARA metadata to Wikidata via a PAWS-based system using Pywikibot. I don't think he had a bot flag and I think his publication rate was faster than one write per second.  But I really didn't want to take the time to test this hypothesis by converting my script over to PAWS (which would require more experimentation with authentication). So I decided to make a post to Wikitech-l and see if I could get an answer. 

I quickly got a helpful answer that confirmed that neither using PAWS nor Pywikibot should have any effect on the rate limit. If I had a bot flag, I might gain the "noratelimit" right, which might bypass rate limiting in many cases. 

Bot flags are discussed here . In order to get a bot flag, one must detail the task that the bot will perform, then demonstrate by a test run of 50 and 250 edits that the bot is working correctly. When I was at Wikicon NA, I asked some of the Powers That Be whether it was important to get a bot flag if I was not running an autonomous bot. They said that it wasn't so important if I was monitoring the writing process. It would be difficult to "detail the task" that VanderBot will perform since it's just a general-purpose API writing script, and what it writes will depend on the CSV file and the JSON mapping schema. 

In the end, I decided to just forget about getting a bot flag for now and keep the rate at 2 seconds per write. I usually don't write more than 50-100 edits in a session and often the server will be lagged anyway requiring me to wait much longer than 2 seconds. If VanderBot's task becomes more well-defined and autonomous, I might request a bot flag at some point in the future.

Query Service Updater lag

One of the principles upon which VanderBot is built is that data are written to Wikidata by POSTing to the API, but that the status of data in Wikidata is determined by SPARQL queries of the Query Service. That is a sound idea, but it has one serious limitation. Data that are added through either the API or the human GUI do not immediately appear in the graph database that supports the Query Service. There is a delay, known as the Updater lag, between the time of upload and the time of availability at the Query Service. We can gain a better understanding by looking at the Query Service dashboard.

Here's a view of the lag time on the day I wrote this post (2020-02-03):


The first thing to notice is that there isn't just one query service. There are actually seven servers running replicates of the Query Service that handle the queries. They are all being updated constantly with data from the relational database connected to the API, but since the updating process has to compete with queries that are being run, some servers cannot keep up with the updates and lag by as much as 10 hours. Other servers have lag times of less than one minute. So depending on the luck of the draw of which server takes your query, data that you wrote to the API may be visible via SPARQL in a few seconds or in half a day.


A practical implication of this is that if VanderBot updates its CSV record using SQARQL, the data could be as much as half a day out of date. Normally that isn't a problem, since the data I'm working with doesn't change much, and once I write new data, I usually don't mess with it for days. However, since the script depends on a SPARQL query to determine if the labels and descriptions in the CSV differ from what's already in Wikidata, there can be problems if the script crashes half way through the rows of the CSV. If I fix the problem and immediately re-run the script, a lagged Query Service will send a response to the query saying that the labels and descriptions that I successfully wrote a few moments earlier were in their previous state. That will cause VanderBot to attempt to re-write those labels and descriptions.  Fortunately, if the API detects that a write operation is trying to set the value of a label or description to the value it already has, it will do nothing. So generally, no harm is done.  

This lag is why I use the response JSON sent from the API after a write to update the CSV rather than depending on a separate SPARQL query to make the update. Because the data in the response JSON comes directly from the API and not the Query Service, it is not subject to any lag.

Summary


The API writing script part of VanderBot does the following:
  1. Reads the JSON mapping schema to determine the meaning of the CSV table columns.
  2. Reads in the data from the CSV table.
  3. Sorts out the columns by type of data (label, alias, description, property).
  4. Constructs snak JSON for any new data items that need to be written.
  5. Checks new statements for references and qualifiers by looking at columns associated with the statement properties, then creates snak JSON for references or qualifiers as needed.
  6. Inserts the constructed JSON object into the required parameter dictionary for the wbeditentity API method.
  7. POSTs to the Wikidata API via HTTP.
  8. Parses the response JSON from the API to discover the identifiers of newly created data items.
  9. Inserts the new identifiers into the table and write the CSV file.
In the final post of this series, I'll describe how the data harvesting script part of VanderBot works.

No comments:

Post a Comment