Thursday, February 6, 2020

VanderBot: A Python Script for Writing to Wikidata (part 1)


Note added 2021-03-13: Although this post is still relevant for understanding the conceptual ideas behind my project to write Vanderbilt researcher/scholar records to Wikidata, I have written another series of blog posts showing (with lots of screenshots and handholding) how you can safely write your own data to the Wikidata API using data that is stored in simple CSV spreadsheets. See this post for details.

If you follow my blog, you will notice that I haven't written much in the last six months. That is at least partly because I've spent a lot of time working out the practical details of creating a "bot" that I can use to upload data about Vanderbilt researchers and scholars into Wikidata. In an earlier post from June last year, I described in general terms some background about writing to Wikibase, the platform on which Wikidata is built. (You probably should review that post for background before starting in on this one.) However, there were a lot of practical details that needed to be worked out to write to the "real" Wikidata.  Those details are what I'll talk about in this post.

One question I'll dispense with at the start of the post is "Why didn't you just use Pywikibot?" There are two reasons. One is that when I experimented with using Pywikibot and our Wikibase instance, I encountered an approximately 10 second delay between write operations. I'm sure that there is some way to defeat that delay, but I was not able to figure it out by looking through the Pywikibot code and documentation. This brings me to the second reason. I really don't like to use other people's code that I don't understand. When I looked through the Pywikibot code, there were layers of objects and functions calling other objects and functions in different files. After a short period of sorting through the code, I realized that there was no way that I was going to understand what was going with Pywikibot at my current level of skill with Python.

After that experience, I decided to build my bot from the ground up. Obviously that took more time, but in the end I actually understood everything that I was doing and also had a much better idea of how the Wikibase API works.  The code that I've written is relatively linear and is liberally annotated with comments. So I hope that people with a moderate level of experience with Python can understand what I did and be able to hack the code to meet their own needs.

Where I last left off

In the previous post about writing to Wikidata, I described a simple script that took data from a CSV file and wrote it to a Wikibase instance (the test Wikidata instance, an independent Wikibase installation, or the real Wikidata).




That script was very limited. It was only able to write statements and could not associate references with those statements nor add qualifiers to the statements. It only created new items and had no way to know if the described entities already existed in the Wikibase instance.  It also had no way to track data about items once they had been written. Finally, it simply wrote the data as fast as it could and did not consider whether it should slow its rate due to high load on the Wikibase API.

Where I wanted to be

A major deficiency of the previous script was that its communication with the Wikibase instance was only one-way. It wrote to the API, but made little use of the API's response and it made no use of Wikibase's capabilities to respond to SPARQL queries.  The workflow that I wanted to facilitate was more complicated.





I wanted the script to first send a SPARQL query to the Query Service to determine which data (including references and qualifiers) that I wanted to write already existed in Wikidata. (From this point forward, I'm going to refer to the "real" Wikidata instance of Wikibase, so I will stop talking about Wikibase generically.) That information would then be used to determine for each record whether the script needed to: create a new item, add or change labels and descriptions, add statements to an existing item, to add references and qualifiers to existing statements, or do nothing because all of the desired information was already there. 

Once it was determined what needed to be written; the script would then compose the appropriate JSON (based on the form of "snaks" in the Wikibase model) for an item and send it to the API. Using the response from the API, the script would update the records to indicate that the data were now present in Wikidata. Based on feedback from the API, the script would also limit its request rate to avoid hitting it too fast at times of high usage.

Eventually, the data uploaded to the API would become available via the Query Service, making it possible to track in the future whether the data were still present in Wikidata.

What is VanderBot?

The simple answer to this question is that VanderBot is the set of Python scripts that I created to write data to Wikidata. The code is freely available in GitHub.  However, the question is a little more complicated than that. 

When an application communicates with server over the internet, it is technically known as a "User-Agent". It is considered a polite and good practice for a User-Agent to identify itself to the server via an HTTP request header. When I use the scripts I've written, I send the header 

VanderBot/0.8 (https://github.com/HeardLibrary/linked-data/tree/master/publications; mailto:steve.baskauf@vanderbilt.edu)

So VanderBot is also the name of a User-Agent. Technically, if you used my script without editing it, you would be using the VanderBot User-Agent, but it probably would be better to not send the header above, since I don't want server administrators to email me if you do bad things to their server.  So you should change the User-Agent header values if you use or modify the VanderBot code. (Similarly, you should also change the tool name and email address sent to the NCBI API in that part of the code - please do not use mine!)

When you write to the Wikidata API, you need to be logged in as a Wikidata user. I have created a Wikidata user account called VanderBot, so if I make edits using that account, they are credited to VanderBot in the page history. So VanderBot is also a registered bot in Wikidata. But since you don't have my VanderBot access credentials, you can't make edits to Wikidata as VanderBot even if you use the Vanderbot scripts.

So the complicated answer is that you are welcome to use the VanderBot code, you probably shouldn't be using "VanderBot" in a User-Agent header (and definitely not my email address), and you can't use the VanderBot Wikidata bot account.

Upcoming posts

In part 2 of this series, I will talk about the Wikibase data model and identifiers used for entities in the Wikidata graph. The model and identifier system influenced my choices about how to write the code.

In part 3, I will describe the API writing script that maps tabular data to the Wikibase model, then writes those data to the Wikidata API.

In the final part 4, I will describe the data harvesting script that is used to assemble the data to be written to Wikidata and that ensures that duplicate data are not added.


1 comment:

  1. This comment has been removed by a blog administrator.

    ReplyDelete