Note added 2021-03-13: Although this post is still relevant for understanding the conceptual ideas behind my project to write Vanderbilt researcher/scholar records to Wikidata, I have written another series of blog posts showing (with lots of screenshots and handholding) how you can safely write your own data to the Wikidata API using data that is stored in simple CSV spreadsheets. See this post for details.
If you follow my blog, you will notice that I haven't
written much in the last six months. That is at least partly because I've spent
a lot of time working out the practical details of creating a "bot"
that I can use to upload data about Vanderbilt researchers and scholars into
Wikidata. In an earlier post from June last year,
I described in general terms some background about writing to Wikibase, the
platform on which Wikidata is built. (You probably should review that post for
background before starting in on this one.) However, there were a lot of
practical details that needed to be worked out to write to the "real"
Wikidata. Those details are what I'll
talk about in this post.
One question I'll dispense with at the start of the post is
"Why didn't you just use Pywikibot?" There are two reasons. One is
that when I experimented with using Pywikibot and our Wikibase instance,
I encountered an approximately 10 second delay between write operations. I'm
sure that there is some way to defeat that delay, but I was not able to figure
it out by looking through the Pywikibot code and documentation. This brings me
to the second reason. I really don't like to use other people's code that I
don't understand. When I looked through the Pywikibot code, there were layers
of objects and functions calling other objects and functions in different
files. After a short period of sorting through the code, I realized that there
was no way that I was going to understand what was going with Pywikibot at my
current level of skill with Python.
After that experience, I decided to build my bot from the
ground up. Obviously that took more time, but in the end I actually understood
everything that I was doing and also had a much better idea of how the Wikibase
API works. The code that I've written is
relatively linear and is liberally annotated with comments. So I hope that people
with a moderate level of experience with Python can understand what I did and
be able to hack the code to meet their own needs.
Where I last left off
In the previous post about writing to Wikidata, I described
a simple script that took data from a CSV file and wrote it to a Wikibase
instance (the test Wikidata instance, an independent Wikibase installation, or
the real Wikidata).
That script was very limited. It was only able to write
statements and could not associate references with those statements nor add
qualifiers to the statements. It only created new items and had no way to know
if the described entities already existed in the Wikibase instance. It also had no way to track data about items
once they had been written. Finally, it simply wrote the data as fast as it could
and did not consider whether it should slow its rate due to high load on the
Wikibase API.
Where I wanted to be
A major deficiency of the previous script was that its
communication with the Wikibase instance was only one-way. It wrote to the API,
but made little use of the API's response and it made no use of Wikibase's
capabilities to respond to SPARQL queries.
The workflow that I wanted to facilitate was more complicated.
I wanted the script to first send a SPARQL query to the Query
Service to determine which data (including references and qualifiers) that I
wanted to write already existed in Wikidata. (From this point forward, I'm
going to refer to the "real" Wikidata instance of Wikibase, so I will
stop talking about Wikibase generically.) That information would then be used
to determine for each record whether the script needed to: create a new item,
add or change labels and descriptions, add statements to an existing item, to
add references and qualifiers to existing statements, or do nothing because all
of the desired information was already there.
Once it was determined what needed to be written; the script
would then compose the appropriate JSON (based on the form of "snaks"
in the Wikibase model) for an item and send it to the API. Using the response
from the API, the script would update the records to indicate that the data
were now present in Wikidata. Based on feedback from the API, the script would also
limit its request rate to avoid hitting it too fast at times of high usage.
Eventually, the data uploaded to the API would become
available via the Query Service, making it possible to track in the future
whether the data were still present in Wikidata.
What is VanderBot?
The simple answer to this question is that VanderBot is the
set of Python scripts that I created to write data to Wikidata. The code is
freely available in GitHub. However, the question is a little more
complicated than that.
When an application communicates with server over the
internet, it is technically known as a "User-Agent". It is considered
a polite and good practice for a User-Agent to identify itself to the server via
an HTTP request header. When I use the scripts I've written, I send the header
VanderBot/0.8
(https://github.com/HeardLibrary/linked-data/tree/master/publications;
mailto:steve.baskauf@vanderbilt.edu)
So VanderBot is also the name of a
User-Agent. Technically, if you used my script without editing it, you would be
using the VanderBot User-Agent, but it probably would be better to not send the
header above, since I don't want server administrators to email me if you do
bad things to their server. So you should
change the User-Agent header values if you use or modify the VanderBot code.
(Similarly, you should also change the tool name and email address sent to the
NCBI API in that part of the code - please do not use mine!)
When you write to the Wikidata API, you need to be logged in
as a Wikidata user. I have created a Wikidata user account called VanderBot,
so if I make edits using that account, they are credited to VanderBot in the
page history. So VanderBot is also a registered bot in Wikidata. But since you
don't have my VanderBot access credentials, you can't make edits to Wikidata as
VanderBot even if you use the Vanderbot scripts.
So the complicated answer is that you are welcome to use the
VanderBot code, you probably shouldn't be using "VanderBot" in a
User-Agent header (and definitely not my email address), and you can't use the
VanderBot Wikidata bot account.
Upcoming posts
In part 2 of this series, I will talk about the Wikibase data model and identifiers used for entities in the Wikidata graph. The model and identifier system influenced my choices about how to write the code.
In part 3, I will describe the API writing script that maps tabular data to the Wikibase model, then writes those data to the Wikidata API.
In the final part 4, I will describe the data harvesting script that is used to assemble the data to be written to Wikidata and that ensures that duplicate data are not added.
This comment has been removed by a blog administrator.
ReplyDelete