Saturday, February 8, 2020

VanderBot part 4: Preparing data to send to Wikidata



In the previous blog post, I described how I used a Python script to upload data stored in a CSV spreadsheet to Wikidata via the Wikidata API.  I noted that the spreadsheet contained information about whether data were already in Wikidata and if they needed to be written to the API, but I did not say how I acquired those data, nor how I determined whether they needed to be uploaded or not. That data acquisition and processing is the topic of this post.

The overall goal of the VanderBot project is to enter data about Vanderbilt employees (scholars and researchers) and their academic publications into Wikidata. Thus far in the project, I have focused primarily on acquiring and uploading data about the employees. The data acquisition process has three stages:

1. Acquiring the names of research employees (faculty, postdocs, and research staff) in departments of the university.

2. Determining whether those employees were already present in Wikidata or if items needed to be created for them.

3. Generating data required to make key statements about the employees and determining whether those statements (and associated references) had already been asserted in Wikidata.

The data harvesting script (coded in Python) required to carry out these processes is available via a Jupyter notebook available onGitHub.



Acquire names of research employees at Vanderbilt


Scrape departmental website

I've linked employees to Vanderbilt through their departmental affiliations. Therefore, the first task was to create items for departments in the various schools and colleges of Vanderbilt University. I won't go into detail about that process other than to say that the hacky code I used to do it is on GitHub.

The actual names of the employees were acquired by scraping departmental faculty and staff web pages. I developed the scraping script based on the web page of my old department, biological sciences. Fortunately, the same page template was used by many other departments in both the College of Arts and Sciences and the Peabody College of Education, so I was able to scrape about 2/3 of the departments in those schools without modifying the script I developed for the biological sciences department.

Because the departments had differing numbers of researcher pages covering different categories of researchers, I create a JSON configuration file where I recorded the base departmental URLs and the strings appended to that base to generate each of the researcher pages. The configuration file also included some other data needed by the script, such as the department's Wikidata Q ID, a generic description to use for researchers in the department (if they didn’t already have a description), and some strings that I used for fuzzy matching with other records (described later). Some sample JSON is included in the comments near the top of the script.

The result at the end of the "Scrape departmental website" section of the code was a CSV file with the researcher names and some other data that I made a feeble attempt to scrape, such as their title and affiliation.


Search ORCID for Vanderbilt employees

ORCID (Open Researcher and Contributor ID) plays an important part in disambiguating employees. Because ORCIDs are globally unique, associating an employee name with an ORCID allows one to know that the employee is different from someone with the same name who has a different ORCID.

For that reason, I began the disambiguation process by performing a search for "Vanderbilt University" using the ORCID API. The search produced several thousand results. I then dereferenced each of the resulting ORCID URIs to capture the full data about the researcher. That required an API call for each record and I used a quarter second delay per call to avoid hitting the API too fast. As a result, this stage of the process took hours to run.

I screened the results by recording only those that listed "Vanderbilt University" as part of the employments affiliation organization string. That excluded people who were only students and never employees, and included people whose affiliation was "Vanderbilt University Medical Center", "Vanderbilt University School of Nursing", etc. As part of the data recorded, I included their stated departmental affiliations (some had multiple affiliations if they moved from one department to another during their career). After this stage, I had 2240 name/department records.


Fuzzy matching of departmental and ORCID records

The next stage of the process was to try to match employees from the department that I was processing with the downloaded ORCID records. I used a Python fuzzy string matching function called fuzz.token_set_ratio() from the fuzzywuzzy package. I tested this function along with others in the package and it was highly effective at matching names with minor variations (both people and departmental names). Because this function was insensitive to word order, it matched names like "Department of Microbiology" and "Microbiology Department". However, it also made major errors for name order reversals ("John James" and "James Johns", for example) so I had an extra check for that.

If the person's name had a match score of greater than 90 (out of 100), I then performed a match check against the listed department. If it also had a match score of greater than 90, I assigned that ORCID to the person. If no listed department matched had a score over 90, I assigned the ORCID, but flagged that match for manual checking later.



Determine whether employees were already in Wikidata


Attempt automated matching with people in Wikidata known to work at Vanderbilt

I was then ready to start trying to match people with existing Wikidata records. The low-hanging fruit was people whose records already stated that their employer was Vanderbilt University (Q29052). I ran a SPARQL query for that using the Wikidata Query Service.  For each match, I also recorded the employee's description, ORCID, start date, and end date (where available).

Once I had those data, I checked each departmental employee's record against the query results. If both the departmental employee and the potential match from Wikidata had the same ORCID, then I knew that they were the same person and I assigned the Wikidata Q ID to that employee. If the employee had an ORCID I could exclude any Wikidata records with non-matching ORCIDs and only check for name matches with Wikidata records that didn't have ORCIDs.  Getting a name match alone was not a guarantee that the person in Wikidata was the same as the departmental employee, but given that the pool of possible Wikidata matches only included people employed at Vanderbilt, a good name match meant that it was probably the same person. If the person had a description in Wikidata, I printed the two names and the description and visually inspected the matches. For example, if there was a member of the Biological Sciences department named Jacob Reynolds and someone in Wikidata named Jacob C. Reynolds who was a microbiologist, the match was probably good. On the other hand, if Jacob C. Reynolds was a historian, then some manual checking was in order.  I did a few other tricks that you can see in the code.

This "smart matching" with minimal human intervention was usually able to match a small fraction of people in the department. But there were plenty of departmental employees who were already in Wikidata without any indication that they worked at Vanderbilt. The obvious way to look for them would be to just do a SPARQL query for their name. There are some features built in to SPARQL that allow for REGEX checks, but those features are impossibly slow for a triplestore the size of Wikidata's. The strategy that I settled for was to generate as many possible variations of the person's name and query for all of them at once. You can see what I did in the generateNameAlternatives() function in the code. I searched labels and aliases for: the full name, names with middle initials with and without periods, first and middle initials with and without periods, etc. This approach was pretty good at matching with the right people, but it also matched with a lot of wrong people. For example, for Jacob C. Reynolds, I would also search for J. C. Reynolds. If John C. Reynolds had J. C. Reynolds as an alias, he would come up as a hit. I could have tried to automate the processing of the returned names more, but there usually weren't a lot of matches and with the other screening criteria I applied, it was pretty easy for me to just look at the results and bypass the false positives.

When I did the query for the name alternatives, I downloaded the values for several properties that were useful for eliminating hits. One important screen was to eliminate any matching items that were instances of classes (P31) other than human (Q5). I also screened out people who were listed as having died prior to some set data (2000 worked well - some departments still listed recently deceased emeriti and I didn't want to eliminate those).  If both the employee and the name match in Wikidata had ORCIDs that were different, I also eliminated the hit.  For all matches that passed these screens, I printed the description, occupation, and employer if they were given in Wikidata.


Clues from publications in PubMed and Crossref

The other powerful tool I used for disambiguation was to look up any articles linked to the putative Wikidata match.  For each Wikidata person item who made it this far through the screen, I did a SPARQL query to find works authored by that person. For up to 10 works, I did the following.  If the article had a PubMed ID, I retrieved the article metadata from the PubMed API and tried to match against the author names. When I got a match with an author, I checked for an ORCID match (or excluded if an ORCID mismatch) and also for a fuzzy match against any affiliation that was given.  If either an ORCID or affiliation matched, I concluded that the departmental employee was the same as the Wikidata match and stopped looking.

If there was no match in PubMed and the article had a DOI, I then retrieved the metadata about the article from the CrossRef API and did the same kind of screening that I did in PubMed. 


Human intervention

If there was no automatic match via the article searches, I printed out the full set of information (description, employer, articles, etc.) for every name match, along with the name from the department and the name from Wikidata in order for a human to check whether any of the matches seemed plausible. In a lot of cases, it was easy to eliminate matches that had descriptions like "Ming Dynasty person" or occupation = "golfer".   If there was uncertainty, the script printed hyperlinked Wikidata URLs and I could just click on them to examine the Wikidata record manually.

Here's some typical output:

--------------------------
No Wikidata name match:  Justine Bruyère

--------------------------
No Wikidata name match:  Nicole Chaput Guizani

--------------------------
SPARQL name search:  Caroline Christopher
(no ORCID)


0 Wikidata ID:  Q83552019  Name variant:  Caroline Christopher   https://www.wikidata.org/wiki/Q83552019
No death date given.
description:  human and organizational development educator
employer:  Vanderbilt University
No articles authored by that person
Employee:  Caroline Christopher  vs. name variant:  Caroline Christopher

Enter the number of the matched entity, or press Enter/return if none match: 0

--------------------------
SPARQL name search:  Paul Cobb
(no ORCID)


0 Wikidata ID:  Q28936750  Name variant:  Paul Cobb   https://www.wikidata.org/wiki/Q28936750
No death date given.
description:  association football player
occupation:  association football player
No articles authored by that person
Employee:  Paul Cobb  vs. name variant:  Paul Cobb


1 Wikidata ID:  Q55746009  Name variant:  Paul Cobb   https://www.wikidata.org/wiki/Q55746009
No death date given.
description:  American newspaper publisher
occupation:  newspaper proprietor
No articles authored by that person
Employee:  Paul Cobb  vs. name variant:  Paul Cobb

Enter the number of the matched entity, or press Enter/return if none match: 

--------------------------
No Wikidata name match:  Molly Collins

--------------------------
No Wikidata name match:  Ana Christina da Silva [Iddings]


Although this step did require human intervention, because of the large amount of information that the script collected about the Wikidata matches, it usually only took a few minutes to disambiguate a department with 30 to 50 employees.


Generate statements and references and determine which were already in Wikidata


Generating data for a minimal set of properties

The next to last step was to assign values to a minimal set of properties that I felt each employee should have in a Wikidata record. Here's what I settled on for that minimal set:

P31 Q5 (instance of human). This was automatically assigned to all records.

P108 Q29052 (employer Vanderbilt University). This applies to all employees in our project - the employer value can be set at the top of the script.

P1416 [Q ID of department] (affiliation with focal department). After searching through many possible properties, I decided that P1416 (affiliation) was the best property to use to assert the employee's connection to the department I was processing. P108 was also possible, but there were a lot of people with dual departmental appointments and I generally didn't know which department was the actual "employer". Affiliation seemed to be an appropriate connection for regular faculty, postdocs, visiting faculty, research staff, and other kinds of statuses where the person would have some kind of research or scholarly output.

P496 [ORCID identifier]. ORCIDs that I'd acquired for the employees were hard-won and an excellent means for anyone else to carry out disambiguation, so I definitely wanted to include that assertion if I could.

P21 [sex or gender]. I was really uncomfortable assigning a value of this property, but this is a property often flagged by Recoin as a top missing property and I didn't want some overzealous editor deleting my new items because their metadata were too skimpy. Generally, the departmental web pages had photos to go with the names, so I made a call and manually assigned a value for this property (options: m=male, f=female, i=intersex, tf=transgender female, tm=transgender male). Any time the sex or gender seemed uncertain, I did not provide a value.

The description.  I made up a default description for the department, such as "biological science researcher", "historian", or "American Studies scholar" for the Biological Sciences, History, and American Studies departments respectively. I did not overwrite any existing descriptions by default, although as a last step I looked at the table to replace stupid ones like "researcher, ORCID: 0000-0002-1234-5678". These defaults were generally specific enough to prevent collisions where the label/description combination I was creating would collide with the label/description combination for an existing record and kill the record write.

When it made sense, I added references to the statements I was making. Generally, a reference is not expected for instance of human and I really couldn't give a reference for sex or gender.  For the employer and affiliation references, I used the web page that I scraped to get their name as the reference URL and provided the current date as the value for P813 (retrieved).  For ORCID, I created a reference that had a P813 (retrieved) property if I was able to successfully dereference the ORCID URI.

Because each of these properties had different criteria for assigning values and references, there was no standard code for assigning them. The code for each property is annotated, so if you are interested you can look at it to see how I made the assignments.


Check for existing data in Wikidata

In the earlier posts, I said that I did not want VanderBot to create duplicate items, statements, and references when they already existed in Wikidata. So a critical last step was to check for existing data using SPARQL. One important thing to keep in mind is the Query Service Updater lag that I talked about in the last post. That lag means that changes made up to 8 or 10 hours ago would not be included in this download. However, given that the Wikidata researcher item records I'm dealing with do not change frequently, the lag generally wasn't a problem. I should note that it would be possible to get these data directly from the Wikidata API, but the convenience of getting exactly the information I wanted using SPARQL outweighed my motivation to develop code to do that.

At this point in the workflow, I've already determined with a fairly high degree of confidence which of the departmental employees were already in Wikidata. That takes care of the potential problem of creating duplicate item records, and it also means that I do not need to check for the presence of statements or references for any of the new items either.

One interesting feature of SPARQL that I learned from this project was using the VALUES clause. Despite having used SPARQL for years and skimming through the SPARQL specification several times, I missed it. The VALUES clause allows you to specify which values the query should use for a particular variable in its pattern matching.  That makes querying a large triplestore like Wikidata much faster that without it and it also reduces the number of results that the code has to sort through when results come back from the query service. Here's an example of a query using the VALUES clause that you can test at the Wikidata Query Service:

SELECT DISTINCT ?id ?statement WHERE {
VALUES ?id {
  wd:Q4958
  wd:Q39993
  wd:Q234
  }
?id p:P31 ?statement.
}


So the first part of the last step in the workflow is to generate a list of all of the existing item Q IDs for employees in the department. That list is passed to the searchStatementAtWikidata() function as its first argument. searchStatementAtWikidata() is a general purpose function that will search Wikidata for a particular property of items in the generated list. It can be used either to search for a particular property and value (like P108 Q29052, employer Vanderbilt University) and retrieve the references for that statement, or for only the property (like P496, ORCID) and retrieve both the values and references associated with those statements.  This behavior is controlled by whether an empty string is sent for the value argument or not.  For each of the minimal set of properties that I'm tracking for departmental employees, the searchStatementAtWikidata() is used to retrieve any available data for the listed employees. Those data are then matched with the appropriate employee records and recorded in the CSV file along with the previously generated property values.

In addition to the property checks, labels, descriptions, and aliases for the list of employees are retrieved via SPARQL queries. In the cases of labels and descriptions, if there is an existing label or description in Wikidata, it is written to the CSV file. If there is no existing label, the name scraped from the departmental website is written to the CSV as the label. If there is no existing description, the default description for the department is written to the CSV. Whatever alias lists are retrieved from Wikidata (including empty ones) are written to the CSV.

Final manual curation prior to writing to the Wikidata API

In theory, the CSV file resulting from the previous step should contain all of the information needed by the API-writing script that was discussed in the last post. However, I always manually examine the CSV to look for problems or things that are stupid such as bad descriptions.

If a description or label is changed, the API-writing script will detect that it's different from the current value being provided by the SPARQL endpoint and the new description or label will overwrite the existing one. The API-writing script is currently not very sophisticated about how it handles aliases. If there are more aliases in the CSV than are currently in Wikidata, the script will overwrite existing aliases in Wikidata with those in the spreadsheet. The assumption is that alias lists are only added to, rather than aliases being changed or deleted.  At some point in the future, I intend to write a separate script that will handle labels and aliases in a more robust way, so I really didn't want to waste time now on making the alias-handling better than it is. 

A typical situation is to discover a more specific label for the person than already exists in Wikidata. In that case, I usually add the existing label to the alias list, and replace the label value in the CSV with the better new one. WARNING! If you edit the alias list, make sure that your editor uses generic quotes (ASCII 32/Unicode +U0022) and not "smart quotes". They have a different Unicode value and will break the script. Open Office/Libre Office (the best applications for editing CSVs in my opinion) default to smart quotes, so this setting must be turned off manually.

I also just look over the rest of the spreadsheet to convince myself that nothing weird is going on. Usually the script does an effective job of downloading the correct reference properties and values, but I've discovered some odd situations that have caused problems.


At this point, steps 1 and 2 in the VanderBot diagram have been completed by the data harvesting script, and the API-writing script described in the last post is ready to take over in step 3.  When step 4 is complete, the blank cells in the CSV for missing item, statement, and reference identifiers will should all be filled in and the CSV can be filed for future reference.

Final thoughts


I tried to make the API writing script generic and adaptable for writing statements and references about any kind of entity. That's achievable simply by editing the JSON schema file that maps the columns in the source CSV. However, getting the values for that CSV is the tricky part. If one were confident that only new items were being written, then the table could filled with only the data to be written and without any item, statement, or reference identifiers.  That would be the case if you were using the script to load your own Wikibase instance. However, for adding data to Wikidata about most items like people or references, one can't know if the data needs to be written or not, and that's why a complex and somewhat idiosyncratic script like the data harvesting script is necessary. So there's no "magic bullet" that will make it possible to automatically know whether you can write data to Wikidata without creating duplicate assertions.

To find records that VanderBot has put into Wikidata, try this query at the Wikidata Query Service:

select distinct ?employee where {
  ?employee wdt:P1416/wdt:P749+ wd:Q29052.
  }
limit 50

The triple pattern requires that the employee first have an affiliation (P1416) to some item, and that item be linked by one or more parent organization (P749) links to Vanderbilt University (Q29052). I linked the department items to their parent school or college using P749 and made sure that the University's schools and colleges were all linked to the University by P749 as well. However, some schools like the Blair School of Music do not really have departments, so their employees were affiliated directly to the school or college rather than a department. So the search has to pick up administrative entity items that were either one or two P749 links from the university (hence the "+" property path operator after P749). Since there are a lot of employees, I limited the results to 50. If you click on any of the results, it will take you to the item page and you can view the page history to confirm that VanderBot had made edits to the page.  (At some point, there may be people who were linked in this way by an account other than VanderBot, but thus far, VanderBot is probably the only editor of Vanderbilt employees items that's linking to departments by P1416, given that I recently created all of the department items from scratch.)

A variation of that query will tell you the number of records meeting the criteria of the previous query:

select (count(?employee) as ?count) where {
  ?employee wdt:P1416/wdt:P749+ wd:Q29052.
  }

As of 2020-02-08, there are 1221 results. That number should grow as I use VanderBot to process other departments.


Friday, February 7, 2020

VanderBot part 3: Writing data from a CSV file to Wikidata


In the previous post of this series, I described how my investigation of the Wikibase data model led me to settle on a relatively simple spreadsheet layout for tracking what items, statements, and references needed to be created or edited in Wikidata. Since column headers in a CSV spreadsheet don't really have any meaning other than to a human, it's necessary to map columns to features of the Wikibase model so that a script would know how to write the data in those columns to appropriate data items in Wikidata. 

Developing a schema to map spreadsheet columns to the Wikibase model

In a blog post from 2016, I wrote about a similar problem that I faced when creating an application that would translate tabular CSV data to RDF triples. In that case, I created a mapping CSV table that mapped table headers to particular RDF predicates, and that also indicated the kind of object represented in the table (language-tagged literal, IRI, etc.). That approach worked fine and had the advantage of simplicity, but it had the disadvantage that it was an entirely ad hoc solution that I made up for my own use.

When I learned about the "Generating RDF from Tabular Data on the Web" W3C Recommendation, I recognized that this was a more standardized way to accomplish a mapping from a CSV table to RDF. When I started working on the VanderBot project I realized that since the Wikibase model can be expressed as an RDF graph, I could construct a schema using this W3C standard to document how my CSV data should be mapped to Wikidata items, properties, references, labels, etc. The most relevant part of the standard is section 7.3, "Example with single table and using virtual columns to produce multiple subjects per row".

An example schema that maps the sample table from last the last post is here. The schema is written in JSON and if ingested by an application that can transform CSV files in accordance with the W3C specification, it should produce RDF triples identical to triples about the subject items that are stored in the Wikidata Query Service triplestore (not all triples, but many of the ones that would be generated if the CSV data were loaded into the Wikidata API).  I haven't actually tried this since I haven't acquired such an application, but the point is that the JSON schema applied to the CSV data will generate part of the graph that will eventually be present in Wikidata when the data are loaded.

I will not go into every detail of the example schema, but show several examples of how parts of it map particular columns.

Column for the item identifier

Each column in the table has a corresponding JSON object in the schema. The first column, with the column header title "wikidataId" is mapped with:

{
"titles": "wikidataId",
"name": "wikidataId",
"datatype": "string", 
"suppressOutput": true
}

This JSON simply associates a variable name (wikidataId) with the Wikidata Q ID for the item that's the subject of each row. (For simplicity, I've chosen to make the variable names the same as the column titles, but that isn't required.)  The "true" value for suppressOutput means that no statement is directly generated from this column.

Column for the label

The "labelEn" column is mapped with this JSON object:

{
"titles": "labelEn",
"name": "labelEn",
"datatype": "string",
"aboutUrl": "http://www.wikidata.org/entity/{wikidataId}",
"propertyUrl": "rdfs:label",
"lang": "en"
}

The value of aboutUrl indicates the subject of the triple generated by this column. The curly brackets indicate that the wikidataId variable should be substituted in that place to generate the URI for the subject.  The value of propertyUrl is rdfs:label, the RDF predicate that Wikibase uses for its label field. The object of the triple by default is the value present in that column for the row.  The lang value provides the language tag for the literal.


So when this mapping is applied to the labelEn column of the first row, the triple

<http://www.wikidata.org/entity/Q84268104> rdfs:label "Vanderbilt Department of Biomedical Engineering"@en.

would be generated.

Column for a property having value that is an item (P749)

Here is the JSON object that maps the "parentUnit" column.

{
"titles": "parentUnit",
"name": "parentUnit",
"datatype": "string",
"aboutUrl": "http://www.wikidata.org/entity/{wikidataId}",
"propertyUrl": "http://www.wikidata.org/prop/direct/P749",
"valueUrl": "http://www.wikidata.org/entity/{parentUnit}"
}

As before, the subject URI is established by substituting the wikidataId variable into the URI template for aboutUrl. Instead of directly mapping the column value as the object of the triple, the column value is inserted into a valueUrl URI template in the same manner as the aboutUrl.  


Applying this column mapping to the parentUnit column generates the triple:

<http://www.wikidata.org/entity/Q84268104> <http://www.wikidata.org/prop/direct/P749> <http://www.wikidata.org/entity/Q7914459>.

which can be abbreviated

wd:Q84268104 wdt:P749 wd:Q7914459.

The other columns in the CSV table are mapped similarly. If there is no valueURl key:value pair, the value for the column is a literal object, and if there is a value for valueURI, the value for the column is used to generate a URI denoting a non-literal object. 

The value of datatype is important since it determines the xsd:datatype of literal values in the generated triples.

Not every column generates a triple with a subject that's the subject of the row. The subject may be the value of any other column. This allows the data in the row to form a more complicated graph structure.

How the VanderBot script writes the CSV data to the Wikidata API

The script that does the actual writing to the Wikidata API is here.  The authentication process (line 338) is described in detail elsewhere.   

The actual script begins (line 374) by loading the schema JSON into a Python data structure and loading the CSV table into a list of dictionaries. 

The next section of the code (lines 402 to 554) uses the schema JSON to sort the columns of the tables into categories (labels, aliases, descriptions, statements with entity values, and statements with literal values).  

From lines 556 to 756, the script steps through each row of the table to generate the data that needs to be passed to the API to upload new data.  In each row, the script goes through each category of data (labels, aliases, etc.) and turns the value in a column into the specific JSON required by the API for uploading that kind of data. I call this "snak JSON" because the units in the JSON represent "snaks" (small, discrete statements) as defined by the Wikibase data model.

Originally, I had written the script in a simpler way, where each piece of information about the item was written in a separate API call. This seemed intuitive since there are individual API methods for uploading every category (label, description, property, reference, etc., see the API documentation). However, because of rate limitations that I'll talk about later, the most reasonable way to write the data was to determine which categories needed to be written for an item and then generate the JSON for all categories at once. I then used the "all in one" method wbeditentity to make all possible edits in a single API call. This resulted in much more complicated code that constructed deeply nested JSON that's difficult to read. The API help page didn't give any examples that were nearly this complicated, so getting this strategy to work required delving deeply into the Wikibase model. One lifesaver was that when a successful API call was made, the API's response included JSON structured according to the Wikibase model that was very similar to the JSON that was necessary to write to the API. Being able to look at this response JSON was really useful to help me figure out what subtle mistakes I was making when constructing the JSON to send to the API.

Simply creating labels, descriptions, and claims would not have been too hard, but I was determined to also have the capability to support references and qualifiers for claims. Here's how I hacked that task: for each statement column, I went through the columns and looked for other columns that the schema indicated were references or qualifiers of that statement. Currently, the script only handles one reference and one qualifier per statement, but when I get around to it, I'll improve the script to remove that limitation. 

In line 759, the script checks whether it found any information about the item that wasn't already written to Wikidata. If there was at least one thing to write, the script attempts to post a parameter dictionary (including the complex, constructed snak JSON) to the API (lines 305 to 335) If the attempt was unsuccessful because the API was too busy, it retries several times. If the attempt was unsuccessful for other reasons, the script displays the server's response for debugging. 

If the attempt was successful, the script extracts identifiers of newly-created data records (item Q IDs, statement UUIDs, and reference hashes - see the previous post for more on this) and adds them to the CSV table so that the script will know in the future that those data are already in Wikidata.  The script rewrites the CSV table after every line so that if the script crashes or the API throws an error during a write attempt, one can simply re-start the script after fixing the problem and the script will know not to create duplicate data on the second go-around (since the identifiers for the already-written data have already been added to the CSV).  

I mentioned near the end of my previous post that I don't have any way to record whether labels, descriptions, and qualifiers had already been written or not, since URI identifiers aren't generated for them.  The lack of URI identifiers means that one can't refer to those particular assertions directly by URIs in a SPARQL query. Instead, one must make a query asking explicitly for the value of the label, description, or qualifier and then determine whether it's the same as the value in the CSV table. The way the script currently works, prior to creating JSON to send to the API the script sends a SPARQL query asking for the values of labels and descriptions of all of the entities in the table (lines 465 and 515). Then as the script processes each line of the table, it checks whether the value in the CSV is the same as what's already in Wikidata (and then does nothing) or different. If the value is different, it writes the new value from the CSV and overwrites the value in Wikidata. 

It is important to understand this behavior, because if the CSV table is "stale" and has not been updated for a long time, other users may have improved the labels or descriptions.  Running the script with the stale values will effectively revert their improvements. So it's important to update the CSV file with current values before running this script that writes to the API.  After updating, then you can manually change any labels or descriptions that are unsatisfactory.  

In the future, I plan to write additional scripts for managing labels and aliases, so this crude management system will hopefully be improved.

Cleaning up missing references

In some cases, other Wikidata contributors have already made statements about pre-existing Vanderbilt employee items. For example, someone may have already asserted that the Vanderbilt employee's employer was Vanderbilt University. In such cases, the primary API writing script will do nothing with those statements because it is not possible to write a reference as part of the wbeditentity API method without also writing its parent statement. So I had to create a separate script that is a hack of the primary script in order to write the missing references. I won't describe that script here because its operation is very similar to the main script. The main difference is that it uses the wbsetreference API method that is able to directly write a reference given a statement identifier. After running the main script, I run the cleanup script until all of the missing references have been added.

Timing issues

Maxlag

One of the things that I mentioned in my original post on writing data to Wikidata was that when writing to the "real" Wikidata API (vs. the test API or your own Wikibase instance) it's important to respect the maxlag parameter.

You can set the value of the maxlag parameter in line 381. The recommended value is 5 seconds.  A higher maxlag value is more aggressive and a lower maxlag value is "nicer" but means that you are willing to be told more often by the API to wait. The value of maxlag you have chosen is added to the parameters sent to the API in line 764 just before the POST operation.  

The API lag is the average amount of time between when a user requests an operation and the API is able to honor that request.  At times of low usage (e.g. nighttime in the US and Europe), the lag may be small, but at times of high usage, the lag can be over 8 seconds (I've seen it go as high as 12 seconds). If you set maxlag to 5 seconds, you are basically telling the server that if the lag gets longer than 5 seconds, ignore your request and you'll try again later.  The server tells you to wait by responding to your POST request with a response that contains a maxlag error code and the amount of time the server is lagged.  This error is handled in line 315 of the script.  When a lag error is detected, the recommended practice is to wait at least 5 seconds before retrying.

Bot flags

I naïvely believed that if I respected maxlag errors that I'd be able to write to the API as fast as conditions allowed. However, the very first time I used the VanderBot script to write more than 25 records in a row, I was blocked by the API as a potential spammer with the message "As an anti-abuse measure, you are limited from performing this action too many times in a short space of time, and you have exceeded this limit. Please try again in a few minutes." Clearly my assumption was wrong.  Through trial and error, I determined that a write rate of one second per write was too fast and would result in being temporarily blocked, but a rate of two seconds per write was acceptable. So to handle cases when maxlag was not invoked, I put a delay of 2 seconds on the script (line 822).

I had several hypotheses about the cause of the blocking. One possible reason was because I didn't have a bot flag. (More on that later.) Another reason might be because I was running the script from my local computer rather than from PAWS. PAWS is a web-based interactive programming and publishing environment based on Jupyter notebooks. At Wikicon North America, I had an interesting and helpful conversation with Dominic Byrd-McDevitt of the National Archives who showed me how he used PAWS to publish NARA metadata to Wikidata via a PAWS-based system using Pywikibot. I don't think he had a bot flag and I think his publication rate was faster than one write per second.  But I really didn't want to take the time to test this hypothesis by converting my script over to PAWS (which would require more experimentation with authentication). So I decided to make a post to Wikitech-l and see if I could get an answer. 

I quickly got a helpful answer that confirmed that neither using PAWS nor Pywikibot should have any effect on the rate limit. If I had a bot flag, I might gain the "noratelimit" right, which might bypass rate limiting in many cases. 

Bot flags are discussed here . In order to get a bot flag, one must detail the task that the bot will perform, then demonstrate by a test run of 50 and 250 edits that the bot is working correctly. When I was at Wikicon NA, I asked some of the Powers That Be whether it was important to get a bot flag if I was not running an autonomous bot. They said that it wasn't so important if I was monitoring the writing process. It would be difficult to "detail the task" that VanderBot will perform since it's just a general-purpose API writing script, and what it writes will depend on the CSV file and the JSON mapping schema. 

In the end, I decided to just forget about getting a bot flag for now and keep the rate at 2 seconds per write. I usually don't write more than 50-100 edits in a session and often the server will be lagged anyway requiring me to wait much longer than 2 seconds. If VanderBot's task becomes more well-defined and autonomous, I might request a bot flag at some point in the future.

Query Service Updater lag

One of the principles upon which VanderBot is built is that data are written to Wikidata by POSTing to the API, but that the status of data in Wikidata is determined by SPARQL queries of the Query Service. That is a sound idea, but it has one serious limitation. Data that are added through either the API or the human GUI do not immediately appear in the graph database that supports the Query Service. There is a delay, known as the Updater lag, between the time of upload and the time of availability at the Query Service. We can gain a better understanding by looking at the Query Service dashboard.

Here's a view of the lag time on the day I wrote this post (2020-02-03):


The first thing to notice is that there isn't just one query service. There are actually seven servers running replicates of the Query Service that handle the queries. They are all being updated constantly with data from the relational database connected to the API, but since the updating process has to compete with queries that are being run, some servers cannot keep up with the updates and lag by as much as 10 hours. Other servers have lag times of less than one minute. So depending on the luck of the draw of which server takes your query, data that you wrote to the API may be visible via SPARQL in a few seconds or in half a day.


A practical implication of this is that if VanderBot updates its CSV record using SQARQL, the data could be as much as half a day out of date. Normally that isn't a problem, since the data I'm working with doesn't change much, and once I write new data, I usually don't mess with it for days. However, since the script depends on a SPARQL query to determine if the labels and descriptions in the CSV differ from what's already in Wikidata, there can be problems if the script crashes half way through the rows of the CSV. If I fix the problem and immediately re-run the script, a lagged Query Service will send a response to the query saying that the labels and descriptions that I successfully wrote a few moments earlier were in their previous state. That will cause VanderBot to attempt to re-write those labels and descriptions.  Fortunately, if the API detects that a write operation is trying to set the value of a label or description to the value it already has, it will do nothing. So generally, no harm is done.  

This lag is why I use the response JSON sent from the API after a write to update the CSV rather than depending on a separate SPARQL query to make the update. Because the data in the response JSON comes directly from the API and not the Query Service, it is not subject to any lag.

Summary


The API writing script part of VanderBot does the following:
  1. Reads the JSON mapping schema to determine the meaning of the CSV table columns.
  2. Reads in the data from the CSV table.
  3. Sorts out the columns by type of data (label, alias, description, property).
  4. Constructs snak JSON for any new data items that need to be written.
  5. Checks new statements for references and qualifiers by looking at columns associated with the statement properties, then creates snak JSON for references or qualifiers as needed.
  6. Inserts the constructed JSON object into the required parameter dictionary for the wbeditentity API method.
  7. POSTs to the Wikidata API via HTTP.
  8. Parses the response JSON from the API to discover the identifiers of newly created data items.
  9. Inserts the new identifiers into the table and write the CSV file.
In the final post of this series, I'll describe how the data harvesting script part of VanderBot works.

VanderBot part 2: The Wikibase data model and Wikidata identifiers


The Wikidata GUI and the Wikibase model

To read part 1 of this series, see this page.

If you've edited Wikidata using the human-friendly graphical user interface (GUI), you know that items can have multiple properties, each property can have multiple values, each property/value statement can be qualified in multiple ways, each property/value statement can have multiple references, and each reference can have multiple statements about that reference. The GUI keeps this tree-like proliferation of data tidy by collapsing the references and organizing the statements by property.


This organization of information arises from the Wikibase data model (summarized here, in detail here). For those unfamiliar with Wikibase, it is the underlying software system that Wikidata is built upon. Wikidata is just one instance of Wikibase and there are databases other than Wikidata that are built on the Wikibase system. All of those databases built on Wikibase will have a GUI that is similar to Wikidata, although the specific items and properties in those databases will be different from Wikidata.

To be honest, I found working through the Wikibase model documentation a real slog. (I was particularly mystified by the obscure term for basic assertions: "snak". Originally, I though it was an acronym, but later realized it was an inside joke. A snak is "small, but more than a byte".) But understanding the Wikibase model is critical for anyone who wants to either write to the Wikidata API or query the Wikidata Query Service and I wanted to do both. So I dug in.

The Wikibase model is an abstract model, but it is possible to represent it as a graph model. That's important because that is why the Wikidata dataset can be exported as RDF and made queryable by SPARQL in the Wikidata Query Service. After some exploration of Wikidata using SPARQL and puzzling over the data model documentation, I was able to draw out the major parts of the Wikibase model as a graph model. It's a bit too much to put in a single diagram, so I made one that showed references and another that showed qualifiers (inserted later in the post). Here's the diagram for references:


Note about namespace prefixes: the exact URI for a particular namespace abbreviation will depend on the Wikibase installation. The URIs shown in the diagrams are for Wikidata. A generic Wikibase instance will contain wikibase.svc as its domain name in place of www.wikidata.org, and other instances will use other domain names. However, the namespace abbreviations shown above are used consistently among installations, and when querying via the human-accessible Query Service or via HTTP, the standard abbreviations can be used without declaring the underlying namespaces. That's convenient because it allows code based on the namespace abbreviations to be generic enough to be used for any Wikibase installation. 

In the next several sections, I'm going to describe the Wikibase model and how Wikidata assigns identifiers to different parts of it. This will be important in deciding how to track data locally. Following that, I'll briefly describe my strategy for storing those data.

Item identifiers

The subject item of a statement is identified by a unique "Q" identifier. For example, Vanderbilt University is identified by Q29052 and the researcher Antonis Rokas is identified by Q42352198. We can make statements by connecting subject and object items with a defined Wikidata property. For example, the property P108 ("employer") can be used to state that Antonis Rokas' employer is Vanderbilt University: Q42352198 P108 Q29052. When the data are transferred from the Wikidata relational database backend fed by the API to the Blazegraph graph database backend of the Query Service, the "Q" item identifiers and "P" property identifiers are turned into URIs by appending the appropriate namespace (wd:Q42352198 wdt:P108 wd:Q29052.)

We can check this out by running the following query at the Wikidata Query Service:

SELECT DISTINCT ?predicate ?object WHERE {
  wd:Q42352198 ?predicate ?object.
  }

This query returns all of the statements made about Antonis Rokas in Wikidata.

Statement identifiers

In order to be able to record further information about a statement itself, each statement is assigned a unique identifier in the form of a UUID. The UUID is generated at the time the statement is first made. For example, the particular statement above (Q42352198 P108 Q29052) has been assigned the UUID FB9EABCA-69C0-4CFC-BDC3-44CCA9782450. In the transfer from the relational database to Blazegraph, the namespace "wds:" is prepended and for some reason, the subject Q ID is also prepended with a dash. So our example statement would be identified with the URI wds:Q42352198-FB9EABCA-69C0-4CFC-BDC3-44CCA9782450. If you look at the results from the query above, you'll see

p:P108 wds:Q42352198-FB9EABCA-69C0-4CFC-BDC3-44CCA9782450

as one of the results.

We can ask what statements have been made about the statement itself by using a similar query, but with the statement URI as the subject:

SELECT DISTINCT ?predicate ?object WHERE {
  wds:Q42352198-FB9EABCA-69C0-4CFC-BDC3-44CCA9782450 ?predicate ?object.
  }

One important detail relates to case insensitivity. UUIDs are supposed to be output as lowercase, but they are supposed to be case-insensitive on input. So in theory, a UUID should represent the same value regardless of the case. However, in the Wikidata system the generated identifier is just a string and that string would be different depending on the case. So the URI

wds:Q42352198-FB9EABCA-69C0-4CFC-BDC3-44CCA9782450

is not the same as the URI

wds:Q42352198-fb9eabca-69c0-4cfc-bdc3-44cca9782450

(Try running the query with the lower case version to convince yourself that this is true.) Typically, the UUIDs generated in Wikidata are upper case, but there are some that are lower case. For example, try

wds:Q57756352-4a25cee4-45bc-63e8-74be-820454a8b7ad

in the query. Generally it is safe to assume that the "Q" in the Q ID is upper case, but I've discovered at least one case where the Q is lower case.

Reference identifiers

If a statement has a reference, that reference will be assigned an identifier based on a hash algorithm. Here's an example: f9c309a55265fcddd2cb0be62a530a1787c3783e. The reference hash is turned into a URL by prepending the "wdref:" namespace. Statements are linked to references by the property prov:wasDerivedFrom. We can see an example in the results of the previous query:

prov:wasDerivedFrom wdref:8cfae665e8b64efffe44128acee5eaf584eda3a3

which shows the connection of the statement wds:Q42352198-FB9EABCA-69C0-4CFC-BDC3-44CCA9782450 (which states wd:Q42352198 wdt:P108 wd:Q29052.) to the reference wdref:8cfae665e8b64efffe44128acee5eaf584eda3a3 (which states "reference URL http://orcid.org/0000-0002-7248-6551 and retrieved 12 January 2019"). We can see this if we run a version of the previous query asking about the reference statement:

SELECT DISTINCT ?predicate ?object WHERE {
  wdref:8cfae665e8b64efffe44128acee5eaf584eda3a3?predicate ?object.
  }

As far as I know reference hashes seem to be consistently recorded in all lower case.

Reference identifiers are different from statement identifiers in that they denote the reference itself, and not a particular assertion of the reference. That is, they do not denote "statement prov:wasDerivedFrom reference", only the reference.  (In contrast, statement identifiers denote the whole statement "subject property value".) That means that any statement whose reference has exactly the same asserted statements will have the same reference hash (and URI). 

We can see that reference URIs are shared by multiple statements using this query:

SELECT DISTINCT ?statement WHERE {
  ?statement prov:wasDerivedFrom wdref:f9c309a55265fcddd2cb0be62a530a1787c3783e.
  }

Identifier examples

The following part of a table that I generated for Vanderbilt researchers shows examples of the identifiers I've described above.


We see that each item (researcher) has a unique Q ID and that each statement that the researcher is employed at Vanderbilt University (Q29052) has a unique UUID (some upper case, some lower case) and that there are more than one statement that share the same reference (having the same reference hash).  

Statement qualifiers

In addition to linking references to a statement, the statements can also be qualified. For example, Brandt Eichman has worked at Vanderbilt since 2004.


Here's a diagram showing how the qualifier "start time 2004" is represented in Wikidata's graph database:


We can see that qualifiers are handled a little differently from references. If the qualifier property (in this case P580, "since") has a simple value (literal or item), the value is linked to the statement instance using the pq: namespace version of the property. 

If the value has a complex value (e.g. date), that value is assigned a hash and is linked to the statement instance using the pqv: version of the property. When the data are transferred to the graph database, the wdv: namespace is prepended to the hash. 

Because dates are complex, the qualifier "since" requires a non-literal value in addition to a literal value linked by the pq: version of the property (see this page for more on the Wikibase date model). We can use this query:

SELECT DISTINCT ?property ?value WHERE {
  wdv:849f00455434dc418fb4287a4f2b7638 ?property ?value.
  }

to explore the non-literal date instance.  In Wikidata, all dates are represented as full XML Schema dateTime values (year, month, day, hour, minute, second, timezone). In order to differentiate between the year "2004" and the date 1 January 2004 (both can be represented in Wikidata by the same dateTime value), the year 2004 is assigned a timePrecision of 9 and the date 1 January 2004 is assigned a timePrecision of 11.

Not every qualifier will have a non-literal value. For example, the property "series ordinal" (P1545; used to indicate things like the order authors are listed) has only literal values (integer numbers). So there are values associated with pq:P1545, but not pqv:P1545. The same is true for "language of work or name" (P407; used to describe websites, songs, books, etc.), which has an entity value like Q1860 (English).

Labels, aliases, and descriptions

Labels, aliases, and descriptions are properties of items that are handled differently from other properties in Wikidata. Labels and descriptions are handled in a similar manner, so I will discuss them together.

Each item in Wikidata can have only one label and one description in any particular language. Therefore adding or changing a label or description requires specifying the appropriate ISO 639-1 code for the intended language.  When a label or description is changed in Wikidata, the previous version is replaced.

One important restriction is that the label/description combination in a particularly language must be unique. For example, the person with the English label "John Jones" and English description "academic" can currently only be Q16089943. Because labels and descriptions can change, this label/description combination won't necessarily be permanent associated with Q16089943 because someone might give that John Jones a more detailed description, or make his name less generic by adding a middle name or initial. So at some point in the future, it might be possible for some other John Jones to be described as "academic".  An implication of the prohibition against two items sharing the same label/description pair is that it's better to create labels and descriptions that are as specific as possible to avoid collisions with pre-existing entities. As more entities get added to Wikidata, the probability of such collisions increases.

There is no limit to the number of aliases that an item can have per language. Aliases can be changed by either changing the value of a pre-existing alias or adding a new alias. As far as I know, there is no prohibition about aliases of one item matching aliases of another item.

When these statements are transferred to the Wikidata graph database, labels are values of rdfs:label, descriptions are values of schema:description, and aliases are values of skos:altLabel. All of the values are language-tagged.

What am I skipping?

Another component of the Wikibase model that I have not discussed is ranks. I also haven't talked about statements that don’t have values (PropertyNoValueSnak and PropertySomeValueSnak), and sitelinks. These are features that may be important to some users, but have not yet been important enough to me to incorporate handling them in my code. 

Local data storage

If one wanted to make and track changes to Wikidata items, there are many ways to accomplish that with varying degrees of human intervention.  Last year, I spent some time pondering all of the options and came up with this diagram:


Tracking every statement, reference, and qualifier for items would be complicated because each item could have an indefinite number and kind of properties, values, references, and qualifiers.  To track all of those things would require a storage system as complicated as Wikidata itself (such as a separate a relational database or a Wikibase instance as shown in the bottom of the diagram). That's way beyond what I'm interested in doing now. But what I learned about the Wikibase model and how data items are identified suggested to me a way to track all of the data that I care about in a single, flat spreadsheet. That workflow can be represented by this subset of the diagram above:


I decided on the following structure for the spreadsheet (a CSV file, example here.). The Wikidata Q ID serves as the key for an item and the data in a row is about a particular item. A value in the Wikidata ID column indicates that the item already exists in Wikidata. If the Wikidata ID column does not have a value, that indicates that the item needs to be created. 

Each statement has a column representing the property with the value of that property for an item recorded in the cell for that item's row.  For each property column, there is an associated column for the UUID identifying the statement consisting of the item, property, and value. If there is no value for a property, no information is available to make that statement. If there is a value and no UUID, then the statement needs to be asserted. If there is a value and a UUID, the statement already exists in Wikidata.  

References consist of one or more columns representing the properties that describe the reference. References have a single column to record the hash identifier for the reference.  As with statements, if the identifier is absent, that indicates that the reference needs to be added to Wikidata. If the identifier is present, the reference has already been asserted.  

Because labels, descriptions, and many qualifiers do not have URIs assigned as their identifiers, their values are listed in columns of the table without corresponding identifier columns.  Knowing whether the existing labels descriptions and qualifiers already exist in Wikidata requires making a SPARQL query to find out. That process is described in the fourth blog post.

Where does VanderBot come in?

In the first post of this series, I showed a version of the following diagram to illustrate how I wanted VanderBot (my Python script for loading Vanderbilt researcher data into Wikidata) to work. That diagram is basically an elaboration of the simpler previous diagram.


The part of the workflow circled in green is the API writing script that I will describe in the third post of this series (the next one). The part of the workflow circled in orange is the data harvesting script that I will describe in the fourth post. Together these two scripts form VanderBot in its current incarnation.

Discussing the scripts in that order may seem a bit backwards because when VanderBot operates, the data harvesting script works before the API writing script. But in developing the two scripts, I needed to think about how I was going to write to the API before I thought about how to harvest the data. So it's probably more sensible for you to learn about the API writing script first as well. Also, the design of the API writing script is intimately related to the Wikidata data model, so that's another reason to talk about it next after this post.