In the previous blog post, I described how I used a Python
script to upload data stored in a CSV spreadsheet to Wikidata via the Wikidata API. I noted that the spreadsheet
contained information about whether data were already in Wikidata and if they needed to be written to the API, but I did not say how I acquired those data, nor how I
determined whether they needed to be uploaded or not. That data acquisition and processing is the topic of this post.
The overall goal of the VanderBot project is to enter data
about Vanderbilt employees (scholars and researchers) and their academic publications into Wikidata.
Thus far in the project, I have focused primarily on acquiring and uploading data about the
employees. The data acquisition process has three stages:
1. Acquiring the names of research employees (faculty,
postdocs, and research staff) in departments of the university.
2. Determining whether those employees were already present
in Wikidata or if items needed to be created for them.
3. Generating data required to make key statements about the
employees and determining whether those statements (and associated references) had
already been asserted in Wikidata.
The data harvesting script (coded in Python) required to
carry out these processes is available via a Jupyter notebook available onGitHub.
Acquire names of research employees at Vanderbilt
Scrape departmental website
I've linked employees to Vanderbilt through
their departmental affiliations. Therefore, the first task was to create items
for departments in the various schools and colleges of Vanderbilt University. I
won't go into detail about that process other than to say that the hacky code I
used to do it is on GitHub.
The actual names of the employees were acquired by scraping departmental
faculty and staff web pages. I developed the scraping script based on the web
page of my old department, biological sciences. Fortunately, the same page template was used
by many other departments in both the College of Arts and Sciences and the
Peabody College of Education, so I was able to scrape about 2/3 of the
departments in those schools without modifying the script I developed for the biological sciences department.
Because the departments had differing numbers of researcher
pages covering different categories of researchers, I create a JSON
configuration file where I recorded the base departmental URLs and the strings
appended to that base to generate each of the researcher pages. The
configuration file also included some other data needed by the script, such as
the department's Wikidata Q ID, a generic description to use for researchers
in the department (if they didn’t already have a description), and some strings
that I used for fuzzy matching with other records (described later).
Some sample JSON is included in the comments near the top of the script.
The result at the end of the "Scrape departmental
website" section of the code was a CSV file with the researcher names and
some other data that I made a feeble attempt to scrape, such as their title and
affiliation.
Search ORCID for Vanderbilt employees
ORCID (Open Researcher and Contributor ID) plays an important part in disambiguating employees. Because ORCIDs are globally unique, associating an employee name with an ORCID allows one to know that the employee is different from someone with the same name who has a different ORCID.
For that reason, I began the disambiguation process by
performing a search for "Vanderbilt University" using the ORCID API. The search produced several thousand results. I then dereferenced each of the
resulting ORCID URIs to capture the full data about the researcher. That required an API call for each record and I used a quarter second delay per call
to avoid hitting the API too fast. As a result, this stage of the process took
hours to run.
I screened the results by recording only those that listed
"Vanderbilt University" as part of the employments affiliation
organization string. That excluded people who were only students and never
employees, and included people whose affiliation was "Vanderbilt
University Medical Center", "Vanderbilt University School of
Nursing", etc. As part of the data recorded, I included their stated
departmental affiliations (some had multiple affiliations if they moved from
one department to another during their career). After this stage, I had 2240 name/department
records.
Fuzzy matching of departmental and ORCID records
The next stage of the process was to try to match employees from the department that I was processing with the downloaded ORCID records. I used a Python fuzzy string matching function called fuzz.token_set_ratio() from the fuzzywuzzy package. I tested this function along with others in the package and it was highly effective at matching names with minor variations (both people and departmental names). Because this function was insensitive to word order, it matched names like "Department of Microbiology" and "Microbiology Department". However, it also made major errors for name order reversals ("John James" and "James Johns", for example) so I had an extra check for that.
If the person's name had a match score of greater than 90
(out of 100), I then performed a match check against the listed department. If it
also had a match score of greater than 90, I assigned that ORCID to the person.
If no listed department matched had a score over 90, I assigned the ORCID, but flagged
that match for manual checking later.
Determine whether employees were already in Wikidata
Attempt automated matching with people in Wikidata known to work at Vanderbilt
I was then ready to start trying to match people with existing
Wikidata records. The low-hanging fruit was people whose records already stated
that their employer was Vanderbilt University (Q29052). I ran a SPARQL query for that using the Wikidata Query Service. For each match, I also recorded the employee's
description, ORCID, start date, and end date (where available).
Once I had those data, I checked each departmental employee's
record against the query results. If both the departmental employee and the
potential match from Wikidata had the same ORCID, then I knew that they were
the same person and I assigned the Wikidata Q ID to that employee. If the
employee had an ORCID I could exclude any Wikidata records with non-matching
ORCIDs and only check for name matches with Wikidata records that didn't have
ORCIDs. Getting a name match alone was
not a guarantee that the person in Wikidata was the same as the departmental
employee, but given that the pool of possible Wikidata matches only included
people employed at Vanderbilt, a good name match meant that it was probably the same person. If the person had
a description in Wikidata, I printed the two names and the description and
visually inspected the matches. For example, if there was a member of the
Biological Sciences department named Jacob Reynolds and someone in Wikidata
named Jacob C. Reynolds who was a microbiologist, the match was probably good.
On the other hand, if Jacob C. Reynolds was a historian, then some manual
checking was in order. I did a few
other tricks that you can see in the code.
This "smart matching" with minimal human intervention was usually able to match a
small fraction of people in the department. But
there were plenty of departmental employees who were already in Wikidata
without any indication that they worked at Vanderbilt. The obvious way to look
for them would be to just do a SPARQL query for their name. There are some
features built in to SPARQL that allow for REGEX checks, but those features are
impossibly slow for a triplestore the size of Wikidata's. The strategy that I
settled for was to generate as many possible variations of the person's name
and query for all of them at once. You can see what I did in the generateNameAlternatives()
function in the code. I searched labels and aliases for: the full name, names with middle
initials with and without periods, first and middle initials with and without
periods, etc. This approach was pretty good at matching with the right people,
but it also matched with a lot of wrong people. For example, for Jacob C.
Reynolds, I would also search for J. C. Reynolds. If John C. Reynolds had J. C.
Reynolds as an alias, he would come up as a hit. I could have tried to automate the processing of the returned names more, but there usually weren't a lot of matches and with
the other screening criteria I applied, it was pretty easy for me to just look at the results and
bypass the false positives.
When I did the query for the name alternatives, I downloaded the values for several
properties that were useful for eliminating hits. One important screen was to
eliminate any matching items that were instances of classes (P31) other than
human (Q5). I also screened out people who were listed as having died prior to
some set data (2000 worked well - some departments still listed recently
deceased emeriti and I didn't want to eliminate those). If both the employee and the name match in
Wikidata had ORCIDs that were different, I also eliminated the hit. For all matches that passed these screens, I
printed the description, occupation, and employer if they were given in
Wikidata.
Clues from publications in PubMed and Crossref
The other powerful tool I used for disambiguation was to look up any articles linked to the putative Wikidata match. For each Wikidata person item who made it this far through the screen, I did a SPARQL query to find works authored by that person. For up to 10 works, I did the following. If the article had a PubMed ID, I retrieved the article metadata from the PubMed API and tried to match against the author names. When I got a match with an author, I checked for an ORCID match (or excluded if an ORCID mismatch) and also for a fuzzy match against any affiliation that was given. If either an ORCID or affiliation matched, I concluded that the departmental employee was the same as the Wikidata match and stopped looking.
If there was no match in PubMed and the article had a DOI, I
then retrieved the metadata about the article from the CrossRef API and did the same
kind of screening that I did in PubMed.
Human intervention
If there was no automatic match via the article searches, I printed out the full set of information (description, employer, articles, etc.) for every name match, along with the name from the department and the name from Wikidata in order for a human to check whether any of the matches seemed plausible. In a lot of cases, it was easy to eliminate matches that had descriptions like "Ming Dynasty person" or occupation = "golfer". If there was uncertainty, the script printed hyperlinked Wikidata URLs and I could just click on them to examine the Wikidata record manually.Here's some typical output:
--------------------------
No Wikidata name match: Justine Bruyère
--------------------------
No Wikidata name match: Nicole Chaput Guizani
--------------------------
SPARQL name search: Caroline Christopher
(no ORCID)
0 Wikidata ID: Q83552019 Name variant: Caroline Christopher https://www.wikidata.org/wiki/Q83552019
No death date given.
description: human and organizational development educator
employer: Vanderbilt University
No articles authored by that person
Employee: Caroline Christopher vs. name variant: Caroline Christopher
Enter the number of the matched entity, or press Enter/return if none match: 0
--------------------------
SPARQL name search: Paul Cobb
(no ORCID)
0 Wikidata ID: Q28936750 Name variant: Paul Cobb https://www.wikidata.org/wiki/Q28936750
No death date given.
description: association football player
occupation: association football player
No articles authored by that person
Employee: Paul Cobb vs. name variant: Paul Cobb
1 Wikidata ID: Q55746009 Name variant: Paul Cobb https://www.wikidata.org/wiki/Q55746009
No death date given.
description: American newspaper publisher
occupation: newspaper proprietor
No articles authored by that person
Employee: Paul Cobb vs. name variant: Paul Cobb
Enter the number of the matched entity, or press Enter/return if none match:
--------------------------
No Wikidata name match: Molly Collins
--------------------------
No Wikidata name match: Ana Christina da Silva [Iddings]
Although this step did require human intervention, because of the large amount of information that the script collected about the Wikidata matches, it usually only took a few minutes to disambiguate a department with 30 to 50 employees.
Generate statements and references and determine which were already in Wikidata
Generating data for a minimal set of properties
The next to last step was to assign values to a minimal set
of properties that I felt each employee should have in a Wikidata record.
Here's what I settled on for that minimal set:
P31 Q5 (instance of human). This was automatically assigned
to all records.
P108 Q29052 (employer Vanderbilt University). This applies
to all employees in our project - the employer value can be set at the top of
the script.
P1416 [Q ID of department] (affiliation with focal
department). After searching through many possible properties, I decided that
P1416 (affiliation) was the best property to use to assert the employee's
connection to the department I was processing. P108 was also possible, but
there were a lot of people with dual departmental appointments and I generally
didn't know which department was the actual "employer". Affiliation seemed
to be an appropriate connection for regular faculty, postdocs, visiting
faculty, research staff, and other kinds of statuses where the person would
have some kind of research or scholarly output.
P496 [ORCID identifier]. ORCIDs that I'd acquired for the
employees were hard-won and an excellent means for anyone else to carry out
disambiguation, so I definitely wanted to include that assertion if I could.
P21 [sex or gender]. I was really uncomfortable assigning a
value of this property, but this is a property often flagged by Recoin as a top
missing property and I didn't want some overzealous editor deleting my new items because their metadata were too skimpy. Generally, the departmental web pages had photos
to go with the names, so I made a call and manually assigned a value for this
property (options: m=male, f=female, i=intersex, tf=transgender female,
tm=transgender male). Any time the sex or gender seemed uncertain, I did not
provide a value.
The description. I
made up a default description for the department, such as "biological
science researcher", "historian", or "American Studies
scholar" for the Biological Sciences, History, and American Studies
departments respectively. I did not overwrite any existing descriptions by
default, although as a last step I looked at the table to replace stupid ones
like "researcher, ORCID: 0000-0002-1234-5678". These defaults were
generally specific enough to prevent collisions where the label/description
combination I was creating would collide with the label/description combination
for an existing record and kill the record write.
When it made sense, I added references to the statements I
was making. Generally, a reference is not expected for instance of human and I
really couldn't give a reference for sex or gender. For the employer and affiliation references,
I used the web page that I scraped to get their name as the reference URL and
provided the current date as the value for P813 (retrieved). For ORCID, I created a reference that had a
P813 (retrieved) property if I was able to successfully dereference the ORCID
URI.
Because each of these properties had different criteria for
assigning values and references, there was no standard code for assigning them.
The code for each property is annotated, so if you are interested you can look
at it to see how I made the assignments.
Check for existing data in Wikidata
In the earlier posts, I said that I did not want VanderBot to create duplicate items, statements, and references when they already existed in Wikidata.
So a critical last step was to check for existing data using SPARQL. One
important thing to keep in mind is the Query Service Updater lag that I talked
about in the last post. That lag means that changes made up to 8 or 10 hours
ago would not be included in this download. However, given that the Wikidata researcher item records I'm dealing with do not change frequently, the lag generally wasn't a
problem. I should note that it would be possible to get these data directly
from the Wikidata API, but the convenience of getting exactly the information I wanted
using SPARQL outweighed my motivation to develop code to do that.
At this point in the workflow, I've already determined with
a fairly high degree of confidence which of the departmental employees were
already in Wikidata. That takes care of the potential problem of creating duplicate item
records, and it also means that I do not need to check for the presence of
statements or references for any of the new items either.
One interesting feature of SPARQL that I learned from this
project was using the VALUES clause. Despite having used SPARQL for years and
skimming through the SPARQL specification several times, I missed it. The
VALUES clause allows you to specify which values the query should use for a particular
variable in its pattern matching. That
makes querying a large triplestore like Wikidata much faster that without it
and it also reduces the number of results that the code has to sort through
when results come back from the query service. Here's an example of a query
using the VALUES clause that you can test at the Wikidata Query Service:
SELECT DISTINCT ?id ?statement WHERE {
VALUES ?id {
wd:Q4958
wd:Q39993
wd:Q234
}
?id p:P31 ?statement.
}
So the first part of the last step in the workflow is to
generate a list of all of the existing item Q IDs for employees in the
department. That list is passed to the searchStatementAtWikidata() function as
its first argument. searchStatementAtWikidata() is a general purpose function
that will search Wikidata for a particular property of items in the generated list. It can be used either to search for a particular property and value (like
P108 Q29052, employer Vanderbilt University) and retrieve the references for
that statement, or for only the property (like P496, ORCID) and retrieve both
the values and references associated with those statements. This behavior is controlled by whether an
empty string is sent for the value argument or not. For each of the minimal set of properties
that I'm tracking for departmental employees, the searchStatementAtWikidata()
is used to retrieve any available data for the listed employees. Those data are
then matched with the appropriate employee records and recorded in the CSV file
along with the previously generated property values.
In addition to the property checks, labels, descriptions,
and aliases for the list of employees are retrieved via SPARQL queries. In the
cases of labels and descriptions, if there is an existing label or description
in Wikidata, it is written to the CSV file. If there is no existing label, the
name scraped from the departmental website is written to the CSV as the label.
If there is no existing description, the default description for the department
is written to the CSV. Whatever alias lists are retrieved from Wikidata
(including empty ones) are written to the CSV.
Final manual curation prior to writing to the Wikidata API
In theory, the CSV file resulting from the previous step should
contain all of the information needed by the API-writing script that was
discussed in the last post. However, I always manually examine the CSV to look
for problems or things that are stupid such as bad descriptions.
If a description or label is changed, the API-writing script
will detect that it's different from the current value being provided by the
SPARQL endpoint and the new description or label will overwrite the existing
one. The API-writing script is currently not very sophisticated about how it
handles aliases. If there are more aliases in the CSV than are currently in
Wikidata, the script will overwrite existing aliases in Wikidata with those in the
spreadsheet. The assumption is that alias lists are only added to, rather than
aliases being changed or deleted. At
some point in the future, I intend to write a separate script that will handle
labels and aliases in a more robust way, so I really didn't want to waste time
now on making the alias-handling better than it is.
A typical situation is to
discover a more specific label for the person than already exists in Wikidata.
In that case, I usually add the existing label to the alias list, and replace
the label value in the CSV with the better new one. WARNING! If you edit the
alias list, make sure that your editor uses generic quotes (ASCII 32/Unicode
+U0022) and not "smart quotes". They have a different Unicode value
and will break the script. Open Office/Libre Office (the best applications for editing CSVs in my opinion)
default to smart quotes, so this setting must be turned off manually.
I also just look over the rest of the spreadsheet to
convince myself that nothing weird is going on. Usually the script does an
effective job of downloading the correct reference properties and values, but
I've discovered some odd situations that have caused problems.
At this point, steps 1 and 2 in the VanderBot diagram have
been completed by the data harvesting script, and the API-writing script
described in the last post is ready to take over in step 3. When step 4 is complete, the blank cells in
the CSV for missing item, statement, and reference identifiers will should all
be filled in and the CSV can be filed for future reference.
Final thoughts
I tried to make the API writing script generic and adaptable for writing statements and references about any kind of entity. That's achievable simply by editing the JSON schema file that maps the columns in the source CSV. However, getting the values for that CSV is the tricky part. If one were confident that only new items were being written, then the table could filled with only the data to be written and without any item, statement, or reference identifiers. That would be the case if you were using the script to load your own Wikibase instance. However, for adding data to Wikidata about most items like people or references, one can't know if the data needs to be written or not, and that's why a complex and somewhat idiosyncratic script like the data harvesting script is necessary. So there's no "magic bullet" that will make it possible to automatically know whether you can write data to Wikidata without creating duplicate assertions.
To find records that VanderBot has put into Wikidata, try this query at the Wikidata Query Service:
select distinct ?employee where {
?employee wdt:P1416/wdt:P749+ wd:Q29052.
}
limit 50
The triple pattern requires that the employee first have an affiliation (P1416) to some item, and that item be linked by one or more parent organization (P749) links to Vanderbilt University (Q29052). I linked the department items to their parent school or college using P749 and made sure that the University's schools and colleges were all linked to the University by P749 as well. However, some schools like the Blair School of Music do not really have departments, so their employees were affiliated directly to the school or college rather than a department. So the search has to pick up administrative entity items that were either one or two P749 links from the university (hence the "+" property path operator after P749). Since there are a lot of employees, I limited the results to 50. If you click on any of the results, it will take you to the item page and you can view the page history to confirm that VanderBot had made edits to the page. (At some point, there may be people who were linked in this way by an account other than VanderBot, but thus far, VanderBot is probably the only editor of Vanderbilt employees items that's linking to departments by P1416, given that I recently created all of the department items from scratch.)
A variation of that query will tell you the number of records meeting the criteria of the previous query:
select (count(?employee) as ?count) where {
?employee wdt:P1416/wdt:P749+ wd:Q29052.
}
As of 2020-02-08, there are 1221 results. That number should grow as I use VanderBot to process other departments.