Saturday, February 8, 2020

VanderBot part 4: Preparing data to send to Wikidata



In the previous blog post, I described how I used a Python script to upload data stored in a CSV spreadsheet to Wikidata via the Wikidata API.  I noted that the spreadsheet contained information about whether data were already in Wikidata and if they needed to be written to the API, but I did not say how I acquired those data, nor how I determined whether they needed to be uploaded or not. That data acquisition and processing is the topic of this post.

The overall goal of the VanderBot project is to enter data about Vanderbilt employees (scholars and researchers) and their academic publications into Wikidata. Thus far in the project, I have focused primarily on acquiring and uploading data about the employees. The data acquisition process has three stages:

1. Acquiring the names of research employees (faculty, postdocs, and research staff) in departments of the university.

2. Determining whether those employees were already present in Wikidata or if items needed to be created for them.

3. Generating data required to make key statements about the employees and determining whether those statements (and associated references) had already been asserted in Wikidata.

The data harvesting script (coded in Python) required to carry out these processes is available via a Jupyter notebook available onGitHub.



Acquire names of research employees at Vanderbilt


Scrape departmental website

I've linked employees to Vanderbilt through their departmental affiliations. Therefore, the first task was to create items for departments in the various schools and colleges of Vanderbilt University. I won't go into detail about that process other than to say that the hacky code I used to do it is on GitHub.

The actual names of the employees were acquired by scraping departmental faculty and staff web pages. I developed the scraping script based on the web page of my old department, biological sciences. Fortunately, the same page template was used by many other departments in both the College of Arts and Sciences and the Peabody College of Education, so I was able to scrape about 2/3 of the departments in those schools without modifying the script I developed for the biological sciences department.

Because the departments had differing numbers of researcher pages covering different categories of researchers, I create a JSON configuration file where I recorded the base departmental URLs and the strings appended to that base to generate each of the researcher pages. The configuration file also included some other data needed by the script, such as the department's Wikidata Q ID, a generic description to use for researchers in the department (if they didn’t already have a description), and some strings that I used for fuzzy matching with other records (described later). Some sample JSON is included in the comments near the top of the script.

The result at the end of the "Scrape departmental website" section of the code was a CSV file with the researcher names and some other data that I made a feeble attempt to scrape, such as their title and affiliation.


Search ORCID for Vanderbilt employees

ORCID (Open Researcher and Contributor ID) plays an important part in disambiguating employees. Because ORCIDs are globally unique, associating an employee name with an ORCID allows one to know that the employee is different from someone with the same name who has a different ORCID.

For that reason, I began the disambiguation process by performing a search for "Vanderbilt University" using the ORCID API. The search produced several thousand results. I then dereferenced each of the resulting ORCID URIs to capture the full data about the researcher. That required an API call for each record and I used a quarter second delay per call to avoid hitting the API too fast. As a result, this stage of the process took hours to run.

I screened the results by recording only those that listed "Vanderbilt University" as part of the employments affiliation organization string. That excluded people who were only students and never employees, and included people whose affiliation was "Vanderbilt University Medical Center", "Vanderbilt University School of Nursing", etc. As part of the data recorded, I included their stated departmental affiliations (some had multiple affiliations if they moved from one department to another during their career). After this stage, I had 2240 name/department records.


Fuzzy matching of departmental and ORCID records

The next stage of the process was to try to match employees from the department that I was processing with the downloaded ORCID records. I used a Python fuzzy string matching function called fuzz.token_set_ratio() from the fuzzywuzzy package. I tested this function along with others in the package and it was highly effective at matching names with minor variations (both people and departmental names). Because this function was insensitive to word order, it matched names like "Department of Microbiology" and "Microbiology Department". However, it also made major errors for name order reversals ("John James" and "James Johns", for example) so I had an extra check for that.

If the person's name had a match score of greater than 90 (out of 100), I then performed a match check against the listed department. If it also had a match score of greater than 90, I assigned that ORCID to the person. If no listed department matched had a score over 90, I assigned the ORCID, but flagged that match for manual checking later.



Determine whether employees were already in Wikidata


Attempt automated matching with people in Wikidata known to work at Vanderbilt

I was then ready to start trying to match people with existing Wikidata records. The low-hanging fruit was people whose records already stated that their employer was Vanderbilt University (Q29052). I ran a SPARQL query for that using the Wikidata Query Service.  For each match, I also recorded the employee's description, ORCID, start date, and end date (where available).

Once I had those data, I checked each departmental employee's record against the query results. If both the departmental employee and the potential match from Wikidata had the same ORCID, then I knew that they were the same person and I assigned the Wikidata Q ID to that employee. If the employee had an ORCID I could exclude any Wikidata records with non-matching ORCIDs and only check for name matches with Wikidata records that didn't have ORCIDs.  Getting a name match alone was not a guarantee that the person in Wikidata was the same as the departmental employee, but given that the pool of possible Wikidata matches only included people employed at Vanderbilt, a good name match meant that it was probably the same person. If the person had a description in Wikidata, I printed the two names and the description and visually inspected the matches. For example, if there was a member of the Biological Sciences department named Jacob Reynolds and someone in Wikidata named Jacob C. Reynolds who was a microbiologist, the match was probably good. On the other hand, if Jacob C. Reynolds was a historian, then some manual checking was in order.  I did a few other tricks that you can see in the code.

This "smart matching" with minimal human intervention was usually able to match a small fraction of people in the department. But there were plenty of departmental employees who were already in Wikidata without any indication that they worked at Vanderbilt. The obvious way to look for them would be to just do a SPARQL query for their name. There are some features built in to SPARQL that allow for REGEX checks, but those features are impossibly slow for a triplestore the size of Wikidata's. The strategy that I settled for was to generate as many possible variations of the person's name and query for all of them at once. You can see what I did in the generateNameAlternatives() function in the code. I searched labels and aliases for: the full name, names with middle initials with and without periods, first and middle initials with and without periods, etc. This approach was pretty good at matching with the right people, but it also matched with a lot of wrong people. For example, for Jacob C. Reynolds, I would also search for J. C. Reynolds. If John C. Reynolds had J. C. Reynolds as an alias, he would come up as a hit. I could have tried to automate the processing of the returned names more, but there usually weren't a lot of matches and with the other screening criteria I applied, it was pretty easy for me to just look at the results and bypass the false positives.

When I did the query for the name alternatives, I downloaded the values for several properties that were useful for eliminating hits. One important screen was to eliminate any matching items that were instances of classes (P31) other than human (Q5). I also screened out people who were listed as having died prior to some set data (2000 worked well - some departments still listed recently deceased emeriti and I didn't want to eliminate those).  If both the employee and the name match in Wikidata had ORCIDs that were different, I also eliminated the hit.  For all matches that passed these screens, I printed the description, occupation, and employer if they were given in Wikidata.


Clues from publications in PubMed and Crossref

The other powerful tool I used for disambiguation was to look up any articles linked to the putative Wikidata match.  For each Wikidata person item who made it this far through the screen, I did a SPARQL query to find works authored by that person. For up to 10 works, I did the following.  If the article had a PubMed ID, I retrieved the article metadata from the PubMed API and tried to match against the author names. When I got a match with an author, I checked for an ORCID match (or excluded if an ORCID mismatch) and also for a fuzzy match against any affiliation that was given.  If either an ORCID or affiliation matched, I concluded that the departmental employee was the same as the Wikidata match and stopped looking.

If there was no match in PubMed and the article had a DOI, I then retrieved the metadata about the article from the CrossRef API and did the same kind of screening that I did in PubMed. 


Human intervention

If there was no automatic match via the article searches, I printed out the full set of information (description, employer, articles, etc.) for every name match, along with the name from the department and the name from Wikidata in order for a human to check whether any of the matches seemed plausible. In a lot of cases, it was easy to eliminate matches that had descriptions like "Ming Dynasty person" or occupation = "golfer".   If there was uncertainty, the script printed hyperlinked Wikidata URLs and I could just click on them to examine the Wikidata record manually.

Here's some typical output:

--------------------------
No Wikidata name match:  Justine Bruyère

--------------------------
No Wikidata name match:  Nicole Chaput Guizani

--------------------------
SPARQL name search:  Caroline Christopher
(no ORCID)


0 Wikidata ID:  Q83552019  Name variant:  Caroline Christopher   https://www.wikidata.org/wiki/Q83552019
No death date given.
description:  human and organizational development educator
employer:  Vanderbilt University
No articles authored by that person
Employee:  Caroline Christopher  vs. name variant:  Caroline Christopher

Enter the number of the matched entity, or press Enter/return if none match: 0

--------------------------
SPARQL name search:  Paul Cobb
(no ORCID)


0 Wikidata ID:  Q28936750  Name variant:  Paul Cobb   https://www.wikidata.org/wiki/Q28936750
No death date given.
description:  association football player
occupation:  association football player
No articles authored by that person
Employee:  Paul Cobb  vs. name variant:  Paul Cobb


1 Wikidata ID:  Q55746009  Name variant:  Paul Cobb   https://www.wikidata.org/wiki/Q55746009
No death date given.
description:  American newspaper publisher
occupation:  newspaper proprietor
No articles authored by that person
Employee:  Paul Cobb  vs. name variant:  Paul Cobb

Enter the number of the matched entity, or press Enter/return if none match: 

--------------------------
No Wikidata name match:  Molly Collins

--------------------------
No Wikidata name match:  Ana Christina da Silva [Iddings]


Although this step did require human intervention, because of the large amount of information that the script collected about the Wikidata matches, it usually only took a few minutes to disambiguate a department with 30 to 50 employees.


Generate statements and references and determine which were already in Wikidata


Generating data for a minimal set of properties

The next to last step was to assign values to a minimal set of properties that I felt each employee should have in a Wikidata record. Here's what I settled on for that minimal set:

P31 Q5 (instance of human). This was automatically assigned to all records.

P108 Q29052 (employer Vanderbilt University). This applies to all employees in our project - the employer value can be set at the top of the script.

P1416 [Q ID of department] (affiliation with focal department). After searching through many possible properties, I decided that P1416 (affiliation) was the best property to use to assert the employee's connection to the department I was processing. P108 was also possible, but there were a lot of people with dual departmental appointments and I generally didn't know which department was the actual "employer". Affiliation seemed to be an appropriate connection for regular faculty, postdocs, visiting faculty, research staff, and other kinds of statuses where the person would have some kind of research or scholarly output.

P496 [ORCID identifier]. ORCIDs that I'd acquired for the employees were hard-won and an excellent means for anyone else to carry out disambiguation, so I definitely wanted to include that assertion if I could.

P21 [sex or gender]. I was really uncomfortable assigning a value of this property, but this is a property often flagged by Recoin as a top missing property and I didn't want some overzealous editor deleting my new items because their metadata were too skimpy. Generally, the departmental web pages had photos to go with the names, so I made a call and manually assigned a value for this property (options: m=male, f=female, i=intersex, tf=transgender female, tm=transgender male). Any time the sex or gender seemed uncertain, I did not provide a value.

The description.  I made up a default description for the department, such as "biological science researcher", "historian", or "American Studies scholar" for the Biological Sciences, History, and American Studies departments respectively. I did not overwrite any existing descriptions by default, although as a last step I looked at the table to replace stupid ones like "researcher, ORCID: 0000-0002-1234-5678". These defaults were generally specific enough to prevent collisions where the label/description combination I was creating would collide with the label/description combination for an existing record and kill the record write.

When it made sense, I added references to the statements I was making. Generally, a reference is not expected for instance of human and I really couldn't give a reference for sex or gender.  For the employer and affiliation references, I used the web page that I scraped to get their name as the reference URL and provided the current date as the value for P813 (retrieved).  For ORCID, I created a reference that had a P813 (retrieved) property if I was able to successfully dereference the ORCID URI.

Because each of these properties had different criteria for assigning values and references, there was no standard code for assigning them. The code for each property is annotated, so if you are interested you can look at it to see how I made the assignments.


Check for existing data in Wikidata

In the earlier posts, I said that I did not want VanderBot to create duplicate items, statements, and references when they already existed in Wikidata. So a critical last step was to check for existing data using SPARQL. One important thing to keep in mind is the Query Service Updater lag that I talked about in the last post. That lag means that changes made up to 8 or 10 hours ago would not be included in this download. However, given that the Wikidata researcher item records I'm dealing with do not change frequently, the lag generally wasn't a problem. I should note that it would be possible to get these data directly from the Wikidata API, but the convenience of getting exactly the information I wanted using SPARQL outweighed my motivation to develop code to do that.

At this point in the workflow, I've already determined with a fairly high degree of confidence which of the departmental employees were already in Wikidata. That takes care of the potential problem of creating duplicate item records, and it also means that I do not need to check for the presence of statements or references for any of the new items either.

One interesting feature of SPARQL that I learned from this project was using the VALUES clause. Despite having used SPARQL for years and skimming through the SPARQL specification several times, I missed it. The VALUES clause allows you to specify which values the query should use for a particular variable in its pattern matching.  That makes querying a large triplestore like Wikidata much faster that without it and it also reduces the number of results that the code has to sort through when results come back from the query service. Here's an example of a query using the VALUES clause that you can test at the Wikidata Query Service:

SELECT DISTINCT ?id ?statement WHERE {
VALUES ?id {
  wd:Q4958
  wd:Q39993
  wd:Q234
  }
?id p:P31 ?statement.
}


So the first part of the last step in the workflow is to generate a list of all of the existing item Q IDs for employees in the department. That list is passed to the searchStatementAtWikidata() function as its first argument. searchStatementAtWikidata() is a general purpose function that will search Wikidata for a particular property of items in the generated list. It can be used either to search for a particular property and value (like P108 Q29052, employer Vanderbilt University) and retrieve the references for that statement, or for only the property (like P496, ORCID) and retrieve both the values and references associated with those statements.  This behavior is controlled by whether an empty string is sent for the value argument or not.  For each of the minimal set of properties that I'm tracking for departmental employees, the searchStatementAtWikidata() is used to retrieve any available data for the listed employees. Those data are then matched with the appropriate employee records and recorded in the CSV file along with the previously generated property values.

In addition to the property checks, labels, descriptions, and aliases for the list of employees are retrieved via SPARQL queries. In the cases of labels and descriptions, if there is an existing label or description in Wikidata, it is written to the CSV file. If there is no existing label, the name scraped from the departmental website is written to the CSV as the label. If there is no existing description, the default description for the department is written to the CSV. Whatever alias lists are retrieved from Wikidata (including empty ones) are written to the CSV.

Final manual curation prior to writing to the Wikidata API

In theory, the CSV file resulting from the previous step should contain all of the information needed by the API-writing script that was discussed in the last post. However, I always manually examine the CSV to look for problems or things that are stupid such as bad descriptions.

If a description or label is changed, the API-writing script will detect that it's different from the current value being provided by the SPARQL endpoint and the new description or label will overwrite the existing one. The API-writing script is currently not very sophisticated about how it handles aliases. If there are more aliases in the CSV than are currently in Wikidata, the script will overwrite existing aliases in Wikidata with those in the spreadsheet. The assumption is that alias lists are only added to, rather than aliases being changed or deleted.  At some point in the future, I intend to write a separate script that will handle labels and aliases in a more robust way, so I really didn't want to waste time now on making the alias-handling better than it is. 

A typical situation is to discover a more specific label for the person than already exists in Wikidata. In that case, I usually add the existing label to the alias list, and replace the label value in the CSV with the better new one. WARNING! If you edit the alias list, make sure that your editor uses generic quotes (ASCII 32/Unicode +U0022) and not "smart quotes". They have a different Unicode value and will break the script. Open Office/Libre Office (the best applications for editing CSVs in my opinion) default to smart quotes, so this setting must be turned off manually.

I also just look over the rest of the spreadsheet to convince myself that nothing weird is going on. Usually the script does an effective job of downloading the correct reference properties and values, but I've discovered some odd situations that have caused problems.


At this point, steps 1 and 2 in the VanderBot diagram have been completed by the data harvesting script, and the API-writing script described in the last post is ready to take over in step 3.  When step 4 is complete, the blank cells in the CSV for missing item, statement, and reference identifiers will should all be filled in and the CSV can be filed for future reference.

Final thoughts


I tried to make the API writing script generic and adaptable for writing statements and references about any kind of entity. That's achievable simply by editing the JSON schema file that maps the columns in the source CSV. However, getting the values for that CSV is the tricky part. If one were confident that only new items were being written, then the table could filled with only the data to be written and without any item, statement, or reference identifiers.  That would be the case if you were using the script to load your own Wikibase instance. However, for adding data to Wikidata about most items like people or references, one can't know if the data needs to be written or not, and that's why a complex and somewhat idiosyncratic script like the data harvesting script is necessary. So there's no "magic bullet" that will make it possible to automatically know whether you can write data to Wikidata without creating duplicate assertions.

To find records that VanderBot has put into Wikidata, try this query at the Wikidata Query Service:

select distinct ?employee where {
  ?employee wdt:P1416/wdt:P749+ wd:Q29052.
  }
limit 50

The triple pattern requires that the employee first have an affiliation (P1416) to some item, and that item be linked by one or more parent organization (P749) links to Vanderbilt University (Q29052). I linked the department items to their parent school or college using P749 and made sure that the University's schools and colleges were all linked to the University by P749 as well. However, some schools like the Blair School of Music do not really have departments, so their employees were affiliated directly to the school or college rather than a department. So the search has to pick up administrative entity items that were either one or two P749 links from the university (hence the "+" property path operator after P749). Since there are a lot of employees, I limited the results to 50. If you click on any of the results, it will take you to the item page and you can view the page history to confirm that VanderBot had made edits to the page.  (At some point, there may be people who were linked in this way by an account other than VanderBot, but thus far, VanderBot is probably the only editor of Vanderbilt employees items that's linking to departments by P1416, given that I recently created all of the department items from scratch.)

A variation of that query will tell you the number of records meeting the criteria of the previous query:

select (count(?employee) as ?count) where {
  ?employee wdt:P1416/wdt:P749+ wd:Q29052.
  }

As of 2020-02-08, there are 1221 results. That number should grow as I use VanderBot to process other departments.


No comments:

Post a Comment