Steve Baskauf's blog: March 2021

Thursday, March 18, 2021

Writing your own data to Wikidata using spreadsheets: Part 4 - Downloading existing data

This is the fourth part of a series of posts intended to help you manage and upload data to Wikidata using spreadsheets. It assumes that you have already read the first three posts and that you have tried the "do it yourself" experiments to get a better understanding of how the system works. However, it's possible that the only thing that you want to do is to download data from Wikidata into a spreadsheet, and if that is the case, you could get something out of this post without reading the others. The script that I will describe does require a configuration file whose structure is described in the second post, so you'll either need to read that post (if you like handholding) or read the more technical specifications here to know how to construct or hack that configuration file.

If you are the kind of person who prefers to just look at the specs and play with software, then skip this whole post and go to the documentation for the download script here. Good luck! Come back and do the walk-through below if you can't figure it out.

The latter sections of the post do show how to use the downloaded data to carry out easy-to-implement improvements to item records. Those sections depend on an understanding of the earlier posts. So if that kind of editing beyond simply downloading data interests you, then you should read the whole series of posts.

Configuring and carrying out the download

Towards the end of the last post, I created a JSON configuration file for metadata about journal articles. The configuration specified the structure of three CSV files. The first CSV (articles.csv) was intended to have one row per article and contained headers for statements about the following properties: "instance of", DOI, date of publication, English title, journal, volume, page, and issue. The other two CSVs were expected to have multiple rows per article since they contained data about author items (authors.csv) and author name strings (author_strings.csv). Since articles can have one-to-many authors, these two tables could be expected to have zero-to-many rows per article.

For the purposes of testing the download script, you can just use the JSON configuration file as-is. Download it, name it config.json, and put it in a new directory that you can easily navigate to from your home folder. We are going to specify the group of items to be downloaded by designating a graph pattern, so edit the fourth line of the file using a text editor so that it says

"item_pattern_file": "graph_pattern.txt",

You can screen for articles using any kind of graph pattern that you know how to write, but if you don't know what to use, you can use this pattern:

?person wdt:P1416 wd:Q16849893. # person, affiliation, Vanderbilt Libraries
?qid wdt:P50 ?person. # work, author, person

Copy these two lines and save them in a plain text file called graph_pattern.txt in the same directory as the configuration file. The comments after the hash (#) mark will be ignored, so you can leave them off if you want. I chose the first triple pattern (people affiliated with Vanderbilt Libraries) because there is a relatively small number of people involved. You can use some other triple pattern to define the people, but if it designates a large number of people, the file of downloaded journal data may be large. Whatever pattern you use, the variable ?qid must be used to designate the works.

The last thing you need is a copy of the Python script that does the downloading. Go to this page and download the script into the same directory as the other two files.

Open your console software (Terminal on Mac or Command Prompt on Windows) and navigate to the directory where you put the files. Enter

python acquire_wikidata_metadata.py

(or python3 if your installation requires that).

The output should be similar to the screenshot above.

Examining the results

Start by opening the authors.csv file with your spreadsheet software (LibreOffice Calc recommended, Excel OK). This file should be pretty much as expected. There is a label_en column that is there solely to make it easier to interpret the subject Q IDs -- that column is ignored when the spreadsheet is processed by VanderBot. In this case, every row has a value for the author property because we specified works that had author items in the graph pattern we used to screen the works.

The author_strings.csv file should also be close to what you expect, although you might be surprised to see that some rows don't have any author strings. Those are cases where all of the authors of that particular work have been associated with Wikidata items. The script always generates at least one row per subject item because it's very generic. It generally it leaves a blank cell for every statement property that doesn't have a value in case you want to add it later. Because there is only one statement property in this table, a missing value makes the row seem a bit weird because the whole row is then empty except for the Q ID.

When you open the articles.csv file, you may be surprised or annoyed to discover that despite what I said about intending for there to be only one row per article, many articles have two or even more rows. Why is this the case? If you scroll to the right in the table, you will see that in most, if not all, of the cases of multiple rows there is more than one value for instance of. If we were creating an item, we would probably just say that it's an instance of one kind of thing. But there is no rule saying that an item in Wikidata can't be an instance of more that one class. You might think that the article is a scholarly article (Q13442814) and I may think it's an academic journal article (Q18918145) and there is nothing to stop us from both making our assertions.

The underlying reason why we get these multiple rows is because we are using a SPARQL query to retrieve the results. We will see why in the next section. The situation would be even worse if there were more than one property with multiple values. If there were 3 values of instance of for the item and 2 values for the published date, we would get rows with every combination of the two, and end up with 3x2=6 rows for that article. That's unlikely, since I took care to select properties that (other than instance of) are supposed to only have a single value. But sometimes single-value properties are mistakenly given several values and we end up with a proliferation of rows.

An aside on SPARQL

It is not really necessary for you to understand anything about SPARQL to use this script, but if you are interested in understanding this "multiplier" phenomenon, you can read this section. Otherwise, skip to the next section.

Let's start by looking at the page of Carl H. Johnson, a researcher at Vanderbilt (Q28530058).

As I'm writing this (2021-03-16), we can see that Carl is listed as having two occupations: biologist and researcher. That is true, his work involves both of those things. He is also listed as having been educated at UT Austin and Stanford. That is also true, he went to UT Austin as an undergrad and Stanford as a grad student. We can carry out the following SPARQL query to ask about Carl's occupation and education.

select distinct ?item ?itemLabel ?occupation ?occupationLabel ?educatedAt ?educatedAtLabel {
?item wdt:P106 ?occupation.
?item wdt:P69 ?educatedAt.
BIND(wd:Q28530058 AS ?item)
SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}

You can run the query yourself here, although if the information about Carl has changed since I wrote this post, you could get different results.

The last line of the query ("SERVICE...") is some "magic" that the Wikidata Query service does to automatically generate labels for variables. If you are asking about a variable named "?x" and you also ask about the variable "?xLabel", with the "magic" line the Query Service will automatically generate "?xLabel" for you even if you don't define it as part of the graph pattern. I've used this method to generate labels for the three variables I'm asking about in the first line: ?item, ?occupation, and ?educatedAt.

The second and third lines of the query:

?item wdt:P106 ?occupation.
?item wdt:P69 ?educatedAt.

are the ones that are actually important. They restrict the value of the variable ?occupation to be the occupation (P106) of the item and restrict the value of the variable ?educatedAt to be the place where the item was educated at (P69).

The fourth line just forces the item to be Carl H. Johnson (Q28530058). If we left that line out, we would get thousands or millions of results: anyone in Wikidata who had an occupation and was educated. (Actually, the query would probably time out. Try it and see what happens if you want.)

So here's what happens when we run the query:

When we carry out a SPARQL query, we are asking the question: what combinations of variable values satisfy the graph pattern that we have specified? The query binds those combinations to the variables and then displays the combinations in the results. If we think about these combinations, we can see they all satisfy the pattern that we required:

Carl has occupation researcher and went to school at UT Austin.
Carl has occupation researcher and went to school at Stanford.
Carl has occupation biologist and went to school at UT Austin.
Carl has occupation biologist and went to school at Stanford.

There are four ways we can bind values to the variables that are true and satisfy the pattern we required. We cannot ask SPARQL to read our mind and guess there was a special combination that we intended, or that there was one combination that we were more interested in than another.

This behavior sometimes produces results for SPARQL queries that seem unexpected because you get more results than you intend. But if you ask yourself what you really required in your graph pattern, you can usually figure out why you got the result that you did.

Restricting the combinations of values in the table

If you paid close attention to the output of the script, you will have noticed that for each of the three CSVs it said that there were no pre-existing CSVs. After the script runs the SPARQL query to collect the data from Wikidata, it tries to open the files. If it can't open the files, it creates new files and saves all of the combinations of values that it found. However, if the files already exist, it compares the data from the query to the data already in the CSV and ignores combinations of values that don't match what's already there.

That means that if we are annoyed about all of the possible combinations of values initially written to the table, we can delete lines that contain combinations that we don't care about. For example, if one row of the table says that the article is a scholarly article (Q13442814) and another says its an academic journal article (Q18918145), I can delete the scholarly article row and only pay attention to row containing the statement that it is an academic journal article. In future downloads, the scholarly article assertion will be ignored by the script. It is a pain to have to manually delete the duplicate lines, but once you've done so, you shouldn't have to do it again if you try downloading more data later. You will have only a single line to deal with per existing item.

The situation is actually a bit more complicated than I just described. If you are interested in the details of how the script screens combinations of variables that it gets from the Query Service, you can look at the comments in the script starting in line 510. If not, you can assume that the script does the best it can and usually does the screening pretty well. It is still good to examine the CSV visually after doing a fresh download to make sure nothing weird happened.

Repeating the download to pick up new data

Based on what I said in the previous section, you should have noticed that you can run this script repeatedly if you want to pick up new data that has been added to the items since the last time you ran the script. That means that if you are adding data to the downloaded CSV as a way to make additions to Wikidata via the API, you first can check for updated information to make sure that you don't accidentally add duplicate statements when you use a stale CSV. VanderBot assumes that when you fill in empty cells, that represents new data to be written. So you or someone else has actually made the same new statement using the Web interface since the last time you ran VanderBot to do an upload, you risk creating duplicate statements.

There are several important things that you need to keep in mind about updating existing CSVs prior to adding and uploading new data:

1. The screening to remove duplicate rows is only done for preexisting items. Any new items downloaded in the update will need to be visually screened by a human for duplicates. It doesn't really hurt anything if you leave the duplicate rows -- all of their statements and references are preexisting and will have identifiers, so VanderBot will ignore them anyway. But you will probably eventually want to clean up the duplicates to make the spreadsheet easier to use in the future.

2. If there is only a single combination of values for an item (i.e. only a single row), the script will automatically replace any changed values with the new ones regardless of the preexisting state of that row. The screening of rows against the existing spreadsheet only happens when there are two rows with the same Q ID. So if somebody changed the wonderful, correct value that you have in your spreadsheet to something icky and wrong, running the update will change your local copy of the data in the spreadsheet to their icky and wrong value. On the other hand, if they have fixed an error and turned your data into wonderful, correct data, that will be changed in your local copy as well. The point here is that the script is dumb and cannot tell the difference between vandalism and crowd-sourced improvements of your local data. It just always updates your local data when there aren't duplicate rows. In a later post, we will talk about a system to detect changes before downloading them, so that you can make a decision about whether to allow the change to be made to your local copy of the data (i.e. the CSV).

3. If you have enabled VanderBot to write the labels and descriptions of existing items (using the --update option), it is very important that you download fresh values prior to using a stale CSV for writing data to the API. If you do not, then you will effectively revert any recently changed labels and descriptions back to whatever they were the last time you downloaded data or wrote to the API with that CSV. That would be extremely irritating to anyone (including YOU) who put a lot of work into improving labels and descriptions using the web interface and then had them all changed by the VanderBot script back to what they were before. So be careful!

Making additions to the downloaded data and pushing them to the API: Low hanging fruit

If you use this script to download existing data, it will not take you very long to realize that a lot of the data in Wikidata is pretty terrible. There are several common ways that data in Wikidata are terrible, and VanderBot can help you to improve the data with a lot less effort than doing a lot of manual editing using the web interface.

Changing a lot of descriptions at once

Many items in Wikidata were created by bots that had limited information about the items, and limited abilities to collect the information they were missing. The end result is that descriptions are often very poor. A very common example is describing a person as a "researcher". I believe that this happens because the person is an author of a research article, and since the bot knows nothing else about the person, it describes them as a "researcher". Since we are screening by a SPARQL query that establishes some criteria about the items, that criterion often will allow us to provide a better description. For example, if we are screening people by requiring that they be faculty in a chemistry department, and who have published academic research articles, we can safely improve their descriptions by calling them "chemistry researchers".

In the case of our example, there is an even more obvious problem: many of the items have no description at all. There is a very easy solution, since we have the instance of (P31) information about the items. All of the items in the screenshot above that are missing descriptions are instances of Q13442814.

I used the script (discussed in the previous post) for downloading information about values of statement properties to summarize all of the P31 values for the items in this group. So I know that all of these items with missing descriptions are instances of scholarly article. There may be better descriptions for those items, but at this point "scholarly article" is a much better description than no description at all.

One might argue that we aren't actually adding information to the items, given that our proposed description is simply re-stating the P31 value. That may be true, but the description is important because it shows up in the search results for an item, and the P31 value does not. In Wikidata, descriptions also play an important role in disambiguating items that have the same label, so it's best for all items to have a description.

I am going to fix all of these descriptions at once by simply pasting the text "scholarly article" in the description column for these items, then running VanderBot with the --update option set to allow. If you have not read the earlier posts, be aware that prior to writing to the API, you will need to create a metadata description file for the CSVs (discussed in the second post) and also download a copy of VanderBot from here into the directory where the CSV files are located. Run the API upload script using the command

python vanderbot.py --update allow --log log.txt

After it finished running, I deleted the three CSV files. After a little while I ran the download script again to see how things in Wikidata had changed. The results are in the screenshot above. (Note: there is a delay between when data are written to the API and when they are available at the Query Service, so the changes won't necessarily show up immediately. It can take between a few seconds and up to an hour for the changes to be transferred to the Query Service.) I was able to improve the descriptions of 23 items by about 30 seconds of copying and pasting. That would have probably taken me at least 10 or 15 minutes if I had looked up each item using the web interface and entered those descriptions manually.

Changing a lot of labels at once

Another situation where I can make a large number of improvements with very little effort is adding people item labels for other languages. Most names of people will be represented in the same way across all languages that use the Latin character set. So I can easily improve label coverage in non-English languages by just copying the English names and using them as labels in the other languages. This would be extremely labor-intensive if you had to look up each item and do the copying and pasting one item at a time. However, when the labels are in spreadsheet form, I can easily copy an entire column and paste it into another column.

In our Vanderbilt Fine Arts Gallery WikiProject, we put in a lot of work on either disambiguating artist name strings against Wikidata items, or creating new items for artists that weren't already there. As an end result, we now have a list of 1325 artists whose works are included in the gallery collection. I can use that list of 1325 Q IDs as a way to define a category of items to be included in a download using the acquire_wikidata_metadata.py script. To set up the download, I created a config.json file containing this JSON:

{
  "data_path": "",
  "item_source_csv": "creators.csv",
  "item_pattern_file": "",
  "outfiles": [
    {
      "manage_descriptions": true,
      "label_description_language_list": [
        "en",
        "es",
        "pt",
        "fr",
        "it",
        "nl",
        "de",
        "da",
        "et",
        "hu",
        "ga",
        "ro",
        "sk",
        "sl",
        "zu",
        "tr",
        "sv"
      ],
      "output_file_name": "creators_out.csv",
      "prop_list": [
      ]
    }
  ]
}

As you can see, I'm not concerned with any properties of the works items. I've simply listed the language codes for many languages that primarily use the Latin character set. The creators.csv file is my spreadsheet with the 1325 item identifiers in a column named qid. It defines the set of items I'm interested in. After running the acquire_wikidata_metadata.py script, the creators_out.csv spreadsheet looked like this:

There are several things worth noting. In most cases, when the label is available in non-English languages, it's exactly the same as the English label. This confirms my assertion that it's probably fine to just re-use the "English" names as labels in the other languages. There are a couple exceptions. Buckminster Fuller has variation in his labels because "Buckminster" is apparently his middle name. So I'm going to mostly leave that row alone -- he's famous enough that he's represented in most languages anyway. The Haverford Painter's name isn't really a name. It's more of a description applied as a label and it does vary from language to language. I'll just delete that row since I have no idea how to translate "Haverford Painter" into most of the languages.

The other interesting thing is that most of the names are represented in Dutch already. The reason is that there is a bot called Enderoobot which, among other things, automatically adds English people name labels as Dutch labels (see this edit for example). There are only a few missing Dutch labels to fill in. So I definitely should not just copy the entire English column of labels and paste it into the Dutch column.

Since the rows of the CSV are in alphabetical order by Q ID, the top of the spreadsheet contains mostly newer items with Q IDs over 100 million. In the lower part of the sheet where the Q IDs of less than 100 million are located, there are a lot more well-known artists that have labels in more languages. It would take more time than I want to spend right now to scrutinize the existing labels to see if it's safe to paste over them. So for now I'll limit my copying and pasting to the top of the spreadsheet.

After pasting the English labels into all of the other columns and filling in the few missing Dutch labels, I'm ready to write the new labels to Wikidata. I needed to run the convert_json_to_metadata_schema.py script to generate the metadata description file that VanderBot needs to understand the new creators_out.csv spreadsheet I've just created and edited (see my second post if you don't know about that). I'm now ready to run VanderBot using the same command I used earlier.

Using this method, I was able to add approximately 3500 multilingual labels with only about 30 seconds of time to copy and paste columns in the spreadsheet and about 10 minutes for VanderBot to write the new labels to the API. I can't even imagine how long that would take to do manually.

One nice thing is that there is only one interaction with the API per item, regardless of the number of different languages of labels that are changed. Since most of the time that VanderBot takes to do the writing is actually just sleeping 1.25 seconds per item (to avoid exceeding the maximum writing rate for bots without a bot flag), it's important to bundle as many data items per API interaction as possible.

Wikidata page showing added language labels

When I check one of the artist's pages, I see now that it has labels in many languages instead of only English.

Although it would be more labor-intensive, the same process could be used for adding labels in non-Latin character sets. A native speaker could simply go down the rows and type in the labels in Chinese characters, Cyrillic, Greek, Arabic, or any other non-Latin character set in an appropriate column and run the script to add those labels as well.

Adding multiple references to an item at once

Despite the importance of references to ensuring the reliability of Wikidata, many (most?) statements do not have them. That's understandable when humans create the statements, since including references is time consuming (although less so if you use some of the gadgets that are available to streamline the process, like currentDate and DuplicateReferences). For bots, it's inexcusable. Most bots are getting their data automatically from some data source and they know what that data source is, so there is no reason for them to not add references other than the laziness of their developers.

We can't fix other people's bad behavior, but we can fix their missing references with minimal work if we have an easy way to acquire the information.

Q44943965 is an article that was created using QuickStatements. Some of the data about the item was curated manually and those statements have references. But most of the bot-created statements don't have any references and I'm too lazy to add them manually. Luckily, the article has a DOI statement near the bottom, so all I need to do is to click on it to verify that the information exists for the statements with missing references. As a reference URL, I'm going to use the HTTPS form of the DOI, https://doi.org/10.3233/SW-150203, which a human can click on to see the evidence that supports the statement.

This publication was in the CSVs from the first example in this post, so prior to writing the references, I deleted the CSVs and used the acquire_wikidata_metadata.py script download a fresh copy of the data.

I highlighted the row to make it easier to see and pasted the DOI URL into the doi_ref1_referenceUrl column. I typed in todays date into the doi_ref1_retrieved_val column in the required format: 2021-03-17.

To create the references for the other statements, I just needed to copy the DOI URL into all of the columns whose names end in _ref1_referenceUrl and today's date into all of the columns that end in _ref1_retrieved_val.

Once I finished that, I saved the CSV and ran VanderBot (I already had the metadata description file from earlier work). I saved the output into a log file so that I could look at it later.

When VanderBot processes a CSV, it first writes any new items. It then runs a second check on the spreadsheet to find any items where statements already exist (indicated by the presence of a value in the _uuid column for that property), but where references have NOT been written (indicated by the absence of a _ref1_hash value). Scrolling through the log file, I saw that there was "no data to write" for any of the statements. In the "Writing references of existing claims" section (screenshot above), I saw the seven new references I created for Q44943965.

Checking the item page again, I see that all of the statements now have references!

This is more labor-intensive than making the label changes that I demonstrated in the previous example, but if all of the items in a spreadsheet were derived from the same source, then copying and pasting all the way down the _ref1_referenceURL and _ref1_retrieved_val columns would be really fast. In this case, it was not particularly fast, since I had to look up the DOI, then copy and paste the URL and date manually for each different item. However, since DOI data from CrossRef are machine-readable (via their API, see https://github.com/CrossRef/rest-api-doc), it won't be that hard to script the lookup in Python and have the script add all of the references to the CSV. I may write a post showing how to do that sometime in the future.

Conclusion

The script that downloads existing data from Wikidata into a CSV (acquire_wikidata_metadata.py) makes it possible to use the VanderBot API-writing script to improve certain kinds of information about items by simply copying multiple cells and pasting them elsewhere in the spreadsheet. Since CSVs are easily read and written by scripts, it is also possible to automate the addition of some kinds of data about existing items to the CSV (and eventually to Wikidata) by scripting.

In future posts, I will show how to accomplish some of the more difficult aspects of managing your own data in Wikidata using spreadsheets, including automated data acquisition, disambiguation, and monitoring for changes over time.

Thursday, March 11, 2021

Writing your own data to Wikidata using spreadsheets: Part 3 - Determining what properties to use

This is the third part of a series of posts intended to help you manage and upload data to Wikidata using spreadsheets. You probably won't get as much out of this post if you haven't already done the do-it-yourself exercises in the first two posts, but since this is a more general topic, you might still find it useful even if you haven't read the earlier ones.

Determining scope

The target audience of this post is people or groups who have particular defined datasets that they would like to upload and "manage" on Wikidata. I put "manage" in quotes because no one can absolutely manage any data on Wikidata, since by definition it is a knowledge graph that anyone can edit. So no matter how much we care about "our" data, once we put it into Wikidata, we need to be at peace with the fact that others may edit "our" items.

There is a good chance that the data we are interested in "managing" may be some kind of data about which we have special knowledge. For example, if we are part of a museum, gallery, or library, we may have items in our collection that are worth describing in Wikidata. It is unlikely that others will have better information than we do about things like accession numbers and license information. I'm going to refer to this kind of information as "authoritative" data -- data that we probably know more about than other Wikidata users. There may be other data that we are very interested in tracking, but about which we may have no more information than anyone else.

In both of these situations, we have a vested interest in monitoring additions and changes made by others outside our group or organization. In the case of our "authoritative" data, we may want to be on the lookout for vandalism that needs to be reverted. As we track other non-authoritative data, we may discover useful information that's effectively crowd-sourced and available without cost (time or financial) to us.

There will also be other statements that involve properties that we aren't interested in tracking. That doesn't mean this other information is useless -- it may just not be practical for us to track it since we can't really contribute to it nor gain much benefit from it.

So an important part of planning a project to upload and manage data in Wikidata is determining the scope of statements you plan to monitor. This is true regardless of how you are doing that managing, but in the case of using CSV spreadsheets, the defined scope will determine what column headers will be present in the spreadsheet. So prior to moving forward with using spreadsheets to write data to Wikidata, we need to decide what properties, qualifiers, and references we plan to document in those spreadsheets.

Defining a group of items of interest

The first thing we need to decide is what kind of items are we interested in. There may be an obvious target item type: works in a gallery, specimens in a museum, articles published by researchers in an institution, etc. There will also often be secondary item types associated with the primary one: artists associated with gallery works, collectors with specimens, authors and journals with articles, for example. After determining the primary and secondary item types, the next step is to figure out the value of P31 (instance of) goes with each type of item of interest. In some cases, this might be obvious (Q5 = human for authors, for example). In other cases it may not be so clear. Is a book Q571 (book) or is it Q3331189 (version, edition, or translation)? The best answer to this question is probably "what are other people using for items similar to mine?" We'll talk about some tools for figuring that out later in the post.

There are two useful ways to define a group of related items. The simplest is enumeration: creating a list of Q IDs of items that are similar. Less straightforward but more powerful is to define a graph pattern that can be used in SPARQL to designate the group. We are not going to go off the deep end on SPARQL in this post, but it plays such an important role in using Wikidata that we need to talk about it a little bit.

Wikidata is a knowledge graph, which means that its items are linked together by the statements that connect them. Thus the links involve the properties that form the statements. The simplest connection between two statements involves a single statement using a single property. For example, if we are interested in works in the Vanderbilt Fine Arts Gallery, we can define that group of items by stating that a work is in the collection (P195) of the Vanderbilt Fine Arts Gallery (Q18563658). We can abbreviate this relationship by the shorthand:

?item wdt:P195 wd:Q18563658.

The ?item means that the item is the thing we want to know, and the other two parts lay out how the item is related to the gallery. This shorthand is the simplest kind of graph pattern that we can create to define a group of items.

We can use this graph pattern to get a list of the names of works in the gallery using the Wikidata Query Service. Click on this link and it will take you to the Query service with the appropriate query filled in. If you look at the query in the upper right, you'll see our graph pattern stuck between a pair of curly brackets. (The other line is a sort of "magic" line that produces English labels for items that are found.) If you are wondering how you might have known how to set up a query like this, you can drop down the Examples list and select the first query: "Cats". My query is just a hack of that one where I substituted my graph pattern for the one that defines cats (P31=instance of, Q146=house cat).

We can narrow down the scope of our group if we add another requirement to the graph pattern. For example, if we want our group of items to include only paintings that are in the Vanderbilt gallery, we can add another statement to the graph pattern: the item must also be an instance of (P31) a painting (Q3305213).

?item wdt:P31 wd:Q3305213.

The query using both restrictions is here.

These kinds of graph patterns are used all the time in Wikidata, sometimes when you don't even know it. If you visit the Vanderbilt Fine Arts Gallery WikiProject paintings page and look just below the star, you'll see that the graph pattern that we just defined is actually what generates that page. We will use such patterns later on to investigate property use by groups of items that are defined by graph patterns.

What properties are used with different kinds of items?

Recoin (Relative Completeness Indicator)

The simplest way to see what kind of properties tend to be used with certain kinds of items is to look at the page of an item of that kind and see what properties are there. That isn't a very systematic approach, but there is a gadget called Recoin that can make our investigation more robust. Recoin can be installed by clicking on the Preferences link at the top of any Wikidata page, then selecting the Gadgets tab. Check the box for Recoin, then click Save.

After you enable Recoin, you can click the Recoin link just below the item description and a list will drop down showing the fraction of items having various properties for all items having the same P31 value. The example above shows values for instances of "art museum". Of course, this list shows properties that are missing for that page, so you would need to find a page with most properties missing to get a more full list. If you can create a new item having only a P31 value, you will be able to get the complete list.

Wikidata:WikiProjects

A more systematic approach is to look for a WikiProject that is interested in the same kind of items as you. The list of WikiProjects is somewhat intimidating, but if you succeed in finding the right project, it will often contain best-practices guidelines for describing certain types of items. For example, if you expand Cultural WikiProjects, then GLAM (galleries, libraries, archives, and museums) WikiProjects, you will see one called "Sum of all paintings". They have a list of recommendations for how to describe paintings. You can find similar lists in other areas and if you are lucky, you will find a list of extensive data model guidelines, such as the Stanford Libraries's data models for academia.

A small amount of time spent searching here will pay large dividends later if you start by using the consensus properties adopted by the community in which you are working. The items you put into Wikidata will be much more likely to be found and linked to by others if you describe them using the same model as is used with other items of the same type.

Determining what properties are used "in the wild"

If you find a WikiProject related to your type of interest, you will probably have a good idea of the properties that group says you should be using for statements about that type of item. However, you might discover that in actuality some of those properties are not really used much. That could be the case if the values are not easily available or if it's to labor-intensive to disambiguate available string values with item values in Wikidata. So it is pretty useful to know what properties are actually being used by items similar to the ones you are interested in creating/editing.

I have written a Python script, count_entities.py, that you can use to determine what properties have been used to describe a group of related items and the number of items that have used each property. The script details are described here. Before using the script with your own set of items, you will need to define your category of items using one of the two methods I described earlier. But for testing purposes, you can try running the script using the default built-in group: works in the Vanderbilt University Fine Arts Gallery.

To run the script, you need the following:

Python 3 installed on your computer with the ability to run it at the command line.
The requests module installed using PIP, Conda, or some other package manager.
a plain text editor if you want to define the group by SPARQL graph pattern. You can use the built-in text editors TextEdit on Mac or Notepad on Windows.
a spreadsheet program to open CSV files. LibreOffice Calc is recommended.
knowledge of how to change directories and run a Python script from your computer's console (Terminal on Mac, Command Line on Windows).

You do NOT need to know how to code in Python. If you are uncertain about any of these requirements, please read the first post in this series, which includes a lot of hand-holding and additional information about them.

To run the script, go to the script's page on GitHub and right click on the Raw button. Select Save Link As... and save the script in a directory you can easily navigate to using your console. The script will general CSV files as output, so it is best to put the script in a relatively empty directory so you can find the files that are created.

To test the script, go to your console and navigate to the directory where you saved the script. Enter

python count_entities.py

(or python3 if your installation requires that). The script will create a file called properties_summary.csv, which you can open using your spreadsheet program.

The table shows all of the properties used to make statements about items in the gallery and the number of items that use each property. Although there are (currently) 6000 items in the group, they use properties fairly consistently, so there aren't that many properties on the list. Other groups may have much longer lists. But often there will be a very long tail of properties used only once or a few times.

Unless you want to keep investigating the Vanderbilt Fine Arts Gallery items, you must define your group using one of the two options described below: --csv (or its brief form -C) to enumerate items in the group by Q ID or --graph (or its brief form -G) to define the group by a graph pattern.

Defining a group by a list of Q IDs

Let's try using the script by defining the group by enumeration. Download the file bluffton_presidents.csv from here into the same directory as the script, using the Raw button as before. NOTE: if you are using a Mac, it may automatically try to change the file extension from .csv to .txt in the Save As... dialog. If so, change the format to All Files and change the extension back to .csv before saving.

If you open the CSV that you downloaded, you'll see that the first column has the header qid. The script requires that the Q IDs be in a column with this header. The position of that column and the presence of other columns do not matter. The items in the column must be Q IDs, including the initial Q and omitting any namespace abbreviations like wd: .

Run the script again using this syntax:

python count_entities.py --csv bluffton_presidents.csv

Note that the previous output file will be overwritten when you run the script again.

This time the script produces a list of properties appropriate for people.

Defining a list by SPARQL graph pattern

Open your text editor and paste in the following text:

?qid wdt:P108 wd:Q29052.

?article wdt:P50 ?qid.

?article wdt:P31 wd:Q13442814.

The first line limits the group to items whose employer (P108) is Vanderbilt University (Q29052). The second line specifies that those items must be authors of something (P50). The third line limits those somethings to being instances of (P31) scholarly articles (Q13442814). So with this graph pattern, we have defined our group as authors of scholarly articles who work (or worked) at Vanderbilt University.

Save the file using the name graph_pattern.txt in the same directory as the script. Run the script using this syntax:

python count_entities.py --graph graph_pattern.txt

Again, the script will overwrite the previous output file.

list of properties of Vanderbilt authors

This time, the list of properties is much longer because the group is larger and more diverse than in the last example. Despite whatever advice any WikiProjects group may give about best-practices for describing academics, we can see that there is a very small number of properties that are actually given for most of these academic authors. Note that in many cases, given name and family name statements are generated automatically by bots. So if we wanted to create "typical" records, we would only need to provide the top six properties.

If you are unfamiliar with creating SPARQL query graph patterns, I recommend experimenting at the Wikidata Query Service page. The Examples dropdown there shows a lot of examples. However, in most cases, we can define the groups we want with simple graph patterns of only one to three lines.

Examining property use in the wild

Before deciding for sure what properties you want to write/monitor, it is good to know what typical values are for that property. It is also critical to know whether it is conventional to use qualifiers with that property. The count_entities.py script can also collect that information if you use the --prop option (or its brief form -P). I will demonstrate this with the default group (Vanderbilt Fine Arts Gallery works), but you can supply a value for either the --csv or --graph option to define your own group.

One of the most important properties to understand about a group is P31 (instance of). To see the distribution of values for P31 in the gallery, issue this command in your console:

python count_entities.py --prop P31

(or python3 if your installation requires that). The script generates a file whose name starts with the the property ID and ends in _summary.csv (P31_summary.csv in this example). Here's what the results look like:

We can see that most items in the gallery that are described in Wikidata are prints. There is a long tail of other types with a very small number of representatives (e.g. "shoe"). Note that it is possible for an item to have more than one value for P31, so the total count of item by type could be greater than the total number of items.

If any statements using the target property have qualifiers, the script will create a file listing the qualifiers used and the number of items with statements using those qualifiers. In the case of P31, there were no qualifiers used, so no file was created. Let's try again using P571, inception.

python count_entities.py -P P571

The result in the P571_summary.csv file is not very useful.

It listed the 401 different inception dates (as of today) for works in the gallery. However, the P571_qualifiers_summary.csv is more interesting.

This gives me very important information. For most of the 401 dates, they were qualified by defining an uncertainty range using earliest date (P1319) and latest date (P1326). The other commonly used qualifier was P1480 (sourcing circumstances). Examining the property description, we see that P1480 is used to indicate that a date is "circa" (Q5727902). So all three of these qualifiers are really important and should probably be designated to be used with P571.

For properties that have a large number of possible values (e.g. properties that have unique values for every item), you probably don't want to have the script generate the file of values if all you want to know is the qualifiers that are used. You can get only the qualifiers output file if you use the --qual (or -Q) option (with no value needed to go with it). A good example for this is P217 (inventory number). Every work has a unique value for this property, so there is no reason to download the values for the property. Using the --qual option, I can find out what qualifiers are used without recording the values.

python count_entities.py -prop P217 --qual

The P217_qualifiers_summary.csv file shows that there is a single qualifier used with P217: collection (P195).

Putting it together

The reason for including this post in the series about writing the Wikidata using spreadsheets is that we need to decide what properties, qualifiers, and references to include in the metadata description description of the CSV that we will use to manage the data. So I will demonstrate how to put this all together to create the spreadsheet and its metadata description.

I am interested in adding publications written by Vanderbilt researchers to Wikidata. Since data from Crossref is easily obtainable when DOIs are known, I'm interested in knowing what properties are used with existing items that have DOIs and were written by Vanderbilt researchers. So the first step is to define the group for the items. The first thing I tried was the graph pattern method. Here is my graph pattern:

?person wdt:P1416 wd:Q16849893. # person affiliation VU Libraries

?item wdt:P50 ?person. # work author person

?item wdt:P356 ?doi. # work has doi DOI.

I tested this pattern at the Query Service with this query. However, when I ran the script with the --graph option to determine property use, it timed out.

Determining property use

Since Plan A did not work, I moved on to Plan B. I downloaded the results from the query that I ran at the Query Service and put them into a CSV file. I then did a bit of massaging to pull the Q IDs into their own column with the header qid. This time when I ran the script with the --csv option, I got some useful results.

Based on these results I probably need to plan to upload and track the first 10 properties (through author name string). For P31 and P1433, it would probably be useful to see what kind of values are usual, but for the rest I just need to know if they are typically used with qualifiers or not.

The results for P31 indicate that although both scholarly article (Q13442814) and academic journal article (Q18918145) are used to describe these kind of academic publications, scholarly article seems to be more widely used. There were no qualifiers used with P31. Not unexpectedly, a check of P1433 revealed many library-related journals. One item used qualifiers with P1433, but those qualifiers, P304 (pages), P433 (issue), and P478 (volume), appear to be misplaced since those properties are generally used directly in statements about the work.

The only other items with qualifiers were P50 (author, shown above) and P2093 (author name string), which also had the qualifier P1545 (series ordinal). So this simplifies the situation quite a bit -- I really only need to worry about qualifiers with the two author-related terms, which are going to require some special handling anyway.

Creating a config.json file for the spreadsheet

I now have enough information to know how I want to lay out the spreadsheet(s) to contain the data that I'll upload/manage about journal articles. To understand better how to structure the config.json file that I'll use to generate the spreadsheets and metadata description file, I looked at one of the articles to help understand the value types for the properties.

The style of the values on the page help me to know the value type. The item values are hyperlinked text. The string values are unlinked black text. Monolingual text values look like strings, but have their language following them in parentheses.

To decide about the number of spreadsheets needed, I thought about which properties were likely to have multiple values per article item. Both author (item) and author name string could have multiple values. So I put them into separate spreadsheets. The rest of the properties will probably have only one value per article (or at least only one value that I'm interested in tracking). So here is what the overall structure of the config.json file looks like:

{
  "data_path": "",
  "item_source_csv": "",
  "item_pattern_file": "",
  "outfiles": [
    {
      "manage_descriptions": true,
      "label_description_language_list": [
        "en"
      ],
      "output_file_name": "articles.csv",
      "prop_list": [
      ]
    },
    {
      "manage_descriptions": false,
      "label_description_language_list": [],
      "output_file_name": "authors.csv",
      "prop_list": [
      ]
    },
    {
      "manage_descriptions": false,
      "label_description_language_list": [],
      "output_file_name": "author_strings.csv",
      "prop_list": [
      ]
    }
  ]
}

I don't want to manage descriptions on the two author-related CSVs, and am only including the labels to make it easier to identify the article. I'm only working in English, so that also simplifies the label situation.

Here are a few of the property descriptions that I used that illustrate several value types for the statement properties:

        {
          "pid": "P31",
          "variable": "instance_of",
          "value_type": "item",
          "qual": [],
          "ref": []
        },
        {
          "pid": "P356",
          "variable": "doi",
          "value_type": "string",
          "qual": [],
          "ref": [
            {
              "pid": "P854",
              "variable": "referenceUrl",
              "value_type": "uri"
            },
            {
              "pid": "P813",
              "variable": "retrieved",
              "value_type": "date"
            }
          ]
        },
        {
          "pid": "P577",
          "variable": "published",
          "value_type": "date",
          "qual": [],
          "ref": [
            {
              "pid": "P854",
              "variable": "referenceUrl",
              "value_type": "uri"
            },
            {
              "pid": "P813",
              "variable": "retrieved",
              "value_type": "date"
            }
          ]
        },
        {
          "pid": "P1476",
          "variable": "title_en",
          "value_type": "monolingualtext",
          "language": "en",
          "qual": [],
          "ref": [
            {
              "pid": "P854",
              "variable": "referenceUrl",
              "value_type": "uri"
            },
            {
              "pid": "P813",
              "variable": "retrieved",
              "value_type": "date"
            }
          ]
        },

Following typical practice, I'm skipping references for P31 (instance of). The rest of the properties only have reference properties for P854 (reference URL) and P813 (retrieved). Some existing items may have references for P248 (stated in), but since I'm going to be getting my data from Crossref DOIs, I'll probably just use the URL form of the DOI in all of the references. So I'll only use a column for P854. Notice also that the P1476 (title) property must have the extra language key/value pair since it's a monolingual string. If the title of the journal isn't in English, I'm stuck but I'll deal with that problem later if it arises.

The final version of my config.json file is here. I will now try running the convert_json_to_metadata_schema.py script discussed in the last post to generate the headers for the three CSV files and the metadata description file so that I can test them out.

Test data

To test whether this will work, I'm going to manually add data to the spreadsheet for an old article of mine that I know is not yet in Wikidata. It's https://doi.org/10.1603/0046-225X-30.2.181 . Here's a file that shows what the data look like when entered into the spreadsheet. You'll notice that I used the DOI as the reference URL. As I said in the last section, I intend to eventually automate the process of collecting the information from Crossref, but even though I got the information manually, the DOI URL will redirect to the journal article landing page, so anyone checking the reference will be able to see it in human-readable form. So this is a good solution that's honest on the data source and that also allows people to check the reference when the click on the link.

Please note that I did NOT fill in the author CSV yet, even though I already know what the author items are. The reason is that if I filled it in without the article item Q ID in the qid column, the VanderBot API-writing script would create two new items that consisted only of author statements about unlabeled items. Instead, I created the item for the article first, then added the article Q ID in the qid column for both author rows in the authors.csv file. You can see what that looks like in this file. Since I knew the author item Q IDs for both authors, I could put them both in the authors.csv file, but if I had only known the name strings for some or all of the authors, I would have had to put them in the author_strings.csv file, again along with the article Q IDs after the article record had been written.

The final product seems to have turned out according to the plan. The page is here.

If we examine the page history of the new page, we see that there were two edits. The two smaller, more recent ones were the two author edits and the first, larger edit was the one that created the original item.

What's next?

You can try using the config.json file to generate your own CSV headers and metadata description files if you want to try uploading a journal article yourself. Just make sure that it isn't already in Wikidata. You can also hack the config.json file to use different properties, qualifiers, and references for a project of your own. I do highly recommend that you try writing only a single item at first so that if things do not go according to plan, the problems can easily be fixed manually.

Although we have now set up spreadsheets and a metadata description JSON file that can write data to Wikidata, there is still too much manual work for this to be productive. In subsequent posts, I'll talk about how we can automate things we have thus far been doing by hand.

The diagram above shows the general workflow that I've been using in the various projects with which I've used the spreadsheet approach. We have basically been working backwards through that workflow, so in the next post I will talk about how we can use the Query Service to download existing data from Wikidata so that we don't duplicate any of the items, statements, or references that already exist in Wikidata.

The image above is from a presentation I gave in Feb 2021 describing the "big picture", rationale, and potential benefits of managing data in Wikidata using spreadsheets. You can view that video here.