Friday, February 7, 2020

VanderBot part 2: The Wikibase data model and Wikidata identifiers


The Wikidata GUI and the Wikibase model

To read part 1 of this series, see this page.

If you've edited Wikidata using the human-friendly graphical user interface (GUI), you know that items can have multiple properties, each property can have multiple values, each property/value statement can be qualified in multiple ways, each property/value statement can have multiple references, and each reference can have multiple statements about that reference. The GUI keeps this tree-like proliferation of data tidy by collapsing the references and organizing the statements by property.


This organization of information arises from the Wikibase data model (summarized here, in detail here). For those unfamiliar with Wikibase, it is the underlying software system that Wikidata is built upon. Wikidata is just one instance of Wikibase and there are databases other than Wikidata that are built on the Wikibase system. All of those databases built on Wikibase will have a GUI that is similar to Wikidata, although the specific items and properties in those databases will be different from Wikidata.

To be honest, I found working through the Wikibase model documentation a real slog. (I was particularly mystified by the obscure term for basic assertions: "snak". Originally, I though it was an acronym, but later realized it was an inside joke. A snak is "small, but more than a byte".) But understanding the Wikibase model is critical for anyone who wants to either write to the Wikidata API or query the Wikidata Query Service and I wanted to do both. So I dug in.

The Wikibase model is an abstract model, but it is possible to represent it as a graph model. That's important because that is why the Wikidata dataset can be exported as RDF and made queryable by SPARQL in the Wikidata Query Service. After some exploration of Wikidata using SPARQL and puzzling over the data model documentation, I was able to draw out the major parts of the Wikibase model as a graph model. It's a bit too much to put in a single diagram, so I made one that showed references and another that showed qualifiers (inserted later in the post). Here's the diagram for references:


Note about namespace prefixes: the exact URI for a particular namespace abbreviation will depend on the Wikibase installation. The URIs shown in the diagrams are for Wikidata. A generic Wikibase instance will contain wikibase.svc as its domain name in place of www.wikidata.org, and other instances will use other domain names. However, the namespace abbreviations shown above are used consistently among installations, and when querying via the human-accessible Query Service or via HTTP, the standard abbreviations can be used without declaring the underlying namespaces. That's convenient because it allows code based on the namespace abbreviations to be generic enough to be used for any Wikibase installation. 

In the next several sections, I'm going to describe the Wikibase model and how Wikidata assigns identifiers to different parts of it. This will be important in deciding how to track data locally. Following that, I'll briefly describe my strategy for storing those data.

Item identifiers

The subject item of a statement is identified by a unique "Q" identifier. For example, Vanderbilt University is identified by Q29052 and the researcher Antonis Rokas is identified by Q42352198. We can make statements by connecting subject and object items with a defined Wikidata property. For example, the property P108 ("employer") can be used to state that Antonis Rokas' employer is Vanderbilt University: Q42352198 P108 Q29052. When the data are transferred from the Wikidata relational database backend fed by the API to the Blazegraph graph database backend of the Query Service, the "Q" item identifiers and "P" property identifiers are turned into URIs by appending the appropriate namespace (wd:Q42352198 wdt:P108 wd:Q29052.)

We can check this out by running the following query at the Wikidata Query Service:

SELECT DISTINCT ?predicate ?object WHERE {
  wd:Q42352198 ?predicate ?object.
  }

This query returns all of the statements made about Antonis Rokas in Wikidata.

Statement identifiers

In order to be able to record further information about a statement itself, each statement is assigned a unique identifier in the form of a UUID. The UUID is generated at the time the statement is first made. For example, the particular statement above (Q42352198 P108 Q29052) has been assigned the UUID FB9EABCA-69C0-4CFC-BDC3-44CCA9782450. In the transfer from the relational database to Blazegraph, the namespace "wds:" is prepended and for some reason, the subject Q ID is also prepended with a dash. So our example statement would be identified with the URI wds:Q42352198-FB9EABCA-69C0-4CFC-BDC3-44CCA9782450. If you look at the results from the query above, you'll see

p:P108 wds:Q42352198-FB9EABCA-69C0-4CFC-BDC3-44CCA9782450

as one of the results.

We can ask what statements have been made about the statement itself by using a similar query, but with the statement URI as the subject:

SELECT DISTINCT ?predicate ?object WHERE {
  wds:Q42352198-FB9EABCA-69C0-4CFC-BDC3-44CCA9782450 ?predicate ?object.
  }

One important detail relates to case insensitivity. UUIDs are supposed to be output as lowercase, but they are supposed to be case-insensitive on input. So in theory, a UUID should represent the same value regardless of the case. However, in the Wikidata system the generated identifier is just a string and that string would be different depending on the case. So the URI

wds:Q42352198-FB9EABCA-69C0-4CFC-BDC3-44CCA9782450

is not the same as the URI

wds:Q42352198-fb9eabca-69c0-4cfc-bdc3-44cca9782450

(Try running the query with the lower case version to convince yourself that this is true.) Typically, the UUIDs generated in Wikidata are upper case, but there are some that are lower case. For example, try

wds:Q57756352-4a25cee4-45bc-63e8-74be-820454a8b7ad

in the query. Generally it is safe to assume that the "Q" in the Q ID is upper case, but I've discovered at least one case where the Q is lower case.

Reference identifiers

If a statement has a reference, that reference will be assigned an identifier based on a hash algorithm. Here's an example: f9c309a55265fcddd2cb0be62a530a1787c3783e. The reference hash is turned into a URL by prepending the "wdref:" namespace. Statements are linked to references by the property prov:wasDerivedFrom. We can see an example in the results of the previous query:

prov:wasDerivedFrom wdref:8cfae665e8b64efffe44128acee5eaf584eda3a3

which shows the connection of the statement wds:Q42352198-FB9EABCA-69C0-4CFC-BDC3-44CCA9782450 (which states wd:Q42352198 wdt:P108 wd:Q29052.) to the reference wdref:8cfae665e8b64efffe44128acee5eaf584eda3a3 (which states "reference URL http://orcid.org/0000-0002-7248-6551 and retrieved 12 January 2019"). We can see this if we run a version of the previous query asking about the reference statement:

SELECT DISTINCT ?predicate ?object WHERE {
  wdref:8cfae665e8b64efffe44128acee5eaf584eda3a3?predicate ?object.
  }

As far as I know reference hashes seem to be consistently recorded in all lower case.

Reference identifiers are different from statement identifiers in that they denote the reference itself, and not a particular assertion of the reference. That is, they do not denote "statement prov:wasDerivedFrom reference", only the reference.  (In contrast, statement identifiers denote the whole statement "subject property value".) That means that any statement whose reference has exactly the same asserted statements will have the same reference hash (and URI). 

We can see that reference URIs are shared by multiple statements using this query:

SELECT DISTINCT ?statement WHERE {
  ?statement prov:wasDerivedFrom wdref:f9c309a55265fcddd2cb0be62a530a1787c3783e.
  }

Identifier examples

The following part of a table that I generated for Vanderbilt researchers shows examples of the identifiers I've described above.


We see that each item (researcher) has a unique Q ID and that each statement that the researcher is employed at Vanderbilt University (Q29052) has a unique UUID (some upper case, some lower case) and that there are more than one statement that share the same reference (having the same reference hash).  

Statement qualifiers

In addition to linking references to a statement, the statements can also be qualified. For example, Brandt Eichman has worked at Vanderbilt since 2004.


Here's a diagram showing how the qualifier "start time 2004" is represented in Wikidata's graph database:


We can see that qualifiers are handled a little differently from references. If the qualifier property (in this case P580, "since") has a simple value (literal or item), the value is linked to the statement instance using the pq: namespace version of the property. 

If the value has a complex value (e.g. date), that value is assigned a hash and is linked to the statement instance using the pqv: version of the property. When the data are transferred to the graph database, the wdv: namespace is prepended to the hash. 

Because dates are complex, the qualifier "since" requires a non-literal value in addition to a literal value linked by the pq: version of the property (see this page for more on the Wikibase date model). We can use this query:

SELECT DISTINCT ?property ?value WHERE {
  wdv:849f00455434dc418fb4287a4f2b7638 ?property ?value.
  }

to explore the non-literal date instance.  In Wikidata, all dates are represented as full XML Schema dateTime values (year, month, day, hour, minute, second, timezone). In order to differentiate between the year "2004" and the date 1 January 2004 (both can be represented in Wikidata by the same dateTime value), the year 2004 is assigned a timePrecision of 9 and the date 1 January 2004 is assigned a timePrecision of 11.

Not every qualifier will have a non-literal value. For example, the property "series ordinal" (P1545; used to indicate things like the order authors are listed) has only literal values (integer numbers). So there are values associated with pq:P1545, but not pqv:P1545. The same is true for "language of work or name" (P407; used to describe websites, songs, books, etc.), which has an entity value like Q1860 (English).

Labels, aliases, and descriptions

Labels, aliases, and descriptions are properties of items that are handled differently from other properties in Wikidata. Labels and descriptions are handled in a similar manner, so I will discuss them together.

Each item in Wikidata can have only one label and one description in any particular language. Therefore adding or changing a label or description requires specifying the appropriate ISO 639-1 code for the intended language.  When a label or description is changed in Wikidata, the previous version is replaced.

One important restriction is that the label/description combination in a particularly language must be unique. For example, the person with the English label "John Jones" and English description "academic" can currently only be Q16089943. Because labels and descriptions can change, this label/description combination won't necessarily be permanent associated with Q16089943 because someone might give that John Jones a more detailed description, or make his name less generic by adding a middle name or initial. So at some point in the future, it might be possible for some other John Jones to be described as "academic".  An implication of the prohibition against two items sharing the same label/description pair is that it's better to create labels and descriptions that are as specific as possible to avoid collisions with pre-existing entities. As more entities get added to Wikidata, the probability of such collisions increases.

There is no limit to the number of aliases that an item can have per language. Aliases can be changed by either changing the value of a pre-existing alias or adding a new alias. As far as I know, there is no prohibition about aliases of one item matching aliases of another item.

When these statements are transferred to the Wikidata graph database, labels are values of rdfs:label, descriptions are values of schema:description, and aliases are values of skos:altLabel. All of the values are language-tagged.

What am I skipping?

Another component of the Wikibase model that I have not discussed is ranks. I also haven't talked about statements that don’t have values (PropertyNoValueSnak and PropertySomeValueSnak), and sitelinks. These are features that may be important to some users, but have not yet been important enough to me to incorporate handling them in my code. 

Local data storage

If one wanted to make and track changes to Wikidata items, there are many ways to accomplish that with varying degrees of human intervention.  Last year, I spent some time pondering all of the options and came up with this diagram:


Tracking every statement, reference, and qualifier for items would be complicated because each item could have an indefinite number and kind of properties, values, references, and qualifiers.  To track all of those things would require a storage system as complicated as Wikidata itself (such as a separate a relational database or a Wikibase instance as shown in the bottom of the diagram). That's way beyond what I'm interested in doing now. But what I learned about the Wikibase model and how data items are identified suggested to me a way to track all of the data that I care about in a single, flat spreadsheet. That workflow can be represented by this subset of the diagram above:


I decided on the following structure for the spreadsheet (a CSV file, example here.). The Wikidata Q ID serves as the key for an item and the data in a row is about a particular item. A value in the Wikidata ID column indicates that the item already exists in Wikidata. If the Wikidata ID column does not have a value, that indicates that the item needs to be created. 

Each statement has a column representing the property with the value of that property for an item recorded in the cell for that item's row.  For each property column, there is an associated column for the UUID identifying the statement consisting of the item, property, and value. If there is no value for a property, no information is available to make that statement. If there is a value and no UUID, then the statement needs to be asserted. If there is a value and a UUID, the statement already exists in Wikidata.  

References consist of one or more columns representing the properties that describe the reference. References have a single column to record the hash identifier for the reference.  As with statements, if the identifier is absent, that indicates that the reference needs to be added to Wikidata. If the identifier is present, the reference has already been asserted.  

Because labels, descriptions, and many qualifiers do not have URIs assigned as their identifiers, their values are listed in columns of the table without corresponding identifier columns.  Knowing whether the existing labels descriptions and qualifiers already exist in Wikidata requires making a SPARQL query to find out. That process is described in the fourth blog post.

Where does VanderBot come in?

In the first post of this series, I showed a version of the following diagram to illustrate how I wanted VanderBot (my Python script for loading Vanderbilt researcher data into Wikidata) to work. That diagram is basically an elaboration of the simpler previous diagram.


The part of the workflow circled in green is the API writing script that I will describe in the third post of this series (the next one). The part of the workflow circled in orange is the data harvesting script that I will describe in the fourth post. Together these two scripts form VanderBot in its current incarnation.

Discussing the scripts in that order may seem a bit backwards because when VanderBot operates, the data harvesting script works before the API writing script. But in developing the two scripts, I needed to think about how I was going to write to the API before I thought about how to harvest the data. So it's probably more sensible for you to learn about the API writing script first as well. Also, the design of the API writing script is intimately related to the Wikidata data model, so that's another reason to talk about it next after this post.

No comments:

Post a Comment