Steve Baskauf's blog

Favorite Nebula Award-winning Novels

2025-03-31T09:15:00.000-07:00

After finishing reading all of the Hugo Award winners for best novel, I decided to keep up the momentum and read all of the winners of the Nebula Award for best novel. I just finished the last one yesterday and decided to write a follow-up post to my earlier one where I talked about which of the Hugo books were my favorites.

In this post, I’ll discuss the Nebula winners that were not also Hugo winners, and list those double-winners that I already described in the previous post.

The Nebula Award started in 1966, so there are over 10 years of generally poor-quality books that were Hugo-eligible but that were out of contention in this quest (all of my 5 worst Hugo books were before 1962). Nevertheless, there were two that I disliked enough to put in the category of worst Nebula winners.

List of favorites

Not surprisingly, many of the really good books won both awards. So the list of favorites that won only the Nebula is rather short. It’s hard to be sure that I’m holding them to exactly the same standard as my Hugo favorites – I may be a little more generous here. But all of these favorites are solid and worth a read.

NOTE: to read my full reviews of all of the Nebula Award-winning novels see my Goodreads books.

Samuel R. Delany: Babel-17 (1967)

I’m not sure that this falls into my top books of all times, but it was one of my favorite of the Nebula winners. It was pretty weird, which often is a negative for me, but somehow this one was weird in an interesting way and also pretty good for 1967. In particular, I liked the strong female main character, which was refreshing for a book from that decade. Like Babel, the 2023 winner, it was in the “power of words and language” genre, but did not drag on for page after dull page as Babel did.

Daniel Keyes: Flowers for Algernon (1967)

It’s hard to place this book relative to the others since I haven’t read it since I first did in about 1974. But I think it made a big impression on me at the time and was a really solid story. Very different in style from the other 1967 winner that I just described.

Greg Bear: Darwin’s Radio (2001)

This was a really exciting and interesting book that I had trouble putting down. Not one of my all-time top books, and I found the biology a bit hard to swallow (as a biologist). But worth reading.

Elizabeth Moon: The Speed of Dark (2004)

This is not your typical sci-fi book: the focus was not really on the advanced technology, which was only tangential to the real story line: seeing our world from the eyes of someone with autism. I was a bit disappointed with the ending, but otherwise it was a really engaging and thought-provoking book.

Favorites that also won the Hugo

The following favorite Nebula winners were already discussed in my previous blog post on favorite Hugo winners, so I will just list them here and let you read about them in the other post.

Frank Herbert: Dune (1966)

Ursula K. Le Guin: Left Hand of Darkness (1970)

Frederik Pohl: Gateway (1978)

Orson Scott Card: Ender’s Game (1986), Speaker for the Dead (1987)

Connie Willis: Blackout/All Clear (2011), Doomsday Book (1993)

N. K. Jemisin: The Stone Sky (2018)

Lois McMaster Bujold: Falling Free (1996)

I’m putting this in a special category. This book did not win the Hugo and is actually not one of my favorite Bujold books, but I am including it because of my general love of the other Vorkosigan Universe books, some of which did win the Hugo.

5 star book that didn’t make my favorite list:

NOTE: there are other Nebula 5 start books that are not listed here because they were listed on the Hugo 5 star list of my other post and are therefore not repeated here.

Vonda N. McIntyre: The Moon and the Sun (1998)

Perhaps I was generous to give this 5 stars, but I liked the story. A bit slow to start and too much detail about the French court, but I really liked the characters and how different “good” characters had different viewpoints on topics like sex and religion.

Worst Nebula books

No “best” list would be complete without a corresponding “worst” list. Here it is:

Samuel R. Delany: The Einstein Interaction (1968)

Very weird book that was probably trying to make some point about myths that was lost on me. Interesting that Delany makes both my best and worst list!

Robert Silverberg: A Time of Changes (1972)

Preachy and depressing premise, disgusting attitude towards women, poor writing.

Favorite Hugo Award-winning Novels

2025-02-16T18:14:00.000-08:00

In May 2023, I completed a quest that was on my bucket list: reading all of the winners of the Hugo Award for best science fiction/fantasy novel. At that time, there were 71 books on the list (not counting “Retro-Hugo” winners). I’m not sure when I read my first one – the first one that I can unambiguously remember reading was Dune in about 1975 or 76. I read a number of the winners from the 70’s through 90’s soon after their publication, before having kids and going to grad school cut back on my pleasure reading time. Starting in 2021, I resolved to spend more time reading for fun and a number of the more recent winners were recommended to me by my daughter. This enticed me to take up the challenge of finishing all of them and Goodreads tells me that I read 31 of them in the 12 months preceding May 2023.

Having read them all, I am enjoying thinking back on them and pondering which were my favorites. I decided to write this post to list them.

Why a favorite?

There are several criteria that must have been met to make my “favorites” list. First and foremost, the book must be deeply engaging. To me, a great fiction book draws me into its world, and while I’m reading I’m transported to that world and barely aware that I am sitting in this world reading. Second, the story needs to be clever, creative, or explore a universe that has some really interesting and different twist. Third, the story can’t be ruined by being overtly sexist, dated, or transparently preaching about the author’s pet peeve. It is fine for the book to have a point, but that point needs to be made through the storytelling.

Another characteristic (but not a requirement) is that I found myself pondering and thinking about these stories for days or weeks after reading them, and years later thinking how I would like to re-read them.

List of my favorites

NOTE: to read my full reviews of all of the Hugo Award-winning novels, see my Goodreads books.

N. K. Jemisin: The Fifth Season (2016), The Obelisk Gate (2017), The Stone Sky (2018)

This trilogy was so different and interesting that I was quickly intrigued by it. The narrative style of The Fifth Season was also really cool. Some parts of The Stone Sky were a bit hard to believe, but the trilogy's overall the story was very satisfying.

Vernor Vinge: A Fire Upon the Deep (1993), A Deepness in the Sky (2000)

I was not familiar with Vernor Vinge before I started reading the Hugo books, but I now really appreciate his creativity and storytelling. Both of these books have a compelling story arc, but also have fascinating and creative alien species whose interactions with humans form an integral part of the story. One interesting character overlaps in the two books.

Frank Herbert: Dune (1966)

It is a bit difficult for me to objectively compare Dune to my other favorite Hugo books, since it was probably the first “epic” sci-fi/fantasy book that I read. But at that time, I was blown away by the complex vision that Herbert created in the book. Queen’s Night at the Opera had come out not long before I read Dune and I listened to “The Prophet’s Song” many times while reading. It has become indelibly associated with Dune in my mind. If you’ve read Dune, listen to The Prophet's Song and see if you can tell why it made such a strong connection for me.

Connie Willis: Blackout/All Clear (2011), Doomsday Book (1993)

Although both of these books involve pretty depressing topics (WW II and the plague), the story telling really immersed me in those time periods. The character’s struggle to survive and return to their own time, overlaid with their efforts to recognize the humanity and dignity of the people of those times in the most trying circumstances, made for a compelling plot.

Orson Scott Card: Ender’s Game (1986), Speaker for the Dead (1987)

Although these books might be classified as young adult books, they had really interesting and surprising plots.

Lois McMaster Bujold: The Vor Game (1991), Barrayar (1992), Mirror Dance (1995)

I include these books not because they were my particular Bujold favorites, but rather because the entire Miles Vorkosigan series were so clever, funny, and entertaining. They are certainly one of my favorite book series, with The Warrior's Apprentice (not a nominee) as the very best.

C. J. Cherryh: Downbelow Station (1982), Cyteen (1989)

I include these Cherryh books for a similar reason as the Bujold books. They weren’t necessarily my favorite Cherryh books (that would probably be the Chanur books, nominated in 1983 but did not win). But C. J. Cherryh is overall one of my favorite sci-fi authors and her Alliance/Union universe is complex and fascinating.

Ursula K. Le Guin: The Left Hand of Darkness (1970)

It would be difficult to not include Le Guin somewhere on my list. The Left Hand of Darkness is certainly one of her best books, although probably the Lathe of Heaven (nominated in 1972 but did not win) is my favorite. Le Guin is perhaps unmatched for her ability to situate interesting plots in worlds and cultures that are thought-provoking.

Frederik Pohl: Gateway (1978)

I read Gateway many years ago, so I’m not sure how I would feel about it now. But at the time, the novelty of the story premise and narrative style really appealed to me.

J. K. Rowling: Harry Potter and the Goblet of Fire (2001)

This is actually my least favorite Harry Potter book. But the Harry Potter saga is one of my top fantasy series, so I included it on that basis.

Runners up:

Robert Sawyer: Hominids (2003)

This book was borderline and did not quite make the cut, but I have to say that I was quite intrigued by the underlying concept of the world, and I just really liked the story and imagining how the world would be different if a different Homo species had come to dominate the earth.

Walter M. Miller Jr.: A Canticle for Leibowitz (1961)

I had first heard the NPR radio dramatization of this in the early 1980’s and was not overly impressed. I am also not a big fan of post-apocalyptic books. But when I read the book recently, I really enjoyed the story-telling and premise of the first two parts of the book. It was far superior to most other sci fi books I’ve read that were written in the 1950’s and early 1960’s. But it got booted from the favorites list because of the “no preachiness” criterion. The third part of the book was just transparently an anti-euthanasia sermon and that ruined the last part of the book for me.

Vonda N. McIntyre: Dreamsnake (1979)

When I started reading this book, I was expecting to dislike it. As I said, I don’t really like post-apocalyptic stories that well, and the beginning of the book seemed pretty hokey to me. But as the story was built out, I really found myself enjoying it. As a post-apocalyptic novel, it was pretty unusually in emphasizing kindness as a basic human characteristic. That was really refreshing to me.

John Scalzi: Redshirts (2013)

This was a short and very funny parody of Star Trek. Surprisingly, it was actually built into a somewhat clever story. Definitely work a read.

Other books I gave 5 star ratings to:

Ann Leckie: Ancillary Justice (2014)

Very interesting take on A.I.

Arkady Martine: A Memory Called Empire (2020), A Desolation Called Peace (2022)

Intriguing world told from the perspective of someone confused about another culture.

Neil Gaiman: The Graveyard Book (2009)

Did not think I would like, but did.

Mary Robinette Kowal: The Calculating Stars (2019)

Pretty interesting overall plot concept, but bordering on unbelievable.

Larry Nevin: Ringworld (1971)

Clever world-building, but too sexist for my tastes now.

Victor Vinge: Rainbows End (2007)

Enjoyable and interesting book, but not up to the level of his two books I put on my favorites.

Paolo Bacigalupi: The Windup Girl (2010)

Very interesting world, but a bit too violent and depressing for me to fully enjoy

Robert Charles Wilson: Spin (2006)

Engaging and suspenseful, but not top tier.

Roger Zelazny: Lord of Light (1968)

Really interesting presentation: not sure what was real and what was mythological. Far superior and more creative than many of the books from the 1960’s.

William Gibson: Neuromancer (1985)

Story line not amazing, but very prescient and the storytelling was vivid. The origin of "cyberspace" and cyberpunk.

Philip K. Dick: The Man in the High Castle (1963)

One of the rare excellent winners from the early 1960's. An early entry in the alternate universe genre and well-written.

Worst Hugo Books:

No "best" list would be complete without a corresponding "worst" list.

It seems to me pretty clear that the quality of science fiction and fantasy writing has generally improved over time. Anyone who thinks that the 1959’s and early 1960’s was some kind of golden age for science fiction clearly has not read these terrible books. One thing I cannot figure out is why Robert Heinlein is considered a great sci-fi author. The books of his that I have read ranged from mediocre to downright awful.

Robert A. Heinlein: Starship Troopers (1960)

This was just a deplorable book. One of the few I’ve given a one-star rating. Almost no plot and transparent right-wing ax-grinding.

Robert A. Heinlein: Stranger in a Strange Land (1962)

Despite this being a “famous book”, it was really terrible. Disgusting sexism (female character says “Nine times out of ten, if a girl gets raped, it’s partly her fault.”), characters droning on about Heinlein’s pet issues, etc.

Fritz Leiber: The Big Time (1958)

No real plot, stereotypical characters, dumb premise.

James Blish: A Case of Conscience (1959)

Lack of imagination about technology, shallow and stereotypical female characters, pages of pontification by characters with no plot development, …

Mark Clifton: They’d Rather Be Right (1955)

Almost impossible to obtain, and for good reason. An implausible story with annoying political overtones, masquerading as a science fiction story. Really, really dumb portrayal of A.I.

Building an Omeka website on AWS

2023-08-06T14:03:00.001-07:00


James H. Bassett, “Okapi,” Bassett Associates Archive, accessed August 5, 2023, https://bassettassociates.org/archive/items/show/337. Available under a CC BY 4.0 license.

Several years ago, I was given access to the digital files of Bassett Associates, a landscape architectural firm that operated for over 60 years in Lima, Ohio. This award-winning firm, which disbanded in 2017, was well known for its zoological design work and also did ground-breaking work in incorporating storm water retention as part of landscape site design. In addition to images of plans and site photographs, the files also included scans of sketches done by the firm's founder, James H. Bassett, which was artwork in its own right. I had been deliberating what the best way was to make these works publicly available and decided that this summer I would make it my project to set up an online digital archive featuring some of the images from the files.

Given my background as a Data Science and Data Curation Specialist at the Vanderbilt Libraries, it seemed like a good exercise to figure out how to set up Omeka Classic on Amazon Web Services (AWS), Vanderbilt's preferred cloud computing platform. Omeka is a free, open-source web platform that is popular in the library and digital humanities communities for creating online digital collections and exhibits, so it seemed like a good choice for me given that I would be funding this project on my own.

Preliminaries

The hard drive I have contains about 70 000 files collected over several decades. So the first task was to sort through the directories to figure out exactly what was there. For some of the later projects, there were some born-digital files, but the majority of the images were either digitizations of paper plans and sketches, or scans of 35mm slides. In some cases, the same original work was present several places on the drive with a variety of resolutions, so I needed to sort out where the highest quality files were located. Fortunately, some of the best works from signature projects had been digitized for an art exhibition, "James H. Bassett, Landscape Architect: A Retrospecive Exhibition 1952-2001" that took place in Artspace/Lima in 2001. Most of the digitized files were high-resolution TIFFs, which were ideal for preservation use. I focused on building the online image collection by featuring projects that were highlighted in that exhibition, since they covered the breadth of types of work done by the firm throughout its history.

The second major issue was to resolve the intellectual property status of the images. Some had previously been published in reports and brochures, and some had not. Some were from before the 1987 copyright law went into effect and some were after. Some could be attributed directly to James Bassett before the Bassett Associates corporation was formed and others could not be attributed to any particular individual. Fortunately, I was able to get permission from Mr. Bassett and the other two owners of the corporation when it disbanded to make the images freely available under a Creative Commons Attribution 4.0 International (CC BY 4.0) license. This basically eliminated complications around determining the copyright status of any particular work, and allows the images to be used by anyone as long as they provide the requested citation.

TIFF pyramid for a sketch of the African plains exhibit at the Kansas City Zoo. James H. Bassett, “African Plains,” Bassett Associates Archive, accessed August 6, 2023, https://bassettassociates.org/archive/items/show/415. Available under a CC BY 4.0 license.

Image pre-processing

For several years I have been investigating how to make use of the International Image Interoperability Framework (IIIF) to provide a richer image viewing experience. Based on previous work and experimentation with our Libraries' Cantaloupe IIIF server, I knew that large TIFF images needed to be converted to tiled pyramidal (multi-resolution) form to be effectively served. I also discovered that TIFFs using CMYK color mode did not display properly when served by Cantaloupe. So the first image processing step was to open TIFF or Photoshop format images in Photoshop, flatten any layers, convert to RGB color mode if necessary, reduce the image size to less than 35 MB (more on size limits later), and save the image in TIFF format. JPEG files were not modified -- I just used the highest resolution copy that I could find.

Because I wanted to make it easy in the future to use the images with IIIF, I used a Python script that I wrote to converting single-resolution TIFFs en mass to tiled pyramidal TIFFs via ImageMagick. These processed TIFFs or high-resolution JPEGs were the original files that I eventually uploaded to Omeka.

Why use AWS?

One of my primary reasons for using AWS as the hosting platform was the availability of S3 bucket storage. AWS S3 storage is very inexpensive and by storing the images there rather than within the file storage attached to the cloud server, the image storage capacity could basically expand indefinitely without requiring any changes to the configuration of the cloud server hosting the site. Fortunately, there is an Omeka plug-in that makes it easy to configure storage in S3.

Another advantage (not realized in this project) is that because image storage is outside the server in a public S3 bucket, the same image files can be used as source files for a Cantaloupe installation. Thus a single copy of an image in S3 can serve the purpose of provisioning Omeka, being the source file for IIIF image variants served by Cantaloupe, and having a citable, stable URL that allows the original raw image to be downloaded by anyone.

I've also determined through experimentation that one can run a relatively low-traffic Omeka site on AWS using a single t2.micro tier Elastic Compute Cloud (EC2) server. This minimally provisioned server currently costs only US$ 0.0116 per hour (about $8 per month) and is "free tier eligible", meaning that new users could run a Omeka on EC2 for free during the first year. Including the cost of the S3 storage, one could run an Omeka site on AWS with hundreds of images for under $10 per month.

The down side

The main problem with installing Omeka on AWS is that it is not a beginner-level project. I'm relatively well-acquainted with AWS and Unix command line, but it took me a couple months on and off to figure out how to get all of the necessary pieces to work together. Unfortunately, there wasn't a single web page that laid out all of the steps, so I had to read a number of blog posts and articles, then do a lot of experimenting to get the whole thing to work. I did take detailed notes, including all of the necessary commands and configuration details, so it should be possible for someone with moderate command-line skills and a willingness to learn the basics of AWS to replicate what I did.

Installation summary

In the remainder of this post, I'll walk through the general steps required to install Omeka Classic on AWS and describe important considerations and things I learned in the process. In general, there are three major components to the installation: setting up the S3 storage, installing Omeka on EC2, and getting a custom domain name to work with the site using secure HTTP. Each of these major steps includes several sub-tasks that will be described below.

S3 setup

The basic setup of an S3 bucket is very simple and involves only a few button clicks. However, the way Omeka operates, several additional steps are required for the bucket setup.

By design, AWS is secure and generally one wants to permit only the minimum required access to resources. But because Omeka exposes file URLs publicly so that people can download those files, the S3 bucket must be readable by anyone. Omeka also writes multiple image variant files to S3, and this requires generating access keys whose security must be carefully guarded.

You can manually upload files and enter their metadata by typing into boxes in the Omeka graphical interface. That's fine if you will only have a few items. However, if you will be uploading many items, uploading using the graphical interface is very tedious and requires many button clicks. To create an efficient upload workflow, I used the Omeka CSV import plugin. It requires loading the files via a URL during the import process, so I used a different public S3 bucket as the source of the raw images. I used a Python script to partially automate the process of generating the metadata CSV and as part of that script, I uploaded the images automatically to the source raw image bucket using the AWS Python library (boto3). This required creating access credentials for the raw image bucket and to reduce security risks, I created a special AWS user that was only allowed to write to that one bucket.

The ASW free tier allows a new user access to up to 5 GB for free during the first year. That corresponds to roughly a hundred high-resolution (50 MB) TIFF images.

Omeka installation on EC2

As with the set up of S3 buckets, launching an EC2 server instance just involves a few button clicks. What is trickier and somewhat tedious is performing the actual setup of Omeka within the server. Because the setup is happening at some mysterious location in the cloud, you can't point and click like you can on your local computer. To access the EC2 server, you have to essentially create a "tunnel" into it by connecting to it using SSH. Once you've done that, commands that you type into your terminal application are being applied to the remote server and not your local computer. Thus, everything you do must be done at the command line. This requires basic familiarity with Unix shell commands and since you also need to edit some configuration files, you need to know how to use a terminal-based editor like Nano.

The steps involve:

- installing a LAMP (Linux, Apache, MySQL, and PHP) server bundle

- creating a MySQL database

- downloading and installing Omeka

- modifying Apache and Omeka configuration files

- downloading an enabling the Omeka S3 Storage Adapter and CSV Import plugins

Once you have completed these steps (which actually involve issuing something like 50 complicated Unix commands that fortunately can be copied and pasted from my instructions), you will have a functional Omeka installation on AWS. However, accessing it would require users to use a confusing and insecure URL like

http://54.243.224.52/archive/

Mapping an Elastic IP address to a custom domain and enabling secure HTTP

To change this icky URL to a "normal" one that's easy to type into a browser and that is secure, several additional steps are required.

AWS provides a feature called an Elastic IP address that allows you to keep using the same IP address even if you change the underlying resource it refers to. Normally, if you had to spin up a new EC2 instance (for example to restore from a backup), it would be assigned a new IP address, requiring you to change any setting that referred to the IP address of the previous EC2 you were using. An Elastic IP address can be reassigned to any EC2 instance, so disruption caused by replacing the old EC2 with a new one can be avoided by just shifting the Elastic IP to the new instance. Elastic IPs are free as long as they remain associated with a running resource.

It is relatively easy to assign a custom domain name to the Elastic IP if the AWS Route 53 domain registration is used. The cost of the custom domain varies depending on the specific domain name that you select. I was able to obtain `bassettassociates.org` for US$12 per year, adding $1 per month to the cost of running the website.

After the domain name has been associated with the Elastic IP address, the last step is to enable secure HTTP (HTTPS). When initially searching the web for instructions on how to do that, I found a number of complicated and potentially expensive suggestions including installing an Nginx front-end server and using an AWS load balancer. Those options are overkill for a low-traffic Omeka site. In contrast, it is relatively easy to get free security certificate from Let's Encrypt and set it up to automatically renew monthly using Certbot for an Apache server.

After completing these steps, one can now access my Omeka instance at https://bassettassociates.org/archive/.

Optional additional steps

If you plan to have multiple users editing the Omeka site, you won't be able to add users beyond the default Super User without additional steps. It appears that it's not possible to add more users without enabling Omeka to send emails. This requires setting up AWS Simple Email Service (SES), then adding the SMPT credentials to the Omeka configuration file. SES is designed to enable sending mass emails, so being approved for production access requires applying for approval. I didn't have any problems getting approved when I explained that I was only going to use it to send a few confirmation emails, although the process took at least a day since apparently a human has to examine the application.

There are three additional plugins that I installed that you may consider using. The Exhibit Builder and Simple Pages plugins add the ability to create richer content. Installing them is trivial, so you will probably want to turn them on. I also installed the CSV Export Format plugin because I wanted to use it to capture identifier information as part of my partially automated workflow (see following sections for more details).

If you are interested in using IIIF on your site, you may also want to install the IIIF Toolkit plugin, explained in more detail later.

Efficient workflow

Once Omeka is installed and configured, it is possible to just upload content manually using the Omeka graphical interface. That's fine if you will only have a few objects. However, if you will be uploading many objects, uploading using the graphical interface is very tedious and requires many button clicks.

The workflow described here is based on assembling the metadata in the most automated way possible, using file naming conventions, a Python script, and programatically created CSV files. Python scripts are also used to upload the files to S3, and from there they can be automatically imported into Omeka.

After the items are imported, the CSV export plugin can be used to extract the ID numbers assigned to the items by Omeka. A Python script then extracts the IDs from the resulting CSV and inserts them into the original CSVs used to assemble the metadata.

For full details about the scripts and step-by-step instructions, see the detailed notes that accompany this post.

Notes about TIFF image files

If original image files are available as high-resolution TIFFs, that is probably the best format to archive from the preservation standpoint. However, most browsers will not display TIFFs natively, while JPEGs can be displayed onscreen. The practical implication of this is that image thumbnails are linked directly to the original highres image file. So when a user clicks on the thumbnail of a JPEG, the image is displayed in their browser, but when a TIFF thumbnail is clicked, the file downloads to the user's hard drive without being displayed. When an image is uploaded, Omeka makes several JPEG copies at lower resolution so that they can be displayed onscreen in the browser without downloading.

As explained in the preprocessing section above, the workflow includes an additional conversion step that only applies to TIFFs.

Note about file sizes

In the file configuration settings, I recommend seting a maximum file size of 100 MB. Virtually no JPEGs are ever that big, but some large TIFF files may exceed that size. As a practical matter, the upper limit on file size in this installation is actually about 50 MB. I have found from practical experience that importing original TIFF files between 50 and 100 MB can generate errors that will cause the Omeka server to hang. I have not been able to isolate the actual source of the problem, but it may be related to the process of generating the lower resolution JPEG copies. The problem may be isolated to using the CSV import plugin because some files that hung the server when using the CSV import were then able to be uploaded manually after creating the item record. In one instance, a JPEG that was only 11.4 MB repeatedly failed to upload using the CSV import. Apparently its large pixel dimensions (6144x4360) were the problem (it also was successfully uploaded manually).

The other thing to consider is that when TIFFs are converted to tiled pyramidal form, there is an increase in size of roughly 25% when the low-res layers are added to the original high-res layer. So a 40 MB raw TIFF may be at or over 50 MB after conversion. I have found that if I keep the original file size below 35 MB, the files usually load without problems. It is annoying to have to decrease the resolution of any souce files in order to add them to digital collection, but there is a workaround (described in the IIIF section below) for extremely large TIFF image files.

The CSV Import plugin

An efficient way to import multiple images is to use the CSV Import plugin. The plugin requires two things: a CSV spreadsheet containing file and item metadata, and files that are accessible directly using a URL. Because files on your local hard drive are not accessible via a URL, there are a number of workarounds that can be used, such a uploading the images to a cloud service like Google Drive or Dropbox. Since we are using AWS S3 storage, it makes sense to make the image files accessible from there, since files in a public S3 bucket can be accessed by a URL. (Example of raw image available from an S3 bucket via the URL: https://bassettassociates.s3.amazonaws.com/glf/haw/glf_haw_pl_00.tif)

One could create the metadata CSV entirely by hand by typing and copying and pasting in a spreadsheet editor. However, in my case, because of the general inconsistency in file names on the source hard drive, I was renaming all of the image files anyway. So I established a file identifier coding system that when used with file names would both group similar files together in the directory listing and also make it possible to automate populating some of the metadata fields in the CSV. The Python script that I wrote generated a metadata CSV with many of the columns already populated, including the image dimensions, which it extracted from the EXIF data in the image files. After generating a first draft of the CSV, I then had to manually add the date, title, and description fields, plus any tags I wanted to add in addition to the ones that the script generated automatically from the file names. (Example of completed CSV metadata file)

The CSV import plugin requires that all items imported as a batch be the same general type. Since my workflow was build to handle images, that wasn't be a problem -- all items were Still Images. As a practical matter, it was best to restrict all of the images in a batch to be for the same Omeka collection. If images intended for several collections were uploaded together in a batch, they would have had to be assigned to collections manually after upload.

Omeka identifiers

When Omeka ingests image files, it automatically assigns an opaque ID (e.g. 3244d9cdd5e9dce04e4e0522396ff779) to the image and generates JPEG versions of the original image at various sizes. These images are stored in the S3 bucket that you set up for Omeka storage. Since those images are publicly accessible by URL, you could provide access to them for other purposes. However, since the file names are based on the opaque identifiers and have no connection with the original file names, it would be difficult to know what the access URL would be. (Example: https://bassett-omeka-storage.s3.amazonaws.com/fullsize/3244d9cdd5e9dce04e4e0522396ff779.jpg)

Fortunately, there is a CSV Export Format plugin that can be used to discover the Omeka-assigned IDs along with the original identifiers assigned by the provider as part of the CSV metadata that was uploaded during the import process. In my workflow, I have added additional steps to do the CSV export, then run another Python script that pulls the Omeka identifiers from the CSV and archives them along with the original user-assigned identifier in an identifier CSV. At the end of processing each batch, I push the identifier and metadata CSV files to GitHub to archive the data used in the upload.

In theory, the images in the raw image upload CSV file could be deleted. However, S3 storage costs are so low that you probably will just want to leave them there. Since they have meaningful file names and a subfolder organization of your choice, they would make a pretty nice cloud backup system that is independent of the Omeka instance. After your archive project is complete, you could change the raw image source bucket over to one of the cheaper, low-access types (like Glacier) that have even lower storage costs than a standard S3 bucket. Because both buckets are public, you can use them as a means of giving access to the original high-res files by simply giving the Object URL to the person wanting a copy of the file.

Backing up the data

There are two mechanisms for backing up your data periodically.

The most straightforward is to create an Amazon Machine Image (AMI) of the EC2 server. Not only will this save all of your data, but it will also archive the complete configuration of the server at the time the image is made. This is critical if you have any disasters while making major configuration changes and need to roll back the EC2 to an earlier (functional) state. It is quite easy to roll back to an AMI and re-assign the Elastic IP to the new EC2 instance. However, this rollback will have no impact on any files saved in S3 by Omeka after the time when the backup AMI was created. Those files won't hurt anything, but they will effectively be uselessly orphaned there.

The CSV files pushed to GitHub after each CSV import (example) can also be used as a sort of backup. Any set of rows from the saved metadata CSV file can be used to re-upload those items onto any Omeka instance as long as the original files are still in the raw source image S3 bucket. Of course, if you make manual edits to the metadata, the metadata in the CSV file would become stale.

Using IIIF tools in Omeka

There are two Omeka plugins that add International Image Interoperability Framework (IIIF) capabilities.

The UniversalViewer plugin allows Omeka to serve images like a IIIF image server and it generates IIIF manifests using the existing metadata. That makes it possible for the Universal Viewer player (included in the plugin) to display images in a rich manner that allows pan and zoom. This plugin was very appealing to me because if it functioned well, it would enable IIIF capabilities without needing to manage any other servers. I was able to install it and the embedded Universal Viewer did launch, but the images never loaded in the viewer. Despite spending a lot of time messing around with the settings, disabling S3 storage, and launching a larger EC2 image, I was never able to get it to work, even for a tiny JPEG file. I read a number of Omeka forum posts about troubleshooting, but eventually gave up.

If I had gotten it to work, there was one potential problem with the setup anyway. The t2.micro instance that I'm running has very low resource capacity (memory, number of CPUs, drive storage), which is OK as I've configured it because the server just has to run a relatively tiny MySQL database and serve static files from S3. But presumably this plugin would also have to generate the image variants that it's serving on the fly and that could max out the server quite easily. I'm disappointed that I couldn't get it to work, but I'm not confident that it's the right tool for a budget installation like this one.

I had more success with the IIIF Toolkit plugin. It also provides an embedded Universal Viewer that can be inserted various places in Omeka. The major downside is that you must have access to a separate IIIF server to actually provide the images used in the viewer. I was able to test it out by loading images into the Vanderbilt Libraries' Cantaloupe IIIF server and it worked pretty well. However, setting up your own Cantaloupe server on AWS does not appear to be a trivial task and because of the resources required for the IIIF server to run effectively, it would probably cost a lot more per month to operate than the Omeka site itself. (Vanderbilt's server is running on a cluster with a load balancer, 2 vCPU, and 4 GB memory. All of these increases over a basic single t2.micro instance would involve a significantly increased cost.) So in the absence of an available external IIIF server, this plugin probably would not be useful for an independent user with a small budget.

One nice feature that I was not able to try was pointing the external server to the `original` folder of the S3 storage bucket. That would be a really nice feature since it would not require loading the images separately into dedicated storage for the IIIF server separate from what is already being provisioned for Omeka. Unfortunately, we have not yet got that working on the Libraries' Cantaloupe server as it seems to require some custom Ruby coding to implement.

Once the IIIF Toolkit is installed, there are two ways to include IIIF content into Omeka pages. If the Exhibit Builder plugin is enabled, the IIIF Toolkit adds a new kind of content block, "Manifest". Entering an IIIF manifest URL simply displays the contents of that manifest in an embedded Universal Viewer widget on the exhibit page without actually copying any images or metadata into the Omeka database.

The second way to include IIIF content is to make use of an alternate method of importing content that becomes available after the IIIF Tollkit is installed. There are three import types possible to use to import items. I explored importing `Manifest` and `Canvas` types since I had those types of structured data available.

Manifest is the most straightforward because it only requires a manifest URL (commonly available from many sources). But the import was messy and always created a new collection for each item imported. In theory, this could be avoided by selecting an existing collection using the `Parent` dropdown, but that feature never worked for me.

I concluded that importing canvases was the only feasible method. Unfortunately, canvas JSON usually doesn't exist in isolation -- it usually is part of the JSON for an entire manifest. The `From Paste` option is useful if you are capable of the tedious task of searching through the JSON of a whole manifest and copying just the JSON for a single canvas from it. I found it much more useful to just create Python script to generate minimal canvas JSON for an image and save it as a file, which could either be uploaded directly, or pushed to the web and read in through a URL. It gets the pixel dimensions from the image file, with labels and descriptions taken from a CSV file (the IIIF import does not use more information than that). These values are inserted into a JSON canvas template, then saved as a file. The script will loop through an entire directory of files, so it's relatively easy to make canvases for a number of images that were already uploaded using the CSV import function (just copy and paste labels and descriptions from the metadata CSV file). Once the canvases have been generated, either upload them or paste their URLs (if they were pushed to the web) on the IIIF Toolkit Import Items page.

The result of the import is an item similar to those created by direct upload or CSV import -- JPEG size variants are generated and stored and a small amount of metadata present in the canvas is assigned to the title and description metadata fields for the item. The main difference is that the import includes the canvas JSON as part of an Omeka-generated IIIF manifest that can be displayed in an embedded Universal Viewer either as part of an exhibit or on a Simple Pages web page. The viewer also shows up at the bottom of the item page.

Because there is no way to import IIIF items as a batch, nor to import metadata from the canvas beyond the title and description, each item needs to be imported one at a time and the metadata added manually, or added using the Bulk Metadata Editor plugin if possible. This makes uploading many items somewhat impractical. However, for very large images whose detail cannot be seen well in a single image on a screen, the ability to pan and zoom is pretty important. So for some items, like large maps, this tool can be very nice despite the extra work. For a good example, see the panels page from the Omeka exhibit I made for the 2001 Artspace/Lima exhibition. It is best viewed by changing the embedded viewer to full screen.

Entire master plan image. Bassett Associates. “Binder Park Zoo Master Plan (IIIF),” Bassett Associates Archive, accessed August 6, 2023, https://bassettassociates.org/archive/items/show/418. Available under a CC BY 4.0 license.

Maximum zoom level using embedded IIIF Universal Viewer

One thing that should be noted is that like other images associated with Omeka items, image import using the IIIF Toolkit generates size versions of the image. A IIIF import also generates an "original" JPEG version that is much smaller than the pyramidal tiled TIFF uploaded to the IIIF server. This means that it is possible to create items for TIFF images that are larger than the 50 MB recommended above. An example is the Binder Park Master Plan. If you scroll to the bottom of its page and zoom in (above), you will see that an incredible amount of detail is visible because the original TIFF file being used by the IIIF server is huge (347 MB). So using IIIF import is a way to display and make available very large image files that exceed the practical limit of 50 MB discussed above.

Conclusions

Although it took me a long time to figure out how to get all of the pieces to work together, I'm quite satisfied with the Omeka setup I now have running on AWS. I've been uploading works and as of this writing (2023-08-06), I've uploaded 400 items into 36 collections. I also created an Omeka Exhibit for the 2001 exhibition that includes the panels created for the exhibition using an "IIIF Items" block (allows arrowing through all of the panels with pan and zoom), a "works" block (displaying thumbnails for artworks displayed in the exhibition), and a "carousel" block (cycling through photographs of the exhibition). I still need to do more work on the landing page and on styling of the theme. But for now I have an adequate mechanism for exposing some of the images in the collection on a robust hosting system for a total cost of around $10 per month.

Structured Data in Commons and wikibase software tools

2023-04-12T13:45:00.004-07:00

In my last blog post, I described a tool (CommonsTool) that I created for uploading art images to Wikimedia Commons. One of the features of that Python script was to create Structured Data in Commons (SDoC) statements about the artwork that was being uploaded, such as "depicts" (P180) and "main subject" (P921) and "digital representation of" (P6243), necessary to "magically" populate the Commons page with extensive metadata about the artwork from Wikidata. The script also added "created" (P170) and "inception" (P571) statements, which are important for providing the required attribution when the work is under copyright.

Structured Data on Commons "depicts" statements

These properties serve important roles, but one of the key purposes of SDoC is to make it possible for potential users of the media item to find it by providing richer metadata about what is depicted in the media. SDoC depict statements go into the data that is indexed by the Commons search engine, which otherwise is primarily dependent on words present in the filename. My CommonsTool script does write one "depicts" statement (that the image depicts the artwork itself) and that's important for the semantics of understanding what the media item represents. However, from the standpoint of searching, that single depicts statement doesn't add much to improve discovery since the artwork title in Wikidata is probably similar to the filename of the media item -- neither of which necessarily describe what is depicted IN the artwork.

Of course, one can add depicts statements manually, and there are also some tools that can be used to help with the process. But if you aspire to add multiple depicts statements to hundreds or thousands of media items, this could be very tedious and time consuming. If we are clever, we can take advantage of the fact that Structured Data in Commons is actually just another instance of a wikibase. So generally any tools that can make it easier to work with a wikibase can also make it easier to work with Wikimedia Commons

In February, I gave a presentation about using VanderBot (a tool that I wrote to write data to Wikidata) to write to any wikibase. As part of that presentation, I put together some information about how to use VanderBot to write statements to SDoC using the Commons API, and how to use the Wikimedia Commons Query Service (WCQS) to acquire data programatically via Python. In this blog post, I will highlight some of the key points about interacting with Commons as a wikibase and link out to the details required to actually do the interacting.

Media file identifiers (M IDs)

Wikimedia Commons media files are assigned a unique identifier that is analogous to the Q IDs used with Wikidata items. They are known as "M IDs" and they are required to interact with the Commons API or the Wikimedia Commons Query Service programmatically as I will describe below.

It is not particularly straightforward to find the M ID for a media file. The easiest way is probably to find the Concept URI link in the left menu of a Commons page, right-click on the link, and then paste it somewhere. The M ID is the last part of that link. Here's an example: https://commons.wikimedia.org/entity/M113161207 . If the M ID for a media file is known, you can load its page using a URL of this form.

If you are automating the upload process as I described in my last post, CommonsTool records the M ID when it uploads the file. I also have a Python function that can be used to get the M ID from the Commons API using the media filename.

Properties and values in Structured Data on Commons come from Wikidata

Structured Data on Commons does not maintain its own system of properties. It exclusively uses properties from Wikidata, identified by P IDs. Similarly, the values of SDoC statements are nearly always Wikidata items identified by Q IDs (with dates being an exception). So one could generally represent a SDoC statement (subject property value) like this:

MID PID QID.

Captions

Captions are a feature of Commons that allows multilingual captions to be applied to media items. They show up under the "File information" tab.

Although captions can be added or edited using the graphical interface, under the hood the captions are the multilingual labels for the media items in the Commons wikibase. So they can be added or edited as wikibase labels via the Commons API using any tool that can edit wikibases.

Writing statements to the Commons API with VanderBot

VanderBot uses tabular data (spreadsheets) as a data source when it creates statements in a wikibase. One key piece of required information is the Q ID of the subject item that the statements are about and that is generally the first column in the table. When writing to Commons, the subject M ID is substituted for a Q ID in the table.

Statement values for a particular property are placed in one column in the table. Since all of the values in a column are assumed to be for the same property, the P ID doesn't need to be specified as data in the row. VanderBot just needs to know what P ID is associated with that column and that mapping of column with property is made separately. So at a minimum, to write a single kind of statement to Commons (like Depicts), VanderBot needs only two columns of data (one for the M ID and one for the Q ID of the value of the property).

Here is an example of a table with depicts data to be uploaded to Commons by VanderBot:

The qid column contains the subject M ID identifiers (for this media file). The depicts column contains the Q IDs of the values (the things that are depicted in the media item). The other three columns serve the following purposes:

- depicts_label is ignored by the script. It's just a place to put the label of the otherwise opaque Q ID for the depicted item so that a human looking at the spreadsheet has some idea about what's being depicted.

- label_en is the language-tagged caption/wikibase label. VanderBot has an option to either overwrite the existing label in the wikibase with the value in the table or ignore the label column and leave the label in Wikibase the same. In this example, we are not concerning ourselves with editing the captions, so we will use the "ignore" option. But if one wanted to add or update captions, VanderBot could be used for that.

- depicts_uuid stores the unique statement identifier after the statement is created. It is empty for statements that have not yet been uploaded.

I mentioned before that the connection between the property and the column that contains its values was made separately. This mapping is done in a YAML file that describes the columns in the table:

The details of this file structure are given elsewhere, but a few key details are obvious. The depicts_label column is designated as to be ignored. In the properties list, the header for a column is given as value of the variable key, with a value of depicts in this example. That column has item as its value type and P180 as its property.

As a part of the VanderBot workflow, this mapping file is converted into a JSON metadata description file and that file along with the CSV are all that are needed by VanderBot to create the SDoC depicts statements.

If you have used VanderBot to create new items in Wikidata, uploading to Commons is more restrictive than what you are used to. When writing to Wikidata, if the Q ID column for a row in a CSV is empty, Vanderbot will create a new item and if it's not, it edits an existing one. Creating new items directly via the API is not possible in Commons, because new items in the Commons wikibase are only created as a result of media uploads. So when VanderBot interacts with the Commons API, the qid column must contain an existing M ID.

After writing the SDoC statements, they will show up under the "Structured data" tab for the media item, like this:

Notice that the Q IDs for the depicts values have been replaced by their labels.

This is a very abbreviated overview of the process and is intended to make the point that once you have the system set up, all you need to write a large number of SDoC depicts statement is a spreadsheet with column for the M IDs of the media items and a column with the Q IDs of what is depicted in that media item. There are more details with linkouts to how to use VanderBot to write to Commons on a webpage that I made for the Wikibase Working Hour presentation.

Acquiring Structured Data on Commons from the Wikimedia Commons Query Service

A lot of people know about the Wikidata Query Service (WQS), which can be used to query Wikidata using SPARQL. Fewer people know about the Wikimedia Commons Query Service (WCQS) because it's newer and interests a narrower audience. You can access the WCQS at https://commons-query.wikimedia.org/ . It is still under development and is a bit fragile, so it is sometimes down or undergoing maintenance.

If you are working with SDoC, the WCQS is a very effective way to retrieve information about the current state of the structured data. For example, it's a very simple query to discover all media items that depict a particular item, as shown in the example below. There are quite a few examples of queries that you can run to get a feel for how the WCQS might be used.

It is actually quite easy to query the Wikidata Query Service programmatically, but there are additional challenges to using the WCQS because it requires authentication. I have struggled through reading the developer instructions for accessing the WCQS endpoint via Python and the result is functions and example code that you can use to query the WCQS in your Python scripts. One important warning: the authentication is done by setting a cookie on your computer. So you must be careful not to save this cookie in any location that will be exposed, such as in a GitHub repository. Anyone who gets a copy of this cookie can act as if they were you until the cookie is revoked. To avoid this, the script saves the cookie in your home directory by default.

The code for querying is very simple with the functions I provide:

user_agent = 'TestAgent/0.1 (mailto:username@email.com)' # put your own script name and email address here
endpoint_url = 'https://commons-query.wikimedia.org/sparql'
session = init_session(endpoint_url, retrieve_cookie_string())
wcqs = Sparqler(useragent=user_agent, endpoint=endpoint_url, session=session)

query_string = '''PREFIX sdc: <https://commons.wikimedia.org/entity/>
PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
SELECT DISTINCT ?depicts WHERE {
  sdc:M113161207 wdt:P180 ?depicts.
  }'''

data = wcqs.query(query_string)
print(json.dumps(data, indent=2))

The query is set in the multi-line string assigned in the line that begins query_string =. One thing to notice is that in WCQS queries, you must define the prefixes wdt: and wd: using PREFIX statements in the query prologue. Those prefixes can be used in WQS queries without making PREFIX statements. In addition, you must define the Commons-specific sdc: prefix and use it with M IDs.

This particular query simply retrieves all of the depicts statements that we created in the example above for M113161207 . The resulting JSON is

[
  {
    "depicts": {
      "type": "uri",
      "value": "http://www.wikidata.org/entity/Q103304813"
    }
  },
  {
    "depicts": {
      "type": "uri",
      "value": "http://www.wikidata.org/entity/Q302"
    }
  },
  {
    "depicts": {
      "type": "uri",
      "value": "http://www.wikidata.org/entity/Q345"
    }
  },
  {
    "depicts": {
      "type": "uri",
      "value": "http://www.wikidata.org/entity/Q40662"
    }
  },
  {
    "depicts": {
      "type": "uri",
      "value": "http://www.wikidata.org/entity/Q235849"
    }
  }
]

The Q IDs can easily be extracted from these results using a list comprehension:

qids = [ item['depicts']['value'].split('/')[-1] for item in data ]

resulting in this list:

['Q103304813', 'Q302', 'Q345', 'Q40662', 'Q235849']

Comparison with the example table shows the same four Q IDs that we wrote to the API, plus the depicts value for the artwork (Q103304813) that was created by CommonsTool when the media file was uploaded. When adding new depicts statements, having this information about the ones that already exist can be critical to avoid creating duplicate statements.

For more details about how the code works, see the informational web page I made for the Wikibase Working Hour presentation.

Conclusion

I hope that this code will help make it possible to ramp up the rate at which we can add depicts statements to Wikimedia Commons media files. In the Vanderbilt Libraries, we are currently experimenting with using Google Cloud Vision to do object detection and we would like to combine that with artwork title analysis to be able to partially automate the process of describing what is depicted in the Vanderbilt Fine Arts Gallery works whose images have been uploaded to Commons. I plan to report on that work in a future post.

CommonsTool: A script for uploading art images to Wikimedia Commons

2022-09-07T05:21:00.003-07:00

A Ghost Painting Coming to Life in the Studio of the Painter Okyō, from the series Yoshitoshi ryakuga (Sketches by Yoshitoshi). 1882 print by Tsukioka Yoshitoshi. Vanderbilt University Fine Arts Gallery 1992.083 via Wikimedia Commons. Wikidata item Q102961245

For several years, I've been working with the Vanderbilt Fine Arts Gallery staff to create and improve Wikidata items for the approximately 7000 works in the Gallery collection through the WikiProject Vanderbilt Fine Arts Gallery. In the past year, I've been focused on creating a Python script to streamline the process of uploading images of Public Domain works in the collection to Wikimedia Commons, where they will be freely available for use. I've just completed work on that script, which I've called CommonsTool, and have used it to upload over 1300 images (covering about 20% of the collection and most of the Public Domain artworks that have been imaged).

In this post, I'll begin by describing some of the issues I dealt with and how they resulted in features of the script. I will conclude by outlining briefly how the script works.

The script is freely available for use and there are detailed instructions on GitHub for configuring and using it. Although it's designed to be usable in contexts other than the Vanderbilt Gallery, it hasn't been tested thoroughly in those circumstances. So if you try using it, I'd like to hear about your experience.

Wikidata, Commons, and structured data

If you have ever worked with editing metadata about art-related media in Wikimedia Commons, you are probably familiar with the various templates used to describe the metadata on the file page using Wiki syntax. Here's an example:

=={{int:filedesc}}==
{{Artwork
|artist             = {{ Creator | Wikidata = Q3695975 | Option = {{{1|}}} }}
|title              = {{en|'''Lake George'''.}}
|description        = {{en|1=Lake George, painting by David Johnson}}
|depicted people    =
|depicted place     =
|date               =
|medium             = {{technique|oil|canvas}}
|dimensions         = {{Size|in|24.5|19.5}}
|institution        = {{Institution:Vanderbilt University Fine Arts Gallery}}
|references         = {{cite web |title=Lake George |url=https://library.artstor.org/#/asset/26754443 |accessdate=30 November 2020}}
|source             = Vanderbilt University Fine Arts Gallery
|other_fields       =
}}

=={{int:license-header}}==
{{PD-Art|PD-old-100-expired}}

[[Category:Vanderbilt University Fine Arts Gallery]]

These templates are complicated to create and difficult to edit by automated means. In recognition of this, the Commons community has been moving towards storing metadata about the media files as structured data ("Structured Data on Commons", SDC). When media files depict artwork, the preference is to describe the artwork metadata in Wikidata rather than as wikitext on the Commons file page (as shown in the example above).

In July, Sandra Fauconnier gave a presentation at an ARLIS/NA (Art Libraries Society of North America) Wikidata group meeting that was extremely helpful for improving my understanding of the best practices for expressing metadata about visual artworks in Wikimedia Commons. She provided a link to a very useful reference page (still under construction as of September 2022) to which I referred while working on my script.

The CommonsTool script has been designed around two key features for simplifying management of the media and artwork metadata. The first is two very simple wikitexts: one for two-dimensional artwork and another for three-dimensional artwork. The 2D wikitext looks like this:

=={{int:filedesc}}==
{{Artwork
|source = Vanderbilt University
}}

=={{int:license-header}}==
{{PD-Art|PD-old-100-expired}}

[[Category:Vanderbilt University Fine Arts Gallery]]

and the 3D wikitext looks like this:

=={{int:filedesc}}==
{{Art Photo
|artwork license = {{PD-old-100-expired}}
|photo license = {{Cc-by-4.0 |1=photo © [https://www.vanderbilt.edu/ Vanderbilt University] / [https://www.library.vanderbilt.edu/gallery/ Fine Arts Gallery] / [https://creativecommons.org/licenses/by/4.0/ CC BY 4.0]}}
}}

[[Category:Vanderbilt University Fine Arts Gallery]]

By comparison with the wikitext in the first example, this is clearly much simpler, but also has the advantage that there is very little metadata in the wikitext itself that might need to be updated.

The second key feature involves using SDC to link the media file to the Wikidata item for the artwork. Here's an example for the work shown at the top of this post:

In order for this strategy to work, for all artwork images the depicts (P180) and main subject (P921) values must be set to the artwork's Wikidata item (in this case Q102961245). Two dimensional artwork images should also have a "digital representation of" (P6243) value with the artwork's Wikidata item. When these claims are created, the Wikidata metadata will "magically" populate the file information summary without entering it into a wikitext template.

The great advantage here is that when metadata are updated on Wikidata, they automatically are updated in Commons as well.

Copyright and licensing issues

One of the complicating issues that had slowed me down in developing the script was to figure out how to handle copyright and licensing issues. The images we are uploading depict old artwork that is out of copyright, but what about copyright of the images of the artwork? The Wikimedia Foundation takes the position that faithful photographic reproductions of old two-dimensional artwork lack originality and are therefore not subject to copyright. However, images of three-dimensional works can involve creativity, so those images must be usable under an open license acceptable for Commons uploads.

Wikitext tags

Unlike other metadata properties about a media item, the copyright and licensing details cannot (as of September 2022) be expressed only in SDC. They must be explicitly included in the file page's wikitext.

As shown in the example above, I used the license tags

for 2D artwork. The PD-Art tag asserts that the image is not copyrightable for the reason given above and PD-old-100-expired asserts that the artwork is not under copyright because it is old. When these tags are used together, they are rendered on the file page like this:

The example above for 3D artworks uses separate license tags for the artwork and the photo. The artwork license is PD-old-100-expired as before, and the photo license I used was

{{Cc-by-4.0 |1=photo © [https://www.vanderbilt.edu/ Vanderbilt University] / [https://www.library.vanderbilt.edu/gallery/ Fine Arts Gallery] / [https://creativecommons.org/licenses/by/4.0/ CC BY 4.0]}}

There are a number of possible licenses that can be used for both the photo and artwork and they can be set in the CommonsTool configuration file. Since the CC BY license requires attribution, I used the explicit credit line feature to make clear that it's the photo (not the artwork) that's under copyright and to provide links to Vanderbilt University (the copyright holder) and the Fine Arts Gallery. Here's how these tags are rendered on the file page of an image of a 3D artwork:

Using the format

{{Art Photo
|artwork license = {{artLicenseTag}}
|photo license = {{photoLicenseTag}}
}}

in the wikitext is great because it creates separate boxes that clarify that the permissions for the artwork are distinct from the permissions for the photo of the artwork.

Structured data about licensing

As noted previously, it's required to include copyright and licensing information in the page wikitext. However, file pages must also have certain structured data claims related to the file creator, copyright, and licensing or they will be flagged.

In the case of 2D images where the PD-Art tag was used, there should be a "digital representation of" (P6243) claim where the value is the Q ID of the Wikidata item depicted in the media file.

In the case of 3D images, they should not have a P6243 claim, but should have values for copyright status (P6216) and copyright license (P275). If under copyright, they should also have values for creator (P170, i.e. photographer) and inception (P571) date so that it can be determined to whom attribution should be given and when the copyright may expire. Keep in mind that for artwork SDC metadata is generally about the media file and not the depicted thing. So similar information about the depicted artwork would be expressed in the Wikidata item about the artwork, not in SDC.

Although not required when the PD-Art tag is used, it's a good idea to include the creator (photographer) and inception date of the image in the SDC metadata for 2D works. It's not yet clear to me whether a copyright status value should be provided. I suppose so, but if it's directly asserted in the SDC that the work is in the Public Domain, you are supposed to use a qualifier to indicate the reason, and I'm not sure what value would be used for that. I haven't seen any examples illustrating how to do that, so for now, I've omitted it.

To see examples of how this looks in practice see this example for 2D and this example for 3D. After the page loads, click on the Structured Data tab below the image.

What the script does: the Commons upload

The Commons upload takes place in three stages.

First, CommonsTool acquires necessary information about the artwork and the image from CSV tables. One key piece of information is what image or images to be uploaded to Commons are associated with a particular artwork (represented by a single Wikidata item). The main link from Commons to Wikidata is made using a depicts (P180) claim in the SDC and the link from Wikidata to Commons is made using an image (P18) claim.

Miriam by Anselm Feuerbach. Public Domain via Wikimedia Commons

It is important to know whether there are more than one image associated with the artwork. In the source CSV data about images, the image to be linked from Wikidata is designated as "primary" and additional images are designated as "secondary".

Both primary and secondary images will be linked from Commons to Wikidata using a depicts (P180) claim, but it's probably best for only the primary image to be linked from Wikidata using an image (P18) claim. Here is an example of a primary image page in Commons and here is an example of a secondary image page in Commons. Notice that the Wikidata page for the artwork only displays the primary image.

The CommonsTool script also constructs a descriptive Commons filename for the image using the Wikidata label, any sub-label particular to one of multiple images, the institution name, and the unique local filename. There are a number of characters that aren't allowed, so CommonsTool tries to find them and replace them with valid characters.

The script also performs a number of optional screens based on copyright status and file size. It can skip images deemed to be too small and will also skip images whose file size exceeds the API limit of 100 Mb. (See the configuration file for more details.)

The second stage is to upload the media file and the file page wikitext via the Commons API. Commons guidelines state that the rate of file upload should not be greater than one upload per 5 seconds, so the script introduces a delay of necessary to avoid exceeding this rate. If successful, the script moves on to the third stage and if not, it logs an error and moves to the next media item.

In the third stage, SDC claims are written to the API in a manner similar to how claims are written to Wikidata. The claims upload function respects the maxlag errors from the server and delays the upload if the server is lagged due to high usage (although this rarely seems to happen). If the SDC upload fails, it logs an error, but the script continues in order to record the results of the media upload in the existing uploads CSV file.

The links from the Commons image(s) to Wikidata are made using SDC statements, which results in a hyperlink in the file summary (the tiny Wikidata flag). However, the link in the other direction doesn't get made by CommonsTool.

The CSV file where existing uploads are recorded contains an image_name column and the primary values for "primary" images in that column can be used as values for the image (P18) property on the corresponding Wikidata artwork item page. After creating that claim, the primary image will be displayed on the artwork's Wikidata page:

Making this link manually can be tedious, so there is a script that will automatically transfer these values into the appropriate column of a CSV file that is set up to be used by the VanderBot script to upload data to Wikidata. In production, I have a shell script that runs CommonsTool, then the transfer script, followed by VanderBot. Once that shell script has finished running, the image claim will be present on the appropriate Wikidata page.

International Image Interoperability Framework (IIIF) functions

One of our goals at the Vanderbilt Libraries (of which the Fine Arts Gallery is part) is to develop the infrastructure to support serving images using the International Image Interoperability Framework (IIIF). To that end, we've set up a Cantaloupe image server on Amazon Web Services (AWS). The setup details are way beyond the scope of this web post, but now that we have this capability, we want to make the images that we've uploaded to Commons also be available as zoomable high-resolution images via our IIIF server.

For that reason, the CommonsTool script also has the capacity to upload images to the IIIF server storage (an AWS bucket) and to generate manifests that can be used to view those images. The IIIF functionalities are independent of the Commons upload capabilities -- either can be turned on or off. However, for my workflow, I do the IIIF functions immediately after the Commons upload so that I can use the results in Wikidata as I'll describe later.

Source images

One of the early things that I learned when experimenting with the server is that you don't want to upload large, raw TIFF files (i.e. greater than 10 MB). When a IIIF viewer tries to display such a file, it has to load the whole file, even if the screen area is much smaller that the entire TIFF would be if displayed at full resolution. This takes an incredibly long time, making viewing of the files very annoying. The solution to this is to convert the TIFF files into tiled pyramidal TIFFs.

When I view one of these files using Preview on my Mac, it becomes apparent why they are called "pyramidal". The TIFF file doesn't contain a single image. Rather, it contains a series of images that are increasingly small. If I click on the largest of the images (number 1), I see this:

and if I click on a smaller version (number 3), I see this:

If you think of the images as being stacked with the smaller ones on top of the larger ones, you can envision a pyramid.

When a client application requests an image from the IIIF server, the server looks through the images in the pyramid to find the smallest one that will fill up the viewer and sends that. If the viewer zooms in on the image, requiring greater resolution, the server will not send all of the next larger image. Since the images in the stack are tiled, it will only send the particular tiles from the larger, higher resolution image that will actually be seen in the viewer. The end result is that the tiled pyramidal TIFFs load much faster because the IIIF server is smart and doesn't send any more information than is necessary to display what the user wants to see.

The problem that I faced was how to automate the process of generating a large number of these tiled pyramidal TIFFs. After thrashing with various Python libraries, I finally ended up using the command line tool ImageMagick and calling it from a Python script using the os.system() function. The script I used is available on GitHub.

Because the Fine Arts Gallery has been working on imaging their collection for over 20 years, the source images that I'm using are in a variety of formats and sizes (hence the optional size screening criteria in the script to filter out images that have too low resolution). The newer images are high resolution TIFFs, but many of the older images are JPEGs or PNGs. So one task of the IIIF server upload part of the CommonsTool script is to sort out whether to pull the files from the directory where the pyramidal TIFFs are stored, or the directory where the original images are stored.

Once the location of the correct images are identified, the script uses the boto3 module (the AWS software development kit or SDK), to initiate the upload to the S3 bucket as part of the Python script. I won't go into the details of setting up and using credentials as that is described well in the AWS documentation.

Once the file is uploaded, it can be directly accessed using a URL constructed according to the IIIF Image API standard. Here's a URL you can play with:

https://iiif.library.vanderbilt.edu/iiif/3/gallery%2F1992%2F1992.083.tif/full/!400,400/0/default.jpg

If you adjust the URL (for example replacing the 400s with different numbers) according to the API 2.0 URL patterns, you can make the image display at different sizes directly in the browser.

IIIF manifests

The real reason for making images available through a IIIF server is to display them in a viewer application. One such application is Mirador. A IIIF viewer uses a manifest to understand how the image or set of images should be displayed. CommonsTool generates very simple IIIF manifests that display each image in a separate canvas, along with basic metadata about the artwork. To see what the manifest looks like for the image at the top of this post, go to this link.

IIIF manifests are written in machine-readable Javascript Object Notation (JSON), so they are not intended to be understood by humans. However, when the manifest is consumed by a viewer application, a human can use controls such as pan, zoom, and buttons to manipulate the image or to move to another canvas that displays a different image. The Mirador project provides an online IIIF viewer that can be used to view images described by a manifest. This link will display the manifest from above in the Mirador online viewer.

One nice thing about providing a IIIF manifest is that it allows multiple images of the same work to be viewed in the same viewer. For example, there might be multiple pages of a book, or the front and back sides of a sculpture. I'm still learning about constructing IIIF manifests, so I haven't done anything fancy yet with respect to generating IIIF manifests in the CommonsTool script. However, the script does generate a single manifest describing all of the images depicting the same artwork. The image designated as "primary" is shown in the initial view and any other images designated as "secondary" are shown in other canvases that can be selected using the viewer display options or be viewed sequentially using the buttons at the bottom of the viewer. Here is an example showing how the manifest for the primary and secondary images in an earlier example put the front and back images of a manuscript page in the same viewer window.

IIIF in Wikidata

Wikidata has a property "IIIF manifest" (P6108) that allows an item to be linked to a IIIF manifest that displays depictions of that item. The file where existing uploads are recorded includes a iiif_manifest column that contains the manifest URLs for the works depicted by the images.

Those values can be used to create IIIF manifest (P6108) claims for an item in Wikidata:

Because doing this manually would be tedious, the iiif_manifest values can be automatically transferred to a VanderBot-compatable CSV file using the same transfer script used to transfer the image_name.

In itself, adding a IIIF manifest claim isn't very exciting. However, Wikidata supports a user script that will display an embedded Mirador viewer anytime an item has a value for P6108. (For details on how to install that script, see this post.) With the viewer enabled, opening a Wikidata page for a Fine Arts Gallery item with images will display the viewer at the top of the page and a user can zoom in or use the buttons at the bottom to move to another image of the same artwork.

This is really nice because if only the primary image is linked using the image property, users would not necessarily know that there are other images of the object in Commons. But with the embedded viewer, the user can flip through all of the images of the item that are in Commons using the display features of the viewer, such as thumbnails.

Using the script

Although I wrote this script primarily to serve my own purposes, I tried to make it clean and customizable enough that someone with moderate computer skills should also be able to use it. The only installation requirements are Python and several modules that aren't included in the standard library. It should not generally be necessary to modify the script to use it -- most customizing should be possible by changing the configuration file.

If the script is only used to write files to Commons, it's operation is pretty straightforward. If you want to combine uploading image files to Commons with writing the image_names and iiif_manifest values to Wikidata, it's more complicated. You need to get the transfer_to_vanderbot.py script working and then learn how to operate VanderBot. There are details instructions, videos, etc. to do that on the VanderBot landing page.

What's next?

There are still a few more Fine Arts Gallery images that I need to upload after doing some file conversions, checking out some copyright statuses, and wranging some data for multiple files that depict the same work. However, I'm quite excited about developing better IIIF manifests that will make it possible to view related works in the same viewer. Having so many images in Commons now also makes it possible to see the real breadth of the collection by viewing the Listeria visualizations on the tabs of the WikiProject Vanderbilt Fine Arts Gallery website. I hope soon to create more fun SPARQL-based visualizations to add to those already on the website landing page.

Making SPARQL queries to Wikidata using Python

2022-06-11T10:23:00.000-07:00

"Welding sparkles" by Dhivya dhivi DJ, CC BY-SA 4.0, via Wikimedia Commons

Background

This is actually a sort of followup post to my most popular blog post: "Getting Data Out of Wikidata using Software", which has had about 6.5K views since 2019. That post was focused on the variety of query forms you could use and talked a lot about using Javascript to build web pages that acquired data from Wikidata dynamically. However, I did provide a link to some Python code, which included the line

r = requests.get(endpointUrl, params={'query': query}, headers={'Accept': 'application/sparql-results+json'})

for making the actual query to the Wikidata Query Service via HTTP GET.

Since that time, I've used some variation on that code in dozens of Python scripts that I've written to grab data from Wikidata. In the process, I experienced some frustration when things did not behave as I had expected and when I got unexpected errors from the API. My goal for this post is to describe some of those problems and how I solved them. I'll also provide a link to the "Sparqler" Python class that I wrote to make querying simpler and more reliable, along with some examples of how to use it to do several types of queries.

Note: SPARQL keywords are case insensitive. Although you often see them written in ALL CAPS in examples, I'm generally too lazy to do that and tend to use lower case, as you'll see in most of the examples below.

The Sparqler class

For those of you who don't care about the technical details, I'll cut right to the chase and tell you how to make queries to Wikidata using the code. You can access the code in GitHub here. I should note that the code is general-purpose and can be used with any SPARQL 1.1 compliant endpoint, not just the Wikidata Query Service (WDQS). This includes Wikibase instances and installations of Blazegraph, Fuseki, Neptune, etc. The code also supports SPARQL Update for loading data into a triplestore, but that's the topic of another post.

To use the code, you need to import three modules: datetime, time, and requests. The requests module isn't included in the standard Python distribution, so you may need to install it with PIP if you haven't already. If you are using Jupyter notebooks through Anaconda, or Colab notebooks, requests will probably already be installed. Copy the code from "class Sparqler:" through just before the "Body of script" comment near the bottom of the file, and paste it near the top of your script.

To test the code, you can run the entire script, which includes code at the end with an example of how to use the script. If you only run it once or twice, you can use the code as-is. However, if you make more than a few queries, you'll need to change the user_agent string from the example I gave to your own. You can read about that in the next section.

The body of the script has four main parts. Lines 238 through 256 create a value for the text query_string that gets sent to the WDQS endpoint. Lines 259 and 260 instantiate a Sparqler object called wdqs. Line 261 sends the query string that you created to the endpoint and returns the SELECT query results as a list of dictionaries called data. The remaining lines check for errors and display the results as pretty JSON (the reason for importing the json module at the top of the script). If you want to see the query_string as constructed or the raw response text from the endpoint, you can uncomment lines 257 and 266.

Here's what the response looks like:

[
{
    "item": {
      "type": "uri",
      "value": "http://www.wikidata.org/entity/Q102949359"
    },
    "label": {
      "xml:lang": "en",
      "type": "literal",
      "value": "\"I Hate You For Hitting My Mother,\" Minneapolis"
    }
},
{
    "item": {
      "type": "uri",
      "value": "http://www.wikidata.org/entity/Q102961315"
    },
    "label": {
      "xml:lang": "en",
      "type": "literal",
      "value": "A Picture from an Outline of Women's Manners - The Wedding Ceremony"
    }
},
{
    "item": {
      "type": "uri",
      "value": "http://www.wikidata.org/entity/Q1399"
    },
    "label": {
      "xml:lang": "en",
      "type": "literal",
      "value": "Niccol\u00f2 Machiavelli"
    }
}
]

It's in the standard SPARQL 1.1 JSON results format, so if you write code to extract the results from the data list of dictionaries, you can use it with the results of any query.

Features of the code

For those of you who are interested in knowing more about the code and the rationale behind it, read the following sections. If you just want to try it out, skip to the "Options for querying" section.

The User-Agent string

Often applications that request data from an API are requested to identify themselves as in indication that they aren't bad actors and to allow the API maintainers to contact the developers if the application is doing something the API maintainers don't like. In the case of the Wikimedia Foundation, they have adopted a User-Agent policy that requires that an HTTP User-Agent header be sent with all requests to their servers. This policy is not universally enforced, and I'm not sure whether it's enforced at all for the WDQS, but if you are writing a script that is making repeated queries at a high rate of speed, you should definitely supply a User-Agent header that identifies your application (and you) in the event that it is suspected to be a denial of service attack.

The details of what they would like developers to include in the string are given on the policy page, but the TLDR is that you should have a name for the "application" (your script) and either your email address or the URL of a page that describes your project. The value given in lines 259 and 260 of the body of the script for the user_agent variable can be used as a template. When instantiating the Sparqler object, the string MUST be passed in as the value of the useragent argument if the endpoint URL given as the value of the endpoint argument is https://query.wikidata.org/sparql (the default if no endpoint argument is given). If you don't provide one, the script will exit.

The sleep argument

When you create a Sparqler object, you can choose to supply a value (in seconds) for the sleep argument. If none is supplied, it defaults to 0.1 s. Each time a query is made, the script pauses execution for the length of time specified. The rationale for the default of 0.1 s for the WDQS is similar to the previous section -- you don't want the WDQS operators to think you are a bad actor if you are hitting the endpoint repeatedly without delay. If you are reading from a localhost endpoint, you can set the value of sleep to zero.

While I'm on the topic of being a courteous WDQS user, I would like to point out that often repetitive querying can be avoided if you use a "smarter" query. In the example code, I wanted to discover the Q IDs of three labels. I could have inserted the label value in the query as a literal in the position of ?value, e.g.

?item rdfs:label|skos:altLabel "尼可罗·马基亚维利"@zh.

then put the .query() method inside a loop that runs three times. However, instead in the script I used a loop to create an VALUES clause to enumerate the possible values of ?value . I still get the same information, but using the VALUES method only requires one interaction with the Query Service instead of three. For a small number like this, it's not that important, but I've sent queries with hundreds or thousands of values, and there the difference is significant.

GET vs. POST

This brings me to another important thing that I learned the hard way about interacting with SPARQL endpoints programmatically. If you drill down in the SPARQL 1.1 Protocol specification (which I doubt that anyone but me typically does!), you'll see that there are three options for sending queries via HTTP: one using GET and two using POST. When I first started running queries from scripts, I tended to use the GET method because it seemed simpler -- after URL-encoding the query just gets attached to the end of the URL as the value of a query parameter. However, what I discovered once I started making really long queries (like the one I previously described with thousands of VALUES) was that you can fairly easily exceed the length limits of a URL allowed by the server (something in the neighborhood of 5K to 15K characters). Once I discovered that, I switched to using POST since the query is passed as the message body and therefore has no particular length limit.

So why would you ever need to use GET? In some cases, a SPARQL endpoint will only support GET requests because the endpoint is read-only. In cases where a SPARQL service supports both Query and Update, a quick-and-dirty way to restrict writing to the triplestore using Update (which must be done using POST) is to disallow any un-authenticated POST requests. Another case is services like AWS Neptune that have separate read-only endpoints whose access is separate from the endpoint that supports writing. A read-only endpoint would only support GET requests.

For these reasons, you can specify that the Sparqler object use GET by providing a value of "get" for the method argument. Otherwise it defaults to POST.

UTF-8 support

If the literals that you are using only contain Latin characters, it doesn't really matter that much how you do the querying. However, a lot of projects I work on either involve languages with non-Latin character sets, or include characters with diacritics that aren't in the ASCII character set. Despite my best efforts to enforce UTF-8 encoding everywhere, I was still having queries that would fail to match labels in Wikidata that I knew should match. After wasting a bunch of time troubleshooting, I finally figured out the fix.

As I mentioned earlier, the SPARQL 1.1 Protocol Recommendation provides two ways to send queries. The simplest one is to just send the query as text without URL-encoding as the message body. That's awesome for testing because you can just paste a query into the message text box of Postman and if you use the right Content-Type header, you can send the query with the click of a button. I assumed that as long as the text was all UTF-8, I would be fine. However, using this option was actually the cause of the problems I was having with match failures. When I switched to the other POST method (which URL-encodes the query string), my matching problems disappeared. For that reason, my script only uses the "query via URL-encoded POST" option.

", ', and """ quoting for literals

I learned early on that in SPARQL you can use either double or single quotes for literals in queries. That's nice, because if you have a string containing a single quote like "don't", you can enclose it in double quotes and a string containing double quotes like 'say "hi" for me', you can enclose it in single quotes. But what if you have 'Mother said "don't forget to brush your teeth" to me.', which contains both double and single quotes? Also, in the situation of inserting strings into the query string using variables, you can 't know in advance what kind or kinds of quotes a string might contain.

This problem frustrated me for quite some time and I experimented with checking strings for both kinds of quotes, replacing double quotes with singles, escaping quotes in various ways, but none of these approaches worked and my scripts kept crashing because of quote mismatches.

Finally, I resorted to (you guessed it) reading the SPARQL 1.1. Query specification and there was the obvious (to Python users) answer in section 4.1.2: enclose the literals in sets of three quotes. I don't know why I didn't think of trying that. Note that in line 246 of the script, triple single-quotes are used to enclose the literals. Thus the script can handle both of the example English strings: the one with double quotes around the first part of the label and the label that includes "Women's" with an apostrophe.

After solving the quote and UTF-8 problems, my scripts now reliably handle literals that contain any UTF-8 characters.

Options for querying

The query in the code example uses the SELECT query form. This is probably the most common type of SPARQL query, but others are possible and Sparqler objects support any query form option. Depending on the chosen form of the query, there are also several possible response formats. Since we are talking about Python here, the most convenient response format is JSON, since it can easily be converted into a complex Python data structure. But in some situations, another format may be more convenient.

Query form

The query form is specified using the form keyword argument of the .query() method. It may seem a bit strange to specify the query form as an argument of the method when the query form is determined by the text of the query itself, but doing so allows the script to control the default format of the response and whether the raw response is processed prior to being returned from the method. For SELECT and ASK, the default response serialization is set to JSON. For the DESCRIBE and CONSTRUCT query forms that return graphs, the default serialization is Turtle.

SELECT

The default query form is SELECT, so it isn't necessary to provide a form argument to use it. That's convenient, since it's probably the most commonly used form. The raw JSON response (which you can view as the value of the .response attribute of the Sparqler object, e.g. wdqs.response) from the endpoint is structured in a more complicated way that is required to just get the results of the query. The results list is actually the value of a bindings key that forms an object that's the value of a results key, like this:

{
"head" : {
    "vars" : [ "item", "label" ]
},
"results" : {
    "bindings" : [ {
      "item" : {
        "type" : "uri",
        "value" : "http://www.wikidata.org/entity/Q102949359"
      },
...
      }
    } ]
}
}

For convenience, when handling SELECT queries with the default JSON serialization, the script converts the raw JSON to a complex Python data object, then extracts the results list that's nested as the value of the bindings key and returns that as the value of the .query() method. That produces the result shown in the example shown earlier in the post.

Here's an example that prints Douglas Adams' (Q42) name in all available languages:

query_string = '''select distinct ?label ?language where {
wd:Q42 rdfs:label ?label.
bind ( lang(?label) AS ?language )
}
order by ?language'''
names = wdqs.query(query_string)
for name in names:
print(name['language']['value'], name['label']['value'])

The loop iterates through all of the items in the results list and pulls the value for each variable. This structure: item['variableName']['value'] is consistent for all SELECT queries where variableName is the string you used for that variable in the query (e.g. ?variableName).

ASK

When the ASK query form is chosen, the result is a true or false, so the raw response is processed to return a Python boolean as the response value. That allows you to directly control program flow based on whether a particular graph pattern has any solutions, like this:

label_string = '尼可罗·马基亚维利'
language = 'zh'

query_string = '''ask where {
?entity rdfs:label """'''+ label_string + '"""@' + language + '''.
      }'''

if wdqs.query(query_string, form='ask'):
    print(label_string, 'is in Wikidata')
else:
    print('Could not find', label_string, 'in Wikidata')

I use this kind of query to check whether label/description combinations that I plan to use for new Wikidata items have already been used. If you try to create a new item that has the same label and description as an existing item, the Wikidata API will return an error message and refuse to create the item. So it's better to query ahead of time so that you can change either the label or description to make it unique. Here's some code that will perform that check for you:

label_string = 'Italian Lake Scene'
description_string = 'painting by Artist Unknown'

query_string = '''ask where {
?item rdfs:label """'''+ label_string + '''"""@en.
?item schema:description """'''+ description_string + '''"""@en.
      }'''

if wdqs.query(query_string, form='ask'):
    print('There is already an item in Wikidata with')
    print('label:', label_string)
    print('description:', description_string)
    print('The label or description must be changed before uploading.')

DESCRIBE

The DESCRIBE query form is probably the least commonly used SPARQL query form. Its behavior is somewhat dependent on the implementation. Blazegraph, which is the application that underlies the WDQS, returns all of the triples that include the resource that is the solution to the query. The simplest kind of DESCRIBE query just specifies the IRI of the resource to be described. Here's an example that will return all of the triples that provide some kind of information about Douglas Adams (Q42):

query_string = 'describe wd:Q42'
description = wdqs.query(query_string, form='describe')

description is a string containing the triples in Turtle serialization. That string could be saved as a file and loaded into an application that knows how to parse Turtle.

CONSTRUCT

CONSTRUCT queries are similar to DESCRIBE in that they produce triples. The triples are those that conform to a graph pattern that you specify. For example, this query will produce all of the triples (serialized as Turtle) that are direct claims about Douglas Adams.

query_string = '''construct {wd:Q42 ?p ?o.} where {
wd:Q42 ?p ?o.
?prop wikibase:directClaim ?p.
}'''
triples = wdqs.query(query_string, form='construct')
print(triples)

This might be useful to you if you want to load just those triples into a triplestore.

Response formats

Because of the ease with which JSON can be converted directly to an analogously structured complex Python data object, Sparqler objects default to JSON as the response format for SELECT queries. For the two query forms that return triples (DESCRIBE and CONSTRUCT), the default is Turtle. ASK defaults to JSON, from which a Python boolean is extracted. However, these response formats can be overridden using the mediatype keyword argument in the .query() method if desired.

The mediatype argument for some other possible response formats for SELECT are:

application/sparql-results+xml for XML

text/csv for CSV tabular data

For non-JSON response serializations, the return value of the .query() method is the raw text from the endpoint. That may be useful if you want to save the XML for use with some XML processing language like XQuery. It also makes it super simple to save the output as a CSV file with a few lines of code, like this:

data = sve.query(query_string, mediatype='text/csv')
with open('graph_dump.csv', 'wt', encoding='utf-8') as file_object:
file_object.write(data)

Triple output from DESCRIBE and CONSTRUCT can be serialized in other formats using these values of the mediatype argument:

application/rdf+xml for XML

application/n-triples for N-Triples

Monitoring the status of the query

The verbose keyword argument can be used to control whether you get printed feedback to monitor the status of the query. A False value (the default) suppresses printing. Supplying a True value prints a notification that the query has been requested and another when a response has been received, including the time to complete the query. This may be helpful during debugging or if the queries take a long time to execute. For small, routine queries, you probably want to turn this off. Note: the second notification takes place after the sleep delay, so the reported response time includes that delay.

FROM and FROM NAMED

The SPARQL 1.1 Protocol specification provides a mechanism for specifying graphs to be included in the default graph using a request parameter rather than by using the FROM and FROM NAMED keywords in the text of the query itself. Sparqler supports this mechanism through the default and named arguments. Given that this is an advanced feature and that the WDQS triplestore does not have named graphs, I won't say more about this here. However, I'm planning talk about this feature in a future post about the Vanderbilt Libraries' new Neptune triplestore. For more details, see the doc strings in the code.

Detecting errors

Detecting errors depends on how errors are reported by the SPARQL query service. In the case of Blazegraph (the service on which the WDQS is based), errors are reported as unformatted text in the response body. This is not the case with every SPARQL service -- they may report errors by some different mechanism, such as a log that must be checked.

Because the main use cases of the Sparqler class are SELECT and ASK queries to the WDQS, errors can be detected by checking whether the results are JSON or not (assuming the default JSON response format is used). When SELECT queries return JSON, the code tries to convert the response from JSON to a Python object. If it fails, it returns a None object. You can then detect a failed query by checking whether the value is None and if it is, you can try to parse out the error message string (provided as the value of the .response attribute of the Sparqler object, e.g. wdqs.response), or just print it for the user to see. Here is an example:

query_string = '''select distinct ?p ? where {
wd:Q42 ?p ?o.
}
limit 3'''
data = wdqs.query(query_string)
if data is None:
print(wdqs.response)
else:
print(data)

The example intentionally omits the name of the second variable (?o) to cause the query to be malformed. If you run this query, None will be returned as the value of data, and the error message will be printed. If you add the missing "o" after the question mark and re-run the query, you should get the query results.

Note that this mechanism detects actual errors and not a negative query result. For example, a select query with no matches will return an empty list ([]), which a negative result, not an error. The same is true for ASK queries that evaluate as False when there are no matches. That's why the code is written "if data is None:" rather than "if data:", which would evaluate as True if there were matches (non-empty list or True value) but as False for a either an error (a value of None) or no matches (an empty list or False value). The point is that "no matches" result should be handled differently than an error in your code, and that's why the code if data is None: is used.

For other query forms (DESCRIBE and CONSTRUCT) and response formats other than JSON, the .query() method simply returns the response text. So I leave it to you to figure out how to differentiate between errors and valid responses (maybe search for "ExecutionException" in the response string?).

SPARQL Update support

The Sparqler class supports changing graphs in the triplestore using SPARQL Update if the SPARQL service supports that. This is done using the .update() method and two more specific types of Update operations: .load() and .drop() . However, since changes to the data available on the WSQS triplestore must be made through the Wikidata API and not through SPARQL Update, I won't discuss these features in this post. I'm planning to describe them in more detail in an upcoming post where I talk about our Neptune triplestore. Until then, you can look at the doc strings in the code for details.

Birding in Puerto Rico

2022-03-16T19:44:00.008-07:00

Pearly-eyed Thrasher - Bosque Estatal de Guánica, Puerto Rico

NOTE: this information was accurate as of our trip in mid-March of 2022. It will undoubtedly change as time goes by.

Having just completed a week-long vacation in Puerto Rico focused primarily on bird-watching, I wanted to share some observations that might be helpful for others planning to do the same. Please note that we aren't top level birders who were focused on seeing every endemic species -- we just wanted to have fun seeing a variety of cool new birds. So that perspective influences my comments.

The Book

If you have been researching places to bird in PR, you have undoubtedly found out about "A Birdwatchers' Guitd to Cuba, Jamaica, Hispaniola, Puerto Rico, and the Cayans", by Kirwan, Kirkconnell, and Flieg. We purchased this book and it was helpful for deciding places to go and for some ideas about what we were likely to see at different locations. However, the edition of the book we bought (copyright 2010 and I think the most recent) is hopelessly outdated and therefore much of the information is useless.

There are several ways that the book was dated. It spends a lot of time explaining particular hotels where you might want to stay and gives descriptive text (go x miles, turn right on road so-and-so) describing how to get to the sites and hotels. In 2022 you'd be much better off getting an AirBNB than using the outdated hotel information. They are available all over the island for half the cost of the hotels and the two we stayed in were clean, safe, and had friendly and helpful hosts. There was also no point in trying to follow the text descriptions. For example: take "the beach road" past some mangrove trees -- which road was the beach road and which of the many mangroves were the right ones? The hand-drawn maps also usually did not seem to bear much resemblance to reality. Thankfully, I had used Google Maps to locate the preserves we visited in advance and save the locations. We were then able to drive directly to them using Google Maps on our phone. (I've included coordinates and links in the text below.) Another problem with the book was that some of the information about facilities was out of date, so we ended up discovering the actual situation (usually: closed) by arriving and finding out in person. The last deficiency (in my opinion) is that the book is super-focused on the birder who MUST see every endemic, so about a third of the text is devoted to how to see three or four of the most difficult birds, which was not our primary concern. So for "normal" birders like us, getting this book was helpful for thinking about where to go and for knowing likely birds to see, but that was about it.

General observations

If you have birded in a place like Costa Rica with a well-developed ecotourism industry, you will find Puerto Rico somewhat disappointing. Thankfully, PR does have a significant number of protected areas that are publicly accessible, but don't expect much in the way of signage, interpretation, or knowledgeable rangers or local guides. It became almost a joke with us that nearly every vistors' center and developed bathroom was closed and locked. This may be partly due to lingering effects of the hurricanes a few years ago and also the government fiscal crisis, but the bottom line is: bring your own toilet paper and use bathrooms whenever you have the opportunity. The main exception to this was the shiny new National Forest Service visitors' center in El Yunque, which I'll describe in more detail later.

Getting around is relatively easy if you rent a car. Nearly all of the roads we drove on were paved, although you can expect some of them to be pretty narrow and on some roads potholes were abundant. With the exception of Rio Abajo State Forest, we had at least one bar of cell phone coverage almost everywhere, so using Google Maps is quite feasible for navigation. Gas stations are not very abundant off the main roads, so it's probably advisable to keep your tank at least half full, although the distances are not far so you can easily visit the more remote places without worrying about running out of gas.

As I noted, places being closed can be a significant issue, particularly since some of the best birding is early in the morning or near sunset. So places that have locked gates are an issue that you need to plan around. I will note the places where we had problems with this in the descriptions of individual locations. We did not notice particular patterns, like differences between week days and weekends -- things were just closed a lot.

Overall strategy

We split our one-week trip in half, with the first half operating out of an AirBNB in Fajardo in the northeast and the second half in the southwest, operating out of Sabana Grande. Overall, that wasn't a bad idea, although with Cabezas de San Juan being closed and Humacao National Wildlife refuge being difficult to access, it would probably make sense to have spent 2 days in the northeast and the rest of the time in the southwest where there were a lot more locations to bird. We did not go out to either of the islands mentioned in the book (Culebra and Vieques), so if you were going to do that, then you'd want more time in the northeast. Also, I had hoped to snorkel from Seven Seas beach in Fajardo, but there were rip current warnings for the entire north coast of PR during our whole trip, so that didn't happen.

The northeast

El Yunque (Caribbean National Forest)

Catarata Coca drop pin (entrance gate): 18.325206, -65.769975

Sierra palms in El Yunque rainforest

The Caribbean National Forest (which is universally known as "El Yunque" in PR) is the most famous natural area in Puerto Rico and was the place where we saw the most other visitors. The most important thing to understand about visiting El Yunque is the ticketing system for entering the forest by car. To access most of the forest beyond the Catarata Coca (waterfall), you MUST get a "free" (with $2 handling fee) ticket at recreation.gov. We tried to get tickets over a month in advance but they weren't available yet. Then a few days ahead of our visit, all of the advance tickets were already sold out. There is apparently some release date that is not well described on the website, so we probably should have been checking for tickets every day. Thankfully, they hold back 95 tickets which become available at 8 AM local time the day ahead. That was annoying because it meant we needed to be somewhere with Internet at 8 AM the day before we wanted to visit. We were able to get 8 AM entry tickets for the two consecutive days we wanted to go into the forest. A second batch of tickets are available at 11 AM, but the morning is a better time to visit. Both times allow you to stay until the forest closes (I think at 5 or 6 PM).

All of the "real" toilets (with water) inside the forest are closed and locked. None of the port-a-potties at the Palo Colorado parking lot had toilet paper, and the toilets themselves were a mess. So plan for that. This situation is particularly pathetic given the shiny new million dollar visitor center near the entrance of the forest. The Sierra Palms parking lot is the best place to park for the most popular trail in the park: the one that goes to the Los Picachos and Mount Britton overlooks. There are several trails shown on the maps, but only one is actually functional -- the El Yunque trail that takes off just a short distance downhill from the parking lot. After hiking most of the way to the ridgetop, the trail splts. The right trail leads to the Los Picachos overlook, which provides a spectacular view, but is pretty muddy at the top. The left trail leads to the Mount Britton overlook. If you take the left trail, you can make it a loop by taking the trail all the way to the road and then walking down the road to the parking lot. The section of the trail from the split left to the Mount Britton overlook passes through the "Elfin forest", an area of stunted trees that is home to the endemic Elfin Woods Warbler. We visited that area on our second day by driving to the end of the road and parking there, then taking the trail towards the Mount Britton overlook from the other side. The elfin woods was quite interesting, but this isn't actually the best place in Puerto Rico to see the warbler (see Maricao State Forest later in the post).

We were somewhat surprised that we didn't see many birds along the trail. (The exception was several sitings of the bananaquit, which is abundant everywhere in PR.) That may be partly due to us not being experts and partly due to the difficulty of finding birds in the rainforest canopy, but we've been in other rainforests and this seemed rather disappointing to us. We actually saw more birds near the parking lot and in the area around the visitors' center.

You should bring a good raincoat. We got rained on at least once on nearly every hike we took on the trip and it poured on us in El Yunque.

I mentioned the visitors' center. It really is quite amazing. It was brand-new, so everything was in beautiful condition. They have some nice interpretive exhibits and the person at the desk was actually able to give us some advice about what birds people typically saw around the grounds and where. There are some paved trails right at the center and some well-maintained gravel trails further out. We came back a second time because the area around the visitors center was actually the most productive birding site for us in northeastern PR and maybe of any place in the Commonwealth. The cost to enter is a bit steep ($8 per person) but they honor National Park Service annual and senior passes, so if you have one, you can get in for free.

Humacao National Wildlife Refuge

beach access drop pin: 18.151809, -65.764071

Beach approaching the wildlife refuge near sunset

Following the advice of the book, we went to the Humacao National Wildlife Refuge in the evening to look for waterfowl. This area is one of the worst-described in the book. Almost nothing described about the entrance, where to park, etc. was still valid. Instead of a chain that you can step over, there is now a big steel fence with locked gate and unfriendly-looking barbed wire fences after that. I suppose they want to keep people out of the area where they rent out recreational equipment.

Since we drove all the way there, we decided to see if there was a way to enter the refuge from the beach, which looked like a reasonable point of access. The side roads nearest the preserve are part of a gated community, but going further down the road, we were able to park in a public beach parking lot. We had a nice scenic walk along the beach, where we identified a couple of shore birds. At the end of the beech, there was a short path that took us onto the wildlife refuge drive. From there we were able to easily walk to the drive between the two ponds shown on the hand-drawn map in the book. We were a bit apprehensive about going in the back way when the front gates were closed, but we as we walked, we met several local people who were jogging around the ponds. So clearly it was a normal thing for people to be enjoying the preserve after hours.

Unfortunately, there was very dense vegetation on both sides of the drive, making it difficult to actually see the ponds. We were able to walk out on some kind of old boat dock and actually see the pond on the south side. We saw some waterfowl, but would have needed a spotting scope to figure out exactly what they were (maybe Caribbean coots?). By this time it was getting dark, so we gave up and headed back along the beach.

If you make this kind of sunset visit, I'd recommend dropping a pin on Google Maps on your phone at the place where you enter the beach from the parking lot to make sure that you can find it walking back in the near darkness.

Cabezas de San Juan

This area was closed and seems to have been closed since the hurricane. After leaving northeast PR, we were told by someone we met that you can make arrangements to visit. However, there was no indication on their website of how that would be possible. So unless you have some inside information, I wouldn't plan to go there.

The northwest

We planned to stop in several places in the northwestern part of the island on our way to and from the southwest -- we didn't spend any nights there.

Cambalache State Forest (Bosque Estatal de Cambalache)

parking lot drop pin: 18.452568, -66.596961

The Birdwatchers Guide did not mention Cambalache State Forest (near Arecibo), but we had read several places on the Internet that it was a good spot and was a research base for the Puerto Rico Ornithological Society. So we decided to check it out. It turned out to be a nice area to bird after our somewhat disappointing experience in El Yunque. As usual, all of the facilities were closed when we arrived, but that didn't matter since we could park and walk on the trails. There is a network of trails that are quite well maintained. We spent a few hours walking slowly along the trail listed as "1" on the map above before having to turn back due to heavy rain. Surprisingly, at the campground in the upper left of the map there was actually one open restroom with a composting toilet that was operational (the other bathrooms were locked as usual). We saw both the Puerto Rican Lizard-Cuckoo and Mangrove Cuckoo here as well as the Puerto Rican bullfinch (which we saw elsewhere as well). So it was well worth a half day.

Parador Guajataca

overlook parking lot drop pin: 18.489983, -66.949409

Guajataca cliffs from picnic area

This spot was mentioned as a possible hotel venue in Quebradillas in the northwest. We wanted to check it out for the possibility of seeing the White-tailed Tropicbirds that are supposed to nest in the cliffs nearby. The actual site of the hotel/restaurant did not look like a particularly great birding spot and we didn't opt to stay or eat there. However, just to the east of the hotel turnoff is a small park with a parking area and several benches that overlook the ocean. They looked like a much more promising viewing spot. I saw one large white bird fly by when I was getting out of the car, but otherwise we only saw a couple brown pelicans fly by. But it might be a good place to try your luck if you want to stop for a picnic lunch or a break from driving.

Río Abajo State Forest ( Bosque Estatal de Río Abajo)

junction near headquarters drop pin: 18.320761, -66.683640

The Río Abajo State Forest is most well-known as the best place to see the endangered Puerto Rican Parrot. However, it's a long shot since you aren't allowed to get close to the aviaries area. We were told by some birders who had seen the parrot on the previous day that the best strategy was to walk down the trail towards the aviaries (but stopping before the electronic gate) after about 3:30 to 4 PM when they return to roost. We weren't there at the right time of day, so mostly were just interested in seeing birds in general.

The first issue was figuring out where you actually could go to bird. The road leading from the highway to the forest T's into another road. The sign directs you to the visitors' center a short distance on the right, near the intersection. It has a huge, fancy sign, but was not open (of course) and apparently hasn't been open for several years. To the left was some headquarters buildings (also closed). The access to the forest is on the left branch of the road. You have to drive a significant distance past a lot of residences, which gives you the impression that you were out of the forest or had somehow missed it. This was the one place in PR where we had no cell service, so we had to go on faith that the road actually eventually dead-ends at a closed gate. At the gate there is a sign that says "danger", although it was not at all apparent what the danger was. Beyond the gate is just a paved road through the forest that probably would have been pretty good for birding if we had been there earlier in the day. As it was, we mostly managed to finally see a black-whiskered vireo, which we had been hearing repeatedly throughout the trip. What we had been told by the other birders was that the forestry people were OK with people birding along that road as long as they stayed on the road and did not enter the parrot area after the second gate. We never made it to the second gate because we turned around due to lack of time.

The southwest

We spent two days making circuits through the southwest part of the island. The first day we went along the coast and the second day we visited the high elevations.

Guánica State Forest (Bosque Estatal de Guánica)

parking area drop pin: 17.971403, -66.868727

dry forest in Guánica State Forest

This is supposed to be one of the best birding spots in Puerto Rico and we were not disappointed by it. It is a very dry forest, so don't expect spectacular scenery, though. We took the main road (PR334) into the forest until it ended at the headquarters. When we arrived, there was briefly a guy sitting at an information booth, although by the time we got back around noon he was gone and everything (as usual) seemed completely closed up. Near the parking lot there was a reasonably nice picnic area with actual flush toilets and toilet paper (at least the day we were there). What we learned from the guy at the booth was that the trail starting at the picnic area was a loop if you went "left, left, left, left…". That turned out to be true and the trail was a nice length for a slow birding ramble. We had very satisfying multiple views of the Puerto Rican Tody and Adelaide's Warbler along the trail where they seemed quite common.

ANP Salias Fortuna Para La Naturaleza/Biolumenescent Bay

gate drop pin: 17.977386, -67.011882

On our way to from Guanica to La Parguera, we stopped at a wildlife refuge (operated by Para la Naturaleza https://www.paralanaturaleza.org/) that wasn't mentioned in the book, but that I'd seen on Google Maps. I have no idea when it's supposed to be open or if there is ever any kind of programming there. There was a small building on the site and some kind of construction of a small bridge or something, but there was no explanation or any indication of whether it was open to the public. So as usual, we parked the car by the gate and walked in. This was a nice area for observing wetlands birds and we saw a nice Great Egret, Short-billed Dowitcher, and several other wetland birds that were too far away to identify (a spotting scope would have been good here). I'm not sure this is any better than other wetlands in the area, but it was easy to get to and a nice stop if you are making the obligatory trip to La Parguera to try to see the Yellow-shouldered Blackbird.

Incidentally, we did not manage to see the blackbird in La Parguera. The instructions in the Birdwatcher's Guide were pretty incomprehensible. We found the Parador Villa Parguera with no problem, but it did not seem like the mangroves there were any better than others we could see from the road. We utterly failed to find the "general store" described in the book and after wasting about an hour looking around the town unsuccessfully, we moved on.

Although this has nothing to do with birds, it is worth mentioning that La Parguera is probably the best place from which to visit a bioluminescent bay. This bay is apparently the only place in PR where you are actually allowed to get in the water and there are various options, such as going out in a boat at sunset and snorkeling, or being towed out in kayaks by a boat and then bobbing around in a life jacket. Unfortunately, we did not book far enough in advance to do either of these options, so you should book online at least 2 or 3 days ahead of when you want to do it. We had no problem getting a spot on a boat with a glass bottom. The luminescence was really quite amazing, but unfortunately our timing was off since the moon was at first quarter and was making a lot of light at sunset. We really could only see the luminescence through the glass under the boat when one of the operators swam down and kicked his legs under the glass. It would have been much better to have actually been in the water or at least to have seen the effect on the boat's wake when the moon was not up and it was darker. But it was still pretty cool and the trip was only $15.

Cabo Rojo

lighthouse parking lot drop pin: 17.937730, -67.194344

Our last stop for the day of our coastal exploration was the peninsula of the Cabo Rojo National Wildlife Refuge. This area was not as scenic as we expected and the well-known "pink" lagoons looked like some kind of sedimentation ponds. We did manage to identify a couple shorebirds along the road and we managed to see the introduced Venezuelan Troupial, which was fun.

There were a couple issues you should be aware of. One is that upon entering the refuge proper, the road degenerates terribly into the worst road we encountered on the island. We had to weave back and forth from one side of the other of the road to avoid breaking the axle of our car on giant potholes and there was one degenerated bridge/culvert where we almost turned around because we weren't sure we could cross without damaging the bottom of our low-clearance rental car. We did finally make it to the end of the road to the lighthouse parking lot and had just gotten out to make the kilometer or so walk up to the lighthouse when a couple police officers warned us that we needed to make sure that we were out of the refuge in 45 minutes or we would get locked in when they closed the gate at 5 PM. So we abandoned the attempt to see the lighthouse and alleged brown boobies on the rocks below it. So if you plan to do this excursion, come early in the day and plan for plenty of time to crawl along the horrible road.

Maricao State Forest

km 16.8 drop pin: 18.156738, -66.997737

vacation cottages parking lot drop pin: 18.140393, -66.974230

Elfin forest in Maricao State Forest

We spent the entire morning of our southwestern uplands tour in the vicinity of the Maricao State Forest and it was one of our most productive birding excursions. One advantage of this forest is that a public road (PR 120) passes through it and there are several good stopping places along the road that are never closed off by gates. We started by going straight to km 16.8 where one of the two sets of serious birders we met on the island had seen the Elfin Woods Warbler. We pulled off the road in a small parking area by a gate and immediately heard several of the warblers in a big tree near where we parked. We managed to get a reasonably good look at them before they moved on. We walked for some distance along the trail and saw and heard several other birds including the Puerto Rican Vireo, Puerto Rican Woodpecker, and the Puerto Rican Bullfinch. On our way to checking out what we assumed was the visitors center, we stopped at La Torre de Piedra, a stone overlook in the form of a castle (built by the Civilian Conservation Corps) that had a great view. We saw the Puerto Rican Spindalis there. Beyond the stone tower towards Sabana Grande, we came to what we thought was the Forest Service buildings and picnic area shown on the hand-drawn map in the book. But the area bore no resemblance to the map -- it had vacation cottages, a swimming pool (and maintained bathrooms that were open!). We never did figure out where the supposed "concrete cistern" and other spots on the map were located. We did, however, have amazing looks at several Puerto Rican Woodpeckers that were hanging around in some dead trees around one of the parking lots.

After checking out that area, we headed back to km 16.2, which was an intersection of an actual road that branched off the main road. We parked and walked down the intersecting road and were treated to a second look at the Elfin Woods Warbler -- this time an adult and a juvenile. To cap things off, we also spotted an amazing Antillean Euphonia singing its heart out high in a tree along the road. All in all, this was one of the most productive areas for birding in the whole trip and we weren't locked out of any of it by gates!

Susua State Forest (Bosque Estatal de Susua)

entrance gate drop pin: 18.071079, -66.914372

Vista at Susua State Forest

We had planned to wrap up our day of birding in the southwestern uplands by spending some time in Susua State Forest, just to the east of where we were staying in Sabana Grande. We drove up the narrow road to the forest and were surprised to encounter a locked gate at the entrance. Apparently the gate is locked at 3 PM! We decided to park the car at the gate and walk for a while along the road into the forest, but it was hot, dry, and late in the afternoon, so after walking for about a half hour without seeing anything but vultures, we turned around and walked back to the car. We did see a single Scaly-naped Pigeon near the gate, but that was it for birds. The plants were interesting -- we saw several large cacti and a weird spiny plant with we were told by some botanists was the Puerto Rican version of poison ivy. But this is definitely a place you will want to visit earlier in the day if you plan to drive.

Summary

If you live in the U.S., Puerto Rico is a pretty easy and relatively inexpensive place to visit, since no special travel rules apply, you can find reasonably priced car rentals and accommodations, and many residents speak English if you don't know Spanish. If you've never been to a rainforest before, El Yunque is very interesting, and the southwestern part of the island has a wide variety of habitats in a relatively small area. As a birding spot, it can be fun if you've never birded in the tropics before, and most of the birds we saw were natives (as opposed to some places where you mostly see introduced birds). However, it pales in comparison to other places we've birded like Costa Rica and southern Africa where there are just a lot more species. Nevertheless, it is quite easy to see a dozen or so species that are endemic to Puerto Rico and the Virgin Islands, so it is a place you can go to see unique wildlife.

Investigating Wordle guesses

2022-01-17T19:09:00.005-08:00

Introduction

Those of you who know me know that I like to play somewhat complicated games. One of the reasons I enjoy that is the challenge of figuring out how the game works and how I might play the game better by using my understanding to develop rules of thumb for making decisions during the game.

I also enjoy writing Python scripts when I've found an interesting problem to solve. In this particular case, I ended up spending most of the MLK holiday weekend working on a Python script to investigate the best words to use as guesses in Wordle, that online game that has become viral.

I've been playing for less than two weeks, and during that time I've had an excessive number of discussions with my wife and two daughters about strategies for word choices. One such discussion was about "what's the best first word?" It seemed clear that one would like a word that contained common letters in order to increase the probability of getting correct guesses early. A much longer discussion was centered around whether it would be a better strategy to pick a second word that was complementary to the first word (containing different common letters, e.g. vowels) in order to get the most information during the first two guesses, or whether the second move should capitalize on information gained in the first guess (e.g. limiting the second guess to words that would both help discover new letters and determine the positions of any letters discovered in the first guess).

One appealing feature of this problem is that it is highly tractable using a modern computer. The number of 5-letter English words is limited and a script could sort them out in a minuscule amount of time. Thus a simple and brute-force approach would be totally practical.

Background knowledge

There are several key things that one would want to know before starting out. The first is whether game creator Josh Wardle is tricky and tries to pick "hard" words as a human opponent would, or whether the game words are random. In an interview on The World, he said that he wanted the words to be random so that he could play himself.

Another important thing to know is what set of words are actually used as a source for the game. I don't know the answer to that question, but in the interview, he mentioned that the words were drawn from a list of about 2500 English words. There was also a controversy about the January 12 word "favor", which raised the ire of British users who didn't consider that a proper five-letter word because they thought it should be spelled "favour". This is an indication that the word list might be derived from an American English rather than British English word list.

When I first started playing around, I tried extracting words from several rando English word lists that I found on the Internet. However, they either included a lot of questionable words to give Scrabble players an edge, or were polluted with capitalized proper names, abbreviations, etc. I finally came across a very high quality curated list of words called the "6 of 12" list (named so because it contains words found in at least 6 of the 12 dictionaries used as sources). This list is heavily curated and very clean: proper names are all capitalized and abbreviations without periods or spaces are specially marked so they can be excluded. After running some code to screen out names, abbreviations, and words with lengths other than 5, I came up with a clean list of 2529 words as a source for experimentation. Not only is this list about the same size as the list used by Wardle for the game, all of the words used in the game since 1 January 2022 are on it. (Spoiler alert: the previous link is updated daily, so following it may reveal the answer to today's puzzle if you haven't done it already.) So I consider my list to be a very good proxy for the possible words in the game.

The script

If you want to look at and try the code I'm going to discuss, you can run it yourself on a Colab notebook without installing anything. (You must have and be logged in to a Google account, however.) If you want to edit and save the code, select "Save a copy in Drive" from the file menu and you will be able to save your work. The data are in GitHub, so you don't have to download anything either. Before all of you software engineers start sharpening your knives, I'm not a professional coder, so be kind.

A major part of the code is a Wordle_list object. If you instantiate it without any arguments, you just get the full word list from GitHub. If you pass in a "guess code" and a Python list when you instantiate a Wordle_list, the code will be used to screen out words from the input list according to the information given in the guess code (equivalent to what you see on the screen after you've made a guess in the game.Values of instance variables and results of methods on a Wordle_list instance will tell you things about the screened list, such as how many words are in it and information about the frequencies of letters in the words.

The other major part of the code is the score_words_by_frequency() function. This is the actual "experimental" part of the code where I played around with various ways to score the words on the list to identify which words would make the best next guess. I will talk a lot more about this later.

To actually use the code, run the first cell to define everything, then scroll down to the "Screening algorithm test" section and start running the code in the next cell. There are some instructions you can read there, but here's the TLDR information:

1. To use the code with an actual puzzle, change the value of actual_word to an empty string (two consecutive single quote characters: ''). That will stop the script from trying to suggest guess codes for some other word. If you want to test one of the previous puzzle words (or any word) set the value of actual_word to the word you want.

2. Before you run each of the guess cells, you have to set the value of the next_guess_code variable. If you are testing using an actual known word entered in step 1, a guess code will be suggested for you at the end of the output of the previous cell using the highest ranking word based on the criteria used by the word-scoring function. It will be used automatically if you run the next cell without changing the value of next_guess_code. If you are using the output from the game, set the value of next_guess_code using an uppercase letter for a correct letter in the correct position, a lowercase letter for a correct letter in the wrong place, and a lowercase letter prefixed by a dash for an incorrect letter. Separate the individual letter codes by commas but no spaces.

Here's the code for the example above: 'S,-e,r,-g,E'.

3. As you run each cell, it will show you how it has reduced the number of possible answer words by applying various screens based on the guess:

then show you the top five scoring words for the next guess:

4. Repeat entering guess codes and running subsequent cells until there is only a single word remaining.

Scoring words

The most important open question is how words should be scored for selecting the next guess. Here's the general approach I took:

1. Determine the distribution of letters in the words by counting up how many times they occurred.

2. Order the letters of the alphabet from highest frequency to lowest.

3. Assign a rank to each letter based on its position (0=most frequently used, 25=least frequently used). Ties get the same rank, then missing ranks are skipped until the next non-tied letter (e.g. 1, 2, 2, 4, ...)

4. Score a word by adding up the ranks for each of the five letters in the word. The lowest score is the best word.

This sounds pretty simple and most people use some variation of this in their head based on knowledge they have about letter use from everyday life, playing Scrabble, trying to crack cyphers (didn't everyone do that?), etc. A lot of our early discussion about Wordle guesses revolved around what were the most common letters. For example we realized that what counts is frequency of dictionary words, not frequency of words "in the wild" since a relatively small number of words are used a lot in normal text (e.g. "their", "there", "where", etc.). But what matters the most is the frequency of letters in the 2529 words at play in the game, and since we have the list and a computer, we can know anything we want about letter distributions.

A key consideration is whether we should care more about the overall distribution of letters anywhere in the words or if we should be more concerned about the distribution of letters in particular positions within the words. For example, the letter "y" is not that common overall, but is a lot more frequent in the last position of the word. The overall distribution is particularly important early in the game when nothing is known about any position, but later in the game, the distribution in particular positions becomes more important as only some positions remain undetermined.

The other consideration is that on any particular turn, we don't really care about the distributions of letters in all 2529 words, but only about the distributions of the words that haven't yet been eliminated. An unassisted human player would lack concrete information about this, but with a script, it's easily known.

When a new Wordle_list object is created after each guess, the frequencies and ordinal positions of every letter are calculated both for the words as a whole and for each of the five positions, using the words that remain after eliminating words using the guess code. The ordinal positions for the letters are then available for calculating the word scores.

When I first started playing with this, I calculated two sets of scores: "overall scores" for words with unique letter combinations (no repeated letters) based on the overall letter frequencies, and "position scores" for all words based on the separate letter frequencies of each of the five positions in the word. Words that scored well by the first system were the efficient at discovering/eliminating common letters. Words that scored well by the second system weren't as efficient at discovering letters, but were more efficient at placing letters in the correct position. There are some pairs of words like "aster" and "stare" that have equally good overall scores because they have the same letters. But "stare" has a much better position score than "aster" (14 vs. 30), partly because a lot of words end in "e". So if two words have the same overall score, pick the one with the better position score.

Having two different scores was not a good solution, so I then tried to think how I could combine them into a single score. The easy solution was to assign weights to the two scores. I somewhat arbitrarily chose 0.7 for the overall score and 0.3 for the position score. This gave more weight to efficient letter discovery, while having some weight to the position as a sort of "tiebreaker". This solution worked fairly well, but I quickly discovered two problems.

The most obvious problem was that I was only calculating the overall scores for words with unique letters. How should I score words in which the same letter occurred more than once? The answer came to me when I realized that from the standpoint of letter discovery, repeating a letter is worse than using even the worst letter because it gives you no new information. Therefore, I assigned a score of 26 to the second (or more) instance of a letter in a word.

However, this caused an unintended consequence. Particularly late in the game, it may actually be more efficient to use words with repeated letters depending on the distribution of letters in the undetermined positions. The 26 point penalty on the overall score for repeat letters was so severe, it basically prevented repeat letter words from ever being selected. The solution that I chose was to reduce the overall score weight (from 0.7 to 0) and increase the position weight (from 0.3 to 1) as the number of words decreased towards zero. That made the repeat letter penalty disappear late in the game and had a secondary benefit of emphasizing letter selection (through the overall score) in the first few guesses and emphasizing position selection in the later guesses.

This "sliding scale" of weights is what is used in the final version of the script, although you can manually adjust the base scoring weights of 0.7 and 0.3 in the initial setup.

The optimal second guess

One of the longest points of discussion between my wife and me was whether one should always use two great "letter discovery" words in the first two guesses, or to vary the second guess by choosing the best word based on information gained in the first guess. I could see benefits to both strategies. The "letter discovery" system could be very efficient at nailing down the letters and eliminating a lot of words in the first two guesses. It is also a very simple strategy that doesn't require any thinking about what words might or might not have been eliminated until the third guess. The "information based" second-guess system makes use of what has been learned about letters in the first guess to allow the selection of a tailored second guess based on the remaining words (at least if you have a computer to keep track of the words for you). Once the program was finished, I had an opportunity to test the two approaches using the script.

One thing that you'll notice if you use the script is that the first guess is always "arose", since the selection of the best scoring word from the set is deterministic. There is a section of the notebook labeled "Test of rating words by frequencies" whose first cell finds the 10 best-scoring words with unique letters. The second cell then eliminates all of the unique-letter words that have any of the letters from the first word in order to score the best "complimentary" second word to go with the first word. The default is to use "arose" as the first word, and it results in "glint" as the best second word to go with it. You can hard code one of the other first words as the value of first_word in the second cell to find its complement, e.g. "stare" (first word) and "doily" (second word).

If you run the cell in the "Calculate stats for raw list" section, you'll find that the overall order of frequencies for letters in the five letter-word set is e, a, r, o, t, s, l, i, n, u, c, y, d, h, ... . "arose" includes five of the six most common letters and adding "glint" gets nine of the top ten. "stare" includes five of the top six, and adding "doily" gets all of the top eight and ten of the top thirteen.

I ran a test using the 17 game words from 1 January to today, comparing the strategies of always using "arose" as the first word and "glint" as the second word vs. using "arose" as the first word and letting the script's scoring system to select the second word. The metric was the number of words that remained after screening words out using the guesses. In three cases, it was a tie, with both strategies resulting in the same number of words (1 or 2 words left). In seven cases, the "glint" choice did better and in seven cases, using the best scoring word did better.

Based on this metric, both systems worked about equally well. However, in 6 of the seven cases where "glint" as a second choice won, the number of possible words remaining after the second guess was 1 (meaning that the strategy would have produced a pretty amazing score of 3 for each of those games). Only two of the cases where letting the scoring algorithm choose the word resulted in only one word remaining after the second guess.

Although this is a very low sample size, the somewhat surprising take-home message of this test is that the extremely simple strategy of always guessing "arose" and "glint" as your first two guesses is highly effective at bringing the number of words down to a low number for the third guess.

Other possible investigations

The obvious follow-up to this is to automate the process of guessing to the point where many tests could be run to compare how various strategies work. Since there are only 2529 words, it wouldn't even have to be a random sampling exercise -- one could literally try the strategy on every possible word. There are a few complications. One is that I'm not entirely confident about how the script is handling the evaluation of choices where the guess word contains two of the same letter (neither in the correct position) and the word has only one of that letter. I was having some problems with coding that and with uncertainty about how the game would indicate that result.

The other problem is that it is common near the end of the game for there to be two or even three possible guess words with the same score. So to have the computer fully play out the game would require a random selection. Many human players have been in this situation where there are two possible words that could fit in the final guess and it was necessary to just pick one and hope to get lucky (today for example when my last two choices were "spire" and "shire" -- I got unlucky and incorrectly guessed "spire"). There would probably be a more graceful way to handle the situation, but one could just have the script guess at random, then run it enough times for the probabilities to come out in the wash.

With this kind of automation, one could try many possible combinations of fixed first two words, adjust the base scoring weights to be something other than 0.7 and 0.3, or try out some other strategies I haven't though of yet.

However, I've totally burned up my holiday weekend, so if I do this it will have to be later! :-)

Writing your own data to Wikidata using spreadsheets: Part 4 - Downloading existing data

2021-03-18T05:36:00.002-07:00

This is the fourth part of a series of posts intended to help you manage and upload data to Wikidata using spreadsheets. It assumes that you have already read the first three posts and that you have tried the "do it yourself" experiments to get a better understanding of how the system works. However, it's possible that the only thing that you want to do is to download data from Wikidata into a spreadsheet, and if that is the case, you could get something out of this post without reading the others. The script that I will describe does require a configuration file whose structure is described in the second post, so you'll either need to read that post (if you like handholding) or read the more technical specifications here to know how to construct or hack that configuration file.

If you are the kind of person who prefers to just look at the specs and play with software, then skip this whole post and go to the documentation for the download script here. Good luck! Come back and do the walk-through below if you can't figure it out.

The latter sections of the post do show how to use the downloaded data to carry out easy-to-implement improvements to item records. Those sections depend on an understanding of the earlier posts. So if that kind of editing beyond simply downloading data interests you, then you should read the whole series of posts.

Configuring and carrying out the download

Towards the end of the last post, I created a JSON configuration file for metadata about journal articles. The configuration specified the structure of three CSV files. The first CSV (articles.csv) was intended to have one row per article and contained headers for statements about the following properties: "instance of", DOI, date of publication, English title, journal, volume, page, and issue. The other two CSVs were expected to have multiple rows per article since they contained data about author items (authors.csv) and author name strings (author_strings.csv). Since articles can have one-to-many authors, these two tables could be expected to have zero-to-many rows per article.

For the purposes of testing the download script, you can just use the JSON configuration file as-is. Download it, name it config.json, and put it in a new directory that you can easily navigate to from your home folder. We are going to specify the group of items to be downloaded by designating a graph pattern, so edit the fourth line of the file using a text editor so that it says

"item_pattern_file": "graph_pattern.txt",

You can screen for articles using any kind of graph pattern that you know how to write, but if you don't know what to use, you can use this pattern:

?person wdt:P1416 wd:Q16849893. # person, affiliation, Vanderbilt Libraries
?qid wdt:P50 ?person. # work, author, person

Copy these two lines and save them in a plain text file called graph_pattern.txt in the same directory as the configuration file. The comments after the hash (#) mark will be ignored, so you can leave them off if you want. I chose the first triple pattern (people affiliated with Vanderbilt Libraries) because there is a relatively small number of people involved. You can use some other triple pattern to define the people, but if it designates a large number of people, the file of downloaded journal data may be large. Whatever pattern you use, the variable ?qid must be used to designate the works.

The last thing you need is a copy of the Python script that does the downloading. Go to this page and download the script into the same directory as the other two files.

Open your console software (Terminal on Mac or Command Prompt on Windows) and navigate to the directory where you put the files. Enter

python acquire_wikidata_metadata.py

(or python3 if your installation requires that).

The output should be similar to the screenshot above.

Examining the results

Start by opening the authors.csv file with your spreadsheet software (LibreOffice Calc recommended, Excel OK). This file should be pretty much as expected. There is a label_en column that is there solely to make it easier to interpret the subject Q IDs -- that column is ignored when the spreadsheet is processed by VanderBot. In this case, every row has a value for the author property because we specified works that had author items in the graph pattern we used to screen the works.

The author_strings.csv file should also be close to what you expect, although you might be surprised to see that some rows don't have any author strings. Those are cases where all of the authors of that particular work have been associated with Wikidata items. The script always generates at least one row per subject item because it's very generic. It generally it leaves a blank cell for every statement property that doesn't have a value in case you want to add it later. Because there is only one statement property in this table, a missing value makes the row seem a bit weird because the whole row is then empty except for the Q ID.

When you open the articles.csv file, you may be surprised or annoyed to discover that despite what I said about intending for there to be only one row per article, many articles have two or even more rows. Why is this the case? If you scroll to the right in the table, you will see that in most, if not all, of the cases of multiple rows there is more than one value for instance of. If we were creating an item, we would probably just say that it's an instance of one kind of thing. But there is no rule saying that an item in Wikidata can't be an instance of more that one class. You might think that the article is a scholarly article (Q13442814) and I may think it's an academic journal article (Q18918145) and there is nothing to stop us from both making our assertions.

The underlying reason why we get these multiple rows is because we are using a SPARQL query to retrieve the results. We will see why in the next section. The situation would be even worse if there were more than one property with multiple values. If there were 3 values of instance of for the item and 2 values for the published date, we would get rows with every combination of the two, and end up with 3x2=6 rows for that article. That's unlikely, since I took care to select properties that (other than instance of) are supposed to only have a single value. But sometimes single-value properties are mistakenly given several values and we end up with a proliferation of rows.

An aside on SPARQL

It is not really necessary for you to understand anything about SPARQL to use this script, but if you are interested in understanding this "multiplier" phenomenon, you can read this section. Otherwise, skip to the next section.

Let's start by looking at the page of Carl H. Johnson, a researcher at Vanderbilt (Q28530058).

As I'm writing this (2021-03-16), we can see that Carl is listed as having two occupations: biologist and researcher. That is true, his work involves both of those things. He is also listed as having been educated at UT Austin and Stanford. That is also true, he went to UT Austin as an undergrad and Stanford as a grad student. We can carry out the following SPARQL query to ask about Carl's occupation and education.

select distinct ?item ?itemLabel ?occupation ?occupationLabel ?educatedAt ?educatedAtLabel {
?item wdt:P106 ?occupation.
?item wdt:P69 ?educatedAt.
BIND(wd:Q28530058 AS ?item)
SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}

You can run the query yourself here, although if the information about Carl has changed since I wrote this post, you could get different results.

The last line of the query ("SERVICE...") is some "magic" that the Wikidata Query service does to automatically generate labels for variables. If you are asking about a variable named "?x" and you also ask about the variable "?xLabel", with the "magic" line the Query Service will automatically generate "?xLabel" for you even if you don't define it as part of the graph pattern. I've used this method to generate labels for the three variables I'm asking about in the first line: ?item, ?occupation, and ?educatedAt.

The second and third lines of the query:

?item wdt:P106 ?occupation.
?item wdt:P69 ?educatedAt.

are the ones that are actually important. They restrict the value of the variable ?occupation to be the occupation (P106) of the item and restrict the value of the variable ?educatedAt to be the place where the item was educated at (P69).

The fourth line just forces the item to be Carl H. Johnson (Q28530058). If we left that line out, we would get thousands or millions of results: anyone in Wikidata who had an occupation and was educated. (Actually, the query would probably time out. Try it and see what happens if you want.)

So here's what happens when we run the query:

When we carry out a SPARQL query, we are asking the question: what combinations of variable values satisfy the graph pattern that we have specified? The query binds those combinations to the variables and then displays the combinations in the results. If we think about these combinations, we can see they all satisfy the pattern that we required:

Carl has occupation researcher and went to school at UT Austin.
Carl has occupation researcher and went to school at Stanford.
Carl has occupation biologist and went to school at UT Austin.
Carl has occupation biologist and went to school at Stanford.

There are four ways we can bind values to the variables that are true and satisfy the pattern we required. We cannot ask SPARQL to read our mind and guess there was a special combination that we intended, or that there was one combination that we were more interested in than another.

This behavior sometimes produces results for SPARQL queries that seem unexpected because you get more results than you intend. But if you ask yourself what you really required in your graph pattern, you can usually figure out why you got the result that you did.

Restricting the combinations of values in the table

If you paid close attention to the output of the script, you will have noticed that for each of the three CSVs it said that there were no pre-existing CSVs. After the script runs the SPARQL query to collect the data from Wikidata, it tries to open the files. If it can't open the files, it creates new files and saves all of the combinations of values that it found. However, if the files already exist, it compares the data from the query to the data already in the CSV and ignores combinations of values that don't match what's already there.

That means that if we are annoyed about all of the possible combinations of values initially written to the table, we can delete lines that contain combinations that we don't care about. For example, if one row of the table says that the article is a scholarly article (Q13442814) and another says its an academic journal article (Q18918145), I can delete the scholarly article row and only pay attention to row containing the statement that it is an academic journal article. In future downloads, the scholarly article assertion will be ignored by the script. It is a pain to have to manually delete the duplicate lines, but once you've done so, you shouldn't have to do it again if you try downloading more data later. You will have only a single line to deal with per existing item.

The situation is actually a bit more complicated than I just described. If you are interested in the details of how the script screens combinations of variables that it gets from the Query Service, you can look at the comments in the script starting in line 510. If not, you can assume that the script does the best it can and usually does the screening pretty well. It is still good to examine the CSV visually after doing a fresh download to make sure nothing weird happened.

Repeating the download to pick up new data

Based on what I said in the previous section, you should have noticed that you can run this script repeatedly if you want to pick up new data that has been added to the items since the last time you ran the script. That means that if you are adding data to the downloaded CSV as a way to make additions to Wikidata via the API, you first can check for updated information to make sure that you don't accidentally add duplicate statements when you use a stale CSV. VanderBot assumes that when you fill in empty cells, that represents new data to be written. So you or someone else has actually made the same new statement using the Web interface since the last time you ran VanderBot to do an upload, you risk creating duplicate statements.

There are several important things that you need to keep in mind about updating existing CSVs prior to adding and uploading new data:

1. The screening to remove duplicate rows is only done for preexisting items. Any new items downloaded in the update will need to be visually screened by a human for duplicates. It doesn't really hurt anything if you leave the duplicate rows -- all of their statements and references are preexisting and will have identifiers, so VanderBot will ignore them anyway. But you will probably eventually want to clean up the duplicates to make the spreadsheet easier to use in the future.

2. If there is only a single combination of values for an item (i.e. only a single row), the script will automatically replace any changed values with the new ones regardless of the preexisting state of that row. The screening of rows against the existing spreadsheet only happens when there are two rows with the same Q ID. So if somebody changed the wonderful, correct value that you have in your spreadsheet to something icky and wrong, running the update will change your local copy of the data in the spreadsheet to their icky and wrong value. On the other hand, if they have fixed an error and turned your data into wonderful, correct data, that will be changed in your local copy as well. The point here is that the script is dumb and cannot tell the difference between vandalism and crowd-sourced improvements of your local data. It just always updates your local data when there aren't duplicate rows. In a later post, we will talk about a system to detect changes before downloading them, so that you can make a decision about whether to allow the change to be made to your local copy of the data (i.e. the CSV).

3. If you have enabled VanderBot to write the labels and descriptions of existing items (using the --update option), it is very important that you download fresh values prior to using a stale CSV for writing data to the API. If you do not, then you will effectively revert any recently changed labels and descriptions back to whatever they were the last time you downloaded data or wrote to the API with that CSV. That would be extremely irritating to anyone (including YOU) who put a lot of work into improving labels and descriptions using the web interface and then had them all changed by the VanderBot script back to what they were before. So be careful!

Making additions to the downloaded data and pushing them to the API: Low hanging fruit

If you use this script to download existing data, it will not take you very long to realize that a lot of the data in Wikidata is pretty terrible. There are several common ways that data in Wikidata are terrible, and VanderBot can help you to improve the data with a lot less effort than doing a lot of manual editing using the web interface.

Changing a lot of descriptions at once

Many items in Wikidata were created by bots that had limited information about the items, and limited abilities to collect the information they were missing. The end result is that descriptions are often very poor. A very common example is describing a person as a "researcher". I believe that this happens because the person is an author of a research article, and since the bot knows nothing else about the person, it describes them as a "researcher". Since we are screening by a SPARQL query that establishes some criteria about the items, that criterion often will allow us to provide a better description. For example, if we are screening people by requiring that they be faculty in a chemistry department, and who have published academic research articles, we can safely improve their descriptions by calling them "chemistry researchers".

In the case of our example, there is an even more obvious problem: many of the items have no description at all. There is a very easy solution, since we have the instance of (P31) information about the items. All of the items in the screenshot above that are missing descriptions are instances of Q13442814.

I used the script (discussed in the previous post) for downloading information about values of statement properties to summarize all of the P31 values for the items in this group. So I know that all of these items with missing descriptions are instances of scholarly article. There may be better descriptions for those items, but at this point "scholarly article" is a much better description than no description at all.

One might argue that we aren't actually adding information to the items, given that our proposed description is simply re-stating the P31 value. That may be true, but the description is important because it shows up in the search results for an item, and the P31 value does not. In Wikidata, descriptions also play an important role in disambiguating items that have the same label, so it's best for all items to have a description.

I am going to fix all of these descriptions at once by simply pasting the text "scholarly article" in the description column for these items, then running VanderBot with the --update option set to allow. If you have not read the earlier posts, be aware that prior to writing to the API, you will need to create a metadata description file for the CSVs (discussed in the second post) and also download a copy of VanderBot from here into the directory where the CSV files are located. Run the API upload script using the command

python vanderbot.py --update allow --log log.txt

After it finished running, I deleted the three CSV files. After a little while I ran the download script again to see how things in Wikidata had changed. The results are in the screenshot above. (Note: there is a delay between when data are written to the API and when they are available at the Query Service, so the changes won't necessarily show up immediately. It can take between a few seconds and up to an hour for the changes to be transferred to the Query Service.) I was able to improve the descriptions of 23 items by about 30 seconds of copying and pasting. That would have probably taken me at least 10 or 15 minutes if I had looked up each item using the web interface and entered those descriptions manually.

Changing a lot of labels at once

Another situation where I can make a large number of improvements with very little effort is adding people item labels for other languages. Most names of people will be represented in the same way across all languages that use the Latin character set. So I can easily improve label coverage in non-English languages by just copying the English names and using them as labels in the other languages. This would be extremely labor-intensive if you had to look up each item and do the copying and pasting one item at a time. However, when the labels are in spreadsheet form, I can easily copy an entire column and paste it into another column.

In our Vanderbilt Fine Arts Gallery WikiProject, we put in a lot of work on either disambiguating artist name strings against Wikidata items, or creating new items for artists that weren't already there. As an end result, we now have a list of 1325 artists whose works are included in the gallery collection. I can use that list of 1325 Q IDs as a way to define a category of items to be included in a download using the acquire_wikidata_metadata.py script. To set up the download, I created a config.json file containing this JSON:

{
  "data_path": "",
  "item_source_csv": "creators.csv",
  "item_pattern_file": "",
  "outfiles": [
    {
      "manage_descriptions": true,
      "label_description_language_list": [
        "en",
        "es",
        "pt",
        "fr",
        "it",
        "nl",
        "de",
        "da",
        "et",
        "hu",
        "ga",
        "ro",
        "sk",
        "sl",
        "zu",
        "tr",
        "sv"
      ],
      "output_file_name": "creators_out.csv",
      "prop_list": [
      ]
    }
  ]
}

As you can see, I'm not concerned with any properties of the works items. I've simply listed the language codes for many languages that primarily use the Latin character set. The creators.csv file is my spreadsheet with the 1325 item identifiers in a column named qid. It defines the set of items I'm interested in. After running the acquire_wikidata_metadata.py script, the creators_out.csv spreadsheet looked like this:

There are several things worth noting. In most cases, when the label is available in non-English languages, it's exactly the same as the English label. This confirms my assertion that it's probably fine to just re-use the "English" names as labels in the other languages. There are a couple exceptions. Buckminster Fuller has variation in his labels because "Buckminster" is apparently his middle name. So I'm going to mostly leave that row alone -- he's famous enough that he's represented in most languages anyway. The Haverford Painter's name isn't really a name. It's more of a description applied as a label and it does vary from language to language. I'll just delete that row since I have no idea how to translate "Haverford Painter" into most of the languages.

The other interesting thing is that most of the names are represented in Dutch already. The reason is that there is a bot called Enderoobot which, among other things, automatically adds English people name labels as Dutch labels (see this edit for example). There are only a few missing Dutch labels to fill in. So I definitely should not just copy the entire English column of labels and paste it into the Dutch column.

Since the rows of the CSV are in alphabetical order by Q ID, the top of the spreadsheet contains mostly newer items with Q IDs over 100 million. In the lower part of the sheet where the Q IDs of less than 100 million are located, there are a lot more well-known artists that have labels in more languages. It would take more time than I want to spend right now to scrutinize the existing labels to see if it's safe to paste over them. So for now I'll limit my copying and pasting to the top of the spreadsheet.

After pasting the English labels into all of the other columns and filling in the few missing Dutch labels, I'm ready to write the new labels to Wikidata. I needed to run the convert_json_to_metadata_schema.py script to generate the metadata description file that VanderBot needs to understand the new creators_out.csv spreadsheet I've just created and edited (see my second post if you don't know about that). I'm now ready to run VanderBot using the same command I used earlier.

Using this method, I was able to add approximately 3500 multilingual labels with only about 30 seconds of time to copy and paste columns in the spreadsheet and about 10 minutes for VanderBot to write the new labels to the API. I can't even imagine how long that would take to do manually.

One nice thing is that there is only one interaction with the API per item, regardless of the number of different languages of labels that are changed. Since most of the time that VanderBot takes to do the writing is actually just sleeping 1.25 seconds per item (to avoid exceeding the maximum writing rate for bots without a bot flag), it's important to bundle as many data items per API interaction as possible.

When I check one of the artist's pages, I see now that it has labels in many languages instead of only English.

Although it would be more labor-intensive, the same process could be used for adding labels in non-Latin character sets. A native speaker could simply go down the rows and type in the labels in Chinese characters, Cyrillic, Greek, Arabic, or any other non-Latin character set in an appropriate column and run the script to add those labels as well.

Adding multiple references to an item at once

Despite the importance of references to ensuring the reliability of Wikidata, many (most?) statements do not have them. That's understandable when humans create the statements, since including references is time consuming (although less so if you use some of the gadgets that are available to streamline the process, like currentDate and DuplicateReferences). For bots, it's inexcusable. Most bots are getting their data automatically from some data source and they know what that data source is, so there is no reason for them to not add references other than the laziness of their developers.

We can't fix other people's bad behavior, but we can fix their missing references with minimal work if we have an easy way to acquire the information.

Q44943965 is an article that was created using QuickStatements. Some of the data about the item was curated manually and those statements have references. But most of the bot-created statements don't have any references and I'm too lazy to add them manually. Luckily, the article has a DOI statement near the bottom, so all I need to do is to click on it to verify that the information exists for the statements with missing references. As a reference URL, I'm going to use the HTTPS form of the DOI, https://doi.org/10.3233/SW-150203, which a human can click on to see the evidence that supports the statement.

This publication was in the CSVs from the first example in this post, so prior to writing the references, I deleted the CSVs and used the acquire_wikidata_metadata.py script download a fresh copy of the data.

I highlighted the row to make it easier to see and pasted the DOI URL into the doi_ref1_referenceUrl column. I typed in todays date into the doi_ref1_retrieved_val column in the required format: 2021-03-17.

To create the references for the other statements, I just needed to copy the DOI URL into all of the columns whose names end in _ref1_referenceUrl and today's date into all of the columns that end in _ref1_retrieved_val.

Once I finished that, I saved the CSV and ran VanderBot (I already had the metadata description file from earlier work). I saved the output into a log file so that I could look at it later.

When VanderBot processes a CSV, it first writes any new items. It then runs a second check on the spreadsheet to find any items where statements already exist (indicated by the presence of a value in the _uuid column for that property), but where references have NOT been written (indicated by the absence of a _ref1_hash value). Scrolling through the log file, I saw that there was "no data to write" for any of the statements. In the "Writing references of existing claims" section (screenshot above), I saw the seven new references I created for Q44943965.

Checking the item page again, I see that all of the statements now have references!

This is more labor-intensive than making the label changes that I demonstrated in the previous example, but if all of the items in a spreadsheet were derived from the same source, then copying and pasting all the way down the _ref1_referenceURL and _ref1_retrieved_val columns would be really fast. In this case, it was not particularly fast, since I had to look up the DOI, then copy and paste the URL and date manually for each different item. However, since DOI data from CrossRef are machine-readable (via their API, see https://github.com/CrossRef/rest-api-doc), it won't be that hard to script the lookup in Python and have the script add all of the references to the CSV. I may write a post showing how to do that sometime in the future.

Conclusion

The script that downloads existing data from Wikidata into a CSV (acquire_wikidata_metadata.py) makes it possible to use the VanderBot API-writing script to improve certain kinds of information about items by simply copying multiple cells and pasting them elsewhere in the spreadsheet. Since CSVs are easily read and written by scripts, it is also possible to automate the addition of some kinds of data about existing items to the CSV (and eventually to Wikidata) by scripting.

In future posts, I will show how to accomplish some of the more difficult aspects of managing your own data in Wikidata using spreadsheets, including automated data acquisition, disambiguation, and monitoring for changes over time.

Writing your own data to Wikidata using spreadsheets: Part 3 - Determining what properties to use

2021-03-11T18:45:00.002-08:00

This is the third part of a series of posts intended to help you manage and upload data to Wikidata using spreadsheets. You probably won't get as much out of this post if you haven't already done the do-it-yourself exercises in the first two posts, but since this is a more general topic, you might still find it useful even if you haven't read the earlier ones.

Determining scope

The target audience of this post is people or groups who have particular defined datasets that they would like to upload and "manage" on Wikidata. I put "manage" in quotes because no one can absolutely manage any data on Wikidata, since by definition it is a knowledge graph that anyone can edit. So no matter how much we care about "our" data, once we put it into Wikidata, we need to be at peace with the fact that others may edit "our" items.

There is a good chance that the data we are interested in "managing" may be some kind of data about which we have special knowledge. For example, if we are part of a museum, gallery, or library, we may have items in our collection that are worth describing in Wikidata. It is unlikely that others will have better information than we do about things like accession numbers and license information. I'm going to refer to this kind of information as "authoritative" data -- data that we probably know more about than other Wikidata users. There may be other data that we are very interested in tracking, but about which we may have no more information than anyone else.

In both of these situations, we have a vested interest in monitoring additions and changes made by others outside our group or organization. In the case of our "authoritative" data, we may want to be on the lookout for vandalism that needs to be reverted. As we track other non-authoritative data, we may discover useful information that's effectively crowd-sourced and available without cost (time or financial) to us.

There will also be other statements that involve properties that we aren't interested in tracking. That doesn't mean this other information is useless -- it may just not be practical for us to track it since we can't really contribute to it nor gain much benefit from it.

So an important part of planning a project to upload and manage data in Wikidata is determining the scope of statements you plan to monitor. This is true regardless of how you are doing that managing, but in the case of using CSV spreadsheets, the defined scope will determine what column headers will be present in the spreadsheet. So prior to moving forward with using spreadsheets to write data to Wikidata, we need to decide what properties, qualifiers, and references we plan to document in those spreadsheets.

Defining a group of items of interest

The first thing we need to decide is what kind of items are we interested in. There may be an obvious target item type: works in a gallery, specimens in a museum, articles published by researchers in an institution, etc. There will also often be secondary item types associated with the primary one: artists associated with gallery works, collectors with specimens, authors and journals with articles, for example. After determining the primary and secondary item types, the next step is to figure out the value of P31 (instance of) goes with each type of item of interest. In some cases, this might be obvious (Q5 = human for authors, for example). In other cases it may not be so clear. Is a book Q571 (book) or is it Q3331189 (version, edition, or translation)? The best answer to this question is probably "what are other people using for items similar to mine?" We'll talk about some tools for figuring that out later in the post.

There are two useful ways to define a group of related items. The simplest is enumeration: creating a list of Q IDs of items that are similar. Less straightforward but more powerful is to define a graph pattern that can be used in SPARQL to designate the group. We are not going to go off the deep end on SPARQL in this post, but it plays such an important role in using Wikidata that we need to talk about it a little bit.

Wikidata is a knowledge graph, which means that its items are linked together by the statements that connect them. Thus the links involve the properties that form the statements. The simplest connection between two statements involves a single statement using a single property. For example, if we are interested in works in the Vanderbilt Fine Arts Gallery, we can define that group of items by stating that a work is in the collection (P195) of the Vanderbilt Fine Arts Gallery (Q18563658). We can abbreviate this relationship by the shorthand:

?item wdt:P195 wd:Q18563658.

The ?item means that the item is the thing we want to know, and the other two parts lay out how the item is related to the gallery. This shorthand is the simplest kind of graph pattern that we can create to define a group of items.

We can use this graph pattern to get a list of the names of works in the gallery using the Wikidata Query Service. Click on this link and it will take you to the Query service with the appropriate query filled in. If you look at the query in the upper right, you'll see our graph pattern stuck between a pair of curly brackets. (The other line is a sort of "magic" line that produces English labels for items that are found.) If you are wondering how you might have known how to set up a query like this, you can drop down the Examples list and select the first query: "Cats". My query is just a hack of that one where I substituted my graph pattern for the one that defines cats (P31=instance of, Q146=house cat).

We can narrow down the scope of our group if we add another requirement to the graph pattern. For example, if we want our group of items to include only paintings that are in the Vanderbilt gallery, we can add another statement to the graph pattern: the item must also be an instance of (P31) a painting (Q3305213).

?item wdt:P31 wd:Q3305213.

The query using both restrictions is here.

These kinds of graph patterns are used all the time in Wikidata, sometimes when you don't even know it. If you visit the Vanderbilt Fine Arts Gallery WikiProject paintings page and look just below the star, you'll see that the graph pattern that we just defined is actually what generates that page. We will use such patterns later on to investigate property use by groups of items that are defined by graph patterns.

What properties are used with different kinds of items?

Recoin (Relative Completeness Indicator)

The simplest way to see what kind of properties tend to be used with certain kinds of items is to look at the page of an item of that kind and see what properties are there. That isn't a very systematic approach, but there is a gadget called Recoin that can make our investigation more robust. Recoin can be installed by clicking on the Preferences link at the top of any Wikidata page, then selecting the Gadgets tab. Check the box for Recoin, then click Save.

After you enable Recoin, you can click the Recoin link just below the item description and a list will drop down showing the fraction of items having various properties for all items having the same P31 value. The example above shows values for instances of "art museum". Of course, this list shows properties that are missing for that page, so you would need to find a page with most properties missing to get a more full list. If you can create a new item having only a P31 value, you will be able to get the complete list.

Wikidata:WikiProjects

A more systematic approach is to look for a WikiProject that is interested in the same kind of items as you. The list of WikiProjects is somewhat intimidating, but if you succeed in finding the right project, it will often contain best-practices guidelines for describing certain types of items. For example, if you expand Cultural WikiProjects, then GLAM (galleries, libraries, archives, and museums) WikiProjects, you will see one called "Sum of all paintings". They have a list of recommendations for how to describe paintings. You can find similar lists in other areas and if you are lucky, you will find a list of extensive data model guidelines, such as the Stanford Libraries's data models for academia.

A small amount of time spent searching here will pay large dividends later if you start by using the consensus properties adopted by the community in which you are working. The items you put into Wikidata will be much more likely to be found and linked to by others if you describe them using the same model as is used with other items of the same type.

Determining what properties are used "in the wild"

If you find a WikiProject related to your type of interest, you will probably have a good idea of the properties that group says you should be using for statements about that type of item. However, you might discover that in actuality some of those properties are not really used much. That could be the case if the values are not easily available or if it's to labor-intensive to disambiguate available string values with item values in Wikidata. So it is pretty useful to know what properties are actually being used by items similar to the ones you are interested in creating/editing.

I have written a Python script, count_entities.py, that you can use to determine what properties have been used to describe a group of related items and the number of items that have used each property. The script details are described here. Before using the script with your own set of items, you will need to define your category of items using one of the two methods I described earlier. But for testing purposes, you can try running the script using the default built-in group: works in the Vanderbilt University Fine Arts Gallery.

To run the script, you need the following:

Python 3 installed on your computer with the ability to run it at the command line.
The requests module installed using PIP, Conda, or some other package manager.
a plain text editor if you want to define the group by SPARQL graph pattern. You can use the built-in text editors TextEdit on Mac or Notepad on Windows.
a spreadsheet program to open CSV files. LibreOffice Calc is recommended.
knowledge of how to change directories and run a Python script from your computer's console (Terminal on Mac, Command Line on Windows).

You do NOT need to know how to code in Python. If you are uncertain about any of these requirements, please read the first post in this series, which includes a lot of hand-holding and additional information about them.

To run the script, go to the script's page on GitHub and right click on the Raw button. Select Save Link As... and save the script in a directory you can easily navigate to using your console. The script will general CSV files as output, so it is best to put the script in a relatively empty directory so you can find the files that are created.

To test the script, go to your console and navigate to the directory where you saved the script. Enter

python count_entities.py

(or python3 if your installation requires that). The script will create a file called properties_summary.csv, which you can open using your spreadsheet program.

The table shows all of the properties used to make statements about items in the gallery and the number of items that use each property. Although there are (currently) 6000 items in the group, they use properties fairly consistently, so there aren't that many properties on the list. Other groups may have much longer lists. But often there will be a very long tail of properties used only once or a few times.

Unless you want to keep investigating the Vanderbilt Fine Arts Gallery items, you must define your group using one of the two options described below: --csv (or its brief form -C) to enumerate items in the group by Q ID or --graph (or its brief form -G) to define the group by a graph pattern.

Defining a group by a list of Q IDs

Let's try using the script by defining the group by enumeration. Download the file bluffton_presidents.csv from here into the same directory as the script, using the Raw button as before. NOTE: if you are using a Mac, it may automatically try to change the file extension from .csv to .txt in the Save As... dialog. If so, change the format to All Files and change the extension back to .csv before saving.

If you open the CSV that you downloaded, you'll see that the first column has the header qid. The script requires that the Q IDs be in a column with this header. The position of that column and the presence of other columns do not matter. The items in the column must be Q IDs, including the initial Q and omitting any namespace abbreviations like wd: .

Run the script again using this syntax:

python count_entities.py --csv bluffton_presidents.csv

Note that the previous output file will be overwritten when you run the script again.

This time the script produces a list of properties appropriate for people.

Defining a list by SPARQL graph pattern

Open your text editor and paste in the following text:

?qid wdt:P108 wd:Q29052.

?article wdt:P50 ?qid.

?article wdt:P31 wd:Q13442814.

The first line limits the group to items whose employer (P108) is Vanderbilt University (Q29052). The second line specifies that those items must be authors of something (P50). The third line limits those somethings to being instances of (P31) scholarly articles (Q13442814). So with this graph pattern, we have defined our group as authors of scholarly articles who work (or worked) at Vanderbilt University.

Save the file using the name graph_pattern.txt in the same directory as the script. Run the script using this syntax:

python count_entities.py --graph graph_pattern.txt

Again, the script will overwrite the previous output file.

This time, the list of properties is much longer because the group is larger and more diverse than in the last example. Despite whatever advice any WikiProjects group may give about best-practices for describing academics, we can see that there is a very small number of properties that are actually given for most of these academic authors. Note that in many cases, given name and family name statements are generated automatically by bots. So if we wanted to create "typical" records, we would only need to provide the top six properties.

If you are unfamiliar with creating SPARQL query graph patterns, I recommend experimenting at the Wikidata Query Service page. The Examples dropdown there shows a lot of examples. However, in most cases, we can define the groups we want with simple graph patterns of only one to three lines.

Examining property use in the wild

Before deciding for sure what properties you want to write/monitor, it is good to know what typical values are for that property. It is also critical to know whether it is conventional to use qualifiers with that property. The count_entities.py script can also collect that information if you use the --prop option (or its brief form -P). I will demonstrate this with the default group (Vanderbilt Fine Arts Gallery works), but you can supply a value for either the --csv or --graph option to define your own group.

One of the most important properties to understand about a group is P31 (instance of). To see the distribution of values for P31 in the gallery, issue this command in your console:

python count_entities.py --prop P31

(or python3 if your installation requires that). The script generates a file whose name starts with the the property ID and ends in _summary.csv (P31_summary.csv in this example). Here's what the results look like:

We can see that most items in the gallery that are described in Wikidata are prints. There is a long tail of other types with a very small number of representatives (e.g. "shoe"). Note that it is possible for an item to have more than one value for P31, so the total count of item by type could be greater than the total number of items.

If any statements using the target property have qualifiers, the script will create a file listing the qualifiers used and the number of items with statements using those qualifiers. In the case of P31, there were no qualifiers used, so no file was created. Let's try again using P571, inception.

python count_entities.py -P P571

The result in the P571_summary.csv file is not very useful.

It listed the 401 different inception dates (as of today) for works in the gallery. However, the P571_qualifiers_summary.csv is more interesting.

This gives me very important information. For most of the 401 dates, they were qualified by defining an uncertainty range using earliest date (P1319) and latest date (P1326). The other commonly used qualifier was P1480 (sourcing circumstances). Examining the property description, we see that P1480 is used to indicate that a date is "circa" (Q5727902). So all three of these qualifiers are really important and should probably be designated to be used with P571.

For properties that have a large number of possible values (e.g. properties that have unique values for every item), you probably don't want to have the script generate the file of values if all you want to know is the qualifiers that are used. You can get only the qualifiers output file if you use the --qual (or -Q) option (with no value needed to go with it). A good example for this is P217 (inventory number). Every work has a unique value for this property, so there is no reason to download the values for the property. Using the --qual option, I can find out what qualifiers are used without recording the values.

python count_entities.py -prop P217 --qual

The P217_qualifiers_summary.csv file shows that there is a single qualifier used with P217: collection (P195).

Putting it together

The reason for including this post in the series about writing the Wikidata using spreadsheets is that we need to decide what properties, qualifiers, and references to include in the metadata description description of the CSV that we will use to manage the data. So I will demonstrate how to put this all together to create the spreadsheet and its metadata description.

I am interested in adding publications written by Vanderbilt researchers to Wikidata. Since data from Crossref is easily obtainable when DOIs are known, I'm interested in knowing what properties are used with existing items that have DOIs and were written by Vanderbilt researchers. So the first step is to define the group for the items. The first thing I tried was the graph pattern method. Here is my graph pattern:

?person wdt:P1416 wd:Q16849893. # person affiliation VU Libraries

?item wdt:P50 ?person. # work author person

?item wdt:P356 ?doi. # work has doi DOI.

I tested this pattern at the Query Service with this query. However, when I ran the script with the --graph option to determine property use, it timed out.

Determining property use

Since Plan A did not work, I moved on to Plan B. I downloaded the results from the query that I ran at the Query Service and put them into a CSV file. I then did a bit of massaging to pull the Q IDs into their own column with the header qid. This time when I ran the script with the --csv option, I got some useful results.

Based on these results I probably need to plan to upload and track the first 10 properties (through author name string). For P31 and P1433, it would probably be useful to see what kind of values are usual, but for the rest I just need to know if they are typically used with qualifiers or not.

The results for P31 indicate that although both scholarly article (Q13442814) and academic journal article (Q18918145) are used to describe these kind of academic publications, scholarly article seems to be more widely used. There were no qualifiers used with P31. Not unexpectedly, a check of P1433 revealed many library-related journals. One item used qualifiers with P1433, but those qualifiers, P304 (pages), P433 (issue), and P478 (volume), appear to be misplaced since those properties are generally used directly in statements about the work.

The only other items with qualifiers were P50 (author, shown above) and P2093 (author name string), which also had the qualifier P1545 (series ordinal). So this simplifies the situation quite a bit -- I really only need to worry about qualifiers with the two author-related terms, which are going to require some special handling anyway.

Creating a config.json file for the spreadsheet

I now have enough information to know how I want to lay out the spreadsheet(s) to contain the data that I'll upload/manage about journal articles. To understand better how to structure the config.json file that I'll use to generate the spreadsheets and metadata description file, I looked at one of the articles to help understand the value types for the properties.

The style of the values on the page help me to know the value type. The item values are hyperlinked text. The string values are unlinked black text. Monolingual text values look like strings, but have their language following them in parentheses.

To decide about the number of spreadsheets needed, I thought about which properties were likely to have multiple values per article item. Both author (item) and author name string could have multiple values. So I put them into separate spreadsheets. The rest of the properties will probably have only one value per article (or at least only one value that I'm interested in tracking). So here is what the overall structure of the config.json file looks like:

{
  "data_path": "",
  "item_source_csv": "",
  "item_pattern_file": "",
  "outfiles": [
    {
      "manage_descriptions": true,
      "label_description_language_list": [
        "en"
      ],
      "output_file_name": "articles.csv",
      "prop_list": [
      ]
    },
    {
      "manage_descriptions": false,
      "label_description_language_list": [],
      "output_file_name": "authors.csv",
      "prop_list": [
      ]
    },
    {
      "manage_descriptions": false,
      "label_description_language_list": [],
      "output_file_name": "author_strings.csv",
      "prop_list": [
      ]
    }
  ]
}

I don't want to manage descriptions on the two author-related CSVs, and am only including the labels to make it easier to identify the article. I'm only working in English, so that also simplifies the label situation.

Here are a few of the property descriptions that I used that illustrate several value types for the statement properties:

        {
          "pid": "P31",
          "variable": "instance_of",
          "value_type": "item",
          "qual": [],
          "ref": []
        },
        {
          "pid": "P356",
          "variable": "doi",
          "value_type": "string",
          "qual": [],
          "ref": [
            {
              "pid": "P854",
              "variable": "referenceUrl",
              "value_type": "uri"
            },
            {
              "pid": "P813",
              "variable": "retrieved",
              "value_type": "date"
            }
          ]
        },
        {
          "pid": "P577",
          "variable": "published",
          "value_type": "date",
          "qual": [],
          "ref": [
            {
              "pid": "P854",
              "variable": "referenceUrl",
              "value_type": "uri"
            },
            {
              "pid": "P813",
              "variable": "retrieved",
              "value_type": "date"
            }
          ]
        },
        {
          "pid": "P1476",
          "variable": "title_en",
          "value_type": "monolingualtext",
          "language": "en",
          "qual": [],
          "ref": [
            {
              "pid": "P854",
              "variable": "referenceUrl",
              "value_type": "uri"
            },
            {
              "pid": "P813",
              "variable": "retrieved",
              "value_type": "date"
            }
          ]
        },

Following typical practice, I'm skipping references for P31 (instance of). The rest of the properties only have reference properties for P854 (reference URL) and P813 (retrieved). Some existing items may have references for P248 (stated in), but since I'm going to be getting my data from Crossref DOIs, I'll probably just use the URL form of the DOI in all of the references. So I'll only use a column for P854. Notice also that the P1476 (title) property must have the extra language key/value pair since it's a monolingual string. If the title of the journal isn't in English, I'm stuck but I'll deal with that problem later if it arises.

The final version of my config.json file is here. I will now try running the convert_json_to_metadata_schema.py script discussed in the last post to generate the headers for the three CSV files and the metadata description file so that I can test them out.

Test data

To test whether this will work, I'm going to manually add data to the spreadsheet for an old article of mine that I know is not yet in Wikidata. It's https://doi.org/10.1603/0046-225X-30.2.181 . Here's a file that shows what the data look like when entered into the spreadsheet. You'll notice that I used the DOI as the reference URL. As I said in the last section, I intend to eventually automate the process of collecting the information from Crossref, but even though I got the information manually, the DOI URL will redirect to the journal article landing page, so anyone checking the reference will be able to see it in human-readable form. So this is a good solution that's honest on the data source and that also allows people to check the reference when the click on the link.

Please note that I did NOT fill in the author CSV yet, even though I already know what the author items are. The reason is that if I filled it in without the article item Q ID in the qid column, the VanderBot API-writing script would create two new items that consisted only of author statements about unlabeled items. Instead, I created the item for the article first, then added the article Q ID in the qid column for both author rows in the authors.csv file. You can see what that looks like in this file. Since I knew the author item Q IDs for both authors, I could put them both in the authors.csv file, but if I had only known the name strings for some or all of the authors, I would have had to put them in the author_strings.csv file, again along with the article Q IDs after the article record had been written.

The final product seems to have turned out according to the plan. The page is here.

If we examine the page history of the new page, we see that there were two edits. The two smaller, more recent ones were the two author edits and the first, larger edit was the one that created the original item.

What's next?

You can try using the config.json file to generate your own CSV headers and metadata description files if you want to try uploading a journal article yourself. Just make sure that it isn't already in Wikidata. You can also hack the config.json file to use different properties, qualifiers, and references for a project of your own. I do highly recommend that you try writing only a single item at first so that if things do not go according to plan, the problems can easily be fixed manually.

Although we have now set up spreadsheets and a metadata description JSON file that can write data to Wikidata, there is still too much manual work for this to be productive. In subsequent posts, I'll talk about how we can automate things we have thus far been doing by hand.

The diagram above shows the general workflow that I've been using in the various projects with which I've used the spreadsheet approach. We have basically been working backwards through that workflow, so in the next post I will talk about how we can use the Query Service to download existing data from Wikidata so that we don't duplicate any of the items, statements, or references that already exist in Wikidata.

The image above is from a presentation I gave in Feb 2021 describing the "big picture", rationale, and potential benefits of managing data in Wikidata using spreadsheets. You can view that video here.

Writing your own data to Wikidata using spreadsheets: Part 2 - editing the real Wikidata

2021-03-07T20:28:00.009-08:00

For a video walk-through of the previous blog post and this one, see this page.

In the previous post, I described how to create a Wikimedia bot password and use it to write spreadsheet data to the test Wikidata instance: https://test.wikidata.org/. The process required setting up a JSON metadata description file that mapped the CSV column headers to the RDF variant of the Wikibase data model. The VanderBot Python script used that mapping file to "understand" how to prepare the CSV data to be written to the Wikidata API. The script also recorded its interactions with the API by storing identifiers associated with the knowledge graph entities in the CSV along with the data.

This post will continue in the "do it yourself" vein of the previous post. In order to successfully complete the activities in this post, you must:

have a credentials plain text file (prepared in the last post)
have Python installed and know how to run a script from the command line (Python programming skills not required)
have downloaded the VanderBot script to a directory on your local drive where you plan to work.
understand that the edits you make are your responsibility just as if you had made them using the graphical interface. If you mess something up, you need to fix it -- mostly likely manually since VanderBot is designed to upload new data, not change or delete existing data.
have practiced on the test Wikidata instance enough to feel comfortable using VanderBot to make edits.

If any of these things are not true, then you need to go back and read the first blog post to prepare.

Options when running the script

In the last post, we practiced with VanderBot using all of the settings defaults. However, you may want to change some of those defaults depending you your situation. The most obvious change is to suppress the display of the giant blobs of response JSON from the API that fly up the screen as the script runs. You can redirect most of the output to a log file using the --log option. The log file will record the JSON output and at the end will include a summary of known errors that occurred throughout the writing process. (The same error report will be shown on the console screen, too.) You may choose to ignore the log file most of the time -- it will simply be overwritten the next time the script is run. However, it may be useful if the script terminates due to an error.

Most of the other options allow you to designate different file names or locations for the metadata description file and credentials file. It may be convenient to keep the credentials file in the same directory as the other files (the working directory option), but if you are using version control (e.g. GitHub) you should keep it elsewhere. You may wish to use different file names if you have multiple bot passwords or have different metadata description files for different CSVs.

The --update option is used to control whether the label and descriptions in the CSV will overwrite different values for existing items in Wikidata. It defaults to suppress updates, but we will talk later about when you might want to use a different option.

Options:

long form short values default

--log -L log filename, or path and appended none

filename

--json -J JSON metadata description filename "csv-metadata.json"

or path and appended filename

--credentials -C name of the credentials file "wikibase_credentials.txt"

--path -P credentials directory: "home", "home"

"working", or path with trailing "/"

--update -U "allow" or "suppress" automatic "suppress"

updates to labels and descriptions

Option examples:

Note: some installations of Python require using python3 instead of python in the command.

python vanderbot.py --json project-metadata.json --log ../log.txt

Metadata description file is called project-metadata.json and is in the current working directory. Progress and error logs saved to the file log.txt in the parent directory.

python vanderbot.py -P working -C wikidata-credentials.txt

Credentials file called wikidata-credentials.txt is in the current working directory.

python vanderbot.py --update allow -L update.log

Progress and error logs saved to the file update.log in the current working directory. Labels and descriptions of existing items in Wikidata are automatically replaced with local values if they differ. Notice that the long and short forms of the options can be mixed and are interchangeable.

Writing to the "real" Wikidata

Once you have set everything up, it is a simple matter to switch from writing from the test.wikidata.org API to the "real" www.wikidata.org API. All that is necessary is to change test to www in the first line of the configuration file:

endpointUrl=https://www.wikidata.org

The username and password lines can stay the same.

However, we cannot use the same CSV and metadata description files as before because the property and item IDs are different in the real Wikidata. We also don't yet want to create new items until we are comfortable with making several edits in the real Wikidata. Fortunately, there are several items in Wikidata that are designated as "sandbox" items, i.e. their metadata can be change to anything by anyone without consequence. They are generally lightly used, so you can edit them and still have time to examine what you have done before someone changes them to something else. The first sandbox item (Q4115189) is better known than the other two, so we will use sandbox items 2 and 3 in our practice.

Wikidata sandbox items:

Q ID Label

-------- ------------------

Q4115189 Wikidata Sandbox

Q13406268 Wikidata Sandbox 2

Q15397819 Wikidata Sandbox 3

With respect to etiquette regarding the sandbox items, I don't know that there are particular rules, but I would say that it would not be acceptable to change their labels, since that is the primary means by which users will know what they are. I would say that anything else, including descriptions and aliases is probably open to editing.

I would avoid adding a large number of statements to the sandbox items and then just leaving them, although a few edits probably don't matter. Probably the best thing to do after you are done playing with editing the sandbox items is to go the the View history page and undo your change (if you've only made one edit) or restore the last previous version before you started playing with the item if you've made a lot of changes. I'll show how to do that when we get to that point.

After you are comfortable playing with the sandbox items, we will try adding new real items.

Describing a CSV using simpler JSON format

We could use the web tool again to create a new metadata description file based on the real Wikidata properties, but I will tell you about another tool that you can use that requires many fewer button clicks. I created a simplified configuration file format that can be used to generate the standard metadata description file based on some rules about how to construct the column header names, assumptions about labels and descriptions, and one simplification of references. (The detailed specifications for the configuration file format are here.) The configuration file that we will be using has the default name config.json and can be viewed in this gist.

It is not necessary for you to edit this file to use it for the practice exercise. You can simply download it (right-click on the Raw button and select Save file as...). Download it into a directory you can access easily from your home folder -- you can use the same one you used last time, although it might get a bit cluttered.

If you understand JSON, the file structure will make sense to you. Even if you don't, you can probably copy and paste parts of it to change it to fit your needs. (It includes examples of most of the object types including two we haven't used before: monolingual text and quantity.) If you copy and paste, you will mostly need to be careful about placement of commas. Indentation is optional in JSON and is only used to make the structure more apparent.

For now, we can ignore the first three key:value pairs. The rest of the JSON after outfiles describes two CSV files that will be mapped by the metadata description file: artworks.csv and works_depicts.csv . artworks.csv contains data about statements involving 5 properties: P31 (instance of), P217 (inventory number), P1476 (title), P2048 (height), and P571 (inception). depicts.csv contains data about only one kind of statement: P180 (depicts). You may be wondering why I chose to put the depicts statements in a separate CSV. That is because all of the other properties will typically have only one value per item, while a particular artwork may depict several things. So in the first CSV, there will only be one row for each item, while in the second CSV there is an indefinite number of rows per item.

        {
          "pid": "P571",
          "variable": "inception",
          "value_type": "date",
          "qual": [
            {
              "pid": "P1319",
              "variable": "earliest_date",
              "value_type": "date"
            },
            {
              "pid": "P1326",
              "variable": "latest_date",
              "value_type": "date"
            }
          ],
          "ref": [
            {
              "pid": "P248",
              "variable": "statedIn",
              "value_type": "item"
            }
          ]
        }

Each property has a pid (property ID), a column header name (variable) and a value_type. The value_type will determine the details of the number of data columns needed to represent that kind of value and the kind of data that will be stored in those columns. Each property can also have zero or more qualifier properties and zero or more reference properties associated with it. In the snippet above, inception (P571) statements will have two associated qualifier (earliest date and latest date) properties and one reference property (stated in). The Wikibase model allows many references per statement, but this configuration file format restricts you to a single reference with as many properties as you want.

The structure of the qualifier and reference properties are the same as the statement properties (ID, variable, and value type) with the only restriction being that you must use properties that are appropriate for use in qualifiers or references.

    {
      "manage_descriptions": true,
      "label_description_language_list": [
        "en",
        "es"
      ],
      "output_file_name": "artworks.csv",
      "prop_list": [

The situation with labels and descriptions is a little more complicated. If you have more than one data table, you probably only really want to manage the labels and descriptions in one of the tables. In this case, it would make the most sense to manage them in the artworks.csv table, since it has a row for every item and the other table may have zero or more than one row per item. So the manage_description value for the first table is set to true.

    {
      "manage_descriptions": false,
      "label_description_language_list": [
        "en"
      ],
      "output_file_name": "works_depicts.csv",
      "prop_list": [

In the second table (works_depicts.csv) the manage_descriptions value is set to false. In that table, there will be a label column, but it will be set to be ignored during CSV processing and will only be to help humans understand what is in the rows. The label_description_language_list value contains a list of the ISO language codes for all languages to be included. If manage_description is set to true for a table, there will be both a label and description in the table for every language. If it is set to false for a table, there will only be a label for the default language. The default language of the suppressed output labels is set by the --lang option (see below). Any languages supplied in the JSON (as in the example above) will be ignored.

Generating the metadata description file and CSV headers

To generate the metadata description files and the CSV files, we need to download another script from GitHub called convert_json_to_metadata_schema.py . Download it into the same directory where you downloaded the config.json file. At the command line, run the following command if you used the default name config.json

python convert_json_to_metadata_schema.py

If you saved the input configuration file with a different name, or if you want a different name than csv-metadata.json to be used for the output metadata description file, use the following command line options:

long form short values default

--config -C input configuration file path config.json

--meta -M output metadata description fille path csv-metadata.json

--lang -L language of labels when output suppressed en

After you run the script, it will have generated the csv-metadata.json file and also variants of the two CSV files that were specified in the input config.json file: artworks.csv and works_depicts.csv . To prevent accidentally overwriting any existing data, the letter "h" is prepended to the file names of the generated CSVs (hartworks.csv and hworks_depicts.csv). So before you use the files, you need to delete the initial "h" from the file names.

The generated CSV files contain only the column headers with no data. But you can still open them with your spreadsheet software to look at them.

If you compare the columns in the created spreadsheet with the source JSON configuration file, you should see that the columns are in the order that they were designated in the JSON. The variable values are joined to any parent properties by underscores (e.g. earliest_date appended to inception to form inception_earliest_date). In cases where more than one column is required to describe a value node, the _nodeId, _val, etc. suffixes are added to the corresponding root column

Since we chose true as the value of manage_descriptions for the artworks.csv file, the generated spreadsheet includes both labels and descriptions for the two languages we designated. The script automatically prepends label_ and description_ to the language codes to generate the column headers.

For the second spreadsheet, works_depicts.csv, the value of manage_descriptions is false, so only labels are generated. Since the label_en column is only for local use to make the identity of the rows clearer, I only bothered to generate it as English. The value of the labels in this spreadsheet will be ignored by the API upload script.

Adding data to the CSV files

Since we are still testing, we won't create new items yet in the real Wikidata. Instead, we will add statements to two of the sandbox Wikidata items.

In the qid column of the artworks.csv file, add Q13406268 and Q15397819 to the first two rows after the header row. For purposes of keeping the row identities clear, I added Wikidata Sandbox 2 and Wikidata Sandbox 3 as label_en values for those rows, although since we will be using the default to suppress updating labels, these values will have no effect. I also chose to use Q3305213 (painting) and Q860861 (sculpture) as values of instance_of since the the CSV file is supposed to be about artworks.

If you want to see what other values I used in my test, you can look at or download this gist. You can use whatever values would amuse you as long as the types of the values match the types that are appropriate for properties specified in the configuration file. Leave all of the ID columns blank (those ending in _uuid, _hash, or _nodeId), since they will be filled in by the API upload script. For the dates, you can either use the abbreviated conventions discussed in the last post (in which case you MUST leave the _prec column empty). If you want to use dates that don't conform to those patterns (precisions less than year, BCI dates, or dates between 1 and 999 CE), you will need to use the long form values and provide an appropriate _prec value. See the VanderBot landing page for details.

If you looked at the configuration JSON carefully, you may have noticed that there were two new value types that we didn't see in the last post. The title property (P1476) has the type monolingualtext. Monolingual text values are required to have a language tag in addition to a provided string. Unfortunately, because of limitations of the W3C CSV2RDF Recommendation, the language tag has to be hard-coded in the metadata description file rather than being specified in the CSV. That's why the language is specified in the configuration JSON as the value of language for that property rather than as a column in the CSV table.

The other new value type is quantity. Like dates, quantities have value nodes that require two columns in the CSV table to be fully described. The _val column contains a decimal number and the _unit column should contain the Q ID for an item that is an appropriate measurement unit for the number (e.g. Q11573 for meter).

The second spreadsheet, works_depicts.csv, describes only one kind of statement, depicts (P180). It is intended to have multiple rows with the same qid, since a work can depict more than one thing. Since I described the Sandbox 2 item as a painting with title "Mickey Mouse house", I decided to say that it depicts Mickey and Minnie Mouse. You can set the depicts values to any item.

Writing the data

Before writing the data in the CSVs, open the pages for Sandbox 2 and Sandbox 3 so that you can see how they change when you write. Make sure that the two CSV files, the csv-metadata.json file you generated from config.json, and a copy of vanderbot.py are together in a directory that can easily be accessed from your home directory. Make sure that you removed the "h" from the beginning of the CSV filenames as well.

Open your console application (Terminal on Mac, Command Prompt in Windows), navigate to the directory where the files are, and run VanderBot. Unless you changed default file names and locations, you can just enter

python vanderbot.py

(or use python3 if your installation requires that). If you want to save the API response in a log file, specify its name using the --log (or -L) option, like this:

python vanderbot.py -L log.txt

When you run the script, you should see something like this (with logging to file):

There are two episodes of writing to the API, one for each of the CSVs. If you refresh the web pages for the two items, you should see the changes that you made.

Click on the View history link at the top of the Wikidata Sandbox 2 page. You will see the revision history for the page.

Notice that on Sandbox 2, there were three edits listed. Each line in a spreadsheet resulted in one write to the API. The first larger one (4997 bytes) was an update consisting of all of the statements made in the first line of artworks.csv . The two later and smaller ones were from the two single-statement depicts writes in the works_depicts.csv table.

It is not a requirement to get rid of all of your edits to the sandbox, but to avoid causing the sandbox items to be hopelessly cluttered, you should probably delete your edits. If you made only a single change to the page, you can just click the undo link after the edit. If you made several edits and your edits were the last ones, you can revert back to the last version before your changes by clicking on the restore link after the last edit that was made prior to yours.

The restore dialog will show you all of the changes you made so that you can review them before committing to the restore. Give a summary, and click Publish changes.

Changing labels

VanderBot handles labels and descriptions differently from statements and references.

Adding statements or references is controlled by the presence or absence of an identifier corresponding to the column(s) representing the statement or reference in the spreadsheet. The statements or references are only written if their corresponding identifier cell is empty. If you examine the CSVs after their data have been written to the API, you will see that identifiers have been added for all of the columns that contain statement values. That means that if you run the script again, nothing will happen because VanderBot will ignore the values -- they all have assigned identifiers.

The behavior of labels and descriptions is different. When a new item is created, any labels or descriptions that are present will be added to the item. However, VanderBot will NOT make any edits to labels or descriptions of existing items unless the the --update (or -U) option is set to allow when the script is run. If updating is allowed, the existing labels and descriptions will be changed to whatever is present in the spreadsheet for that item. (The exception to this is when a label or description cell is empty. Empty cells will not result in deleting the label or description.)

From the screenshot above, you can see that at the start of this experiment, Sandbox 2 had no descriptions in either Spanish or English. Also, the Spanish label isn't actually in Spanish. I'm going to use VanderBot to change that.

To make the changes, I started with the artworks.csv spreadsheet after my last edits. I deleted the line for Sandbox 3 since I didn't want to mess with it. I first made sure that my English label was exactly the same as the existing label so that it won't be changed. Then I added a Spanish label, and English and Spanish descriptions. I left the rest of the row the way it was since none of those statements would be written since they all had IDs.

The following command will write the labels and log to a file:

python vanderbot.py -L log.txt --update allow

After the script finishes, checking the log file shows the changes made.

Since the English label was identical, there were no changes to it. The log also shows that there were no changes in the other CSV.

Checking the web page shows

Since I'm done with the test, I'm going to delete the descriptions, but the Spanish label isn't any worse than what was there before, so I'll leave it. Checking the history, I can see that all of the labels and descriptions were changed in a single API write, so I'll have to delete them manually if I want to leave the Spanish label -- I can't undo the description changes without also undoing the new Spanish label.

The take-home message from this section is that you need to make sure that the existing labels and descriptions in the CSV match what is in Wikidata when label/description updates are allowed (unless you actually want to change them). This is particularly an issue if your data table is stale because you are coming back to work on it at a much later time after you initially wrote the data. If in the intervening time other users have improved the quality of the data by changing labels and descriptions, you would essentially be reverting their changes back to a worse state. That would be really irritating to someone who put in work to make the improvements. I will talk about strategies to avoid this problem in a later post.

Creating new items

At this point, you are hopefully comfortable with VanderBot enough to create or edit real items in the real Wikidata. For now, let's stick with creating new items since editing existing items has the issue of avoiding creating duplicate statements. We will address problem that in a future post.

There are several issues that you should consider before creating new items for testing. One is that you really should only create items that meet some minimal standard of notability. The actual notability requirements for Wikidata are so minimal that you could theoretically create items about almost anything. But as a practical matter, we really shouldn't just create junk items that don't have some relatively useful purpose. One type of item that seems to be relatively "safe" is university faculty, since they generally have the potential to be authors of academic works that could potentially be references for Wikipedia articles. When I'm testing VanderBot, I often add faculty from my alma mater, Bluffton University, since none of them were in Wikidata until I started adding them.

The second issue is that you should create items that have enough information that the item can actually be unambiguously identified. There are several really irritating categories of items that have been added to Wikidata without sufficient information. There are thousands of "Ming Dynasty person" and "Peerage person" items that have little but a name attached to them. They are pointless and just make it harder to find other useful items with similar labels. So, for example, if you add faculty to Wikidata, at a minimum you should include their university affiliation and field of work.

The third issue is that you should make sure that you are not creating a duplicate item. In a future post I will talk about strategies for computer-assisted disambiguation. But for now, just typing the label in the search box in the Wikidata is the easiest way to avoid duplication. Try typing it with and without middle initials and also with and without periods after the initials to make sure you tried every permutation.

Configuring the properties

If you want to try my strategy of practicing by creating faculty records, you can start with this template configuration file. It contains the obligatory instance of (P31) that should be provided for every item, and sex or gender (P21), which despite its issues is probably the most widespread property assigned to humans. I did not provide reference fields for those two properties since they are commonly given without references. The other two properties are probably the minimal properties that should be supplied for faculty: employer (P108) and field of work (P101). One reason why I chose these two properties is because they can both be determined easily from a single source, the Bluffton University faculty web listing. The statements for these two properties should definitely have references.

I've done some querying to try to discover what the most commonly used properties are for references. A key reference property is retrieved (P813). All references should probably have this property. The other property is usually an indication of the source of the reference. Commonly used source properties are: reference URL (P854, used for web pages), stated in (P248, used when the source is a described item in Wikidata with a Q ID value), and Wikimedia import URL (P4656, used when the data have been retrieved from another Wikimedia project with a URL value). Unless you are working specifically on a project to move data from another project like Wikipedia to Wikidata, the first two are the ones you are most likely to use. Since all of my data are coming from a web page, I'm using P854 and P813 as the reference properties for both of the statement types that have references.

Use the convert_json_to_metadata_schema.py Python script and the config.json file you downloaded to generate the metadata description file csv-metadata.json and the faculty.csv CSV file (with header row only).

Adding the data

I chose two of the faculty from the web page list and pasted their names into the Wikidata search box to make sure they weren't already existing items. I then added their names to the label_en column of faculty.csv and described them in the description_en column. See this gist for the examples. Q5 is the value for instance_of for all humans. Sex or gender options are given on the P21 property page. All of the faculty work at Bluffton University (Q886141). The trickiest value was their field of work, which I had to determine using the Wikidata search box.

I was then able to copy and paste the URL for the faculty web listing, https://www.bluffton.edu/catalog/officers/faculty.aspx, into the reference URL columns and today's date, 2021-03-07, into all of the _retrieved_val columns. I then saved the file.

Writing the data to the API

Note: what would happen if you tried to use my example CSV file to write to Wikidata without changing its contents? Before the VanderBot script tries to write a new record to the real Wikidata, it checks the Wikidata Query Service to see if there are already any items with exactly the same labels and descriptions in any language in that row. If it finds a match, it logs an error and goes on to the next row. So since those items were already created by me, VanderBot will do nothing as long as no one has changed either the label or description for the two example items. If a label or description for either of them has been changed since I created the items, then the API will create duplicate items that will need to be merged later. So don't try running the script with my unmodified example files unless you first check that the labels and descriptions are still exactly the same on the items' Wikidata item pages.

I ran the vanderbot.py script with logging to a text file.

When writing the statements, the two rows were identified as new records. When the rows were later checked for any new unwritten references, the Q IDs were already known since they had been reported in the API response.

To see how the faculty.csv file looked after its data were written to the API, see this gist.

I could check for the new item pages in Wikidata by either searching for the faculty names or by directly using the two new Q IDs.

The new page contains all of the data from the CSV table in the appropriate place!

Although for only two items this work flow probably took longer than just creating the records by hand, it doesn't take many more items to make this process much faster, particularly if references are added (and they should be!). Adding references requires many button clicks on the graphical web interface, but because the same reference can be added to many rows of the spreadsheet with a single copy and paste, it is very efficient to add references using the VanderBot script.

What's next?

In the next post, I'll talk about how you can determine what properties are most commonly used for various types of items. This is important information when you are planning your own projects that involve adding a lot of data to Wikidata.

Writing your own data to Wikidata using spreadsheets: Part 1 - test.wikidata.org

2021-03-01T10:08:00.009-08:00

Warning: this blog post involves extreme hand-holding. If that irritates you and you want to try to figure out how to use VanderBot on your own without hand-holding, you can go straight to the VanderBot landing page and look at the very abbreviated instructions there. However, make sure that you understand your responsibilities as a Wikidata user. If they are unclear to you, read the "Responsibility and good citizenship" section below.

On the other hand, if you love extreme hand-holding, there is a series of videos that will essentially walk you through the steps in this post.

It has been almost a year since I last wrote about my efforts to write to Wikidata using Python scripts. At that time, I was using a bespoke set of scripts for a very specific purpose: to create or upgrade items in Wikidata about researchers and scholars at Vanderbilt University. I was feeling pretty smug that I actually got the scripts to work, but at that point the scripts were pretty idiosyncratic. They were limited to a particular type of item (people), supported a restricted subset of property types, and used a particular spreadsheet mapping schema that wasn't easily modified.

Since that time, I have been working to adapt those scripts to be more broadly usable and have been testing them on several other projects: WikiProject Vanderbilt Fine Arts Gallery and WikiProject Art in the Christian Tradition (ACT), and several smaller ones. The scripts and my ability to explain how to use them have now evolved to the point where I feel like they could be used by others. The goal of this series is to make it possible for you to try them out in a do-it-yourself manner.

Background

This series of posts will not dwell on the conceptual and technical details except where necessary for you to make the scripts work. For those interested in more details, I refer you to previous things I've written:

blog post dealing with the minutiae of Wikibase, the Wikimedia API, and authentication (June 2019)
blog post describing how to retrieve data from Wikidata from the Query Service using HTTP and SPARQL (May 2019)
blog post giving a general overview of the paradigm of writing to Wikidata using spreadsheets (February 2020)
blog post with an overview of the Wikibase model and associated identifiers (February 2020)
web page with somewhat overlapping overview of the Wikibase model but details about property labels (2019, revised 2020)
blog post with a very brief overview of using the W3C CSV2RDF Recommendation to map spreadsheets to the Wikibase model and a discussion of issues related to timing of interactions with the API (February 2020)
submitted manuscript with very technical description of using the W3C CSV2RDF Recommendation to map spreadsheets to the Wikibase model (submitted to Semantic Web Journal in December 2020, revised version submitted June 2021)
blog post with overview of the workflow for the Vanderbilt scholar and researcher Wikidata project (February 2020)
video of presentation at the 2020 LD4 Conference on Linked Data in Libraries: VanderBot: Using a Python script to create and update researcher items in Wikidata (July 2020)
video of presentation to the Program for Cooperative Cataloging Wikidata Pilot group: VanderBot: A spreadsheet-based system for creating and updating items in Wikidata (February 2021)

It is not necessary to refer to any of this material in order to try out the system. But those interested in the technical details may find the links helpful.

Do I want to try this?

Before going any further, you should assess whether it is worth your time trying this out.

Requirements:

You need to know how to use the command line to navigate around directories and run a program. See this page for Mac or this page for Windows if you don't know how to open a console and issue basic commands. In particular, read the section on "Running a program using the command line".
You need to have Python installed on your computer so that you can run it from the command line. See this page for installation instructions. You do NOT need to know how to program in Python. I believe that the only module used in the script that is not part of the standard library is requests, so you may need to install that if you haven't already.
You need to have an application to open, edit, and save CSV files. The recommended application is LibreOffice Calc. Other alternatives are OpenOffice Calc and Excel, but there are situations where you can run into problems with either of them. For information on CSV spreadsheets and how to save them in Excel, see this video. For a deeper dive and description of the problems with Excel and OpenOffice Calc, see the first video in this lesson and the screenshots after the second video.
You must have a Wikimedia user account. The same user account is used across Wikimedia platforms, including Wikipedia, Wikidata, and Commons, so if you have an account any of those places you can use it here.
You need to be familiar with the Wikidata graphical editing interface. I assume that every reader has already done enough editing to understand the important features of the Wikibase model (items, properties, statements, qualifiers, and references) and how they are related to each other. They will not be explained in this post, so if you don't already have experience exploring these features using the graphical interface, you are probably not adequately equipped to continue with this exercise.

Other alternatives you should consider

There are a number of good alternatives to using the VanderBot scripts to write to Wikidata. They are:

Use the graphical interface at https://www.wikidata.org/ to edit items manually. Advantage: very easy to use and robust. Disadvantage: slow and labor intensive.
Use QuickStatements (https://www.wikidata.org/wiki/Help:QuickStatements). Advantage: very easy to use and robust, particularly when used as an integrated part of other tools like Scholia "missing". Disadvantage: there is a learning curve for constructing the input files from scratch. Users who aren't familiar with how CSV files work may find it confusing.
Use the Wikidata plugin with OpenRefine (https://www.wikidata.org/wiki/Wikidata:Tools/OpenRefine). Advantage: powerful and full-featured, I believe that some scripting is possible to the extent that scripting is generally possible in OpenRefine using GREL. Disadvantage: requires skill with using OpenRefine plus an additional learning curve for figuring out how to make the Wikidata plugin work, not sure whether it is possible to integrate OpenRefine with command line-based workflows involving other applications.
Use PyWikiBot (https://pypi.org/project/pywikibot/) or Wikidataintegrator (https://github.com/SuLab/WikidataIntegrator). Many (most?) Wikidata bots are built using one of these two Python frameworks. Advantage: powerful, full featured, robust. Disadvantage: you need to be a relatively experienced Python coder who understands object oriented programming with Python to use these libraries. There are a number of bot-building tutorials for PyWikiBot online. However, when starting out, I found the proliferation of materials on the subject confusing and for both of these platforms, when I couldn't get things to work, the libraries were so complex I couldn't figure out what was going on. Professional developers would probably not have that problem, but as someone who is self-taught, I was confused.

Factors that might make using VanderBot right for you

If any of the following situations apply to you, VanderBot might be useful for you:

Your data are already in spreadsheets or can be exported as spreadsheets and you would like to keep the data in spreadsheets for future reference, ingest, or editing using off-the-shelf applications like Libre Office or Excel.
You want to keep humans in the Wikidata data entry loop for quality assurance, but want to increase the speed at which edits can be made.
Over time, you are interested in keeping versioned snapshots of the data that you have written in a format that is suitable for archival preservation (CSV).
You are interested in comparing what is currently in Wikidata with what you put into Wikidata to discover beneficial information added by the community or to detect vandalism by bad actors.
You want to develop a workflow based on command-line tools that can be scheduled and monitored by humans.

The last of these two features are not yet fully developed, but I'm trying to design the scripts I'm writing to make them possible in the future.

If one or more of these factors applies to you and the other existing tools don't seem better suited for your purposes, then let's get started.

Responsibility and good citizenship

This "lesson" involves using the VanderBot API uploading script to write data to the test Wikidata API (application programming interface). In order to do that, you will need to create a bot password, but not a separate bot account. So let's clarify exactly what that means.

User account and bot password

When you create a bot password, you should be logged in under your umbrella Wikimedia account. That account applies across the entire Wikimedia universe: Wikipedia, Wikidata, Commons, and other Wikimedia projects. The bot password you create allows you to automate your interactions by using any Wikimedia API, but the edits that you make will be logged to your user account. That means that you bear the same responsibility for the edits that you make using the bot password as you would if you made them using the graphical interface or QuickStatements. Edits that you make using the bot password will show up in the page history just as if you had made them manually. If you make a mess using the bot password, you are responsible for cleaning it up just as you would be if you made errors using any other editing method. The whole point of scripting is to allow you to do things faster and easier, but the down side of that is that you can also make mistakes faster and easier as well.

Because of the potential for disaster, we will start by using Wikidata's test instance: https://test.wikidata.org/ . It behaves exactly like the "real" Wikidata, except that the items and properties there do not necessarily correspond to anything real. If you make a mess in test.wikidata.org, you do NOT have to clean it up -- that's the whole point of it. So it is a place we can experiment without risk and once we feel comfortable, we can easily move to using the "real" Wikidata.

Distinction between a User-Agent and user account

In the Wikimedia world, typically when one creates an autonomous bot (one that works without human intervention), a separate bot user account is created. That account is used with a particular application (script, program) that carries out the bot's defined task. However, VanderBot is not an autonomous bot and has no particular defined task. It is a general-purpose script that can be used by any account to make human-mediated edits. So we need to draw a distinction between the application (technical term: User-Agent) and the user account. There actually is a user account called VanderBot (https://www.wikidata.org/wiki/User:VanderBot). It is operated by me and it shows up as the user who made the edits when I use it with the API-writing script. But you can't use it because you don't have the account credentials -- edits that you make will be made under your own user account. On the other hand, regardless of the user account responsible for the edits, the VanderBot Python script will identify itself to the API as the software that is mediating the interaction between you and the API. Software that manages communications between a user and a server is called a User-Agent.

You can think of this situation as similar to the difference between your web browser and you. Your web browser is not responsible for the actions that you take with it. If you use Firefox to write the world's best Wikipedia article, the Mozilla Foundation that created Firefox doesn't get credit for that. If you use Chrome to buy drugs or organize an assassination, Google, which created Chrome, does not take responsibility for that. On the other hand, if your browser has a bug that causes it to repeatedly hit a website and create a denial of service problem for a web server, the website may use the User-Agent identification for the browser to either block the browser or to contact the browser's developer to ask them to fix the bug.

VanderBot, the User-Agent, has features that prevent it from doing "bad" things to the API, like making requests too fast or not backing off when the server says it's too busy. As the programmer, I'm responsible for those features. I am not responsible if you write bad statements, create duplicate items, or overwrite correct labels and descriptions with stupid ones. Those mistakes will be credited to your user account. On the other hand, if you significantly modify the VanderBot API-writing script (which you are allowed to do under its GNU General Public License v3.0), then you should change the value of its user_agent_header variable with your own URL and email address, particularly if you mess with its "good citizen" features and settings.

Do you need a bot user account and flag?

Wikidata has a bot policy, which you can read about here. However, that policy defines bots as "tools used to make edits without the necessity of human decision-making". By that definition, VanderBot is not technically a bot since its edits are under human supervision (it's not autonomous). That's good, because it means that you can use it without going through any bot approval process, just as if you had used QuickStatements or OpenRefine to make edits.

However, not having bot approval also places rate limitations on interactions with the API. User accounts without "bot flags" (granted after successfully completing the approval process) are limited to 50 writes per minute. Writing at a faster speed without a bot flag will cause the API to block your IP address. This is the primary limitation on the speed of writing data with VanderBot and a delay (api_sleep) is hard-coded in the script.

Note that bot approval and bot flags are granted to accounts, not User-Agents. If you use the VanderBot API-writing script along with other scripts as part of a defined automated process, you should set up a separate bot user account. You could then get a bot flag for that particular account and purpose, and remove the speed limitation from the script. I don't know if VanderBot is actually ready for that kind of use at this point. So you're on your own there.

The short answer to the overall question of whether you need a separate bot account is usually "no".

Generating a bot password

If you understand your responsibilities and have decided that experimenting with VanderBot is worth your time, let's get started on the DIY part. Because the bot password you create can be used across Wikimedia sites, I will illustrate the password creation process at https://test.wikidata.org/ since that is where we will first use it.

The test Wikidata instance looks similar to the regular one, except that the logo in the upper left is in monochrome rather than color. The functionality is identical. Click on the Special pages link in the left pane.

On the Special pages page, click on the Bot passwords link in the Users and rights section.

On the Bot passwords page, enter a name for the bot password. It is conventional to include "bot" or "Bot" somewhere in the name of a bot. However, since this password is actually going to be associated with your own user account and not a special bot account, including "bot" in the name is not that important. In the past when bot passwords were only associated with particular Wikimedia sites, it was more important to have mnemonic names to keep bots for different sites straight. However, since you can use the same bot across sites, this is no longer important.

There are actually two reasons why you might want to use multiple, differently-named bot passwords. One is that different passwords can have different scope restrictions (see next step). So one password might only be able to perform certain actions, while another might be less restricted. The other reason is that if a particular password is being used "in production", you might want to have another one for testing. In the event you accidentally expose the credentials for the testing bot password, you could revoke those credentials without affecting the production bot password. However, for most purposes it would probably make sense to just have the "bot name" be the same as your username.

After entering the name, click the Create button.

On the next page, select the rights that you want to grant to this bot password. I think the important ones are Edit existing pages, Create, edit, and move pages, and Delete pages, revisions, and log entries. However, just in case, I also selected High-volume editing, and View deleted files and pages as well. Leave the rest of the options at their defaults and click the Create button.

The resulting page will give you the username and passwords that you will need to write to the API. There are two variants: one where the bot name is appended to your username by @, and another where the username is used alone and the bot name is prepended to the password by @. We will use the first variant (username@botname).

You need to create a plain text file that contains the username and password. To do this, you should use a text or code editor and NOT a word processor like Microsoft Word. Your computer should have a built-in text editor (TextEdit for Mac, Notepad for Windows). If you don't know what text and code editor are, see the first three videos on this page. If you are using a Mac, the second video explains how to ensure that TextEdit saves your file as plain text rather as rich text (which will cause an error in our situation) and to ensure that files are opened and closed using UTF-8 character encoding.

Open a new document in the text editor. Create three lines of text similar to this:

endpointUrl=https://test.wikidata.org

username=User@bot

password=465jli90dslhgoiuhsaoi9s0sj5ki3lo

Be careful, since mistyping any character will cause VanderBot to not work. It's best to copy and paste rather than to try to type the credentials. (These are fake credentials, so you can't actually use them -- use your own username and password.) Do not leave a space between the equals sign (=) and the other characters. The first line specifies that we are going to use the test.wikidata.org API, so you can copy it exactly as written above. The username is the login name that includes the @ symbol (Baskaufs@BaskaufTestBot in the example above). The password is the password version that does not have the @ symbol in it. Double check that when you copied the username and password, you did not leave any characters off. Also, put the cursor at the end of each line and make sure that there are no trailing spaces after the text on the line. It does not matter whether the last line is followed with a newline (hard return) or not.

When you have entered the text, save the file as wikibase_credentials.txt in your home directory. In the next post, we will see how to use a different name or to change the location to somewhere else. Make sure that there is an underscore between "wikibase" and "credentials", not a dash or a space. If you do not know what your home directory is, or where it is located on your computer, see the Special directories in Windows section of this page or the Special directories on Mac section of this page. In Finder on a Mac, you can select Home from the Go menu to get there. In Windows File Explorer, start at the c: drive, then navigate to the Users folder. Your user folder will will be within the Users folder and have the same name as your username on the computer.

Preparing the metadata description file and CSV headers

The VanderBot API upload script uses CSV files as its data source. Each row in the table represents data about an item. The columns of the table represent various aspects of metadata about the items, such as statements, qualifiers, and references. In order to transfer the data from the CSV to the Wikidata API, the columns of the CSV spreadsheet need to be mapped to the Wikibase data model (the model used by Wikidata). Since the Wikibase model can be represented as RDF, the W3C Generating RDF from Tabular Data on the Web Recommendation can be used to systematically map the CSV columns to the Wikibase model. The VanderBot script uses that mapping to determine how to construct the JSON required to transfer the CSV data to the Wikidata API.

Initially, I constructed the mapping file (known as the CSV's "metadata description file") by hand while referring to the W3C Recommendation and its examples. However, it is extremely difficult to build the mapping file by hand without making errors that were difficult to detect. Fortunately, my collaborator, Jessie Baskauf, created a web tool that allows a user to construct the mapping file using drop-downs that are organized in a structure that reflects that of the Wikidata graphical user interface. We will use that tool to create both the mapping file and the CSV header whose field names correspond to those used in the mapping file.

The tool itself can be accessed online from this link. The Javascript that runs the tool runs entirely within the web browser, so it can be used offline by downloading the HTML, CSS, and Javascript files from GitHub into the same directory, then opening the HTML file in a browser.

On the tool page, leave the Wikidata ID field at its default, qid. Use the Add label and Add description buttons to enter the names of each of those fields. I have been using the convention labelEn, labelDe, descriptionEs, etc. where I use lower camelCase and append the language code. However, you can use any name that makes sense to you. Select the appropriate language codes from the dropdown.

One thing to note is that there is no correspondence between property and item identifiers in test.wikidata.org (the test Wikidata implementation) and www.wikidata.org (the real Wikidata). So before we can add properties and item values of those properties, we need to look in the test Wikidata site to find properties and items that we want to play with. From the https://test.wikidata.org/ landing page, select Special pages from the left pane as you did before.

Near the bottom of the Special pages page in the Wikibase section, click on the List of Properties link. In the real Wikidata instance, creation of properties is controlled by a community process. In the test Wikidata instance anyone can create, change, or delete properties. So although the properties used in this example may still be the same when you do this exercise, they also may have changed. Since we are practicing, you can substitute any other similar property for the ones shown in the examples. We want to chose a couple of properties that have different kinds of values in order to see how that affects the mapping file and CSV headers. So we are looking for a property that has an Item value and one that has a Point in time value.

I picked P17 (country) and P18 (Date of birth) to use in the practice example. Clicking on the links shows that P17 has an Item value and P18 has a Point in time value.

There are not necessarily items in the test Wikidata instance that correspond to those in the real Wikidata, so I searched for some countries to use as values of P17 in the test. I found Q346 (France) and Q53079 (Mexico). You can find your own, or create new items to use if you want.

I also wanted to select a property to use for a qualifier and another one to use for a reference. In the real Wikidata instance, many properties have constraints that indicate whether they are suitable to be used as properties in statements, qualifiers, or references. In the test Wikidata instance, most properties don't have any constraints. So I just picked a couple that seemed to make sense. I chose P87 (start date, having a Point in time value) as a qualifier property for P17 (country). (What does that mean? I don't know and it doesn't matter -- this is just a test.) I chose P93 (reference URL, having a URL value) as a reference property for P18 (Date of birth). Here is a summary of my chosen entities:

P17 country (Item value, used as a statement property)

P87 start date (Point in time value, used as a qualifier property for P17)

Q346 France (Item, used as a value for P17)

Q53079 Mexico (Item, used as a value for P17)

P18 Date of birth (Point in time value, used as a statement property)

P93 reference URL (URL value, used as a reference property)

Using the buttons and drop-downs, I selected the properties listed above on the web tool.

The field names that you choose for the properties used in statements can be whatever you want. It is best to keep them short and do NOT use spaces. If you use multi-word names, I recommend lower camelCase, since dashes may cause problems later on and underscores are used by the tool to indicate the hierarchy of qualifier and reference properties. The fields ending in _uuid and _hash are for statement and reference identifier fields and you should leave them at their defaults. When you create statement properties, by default the tool prefixes qualifier and reference properties their parent statement property names followed by an underscore. You can change these to shorten them if you want, but it's probably best to leave them at their defaults since when CSVs have many columns it becomes difficult to remember the structure without the prefixes.

A statement can have multiple qualifier properties, but a statement can have both multiple references and multiple properties within a reference. For simplicity's sake, I recommend sticking with a single reference having one or more properties.

Using the drop-downs, be sure to select a value type that is appropriate for the property. There is no quality control here at the point of the tool, but an error will be generated when writing to the API if the selected value type does not match with the value type specified for the property on the property page of the test Wikidata instance.

After you have entered all of the property information, scroll to the bottom and enter a filename in the box. Click on the Create CSV button. At this time, the script isn't sophisticated enough to actually generate the CSV file. (That is a possible future feature.) Rather it generates the header line for the CSV as raw text. Click the Copy to clipboard button, then open a new file using the same text editor that you used to generate the credentials file.

Paste the copied text into the new file window.

Select Save or Save As... from an appropriate menu on your editor. The exact appearance of the dialog window will depend on your editor. The screenshot above is for TextEdit on a Mac. Be sure that you use exactly the same file name as you entered in the filename box in the web tool, with a .csv file extension. If your editor gives you a choice of text encoding, be sure to choose UTF-8. The directory into which you save the CSV file will be the one from which you will be running the upload script using the command line. So it is best to save it in some folder that is a subfolder of your home folder. Generally, Downloads, Documents, and Desktop are directly below the home folder, so if you use a subfolder of one of those folders, you should be able to navigate to that folder easily using the command line.

Now click the Create JSON button.

The metadata description JSON for the CSV file columns that you set up will be generated on the screen below the button. Click the Copy to clipboard button. Open a new file in the text editor that you used before and paste the copied text into it. Save the file using the name csv-metadata.json in the same directory where you saved the CSV file.

I like to paste the JSON into my favorite code editor (VS Code) because it will validate the JSON and display it using syntax highlighting, but that isn't really any better than using a vanilla text editor.

Preparing data to create new items

Now we will open the CSV file to add the data that we want to write to the test Wikidata instance. For this practice exercise, you can use Excel to edit the CSV if that is all you have, but if you are serious about using this system in the future, I highly recommend downloading and installing LibreOffice and using its Calc application to open and edit CSVs. I explain the reasons for this in the Skills required section at the top of this post. You can probably just double-click on the file in your file handling system (Finder on Mac or File Explorer in Windows), but if that doesn't work, open your spreadsheet application and open the file via Open in the File menu.

When you open the CSV file, it should appear as a spreadsheet with the column names in the order that you created them with the web tool with empty rows below. You can now add data in the rows below the header.

The screenshot of my example above is too small to easily see, but you can get a better look at it by going to this GitHub gist. You must use different labels and descriptions from the ones I used because if you use the same ones, the API will not allow them to be written (more details about this in the next blog post). As values for the country column, you can use the Q ID of any item. (I used Q53079 for Mexico). Notice that the birthDate column does not have a prefix, indicating that it is a statement property and not a child property of something else. The startDate column was prefixed with country_ by the web tool. That prefix and its position following the country column are clues that this column is a qualifier for the country column. The refUri column was prefixed by the web tool with birthDate_ and ref1_, indicating that it is a property of the first reference for the birthDate statement. Because the value type of the birthDate_ref1_refUrl column is URL, it must be a valid IRI starting with either http:// or https://.

The two date fields are a more complicated type. Dates, globe coordinates, and quantities are complex data types that cannot be represented by single fields. In the case of dates, they require one column field for the date string and another column filed to indicate the precision of the date (e.g. to year, to month, to century, etc.). There is a somewhat complicated system for representing dates in the Wikibase model (see this page for details). Fortunately, the VanderBot script will automatically convert dates that are formatted according to its conventions into the format required by the API. Those conventions are:

character pattern example precision

----------------- ------- ---------

YYYY 1885 to year

YYYY-MM 2020-03 to month

YYYY-MM-DD 2001-09-11 to day

In the example spreadsheet, the country_startDate_val date value for the second item has precision to month, while the birthDate_val date values have precision to day.

The dates should be placed in the corresponding column with name ending in _val. The script knows that it should make the conversion when the corresponding column with name ending in _prec is empty. If the year has fewer than four digits, is BCE (a negative number), or has a precision lower than year (century, millennium, etc.), then a date string and precision integer properly formatted according to the Wikibase model must be provided explicitly. The script only provides minimal format checking (for the correct number of characters), so dates that are otherwise incorrectly formatted will result in an error that prevents the record from being written to the API.

You should also notice that the example spreadsheet has a number of empty columns. These columns will contain identifiers for the various entities described by the data columns. For example, the qid column will contain the identifier for the item. The country_uuid and birthDate_uuid columns will contain the identifiers for the country and birthDate statements. The birthDate_ref1_hash will contain the identifier for the first reference for birthDate, which contains a reference URL. In all of these cases, the Wikidata API will assign those identifiers when the various entities are created and they will be recorded in the CSV file immediately after the item has been created. The VanderBot script uses the presence or absence of these identifiers to know whether the particular identified entity exists and therefore whether it needs to be written to the API or not.

The situation with the two date columns whose names end in _nodeId is complicated. For technical reasons that I don't want to get into in this post, the node ID values are not assigned by the API, but rather are generated by VanderBot at the point of processing the dates. This is true for all of the properties with node value types (dates, globe coordinates, and quantities). All you need to know is that you should leave the columns ending in _nodeId blank and that the sets of three date-related columns that have the same first part (country_startDate_nodeId, country_startDate_val, and country_startDate_prec; birthDate_nodeId, birthDate_val, and birthDate_prec) represent complex values that can't be represented by a single column.

Note that I did not fill in every cell in the table that could contain values. I did that because in a later step we will practice adding values to the item statements and references after the items have already been created.

Be sure to close the CSV file before continuing to the next step. Failure to close the CSV will have different effects depending on the spreadsheet program you are using. I believe that both Excel and Open Office Calc place a lock on the file so that when the VanderBot script tries to write the API responses to the CSV file, it generates an error and crashes the script. Libre Office Calc will allow the changes to be written to the CSV file, but they will not show up unless the file is closed and re-opened. Libre Office Calc will warn you if you try to save an open file if it has been changed by the script while the file was open. In that case, close the file without saving and re-open it to see the changes.

Creating new items using the API

The last thing you need to have to actually write data to the API is the VanderBot Python script itself. Go to the code page on GitHub. Right-click on the Raw button in the upper right of the page. Select Save Link As..., navigate to the directory where you saved the csv-metadata.json file and the CSV file that you edited, and save the the vanderbot.py script there.

If you have not previously installed the requests library, you may need to do that before you can run the script. If you have Anaconda installed on your computer, requests may already be installed. If you aren't sure, just try running the script as described below. If you get an error message saying that Python doesn't know about requests, then try entering:

pip install requests

If that doesn't work, try

pip3 install requests

If you use some non-standard package manager like brew or conda, install requests by whatever means you normally install packages.

Open the appropriate console program for your operating system (probably Terminal for Mac or Command Prompt for Windows). Use the cd command to navigate to the directory where you saved the file, then list the files to make sure you are in the right place and the files are all there (ls for Mac Linux or dir for Windows DOS).

Depending on how you set up Python the command to run the script will probably either be

python vanderbot.py

python3 vanderbot.py

If things work correctly, the console should show the progress of writing to the API.

The first part of the output shows how VanderBot is interpreting the columns of the CSV based on the information from the csv-metadata.json column-mapping file. Then there is an indication that dates have been converted to the form required by the Wikibase model. As the script writes each row to the API, it displays the response of the API. The contents of the response don't matter as long as the end of the response contains "success". After writing statements for each row, the script then checks whether there were any existing statements with added references. Since there were none, nothing was reported. Finally, there is a report of any errors that occurred that prevented particular rows from being written. Not every possible type of error is trapped and some will result in the script terminating before finishing all of the rows of the CSV. In that situation, the last response from the API may give clues about what went wrong. All information about identifiers received from the API prior to termination of the script should be saved in the CSV file, so once the error is fixed, you can just run the script again to try again to write the problematic line.

If you re-open the CSV file, you should see results similar to this gist. All of the identifier columns in the table that are associated with value columns have now been filled in, indicating that those data now exist on Wikidata. Notice also that the dates have been converted into the more complicated format.

If you search the test.wikidata.org site, you should see the record for the new item that you created. Because the birthDate_ref1_refUrl column had a value, a reference was created for the Date of birth statement.

Because in the second row the country property column was followed by value columns for country_startDate that contained data, the country statement on the web page for the item displays a start date qualifier. The country_startDate_val column contained a value in the form 1986-02, so a precision of 10 (to month) was placed in the country_startDate_prec column of the table and therefore only the month is shown on the web page. In contrast, the birthDate_val column was given a value of 1982-02-03, so it was assigned a precision of 11 (to day) and the day is displayed on the web page for the date of birth statement.

Editing existing items using the API

We can add information to the two new items that we just created by filling in parts of the CSV that we left blank before. In this gist, I added a country value (Q346, France) for item Q214621 and I added a reference value in the birthDate_ref1_refUrl for the birth date statement, which already existed, but did not previously have any references. After making sure that I closed the CSV file in the spreadsheet program, I ran the VanderBot script again.

In the first section of the output, the script detected that it needed to add a statement to an existing item (there was already a value in the qid column, but there was a value in the country column without a corresponding identifier value in the country_uuid column). It found no statements to add in the second row, so it did nothing.

When it went through each row looking for new references for existing statements, it found one for the birth date reference URL column for the second record (there was an identifier in the birthDate_uuid column, but no identifier in the birthDate_ref1_hash column). It attempted to write the new data to the API, but an interesting thing happened. The server was too busy, and sent a message back to the script that it should wait a while and try again. In general, the script will keep trying with an increasing delay of up to 5 minutes, giving up after 10 tries. In this particular case, on the second retry the server was no longer too busy and the reference was successfully added.

When I reload the Juan Jose Garza page, I see that the date of birth statement now has a reference where there was none before.

Reloading Marie Gareau's page shows the new country statement. You may have noticed that I filled in the country value without giving any start date qualifier value for the statement in the country_startDate_val column. The Wikidata Query Service treats qualifiers and references differently in that it assigns IRI identifiers to references, but does not assign them to qualifiers. Because VanderBot is designed to get information about specific metadata about items using the Query Service, it does not capture and store any identifier for qualifiers. Thus it is currently not possible to add a qualifier to a statement once the statement has been created. This behavior may be modified at some point in the future, but for now you should be aware of that limitation.

Who's responsible for what just happened?

We can check the revision history of the Juan Jose Garza page to see how the edits we made were recorded.

Notice that since the edits were made by a script using a bot password associated with my user account (Baskaufs), the edits were credited to me just as if I had made them by hand using the graphical interface. One difference is that the original item was created using a single API interaction. So even though it involved creating a label, a description, and two statements, it was recorded as a single edit instead of four.

The benefit of editing as many parts of the item metadata at once as possible is that the interactions with the API are the rate-limiting factor when writing data to the API. VanderBot only makes one API call per row, even if the row contains many more columns than in this simple example. So it can make the edits much faster all at once than it could if it did them all separately.

Notice also that there is no record here that the VanderBot script was used. It identified itself to the API through its User-Agent HTTP header when it communicated with the server, and it was a "good citizen" by waiting to retry when the server reported that it was lagged. But there is no record of that interaction in the revision history.

To see the final state of the CSV file after all of the uploads shown here, see this gist.

What should you do next?

While you have the spreadsheet and JSON metadata description file set up, you should do a lot more experimenting. There is really nothing that you can "break" on either test.wikidata.org or VanderBot. In particular, you should try doing the following things to see what happens. Some of them are "wrong" things that produce bad results or aren't allowed by the API, while others are harmless or fine. If the script doesn't crash, reload the page or search for the new item in the graphical interface to see what happened.

Create another row in the spreadsheet where the label and description are the same as an existing item. What happens? (When writing to the "real" Wikidata, there is code in VanderBot that tries to prevent this, but it doesn't work with the test instance.)
What happens if either the label or the description (but not both) is the same as an existing item? Can you create an item that is missing either a label or a description?
What happens if you delete the uuid identifier for a statement property and run the script again?
What happens if you delete the uuid identifier for a statement property, change the value, and run the script again?
What happens if you delete a reference hash identifier and replace the reference value with a different one?
What happens if you leave off part of a date (e.g. 1997-9-23 with no leading zero for the month)? You may need to change the cell format to "text" in order to be able to make this mistake in your spreadsheet program.
What happens if you have the correct number of characters in a date, but the date is malformed (e.g. 199x-09-23)?
What happens if you change a value, but do not delete the corresponding identifier associated with the value?

In the next blog post, we will switch to writing to the "real" Wikidata, where we would prefer not to make these kinds of mistakes.

Answers are below.

Answers:

The API responds with an error message and the script ends prematurely.
A new item will be created with the same label (or description). You can also create items lacking either a label or a description (but not both).
A duplicate statement will be created. This is a bad practice.
A second value will be added for the property. This is perfectly fine as long as the second value is correct information.
The new reference gets added as a second reference for the same statement. This is perfectly fine.
The script does nothing and reports that there was an incorrectly formatted date in that row.
The script tries to write the value, but the API returns an error message saying that the date format is bad. The script then stops running.
Nothing happens. The script doesn't look at the value if it already has an identifier associated with it.

TDWG gets 5 Stars!

2020-03-05T08:12:00.002-08:00

Photo from W3C https://www.w3.org/DesignIssues/LinkedData.html

TDWG IRIs are dereferenceable with content negotiation!

Yesterday was a happy day for me because after several years of work, the switch was flipped and all of the IRIs minted by TDWG under the rs.tdwg.org subdomain became dereferenceable with content negotiation in most cases. For those readers who aren't hard-core Linked Open Data (LOD) buffs, I'll explain what that means.

An internationalized resource identifier (IRI; superset of uniform resource identifiers, URIs) is a globally unique identifier that generally looks like the well known URL. It usually starts with http:// or https://, which implies that something will happen if you put it in a web browser. That "something" is dereferencing - the browser uses the IRI to try to retrieve a document from a remote server and if successful, a web page shows up in the browser. Because a browser's job is to retrieve web pages, when it dereferences an IRI, it asks for a particular "content type" (text/html) indicating that it wants an HTML web page.

But there are other kinds of software designed to retrieve documents that are readable by machines rather than by humans. When those applications dereference an IRI, they ask for other content types (like text/turtle or application/rdf+xml) that can be interpreted as structured data and be integrated with data from other sources. The same IRI can be used to retrieve different documents that provide the same information in different formats depending on the content type that is requested. The process of determining what kind of document to return to the requesting application is called content negotiation.

In the past, the behavior of TDWG IRIs were inconsistent. Some IRIs like those of Darwin Core terms would retrieve a web page in a browser and provide machine-readable RDF/XML when requested. Other IRIs like those of Audubon Core terms would retrieve a web page, but no machine-readable formats. Obsolete IRIs like those of old versions of Darwin Core and the defunct TDWG ontology did nothing at all. Then there were many TDWG resources, such as old standards documents, that didn't even have IRIs.

In an earlier blog post, I described the IRI patterns that I established in order to be able to denote all of the kinds of TDWG standards components that were described in the TDWG Standards Documentation Specification. Those patterns made it possible to use IRIs to refer to things like vocabularies, term lists, and documents in a consistent way. Just creating the IRI patterns and using them to assign IRIs to vocabularies and documents provided a way to uniquely identify those resources, but did not create the "magic" of actually making it possible to use those IRIs to retrieve information. That's what happened yesterday.

What happens when the IRIs are dereferenced?

The action that takes place when an rs.tdwg.org IRI is dereferenced depends on the category of the resource and the content type that's requested. There are four categories of behavior that vary primarily on how they deliver human-readable content.

1. "Living" TDWG vocabulary terms. When a term from one of the actively maintained TDWG vocabularies (currently Darwin Core and Audubon Core) is dereferenced, the browser is redirected to the most helpful reference document for that vocabulary (the Quick Reference Guide for Darwin Core and the Term List document for Audubon Core). You can try this with dwc:recordedBy, http://rs.tdwg.org/dwc/terms/recordedBy and ac:caption, http://rs.tdwg.org/ac/terms/caption.

2. Obsolete TDWG vocabulary terms, vocabularies, term lists, and special categories of resources. When terms in these categories are dereferenced, a generic web page is generated by a script that provides vanilla information about the term. The same is true for some special categories like Executive Committee decisions. Try it with an obsolete term http://rs.tdwg.org/dwc/curatorial/Disposition, a decision http://rs.tdwg.org/decisions/decision-2011-10-16_6 and a term list http://rs.tdwg.org/ac/xmp/.

3. TDWG-maintained standards documents. The maintenance of TDWG standards documents is idiosyncratic and their location depends on where their maintainers happened to have stashed them. The URLs used to retrieve the documents might change if they are put into different places or if their format changes (e.g. changed from PDF to Markdown). To provide a stable way to denote those documents, the IRIs minted in rs.tdwg.org subdomain redirect to whatever current URL delivers that particular document. If the document moves or the access URL changes for some reason, the stable IRI will redirect to the new access URL. Try it with the TDWG Vocabulary Maintenance Specification http://rs.tdwg.org/vms/doc/specification/, the Audubon Core Structure document http://rs.tdwg.org/ac/doc/structure/, and the TAPIR Protocol Specification http://rs.tdwg.org/tapir/doc/specification/.

4. Non-TDWG-maintained standards documents. A lot of the old TDWG standards were not actually published by TDWG, and their maintenance is carried out by organizations whose websites are not under TDWG control. So we will just try to keep the TDWG-issued document IRIs pointing at whatever the access URL is currently for the document. Examples: Economic Botany Data Collection Standard specification http://rs.tdwg.org/ebdc/doc/specification/, Taxonomic Literature : A Selective Guide to Botanical Publications and Collections with Dates, Commentaries and Types (Second edition, vol. 1) http://rs.tdwg.org/tl/doc/v1/, and Index Herbariorum http://rs.tdwg.org/ih/doc/book/.

Machine-readable metadata
For these categories, the machine readable metadata is delivered in the same way: generated by script from the data in the rs.tdwg.org Github repository. To access the content through content negotiation, you can dereference any of the IRIs above using software like Postman that will allow you to specify an Accept header for the machine-readable content type that you want (text/turtle or application/rdf+xml). To access the machine-readable documents directly, drop any trailing slashes and append .ttl or .rdf to access RDF/Turtle or RDF/XML respectively. Examples: http://rs.tdwg.org/dwc/terms/recordedBy.ttl, http://rs.tdwg.org/dwc/terms/recordedBy.rdf, and http://rs.tdwg.org/tl/doc/v1.ttl.

There are also a number of legacy XML schemas that are still being retrieved by some applications and they are made available by just redirecting from the rs.tdwg.org IRI to wherever the schema lives. Example: http://rs.tdwg.org/dwc/text/tdwg_dwc_text.xsd .

How this happens
The script that handles all of these many variations of IRIs is written in XQuery (a functional programming language designed to process XML) and runs on a BaseX server instance. A second XQuery script generates the vanilla HTML web pages that are generated from the same data as the machine-readable metadata. I've written more extensively about this approach in an earlier post, so I won't say more about it here.

There was a lot of concern about maintaining a server that is based on a programming language that is not well-known among IT professionals. So it's likely that in the future the XQuery-based system will be replaced by something else. I'd like to use something based on the W3C Generating RDF from Tabular Data on the Web Recommendation, since the source data live as CSV files on Github. But for now, this is what we have.

5 Stars???

The title of this post says that TDWG now gets 5 stars. What does that mean? In 2010, Tim Berners-Lee promoted a 5 star system to rate the extent to which data sources are freely available in machine-readable form. The TDWG standards metadata have been available online in structured form under an open license (stars 1 through 3), but failed to achieve 5 stars since standards-based machine readable data (RDF) couldn't be acquired by dereferencing the IRIs (star 4) and the resources weren't linked to others in the machine-readable metadata (star 5). As of yesterday, we can tick off stars 4 and 5, so the TDWG standards metadata are now fully compliant with Linked Open Data best practices. Congratulations TDWG!

Special thanks to Matt Blissett of GBIF for working out the technical details of setting up the server and production protocol and to Tim Robertson of GBIF for his support in getting this done. Thanks also to Cliff Anderson and the XQuery Working Group of the Vanderbilt University Heard Library for introducing me to BaseX server.

VanderBot part 4: Preparing data to send to Wikidata

2020-02-08T07:28:00.002-08:00

In the previous blog post, I described how I used a Python script to upload data stored in a CSV spreadsheet to Wikidata via the Wikidata API. I noted that the spreadsheet contained information about whether data were already in Wikidata and if they needed to be written to the API, but I did not say how I acquired those data, nor how I determined whether they needed to be uploaded or not. That data acquisition and processing is the topic of this post.

The overall goal of the VanderBot project is to enter data about Vanderbilt employees (scholars and researchers) and their academic publications into Wikidata. Thus far in the project, I have focused primarily on acquiring and uploading data about the employees. The data acquisition process has three stages:

1. Acquiring the names of research employees (faculty, postdocs, and research staff) in departments of the university.

2. Determining whether those employees were already present in Wikidata or if items needed to be created for them.

3. Generating data required to make key statements about the employees and determining whether those statements (and associated references) had already been asserted in Wikidata.

The data harvesting script (coded in Python) required to carry out these processes is available via a Jupyter notebook available onGitHub.

Acquire names of research employees at Vanderbilt

Scrape departmental website

I've linked employees to Vanderbilt through their departmental affiliations. Therefore, the first task was to create items for departments in the various schools and colleges of Vanderbilt University. I won't go into detail about that process other than to say that the hacky code I used to do it is on GitHub.

The actual names of the employees were acquired by scraping departmental faculty and staff web pages. I developed the scraping script based on the web page of my old department, biological sciences. Fortunately, the same page template was used by many other departments in both the College of Arts and Sciences and the Peabody College of Education, so I was able to scrape about 2/3 of the departments in those schools without modifying the script I developed for the biological sciences department.

Because the departments had differing numbers of researcher pages covering different categories of researchers, I create a JSON configuration file where I recorded the base departmental URLs and the strings appended to that base to generate each of the researcher pages. The configuration file also included some other data needed by the script, such as the department's Wikidata Q ID, a generic description to use for researchers in the department (if they didn’t already have a description), and some strings that I used for fuzzy matching with other records (described later). Some sample JSON is included in the comments near the top of the script.

The result at the end of the "Scrape departmental website" section of the code was a CSV file with the researcher names and some other data that I made a feeble attempt to scrape, such as their title and affiliation.

Search ORCID for Vanderbilt employees

ORCID (Open Researcher and Contributor ID) plays an important part in disambiguating employees. Because ORCIDs are globally unique, associating an employee name with an ORCID allows one to know that the employee is different from someone with the same name who has a different ORCID.

For that reason, I began the disambiguation process by performing a search for "Vanderbilt University" using the ORCID API. The search produced several thousand results. I then dereferenced each of the resulting ORCID URIs to capture the full data about the researcher. That required an API call for each record and I used a quarter second delay per call to avoid hitting the API too fast. As a result, this stage of the process took hours to run.

I screened the results by recording only those that listed "Vanderbilt University" as part of the employments affiliation organization string. That excluded people who were only students and never employees, and included people whose affiliation was "Vanderbilt University Medical Center", "Vanderbilt University School of Nursing", etc. As part of the data recorded, I included their stated departmental affiliations (some had multiple affiliations if they moved from one department to another during their career). After this stage, I had 2240 name/department records.

Fuzzy matching of departmental and ORCID records

The next stage of the process was to try to match employees from the department that I was processing with the downloaded ORCID records. I used a Python fuzzy string matching function called fuzz.token_set_ratio() from the fuzzywuzzy package. I tested this function along with others in the package and it was highly effective at matching names with minor variations (both people and departmental names). Because this function was insensitive to word order, it matched names like "Department of Microbiology" and "Microbiology Department". However, it also made major errors for name order reversals ("John James" and "James Johns", for example) so I had an extra check for that.

If the person's name had a match score of greater than 90 (out of 100), I then performed a match check against the listed department. If it also had a match score of greater than 90, I assigned that ORCID to the person. If no listed department matched had a score over 90, I assigned the ORCID, but flagged that match for manual checking later.

Determine whether employees were already in Wikidata

Attempt automated matching with people in Wikidata known to work at Vanderbilt

I was then ready to start trying to match people with existing Wikidata records. The low-hanging fruit was people whose records already stated that their employer was Vanderbilt University (Q29052). I ran a SPARQL query for that using the Wikidata Query Service. For each match, I also recorded the employee's description, ORCID, start date, and end date (where available).

Once I had those data, I checked each departmental employee's record against the query results. If both the departmental employee and the potential match from Wikidata had the same ORCID, then I knew that they were the same person and I assigned the Wikidata Q ID to that employee. If the employee had an ORCID I could exclude any Wikidata records with non-matching ORCIDs and only check for name matches with Wikidata records that didn't have ORCIDs. Getting a name match alone was not a guarantee that the person in Wikidata was the same as the departmental employee, but given that the pool of possible Wikidata matches only included people employed at Vanderbilt, a good name match meant that it was probably the same person. If the person had a description in Wikidata, I printed the two names and the description and visually inspected the matches. For example, if there was a member of the Biological Sciences department named Jacob Reynolds and someone in Wikidata named Jacob C. Reynolds who was a microbiologist, the match was probably good. On the other hand, if Jacob C. Reynolds was a historian, then some manual checking was in order. I did a few other tricks that you can see in the code.

This "smart matching" with minimal human intervention was usually able to match a small fraction of people in the department. But there were plenty of departmental employees who were already in Wikidata without any indication that they worked at Vanderbilt. The obvious way to look for them would be to just do a SPARQL query for their name. There are some features built in to SPARQL that allow for REGEX checks, but those features are impossibly slow for a triplestore the size of Wikidata's. The strategy that I settled for was to generate as many possible variations of the person's name and query for all of them at once. You can see what I did in the generateNameAlternatives() function in the code. I searched labels and aliases for: the full name, names with middle initials with and without periods, first and middle initials with and without periods, etc. This approach was pretty good at matching with the right people, but it also matched with a lot of wrong people. For example, for Jacob C. Reynolds, I would also search for J. C. Reynolds. If John C. Reynolds had J. C. Reynolds as an alias, he would come up as a hit. I could have tried to automate the processing of the returned names more, but there usually weren't a lot of matches and with the other screening criteria I applied, it was pretty easy for me to just look at the results and bypass the false positives.

When I did the query for the name alternatives, I downloaded the values for several properties that were useful for eliminating hits. One important screen was to eliminate any matching items that were instances of classes (P31) other than human (Q5). I also screened out people who were listed as having died prior to some set data (2000 worked well - some departments still listed recently deceased emeriti and I didn't want to eliminate those). If both the employee and the name match in Wikidata had ORCIDs that were different, I also eliminated the hit. For all matches that passed these screens, I printed the description, occupation, and employer if they were given in Wikidata.

Clues from publications in PubMed and Crossref

The other powerful tool I used for disambiguation was to look up any articles linked to the putative Wikidata match. For each Wikidata person item who made it this far through the screen, I did a SPARQL query to find works authored by that person. For up to 10 works, I did the following. If the article had a PubMed ID, I retrieved the article metadata from the PubMed API and tried to match against the author names. When I got a match with an author, I checked for an ORCID match (or excluded if an ORCID mismatch) and also for a fuzzy match against any affiliation that was given. If either an ORCID or affiliation matched, I concluded that the departmental employee was the same as the Wikidata match and stopped looking.

If there was no match in PubMed and the article had a DOI, I then retrieved the metadata about the article from the CrossRef API and did the same kind of screening that I did in PubMed.

Human intervention

If there was no automatic match via the article searches, I printed out the full set of information (description, employer, articles, etc.) for every name match, along with the name from the department and the name from Wikidata in order for a human to check whether any of the matches seemed plausible. In a lot of cases, it was easy to eliminate matches that had descriptions like "Ming Dynasty person" or occupation = "golfer". If there was uncertainty, the script printed hyperlinked Wikidata URLs and I could just click on them to examine the Wikidata record manually.

Here's some typical output:

--------------------------
No Wikidata name match: Justine Bruyère

--------------------------
No Wikidata name match: Nicole Chaput Guizani

--------------------------
SPARQL name search: Caroline Christopher
(no ORCID)

0 Wikidata ID: Q83552019 Name variant: Caroline Christopher https://www.wikidata.org/wiki/Q83552019
No death date given.
description: human and organizational development educator
employer: Vanderbilt University
No articles authored by that person
Employee: Caroline Christopher vs. name variant: Caroline Christopher

Enter the number of the matched entity, or press Enter/return if none match: 0

--------------------------
SPARQL name search: Paul Cobb
(no ORCID)

0 Wikidata ID: Q28936750 Name variant: Paul Cobb https://www.wikidata.org/wiki/Q28936750
No death date given.
description: association football player
occupation: association football player
No articles authored by that person
Employee: Paul Cobb vs. name variant: Paul Cobb

1 Wikidata ID: Q55746009 Name variant: Paul Cobb https://www.wikidata.org/wiki/Q55746009
No death date given.
description: American newspaper publisher
occupation: newspaper proprietor
No articles authored by that person
Employee: Paul Cobb vs. name variant: Paul Cobb

Enter the number of the matched entity, or press Enter/return if none match:

--------------------------
No Wikidata name match: Molly Collins

--------------------------
No Wikidata name match: Ana Christina da Silva [Iddings]

Although this step did require human intervention, because of the large amount of information that the script collected about the Wikidata matches, it usually only took a few minutes to disambiguate a department with 30 to 50 employees.

Generate statements and references and determine which were already in Wikidata

Generating data for a minimal set of properties

The next to last step was to assign values to a minimal set of properties that I felt each employee should have in a Wikidata record. Here's what I settled on for that minimal set:

P31 Q5 (instance of human). This was automatically assigned to all records.

P108 Q29052 (employer Vanderbilt University). This applies to all employees in our project - the employer value can be set at the top of the script.

P1416 [Q ID of department] (affiliation with focal department). After searching through many possible properties, I decided that P1416 (affiliation) was the best property to use to assert the employee's connection to the department I was processing. P108 was also possible, but there were a lot of people with dual departmental appointments and I generally didn't know which department was the actual "employer". Affiliation seemed to be an appropriate connection for regular faculty, postdocs, visiting faculty, research staff, and other kinds of statuses where the person would have some kind of research or scholarly output.

P496 [ORCID identifier]. ORCIDs that I'd acquired for the employees were hard-won and an excellent means for anyone else to carry out disambiguation, so I definitely wanted to include that assertion if I could.

P21 [sex or gender]. I was really uncomfortable assigning a value of this property, but this is a property often flagged by Recoin as a top missing property and I didn't want some overzealous editor deleting my new items because their metadata were too skimpy. Generally, the departmental web pages had photos to go with the names, so I made a call and manually assigned a value for this property (options: m=male, f=female, i=intersex, tf=transgender female, tm=transgender male). Any time the sex or gender seemed uncertain, I did not provide a value.

The description. I made up a default description for the department, such as "biological science researcher", "historian", or "American Studies scholar" for the Biological Sciences, History, and American Studies departments respectively. I did not overwrite any existing descriptions by default, although as a last step I looked at the table to replace stupid ones like "researcher, ORCID: 0000-0002-1234-5678". These defaults were generally specific enough to prevent collisions where the label/description combination I was creating would collide with the label/description combination for an existing record and kill the record write.

When it made sense, I added references to the statements I was making. Generally, a reference is not expected for instance of human and I really couldn't give a reference for sex or gender. For the employer and affiliation references, I used the web page that I scraped to get their name as the reference URL and provided the current date as the value for P813 (retrieved). For ORCID, I created a reference that had a P813 (retrieved) property if I was able to successfully dereference the ORCID URI.

Because each of these properties had different criteria for assigning values and references, there was no standard code for assigning them. The code for each property is annotated, so if you are interested you can look at it to see how I made the assignments.

Check for existing data in Wikidata

In the earlier posts, I said that I did not want VanderBot to create duplicate items, statements, and references when they already existed in Wikidata. So a critical last step was to check for existing data using SPARQL. One important thing to keep in mind is the Query Service Updater lag that I talked about in the last post. That lag means that changes made up to 8 or 10 hours ago would not be included in this download. However, given that the Wikidata researcher item records I'm dealing with do not change frequently, the lag generally wasn't a problem. I should note that it would be possible to get these data directly from the Wikidata API, but the convenience of getting exactly the information I wanted using SPARQL outweighed my motivation to develop code to do that.

At this point in the workflow, I've already determined with a fairly high degree of confidence which of the departmental employees were already in Wikidata. That takes care of the potential problem of creating duplicate item records, and it also means that I do not need to check for the presence of statements or references for any of the new items either.

One interesting feature of SPARQL that I learned from this project was using the VALUES clause. Despite having used SPARQL for years and skimming through the SPARQL specification several times, I missed it. The VALUES clause allows you to specify which values the query should use for a particular variable in its pattern matching. That makes querying a large triplestore like Wikidata much faster that without it and it also reduces the number of results that the code has to sort through when results come back from the query service. Here's an example of a query using the VALUES clause that you can test at the Wikidata Query Service:

SELECT DISTINCT ?id ?statement WHERE {

VALUES ?id {

wd:Q4958

wd:Q39993

wd:Q234

}

?id p:P31 ?statement.

}

So the first part of the last step in the workflow is to generate a list of all of the existing item Q IDs for employees in the department. That list is passed to the searchStatementAtWikidata() function as its first argument. searchStatementAtWikidata() is a general purpose function that will search Wikidata for a particular property of items in the generated list. It can be used either to search for a particular property and value (like P108 Q29052, employer Vanderbilt University) and retrieve the references for that statement, or for only the property (like P496, ORCID) and retrieve both the values and references associated with those statements. This behavior is controlled by whether an empty string is sent for the value argument or not. For each of the minimal set of properties that I'm tracking for departmental employees, the searchStatementAtWikidata() is used to retrieve any available data for the listed employees. Those data are then matched with the appropriate employee records and recorded in the CSV file along with the previously generated property values.

In addition to the property checks, labels, descriptions, and aliases for the list of employees are retrieved via SPARQL queries. In the cases of labels and descriptions, if there is an existing label or description in Wikidata, it is written to the CSV file. If there is no existing label, the name scraped from the departmental website is written to the CSV as the label. If there is no existing description, the default description for the department is written to the CSV. Whatever alias lists are retrieved from Wikidata (including empty ones) are written to the CSV.

Final manual curation prior to writing to the Wikidata API

In theory, the CSV file resulting from the previous step should contain all of the information needed by the API-writing script that was discussed in the last post. However, I always manually examine the CSV to look for problems or things that are stupid such as bad descriptions.

If a description or label is changed, the API-writing script will detect that it's different from the current value being provided by the SPARQL endpoint and the new description or label will overwrite the existing one. The API-writing script is currently not very sophisticated about how it handles aliases. If there are more aliases in the CSV than are currently in Wikidata, the script will overwrite existing aliases in Wikidata with those in the spreadsheet. The assumption is that alias lists are only added to, rather than aliases being changed or deleted. At some point in the future, I intend to write a separate script that will handle labels and aliases in a more robust way, so I really didn't want to waste time now on making the alias-handling better than it is.

A typical situation is to discover a more specific label for the person than already exists in Wikidata. In that case, I usually add the existing label to the alias list, and replace the label value in the CSV with the better new one. WARNING! If you edit the alias list, make sure that your editor uses generic quotes (ASCII 32/Unicode +U0022) and not "smart quotes". They have a different Unicode value and will break the script. Open Office/Libre Office (the best applications for editing CSVs in my opinion) default to smart quotes, so this setting must be turned off manually.

I also just look over the rest of the spreadsheet to convince myself that nothing weird is going on. Usually the script does an effective job of downloading the correct reference properties and values, but I've discovered some odd situations that have caused problems.

At this point, steps 1 and 2 in the VanderBot diagram have been completed by the data harvesting script, and the API-writing script described in the last post is ready to take over in step 3. When step 4 is complete, the blank cells in the CSV for missing item, statement, and reference identifiers will should all be filled in and the CSV can be filed for future reference.

Final thoughts

I tried to make the API writing script generic and adaptable for writing statements and references about any kind of entity. That's achievable simply by editing the JSON schema file that maps the columns in the source CSV. However, getting the values for that CSV is the tricky part. If one were confident that only new items were being written, then the table could filled with only the data to be written and without any item, statement, or reference identifiers. That would be the case if you were using the script to load your own Wikibase instance. However, for adding data to Wikidata about most items like people or references, one can't know if the data needs to be written or not, and that's why a complex and somewhat idiosyncratic script like the data harvesting script is necessary. So there's no "magic bullet" that will make it possible to automatically know whether you can write data to Wikidata without creating duplicate assertions.

To find records that VanderBot has put into Wikidata, try this query at the Wikidata Query Service:

select distinct ?employee where {

?employee wdt:P1416/wdt:P749+ wd:Q29052.

}

limit 50

The triple pattern requires that the employee first have an affiliation (P1416) to some item, and that item be linked by one or more parent organization (P749) links to Vanderbilt University (Q29052). I linked the department items to their parent school or college using P749 and made sure that the University's schools and colleges were all linked to the University by P749 as well. However, some schools like the Blair School of Music do not really have departments, so their employees were affiliated directly to the school or college rather than a department. So the search has to pick up administrative entity items that were either one or two P749 links from the university (hence the "+" property path operator after P749). Since there are a lot of employees, I limited the results to 50. If you click on any of the results, it will take you to the item page and you can view the page history to confirm that VanderBot had made edits to the page. (At some point, there may be people who were linked in this way by an account other than VanderBot, but thus far, VanderBot is probably the only editor of Vanderbilt employees items that's linking to departments by P1416, given that I recently created all of the department items from scratch.)

A variation of that query will tell you the number of records meeting the criteria of the previous query:

select (count(?employee) as ?count) where {

?employee wdt:P1416/wdt:P749+ wd:Q29052.

}

As of 2020-02-08, there are 1221 results. That number should grow as I use VanderBot to process other departments.

VanderBot part 3: Writing data from a CSV file to Wikidata

2020-02-07T14:49:00.001-08:00

In the previous post of this series, I described how my investigation of the Wikibase data model led me to settle on a relatively simple spreadsheet layout for tracking what items, statements, and references needed to be created or edited in Wikidata. Since column headers in a CSV spreadsheet don't really have any meaning other than to a human, it's necessary to map columns to features of the Wikibase model so that a script would know how to write the data in those columns to appropriate data items in Wikidata.

Developing a schema to map spreadsheet columns to the Wikibase model

In a blog post from 2016, I wrote about a similar problem that I faced when creating an application that would translate tabular CSV data to RDF triples. In that case, I created a mapping CSV table that mapped table headers to particular RDF predicates, and that also indicated the kind of object represented in the table (language-tagged literal, IRI, etc.). That approach worked fine and had the advantage of simplicity, but it had the disadvantage that it was an entirely ad hoc solution that I made up for my own use.

When I learned about the "Generating RDF from Tabular Data on the Web" W3C Recommendation, I recognized that this was a more standardized way to accomplish a mapping from a CSV table to RDF. When I started working on the VanderBot project I realized that since the Wikibase model can be expressed as an RDF graph, I could construct a schema using this W3C standard to document how my CSV data should be mapped to Wikidata items, properties, references, labels, etc. The most relevant part of the standard is section 7.3, "Example with single table and using virtual columns to produce multiple subjects per row".

An example schema that maps the sample table from last the last post is here. The schema is written in JSON and if ingested by an application that can transform CSV files in accordance with the W3C specification, it should produce RDF triples identical to triples about the subject items that are stored in the Wikidata Query Service triplestore (not all triples, but many of the ones that would be generated if the CSV data were loaded into the Wikidata API). I haven't actually tried this since I haven't acquired such an application, but the point is that the JSON schema applied to the CSV data will generate part of the graph that will eventually be present in Wikidata when the data are loaded.

I will not go into every detail of the example schema, but show several examples of how parts of it map particular columns.

Column for the item identifier

Each column in the table has a corresponding JSON object in the schema. The first column, with the column header title "wikidataId" is mapped with:

{
"titles": "wikidataId",
"name": "wikidataId",
"datatype": "string",
"suppressOutput": true
}

This JSON simply associates a variable name (wikidataId) with the Wikidata Q ID for the item that's the subject of each row. (For simplicity, I've chosen to make the variable names the same as the column titles, but that isn't required.) The "true" value for suppressOutput means that no statement is directly generated from this column.

Column for the label

The "labelEn" column is mapped with this JSON object:

{

"titles": "labelEn",

"name": "labelEn",

"datatype": "string",

"aboutUrl": "http://www.wikidata.org/entity/{wikidataId}",

"propertyUrl": "rdfs:label",

"lang": "en"

}

The value of aboutUrl indicates the subject of the triple generated by this column. The curly brackets indicate that the wikidataId variable should be substituted in that place to generate the URI for the subject. The value of propertyUrl is rdfs:label, the RDF predicate that Wikibase uses for its label field. The object of the triple by default is the value present in that column for the row. The lang value provides the language tag for the literal.

So when this mapping is applied to the labelEn column of the first row, the triple

<http://www.wikidata.org/entity/Q84268104> rdfs:label "Vanderbilt Department of Biomedical Engineering"@en.

would be generated.

Column for a property having value that is an item (P749)

Here is the JSON object that maps the "parentUnit" column.

{

"titles": "parentUnit",

"name": "parentUnit",

"datatype": "string",

"aboutUrl": "http://www.wikidata.org/entity/{wikidataId}",

"propertyUrl": "http://www.wikidata.org/prop/direct/P749",

"valueUrl": "http://www.wikidata.org/entity/{parentUnit}"

}

As before, the subject URI is established by substituting the wikidataId variable into the URI template for aboutUrl. Instead of directly mapping the column value as the object of the triple, the column value is inserted into a valueUrl URI template in the same manner as the aboutUrl.

Applying this column mapping to the parentUnit column generates the triple:

<http://www.wikidata.org/entity/Q84268104> <http://www.wikidata.org/prop/direct/P749> <http://www.wikidata.org/entity/Q7914459>.

which can be abbreviated

wd:Q84268104 wdt:P749 wd:Q7914459.

The other columns in the CSV table are mapped similarly. If there is no valueURl key:value pair, the value for the column is a literal object, and if there is a value for valueURI, the value for the column is used to generate a URI denoting a non-literal object.

The value of datatype is important since it determines the xsd:datatype of literal values in the generated triples.

Not every column generates a triple with a subject that's the subject of the row. The subject may be the value of any other column. This allows the data in the row to form a more complicated graph structure.

How the VanderBot script writes the CSV data to the Wikidata API

The script that does the actual writing to the Wikidata API is here. The authentication process (line 338) is described in detail elsewhere.

The actual script begins (line 374) by loading the schema JSON into a Python data structure and loading the CSV table into a list of dictionaries.

The next section of the code (lines 402 to 554) uses the schema JSON to sort the columns of the tables into categories (labels, aliases, descriptions, statements with entity values, and statements with literal values).

From lines 556 to 756, the script steps through each row of the table to generate the data that needs to be passed to the API to upload new data. In each row, the script goes through each category of data (labels, aliases, etc.) and turns the value in a column into the specific JSON required by the API for uploading that kind of data. I call this "snak JSON" because the units in the JSON represent "snaks" (small, discrete statements) as defined by the Wikibase data model.

Originally, I had written the script in a simpler way, where each piece of information about the item was written in a separate API call. This seemed intuitive since there are individual API methods for uploading every category (label, description, property, reference, etc., see the API documentation). However, because of rate limitations that I'll talk about later, the most reasonable way to write the data was to determine which categories needed to be written for an item and then generate the JSON for all categories at once. I then used the "all in one" method wbeditentity to make all possible edits in a single API call. This resulted in much more complicated code that constructed deeply nested JSON that's difficult to read. The API help page didn't give any examples that were nearly this complicated, so getting this strategy to work required delving deeply into the Wikibase model. One lifesaver was that when a successful API call was made, the API's response included JSON structured according to the Wikibase model that was very similar to the JSON that was necessary to write to the API. Being able to look at this response JSON was really useful to help me figure out what subtle mistakes I was making when constructing the JSON to send to the API.

Simply creating labels, descriptions, and claims would not have been too hard, but I was determined to also have the capability to support references and qualifiers for claims. Here's how I hacked that task: for each statement column, I went through the columns and looked for other columns that the schema indicated were references or qualifiers of that statement. Currently, the script only handles one reference and one qualifier per statement, but when I get around to it, I'll improve the script to remove that limitation.

In line 759, the script checks whether it found any information about the item that wasn't already written to Wikidata. If there was at least one thing to write, the script attempts to post a parameter dictionary (including the complex, constructed snak JSON) to the API (lines 305 to 335) If the attempt was unsuccessful because the API was too busy, it retries several times. If the attempt was unsuccessful for other reasons, the script displays the server's response for debugging.

If the attempt was successful, the script extracts identifiers of newly-created data records (item Q IDs, statement UUIDs, and reference hashes - see the previous post for more on this) and adds them to the CSV table so that the script will know in the future that those data are already in Wikidata. The script rewrites the CSV table after every line so that if the script crashes or the API throws an error during a write attempt, one can simply re-start the script after fixing the problem and the script will know not to create duplicate data on the second go-around (since the identifiers for the already-written data have already been added to the CSV).

I mentioned near the end of my previous post that I don't have any way to record whether labels, descriptions, and qualifiers had already been written or not, since URI identifiers aren't generated for them. The lack of URI identifiers means that one can't refer to those particular assertions directly by URIs in a SPARQL query. Instead, one must make a query asking explicitly for the value of the label, description, or qualifier and then determine whether it's the same as the value in the CSV table. The way the script currently works, prior to creating JSON to send to the API the script sends a SPARQL query asking for the values of labels and descriptions of all of the entities in the table (lines 465 and 515). Then as the script processes each line of the table, it checks whether the value in the CSV is the same as what's already in Wikidata (and then does nothing) or different. If the value is different, it writes the new value from the CSV and overwrites the value in Wikidata.

It is important to understand this behavior, because if the CSV table is "stale" and has not been updated for a long time, other users may have improved the labels or descriptions. Running the script with the stale values will effectively revert their improvements. So it's important to update the CSV file with current values before running this script that writes to the API. After updating, then you can manually change any labels or descriptions that are unsatisfactory.

In the future, I plan to write additional scripts for managing labels and aliases, so this crude management system will hopefully be improved.

Cleaning up missing references

In some cases, other Wikidata contributors have already made statements about pre-existing Vanderbilt employee items. For example, someone may have already asserted that the Vanderbilt employee's employer was Vanderbilt University. In such cases, the primary API writing script will do nothing with those statements because it is not possible to write a reference as part of the wbeditentity API method without also writing its parent statement. So I had to create a separate script that is a hack of the primary script in order to write the missing references. I won't describe that script here because its operation is very similar to the main script. The main difference is that it uses the wbsetreference API method that is able to directly write a reference given a statement identifier. After running the main script, I run the cleanup script until all of the missing references have been added.

Timing issues

Maxlag

One of the things that I mentioned in my original post on writing data to Wikidata was that when writing to the "real" Wikidata API (vs. the test API or your own Wikibase instance) it's important to respect the maxlag parameter.

You can set the value of the maxlag parameter in line 381. The recommended value is 5 seconds. A higher maxlag value is more aggressive and a lower maxlag value is "nicer" but means that you are willing to be told more often by the API to wait. The value of maxlag you have chosen is added to the parameters sent to the API in line 764 just before the POST operation.

The API lag is the average amount of time between when a user requests an operation and the API is able to honor that request. At times of low usage (e.g. nighttime in the US and Europe), the lag may be small, but at times of high usage, the lag can be over 8 seconds (I've seen it go as high as 12 seconds). If you set maxlag to 5 seconds, you are basically telling the server that if the lag gets longer than 5 seconds, ignore your request and you'll try again later. The server tells you to wait by responding to your POST request with a response that contains a maxlag error code and the amount of time the server is lagged. This error is handled in line 315 of the script. When a lag error is detected, the recommended practice is to wait at least 5 seconds before retrying.

Bot flags

I naïvely believed that if I respected maxlag errors that I'd be able to write to the API as fast as conditions allowed. However, the very first time I used the VanderBot script to write more than 25 records in a row, I was blocked by the API as a potential spammer with the message "As an anti-abuse measure, you are limited from performing this action too many times in a short space of time, and you have exceeded this limit. Please try again in a few minutes." Clearly my assumption was wrong. Through trial and error, I determined that a write rate of one second per write was too fast and would result in being temporarily blocked, but a rate of two seconds per write was acceptable. So to handle cases when maxlag was not invoked, I put a delay of 2 seconds on the script (line 822).

I had several hypotheses about the cause of the blocking. One possible reason was because I didn't have a bot flag. (More on that later.) Another reason might be because I was running the script from my local computer rather than from PAWS. PAWS is a web-based interactive programming and publishing environment based on Jupyter notebooks. At Wikicon North America, I had an interesting and helpful conversation with Dominic Byrd-McDevitt of the National Archives who showed me how he used PAWS to publish NARA metadata to Wikidata via a PAWS-based system using Pywikibot. I don't think he had a bot flag and I think his publication rate was faster than one write per second. But I really didn't want to take the time to test this hypothesis by converting my script over to PAWS (which would require more experimentation with authentication). So I decided to make a post to Wikitech-l and see if I could get an answer.

I quickly got a helpful answer that confirmed that neither using PAWS nor Pywikibot should have any effect on the rate limit. If I had a bot flag, I might gain the "noratelimit" right, which might bypass rate limiting in many cases.

Bot flags are discussed here . In order to get a bot flag, one must detail the task that the bot will perform, then demonstrate by a test run of 50 and 250 edits that the bot is working correctly. When I was at Wikicon NA, I asked some of the Powers That Be whether it was important to get a bot flag if I was not running an autonomous bot. They said that it wasn't so important if I was monitoring the writing process. It would be difficult to "detail the task" that VanderBot will perform since it's just a general-purpose API writing script, and what it writes will depend on the CSV file and the JSON mapping schema.

In the end, I decided to just forget about getting a bot flag for now and keep the rate at 2 seconds per write. I usually don't write more than 50-100 edits in a session and often the server will be lagged anyway requiring me to wait much longer than 2 seconds. If VanderBot's task becomes more well-defined and autonomous, I might request a bot flag at some point in the future.

Query Service Updater lag

One of the principles upon which VanderBot is built is that data are written to Wikidata by POSTing to the API, but that the status of data in Wikidata is determined by SPARQL queries of the Query Service. That is a sound idea, but it has one serious limitation. Data that are added through either the API or the human GUI do not immediately appear in the graph database that supports the Query Service. There is a delay, known as the Updater lag, between the time of upload and the time of availability at the Query Service. We can gain a better understanding by looking at the Query Service dashboard.

Here's a view of the lag time on the day I wrote this post (2020-02-03):

The first thing to notice is that there isn't just one query service. There are actually seven servers running replicates of the Query Service that handle the queries. They are all being updated constantly with data from the relational database connected to the API, but since the updating process has to compete with queries that are being run, some servers cannot keep up with the updates and lag by as much as 10 hours. Other servers have lag times of less than one minute. So depending on the luck of the draw of which server takes your query, data that you wrote to the API may be visible via SPARQL in a few seconds or in half a day.

A practical implication of this is that if VanderBot updates its CSV record using SQARQL, the data could be as much as half a day out of date. Normally that isn't a problem, since the data I'm working with doesn't change much, and once I write new data, I usually don't mess with it for days. However, since the script depends on a SPARQL query to determine if the labels and descriptions in the CSV differ from what's already in Wikidata, there can be problems if the script crashes half way through the rows of the CSV. If I fix the problem and immediately re-run the script, a lagged Query Service will send a response to the query saying that the labels and descriptions that I successfully wrote a few moments earlier were in their previous state. That will cause VanderBot to attempt to re-write those labels and descriptions. Fortunately, if the API detects that a write operation is trying to set the value of a label or description to the value it already has, it will do nothing. So generally, no harm is done.

This lag is why I use the response JSON sent from the API after a write to update the CSV rather than depending on a separate SPARQL query to make the update. Because the data in the response JSON comes directly from the API and not the Query Service, it is not subject to any lag.

Summary

The API writing script part of VanderBot does the following:

Reads the JSON mapping schema to determine the meaning of the CSV table columns.
Reads in the data from the CSV table.
Sorts out the columns by type of data (label, alias, description, property).
Constructs snak JSON for any new data items that need to be written.
Checks new statements for references and qualifiers by looking at columns associated with the statement properties, then creates snak JSON for references or qualifiers as needed.
Inserts the constructed JSON object into the required parameter dictionary for the wbeditentity API method.
POSTs to the Wikidata API via HTTP.
Parses the response JSON from the API to discover the identifiers of newly created data items.
Inserts the new identifiers into the table and write the CSV file.

In the final post of this series, I'll describe how the data harvesting script part of VanderBot works.

VanderBot part 2: The Wikibase data model and Wikidata identifiers

2020-02-07T09:16:00.002-08:00

The Wikidata GUI and the Wikibase model

To read part 1 of this series, see this page.

If you've edited Wikidata using the human-friendly graphical user interface (GUI), you know that items can have multiple properties, each property can have multiple values, each property/value statement can be qualified in multiple ways, each property/value statement can have multiple references, and each reference can have multiple statements about that reference. The GUI keeps this tree-like proliferation of data tidy by collapsing the references and organizing the statements by property.

This organization of information arises from the Wikibase data model (summarized here, in detail here). For those unfamiliar with Wikibase, it is the underlying software system that Wikidata is built upon. Wikidata is just one instance of Wikibase and there are databases other than Wikidata that are built on the Wikibase system. All of those databases built on Wikibase will have a GUI that is similar to Wikidata, although the specific items and properties in those databases will be different from Wikidata.

To be honest, I found working through the Wikibase model documentation a real slog. (I was particularly mystified by the obscure term for basic assertions: "snak". Originally, I though it was an acronym, but later realized it was an inside joke. A snak is "small, but more than a byte".) But understanding the Wikibase model is critical for anyone who wants to either write to the Wikidata API or query the Wikidata Query Service and I wanted to do both. So I dug in.

The Wikibase model is an abstract model, but it is possible to represent it as a graph model. That's important because that is why the Wikidata dataset can be exported as RDF and made queryable by SPARQL in the Wikidata Query Service. After some exploration of Wikidata using SPARQL and puzzling over the data model documentation, I was able to draw out the major parts of the Wikibase model as a graph model. It's a bit too much to put in a single diagram, so I made one that showed references and another that showed qualifiers (inserted later in the post). Here's the diagram for references:

Note about namespace prefixes: the exact URI for a particular namespace abbreviation will depend on the Wikibase installation. The URIs shown in the diagrams are for Wikidata. A generic Wikibase instance will contain wikibase.svc as its domain name in place of www.wikidata.org, and other instances will use other domain names. However, the namespace abbreviations shown above are used consistently among installations, and when querying via the human-accessible Query Service or via HTTP, the standard abbreviations can be used without declaring the underlying namespaces. That's convenient because it allows code based on the namespace abbreviations to be generic enough to be used for any Wikibase installation.

In the next several sections, I'm going to describe the Wikibase model and how Wikidata assigns identifiers to different parts of it. This will be important in deciding how to track data locally. Following that, I'll briefly describe my strategy for storing those data.

Item identifiers

The subject item of a statement is identified by a unique "Q" identifier. For example, Vanderbilt University is identified by Q29052 and the researcher Antonis Rokas is identified by Q42352198. We can make statements by connecting subject and object items with a defined Wikidata property. For example, the property P108 ("employer") can be used to state that Antonis Rokas' employer is Vanderbilt University: Q42352198 P108 Q29052. When the data are transferred from the Wikidata relational database backend fed by the API to the Blazegraph graph database backend of the Query Service, the "Q" item identifiers and "P" property identifiers are turned into URIs by appending the appropriate namespace (wd:Q42352198 wdt:P108 wd:Q29052.)

We can check this out by running the following query at the Wikidata Query Service:

SELECT DISTINCT ?predicate ?object WHERE {

wd:Q42352198 ?predicate ?object.

}

This query returns all of the statements made about Antonis Rokas in Wikidata.

Statement identifiers

In order to be able to record further information about a statement itself, each statement is assigned a unique identifier in the form of a UUID. The UUID is generated at the time the statement is first made. For example, the particular statement above (Q42352198 P108 Q29052) has been assigned the UUID FB9EABCA-69C0-4CFC-BDC3-44CCA9782450. In the transfer from the relational database to Blazegraph, the namespace "wds:" is prepended and for some reason, the subject Q ID is also prepended with a dash. So our example statement would be identified with the URI wds:Q42352198-FB9EABCA-69C0-4CFC-BDC3-44CCA9782450. If you look at the results from the query above, you'll see

p:P108 wds:Q42352198-FB9EABCA-69C0-4CFC-BDC3-44CCA9782450

as one of the results.

We can ask what statements have been made about the statement itself by using a similar query, but with the statement URI as the subject:

SELECT DISTINCT ?predicate ?object WHERE {
wds:Q42352198-FB9EABCA-69C0-4CFC-BDC3-44CCA9782450 ?predicate ?object.
}

One important detail relates to case insensitivity. UUIDs are supposed to be output as lowercase, but they are supposed to be case-insensitive on input. So in theory, a UUID should represent the same value regardless of the case. However, in the Wikidata system the generated identifier is just a string and that string would be different depending on the case. So the URI

wds:Q42352198-FB9EABCA-69C0-4CFC-BDC3-44CCA9782450

is not the same as the URI

wds:Q42352198-fb9eabca-69c0-4cfc-bdc3-44cca9782450

(Try running the query with the lower case version to convince yourself that this is true.) Typically, the UUIDs generated in Wikidata are upper case, but there are some that are lower case. For example, try

wds:Q57756352-4a25cee4-45bc-63e8-74be-820454a8b7ad

in the query. Generally it is safe to assume that the "Q" in the Q ID is upper case, but I've discovered at least one case where the Q is lower case.

Reference identifiers

If a statement has a reference, that reference will be assigned an identifier based on a hash algorithm. Here's an example: f9c309a55265fcddd2cb0be62a530a1787c3783e. The reference hash is turned into a URL by prepending the "wdref:" namespace. Statements are linked to references by the property prov:wasDerivedFrom. We can see an example in the results of the previous query:

prov:wasDerivedFrom wdref:8cfae665e8b64efffe44128acee5eaf584eda3a3

which shows the connection of the statement wds:Q42352198-FB9EABCA-69C0-4CFC-BDC3-44CCA9782450 (which states wd:Q42352198 wdt:P108 wd:Q29052.) to the reference wdref:8cfae665e8b64efffe44128acee5eaf584eda3a3 (which states "reference URL http://orcid.org/0000-0002-7248-6551 and retrieved 12 January 2019"). We can see this if we run a version of the previous query asking about the reference statement:

SELECT DISTINCT ?predicate ?object WHERE {

wdref:8cfae665e8b64efffe44128acee5eaf584eda3a3?predicate ?object.

}

As far as I know reference hashes seem to be consistently recorded in all lower case.

Reference identifiers are different from statement identifiers in that they denote the reference itself, and not a particular assertion of the reference. That is, they do not denote "statement prov:wasDerivedFrom reference", only the reference. (In contrast, statement identifiers denote the whole statement "subject property value".) That means that any statement whose reference has exactly the same asserted statements will have the same reference hash (and URI).

We can see that reference URIs are shared by multiple statements using this query:

SELECT DISTINCT ?statement WHERE {

?statement prov:wasDerivedFrom wdref:f9c309a55265fcddd2cb0be62a530a1787c3783e.

}

Identifier examples

The following part of a table that I generated for Vanderbilt researchers shows examples of the identifiers I've described above.

We see that each item (researcher) has a unique Q ID and that each statement that the researcher is employed at Vanderbilt University (Q29052) has a unique UUID (some upper case, some lower case) and that there are more than one statement that share the same reference (having the same reference hash).

Statement qualifiers

In addition to linking references to a statement, the statements can also be qualified. For example, Brandt Eichman has worked at Vanderbilt since 2004.

Here's a diagram showing how the qualifier "start time 2004" is represented in Wikidata's graph database:

We can see that qualifiers are handled a little differently from references. If the qualifier property (in this case P580, "since") has a simple value (literal or item), the value is linked to the statement instance using the pq: namespace version of the property.

If the value has a complex value (e.g. date), that value is assigned a hash and is linked to the statement instance using the pqv: version of the property. When the data are transferred to the graph database, the wdv: namespace is prepended to the hash.

Because dates are complex, the qualifier "since" requires a non-literal value in addition to a literal value linked by the pq: version of the property (see this page for more on the Wikibase date model). We can use this query:

SELECT DISTINCT ?property ?value WHERE {

wdv:849f00455434dc418fb4287a4f2b7638 ?property ?value.

}

to explore the non-literal date instance. In Wikidata, all dates are represented as full XML Schema dateTime values (year, month, day, hour, minute, second, timezone). In order to differentiate between the year "2004" and the date 1 January 2004 (both can be represented in Wikidata by the same dateTime value), the year 2004 is assigned a timePrecision of 9 and the date 1 January 2004 is assigned a timePrecision of 11.

Not every qualifier will have a non-literal value. For example, the property "series ordinal" (P1545; used to indicate things like the order authors are listed) has only literal values (integer numbers). So there are values associated with pq:P1545, but not pqv:P1545. The same is true for "language of work or name" (P407; used to describe websites, songs, books, etc.), which has an entity value like Q1860 (English).

Labels, aliases, and descriptions

Labels, aliases, and descriptions are properties of items that are handled differently from other properties in Wikidata. Labels and descriptions are handled in a similar manner, so I will discuss them together.

Each item in Wikidata can have only one label and one description in any particular language. Therefore adding or changing a label or description requires specifying the appropriate ISO 639-1 code for the intended language. When a label or description is changed in Wikidata, the previous version is replaced.

One important restriction is that the label/description combination in a particularly language must be unique. For example, the person with the English label "John Jones" and English description "academic" can currently only be Q16089943. Because labels and descriptions can change, this label/description combination won't necessarily be permanent associated with Q16089943 because someone might give that John Jones a more detailed description, or make his name less generic by adding a middle name or initial. So at some point in the future, it might be possible for some other John Jones to be described as "academic". An implication of the prohibition against two items sharing the same label/description pair is that it's better to create labels and descriptions that are as specific as possible to avoid collisions with pre-existing entities. As more entities get added to Wikidata, the probability of such collisions increases.

There is no limit to the number of aliases that an item can have per language. Aliases can be changed by either changing the value of a pre-existing alias or adding a new alias. As far as I know, there is no prohibition about aliases of one item matching aliases of another item.

When these statements are transferred to the Wikidata graph database, labels are values of rdfs:label, descriptions are values of schema:description, and aliases are values of skos:altLabel. All of the values are language-tagged.

What am I skipping?

Another component of the Wikibase model that I have not discussed is ranks. I also haven't talked about statements that don’t have values (PropertyNoValueSnak and PropertySomeValueSnak), and sitelinks. These are features that may be important to some users, but have not yet been important enough to me to incorporate handling them in my code.

Local data storage

If one wanted to make and track changes to Wikidata items, there are many ways to accomplish that with varying degrees of human intervention. Last year, I spent some time pondering all of the options and came up with this diagram:

Tracking every statement, reference, and qualifier for items would be complicated because each item could have an indefinite number and kind of properties, values, references, and qualifiers. To track all of those things would require a storage system as complicated as Wikidata itself (such as a separate a relational database or a Wikibase instance as shown in the bottom of the diagram). That's way beyond what I'm interested in doing now. But what I learned about the Wikibase model and how data items are identified suggested to me a way to track all of the data that I care about in a single, flat spreadsheet. That workflow can be represented by this subset of the diagram above:

I decided on the following structure for the spreadsheet (a CSV file, example here.). The Wikidata Q ID serves as the key for an item and the data in a row is about a particular item. A value in the Wikidata ID column indicates that the item already exists in Wikidata. If the Wikidata ID column does not have a value, that indicates that the item needs to be created.

Each statement has a column representing the property with the value of that property for an item recorded in the cell for that item's row. For each property column, there is an associated column for the UUID identifying the statement consisting of the item, property, and value. If there is no value for a property, no information is available to make that statement. If there is a value and no UUID, then the statement needs to be asserted. If there is a value and a UUID, the statement already exists in Wikidata.

References consist of one or more columns representing the properties that describe the reference. References have a single column to record the hash identifier for the reference. As with statements, if the identifier is absent, that indicates that the reference needs to be added to Wikidata. If the identifier is present, the reference has already been asserted.

Because labels, descriptions, and many qualifiers do not have URIs assigned as their identifiers, their values are listed in columns of the table without corresponding identifier columns. Knowing whether the existing labels descriptions and qualifiers already exist in Wikidata requires making a SPARQL query to find out. That process is described in the fourth blog post.

Where does VanderBot come in?

In the first post of this series, I showed a version of the following diagram to illustrate how I wanted VanderBot (my Python script for loading Vanderbilt researcher data into Wikidata) to work. That diagram is basically an elaboration of the simpler previous diagram.

The part of the workflow circled in green is the API writing script that I will describe in the third post of this series (the next one). The part of the workflow circled in orange is the data harvesting script that I will describe in the fourth post. Together these two scripts form VanderBot in its current incarnation.

Discussing the scripts in that order may seem a bit backwards because when VanderBot operates, the data harvesting script works before the API writing script. But in developing the two scripts, I needed to think about how I was going to write to the API before I thought about how to harvest the data. So it's probably more sensible for you to learn about the API writing script first as well. Also, the design of the API writing script is intimately related to the Wikidata data model, so that's another reason to talk about it next after this post.

VanderBot: A Python Script for Writing to Wikidata (part 1)

2020-02-06T20:38:00.001-08:00

Note added 2021-03-13: Although this post is still relevant for understanding the conceptual ideas behind my project to write Vanderbilt researcher/scholar records to Wikidata, I have written another series of blog posts showing (with lots of screenshots and handholding) how you can safely write your own data to the Wikidata API using data that is stored in simple CSV spreadsheets. See this post for details.

If you follow my blog, you will notice that I haven't written much in the last six months. That is at least partly because I've spent a lot of time working out the practical details of creating a "bot" that I can use to upload data about Vanderbilt researchers and scholars into Wikidata. In an earlier post from June last year, I described in general terms some background about writing to Wikibase, the platform on which Wikidata is built. (You probably should review that post for background before starting in on this one.) However, there were a lot of practical details that needed to be worked out to write to the "real" Wikidata. Those details are what I'll talk about in this post.

One question I'll dispense with at the start of the post is "Why didn't you just use Pywikibot?" There are two reasons. One is that when I experimented with using Pywikibot and our Wikibase instance, I encountered an approximately 10 second delay between write operations. I'm sure that there is some way to defeat that delay, but I was not able to figure it out by looking through the Pywikibot code and documentation. This brings me to the second reason. I really don't like to use other people's code that I don't understand. When I looked through the Pywikibot code, there were layers of objects and functions calling other objects and functions in different files. After a short period of sorting through the code, I realized that there was no way that I was going to understand what was going with Pywikibot at my current level of skill with Python.

After that experience, I decided to build my bot from the ground up. Obviously that took more time, but in the end I actually understood everything that I was doing and also had a much better idea of how the Wikibase API works. The code that I've written is relatively linear and is liberally annotated with comments. So I hope that people with a moderate level of experience with Python can understand what I did and be able to hack the code to meet their own needs.

Where I last left off

In the previous post about writing to Wikidata, I described a simple script that took data from a CSV file and wrote it to a Wikibase instance (the test Wikidata instance, an independent Wikibase installation, or the real Wikidata).

That script was very limited. It was only able to write statements and could not associate references with those statements nor add qualifiers to the statements. It only created new items and had no way to know if the described entities already existed in the Wikibase instance. It also had no way to track data about items once they had been written. Finally, it simply wrote the data as fast as it could and did not consider whether it should slow its rate due to high load on the Wikibase API.

Where I wanted to be

A major deficiency of the previous script was that its communication with the Wikibase instance was only one-way. It wrote to the API, but made little use of the API's response and it made no use of Wikibase's capabilities to respond to SPARQL queries. The workflow that I wanted to facilitate was more complicated.

I wanted the script to first send a SPARQL query to the Query Service to determine which data (including references and qualifiers) that I wanted to write already existed in Wikidata. (From this point forward, I'm going to refer to the "real" Wikidata instance of Wikibase, so I will stop talking about Wikibase generically.) That information would then be used to determine for each record whether the script needed to: create a new item, add or change labels and descriptions, add statements to an existing item, to add references and qualifiers to existing statements, or do nothing because all of the desired information was already there.

Once it was determined what needed to be written; the script would then compose the appropriate JSON (based on the form of "snaks" in the Wikibase model) for an item and send it to the API. Using the response from the API, the script would update the records to indicate that the data were now present in Wikidata. Based on feedback from the API, the script would also limit its request rate to avoid hitting it too fast at times of high usage.

Eventually, the data uploaded to the API would become available via the Query Service, making it possible to track in the future whether the data were still present in Wikidata.

What is VanderBot?

The simple answer to this question is that VanderBot is the set of Python scripts that I created to write data to Wikidata. The code is freely available in GitHub. However, the question is a little more complicated than that.

When an application communicates with server over the internet, it is technically known as a "User-Agent". It is considered a polite and good practice for a User-Agent to identify itself to the server via an HTTP request header. When I use the scripts I've written, I send the header

VanderBot/0.8 (https://github.com/HeardLibrary/linked-data/tree/master/publications; mailto:steve.baskauf@vanderbilt.edu)

So VanderBot is also the name of a User-Agent. Technically, if you used my script without editing it, you would be using the VanderBot User-Agent, but it probably would be better to not send the header above, since I don't want server administrators to email me if you do bad things to their server. So you should change the User-Agent header values if you use or modify the VanderBot code. (Similarly, you should also change the tool name and email address sent to the NCBI API in that part of the code - please do not use mine!)

When you write to the Wikidata API, you need to be logged in as a Wikidata user. I have created a Wikidata user account called VanderBot, so if I make edits using that account, they are credited to VanderBot in the page history. So VanderBot is also a registered bot in Wikidata. But since you don't have my VanderBot access credentials, you can't make edits to Wikidata as VanderBot even if you use the Vanderbot scripts.

So the complicated answer is that you are welcome to use the VanderBot code, you probably shouldn't be using "VanderBot" in a User-Agent header (and definitely not my email address), and you can't use the VanderBot Wikidata bot account.

Upcoming posts

In part 2 of this series, I will talk about the Wikibase data model and identifiers used for entities in the Wikidata graph. The model and identifier system influenced my choices about how to write the code.

In part 3, I will describe the API writing script that maps tabular data to the Wikibase model, then writes those data to the Wikidata API.

In the final part 4, I will describe the data harvesting script that is used to assemble the data to be written to Wikidata and that ensures that duplicate data are not added.

Understanding the Standards Documentation Specification, Part 6: The rs.tdwg.org repository

2019-10-23T14:54:00.000-07:00

This is the sixth and final post in a series on the TDWG Standards Documentation Specification (SDS). The five earlier posts explain the history and model of the SDS, and how to retrieve the machine-readable metadata about TDWG standards.

Note: this post was revised on 2020-03-04 when IRI dereferencing of the http://rs.tdwg.org/ subdomain went from testing into production.

Where do standards data live?

In earlier posts in this series, I said that after the SDS was adopted, there wasn't any particular plan for actually putting it into practice. Since I had a vested interest in its success, I took it upon myself to work on the details of its implementation, particularly with respect to making standards metadata available in machine-readable form.

The SDS is silent about where data should live and how it should be turned into the various serializations that should be available when clients dereference resource IRIs. My thinking on this subject was influenced by my observations about previous management of TDWG standards data. In the past, the following things have happened to TDWG standards data:

the standards documents for TAPIR were accidentally overwritten and lost.
the authoritative Darwin Core (DwC) documents were locked up on a proprietary publishing system where only a few people could look at them or even know what was there.
the normative Darwin Core document was written in RDF/XML, which no one could read and which was in a document that had to be edited by hand.

Given that background I was pretty convinced that the place for the standards data to live was in a public GitHub repository. I was able to have a repository called rs.tdwg.org set up in the TDWG GitHub site for the purpose of storing the standards metadata.

Form of the standards metadata

Given past problems with formats that have become obsolete or that were difficult to read and edit, I was convinced that the standards metadata should be in a simple format. To me the obvious format was CSV.

At the time I started working on this project, I had been working on an application to transform CSV spreadsheets into various forms of RDF, so I had already been thinking about how the CSV spreadsheets should be set up to do that. I liked the model used for DwC Archives (DwC-A) and defined in the DwC text guide.

Example metadata CSV file for terms defined by Audubon Core: audubon.csv

In the DwC-A model, each table is "about" some class of thing. Each row in a data table represents an instance of that class, and each column represents some property of those instances. The contents of each cell represent the value of the property for that instance.

Darwin Core Archive model (from the Darwin Core Text Guide)

In order to associate the columns with their property terms, DwC Archives use an XML file (meta.xml) that associates the intended properties with the columns of the spreadsheet. Since a flat spreadsheet can't handle one-to-many relationships very well, the model connects the instances in the core spreadsheet with extension tables that allow properties to have multiple values.

For the purposes of generating RDF, the form of the meta.xml file is not adequate. One problem is that the meta.xml file does not indicate whether the value (known in RDF as the object) recorded in the cell is supposed to be a literal (string) or an IRI. A second problem is that in RDF values of properties can also have language tags or datatypes if they are not plain literals. Finally, a DwC Archive assumes that a row is a single type of thing, but actually a row may actually contain information about several types of things.

Example CSV mapping file: audubon-column-mappings.csv

For those reasons I ended up creating my own form of mapping file -- another CSV file rather than a file in XML format. I won't go into more details here, since I've already described the system of files in another blog post. But you can see from the example above that the file relates the column headers to properties, indicates the type of object (IRI, plain literal, datatyped literal, or language tagged literal), and provides the value of the language tag or datatype. The final column indicates whether that column applies to the main subject of the table or an instance of another class that has a one-to-one relationship with the subject resource.

Add captionExample extension links file: linked-classes.csv

The links between the core file and the extensions are described in a separate links file (e.g. linked-classes.csv). In this example, extension files are required because each term can have many versions and a term can also replace more than one term. Because in RDF the links can be described by properties in either direction, the links file lists the property linking from the extension to the core file (e.g. dcterms:isVersionOf) and from the core file to the extension (e.g. dcterms:hasVersion).

This system differs a bit from the DwC-A system where the fields in the linked extension files are described within the same meta.xml file. I opted to have a separate mapping file for each extension. The filenames listed in the linked-classes.csv file point to the extension data files and the mapping files associated with the extension data files use the same naming pattern as the mapping files for the core file.

The description of file types above explains most of the many files that you'll find if you look in a particular directory in the rs.tdwg.org repo.

Organization of directories in rs.tdwg.org

The set of files detailed above describe a single category of resources. Most of the directories in the rs.tdwg.org repository contain such a set that is associated with a particular namespace that is in use within a TDWG vocabulary (in the language of the SDS, "term lists"). For example, the directory "audubon" (containing the example files above) describes the current terms minted by Audubon Core and the directory "terms" describes terms minted by Darwin Core. There are also directories that describe terms that are borrowed by Audubon or Darwin Cores. Those directories have names that end with "-for-ac" or "-for-dwc".

For each of the directories that describe terms in a particular namespace, there is another directory that describes the versions of those terms. Those directory names have "-versions" appended to the directory name for their corresponding current terms.

Finally, there are some special directories that describe resources in the TDWG standards hierarchy at levels higher than individual terms: "term-lists", "vocabularies", and "standards". There is also a special directory for documents ("docs") that describe all of the documents that are associated with TDWG standards. Taken together, all of these directories contain the metadata necessary to completely characterize all of the components of TDWG standards.

Using rs.tdwg.org metadata

In theory, one could pick through all of the CSV files that I just described and learn anything you wanted to know about any part of any TDWG standard. However, that would be a lot to ask of a human. The real purpose of the repository is to provide source data for software that can generate the human- and machine-readable serializations that the SDS specifies. By building all of the serializations from the same CSV tables, we can reduce errors caused by human entry and guarantee that a consumer always receives exactly the same metadata regardless of the chosen format.

One option for creating the serializations is to run a build script that generates the serialization as a static file. I used this approach to generate the Audubon Core Term List document. A Python script generates Markdown from the appropriate CSV files. The generated file is pushed to GitHub where it is rendered as a web page via GitHub Pages.

Another option is to generate the serializations on the fly based on the CSV tables. In another blog post I describe my efforts to set up a web service that uses CSV files of the form described above to generate RDF/Turtle, RDF/XML, or JSON-LD serializations of the data. That system has now been implemented for TDWG standards components.

The SDS specifies that if an IRI is dereferenced with an Accept: header for one of the RDF serializations, the server should perform content negotiation (303 redirect) to direct the client to the URL for the serialization they want. For example, when a client that is a browser (with an Accept header of text/html) dereferences the Darwin Core term IRI http://rs.tdwg.org/dwc/terms/recordedBy, it will be redirected to the Darwin Core Quick Reference Guide bookmark for that term. However, if an Accept: header of text/turtle is used, the client will be redirected to http://rs.tdwg.org/dwc/terms/recordedBy.ttl . Similarly, application/rdf+xml redirects to a URL ending in .rdf and application/json or application/ld+json redirects to a URL ending in .json . Those URLs for specific serializations can also be requested directly without requiring content negotiation.

The test system also generates HTML web pages for obsolete Darwin Core terms that otherwise wouldn't be available via the Darwin Core website. For example: http://rs.tdwg.org/dwc/curatorial/Preparations redirects to http://rs.tdwg.org/dwc/curatorial/Preparations.htm, a web page describing an obsolete Darwin Core term from 2007.

Providing term dereferencing of this sort is considered a best practice in the Linked Data community. But for developers interested in obtaining the machine-readable metadata, as a practical matter it's probably easier to just get a machine-readable dump of all of the whole dataset by one of the methods described in my earlier posts. However, having the data available in CSV form on GitHub makes the data available in a primitive "machine-readable" form that doesn't really have anything to do with Linked Data. Anyone can write a script to retrieve the raw CSV files from the GitHub repo and process them using conventional means as long as they understand how the various CSV files within a directory are related to each other. Because of the simplicity of the format of the data, it is highly likely that they will be usable long into the future (or at least as long as GitHub is viable) even if Linked Data falls by the wayside.

Maintaining the CSV files in rs.tdwg.org

The files in rs.tdwg.org were originally manually assembled (by me) laboriously from a variety of sources. All of the current and obsolete Darwin Core data were pulled from the "complete history" RDF/XML file that was formerly maintained as the "normative document" for Darwin Core. Audubon Core terms data were assembled from the somewhat obsolete terms.tdwg.org website. Data on ancient TDWG standards documents and their authors was assembled by a lot of detective work on my part. However, maintaining the CSV files manually is not really a viable option. Whenever a new version of a term is generated, that should spawn a series of new versions up the standards hierarchy. The new term version should result in a new modified date for its corresponding current term, spawn a new version of its containing term list, result in an addition to the list of terms contained in the term list, generate a new version of the whole vocabulary, etc.

It would be unreliable to trust that a human could make all of the necessary modifications to all of the CSV files without errors. It is also unreasonable to expect standards maintainers to have to suffer through editing a bunch of CSV files every time they need to change a term. They should only have to make minimal changes to a single CSV file and the rest of the work should be done by a script.

I've written a Python script within a Jupyter notebook to do that work. Currently the script will make changes to the necessary CSV files for term changes and additions within a single term list (a.k.a. "namespace") of a vocabulary. It currently does not handle term deprecations and replacements -- presumably those will be uncommon enough that they could be done by manual editing. It also doesn't handle changes to the documents metadata. I haven't really implemented document versioning on rs.tdwg.org, mostly because that's either lost or unknown information for all of the older standards. That should change in the future, but it just isn't something I've had the time to work on yet.

Some final notes

Some might take issue with the fact that I've somewhat unilaterally made these implementation decisions (although I did discuss them with a number of key TDWG people during the time when I was setting up the rs.tdwg.org repo). The problem is that TDWG doesn't really have a very formal mechanism for handling this kind of work. There is the TAG and an Infrastructure interest group, but neither of them currently have operational procedures for this kind of implementation. Fortunately, TDWG generally has given a fairly free hand to people who are willing to do the work necessary for standards development, and I've received encouragement on this work, for which I'm grateful.

I feel relatively confident about the approach of archiving the standards data as CSV files. With respect to the method of mapping the columns to properties and my ad hoc system for linking tables, I think it would actually be better to use the JSON metadata description files specified in the W3C standard for generating RDF from CSV files. I wasn't aware of that standard when I started working on the project, but it would probably be a better way to clarify the relationships between CSV tables and to impart meaning to their columns.

So far the system that I created for dereferencing the rs.tdwg.org IRIs seems to be adequate. In the long run, it might be better to use an alternative system. One is to simply have a build script that generates all of the possible serializations as static files. There would be a lot of them, but who cares? They could then be served by a much simpler script that just carried out the content negotiation but did not actually have to generate the pages. Another alternative would be to pay a professional to create a better system. That would involve a commitment of funds on the part of TDWG. But in either case the alternative systems could draw their data from the CSV files in rs.tdwg.org as they currently exist.

When we were near the adoption of the SDS, someone asked whether the model we developed was too complicated. My answer was that it was just complicated enough to do all of the things that people said that they wanted. One of my goals in this implementation project was to show that it actually was possible to fully implement the SDS as we wrote it. Although the mechanism for managing and delivering the data may change in the future, the system that I've developed shows that it's reasonable to expect that TDWG can dereference (with content negotiation) the IRIs for all of the terms that it mints, and to provide a full version history for every term, vocabulary, and document that we've published in the past.

Note: although this is the last post in this series, some people have asked about how one would actually build a new vocabulary using this system. I'll try to write a follow-up showing how it can be done.

Comparing the ABCD model to Darwin Core

2019-06-11T09:41:00.001-07:00

This post is very focused on the details of two Biodiversity Information Standards (TDWG) standards as they relate to Linked Data and graph models. If you are generally interested in approaches to Linked Data graph modeling, you might find it interesting. Otherwise, if you aren't into TDWG standards, you may zone out.

Background

The TDWG Darwin Core and Access to Biological Collection Data (ABCD) standards

Access to Biological Collection Data (ABCD) is a standard of Biodiversity Information Standards (TDWG). It is classified as a "Current Standard" but is in a special category called "2005" standard because it was ratified just before the present TDWG by-laws (which specify the details of the standards development process) were adopted in 2006. Originally, ABCD was defined as an XML schema that could be used to validate XML records that describe biodiversity resources. The various versions of the ABCD XML schema can be found in the ABCD GitHub repository.

Darwin Core (DwC) is a current standard of TDWG that was ratified in 2009. It is modeled after Dublin Core, with which it shares many similarities. Biodiversity data can be transmitted in several ways: as simple spreadsheets, as XML, and as text files structured in a form known as a Darwin Core Archive.

Nearly all of the more than 1.3 billion records in the Global Biodiversity Information Facility (GBIF) have been marked up in either DwC or ABCD.

My role in Darwin Core

For some time I've been interested in the possibility of using Darwin Core terms as a way to transmit biodiversity data as Linked Open Data (LOD). That interest has manifested itself in my being involved in three ways with the development of Darwin Core:

as the instigator of the establishment of the dwc:Organism class
as the shepherd of the clarification of definitions of all of the Darwin Core (dwc: namespace) Classes and deprecation of the confusing alternative Darwin Core type vocabulary (dwctype: namespace) classes.
as the lead author of the Darwin Core RDF Guide (for details, see http://dx.doi.org/10.3233/SW-150199; open access at http://bit.ly/2e7i3Sj)

All three of these official changes to Darwin Core were approved by decision of the TDWG Executive Committee on October 26, 2014. Along with Cam Webb, I also was involved in an unofficial effort called Darwin-SW (DSW) to develop an RDF ontology to create the graph model and object properties that were missing from the Darwin Core vocabulary. (For details, see http://dx.doi.org/10.3233/SW-150203; open access at http://bit.ly/2dG85b5.) More on that later...

I've had no role with ABCD and honestly, I was pretty daunted about the prospect of plowing through the XML schema to try to understand how it worked. However, I've recently been using some new tools Linked Data tools to explore ABCD and they have been instrumental for putting the material together for this blog. More about them later...

A common model for ABCD and Darwin Core?

Recently, a call went out to people interested in developing a common model for TDWG that would encompass both ABCD and DwC. Because of my past interest in using Darwin Core terms as RDF, I joined the group, which has met online once so far. Because of my basic ignorance about ABCD, I've recently put in some time to try to understand the existing model for ABCD and how it is similar or different from Darwin Core. In the following sections, I'll discuss some issues with modeling Darwin Core, then report on what I've learned about ABCD and how it compares to Darwin Core.

Darwin Core's missing graph model

One of the things that surprises some people is that although a DwC RDF Guide exists, it is not currently possible to express biodiversity data as RDF using only terms currently in the standard.

What the RDF Guide does is to clear up how the existing terms of Darwin Core should be used and to mint some new terms that can be used for creating links between resources (i.e. to non-literal objects of triples). For example, as adopted, Darwin Core had the term dwc:recordedBy (http://rs.tdwg.org/dwc/terms/recordedBy) to indicate the person who recorded the occurrence of an organism. However, it was not clear whether the value of this term (i.e. the object of a triple of which the predicate was dwc:recordedBy) should be a literal (i.e. a name string) or an IRI (i.e. an identifier denoting an agent). The RDF Guide establishes that dwc:recordedBy should be used with a literal value, and that a new term, dwciri:recordedBy (http://rs.tdwg.org/dwc/iri/recordedBy) should be used to link to an IRI denoting an agent (i.e. a non-literal value). For each term in Darwin Core where it seemed appropriate for an existing term to have a non-literal (IRI) value, a dwciri: namespace analog of that term was created. The terms affected by this decision are detailed in the Term reference section of the guide.

So with the RDF Guide, it is now possible to express a lot of Darwin Core metadata as RDF. But at the time of the adoption of the RDF Guide there were no existing DwC terms that linked instances of the DwC classes (i.e. object properties), so there was no way to fully express a dataset as RDF. (Another way of saying this is that Darwin Core did not have a graph model for its classes.) It seems like there should be a simple solution to that problem: just define some object properties to connect the classes. But as Joel Sachs and I describe in a recent book chapter, that's not as simple as it seems. In section 3.2 of the chapter, we show how users with varying interests may want to use graph models that are more or less complex, and that inconsistencies on those models makes it difficult to query across datasets that use different models.

The Darwin Core RDF Guide was developed not long after a bruising, year-long online discussion about modeling Darwin Core (see this page for a summary of the gory details). It was clear that if we had planned to include a graph model and the necessary object properties, the RDF Guide would probably never get finished. So it was decided to create the RDF Guide to deal with the existing terms and leave the development of a graph model as a later effort.

Darwin-SW's graph model

After the exhausting online discussion (argument?) about modeling Darwin Core, I was so burned out on the subject, I had decided that I was basically done with that subject. However, Cam Webb, the eternal optimist, contacted me and said that we should just jump in and try to create a QL-type ontology that had the missing object properties. (See "For further reference" at the end for definitions of "ontology").

What made that project feasible was that despite the rancor of the online discussion, there actually did seem to be some degree of consensus about a model based on historical work done 20 years earlier. Rich Pyle had laid out a diagram of a model that we were discussing and Greg Whitbread noted that it was quite similar to the Association of Systematics Collections (ASC) model of 1993. All Cam and I really had to do was to create object properties to connect all of the nodes on Rich's diagram. We worked on it for a couple of weeks and the first draft of Darwin-SW (DSW) was done!

The diagram above shows the DSW graph model overlaid upon the ACS entity-relation (ER) diagram. I realize that it's impossible to see the details in this image, but you can download a poster-sized PowerPoint diagram from this page to see the details.

DSW differs a little from the ASC model in that it includes two Darwin Core classes (dwc:Organism and dwc:Occurrence) that weren't dealt with in the ACS model. Since the ACS model dealt only with museum specimens, it did not include the classes of Darwin Core that were developed later to deal with repeated records of the same organism, or records documented by forms of evidence other than specimens (i.e. human and machine observations, media, living specimens, etc.). But other than that, the DSW model is just a simplified version of the ACS model.

The diagram above shows the core of the DSW graph model (available poster-sized here if you have trouble seeing the details). The six red bubbles are the six major classes defined by Darwin Core. The yellow bubble is FOAF's Agent class, which can be linked DwC classes by two terms from the dwciri: namespace. The object of dwc:eventDate is a literal, and dwciri:toTaxon links to some yet-to-be-fully-described taxon-like entity that will hopefully be fleshed out by a successor to the Taxon Concept Transfer Schema (TCS) standard, but whose place is currently being held by the dwc:Taxon class. The seven object properties printed in blue are DSW's attempt to fill in the object properties that are missing from the Darwin Core standard.

The blue bubble, dsw:Token, is one of the few classes that we defined in DSW instead of borrowing from elsewhere. We probably should have called it dsw:Evidence, because "evidence" is what it represents, but too late now. I will talk more about the Token class in the next section.

What's an Occurrence???

One of the longstanding and vexing questions of users of Darwin Core is "what the heck is an occurrence?" The origin of dwc:Occurrence predates my involvement with TDWG, but I believe that its creation was to solve the problem of overlap of terms that applied to both observations and preserved specimens. For example, you could have terms called dwc:observer and dwc:collector, with observer being used with observations and collector being used with specimens. Similarly, you could have dwc:observationRemarks for observations and dwc:collectionRemarks for specimens. But fundamentally, both an observer and a collector are creating a record that an organism was at some place at some time, so why have two different terms for them? Why have two separate remarks term when one would do? So the dwc:Occurrence class was created as an artificial class to organize terms that applied to both specimens and observations (like the two terms dwc:recordedBy and dwc:occurrenceRemark that replace the four terms above). Any terms that applied to only specimens (like dwc:preparations and dwc:disposition) were thrown in the Occurrence group as well.

So for some time, dwc:Occurrence was considered by many to be a sort of superclass for both specimens and observations. However, its definition was pretty murky and a bit circular. Prior to our clarification of class definitions in October 2014, the definition was "The category of information pertaining to evidence of an occurrence in nature, in a collection, or in a dataset (specimen, observation, etc.)." After the class definition cleanup, it was "An existence of an Organism (sensu http://rs.tdwg.org/dwc/terms/Organism) at a particular place at a particular time." That's still a bit obtuse, but appropriate for an artificial class whose instances document that an organism was at a certain place at a certain time.

What DSW does is to clearly separate the artificial Occurrence class from the actual resources that serve to document that the organism occurred. The dsw:Token class is a superclass for any kind of resource that can serve as evidence for the Occurrence. The class name, Token, comes from the fact that the evidence also has a dsw:derivedFrom relationship with the organism that was documented -- it's a kind of token that represents the organism. There is no particular limit to what type of thing can be a token; it can be a preserved specimen, living specimen, image, machine record, DNA sequence, or any other kind of thing that can serve as evidence for an occurrence and is derived in some way from the documented organism. The properties of Tokens are any properties appropriate for any class of evidence: dwc:preparation for preserved specimens, ac:caption for images, etc.

Investigating ABCD

I mentioned that recently I gained access to some relatively new Linked Data tools for investigating ABCD. One that I'm really excited about is a Wikibase instance that is loaded with the ABCD terminology data. If you've read any of my recent blog posts, you'll know that I'm very interested in learning how Wikibase can be used as a way to manage Linked Data. So I was really excited both to see how the ABCD team had fit the ABCD model into the Wikibase model and also to be able to use the built-in Query service to explore the ABCD model.

The other useful thing that I just recently discovered is an ABCD OWL ontology document in RDF/XML serialization. It was loaded into the ABCD GitHub repo only a few days ago, so I'm excited to be able to use it as a point of comparison with the Wikibase data. I've loaded the ontology triples into the Vanderbilt Libraries' triplestore as the named graph http://rs.tdwg.org/abcd/terms/ so that I can query it using the SPARQL endpoint. In most of the comparisons that I've done, the results from the OWL document and the Wikibase data are identical. (As I noted in the "Time for a Snak" section of my previous post, the Wikibase data model differs significantly from the standard RDFS model of class, range, domain, etc. So querying the two data sources requires some significant adjustments in the actual queries used in order to fit the model that the data are encoded in.)

One caveat is that ABCD 3.0 is currently under development and the Wikibase installation is clearly marked as "experimental". So I'm assuming that the data both there and in the ontology are subject to change. Nevertheless, both of these data sources has given me a much better understanding of how ABCD models the biodiversity universe.

Term types

The Main Page of the Wikibase installation gives a good explanation of the types of terms included in its dataset. In the description, they use the word "concept", but I prefer to restrict the use of the word "concept" to what I consider to be its standard use: for controlled vocabulary terms. (See the "For further reference" section for more on this.") So to translate their "types" list, I would say they describe one type of vocabulary (Controlled Vocabulary Q14) and four types of terms: Class Q32, Object Property Q33, Datatype Property Q34, and Controlled Term (i.e. concept) Q16.

For comparison purposes, the class and property terms in the abcd_concepts.owl OWL ontology are typed as: owl:Class, owl:ObjectProperty, and owl:DatatypeProperty. The controlled vocabularies are typed as owl:Class rather than skos:ConceptScheme, so subsequently the controlled vocabulary terms are typed as instances of the classes that correspond to their containing controlled vocabularies (e.g. abcd:Female rdf:type abcd:Sex), rather than as skos:Concept. It's a valid modeling choice, but isn't according to the recommendations of the TDWG Standards Documentation Specification. (More details about this later in the " The place of controlled vocabularies in the model" section.)

The query service makes it easy to discover what properties have actually been used with each type of term. Here is an example for Classes:

PREFIX bwd: <http://wiki.bgbm.org/entity/>

PREFIX bwdt: <http://wiki.bgbm.org/prop/direct/>

SELECT DISTINCT ?predicate ?label WHERE {

?concept bwdt:P8 bwd:Q219.

?concept bwdt:P9 bwd:Q32.

?concept bwdt:P25 ?name.

?concept ?predicate ?value.

OPTIONAL {

?genericProp wikibase:directClaim ?predicate.

?genericProp rdfs:label ?label.

}

MINUS {

?otherGenericProp wikibase:claim ?predicate.

}

ORDER BY ?predicate

This query is complicated a bit by the somewhat complex way that Wikibase handles properties and their labels (see this for details), but you can see that it works by going to https://wiki.bgbm.org/bdidata/query/ and pasting the query into the box.

One of the cool things that the Wikibase Query service allows you to do is copy the link from the browser URL bar and the link contains the query itself as part of the URL. This means that you can link directly to the query so that when you click on the link, the query will load itself into the Query Service GUI box. So to avoid cluttering up this post with cut and paste queries, I'll just link the queries like this: properties used with object properties, datatype properties, controlled terms, and controlled vocabularies.

If you run each of the queries, you'll see that the properties used to describe the various term and vocabulary types are similar to the table shown at the bottom of the Main Page.

Classes

One of the things I was interested in finding out about were the classes that were included in ABCD. This query will create a table of all of the classes in ABCD 3.0 along with basic information about them. One thing that is very clear from running the query is that ABCD has a LOT more classes (57) than DwC (15). Fortunately, the classes are grouped into categories based on the core classes they are associated with. This was really helpful for me because it made it obvious to me that Gathering, Unit, and Identification were key classes in the model. The Identification class was basically the same as the dwc:identification class of Darwin Core. The Gathering class, defined as "A class to describe a collection or observation event." seems to be more or less synonymous to the dwc:Event class. The Unit class, defined as "A class to join all data referring to a unit such as specimen or observation record" is almost exactly how I described the dwc:Occurrence class: an artificial class that's used to group properties that are common to specimens and observations.

Object properties

Another key thing that I wanted to know was how the ABCD 3.0 graph model compared with the DSW graph model. In order to do that, I needed to study the object properties and find out how they connected instances of classes.

As we can see from the table of term properties on the Main Page, object properties are required to have a defined range. They are not required to have a domain. Cam and I got a lot of flack when we assigned ranges and domains to object properties in DSW because of the way ranges and domains can generate unintended entailments. There is a common misconception that if one assigns a range to an object property that it REQUIRES that the object to be in instance of that class. Actually what it does is to entail that the object IS an instance of that class, whether that makes sense or not. We were OK with assigning ranges and domains in DSW because we didn't want people to use the DSW object properties to link class instances other than those that we specified in our mode - if people ignored our guidance, then they got unintended entailments. In ABCD the object properties all have names like "hasX", so if the object of a triple using the property isn't an instance of class "X", it's pretty silly to use that property. So here is makes some sense to assign ranges. Perhaps wisely, few of the ABCD object properties have the optional domain declaration. That allows those properties to be used with subject resources other than types that might have been originally envisioned without it entailing anything silly.

Instead of assigning domains, ABCD uses the property abcd:associatedWithClass to indicate the class or classes whose instances you'd expect to have that property. Here's a query that lists all of the object properties, their ranges, and the subject class with which they are associated. The query shows that there are a much larger number of link types (135) than DSW has. That's to be expected since there are a lot more classes. The actual number of ABCD object properties (88) is less than the number of link types because some of the object properties are used to link more than one combination of class instances.

Comparison of the DSW and ABCD graph model

Color coding described in text

I went through the rather labor-intensive process of creating a PowerPoint diagram (above) that overlays part of the ABCD graph model on top of the DSW graph diagram that I showed previously. (There are other ABCD classes that I did't include because the diagram was too crowded and I was getting tired.) Although ABCD has a whole bunch of extra classes that don't correspond to DwC classes, the main DwC classes are have ABCD analogs that are connected in a very similar manner to the way they are connected in DSW. The resemblance is actually rather striking.

Here are a few notes about the diagram. First of all, it isn't surprising that ABCD doesn't have an Organism class that corresponds to dwc:Organism. As its name indicates, "Access to Biological Collections Data" is focused primarily on data from collections. As I learned from the fight to get dwc:Organism added to Darwin Core, collections people don't care much about repeated observations. They generally only sample an organism once since they usually kill it in the process. So they rarely have to deal with multiple occurrences linked to the same organism. However, people who track live whales or band birds care about the dwc:Organism class a lot since its primary purpose is to enable one-to-many relationships between organisms and occurrences (as opposed to having the purpose of creating some kind of semantic model of organisms).

Another obvious difference is the absence of any Location class that's separate from abcd:Gathering. Another common theme in discussing a model for Darwin Core was whether there was any need to have a dwc:Event class in addition to the dcterms:Location class, or if we could just denormalize it out of existence. In that case, the disagreement was between collections people (who often only collect at a particular location once) and people who conducted long-term monitoring of sites (who therefore had many sampling Events at one Location).

The general theme here is that people who don't have one-to-many (or many-to-many) relationships between classes don't see the need for the extra classes and omit them from their graph model. But the more diverse the kinds of datasets we want to handle with the model, the more complicated the core graph model needs to be.

The other thing that surprised me a little in the ABCD graph model was that the "Unit" was connected to the "Gathering Agent" through an instance of abcd:FieldNumber, instead of being connected directly as does dwciri:recordedBy. I guess that makes sense if there's a one-to-many relationship between the Unit and the FieldNumber (several Gathering Agents assign their own FieldNumber to the Unit). There are some parallels with dwciri:fieldNumber, although it is defined to have a subject that is field notes and an object that is a dwc:Event. (see table 3.7 in the DwC RDF Guide). Clearly there would be some work required to harmonize DwC and ABCD in this area.

The other part of the two graph models I want to draw attention to is the area of dsw:Token.

There are two different ways of imagining the dsw:Token class. One way is to say that dsw:Token is a class that includes every kind of evidence. In that view, we enumerate the token classes we can think of, then define them using the properties associated with those kinds of evidence. The other way to think about it is to say that all of the properties that we can't join together under the banner of dwc:Occurrence get grouped under an appropriate kind of token. In that view, our job is to sort properties, and we then name the token classes as a way to group the sorted properties. These are really just two different ways of describing the same thing.

The ABCD analog of the dsw:Token class is the class abcd:TypeSpecificInformation. Its definition is: "A super class to create and link to type specific information about a unit." Recall that the definition of a Unit is "A class to join all data referring to a unit such as specimen or observation record". These definitions correspond to the "sorting out of properties" view I described above. Properties common to all kinds of evidence are organized together under the Unit class, but properties that are not common get sorted out into the appropriate specific subclass of abcd:TypeSpecificInformation.

ABCD class hierarchy

The diagram above shows the "enumeration of types of evidence" view. In the diagram, you can see most of the imaginable kinds of specific evidence types listed as subclasses of abcd:TypeSpecificInformation. These subclasses correspond with some of the possible DwC classes that could serve as Tokens: abcd:HerbariumUnit corresponds to dwc:PreservedSpecimen, abcd:BotanicalGardenUnit corresponds to dwc:LivingSpecimen, abcd:ObservationUnit corresponds to dwc:HumanObservation, etc.

Object properties linking abcd:Unit instances and instances of subclasses of abcd:TypeSpecificInformation

In the same way that DSW uses the object property dsw:evidenceFor to link Tokens and Occurrences, ABCD uses the object property abcd:hasTypeSpecificInformation to link abcd:TypeSpecificInformation instances to Units. In addition, ABCD defines separate object properties that link an abcd:Unit to instances of each subclass of abcd:TypeSpecificInformation. To find all of those properties, I ran this query; the specific object properties are all shown in the diagram above.

Clearly, the diagram above diagram is too complicated to insert as part of the man diagram comparing ABCD and DwC. Instead, I abbreviated it in the main diagram as shown in the following detail:

In this part of the diagram, I generalized the nine subclasses by a single bubble for the superclass abcd:TypeSpecificInformation. The link from the Unit to the evidence instance can be made through the abcd:hasTypeSpecificInformation or it can be made using one of the nine object properties that connect the Unit directly to the evidence.

In addition, I also placed abcd:MultimediaObject in the position of dsw:Token. Although images (and other kinds of multimedia) taken directly of the organism at the time the occurrence is recorded is often ignored by the museum community, with the flood of data coming from iNaturalist into GBIF, media is now a very important type of direct evidence for occurrences.

So in general, abcd:TypeSpecificInformation is synonymous with dsw:Token, with the exception that multimedia objects can serve as Tokens but aren't explicitly listed as subclasses of abcd:TypeSpecificInformation.

The place of controlled vocabularies in the model

The last major difference between the ABCD model and Darwin Core is how they deal with controlled vocabularies. Take for example the property abcd:hasSex. In the Wikibase installation, it's item Q1057 and has the range abcd:Sex. The range property would entail that abcd:Sex is a Class, but it's type is given in the Wikibase installation as Controlled Vocabulary rather than Class. As I mentioned earlier, in the abcd_concepts.owl ontology document, the controlled vocabularies are actually typed as owl:Class rather than skos:ConceptScheme as I would expect, with the controlled terms as instances of the controlled vocabularies.

So let's assume we have an abcd:Unit instance called _:occurrence1 that is a female. Using the model of ABCD, the following triples could describe the situation:

abcd:Sex a rdfs:Class.

abcd:hasSex a owl:ObjectProperty;

rdfs:range abcd:Sex.

abcd:Female a abcd:Sex;

rdfs:label "female"@en.

_:occurrence1 abcd:hasSex abcd:Female.

Currently, there are many terms in Darwin Core that say "Recommended best practice is to use a controlled vocabulary." However, most of these terms do not (yet) have controlled vocabularies, although this could change soon. Let's assume that the Standards Documentation Specification is followed and a SKOS-based controlled vocabulary identified by the IRI dwcv:gender is created to be used to provide values for the term dwciri:sex. Assume that the controlled vocabulary contains the terms dwcv:male and dwcv:female. The following triples could then describe the situation:

dwcv:gender a skos:ConceptScheme.

dwcv:female a skos:Concept;

skos:prefLabel "female"@en;

rdf:value "female";

skos:inScheme dwcv:gender.

_:occurrence1 dwc:sex "female".

_:occurrence1 dwciri:sex dwcv:female.

From the standpoint of generic modeling, neither of these approaches are "right" or "wrong". However, the latter approach is consistent with sections 4.1.2, 4.5, and 4.5.4 of the TDWG Standards Documentation Specification as well as the pattern noted for controlled vocabularies in section 8.9 of the W3C Data on the Web Best Practices recommendation.

One reason that the ABCD graph diagram is more complicated than the DSW graph diagram is that some classes shown on the ABCD diagram as yellow bubbles (abcd:RecordBasis and abcd:Sex) and other classes not shown (like abcd:PermitType, abcd:NomenclaturalCode, etc.) represent controlled vocabularies rather than classes of linked resources.

Final thoughts

I have to say that I was somewhat surprised at how similar the ABCD and Darwin-SW graph models were. Perhaps I shouldn't be that surprised, given the DSW model's roots in the ACS model - it generally reflects the way the collections community views the universe and that view undoubtedly informs the ABCD model as well. That's good news, because it means that it should be possible to create a consensus graph model for Darwin Core and ABCD with minimal changes to either standard.

With such a model, it should be possible using SPARQL CONSTRUCT queries mediated by software to perform automated conversions from Darwin Core linked data to ABCD linked data. The CONSTRUCT query could insert blank nodes in places where the ABCD model has classes that aren't included in DwC. The conversion in the other direction would be more difficult since classes included in ABCD that aren't in DwC would have to be eliminated to make the conversion, and that might result in data loss as the data were denormalized. Still, the idea of any automated conversion is an encouraging thought!

The other thing that is clear to me from this investigation is that the current DwC and ABCD vocabularies could relatively easily be further developed into QL-like ontologies. That's basically what has already been done in the abcd_concepts.owl ontology document and in DSW. It has been suggested that TDWG ontology development be carried out using the OBO Foundry system, but that system is designed to create and maintain EL-like ontologies. Transforming Darwin Core and ABCD to EL-like ontologies would be be much more difficult and it is not clear to me what would be gained by that, given that the primary use case for ontology development in TDWG would be to facilitate querying of large volumes of instance data.

For further reference

Ontologies vs. controlled vocabularies

The distinction between ontologies and controlled vocabularies is discussed in several standards:

To paraphrase these references, there is a fundamental difference between ontologies and controlled vocabularies. Ontologies define knowledge related to some shared conceptualization in a formal way so that machines can carry out reasoning. They aren't primarily designed for human interaction. Controlled vocabularies are designed to help humans use natural language to organize and find items by associating consistent labels with concepts. Controlled vocabularies don't assert axioms or facts. A thesaurus (sensu ISO 25964) is a subset of controlled vocabulary where its concepts are organized with explicit relationships (e.g. broader, narrower, etc.).

The Data on the Web Best Practices recommendation notes in section 8.9 that controlled vocabularies and ontologies can be used together when the concepts defined in the controlled vocabulary are used as values for a property defined in an ontology. It gives the following example: "A concept from a thesaurus, say, 'architecture', will for example be used in the subject field for a book description (where 'subject' has been defined in an ontology for books)."

Kinds of ontologies

The Introduction of the W3C OWL 2 Web Ontology Language Profiles Recommendation describes several profiles or sublanguages of the OWL 2 language for building ontologies. These profiles place restrictions on the structure of OWL 2 ontologies in ways that make them more efficient for dealing with data of different sorts. The nature of these restrictions are very technical and way beyond the scope of this post, but I mention the profiles because they provide a convenient way the characterize ontology modeling approaches. (I also refer you to this post, which offers a very succinct description of the difference in the profiles.)

OWL 2 EL is suitable for "applications employing ontologies that define very large numbers of classes and/or properties". A classic example of such an ontology is the Gene Ontology, where the data themselves are represented as tens of thousands of classes. OWL 2 QL is suitable for "applications that use large volumes of instance data, and where query answering is the most important reasoning activity." A classic example of such an ontology is the GeoNames ontology, which contains only 7 classes and 28 properties, but is used with over eleven million place feature instances. In OWL 2 QL, query answering can be implemented using conventional relational database systems.

I refer to ontologies with many classes and properties for which OWL 2 EL is suitable as "EL-like ontologies", and ontologies with few classes and properties used with lots of instance data for which OWL 2 QL is suitable as "QL-like ontologies".

Vocabularies and terms

Section 8.9 of the W3C Data on the Web Best Practices Recommendation describes vocabularies and terms in this way:

Vocabularies define the concepts and relationships (also referred to as “terms” or “attributes”) used to describe and represent an area of interest. They are used to classify the terms that can be used in a particular application, characterize possible relationships, and define possible constraints on using those terms. Several near-synonyms for 'vocabulary' have been coined, for example, ontology, controlled vocabulary, thesaurus, taxonomy, code list, semantic network.

So a vocabulary is a broad category that includes both ontologies and controlled vocabularies, and it is a collection of terms. In this post, I use "vocabulary" and "term" in this context and avoid using the word "concept" unless I specifically mean it in the sense of a skos:Concept (i.e. a term in controlled vocabulary).

Note: this was originally posted 2019-06-11 but was edited on 2019-06-12 to clarify the position of the subclasses of abcd:hasTypeSpecificInformation in the model.

Putting Data into Wikidata using Software

2019-06-04T15:39:00.005-07:00

This is a followup post to an earlier post about getting data out of Wikidata, so although what I'm writing about here doesn't really depend on having read that post, you might want to take a look at it for background.

Note added 2021-03-13: Although this post is still relevant for understanding some of the basic ideas about writing to a Wikibase API (including Wikidata's), I have written another series of blog posts showing (with lots of screenshots and handholding) how you can safely write your own data to the Wikidata API using data that is stored in simple CSV spreadsheets. See this post for details.

Image from Wikimedia Commons; licensing murky but open

What do I mean by "putting data into Wikidata"?

I have two confessions to make right at the start. To some extent, the title of this post is misleading. What I am actually going to talk about is putting data into Wikibase, which isn't exactly the same thing as Wikidata. I'll explain about that in a moment. The second confession is that if all you really want are the technical details of how to write to Wikibase/Wikidata and the do-it-yourself scripts, you can just skip reading the rest of this post and go directly to a web page that I've already written on that subject. But hopefully you will read on and try the scripts after you've read the background information here.

Wikibase is the underlying application upon which Wikidata is built. So if you are able to write to Wikibase using a script, you are also able to use that same script to write to Wikidata. However, there is an important difference between the two. If you create your own instance of Wikibase, it is essentially a blank version of Wikidata into which you can put your own data, and whose properties you can tweak in any way that you want. In contrast, Wikidata is a community-supported project that contains data from many sources, and which has properties that have been developed by consensus. So you can't just do whatever you want with Wikidata. (Well, actually you can, but your changes might be reverted and you might get banned if you do things that the community considers bad.)

So before you start using a script to mess with the "real" Wikidata, it's really important to first understand the expectations and social conventions of the Wikidata community. Although I've been messing around with scripting interactions with Wikibase and Wikidata for months, I have not turned a script loose on the "real" Wikidata yet because I still have some work to do to meet the community expectations.

Before you start using a script to make edits to the real Wikidata, at a minimum you need to do the following:

read the MediaWiki API Etiquette page
study and understand the information on the Maxlag parameter
understand the bot approval process and follow the guidelines for creating and operating a Wikidata bot
test your script extensively on the Wikidata test instance

If you are only thinking about using a script to write to your own instance of Wikibase, you can ignore the steps above and just hack away. The worse case scenario is that you'll have to blow the whole thing up and start over, which is not that big of a deal if you haven't yet invested a lot of time in loading data.

Some basic background on Wikibase

Although we tend to talk about Wikibase as if it were a single application, it actually consists of several applications operating together in a coordinated installation. This is somewhat of a gory detail that we can usually ignore. However, having a basic understanding the structure of Wikidata will help us to understand why we even though Wikidata supports Linked Data, we have to write to Wikidata through the MediaWiki API. (Full disclosure: I'm not an expert on Wikibase and what I say here is based on the understanding that I have gained based on my own explorations.)

We can see the various pieces of Wikibase by looking its Docker Compose YAML file. Here are some of them:

a mysql database
a Blazegraph triplestore backend (exposed on port 8989)
the Wikidata Query Service frontend (exposed on port 8282)
the Mediawiki GUI and API (exposed on port 8181)
a Wikidata Query Service updater
Quickstatements (which doesn't work right out of the box, so we'll ignore it)

When data are entered into Wikibase using the Mediawiki instance at port 8181, they are stored in the mysql database. The Wikidata Query Service updater checks periodically for changes in the database. When it finds one, it loads the changed data into the Blazegraph triplestore. Although one can access the Blazegraph interface directly through port 8989, accessing the triplestore indirectly through the Wikidata Query Service frontend on port 8282 gives some additional bells and whistles that make querying easier.

If I look at the terminal window while Docker Compose is running Wikidata, I see this:

You can see that the updater is looking for changes every 10 seconds. This goes on in the terminal window as long as the instance is up. So when changes are made via Mediawiki, they show up in the Query Service within about 10 seconds.

If you access Blazegraph via http://localhost:8989/bigdata/, you'll see the normal GUI that will be familiar to you if you've used Blazegraph before:

However, if you go to the UPDATE tab and try to add data using SPARQL Update, you'll find that it's disabled. That means that the only way to actually get data into the system is through the Mediawiki GUI or API exposed through port 8181, and NOT through the standard Linked Data mechanism of SPARQL Update. So if you want to add data to Wikibase (either your local installation or the Wikidata instance of Wikibase), you need to figure out how to use the Mediawiki API, which is based on a specific Wikimedia data model and NOT on standard RDF or RDFS.

The MediaWiki API

The MediaWiki API is a generic web service for all installations in the WikiMedia universe. That includes not only familiar Wikimedia Foundation projects like Wikipedia in all of its various languages, Wikimedia Commons, and Wikidata, but also any of the many other projects built on the open source MediaWiki platform.

The API allows you to perform many possible read or write actions on a MediaWiki installation. Those actions are listed on the MediaWiki API help page and you can learn their details by clicking on the name of any of the actions. The actions whose names begin with "wb" are the ones specifically related to Wikibase and there is a special page that focuses only on that set of actions. Since this post is related to Wikibase, we will focus on those actions. Although a number of the Wikibase-related actions can read from the API, as I pointed out in my most recent previous post there is not much point in reading directly from the API when one can just use Wikibase's awesome SPARQL interface instead. So in my opinion, the most important Wikibase actions are the ones that write to the API rather than read.

The Wikibase-specific API page makes two important points about writing to a Wikibase instance: writing requires a token (more on that later) and must be done using an HTTP POST request. I have to confess that when I first started looking at the API documentation, I was mystified about how to translate the examples given there into request bodies that could be sent as part of a POST request. But there is a very useful tool that makes it much easier to construct the POST requests: the API sandbox. There are actually multiple sandboxes (e.g. real Wikidata, Wikidata test instance, real Wikipedia, Wikipedia test instance, etc.), but since tests that you do in an API sandbox cause real changes to their corresponding MediaWiki instances, you should practice using the Wikidata test instance sandbox (https://test.wikidata.org/wiki/Special:ApiSandbox) and not the sandbox for the real Wikidata, which looks and behaves exactly the same as the test instance sandbox.

When you go to the sandbox, you can select from the dropdown the action that you want to test. Alternatively, you can click on one of the actions on the MediaWiki API help page, then in the Examples section, click on the "[open in sandbox]" link to jump directly to the sandbox with the parameters already filled into the form.

Click on the "action=..." link in the menu on the left if needed to enter any necessary parameters. Note: since testing the write actions requires a token, you need to log in (same credentials as Wikipedia or any other Wikimedia site), then click the "Auto-fill the token" button before the write action will really work. Once the action has taken place, you can go to the edited entry in the test Wikidata instance and convince yourself that it really worked.

On the sandbox page, clicking on the "Results" link in the menu on the left will provide you with a really useful piece of information: the Request JSON that needs to be sent to the API as the body of the POST request:

Drop down the "Show request data as:" list to "JSON" and you can copy the Request JSON to use as you write and test your bot script. Once you've had a chance to look at several examples of request JSON, you can then compare it to the information given on the various API help pages to understand better what exactly you need to send to the API as the body of your POST request.

Authentication

In the last section, I mentioned that all write actions required a token. So what is that token, and how do you get it? In the API sandbox, you just click on a button and magic happens: a token is pasted into the box on the form. But what do you do for a real script?

The actual process of getting the necessary token is a bit convoluted an I won't go into the details here since they are covered in detail (with screenshots) on another web page in the Set up the bot and Use the bot to write to the Wikidata test instance sections. The abridged version is that you first need to create a bot username and password, then use those credentials to interact with the API to get the CSRF token that will allow you to perform the POST request.

For use in the test Wikidata instance or in your own Wikibase installation, you can just create the bot password using your own personal account. (Note: "bot" is just MediaWiki lingo for a script that automates edits.) However, the guidelines for getting approval for a Wikidata bot say that if you want to create a bot that carries out manipulations of the real Wikidata, you need to create a separate account specifically for the bot. An approved bot will receive a "bot flag" indicating that the community has given a thumbs-up to the bot to carry out its designated tasks. In the practice examples I've given, you don't need to do that, so you can ignore that part for now.

A CSRF token is issued for a particular editing session, so once it has been issued, it can be re-used for many actions that are carried out by the bot script during that session. I've written a Python function, authenticate(), that can be copied from this page and used to get the CSRF token - it's not necessary to understand the details unless you care about that kind of thing.

Time for a Snak

You can't get very far into the process of performing Wikibase actions on the MediaWiki API before you start running into the term snak. Despite reading various Wikibase documents and doing some minimal googling, I have not been able to find out the origin of the word "snak". I suppose it is either an inside joke, a term from some language other than English, or an acronym. If anybody out there knows, I would love to be set straight on this.

The Wikibase/DataModel reference page defines snaks as: "the basic information structures used to describe Entities in Wikidata. They are an integral part of each Statement (which can be viewed as collection of Snaks about an Entity, together with a list of references)." But what exactly does that mean?

Truthfully, I find the reference page a tough slog, so if you are unfamiliar with the Wikidata model and want to get a better understanding of it, I would recommend starting with the Data Model Primer page, which shows very clearly how the data model relates to the familiar MediaWiki item entry GUI (but ironically does not mention snaks anywhere on the entire page). I would also recommend studying the following graph diagram, which comes from a page that I wrote to help people get started making more complex Wikibase/Wikidata SPARQL queries.

Before I talk about how snaks fit into the Wikibase data model, I want to talk briefly about how the Wikibase modeling approach differs from modeling more typical for RDF-based Linked Data. A typical RDF-based graph model is built upon the RDFS, which includes an implicit notion of classes and type. One could then build a model on top of RDFS by creating an ontology where class relationships are define using subclass statements, restrictions are placed on class membership, ranges and domains are defined, etc. The overall goal is to describe some model of the world (real or imagined).

In contrast to that, a basic principle of Wikibase is that it is not about the truth. Rather, the Wikibase model is based on describing statements and their references. So the Wikibase model does not assume that we can model the world by placing items in a class. Rather, the Wikibase model allows us to state that "so-and-so says" that an item is a member of some class. A key property in Wikidata is P31 ("instance of"), which is used with almost every item to document a statement about class membership. But there is no requirement that some other installation of Wikibase have an "instance of" property, or that if an "instance of" property exists its identifier must be P31. "Instance of" is not an idea that's "baked into" the Wikibase model in the way it's build into RDFS. "Instance of" is just one of the many properties that the Wikidata community has decided it would like to use in statements that it documents. The same is true of "subclass of" (P279). A user can create the statement Q6256 P279 Q1048835 ("country" "subclass of" "political territorial entity"), but according to the Wikibase model, that is not some kind of special assertion of the state of reality. Rather, it's just one of the many other statements about items that have been documented in the Wikidata knowledge base.

So when we say that some part of the Wikidata community is "building a model" of their domain, they aren't doing it by building a formal ontology using RDF, RDFS, or OWL. Rather, they are doing it by making and documenting statements that involve the properties P31 and P279, just as they would make and document statements using any of the other thousands of properties that have been created by the Wikidata community.

What is actually "baked into" the Wikibase model (and Wikidata by extension) are the notions of property/value pairs associated with statements, reference property/value pairs associated with statements, and qualifiers and ranks for statements (not shown in the diagram above). The Wikibase data model assumes that the properties associated with statements and references exist, but does not define any of them a priori. Creating those particular properties are is up to the implementers of a particular Wikibase instance.

These key philosophical differences between the Wikibase model and the "standard" RDF/RDFS/OWL world need to be understood by implementers from the Linked Data world who are interested in using Wikibase as a platform to host their data. Building a knowledge graph on top of Wikibase will automatically include notions of statements and reference, but it will NOT automatically include notions of class membership and subclass relationships. Those features of the model will have to be built by the implementers through creation of appropriate properties. It's also possible to use SPARQL Construct to translate a statement in Wikidata lingo like

Q42 P31 Q5.

into a standard RDF/RDFS statement like

Q42 rdf:type Q5.

although there are OWL-related problems with this approach related to an item being used as both a class and an instance. But that's way beyond the scope of this post.

So after that rather lengthy aside, let's return to the question of snaks. A somewhat oversimplified description of a snak would be to say that it's a property/value pair of some sort. (There are also less commonly "no value" and "some value" snaks in cases where particular values aren't known - you can read about their details on the reference page.) The exact nature of the snak will depend on whether the value is a string, an item, or some other more complicated entity like a date range or geographic location. "Main" snaks are property/value pairs that are associated directly with the subject item and "qualifier" snaks qualify the statement made by the main snak. Zero to many reference records are linked to the statement, and each reference record has its own set of property/value snaks describing the reference itself (as opposed to describing the main statement). Given that the primary concern of the Wikibase data model is documenting statements involving property/value pairs, snaks are a central part of that model.

The reason I'm going out into the weeds on the subjects of snaks in this post is that a basic knowledge of snaks is required in order to understand the lingo of the Wikibase actions described in the MediaWiki API help. For example, if we look at the help page for the wbcreateclaim action, we can see how a knowledge of snaks will help us better understand the parameters required for that action.

In most cases, snaktype will have a value of value (unless you want to make a "no value" or "some value" assertion). If we want to write a claim having a typical snak, we will have to provide the API with values for both the property and value parameters. The property parameter is straightforward: the property's "P" identifier is simply given as the value of the parameter.

The value of the snak is more complicated. Its value is a string that also includes the delimiters necessary to describe the particular kind of value that's appropriate for the property. If the property is supposed to have a string value, then the value of the value parameter will be the string enclosed in quotes. If the property is supposed to have an item as a value, then the information about the item is given as a string that includes all of the JSON delimiters (quotes, colons, curly braces, etc.) required in the API documentation. Since all of the parameters and values for the action will be passed to the API as JSON in the POST request, the value of the value parameter will end up as a JSON string inside of JSON. Depending on the programming language you use, you may have to use escaping or some other mechanism to make sure that the JSON string for the value value is rendered properly. Here are some examples of how part of the POST request body JSON might look in a programming language where escaping is done by preceding a character with a backslash:

if the value is a string:

{
...
"property": "P1234",
"value": "\"WGS84\"",
...
}

if the value is an item:

{
...
"property": "P9876",
"value": "{\"entity-type\":\"item\",\"numeric-id\":1}",
...
}

Because the quotes that are part of the value parameter value string are inside the quotes required by the request body JSON, they were escaped as \".

For JSON data sent by the requests Python library as the body of a POST request, the JSON can be passed into the .post() method as a dictionary data structure, and requests will turn the dictionary into JSON before sending it to the API. To some extent, that allows one to dodge the whole escaping thing by using a combination of single and double quotes when constructing the dictionary. So in Python, we could code the dictionary to be passed by requests like this:

if the value is a string:

{
...
'property': 'P1234',
'value': '"WGS84"',
...
}

if the value is an item:

{
...
'property': 'P9876',
'value': '{"entity-type":"item","numeric-id":1}',
...
}

since Python dictionaries can be defined using using single quotes. Other kinds of values such as geocoordinates will have a different structure for their value string.

I ran into problems in Python when I tried to build the value values for the POST body dictionary by directly concatenating string variables with literals containing curly braces. Since Python uses curly braces to define string replacement fields, it got confused and threw an error in some of my lines of code. The simplest solution to that problem was to construct a dictionary for the data that needed to be turned into a string value, then pass that dictionary into the json.dumps() function to turn the dictionary into a valid JSON string (rather than trying to build that string directly). The string resulting as output of json.dumps() could then be assigned as the value of the appropriate parameter to be included in the JSON sent in the POST body. You can see how I used this approach in lines 128 through 148 of this script.

I realize that what I've just described here is about as confusing as trying to watch the movie Inception for the first time, but I probably wasted at least half of the time it took me to get my bot script to work by being confused about what a snak was and how to construct the value of the value parameter. So at least you will have a heads up about this confusing topic, and by looking at my example code you will hopefully be able to figure it out.

Putting it all together

So to summarize, here are the steps you need to take to write to any Wikibase installation using the MediaWiki API:

Create a bot to get a username and password.
Determine the structure of the JSON body that needs to be passed to the API in the POST request for the desired action.
Use the bot credentials to log into an HTTP session with the API and get a CSRF token.
Execute the code necessary to insert the data you want to write into the appropriate JSON structure for the action.
Execute the code necessary to perform the POST request and pass the JSON to the API.
Track the API response to determine if errors occurred and handle any errors.
Repeat many times (otherwise why are you automating with a bot?).

This tutorial will walk you through the steps and provides code examples and screenshots to get you going.

If you are writing to the "real" Wikidata instance of Wikibase, you need to take several additional steps:

Create a separate bot account.
Define what the bot will do and describe those tasks in the bot's talk page.
Request approval for permission to operate the bot.
In programming the bot, figure out how you will check for existing records and avoid creating duplicate items or claims.
Perform 50 to 250 edits with the bot to show that it works. Make sure that you throttle the bot appropriately using the Maxlag parameter.
After you get approval, put the bot into production mode and monitor its performance carefully.

The Wikidata:Bots page gives many of the necessary administrative details of setting up a Wikidata bot.

For writing to the "real" Wikidata, you might consider using the Pywikibot Python library to build your bot. I've written a tutorial for that here. Pywikibot has built-in throttling, so that takes care of potential problems with hitting the API at an unacceptable rate. However, in tests that I carried out on our test instance of Wikibase hosted on AWS, writing directly to the API as I've described here was about 60 times faster than using Pywikibot. So if you are writing a lot of data to a fresh and empty Wikibase instance, you may find using Pywikibot's slow speed frustrating.

Acknowledgements

Asaf Bartov and Andrew Lih's presentations and their answers to my questions at the 2019 LD4P conference were critical for helping me to finally figure out how to write effectively to Wikibase. Thanks!

Getting Data Out of Wikidata using Software

2019-05-28T10:24:00.003-07:00

For those of you who have been struggling through my series of posts on the TDWG Standards Documentation Specification, you may be happy to read a more "fun" post on everybody's current darling: Wikidata. This post has a lot of "try it yourself" activities, so if you have an installation of Python on your computer, you can try running the scripts yourself.

2022-06-11 note: a more recent followup post to this one describing Python code to reliably retrieve data from the Wikidata Query Service is here: "Making SPARQL queries to Wikidata using Python".

Because our group at the Vanderbilt Libraries is very interested in leveraging Wikidata and Wikibase for possible use in projects, I recently began participating in the LD4 Wikidata Affinity Group, an interest group of the Linked Data for Production project. I also attended the 2019 LD4 Conference on Linked Data in Libraries in Boston earlier this month. In both instances, most of the participants were librarians who were knowledgeable about Linked Data. So it's been a pleasure to participate in events where I don't have to explain what RDF is, or why one might be able to do cool things with Linked Open Data (LOD).

However, I have been surprised to hear people complain a couple of times at those events that Wikidata doesn't have a good API that people can use to acquire data to use in applications such as those that generate web pages. When I mentioned that Wikidata's query service effectively serves as a powerful API, I got blank looks from most people present.

I suspect that one reason why people don't think of the Wikidata Query Service as an API is because it has such an awesome graphical interface that people can interact with. Who wouldn't get into using a simple dropdown to create cool visualizations that include maps, timelines, and of course, pictures of cats? But underneath it all, the query service is a SPARQL endpoint, and as I have pontificated in a previous post, a SPARQL endpoint is just a glorified, program-it-yourself API.

In this post, I will demonstrate how you can use SPARQL to acquire both generic data and RDF triples from the Wikidata query service.

What is SPARQL for?

The recursive acronym SPARQL (pronounced like "sparkle") stands for "SPARQL Protocol and RDF Query Language". Most users of SPARQL know that it is a query language. It is less commonly known that SPARQL has a protocol associated with it that allows client software to communicate with the endpoint server using hypertext transfer protocol (HTTP). That protocol establishes things like how queries should be sent to the server using GET or POST, how the client indicates the desired form of the response, and how the server indicates the media type of the response. These are all standard kinds of things that are required for a software client to interact with an API, and we'll see the necessary details when we get to the examples.

The query parts of SPARQL demarcate the kinds of tasks we can accomplish using it. There are three main things we can do with SPARQL:

get generic data from the underlying graph database (triplestore) using SELECT
get RDF triples based on data in the graph database using CONSTRUCT
load RDF data into the triplestore using UPDATE

In the Wikidata system, data enters the Blazegraph triplestore directly from a separate database, so the third of these methods (UPDATE) is not enabled. That leaves the SELECT and CONSTRUCT query forms and we will examine each of them separately.

Getting generic data using SPARQL SELECT

The SELECT query form is probably most familiar to users of the Wikidata Query Service. It's the form used when you do all of those cool visualizations using the dropdown examples. The example queries do some magical things using commands that are not part of standard SPARQL, such as view settings that are in comments and the AUTO_LANGUAGE feature. In my examples, I will use only standard SPARQL for dealing with languages and ignore the view settings since we aren't using a graphical interface anyway.

We are going to develop an application that will allow us to discover what Wikidata knows about superheroes. The query that we are going to start off with is this one:

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
SELECT DISTINCT ?name ?iri WHERE {
?iri wdt:P106 wd:Q188784.
?iri wdt:P1080 wd:Q931597.
?iri rdfs:label ?name.
FILTER(lang(?name)="en")
}
ORDER BY ASC(?name)

For reference purposes, in the query, wdt:P106 wd:Q188784 is "occupation superhero" and wdt:P1080 wd:Q931597 is "from fictional universe Marvel Universe". This is what restricts the results to Marvel superheroes. (You can leave this restriction out, but then the list gets unmanageably long.) The language filter restricts the labels to the English ones. The name of the superhero and its Wikidata identifier are what is returned by the query.

If you want to try the query, you can go the the graphical query interface (GUI), paste it into the box, and click the blue "run" button. I should note that Wikidata will allow you to get away with leaving off the PREFIX declarations, but that bugs me, so I'm going to include them since I think it's a good practice to be in the habit of including them.

When you run the query, you will see the result in the form of a table:

The table shows all of the bindings to the ?name variable in one column and the bindings for the ?iri variable in a second column. However, when you make the query programmatically rather than through the GUI, there are a number of possible non-tabular forms (i.e. serializations) in which you can receive the results.

To understand what is going on under the hood here, you need to issue the query to the SPARQL endpoint using client software that will allow you specify all of the settings required by the SPARQL HTTP protocol. Most programmers at this point would tell you to use CURL to issue the commands, but personally, I find CURL difficult to use and confusing for beginners. I always use Postman, which is free, and easy to use and understand.

The SPARQL protocol describes several ways to make a query. We will talk about two of them here.

Query via GET is pretty straightforward if you are used to interacting with APIs. The SPARQL query is sent to the endpoint (https://query.wikidata.org/sparql) as a query string with a key of query and value that is the URL-encoded query. The details of how to do that using Postman are here. The end result is that you are creating a long, ugly URL that looks like this:

https://query.wikidata.org/sparql?query=%0APREFIX+rdfs%3A+%3Chttp%3A%2F%2Fwww.w3.org%2F2000%2F01%2Frdf-schema%23%3E%0APREFIX+wd%3A+%3Chttp%3A%2F%2Fwww.wikidata.org%2Fentity%2F%3E%0APREFIX+wdt%3A+%3Chttp%3A%2F%2Fwww.wikidata.org%2Fprop%2Fdirect%2F%3E%0ASELECT+DISTINCT+%3Fname+%3Firi+WHERE+%7B%0A%3Firi+wdt%3AP106+wd%3AQ188784.%0A%3Firi+wdt%3AP1080+wd%3AQ931597.%0A%3Firi+rdfs%3Alabel+%3Fname.%0AFILTER%28lang%28%3Fname%29%3D%22en%22%29%0A%7D%0AORDER+BY+ASC%28%3Fname%29%0A

If you are OK with getting your results in the default XML serialization, you just need to request that URL and the file that comes back will have your results. You can even do that by just pasting the ugly URL into the URL box of a web browser if you don't want to bother with Postman.

However, since we are planning to use the results in a program, it is much easier to use the results if they are in JSON. Getting the results in JSON requires sending an Accept request header of application/sparql-results+json along with the GET request. You can't do that in a web browser, but in Postman you can set request headers by filling in the appropriate boxes on the header tab as shown here. SPARQL endpoints may also accept the more generic JSON Accept request header application/json, but the previous header is the preferred one for SPARQL requests.

Query via POST is in some ways simpler than using GET. The SPARQL query is sent to the endpoint using only the base URL without any query string. The query itself is sent to the endpoint in unencoded form as the message body. A critical requirement is that the request must be sent with a Content-Type header of application/sparql-query. If you want the response to be in JSON, you must also include an Accept request header of application/sparql-results+json as was the case with query via GET.

There is no particular advantage of using POST instead of GET, except in cases where using GET would result in a URL that exceeds the allowed length for the endpoint server. I'm not sure what that limit is for Wikidata's server, but typically the maximum is between 5000 and 15000 characters. So if the query you are sending ends up being very long, it is safer to send it using POST.

Response. The JSON response that we get back look like this:

{
"head": {
"vars": [
"name",
"iri"
]
},
"results": {
"bindings": [
{
"name": {
"xml:lang": "en",
"type": "literal",
"value": "Amanda Sefton"
},
"iri": {
"type": "uri",
"value": "http://www.wikidata.org/entity/Q3613591"
}
},
{
"name": {
"xml:lang": "en",
"type": "literal",
"value": "Andreas von Strucker"
},
"iri": {
"type": "uri",
"value": "http://www.wikidata.org/entity/Q4755702"
}
},

...

{

"name": {

"xml:lang": "en",

"type": "literal",

"value": "Zeitgeist"

"iri": {

"type": "uri",

"value": "http://www.wikidata.org/entity/Q8068621"

}

]

}

The part of the results that we care about is the value of the bindings key. It's a array of objects that include a key for each of the variables that we used in the query (e.g. name and iri). The value for each variable key is another object representing a bound result. The bound result object contains key:value pairs that tell you more than we could learn from the Query Service GUI table, notably the language tag for strings and whether the result was a literal or URI. But basically, the information for each object in the array corresponds to the information that was in a row in the GUI table. In our program, we can step through each object in the array and pull out the bound results that we want.

The reason for going into these gory details is to point out that the generic HTTP operations that we just carried out can be done for any programming language that has libraries to perform HTTP calls. We will see how this is done in practice for two languages.

Using Python to get generic data using SPARQL SELECT

A Python 3 script that performs the query above can be downloaded from this page. The query itself is assigned to a variable as a multi-line string in lines 10-19. In line 3, the script allows the user to choose a language for the query and the code for that language is inserted as the variable isoLanguage in line 17.

The script uses the popular and easy-to-use requests library to make the HTTP call. It's not part of the standard library, so if you haven't used it before, you'll need to install it using PIP before you run the script. The actual HTTP GET call is made in line 26. The requests module is really smart and will automatically URL-encode the query when it's sent into the .get() method as a value of params. So you don't have to worry about that yourself.

If you uncomment line 27 and comment line 26, you can make the request using the .post() method instead of GET. For a query of this size, there is no particular advantage of one method over the other. The syntax varies slightly (specifying the query as a data payload rather than a query parameter) and the POST request includes the required Content-Type header in addition to the Accept header to receive JSON.

There are print statements in lines 21 and 29 so that you can see what the query looks like after the insertion of the language code, and after it's been URL-encoded and appended to the base endpoint URL. You can delete them later if they annoy you. If you uncomment line 31, you can see the raw JSON results as the have been received from the query service. They should look like what was shown above.

Line 33 converts the received JSON string into a Python data structure, and also pulls out the array value of the bindings key (now a Python "list" data structure). Lines 34 to 37 step through each result in the list, extract the values bound to the ?name and ?iri variables, then prints them on the screen. The result looks like this for English:

http://www.wikidata.org/entity/Q3613591 : Amanda Sefton
http://www.wikidata.org/entity/Q4755702 : Andreas von Strucker
http://www.wikidata.org/entity/Q14475812 : Anne Weying
http://www.wikidata.org/entity/Q2604744 : Anya Corazon
http://www.wikidata.org/entity/Q2299363 : Armor
http://www.wikidata.org/entity/Q2663986 : Aurora
http://www.wikidata.org/entity/Q647105 : Banshee
http://www.wikidata.org/entity/Q302186 : Beast
http://www.wikidata.org/entity/Q2893582 : Bedlam
http://www.wikidata.org/entity/Q2343504 : Ben Reilly
http://www.wikidata.org/entity/Q616633 : Betty Ross
...

If we run the program and enter the language code for Russian (ru), the results look like this:

http://www.wikidata.org/entity/Q49262738 : Ultimate Ангел
http://www.wikidata.org/entity/Q48891562 : Ultimate Джин Грей
http://www.wikidata.org/entity/Q39052195 : Ultimate Женщина-паук
http://www.wikidata.org/entity/Q4003146 : Ultimate Зверь
http://www.wikidata.org/entity/Q48958279 : Ultimate Китти Прайд
http://www.wikidata.org/entity/Q4003156 : Ultimate Колосс
http://www.wikidata.org/entity/Q16619139 : Ultimate Рик Джонс
http://www.wikidata.org/entity/Q7880273 : Ultimate Росомаха
http://www.wikidata.org/entity/Q48946153 : Ultimate Роуг
http://www.wikidata.org/entity/Q4003147 : Ultimate Циклоп
http://www.wikidata.org/entity/Q48947511 : Ultimate Человек-лёд
http://www.wikidata.org/entity/Q4003183 : Ultimate Шторм
http://www.wikidata.org/entity/Q2663986 : Аврора
http://www.wikidata.org/entity/Q3613591 : Аманда Сефтон
http://www.wikidata.org/entity/Q4755702 : Андреас фон Штрукер
http://www.wikidata.org/entity/Q2604744 : Аня Коразон
http://www.wikidata.org/entity/Q770064 : Архангел
http://www.wikidata.org/entity/Q28006858 : Баки Барнс
...

So did our script just use an API? I would argue that it did. But it's programmable: if we wanted it to retrieve superheroes from the DC universe, all we would need to do is to replace Q931597 with Q1152150 in line 15 of the script.

Now that we have the labels and ID numbers for the superheroes, we could let the user pick one and we could carry out a second query to find out more. I'll demonstrate that in the next example.

Using Javascript/JQuery to get generic data using SPARQL SELECT

Because the protocol to acquire the data is generic, we can go through the same steps in any programming language. Here is an example using Javascript with some JQuery functions. The accompanying web page sets up two dropdown lists, with the second one being populated by the Javascript using the superhero names retrieved using SPARQL. You can try the page directly from this web page. To have the page start off using a language other than English, append a question mark, followed by the language code, like this. If you want to try hacking the Javascript yourself, you can download both documents into the same local directory, then double click on the HTML file to open it in a browser. You can then edit the Javascript and reload the page to see the effect.

There are basically two things that happen in this page.

Use SPARQL to find superhero names in the selected language. The initial page load (line 157) and making a selection on the Language dropdown (lines 106-114) fire the setStatusOptions() function (lines 58-98). That function inserts the selected language into the SPARQL query string (lines 69-78), URL-encodes the query (line 79), then performs the GET to the constructed URL (lines 82-87). The script then steps through each result in the bindings array (line 89) and pulls out the bound name and iri value for the result (lines 90-91). Up to this point, the script is doing exactly the same things as the Python script. In line 93, the name is assigned to the dropdown option label and the IRI is assigned to the dropdown option value. The page then waits for something else to happen.

In this screenshot, I've turned on Chrome's developer tools so I can watch what the page is doing as it runs. This is what the screen looks like after the page loads. I've clicked on the Network tab and selected the sparql?query=PREFIC%20rdfs... item. I can see the result of the query that was executed by the setStatusOptions() function.

Use SPARQL to find properties and values associated with a selected superhero. When a selection is made in the Character dropdown, the $("#box1").change(function() (lines 117-154) is executed. It operates in a manner similar to the setStatusOptions() function, except that it uses a different query that finds properties associated with the superhero and the values of those properties. Lines 121 through 133 insert the IRI from the selected dropdown into line 125 and the code of the selected language in lines 130 and 131, resulting in a query like this (for Black Panther, Q998220):

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

PREFIX wd: <http://www.wikidata.org/entity/>

PREFIX wdt: <http://www.wikidata.org/prop/direct/>

SELECT DISTINCT ?property ?value WHERE {

<http://www.wikidata.org/entity/Q998220> ?propertyUri ?valueUri.

?valueUri rdfs:label ?value.

?genProp <http://wikiba.se/ontology#directClaim> ?propertyUri.

?genProp rdfs:label ?property.

FILTER(substr(str(?propertyUri),1,36)="http://www.wikidata.org/prop/direct/")

FILTER(LANG(?property) = "en")

FILTER(LANG(?value) = "en")

}

ORDER BY ASC(?property)

The triple patterns in lines 127 and 128 and filter in 129 give access to the label of a property used in a statement about the superhero. The model for property labels in Wikidata are a bit complex - see this page for more details. This query only finds statements whose values are items (not those with string values) because the triple pattern in line 126 requires the value to have a label (and strings don't have labels). A more complex graph pattern than this one would be necessary to get values of statements with both literal (string) and non-literal (item) values.

Lines 144-149 differ from the earlier function in that they build a text string (called text) containing the property and value strings from the returned query results. The completed string is inserted as HTML into the div1 element of the web page (line 149).

The screenshot above shows what happens in Developer Tools when the Character dropdown is used to select "Black Panther". You can see on the Network tab on the right that another network action has occurred - the second SPARQL query. Clicking on the response tab shows the response JSON, which is very similar in form to the previous query results. On the left side of the screen, you can see where the statements about the Black Panther have been inserted into the web page.

It is worth noting that the results that are acquired vary a lot depending on the language that is chosen. The first query that builds the character dropdown requires that the superhero have a label in the chosen language. If there isn't a label in that language for that character, then the superhero isn't listed. That's why the list of English superhero names is so long and the simplified Chinese list only has a few options. Similarly, properties for a character are only listed if they have labels in that language and if those labels also have a value that has a label in that language. So we miss a lot of superheros and properties that exist if no one has bothered to create labels for them in a given language.

This page is also very generic. Except for the page titles and headers in different languages, which are hard-coded, minor changes to the triple patterns in lines 73 and 74 would make it possible to retrieve information about almost any kind of thing described in Wikidata.

Getting RDF triples using SPARQL CONSTRUCT

In SPARQL SELECT, we specify any number of variables that we want the endpoint to send information about. The values that are bound to those variables can be any kind of string (datatyped or language tagged) or IRI. In contrast, SPARQL CONSTRUCT always returns the same kind of information: RDF triples. So the response to CONSTRUCT is always an RDF graph.

As with the SELECT query, you can issue a CONSTRUCT query at the Wikidata Query Service GUI by pasting it into the box. You can try it with this query:

PREFIX wd: <http://www.wikidata.org/entity/>

CONSTRUCT {

wd:Q42 ?p1 ?o.

?s ?p2 wd:Q42.

}

WHERE {

{wd:Q42 ?p1 ?o.}

UNION

{?s ?p2 wd:Q42.}

}

The WHERE clause of the query requires that triples match one of two patterns: triples where the subject is item Q42, and triples where the object is item Q42. The graph to be constructed and returned consists of all of the triples that conform to one of those two patterns. In other words, the graph that is returned to the client is all triples in Wikidata that are directly related to Douglas Adams (Q42).

When we compare the results to what we got when we pasted the SELECT query into the box, we see that there is also a table at the bottom. However, in a CONSTRUCT query there will always be three columns for the three parts of the triple (subject, predicate, and object), plus a column for the "context", which we won't worry about here. The triples that are shown here mostly look the same, but that's only because the table in the GUI doesn't tell us the language tags of the labels and the names in most languages written in Latin characters are the same.

If we use Postman to make the query, we have the option to specify the serialization that we want for the response graph. Blazegraph (the system that runs Wikidata's SPARQL endpoint) will support any of the common RDF serializations, so we just need to send an Accept header with the appropriate media type (text/turtle for Turtle, application/rdf+xml for XML, or application/ld+json for JSON-LD). Otherwise, the query is sent via GET or POST just as in the SELECT example.

These results (in Turtle serialization) show us the language tags of all of the labels, and we can see that the string part of many of them are the same when the language uses Latin characters.

We can also run this query using this Python script. As with the previous Python script, we assign the query to a variable in lines 9 to 18. This time, we substitute the value of the item variable, which is set to be Q42, but could be changed to retrieve triples about any other item. After we perform the GET request, the result is written to a text file, requestsOutput.ttl . We could then load that file into our own triplestore if we wanted.

Using rdflib in Python to manipulate RDF data from SPARQL CONSTRUCT

Since the result of our CONSTRUCT query is an RDF graph, there aren't the same kind of direct uses for the data in generic Python (or Javascript, for that matter) as the JSON results of the SELECT query. However, Python has an awesome library for working with RDF data called rdflib. Let's take a look at another Python script that uses rdflib to mess around with RDF graphs acquired from the Wikidata SPARQL endpoint. (Don't forget to use PIP to install rdflib if you haven't used it before.

Lines 11 through 24 do the same thing as the previous script: create variables containing the base endpoint URL and the query string.

The rdflib package allows you to create an instance of a graph (line 8) and has a .parse() method that will retrieve a file containing serialized RDF from a URL, parse it, and load it into the graph instance (line 29). In typical use, the URL of a specific file is passed into the method, but since all of the information necessary to initiate a SPARQL CONSTRUCT query via GET is encoded in a URL, and since the result of the query is a file containing serialized RDF, we can just pass the complex endpoint URL with the encoded query into the method and the graph returned from the query will go directly into the itemGraph graph instance.

There are two issues with using the method in this way. One is that unlike the requests .get() method, the rdflib .parse() method does not allow you to include request headers in your GET call. Fortunately, if no Accept header is sent to the Wikidata SPARQL endpoint, it defaults to returning RDF/XML, and the .parse() method is fine with that. The other issue is that unlike the requests .get() method, the rdflib .parse() method does not automatically URL-encode the query string and associate it with the query key. That is why line 25 builds the URL manually and uses the urllib.parse.quote() function to URL-encode the query string before appending it to the rest of the URL.

Upon completion of line 29, we now have the triples constructed by our query loaded into the itemGraph graph instance. What can we do with them? The rdflib documentation provides some ideas. If I am understanding it correctly, graph is an iterable object consisting of tuples, each of which represents a triple. So in line 33, we can simply use the len() function to determine how many triples were loaded into the graph instance. In lines 35 and 37, I used the .preferredLabel() method to search through the graph to find the labels for Q42 in two languages.

rdflib has a number of other powerful features that are worth exploring. One is its embedded SPARQL feature, which perhaps isn't that useful here since we just got the graph using a SPARQL query. Nevertheless, it's a cool function. The other capability that could be very powerful is rdflib's nearly effortless ability to merge RDF graphs. In the example script, the value of the item variable is hard-coded in line 5. However, the value of item could be determined by a for loop and the triples associated with many items could be accumulated into a single graph before saving the merged graph as a file (line 41) to be used elsewhere (e.g. loaded into a triplestore). You can imagine how a SPARQL SELECT query could be made to generate a list of items (as was done in the "Using Python to get generic data using SPARQL SELECT" section of this post), then that list could be passed into the code discussed here to create a graph containing all of the information about every item meeting some criteria set out in the SELECT query. That's pretty powerful stuff!

Alternate methods to get data from Wikidata

Although I've made the case here that SPARQL SELECT and CONSTRUCT queries are probably the best way to get data from Wikidata, there are other options. I'll describe three.

MediaWiki API

Since Wikidata is built on the MediaWiki system, the MediaWiki API is another mechanism to acquire generic data (not RDF triples) about items in Wikidata. I have written a Python script that uses the wbgetclaims action to get data about the claims (i.e. statements) made about a Wikidata item. I won't go into detail about the script, since it just uses the requests module's .get() method to get and parse JSON as was done in the first Python example of this post. The main tricky thing about this method is that you need to understand about "snaks", an idiosyncratic feature of the Wikibase data model. The structure of the JSON for the value of a claim varies depending on the type of the snak - thus the series of try...except... statements in lines 20 through 29.

If you intend to use the MediaWiki API, you will need to put in a significant amount of time studying the API documentation. A list of possible actions are on this page - actions whose names begin with "wb" are relevant to Wikidata. I will be talking a lot more about using the WikiMedia API in the next blog post, so stay tuned.

Dereferencing Wikidata item IRIs

Wikidata plays nicely in the Linked Data world in that they support content negotiation for dereferencing of their IRIs. That means that you can just do an HTTP GET for any item IRI with an Accept request header of one of the RDF media types, and you'll get a very complete description of the item in RDF.

For example, if I use Postman to dereference the IRI http://www.wikidata.org/entity/Q42 with an Accept header of text/turtle and allow Postman to automatically follow redirects, I eventually get redirected to the URL https://www.wikidata.org/wiki/Special:EntityData/Q42.ttl . The result is a pretty massive file that contains 66499 triples (as of 2019-05-28). In contrast, the SPARQL CONSTRUCT query to find all of the statements where Q42 was either the subject or object of the triple returned 884 triples. Why are there 75 times as many triples when we dereference the URI? If we scroll through the 66499 triples, we can see that not only do we have all of the triples that contain Q42, but also all of the triples about every part of every triple that contains Q42 (a complete description of the properties and a complete description of the values of statements about Q42). So this is a possible method to acquire information about an item in the form of RDF triples, but you get way more than you may be interested in knowing.

Using SPARQL DESCRIBE

One of the SPARQL query forms that I didn't mention earlier is DESCRIBE. The SPARQL 1.1 Query Language specification is a bit vague about what is supposed to happen in a DESCRIBE query. It says "The DESCRIBE form returns a single result RDF graph containing RDF data about resources. This data is not prescribed by a SPARQL query, where the query client would need to know the structure of the RDF in the data source, but, instead, is determined by the SPARQL query processor." In other words, it's up to the particular SPARQL query processor implementation to decide what information to send the client about the resource. It may opt to send triples that are indirectly related to the described resource, particularly if the connection is made by blank nodes (a situation that would make it more difficult for the client to "follow its nose" to find the other triples). So basically, the way to find out what a SPARQL endpoint will send as a response to a DESCRIBE query is to do one and see what you get.

When I issue the query

DESCRIBE <http://www.wikidata.org/entity/Q42>

to the Wikidata SPARQL endpoint with an Accept request header of text/turtle, I get 884 triples, all of which have Q42 as either the subject or object. So at least for the Wikidata query service SPARQL endpoint, the DESCRIBE query provides a simpler way to express the CONSTRUCT query that I described in the "Getting RDF triples using SPARQL CONSTRUCT" section above.

The power of SPARQL CONSTRUCT

In the simple example above, DESCRIBE was a more efficient way than CONSTRUCT to get all of the triples where Q42 was the subject or object. However, the advantage of using CONSTRUCT is that you can tailor the triples to be returned in more specific ways. For example, you could easily obtain only the triples where Q42 is the subject by just leaving out the

?s ?p2 wd:Q42.

part of the query.

In the CONSTRUCT examples I've discussed so far, the triples in the constructed graph all existed in the Wikidata dataset - we just plucked them out of there for our own use. However, there is no requirement that the constructed triples actually exist in the data source. We can actually "make up" triples to say anything we want. I'll illustrate this with an example involving references.

The Wikibase graph model (upon which the Wikidata model is based) is somewhat complex with respect to the references that support claims about items. (See this page for more information.) When a statement is made, a statement resource is instantiated and it's linked to the subject item by a "property" (p: namespace) analog of the "truthy direct property" (wdt: namespace) that is used to link the subject item to the object of the claim. The statement instance is then linked to zero to many reference instances by a prov:wasDerivedFrom (http://www.w3.org/ns/prov#wasDerivedFrom) predicate. The reference instances can then be linked to a variety of source resources by reference properties. These reference properties are not intrinsic to the Wikibase graph model and are created as needed by the community just as is the case with other properties in Wikidata.

We can explore the references that support claims about Q42 by going to the Wikidata Query Service GUI and pasting in this query:

PREFIX wd: <http://www.wikidata.org/entity/>

PREFIX p: <http://www.wikidata.org/prop/>

PREFIX pr: <http://www.wikidata.org/prop/reference/>

PREFIX prov: <http://www.w3.org/ns/prov#>

SELECT DISTINCT ?propertyUri ?referenceProperty ?source

WHERE {

wd:Q42 ?propertyUri ?statement.

?statement prov:wasDerivedFrom ?reference.

?reference ?referenceProperty ?source.

}

ORDER BY ?referenceProperty

After clicking on the blue "run" button, the results table shows us three things:

in the first column we see the property used in the claim
in the second column we see the kind of reference property that was used to support the claim
in the third column we see the value of the reference property, which is the cited source

Since the table is sorted by the reference properties, we can click on them to see what they are. One of the useful ones is P248, "stated in". It links to an item that is an information document or database that supports a claim. This is very reminiscent of dcterms:source, "a related resource from which the described resource is derived". If I wanted to capture this information in my own triplestore, but use the more standard Dublin Core term, I could construct a graph that contained the statement instance, but then connected the statement directly to the source using dcterms:source. Here's how I would write the CONSTRUCT query:

PREFIX wd: <http://www.wikidata.org/entity/>

PREFIX p: <http://www.wikidata.org/prop/>

PREFIX pr: <http://www.wikidata.org/prop/reference/>

PREFIX prov: <http://www.w3.org/ns/prov#>

PREFIX dcterms: <http://purl.org/dc/terms/>

CONSTRUCT {

wd:Q42 ?propertyUri ?statement.

?statement dcterms:source ?source.

}

WHERE {

wd:Q42 ?propertyUri ?statement.

?statement prov:wasDerivedFrom ?reference.

?reference pr:P248 ?source.

}

You can test out the query at the Wikidata Query Service GUI. You could simplify the situation even more if you made up your own predicate for "has a source about it", which we could call ex:source (http://example.org/source). In that case, the constructed graph would be defined as

CONSTRUCT {

wd:Q42 ex:source ?source.

}

This construct query could be incorporated into a Python script (or a script in any other language that supports HTTP calls) using the requests or rdflib modules as described earlier in this post.

Conclusion

I have hopefully made the case that the ability to perform SPARQL SELECT and CONSTRUCT querys at the Wikidata Query Service eliminates the need for the Wikimedia Foundation to create an additional API to provide data from Wikidata. Using SPARQL queries provides data retrieval capabilities that are only limited by your imagination. It's true that using a SPARQL endpoint as an API requires some knowledge about constructing SPARQL queries, but I would assert that this is a skill that must be acquired by anyone who is really serious about using LOD. I'm a relative novice at constructing SPARQL queries, but even so can think of a mind-boggling array of possibilities for using them to get data from Wikidata.

In my next post, I'm going to talk about the reverse situation: getting data into Wikidata using software.

Understanding the TDWG Standards Documentation Specification, Part 5: Acquiring Machine-readable using DCAT

2019-04-24T10:06:00.001-07:00

This is the fifth in a series of posts about the TDWG Standards Documentation Specification (SDS). For background on the SDS, see the first post. For information on the SDS hierarchical model and how it relates to IRI design, see the second post. For information about how TDWG standards metadata can be retrieved via IRI dereferencing, see the third post. For information about accessing TDWG standards metadata via a SPARQL API, see the fourth post.

Note: this post was revised on 2020-03-04 when IRI dereferencing of the http://rs.tdwg.org/ subdomain went from testing into production.

Acquiring the machine-readable TDWG standards metadata based on the W3C Data Catalog (DCAT) Vocabulary Recommendation.

Not-so-great methods of getting a dump of all of the machine-readable metadata

In the last two posts of this series, I showed two different ways that you could acquire machine-readable metadata about TDWG Standards and their components.

In the third post, I explained how the implementation of the Standards Documentation Specification (SDS) could allow a machine (i.e. computer software) use the classic Linked Open Data (LOD) method of "following its nose" and essentially scraping the standards metadata by discovering linked IRIs, then following those links to retrieve metadata about the linked components. There are two problems with this approach. One is that it's very inefficient. Multiple HTTP calls are required to acquire the metadata about a single resource and there are thousands of resources that would need to be scraped. A more serious problem is that some of the terms that are current or past terms of Darwin and Audubon Cores are not dereferenceable. For example, the International Press Telecommunications Council (IPTC) terms that are borrowed by Audubon Core are defined in a PDF document and don't dereference. There are many ancient Darwin Core terms in namespaces other than the rs.tdwg.org subdomain that don't even bring up a web page, let alone machine-readable metadata. And the "permanent URLs" of the standards themselves (e.g. http://www.tdwg.org/standards/116) do not use content negotiation to return machine-readable metadata (although they might at some future point). So there are many items of interest whose machine-readable metadata simply cannot be discovered by this means, since linked IRIs can't be dereferenced with a request for machine-readable metadata.

In the fourth post, I described how the SPARQL query language could be used to get all of the triples in the TDWG Standards dataset. The query to do so was really simple:

CONSTRUCT {?s ?p ?o}
FROM <http://rs.tdwg.org/>
WHERE {?s ?p ?o}

and by requesting the appropriate content type (XML, Turtle, or JSON-LD) via an Accept header, a single HTTP call would retrieve all of the metadata at once. If all goes well, this is a simple and effective method. However, this method depends critically on two things: there has to be a SPARQL endpoint that is functioning and publicly accessible, and the metadata in the triplestore of the underlying graph database must be up-to-date with the most recent data. At the moment, both of those things are true about the Vanderbilt Library SPARQL endpoint (https://sparql.vanderbilt.edu/sparql), but there is no guarantee that it will continue to be true indefinitely. There is no reason why there cannot be multiple SPARQL endpoints where the data are available, and TDWG itself could run its own, but currently there are no plans for that to happen and so we are stuck with depending on the Vanderbilt endpoint.

Getting a machine-readable data dump from TDWG itself

I'm now going to tell you about the best way to acquire authoritative machine-readable metadata from the rs.tdwg.org implementation itself. But first we need to talk about the W3C Data Catalog (DCAT) recommendation, which is used to organize the data dump. The SDS does not mention the DCAT recommendation, but since DCAT is an international standard, it is the logical choice to be used for describing the TDWG standards datasets.

Data Catalog Vocabulary (DCAT)

In 2014, the W3C ratified the DCAT vocabulary as a Recommendation (the W3C term for its ratified standards). DCAT is a vocabulary for describing datasets of any form. The described datasets can be machine-readable, but do not have to be, and could include non-machine-readable forms like spreadsheets. The description of the datasets is in RDF, although the Recommendation is agnostic about the serialization.

There are three classes of resources that are described by the DCAT vocabulary. A data catalog is the resource that describes datasets. It's type is dcat:Catalog (http://www.w3.org/ns/dcat#Catalog). The datasets described in the catalog are assigned the type dcat:Dataset, which is a subclass of dctype:Dataset (http://purl.org/dc/dcmitype/Dataset). The third class of resources, distributions, are described as "an accessible form of a dataset" and can include downloadable files or web services. Distributions are assigned the type dcat:Distribution (http://www.w3.org/ns/dcat#Distribution). The hierarchical relationship among these classes of resources is shown in the following diagram.

An important thing to notice is that the DCAT vocabulary defines several terms whose IRIs are very similar: dcat:dataset and dcat:Dataset, and dcat:distribution and dcat:Distribution. The only thing that differs between the pairs of terms is whether the local name is capitalized or not. Those with capitalized local names denote classes and those that begin with lower case denote object properties.

Organization of TDWG data according to the DCAT data model

I assigned the IRI http://rs.tdwg.org/index to denote the TDWG standards metadata catalog. The local name "index" is descriptive of a catalog, and the IRI has the added benefit of supporting a typical web behavior: if a base subdomain like http://rs.tdwg.org/ is dereferenced, it is typical for that form of IRI to dereference to a "homepage" having the IRI http://rs.tdwg.org/index.htm, and http://rs-test.tdwg.org/index.htm does indeed redirect to a "homepage" of sorts: the README.md page for the rs.tdwg.org GitHub repo where the authoritative metadata tables live. You can try this yourself by putting either http://rs.tdwg.org/or http://rs.tdwg.org/index.htm into a browser URL bar and see what happens. However, making an HTTP call to either of these IRIs with an Accept header for machine-readable RDF (text/turtle or application/rdf+xml) will redirect to a representation-specific IRI like http://rs.tdwg.org/index.ttl or http://rs.tdwg.org/index.rdf as you'd expect in the Linked Data world.

The data catalog denoted by http://rs.tdwg.org/index describes the data located in the GitHub repository https://github.com/tdwg/rs.tdwg.org. Those data are organized into a number of directories, with each directory containing all of the information required to map metadata-containing CSV files to machine-readable RDF. From the standpoint of DCAT, we can consider the information in each directory as a dataset. There is no philosophical reason why we should organize the datasets that way. Rather, it is based on practicality, since the server that dereferences TDWG IRIs can generate a data dump for each directory via a dump URL. See this file for a complete list of the datasets.

Each of the abstract datasets can be accessed through one of several distributions. Currently, the RDF metadata about the TDWG data says that there are three distributions for each of the datasets: one in RDF/XML, one in RDF/Turtle, and one in JSON-LD (with the JSON-LD having a problem I mentioned in the third post). The IANA media type for each distribution is given as the value of a dcat:mediaType property (see the diagram above for an example).

One thing that is a bit different from what one might consider the traditional Linked Data approach is that the distributions are not really considered representations of the datasets. That is, under the DCAT model, one does not necessarily expect to be redirected to the distribution IRI from dereferencing of the dataset IRI through content negotiation. That's because content negotiation generally results in direct retrieval of some human- or machine-readable serialization, but in the DCAT model, the distribution itself is a separate, abstract entity apart from the serialization. The serialization itself is connected via a dcat:downloadURL property of the distribution (see the diagram above). I'm not sure why the DCAT model adds this extra layer, but I think it is probably so that a permanent IRI can be assigned to the distribution, while the download URL can be a mutable thing that can change over time, yet still be discovered through its link to the distribution.

At the moment, the dataset IRIs don't dereference, although that could be changed in the future if need be. Despite that, their metadata are exposed when the data catalog IRI itself is dereferenced, so a machine could learn all it needed to know about them with a single HTTP call to the catalog IRI.

In the case of the TDWG data, I didn't actually mint IRIs for the distributions, since it's not that likely that anyone would ever need to address them directly and I wasn't interested in maintaining another set of identifiers. So they are represented by blank (anonymous) nodes in the dataset. The download URLs can be determined from the dataset URI by rules, so there's no need to maintain a record of them, either.

Here is an abbreviated bit of the Turtle that you get if you dereference the catalog IRI http://rs.tdwg.org/index and request text/turtle (or just retrieve http://rs.tdwg.org/index.ttl):

@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>.
@prefix xsd: <http://www.w3.org/2001/XMLSchema#>.
@prefix dc: <http://purl.org/dc/elements/1.1/>.
@prefix dcterms: <http://purl.org/dc/terms/>.
@prefix dcat: <http://www.w3.org/ns/dcat#>.
@prefix dcmitype: <http://purl.org/dc/dcmitype/>.

<http://rs.tdwg.org/index>
dc:publisher "Biodiversity Information Standards (TDWG)"@en;
dcterms:publisher <https://www.grid.ac/institutes/grid.480498.9>;
dcterms:license <http://creativecommons.org/licenses/by/4.0/>;
dcterms:modified "2018-10-09"^^xsd:date;
rdfs:label "TDWG dataset catalog"@en;
rdfs:comment "This dataset contains the data that underlies TDWG standards and standards documents"@en;
dcat:dataset <http://rs.tdwg.org/index/audubon>;
a dcat:Catalog.

<http://rs.tdwg.org/index/audubon>
dcterms:modified "2018-10-09"^^xsd:date;
rdfs:label "Audubon Core-defined terms"@en;
dcat:distribution _:53c07f45-4561-448b-9bb9-396e47d3ad1d;
a dcmitype:Dataset.

_:53c07f45-4561-448b-9bb9-396e47d3ad1d
dcat:mediaType <https://www.iana.org/assignments/media-types/application/rdf+xml>;
dcterms:license <https://creativecommons.org/publicdomain/zero/1.0/>;
dcat:downloadURL <http://rs.tdwg.org/dump/audubon.rdf>;
a dcat:Distribution.

In this Turtle, you can see the DCAT-based structure as described above.

Returning to a comment that I made earlier, DCAT can describe data in any form and it's not restricted to RDF. So in theory, one could consider each dataset to have a distribution that is in CSV format, and use the GitHub raw URL for the CSV file as the download URL of that distribution. I haven't done that because complete information about the dataset requires the combination of the raw CSV file with a property mapping table and I don't know how to represent that complexity in DCAT. But at least in theory it could be done. One can also indicate that a distribution of the dataset is available from an API such as a SPARQL endpoint, which I also have not done because the datasets aren't compartmentalized into named graphs and therefore can't really be distinguished from each other. But again, in theory it could be done.

Getting a dump of all of the data

At the start of this post, I complained that there were potential issues with the first two methods that I described for retrieving all of the TDWG standards metadata. I promised a better way, so here it is!

In theory, a client could start with the catalog IRI (http://rs.tdwg.org/index), dereference it requesting the machine-readable serialization flavor of your choice, and follow the links to the download URLs of all 50 of the datasets currently in the catalog. That would be in the LOD style and would require far fewer HTTP calls than the thousands that would be required to scrape all of the machine-readable data one standards-related resource at a time.

However, here is a quick and dirty way that doesn't require using any Linked Data technology:

use a script of your favorite programming language to load the raw file for the datasets CSV table on GitHub
get the dataset name from the second ("term_localName") column (e.g. audubon)
prepend http://rs.tdwg.org/dump/ to the name (e.g. http://rs.tdwg.org/dump/audubon)
append the appropriate file extension for the serialization you want (.ttl for Turtle, .rdf for XML) to the URL from the previous step (e.g. http://rs.tdwg.org/dump/audubon.ttl)
make an HTTP GET call to that URL to acquire the machine-readable serialization for that dataset.
Repeat for the other 49 data rows in the table.

I've actually done something like this in lines 55 to 63 of a Python script on GitHub. Rather than making a GET request, the script actually uses the constructed URL to create a SPARQL Update command that loads the data directly from the TDWG server into a graph database triplestore (lines 133 and 127) via an HTTP POST request. But you could use GET to load the data directly into your own software using a library like Python's RDFLib if you preferred to work with it directly rather than through a SPARQL endpoint.

The advantage of getting the dump in this way is that it would be coming directly from the authoritative TDWG server (which gets its data from the CSVs in the rs.tdwg.org repo of the TDWG GitHub site). You would then be guaranteed to have the most up-to-date version of the data, something that would not necessarily happen if you got the data from somebody else's SPARQL endpoint.

In the future, this method will be important because it would be the best way to build reliable applications that made use of standards metadata. For many standards and the "regular" TDWG vocabularies that conform to the SDS (Darwin and Audubon Cores), retrieving up-to-date metadata probably isn't that critical because those standards don't change very quickly. However, in the case of controlled vocabularies, access to up-to-date data may be more important.

Understanding the TDWG Standards Documentation Specification, Part 4: Machine-readable Metadata Via an API

2019-04-07T20:44:00.000-07:00

This is the fourth in a series of posts about the TDWG Standards Documentation Specification (SDS). For background on the SDS, see the first post. For information on its hierarchical model and how it relates to IRI design, see the second post. For information about how metadata is retrieved via IRI dereferencing, see the third post.

Note: this post was revised on 2020-03-04 when IRI dereferencing of the http://rs.tdwg.org/ subdomain went from testing into production.

Retrieving metadata about TDWG standards using a web API

If you have persevered through the first three posts in this series, congratulations! The main reason for those earlier posts was to provide the background for this post, which is on the topic that will probably be most interesting to readers: how to effectively retrieve machine-readable metadata about TDWG standards using a web API.

Let's start with retrieving an example resource: the term IRIs and definitions of terms of a TDWG vocabulary (Darwin Core=dwc or Audubon Core=ac).

Here is what we need for the API call:

Resource URL: https://sparql.vanderbilt.edu/sparql
Method: GET
Authentication required: No
Request header key: Accept
Request header value: application/json, text/csv or application/xml
Parameter key: query
Parameter value: insert "dwc" or "ac" in place of {vocabularyAbbreviation} in the following string:
"prefix%20skos%3A%20%3Chttp%3A%2F%2Fwww.w3.org%2F2004%2F02%2Fskos%2Fcore%23%3E%0Aprefix%20dcterms%3A%20%3Chttp%3A%2F%2Fpurl.org%2Fdc%2Fterms%2F%3E%0ASELECT%20DISTINCT%20%3Firi%20%3Fdefinition%0AWHERE%20%7B%0A%20%20GRAPH%20%3Chttp%3A%2F%2Frs.tdwg.org%2F%3E%20%7B%0A%20%20%20%20%3Chttp%3A%2F%2Frs.tdwg.org%2F{vocabularyAbbreviation}%2F%3E%20dcterms%3AhasPart%20%3FtermList.%0A%20%20%20%20%3FtermList%20dcterms%3AhasPart%20%3Firi.%0A%20%20%20%20%3Firi%20skos%3AprefLabel%20%3Flabel.%0A%20%20%20%20%3Firi%20skos%3Adefinition%20%3Fdefinition.%0A%20%20%20%20FILTER(lang(%3Flabel)%3D%22en%22)%0A%20%20%20%20FILTER(lang(%3Fdefinition)%3D%22en%22)%0A%20%20%20%20%7D%0A%7D%0AORDER%20BY%20%3Firi"

Note: the Accept header is required to receive JSON -- omitting it returns XML.

Here's an example response that shows the structure of the JSON that is returned:

{
"head": {
"vars": [
"iri",
"definition"
]
},
"results": {
"bindings": [
{
"iri": {
"type": "uri",
"value": "http://ns.adobe.com/exif/1.0/PixelXDimension"
},
"definition": {
"xml:lang": "en",
"type": "literal",
"value": "Information specific to compressed data. When a compressed file is recorded, the valid width of the meaningful image shall be recorded in this tag, whether or not there is padding data or a restart marker. This tag shall not exist in an uncompressed file."
}
},

(... many more array values here ...)

{
"iri": {
"type": "uri",
"value": "http://rs.tdwg.org/dwc/terms/waterBody"
},
"definition": {
"xml:lang": "en",
"type": "literal",
"value": "The name of the water body in which the Location occurs. Recommended best practice is to use a controlled vocabulary such as the Getty Thesaurus of Geographic Names."
}
}
]
}
}

Here is an example script to use the API via Python 3. (You can convert to your own favorite programming language or see this page if you need to set up Python 3 on your computer.) Note: the requests module is not included in the Python standard library and must be installed using PIP or another package manager.

Although the API can return CSV and XML, we will only be using JSON in this example.
------

import requests

vocab = input('Enter the vocabulary abbreviation (dwc for Darwin Core or ac for Audubon Core): ')

# values required for the HTTP request
resourceUrl = 'https://sparql.vanderbilt.edu/sparql'
requestHeaderKey = 'Accept'
requestHeaderValue = 'application/json'
parameterKey = 'query'
parameterValue ='prefix%20skos%3A%20%3Chttp%3A%2F%2Fwww.w3.org%2F2004%2F02%2Fskos%2Fcore%23%3E%0Aprefix%20dcterms%3A%20%3Chttp%3A%2F%2Fpurl.org%2Fdc%2Fterms%2F%3E%0ASELECT%20DISTINCT%20%3Firi%20%3Fdefinition%0AWHERE%20%7B%0A%20%20GRAPH%20%3Chttp%3A%2F%2Frs.tdwg.org%2F%3E%20%7B%0A%20%20%20%20%3Chttp%3A%2F%2Frs.tdwg.org%2F'
parameterValue += vocab
parameterValue += '%2F%3E%20dcterms%3AhasPart%20%3FtermList.%0A%20%20%20%20%3FtermList%20dcterms%3AhasPart%20%3Firi.%0A%20%20%20%20%3Firi%20skos%3AprefLabel%20%3Flabel.%0A%20%20%20%20%3Firi%20skos%3Adefinition%20%3Fdefinition.%0A%20%20%20%20FILTER(lang(%3Flabel)%3D%22en%22)%0A%20%20%20%20FILTER(lang(%3Fdefinition)%3D%22en%22)%0A%20%20%20%20%7D%0A%7D%0AORDER%20BY%20%3Firi'
url = resourceUrl + '?' + parameterKey + '=' + parameterValue

# make the HTTP request and store the terms data in a list
r = requests.get(url, headers={requestHeaderKey: requestHeaderValue})
responseBody = r.json()
items = responseBody['results']['bindings']

# iterate through the list and print what we wanted
for item in items:
print(item['iri']['value'])
print(item['definition']['value'])
print()

------

For anyone who has programmed an application to retrieve data from an API, this is pretty standard stuff and because the requests module is so simple to use, the part of the code that actually retrieves the data from the API (lines 16-18) is only three lines long. So the coding required to retrieve the data is not complicated. For the output I just had the values for the IRI and definition printed to the console, but obviously you could do whatever you wanted with them in your own programming.

If you are familiar with using web APIs and if you examined the details of the code, you will probably have several questions:

- Why is the parameter value so much longer and weirder than what is typical for web APIs?

- What is this sparql.vanderbilt.edu API?

- What other kinds of resources can be obtained from the API?

About the API

The reason that the parameter value is so long and weird looking is because the required parameter value is a SPARQL query in URL-encoded form. I purposefully obfuscated the parameter value by URL-encoding it in the script because I wanted to emphasize how a SPARQL endpoint is fundamentally just like any other web API, except with a more complicated query parameter.

I feel like in the past Linked Data, RDF, and SPARQL has been talked about in the TDWG community like it is some kind of religion with secrets that only initiated members of the priesthood can know. (For a short introduction to this topic, see this video.) It is true that if you want to design an RDF data model or build the infrastructure to transform tabular data to RDF, you need to know a lot of technical details, but those are not tasks that most people need to do. You actually don't need to know anything about RDF, how it's structured, or how to create it in order to use a SPARQL endpoint, as I just demonstrated above.

The endpoint http://sparql.vanderbilt.edu/sparql provides public access to datasets that have been made available by the Vanderbilt Libraries. It is our intention to keep this API up and the datasets stable for as long as possible. (For more about the API, see this page.) However, there is nothing special about about the API - it's just an installation of Blazegraph, which is freely available without cost as a Docker image (see this page for instructions if you want to try installing it on your own computer). The TDWG dataset that is loaded into the Vanderbilt API is also freely available and can be installed in any Blazegraph instance. So although the Vanderbilt API provides a convenient way to access the TDWG data, there is nothing special about it. There is no custom programming that has been done to get it online and there has been no processing of the data that was loaded into it. There could be zero to many other APIs that could be set up to provide exactly the same services using exactly the same API calls. For those who are interested, later on in this post I will provide more details about how anyone can obtain the data, but those are details that most users can happily ignore.

The interesting thing about SPARQL endpoints is that there is an unlimited number of resources that can be obtained from the API. Conventional APIs, such as the GBIF or Twitter APIs, provide web pages that list the available resources and the parameter key/value pairs required to obtain them. If potential users want to obtain a resource that is not currently available, they have to ask the API developers to create the code required to allow them to access that resource. A SPARQL endpoint is much simpler. It has exactly one resource URL (the URL of the endpoint) and for read operations has only one parameter key (query). The value of that single parameter is the SPARQL query.

In a manner analogous to traditional API documentation, we can (and should) provide a list of queries that would retrieve the types of information that users typically might want to obtain. Developers who are satisfied with that list can simply follow the recipe and make API calls using that recipe as they would for any other API. But the great thing about a SPARQL endpoint is that you are NOT limited to any provided list of queries. If you are willing to study the TDWG standards data model that I described in the second post of this series and expend a minimal amount of time learning to construct SPARQL queries (see this beginner's page to get started), you can retrieve any kind of data that you can imagine without needing to beg some developers to add that functionality to their API.

In the next section, I'm going to simplify the Python 3 script that I listed above, then provide several additional API call examples.

A generic Python script for making other API calls

Here is the previous script in a more straightforward and hackable form:

------

import requests

vocab = input('Enter the vocabulary abbreviation (dwc for Darwin Core or ac for Audubon Core): ')

parameterValue ='''prefix skos: <http://www.w3.org/2004/02/skos/core#>

prefix dcterms: <http://purl.org/dc/terms/>

SELECT DISTINCT ?iri ?definition

WHERE {

GRAPH <http://rs.tdwg.org/> {

<http://rs.tdwg.org/'''

parameterValue += vocab

parameterValue += '''/> dcterms:hasPart ?termList.

?termList dcterms:hasPart ?iri.

?iri skos:prefLabel ?label.

?iri skos:definition ?definition.

FILTER(lang(?label)="en")

FILTER(lang(?definition)="en")

}

ORDER BY ?iri'''

endpointUrl = 'https://sparql.vanderbilt.edu/sparql'

requestHeaderValue = 'application/json'

# make the HTTP request and store the terms data in a list

r = requests.get(endpointUrl, headers={'Accept': requestHeaderValue}, params={'query': parameterValue})

responseBody = r.json()

items = responseBody['results']['bindings']

# iterate through the list and print what we wanted

for item in items:

print(item['iri']['value'])

print(item['definition']['value'])

print()

------

The awesome Python requests module allows you to pass the parameters to the .get() method as a dict, getting rid of the necessity of constructing the entire URL yourself. The values you pass are automatically URL-encoded, so that eliminates the necessity of doing the encoding yourself. As a result, I was able to create the parameter value by assigning multi-line strings that are formatted in a much more readable way. Since the only header we should ever need to send is Accept and the only parameter key we should need is query, I just hard-coded them into the corresponding dicts of the .get() method. I left the value for the Accept request header as a variable in line 24 as a variable in case anybody wants to play with requesting XML or a CSV table.

We can now request different kinds of data from the API by changing the parameter value that is assigned in lines 3 through 21.

Multilingual labels and definitions

To retrieve the label and definition for a Darwin Core term in a particular language, substitute these lines for lines 3-21:

------

localName = input('Enter the local name of a Darwin Core term: ')

language = input('Enter the two-letter code for the language you want (en, es, zh-hans): ')

parameterValue ='''prefix skos: <http://www.w3.org/2004/02/skos/core#>

SELECT DISTINCT ?label ?definition

WHERE {

GRAPH <http://rs.tdwg.org/> {

BIND(IRI("http://rs.tdwg.org/dwc/terms/'''

parameterValue += localName

parameterValue += '''") as ?iri)

BIND("'''

parameterValue += language

parameterValue += '''" as ?language)

?iri skos:prefLabel ?label.

?iri skos:definition ?definition.

FILTER(lang(?label)=?language)

FILTER(lang(?definition)=?language)

}

}'''

------

The printout section needs to be changed, since we asked for a label instead of an IRI:

------

for item in items:

print()

print(item['label']['value'])

print(item['definition']['value'])

------

The "local name" asked for by the script is the last part of a Darwin Core IRI. For example, the local name for dwc:recordedBy (that is, http://rs.tdwg.org/dwc/terms/recordedBy) would be recordedBy. (You can find more local names to try here.)

Other than English, we currently only have translations of term names and labels in Spanish and simplified Chinese. We also only have translations of dwc: namespace terms from Darwin Core and not dwciri:, dc:, or dcterms: terms. So this resource is currently somewhat limited, but could get better in the future with the addition of other languages to the dataset.

Track the history of any TDWG term to the beginning of the universe

The user sends the full IRI of any term ever created by TDWG and the API will return the term name, version date of issue, definition and status of every version that was a precursor of that term. Again, replace lines 3-21 with this:

------

iri = input('Enter the unabbreviated IRI of a TDWG vocabulary term: ')

parameterValue ='''prefix dcterms: <http://purl.org/dc/terms/>

prefix skos: <http://www.w3.org/2004/02/skos/core#>

prefix tdwgutility: <http://rs.tdwg.org/dwc/terms/attributes/>

SELECT DISTINCT ?term ?date ?definition ?status

WHERE {

GRAPH <http://rs.tdwg.org/> {

<'''

parameterValue += iri

parameterValue += '''> dcterms:hasVersion ?directVersion.

?directVersion dcterms:replaces* ?version.

?version dcterms:issued ?date.

?version tdwgutility:status ?status.

?version dcterms:isVersionOf ?term.

?version skos:definition ?definition.

FILTER(lang(?definition)="en")

}

ORDER BY DESC(?date)'''

------

and replace the printout section with this:

------

for item in items:

print()

print(item['date']['value'])

print(item['term']['value'])

print(item['definition']['value'])

print(item['status']['value'])

------

The results of this query allow you to see every possible previous term that might have been used in the past to refer to this concept, and to see how the definition of those earlier terms differed from the target term. You should try it with everyone's favorite confusing term, dwc:basisOfRecord, which has the unabbreviated IRI http://rs.tdwg.org/dwc/terms/basisOfRecord .

You can make a simple modification to the script to have the call return every term that has ever been used to replace an obsolete term, and the definitions of every version of those terms. Just replace dcterms:replaces* with dcterms:isReplacedby* in the second parameterValue string. If you want them to be ordered from oldest to newest, you can replace DESC(?date) with ASC(?date). Try it with this refugee from the past: http://digir.net/schema/conceptual/darwin/2003/1.0/YearCollected .

What are all of the TDWG Standards documents?

This version of the script lets you enter any part of a TDWG standard's name and it will retrieve all of the documents that are part of that standard, tell you the date it was last modified, and give the URL where you might be able to find it (some are print only and at least one -- XDF -- seems to be lost entirely). Press enter without any text and you will get all of them. Here's the code to generate the parameter value:

------

searchString = input('Enter part of the standard name, or press Enter for all: ')

parameterValue ='''PREFIX foaf: <http://xmlns.com/foaf/0.1/>

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

PREFIX dcterms: <http://purl.org/dc/terms/>

SELECT DISTINCT ?docLabel ?date ?stdLabel ?url

FROM <http://rs.tdwg.org/>

WHERE {

?standard a dcterms:Standard.

?standard rdfs:label ?stdLabel.

?standard dcterms:hasPart ?document.

?document a foaf:Document.

?document rdfs:label ?docLabel.

?document rdfs:seeAlso ?url.

?document dcterms:modified ?date.

FILTER(lang(?stdLabel)="en")

FILTER(lang(?docLabel)="en")

FILTER(contains(?stdLabel, "'''

parameterValue += searchString

parameterValue += '''"))

}

ORDER BY ?stdLabel'''

------

and here's the printout section:

------

for item in items:

print()

print(item['docLabel']['value'])

print(item['date']['value'])

print(item['stdLabel']['value'])

print(item['url']['value'])

------

Note: the URLs that are returned are access URLs, NOT the IRI identifiers for the documents!

The following is a variation of the API call above. In this variation, you enter the name of a standard (or press Enter for all), and you can retrieve the names of all of the contributors (whose roles might have included author, editor, translator, reviewer, or review manager). Parameter value code:

------

parameterValue ='''PREFIX foaf: <http://xmlns.com/foaf/0.1/>

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

PREFIX dcterms: <http://purl.org/dc/terms/>

SELECT DISTINCT ?contributor ?stdLabel

FROM <http://rs.tdwg.org/>

WHERE {

?standard a dcterms:Standard.

?standard rdfs:label ?stdLabel.

?standard dcterms:hasPart ?document.

?document a foaf:Document.

?document dcterms:contributor ?contribUri.

?contribUri rdfs:label ?contributor.

FILTER(contains(?stdLabel, "'''

parameterValue += searchString

parameterValue += '''"))

}

ORDER BY ?contributor'''

------

Printout code:

------

for item in items:

print()

print(item['contributor']['value'])

print(item['stdLabel']['value'])

------

Note: assembling this list of documents was my best shot at determining what documents should be considered to be part of the standards themselves, as opposed to ancillary documents not part of the standards. It's possible that I might have missed some, or included some that aren't considered key to the standards. This is more of a problem with older documents whose status was not clearly designated.

The SDS isn't very explicit about how to assign all of the properties that should probably be assigned to documents, so some information that might be important is missing, such as documentation of contributor roles. Also, I could not determine who all of the review managers were, where the authoritative locations were for all documents, nor find prior versions for some documents. So this part of the TDWG standards metadata still needs some work.

An actual Linked Data application

In the previous examples, the data involved was limited to metadata about TDWG standards. However, we can make an API call that is an actual bona fide application of Linked Data. Data from the Bioimages project are available as RDF/XML. You can examine the human-readable web page of an image at http://bioimages.vanderbilt.edu/thomas/0488-01-01 and the corresponding RDF/XML here. Both the human- and machine-readable versions of the image metadata use either Darwin Core or Audubon Core terms as most of their properties. However, the Bioimages metadata do not provide an explanation of what those TDWG vocabulary terms mean.

Both the Bioimages and TDWG metadata datasets have been loaded into the Vanderbilt Libraries SPARQL API, and we can include both datasets in the query's dataset using the FROM keyword. That allows us to make use of information from the TDWG dataset in the query of the Bioimages data because the two datasets are linked by use of common term IRIs. In the query, we can ask for the metadata values for the image (from the Bioimages dataset), but include the definition of the properties (from the TDWG dataset; not present in the Bioimages dataset).

------

iri = input('Enter the unabbreviated IRI of an image from Bioimages: ')

parameterValue ='''PREFIX dcterms: <http://purl.org/dc/terms/>

PREFIX skos: <http://www.w3.org/2004/02/skos/core#>

SELECT DISTINCT ?label ?value ?definition

FROM <http://rs.tdwg.org/>

FROM <http://bioimages.vanderbilt.edu/images>

WHERE {

<'''

parameterValue += iri

parameterValue += '''> ?property ?value.

?property skos:prefLabel ?label.

?property skos:definition ?definition.

FILTER(lang(?label)="en")

FILTER(lang(?definition)="en")

}'''

------

Printout code:

------

for item in items:

print()

print(item['label']['value'])

print(item['value']['value'])

print(item['definition']['value'])

------

You can try this script out on the example IRI I gave above (http://bioimages.vanderbilt.edu/thomas/0488-01-01) or on any other image identifier in the collection (listed under "Refer to this permanent identifier for the image:" on any of the image metadata pages that you get to by clicking on an image thumbnail).

Conclusion

Hopefully, these examples can give you a taste for the kind of metadata about TDWG standards that can be retrieved using an API. There are several final issues that I should discuss before I wrap up this post. I'm going to present them in a Q&A format.

Q: Can I build an application to use this API?

A. Yes, you could. We intend for the Vanderbilt SPARQL API to remain up indefinitely at the endpoint URL given in the examples. However, we can't make a hard promise about that, and the API is not set up to handle large amounts of traffic. There aren't any usage limits and subsequently it's already been crashed once by someone who hit it really hard. So if you need a robust service, you should probably set up your own installation of Blazegraph and populate it with the TDWG dataset.

Q: How can I get a dump of the TDWG data to populate my own version of the API?

A: The simplest way is to execute this query to the Vanderbilt SPARQL API as above with an Accept header of text/turtle:

CONSTRUCT {?s ?p ?o}

FROM <http://rs.tdwg.org/>

WHERE {?s ?p ?o}

URL-encoded, the query is:

CONSTRUCT%20%7B%3Fs%20%3Fp%20%3Fo%7D%0AFROM%20%3Chttp%3A%2F%2Frs.tdwg.org%2F%3E%0AWHERE%20%7B%3Fs%20%3Fp%20%3Fo%7D

If you use Postman, you can drop down the Send button to Send and Download and save the data in a file, which you can upload into your own instance of Blazegraph or some other SPARQL endpoint/triplestore. (There are approximately 43000 statements (triples) in the dataset, so copy and paste is not a great method of putting them into a file.) If your triplestore doesn't support RDF/Turtle, you can get RDF/XML instead by using an Accept header of application/rdf+xml.

There is a better method of acquiring the data that uses the authoritative source data, but I'll have to describe that in a subsequent post.

Q: How accurate are the data?

A: I've spent many, many hours over the last several years curating the source data in GitHub. Nevertheless, I still discover errors almost every time I try new queries on the data. If you discover errors, put them in the issues tracker and I'll try to fix them.

Q: How would this work for future controlled vocabularies?

A: This is a really important question. It's so important that I'm going to address it in a subsequent post in the series.

Q: How can I retrieve information from the API about resources that weren't described in the examples?

A: Since a SPARQL endpoint is essentially a program-it-yourself API, all you need is to have the right SPARQL query to retrieve the information you want. First you need to have a clear idea of the question you want to answer. Then you've got two options: find someone who knows how to write SPARQL queries and get them to write the query for you, or teach yourself how to write SPARQL queries and do it yourself. You can test your queries by pasting them in the box at https://sparql.vanderbilt.edu/ as you build them. It is not possible to create the queries without understanding the underlying data model (the graph model) and the machine-readable properties assigned to each kind of resource. That's why I wrote the first (boring) parts of this series and why we wrote the specification itself.

Q: Where did the data in the dataset come from and how is it managed?

A: That is an excellent question. Actually it is several questions:

- where does the data come from? (answer: the source csv tables in GitHub)

- how does the source data get turned into machine-readable data?

- how does the machine-readable data get into the API?

One of the beauties of REST is that when you request a URI from a server, you should be able to get a useful response from the server without having to worry about how the server generates that response. What that means in this context is that the intermediate steps that lie between the source data and what comes out of the API (the answers to the second and third questions above) can change and the client should never notice the difference since it would still be able to get exactly the same response. That's because the processing essentially involves implementing a mapping between what's in the tables on GitHub and what the SDS says the standardized machine-readable metadata should look like. There is no one particular way that mapping must happen, as long as the end result is the same. I will discuss this point in what will probably be the last post of the series.

Understanding the TDWG Standards Documentation Specification, Part 3: Machine-readable Metadata Via Content Negotiation

2019-04-02T21:17:00.000-07:00

This is the third in a series of posts about the TDWG Standards Documentation Specification (SDS). For background on the SDS, see the first post. For information on its hierarchical model and how it relates to IRI design, see the second post.

Note: this post was revised on 2020-03-04 when IRI dereferencing of the http://rs.tdwg.org/ subdomain went from testing into production.

Human- vs. Machine-readable metadata

In the previous posts, I made the point that the SDS considers standards-related resources such as standards, vocabularies, term lists, terms, and documents to be abstract entities (section 2.1). As such, the IRI assigned to a resource denotes that resource in its abstract form. That abstract resource does not have one particular representation -- rather it can have multiple representation syntaxes which differ in format, but which in most cases provide equivalent information.

For example, consider the deprecated Darwin Core term dwccuratorial:Disposition. It is denoted by the IRI http://rs.tdwg.org/dwc/curatorial/Disposition. The metadata for this term in human-readable form looks like this:

Term Name: dwccuratorial:Disposition
Label: Disposition
Term IRI: http://rs.tdwg.org/dwc/curatorial/Disposition
Term version IRI: http://rs.tdwg.org/dwc/curatorial/version/Disposition-2007-04-17
Modified: 2009-04-24
Definition: The current disposition of the cataloged item. Examples: "in collection", "missing", "voucher elsewhere", "duplicates elsewhere".
Type: Property
Note: This term is no longer recommended for use.
Is replaced by: http://rs.tdwg.org/dwc/terms/disposition

In RDF/Turtle machine-readable serializations, the metadata looks like this (namespace abbreviations omitted):

<http://rs.tdwg.org/dwc/curatorial/Disposition>
rdfs:isDefinedBy <http://rs.tdwg.org/dwc/curatorial/>;
dcterms:isPartOf <http://rs.tdwg.org/dwc/curatorial/>;
dcterms:created "2007-04-17"^^xsd:date;
dcterms:modified "2009-04-24"^^xsd:date;
owl:deprecated "true"^^xsd:boolean;
rdfs:label "Disposition"@en;
skos:prefLabel "Disposition"@en;
rdfs:comment "The current disposition of the cataloged item. Examples: \"in collection\", \"missing\", \"voucher elsewhere\", \"duplicates elsewhere\"."@en;
skos:definition "The current disposition of the cataloged item. Examples: \"in collection\", \"missing\", \"voucher elsewhere\", \"duplicates elsewhere\"."@en;
rdf:type <http://www.w3.org/1999/02/22-rdf-syntax-ns#Property>;
tdwgutility:abcdEquivalence "DataSets/DataSet/Units/Unit/SpecimenUnit/Disposition";
dcterms:hasVersion <http://rs.tdwg.org/dwc/curatorial/version/Disposition-2007-04-17>;
dcterms:isReplacedBy <http://rs.tdwg.org/dwc/terms/disposition>.

In RDF/XML machine-readable form, the metadata looks like this (namespace abbreviations omitted):

<rdf:Description rdf:about="http://rs.tdwg.org/dwc/curatorial/Disposition">

<rdfs:isDefinedBy rdf:resource="http://rs.tdwg.org/dwc/curatorial/"/>

<dcterms:isPartOf rdf:resource="http://rs.tdwg.org/dwc/curatorial/"/>

<dcterms:created rdf:datatype="http://www.w3.org/2001/XMLSchema#date">2007-04-17</dcterms:created>

<dcterms:modified rdf:datatype="http://www.w3.org/2001/XMLSchema#date">2009-04-24</dcterms:modified>

<owl:deprecated rdf:datatype="http://www.w3.org/2001/XMLSchema#boolean">true</owl:deprecated>

<rdfs:label xml:lang="en">Disposition</rdfs:label>

<skos:prefLabel xml:lang="en">Disposition</skos:prefLabel>

<rdfs:comment xml:lang="en">The current disposition of the cataloged item. Examples: "in collection", "missing", "voucher elsewhere", "duplicates elsewhere".</rdfs:comment>

<skos:definition xml:lang="en">The current disposition of the cataloged item. Examples: "in collection", "missing", "voucher elsewhere", "duplicates elsewhere".</skos:definition>

<rdf:type rdf:resource="http://www.w3.org/1999/02/22-rdf-syntax-ns#Property"/>

<tdwgutility:abcdEquivalence>DataSets/DataSet/Units/Unit/SpecimenUnit/Disposition</tdwgutility:abcdEquivalence>

<dcterms:hasVersion rdf:resource="http://rs.tdwg.org/dwc/curatorial/version/Disposition-2007-04-17"/>

<dcterms:isReplacedBy rdf:resource="http://rs.tdwg.org/dwc/terms/disposition"/>

</rdf:Description>

For brevity, I'll omit the JSON-LD serialization. If you make a careful comparison of the two machine-readable serializations shown here, you'll see that they contain exactly the same information.

The SDS requires that when a machine consumes any machine-readable serialization, it acquire information identical to any other serialization (section 4). For most resources (terms, vocabularies, etc.), the human-readable representation generally contains the same information as the machine-readable serializations for all of the key properties required by the SDS, although some that aren't required, such as the abdcEquivalence, are omitted. The exception to this is standards-related documents -- the human-readable representation is the document itself, while the machine-readable representations are metadata about the document. (In contrast, machine-readable metadata about vocabularies, term lists, and terms contain virtually complete data about the resource.)

Distinguishing between resources and the documents that describe them

Section 4.1 of the SDS requires that machine-readable documents must have IRIs that are different from the IRIs of the abstract resources that they describe. Although at first it many not be apparent why this is important, we can see why if we consider the case of some of the older TDWG standards documents. For instance, the document Floristic Regions of the World (denoted by the IRI http://rs.tdwg.org/frw/doc/book/) by A. L. Takhtahan was adopted as part of TDWG standard http://www.tdwg.org/standards/104. It is copyrighted by the University of California Press and is not available under an open license. However, the metadata about the book in RDF/Turtle serialization (denoted by the IRI http://rs.tdwg.org/frw/doc/book.ttl) is freely available. So we could make the statement

http://rs.tdwg.org/frw/doc/book.ttl dcterms:license https://creativecommons.org/publicdomain/zero/1.0/ .

but it would NOT be accurate to make the statement

http://rs.tdwg.org/frw/doc/book/ dcterms:license https://creativecommons.org/publicdomain/zero/1.0/ .

because the book isn't licensed as CC0. Similarly, it would be correct to say:

http://rs.tdwg.org/frw/doc/book/ dc:creator "A. L. Takhtahan" .

but not:

http://rs.tdwg.org/frw/doc/book/ dc:creator "Biodiversity Information Standards (TDWG)" .

because TDWG did not create the book. On the other hand, saying:

http://rs.tdwg.org/frw/doc/book.ttl dc:creator "Biodiversity Information Standards (TDWG)" .

would be correct, since TDWG did create the RDF/Turtle metadata document that describes the book.

Although in human-readable documents we tend to be fuzzy about the distinction between resources and the metadata about those resources, when we create machine-readable metadata representations we need to be careful to distinguish between the two.

The SDS prescribes a way to link metadata documents and the resources they are about: dcterms:references and dcterms:isReferencedBy (section 4.1). In the example above, we can say:

http://rs.tdwg.org/frw/doc/book.ttl dcterms:references http://rs.tdwg.org/frw/doc/book/ .

and

http://rs.tdwg.org/frw/doc/book/ dcterms:isReferencedBy http://rs.tdwg.org/frw/doc/book.ttl .

Content negotiation

As I explained in the second post of this series, IRIs are fundamentally identifiers. There is no requirement that an IRI actually dereference to retrieve a web page or any other kind of document, although if it did, that would be nice, since that's the kind of behavior that people expect, particularly if the IRI begins with "http://" or "https://". If you think about it, defining TDWG IRIs to denote an abstract conceptual thing is a bit of a problem, because only non-abstract files can actually be returned to a user from a server through the Internet. You can't retrieve an abstract thing like the emotion "love" or the concept "justice" through the Internet, although you could certainly mint IRIs to denote those kinds of things.

The standard practice when an IRI denotes a resource that is a physical object or abstract idea is to redirect the user to a document that is about the object or idea. Such a document containing descriptive metadata about the resource is called a representation of the resource. Users can specify what kind of document (human- or machine-readable) they want, and more specifically, the serialization that they want if they are asking for a machine-readable document. This process is called content negotiation.

Resolution of permanent identifiers indefinitely is specified by Recommendation 7 of the TDWG Globally Unique Identifier (GUID) Applicability Statement standard, although it does not go into details of how that resolution should happen. Section 2.1.1 and 2.1.2 of the SDS expands on the GUID AS by saying that the abstract IRI should be stable and generic, and that content negotiation should redirect the user to an IRI for a particular content type that will serve as a URL that can be used to retrieve the document of the content type the user wanted. That requirement is based on the widespread practice in the Linked Data community as expressed in the 2008 W3C Note "Cool URIs for the Semantic Web".

The SDS does not specify a particular way that this redirection should be accomplished, but given that it's desirable to support as many different serializations as possible, I chose to implement the "303 URIs forwarding to Different Documents" recipe described in the Cool URIs document. Here are the specific details:

1. Client software performs an HTTP GET request for the abstract IRI of the resource and includes an Accept header that specifies the content type that it wants.

2. The server responds with an HTTP status code of 303 and includes the URL for the specific content type requested. To construct the redirect URL, any abstract IRIs with trailing slashes first have the trailing slash removed. If text/html is requested (i.e. human-readable web page), .htm is appended to the IRI to form the redirect URL. If text/turtle is requested, .ttl is appended. If application/rdf+xml is requested, .rdf is appended. If application/ld+json is requested, .json is appended.

3. The client then requests the specific redirect URL and the server returns the appropriate document in the serialization requested. In this stage, the Accept header is ignored by the server. In the case of standards documents and current terms in Darwin and Audubon Cores, there typically will be an additional redirect to a web page that isn't generated programmatically by the rs.tdwg.org server and might be located anywhere.

We can test the behavior using curl or a graphical HTTP client like Postman. Here is an example using Postman (with automatic following of redirects turned off):

1. Client requests metadata about the basic Darwin Core vocabulary by HTTP GET to the generic IRI: http://rs.tdwg.org/dwc/ with an Accept header of text/turtle.

2. The server responds with a 303 (see other) code and redirects to http://rs.tdwg.org/dwc.ttl

3. The client sends another GET request to http://rs.tdwg.org/dwc.ttl, this time without any Accept header.

4. The server responds with a 200 (success) code and a Content-Type response header of text/turtle. The response body is the document serialized as RDF/Turtle.

This illustration was done "manually" using Postman, but it is relatively simple to use any typical programming language (such as Javascript or Python) to perform HTTP calls with appropriate Accept headers.[1] So enabling IRI dereferencing with content negotiation really starts to open up TDWG standards to machine readability.

One feature of this implementation method is that it allows a human user to examine a representation in any serialization using a browser by just by hacking the abstract IRI using the rules in step 2. Thus, if you want to see what the RDF/XML serialization looks like for the basic Darwin Core vocabulary, you can put the URL http://rs.tdwg.org/dwc.rdf into a browser. The browser will send an Accept header of text/html, but since the URL contains an extension for a specific file type, the server will ignore the Accept header and send RDF/XML anyway. (Depending on how the browser is set up to handle file types, it may display the retrieved file in the browser window, or may initiate a download of the file into the user's Downloads directory.)

Important note: currently (as of April 2019), there is an error in the algorithm that generates the JSON-LD that causes repeated properties to be serialized incorrectly. The JSON that is returned validates as JSON-LD, but when the document is interpreted, some instances of the repeated properties are ignored. So application designers should at this point plan to consume either RDF/XML or RDF/Turtle until this error is corrected.

Why does this matter?

There are three reasons why implementation of dereferencing TDWG standards-related IRIs through content negotiation is important.

1. The least important reason is probably the one that is given as a core rationale in the Linked Data world: when someone "looks up" a URI, they get useful information and can discover more things through the links in the metadata. In theory, one could "discover" any resource related to TDWG standards, scrape the machine-readable metadata about that resource, dereference other resources that are linked to the first one, scrape those resources' medata and follow their links, etc. until everything that there is to be known about TDWG standards has been discovered. Essentially, we could have an analog of the Google web scraper that scrapes machine-readable documents instead of web pages. In theory, this could be done, but it would result in many HTTP calls and would be a very inefficient way to keep up-to-date on TDWG standards. There is a much better way, and I'll discuss it in the next post.

2. Probably the most important reason is that implementing real permanent IRIs for TDWG vocabularies and documents puts a stop to the continual breaking of links and browser bookmarks that happens every time documents get moved to a new website, get changed from HTML to markdown, etc. If we stress that the permanent IRIs are what should be bookmarked and cited, we can always set up the server to redirect to the URL of the day where the document or information actually lives. Since the permanent IRIs are "cool" and don't include implementation-specific aspects like ".php" or "?pid=123&lan=en", we can change the way we actually generate and serve the data at will without ever "breaking" any links. This is really critical if we want people to be able to cite IRIs for TDWG standards components in journal articles with those IRIs continuing to dereference indefinitely.

3. The third reason is more philosophical. By having IRIs that dereference to human- and machine-readable metadata, we demonstrate that these are "real" IRIs that exhibit the behavior expected from "grown-up" organizations in the Linked Data world in specific, and the web in general. We show that TDWG is not some fly-by-night organization that creates identifiers one day and abandons them the next. The Internet is littered with the wreckage of vocabularies and ontologies from organizations that minted terms but stopped paying for their domain name, or couldn't keep their servers running. Having properly dereferencing, permanent IRIs marks TDWG as a real standards organization that can run with the big dogs like Dublin Core and the W3C. (We also get 5 stars !)

In my next post I'll talk about retrieving SDS-specified machine readable standards metadata en masse.

[1] Sample Python 3 code for dereferencing a term IRI

Note: you may need to use PIP to install the requests module if you don't already have it.

import requests
iri = 'http://rs.tdwg.org/ac/terms/caption'
accept = 'text/turtle'
r = requests.get(iri, headers={'Accept' : accept})
print(r.text)

Understanding the TDWG Standards Documentation Specification, Part 2: Hierarchy Model and Implementation of IRIs

2019-03-10T07:55:00.003-07:00

This is the second in a series of posts about the TDWG Standards Documentation Specification (SDS). For background on the SDS, see the first post.

Note: this post was revised on 2020-03-04 when IRI dereferencing of the http://rs.tdwg.org/ subdomain went from testing into production.

Implementation plan?

The SDS was ratified and issued in April of 2017. It did not, however, include any plan for its implementation. It wasn't actually clear whose responsibility it was to make implementation of the SDS happen. The Technical Architecture Group (TAG) might have been a logical group to take charge, but in 2017 it had not yet been reconstituted in its current form. As the architect of the SDS, I had a vested interest in seeing it become functional, so I decided to take the initiative to figure out how it could be implemented. As I worked on this project, I got feedback from the Darwin Core Maintenance Group, key people working on the TDWG website and other infrastructure, and later from some TAG members.

Although the SDS provided a general framework, it left a lot of the details to implementers. In particular, the SDS had relatively little to say about the form of URIs used as identifiers for documents whose form was specified by the SDS. For guidance, I looked to precedents set by Darwin Core, general practices in the Linked Data world, and practicalities of URI dereferencing.

The SDS model

The SDS describes a hierarchical model for resources within its scope. That hierarchy is relatively simple for documents within a standard: there is simply a hasPart/isPartOf relationship between the standard and its documents.

For vocabularies, the situation is more complicated. The SDS describes four levels in the hierarchy that applies to vocabularies: standard, vocabulary, term list, and term. There was some discussion in the run-up to ratification of the SDS as to whether the model needed to be this complicated. At that time, I asserted that this was the least complicated model that could accomplish all of the things that people said they wanted to do with vocabularies in TDWG.

It would be tempting to say that a much simpler model might be possible. For example, we could consider the Audubon Core Standard to be synonymous with the Audubon Core vocabulary. We could say that Audubon Core terms were a direct part of it -- a simple two-level hierarchy.

However, the Audubon Core Standard is more than just a set of terms. The Audubon Core vocabulary is distinct from the documents that describe how Audubon Core should be used (the structure document, term list document, etc.), which are also part of the standard. Although we might lump the standard, vocabulary, and documents together in our human minds, if we really aspire to have machine-readable descriptions of components of TDWG standards, we have to distinguish between things that are not the same -- things that have different authors, creation dates, and version histories.

Example of first (standards) and second (vocabularies and documents) levels of the TDWG Standards Documentation Specification hierarchy

As I described in the previous post, there was also a desire expressed in the community for the capability to have more than one "Darwin Core vocabulary". Some people might want only the basic vocabulary (a "bag of terms" with definitions). Others might want a more complicated vocabulary where some terms might be declared to be equivalent to terms outside of Darwin Core, or classes might be declared to be subclasses of classes in an outside ontology. Still others might want to create a Darwin Core vocabulary that restrict the values that can be used for certain terms, or entail class membership through range and domain declarations. So although we don't currently have more than one Darwin Core vocabulary, we want to allow for that possibility in the future. That's another reason to have a model that separates the standard from the vocabulary or vocabularies that it defines.

Example of second (vocabularies) and third (term lists) levels of the TDWG Standards Documentation Specification hierarchy

Within a vocabulary, the SDS describes an entity called "term list" (Section 3.3.3 and 4.4.2).

Example of third (term list) and fourth (term) levels of the TDWG Standards Documentation Specification hierarchy. This is an example of a list of terms defined by TDWG and only includes a few of the terms on the list.

For terms defined by a TDWG vocabulary, there is an authoritative term list for each namespace. For example, there is an authoritative term list for the dwc: namespace and another for the dwciri: namespace. These lists are considered authoritative because they define the terms they contain. Dereferencing a term list IRI should return the term list document.

Example of third (term list) and fourth (term) levels of the TDWG Standards Documentation Specification hierarchy. This is an example of a list of terms borrowed by TDWG and only includes a few of the terms on the list.

A term list can also contain terms that are borrowed from another vocabulary and included in the TDWG vocabulary. The SDS does not prescribe how borrowed terms should be organized in term lists -- for example, whether all borrowed terms should be included in a single list or whether there should be a separate term list for each namespace from which terms are borrowed. As a practical matter, it made sense to create a separate term list for each namespace.

Some notes about IRIs

According to the SDS, each resource in the hierarchy should be assigned an IRI as an identifier (Section 2.1.1). An IRI is a superset of URIs that allows for non-Latin characters to be used. For the purposes of this post, you can consider URIs and IRIs to be synonymous.

There has always been confusion between the use of IRIs/URIsas identifiers and URLs as resource locators. Fundamentally, an IRI is an identifier that may or may not actually dereference in a web browser to retrieve a web page about the resource. In the Linked Data community, it is considered a best practice for IRIs to dereference, but it isn't a requirement. In fact, there are a number of "borrowed" term IRIs in Audubon Core that don't dereference and probably never will. So although it isn't a requirement of the SDS that TDWG IRIs dereference, one goal of implementation is to eventually make that happen.

The origin of the subdomain rs.tdwg.org has always been a little mysterious to me. I believe that the "rs" part stands for "schema repository" and that it was originally intended to be a place from which XML and other schemas could be retrieved. Although I don't think there is any official policy that requires use of the rs.tdwg.org subdomain for TDWG-minted IRIs, that has become the convention with Darwin Core and Audubon Core and I've taken that as the precedent to be followed when creating other IRIs that denote resources associated with TDWG standards. The exception to this pattern are the IRIs for the standards themselves. The precedent there is that TDWG standards have IRIs in the form http://www.tdwg.org/standards/nnn, where "nnn" is a number assigned to a particular standard.

IRI patterns for vocabulary standards

I used the precedents established by the Darwin and Audubon Core standards, together with the URI specification (RFC 3986) itself to establish IRI patterns that are consistent with the hierarchy established by the SDS. Section 1.2.3 of RFC 3986 notes that a forward slash is used to "delimit components that are significant to the generic parser's hierarchical interpretation of an identifier" and the IRIs of components of vocabularies can be interpreted this way.

Here are the patterns I established or continued based on past practice:

Standards IRI:

http://www.tdwg.org/standards/nnn

where "nnn" consists of numeric characters assigned to the standard. Dereferencing these IRIs should lead the user to the landing page of the standard. Example of the Darwin Core standard:

http://www.tdwg.org/standards/450

Note that since these IRIs aren't within the rs.tdwg.org subdomain, the test system I've implemented does not handle their dereferencing. Standards IRI dereferencing is handled by a separate system and I don’t know how fully functional it is for all prior TDWG standards.

Vocabulary IRI:

http://rs.tdwg.org/vvv/

where "vvv" is a sequence of alphabetic characters assigned to the vocabulary. Example of the Darwin Core basic vocabulary:

http://rs.tdwg.org/dwc/

Term list IRI:

http://rs.tdwg.org/vvv/ttt/

where "vvv" is a sequence of alphabetic characters assigned to the vocabulary and "ttt" is a sequence of alphabetic characters assigned to the term list within that vocabulary. Example of the Darwin Core IRI-valued terms:

http://rs.tdwg.org/dwc/iri/

Term IRI:

http://rs.tdwg.org/vvv/ttt/nnn

where "vvv" is a sequence of alphabetic characters assigned to the vocabulary, "ttt" is a sequence of alphabetic characters assigned to the term list within that vocabulary, and "nnn" is the local name of the term. Example of the "in described place" term:

http://rs.tdwg.org/dwc/iri/inDescribedPlace

The term pattern described above is backward compatible with all current Darwin Core and Audubon Core term IRIs. Existing Darwin Core RDF/XML asserts relationships between terms and the resource that defines them like this:

http://rs.tdwg.org/dwc/terms/dateIdentified rdfs:isDefinedBy http://rs.tdwg.org/dwc/terms/ .

So the IRI pattern for term lists is also backwards compatible with this previous use, with the name "term list" now explicitly given to the resource that defines terms.

The IRI pattern for vocabularies is new, but is consistent with the hierarchy and is necessary to distinguish between vocabularies and the standards that create them.

IRI pattern for documents

Previously, there had been no consistent pattern for IRIs assigned to documents associated with standards. Here are some examples of IRIs for Darwin Core documents:

The Darwin Core XML guide: http://rs.tdwg.org/dwc/terms/guides/xml/

The Darwin Core simple text guide: http://rs.tdwg.org/dwc/terms/simple/

To maintain backwards compatibility, these pre-existing IRIs were left unchanged. However, the IRI patterns used for Darwin Core documents make it difficult to distinguish programmatically between term and document IRIs using pattern matching. So for all documents from standards other than Darwin Core, I used this pattern:

http://rs.tdwg.org/sss/doc/docname/

where "sss" is a sequence of alphabetic characters representing the standard and "docname" is a short series of alphabetic characters representing the document. For example:

http://rs.tdwg.org/ac/doc/structure/

is the IRI for the Audubon Core Structure document.

Redirection

One thing that should be made clear is the distinction between the IRI that identifies a resource and the URL that actually can be used to retrieve a document or metadata about some other resource. Because the SDS considers the resources it describes as abstract entities, those entities can have multiple formats or serializations that are distinct from the abstract resources themselves. For example, the Audubon Core Structure document is an abstract thing identified by http://rs.tdwg.org/ac/doc/structure/ . However, the HTML serialization of that document can currently be retrieved from the URL https://tdwg.github.io/ac/structure/ and in the future that document might be made available at different URLs in other formats such as PDF. It is required that the IRI of the abstract resource be stable and unchanged, but there is no requirement that the retrieval URL for a serialization stay the same over time. Thus it's important that citations and bookmarks be set to the permanent IRI of the resource, and that redirection from the permanent IRI to the retrieval URL be maintained so that people can actually acquire a copy of the resource using a browser.

In the past, obscure, deprecated Darwin Core terms simply didn't dereference. In the test system, they redirect programmatically to a URL that is the term IRI plus ".htm". Here's an example:

http://rs.tdwg.org/dwc/curatorial/CollectorNumber

redirects to

http://rs.tdwg.org/dwc/curatorial/CollectorNumber.htm

The document that is retrieved is an HTML, human-readable description of the term.

Historically, current Darwin Core terms redirected to the Darwin Core Quick Reference page and that behavior has been maintained in the test system. Here's an example:

http://rs.tdwg.org/dwc/terms/institutionCode

redirects to

https://dwc.tdwg.org/terms/#dwc:institutionCode

The same is true with Audubon Core terms, whose IRIs redirect to an appropriate place on the Audubon Core Term List document. The URLs of both the Audubon Core Term List page and Darwin Core Quick Reference page have changed recently, reinforcing the importance of citing the actual term IRIs rather than the redirected URLs.

TDWG Standards Documentation Specification version model (from Section 2.3)

Versions

Taking cues from Dublin Core and the W3C, the SDS describes a version model that can be used to track versions of resources associated with TDWG standards. For example, dereferencing the Darwin Core vocabulary IRI http://rs.tdwg.org/dwc/ shows that there are 19 versions: 18 previous version and a most recent version that corresponds to the current Darwin Core vocabulary.

For vocabularies and term lists, the version IRIs are constructed by appending an ISO 8601 date after the final slash and inserting "version/" before the terminal string. For example, the current Darwin Core vocabulary IRI is http://rs.tdwg.org/dwc/ and a version of the Darwin Core vocabulary is http://rs.tdwg.org/version/dwc/2015-03-27 . The current Darwin Core IRI-value term list IRI is http://rs.tdwg.org/dwc/iri/ and a version of it is http://rs.tdwg.org/dwc/version/iri/2015-03-27 . (Although it wouldn't be necessary to include the characters "version/" in the version IRI, doing so makes pattern recognition for those IRIs much simpler.)

Following the precedent already set for Darwin Core, term version IRIs are formed by appending an ISO 8601 date with a dash. Again "version/" is inserted ahead of the local name to make IRI pattern recognition easier. For example, the term IRI http://rs.tdwg.org/dwc/terms/establishmentMeans has a version http://rs.tdwg.org/dwc/terms/version/establishmentMeans-2009-04-24

For documents, the version IRI is formed by simply appending the ISO 8601 date after the trailing slash. (In the case of documents, IRI pattern recognition is less critical since there aren't hierarchical levels below the level of the document. So "version/" isn't inserted in the version IRI.) For example, the document http://rs.tdwg.org/sds/doc/specification/ has a version http://rs.tdwg.org/sds/doc/specification/2007-11-05 .

In the case of non-document resources, resolution of version IRIs is fully implemented, since human-readable pages can be constructed programmatically for those resources using data from the metadata database. However, since the human-readable versions of standards documents are generally created manually and have idiosyncratic redirection IRIs, version IRI resolution is currently only partially implemented. In the case of many standards documents, the location of previous versions is not known or they are not yet available online. So for now, one can't explore older versions of standards documents in the same way one can explore older versions of vocabularies, term lists, and terms.

Summary

I've implemented a system of IRIs that are consistent with the SDS and past practice of Darwin and Audubon Cores. Although the patterns I established aren't the only possible ones, they work well for facilitating pattern matching by a server that generates many of the documents programmatically, so I feel that the pattern system is sound.

Here are some starting points for exploration:

Audubon Core basic vocabulary:

http://rs.tdwg.org/ac/

Darwin Core basic vocabulary:

http://rs.tdwg.org/dwc/

From these two vocabulary pages you can surf to term lists, terms, and older versions of all of the resources.

Terms borrowed by Audubon Core from the IPTC Photo Metadata Extension:
http://rs.tdwg.org/ac/Iptc4xmpExt/

The October 16, 2011 version of the Darwin Core vocabulary:
http://rs.tdwg.org/version/dwc/2011-10-16

The April 24, 2009 version of the list of core Darwin Core terms:
http://rs.tdwg.org/dwc/version/terms/2009-04-24

The September 11, 2009 version of Basis of Record:
http://rs.tdwg.org/dwc/terms/version/basisOfRecord-2009-09-11

A deprecated Darwin Core term list:
http://rs.tdwg.org/dwc/curatorial/

A deprecated Darwin Core term:
http://rs.tdwg.org/dwc/dwctype/MachineObservation

Here are some examples of document IRIs that redirect:

http://rs.tdwg.org/ac/doc/introduction/

http://rs.tdwg.org/tapir/doc/xmlschema/

http://rs.tdwg.org/apn/doc/data/

In the next post, I'll describe how the system I've implemented allows retrieval of machine-readable metadata.