Monday, March 31, 2025

Favorite Nebula Award-winning Novels

After finishing reading all of the Hugo Award winners for best novel, I decided to keep up the momentum and read all of the winners of the Nebula Award for best novel. I just finished the last one yesterday and decided to write a follow-up post to my earlier one where I talked about which of the Hugo books were my favorites.

In this post, I’ll discuss the Nebula winners that were not also Hugo winners, and list those double-winners that I already described in the previous post.

The Nebula Award started in 1966, so there are over 10 years of generally poor-quality books that were Hugo-eligible but that were out of contention in this quest (all of my 5 worst Hugo books were before 1962). Nevertheless, there were two that I disliked enough to put in the category of worst Nebula winners.

List of favorites

Not surprisingly, many of the really good books won both awards. So the list of favorites that won only the Nebula is rather short. It’s hard to be sure that I’m holding them to exactly the same standard as my Hugo favorites – I may be a little more generous here. But all of these favorites are solid and worth a read.

NOTE: to read my full reviews of all of the Nebula Award-winning novels see my Goodreads books.

Samuel R. Delany: Babel-17 (1967)

I’m not sure that this falls into my top books of all times, but it was one of my favorite of the Nebula winners. It was pretty weird, which often is a negative for me, but somehow this one was weird in an interesting way and also pretty good for 1967. In particular, I liked the strong female main character, which was refreshing for a book from that decade. Like Babel, the 2023 winner, it was in the “power of words and language” genre, but did not drag on for page after dull page as Babel did.

Daniel Keyes: Flowers for Algernon (1967)

It’s hard to place this book relative to the others since I haven’t read it since I first did in about 1974. But I think it made a big impression on me at the time and was a really solid story. Very different in style from the other 1967 winner that I just described.

Greg Bear: Darwin’s Radio (2001)

This was a really exciting and interesting book that I had trouble putting down. Not one of my all-time top books, and I found the biology a bit hard to swallow (as a biologist). But worth reading.

Elizabeth Moon: The Speed of Dark (2004)

This is not your typical sci-fi book: the focus was not really on the advanced technology, which was only tangential to the real story line: seeing our world from the eyes of someone with autism. I was a bit disappointed with the ending, but otherwise it was a really engaging and thought-provoking book.

Favorites that also won the Hugo

The following favorite Nebula winners were already discussed in my previous blog post on favorite Hugo winners, so I will just list them here and let you read about them in the other post.

Frank Herbert: Dune (1966)

Ursula K. Le Guin: Left Hand of Darkness (1970)

Frederik Pohl: Gateway (1978)

Orson Scott Card: Ender’s Game (1986), Speaker for the Dead (1987)

Connie Willis: Blackout/All Clear (2011), Doomsday Book (1993)

N. K. Jemisin: The Stone Sky (2018)

Lois McMaster Bujold: Falling Free (1996)

I’m putting this in a special category. This book did not win the Hugo and is actually not one of my favorite Bujold books, but I am including it because of my general love of the other Vorkosigan Universe books, some of which did win the Hugo.

5 star book that didn’t make my favorite list:

NOTE: there are other Nebula 5 start books that are not listed here because they were listed on the Hugo 5 star list of my other post and are therefore not repeated here.

Vonda N. McIntyre: The Moon and the Sun (1998)

Perhaps I was generous to give this 5 stars, but I liked the story. A bit slow to start and too much detail about the French court, but I really liked the characters and how different “good” characters had different viewpoints on topics like sex and religion.

Worst Nebula books

No “best” list would be complete without a corresponding “worst” list. Here it is:

Samuel R. Delany: The Einstein Interaction (1968)

Very weird book that was probably trying to make some point about myths that was lost on me. Interesting that Delany makes both my best and worst list!

Robert Silverberg: A Time of Changes (1972)

Preachy and depressing premise, disgusting attitude towards women, poor writing.

Sunday, February 16, 2025

Favorite Hugo Award-winning Novels

In May 2023, I completed a quest that was on my bucket list: reading all of the winners of the Hugo Award for best science fiction/fantasy novel. At that time, there were 71 books on the list (not counting “Retro-Hugo” winners). I’m not sure when I read my first one – the first one that I can unambiguously remember reading was Dune in about 1975 or 76. I read a number of the winners from the 70’s through 90’s soon after their publication, before having kids and going to grad school cut back on my pleasure reading time. Starting in 2021, I resolved to spend more time reading for fun and a number of the more recent winners were recommended to me by my daughter. This enticed me to take up the challenge of finishing all of them and Goodreads tells me that I read 31 of them in the 12 months preceding May 2023.

Having read them all, I am enjoying thinking back on them and pondering which were my favorites. I decided to write this post to list them.

Why a favorite?

There are several criteria that must have been met to make my “favorites” list. First and foremost, the book must be deeply engaging. To me, a great fiction book draws me into its world, and while I’m reading I’m transported to that world and barely aware that I am sitting in this world reading. Second, the story needs to be clever, creative, or explore a universe that has some really interesting and different twist. Third, the story can’t be ruined by being overtly sexist, dated, or transparently preaching about the author’s pet peeve. It is fine for the book to have a point, but that point needs to be made through the storytelling.

Another characteristic (but not a requirement) is that I found myself pondering and thinking about these stories for days or weeks after reading them, and years later thinking how I would like to re-read them.

List of my favorites

NOTE: to read my full reviews of all of the Hugo Award-winning novels, see my Goodreads books.

N. K. Jemisin: The Fifth Season (2016), The Obelisk Gate (2017), The Stone Sky (2018)

This trilogy was so different and interesting that I was quickly intrigued by it. The narrative style of The Fifth Season was also really cool. Some parts of The Stone Sky were a bit hard to believe, but the trilogy's overall the story was very satisfying.

Vernor Vinge: A Fire Upon the Deep (1993), A Deepness in the Sky (2000)

I was not familiar with Vernor Vinge before I started reading the Hugo books, but I now really appreciate his creativity and storytelling. Both of these books have a compelling story arc, but also have fascinating and creative alien species whose interactions with humans form an integral part of the story. One interesting character overlaps in the two books.

Frank Herbert: Dune (1966)

It is a bit difficult for me to objectively compare Dune to my other favorite Hugo books, since it was probably the first “epic” sci-fi/fantasy book that I read. But at that time, I was blown away by the complex vision that Herbert created in the book. Queen’s Night at the Opera had come out not long before I read Dune and I listened to “The Prophet’s Song” many times while reading. It has become indelibly associated with Dune in my mind. If you’ve read Dune, listen to The Prophet's Song and see if you can tell why it made such a strong connection for me.

Connie Willis: Blackout/All Clear (2011), Doomsday Book (1993)

Although both of these books involve pretty depressing topics (WW II and the plague), the story telling really immersed me in those time periods. The character’s struggle to survive and return to their own time, overlaid with their efforts to recognize the humanity and dignity of the people of those times in the most trying circumstances, made for a compelling plot.

Orson Scott Card: Ender’s Game (1986), Speaker for the Dead (1987)

Although these books might be classified as young adult books, they had really interesting and surprising plots.

Lois McMaster Bujold: The Vor Game (1991), Barrayar (1992), Mirror Dance (1995)

I include these books not because they were my particular Bujold favorites, but rather because the entire Miles Vorkosigan series were so clever, funny, and entertaining. They are certainly one of my favorite book series, with The Warrior's Apprentice (not a nominee) as the very best.

C. J. Cherryh: Downbelow Station (1982), Cyteen (1989)

I include these Cherryh books for a similar reason as the Bujold books. They weren’t necessarily my favorite Cherryh books (that would probably be the Chanur books, nominated in 1983 but did not win). But C. J. Cherryh is overall one of my favorite sci-fi authors and her Alliance/Union universe is complex and fascinating.

Ursula K. Le Guin: The Left Hand of Darkness (1970)

It would be difficult to not include Le Guin somewhere on my list. The Left Hand of Darkness is certainly one of her best books, although probably the Lathe of Heaven (nominated in 1972 but did not win) is my favorite. Le Guin is perhaps unmatched for her ability to situate interesting plots in worlds and cultures that are thought-provoking.

Frederik Pohl: Gateway (1978)

I read Gateway many years ago, so I’m not sure how I would feel about it now. But at the time, the novelty of the story premise and narrative style really appealed to me.

J. K. Rowling: Harry Potter and the Goblet of Fire (2001)

This is actually my least favorite Harry Potter book. But the Harry Potter saga is one of my top fantasy series, so I included it on that basis.

Runners up:

Robert Sawyer: Hominids (2003)

This book was borderline and did not quite make the cut, but I have to say that I was quite intrigued by the underlying concept of the world, and I just really liked the story and imagining how the world would be different if a different Homo species had come to dominate the earth.

Walter M. Miller Jr.: A Canticle for Leibowitz (1961)

I had first heard the NPR radio dramatization of this in the early 1980’s and was not overly impressed. I am also not a big fan of post-apocalyptic books. But when I read the book recently, I really enjoyed the story-telling and premise of the first two parts of the book. It was far superior to most other sci fi books I’ve read that were written in the 1950’s and early 1960’s. But it got booted from the favorites list because of the “no preachiness” criterion. The third part of the book was just transparently an anti-euthanasia sermon and that ruined the last part of the book for me.

Vonda N. McIntyre: Dreamsnake (1979)

When I started reading this book, I was expecting to dislike it. As I said, I don’t really like post-apocalyptic stories that well, and the beginning of the book seemed pretty hokey to me. But as the story was built out, I really found myself enjoying it. As a post-apocalyptic novel, it was pretty unusually in emphasizing kindness as a basic human characteristic. That was really refreshing to me.

John Scalzi: Redshirts (2013)

This was a short and very funny parody of Star Trek. Surprisingly, it was actually built into a somewhat clever story. Definitely work a read.

Other books I gave 5 star ratings to:

Ann Leckie: Ancillary Justice (2014)

Very interesting take on A.I.

Arkady Martine: A Memory Called Empire (2020), A Desolation Called Peace (2022)

Intriguing world told from the perspective of someone confused about another culture.

Neil Gaiman: The Graveyard Book (2009)

Did not think I would like, but did.

Mary Robinette Kowal: The Calculating Stars (2019)

Pretty interesting overall plot concept, but bordering on unbelievable.

Larry Nevin: Ringworld (1971)

Clever world-building, but too sexist for my tastes now.

Victor Vinge: Rainbows End (2007)

Enjoyable and interesting book, but not up to the level of his two books I put on my favorites.

Paolo Bacigalupi: The Windup Girl (2010)

Very interesting world, but a bit too violent and depressing for me to fully enjoy

Robert Charles Wilson: Spin (2006)

Engaging and suspenseful, but not top tier.

Roger Zelazny: Lord of Light (1968)

Really interesting presentation: not sure what was real and what was mythological. Far superior and more creative than many of the books from the 1960’s.

William Gibson: Neuromancer (1985)

Story line not amazing, but very prescient and the storytelling was vivid. The origin of "cyberspace" and cyberpunk.

Philip K. Dick: The Man in the High Castle (1963)

One of the rare excellent winners from the early 1960's. An early entry in the alternate universe genre and well-written.

Worst Hugo Books:

No "best" list would be complete without a corresponding "worst" list.

It seems to me pretty clear that the quality of science fiction and fantasy writing has generally improved over time. Anyone who thinks that the 1959’s and early 1960’s was some kind of golden age for science fiction clearly has not read these terrible books. One thing I cannot figure out is why Robert Heinlein is considered a great sci-fi author. The books of his that I have read ranged from mediocre to downright awful.

Robert A. Heinlein: Starship Troopers (1960)

This was just a deplorable book. One of the few I’ve given a one-star rating. Almost no plot and transparent right-wing ax-grinding.

Robert A. Heinlein: Stranger in a Strange Land (1962)

Despite this being a “famous book”, it was really terrible. Disgusting sexism (female character says “Nine times out of ten, if a girl gets raped, it’s partly her fault.”), characters droning on about Heinlein’s pet issues, etc.

Fritz Leiber: The Big Time (1958)

No real plot, stereotypical characters, dumb premise.

James Blish: A Case of Conscience (1959)

Lack of imagination about technology, shallow and stereotypical female characters, pages of pontification by characters with no plot development, …

Mark Clifton: They’d Rather Be Right (1955)

Almost impossible to obtain, and for good reason. An implausible story with annoying political overtones, masquerading as a science fiction story. Really, really dumb portrayal of A.I.

Sunday, August 6, 2023

Building an Omeka website on AWS


James H. Bassett, “Okapi,” Bassett Associates Archive, accessed August 5, 2023, https://bassettassociates.org/archive/items/show/337. Available under a CC BY 4.0 license.

Several years ago, I was given access to the digital files of Bassett Associates, a landscape architectural firm that operated for over 60 years in Lima, Ohio. This award-winning firm, which disbanded in 2017, was well known for its zoological design work and also did ground-breaking work in incorporating storm water retention as part of landscape site design. In addition to images of plans and site photographs, the files also included scans of sketches done by the firm's founder, James H. Bassett, which was artwork in its own right. I had been deliberating what the best way was to make these works publicly available and decided that this summer I would make it my project to set up an online digital archive featuring some of the images from the files.

Given my background as a Data Science and Data Curation Specialist at the Vanderbilt Libraries, it seemed like a good exercise to figure out how to set up Omeka Classic on Amazon Web Services (AWS), Vanderbilt's preferred cloud computing platform. Omeka is a free, open-source web platform that is popular in the library and digital humanities communities for creating online digital collections and exhibits, so it seemed like a good choice for me given that I would be funding this project on my own.

Preliminaries

The hard drive I have contains about 70 000 files collected over several decades. So the first task was to sort through the directories to figure out exactly what was there. For some of the later projects, there were some born-digital files, but the majority of the images were either digitizations of paper plans and sketches, or scans of 35mm slides. In some cases, the same original work was present several places on the drive with a variety of resolutions, so I needed to sort out where the highest quality files were located. Fortunately, some of the best works from signature projects had been digitized for an art exhibition, "James H. Bassett, Landscape Architect: A Retrospecive Exhibition 1952-2001" that took place in Artspace/Lima in 2001. Most of the digitized files were high-resolution TIFFs, which were ideal for preservation use. I focused on building the online image collection by featuring projects that were highlighted in that exhibition, since they covered the breadth of types of work done by the firm throughout its history.

The second major issue was to resolve the intellectual property status of the images. Some had previously been published in reports and brochures, and some had not. Some were from before the 1987 copyright law went into effect and some were after. Some could be attributed directly to James Bassett before the Bassett Associates corporation was formed and others could not be attributed to any particular individual. Fortunately, I was able to get permission from Mr. Bassett and the other two owners of the corporation when it disbanded to make the images freely available under a Creative Commons Attribution 4.0 International (CC BY 4.0) license. This basically eliminated complications around determining the copyright status of any particular work, and allows the images to be used by anyone as long as they provide the requested citation.

TIFF pyramid for a sketch of the African plains exhibit at the Kansas City Zoo. James H. Bassett, “African Plains,” Bassett Associates Archive, accessed August 6, 2023, https://bassettassociates.org/archive/items/show/415. Available under a CC BY 4.0 license.

Image pre-processing

For several years I have been investigating how to make use of the International Image Interoperability Framework (IIIF) to provide a richer image viewing experience. Based on previous work and experimentation with our Libraries' Cantaloupe IIIF server, I knew that large TIFF images needed to be converted to tiled pyramidal (multi-resolution) form to be effectively served. I also discovered that TIFFs using CMYK color mode did not display properly when served by Cantaloupe. So the first image processing step was to open TIFF or Photoshop format images in Photoshop, flatten any layers, convert to RGB color mode if necessary, reduce the image size to less than 35 MB (more on size limits later), and save the image in TIFF format. JPEG files were not modified -- I just used the highest resolution copy that I could find.

Because I wanted to make it easy in the future to use the images with IIIF, I used a Python script that I wrote to converting single-resolution TIFFs en mass to tiled pyramidal TIFFs via ImageMagick. These processed TIFFs or high-resolution JPEGs were the original files that I eventually uploaded to Omeka.

Why use AWS?

One of my primary reasons for using AWS as the hosting platform was the availability of S3 bucket storage. AWS S3 storage is very inexpensive and by storing the images there rather than within the file storage attached to the cloud server, the image storage capacity could basically expand indefinitely without requiring any changes to the configuration of the cloud server hosting the site. Fortunately, there is an Omeka plug-in that makes it easy to configure storage in S3.

Another advantage (not realized in this project) is that because image storage is outside the server in a public S3 bucket, the same image files can be used as source files for a Cantaloupe installation. Thus a single copy of an image in S3 can serve the purpose of provisioning Omeka, being the source file for IIIF image variants served by Cantaloupe, and having a citable, stable URL that allows the original raw image to be downloaded by anyone.

I've also determined through experimentation that one can run a relatively low-traffic Omeka site on AWS using a single t2.micro tier Elastic Compute Cloud (EC2) server. This minimally provisioned server currently costs only US$ 0.0116 per hour (about $8 per month) and is "free tier eligible", meaning that new users could run a Omeka on EC2 for free during the first year. Including the cost of the S3 storage, one could run an Omeka site on AWS with hundreds of images for under $10 per month.

The down side

The main problem with installing Omeka on AWS is that it is not a beginner-level project. I'm relatively well-acquainted with AWS and Unix command line, but it took me a couple months on and off to figure out how to get all of the necessary pieces to work together. Unfortunately, there wasn't a single web page that laid out all of the steps, so I had to read a number of blog posts and articles, then do a lot of experimenting to get the whole thing to work. I did take detailed notes, including all of the necessary commands and configuration details, so it should be possible for someone with moderate command-line skills and a willingness to learn the basics of AWS to replicate what I did.

Installation summary

In the remainder of this post, I'll walk through the general steps required to install Omeka Classic on AWS and describe important considerations and things I learned in the process. In general, there are three major components to the installation: setting up the S3 storage, installing Omeka on EC2, and getting a custom domain name to work with the site using secure HTTP. Each of these major steps includes several sub-tasks that will be described below.

S3 setup

The basic setup of an S3 bucket is very simple and involves only a few button clicks. However, the way Omeka operates, several additional steps are required for the bucket setup.

By design, AWS is secure and generally one wants to permit only the minimum required access to resources. But because Omeka exposes file URLs publicly so that people can download those files, the S3 bucket must be readable by anyone. Omeka also writes multiple image variant files to S3, and this requires generating access keys whose security must be carefully guarded.

You can manually upload files and enter their metadata by typing into boxes in the Omeka graphical interface. That's fine if you will only have a few items. However, if you will be uploading many items, uploading using the graphical interface is very tedious and requires many button clicks. To create an efficient upload workflow, I used the Omeka CSV import plugin. It requires loading the files via a URL during the import process, so I used a different public S3 bucket as the source of the raw images. I used a Python script to partially automate the process of generating the metadata CSV and as part of that script, I uploaded the images automatically to the source raw image bucket using the AWS Python library (boto3). This required creating access credentials for the raw image bucket and to reduce security risks, I created a special AWS user that was only allowed to write to that one bucket.

The ASW free tier allows a new user access to up to 5 GB for free during the first year. That corresponds to roughly a hundred high-resolution (50 MB) TIFF images.

Omeka installation on EC2

As with the set up of S3 buckets, launching an EC2 server instance just involves a few button clicks. What is trickier and somewhat tedious is performing the actual setup of Omeka within the server. Because the setup is happening at some mysterious location in the cloud, you can't point and click like you can on your local computer. To access the EC2 server, you have to essentially create a "tunnel" into it by connecting to it using SSH. Once you've done that, commands that you type into your terminal application are being applied to the remote server and not your local computer. Thus, everything you do must be done at the command line. This requires basic familiarity with Unix shell commands and since you also need to edit some configuration files, you need to know how to use a terminal-based editor like Nano.

The steps involve:

- installing a LAMP (Linux, Apache, MySQL, and PHP) server bundle

- creating a MySQL database

- downloading and installing Omeka

- modifying Apache and Omeka configuration files

- downloading an enabling the Omeka S3 Storage Adapter and CSV Import plugins

Once you have completed these steps (which actually involve issuing something like 50 complicated Unix commands that fortunately can be copied and pasted from my instructions), you will have a functional Omeka installation on AWS. However, accessing it would require users to use a confusing and insecure URL like

http://54.243.224.52/archive/

Mapping an Elastic IP address to a custom domain and enabling secure HTTP

To change this icky URL to a "normal" one that's easy to type into a browser and that is secure, several additional steps are required.

AWS provides a feature called an Elastic IP address that allows you to keep using the same IP address even if you change the underlying resource it refers to. Normally, if you had to spin up a new EC2 instance (for example to restore from a backup), it would be assigned a new IP address, requiring you to change any setting that referred to the IP address of the previous EC2 you were using. An Elastic IP address can be reassigned to any EC2 instance, so disruption caused by replacing the old EC2 with a new one can be avoided by just shifting the Elastic IP to the new instance. Elastic IPs are free as long as they remain associated with a running resource.

It is relatively easy to assign a custom domain name to the Elastic IP if the AWS Route 53 domain registration is used. The cost of the custom domain varies depending on the specific domain name that you select. I was able to obtain `bassettassociates.org` for US$12 per year, adding $1 per month to the cost of running the website.

After the domain name has been associated with the Elastic IP address, the last step is to enable secure HTTP (HTTPS). When initially searching the web for instructions on how to do that, I found a number of complicated and potentially expensive suggestions including installing an Nginx front-end server and using an AWS load balancer. Those options are overkill for a low-traffic Omeka site. In contrast, it is relatively easy to get free security certificate from Let's Encrypt and set it up to automatically renew monthly using Certbot for an Apache server.

After completing these steps, one can now access my Omeka instance at https://bassettassociates.org/archive/.

Optional additional steps

If you plan to have multiple users editing the Omeka site, you won't be able to add users beyond the default Super User without additional steps. It appears that it's not possible to add more users without enabling Omeka to send emails. This requires setting up AWS Simple Email Service (SES), then adding the SMPT credentials to the Omeka configuration file. SES is designed to enable sending mass emails, so being approved for production access requires applying for approval. I didn't have any problems getting approved when I explained that I was only going to use it to send a few confirmation emails, although the process took at least a day since apparently a human has to examine the application.

There are three additional plugins that I installed that you may consider using. The Exhibit Builder and Simple Pages plugins add the ability to create richer content. Installing them is trivial, so you will probably want to turn them on. I also installed the CSV Export Format plugin because I wanted to use it to capture identifier information as part of my partially automated workflow (see following sections for more details).

If you are interested in using IIIF on your site, you may also want to install the IIIF Toolkit plugin, explained in more detail later.

Efficient workflow

Once Omeka is installed and configured, it is possible to just upload content manually using the Omeka graphical interface. That's fine if you will only have a few objects. However, if you will be uploading many objects, uploading using the graphical interface is very tedious and requires many button clicks.

The workflow described here is based on assembling the metadata in the most automated way possible, using file naming conventions, a Python script, and programatically created CSV files. Python scripts are also used to upload the files to S3, and from there they can be automatically imported into Omeka.

After the items are imported, the CSV export plugin can be used to extract the ID numbers assigned to the items by Omeka. A Python script then extracts the IDs from the resulting CSV and inserts them into the original CSVs used to assemble the metadata.

For full details about the scripts and step-by-step instructions, see the detailed notes that accompany this post.

Notes about TIFF image files

If original image files are available as high-resolution TIFFs, that is probably the best format to archive from the preservation standpoint. However, most browsers will not display TIFFs natively, while JPEGs can be displayed onscreen. The practical implication of this is that image thumbnails are linked directly to the original highres image file. So when a user clicks on the thumbnail of a JPEG, the image is displayed in their browser, but when a TIFF thumbnail is clicked, the file downloads to the user's hard drive without being displayed. When an image is uploaded, Omeka makes several JPEG copies at lower resolution so that they can be displayed onscreen in the browser without downloading.

As explained in the preprocessing section above, the workflow includes an additional conversion step that only applies to TIFFs.

Note about file sizes

In the file configuration settings, I recommend seting a maximum file size of 100 MB. Virtually no JPEGs are ever that big, but some large TIFF files may exceed that size. As a practical matter, the upper limit on file size in this installation is actually about 50 MB. I have found from practical experience that importing original TIFF files between 50 and 100 MB can generate errors that will cause the Omeka server to hang. I have not been able to isolate the actual source of the problem, but it may be related to the process of generating the lower resolution JPEG copies. The problem may be isolated to using the CSV import plugin because some files that hung the server when using the CSV import were then able to be uploaded manually after creating the item record. In one instance, a JPEG that was only 11.4 MB repeatedly failed to upload using the CSV import. Apparently its large pixel dimensions (6144x4360) were the problem (it also was successfully uploaded manually).

The other thing to consider is that when TIFFs are converted to tiled pyramidal form, there is an increase in size of roughly 25% when the low-res layers are added to the original high-res layer. So a 40 MB raw TIFF may be at or over 50 MB after conversion. I have found that if I keep the original file size below 35 MB, the files usually load without problems. It is annoying to have to decrease the resolution of any souce files in order to add them to digital collection, but there is a workaround (described in the IIIF section below) for extremely large TIFF image files.

The CSV Import plugin

An efficient way to import multiple images is to use the CSV Import plugin. The plugin requires two things: a CSV spreadsheet containing file and item metadata, and files that are accessible directly using a URL. Because files on your local hard drive are not accessible via a URL, there are a number of workarounds that can be used, such a uploading the images to a cloud service like Google Drive or Dropbox. Since we are using AWS S3 storage, it makes sense to make the image files accessible from there, since files in a public S3 bucket can be accessed by a URL. (Example of raw image available from an S3 bucket via the URL: https://bassettassociates.s3.amazonaws.com/glf/haw/glf_haw_pl_00.tif)

One could create the metadata CSV entirely by hand by typing and copying and pasting in a spreadsheet editor. However, in my case, because of the general inconsistency in file names on the source hard drive, I was renaming all of the image files anyway. So I established a file identifier coding system that when used with file names would both group similar files together in the directory listing and also make it possible to automate populating some of the metadata fields in the CSV. The Python script that I wrote generated a metadata CSV with many of the columns already populated, including the image dimensions, which it extracted from the EXIF data in the image files. After generating a first draft of the CSV, I then had to manually add the date, title, and description fields, plus any tags I wanted to add in addition to the ones that the script generated automatically from the file names. (Example of completed CSV metadata file)

The CSV import plugin requires that all items imported as a batch be the same general type. Since my workflow was build to handle images, that wasn't be a problem -- all items were Still Images. As a practical matter, it was best to restrict all of the images in a batch to be for the same Omeka collection. If images intended for several collections were uploaded together in a batch, they would have had to be assigned to collections manually after upload.

Omeka identifiers

When Omeka ingests image files, it automatically assigns an opaque ID (e.g. 3244d9cdd5e9dce04e4e0522396ff779) to the image and generates JPEG versions of the original image at various sizes. These images are stored in the S3 bucket that you set up for Omeka storage. Since those images are publicly accessible by URL, you could provide access to them for other purposes. However, since the file names are based on the opaque identifiers and have no connection with the original file names, it would be difficult to know what the access URL would be. (Example: https://bassett-omeka-storage.s3.amazonaws.com/fullsize/3244d9cdd5e9dce04e4e0522396ff779.jpg)

Fortunately, there is a CSV Export Format plugin that can be used to discover the Omeka-assigned IDs along with the original identifiers assigned by the provider as part of the CSV metadata that was uploaded during the import process. In my workflow, I have added additional steps to do the CSV export, then run another Python script that pulls the Omeka identifiers from the CSV and archives them along with the original user-assigned identifier in an identifier CSV. At the end of processing each batch, I push the identifier and metadata CSV files to GitHub to archive the data used in the upload.

In theory, the images in the raw image upload CSV file could be deleted. However, S3 storage costs are so low that you probably will just want to leave them there. Since they have meaningful file names and a subfolder organization of your choice, they would make a pretty nice cloud backup system that is independent of the Omeka instance. After your archive project is complete, you could change the raw image source bucket over to one of the cheaper, low-access types (like Glacier) that have even lower storage costs than a standard S3 bucket. Because both buckets are public, you can use them as a means of giving access to the original high-res files by simply giving the Object URL to the person wanting a copy of the file.

Backing up the data

There are two mechanisms for backing up your data periodically.

The most straightforward is to create an Amazon Machine Image (AMI) of the EC2 server. Not only will this save all of your data, but it will also archive the complete configuration of the server at the time the image is made. This is critical if you have any disasters while making major configuration changes and need to roll back the EC2 to an earlier (functional) state. It is quite easy to roll back to an AMI and re-assign the Elastic IP to the new EC2 instance. However, this rollback will have no impact on any files saved in S3 by Omeka after the time when the backup AMI was created. Those files won't hurt anything, but they will effectively be uselessly orphaned there.

The CSV files pushed to GitHub after each CSV import (example) can also be used as a sort of backup. Any set of rows from the saved metadata CSV file can be used to re-upload those items onto any Omeka instance as long as the original files are still in the raw source image S3 bucket. Of course, if you make manual edits to the metadata, the metadata in the CSV file would become stale.

Using IIIF tools in Omeka

There are two Omeka plugins that add International Image Interoperability Framework (IIIF) capabilities.

The UniversalViewer plugin allows Omeka to serve images like a IIIF image server and it generates IIIF manifests using the existing metadata. That makes it possible for the Universal Viewer player (included in the plugin) to display images in a rich manner that allows pan and zoom. This plugin was very appealing to me because if it functioned well, it would enable IIIF capabilities without needing to manage any other servers. I was able to install it and the embedded Universal Viewer did launch, but the images never loaded in the viewer. Despite spending a lot of time messing around with the settings, disabling S3 storage, and launching a larger EC2 image, I was never able to get it to work, even for a tiny JPEG file. I read a number of Omeka forum posts about troubleshooting, but eventually gave up.

If I had gotten it to work, there was one potential problem with the setup anyway. The t2.micro instance that I'm running has very low resource capacity (memory, number of CPUs, drive storage), which is OK as I've configured it because the server just has to run a relatively tiny MySQL database and serve static files from S3. But presumably this plugin would also have to generate the image variants that it's serving on the fly and that could max out the server quite easily. I'm disappointed that I couldn't get it to work, but I'm not confident that it's the right tool for a budget installation like this one.

I had more success with the IIIF Toolkit plugin. It also provides an embedded Universal Viewer that can be inserted various places in Omeka. The major downside is that you must have access to a separate IIIF server to actually provide the images used in the viewer. I was able to test it out by loading images into the Vanderbilt Libraries' Cantaloupe IIIF server and it worked pretty well. However, setting up your own Cantaloupe server on AWS does not appear to be a trivial task and because of the resources required for the IIIF server to run effectively, it would probably cost a lot more per month to operate than the Omeka site itself. (Vanderbilt's server is running on a cluster with a load balancer, 2 vCPU, and 4 GB memory. All of these increases over a basic single t2.micro instance would involve a significantly increased cost.) So in the absence of an available external IIIF server, this plugin probably would not be useful for an independent user with a small budget.

One nice feature that I was not able to try was pointing the external server to the `original` folder of the S3 storage bucket. That would be a really nice feature since it would not require loading the images separately into dedicated storage for the IIIF server separate from what is already being provisioned for Omeka. Unfortunately, we have not yet got that working on the Libraries' Cantaloupe server as it seems to require some custom Ruby coding to implement.

Once the IIIF Toolkit is installed, there are two ways to include IIIF content into Omeka pages. If the Exhibit Builder plugin is enabled, the IIIF Toolkit adds a new kind of content block, "Manifest". Entering an IIIF manifest URL simply displays the contents of that manifest in an embedded Universal Viewer widget on the exhibit page without actually copying any images or metadata into the Omeka database.

The second way to include IIIF content is to make use of an alternate method of importing content that becomes available after the IIIF Tollkit is installed. There are three import types possible to use to import items. I explored importing `Manifest` and `Canvas` types since I had those types of structured data available.

Manifest is the most straightforward because it only requires a manifest URL (commonly available from many sources). But the import was messy and always created a new collection for each item imported. In theory, this could be avoided by selecting an existing collection using the `Parent` dropdown, but that feature never worked for me.

I concluded that importing canvases was the only feasible method. Unfortunately, canvas JSON usually doesn't exist in isolation -- it usually is part of the JSON for an entire manifest. The `From Paste` option is useful if you are capable of the tedious task of searching through the JSON of a whole manifest and copying just the JSON for a single canvas from it. I found it much more useful to just create Python script to generate minimal canvas JSON for an image and save it as a file, which could either be uploaded directly, or pushed to the web and read in through a URL. It gets the pixel dimensions from the image file, with labels and descriptions taken from a CSV file (the IIIF import does not use more information than that). These values are inserted into a JSON canvas template, then saved as a file. The script will loop through an entire directory of files, so it's relatively easy to make canvases for a number of images that were already uploaded using the CSV import function (just copy and paste labels and descriptions from the metadata CSV file). Once the canvases have been generated, either upload them or paste their URLs (if they were pushed to the web) on the IIIF Toolkit Import Items page.

The result of the import is an item similar to those created by direct upload or CSV import -- JPEG size variants are generated and stored and a small amount of metadata present in the canvas is assigned to the title and description metadata fields for the item. The main difference is that the import includes the canvas JSON as part of an Omeka-generated IIIF manifest that can be displayed in an embedded Universal Viewer either as part of an exhibit or on a Simple Pages web page. The viewer also shows up at the bottom of the item page.

Because there is no way to import IIIF items as a batch, nor to import metadata from the canvas beyond the title and description, each item needs to be imported one at a time and the metadata added manually, or added using the Bulk Metadata Editor plugin if possible. This makes uploading many items somewhat impractical. However, for very large images whose detail cannot be seen well in a single image on a screen, the ability to pan and zoom is pretty important. So for some items, like large maps, this tool can be very nice despite the extra work. For a good example, see the panels page from the Omeka exhibit I made for the 2001 Artspace/Lima exhibition. It is best viewed by changing the embedded viewer to full screen.

Entire master plan image. Bassett Associates. “Binder Park Zoo Master Plan (IIIF),” Bassett Associates Archive, accessed August 6, 2023, https://bassettassociates.org/archive/items/show/418. Available under a CC BY 4.0 license.

Maximum zoom level using embedded IIIF Universal Viewer

One thing that should be noted is that like other images associated with Omeka items, image import using the IIIF Toolkit generates size versions of the image. A IIIF import also generates an "original" JPEG version that is much smaller than the pyramidal tiled TIFF uploaded to the IIIF server. This means that it is possible to create items for TIFF images that are larger than the 50 MB recommended above. An example is the Binder Park Master Plan. If you scroll to the bottom of its page and zoom in (above), you will see that an incredible amount of detail is visible because the original TIFF file being used by the IIIF server is huge (347 MB). So using IIIF import is a way to display and make available very large image files that exceed the practical limit of 50 MB discussed above.

Conclusions

Although it took me a long time to figure out how to get all of the pieces to work together, I'm quite satisfied with the Omeka setup I now have running on AWS. I've been uploading works and as of this writing (2023-08-06), I've uploaded 400 items into 36 collections. I also created an Omeka Exhibit for the 2001 exhibition that includes the panels created for the exhibition using an "IIIF Items" block (allows arrowing through all of the panels with pan and zoom), a "works" block (displaying thumbnails for artworks displayed in the exhibition), and a "carousel" block (cycling through photographs of the exhibition). I still need to do more work on the landing page and on styling of the theme. But for now I have an adequate mechanism for exposing some of the images in the collection on a robust hosting system for a total cost of around $10 per month.

Wednesday, April 12, 2023

Structured Data in Commons and wikibase software tools

In my last blog post, I described a tool (CommonsTool) that I created for uploading art images to Wikimedia Commons. One of the features of that Python script was to create Structured Data in Commons (SDoC) statements about the artwork that was being uploaded, such as "depicts" (P180) and "main subject" (P921) and "digital representation of" (P6243), necessary to "magically" populate the Commons page with extensive metadata about the artwork from Wikidata. The script also added "created" (P170) and "inception" (P571) statements, which are important for providing the required attribution when the work is under copyright.

Structured Data on Commons "depicts" statements

These properties serve important roles, but one of the key purposes of SDoC is to make it possible for potential users of the media item to find it by providing richer metadata about what is depicted in the media. SDoC depict statements go into the data that is indexed by the Commons search engine, which otherwise is primarily dependent on words present in the filename. My CommonsTool script does write one "depicts" statement (that the image depicts the artwork itself) and that's important for the semantics of understanding what the media item represents. However, from the standpoint of searching, that single depicts statement doesn't add much to improve discovery since the artwork title in Wikidata is probably similar to the filename of the media item -- neither of which necessarily describe what is depicted IN the artwork.

Of course, one can add depicts statements manually, and there are also some tools that can be used to help with the process. But if you aspire to add multiple depicts statements to hundreds or thousands of media items, this could be very tedious and time consuming. If we are clever, we can take advantage of the fact that Structured Data in Commons is actually just another instance of a wikibase. So generally any tools that can make it easier to work with a wikibase can also make it easier to work with Wikimedia Commons

In February, I gave a presentation about using VanderBot (a tool that I wrote to write data to Wikidata) to write to any wikibase. As part of that presentation, I put together some information about how to use VanderBot to write statements to SDoC using the Commons API, and how to use the Wikimedia Commons Query Service (WCQS) to acquire data programatically via Python. In this blog post, I will highlight some of the key points about interacting with Commons as a wikibase and link out to the details required to actually do the interacting.

Media file identifiers (M IDs)

Wikimedia Commons media files are assigned a unique identifier that is analogous to the Q IDs used with Wikidata items. They are known as "M IDs" and they are required to interact with the Commons API or the Wikimedia Commons Query Service programmatically as I will describe below.

It is not particularly straightforward to find the M ID for a media file. The easiest way is probably to find the Concept URI link in the left menu of a Commons page, right-click on the link, and then paste it somewhere. The M ID is the last part of that link. Here's an example: https://commons.wikimedia.org/entity/M113161207 . If the M ID for a media file is known, you can load its page using a URL of this form.

If you are automating the upload process as I described in my last post, CommonsTool records the M ID when it uploads the file. I also have a Python function that can be used to get the M ID from the Commons API using the media filename.

Properties and values in Structured Data on Commons come from Wikidata

Structured Data on Commons does not maintain its own system of properties. It exclusively uses properties from Wikidata, identified by P IDs. Similarly, the values of SDoC statements are nearly always Wikidata items identified by Q IDs (with dates being an exception). So one could generally represent a SDoC statement (subject property value) like this:

MID PID QID.

Captions

Captions are a feature of Commons that allows multilingual captions to be applied to media items. They show up under the "File information" tab.

Although captions can be added or edited using the graphical interface, under the hood the captions are the multilingual labels for the media items in the Commons wikibase. So they can be added or edited as wikibase labels via the Commons API using any tool that can edit wikibases.

Writing statements to the Commons API with VanderBot

VanderBot uses tabular data (spreadsheets) as a data source when it creates statements in a wikibase. One key piece of required information is the Q ID of the subject item that the statements are about and that is generally the first column in the table. When writing to Commons, the subject M ID is substituted for a Q ID in the table.

Statement values for a particular property are placed in one column in the table. Since all of the values in a column are assumed to be for the same property, the P ID doesn't need to be specified as data in the row. VanderBot just needs to know what P ID is associated with that column and that mapping of column with property is made separately. So at a minimum, to write a single kind of statement to Commons (like Depicts), VanderBot needs only two columns of data (one for the M ID and one for the Q ID of the value of the property).

Here is an example of a table with depicts data to be uploaded to Commons by VanderBot:

The qid column contains the subject M ID identifiers (for this media file). The depicts column contains the Q IDs of the values (the things that are depicted in the media item). The other three columns serve the following purposes:

- depicts_label is ignored by the script. It's just a place to put the label of the otherwise opaque Q ID for the depicted item so that a human looking at the spreadsheet has some idea about what's being depicted.

- label_en is the language-tagged caption/wikibase label. VanderBot has an option to either overwrite the existing label in the wikibase with the value in the table or ignore the label column and leave the label in Wikibase the same. In this example, we are not concerning ourselves with editing the captions, so we will use the "ignore" option. But if one wanted to add or update captions, VanderBot could be used for that.

- depicts_uuid stores the unique statement identifier after the statement is created. It is empty for statements that have not yet been uploaded.

I mentioned before that the connection between the property and the column that contains its values was made separately. This mapping is done in a YAML file that describes the columns in the table:

The details of this file structure are given elsewhere, but a few key details are obvious. The depicts_label column is designated as to be ignored. In the properties list, the header for a column is given as value of the variable key, with a value of depicts in this example. That column has item as its value type and P180 as its property.

As a part of the VanderBot workflow, this mapping file is converted into a JSON metadata description file and that file along with the CSV are all that are needed by VanderBot to create the SDoC depicts statements.

If you have used VanderBot to create new items in Wikidata, uploading to Commons is more restrictive than what you are used to. When writing to Wikidata, if the Q ID column for a row in a CSV is empty, Vanderbot will create a new item and if it's not, it edits an existing one. Creating new items directly via the API is not possible in Commons, because new items in the Commons wikibase are only created as a result of media uploads. So when VanderBot interacts with the Commons API, the qid column must contain an existing M ID.

After writing the SDoC statements, they will show up under the "Structured data" tab for the media item, like this:

Notice that the Q IDs for the depicts values have been replaced by their labels.

This is a very abbreviated overview of the process and is intended to make the point that once you have the system set up, all you need to write a large number of SDoC depicts statement is a spreadsheet with column for the M IDs of the media items and a column with the Q IDs of what is depicted in that media item. There are more details with linkouts to how to use VanderBot to write to Commons on a webpage that I made for the Wikibase Working Hour presentation.

Acquiring Structured Data on Commons from the Wikimedia Commons Query Service

A lot of people know about the Wikidata Query Service (WQS), which can be used to query Wikidata using SPARQL. Fewer people know about the Wikimedia Commons Query Service (WCQS) because it's newer and interests a narrower audience. You can access the WCQS at https://commons-query.wikimedia.org/ . It is still under development and is a bit fragile, so it is sometimes down or undergoing maintenance.

If you are working with SDoC, the WCQS is a very effective way to retrieve information about the current state of the structured data. For example, it's a very simple query to discover all media items that depict a particular item, as shown in the example below. There are quite a few examples of queries that you can run to get a feel for how the WCQS might be used.

It is actually quite easy to query the Wikidata Query Service programmatically, but there are additional challenges to using the WCQS because it requires authentication. I have struggled through reading the developer instructions for accessing the WCQS endpoint via Python and the result is functions and example code that you can use to query the WCQS in your Python scripts. One important warning: the authentication is done by setting a cookie on your computer. So you must be careful not to save this cookie in any location that will be exposed, such as in a GitHub repository. Anyone who gets a copy of this cookie can act as if they were you until the cookie is revoked. To avoid this, the script saves the cookie in your home directory by default.

The code for querying is very simple with the functions I provide:

user_agent = 'TestAgent/0.1 (mailto:username@email.com)' # put your own script name and email address here
endpoint_url = 'https://commons-query.wikimedia.org/sparql'
session = init_session(endpoint_url, retrieve_cookie_string())
wcqs = Sparqler(useragent=user_agent, endpoint=endpoint_url, session=session)

query_string = '''PREFIX sdc: <https://commons.wikimedia.org/entity/>
PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
SELECT DISTINCT ?depicts WHERE {
  sdc:M113161207 wdt:P180 ?depicts.
  }'''

data = wcqs.query(query_string)
print(json.dumps(data, indent=2))

The query is set in the multi-line string assigned in the line that begins query_string =. One thing to notice is that in WCQS queries, you must define the prefixes wdt: and wd: using PREFIX statements in the query prologue. Those prefixes can be used in WQS queries without making PREFIX statements. In addition, you must define the Commons-specific sdc: prefix and use it with M IDs.

This particular query simply retrieves all of the depicts statements that we created in the example above for M113161207 . The resulting JSON is

[
  {
    "depicts": {
      "type": "uri",
      "value": "http://www.wikidata.org/entity/Q103304813"
    }
  },
  {
    "depicts": {
      "type": "uri",
      "value": "http://www.wikidata.org/entity/Q302"
    }
  },
  {
    "depicts": {
      "type": "uri",
      "value": "http://www.wikidata.org/entity/Q345"
    }
  },
  {
    "depicts": {
      "type": "uri",
      "value": "http://www.wikidata.org/entity/Q40662"
    }
  },
  {
    "depicts": {
      "type": "uri",
      "value": "http://www.wikidata.org/entity/Q235849"
    }
  }
]

The Q IDs can easily be extracted from these results using a list comprehension:

qids = [ item['depicts']['value'].split('/')[-1] for item in data ]

resulting in this list:

['Q103304813', 'Q302', 'Q345', 'Q40662', 'Q235849']

Comparison with the example table shows the same four Q IDs that we wrote to the API, plus the depicts value for the artwork (Q103304813) that was created by CommonsTool when the media file was uploaded. When adding new depicts statements, having this information about the ones that already exist can be critical to avoid creating duplicate statements.

For more details about how the code works, see the informational web page I made for the Wikibase Working Hour presentation.

Conclusion

I hope that this code will help make it possible to ramp up the rate at which we can add depicts statements to Wikimedia Commons media files. In the Vanderbilt Libraries, we are currently experimenting with using Google Cloud Vision to do object detection and we would like to combine that with artwork title analysis to be able to partially automate the process of describing what is depicted in the Vanderbilt Fine Arts Gallery works whose images have been uploaded to Commons. I plan to report on that work in a future post.

Wednesday, September 7, 2022

CommonsTool: A script for uploading art images to Wikimedia Commons

A Ghost Painting Coming to Life in the Studio of the Painter Okyō, from the series Yoshitoshi ryakuga (Sketches by Yoshitoshi). 1882 print by Tsukioka Yoshitoshi. Vanderbilt University Fine Arts Gallery 1992.083 via Wikimedia Commons. Wikidata item Q102961245

For several years, I've been working with the Vanderbilt Fine Arts Gallery staff to create and improve Wikidata items for the approximately 7000 works in the Gallery collection through the WikiProject Vanderbilt Fine Arts Gallery. In the past year, I've been focused on creating a Python script to streamline the process of uploading images of Public Domain works in the collection to Wikimedia Commons, where they will be freely available for use. I've just completed work on that script, which I've called CommonsTool, and have used it to upload over 1300 images (covering about 20% of the collection and most of the Public Domain artworks that have been imaged).

In this post, I'll begin by describing some of the issues I dealt with and how they resulted in features of the script. I will conclude by outlining briefly how the script works.

The script is freely available for use and there are detailed instructions on GitHub for configuring and using it. Although it's designed to be usable in contexts other than the Vanderbilt Gallery, it hasn't been tested thoroughly in those circumstances. So if you try using it, I'd like to hear about your experience.

Wikidata, Commons, and structured data

If you have ever worked with editing metadata about art-related media in Wikimedia Commons, you are probably familiar with the various templates used to describe the metadata on the file page using Wiki syntax. Here's an example:

=={{int:filedesc}}==
{{Artwork
|artist             = {{ Creator | Wikidata = Q3695975 | Option = {{{1|}}} }}
|title              = {{en|'''Lake George'''.}}
|description        = {{en|1=Lake George, painting by David Johnson}}
|depicted people    =
|depicted place     =
|date               =
|medium             = {{technique|oil|canvas}}
|dimensions         = {{Size|in|24.5|19.5}}
|institution        = {{Institution:Vanderbilt University Fine Arts Gallery}}
|references         = {{cite web |title=Lake George |url=https://library.artstor.org/#/asset/26754443 |accessdate=30 November 2020}}
|source             = Vanderbilt University Fine Arts Gallery
|other_fields       =
}}

=={{int:license-header}}==
{{PD-Art|PD-old-100-expired}}

[[Category:Vanderbilt University Fine Arts Gallery]]

These templates are complicated to create and difficult to edit by automated means. In recognition of this, the Commons community has been moving towards storing metadata about the media files as structured data ("Structured Data on Commons", SDC). When media files depict artwork, the preference is to describe the artwork metadata in Wikidata rather than as wikitext on the Commons file page (as shown in the example above).

In July, Sandra Fauconnier gave a presentation at an ARLIS/NA (Art Libraries Society of North America) Wikidata group meeting that was extremely helpful for improving my understanding of the best practices for expressing metadata about visual artworks in Wikimedia Commons. She provided a link to a very useful reference page (still under construction as of September 2022) to which I referred while working on my script.

The CommonsTool script has been designed around two key features for simplifying management of the media and artwork metadata. The first is two very simple wikitexts: one for two-dimensional artwork and another for three-dimensional artwork. The 2D wikitext looks like this:

=={{int:filedesc}}==
{{Artwork
|source = Vanderbilt University
}}

=={{int:license-header}}==
{{PD-Art|PD-old-100-expired}}

[[Category:Vanderbilt University Fine Arts Gallery]]

and the 3D wikitext looks like this:

=={{int:filedesc}}==
{{Art Photo
|artwork license = {{PD-old-100-expired}}
|photo license = {{Cc-by-4.0 |1=photo © [https://www.vanderbilt.edu/ Vanderbilt University] / [https://www.library.vanderbilt.edu/gallery/ Fine Arts Gallery] / [https://creativecommons.org/licenses/by/4.0/ CC BY 4.0]}}
}}

[[Category:Vanderbilt University Fine Arts Gallery]]

By comparison with the wikitext in the first example, this is clearly much simpler, but also has the advantage that there is very little metadata in the wikitext itself that might need to be updated.

The second key feature involves using SDC to link the media file to the Wikidata item for the artwork. Here's an example for the work shown at the top of this post:

In order for this strategy to work, for all artwork images the depicts (P180) and main subject (P921) values must be set to the artwork's Wikidata item (in this case Q102961245). Two dimensional artwork images should also have a "digital representation of" (P6243) value with the artwork's Wikidata item. When these claims are created, the Wikidata metadata will "magically" populate the file information summary without entering it into a wikitext template.

The great advantage here is that when metadata are updated on Wikidata, they automatically are updated in Commons as well.

Copyright and licensing issues

One of the complicating issues that had slowed me down in developing the script was to figure out how to handle copyright and licensing issues. The images we are uploading depict old artwork that is out of copyright, but what about copyright of the images of the artwork? The Wikimedia Foundation takes the position that faithful photographic reproductions of old two-dimensional artwork lack originality and are therefore not subject to copyright. However, images of three-dimensional works can involve creativity, so those images must be usable under an open license acceptable for Commons uploads.

Wikitext tags

Unlike other metadata properties about a media item, the copyright and licensing details cannot (as of September 2022) be expressed only in SDC. They must be explicitly included in the file page's wikitext.

As shown in the example above, I used the license tags

for 2D artwork. The PD-Art tag asserts that the image is not copyrightable for the reason given above and PD-old-100-expired asserts that the artwork is not under copyright because it is old. When these tags are used together, they are rendered on the file page like this:

The example above for 3D artworks uses separate license tags for the artwork and the photo. The artwork license is PD-old-100-expired as before, and the photo license I used was

{{Cc-by-4.0 |1=photo © [https://www.vanderbilt.edu/ Vanderbilt University] / [https://www.library.vanderbilt.edu/gallery/ Fine Arts Gallery] / [https://creativecommons.org/licenses/by/4.0/ CC BY 4.0]}}

There are a number of possible licenses that can be used for both the photo and artwork and they can be set in the CommonsTool configuration file. Since the CC BY license requires attribution, I used the explicit credit line feature to make clear that it's the photo (not the artwork) that's under copyright and to provide links to Vanderbilt University (the copyright holder) and the Fine Arts Gallery. Here's how these tags are rendered on the file page of an image of a 3D artwork:

Using the format

{{Art Photo
|artwork license = {{artLicenseTag}}
|photo license = {{photoLicenseTag}}
}}

in the wikitext is great because it creates separate boxes that clarify that the permissions for the artwork are distinct from the permissions for the photo of the artwork.

Structured data about licensing

As noted previously, it's required to include copyright and licensing information in the page wikitext. However, file pages must also have certain structured data claims related to the file creator, copyright, and licensing or they will be flagged.

In the case of 2D images where the PD-Art tag was used, there should be a "digital representation of" (P6243) claim where the value is the Q ID of the Wikidata item depicted in the media file.

In the case of 3D images, they should not have a P6243 claim, but should have values for copyright status (P6216) and copyright license (P275). If under copyright, they should also have values for creator (P170, i.e. photographer) and inception (P571) date so that it can be determined to whom attribution should be given and when the copyright may expire. Keep in mind that for artwork SDC metadata is generally about the media file and not the depicted thing. So similar information about the depicted artwork would be expressed in the Wikidata item about the artwork, not in SDC.

Although not required when the PD-Art tag is used, it's a good idea to include the creator (photographer) and inception date of the image in the SDC metadata for 2D works. It's not yet clear to me whether a copyright status value should be provided. I suppose so, but if it's directly asserted in the SDC that the work is in the Public Domain, you are supposed to use a qualifier to indicate the reason, and I'm not sure what value would be used for that. I haven't seen any examples illustrating how to do that, so for now, I've omitted it.

To see examples of how this looks in practice see this example for 2D and this example for 3D. After the page loads, click on the Structured Data tab below the image.

What the script does: the Commons upload

The Commons upload takes place in three stages.

First, CommonsTool acquires necessary information about the artwork and the image from CSV tables. One key piece of information is what image or images to be uploaded to Commons are associated with a particular artwork (represented by a single Wikidata item). The main link from Commons to Wikidata is made using a depicts (P180) claim in the SDC and the link from Wikidata to Commons is made using an image (P18) claim.

Miriam by Anselm Feuerbach. Public Domain via Wikimedia Commons

It is important to know whether there are more than one image associated with the artwork. In the source CSV data about images, the image to be linked from Wikidata is designated as "primary" and additional images are designated as "secondary".

Both primary and secondary images will be linked from Commons to Wikidata using a depicts (P180) claim, but it's probably best for only the primary image to be linked from Wikidata using an image (P18) claim. Here is an example of a primary image page in Commons and here is an example of a secondary image page in Commons. Notice that the Wikidata page for the artwork only displays the primary image.

The CommonsTool script also constructs a descriptive Commons filename for the image using the Wikidata label, any sub-label particular to one of multiple images, the institution name, and the unique local filename. There are a number of characters that aren't allowed, so CommonsTool tries to find them and replace them with valid characters.

The script also performs a number of optional screens based on copyright status and file size. It can skip images deemed to be too small and will also skip images whose file size exceeds the API limit of 100 Mb. (See the configuration file for more details.)

The second stage is to upload the media file and the file page wikitext via the Commons API. Commons guidelines state that the rate of file upload should not be greater than one upload per 5 seconds, so the script introduces a delay of necessary to avoid exceeding this rate. If successful, the script moves on to the third stage and if not, it logs an error and moves to the next media item.

In the third stage, SDC claims are written to the API in a manner similar to how claims are written to Wikidata. The claims upload function respects the maxlag errors from the server and delays the upload if the server is lagged due to high usage (although this rarely seems to happen). If the SDC upload fails, it logs an error, but the script continues in order to record the results of the media upload in the existing uploads CSV file.

The links from the Commons image(s) to Wikidata are made using SDC statements, which results in a hyperlink in the file summary (the tiny Wikidata flag). However, the link in the other direction doesn't get made by CommonsTool.

The CSV file where existing uploads are recorded contains an image_name column and the primary values for "primary" images in that column can be used as values for the image (P18) property on the corresponding Wikidata artwork item page. After creating that claim, the primary image will be displayed on the artwork's Wikidata page:

Making this link manually can be tedious, so there is a script that will automatically transfer these values into the appropriate column of a CSV file that is set up to be used by the VanderBot script to upload data to Wikidata. In production, I have a shell script that runs CommonsTool, then the transfer script, followed by VanderBot. Once that shell script has finished running, the image claim will be present on the appropriate Wikidata page.

International Image Interoperability Framework (IIIF) functions

One of our goals at the Vanderbilt Libraries (of which the Fine Arts Gallery is part) is to develop the infrastructure to support serving images using the International Image Interoperability Framework (IIIF). To that end, we've set up a Cantaloupe image server on Amazon Web Services (AWS). The setup details are way beyond the scope of this web post, but now that we have this capability, we want to make the images that we've uploaded to Commons also be available as zoomable high-resolution images via our IIIF server.

For that reason, the CommonsTool script also has the capacity to upload images to the IIIF server storage (an AWS bucket) and to generate manifests that can be used to view those images. The IIIF functionalities are independent of the Commons upload capabilities -- either can be turned on or off. However, for my workflow, I do the IIIF functions immediately after the Commons upload so that I can use the results in Wikidata as I'll describe later.

Source images

One of the early things that I learned when experimenting with the server is that you don't want to upload large, raw TIFF files (i.e. greater than 10 MB). When a IIIF viewer tries to display such a file, it has to load the whole file, even if the screen area is much smaller that the entire TIFF would be if displayed at full resolution. This takes an incredibly long time, making viewing of the files very annoying. The solution to this is to convert the TIFF files into tiled pyramidal TIFFs.

When I view one of these files using Preview on my Mac, it becomes apparent why they are called "pyramidal". The TIFF file doesn't contain a single image. Rather, it contains a series of images that are increasingly small. If I click on the largest of the images (number 1), I see this:

and if I click on a smaller version (number 3), I see this:

If you think of the images as being stacked with the smaller ones on top of the larger ones, you can envision a pyramid.

When a client application requests an image from the IIIF server, the server looks through the images in the pyramid to find the smallest one that will fill up the viewer and sends that. If the viewer zooms in on the image, requiring greater resolution, the server will not send all of the next larger image. Since the images in the stack are tiled, it will only send the particular tiles from the larger, higher resolution image that will actually be seen in the viewer. The end result is that the tiled pyramidal TIFFs load much faster because the IIIF server is smart and doesn't send any more information than is necessary to display what the user wants to see.

The problem that I faced was how to automate the process of generating a large number of these tiled pyramidal TIFFs. After thrashing with various Python libraries, I finally ended up using the command line tool ImageMagick and calling it from a Python script using the os.system() function. The script I used is available on GitHub.

Because the Fine Arts Gallery has been working on imaging their collection for over 20 years, the source images that I'm using are in a variety of formats and sizes (hence the optional size screening criteria in the script to filter out images that have too low resolution). The newer images are high resolution TIFFs, but many of the older images are JPEGs or PNGs. So one task of the IIIF server upload part of the CommonsTool script is to sort out whether to pull the files from the directory where the pyramidal TIFFs are stored, or the directory where the original images are stored.

Once the location of the correct images are identified, the script uses the boto3 module (the AWS software development kit or SDK), to initiate the upload to the S3 bucket as part of the Python script. I won't go into the details of setting up and using credentials as that is described well in the AWS documentation.

Once the file is uploaded, it can be directly accessed using a URL constructed according to the IIIF Image API standard. Here's a URL you can play with:

https://iiif.library.vanderbilt.edu/iiif/3/gallery%2F1992%2F1992.083.tif/full/!400,400/0/default.jpg

If you adjust the URL (for example replacing the 400s with different numbers) according to the API 2.0 URL patterns, you can make the image display at different sizes directly in the browser.

IIIF manifests

The real reason for making images available through a IIIF server is to display them in a viewer application. One such application is Mirador. A IIIF viewer uses a manifest to understand how the image or set of images should be displayed. CommonsTool generates very simple IIIF manifests that display each image in a separate canvas, along with basic metadata about the artwork. To see what the manifest looks like for the image at the top of this post, go to this link.

IIIF manifests are written in machine-readable Javascript Object Notation (JSON), so they are not intended to be understood by humans. However, when the manifest is consumed by a viewer application, a human can use controls such as pan, zoom, and buttons to manipulate the image or to move to another canvas that displays a different image. The Mirador project provides an online IIIF viewer that can be used to view images described by a manifest. This link will display the manifest from above in the Mirador online viewer.

One nice thing about providing a IIIF manifest is that it allows multiple images of the same work to be viewed in the same viewer. For example, there might be multiple pages of a book, or the front and back sides of a sculpture. I'm still learning about constructing IIIF manifests, so I haven't done anything fancy yet with respect to generating IIIF manifests in the CommonsTool script. However, the script does generate a single manifest describing all of the images depicting the same artwork. The image designated as "primary" is shown in the initial view and any other images designated as "secondary" are shown in other canvases that can be selected using the viewer display options or be viewed sequentially using the buttons at the bottom of the viewer. Here is an example showing how the manifest for the primary and secondary images in an earlier example put the front and back images of a manuscript page in the same viewer window.

IIIF in Wikidata

Wikidata has a property "IIIF manifest" (P6108) that allows an item to be linked to a IIIF manifest that displays depictions of that item. The file where existing uploads are recorded includes a iiif_manifest column that contains the manifest URLs for the works depicted by the images.

Those values can be used to create IIIF manifest (P6108) claims for an item in Wikidata:

Because doing this manually would be tedious, the iiif_manifest values can be automatically transferred to a VanderBot-compatable CSV file using the same transfer script used to transfer the image_name.

In itself, adding a IIIF manifest claim isn't very exciting. However, Wikidata supports a user script that will display an embedded Mirador viewer anytime an item has a value for P6108. (For details on how to install that script, see this post.) With the viewer enabled, opening a Wikidata page for a Fine Arts Gallery item with images will display the viewer at the top of the page and a user can zoom in or use the buttons at the bottom to move to another image of the same artwork.

This is really nice because if only the primary image is linked using the image property, users would not necessarily know that there are other images of the object in Commons. But with the embedded viewer, the user can flip through all of the images of the item that are in Commons using the display features of the viewer, such as thumbnails.

Using the script

Although I wrote this script primarily to serve my own purposes, I tried to make it clean and customizable enough that someone with moderate computer skills should also be able to use it. The only installation requirements are Python and several modules that aren't included in the standard library. It should not generally be necessary to modify the script to use it -- most customizing should be possible by changing the configuration file.

If the script is only used to write files to Commons, it's operation is pretty straightforward. If you want to combine uploading image files to Commons with writing the image_names and iiif_manifest values to Wikidata, it's more complicated. You need to get the transfer_to_vanderbot.py script working and then learn how to operate VanderBot. There are details instructions, videos, etc. to do that on the VanderBot landing page.

What's next?

There are still a few more Fine Arts Gallery images that I need to upload after doing some file conversions, checking out some copyright statuses, and wranging some data for multiple files that depict the same work. However, I'm quite excited about developing better IIIF manifests that will make it possible to view related works in the same viewer. Having so many images in Commons now also makes it possible to see the real breadth of the collection by viewing the Listeria visualizations on the tabs of the WikiProject Vanderbilt Fine Arts Gallery website. I hope soon to create more fun SPARQL-based visualizations to add to those already on the website landing page.