Saturday, October 11, 2025

Camping at Windigo in Isle Royale National Park

 Our family loves to visit U.S. national parks and we have visited most of the ones that are easy to get to. We had pretty much written off going to Isle Royale due to its remoteness and the difficulty of getting there. However, our daughter did some research on what it would take to get there and we decided to give it a try. In this post, I will share some information we gained about making the trip that would have been good to have known as we were planning.

Who does this apply to?

There are several ways to visit Isle Royale. The information I'm providing here will mostly apply to people who:

  • want to camp
  • who are NOT backpacking
  • who are arriving on the west side of the island from Minnesota by ferry.

It generally does not apply to people arriving by float plane, who are going to the east side of the island from Michigan, and who are going to stay in the Rock Harbor Lodge.

I also should mention that our visit was late in the season when there were fewer visitors than during the peak of the season. So, if you go earlier in the summer when it is busier, you may have problems that we did not encounter. Also, this information is current as of September 2025. As time passes, it will become less accurate as conditions change on the island.

Getting there

There were three reasons why we decided to visit the western end of the island. One reason was that our daughter lives in Minneapolis. Another was that the ferry crossing from Minnesota is significantly shorter than the crossing from Michigan (2 hours vs. 6 hours). The final reason was that fewer people visit the west end of the island, and since it is not possible to making camping reservations, we thought that it was more likely that we could get a non-reservable campsite there than on the eastern end.

The ferry terminal is located in Grand Portage at the extreme northeastern tip of Minnesota. It takes five or more hours to drive there from Minneapolis. So unless you want to start driving in the middle of the night, you'll probably want to stay in Grand Portage the night before a morning departure. Pretty much the only place to stay there is in the hotel of the casino (Grand Portage Lodge; no AirBNBs or other hotels there). The nearest other U.S. town is Grand Marais, MN, which has places to stay, restaurants, and stores, but is about a 45 minute drive from Grand Portage. (Thunder Bay, Ontario is under an hour away, but involves the additional complication of a border crossing.) There is a large parking lot at the ferry terminal where you can park your car while you visit the island. Be prepared to pay a small parking fee to the Village of Grand Portage, which owns the lot (credit cards accepted).

Sea Hunter III at the Grand Portage dock. This is the smaller ferry that runs between Minnesota and Windigo

There are two ferries that run from Grand Portage in the morning, and during most of the summer season there is at least one leaving on most days of the week except Tuesday.  There are ferries returning in the afternoon most days except Monday. Check the schedule for detailed information on days and times. You need to make reservations well in advance of your trip, as the ferries have limited capacity and fill up weeks or months in advance. The cost per person is about $200 round trip and is probably the largest expense in visiting. 

Voyageur II at the Windigo dock. This is the larger ferry between Windigo and Minnesota

One important thing to pay attention to is the time. The terminal in Minnesota is on Central Time and Isle Royale is on Eastern Time. The ferries from Minnesota express all departure times on Central Time regardless of whether you were departing from Grand Portage or Windigo. If you have a conventional watch and don't change it, then their times will make sense. However, if you use your phone to set an alarm or to use as a clock, and if you use the Internet on the island (yes, there is Internet -- see below), your phone may automatically update to Eastern Time. The main point is to PAY ATTENTION to what time it is when your ferry is arriving to take you back to Minnesota and what time your timekeeping device thinks it is. 

Washington Harbor from the Windigo boat dock

The crossing itself was interesting and provides the best opportunity to see Lake Superior and the shores of Isle Royale. The park facilities at Windigo are located on the long and narrow Washington Harbor, so once you arrive, you aren't likely to see Lake Superior again before you leave unless you do significant hiking. 

Road from boat dock to Washington Creek campground. This is one of the backpacks we rented.

The ferries are passenger-only and there are no vehicles on the island, so that means that you need to plan to carry everything that you will need on the island, or buy it in the camp store in Windigo. One of the key pieces of information that we had trouble finding online was just exactly how far the Washington Creek Campground was from the boat dock. It is 0.4 miles on a flat and easy road from the dock to the campground entrance and you could wind up walking a few tenths of a mile further depending on how far your campsite is from the entrance. So even if you aren't backpacking, you'll need to have some means to carry your gear and to keep the weight down to what you can carry that distance.

Camping gear


As I mentioned earlier, one of the complications of doing non-backpacking camping on the island is that you have to carry everything on the ferry and to the campground. If you exclusively do car camping, then you probably don't have backpacks. One extremely useful thing that we learned was that the University of Minnesota recreation center will rent backpacks and other gear to the general public. We rented backpacks and bear canisters from there. Although there are no bears on the island, there have been cases of wolves being habituated to camp food, so all food must be kept in bear canisters to keep the food secure from any animals it might attract. Because bear canisters are pretty expensive, it was nice to be able to rent rather than buy them. The type of canister we rented from them was BearVault BV 500. For more information about outdoor gear rentals, see the recreation center website. NOTE: The location shown on Google Maps for 244 Walnut St. SE, Minneapolis is the correct location for the loading dock where you pick up the rental equipment. However, using Google Maps to navigate there will send you to the other side of the building. So look at the map to find your way there rather than following the directions given by Google Maps.


Although you will probably want to bring your food, it is good to know that there is a camp store near the dock and visitors' center in Windigo. Supplies are somewhat expensive, but it is good to know that you can pick up most of the basics if you run short on food. They do have standard size butane cylinders for sale and there was even a box outside the store where people left partially used cylinders for anyone to take. We had bought an extra cylinder in case we ran out and that was unnecessary because of their availability on the island. (Note: depending on where backpacks are being stowed on the ferry, you may have to place butane cylinders in an outside location on the boat. So make sure that you have written your name on your cylinders and kept them near the top of your pack before you get on the ferry.) The store also sells a limited selection of prepared foods like pizza and hot dogs in the event that you get tired of rehydrated backpacking food.

Because of the necessity of carrying everything, we packed backpacking tents rather than our larger (and heavier) car camping tents. When we arrived at the campground, we were surprised to discover that there were multiple shelters available. Supposedly you pitch your tent in the shelter, but we just left our tents packed and slept in our sleeping bags directly on the wooden floor of the shelter. However, since there's no guarantee that a shelter will be available, you probably will want to bring a tent even if you hope to get a shelter.

Amenities detail


Campsites. In contrast to the relatively expensive ferry, camping on the island is free. When you arrive, you can get a free permit to camp. When we arrived, one of the rangers met us to issue our permits at the kiosk by the dock. However, you will need to pay an entrance fee for the park, unless you already have an annual or permanent park pass. 

At the Washington Creek campground, there are 10 shelters and 5 tent campsites without shelters. The signs at the shelters said that it was not allowed to pitch tents on the site outside of the shelters, although this apparently is not enforced and we saw people doing it. One of our major concerns was what we would do if we got off the ferry and discovered that there were no sites available. Online information indicated that one could potentially appeal to those who were able to obtain a site to share their sites with you. However, the dynamic we observed was that when all of the sites were taken, people camped in the group campground next to the regular campground. So the group campground was effectively an overflow campground.
Shelter, back view

Shelter, front view

Inside of a shelter

When we visited at the end of August, at the time of our arrival on a morning ferry there were several shelters available, and most of the tent campsites were unoccupied. However, as the day progressed and backpackers arrived from other parts of the island, all of the shelter and tent sites were full and people were having to go to the group campsite. As far as I know, site sharing was not necessary. It is likely that earlier in the season when visitation is higher all of the main and overflow campsites might be full, at least by the end of the day.

Water. Potable water is available, so we did not end up using the water purifier that we brought. There is a spigot near the boat dock that was the most convenient one for us since we regularly went from the campground to the dock area to use the nice bathrooms there. We later discovered that there was another spigot located at the second entrance to the campground (see map), but the spigot shown on the map at the first entrance was no longer there.

Bathrooms. There were three pit toilets located at the campground (see map) but be aware that they were not stocked with either toilet paper or hand sanitizer. There were nice bathrooms located near the dock that had toilet paper, hot running water, soap, and electric hand driers. So we used them in preference to the pit toilets when practical. The bath house also had pay showers and laundry facilities, but we did not use them so can't comment on their quality.

Kiosk with WiFi near boat dock

Technology. We had been warned that there was no cell service on the island, although it may be possible at some high places on the island to pick up Canadian cell towers (possibly resulting in large international charges depending on your plan). We were therefore very surprised to discover that there was free public WiFi at the kiosk near the boat dock. So it is possible to send emails and iMessages, and to do limited Internet audio calls from that location. The bandwidth is pretty narrow, so video calling would be sketchy. 

There are electrical outlets by the sinks in the bathrooms, so it's possible to recharge mobile phones and portable power supplies there.

Trash. You should be prepared to pack out the trash that you generate when camping. There were trash cans in the main bathroom, but they are intended for bathroom trash only. There was a trash can outside of the store, so it is possible to discard small items there (e.g. waste from prepared food purchased in the store), but don't plan to discard your camping waste there.

Food locker for tent campers

Backpack cage at kiosk near boat dock

Food storage. As I mentioned earlier, it is required to store your food in bear canisters. The campground has metal lockers into which you can place the canisters if you are sleeping in a tent. If you are able to get a shelter, we were told that it was OK to keep the canister in the shelter. When arriving and departing, there is a cage by the kiosk where you can put backpacks while using the bathrooms and visitors center. 

Bear canisters rented from U of M rec center

Since anything with food scent needs to be secured inside the bear canisters, we ended up primarily eating the type of backpacking meals that you prepared by pouring hot water into the pouch containing the dehydrated food. That avoided the need to clean cooking utensils and dispose of gray water from dishwashing. The used pouches can then be rolled up, closed, and returned to the canister. Scented substances like toothpaste also need to be kept in the canisters, so careful planning is necessary to keep the food volume below the capacity of the canisters. We were (barely) able to fit food and snacks for three people on a 3 night trip (two full days and two half days) into two canisters.

Visitor Center. The National Park visitor center is the place to go for information about trails, wildlife. There is also a part of the center that sells postcards, books, and maps.

Activities

Paddling a canoe in Washington Bay

Paddling up Washington Creek from the bay

View of Lake Superior from Grace Creek overlook

Using Windigo as a base, we were able to enjoy three days of activities with no problem. One day we rented a canoe at the camp store and paddled around the relatively sheltered Washington Bay and up Washington Creek. We also did a nice, easy day hike to the Grace Creek overlook, which gave us our only look at Lake Superior proper from the island and a panoramic view of the pond and wetlands associated with Grace Creek. We also did the short nature trail loop near the visitors' center and took the Feldmann Lake trail as far as Washington Creek. If we had been more ambitious, it would have been feasible to hike to the Minong Ridge overlook as a day hike.

Moose in Washington Creek from our campsite

We were relatively lucky with animal sightings. From our campsite we twice saw a moose wade along Washington Creek, and we also saw a moose (possibly the same one) in the bay in the shallow water near the mouth of Washington Creek. We spotted a red fox and saw two river otters playing in the bay. At the Grace Creek overlook, we saw a family of kestrels doing aerobatics below. We did not see or hear any wolves, but did find a fresh wolf footprint on a trail. We met some people who did not see a moose during their entire visit, so what you see may just depend on your luck.

Conclusions 

With some careful planning, we were able to execute a three night camping trip on the island at a reasonable cost. Although much of the scenery and vegetation were similar to what you might see in Michigan's upper peninsula or northern Minnesota, the isolated location and wilderness character of the island made it a nice place to visit if you enjoy nature.


Monday, March 31, 2025

Favorite Nebula Award-winning Novels

 After finishing reading all of the Hugo Award winners for best novel, I decided to keep up the momentum and read all of the winners of the Nebula Award for best novel. I just finished the last one yesterday and decided to write a follow-up post to my earlier one where I talked about which of the Hugo books were my favorites

In this post, I’ll discuss the Nebula winners that were not also Hugo winners, and list those double-winners that I already described in the previous post.

The Nebula Award started in 1966, so there are over 10 years of generally poor-quality books that were Hugo-eligible but that were out of contention in this quest (all of my 5 worst Hugo books were before 1962). Nevertheless, there were two that I disliked enough to put in the category of worst Nebula winners.

List of favorites

Not surprisingly, many of the really good books won both awards. So the list of favorites that won only the Nebula is rather short. It’s hard to be sure that I’m holding them to exactly the same standard as my Hugo favorites – I may be a little more generous here. But all of these favorites are solid and worth a read.

NOTE: to read my full reviews of all of the Nebula Award-winning novels see my Goodreads books.

Samuel R. Delany: Babel-17 (1967)

I’m not sure that this falls into my top books of all times, but it was one of my favorite of the Nebula winners. It was pretty weird, which often is a negative for me, but somehow this one was weird in an interesting way and also pretty good for 1967. In particular, I liked the strong female main character, which was refreshing for a book from that decade. Like Babel, the 2023 winner, it was in the “power of words and language” genre, but did not drag on for page after dull page as Babel did.

Daniel Keyes: Flowers for Algernon (1967)

It’s hard to place this book relative to the others since I haven’t read it since I first did in about 1974. But I think it made a big impression on me at the time and was a really solid story. Very different in style from the other 1967 winner that I just described.

Greg Bear: Darwin’s Radio (2001)

This was a really exciting and interesting book that I had trouble putting down. Not one of my all-time top books, and I found the biology a bit hard to swallow (as a biologist). But worth reading.

Elizabeth Moon: The Speed of Dark (2004)

This is not your typical sci-fi book: the focus was not really on the advanced technology, which was only tangential to the real story line: seeing our world from the eyes of someone with autism. I was a bit disappointed with the ending, but otherwise it was a really engaging and thought-provoking book.

Favorites that also won the Hugo

The following favorite Nebula winners were already discussed in my previous blog post on favorite Hugo winners, so I will just list them here and let you read about them in the other post.

Frank Herbert: Dune (1966)

Ursula K. Le Guin: Left Hand of Darkness (1970)

Frederik Pohl: Gateway (1978)

Orson Scott Card: Ender’s Game (1986), Speaker for the Dead (1987)

Connie Willis: Blackout/All Clear (2011), Doomsday Book (1993)

N. K. Jemisin: The Stone Sky (2018)

 

Lois McMaster Bujold: Falling Free (1996) 

I’m putting this in a special category. This book did not win the Hugo and is actually not one of my favorite Bujold books, but I am including it because of my general love of the other Vorkosigan Universe books, some of which did win the Hugo.

5 star book that didn’t make my favorite list:

NOTE: there are other Nebula 5 start books that are not listed here because they were listed on the Hugo 5 star list of my other post and are therefore not repeated here.

Vonda N. McIntyre: The Moon and the Sun (1998)

Perhaps I was generous to give this 5 stars, but I liked the story. A bit slow to start and too much detail about the French court, but I really liked the characters and how different “good” characters had different viewpoints on topics like sex and religion.

Worst Nebula books

No “best” list would be complete without a corresponding “worst” list. Here it is:

Samuel R. Delany: The Einstein Interaction (1968)

Very weird book that was probably trying to make some point about myths that was lost on me. Interesting that Delany makes both my best and worst list!

Robert Silverberg: A Time of Changes (1972)

Preachy and depressing premise, disgusting attitude towards women, poor writing.
 

Sunday, February 16, 2025

Favorite Hugo Award-winning Novels


 In May 2023, I completed a quest that was on my bucket list: reading all of the winners of the Hugo Award for best science fiction/fantasy novel. At that time, there were 71 books on the list (not counting “Retro-Hugo” winners). I’m not sure when I read my first one – the first one that I can unambiguously remember reading was Dune in about 1975 or 76. I read a number of the winners from the 70’s through 90’s soon after their publication, before having kids and going to grad school cut back on my pleasure reading time. Starting in 2021, I resolved to spend more time reading for fun and a number of the more recent winners were recommended to me by my daughter. This enticed me to take up the challenge of finishing all of them and Goodreads tells me that I read 31 of them in the 12 months preceding May 2023.

Having read them all, I am enjoying thinking back on them and pondering which were my favorites. I decided to write this post to list them.

Why a favorite?


There are several criteria that must have been met to make my “favorites” list. First and foremost, the book must be deeply engaging. To me, a great fiction book draws me into its world, and while I’m reading I’m transported to that world and barely aware that I am sitting in this world reading. Second, the story needs to be clever, creative, or explore a universe that has some really interesting and different twist. Third, the story can’t be ruined by being overtly sexist, dated, or transparently preaching about the author’s pet peeve. It is fine for the book to have a point, but that point needs to be made through the storytelling.

Another characteristic (but not a requirement) is that I found myself pondering and thinking about these stories for days or weeks after reading them, and years later thinking how I would like to re-read them.

List of my favorites

NOTE: to read my full reviews of all of the Hugo Award-winning novels, see my Goodreads books.

 

N. K. Jemisin: The Fifth Season (2016), The Obelisk Gate (2017), The Stone Sky (2018)


This trilogy was so different and interesting that I was quickly intrigued by it. The narrative style of The Fifth Season was also really cool. Some parts of The Stone Sky were a bit hard to believe, but the trilogy's overall the story was very satisfying.

Vernor Vinge: A Fire Upon the Deep (1993), A Deepness in the Sky (2000)


I was not familiar with Vernor Vinge before I started reading the Hugo books, but I now really appreciate his creativity and storytelling. Both of these books have a compelling story arc, but also have fascinating and creative alien species whose interactions with humans form an integral part of the story. One interesting character overlaps in the two books.

Frank Herbert: Dune (1966)


It is a bit difficult for me to objectively compare Dune to my other favorite Hugo books, since it was probably the first “epic” sci-fi/fantasy book that I read. But at that time, I was blown away by the complex vision that Herbert created in the book. Queen’s Night at the Opera had come out not long before I read Dune and I listened to “The Prophet’s Song” many times while reading. It has become indelibly associated with Dune in my mind. If you’ve read Dune, listen to The Prophet's Song and see if you can tell why it made such a strong connection for me.

Connie Willis: Blackout/All Clear (2011), Doomsday Book (1993)


Although both of these books involve pretty depressing topics (WW II and the plague), the story telling really immersed me in those time periods. The character’s struggle to survive and return to their own time, overlaid with their efforts to recognize the humanity and dignity of the people of those times in the most trying circumstances, made for a compelling plot.

Orson Scott Card: Ender’s Game (1986), Speaker for the Dead (1987)


Although these books might be classified as young adult books, they had really interesting and surprising plots.

Lois McMaster Bujold: The Vor Game (1991), Barrayar (1992), Mirror Dance (1995)


I include these books not because they were my particular Bujold favorites, but rather because the entire Miles Vorkosigan series were so clever, funny, and entertaining. They are certainly one of my favorite book series, with The Warrior's Apprentice (not a nominee) as the very best.

C. J. Cherryh: Downbelow Station (1982), Cyteen (1989)


I include these Cherryh books for a similar reason as the Bujold books. They weren’t necessarily my favorite Cherryh books (that would probably be the Chanur books, nominated in 1983 but did not win). But C. J. Cherryh is overall one of my favorite sci-fi authors and her Alliance/Union universe is complex and fascinating.

Ursula K. Le Guin: The Left Hand of Darkness (1970)


It would be difficult to not include Le Guin somewhere on my list. The Left Hand of Darkness is certainly one of her best books, although probably the Lathe of Heaven (nominated in 1972 but did not win) is my favorite. Le Guin is perhaps unmatched for her ability to situate interesting plots in worlds and cultures that are thought-provoking.

Frederik Pohl: Gateway (1978)


I read Gateway many years ago, so I’m not sure how I would feel about it now. But at the time, the novelty of the story premise and narrative style really appealed to me.

J. K. Rowling: Harry Potter and the Goblet of Fire (2001)


This is actually my least favorite Harry Potter book. But the Harry Potter saga is one of my top fantasy series, so I included it on that basis.

Runners up:


Robert Sawyer: Hominids (2003)


This book was borderline and did not quite make the cut, but I have to say that I was quite intrigued by the underlying concept of the world, and I just really liked the story and imagining how the world would be different if a different Homo species had come to dominate the earth.

Walter M. Miller Jr.: A Canticle for Leibowitz (1961)


I had first heard the NPR radio dramatization of this in the early 1980’s and was not overly impressed. I am also not a big fan of post-apocalyptic books. But when I read the book recently, I really enjoyed the story-telling and premise of the first two parts of the book. It was far superior to most other sci fi books I’ve read that were written in the 1950’s and early 1960’s. But it got booted from the favorites list because of the “no preachiness” criterion. The third part of the book was just transparently an anti-euthanasia sermon and that ruined the last part of the book for me.

Vonda N. McIntyre: Dreamsnake (1979)


When I started reading this book, I was expecting to dislike it. As I said, I don’t really like post-apocalyptic stories that well, and the beginning of the book seemed pretty hokey to me. But as the story was built out, I really found myself enjoying it. As a post-apocalyptic novel, it was pretty unusually in emphasizing kindness as a basic human characteristic. That was really refreshing to me.

John Scalzi: Redshirts (2013)


This was a short and very funny parody of Star Trek. Surprisingly, it was actually built into a somewhat clever story. Definitely work a read.

Other books I gave 5 star ratings to:

Ann Leckie: Ancillary Justice (2014)


Very interesting take on A.I.

Arkady Martine: A Memory Called Empire (2020), A Desolation Called Peace (2022)


Intriguing world told from the perspective of someone confused about another culture.

Neil Gaiman: The Graveyard Book (2009)


Did not think I would like, but did.

Mary Robinette Kowal: The Calculating Stars (2019)


Pretty interesting overall plot concept, but bordering on unbelievable.

Larry Nevin: Ringworld (1971)


Clever world-building, but too sexist for my tastes now.

Victor Vinge: Rainbows End (2007)


Enjoyable and interesting book, but not up to the level of his two books I put on my favorites.

Paolo Bacigalupi: The Windup Girl (2010)


Very interesting world, but a bit too violent and depressing for me to fully enjoy

Robert Charles Wilson: Spin (2006)


Engaging and suspenseful, but not top tier.

Roger Zelazny: Lord of Light (1968)


Really interesting presentation: not sure what was real and what was mythological. Far superior and more creative than many of the books from the 1960’s. 

William Gibson: Neuromancer (1985)

Story line not amazing, but very prescient and the storytelling was vivid. The origin of "cyberspace" and cyberpunk. 

Philip K. Dick: The Man in the High Castle (1963)

One of the rare excellent winners from the early 1960's. An early entry in the alternate universe genre and well-written.

Worst Hugo Books:


No "best" list would be complete without a corresponding "worst" list. 

It seems to me pretty clear that the quality of science fiction and fantasy writing has generally improved over time. Anyone who thinks that the 1959’s and early 1960’s was some kind of golden age for science fiction clearly has not read these terrible books. One thing I cannot figure out is why Robert Heinlein is considered a great sci-fi author. The books of his that I have read ranged from mediocre to downright awful.

Robert A. Heinlein: Starship Troopers (1960)

This was just a deplorable book. One of the few I’ve given a one-star rating. Almost no plot and transparent right-wing ax-grinding.

Robert A. Heinlein: Stranger in a Strange Land (1962)

Despite this being a “famous book”, it was really terrible. Disgusting sexism (female character says “Nine times out of ten, if a girl gets raped, it’s partly her fault.”), characters droning on about Heinlein’s pet issues, etc.

Fritz Leiber: The Big Time (1958)

No real plot, stereotypical characters, dumb premise.

James Blish: A Case of Conscience (1959)

Lack of imagination about technology, shallow and stereotypical female characters, pages of pontification by characters with no plot development, …

Mark Clifton: They’d Rather Be Right (1955)

Almost impossible to obtain, and for good reason. An implausible story with annoying political overtones, masquerading as a science fiction story. Really, really dumb portrayal of A.I.



Sunday, August 6, 2023

Building an Omeka website on AWS

 

James H. Bassett, “Okapi,” Bassett Associates Archive, accessed August 5, 2023, https://bassettassociates.org/archive/items/show/337. Available under a CC BY 4.0 license.
 

Several years ago, I was given access to the digital files of Bassett Associates, a landscape architectural firm that operated for over 60 years in Lima, Ohio. This award-winning firm, which disbanded in 2017, was well known for its zoological design work and also did ground-breaking work in incorporating storm water retention as part of landscape site design. In addition to images of plans and site photographs, the files also included scans of sketches done by the firm's founder, James H. Bassett, which was artwork in its own right. I had been deliberating what the best way was to make these works publicly available and decided that this summer I would make it my project to set up an online digital archive featuring some of the images from the files.

Given my background as a Data Science and Data Curation Specialist at the Vanderbilt Libraries, it seemed like a good exercise to figure out how to set up Omeka Classic on Amazon Web Services (AWS), Vanderbilt's preferred cloud computing platform. Omeka is a free, open-source web platform that is popular in the library and digital humanities communities for creating online digital collections and exhibits, so it seemed like a good choice for me given that I would be funding this project on my own. 

Preliminaries

The hard drive I have contains about 70 000 files collected over several decades. So the first task was to sort through the directories to figure out exactly what was there. For some of the later projects, there were some born-digital files, but the majority of the images were either digitizations of paper plans and sketches, or scans of 35mm slides. In some cases, the same original work was present several places on the drive with a variety of resolutions, so I needed to sort out where the highest quality files were located. Fortunately, some of the best works from signature projects had been digitized for an art exhibition, "James H. Bassett, Landscape Architect: A Retrospecive Exhibition 1952-2001" that took place in Artspace/Lima in 2001. Most of the digitized files were high-resolution TIFFs, which were ideal for preservation use. I focused on building the online image collection by featuring projects that were highlighted in that exhibition, since they covered the breadth of types of work done by the firm throughout its history.

The second major issue was to resolve the intellectual property status of the images. Some had previously been published in reports and brochures, and some had not. Some were from before the 1987 copyright law went into effect and some were after. Some could be attributed directly to James Bassett before the Bassett Associates corporation was formed and others could not be attributed to any particular individual. Fortunately, I was able to get permission from Mr. Bassett and the other two owners of the corporation when it disbanded to make the images freely available under a Creative Commons Attribution 4.0 International (CC BY 4.0) license. This basically eliminated complications around determining the copyright status of any particular work, and allows the images to be used by anyone as long as they provide the requested citation.

 

TIFF pyramid for a sketch of the African plains exhibit at the Kansas City Zoo. James H. Bassett, “African Plains,” Bassett Associates Archive, accessed August 6, 2023, https://bassettassociates.org/archive/items/show/415. Available under a CC BY 4.0 license.

Image pre-processing

For several years I have been investigating how to make use of the International Image Interoperability Framework (IIIF) to provide a richer image viewing experience. Based on previous work and experimentation with our Libraries' Cantaloupe IIIF server, I knew that large TIFF images needed to be converted to tiled pyramidal (multi-resolution) form to be effectively served. I also discovered that TIFFs using CMYK color mode did not display properly when served by Cantaloupe. So the first image processing step was to open TIFF or Photoshop format images in Photoshop, flatten any layers, convert to RGB color mode if necessary, reduce the image size to less than 35 MB (more on size limits later), and save the image in TIFF format. JPEG files were not modified -- I just used the highest resolution copy that I could find.

Because I wanted to make it easy in the future to use the images with IIIF, I used a Python script that I wrote to converting single-resolution TIFFs en mass to tiled pyramidal TIFFs via ImageMagick. These processed TIFFs or high-resolution JPEGs were the original files that I eventually uploaded to Omeka.

Why use AWS?

One of my primary reasons for using AWS as the hosting platform was the availability of S3 bucket storage. AWS S3 storage is very inexpensive and by storing the images there rather than within the file storage attached to the cloud server, the image storage capacity could basically expand indefinitely without requiring any changes to the configuration of the cloud server hosting the site. Fortunately, there is an Omeka plug-in that makes it easy to configure storage in S3

Another advantage (not realized in this project) is that because image storage is outside the server in a public S3 bucket, the same image files can be used as source files for a Cantaloupe installation. Thus a single copy of an image in S3 can serve the purpose of provisioning Omeka, being the source file for IIIF image variants served by Cantaloupe, and having a citable, stable URL that allows the original raw image to be downloaded by anyone. 

I've also determined through experimentation that one can run a relatively low-traffic Omeka site on AWS using a single t2.micro tier Elastic Compute Cloud (EC2) server. This minimally provisioned server currently costs only US$ 0.0116 per hour (about $8 per month) and is "free tier eligible", meaning that new users could run a Omeka on EC2 for free during the first year. Including the cost of the S3 storage, one could run an Omeka site on AWS with hundreds of images for under $10 per month. 

The down side

The main problem with installing Omeka on AWS is that it is not a beginner-level project. I'm relatively well-acquainted with AWS and Unix command line, but it took me a couple months on and off to figure out how to get all of the necessary pieces to work together. Unfortunately, there wasn't a single web page that laid out all of the steps, so I had to read a number of blog posts and articles, then do a lot of experimenting to get the whole thing to work. I did take detailed notes, including all of the necessary commands and configuration details, so it should be possible for someone with moderate command-line skills and a willingness to learn the basics of AWS to replicate what I did.

Installation summary

 
In the remainder of this post, I'll walk through the general steps required to install Omeka Classic on AWS and describe important considerations and things I learned in the process. In general, there are three major components to the installation: setting up the S3 storage, installing Omeka on EC2, and getting a custom domain name to work with the site using secure HTTP. Each of these major steps includes several sub-tasks that will be described below. 


S3 setup


The basic setup of an S3 bucket is very simple and involves only a few button clicks. However, the way Omeka operates, several additional steps are required for the bucket setup. 
 
By design, AWS is secure and generally one wants to permit only the minimum required access to resources. But because Omeka exposes file URLs publicly so that people can download those files, the S3 bucket must be readable by anyone. Omeka also writes multiple image variant files to S3, and this requires generating access keys whose security must be carefully guarded. 

You can manually upload files and enter their metadata by typing into boxes in the Omeka graphical interface. That's fine if you will only have a few items. However, if you will be uploading many items, uploading using the graphical interface is very tedious and requires many button clicks. To create an efficient upload workflow, I used the Omeka CSV import plugin. It requires loading the files via a URL during the import process, so I used a different public S3 bucket as the source of the raw images. I used a Python script to partially automate the process of generating the metadata CSV and as part of that script, I uploaded the images automatically to the source raw image bucket using the AWS Python library (boto3). This required creating access credentials for the raw image bucket and to reduce security risks, I created a special AWS user that was only allowed to write to that one bucket. 
 
The ASW free tier allows a new user access to up to 5 GB for free during the first year. That corresponds to roughly a hundred high-resolution (50 MB) TIFF images.

Omeka installation on EC2

 

As with the set up of S3 buckets, launching an EC2 server instance just involves a few button clicks. What is trickier and somewhat tedious is performing the actual setup of Omeka within the server. Because the setup is happening at some mysterious location in the cloud, you can't point and click like you can on your local computer. To access the EC2 server, you have to essentially create a "tunnel" into it by connecting to it using SSH. Once you've done that, commands that you type into your terminal application are being applied to the remote server and not your local computer. Thus, everything you do must be done at the command line. This requires basic familiarity with Unix shell commands and since you also need to edit some configuration files, you need to know how to use a terminal-based editor like Nano. 

The steps involve:
- installing a LAMP (Linux, Apache, MySQL, and PHP) server bundle
- creating a MySQL database
- downloading and installing Omeka
- modifying Apache and Omeka configuration files
- downloading an enabling the Omeka S3 Storage Adapter and CSV Import plugins

Once you have completed these steps (which actually involve issuing something like 50 complicated Unix commands that fortunately can be copied and pasted from my instructions), you will have a functional Omeka installation on AWS. However, accessing it would require users to use a confusing and insecure URL like
http://54.243.224.52/archive/

Mapping an Elastic IP address to a custom domain and enabling secure HTTP

 
To change this icky URL to a "normal" one that's easy to type into a browser and that is secure, several additional steps are required. 

AWS provides a feature called an Elastic IP address that allows you to keep using the same IP address even if you change the underlying resource it refers to. Normally, if you had to spin up a new EC2 instance (for example to restore from a backup), it would be assigned a new IP address, requiring you to change any setting that referred to the IP address of the previous EC2 you were using. An Elastic IP address can be reassigned to any EC2 instance, so disruption caused by replacing the old EC2 with a new one can be avoided by just shifting the Elastic IP to the new instance. Elastic IPs are free as long as they remain associated with a running resource.

It is relatively easy to assign a custom domain name to the Elastic IP if the AWS Route 53 domain registration is used. The cost of the custom domain varies depending on the specific domain name that you select. I was able to obtain `bassettassociates.org` for US$12 per year, adding $1 per month to the cost of running the website.

After the domain name has been associated with the Elastic IP address, the last step is to enable secure HTTP (HTTPS). When initially searching the web for instructions on how to do that, I found a number of complicated and potentially expensive suggestions including installing an Nginx front-end server and using an AWS load balancer. Those options are overkill for a low-traffic Omeka site. In contrast, it is relatively easy to get free security certificate from Let's Encrypt and set it up to automatically renew monthly using Certbot for an Apache server.

After completing these steps, one can now access my Omeka instance at https://bassettassociates.org/archive/.

 

Optional additional steps

 
If you plan to have multiple users editing the Omeka site, you won't be able to add users beyond the default Super User without additional steps. It appears that it's not possible to add more users without enabling Omeka to send emails. This requires setting up AWS Simple Email Service (SES), then adding the SMPT credentials to the Omeka configuration file. SES is designed to enable sending mass emails, so being approved for production access requires applying for approval. I didn't have any problems getting approved when I explained that I was only going to use it to send a few confirmation emails, although the process took at least a day since apparently a human has to examine the application. 

There are three additional plugins that I installed that you may consider using. The Exhibit Builder and Simple Pages plugins add the ability to create richer content. Installing them is trivial, so you will probably want to turn them on. I also installed the CSV Export Format plugin because I wanted to use it to capture identifier information as part of my partially automated workflow (see following sections for more details).

If you are interested in using IIIF on your site, you may also want to install the IIIF Toolkit plugin, explained in more detail later.

Efficient workflow

Once Omeka is installed and configured, it is possible to just upload content manually using the Omeka graphical interface. That's fine if you will only have a few objects. However, if you will be uploading many objects, uploading using the graphical interface is very tedious and requires many button clicks.

The workflow described here is based on assembling the metadata in the most automated way possible, using file naming conventions, a Python script, and programatically created CSV files. Python scripts are also used to upload the files to S3, and from there they can be automatically imported into Omeka.

After the items are imported, the CSV export plugin can be used to extract the ID numbers assigned to the items by Omeka. A Python script then extracts the IDs from the resulting CSV and inserts them into the original CSVs used to assemble the metadata.

For full details about the scripts and step-by-step instructions, see the detailed notes that accompany this post.

Notes about TIFF image files

 
If original image files are available as high-resolution TIFFs, that is probably the best format to archive from the preservation standpoint. However, most browsers will not display TIFFs natively, while JPEGs can be displayed onscreen. The practical implication of this is that image thumbnails are linked directly to the original highres image file. So when a user clicks on the thumbnail of a JPEG, the image is displayed in their browser, but when a TIFF thumbnail is clicked, the file downloads to the user's hard drive without being displayed. When an image is uploaded, Omeka makes several JPEG copies at lower resolution so that they can be displayed onscreen in the browser without downloading.
 
As explained in the preprocessing section above, the workflow includes an additional conversion step that only applies to TIFFs. 

Note about file sizes

 
In the file configuration settings, I recommend seting a maximum file size of 100 MB. Virtually no JPEGs are ever that big, but some large TIFF files may exceed that size. As a practical matter, the upper limit on file size in this installation is actually about 50 MB. I have found from practical experience that importing original TIFF files between 50 and 100 MB can generate errors that will cause the Omeka server to hang. I have not been able to isolate the actual source of the problem, but it may be related to the process of generating the lower resolution JPEG copies. The problem may be isolated to using the CSV import plugin because some files that hung the server when using the CSV import were then able to be uploaded manually after creating the item record. In one instance, a JPEG that was only 11.4 MB repeatedly failed to upload using the CSV import. Apparently its large pixel dimensions (6144x4360) were the problem (it also was successfully uploaded manually).

The other thing to consider is that when TIFFs are converted to tiled pyramidal form, there is an increase in size of roughly 25% when the low-res layers are added to the original high-res layer. So a 40 MB raw TIFF may be at or over 50 MB after conversion. I have found that if I keep the original file size below 35 MB, the files usually load without problems. It is annoying to have to decrease the resolution of any souce files in order to add them to digital collection, but there is a workaround (described in the IIIF section below) for extremely large TIFF image files.
 

The CSV Import plugin

 
An efficient way to import multiple images is to use the CSV Import plugin. The plugin requires two things: a CSV spreadsheet containing file and item metadata, and files that are accessible directly using a URL.  Because files on your local hard drive are not accessible via a URL, there are a number of workarounds that can be used, such a uploading the images to a cloud service like Google Drive or Dropbox. Since we are using AWS S3 storage, it makes sense to make the image files accessible from there, since files in a public S3 bucket can be accessed by a URL. (Example of raw image available from an S3 bucket via the URL: https://bassettassociates.s3.amazonaws.com/glf/haw/glf_haw_pl_00.tif)

One could create the metadata CSV entirely by hand by typing and copying and pasting in a spreadsheet editor. However, in my case, because of the general inconsistency in file names on the source hard drive, I was renaming all of the image files anyway. So I established a file identifier coding system that when used with file names would both group similar files together in the directory listing and also make it possible to automate populating some of the metadata fields in the CSV. The Python script that I wrote generated a metadata CSV with many of the columns already populated, including the image dimensions, which it extracted from the EXIF data in the image files. After generating a first draft of the CSV, I then had to manually add the date, title, and description fields, plus any tags I wanted to add in addition to the ones that the script generated automatically from the file names. (Example of completed CSV metadata file)
 
The CSV import plugin requires that all items imported as a batch be the same general type. Since my workflow was build to handle images, that wasn't be a problem -- all items were Still Images. As a practical matter, it was best to restrict all of the images in a batch to be for the same Omeka collection. If images intended for several collections were uploaded together in a batch, they would have had to be assigned to collections manually after upload. 
 

Omeka identifiers

 
When Omeka ingests image files, it automatically assigns an opaque ID (e.g. 3244d9cdd5e9dce04e4e0522396ff779) to the image and generates JPEG versions of the original image at various sizes. These images are stored in the S3 bucket that you set up for Omeka storage. Since those images are publicly accessible by URL, you could provide access to them for other purposes. However, since the file names are based on the opaque identifiers and have no connection with the original file names, it would be difficult to know what the access URL would be. (Example: https://bassett-omeka-storage.s3.amazonaws.com/fullsize/3244d9cdd5e9dce04e4e0522396ff779.jpg)

Fortunately, there is a CSV Export Format plugin that can be used to discover the Omeka-assigned IDs along with the original identifiers assigned by the provider as part of the CSV metadata that was uploaded during the import process. In my workflow, I have added additional steps to do the CSV export, then run another Python script that pulls the Omeka identifiers from the CSV and archives them along with the original user-assigned identifier in an identifier CSV. At the end of processing each batch, I push the identifier and metadata CSV files to GitHub to archive the data used in the upload. 
 
In theory, the images in the raw image upload CSV file could be deleted. However, S3 storage costs are so low that you probably will just want to leave them there. Since they have meaningful file names and a subfolder organization of your choice, they would make a pretty nice cloud backup system that is independent of the Omeka instance. After your archive project is complete, you could change the raw image source bucket over to one of the cheaper, low-access types (like Glacier) that have even lower storage costs than a standard S3 bucket. Because both buckets are public, you can use them as a means of giving access to the original high-res files by simply giving the Object URL to the person wanting a copy of the file.

Backing up the data

 
There are two mechanisms for backing up your data periodically.

The most straightforward is to create an Amazon Machine Image (AMI) of the EC2 server. Not only will this save all of your data, but it will also archive the complete configuration of the server at the time the image is made. This is critical if you have any disasters while making major configuration changes and need to roll back the EC2 to an earlier (functional) state. It is quite easy to roll back to an AMI and re-assign the Elastic IP to the new EC2 instance. However, this rollback will have no impact on any files saved in S3 by Omeka after the time when the backup AMI was created. Those files won't hurt anything, but they will effectively be uselessly orphaned there.

The CSV files pushed to GitHub after each CSV import (example) can also be used as a sort of backup. Any set of rows from the saved metadata CSV file can be used to re-upload those items onto any Omeka instance as long as the original files are still in the raw source image S3 bucket.  Of course, if you make manual edits to the metadata, the metadata in the CSV file would become stale.
 

Using IIIF tools in Omeka

 
There are two Omeka plugins that add International Image Interoperability Framework (IIIF) capabilities.

The UniversalViewer plugin allows Omeka to serve images like a IIIF image server and it generates IIIF manifests using the existing metadata. That makes it possible for the Universal Viewer player (included in the plugin) to display images in a rich manner that allows pan and zoom. This plugin was very appealing to me because if it functioned well, it would enable IIIF capabilities without needing to manage any other servers. I was able to install it and the embedded Universal Viewer did launch, but the images never loaded in the viewer. Despite spending a lot of time messing around with the settings, disabling S3 storage, and launching a larger EC2 image, I was never able to get it to work, even for a tiny JPEG file. I read a number of Omeka forum posts about troubleshooting, but eventually gave up.

If I had gotten it to work, there was one potential problem with the setup anyway. The t2.micro instance that I'm running has very low resource capacity (memory, number of CPUs, drive storage), which is OK as I've configured it because the server just has to run a relatively tiny MySQL database and serve static files from S3. But presumably this plugin would also have to generate the image variants that it's serving on the fly and that could max out the server quite easily. I'm disappointed that I couldn't get it to work, but I'm not confident that it's the right tool for a budget installation like this one.

I had more success with the IIIF Toolkit plugin. It also provides an embedded Universal Viewer that can be inserted various places in Omeka. The major downside is that you must have access to a separate IIIF server to actually provide the images used in the viewer. I was able to test it out by loading images into the Vanderbilt Libraries' Cantaloupe IIIF server and it worked pretty well. However, setting up your own Cantaloupe server on AWS does not appear to be a trivial task and because of the resources required for the IIIF server to run effectively, it would probably cost a lot more per month to operate than the Omeka site itself. (Vanderbilt's server is running on a cluster with a load balancer, 2 vCPU, and 4 GB memory. All of these increases over a basic single t2.micro instance would involve a significantly increased cost.) So in the absence of an available external IIIF server, this plugin probably would not be useful for an independent user with a small budget.

One nice feature that I was not able to try was pointing the external server to the `original` folder of the S3 storage bucket. That would be a really nice feature since it would not require loading the images separately into dedicated storage for the IIIF server separate from what is already being provisioned for Omeka. Unfortunately, we have not yet got that working on the Libraries' Cantaloupe server as it seems to require some custom Ruby coding to implement.

Once the IIIF Toolkit is installed, there are two ways to include IIIF content into Omeka pages. If the Exhibit Builder plugin is enabled, the IIIF Toolkit adds a new kind of content block, "Manifest". Entering an IIIF manifest URL simply displays the contents of that manifest in an embedded Universal Viewer widget on the exhibit page without actually copying any images or metadata into the Omeka database.


The second way to include IIIF content is to make use of an alternate method of importing content that becomes available after the IIIF Tollkit is installed. There are three import types possible to use to import items. I explored importing `Manifest` and `Canvas` types since I had those types of structured data available.

Manifest is the most straightforward because it only requires a manifest URL (commonly available from many sources). But the import was messy and always created a new collection for each item imported. In theory, this could be avoided by selecting an existing collection using the `Parent` dropdown, but that feature never worked for me.

I concluded that importing canvases was the only feasible method. Unfortunately, canvas JSON usually doesn't exist in isolation -- it usually is part of the JSON for an entire manifest. The `From Paste` option is useful if you are capable of the tedious task of searching through the JSON of a whole manifest and copying just the JSON for a single canvas from it. I found it much more useful to just create Python script to generate minimal canvas JSON for an image and save it as a file, which could either be uploaded directly, or pushed to the web and read in through a URL. It gets the pixel dimensions from the image file, with labels and descriptions taken from a CSV file (the IIIF import does not use more information than that). These values are inserted into a JSON canvas template, then saved as a file. The script will loop through an entire directory of files, so it's relatively easy to make canvases for a number of images that were already uploaded using the CSV import function (just copy and paste labels and descriptions from the metadata CSV file). Once the canvases have been generated, either upload them or paste their URLs (if they were pushed to the web) on the IIIF Toolkit Import Items page.  
 
The result of the import is an item similar to those created by direct upload or CSV import -- JPEG size variants are generated and stored and a small amount of metadata present in the canvas is assigned to the title and description metadata fields for the item. The main difference is that the import includes the canvas JSON as part of an Omeka-generated IIIF manifest that can be displayed in an embedded Universal Viewer either as part of an exhibit or on a Simple Pages web page. The viewer also shows up at the bottom of the item page.

Because there is no way to import IIIF items as a batch, nor to import metadata from the canvas beyond the title and description, each item needs to be imported one at a time and the metadata added manually, or added using the Bulk Metadata Editor plugin if possible. This makes uploading many items somewhat impractical. However, for very large images whose detail cannot be seen well in a single image on a screen, the ability to pan and zoom is pretty important. So for some items, like large maps, this tool can be very nice despite the extra work. For a good example, see the panels page from the Omeka exhibit I made for the 2001 Artspace/Lima exhibition. It is best viewed by changing the embedded viewer to full screen.
 
Entire master plan image. Bassett Associates. “Binder Park Zoo Master Plan (IIIF),” Bassett Associates Archive, accessed August 6, 2023, https://bassettassociates.org/archive/items/show/418. Available under a CC BY 4.0 license.
 
Maximum zoom level using embedded IIIF Universal Viewer

One thing that should be noted is that like other images associated with Omeka items, image import using the IIIF Toolkit generates size versions of the image. A IIIF import also generates an "original" JPEG version that is much smaller than the pyramidal tiled TIFF uploaded to the IIIF server. This means that it is possible to create items for TIFF images that are larger than the 50 MB recommended above. An example is the Binder Park Master Plan. If you scroll to the bottom of its page and zoom in (above), you will see that an incredible amount of detail is visible because the original TIFF file being used by the IIIF server is huge (347 MB). So using IIIF import is a way to display and make available very large image files that exceed the practical limit of 50 MB discussed above.
 

Conclusions

 
Although it took me a long time to figure out how to get all of the pieces to work together, I'm quite satisfied with the Omeka setup I now have running on AWS. I've been uploading works and as of this writing (2023-08-06), I've uploaded 400 items into 36 collections. I also created an Omeka Exhibit for the 2001 exhibition that includes the panels created for the exhibition using an "IIIF Items" block (allows arrowing through all of the panels with pan and zoom), a "works" block (displaying thumbnails for artworks displayed in the exhibition), and a "carousel" block (cycling through photographs of the exhibition). I still need to do more work on the landing page and on styling of the theme. But for now I have an adequate mechanism for exposing some of the images in the collection on a robust hosting system for a total cost of around $10 per month.


Wednesday, April 12, 2023

Structured Data in Commons and wikibase software tools

VanderBot workflow to Commons
 

In my last blog post, I described a tool (CommonsTool) that I created for uploading art images to Wikimedia Commons. One of the features of that Python script was to create Structured Data in Commons (SDoC) statements about the artwork that was being uploaded, such as "depicts" (P180) and "main subject" (P921) and "digital representation of" (P6243), necessary to "magically" populate the Commons page with extensive metadata about the artwork from Wikidata. The script also added "created" (P170) and "inception" (P571) statements, which are important for providing the required attribution when the work is under copyright.

Structured Data on Commons "depicts" statements

These properties serve important roles, but one of the key purposes of SDoC is to make it possible for potential users of the media item to find it by providing richer metadata about what is depicted in the media. SDoC depict statements go into the data that is indexed by the Commons search engine, which otherwise is primarily dependent on words present in the filename. My CommonsTool script does write one "depicts" statement (that the image depicts the artwork itself) and that's important for the semantics of understanding what the media item represents. However, from the standpoint of searching, that single depicts statement doesn't add much to improve discovery since the artwork title in Wikidata is probably similar to the filename of the media item -- neither of which necessarily describe what is depicted IN the artwork.

Of course, one can add depicts statements manually, and there are also some tools that can be used to help with the process. But if you aspire to add multiple depicts statements to hundreds or thousands of media items, this could be very tedious and time consuming. If we are clever, we can take advantage of the fact that Structured Data in Commons is actually just another instance of a wikibase. So generally any tools that can make it easier to work with a wikibase can also make it easier to work with Wikimedia Commons

In February, I gave a presentation about using VanderBot (a tool that I wrote to write data to Wikidata) to write to any wikibase. As part of that presentation, I put together some information about how to use VanderBot to write statements to SDoC using the Commons API, and how to use the Wikimedia Commons Query Service (WCQS) to acquire data programatically via Python. In this blog post, I will highlight some of the key points about interacting with Commons as a wikibase and link out to the details required to actually do the interacting.

Media file identifiers (M IDs)

Wikimedia Commons media files are assigned a unique identifier that is analogous to the Q IDs used with Wikidata items. They are known as "M IDs" and they are required to interact with the Commons API or the Wikimedia Commons Query Service programmatically as I will describe below. 

It is not particularly straightforward to find the M ID for a media file. The easiest way is probably to find the Concept URI link in the left menu of a Commons page, right-click on the link, and then paste it somewhere. The M ID is the last part of that link. Here's an example: https://commons.wikimedia.org/entity/M113161207 . If the M ID for a media file is known, you can load its page using a URL of this form. 

If you are automating the upload process as I described in my last post, CommonsTool records the M ID when it uploads the file. I also have a Python function that can be used to get the M ID from the Commons API using the media filename. 

Properties and values in Structured Data on Commons come from Wikidata

Structured Data on Commons does not maintain its own system of properties. It exclusively uses properties from Wikidata, identified by P IDs. Similarly, the values of SDoC statements are nearly always Wikidata items identified by Q IDs (with dates being an exception). So one could generally represent a SDoC statement (subject property value) like this:

MID PID QID. 

Captions

Captions are a feature of Commons that allows multilingual captions to be applied to media items. They show up under the "File information" tab.

 Although captions can be added or edited using the graphical interface, under the hood the captions are the multilingual labels for the media items in the Commons wikibase. So they can be added or edited as wikibase labels via the Commons API using any tool that can edit wikibases.

Writing statements to the Commons API with VanderBot

VanderBot uses tabular data (spreadsheets) as a data source when it creates statements in a wikibase. One key piece of required information is the Q ID of the subject item that the statements are about and that is generally the first column in the table. When writing to Commons, the subject M ID is substituted for a Q ID in the table. 

Statement values for a particular property are placed in one column in the table. Since all of the values in a column are assumed to be for the same property, the P ID doesn't need to be specified as data in the row. VanderBot just needs to know what P ID is associated with that column and that mapping of column with property is made separately. So at a minimum, to write a single kind of statement to Commons (like Depicts), VanderBot needs only two columns of data (one for the M ID and one for the Q ID of the value of the property).

 Here is an example of a table with depicts data to be uploaded to Commons by VanderBot:

The qid column contains the subject M ID identifiers (for this media file). The depicts column contains the Q IDs of the values (the things that are depicted in the media item). The other three columns serve the following purposes:

- depicts_label is ignored by the script. It's just a place to put the label of the otherwise opaque Q ID for the depicted item so that a human looking at the spreadsheet has some idea about what's being depicted.

- label_en is the language-tagged caption/wikibase label. VanderBot has an option to either overwrite the existing label in the wikibase with the value in the table or ignore the label column and leave the label in Wikibase the same. In this example, we are not concerning ourselves with editing the captions, so we will use the "ignore" option. But if one wanted to add or update captions, VanderBot could be used for that.

- depicts_uuid stores the unique statement identifier after the statement is created. It is empty for statements that have not yet been uploaded.

I mentioned before that the connection between the property and the column that contains its values was made separately. This mapping is done in a YAML file that describes the columns in the table:

The details of this file structure are given elsewhere, but a few key details are obvious. The depicts_label column is designated as to be ignored.  In the properties list, the header for a column is given as value of the variable key, with a value of depicts in this example. That column has item as its value type and P180 as its property. 

As a part of the VanderBot workflow, this mapping file is converted into a JSON metadata description file and that file along with the CSV are all that are needed by VanderBot to create the SDoC depicts statements.

If you have used VanderBot to create new items in Wikidata, uploading to Commons is more restrictive than what you are used to. When writing to Wikidata, if the Q ID column for a row in a CSV is empty, Vanderbot will create a new item and if it's not, it edits an existing one. Creating new items directly via the API is not possible in Commons, because new items in the Commons wikibase are only created as a result of media uploads. So when VanderBot interacts with the Commons API, the qid column must contain an existing M ID.

After writing the SDoC statements, they will show up under the "Structured data" tab for the media item, like this:

Notice that the Q IDs for the depicts values have been replaced by their labels.

This is a very abbreviated overview of the process and is intended to make the point that once you have the system set up, all you need to write a large number of SDoC depicts statement is a spreadsheet with column for the M IDs of the media items and a column with the Q IDs of what is depicted in that media item. There are more details with linkouts to how to use VanderBot to write to Commons on a webpage that I made for the Wikibase Working Hour presentation.

Acquiring Structured Data on Commons from the Wikimedia Commons Query Service

A lot of people know about the Wikidata Query Service (WQS), which can be used to query Wikidata using SPARQL. Fewer people know about the Wikimedia Commons Query Service (WCQS) because it's newer and interests a narrower audience. You can access the WCQS at https://commons-query.wikimedia.org/ . It is still under development and is a bit fragile, so it is sometimes down or undergoing maintenance. 

If you are working with SDoC, the WCQS is a very effective way to retrieve information about the current state of the structured data. For example, it's a very simple query to discover all media items that depict a particular item, as shown in the example below. There are quite a few examples of queries that you can run to get a feel for how the WCQS might be used.

It is actually quite easy to query the Wikidata Query Service programmatically, but there are additional challenges to using the WCQS because it requires authentication. I have struggled through reading the developer instructions for accessing the WCQS endpoint via Python and the result is functions and example code that you can use to query the WCQS in your Python scripts. One important warning: the authentication is done by setting a cookie on your computer. So you must be careful not to save this cookie in any location that will be exposed, such as in a GitHub repository. Anyone who gets a copy of this cookie can act as if they were you until the cookie is revoked. To avoid this, the script saves the cookie in your home directory by default.

The code for querying is very simple with the functions I provide:

user_agent = 'TestAgent/0.1 (mailto:username@email.com)' # put your own script name and email address here
endpoint_url = 'https://commons-query.wikimedia.org/sparql'
session = init_session(endpoint_url, retrieve_cookie_string())
wcqs = Sparqler(useragent=user_agent, endpoint=endpoint_url, session=session)
query_string = '''PREFIX sdc: <https://commons.wikimedia.org/entity/>
PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
SELECT DISTINCT ?depicts WHERE {
  sdc:M113161207 wdt:P180 ?depicts.
  }'''

data = wcqs.query(query_string)
print(json.dumps(data, indent=2))

The query is set in the multi-line string assigned in the line that begins query_string =. One thing to notice is that in WCQS queries, you must define the prefixes wdt: and wd: using PREFIX statements in the query prologue. Those prefixes can be used in WQS queries without making PREFIX statements. In addition, you must define the Commons-specific sdc: prefix and use it with M IDs. 

This particular query simply retrieves all of the depicts statements that we created in the example above for M113161207 . The resulting JSON is

[
  {
    "depicts": {
      "type": "uri",
      "value": "http://www.wikidata.org/entity/Q103304813"
    }
  },
  {
    "depicts": {
      "type": "uri",
      "value": "http://www.wikidata.org/entity/Q302"
    }
  },
  {
    "depicts": {
      "type": "uri",
      "value": "http://www.wikidata.org/entity/Q345"
    }
  },
  {
    "depicts": {
      "type": "uri",
      "value": "http://www.wikidata.org/entity/Q40662"
    }
  },
  {
    "depicts": {
      "type": "uri",
      "value": "http://www.wikidata.org/entity/Q235849"
    }
  }
]

The Q IDs can easily be extracted from these results using a list comprehension:

 qids = [ item['depicts']['value'].split('/')[-1] for item in data ]

resulting in this list:

['Q103304813', 'Q302', 'Q345', 'Q40662', 'Q235849']
 

Comparison with the example table shows the same four Q IDs that we wrote to the API, plus the depicts value for the artwork (Q103304813) that was created by CommonsTool when the media file was uploaded. When adding new depicts statements, having this information about the ones that already exist can be critical to avoid creating duplicate statements.

For more details about how the code works, see the informational web page I made for the Wikibase Working Hour presentation.

Conclusion

I hope that this code will help make it possible to ramp up the rate at which we can add depicts statements to Wikimedia Commons media files. In the Vanderbilt Libraries, we are currently experimenting with using Google Cloud Vision to do object detection and we would like to combine that with artwork title analysis to be able to partially automate the process of describing what is depicted in the Vanderbilt Fine Arts Gallery works whose images have been uploaded to Commons. I plan to report on that work in a future post.