tag:blogger.com,1999:blog-52997545366702819962024-03-17T23:11:53.098-07:00Steve Baskauf's blogSteve Baskaufhttp://www.blogger.com/profile/01896499749604153763noreply@blogger.comBlogger59125tag:blogger.com,1999:blog-5299754536670281996.post-66283678372412336092023-08-06T14:03:00.001-07:002023-08-06T14:08:53.668-07:00Building an Omeka website on AWS<p> <table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://bassett-omeka-storage.s3.amazonaws.com/fullsize/980bdb54a75ae55b051cbc7c679ec373.jpg" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="550" data-original-width="800" height="440" src="https://bassett-omeka-storage.s3.amazonaws.com/fullsize/980bdb54a75ae55b051cbc7c679ec373.jpg" width="640" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">James H. Bassett, “Okapi,” <em>Bassett Associates Archive</em>, accessed August 5, 2023, <span class="citation-url"><a href="https://bassettassociates.org/archive/items/show/337">https://bassettassociates.org/archive/items/show/337</a></span>. Available under a <a href="https://creativecommons.org/licenses/by/4.0/" target="_blank">CC BY 4.0 license</a>.<br /></td><td class="tr-caption" style="text-align: center;"> </td></tr></tbody></table></p><p>Several years ago, I was given access to the digital files of Bassett Associates, a landscape architectural firm that operated for over 60 years in Lima, Ohio. This award-winning firm, which disbanded in 2017, was well known for its zoological design work and also did ground-breaking work in incorporating storm water retention as part of landscape site design. In addition to images of plans and site photographs, the files also included scans of sketches done by the firm's founder, James H. Bassett, which was artwork in its own right. I had been deliberating what the best way was to make these works publicly available and decided that this summer I would make it my project to set up an online digital archive featuring some of the images from the files.</p><p>Given my background as a Data Science and Data Curation Specialist at the <a href="https://www.library.vanderbilt.edu/" target="_blank">Vanderbilt Libraries</a>, it seemed like a good exercise to figure out how to set up <a href="https://omeka.org/classic/" target="_blank">Omeka Classic</a> on <a href="https://aws.amazon.com/" target="_blank">Amazon Web Services (AWS)</a>, Vanderbilt's preferred cloud computing platform. Omeka is a free, open-source web platform that is popular in the library and digital humanities communities for creating online digital collections and exhibits, so it seemed like a good choice for me given that I would be funding this project on my own. </p><h3 style="text-align: left;">Preliminaries</h3><p>The hard drive I have contains about 70 000 files collected over several decades. So the first task was to sort through the directories to figure out exactly what was there. For some of the later projects, there were some born-digital files, but the majority of the images were either digitizations of paper plans and sketches, or scans of 35mm slides. In some cases, the same original work was present several places on the drive with a variety of resolutions, so I needed to sort out where the highest quality files were located. Fortunately, some of the best works from signature projects had been digitized for an <a href="https://bassettassociates.org/archive/exhibits/show/artspace/photos" target="_blank">art exhibition, "James H. Bassett, Landscape Architect: A Retrospecive Exhibition 1952-2001"</a> that took place in <a href="https://www.artspacelima.com/" target="_blank">Artspace/Lima</a> in 2001. Most of the digitized files were high-resolution TIFFs, which were ideal for preservation use. I focused on building the online image collection by featuring projects that were highlighted in that exhibition, since they covered the breadth of types of work done by the firm throughout its history.</p><p>The second major issue was to resolve the intellectual property status of the images. Some had previously been published in reports and brochures, and some had not. Some were from before the 1987 copyright law went into effect and some were after. Some could be attributed directly to James Bassett before the Bassett Associates corporation was formed and others could not be attributed to any particular individual. Fortunately, I was able to get permission from Mr. Bassett and the other two owners of the corporation when it disbanded to make the images freely available under a <a href="https://creativecommons.org/licenses/by/4.0/" target="_blank">Creative Commons Attribution 4.0 International (CC BY 4.0) license</a>. This basically eliminated complications around determining the copyright status of any particular work, and allows the images to be used by anyone as long as they provide the requested citation.</p><p> <table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhfqXAQJnJx-Ud3b0YS1PkMMs28G_GEEB0a6UIyOgvamRSIhsQM5Np4cbZ6FdHfk7CuC0NgyzB7YgiY4815jOK_CEXAgo3a9HhWkmE_USeECgEwR-GV_jgzKxy1OP7I1z4qj7L-8e4fJsNDl4wyFoRF87kJ8Ff8o42EHhU1ac--xtGqxfuU6DM3jD3FHDE/s1237/african_plains.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="992" data-original-width="1237" height="514" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhfqXAQJnJx-Ud3b0YS1PkMMs28G_GEEB0a6UIyOgvamRSIhsQM5Np4cbZ6FdHfk7CuC0NgyzB7YgiY4815jOK_CEXAgo3a9HhWkmE_USeECgEwR-GV_jgzKxy1OP7I1z4qj7L-8e4fJsNDl4wyFoRF87kJ8Ff8o42EHhU1ac--xtGqxfuU6DM3jD3FHDE/w640-h514/african_plains.png" width="640" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">TIFF pyramid for a sketch of the African plains exhibit at the Kansas City Zoo. James H. Bassett, “African Plains,” <em>Bassett Associates Archive</em>, accessed August 6, 2023, <span class="citation-url"><a href="https://bassettassociates.org/archive/items/show/415">https://bassettassociates.org/archive/items/show/415</a></span>. Available under a <a href="https://creativecommons.org/licenses/by/4.0/" target="_blank">CC BY 4.0 license</a>. </td></tr></tbody></table> <br /></p><p><b>Image pre-processing</b></p><p>For several years <a href="https://baskauf.blogspot.com/2022/09/commonstool-script-for-uploading-art.html" target="_blank">I have been investigating </a>how to make use of the <a href="https://iiif.io/" target="_blank">International Image Interoperability Framework (IIIF)</a> to provide a richer image viewing experience. Based on previous work and experimentation with our Libraries' Cantaloupe IIIF server, I knew that large TIFF images needed to be converted to <a href="https://cantaloupe-project.github.io/manual/3.3/images.html#Multi-Resolution" target="_blank">tiled pyramidal (multi-resolution) form</a> to be effectively served. I also discovered that TIFFs using CMYK color mode did not display properly when served by Cantaloupe. So the first image processing step was to open TIFF or Photoshop format images in Photoshop, flatten any layers, convert to RGB color mode if necessary, reduce the image size to less than 35 MB (more on size limits later), and save the image in TIFF format. JPEG files were not modified -- I just used the highest resolution copy that I could find.</p><p>Because I wanted to make it easy in the future to use the images with IIIF, I used <a href="https://github.com/baskaufs/bassettassociates/blob/main/code/convert_to_pyramidal_tiled_tiff.py" target="_blank">a Python script that I wrote</a> to converting single-resolution TIFFs en mass to tiled pyramidal TIFFs via <a href="https://imagemagick.org/" target="_blank">ImageMagick</a>. These processed TIFFs or high-resolution JPEGs were the original files that I eventually uploaded to Omeka.</p><p><b>Why use AWS?</b></p><p>One of my primary reasons for using AWS as the hosting platform was the availability of <a href="https://aws.amazon.com/s3/" target="_blank">S3 bucket storage</a>. AWS S3 storage is very inexpensive and by storing the images there rather than within the file storage attached to the cloud server, the image storage capacity could basically expand indefinitely without requiring any changes to the configuration of the cloud server hosting the site. Fortunately, there is <a href="https://github.com/EHRI/omeka-amazon-s3-storage-adapter" target="_blank">an Omeka plug-in that makes it easy to configure storage in S3</a>. </p><p>Another advantage (not realized in this project) is that because image storage is outside the server in a public S3 bucket, the same image files can be used as <a href="https://cantaloupe-project.github.io/manual/5.0/sources.html" target="_blank">source files for a Cantaloupe installation</a>. Thus a single copy of an image in S3 can serve the purpose of provisioning Omeka, being the source file for IIIF image variants served by Cantaloupe, and having a citable, stable URL that allows the original raw image to be downloaded by anyone. </p><p>I've also determined through experimentation that one can run a relatively low-traffic Omeka site on AWS using a single t2.micro tier <a href="https://aws.amazon.com/ec2/" target="_blank">Elastic Compute Cloud (EC2)</a> server. This minimally provisioned server currently costs only US$ 0.0116 per hour (about $8 per month) and is "<a href="https://aws.amazon.com/free/" target="_blank">free tier eligible</a>", meaning that new users could run a Omeka on EC2 for free during the first year. Including the cost of the S3 storage, one could run an Omeka site on AWS with hundreds of images for under $10 per month. </p><h3 style="text-align: left;">The down side</h3><p>The main problem with installing Omeka on AWS is that it is not a beginner-level project. I'm relatively well-acquainted with AWS and Unix command line, but it took me a couple months on and off to figure out how to get all of the necessary pieces to work together. Unfortunately, there wasn't a single web page that laid out all of the steps, so I had to read a number of blog posts and articles, then do a lot of experimenting to get the whole thing to work. I did take <a href="https://heardlibrary.github.io/digital-scholarship/pubs/omeka/" target="_blank">detailed notes, including all of the necessary commands and configuration details</a>, so it should be possible for someone with moderate command-line skills and a willingness to learn the basics of AWS to replicate what I did. </p><h2 style="text-align: left;">Installation summary</h2><h2 style="text-align: left;"></h2><div style="text-align: left;"> </div><div style="text-align: left;">In the remainder of this post, I'll walk through the general steps required to install Omeka Classic on AWS and describe important considerations and things I learned in the process. In general, there are three major components to the installation: setting up the S3 storage, installing Omeka on EC2, and getting a custom domain name to work with the site using secure HTTP. Each of these major steps includes several sub-tasks that will be described below. </div><div style="text-align: left;"><br /></div><br /><div style="text-align: left;"><h3 style="text-align: left;">S3 setup</h3><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh4CG3yLwvW2kakk6C1Hr7WCsMZNjn3oIpX9Y3dfAd-VoMEQnArE7Fus13A1zT15fb_owk3GtBrhKf6Q_JPsRg9MS6ro7wI1HmcWBnm4a-yk2RNaAa4rDjga3QtKRCJkw-KCJbvmCSe7KMnZSNCa5eevBq3TW85ZAKaLqZPzOZ2EgLUcRLfjAOXxj9SDos/s629/s3.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="428" data-original-width="629" height="436" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh4CG3yLwvW2kakk6C1Hr7WCsMZNjn3oIpX9Y3dfAd-VoMEQnArE7Fus13A1zT15fb_owk3GtBrhKf6Q_JPsRg9MS6ro7wI1HmcWBnm4a-yk2RNaAa4rDjga3QtKRCJkw-KCJbvmCSe7KMnZSNCa5eevBq3TW85ZAKaLqZPzOZ2EgLUcRLfjAOXxj9SDos/w640-h436/s3.png" width="640" /></a></div><br /></div><div style="text-align: left;">The basic setup of an S3 bucket is very simple and involves only a few button clicks. However, the way Omeka operates, several additional steps are required for the bucket setup. </div><div style="text-align: left;"> </div><div style="text-align: left;">By design, AWS is secure and generally one wants to permit only the minimum required access to resources. But because Omeka exposes file URLs publicly so that people can download those files, the S3 bucket must be readable by anyone. Omeka also writes multiple image variant files to S3, and this requires generating access keys whose security must be carefully guarded. </div><div style="text-align: left;"><br /></div><div style="text-align: left;">You can manually upload files and enter their metadata by typing into boxes in the Omeka graphical interface. That's fine if you will only have a few items. However, if you will be uploading many items, uploading using the graphical interface is very tedious and requires many button clicks. To create an efficient upload workflow, I used the Omeka CSV import plugin. It requires loading the files via a URL during the import process, so I used a different public S3 bucket as the source of the raw images. I used <a href="https://github.com/baskaufs/bassettassociates/blob/main/code/omeka_upload_data.py" target="_blank">a Python script</a> to partially automate the process of generating the metadata CSV and as part of that script, I uploaded the images automatically to the source raw image bucket using the AWS Python library (boto3). This required creating access credentials for the raw image bucket and to reduce security risks, I created a special AWS user that was only allowed to write to that one bucket. </div><div style="text-align: left;"> </div><div style="text-align: left;">The ASW <a href="https://aws.amazon.com/free/storage/" target="_blank">free tier allows a new user access to up to 5 GB for free</a> during the first year. That corresponds to roughly a hundred high-resolution (50 MB) TIFF images.<br /></div><div style="text-align: left;"><br /></div><div style="text-align: left;"><h3 style="text-align: left;">Omeka installation on EC2</h3><h3 style="text-align: left;"> </h3></div><div style="text-align: left;">As with the set up of S3 buckets, launching an EC2 server instance just involves a few button clicks. What is trickier and somewhat tedious is performing the actual setup of Omeka within the server. Because the setup is happening at some mysterious location in the cloud, you can't point and click like you can on your local computer. To access the EC2 server, you have to essentially create a "tunnel" into it by connecting to it using SSH. Once you've done that, commands that you type into your terminal application are being applied to the remote server and not your local computer. Thus, everything you do must be done at the command line. This requires basic familiarity with Unix shell commands and since you also need to edit some configuration files, you need to know how to use a terminal-based editor like Nano. </div><div style="text-align: left;"><br /></div><div style="text-align: left;">The steps involve:</div><div style="text-align: left;">- installing a LAMP (Linux, Apache, MySQL, and PHP) server bundle</div><div style="text-align: left;">- creating a MySQL database<br /></div><div style="text-align: left;">- downloading and installing Omeka</div><div style="text-align: left;">- modifying Apache and Omeka configuration files</div><div style="text-align: left;">- downloading an enabling the Omeka S3 Storage Adapter and CSV Import plugins</div><div style="text-align: left;"><br /></div><div style="text-align: left;">Once you have completed these steps (which actually involve issuing something like 50 complicated Unix commands that fortunately can be copied and pasted from <a href="https://heardlibrary.github.io/digital-scholarship/pubs/omeka/" target="_blank">my instructions</a>), you will have a functional Omeka installation on AWS. However, accessing it would require users to use a confusing and insecure URL like <br /><pre><code>http://54.243.224.52/archive/</code></pre></div><div style="text-align: left;"><h3 style="text-align: left;">Mapping an Elastic IP address to a custom domain and enabling secure HTTP</h3></div><div style="text-align: left;"> </div><div style="text-align: left;">To change this icky URL to a "normal" one that's easy to type into a browser and that is secure, several additional steps are required. </div><div style="text-align: left;"><br /></div><div style="text-align: left;">AWS provides a feature called an <a href="https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/elastic-ip-addresses-eip.html" target="_blank">Elastic IP address</a> that allows you to keep using the same IP address even if you change the underlying resource it refers to. Normally, if you had to spin up a new EC2 instance (for example to restore from a backup), it would be assigned a new IP address, requiring you to change any setting that referred to the IP address of the previous EC2 you were using. An Elastic IP address can be reassigned to any EC2 instance, so disruption caused by replacing the old EC2 with a new one can be avoided by just shifting the Elastic IP to the new instance. Elastic IPs are free as long as they remain associated with a running resource.<br /></div><div style="text-align: left;"><br /></div><div style="text-align: left;">It is relatively easy to assign a custom domain name to the Elastic IP if the AWS Route 53 domain registration is used. The cost of the custom domain varies depending on the specific domain name that you select. I was able to obtain `bassettassociates.org` for US$12 per year, adding $1 per month to the cost of running the website. <br /></div><div style="text-align: left;"><br /></div><div style="text-align: left;">After the domain name has been associated with the Elastic IP address, the last step is to enable secure HTTP (HTTPS). When initially searching the web for instructions on how to do that, I found a number of complicated and potentially expensive suggestions including installing an Nginx front-end server and using an AWS load balancer. Those options are overkill for a low-traffic Omeka site. In contrast, it is relatively easy to get free security certificate from <a href="https://letsencrypt.org/" target="_blank">Let's Encrypt</a> and set it up to automatically renew monthly using Certbot for an Apache server.</div><div style="text-align: left;"><br /></div><div style="text-align: left;">After completing <a href="https://heardlibrary.github.io/digital-scholarship/pubs/omeka/#enabling-https" target="_blank">these steps</a>, one can now access my Omeka instance at <a href="https://bassettassociates.org/archive/"><span style="font-family: courier;">https://bassettassociates.org/archive/</span></a>.<br /></div><div style="text-align: left;"><h3 style="text-align: left;"> </h3><h3 style="text-align: left;">Optional additional steps</h3></div><div style="text-align: left;"> </div><div style="text-align: left;">If you plan to have multiple users editing the Omeka site, you won't be able to add users beyond the default Super User without additional steps. It appears that it's not possible to add more users without enabling Omeka to send emails. This requires setting up <a href="https://aws.amazon.com/ses/" target="_blank">AWS Simple Email Service (SES)</a>, then adding the SMPT credentials to the Omeka configuration file. SES is designed to enable sending mass emails, so being approved for production access requires applying for approval. I didn't have any problems getting approved when I explained that I was only going to use it to send a few confirmation emails, although the process took at least a day since apparently a human has to examine the application. </div><div style="text-align: left;"><br /></div><div style="text-align: left;">There are three additional plugins that I installed that you may consider using. The <a href="https://omeka.org/classic/docs/Plugins/ExhibitBuilder/" target="_blank">Exhibit Builder</a> and <a href="https://omeka.org/classic/docs/Plugins/SimplePages/" target="_blank">Simple Pages</a> plugins add the ability to create richer content. Installing them is trivial, so you will probably want to turn them on. I also installed the <a href="https://omeka.org/classic/plugins/CsvExport/" target="_blank">CSV Export Format</a> plugin because I wanted to use it to capture identifier information as part of my partially automated workflow (see following sections for more details).</div><div style="text-align: left;"><br /></div><div style="text-align: left;">If you are interested in using IIIF on your site, you may also want to install the IIIF Toolkit plugin, explained in more detail later.</div><div style="text-align: left;"><br /></div><div style="text-align: left;"><h2 style="text-align: left;">Efficient workflow</h2></div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhtXRBKa06n685yrFqqoEiDj4u4wVvODp_LKo2QNtu3ow4PcyLmZ2-CDiCRxUpMiqC6atlJFV34aewnm9XvxGZbpf2uxvotglZxXn-E6MmVl0xFgdX5f4oRvyuig63-uADM5OCvsYN2gYbprnts-XwZnTAh84n4zdwewvGBrrdktRvfFGbuCs5Q0lO5YR0/s640/workflow.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="333" data-original-width="640" height="334" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhtXRBKa06n685yrFqqoEiDj4u4wVvODp_LKo2QNtu3ow4PcyLmZ2-CDiCRxUpMiqC6atlJFV34aewnm9XvxGZbpf2uxvotglZxXn-E6MmVl0xFgdX5f4oRvyuig63-uADM5OCvsYN2gYbprnts-XwZnTAh84n4zdwewvGBrrdktRvfFGbuCs5Q0lO5YR0/w640-h334/workflow.png" width="640" /></a></div><div style="text-align: left;">Once Omeka is installed and configured, it is possible to just upload content manually using the Omeka graphical interface. That's fine if you will only have a few objects. However, if you will be uploading many objects, uploading using the graphical interface is very tedious and requires many button clicks. <br /><br />The workflow described here is based on assembling the metadata in the most automated way possible, using file naming conventions, a Python script, and programatically created CSV files. Python scripts are also used to upload the files to S3, and from there they can be automatically imported into Omeka. <br /><br />After the items are imported, the CSV export plugin can be used to extract the ID numbers assigned to the items by Omeka. A Python script then extracts the IDs from the resulting CSV and inserts them into the original CSVs used to assemble the metadata.</div><div style="text-align: left;"><br /></div><div style="text-align: left;">For full details about the scripts and step-by-step instructions, see the <a href="https://heardlibrary.github.io/digital-scholarship/pubs/omeka/#establish-an-efficient-work-flow" target="_blank">detailed notes that accompany this post</a>.</div><div style="text-align: left;"><br /></div><div style="text-align: left;"><h3 style="text-align: left;">Notes about TIFF image files</h3></div><div style="text-align: left;"> </div><div style="text-align: left;">If original image files are available as high-resolution TIFFs, that is probably the best format to archive from the preservation standpoint. However, most browsers will not display TIFFs natively, while JPEGs can be displayed onscreen. The practical implication of this is that image thumbnails are linked directly to the original highres image file. So when a user clicks on the thumbnail of a JPEG, the image is displayed in their browser, but when a TIFF thumbnail is clicked, the file downloads to the user's hard drive without being displayed. When an image is uploaded, Omeka makes several JPEG copies at lower resolution so that they can be displayed onscreen in the browser without downloading.<br /> </div><div style="text-align: left;">As explained in the preprocessing section above, the workflow includes an additional conversion step that only applies to TIFFs. </div><div style="text-align: left;"><br /><h3 style="text-align: left;">Note about file sizes</h3></div><div style="text-align: left;"> </div><div style="text-align: left;">In the file configuration settings, I recommend seting a maximum file size of 100 MB. Virtually no JPEGs are ever that big, but some large TIFF files may exceed that size. As a practical matter, the upper limit on file size in this installation is actually about 50 MB. I have found from practical experience that importing original TIFF files between 50 and 100 MB can generate errors that will cause the Omeka server to hang. I have not been able to isolate the actual source of the problem, but it may be related to the process of generating the lower resolution JPEG copies. The problem may be isolated to using the CSV import plugin because some files that hung the server when using the CSV import were then able to be uploaded manually after creating the item record. In one instance, a JPEG that was only 11.4 MB repeatedly failed to upload using the CSV import. Apparently its large pixel dimensions (6144x4360) were the problem (it also was successfully uploaded manually).<br /><br />The other thing to consider is that when TIFFs are converted to tiled pyramidal form, there is an increase in size of roughly 25% when the low-res layers are added to the original high-res layer. So a 40 MB raw TIFF may be at or over 50 MB after conversion. I have found that if I keep the original file size below 35 MB, the files usually load without problems. It is annoying to have to decrease the resolution of any souce files in order to add them to digital collection, but there is a workaround (described in the IIIF section below) for extremely large TIFF image files.<br /></div><div style="text-align: left;"> </div><div style="text-align: left;"></div><div style="text-align: left;"></div><div style="text-align: left;"></div><div style="text-align: left;"><h3 style="text-align: left;">The CSV Import plugin</h3> </div><div style="text-align: left;">An efficient way to import multiple images is to use the <a href="https://omeka.org/classic/docs/Plugins/CSV_Import/" target="_blank">CSV Import plugin</a>. The plugin requires two things: a CSV spreadsheet containing file and item metadata, and files that are accessible directly using a URL. Because files on your local hard drive are not accessible via a URL, there are a number of workarounds that can be used, such a uploading the images to a cloud service like Google Drive or Dropbox. Since we are using AWS S3 storage, it makes sense to make the image files accessible from there, since files in a public S3 bucket can be accessed by a URL. (Example of raw image available from an S3 bucket via the URL: <a href="https://bassettassociates.s3.amazonaws.com/glf/haw/glf_haw_pl_00.tif">https://bassettassociates.s3.amazonaws.com/glf/haw/glf_haw_pl_00.tif</a>)</div><div style="text-align: left;"><br /></div><div style="text-align: left;">One could create the metadata CSV entirely by hand by typing and copying and pasting in a spreadsheet editor. However, in my case, because of the general inconsistency in file names on the source hard drive, I was renaming all of the image files anyway. So I established a file identifier coding system that when used with file names would both group similar files together in the directory listing and also make it possible to automate populating some of the metadata fields in the CSV. The <a href="https://github.com/baskaufs/bassettassociates/blob/main/code/omeka_upload_data.py" target="_blank">Python script that I wrote</a> generated a metadata CSV with many of the columns already populated, including the image dimensions, which it extracted from the EXIF data in the image files. After generating a first draft of the CSV, I then had to manually add the date, title, and description fields, plus any tags I wanted to add in addition to the ones that the script generated automatically from the file names. (<a href="https://github.com/baskaufs/bassettassociates/blob/ea625f2991f73803dec5416bfce23181fa0af558/data/upload.csv" target="_blank">Example of completed CSV metadata file</a>)<br /></div><div style="text-align: left;"> </div><div style="text-align: left;">The CSV import plugin requires that all items imported as a batch be the same general type. Since my workflow was build to handle images, that wasn't be a problem -- all items were Still Images. As a practical matter, it was best to restrict all of the images in a batch to be for the same Omeka collection. If images intended for several collections were uploaded together in a batch, they would have had to be assigned to collections manually after upload. </div><div style="text-align: left;"> </div><div style="text-align: left;"><h3 style="text-align: left;">Omeka identifiers</h3></div><div style="text-align: left;"> </div><div style="text-align: left;">When Omeka ingests image files, it automatically assigns an opaque ID (e.g. 3244d9cdd5e9dce04e4e0522396ff779) to the image and generates JPEG versions of the original image at various sizes. These images are stored in the S3 bucket that you set up for Omeka storage. Since those images are publicly accessible by URL, you could provide access to them for other purposes. However, since the file names are based on the opaque identifiers and have no connection with the original file names, it would be difficult to know what the access URL would be. (Example: <a href="https://bassett-omeka-storage.s3.amazonaws.com/fullsize/3244d9cdd5e9dce04e4e0522396ff779.jpg">https://bassett-omeka-storage.s3.amazonaws.com/fullsize/3244d9cdd5e9dce04e4e0522396ff779.jpg</a>)<br /></div><div style="text-align: left;"><br /></div><div style="text-align: left;">Fortunately, there is a CSV Export Format plugin that can be used to discover the Omeka-assigned IDs along with the original identifiers assigned by the provider as part of the CSV metadata that was uploaded during the import process. In my workflow, I have added additional steps to do the CSV export, then run <a href="https://github.com/baskaufs/bassettassociates/blob/main/code/extract_omeka_csv_export_data.py" target="_blank">another Python script</a> that pulls the Omeka identifiers from the CSV and archives them along with the original user-assigned identifier in an identifier CSV. At the end of processing each batch, I push the <a href="https://github.com/baskaufs/bassettassociates/tree/main/data" target="_blank">identifier and metadata CSV files to GitHub</a> to archive the data used in the upload. </div><div style="text-align: left;"> </div><div style="text-align: left;">In theory, the images in the raw image upload CSV file could be deleted. However, S3 storage costs are so low that you probably will just want to leave them there. Since they have meaningful file names and a subfolder organization of your choice, they would make a pretty nice cloud backup system that is independent of the Omeka instance. After your archive project is complete, you could change the raw image source bucket over to one of the <a href="https://aws.amazon.com/s3/storage-classes/glacier/" target="_blank">cheaper, low-access types (like Glacier)</a> that have even lower storage costs than a standard S3 bucket. Because both buckets are public, you can use them as a means of giving access to the original high-res files by simply giving the Object URL to the person wanting a copy of the file.<br /></div><div style="text-align: left;"><br /></div><div style="text-align: left;"><h2 style="text-align: left;">Backing up the data</h2></div><div style="text-align: left;"></div><div style="text-align: left;"> </div><div style="text-align: left;">There are two mechanisms for backing up your data periodically. <br /><br />The most straightforward is to create an <a href="https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/AMIs.html" target="_blank">Amazon Machine Image (AMI)</a> of the EC2 server. Not only will this save all of your data, but it will also archive the complete configuration of the server at the time the image is made. This is critical if you have any disasters while making major configuration changes and need to roll back the EC2 to an earlier (functional) state. It is quite easy to roll back to an AMI and re-assign the Elastic IP to the new EC2 instance. However, this rollback will have no impact on any files saved in S3 by Omeka after the time when the backup AMI was created. Those files won't hurt anything, but they will effectively be uselessly orphaned there.<br /><br />The CSV files pushed to GitHub after each CSV import (<a href="https://github.com/baskaufs/bassettassociates/tree/main" target="_blank">example</a>) can also be used as a sort of backup. Any set of rows from the saved metadata CSV file can be used to re-upload those items onto any Omeka instance as long as the original files are still in the raw source image S3 bucket. Of course, if you make manual edits to the metadata, the metadata in the CSV file would become stale.</div><div style="text-align: left;"> </div><div style="text-align: left;"><h2 style="text-align: left;">Using IIIF tools in Omeka</h2></div><div style="text-align: left;"> </div><div style="text-align: left;">There are two Omeka plugins that add International Image Interoperability Framework (IIIF) capabilities. <br /><br />The <a href="https://omeka.org/classic/plugins/UniversalViewer/" target="_blank">UniversalViewer plugin</a> allows Omeka to serve images like a IIIF image server and it generates IIIF manifests using the existing metadata. That makes it possible for the Universal Viewer player (included in the plugin) to display images in a rich manner that allows pan and zoom. This plugin was very appealing to me because if it functioned well, it would enable IIIF capabilities without needing to manage any other servers. I was able to install it and the embedded Universal Viewer did launch, but the images never loaded in the viewer. Despite spending a lot of time messing around with the settings, disabling S3 storage, and launching a larger EC2 image, I was never able to get it to work, even for a tiny JPEG file. I read a number of Omeka forum posts about troubleshooting, but eventually gave up. <br /><br />If I had gotten it to work, there was one potential problem with the setup anyway. The t2.micro instance that I'm running has very low resource capacity (memory, number of CPUs, drive storage), which is OK as I've configured it because the server just has to run a relatively tiny MySQL database and serve static files from S3. But presumably this plugin would also have to generate the image variants that it's serving on the fly and that could max out the server quite easily. I'm disappointed that I couldn't get it to work, but I'm not confident that it's the right tool for a budget installation like this one.<br /><br />I had more success with the <a href="https://omeka.org/classic/plugins/IiifItems/" target="_blank">IIIF Toolkit plugin</a>. It also provides an embedded Universal Viewer that can be inserted various places in Omeka. The major downside is that you must have access to a separate IIIF server to actually provide the images used in the viewer. I was able to test it out by loading images into the Vanderbilt Libraries' Cantaloupe IIIF server and it worked pretty well. However, setting up your own Cantaloupe server on AWS does not appear to be a trivial task and because of the resources required for the IIIF server to run effectively, it would probably cost a lot more per month to operate than the Omeka site itself. (Vanderbilt's server is running on a cluster with a load balancer, 2 vCPU, and 4 GB memory. All of these increases over a basic single t2.micro instance would involve a significantly increased cost.) So in the absence of an available external IIIF server, this plugin probably would not be useful for an independent user with a small budget. <br /><br />One nice feature that I was not able to try was pointing the external server to the `original` folder of the S3 storage bucket. That would be a really nice feature since it would not require loading the images separately into dedicated storage for the IIIF server separate from what is already being provisioned for Omeka. Unfortunately, we have not yet got that working on the Libraries' Cantaloupe server as it seems to require some custom Ruby coding to implement.</div><div style="text-align: left;"><br /></div><div style="text-align: left;">Once the IIIF Toolkit is installed, there are two ways to include IIIF content into Omeka pages. If the Exhibit Builder plugin is enabled, the IIIF Toolkit adds a new kind of content block, "Manifest". Entering an IIIF manifest URL simply displays the contents of that manifest in an embedded Universal Viewer widget on the exhibit page without actually copying any images or metadata into the Omeka database.</div><div style="text-align: left;"><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiSAppJyYN3DuonHefYDY0o_BrA1cHsiFwyMWdZJkcPw9ZsIdl4o7LU6m0ZwIjCoT2VahjMQxd9m-Bk_KSKrdWDGjPx6UaC_e84FHP0OjDhFRjHHAwVAfT_PdBYQc-wFRE68erQq_vhUAFrJlMA21FT1TL_dcHlulUySJWCYyjfmJu62HSsJ5jDxdvU_-Y/s672/iiif_import_workflow.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="294" data-original-width="672" height="280" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiSAppJyYN3DuonHefYDY0o_BrA1cHsiFwyMWdZJkcPw9ZsIdl4o7LU6m0ZwIjCoT2VahjMQxd9m-Bk_KSKrdWDGjPx6UaC_e84FHP0OjDhFRjHHAwVAfT_PdBYQc-wFRE68erQq_vhUAFrJlMA21FT1TL_dcHlulUySJWCYyjfmJu62HSsJ5jDxdvU_-Y/w640-h280/iiif_import_workflow.png" width="640" /></a></div><br /><div style="text-align: left;"></div><div style="text-align: left;">The second way to include IIIF content is to make use of an alternate method of importing content that becomes available after the IIIF Tollkit is installed. There are three import types possible to use to import items. I explored importing `Manifest` and `Canvas` types since I had those types of structured data available. <br /><br />Manifest is the most straightforward because it only requires a manifest URL (commonly available from many sources). But the import was messy and always created a new collection for each item imported. In theory, this could be avoided by selecting an existing collection using the `Parent` dropdown, but that feature never worked for me. <br /><br />I concluded that importing canvases was the only feasible method. Unfortunately, canvas JSON usually doesn't exist in isolation -- it usually is part of the JSON for an entire manifest. The `From Paste` option is useful if you are capable of the tedious task of searching through the JSON of a whole manifest and copying just the JSON for a single canvas from it. I found it much more useful to just create <a href="https://github.com/baskaufs/bassettassociates/blob/main/code/manifests/minimal_manifest.py" target="_blank">Python script to generate minimal canvas JSON</a> for an image and save it as a file, which could either be uploaded directly, or pushed to the web and read in through a URL. It gets the pixel dimensions from the image file, with labels and descriptions taken from a CSV file (the IIIF import does not use more information than that). These values are inserted into a JSON canvas template, then saved as a file. The script will loop through an entire directory of files, so it's relatively easy to make canvases for a number of images that were already uploaded using the CSV import function (just copy and paste labels and descriptions from the metadata CSV file). Once the canvases have been generated, either upload them or paste their URLs (if they were pushed to the web) on the IIIF Toolkit Import Items page. </div><div style="text-align: left;"> </div><div style="text-align: left;">The result of the import is an item similar to those created by direct upload or CSV import -- JPEG size variants are generated and stored and a small amount of metadata present in the canvas is assigned to the title and description metadata fields for the item. The main difference is that the import includes the canvas JSON as part of an Omeka-generated IIIF manifest that can be displayed in an embedded Universal Viewer either as part of an exhibit or on a Simple Pages web page. The viewer also shows up at the bottom of the item page.<br /></div><div style="text-align: left;"><br /></div><div style="text-align: left;">Because there is no way to import IIIF items as a batch, nor to import metadata from the canvas beyond the title and description, each item needs to be imported one at a time and the metadata added manually, or added using the Bulk Metadata Editor plugin if possible. This makes uploading many items somewhat impractical. However, for very large images whose detail cannot be seen well in a single image on a screen, the ability to pan and zoom is pretty important. So for some items, like large maps, this tool can be very nice despite the extra work. For a good example, see the <a href="https://bassettassociates.org/archive/exhibits/show/artspace/panels" target="_blank">panels page</a> from the Omeka exhibit I made for the 2001 Artspace/Lima exhibition. It is best viewed by changing the embedded viewer to full screen.</div><div style="text-align: left;"> </div><div style="text-align: left;"><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhsEUy2FvEEtbgJ0FwAU2eKw3n49YQCcjALo7OXCLsyBf4xeAPH4hNdai7Wv3RueZmvnci2iWzqdUMbvO2fw_sXu5MhYSxIvrWzHpNWhhzTYMbg2STc8U6HWpWrwJSMV94_Jh8-cCZDUDGmlpWHjBumFapSiGXBvunc3BN-V8mst8S4iyTPeAu1bWCMdkU/s847/binder_mp.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="447" data-original-width="847" height="338" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhsEUy2FvEEtbgJ0FwAU2eKw3n49YQCcjALo7OXCLsyBf4xeAPH4hNdai7Wv3RueZmvnci2iWzqdUMbvO2fw_sXu5MhYSxIvrWzHpNWhhzTYMbg2STc8U6HWpWrwJSMV94_Jh8-cCZDUDGmlpWHjBumFapSiGXBvunc3BN-V8mst8S4iyTPeAu1bWCMdkU/w640-h338/binder_mp.png" width="640" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">Entire master plan image. Bassett Associates. “Binder Park Zoo Master Plan (IIIF),” Bassett Associates Archive, accessed August 6, 2023, <a href="https://bassettassociates.org/archive/items/show/418">https://bassettassociates.org/archive/items/show/418</a>. Available under a CC BY 4.0 license.<br /></td></tr></tbody></table> <br /></div><div style="text-align: left;"><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEibVN_68bRrT24FpSux6-BMaXMrNwp_CymmULQrh5cjcNfpPWr_Hr0U_DJMUcBM7aTxctHA6ItSW0SVZnGJyGRdeyHCAE-o9P4MdF3KIq5PVjTvmnJj3Y-SPZMvUszdce8cKLMWBeDqOCFMSquaQ3H__Gl1AUHUIm0ZCDbgzqPc22ksvq4hI2OOLRI1Cy0/s1033/zoom_example.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="664" data-original-width="1033" height="412" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEibVN_68bRrT24FpSux6-BMaXMrNwp_CymmULQrh5cjcNfpPWr_Hr0U_DJMUcBM7aTxctHA6ItSW0SVZnGJyGRdeyHCAE-o9P4MdF3KIq5PVjTvmnJj3Y-SPZMvUszdce8cKLMWBeDqOCFMSquaQ3H__Gl1AUHUIm0ZCDbgzqPc22ksvq4hI2OOLRI1Cy0/w640-h412/zoom_example.png" width="640" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">Maximum zoom level using embedded IIIF Universal Viewer<br /></td></tr></tbody></table><br /></div><div style="text-align: left;"></div><div style="text-align: left;">One thing that should be noted is that like other images associated with Omeka items, image import using the IIIF Toolkit generates size versions of the image. A IIIF import also generates an "original" JPEG version that is much smaller than the pyramidal tiled TIFF uploaded to the IIIF server. This means that it is possible to create items for TIFF images that are larger than the 50 MB recommended above. An example is the <a href="https://bassettassociates.org/archive/items/show/418" target="_blank">Binder Park Master Plan</a>. If you scroll to the bottom of its page and zoom in (above), you will see that an incredible amount of detail is visible because the original TIFF file being used by the IIIF server is huge (347 MB). So using IIIF import is a way to display and make available very large image files that exceed the practical limit of 50 MB discussed above.</div><div style="text-align: left;"> </div><div style="text-align: left;"><h2>Conclusions</h2></div><div style="text-align: left;"> </div><div style="text-align: left;">Although it took me a long time to figure out how to get all of the pieces to work together, I'm quite satisfied with the Omeka setup I now have running on AWS. I've been uploading works and as of this writing (2023-08-06), I've uploaded <a href="https://bassettassociates.org/archive/items/browse" target="_blank">400 items</a> into <a href="https://bassettassociates.org/archive/iiif-items/tree" target="_blank">36 collections</a>. I also created an <a href="https://bassettassociates.org/archive/exhibits/show/artspace" target="_blank">Omeka Exhibit for the 2001 exhibition</a> that includes the panels created for the exhibition using <a href="https://bassettassociates.org/archive/exhibits/show/artspace/panels" target="_blank">an "IIIF Items" block</a> (allows arrowing through all of the panels with pan and zoom), <a href="https://bassettassociates.org/archive/exhibits/show/artspace/works" target="_blank">a "works" block</a> (displaying thumbnails for artworks displayed in the exhibition), and <a href="https://bassettassociates.org/archive/exhibits/show/artspace/photos" target="_blank">a "carousel" block</a> (cycling through photographs of the exhibition). I still need to do more work on the landing page and on styling of the theme. But for now I have an adequate mechanism for exposing some of the images in the collection on a robust hosting system for a total cost of around $10 per month.</div><div style="text-align: left;"><br /></div><div style="text-align: left;"><br /></div>Steve Baskaufhttp://www.blogger.com/profile/01896499749604153763noreply@blogger.com0tag:blogger.com,1999:blog-5299754536670281996.post-48571716357717059092023-04-12T13:45:00.004-07:002023-04-12T14:05:36.501-07:00Structured Data in Commons and wikibase software tools<p></p><div style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjI7S3dkvhn2dcO8D53dzVT67AXBSppkV9Njmlsr0beXRjz1WWpWJqNSWaO34aYErSaMCI5g7Ip2FM6Gsj6zQhvpZvG7meMfJ0kp5D1VJOHSNp98UrQxqcvP98aoSI-ke60uAhQ8VKgNgxEzo2rr6ZDeeQaJYKlkyhdlYPnaHe2A8UvpekmM0et75Qr/s1158/madonna_child_diagram.png"><img alt="VanderBot workflow to Commons" border="0" data-original-height="923" data-original-width="1158" height="510" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjI7S3dkvhn2dcO8D53dzVT67AXBSppkV9Njmlsr0beXRjz1WWpWJqNSWaO34aYErSaMCI5g7Ip2FM6Gsj6zQhvpZvG7meMfJ0kp5D1VJOHSNp98UrQxqcvP98aoSI-ke60uAhQ8VKgNgxEzo2rr6ZDeeQaJYKlkyhdlYPnaHe2A8UvpekmM0et75Qr/w640-h510/madonna_child_diagram.png" title="VanderBot workflow to Commons" width="640" /></a></div> <br /><p></p><p>In my <a href="https://baskauf.blogspot.com/2022/09/" target="_blank">last blog post</a>, I described a tool (<a href="https://github.com/HeardLibrary/linked-data/blob/master/commonsbot/README.md" target="_blank">CommonsTool</a>) that I created for uploading art images to Wikimedia Commons. One of the features of that Python script was to create <a href="https://commons.wikimedia.org/wiki/Commons:Structured_data" target="_blank">Structured Data in Commons</a> (SDoC) statements about the artwork that was being uploaded, such as "depicts" (P180) and "main subject" (P921) and "digital representation of" (P6243), necessary <a href="https://commons.wikimedia.org/wiki/Commons:Structured_data/Modeling/Visual_artworks" target="_blank">to "magically" populate the Commons page with extensive metadata about the artwork from Wikidata</a>. The script also added "created" (P170) and "inception" (P571) statements, which are important for providing the required attribution when the work is under copyright.</p><h3 style="text-align: left;">Structured Data on Commons "depicts" statements<br /></h3><p>These properties serve important roles, but one of the key purposes of SDoC is to make it possible for potential users of the media item to find it by providing richer metadata about what is depicted in the media. SDoC depict statements go into the data that is indexed by the Commons search engine, which otherwise is primarily dependent on words present in the filename. My CommonsTool script does write one "depicts" statement (that the image depicts the artwork itself) and that's important for the semantics of understanding what the media item represents. However, from the standpoint of searching, that single depicts statement doesn't add much to improve discovery since the artwork title in Wikidata is probably similar to the filename of the media item -- neither of which necessarily describe what is depicted IN the artwork. </p><p>Of course, one can add depicts statements manually, and there are also some tools that can be used to help with the process. But if you aspire to add multiple depicts statements to hundreds or thousands of media items, this could be very tedious and time consuming. If we are clever, we can take advantage of the fact that Structured Data in Commons is actually just another instance of a wikibase. So generally any tools that can make it easier to work with a wikibase can also make it easier to work with Wikimedia Commons</p><p>In February, I gave a <a href="https://drive.google.com/file/d/1VH47ej63-sEYNCD8DL25SeerW8Y9f7di/view" target="_blank">presentation</a> about using <a href="http://vanderbi.lt/vanderbot" target="_blank">VanderBot</a> (a tool that I wrote to write data to Wikidata) to write to any wikibase. As part of that presentation, I put together <a href="https://heardlibrary.github.io/digital-scholarship/lod/wikibase/sdoc/" target="_blank">some information about how to use VanderBot to write statements to SDoC using the Commons API</a>, and <a href="https://heardlibrary.github.io/digital-scholarship/lod/wikibase/sdoc/#querying" target="_blank">how to use the Wikimedia Commons Query Service (WCQS) to acquire data programatically via Python</a>. In this blog post, I will highlight some of the key points about interacting with Commons as a wikibase and link out to the details required to actually do the interacting.</p><h3 style="text-align: left;">Media file identifiers (M IDs)</h3><p>Wikimedia Commons media files are assigned a unique identifier that is analogous to the Q IDs used with Wikidata items. They are known as "M IDs" and they are required to interact with the Commons API or the Wikimedia Commons Query Service programmatically as I will describe below. </p><p>It is not particularly straightforward to find the M ID for a media file. The easiest way is probably to find the <span style="font-family: courier;">Concept URI</span> link in the left menu of a Commons page, right-click on the link, and then paste it somewhere. The M ID is the last part of that link. Here's an example: <a href="https://commons.wikimedia.org/entity/M113161207">https://commons.wikimedia.org/entity/M113161207</a> . If the M ID for a media file is known, you can load its page using a URL of this form. </p><p>If you are automating the upload process as I described in my last post, CommonsTool records the M ID when it uploads the file. I also have a <a href="https://github.com/HeardLibrary/linked-data/blob/3d77805318cc0b8f8533c00d582dd0f81af9c4ca/commonsbot/commonstool.py#L659-L686" target="_blank">Python function</a> that can be used to get the M ID from the Commons API using the media filename. </p><h3 style="text-align: left;">Properties and values in Structured Data on Commons come from Wikidata<br /></h3><p>Structured Data on Commons does not maintain its own system of properties. It exclusively uses properties from Wikidata, identified by P IDs. Similarly, the values of SDoC statements are nearly always Wikidata items identified by Q IDs (with dates being an exception). So one could generally represent a SDoC statement (subject property value) like this:</p><p>MID PID QID. </p><h3 style="text-align: left;">Captions</h3><p>Captions are a feature of Commons that allows multilingual captions to be applied to media items. They show up under the "File information" tab.<br /></p><p></p><p><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEga4dqNE610Fk5w-ZF_fE5DdcEGMQq7JTVJdpwvX_Q5sLNGPa0G1b8C1AucwfvoDXuiPB6U8VS2TPixg47UUCxZZ4FYgCOyfX-RtHIvHRFDOsTic1ZpdlvOerlye-tqHNdjCKRlfJDJjJMKEKJz20qhvdFnqNketf94t3nY3Lpzga9cjgyTe0txMU0Y/s1056/caption_screenshot.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="523" data-original-width="1056" height="316" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEga4dqNE610Fk5w-ZF_fE5DdcEGMQq7JTVJdpwvX_Q5sLNGPa0G1b8C1AucwfvoDXuiPB6U8VS2TPixg47UUCxZZ4FYgCOyfX-RtHIvHRFDOsTic1ZpdlvOerlye-tqHNdjCKRlfJDJjJMKEKJz20qhvdFnqNketf94t3nY3Lpzga9cjgyTe0txMU0Y/w640-h316/caption_screenshot.png" width="640" /></a></p> Although captions can be added or edited using the graphical interface, under the hood the captions are the multilingual labels for the media items in the Commons wikibase. So they can be added or edited as wikibase labels via the Commons API using any tool that can edit wikibases.<p></p><p></p><h2 style="text-align: left;"><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjlCYqPm6Lt6ByXQnpKBs_kbssIeV5s4bE8H8yilyjZCyI-pbXFi-Rh3uBcjFG59BJCfDpN62LMrwUvvW8Y7Gc5kUgqYLl4waIDqmuJrQZS3dAuSDBRU6LTf2kNsjOoJCN2TdtHmc2nLbgkiM1QeIUIeP_Oq8Iicjpd4mhKDBLjPEvyl57ltWkG6Xv5/s830/components_diagram.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="522" data-original-width="830" height="402" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjlCYqPm6Lt6ByXQnpKBs_kbssIeV5s4bE8H8yilyjZCyI-pbXFi-Rh3uBcjFG59BJCfDpN62LMrwUvvW8Y7Gc5kUgqYLl4waIDqmuJrQZS3dAuSDBRU6LTf2kNsjOoJCN2TdtHmc2nLbgkiM1QeIUIeP_Oq8Iicjpd4mhKDBLjPEvyl57ltWkG6Xv5/w640-h402/components_diagram.png" width="640" /></a></div></h2><h2 style="text-align: left;">Writing statements to the Commons API with VanderBot<br /></h2><p></p><p>VanderBot uses tabular data (spreadsheets) as a data source when it creates statements in a wikibase. One key piece of required information is the Q ID of the subject item that the statements are about and that is generally the first column in the table. When writing to Commons, the subject M ID is substituted for a Q ID in the table. </p><p>Statement values for a particular property are placed in one column in the table. Since all of the values in a column are assumed to be for the same property, the P ID doesn't need to be specified as data in the row. VanderBot just needs to know what P ID is associated with that column and that mapping of column with property is made separately. So at a minimum, to write a single kind of statement to Commons (like <span style="font-family: courier;">Depicts</span>), VanderBot needs only two columns of data (one for the M ID and one for the Q ID of the value of the property). </p><p> Here is <a href="https://github.com/HeardLibrary/linked-data/blob/master/commonsbot/depicts/depicts.csv" target="_blank">an example of a table with depicts data</a> to be uploaded to Commons by VanderBot:</p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEio5kICrZgnPT0uajA7mhB3EhhnWpNeu7rSQGYoI6Z21OhWbi_84Fqxg8FK7RpkWURCwbHGdaHIAEkAm2kSy1IGd_imqEfOvUyBkUPYGg_1pUe5mhGCkE0kC9mQmURcom-Uq3Xhy2X53pJOl31_ty4NmRZuM89Flv7xq46TTUMfS0-E7B217TwVvPMn/s1096/csv.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="163" data-original-width="1096" height="96" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEio5kICrZgnPT0uajA7mhB3EhhnWpNeu7rSQGYoI6Z21OhWbi_84Fqxg8FK7RpkWURCwbHGdaHIAEkAm2kSy1IGd_imqEfOvUyBkUPYGg_1pUe5mhGCkE0kC9mQmURcom-Uq3Xhy2X53pJOl31_ty4NmRZuM89Flv7xq46TTUMfS0-E7B217TwVvPMn/w640-h96/csv.png" width="640" /></a></div><p>The <span style="font-family: courier;">qid</span> column contains the subject M ID identifiers (for <a href="https://commons.wikimedia.org/wiki/File:Madonna_and_Child_with_St._Elizabeth_and_infant_John_the_Baptist_-_Vanderbilt_Fine_Arts_Gallery_-_1979.0321P_copy.tif" target="_blank">this media file</a>). The <span style="font-family: courier;">depicts</span> column contains the Q IDs of the values (the things that are depicted in the media item). The other three columns serve the following purposes:</p><p>- <span style="font-family: courier;">depicts_label</span> is ignored by the script. It's just a place to put the label of the otherwise opaque Q ID for the depicted item so that a human looking at the spreadsheet has some idea about what's being depicted.</p><p>- <span style="font-family: courier;">label_en</span> is the language-tagged caption/wikibase label. VanderBot has an option to either overwrite the existing label in the wikibase with the value in the table or ignore the label column and leave the label in Wikibase the same. In this example, we are not concerning ourselves with editing the captions, so we will use the "ignore" option. But if one wanted to add or update captions, VanderBot could be used for that.</p><p>- <span style="font-family: courier;">depicts_uuid</span> stores the unique statement identifier after the statement is created. It is empty for statements that have not yet been uploaded.</p><p>I mentioned before that the connection between the property and the column that contains its values was made separately. This mapping is done in <a href="https://github.com/HeardLibrary/linked-data/blob/master/commonsbot/depicts/config.yaml" target="_blank">a YAML file that</a> describes the columns in the table:</p><div class="separator" style="clear: both; text-align: center;"></div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEihMOgpmLINcZbUQuMdXOKEQQ9AOYovg1ea8hUYJTB5kSatb4Mhd2FruqwYHUom_inb45tPSVJU-x2SqtZ3Z8lWfLZuXM80b5IAo8l6IU1DcHozfhhO_D7tQWJ4fPHUFHBIYK6mRZJQ5P1pAeWwByxNy0nU-ItVQP_26FU1W4peoiY18lZDGqlC0KDl/s381/config_yaml.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="381" data-original-width="346" height="320" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEihMOgpmLINcZbUQuMdXOKEQQ9AOYovg1ea8hUYJTB5kSatb4Mhd2FruqwYHUom_inb45tPSVJU-x2SqtZ3Z8lWfLZuXM80b5IAo8l6IU1DcHozfhhO_D7tQWJ4fPHUFHBIYK6mRZJQ5P1pAeWwByxNy0nU-ItVQP_26FU1W4peoiY18lZDGqlC0KDl/s320/config_yaml.png" width="291" /></a></div><p>The details of this file structure are given <a href="https://github.com/HeardLibrary/linked-data/blob/master/vanderbot/convert-config.md" target="_blank">elsewhere</a>, but a few key details are obvious. The <span style="font-family: courier;">depicts_label</span> column is designated as to be ignored. In the properties list, the header for a column is given as value of the <span style="font-family: courier;">variable</span> key, with a value of <span style="font-family: courier;">depicts</span> in this example. That column has <span style="font-family: courier;">item</span> as its value type and <span style="font-family: courier;">P180</span> as its property. <br /></p><p>As a part of the VanderBot workflow, this mapping file is converted into a JSON metadata description file and that file along with the CSV are all that are needed by VanderBot to create the SDoC <span style="font-family: courier;">depicts</span> statements.<br /></p><p>If you have used VanderBot to create new items in Wikidata, uploading to Commons is more restrictive than what you are used to. When writing to Wikidata, if the Q ID column for a row in a CSV is empty, Vanderbot will create a new item and if it's not, it edits an existing one. Creating new items directly via the API is not possible in Commons, because new items in the Commons wikibase are only created as a result of media uploads. So when VanderBot interacts with the Commons API, the <span style="font-family: courier;">qid</span> column must contain an existing M ID. </p><p>After writing the SDoC statements, they will show up under the "Structured data" tab for <a href="https://commons.wikimedia.org/wiki/File:Madonna_and_Child_with_St._Elizabeth_and_infant_John_the_Baptist_-_Vanderbilt_Fine_Arts_Gallery_-_1979.0321P_copy.tif" target="_blank">the media item</a>, like this:</p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEimg29jEed0SnXnn4lF6M7Yv33R1vpogI6y8pTx2-knh5YB6JvUE-KpH6YAj51TUpq_XebfjNuVj9_A3HA-BbT2a_LlEvrYaW45eMGni8C-_1RgtMv4zaCJB0qSldMf-QVi76h_a6bABfnWVVLY5Tdr-XqNEj28gGUSDX1zS38p3K-xEwP_BrVbvKpl/s575/structured_data.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="349" data-original-width="575" height="388" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEimg29jEed0SnXnn4lF6M7Yv33R1vpogI6y8pTx2-knh5YB6JvUE-KpH6YAj51TUpq_XebfjNuVj9_A3HA-BbT2a_LlEvrYaW45eMGni8C-_1RgtMv4zaCJB0qSldMf-QVi76h_a6bABfnWVVLY5Tdr-XqNEj28gGUSDX1zS38p3K-xEwP_BrVbvKpl/w640-h388/structured_data.png" width="640" /></a></div><p>Notice that the Q IDs for the depicts values have been replaced by their labels.</p><p>This is a very abbreviated overview of the process and is intended to make the point that once you have the system set up, all you need to write a large number of SDoC <span style="font-family: courier;">depicts</span> statement is a spreadsheet with column for the M IDs of the media items and a column with the Q IDs of what is depicted in that media item. There are more details with linkouts to how to use VanderBot to write to Commons on <a href="https://heardlibrary.github.io/digital-scholarship/lod/wikibase/sdoc/" target="_blank">a webpage that I made for the Wikibase Working Hour presentation</a>.<br /></p><h2 style="text-align: left;">Acquiring Structured Data on Commons from the Wikimedia Commons Query Service<br /></h2><p>A lot of people know about the Wikidata Query Service (WQS), which can be used to query Wikidata using SPARQL. Fewer people know about the <a href="https://commons.wikimedia.org/wiki/Commons:SPARQL_query_service" target="_blank">Wikimedia Commons Query Service</a> (WCQS) because it's newer and interests a narrower audience. You can access the WCQS at <a href="https://commons-query.wikimedia.org/">https://commons-query.wikimedia.org/</a> . It is still under development and is a bit fragile, so it is sometimes down or undergoing maintenance. </p><p>If you are working with SDoC, the WCQS is a very effective way to retrieve information about the current state of the structured data. For example, it's a very simple query to discover all media items that depict a particular item, as shown in the example below. There are quite a few <a href="https://commons.wikimedia.org/wiki/Commons:SPARQL_query_service/queries/examples" target="_blank">examples of queries</a> that you can run to get a feel for how the WCQS might be used.<br /></p><p>It is actually quite easy to query the Wikidata Query Service programmatically, but there are additional challenges to using the WCQS
because it requires authentication. I have struggled through reading the developer instructions for accessing the WCQS endpoint via Python and the result is <a href="https://github.com/HeardLibrary/linked-data/blob/master/commonsbot/wcqs/wcqs_query.py" target="_blank">functions and example code</a> that you can use to query the WCQS in your Python scripts. <b>One important warning: the authentication is done by setting a cookie on your computer. </b>So you must be careful not to save this cookie in any location that will be exposed, such as in a GitHub repository. Anyone who gets a copy of this cookie can act as if they were you until the cookie is revoked. To avoid this, the script saves the cookie in your home directory by default. <br /></p><p>The code for querying is very simple with the functions I provide:</p><pre><code>user_agent = 'TestAgent/0.1 (mailto:username@email.com)' # put your own script name and email address here
endpoint_url = 'https://commons-query.wikimedia.org/sparql'
session = init_session(endpoint_url, retrieve_cookie_string())
wcqs = Sparqler(useragent=user_agent, endpoint=endpoint_url, session=session)</code></pre><pre><code>query_string = '''PREFIX sdc: <https://commons.wikimedia.org/entity/>
PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
SELECT DISTINCT ?depicts WHERE {
sdc:M113161207 wdt:P180 ?depicts.
}'''</code></pre><p></p><pre><code>data = wcqs.query(query_string)
print(json.dumps(data, indent=2))</code></pre><p></p><p>The query is set in the multi-line string assigned in the line that begins <span style="font-family: courier;">query_string =</span>. One thing to notice is that in WCQS queries, you must define the prefixes <span style="font-family: courier;">wdt:</span> and <span style="font-family: courier;">wd:</span> using <span style="font-family: courier;">PREFIX</span> statements in the query prologue. Those prefixes can be used in WQS queries without making <span style="font-family: courier;">PREFIX</span> statements. In addition, you must define the Commons-specific <span style="font-family: courier;">sdc:</span> prefix and use it with M IDs. </p><p>This particular query simply retrieves all of the depicts statements that we created in the example above for <code>M113161207 </code>. The resulting JSON is</p><pre><code>[
{
"depicts": {
"type": "uri",
"value": "http://www.wikidata.org/entity/Q103304813"
}
},
{
"depicts": {
"type": "uri",
"value": "http://www.wikidata.org/entity/Q302"
}
},
{
"depicts": {
"type": "uri",
"value": "http://www.wikidata.org/entity/Q345"
}
},
{
"depicts": {
"type": "uri",
"value": "http://www.wikidata.org/entity/Q40662"
}
}</code><code>,
{
"depicts": {
"type": "uri",
"value": "http://www.wikidata.org/entity/Q235849"
}
}</code><code>
]</code></pre><p></p><p>The Q IDs can easily be extracted from these results using a list comprehension:</p><p><span style="font-family: courier;"> qids = [ item['depicts']['value'].split('/')[-1] for item in data ]</span></p><p>resulting in this list:</p><p><span style="font-family: courier;">['Q103304813', 'Q302', 'Q345', 'Q40662', 'Q235849']</span><br /> <br /></p><p>Comparison with the example table shows the same four Q IDs that we wrote to the API, plus the depicts value for the artwork (Q103304813) that was created by CommonsTool when the media file was uploaded. When adding new depicts statements, having this information about the ones that already exist can be critical to avoid creating duplicate statements.<br /></p><p>For more details about how the code works, see the <a href="https://heardlibrary.github.io/digital-scholarship/lod/wikibase/sdoc/" target="_blank">informational web page</a> I made for the Wikibase Working Hour presentation.</p><h2 style="text-align: left;">Conclusion<br /></h2><p>I hope that this code will help make it possible to ramp up the rate at which we can add depicts statements to Wikimedia Commons media files. In the Vanderbilt Libraries, we are currently experimenting with using Google Cloud Vision to do object detection and we would like to combine that with artwork title analysis to be able to partially automate the process of describing what is depicted in the <a href="https://www.library.vanderbilt.edu/gallery/" target="_blank">Vanderbilt Fine Arts Gallery</a> works whose <a href="https://www.wikidata.org/wiki/Wikidata:WikiProject_Vanderbilt_Fine_Arts_Gallery" target="_blank">images have been uploaded to Commons</a>. I plan to report on that work in a future post.</p><p><br /></p><p><br /></p><p><br /></p><p><br /></p><br />Steve Baskaufhttp://www.blogger.com/profile/01896499749604153763noreply@blogger.com0tag:blogger.com,1999:blog-5299754536670281996.post-65987266635060212162022-09-07T05:21:00.003-07:002023-04-10T15:06:50.666-07:00CommonsTool: A script for uploading art images to Wikimedia Commons<p> </p><div class="separator" style="clear: both; text-align: center;"><a href="https://upload.wikimedia.org/wikipedia/commons/thumb/6/64/A_Ghost_Painting_Coming_to_Life_in_the_Studio_of_the_Painter_Oky%C5%8D%2C_from_the_series_Yoshitoshi_ryakuga_(Sketches_by_Yoshitoshi)_-_Vanderbilt_Fine_Arts_Gallery_-_1992.083.tif/lossy-page1-782px-thumbnail.tif.jpg" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="599" data-original-width="782" height="490" src="https://upload.wikimedia.org/wikipedia/commons/thumb/6/64/A_Ghost_Painting_Coming_to_Life_in_the_Studio_of_the_Painter_Oky%C5%8D%2C_from_the_series_Yoshitoshi_ryakuga_(Sketches_by_Yoshitoshi)_-_Vanderbilt_Fine_Arts_Gallery_-_1992.083.tif/lossy-page1-782px-thumbnail.tif.jpg" width="640" /></a></div><p><span style="font-size: x-small;"><span style="font-family: arial;"><i>A Ghost Painting Coming to Life in the Studio of the Painter Okyō, from the series Yoshitoshi ryakuga (Sketches by Yoshitoshi).</i> 1882 print by Tsukioka Yoshitoshi. Vanderbilt University Fine Arts Gallery 1992.083 via <a href="https://commons.wikimedia.org/wiki/File:A_Ghost_Painting_Coming_to_Life_in_the_Studio_of_the_Painter_Oky%C5%8D,_from_the_series_Yoshitoshi_ryakuga_(Sketches_by_Yoshitoshi)_-_Vanderbilt_Fine_Arts_Gallery_-_1992.083.tif" target="_blank">Wikimedia Commons</a>. Wikidata item <a href="https://www.wikidata.org/wiki/Q102961245" target="_blank">Q102961245</a></span></span></p><p>For several years, I've been working with the <a href="https://www.library.vanderbilt.edu/gallery/" target="_blank">Vanderbilt Fine Arts Gallery</a> staff to create and improve <a href="https://www.wikidata.org/" target="_blank">Wikidata</a> items for the approximately 7000 works in the Gallery collection through the <a href="https://www.wikidata.org/wiki/Wikidata:WikiProject_Vanderbilt_Fine_Arts_Gallery" target="_blank">WikiProject Vanderbilt Fine Arts Gallery</a>. In the past year, I've been focused on creating a Python script to streamline the process of uploading images of Public Domain works in the collection to Wikimedia Commons, where they will be freely available for use. I've just completed work on that script, which I've called CommonsTool, and have used it to upload over 1300 images (covering about 20% of the collection and most of the Public Domain artworks that have been imaged). </p><p>In this post, I'll begin by describing some of the issues I dealt with and how they resulted in features of the script. I will conclude by outlining briefly how the script works.<br /></p><p>The script is freely available for use and there are <a href="https://github.com/HeardLibrary/linked-data/blob/master/commonsbot/README.md" target="_blank">detailed instructions on GitHub for configuring and using it</a>. Although it's designed to be usable in contexts other than the Vanderbilt Gallery, it hasn't been tested thoroughly in those circumstances. So if you try using it, <a href="mainto:steve.baskauf@vanderbilt.edu">I'd like to hear about your experience</a>. </p><h1 style="text-align: left;">Wikidata, Commons, and structured data</h1><p>If you have ever worked with editing metadata about art-related media in Wikimedia Commons, you are probably familiar with the various templates used to describe the metadata on the file page using Wiki syntax. Here's an example:</p><p><span style="font-size: x-small;"><span style="font-family: courier;">=={{int:filedesc}}==<br />{{Artwork<br /> |artist = {{ Creator | Wikidata = Q3695975 | Option = {{{1|}}} }}<br /> |title = {{en|'''Lake George'''.}}<br /> |description = {{en|1=Lake George, painting by David Johnson}}<br /> |depicted people =<br /> |depicted place =<br /> |date = <br /> |medium = {{technique|oil|canvas}}<br /> |dimensions = {{Size|in|24.5|19.5}}<br /> |institution = {{Institution:Vanderbilt University Fine Arts Gallery}}<br /> |references = {{cite web |title=Lake George |url=https://library.artstor.org/#/asset/26754443 |accessdate=30 November 2020}}<br /> |source = Vanderbilt University Fine Arts Gallery<br /> |other_fields =<br />}}<br /><br />=={{int:license-header}}==<br />{{PD-Art|PD-old-100-expired}}<br /><br />[[Category:Vanderbilt University Fine Arts Gallery]]</span></span><br /></p><p>These templates are complicated to create and difficult to edit by automated means. In recognition of this, the Commons community has been moving towards storing metadata about the media files as structured data ("<a href="https://commons.wikimedia.org/wiki/Commons:Structured_data" target="_blank">Structured Data on Commons</a>", SDC). When media files depict artwork, the preference is to describe the artwork metadata in Wikidata rather than as wikitext on the Commons file page (as shown in the example above). </p><p>In July, Sandra Fauconnier gave a presentation at an ARLIS/NA (Art Libraries Society of North America) Wikidata group meeting that was extremely helpful for improving my understanding of the best practices for expressing metadata about visual artworks in Wikimedia Commons. She provided a link to <a href="https://commons.wikimedia.org/wiki/Commons:Structured_data/Modeling/Visual_artworks" target="_blank">a very useful reference page</a> (still under construction as of September 2022) to which I referred while working on my script. </p><p>The CommonsTool script has been designed around two key features for simplifying management of the media and artwork metadata. The first is two very simple wikitexts: one for two-dimensional artwork and another for three-dimensional artwork. The 2D wikitext looks like this:</p><p><span style="font-family: courier;">=={{int:filedesc}}==<br />{{Artwork<br /> |source = Vanderbilt University<br />}}<br /><br />=={{int:license-header}}==<br />{{PD-Art|PD-old-100-expired}}<br /><br />[[Category:Vanderbilt University Fine Arts Gallery]]</span></p><p>and the 3D wikitext looks like this:</p><p><span style="font-family: courier;">=={{int:filedesc}}==<br />{{Art Photo<br /> |artwork license = {{PD-old-100-expired}}<br /> |photo license = {{Cc-by-4.0 |1=photo © [https://www.vanderbilt.edu/ Vanderbilt University] / [https://www.library.vanderbilt.edu/gallery/ Fine Arts Gallery] / [https://creativecommons.org/licenses/by/4.0/ CC BY 4.0]}}<br />}}<br /><br />[[Category:Vanderbilt University Fine Arts Gallery]]</span></p><p>By comparison with the wikitext in the first example, this is clearly much simpler, but also has the advantage that there is very little metadata in the wikitext itself that might need to be updated.</p><p>The second key feature involves using SDC to link the media file to the Wikidata item for the artwork. Here's an example for the work shown at the top of this post:</p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhTJcfAvA3BRUervPBURth65cJBAlq-KHPe_lilGiwGvdpnJqzzgltdnNpzKqQTnP0PwKXcXZ9x9btRyeRzLrye6kdx2o8Yu1B3pbtJOGta_XSMrjRfW7zvRUxL41SNGtk4tFWAhbdPPnWDw8E4JF__A0EQU_ORpv-kAF1ZzheW6ObNLzywHd98aIx9/s739/sdc_example.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="636" data-original-width="739" height="550" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhTJcfAvA3BRUervPBURth65cJBAlq-KHPe_lilGiwGvdpnJqzzgltdnNpzKqQTnP0PwKXcXZ9x9btRyeRzLrye6kdx2o8Yu1B3pbtJOGta_XSMrjRfW7zvRUxL41SNGtk4tFWAhbdPPnWDw8E4JF__A0EQU_ORpv-kAF1ZzheW6ObNLzywHd98aIx9/w640-h550/sdc_example.png" width="640" /></a></div><br /><p></p><p>In order for this strategy to work, for all artwork images the depicts (P180) and main subject (P921) values must be set to the artwork's Wikidata item (in this case <a href="https://www.wikidata.org/wiki/Q102961245" target="_blank">Q102961245</a>). Two dimensional artwork images should also have a "digital representation of" (P6243) value with the artwork's Wikidata item. When these claims are created, the Wikidata metadata will "magically" populate the file information summary without entering it into a wikitext template. </p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg0CWJu_P5cCp8BxRccJES-8AGOxq7WDKAhUNSiCMDC98xtRVfBLW3lN5pqT9jBX5fnQUi4-MG603Z9jBWginTUuwKxR-cwNkbvUXTD8pGpAlt-iOq3Ni_gXB4bDCw78j3Ky_eKHZ-MBf-_87BVAMxZon5_UqrBY4zXsqIPfZ11kGBkhCzlP1KOHCmw/s852/example_table.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="387" data-original-width="852" height="290" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg0CWJu_P5cCp8BxRccJES-8AGOxq7WDKAhUNSiCMDC98xtRVfBLW3lN5pqT9jBX5fnQUi4-MG603Z9jBWginTUuwKxR-cwNkbvUXTD8pGpAlt-iOq3Ni_gXB4bDCw78j3Ky_eKHZ-MBf-_87BVAMxZon5_UqrBY4zXsqIPfZ11kGBkhCzlP1KOHCmw/w640-h290/example_table.png" width="640" /></a></div><p></p><p>The great advantage here is that when metadata are updated on Wikidata, they automatically are updated in Commons as well.</p><h1 style="text-align: left;">Copyright and licensing issues</h1><p>One of the complicating issues that had slowed me down in developing the script was to figure out how to handle copyright and licensing issues. The images we are uploading depict old artwork that is out of copyright, but what about copyright of the images of the artwork? The Wikimedia Foundation <a href="https://commons.wikimedia.org/wiki/Commons:When_to_use_the_PD-Art_tag" target="_blank">takes the position</a> that faithful photographic reproductions of old two-dimensional artwork lack originality and are therefore not subject to copyright. However, images of three-dimensional works can involve creativity, so those images must be usable under an open license acceptable for Commons uploads.</p><h3 style="text-align: left;">Wikitext tags <br /></h3><p>Unlike other metadata properties about a media item, the copyright and licensing details cannot (as of September 2022) be expressed only in SDC. They must be explicitly included in the file page's wikitext. </p><p> As shown in the example above, I used the license tags</p><p><span style="font-family: courier;">{{PD-Art|PD-old-100-expired}}</span></p><p>for 2D artwork. The <span style="font-family: courier;"><span style="font-family: inherit;"></span>PD-Art</span> tag asserts that the image is not copyrightable for the reason given above and <span style="font-family: courier;">PD-old-100-expired</span> asserts that the artwork is not under copyright because it is old. When these tags are used together, they are rendered on the file page like this:</p><p></p><p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjVLG0CDmdReHfggoQ0MV_OPHZ-WaGlDmQLyHQxWfYl_Zcu4h65w4b2YBBU_n7PP1TtMFdC9Tw7t71UU3m9O0F0LL7LawZZYZU8dXNFo_6Vj53SlSnJqxeQrsZx1Y7lmrbnc5urOS9LEaiLNTBJmM5eJ7AeH1XeZJMGTTaHTMxAqmNsoJusFndWqjHM/s916/pd_art_rendered.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="300" data-original-width="916" height="210" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjVLG0CDmdReHfggoQ0MV_OPHZ-WaGlDmQLyHQxWfYl_Zcu4h65w4b2YBBU_n7PP1TtMFdC9Tw7t71UU3m9O0F0LL7LawZZYZU8dXNFo_6Vj53SlSnJqxeQrsZx1Y7lmrbnc5urOS9LEaiLNTBJmM5eJ7AeH1XeZJMGTTaHTMxAqmNsoJusFndWqjHM/w640-h210/pd_art_rendered.png" width="640" /></a></div><br />The example above for 3D artworks uses separate license tags for the artwork and the photo. The artwork license is <span style="font-family: courier;">PD-old-100-expired</span> as before, and the photo license I used was <br /><p></p><p><span style="font-family: courier;">{{Cc-by-4.0 |1=photo ©
[https://www.vanderbilt.edu/ Vanderbilt University] /
[https://www.library.vanderbilt.edu/gallery/ Fine Arts Gallery] /
[https://creativecommons.org/licenses/by/4.0/ CC BY 4.0]}}</span></p><p>There are a number of <a href="https://commons.wikimedia.org/wiki/Commons:Licensing#License_information" target="_blank">possible licenses</a> that can be used for both the photo and artwork and they can be set in the <a href="https://github.com/HeardLibrary/linked-data/blob/master/commonsbot/commonstool_config.yml" target="_blank">CommonsTool configuration file</a>. Since the <a href="https://commons.wikimedia.org/wiki/Template:Cc-by-4.0" target="_blank">CC BY license</a> requires attribution, I used the explicit <a href="https://commons.wikimedia.org/wiki/Commons:Credit_line#Creative_Commons" target="_blank">credit line</a> feature to make clear that it's the photo (not the artwork) that's under copyright and to provide links to Vanderbilt University (the copyright holder) and the Fine Arts Gallery. Here's how these tags are rendered on the <a href="https://commons.wikimedia.org/wiki/File:Running_Bear_Effigy_-_Vanderbilt_Fine_Arts_Gallery_-_1979.0419P.JPG" target="_blank">file page of an image of a 3D artwork</a>:</p><p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi8KLNpL_oy9DrEAun0KpxEHFxHele0qY-Mlh-Ck0eAOnhtM8fv5qbMrS7FY-gIaJAh4MBGyXh-L0rwzLsf06_ROoW1HnPy4N6zCf6YUEsu-eHc1LYnbllYOT3Aaq1C4DOnXNf0BkSCf0HC3b0BQM9Sm25KmpiK6SHpoUPqeEBzvk0X6fcsQhsqoqm7/s1008/dual_license_example.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="1008" data-original-width="889" height="640" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi8KLNpL_oy9DrEAun0KpxEHFxHele0qY-Mlh-Ck0eAOnhtM8fv5qbMrS7FY-gIaJAh4MBGyXh-L0rwzLsf06_ROoW1HnPy4N6zCf6YUEsu-eHc1LYnbllYOT3Aaq1C4DOnXNf0BkSCf0HC3b0BQM9Sm25KmpiK6SHpoUPqeEBzvk0X6fcsQhsqoqm7/w564-h640/dual_license_example.png" width="564" /></a></div><br />Using the format<p></p><p><span style="font-family: courier;">{{Art Photo<br /> |artwork license = {{artLicenseTag}}<br /> |photo license = {{photoLicenseTag}}<br />}}</span></p><p>in the wikitext is great because it creates separate boxes that clarify that the permissions for the artwork are distinct from the permissions for the photo of the artwork.<br /></p><h3 style="text-align: left;">Structured data about licensing</h3><p>As noted previously, it's required to include copyright and licensing information in the page wikitext. However, file pages must also have certain structured data claims related to the file creator, copyright, and licensing or they will be flagged.</p><p>In the case of 2D images where the <span style="font-family: courier;">PD-Art</span> tag was used, there should be a "digital representation of" (P6243) claim where the value is the Q ID of the Wikidata item depicted in the media file. </p><p>In the case of 3D images, they should not have a P6243 claim, but should have values for copyright status (P6216) and copyright license (P275). If under copyright, they should also have values for creator (P170, i.e. photographer) and inception (P571) date so that it can be determined to whom attribution should be given and when the copyright may expire. Keep in mind that for artwork SDC metadata is generally about the media file and not the depicted thing. So similar information about the depicted artwork would be expressed in the Wikidata item about the artwork, not in SDC. </p><p>Although not required when the <span style="font-family: courier;">PD-Art</span> tag is used, it's a good idea to include the creator (photographer) and inception date of the image in the SDC metadata for 2D works. It's not yet clear to me whether a copyright status value should be provided. I suppose so, but if it's directly asserted in the SDC that the work is in the Public Domain, you are supposed to use a qualifier to indicate the reason, and I'm not sure what value would be used for that. I haven't seen any examples illustrating how to do that, so for now, I've omitted it.</p><p>To see examples of how this looks in practice see <a href="https://commons.wikimedia.org/wiki/File:A_Ghost_Painting_Coming_to_Life_in_the_Studio_of_the_Painter_Oky%C5%8D,_from_the_series_Yoshitoshi_ryakuga_(Sketches_by_Yoshitoshi)_-_Vanderbilt_Fine_Arts_Gallery_-_1992.083.tif" target="_blank">this example for 2D</a> and this <a href="https://commons.wikimedia.org/wiki/File:Running_Bear_Effigy_-_Vanderbilt_Fine_Arts_Gallery_-_1979.0419P.JPG" target="_blank">example for 3D</a>. After the page loads, click on the Structured Data tab below the image.</p><h1 style="text-align: left;">What the script does: the Commons upload</h1><p></p><p>The Commons upload takes place in three stages. </p><p>First, CommonsTool acquires necessary information about the artwork and the image from CSV tables. One key piece of information is what image or images to be uploaded to Commons are associated with a particular artwork (represented by a single Wikidata item). The main link from Commons to Wikidata is made using a depicts (P180) claim in the SDC and the link from Wikidata to Commons is made using an image (P18) claim.<br /></p><p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiI5iJIlt1Mku6_vU13Gy-xrqMaVqVKB5U1WsLbnnxT9FlYgKnBCpKRy22o8SiZdLwQo7_JFLhtTR0zZ6RB7MgZwjfvZHN0sPAxaZviNE2xGwHAHwfvBY9fQaiYxBDPp3VwZAoeHw5kGdWB3-Cqm4Lhlksxy_s9Eb6XJZ1tQgyZIhKvd-qRpJjfUsNX/s1080/commons_wikidata.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="643" data-original-width="1080" height="382" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiI5iJIlt1Mku6_vU13Gy-xrqMaVqVKB5U1WsLbnnxT9FlYgKnBCpKRy22o8SiZdLwQo7_JFLhtTR0zZ6RB7MgZwjfvZHN0sPAxaZviNE2xGwHAHwfvBY9fQaiYxBDPp3VwZAoeHw5kGdWB3-Cqm4Lhlksxy_s9Eb6XJZ1tQgyZIhKvd-qRpJjfUsNX/w640-h382/commons_wikidata.png" width="640" /></a></div><p></p><p style="text-align: center;"><span style="font-size: x-small;"><span style="font-family: arial;"><i>Miriam</i> by Anselm Feuerbach. Public Domain via <a href="https://commons.wikimedia.org/wiki/File:Feuerbach_Mirjam_2.jpg" target="_blank">Wikimedia Commons </a></span></span><br /></p><p>It is important to know whether there are more than one image associated with the artwork. In the source CSV data about images, the image to be linked from Wikidata is designated as "primary" and additional images are designated as "secondary". </p><p> </p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEibquMjv9R844jHlHBvyQoxsLdINHgipyjKC658rq5ULnZEcx4dJFjf3gyEXiDzXkxCbC1yt5TFOGxlR2a9vTejfhidQA5FhDIqwRozCKm8GH-GxJhPVhLT57UtwhgB4OGiHj_i9d6SzCzhWBUeqFrhPiLqsHUJ8vd9KClbATDOX8PJdh1Dn8UEWPZT/s655/image_depicts.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="655" data-original-width="584" height="640" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEibquMjv9R844jHlHBvyQoxsLdINHgipyjKC658rq5ULnZEcx4dJFjf3gyEXiDzXkxCbC1yt5TFOGxlR2a9vTejfhidQA5FhDIqwRozCKm8GH-GxJhPVhLT57UtwhgB4OGiHj_i9d6SzCzhWBUeqFrhPiLqsHUJ8vd9KClbATDOX8PJdh1Dn8UEWPZT/w570-h640/image_depicts.png" width="570" /></a></div><p></p><p>Both primary and secondary images will be linked from Commons to Wikidata using a depicts (P180) claim, but it's probably best for only the primary image to be linked from Wikidata using an image (P18) claim. <a href="https://commons.wikimedia.org/wiki/File:Leaf_from_Italian_Book_of_Hours_-_recto_-_Vanderbilt_Fine_Arts_Gallery_-_1970.040_recto_003.tif" target="_blank">Here is an example of a primary image page in Commons</a> and <a href="https://commons.wikimedia.org/wiki/File:Leaf_from_Italian_Book_of_Hours_-_verso_-_Vanderbilt_Fine_Arts_Gallery_-_1970.040_verso_004.tif" target="_blank">here is an example of a secondary image page in Commons</a>. Notice that the <a href="https://www.wikidata.org/wiki/Q103304554" target="_blank">Wikidata page for the artwork</a> only displays the primary image. <br /></p><p>The CommonsTool script also constructs a descriptive Commons filename for the image using the Wikidata label, any sub-label particular to one of multiple images, the institution name, and the unique local filename. There are a number of characters that aren't allowed, so CommonsTool tries to find them and replace them with valid characters. </p><p>The script also performs a number of optional screens based on copyright status and file size. It can skip images deemed to be too small and will also skip images whose file size exceeds the API limit of 100 Mb. (See the configuration file for more details.)</p><p> The second stage is to upload the media file and the file page wikitext via the Commons API. Commons guidelines state that the rate of file upload should not be greater than one upload per 5 seconds, so the script introduces a delay of necessary to avoid exceeding this rate. If successful, the script moves on to the third stage and if not, it logs an error and moves to the next media item.</p><p>In the third stage, SDC claims are written to the API in a manner similar to how claims are written to Wikidata. The claims upload function respects the maxlag errors from the server and delays the upload if the server is lagged due to high usage (although this rarely seems to happen). If the SDC upload fails, it logs an error, but the script continues in order to record the results of the media upload in the existing uploads CSV file.</p><p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEilWTkcgmMQf129ay85Rs3pHrp4xF_0UKR2Ze9MDZESO13VdPfbXAAgGhCm8Ka6dRXECYLNvXmdtWy-_VS-Gy4zi-yExjPdK9sxq1Katc_eJwd_-MlDDXfGLppqHOrTbxNA8jrrIXDPN5Nsx6Sc9qJ_Jlctc-hfBqQxwhE8HvuqA9uWjx4ZXHuYVpgH/s767/wikidata_link.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="334" data-original-width="767" height="278" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEilWTkcgmMQf129ay85Rs3pHrp4xF_0UKR2Ze9MDZESO13VdPfbXAAgGhCm8Ka6dRXECYLNvXmdtWy-_VS-Gy4zi-yExjPdK9sxq1Katc_eJwd_-MlDDXfGLppqHOrTbxNA8jrrIXDPN5Nsx6Sc9qJ_Jlctc-hfBqQxwhE8HvuqA9uWjx4ZXHuYVpgH/w640-h278/wikidata_link.png" width="640" /></a></div><br /> The links from the Commons image(s) to Wikidata are made using SDC statements, which results in a hyperlink in the file summary (the tiny Wikidata flag). However, the link in the other direction doesn't get made by CommonsTool. <p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhHEnvW6NqHD3zGvVRgW7K0sI-ebdaNt9DTWndhGNhBiFF91o9GjXxwODfFWbIM9_8a7-LSVXjkQOmfmMeuQhavnrad0ng0sV4VehOQhZ91FMnjvJrjbflyIHVrlnIvrUoeZKttUFxqv0m6Ntkb2l3nltkS2Bm3cMZqzRf9XKf40rgZnKWluq4LTucX/s1239/image_metadata_record.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="128" data-original-width="1239" height="66" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhHEnvW6NqHD3zGvVRgW7K0sI-ebdaNt9DTWndhGNhBiFF91o9GjXxwODfFWbIM9_8a7-LSVXjkQOmfmMeuQhavnrad0ng0sV4VehOQhZ91FMnjvJrjbflyIHVrlnIvrUoeZKttUFxqv0m6Ntkb2l3nltkS2Bm3cMZqzRf9XKf40rgZnKWluq4LTucX/w640-h66/image_metadata_record.png" width="640" /></a></div><p>The CSV file where existing uploads are recorded contains an image_name column and the primary values for "primary" images in that column can be used as values for the image (P18) property on the corresponding Wikidata artwork item page. After creating that claim, the primary image will be displayed on the artwork's Wikidata page:</p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgl1nILQ1F1W0P1fEObCalzV6Hj3SVxHZOYsW8vWnmjO9rkTWm1nL8erdWStpCAvgXab7p_NKrH_XCPZuVlPgEeWEAJA0UbEIX39jSeZ6JjTNHx2f_bgVJMjZQaGUkNdSXYgoTZA1EQmzN9vVRqgKRnIW5Vn08gHAb6KSKeZclPY6JylNbYhm063Yx8/s785/wikidata_image_example.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="766" data-original-width="785" height="624" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgl1nILQ1F1W0P1fEObCalzV6Hj3SVxHZOYsW8vWnmjO9rkTWm1nL8erdWStpCAvgXab7p_NKrH_XCPZuVlPgEeWEAJA0UbEIX39jSeZ6JjTNHx2f_bgVJMjZQaGUkNdSXYgoTZA1EQmzN9vVRqgKRnIW5Vn08gHAb6KSKeZclPY6JylNbYhm063Yx8/w640-h624/wikidata_image_example.png" width="640" /></a></div><p>Making this link manually can be tedious, so there is <a href="https://github.com/HeardLibrary/linked-data/blob/master/commonsbot/transfer_to_vanderbot.py" target="_blank">a script that will automatically transfer these values</a> into the appropriate column of a CSV file that is set up to be used by <a href="http://vanderbi.lt/vanderbot" target="_blank">the VanderBot script</a> to upload data to Wikidata. In production, I have <a href="https://github.com/HeardLibrary/linked-data/blob/master/commonsbot/upload_artwork.sh" target="_blank">a shell script</a> that runs CommonsTool, then the transfer script, followed by VanderBot. Once that shell script has finished running, the image claim will be present on the appropriate Wikidata page.</p><h1 style="text-align: left;">International Image Interoperability Framework (IIIF) functions </h1><p>One of our goals at the <a href="https://www.library.vanderbilt.edu/" target="_blank">Vanderbilt Libraries</a> (of which the Fine Arts Gallery is part) is to develop the infrastructure to support serving images using the International Image Interoperability Framework (IIIF). To that end, we've set up a <a href="https://cantaloupe-project.github.io/" target="_blank">Cantaloupe image server</a> on Amazon Web Services (AWS). The setup details are way beyond the scope of this web post, but now that we have this capability, we want to make the images that we've uploaded to Commons also be available as zoomable high-resolution images via our IIIF server. </p><p>For that reason, the CommonsTool script also has the capacity to upload images to the IIIF server storage (an AWS bucket) and to generate manifests that can be used to view those images. The IIIF functionalities are independent of the Commons upload capabilities -- either can be turned on or off. However, for my workflow, I do the IIIF functions immediately after the Commons upload so that I can use the results in Wikidata as I'll describe later. </p><h3 style="text-align: left;">Source images <br /></h3><p>One of the early things that I learned when experimenting with the server is that you don't want to upload large, raw TIFF files (i.e. greater than 10 MB). When a IIIF viewer tries to display such a file, it has to load the whole file, even if the screen area is much smaller that the entire TIFF would be if displayed at full resolution. This takes an incredibly long time, making viewing of the files very annoying. The solution to this is to convert the TIFF files into tiled pyramidal TIFFs. </p><p>When I view one of these files using Preview on my Mac, it becomes apparent why they are called "pyramidal". The TIFF file doesn't contain a single image. Rather, it contains a series of images that are increasingly small. If I click on the largest of the images (number 1), I see this:</p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi5Xz6_bAy1P0SX-ysuH2uiOR0hB9i48R49fiV3W-EYjDOd35GeEAS9O8MgCXbR-naazQhcBUKrvkl4JBBkrOcMpb8HN0_DwRwVVExhWAuwZqFwNeBairvJRnK7MogZOECkL5xeI2OqaeWQ_AYzeHPbyJWvDAFLnjqWgZQC38eD9Qcp_IvaEmL63qI9/s1093/pyramid_big.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="808" data-original-width="1093" height="296" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi5Xz6_bAy1P0SX-ysuH2uiOR0hB9i48R49fiV3W-EYjDOd35GeEAS9O8MgCXbR-naazQhcBUKrvkl4JBBkrOcMpb8HN0_DwRwVVExhWAuwZqFwNeBairvJRnK7MogZOECkL5xeI2OqaeWQ_AYzeHPbyJWvDAFLnjqWgZQC38eD9Qcp_IvaEmL63qI9/w400-h296/pyramid_big.png" width="400" /></a></div><p> </p><p>and if I click on a smaller version (number 3), I see this:</p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi5PTBWztkKyapcdgyyGDOCNqllMzdTaMyEE8eWRUiyjddmai1BHFXsvOewh8UUaceUVlBo114Dh5JIXIrU6Z9cin3xaxW6ZfNJ-wDwAuHK0eguq22txpqPjg5McfhHK6tYODGl99ZJQcxz2Q9-gNqupm7mrfpGsgQr6SujOz4PHolaEjJ25VxCw5JO/s1093/pyramid_small.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="808" data-original-width="1093" height="296" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi5PTBWztkKyapcdgyyGDOCNqllMzdTaMyEE8eWRUiyjddmai1BHFXsvOewh8UUaceUVlBo114Dh5JIXIrU6Z9cin3xaxW6ZfNJ-wDwAuHK0eguq22txpqPjg5McfhHK6tYODGl99ZJQcxz2Q9-gNqupm7mrfpGsgQr6SujOz4PHolaEjJ25VxCw5JO/w400-h296/pyramid_small.png" width="400" /></a></div><br /><p>If you think of the images as being stacked with the smaller ones on top of the larger ones, you can envision a pyramid. </p><p>When a client application requests an image from the IIIF server, the server looks through the images in the pyramid to find the smallest one that will fill up the viewer and sends that. If the viewer zooms in on the image, requiring greater resolution, the server will not send all of the next larger image. Since the images in the stack are tiled, it will only send the particular tiles from the larger, higher resolution image that will actually be seen in the viewer. The end result is that the tiled pyramidal TIFFs load much faster because the IIIF server is smart and doesn't send any more information than is necessary to display what the user wants to see.</p><p>The problem that I faced was how to automate the process of generating a large number of these tiled pyramidal TIFFs. After thrashing with various Python libraries, I finally ended up using the command line tool ImageMagick and calling it from a Python script using the <span style="font-family: courier;">os.system()</span> function. The script I used is <a href="https://github.com/HeardLibrary/linked-data/blob/master/commonsbot/convert_to_pyramidal_tiled_tiff.ipynb" target="_blank">available on GitHub</a>. </p><p>Because the Fine Arts Gallery has been working on imaging their collection for over 20 years, the source images that I'm using are in a variety of formats and sizes (hence the optional size screening criteria in the script to filter out images that have too low resolution). The newer images are high resolution TIFFs, but many of the older images are JPEGs or PNGs. So one task of the IIIF server upload part of the CommonsTool script is to sort out whether to pull the files from the directory where the pyramidal TIFFs are stored, or the directory where the original images are stored. </p><p>Once the location of the correct images are identified, the script uses the <span style="font-family: courier;">boto3</span> module (the AWS software development kit or SDK), to initiate the upload to the S3 bucket as part of the Python script. I won't go into the details of setting up and using credentials as that is described well in the AWS documentation. </p><p>Once the file is uploaded, it can be directly accessed using a URL constructed according to the IIIF Image API standard. Here's a URL you can play with:</p><p><a href="https://iiif.library.vanderbilt.edu/iiif/3/gallery%2F1992%2F1992.083.tif/full/!400,400/0/default.jpg">https://iiif.library.vanderbilt.edu/iiif/3/gallery%2F1992%2F1992.083.tif/full/!400,400/0/default.jpg</a><br /></p><p>If you adjust the URL (for example replacing the 400s with different numbers) according to the <a href="https://iiif.io/api/image/2.0/#size" target="_blank">API 2.0 URL patterns</a>, you can make the image display at different sizes directly in the browser. <br /></p><h3 style="text-align: left;">IIIF manifests</h3><p>The real reason for making images available through a IIIF server is to display them in a viewer application. One such application is <a href="https://projectmirador.org/" target="_blank">Mirador</a>. A IIIF viewer uses a manifest to understand how the image or set of images should be displayed. CommonsTool generates very simple IIIF manifests that display each image in a separate canvas, along with basic metadata about the artwork. To see what the manifest looks like for the image at the top of this post, go to <a href="https://iiif-manifest.library.vanderbilt.edu/gallery/1992/1992.083.json" target="_blank">this link</a>. </p><p>IIIF manifests are written in machine-readable Javascript Object Notation (JSON), so they are not intended to be understood by humans. However, when the manifest is consumed by a viewer application, a human can use controls such as pan, zoom, and buttons to manipulate the image or to move to another canvas that displays a different image. The Mirador project provides an online IIIF viewer that can be used to view images described by a manifest. <a href="https://projectmirador.org/embed/?iiif-content=https://iiif-manifest.library.vanderbilt.edu/gallery/1992/1992.083.json" target="_blank">This link</a> will display the manifest from above in the Mirador online viewer. </p><p>One nice thing about providing a IIIF manifest is that it allows multiple images of the same work to be viewed in the same viewer. For example, there might be multiple pages of a book, or the front and back sides of a sculpture. I'm still learning about constructing IIIF manifests, so I haven't done anything fancy yet with respect to generating IIIF manifests in the CommonsTool script. However, the script does generate a single manifest describing all of the images depicting the same artwork. The image designated as "primary" is shown in the initial view and any other images designated as "secondary" are shown in other canvases that can be selected using the viewer display options or be viewed sequentially using the buttons at the bottom of the viewer. <a href="https://projectmirador.org/embed/?iiif-content=https://iiif-manifest.library.vanderbilt.edu/gallery/1970/1970.040.json" target="_blank">Here is an example</a> showing how the manifest for the primary and secondary images in an earlier example put the front and back images of a manuscript page in the same viewer window. </p><h3 style="text-align: left;">IIIF in Wikidata</h3><p>Wikidata has a property "IIIF manifest" (P6108) that allows an item to be linked to a IIIF manifest that displays depictions of that item. The file where existing uploads are recorded includes a iiif_manifest column that contains the manifest URLs for the works depicted by the images. </p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjtHSAfqtskbptvBF5W7Wjbwy0TH_zPdcz1g7fJS1zHVFQYorq8BOmElWtrqV73XUILlXl9mB7XbkajJZgGEc8RHOAvseZ_ooDDlkwmAv3mlQkFaQM3EOfPTl_a-3V40Al7iWzkMTP82HQHpb92v8G-QgF1c3ifN-qrh0UZWw8TOXbZTe6o2NJ5-rPn/s1139/tabled_manifest_values.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="162" data-original-width="1139" height="91" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjtHSAfqtskbptvBF5W7Wjbwy0TH_zPdcz1g7fJS1zHVFQYorq8BOmElWtrqV73XUILlXl9mB7XbkajJZgGEc8RHOAvseZ_ooDDlkwmAv3mlQkFaQM3EOfPTl_a-3V40Al7iWzkMTP82HQHpb92v8G-QgF1c3ifN-qrh0UZWw8TOXbZTe6o2NJ5-rPn/w640-h91/tabled_manifest_values.png" width="640" /></a></div><p></p><p>Those values can be used to create IIIF manifest (P6108) claims for an item in Wikidata:</p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi2FBcPULl21q_FBHrhugvajY0dHIyrH7ynyWtVXcnhX9vutmKCWIq1ROKuqpkxfqTAx9hlnU8nVlJV3kDzaht16jTbYf4rojza8YvLzTip1B6nN2SORKdncF0esyAWRX_lFk09ICr-CzFRASwqUXyLrtNa67HNAe0YUml-ZGYJSFdqyO4FaaJqNB5i/s896/manifest_claim.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="380" data-original-width="896" height="272" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi2FBcPULl21q_FBHrhugvajY0dHIyrH7ynyWtVXcnhX9vutmKCWIq1ROKuqpkxfqTAx9hlnU8nVlJV3kDzaht16jTbYf4rojza8YvLzTip1B6nN2SORKdncF0esyAWRX_lFk09ICr-CzFRASwqUXyLrtNa67HNAe0YUml-ZGYJSFdqyO4FaaJqNB5i/w640-h272/manifest_claim.png" width="640" /></a></div><p>Because doing this manually would be tedious, the iiif_manifest values can be automatically transferred to a VanderBot-compatable CSV file using the same transfer script used to transfer the image_name.<br /></p><p>In itself, adding a IIIF manifest claim isn't very exciting. However, Wikidata supports a user script that will display an embedded Mirador viewer anytime an item has a value for P6108. (For details on how to install that script, see <a href="https://baskauf.github.io/2021/12/10/iiif/" target="_blank">this post</a>.) With the viewer enabled, opening a Wikidata page for a Fine Arts Gallery item with images will display the viewer at the top of the page and a user can zoom in or use the buttons at the bottom to move to another image of the same artwork.</p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh-bpdg1UqKqwwpdYogITlIWkZy4yC0i4InejhOaMeIepNk1EYlpLjJ41TvUKYikBj1_PxOY72eLG87MUoKMQtKCoZ0hRNgmNUEyIZWUCs0dGb8byKEwYURMTuCCMWt1db_mF6ie6sjC9kADHX9vC3txOWcaACbtkDhTDQxe9gA6WuKHdB6t7jdUpor/s1138/embedded_viewer.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="1108" data-original-width="1138" height="624" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh-bpdg1UqKqwwpdYogITlIWkZy4yC0i4InejhOaMeIepNk1EYlpLjJ41TvUKYikBj1_PxOY72eLG87MUoKMQtKCoZ0hRNgmNUEyIZWUCs0dGb8byKEwYURMTuCCMWt1db_mF6ie6sjC9kADHX9vC3txOWcaACbtkDhTDQxe9gA6WuKHdB6t7jdUpor/w640-h624/embedded_viewer.png" width="640" /></a></div><p>This is really nice because if only the primary image is linked using the image property, users would not necessarily know that there are other images of the object in Commons. But with the embedded viewer, the user can flip through all of the images of the item that are in Commons using the display features of the viewer, such as thumbnails.</p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh4D_h3NiV4QaS84nJhiRgtzfTUSrOJpJUJaxa5J-tdjrLYC70Q987KbMbHhO9qI-0WB5_HMuiIJUwgMlR8iZxo0Qpn2rgAxnPtHBzM6D4BqweBvso7-j-aE8hG6s7cImBG5q7IMPP2qrukmnU1MyeQm3eTrJvlbJcltvQG7Vahjqls9wZ7KgeTwEoo/s1006/thumbnails.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="1006" data-original-width="947" height="640" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh4D_h3NiV4QaS84nJhiRgtzfTUSrOJpJUJaxa5J-tdjrLYC70Q987KbMbHhO9qI-0WB5_HMuiIJUwgMlR8iZxo0Qpn2rgAxnPtHBzM6D4BqweBvso7-j-aE8hG6s7cImBG5q7IMPP2qrukmnU1MyeQm3eTrJvlbJcltvQG7Vahjqls9wZ7KgeTwEoo/w602-h640/thumbnails.png" width="602" /></a></div><br /><h1 style="text-align: left;">Using the script</h1><p>Although I wrote this script primarily to serve my own purposes, I tried to make it clean and customizable enough that someone with moderate computer skills should also be able to use it. The only installation requirements are Python and several modules that aren't included in the standard library. It should not generally be necessary to modify the script to use it -- most customizing should be possible by changing the configuration file. </p><p>If the script is only used to write files to Commons, it's operation is pretty straightforward. If you want to combine uploading image files to Commons with writing the image_names and iiif_manifest values to Wikidata, it's more complicated. You need to get the <a href="https://github.com/HeardLibrary/linked-data/blob/master/commonsbot/transfer_to_vanderbot.py" target="_blank">transfer_to_vanderbot.py script</a> working and then learn how to operate VanderBot. There are details instructions, videos, etc. to do that on the <a href="https://github.com/HeardLibrary/linked-data/tree/master/vanderbot#readme" target="_blank">VanderBot landing page</a>.<br /></p><h1 style="text-align: left;">What's next?</h1><p>There are still a few more Fine Arts Gallery images that I need to upload after doing some file conversions, checking out some copyright statuses, and wranging some data for multiple files that depict the same work. However, I'm quite excited about developing better IIIF manifests that will make it possible to view related works in the same viewer. Having so many images in Commons now also makes it possible to see the real breadth of the collection by viewing the Listeria visualizations on the tabs of the <a href="https://www.wikidata.org/wiki/Wikidata:WikiProject_Vanderbilt_Fine_Arts_Gallery" target="_blank">WikiProject Vanderbilt Fine Arts Gallery website</a>. I hope soon to create more fun SPARQL-based visualizations to add to those already on the website landing page.<br /></p>Steve Baskaufhttp://www.blogger.com/profile/01896499749604153763noreply@blogger.com0tag:blogger.com,1999:blog-5299754536670281996.post-20258204975022372412022-06-11T10:23:00.000-07:002022-06-11T10:23:14.319-07:00Making SPARQL queries to Wikidata using Python<br /><h2 style="text-align: left;"><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgDHJntncOYuJQh1QEtThPBEsB8OxsvfpPv4sN6qo2dnXERv0XvoggQon-ZVrbv6maZlWe33evcddZaENPJmQzkIWGsrGh07wxyVYECsGsr4OdBOoU2XahoqSUQCOEokdYBcvX7QOdQ0VI7GXfD7A_TbeJhQbBDXWbvAnH2XAHgLzVZGry89DaFqj2e/s480/Welding_sparkles.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img alt="Welding sparkles" border="0" data-original-height="480" data-original-width="384" height="320" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgDHJntncOYuJQh1QEtThPBEsB8OxsvfpPv4sN6qo2dnXERv0XvoggQon-ZVrbv6maZlWe33evcddZaENPJmQzkIWGsrGh07wxyVYECsGsr4OdBOoU2XahoqSUQCOEokdYBcvX7QOdQ0VI7GXfD7A_TbeJhQbBDXWbvAnH2XAHgLzVZGry89DaFqj2e/w256-h320/Welding_sparkles.jpg" title=""Welding sparkles" by Dhivya dhivi DJ, CC BY-SA 4.0, via Wikimedia Commons" width="256" /> </a></div><div class="separator" style="clear: both; text-align: center;"><span style="font-size: xx-small;">"Welding sparkles" by Dhivya dhivi DJ, CC BY-SA 4.0, via Wikimedia Commons</span> <br /></div></h2><h2 style="text-align: left;">Background</h2><p>This is actually a sort of followup post to my most popular blog post: "<a href="https://baskauf.blogspot.com/2019/05/getting-data-out-of-wikidata-using.html" target="_blank">Getting Data Out of Wikidata using Software</a>", which has had about
6.5K views since 2019. That post was focused on the variety of query forms you could use and talked a lot about using Javascript to build web pages that acquired data from Wikidata dynamically. However, I did provide a link to some Python code, which included the line</p><p><span style="font-family: courier;"><span class="pl-s1"><span class="pl-token" data-hydro-click-hmac="bddbedec85c44542efab220f0521cffe54ac4d7f2bafd3c5f46576d89dff7f14" data-hydro-click="{"event_type":"code_navigation.click_on_symbol","payload":{"action":"click_on_symbol","repository_id":150451038,"ref":"master","language":"Python","backend":"ALEPH_PRECISE","code_nav_context":"BLOB_VIEW","retry_backend":"","originating_url":"https://github.com/HeardLibrary/digital-scholarship/find-definition?q=r&blob_path=code%2Fwikidata%2Frequests_wikidata_json.py&ref=master&language=Python&row=25&col=0&code_nav_context=BLOB_VIEW","user_id":5765781}}">r</span></span> <span class="pl-c1">=</span> <span class="pl-s1"><span class="pl-token" data-hydro-click-hmac="1d4469301bc169e77b61691e30cb141447b1668fbd81918cc9d70db45f49277e" data-hydro-click="{"event_type":"code_navigation.click_on_symbol","payload":{"action":"click_on_symbol","repository_id":150451038,"ref":"master","language":"Python","backend":"ALEPH_PRECISE","code_nav_context":"BLOB_VIEW","retry_backend":"","originating_url":"https://github.com/HeardLibrary/digital-scholarship/find-definition?q=requests&blob_path=code%2Fwikidata%2Frequests_wikidata_json.py&ref=master&language=Python&row=25&col=4&code_nav_context=BLOB_VIEW","user_id":5765781}}">requests</span></span>.<span class="pl-en"><span class="pl-token" data-hydro-click-hmac="1efa1b0a0ceee090cd839418c9e3098d480fc97084ed0d8b9566833f770c3d4d" data-hydro-click="{"event_type":"code_navigation.click_on_symbol","payload":{"action":"click_on_symbol","repository_id":150451038,"ref":"master","language":"Python","backend":"ALEPH_PRECISE","code_nav_context":"BLOB_VIEW","retry_backend":"","originating_url":"https://github.com/HeardLibrary/digital-scholarship/find-definition?q=get&blob_path=code%2Fwikidata%2Frequests_wikidata_json.py&ref=master&language=Python&row=25&col=13&code_nav_context=BLOB_VIEW","user_id":5765781}}">get</span></span>(<span class="pl-s1"><span class="pl-token" data-hydro-click-hmac="a18dfdb677e6bee94c6dc0be6a61f22c0e97dc604ee845e4bbc705e3094d2948" data-hydro-click="{"event_type":"code_navigation.click_on_symbol","payload":{"action":"click_on_symbol","repository_id":150451038,"ref":"master","language":"Python","backend":"ALEPH_PRECISE","code_nav_context":"BLOB_VIEW","retry_backend":"","originating_url":"https://github.com/HeardLibrary/digital-scholarship/find-definition?q=endpointUrl&blob_path=code%2Fwikidata%2Frequests_wikidata_json.py&ref=master&language=Python&row=25&col=17&code_nav_context=BLOB_VIEW","user_id":5765781}}">endpointUrl</span></span>, <span class="pl-s1">params</span><span class="pl-c1">=</span>{<span class="pl-s">'query'</span>: <span class="pl-s1"><span class="pl-token" data-hydro-click-hmac="5199b7f3257bf94064a16767a33f0da2edb4bd2f620b2de0c599d0a2dd0bd044" data-hydro-click="{"event_type":"code_navigation.click_on_symbol","payload":{"action":"click_on_symbol","repository_id":150451038,"ref":"master","language":"Python","backend":"ALEPH_PRECISE","code_nav_context":"BLOB_VIEW","retry_backend":"","originating_url":"https://github.com/HeardLibrary/digital-scholarship/find-definition?q=query&blob_path=code%2Fwikidata%2Frequests_wikidata_json.py&ref=master&language=Python&row=25&col=48&code_nav_context=BLOB_VIEW","user_id":5765781}}">query</span></span>}, <span class="pl-s1">headers</span><span class="pl-c1">=</span>{<span class="pl-s">'Accept'</span>: <span class="pl-s">'application/sparql-results+json'</span>})</span></p><p>for making the actual query to the Wikidata Query Service via HTTP GET. </p><p>Since that time, I've used some variation on that code in dozens of Python scripts that I've written to grab data from Wikidata. In the process, I experienced some frustration when things did not behave as I had expected and when I got unexpected errors from the API. My goal for this post is to describe some of those problems and how I solved them. I'll also provide a <a href="https://github.com/HeardLibrary/digital-scholarship/blob/master/code/wikidata/sparqler.py" target="_blank">link to the "Sparqler" Python class</a> that I wrote to make querying simpler and more reliable, along with some examples of how to use it to do several types of queries.</p><p> Note: SPARQL keywords are case insensitive. Although you often see them
written in ALL CAPS in examples, I'm generally too lazy to do that and
tend to use lower case, as you'll see in most of the examples below.</p><h2 style="text-align: left;">The Sparqler class</h2><p>For those of you who don't care about the technical details, I'll cut right to the chase and tell you how to make queries to Wikidata using the code. You can access the code in GitHub <a href="https://github.com/HeardLibrary/digital-scholarship/blob/master/code/wikidata/sparqler.py" target="_blank">here</a>. I should note that the code is general-purpose and can be used with any SPARQL 1.1 compliant endpoint, not just the Wikidata Query Service (WDQS). This includes Wikibase instances and installations of Blazegraph, Fuseki, Neptune, etc. The code also supports SPARQL Update for loading data into a triplestore, but that's the topic of another post.<br /></p><p>To use the code, you need to import three modules: <span style="font-family: courier;">datetime</span>, <span style="font-family: courier;">time</span>, and <span style="font-family: courier;">requests</span>. The requests module isn't included in the standard Python distribution, so you may need to install it with PIP if you haven't already. If you are using Jupyter notebooks through Anaconda, or Colab notebooks, requests will probably already be installed. Copy the code from "class Sparqler:" through just before the "Body of script" comment near the bottom of the file, and paste it near the top of your script. </p><p>To test the code, you can run the entire script, which includes code at the end with an example of how to use the script. If you only run it once or twice, you can use the code as-is. However, if you make more than a few queries, you'll need to change the <span style="font-family: courier;">user_agent</span> string from the example I gave to your own. You can read about that in the next section. </p><p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjuog9He1lnJgzSgbl1lpttBz4RiuoumKJP2aUSwiZQuWM4kTOwPlRm5bd9bpR1qVdCqhyW7BqwdPSL3bDeyQJF01gCS5bi-XiBgpNFTyInEwwytkYjwPTQx9xG-p7_QvY2ondH4HttIPDaV8f2l4ryz54D4Z5ZpoSyNgOYhfMDgJmljMvPrk0ICD4O/s544/code.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="67" data-original-width="544" height="78" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjuog9He1lnJgzSgbl1lpttBz4RiuoumKJP2aUSwiZQuWM4kTOwPlRm5bd9bpR1qVdCqhyW7BqwdPSL3bDeyQJF01gCS5bi-XiBgpNFTyInEwwytkYjwPTQx9xG-p7_QvY2ondH4HttIPDaV8f2l4ryz54D4Z5ZpoSyNgOYhfMDgJmljMvPrk0ICD4O/w640-h78/code.png" width="640" /></a></div><br /> The body of the script has four main parts. Lines 238 through 256 create a value for the text <span style="font-family: courier;">query_string</span> that gets sent to the WDQS endpoint. Lines 259 and 260 instantiate a <span style="font-family: courier;">Sparqler</span> object called <span style="font-family: courier;">wdqs</span>. Line 261 sends the query string that you created to the endpoint and returns the SELECT query results as a list of dictionaries called <span style="font-family: courier;">data</span>. The remaining lines check for errors and display the results as pretty JSON (the reason for importing the <span style="font-family: courier;">json</span> module at the top of the script). If you want to see the <span style="font-family: courier;">query_string</span> as constructed or the raw response text from the endpoint, you can uncomment lines 257 and 266. <br /><p></p><p>Here's what the response looks like:<br /><span style="font-size: x-small;"><br /><span style="font-family: courier;">[<br /> {<br /> "item": {<br /> "type": "uri",<br /> "value": "http://www.wikidata.org/entity/Q102949359"<br /> },<br /> "label": {<br /> "xml:lang": "en",<br /> "type": "literal",<br /> "value": "\"I Hate You For Hitting My Mother,\" Minneapolis"<br /> }<br /> },<br /> {<br /> "item": {<br /> "type": "uri",<br /> "value": "http://www.wikidata.org/entity/Q102961315"<br /> },<br /> "label": {<br /> "xml:lang": "en",<br /> "type": "literal",<br /> "value": "A Picture from an Outline of Women's Manners - The Wedding Ceremony"<br /> }<br /> },<br /> {<br /> "item": {<br /> "type": "uri",<br /> "value": "http://www.wikidata.org/entity/Q1399"<br /> },<br /> "label": {<br /> "xml:lang": "en",<br /> "type": "literal",<br /> "value": "Niccol\u00f2 Machiavelli"<br /> }<br /> }<br />]</span></span></p><p>It's in the standard SPARQL 1.1 JSON results format, so if you write code to extract the results from the <span style="font-family: courier;">data</span> list of dictionaries, you can use it with the results of any query.</p><h2 style="text-align: left;">Features of the code</h2><p>For those of you who are interested in knowing more about the code and the rationale behind it, read the following sections. If you just want to try it out, skip to the "Options for querying" section.<br /></p><h3 style="text-align: left;">The User-Agent string</h3><p>Often applications that request data from an API are requested to identify themselves as in indication that they aren't bad actors and to allow the API maintainers to contact the developers if the application is doing something the API maintainers don't like. In the case of the Wikimedia Foundation, they have adopted a <a href="https://meta.wikimedia.org/wiki/User-Agent_policy" target="_blank">User-Agent policy</a> that requires that an HTTP User-Agent header be sent with all requests to their servers. This policy is not universally enforced, and I'm not sure whether it's enforced at all for the WDQS, but if you are writing a script that is making repeated queries at a high rate of speed, you should definitely supply a User-Agent header that identifies your application (and you) in the event that it is suspected to be a denial of service attack. </p><p>The details of what they would like developers to include in the string are given on the policy page, but the TLDR is that you should have a name for the "application" (your script) and either your email address or the URL of a page that describes your project. The value given in lines 259 and 260 of the body of the script for the <span style="font-family: courier;">user_agent</span> variable can be used as a template. When instantiating the <span style="font-family: courier;">Sparqler</span> object, the string MUST be passed in as the value of the <span style="font-family: courier;">useragent</span> argument if the endpoint URL given as the value of the <span style="font-family: courier;">endpoint</span> argument is <span style="font-family: courier;">https://query.wikidata.org/sparql</span> (the default if no <span style="font-family: courier;">endpoint</span> argument is given). If you don't provide one, the script will exit. </p><h3 style="text-align: left;">The sleep argument</h3><p>When you create a <span style="font-family: courier;">Sparqler</span> object, you can choose to supply a value (in seconds) for the <span style="font-family: courier;">sleep</span> argument. If none is supplied, it defaults to 0.1 s. Each time a query is made, the script pauses execution for the length of time specified. The rationale for the default of 0.1 s for the WDQS is similar to the previous section -- you don't want the WDQS operators to think you are a bad actor if you are hitting the endpoint repeatedly without delay. If you are reading from a localhost endpoint, you can set the value of <span style="font-family: courier;">sleep</span> to zero. </p><p>While I'm on the topic of being a courteous WDQS user, I would like to point out that often repetitive querying can be avoided if you use a "smarter" query. In the example code, I wanted to discover the Q IDs of three labels. I could have inserted the label value in the query as a literal in the position of <span style="font-family: courier;">?value</span>, e.g.</p><p><span style="font-family: courier;">?item rdfs:label|skos:altLabel "</span>尼可罗·马基亚维利<span style="font-family: courier;">"@zh.</span><br /></p><p>then put the <span style="font-family: courier;">.query()</span> method inside a loop that runs three times. However, instead in the script I used a loop to create an <span style="font-family: courier;">VALUES</span> clause to enumerate the possible values of <span style="font-family: courier;">?value</span> . I still get the same information, but using the <span style="font-family: courier;">VALUES</span> method only requires one interaction with the Query Service instead of three. For a small number like this, it's not that important, but I've sent queries with hundreds or thousands of values, and there the difference is significant.<br /></p><h3 style="text-align: left;">GET vs. POST</h3><p>This brings me to another important thing that I learned the hard way about interacting with SPARQL endpoints programmatically. If you drill down in the <a href="https://www.w3.org/TR/sparql11-protocol/#query-operation" target="_blank">SPARQL 1.1 Protocol specification</a> (which I doubt that anyone but me typically does!), you'll see that there are three options for sending queries via HTTP: one using GET and two using POST. When I first started running queries from scripts, I tended to use the GET method because it seemed simpler -- after URL-encoding the query just gets attached to the end of the URL as the value of a <span style="font-family: courier;">query</span> parameter. However, what I discovered once I started making really long queries (like the one I previously described with thousands of <span style="font-family: courier;">VALUES</span>) was that you can fairly easily exceed the length limits of a URL allowed by the server (something in the neighborhood of 5K to 15K characters). Once I discovered that, I switched to using POST since the query is passed as the message body and therefore has no particular length limit. </p><p>So why would you ever need to use GET? In some cases, a SPARQL endpoint will only support GET requests because the endpoint is read-only. In cases where a SPARQL service supports both Query and Update, a quick-and-dirty way to restrict writing to the triplestore using Update (which must be done using POST) is to disallow any un-authenticated POST requests. Another case is services like AWS Neptune that have separate read-only endpoints whose access is separate from the endpoint that supports writing. A read-only endpoint would only support GET requests. </p><p>For these reasons, you can specify that the <span style="font-family: courier;">Sparqler</span> object use GET by providing a value of "<span style="font-family: courier;">get</span>" for the <span style="font-family: courier;">method</span> argument. Otherwise it defaults to POST.</p><h3 style="text-align: left;">UTF-8 support</h3><p>If the literals that you are using only contain Latin characters, it doesn't really matter that much how you do the querying. However, a lot of projects I work on either involve languages with non-Latin character sets, or include characters with diacritics that aren't in the ASCII character set. Despite my best efforts to enforce UTF-8 encoding everywhere, I was still having queries that would fail to match labels in Wikidata that I knew should match. After wasting a bunch of time troubleshooting, I finally figured out the fix. </p><p>As I mentioned earlier, the SPARQL 1.1 Protocol Recommendation provides two ways to send queries. The simplest one is to just send the query as text without URL-encoding as the message body. That's awesome for testing because you can just paste a query into the message text box of Postman and if you use the right Content-Type header, you can send the query with the click of a button. I assumed that as long as the text was all UTF-8, I would be fine. However, using this option was actually the cause of the problems I was having with match failures. When I switched to the other POST method (which URL-encodes the query string), my matching problems disappeared. For that reason, my script only uses the "query via URL-encoded POST" option. </p><h3 style="text-align: left;">", ', and """ quoting for literals<br /></h3><p>I learned early on that in SPARQL you can use either double or single quotes for literals in queries. That's nice, because if you have a string containing a single quote like "don't", you can enclose it in double quotes and a string containing double quotes like 'say "hi" for me', you can enclose it in single quotes. But what if you have 'Mother said "don't forget to brush your teeth" to me.', which contains both double and single quotes? Also, in the situation of inserting strings into the query string using variables, you can 't know in advance what kind or kinds of quotes a string might contain. </p><p>This problem frustrated me for quite some time and I experimented with checking strings for both kinds of quotes, replacing double quotes with singles, escaping quotes in various ways, but none of these approaches worked and my scripts kept crashing because of quote mismatches. </p><p>Finally, I resorted to (you guessed it) reading the SPARQL 1.1. Query specification and there was the obvious (to Python users) answer in <a href="https://www.w3.org/TR/2013/REC-sparql11-query-20130321/#QSynLiterals" target="_blank">section 4.1.2</a>: enclose the literals in sets of three quotes. I don't know why I didn't think of trying that. Note that in line 246 of the script, triple single-quotes are used to enclose the literals. Thus the script can handle both of the example English strings: the one with double quotes around the first part of the label and the label that includes "Women's" with an apostrophe.<br /></p><p>After solving the quote and UTF-8 problems, my scripts now reliably handle literals that contain any UTF-8 characters.</p><h2 style="text-align: left;">Options for querying<br /></h2><p>The query in the code example uses the SELECT query form. This is probably the most common type of SPARQL query, but others are possible and <span style="font-family: courier;">Sparqler</span> objects support any query form option. Depending on the chosen form of the query, there are also several possible response formats. Since we are talking about Python here, the most convenient response format is JSON, since it can easily be converted into a complex Python data structure. But in some situations, another format may be more convenient. <br /></p><h3 style="text-align: left;">Query form <br /></h3><p>The query form is specified using the <span style="font-family: courier;">form</span> keyword argument of the <span style="font-family: courier;">.query()</span>
method. It may seem a bit strange to specify the query form as an
argument of the method when the query form is determined by the text of
the query itself, but doing so allows the script to control the default format of the response and whether the raw response is processed
prior to being returned from the method. For SELECT and ASK, the default
response serialization is set to JSON. For the DESCRIBE and CONSTRUCT
query forms that return graphs, the default serialization is Turtle. </p><h4 style="text-align: left;">SELECT <br /></h4><p>The default query form is SELECT, so it isn't necessary to provide a <span style="font-family: courier;">form</span> argument to use it. That's convenient, since it's probably the most commonly used form. The raw JSON response (which you can view as the value of the <span style="font-family: courier;">.response</span> attribute of the Sparqler object, e.g. <span style="font-family: courier;">wdqs.response</span>) from the endpoint is structured in a more complicated way that is required to just get the results of the query. The results list is actually the value of a <span style="font-family: courier;">bindings</span> key that forms an object that's the value of a <span style="font-family: courier;">results</span> key, like this:</p><p><span style="font-family: courier;">{<br /> "head" : {<br /> "vars" : [ "item", "label" ]<br /> },<br /> "results" : {<br /> "bindings" : [ {<br /> "item" : {<br /> "type" : "uri",<br /> "value" : "http://www.wikidata.org/entity/Q102949359"<br /> },<br />...<br /> }<br /> } ]<br /> }<br />}</span></p><p>For convenience, when handling SELECT queries with the default JSON serialization, the script converts the raw JSON to a complex Python data
object, then extracts the results list that's nested as the value of the
<span style="font-family: courier;">bindings</span> key and returns that as the value of the <span style="font-family: courier;">.query()</span> method. That produces the result shown in the example shown earlier in the post. </p><p>Here's an example that prints Douglas Adams' (Q42) name in all available languages:</p><p><span style="font-family: courier;">query_string = '''select distinct ?label ?language where {<br />wd:Q42 rdfs:label ?label.<br />bind ( lang(?label) AS ?language )<br />}<br />order by ?language'''<br />names = wdqs.query(query_string)<br />for name in names:<br /> print(name['language']['value'], name['label']['value']) </span><br /></p><p>The loop iterates through all of the items in the results list and pulls the <span style="font-family: courier;">value</span> for each variable. This structure: <span style="font-family: courier;">item['variableName']['value']</span> is consistent for all SELECT queries where <span style="font-family: courier;">variableName</span> is the string you used for that variable in the query (e.g. <span style="font-family: courier;">?variableName</span>).</p><h4 style="text-align: left;">ASK<br /></h4><p>When
the ASK query form is chosen, the result is a true or false, so the raw
response is processed to return a Python boolean as the response value.
That allows you to directly control program flow based on whether a
particular graph pattern has any solutions, like this:</p><p><span style="font-family: courier;">label_string = '</span>尼可罗·马基亚维利<span style="font-family: courier;">'<br />language = 'zh'<br /><br />query_string = '''ask where {<br /> ?entity rdfs:label """'''+ label_string + '"""@' + language + '''.<br /> }'''<br /><br />if wdqs.query(query_string, form='ask'):<br /> print(label_string, 'is in Wikidata')<br />else:<br /> print('Could not find', label_string, 'in Wikidata')</span><br /></p><p>I use this kind of query to check whether label/description combinations that I plan to use for new Wikidata items have already been used. If you try to create a new item that has the same label and description as an existing item, the Wikidata API will return an error message and refuse to create the item. So it's better to query ahead of time so that you can change either the label or description to make it unique. Here's some code that will perform that check for you:</p><p><span style="font-size: small;"><span style="font-family: courier;">label_string = 'Italian Lake Scene'<br />description_string = 'painting by Artist Unknown'<br /><br />query_string = '''ask where {<br /> ?item rdfs:label """'''+ label_string + '''"""@en.<br /> ?item schema:description """'''+ description_string + '''"""@en.<br /> }'''<br /><br />if wdqs.query(query_string, form='ask'):<br /> print('There is already an item in Wikidata with')<br /> print('label:', label_string)<br /> print('description:', description_string)<br /> print('The label or description must be changed before uploading.') </span></span><br /></p><h4 style="text-align: left;">DESCRIBE <br /></h4><p>The DESCRIBE query form is probably the least commonly used SPARQL query form. Its behavior is somewhat dependent on the implementation. Blazegraph, which is the application that underlies the WDQS, returns all of the triples that include the resource that is the solution to the query. The simplest kind of DESCRIBE query just specifies the IRI of the resource to be described. Here's an example that will return all of the triples that provide some kind of information about Douglas Adams (Q42):</p><p><span style="font-family: courier;">query_string = 'describe wd:Q42'<br />description = wdqs.query(query_string, form='describe')</span><br /></p><p><span style="font-family: courier;">description</span> is a string containing the triples in Turtle serialization. That string could be saved as a file and loaded into an application that knows how to parse Turtle.</p><h4 style="text-align: left;">CONSTRUCT <br /></h4><p> CONSTRUCT queries are similar to DESCRIBE in that they produce triples. The triples are those that conform to a graph pattern that you specify. For example, this query will produce all of the triples (serialized as Turtle) that are direct claims about Douglas Adams.</p><p><span style="font-family: courier;">query_string = '''construct {wd:Q42 ?p ?o.} where {<br />wd:Q42 ?p ?o.<br />?prop wikibase:directClaim ?p.<br />}'''<br />triples = wdqs.query(query_string, form='construct')<br />print(triples) </span><br /></p><p>This might be useful to you if you want to load just those triples into a triplestore.<br /></p><h3 style="text-align: left;">Response formats <br /></h3><p>Because of the ease with which JSON can be converted directly to an analogously structured complex Python data object, Sparqler objects default to JSON as the response format for SELECT queries. For the two query forms that return triples (DESCRIBE and CONSTRUCT), the default is Turtle. ASK defaults to JSON, from which a Python boolean is extracted. However, these response formats can be overridden using the <span style="font-family: courier;">mediatype</span> keyword argument in the <span style="font-family: courier;">.query()</span> method if desired.<br /></p><p>The <span style="font-family: courier;">mediatype</span> argument for some other possible response formats for SELECT are:<br /></p><p><span style="font-family: courier;">application/sparql-results+xml</span> for XML</p><p><span style="font-family: courier;">text/csv</span> for CSV tabular data<br /></p><p>For non-JSON response serializations, the return value of the <span style="font-family: courier;">.query()</span> method is the raw text from the endpoint. That may be useful if you want to save the XML for use with some XML processing language like XQuery. It also makes it super simple to save the output as a CSV file with a few lines of code, like this:</p><p><span style="font-family: courier;">data = sve.query(query_string, mediatype='text/csv')<br />with open('graph_dump.csv', 'wt', encoding='utf-8') as file_object:<br /> file_object.write(data)</span><br /></p><p>Triple output from DESCRIBE and CONSTRUCT can be serialized in other formats using these values of the <span style="font-family: courier;">mediatype</span> argument:</p><p><span style="font-family: courier;">application/rdf+xml</span> for XML</p><p><span style="font-family: courier;">application/n-triples</span> for N-Triples</p><h3 style="text-align: left;">Monitoring the status of the query<br /></h3><p> The <span style="font-family: courier;">verbose</span> keyword argument can be used to control whether you get printed feedback to monitor the status of the query. A <span style="font-family: courier;">False</span> value (the default) suppresses printing. Supplying a <span style="font-family: courier;">True</span> value prints a notification that the query has been requested and another when a response has been received, including the time to complete the query. This may be helpful during debugging or if the queries take a long time to execute. For small, routine queries, you probably want to turn this off. Note: the second notification takes place after the <span style="font-family: courier;">sleep</span> delay, so the reported response time includes that delay. <br /></p><h3 style="text-align: left;"> FROM and FROM NAMED</h3><p>The <a href="https://www.w3.org/TR/sparql11-protocol/#dataset" target="_blank">SPARQL 1.1 Protocol specification</a> provides a mechanism for specifying graphs to be included in the default graph using a request parameter rather than by using the FROM and FROM NAMED keywords in the text of the query itself. Sparqler supports this mechanism through the <span style="font-family: courier;">default</span> and <span style="font-family: courier;">named</span> arguments. Given that this is an advanced feature and that the WDQS triplestore does not have named graphs, I won't say more about this here. However, I'm planning talk about this feature in a future post about the Vanderbilt Libraries' new Neptune triplestore. For more details, see the doc strings in the code.</p><h2 style="text-align: left;">Detecting errors</h2><p>Detecting errors depends on how errors are reported by the SPARQL query service. In the case of Blazegraph (the service on which the WDQS is based), errors are reported as unformatted text in the response body. This is not the case with every SPARQL service -- they may report errors by some different mechanism, such as a log that must be checked. </p><p>Because the main use cases of the Sparqler class are SELECT and ASK queries to the WDQS, errors can be detected by checking whether the results are JSON or not (assuming the default JSON response format is used). When SELECT queries return JSON, the code tries to convert the response from JSON to a Python object. If it fails, it returns a <span style="font-family: courier;">None</span> object. You can then detect a failed query by checking whether the value is <span style="font-family: courier;">None</span> and if it is, you can try to parse out the error message string (provided as the value of the <span style="font-family: courier;">.response</span> attribute of the Sparqler object, e.g. <span style="font-family: courier;">wdqs.response</span>), or just print it for the user to see. Here is an example:</p><p><span style="font-family: courier;"> query_string = '''select distinct ?p ? where {<br />wd:Q42 ?p ?o.<br />}<br />limit 3'''<br />data = wdqs.query(query_string)<br />if data is None:<br /> print(wdqs.response)<br />else:<br /> print(data) </span><br /></p><p>The example intentionally omits the name of the second variable (<span style="font-family: courier;">?o</span>) to cause the query to be malformed. If you run this query, <span style="font-family: courier;">None</span> will be returned as the value of <span style="font-family: courier;">data</span>, and the error message will be printed. If you add the missing "o" after the question mark and re-run the query, you should get the query results. </p><p>Note that this mechanism detects actual errors and not a negative query result. For example, a select query with no matches will return an empty list (<span style="font-family: courier;">[]</span>), which a negative result, not an error. The same is true for ASK queries that evaluate as <span style="font-family: courier;">False</span> when there are no matches. That's why the code is written "<span style="font-family: courier;">if data is None:</span>" rather than "<span style="font-family: courier;">if data:</span>", which would evaluate as <span style="font-family: courier;">True</span> if there were matches (non-empty list or <span style="font-family: courier;">True</span> value) but as <span style="font-family: courier;">False</span> for a either an error (a value of <span style="font-family: courier;">None</span>) or no matches (an empty list or <span style="font-family: courier;">False</span> value). The point is that "no matches" result should be handled differently than an error in your code, and that's why the code <span style="font-family: courier;">if data is None:</span> is used.<br /></p><p>For other query forms (DESCRIBE and CONSTRUCT) and response formats other than JSON, the <span style="font-family: courier;">.query()</span> method simply returns the response text. So I leave it to you to figure out how to differentiate between errors and valid responses (maybe search for "<span style="font-family: courier;">ExecutionException</span>" in the response string?). </p><h2 style="text-align: left;">SPARQL Update support</h2><p>The Sparqler class supports changing graphs in the triplestore using SPARQL Update if the SPARQL service supports that. This is done using the <span style="font-family: courier;">.update()</span> method and two more specific types of Update operations: <span style="font-family: courier;">.load()</span> and <span style="font-family: courier;">.drop()</span> . However, since changes to the data available on the WSQS triplestore must be made through the Wikidata API and not through SPARQL Update, I won't discuss these features in this post. I'm planning to describe them in more detail in an upcoming post where I talk about our Neptune triplestore. Until then, you can look at the doc strings in the code for details.</p><p><br /></p><p></p><p><br /></p>Steve Baskaufhttp://www.blogger.com/profile/01896499749604153763noreply@blogger.com0tag:blogger.com,1999:blog-5299754536670281996.post-46212748174637263042022-03-16T19:44:00.008-07:002022-05-17T12:14:41.950-07:00Birding in Puerto Rico<p></p><div class="separator" style="clear: both; text-align: center;"></div><div class="separator" style="clear: both; text-align: center;"></div><div class="separator" style="clear: both; text-align: center;"></div><div class="separator" style="clear: both; text-align: center;"></div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/a/AVvXsEhcBGvhcHCzC6S775UnyeMDAFr-ToLt4By2oohC9yOB_4t2Ki0OiA1YbKq8yxHA1vnhXFkum_j2DgF8e0K9OfvAyVUu5uREpzDwbJkNrLeoirFrF01QjTp5Bk_C8I7WEFTRGOp7-r9IL33pEysT38T75mk9kZoMXEiMJXS-nJdFgq3VBsIV_o7kzV_w=s1606" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="1495" data-original-width="1606" height="298" src="https://blogger.googleusercontent.com/img/a/AVvXsEhcBGvhcHCzC6S775UnyeMDAFr-ToLt4By2oohC9yOB_4t2Ki0OiA1YbKq8yxHA1vnhXFkum_j2DgF8e0K9OfvAyVUu5uREpzDwbJkNrLeoirFrF01QjTp5Bk_C8I7WEFTRGOp7-r9IL33pEysT38T75mk9kZoMXEiMJXS-nJdFgq3VBsIV_o7kzV_w=s320" width="320" /></a></div><p style="text-align: center;">Pearly-eyed Thrasher - Bosque Estatal de Guánica, Puerto Rico</p><p><i> NOTE: this information was accurate as of our trip in mid-March of 2022. It will undoubtedly change as time goes by.</i><br /></p><p> Having just completed a week-long vacation in Puerto Rico focused primarily on bird-watching, I wanted to share some observations that might be helpful for others planning to do the same. Please note that we aren't top level birders who were focused on seeing every endemic species -- we just wanted to have fun seeing a variety of cool new birds. So that perspective influences my comments.<br /></p><p></p><h3 style="text-align: left;">The Book</h3><p>If you have been researching places to bird in PR, you have undoubtedly found out about "A Birdwatchers' Guitd to Cuba, Jamaica, Hispaniola, Puerto Rico, and the Cayans", by Kirwan, Kirkconnell, and Flieg. We purchased this book and it was helpful for deciding places to go and for some ideas about what we were likely to see at different locations. However, the edition of the book we bought (copyright 2010 and I think the most recent) is hopelessly outdated and therefore much of the information is useless. <br /><br />There are several ways that the book was dated. It spends a lot of time explaining particular hotels where you might want to stay and gives descriptive text (go x miles, turn right on road so-and-so) describing how to get to the sites and hotels. In 2022 you'd be much better off getting an AirBNB than using the outdated hotel information. They are available all over the island for half the cost of the hotels and the two we stayed in were clean, safe, and had friendly and helpful hosts. There was also no point in trying to follow the text descriptions. For example: take "the beach road" past some mangrove trees -- which road was the beach road and which of the many mangroves were the right ones? The hand-drawn maps also usually did not seem to bear much resemblance to reality. Thankfully, I had used Google Maps to locate the preserves we visited in advance and save the locations. We were then able to drive directly to them using Google Maps on our phone. (I've included coordinates and links in the text below.) Another problem with the book was that some of the information about facilities was out of date, so we ended up discovering the actual situation (usually: closed) by arriving and finding out in person. The last deficiency (in my opinion) is that the book is super-focused on the birder who MUST see every endemic, so about a third of the text is devoted to how to see three or four of the most difficult birds, which was not our primary concern. So for "normal" birders like us, getting this book was helpful for thinking about where to go and for knowing likely birds to see, but that was about it.<br /></p><h3 style="text-align: left;">General observations</h3><p>If you have birded in a place like Costa Rica with a well-developed ecotourism industry, you will find Puerto Rico somewhat disappointing. Thankfully, PR does have a significant number of protected areas that are publicly accessible, but don't expect much in the way of signage, interpretation, or knowledgeable rangers or local guides. It became almost a joke with us that nearly every vistors' center and developed bathroom was closed and locked. This may be partly due to lingering effects of the hurricanes a few years ago and also the government fiscal crisis, but the bottom line is: bring your own toilet paper and use bathrooms whenever you have the opportunity. The main exception to this was the shiny new National Forest Service visitors' center in El Yunque, which I'll describe in more detail later. <br /><br />Getting around is relatively easy if you rent a car. Nearly all of the roads we drove on were paved, although you can expect some of them to be pretty narrow and on some roads potholes were abundant. With the exception of Rio Abajo State Forest, we had at least one bar of cell phone coverage almost everywhere, so using Google Maps is quite feasible for navigation. Gas stations are not very abundant off the main roads, so it's probably advisable to keep your tank at least half full, although the distances are not far so you can easily visit the more remote places without worrying about running out of gas. <br /><br />As I noted, places being closed can be a significant issue, particularly since some of the best birding is early in the morning or near sunset. So places that have locked gates are an issue that you need to plan around. I will note the places where we had problems with this in the descriptions of individual locations. We did not notice particular patterns, like differences between week days and weekends -- things were just closed a lot.</p><h3 style="text-align: left;">Overall strategy</h3><p>We split our one-week trip in half, with the first half operating out of an AirBNB in Fajardo in the northeast and the second half in the southwest, operating out of Sabana Grande. Overall, that wasn't a bad idea, although with Cabezas de San Juan being closed and Humacao National Wildlife refuge being difficult to access, it would probably make sense to have spent 2 days in the northeast and the rest of the time in the southwest where there were a lot more locations to bird. We did not go out to either of the islands mentioned in the book (Culebra and Vieques), so if you were going to do that, then you'd want more time in the northeast. Also, I had hoped to snorkel from Seven Seas beach in Fajardo, but there were rip current warnings for the entire north coast of PR during our whole trip, so that didn't happen.</p><h1 style="text-align: left;">The northeast</h1><h3 style="text-align: left;">El Yunque (Caribbean National Forest)</h3><p style="text-align: left;">Catarata Coca drop pin (entrance gate): <a href="https://goo.gl/maps/b7pVNoeUGBkxW6Aw7" target="_blank">18.325206, -65.769975 </a><br /></p><p></p><div class="separator" style="clear: both; text-align: center;"></div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/a/AVvXsEjeIGcsZHXO6HuPkaFzrIBCQ5lWe1x35NtrR1frjHUXdsv3M1sgUYJ5OwK2RX0B9Zw9fyMzHZhur0Oi7iZdHN3t9mc52SjURs7ZsQhkdOPtYnhpo_AxAQG-en0gpBbn-iI9fxXd6ZpDjzdik02mjejDpMP4Hz46_UlmVf16Sg3YGlYwqsre_MpcF4Qx=s736" style="margin-left: 1em; margin-right: 1em;"><img alt="map of El Yunque trails" border="0" data-original-height="603" data-original-width="736" height="524" src="https://blogger.googleusercontent.com/img/a/AVvXsEjeIGcsZHXO6HuPkaFzrIBCQ5lWe1x35NtrR1frjHUXdsv3M1sgUYJ5OwK2RX0B9Zw9fyMzHZhur0Oi7iZdHN3t9mc52SjURs7ZsQhkdOPtYnhpo_AxAQG-en0gpBbn-iI9fxXd6ZpDjzdik02mjejDpMP4Hz46_UlmVf16Sg3YGlYwqsre_MpcF4Qx=w640-h524" title="map of El Yunque trails" width="640" /></a></div><p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/a/AVvXsEirGih8OArEpECrtR42VUVGsFs4L225-KQsc44GFPZOiwfBwqYwzZdlENzR9wRxel-zhWcpeeTlPbSOGdJngXy6BuWvfT6NL6Urgg7vJeJGLAZdKnOWOlSFUYCysKVlBmcekxXLI0ce4Yu8iBIZjQtjSIuYtogxlOl2r0y46-6EADdYAMiDFe4Z_MC6=s2990" style="margin-left: 1em; margin-right: 1em;"><img alt="map of El Yunque trails from sign" border="0" data-original-height="2990" data-original-width="2292" height="640" src="https://blogger.googleusercontent.com/img/a/AVvXsEirGih8OArEpECrtR42VUVGsFs4L225-KQsc44GFPZOiwfBwqYwzZdlENzR9wRxel-zhWcpeeTlPbSOGdJngXy6BuWvfT6NL6Urgg7vJeJGLAZdKnOWOlSFUYCysKVlBmcekxXLI0ce4Yu8iBIZjQtjSIuYtogxlOl2r0y46-6EADdYAMiDFe4Z_MC6=w490-h640" title="map of El Yunque trails from sign" width="490" /></a></div><br /><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/a/AVvXsEjGvYFDvp_j4WjRjfOEjfnuLnxYKVief2ase6IzolaXAL3SpIiFN0SOaCZA67DiAvdt9MYp2KJ14T6PGDSRYC-NXE6lNB6ZQtxQ1WIGZosBPDo4ihvkfma1vQHiUF0zYrLF7hUZlcaT7vNDP2EhSDP8jgpozhK2_Ij-XRmLVSVwUUd6-giGUbXjiKAe=s4032" style="margin-left: 1em; margin-right: 1em;"><img alt="view of El Yunque Sierra palms forest" border="0" data-original-height="3024" data-original-width="4032" height="300" src="https://blogger.googleusercontent.com/img/a/AVvXsEjGvYFDvp_j4WjRjfOEjfnuLnxYKVief2ase6IzolaXAL3SpIiFN0SOaCZA67DiAvdt9MYp2KJ14T6PGDSRYC-NXE6lNB6ZQtxQ1WIGZosBPDo4ihvkfma1vQHiUF0zYrLF7hUZlcaT7vNDP2EhSDP8jgpozhK2_Ij-XRmLVSVwUUd6-giGUbXjiKAe=w400-h300" title="view of El Yunque Sierra palms forest" width="400" /></a></div><div style="text-align: center;">Sierra palms in El Yunque rainforest<br /></div><div><p></p><p>The Caribbean National Forest (which is universally known as "El Yunque" in PR) is the most famous natural area in Puerto Rico and was the place where we saw the most other visitors. The most important thing to understand about visiting El Yunque is the ticketing system for entering the forest by car. To access most of the forest beyond the Catarata Coca (waterfall), you MUST get a "free" (with $2 handling fee) ticket at recreation.gov. We tried to get tickets over a month in advance but they weren't available yet. Then a few days ahead of our visit, all of the advance tickets were already sold out. There is apparently some release date that is not well described on the website, so we probably should have been checking for tickets every day. Thankfully, they hold back 95 tickets which become available at 8 AM local time the day ahead. That was annoying because it meant we needed to be somewhere with Internet at 8 AM the day before we wanted to visit. We were able to get 8 AM entry tickets for the two consecutive days we wanted to go into the forest. A second batch of tickets are available at 11 AM, but the morning is a better time to visit. Both times allow you to stay until the forest closes (I think at 5 or 6 PM). <br /><br />All of the "real" toilets (with water) inside the forest are closed and locked. None of the port-a-potties at the Palo Colorado parking lot had toilet paper, and the toilets themselves were a mess. So plan for that. This situation is particularly pathetic given the shiny new million dollar visitor center near the entrance of the forest. The Sierra Palms parking lot is the best place to park for the most popular trail in the park: the one that goes to the Los Picachos and Mount Britton overlooks. There are several trails shown on the maps, but only one is actually functional -- the El Yunque trail that takes off just a short distance downhill from the parking lot. After hiking most of the way to the ridgetop, the trail splts. The right trail leads to the Los Picachos overlook, which provides a spectacular view, but is pretty muddy at the top. The left trail leads to the Mount Britton overlook. If you take the left trail, you can make it a loop by taking the trail all the way to the road and then walking down the road to the parking lot. The section of the trail from the split left to the Mount Britton overlook passes through the "Elfin forest", an area of stunted trees that is home to the endemic Elfin Woods Warbler. We visited that area on our second day by driving to the end of the road and parking there, then taking the trail towards the Mount Britton overlook from the other side. The elfin woods was quite interesting, but this isn't actually the best place in Puerto Rico to see the warbler (see Maricao State Forest later in the post).<br /><br />We were somewhat surprised that we didn't see many birds along the trail. (The exception was several sitings of the bananaquit, which is abundant everywhere in PR.) That may be partly due to us not being experts and partly due to the difficulty of finding birds in the rainforest canopy, but we've been in other rainforests and this seemed rather disappointing to us. We actually saw more birds near the parking lot and in the area around the visitors' center.<br /><br />You should bring a good raincoat. We got rained on at least once on nearly every hike we took on the trip and it poured on us in El Yunque.<br /><br />I mentioned the visitors' center. It really is quite amazing. It was brand-new, so everything was in beautiful condition. They have some nice interpretive exhibits and the person at the desk was actually able to give us some advice about what birds people typically saw around the grounds and where. There are some paved trails right at the center and some well-maintained gravel trails further out. We came back a second time because the area around the visitors center was actually the most productive birding site for us in northeastern PR and maybe of any place in the Commonwealth. The cost to enter is a bit steep ($8 per person) but they honor National Park Service annual and senior passes, so if you have one, you can get in for free.<br /></p><h3><br /></h3><p style="text-align: left;"></p><h3 style="text-align: left;">Humacao National Wildlife Refuge</h3><p style="text-align: left;">beach access drop pin: <a href="https://goo.gl/maps/GDDMgdnUsUmdF8j37" target="_blank">18.151809, -65.764071</a><br /></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/a/AVvXsEhHdEbONngo4el8uW4GEx5h56CrK1pQTZ58eYeuCQryxvEMQf9c7dtnLVpa9LITdSdrATIv2Ks_TDRSenfxOrrJ0t7LsYQHNFblM0GNLl1qnSNwQpdQ9L8aSMMEUgHSJ84fMdL9YDUh1FnChBJG1vqD9vK90YRmhyypEPpw3x56EB9CBKEDhXPJLKtE=s971" style="margin-left: 1em; margin-right: 1em;"><img alt="Humacao National Wildlife Refuge map" border="0" data-original-height="587" data-original-width="971" height="386" src="https://blogger.googleusercontent.com/img/a/AVvXsEhHdEbONngo4el8uW4GEx5h56CrK1pQTZ58eYeuCQryxvEMQf9c7dtnLVpa9LITdSdrATIv2Ks_TDRSenfxOrrJ0t7LsYQHNFblM0GNLl1qnSNwQpdQ9L8aSMMEUgHSJ84fMdL9YDUh1FnChBJG1vqD9vK90YRmhyypEPpw3x56EB9CBKEDhXPJLKtE=w640-h386" title="Humacao National Wildlife Refuge map" width="640" /></a></div><p></p><div class="separator" style="clear: both; text-align: center;"></div><div class="separator" style="clear: both; text-align: center;"></div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/a/AVvXsEgTFqb-7C3s_lgbe_HstbyHriaiANcbW4-6dMedZ1HQVXJlVLZze-6HGvCHjjW1YkpqBFhkhNXBn4OPb3pXcinFGn2snw3308AypbCwVL_mF2AK8mIg87Zyw2nfpyQ6B_U3PKR5rPYILsSo6lSkSOHwIcwpi_HhXau8_ZsXJAO2q0mjGK1bdy4UsH2v=s4032" style="margin-left: 1em; margin-right: 1em;"><img alt="Beach approaching the wildlife refuge near sunset" border="0" data-original-height="3024" data-original-width="4032" height="300" src="https://blogger.googleusercontent.com/img/a/AVvXsEgTFqb-7C3s_lgbe_HstbyHriaiANcbW4-6dMedZ1HQVXJlVLZze-6HGvCHjjW1YkpqBFhkhNXBn4OPb3pXcinFGn2snw3308AypbCwVL_mF2AK8mIg87Zyw2nfpyQ6B_U3PKR5rPYILsSo6lSkSOHwIcwpi_HhXau8_ZsXJAO2q0mjGK1bdy4UsH2v=w400-h300" title="Beach approaching the wildlife refuge near sunset" width="400" /></a></div><p></p><p style="text-align: center;">Beach approaching the wildlife refuge near sunset<br /></p><p>Following the advice of the book, we went to the Humacao National Wildlife Refuge in the evening to look for waterfowl. This area is one of the worst-described in the book. Almost nothing described about the entrance, where to park, etc. was still valid. Instead of a chain that you can step over, there is now a big steel fence with locked gate and unfriendly-looking barbed wire fences after that. I suppose they want to keep people out of the area where they rent out recreational equipment. <br /><br />Since we drove all the way there, we decided to see if there was a way to enter the refuge from the beach, which looked like a reasonable point of access. The side roads nearest the preserve are part of a gated community, but going further down the road, we were able to park in a public beach parking lot. We had a nice scenic walk along the beach, where we identified a couple of shore birds. At the end of the beech, there was a short path that took us onto the wildlife refuge drive. From there we were able to easily walk to the drive between the two ponds shown on the hand-drawn map in the book. We were a bit apprehensive about going in the back way when the front gates were closed, but we as we walked, we met several local people who were jogging around the ponds. So clearly it was a normal thing for people to be enjoying the preserve after hours.<br /><br />Unfortunately, there was very dense vegetation on both sides of the drive, making it difficult to actually see the ponds. We were able to walk out on some kind of old boat dock and actually see the pond on the south side. We saw some waterfowl, but would have needed a spotting scope to figure out exactly what they were (maybe Caribbean coots?). By this time it was getting dark, so we gave up and headed back along the beach. <br /><br />If you make this kind of sunset visit, I'd recommend dropping a pin on Google Maps on your phone at the place where you enter the beach from the parking lot to make sure that you can find it walking back in the near darkness. <br /></p><h3 style="text-align: left;">Cabezas de San Juan</h3><p>This area was closed and seems to have been closed since the hurricane. After leaving northeast PR, we were told by someone we met that you can make arrangements to visit. However, there was no indication on their website of how that would be possible. So unless you have some inside information, I wouldn't plan to go there.</p><h1 style="text-align: left;">The northwest</h1><p>We planned to stop in several places in the northwestern part of the island on our way to and from the southwest -- we didn't spend any nights there.</p><h3 style="text-align: left;">Cambalache State Forest (Bosque Estatal de Cambalache)</h3><p style="text-align: left;">parking lot drop pin: <a href="https://goo.gl/maps/vAwLs5GrAyNLk9ZT9" target="_blank">18.452568, -66.596961</a> <br /></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/a/AVvXsEjCeP1zyiQSsv5GVg5kkfIUfiqgnFrl0MhjeFVUaApTsJEc57GAufFB5RcgA_Gk38ov5Zg1gzo7Ctt6ZXt1UvLAzuGhhdj96fuUim-YUOvauz7Pw8ESbg0_2eOQBV8Gmn6KGcfu21qwemC_utd2rhr3VsICbA3LjIdHSsKsaTrY7RBNmQLB9M8RO8aW=s420" style="margin-left: 1em; margin-right: 1em;"><img alt="map of Combalache State Forest" border="0" data-original-height="337" data-original-width="420" height="514" src="https://blogger.googleusercontent.com/img/a/AVvXsEjCeP1zyiQSsv5GVg5kkfIUfiqgnFrl0MhjeFVUaApTsJEc57GAufFB5RcgA_Gk38ov5Zg1gzo7Ctt6ZXt1UvLAzuGhhdj96fuUim-YUOvauz7Pw8ESbg0_2eOQBV8Gmn6KGcfu21qwemC_utd2rhr3VsICbA3LjIdHSsKsaTrY7RBNmQLB9M8RO8aW=w640-h514" width="640" /></a></div><p>The Birdwatchers Guide did not mention Cambalache State Forest (near Arecibo), but we had read several places on the Internet that it was a good spot and was a research base for the Puerto Rico Ornithological Society. So we decided to check it out. It turned out to be a nice area to bird after our somewhat disappointing experience in El Yunque. As usual, all of the facilities were closed when we arrived, but that didn't matter since we could park and walk on the trails. There is a network of trails that are quite well maintained. We spent a few hours walking slowly along the trail listed as "1" on the map above before having to turn back due to heavy rain. Surprisingly, at the campground in the upper left of the map there was actually one open restroom with a composting toilet that was operational (the other bathrooms were locked as usual). We saw both the Puerto Rican Lizard-Cuckoo and Mangrove Cuckoo here as well as the Puerto Rican bullfinch (which we saw elsewhere as well). So it was well worth a half day.</p><h3 style="text-align: left;">Parador Guajataca</h3><p style="text-align: left;">overlook parking lot drop pin: <a href="https://goo.gl/maps/bgsBKd2f2GvYHEhMA" target="_blank">18.489983, -66.949409</a><br /></p><p> <a href="https://blogger.googleusercontent.com/img/a/AVvXsEhOmfh3KDJHIwh1F_99cJlIlWeda-2Z-ujx0GjkkC6rB4tboZVe8L13pyZnPhAaAQ_YxoXFUhKtJCZ_JzWJt-3qw8ZXwf3OQ-YOam0oXUSSPqO86KJyPZLNKfxR5Gftgxp2yn5FvMMJuId5FVmbMb2BSm7G4_cgKfwvf5YysVgrEm0b08VpXjmfIUEY=s1017" style="margin-left: 1em; margin-right: 1em;"><img alt="map of Parador Guajataca" border="0" data-original-height="509" data-original-width="1017" height="320" src="https://blogger.googleusercontent.com/img/a/AVvXsEhOmfh3KDJHIwh1F_99cJlIlWeda-2Z-ujx0GjkkC6rB4tboZVe8L13pyZnPhAaAQ_YxoXFUhKtJCZ_JzWJt-3qw8ZXwf3OQ-YOam0oXUSSPqO86KJyPZLNKfxR5Gftgxp2yn5FvMMJuId5FVmbMb2BSm7G4_cgKfwvf5YysVgrEm0b08VpXjmfIUEY=w640-h320" title="map of Parador Guajataca" width="640" /></a></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/a/AVvXsEhup93IaWuP-LzNB-w4LArPDQUP_KaEhm0m4Q5LWPFUAHkwGUlwPV-bxMGGYnKo0RkHqynyQT6dtQreeijDdk9grGI9nlT-NUhNzMvcTucTK06mWc52qcdcFY7E2p_tI10T5s4PF94ol5JzOonx43NswI1FV4qG-NGye5Qu5wK9xeadWSHBzTnP4bBy=s4032" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="3024" data-original-width="4032" height="300" src="https://blogger.googleusercontent.com/img/a/AVvXsEhup93IaWuP-LzNB-w4LArPDQUP_KaEhm0m4Q5LWPFUAHkwGUlwPV-bxMGGYnKo0RkHqynyQT6dtQreeijDdk9grGI9nlT-NUhNzMvcTucTK06mWc52qcdcFY7E2p_tI10T5s4PF94ol5JzOonx43NswI1FV4qG-NGye5Qu5wK9xeadWSHBzTnP4bBy=w400-h300" width="400" /></a></div><div style="text-align: center;">Guajataca cliffs from picnic area<br /></div><div><p>This spot was mentioned as a possible hotel venue in Quebradillas in the northwest. We wanted to check it out for the possibility of seeing the White-tailed Tropicbirds that are supposed to nest in the cliffs nearby. The actual site of the hotel/restaurant did not look like a particularly great birding spot and we didn't opt to stay or eat there. However, just to the east of the hotel turnoff is a small park with a parking area and several benches that overlook the ocean. They looked like a much more promising viewing spot. I saw one large white bird fly by when I was getting out of the car, but otherwise we only saw a couple brown pelicans fly by. But it might be a good place to try your luck if you want to stop for a picnic lunch or a break from driving. </p><h3 style="text-align: left;">Río Abajo State Forest ( Bosque Estatal de Río Abajo)</h3><p style="text-align: left;">junction near headquarters drop pin: <a href="https://goo.gl/maps/qR1XAsh16YXRhD4W7" target="_blank">18.320761, -66.683640</a><br /></p><p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/a/AVvXsEiza5_y4pknxY4Aia-yaTjeM62FXlZngL2toHVEaKLoo-N_J0EHabxlL025LgqI3vsFTEo72e8zLAmJoY_bhcllCxjwcLYJZPGS2GHcHqqlFGqmHGjbHeMsYySDS17Jyzvk88WPw1YO9-DqKSyy-JRxuJNWpvO-ly_EvJL29XYqdH1cQCbgrF_Zc0kF=s1293" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="775" data-original-width="1293" height="384" src="https://blogger.googleusercontent.com/img/a/AVvXsEiza5_y4pknxY4Aia-yaTjeM62FXlZngL2toHVEaKLoo-N_J0EHabxlL025LgqI3vsFTEo72e8zLAmJoY_bhcllCxjwcLYJZPGS2GHcHqqlFGqmHGjbHeMsYySDS17Jyzvk88WPw1YO9-DqKSyy-JRxuJNWpvO-ly_EvJL29XYqdH1cQCbgrF_Zc0kF=w640-h384" width="640" /></a></div><br /> The Río Abajo State Forest is most well-known as the best place to see the endangered Puerto Rican Parrot. However, it's a long shot since you aren't allowed to get close to the aviaries area. We were told by some birders who had seen the parrot on the previous day that the best strategy was to walk down the trail towards the aviaries (but stopping before the electronic gate) after about 3:30 to 4 PM when they return to roost. We weren't there at the right time of day, so mostly were just interested in seeing birds in general. <p></p><p>The first issue was figuring out where you actually could go to bird. The road leading from the highway to the forest T's into another road. The sign directs you to the visitors' center a short distance on the right, near the intersection. It has a huge, fancy sign, but was not open (of course) and apparently hasn't been open for several years. To the left was some headquarters buildings (also closed). The access to the forest is on the left branch of the road. You have to drive a significant distance past a lot of residences, which gives you the impression that you were out of the forest or had somehow missed it. This was the one place in PR where we had no cell service, so we had to go on faith that the road actually eventually dead-ends at a closed gate. At the gate there is a sign that says "danger", although it was not at all apparent what the danger was. Beyond the gate is just a paved road through the forest that probably would have been pretty good for birding if we had been there earlier in the day. As it was, we mostly managed to finally see a black-whiskered vireo, which we had been hearing repeatedly throughout the trip. What we had been told by the other birders was that the forestry people were OK with people birding along that road as long as they stayed on the road and did not enter the parrot area after the second gate. We never made it to the second gate because we turned around due to lack of time. </p><h1 style="text-align: left;">The southwest<br /></h1><p>We spent two days making circuits through the southwest part of the island. The first day we went along the coast and the second day we visited the high elevations.</p><h3 style="text-align: left;">Guánica State Forest (Bosque Estatal de Guánica)</h3><p style="text-align: left;">parking area drop pin: <a href="https://goo.gl/maps/JoEgsEypJ7zZDyUo8" target="_blank">17.971403, -66.868727</a><br /></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/a/AVvXsEi3JFO9WhEiXfcRntypY9QNDWnerfNJq-1nBpdojBxAkkD3rfL_N8wK3NID3TSmL3b1fUK4NNkwEb6dzRB5udFy2UkFqUdFVoLrI0jkV9-bUSqgnlo5oZuQb0INoTkCbLjASaHmBWS1BYha0ZwfJ_EM_6Lhlh22F4KN8I8O-Y3qzzR0VhfJf9HKrZuY=s1078" style="margin-left: 1em; margin-right: 1em;"><img alt="Guánica State Forest map" border="0" data-original-height="502" data-original-width="1078" height="298" src="https://blogger.googleusercontent.com/img/a/AVvXsEi3JFO9WhEiXfcRntypY9QNDWnerfNJq-1nBpdojBxAkkD3rfL_N8wK3NID3TSmL3b1fUK4NNkwEb6dzRB5udFy2UkFqUdFVoLrI0jkV9-bUSqgnlo5oZuQb0INoTkCbLjASaHmBWS1BYha0ZwfJ_EM_6Lhlh22F4KN8I8O-Y3qzzR0VhfJf9HKrZuY=w640-h298" title="Guánica State Forest map" width="640" /></a></div></div><div></div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/a/AVvXsEh3-FFBcLIJYC-NskleOfOySzR_DuUXQ-5WivO6Ck0kXObBniFRAoX6o2cnjhGVH0fFEM4JEkHbPF1H2oWI9ntyBvmzaWUenbIiLV4lwdfCmK2aut5AjUNAGP7mnGjCDlddqO6DkRqsJI8ntFUiCm6CsFC0fe2vAAEbt_Y28JeAeKZr_lGaSEL64RY7=s4032" style="margin-left: 1em; margin-right: 1em;"><img alt="dry forest in Guánica State Forest" border="0" data-original-height="3024" data-original-width="4032" height="300" src="https://blogger.googleusercontent.com/img/a/AVvXsEh3-FFBcLIJYC-NskleOfOySzR_DuUXQ-5WivO6Ck0kXObBniFRAoX6o2cnjhGVH0fFEM4JEkHbPF1H2oWI9ntyBvmzaWUenbIiLV4lwdfCmK2aut5AjUNAGP7mnGjCDlddqO6DkRqsJI8ntFUiCm6CsFC0fe2vAAEbt_Y28JeAeKZr_lGaSEL64RY7=w400-h300" title="dry forest in Guánica State Forest" width="400" /></a></div><div style="text-align: center;">dry forest in Guánica State Forest<br /></div><div><div></div><div></div><div><br /><p>This is supposed to be one of the best birding spots in Puerto Rico and we were not disappointed by it. It is a very dry forest, so don't expect spectacular scenery, though. We took the main road (PR334) into the forest until it ended at the headquarters. When we arrived, there was briefly a guy sitting at an information booth, although by the time we got back around noon he was gone and everything (as usual) seemed completely closed up. Near the parking lot there was a reasonably nice picnic area with actual flush toilets and toilet paper (at least the day we were there). What we learned from the guy at the booth was that the trail starting at the picnic area was a loop if you went "left, left, left, left…". That turned out to be true and the trail was a nice length for a slow birding ramble. We had very satisfying multiple views of the Puerto Rican Tody and Adelaide's Warbler along the trail where they seemed quite common. </p><h3 style="text-align: left;">ANP Salias Fortuna Para La Naturaleza/Biolumenescent Bay</h3><p style="text-align: left;">gate drop pin: <a href="https://goo.gl/maps/Np9SYTMCjB4ieZ1Y8" target="_blank">17.977386, -67.011882</a><br /></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/a/AVvXsEiQAfdxvqZThfarl17ciyRdgvXzJh7laQQlbv4YLh49v7Lju1mNhe3E_2yoPKfjEFSQrVYeb1T9rUpjVl5801wEXXLLsOClHBichxDhP3Q44LOsJa5OV2xfsTfhqvHrkNBZ5uHoMYD-YKPILgDrGo2i_RJqa54cwHVxhBR4SRGgUBq-xgGtqZpA-06a=s1447" style="margin-left: 1em; margin-right: 1em;"><img alt="ANP Salia Fortuna map" border="0" data-original-height="624" data-original-width="1447" height="276" src="https://blogger.googleusercontent.com/img/a/AVvXsEiQAfdxvqZThfarl17ciyRdgvXzJh7laQQlbv4YLh49v7Lju1mNhe3E_2yoPKfjEFSQrVYeb1T9rUpjVl5801wEXXLLsOClHBichxDhP3Q44LOsJa5OV2xfsTfhqvHrkNBZ5uHoMYD-YKPILgDrGo2i_RJqa54cwHVxhBR4SRGgUBq-xgGtqZpA-06a=w640-h276" title="ANP Salia Fortuna map" width="640" /></a></div><br /><p>On our way to from Guanica to La Parguera, we stopped at a wildlife refuge (operated by Para la Naturaleza <a href="https://www.paralanaturaleza.org/">https://www.paralanaturaleza.org/</a>) that wasn't mentioned in the book, but that I'd seen on Google Maps. I have no idea when it's supposed to be open or if there is ever any kind of programming there. There was a small building on the site and some kind of construction of a small bridge or something, but there was no explanation or any indication of whether it was open to the public. So as usual, we parked the car by the gate and walked in. This was a nice area for observing wetlands birds and we saw a nice Great Egret, Short-billed Dowitcher, and several other wetland birds that were too far away to identify (a spotting scope would have been good here). I'm not sure this is any better than other wetlands in the area, but it was easy to get to and a nice stop if you are making the obligatory trip to La Parguera to try to see the Yellow-shouldered Blackbird.<br /><br />Incidentally, we did not manage to see the blackbird in La Parguera. The instructions in the Birdwatcher's Guide were pretty incomprehensible. We found the Parador Villa Parguera with no problem, but it did not seem like the mangroves there were any better than others we could see from the road. We utterly failed to find the "general store" described in the book and after wasting about an hour looking around the town unsuccessfully, we moved on. <br /><br />Although this has nothing to do with birds, it is worth mentioning that La Parguera is probably the best place from which to visit a bioluminescent bay. This bay is apparently the only place in PR where you are actually allowed to get in the water and there are various options, such as going out in a boat at sunset and snorkeling, or being towed out in kayaks by a boat and then bobbing around in a life jacket. Unfortunately, we did not book far enough in advance to do either of these options, so you should book online at least 2 or 3 days ahead of when you want to do it. We had no problem getting a spot on a boat with a glass bottom. The luminescence was really quite amazing, but unfortunately our timing was off since the moon was at first quarter and was making a lot of light at sunset. We really could only see the luminescence through the glass under the boat when one of the operators swam down and kicked his legs under the glass. It would have been much better to have actually been in the water or at least to have seen the effect on the boat's wake when the moon was not up and it was darker. But it was still pretty cool and the trip was only $15. <br /></p><h3 style="text-align: left;">Cabo Rojo</h3><p>lighthouse parking lot drop pin: <a href="https://goo.gl/maps/GdGhkmYg79JMPa1R6" target="_blank">17.937730, -67.194344</a> <br /></p><p>Our last stop for the day of our coastal exploration was the peninsula of the Cabo Rojo National Wildlife Refuge. This area was not as scenic as we expected and the well-known "pink" lagoons looked like some kind of sedimentation ponds. We did manage to identify a couple shorebirds along the road and we managed to see the introduced Venezuelan Troupial, which was fun. </p><p>There were a couple issues you should be aware of. One is that upon entering the refuge proper, the road degenerates terribly into the worst road we encountered on the island. We had to weave back and forth from one side of the other of the road to avoid breaking the axle of our car on giant potholes and there was one degenerated bridge/culvert where we almost turned around because we weren't sure we could cross without damaging the bottom of our low-clearance rental car. We did finally make it to the end of the road to the lighthouse parking lot and had just gotten out to make the kilometer or so walk up to the lighthouse when a couple police officers warned us that we needed to make sure that we were out of the refuge in 45 minutes or we would get locked in when they closed the gate at 5 PM. So we abandoned the attempt to see the lighthouse and alleged brown boobies on the rocks below it. So if you plan to do this excursion, come early in the day and plan for plenty of time to crawl along the horrible road.<br /></p><h3 style="text-align: left;">Maricao State Forest</h3><p> km 16.8 drop pin: <a href="https://goo.gl/maps/sRqKfWmh4uLaPxNU9" target="_blank">18.156738, -66.997737</a></p><p>vacation cottages parking lot drop pin: <a href="https://goo.gl/maps/bvquGyigRSz6J1Td6" target="_blank">18.140393, -66.974230 </a><br /></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/a/AVvXsEhirjYcI-_03rjNytY7yjxiBL_jC2XGBRjBhFBpnmepcIqsntIM4XdbYy51b1vgmnZzYCm-xBPEuDxl7LUgSylaDVy7RmdhbfVY3IXjIvPWssqyPCfllgkoGDXD5sbXxRFterN_wt_DE56TIx_0Up1l2t5J8rDeToPrXKzMRGNJjPvhGPhfm5RX6e9g=s1174" style="margin-left: 1em; margin-right: 1em;"><img alt="Maricao State Forest map" border="0" data-original-height="631" data-original-width="1174" height="344" src="https://blogger.googleusercontent.com/img/a/AVvXsEhirjYcI-_03rjNytY7yjxiBL_jC2XGBRjBhFBpnmepcIqsntIM4XdbYy51b1vgmnZzYCm-xBPEuDxl7LUgSylaDVy7RmdhbfVY3IXjIvPWssqyPCfllgkoGDXD5sbXxRFterN_wt_DE56TIx_0Up1l2t5J8rDeToPrXKzMRGNJjPvhGPhfm5RX6e9g=w640-h344" title="Maricao State Forest map" width="640" /></a></div><br /><p></p><div class="separator" style="clear: both; text-align: center;"></div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/a/AVvXsEiUcaniyN5s_IjAIQlFMpPEl6nCnsMtUjt68js7Rr7iZ3WFTczWKgfNILyahB6zjBStn1PCGca5Sp5eWZNgyaKl3uWoxU6gGZmrGYKmG4aBwyALJppGJqR61h4j7j6z7Av4Hy1InJ5aAasxZNeBQOs6Fey4EosrUTVEKYtd8R8FoCf4cD5Y--tM2-km=s4032" style="margin-left: 1em; margin-right: 1em;"><img alt="Elfin forest in Maricao State Forest" border="0" data-original-height="3024" data-original-width="4032" height="300" src="https://blogger.googleusercontent.com/img/a/AVvXsEiUcaniyN5s_IjAIQlFMpPEl6nCnsMtUjt68js7Rr7iZ3WFTczWKgfNILyahB6zjBStn1PCGca5Sp5eWZNgyaKl3uWoxU6gGZmrGYKmG4aBwyALJppGJqR61h4j7j6z7Av4Hy1InJ5aAasxZNeBQOs6Fey4EosrUTVEKYtd8R8FoCf4cD5Y--tM2-km=w400-h300" title="Elfin forest in Maricao State Forest" width="400" /></a></div><div style="text-align: center;">Elfin forest in Maricao State Forest<br /></div></div><div><p>We spent the entire morning of our southwestern uplands tour in the vicinity of the Maricao State Forest and it was one of our most productive birding excursions. One advantage of this forest is that a public road (PR 120) passes through it and there are several good stopping places along the road that are never closed off by gates. We started by going straight to km 16.8 where one of the two sets of serious birders we met on the island had seen the Elfin Woods Warbler. We pulled off the road in a small parking area by a gate and immediately heard several of the warblers in a big tree near where we parked. We managed to get a reasonably good look at them before they moved on. We walked for some distance along the trail and saw and heard several other birds including the Puerto Rican Vireo, Puerto Rican Woodpecker, and the Puerto Rican Bullfinch. On our way to checking out what we assumed was the visitors center, we stopped at La Torre de Piedra, a stone overlook in the form of a castle (built by the Civilian Conservation Corps) that had a great view. We saw the Puerto Rican Spindalis there. Beyond the stone tower towards Sabana Grande, we came to what we thought was the Forest Service buildings and picnic area shown on the hand-drawn map in the book. But the area bore no resemblance to the map -- it had vacation cottages, a swimming pool (and maintained bathrooms that were open!). We never did figure out where the supposed "concrete cistern" and other spots on the map were located. We did, however, have amazing looks at several Puerto Rican Woodpeckers that were hanging around in some dead trees around one of the parking lots. <br /><br />After checking out that area, we headed back to km 16.2, which was an intersection of an actual road that branched off the main road. We parked and walked down the intersecting road and were treated to a second look at the Elfin Woods Warbler -- this time an adult and a juvenile. To cap things off, we also spotted an amazing Antillean Euphonia singing its heart out high in a tree along the road. All in all, this was one of the most productive areas for birding in the whole trip and we weren't locked out of any of it by gates!<br /><br /></p><h3 style="text-align: left;">Susua State Forest (Bosque Estatal de Susua)</h3><p style="text-align: left;">entrance gate drop pin: <a href="https://goo.gl/maps/9BFZLCU6aPksz3FQ9" target="_blank">18.071079, -66.914372</a><br /></p><p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/a/AVvXsEg5tp62sEpJ4Hbok0cfu_z3AiEuFEvC5bImmPH7wDxDKuA3-QVl_AYspMEYSKg1MS6BdhvTXXNKAI56ujE5XIKPY85wKVzudX8a_1YKZoTZQCN8iSObWpbFXFQjh2jvKG89xDBavexHoPLd_r0ZRFf-vmIaIvzd240oQ6iEtW0O-jgWMg5ZM9MJ7hDe=s4032" style="margin-left: 1em; margin-right: 1em;"><img alt="Susua State Forest vista" border="0" data-original-height="3024" data-original-width="4032" height="300" src="https://blogger.googleusercontent.com/img/a/AVvXsEg5tp62sEpJ4Hbok0cfu_z3AiEuFEvC5bImmPH7wDxDKuA3-QVl_AYspMEYSKg1MS6BdhvTXXNKAI56ujE5XIKPY85wKVzudX8a_1YKZoTZQCN8iSObWpbFXFQjh2jvKG89xDBavexHoPLd_r0ZRFf-vmIaIvzd240oQ6iEtW0O-jgWMg5ZM9MJ7hDe=w400-h300" title="Susua State Forest vista" width="400" /></a></div><p></p><p style="text-align: center;">Vista at Susua State Forest<br /></p><p> We had planned to wrap up our day of birding in the southwestern uplands by spending some time in Susua State Forest, just to the east of where we were staying in Sabana Grande. We drove up the narrow road to the forest and were surprised to encounter a locked gate at the entrance. Apparently the gate is locked at 3 PM! We decided to park the car at the gate and walk for a while along the road into the forest, but it was hot, dry, and late in the afternoon, so after walking for about a half hour without seeing anything but vultures, we turned around and walked back to the car. We did see a single Scaly-naped Pigeon near the gate, but that was it for birds. The plants were interesting -- we saw several large cacti and a weird spiny plant with we were told by some botanists was the Puerto Rican version of poison ivy. But this is definitely a place you will want to visit earlier in the day if you plan to drive.<br /></p><h1 style="text-align: left;">Summary</h1><p>If you live in the U.S., Puerto Rico is a pretty easy and relatively inexpensive place to visit, since no special travel rules apply, you can find reasonably priced car rentals and accommodations, and many residents speak English if you don't know Spanish. If you've never been to a rainforest before, El Yunque is very interesting, and the southwestern part of the island has a wide variety of habitats in a relatively small area. As a birding spot, it can be fun if you've never birded in the tropics before, and most of the birds we saw were natives (as opposed to some places where you mostly see introduced birds). However, it pales in comparison to other places we've birded like Costa Rica and southern Africa where there are just a lot more species. Nevertheless, it is quite easy to see a dozen or so species that are endemic to Puerto Rico and the Virgin Islands, so it is a place you can go to see unique wildlife.<br /></p><p><br /></p><p><br /></p><p><br /></p><p><br /></p><p><br /></p><p><br /></p><p><br /></p></div></div></div>Steve Baskaufhttp://www.blogger.com/profile/01896499749604153763noreply@blogger.com1tag:blogger.com,1999:blog-5299754536670281996.post-34329465093687854732022-01-17T19:09:00.005-08:002022-01-17T19:58:17.956-08:00Investigating Wordle guesses<h2 style="text-align: left;">Introduction</h2><p>Those of you who know me know that I like to play somewhat complicated games. One of the reasons I enjoy that is the challenge of figuring out how the game works and how I might play the game better by using my understanding to develop rules of thumb for making decisions during the game. </p><p>I also enjoy writing Python scripts when I've found an interesting problem to solve. In this particular case, I ended up spending most of the MLK holiday weekend working on a Python script to investigate the best words to use as guesses in Wordle, that online game that has become viral. </p><p>I've been playing for less than two weeks, and during that time I've had an excessive number of discussions with my wife and two daughters about strategies for word choices. One such discussion was about "what's the best first word?" It seemed clear that one would like a word that contained common letters in order to increase the probability of getting correct guesses early. A much longer discussion was centered around whether it would be a better strategy to pick a second word that was complementary to the first word (containing different common letters, e.g. vowels) in order to get the most information during the first two guesses, or whether the second move should capitalize on information gained in the first guess (e.g. limiting the second guess to words that would both help discover new letters and determine the positions of any letters discovered in the first guess).</p><p>One appealing feature of this problem is that it is highly tractable using a modern computer. The number of 5-letter English words is limited and a script could sort them out in a minuscule amount of time. Thus a simple and brute-force approach would be totally practical. <br /></p><h2 style="text-align: left;">Background knowledge</h2><p> There are several key things that one would want to know before starting out. The first is whether game creator Josh Wardle is tricky and tries to pick "hard" words as a human opponent would, or whether the game words are random. In an <a href="https://theworld.org/media/2022-01-14/wordle-goes-global" target="_blank">interview on The World</a>, he said that he wanted the words to be random so that he could play himself. </p><p>Another important thing to know is what set of words are actually used as a source for the game. I don't know the answer to that question, but in the interview, he mentioned that the words were drawn from a list of about 2500 English words. There was also a controversy about the January 12 word "favor", which raised the ire of British users who didn't consider that a proper five-letter word because they thought it should be spelled "favour". This is an indication that the word list might be derived from an American English rather than British English word list.</p><p>When I first started playing around, I tried extracting words from several rando English word lists that I found on the Internet. However, they either included a lot of questionable words to give Scrabble players an edge, or were polluted with capitalized proper names, abbreviations, etc. I finally came across a very high quality curated list of words called the <a href="http://wordlist.aspell.net/12dicts-readme/#nof12" target="_blank">"6 of 12" list</a> (named so because it contains words found in at least 6 of the 12 dictionaries used as sources). This list is heavily curated and very clean: proper names are all capitalized and abbreviations without periods or spaces are specially marked so they can be excluded. After running some code to screen out names, abbreviations, and words with lengths other than 5, I came up with a clean <a href="https://gist.github.com/baskaufs/8c5f187e41f37af7e395c7094eb796d8/raw/cc40500c0ecc7b4e33dedf96451d26ef6362af2b/five_letter_6of12.txt" target="_blank">list of 2529 words</a> as a source for experimentation. Not only is this list about the same size as the list used by Wardle for the game, all of the <a href="https://screenrant.com/wordle-answers-updated-word-puzzle-guide/" target="_blank">words used in the game since 1 January 2022</a> are on it. (Spoiler alert: the previous link is updated daily, so following it may reveal the answer to today's puzzle if you haven't done it already.) So I consider my list to be a very good proxy for the possible words in the game.</p><h2 style="text-align: left;">The script</h2><p>If you want to look at and try the code I'm going to discuss, you can run it yourself on <a href="https://colab.research.google.com/drive/1164Vrvh7uVO1jToItf1gXw-jcpyS63YE?usp=sharing" target="_blank">a Colab notebook</a> without installing anything. (You must have and be logged in to a Google account, however.) If you want to edit and save the code, select "Save a copy in Drive" from the file menu and you will be able to save your work. The data are in GitHub, so you don't have to download anything either. Before all of you software engineers start sharpening your knives, I'm not a professional coder, so be kind. <br /></p><p>A major part of the code is a <span style="font-family: courier;">Wordle_list</span> object. If you instantiate it without any arguments, you just get the full word list from GitHub. If you pass in a "guess code" and a Python list when you instantiate a <span style="font-family: courier;">Wordle_list</span>, the code will be used to screen out words from the input list according to the information given in the guess code (equivalent to what you see on the screen after you've made a guess in the game.Values of instance variables and results of methods on a <span style="font-family: courier;">Wordle_list</span> instance will tell you things about the screened list, such as how many words are in it and information about the frequencies of letters in the words.<br /></p><p>The other major part of the code is the <span style="font-family: courier;">score_words_by_frequency()</span> function. This is the actual "experimental" part of the code where I played around with various ways to score the words on the list to identify which words would make the best next guess. I will talk a lot more about this later.</p><p>To actually use the code, run the first cell to define everything, then scroll down to the "Screening algorithm test" section and start running the code in the next cell. There are some instructions you can read there, but here's the TLDR information:</p><p>1. To use the code with an actual puzzle, change the value of <span style="font-family: courier;">actual_word</span> to an empty string (two consecutive single quote characters: <span style="font-family: courier;">''</span>). That will stop the script from trying to suggest guess codes for some other word. If you want to test one of the previous puzzle words (or any word) set the value of <span style="font-family: courier;">actual_word </span>to the word you want. <br /></p><p>2. Before you run each of the guess cells, you have to set the value of the <span style="font-family: courier;">next_guess_code</span> variable. If you are testing using an actual known word entered in step 1, a guess code will be suggested for you at the end of the output of the previous cell using the highest ranking word based on the criteria used by the word-scoring function. It will be used automatically if you run the next cell without changing the value of <span style="font-family: courier;">next_guess_code</span>. If you are using the output from the game, set the value of <span style="font-family: courier;">next_guess_code</span> using an uppercase letter for a correct letter in the correct position, a lowercase letter for a correct letter in the wrong place, and a lowercase letter prefixed by a dash for an incorrect letter. Separate the individual letter codes by commas but no spaces.<br /></p><p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/a/AVvXsEhdGKzd7cZd6QS1hrtAB9IEOggh5xn_mNLV3A5j6LjdshVi2Xajn_nbsdrsXa0kbJi1gF26d58w0aL76BB7KSYQfj6v7eM6VZ634IrL7Uz0MDRORlg58De9t-sV7cYWuVwOiLMM1MNH2kHmr33mH6RpCRcQbrDsVBm2iujA4ATCagzQNHrt86Y97teC=s1334" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="1334" data-original-width="750" height="320" src="https://blogger.googleusercontent.com/img/a/AVvXsEhdGKzd7cZd6QS1hrtAB9IEOggh5xn_mNLV3A5j6LjdshVi2Xajn_nbsdrsXa0kbJi1gF26d58w0aL76BB7KSYQfj6v7eM6VZ634IrL7Uz0MDRORlg58De9t-sV7cYWuVwOiLMM1MNH2kHmr33mH6RpCRcQbrDsVBm2iujA4ATCagzQNHrt86Y97teC=s320" width="180" /></a></div>Here's the code for the example above: <span style="font-family: courier;">'S,-e,r,-g,E'</span>.<br /><p></p><p>3. As you run each cell, it will show you how it has reduced the number of possible answer words by applying various screens based on the guess:</p><p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/a/AVvXsEhBhAHjCvLSL69CIC4WZN0yu8dig-_BLB45-MiKdtzhsUigCnna-8mzY5aDZjp6D4vaTl4k4xnbmhp_eRmE7ySBfaQXWynzZbQoEmTI82Ca_7lgxWfLTgUXhbcxnsMziIYBy9ysScaX-gbqrl0ujV42mjfRs1CwD7PTR1PO2HgX1FsgcXrqL6d94s3y=s1083" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="314" data-original-width="1083" height="186" src="https://blogger.googleusercontent.com/img/a/AVvXsEhBhAHjCvLSL69CIC4WZN0yu8dig-_BLB45-MiKdtzhsUigCnna-8mzY5aDZjp6D4vaTl4k4xnbmhp_eRmE7ySBfaQXWynzZbQoEmTI82Ca_7lgxWfLTgUXhbcxnsMziIYBy9ysScaX-gbqrl0ujV42mjfRs1CwD7PTR1PO2HgX1FsgcXrqL6d94s3y=w640-h186" width="640" /></a></div><br />then show you the top five scoring words for the next guess:<p></p><div class="separator" style="clear: both; text-align: center;"></div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/a/AVvXsEhKr4e6vbABinz_OofvKdSeYBH84f-JPYcRyCnlk9LDx4ag4TGxNIv9ncNLUHqg0XmAN1vV_-YsbHNbVvUvxoawOM30PqnjC05SK5z8GmX9shhM4q_nqcfGOaVgwR1PfC6GkXWKR6hyqAzXzz33TMy5pDCEqqosKDQpRhSzqbQQiiByLWr8Rl2oKMD2=s444" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="444" data-original-width="337" height="320" src="https://blogger.googleusercontent.com/img/a/AVvXsEhKr4e6vbABinz_OofvKdSeYBH84f-JPYcRyCnlk9LDx4ag4TGxNIv9ncNLUHqg0XmAN1vV_-YsbHNbVvUvxoawOM30PqnjC05SK5z8GmX9shhM4q_nqcfGOaVgwR1PfC6GkXWKR6hyqAzXzz33TMy5pDCEqqosKDQpRhSzqbQQiiByLWr8Rl2oKMD2=s320" width="243" /></a></div><p></p><p>4. Repeat entering guess codes and running subsequent cells until there is only a single word remaining.</p><h2 style="text-align: left;">Scoring words</h2><p>The most important open question is how words should be scored for selecting the next guess. Here's the general approach I took:</p><p>1. Determine the distribution of letters in the words by counting up how many times they occurred.</p><p>2. Order the letters of the alphabet from highest frequency to lowest.</p><p>3. Assign a rank to each letter based on its position (0=most frequently used, 25=least frequently used). Ties get the same rank, then missing ranks are skipped until the next non-tied letter (e.g. 1, 2, 2, 4, ...)<br /></p><p>4. Score a word by adding up the ranks for each of the five letters in the word. The lowest score is the best word.</p><p>This sounds pretty simple and most people use some variation of this in their head based on knowledge they have about letter use from everyday life, playing Scrabble, trying to crack cyphers (didn't everyone do that?), etc. A lot of our early discussion about Wordle guesses revolved around what were the most common letters. For example we realized that what counts is frequency of dictionary words, not frequency of words "in the wild" since a relatively small number of words are used a lot in normal text (e.g. "their", "there", "where", etc.). But what matters the most is the frequency of letters in the 2529 words at play in the game, and since we have the list and a computer, we can know anything we want about letter distributions.</p><p>A key consideration is whether we should care more about the overall distribution of letters anywhere in the words or if we should be more concerned about the distribution of letters in particular positions within the words. For example, the letter "y" is not that common overall, but is a lot more frequent in the last position of the word. The overall distribution is particularly important early in the game when nothing is known about any position, but later in the game, the distribution in particular positions becomes more important as only some positions remain undetermined. </p><p>The other consideration is that on any particular turn, we don't really care about the distributions of letters in all 2529 words, but only about the distributions of the words that haven't yet been eliminated. An unassisted human player would lack concrete information about this, but with a script, it's easily known. </p><p>When a new <span style="font-family: courier;">Wordle_list</span> object is created after each guess, the frequencies and ordinal positions of every letter are calculated both for the words as a whole and for each of the five positions, using the words that remain after eliminating words using the guess code. The ordinal positions for the letters are then available for calculating the word scores. <br /></p><p>When I first started playing with this, I calculated two sets of scores: "overall scores" for words with unique letter combinations (no repeated letters) based on the overall letter frequencies, and "position scores" for all words based on the separate letter frequencies of each of the five positions in the word. Words that scored well by the first system were the efficient at discovering/eliminating common letters. Words that scored well by the second system weren't as efficient at discovering letters, but were more efficient at placing letters in the correct position. There are some pairs of words like "aster" and "stare" that have equally good overall scores because they have the same letters. But "stare" has a much better position score than "aster" (14 vs. 30), partly because a lot of words end in "e". So if two words have the same overall score, pick the one with the better position score. </p><p>Having two different scores was not a good solution, so I then tried to think how I could combine them into a single score. The easy solution was to assign weights to the two scores. I somewhat arbitrarily chose 0.7 for the overall score and 0.3 for the position score. This gave more weight to efficient letter discovery, while having some weight to the position as a sort of "tiebreaker". This solution worked fairly well, but I quickly discovered two problems. </p><p>The most obvious problem was that I was only calculating the overall scores for words with unique letters. How should I score words in which the same letter occurred more than once? The answer came to me when I realized that from the standpoint of letter discovery, repeating a letter is worse than using even the worst letter because it gives you no new information. Therefore, I assigned a score of 26 to the second (or more) instance of a letter in a word. </p><p>However, this caused an unintended consequence. Particularly late in the game, it may actually be more efficient to use words with repeated letters depending on the distribution of letters in the undetermined positions. The 26 point penalty on the overall score for repeat letters was so severe, it basically prevented repeat letter words from ever being selected. The solution that I chose was to reduce the overall score weight (from 0.7 to 0) and increase the position weight (from 0.3 to 1) as the number of words decreased towards zero. That made the repeat letter penalty disappear late in the game and had a secondary benefit of emphasizing letter selection (through the overall score) in the first few guesses and emphasizing position selection in the later guesses. </p><p>This "sliding scale" of weights is what is used in the final version of the script, although you can manually adjust the base scoring weights of 0.7 and 0.3 in the initial setup. <br /></p><h2 style="text-align: left;">The optimal second guess</h2><p>One of the longest points of discussion between my wife and me was whether one should always use two great "letter discovery" words in the first two guesses, or to vary the second guess by choosing the best word based on information gained in the first guess. I could see benefits to both strategies. The "letter discovery" system could be very efficient at nailing down the letters and eliminating a lot of words in the first two guesses. It is also a very simple strategy that doesn't require any thinking about what words might or might not have been eliminated until the third guess. The "information based" second-guess system makes use of what has been learned about letters in the first guess to allow the selection of a tailored second guess based on the remaining words (at least if you have a computer to keep track of the words for you). Once the program was finished, I had an opportunity to test the two approaches using the script. </p><p>One thing that you'll notice if you use the script is that the first guess is always "arose", since the selection of the best scoring word from the set is deterministic. There is a section of the notebook labeled "Test of rating words by frequencies" whose first cell finds the 10 best-scoring words with unique letters. The second cell then eliminates all of the unique-letter words that have any of the letters from the first word in order to score the best "complimentary" second word to go with the first word. The default is to use "arose" as the first word, and it results in "glint" as the best second word to go with it. You can hard code one of the other first words as the value of <span style="font-family: courier;">first_word</span> in the second cell to find its complement, e.g. "stare" (first word) and "doily" (second word). </p><p>If you run the cell in the "Calculate stats for raw list" section, you'll find that the overall order of frequencies for letters in the five letter-word set is e, a, r, o, t, s, l, i, n, u, c, y, d, h, ... . "arose" includes five of the six most common letters and adding "glint" gets nine of the top ten. "stare" includes five of the top six, and adding "doily" gets all of the top eight and ten of the top thirteen. </p><p>I ran a test using the 17 game words from 1 January to today, comparing the strategies of always using "arose" as the first word and "glint" as the second word vs. using "arose" as the first word and letting the script's scoring system to select the second word. The metric was the number of words that remained after screening words out using the guesses. In three cases, it was a tie, with both strategies resulting in the same number of words (1 or 2 words left). In seven cases, the "glint" choice did better and in seven cases, using the best scoring word did better. </p><p>Based on this metric, both systems worked about equally well. However, in 6 of the seven cases where "glint" as a second choice won, the number of possible words remaining after the second guess was 1 (meaning that the strategy would have produced a pretty amazing score of 3 for each of those games). Only two of the cases where letting the scoring algorithm choose the word resulted in only one word remaining after the second guess. </p><p>Although this is a very low sample size, the somewhat surprising take-home message of this test is that the extremely simple strategy of always guessing "arose" and "glint" as your first two guesses is highly effective at bringing the number of words down to a low number for the third guess.<br /></p><h2 style="text-align: left;">Other possible investigations</h2><p> The obvious follow-up to this is to automate the process of guessing to the point where many tests could be run to compare how various strategies work. Since there are only 2529 words, it wouldn't even have to be a random sampling exercise -- one could literally try the strategy on every possible word. There are a few complications. One is that I'm not entirely confident about how the script is handling the evaluation of choices where the guess word contains two of the same letter (neither in the correct position) and the word has only one of that letter. I was having some problems with coding that and with uncertainty about how the game would indicate that result. </p><p>The other problem is that it is common near the end of the game for there to be two or even three possible guess words with the same score. So to have the computer fully play out the game would require a random selection. Many human players have been in this situation where there are two possible words that could fit in the final guess and it was necessary to just pick one and hope to get lucky (today for example when my last two choices were "spire" and "shire" -- I got unlucky and incorrectly guessed "spire"). There would probably be a more graceful way to handle the situation, but one could just have the script guess at random, then run it enough times for the probabilities to come out in the wash.</p><p>With this kind of automation, one could try many possible combinations of fixed first two words, adjust the base scoring weights to be something other than 0.7 and 0.3, or try out some other strategies I haven't though of yet.</p><p>However, I've totally burned up my holiday weekend, so if I do this it will have to be later! :-)</p><p><br /></p><p><br /></p><p><br /></p>Steve Baskaufhttp://www.blogger.com/profile/01896499749604153763noreply@blogger.com0tag:blogger.com,1999:blog-5299754536670281996.post-91727894172281455522021-03-18T05:36:00.002-07:002021-03-18T05:53:59.233-07:00Writing your own data to Wikidata using spreadsheets: Part 4 - Downloading existing data<p> This is the fourth part of a series of posts intended to help you manage and upload data to Wikidata using spreadsheets. It assumes that you have already read the first three posts and that you have tried the "do it yourself" experiments to get a better understanding of how the system works. However, it's possible that the only thing that you want to do is to download data from Wikidata into a spreadsheet, and if that is the case, you could get something out of this post without reading the others. The script that I will describe does require a configuration file whose structure is described in the second post, so you'll either need to read <a href="http://baskauf.blogspot.com/2021/03/writing-your-own-data-to-wikidata-using_7.html" target="_blank">that post</a> (if you like handholding) or read the more technical specifications <a href="https://github.com/HeardLibrary/linked-data/blob/master/vanderbot/convert-config.md" target="_blank">here</a> to know how to construct or hack that configuration file. </p><p>If you are the kind of person who prefers to just look at the specs and play with software, then skip this whole post and go to the documentation for the download script <a href="https://github.com/HeardLibrary/linked-data/blob/master/vanderbot/acquire_wikidata.md" target="_blank">here</a>. Good luck! Come back and do the walk-through below if you can't figure it out.</p><p>The latter sections of the post do show how to use the downloaded data to carry out easy-to-implement improvements to item records. Those sections depend on an understanding of the earlier posts. So if that kind of editing beyond simply downloading data interests you, then you should read the whole series of posts.</p><h2 style="text-align: left;">Configuring and carrying out the download</h2><p>Towards the end of the last post, I created a JSON <a href="https://gist.github.com/baskaufs/53d24710f65a4a958e9b7ca7cb1f8b43" target="_blank">configuration file</a> for metadata about journal articles. The configuration specified the structure of three CSV files. The first CSV (<span style="font-family: courier;">articles.csv</span>) was intended to have one row per article and contained headers for statements about the following properties: "instance of", DOI, date of publication, English title, journal, volume, page, and issue. The other two CSVs were expected to have multiple rows per article since they contained data about author items (<span style="font-family: courier;">authors.csv</span>) and author name strings (<span style="font-family: courier;">author_strings.csv</span>). Since articles can have one-to-many authors, these two tables could be expected to have zero-to-many rows per article. </p><p>For the purposes of testing the download script, you can just use the JSON configuration file as-is. Download it, name it <span style="font-family: courier;">config.json</span>, and put it in a new directory that you can easily navigate to from your home folder. We are going to specify the group of items to be downloaded by designating a graph pattern, so edit the fourth line of the file using a text editor so that it says</p><p><span style="font-family: courier;"> "item_pattern_file": "graph_pattern.txt",</span></p><p>You can screen for articles using any kind of graph pattern that you know how to write, but if you don't know what to use, you can use this pattern:</p><div style="text-align: left;"><span style="font-family: courier; font-size: x-small;">?person wdt:P1416 wd:Q16849893. # person, affiliation, Vanderbilt Libraries<br /></span><span style="font-family: courier; font-size: x-small;">?qid wdt:P50 ?person. # work, author, person</span></div><p>Copy these two lines and save them in a plain text file called <span style="font-family: courier;">graph_pattern.txt</span> in the same directory as the configuration file. The comments after the hash (<span style="font-family: courier;">#</span>) mark will be ignored, so you can leave them off if you want. I chose the first triple pattern (people affiliated with Vanderbilt Libraries) because there is a relatively small number of people involved. You can use some other triple pattern to define the people, but if it designates a large number of people, the file of downloaded journal data may be large. Whatever pattern you use, the variable <span style="font-family: courier;">?qid</span> must be used to designate the works.</p><p>The last thing you need is a copy of the Python script that does the downloading. Go to <a href="https://github.com/HeardLibrary/linked-data/blob/master/vanderbot/acquire_wikidata_metadata.py" target="_blank">this page</a> and download the script into the same directory as the other two files. </p><p>Open your console software (Terminal on Mac or Command Prompt on Windows) and navigate to the directory where you put the files. Enter</p><p><span style="font-family: courier;">python acquire_wikidata_metadata.py</span></p><p>(or <span style="font-family: courier;">python3</span> if your installation requires that).</p><p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjCucW9GJynaU_H7uPXrmcSemM0V_jCRqDf4iBcwB_2MEHY3DOJsQ7pz2hMoFwxyM_RRKcnfaEi0ul7grcUEr1IcusqaYQCKIz2sXevZACVuSowtHkBc8sl8BF7zgrdi2bThP5kZmT0LXU/s524/new_run.png" style="margin-left: 1em; margin-right: 1em;"><img alt="file run screenshot" border="0" data-original-height="330" data-original-width="524" height="405" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjCucW9GJynaU_H7uPXrmcSemM0V_jCRqDf4iBcwB_2MEHY3DOJsQ7pz2hMoFwxyM_RRKcnfaEi0ul7grcUEr1IcusqaYQCKIz2sXevZACVuSowtHkBc8sl8BF7zgrdi2bThP5kZmT0LXU/w640-h405/new_run.png" width="640" /></a></div><p></p><p>The output should be similar to the screenshot above. </p><p><br /></p><h2 style="text-align: left;">Examining the results</h2><p>Start by opening the <span style="font-family: courier;">authors.csv</span> file with your spreadsheet software (LibreOffice Calc recommended, Excel OK). This file should be pretty much as expected. There is a <span style="font-family: courier;">label_en</span> column that is there solely to make it easier to interpret the subject Q IDs -- that column is ignored when the spreadsheet is processed by VanderBot. In this case, every row has a value for the <span style="font-family: courier;">author</span> property because we specified works that had author items in the graph pattern we used to screen the works. </p><p>The <span style="font-family: courier;">author_strings.csv</span> file should also be close to what you expect, although you might be surprised to see that some rows don't have any author strings. Those are cases where all of the authors of that particular work have been associated with Wikidata items. The script always generates at least one row per subject item because it's very generic. It generally it leaves a blank cell for every statement property that doesn't have a value in case you want to add it later. Because there is only one statement property in this table, a missing value makes the row seem a bit weird because the whole row is then empty except for the Q ID.</p><p>When you open the <span style="font-family: courier;">articles.csv</span> file, you may be surprised or annoyed to discover that despite what I said about intending for there to be only one row per article, many articles have two or even more rows. Why is this the case? If you scroll to the right in the table, you will see that in most, if not all, of the cases of multiple rows there is more than one value for <span style="font-family: courier;">instance of</span>. If we were creating an item, we would probably just say that it's an instance of one kind of thing. But there is no rule saying that an item in Wikidata can't be an instance of more that one class. You might think that the article is a <span style="font-family: courier;">scholarly article</span> (<span style="font-family: courier;">Q13442814</span>) and I may think it's an <span style="font-family: courier;">academic journal article</span> (<span style="font-family: courier;">Q18918145</span>) and there is nothing to stop us from both making our assertions. </p><p>The underlying reason why we get these multiple rows is because we are using a SPARQL query to retrieve the results. We will see why in the next section. The situation would be even worse if there were more than one property with multiple values. If there were 3 values of <span style="font-family: courier;">instance of </span>for the item and 2 values for the <span style="font-family: courier;">published</span> date, we would get rows with every combination of the two, and end up with 3x2=6 rows for that article. That's unlikely, since I took care to select properties that (other than <span style="font-family: courier;">instance of</span>) are supposed to only have a single value. But sometimes single-value properties are mistakenly given several values and we end up with a proliferation of rows.</p><p><br /></p><h2 style="text-align: left;">An aside on SPARQL </h2><p>It is not really necessary for you to understand anything about SPARQL to use this script, but if you are interested in understanding this "multiplier" phenomenon, you can read this section. Otherwise, skip to the next section.</p><p>Let's start by looking at the page of Carl H. Johnson, a researcher at Vanderbilt (<a href="https://www.wikidata.org/wiki/Q28530058" target="_blank">Q28530058</a>). </p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiCwhRhiQTLmdSf9Nq14e2qyggozFhjLMwJYvRTllUjqnPbyYD6SpLfEfoqwb_RvmWL2IHME_cWXrwB9isyDsGSHfuHXVxJCnL_j30UZsswHiazcngNYA1J1G-cjKmB4o_J34WP1EDqR50/s880/carl.png" style="margin-left: 1em; margin-right: 1em;"><img alt="Item record for Carl H. Johnson" border="0" data-original-height="880" data-original-width="703" height="640" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiCwhRhiQTLmdSf9Nq14e2qyggozFhjLMwJYvRTllUjqnPbyYD6SpLfEfoqwb_RvmWL2IHME_cWXrwB9isyDsGSHfuHXVxJCnL_j30UZsswHiazcngNYA1J1G-cjKmB4o_J34WP1EDqR50/w512-h640/carl.png" width="512" /></a></div><br /><p>As I'm writing this (2021-03-16), we can see that Carl is listed as having two occupations: biologist and researcher. That is true, his work involves both of those things. He is also listed as having been educated at UT Austin and Stanford. That is also true, he went to UT Austin as an undergrad and Stanford as a grad student. We can carry out the following SPARQL query to ask about Carl's occupation and education. </p><div style="text-align: left;"><span style="font-family: courier; font-size: xx-small;">select distinct ?item ?itemLabel ?occupation ?occupationLabel ?educatedAt ?educatedAtLabel {<br /> ?item wdt:P106 ?occupation.<br /> ?item wdt:P69 ?educatedAt.<br /> BIND(wd:Q28530058 AS ?item)<br /> SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }<br /> }</span></div><p>You can run the query yourself <a href="https://w.wiki/36ex" target="_blank">here</a>, although if the information about Carl has changed since I wrote this post, you could get different results.</p><p>The last line of the query ("<span style="font-family: courier;">SERVICE...</span>") is some "magic" that the Wikidata Query service does to automatically generate labels for variables. If you are asking about a variable named "<span style="font-family: courier;">?x</span>" and you also ask about the variable "<span style="font-family: courier;">?xLabel</span>", with the "magic" line the Query Service will automatically generate "<span style="font-family: courier;">?xLabel</span>" for you even if you don't define it as part of the graph pattern. I've used this method to generate labels for the three variables I'm asking about in the first line: <span style="font-family: courier;">?item</span>, <span style="font-family: courier;">?occupation</span>, and <span style="font-family: courier;">?educatedAt</span>.</p><p>The second and third lines of the query:</p><div style="text-align: left;"><span style="font-family: courier;"> ?item wdt:P106 ?occupation.<br /></span><span style="font-family: courier;"> ?item wdt:P69 ?educatedAt.</span></div><p>are the ones that are actually important. They restrict the value of the variable <span style="font-family: courier;">?occupation</span> to be the <span style="font-family: courier;">occupation</span> (<span style="font-family: courier;">P106</span>) of the item and restrict the value of the variable <span style="font-family: courier;">?educatedAt</span> to be the place where the item was <span style="font-family: courier;">educated at</span> (<span style="font-family: courier;">P69</span>). </p><p>The fourth line just forces the item to be <span style="font-family: courier;">Carl H. Johnson</span> (<span style="font-family: courier;">Q28530058</span>). If we left that line out, we would get thousands or millions of results: anyone in Wikidata who had an occupation and was educated. (Actually, the query would probably time out. Try it and see what happens if you want.)</p><p>So here's what happens when we run the query:</p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEio8ncFpXYjL2MoNQD8kzG-uXbMKizUGO0kXyiyfN1-fHrE1nKtwF2V3DTWvul4cPj4V5z5OUyCKMbU691TIMVbw79Ah_41BWVk-M9W01BRyQHZTJbj5_foUdaxWmcbe1oFVn7_6fAKj_s/s1316/carl_query.png" style="margin-left: 1em; margin-right: 1em;"><img alt="SPARQL query results" border="0" data-original-height="202" data-original-width="1316" height="98" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEio8ncFpXYjL2MoNQD8kzG-uXbMKizUGO0kXyiyfN1-fHrE1nKtwF2V3DTWvul4cPj4V5z5OUyCKMbU691TIMVbw79Ah_41BWVk-M9W01BRyQHZTJbj5_foUdaxWmcbe1oFVn7_6fAKj_s/w640-h98/carl_query.png" width="640" /></a></div><p>When we carry out a SPARQL query, we are asking the question: what combinations of variable values satisfy the graph pattern that we have specified? The query binds those combinations to the variables and then displays the combinations in the results. If we think about these combinations, we can see they all satisfy the pattern that we required: </p><div style="text-align: left;"><ul style="text-align: left;"><li>Carl has occupation researcher and went to school at UT Austin. </li><li>Carl has occupation researcher and went to school at Stanford.</li><li>Carl has occupation biologist and went to school at UT Austin. </li><li>Carl has occupation biologist and went to school at Stanford.</li></ul></div><p>There are four ways we can bind values to the variables that are true and satisfy the pattern we required. We cannot ask SPARQL to read our mind and guess there was a special combination that we intended, or that there was one combination that we were more interested in than another.</p><p>This behavior sometimes produces results for SPARQL queries that seem unexpected because you get more results than you intend. But if you ask yourself what you really required in your graph pattern, you can usually figure out why you got the result that you did.</p><p><br /></p><h2 style="text-align: left;">Restricting the combinations of values in the table</h2><p style="text-align: left;">If you paid close attention to the output of the script, you will have noticed that for each of the three CSVs it said that there were no pre-existing CSVs. After the script runs the SPARQL query to collect the data from Wikidata, it tries to open the files. If it can't open the files, it creates new files and saves all of the combinations of values that it found. However, if the files already exist, it compares the data from the query to the data already in the CSV and ignores combinations of values that don't match what's already there.</p><p style="text-align: left;">That means that if we are annoyed about all of the possible combinations of values initially written to the table, we can delete lines that contain combinations that we don't care about. For example, if one row of the table says that the article is a <span style="font-family: courier;">scholarly article</span> (<span style="font-family: courier;">Q13442814</span>) and another says its an <span style="font-family: courier;">academic journal article</span> (<span style="font-family: courier;">Q18918145</span>), I can delete the <span style="font-family: courier;">scholarly article</span> row and only pay attention to row containing the statement that it is an <span style="font-family: courier;">academic journal article</span>. In future downloads, the <span style="font-family: courier;">scholarly article</span> assertion will be ignored by the script. It is a pain to have to manually delete the duplicate lines, but once you've done so, you shouldn't have to do it again if you try downloading more data later. You will have only a single line to deal with per existing item.</p><p style="text-align: left;">The situation is actually a bit more complicated than I just described. If you are interested in the details of how the script screens combinations of variables that it gets from the Query Service, you can look at the comments in the <a href="https://github.com/HeardLibrary/linked-data/blob/master/vanderbot/acquire_wikidata_metadata.py#L510" target="_blank">script starting in line 510</a>. If not, you can assume that the script does the best it can and usually does the screening pretty well. It is still good to examine the CSV visually after doing a fresh download to make sure nothing weird happened. </p><div style="text-align: left;"><br /></div><h2 style="text-align: left;">Repeating the download to pick up new data</h2><p style="text-align: left;">Based on what I said in the previous section, you should have noticed that you can run this script repeatedly if you want to pick up new data that has been added to the items since the last time you ran the script. That means that if you are adding data to the downloaded CSV as a way to make additions to Wikidata via the API, you first can check for updated information to make sure that you don't accidentally add duplicate statements when you use a stale CSV. VanderBot assumes that when you fill in empty cells, that represents new data to be written. So you or someone else has actually made the same new statement using the Web interface since the last time you ran VanderBot to do an upload, you risk creating duplicate statements. </p><p style="text-align: left;">There are several important things that you need to keep in mind about updating existing CSVs prior to adding and uploading new data:</p><p style="text-align: left;">1. The screening to remove duplicate rows is only done for preexisting items. Any new items downloaded in the update will need to be visually screened by a human for duplicates. It doesn't really hurt anything if you leave the duplicate rows -- all of their statements and references are preexisting and will have identifiers, so VanderBot will ignore them anyway. But you will probably eventually want to clean up the duplicates to make the spreadsheet easier to use in the future.</p><p style="text-align: left;">2. If there is only a single combination of values for an item (i.e. only a single row), the script will <b>automatically</b> replace any changed values with the new ones regardless of the preexisting state of that row. The screening of rows against the existing spreadsheet only happens when there are two rows with the same Q ID. So if somebody changed the wonderful, correct value that you have in your spreadsheet to something icky and wrong, running the update will change your local copy of the data in the spreadsheet to their icky and wrong value. On the other hand, if they have fixed an error and turned your data into wonderful, correct data, that will be changed in your local copy as well. The point here is that the script is dumb and cannot tell the difference between vandalism and crowd-sourced improvements of your local data. It just always updates your local data when there aren't duplicate rows. In a later post, we will talk about a system to detect changes before downloading them, so that you can make a decision about whether to allow the change to be made to your local copy of the data (i.e. the CSV).</p><p style="text-align: left;">3. If you have enabled VanderBot to write the labels and descriptions of existing items (using the <span style="font-family: courier;">--update</span> option), it is <b>very important</b> that you download fresh values prior to using a stale CSV for writing data to the API. If you do not, then you will effectively revert any recently changed labels and descriptions back to whatever they were the last time you downloaded data or wrote to the API with that CSV. That would be extremely irritating to anyone (including YOU) who put a lot of work into improving labels and descriptions using the web interface and then had them all changed by the VanderBot script back to what they were before. So be careful!</p><p style="text-align: left;"><br /></p><h2 style="text-align: left;">Making additions to the downloaded data and pushing them to the API: Low hanging fruit</h2><p style="text-align: left;">If you use this script to download existing data, it will not take you very long to realize that a lot of the data in Wikidata is pretty terrible. There are several common ways that data in Wikidata are terrible, and VanderBot can help you to improve the data with a lot less effort than doing a lot of manual editing using the web interface.</p><h3 style="text-align: left;">Changing a lot of descriptions at once</h3><p style="text-align: left;">Many items in Wikidata were created by bots that had limited information about the items, and limited abilities to collect the information they were missing. The end result is that descriptions are often very poor. A very common example is describing a person as a "researcher". I believe that this happens because the person is an author of a research article, and since the bot knows nothing else about the person, it describes them as a "researcher". Since we are screening by a SPARQL query that establishes some criteria about the items, that criterion often will allow us to provide a better description. For example, if we are screening people by requiring that they be faculty in a chemistry department, and who have published academic research articles, we can safely improve their descriptions by calling them "chemistry researchers".</p><p style="text-align: left;"><br /></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjhr_f5K4uAyF-cZF8nTdLxeity_kWauUWfWT0Xq7-Dok5F3dLjRTyqXiSUT0ydC86VpNHmhxBbcjb1ZDPB2QO1u9CRzj8WIAoNHWG3iwA-6VjRYtt9fSRP1vcZiCd6-e1dn5vj0L_Gr-8/s1392/missing-description.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="498" data-original-width="1392" height="228" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjhr_f5K4uAyF-cZF8nTdLxeity_kWauUWfWT0Xq7-Dok5F3dLjRTyqXiSUT0ydC86VpNHmhxBbcjb1ZDPB2QO1u9CRzj8WIAoNHWG3iwA-6VjRYtt9fSRP1vcZiCd6-e1dn5vj0L_Gr-8/w640-h228/missing-description.png" title="missing descriptions in a CSV table" width="640" /></a></div><br /><p style="text-align: left;">In the case of our example, there is an even more obvious problem: many of the items have no description at all. There is a very easy solution, since we have the <span style="font-family: courier;">instance of</span> (<span style="font-family: courier;">P31</span>) information about the items. All of the items in the screenshot above that are missing descriptions are instances of <span style="font-family: courier;">Q13442814</span>.</p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgeDDGDqE9LbnlYJ-9gCreJoveBsRdMIMCQIFAieH5rrd4It3H3LnBHmz44lDuxKZhev3aRrlNWGWcsc6fTMmhq5E2vOQ8nlC4Ayg5eBCqkImYtiXbeTLmcQfAguyNY9bU0LuGi0tMcAm4/s735/p31-summary.png" style="margin-left: 1em; margin-right: 1em;"><img alt="summary of instance types" border="0" data-original-height="735" data-original-width="487" height="400" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgeDDGDqE9LbnlYJ-9gCreJoveBsRdMIMCQIFAieH5rrd4It3H3LnBHmz44lDuxKZhev3aRrlNWGWcsc6fTMmhq5E2vOQ8nlC4Ayg5eBCqkImYtiXbeTLmcQfAguyNY9bU0LuGi0tMcAm4/w265-h400/p31-summary.png" width="265" /></a></div><p style="text-align: left;">I used <a href="https://github.com/HeardLibrary/linked-data/blob/master/vanderbot/count_entities.py" target="_blank">the script</a> (discussed in <a href="http://baskauf.blogspot.com/2021/03/writing-your-own-data-to-wikidata-using_11.html" target="_blank">the previous post</a>) for downloading information about values of statement properties to summarize all of the <span style="font-family: courier;">P31</span> values for the items in this group. So I know that all of these items with missing descriptions are instances of <span style="font-family: courier;">scholarly article</span>. There may be better descriptions for those items, but at this point "scholarly article" is a much better description than no description at all.</p><p style="text-align: left;">One might argue that we aren't actually adding information to the items, given that our proposed description is simply re-stating the <span style="font-family: courier;">P31</span> value. That may be true, but the description is important because it shows up in the search results for an item, and the <span style="font-family: courier;">P31</span> value does not. In Wikidata, descriptions also play an important role in disambiguating items that have the same label, so it's best for all items to have a description. </p><p style="text-align: left;">I am going to fix all of these descriptions at once by simply pasting the text "scholarly article" in the description column for these items, then running VanderBot with the <span style="font-family: courier;">--update</span> option set to <span style="font-family: courier;">allow</span>. If you have not read the earlier posts, be aware that prior to writing to the API, you will need to create a metadata description file for the CSVs (discussed in the <a href="http://baskauf.blogspot.com/2021/03/writing-your-own-data-to-wikidata-using_7.html" target="_blank">second post</a>) and also download a copy of VanderBot from <a href="https://github.com/HeardLibrary/linked-data/blob/master/vanderbot/vanderbot.py" target="_blank">here</a> into the directory where the CSV files are located. Run the API upload script using the command</p><p><span style="font-family: courier;">python vanderbot.py --update allow --log log.txt</span></p><div><span style="font-family: courier;"><br /></span></div><div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg-BVq_e3Mvk8OJdXreDr7vA1FKaZy00x73CC3qjT37qlbptUt2HyKBj-mMzYy7vTSS7u8tsNAF822rxP_u5acfBhSVa5KVefFDCUPNdvhqeLL6dUuUL7Ds0KZbol-oHX5cMtbKm0est2Q/s1286/added_descriptions.png" style="margin-left: 1em; margin-right: 1em;"><img alt="CSV with added descriptions" border="0" data-original-height="499" data-original-width="1286" height="248" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg-BVq_e3Mvk8OJdXreDr7vA1FKaZy00x73CC3qjT37qlbptUt2HyKBj-mMzYy7vTSS7u8tsNAF822rxP_u5acfBhSVa5KVefFDCUPNdvhqeLL6dUuUL7Ds0KZbol-oHX5cMtbKm0est2Q/w640-h248/added_descriptions.png" width="640" /></a></div><p style="text-align: left;">After it finished running, I deleted the three CSV files. After a little while I ran the download script again to see how things in Wikidata had changed. The results are in the screenshot above. (Note: there is a delay between when data are written to the API and when they are available at the Query Service, so the changes won't necessarily show up immediately. It can take between a few seconds and up to an hour for the changes to be transferred to the Query Service.) I was able to improve the descriptions of 23 items by about 30 seconds of copying and pasting. That would have probably taken me at least 10 or 15 minutes if I had looked up each item using the web interface and entered those descriptions manually.</p><h3>Changing a lot of labels at once</h3><p style="text-align: left;">Another situation where I can make a large number of improvements with very little effort is adding people item labels for other languages. Most names of people will be represented in the same way across all languages that use the Latin character set. So I can easily improve label coverage in non-English languages by just copying the English names and using them as labels in the other languages. This would be extremely labor-intensive if you had to look up each item and do the copying and pasting one item at a time. However, when the labels are in spreadsheet form, I can easily copy an entire column and paste it into another column.</p><p style="text-align: left;">In our <a href="https://www.wikidata.org/wiki/Wikidata:WikiProject_Vanderbilt_Fine_Arts_Gallery" target="_blank">Vanderbilt Fine Arts Gallery WikiProject</a>, we put in a lot of work on either disambiguating artist name strings against Wikidata items, or creating new items for artists that weren't already there. As an end result, we now have a list of 1325 artists whose works are included in the gallery collection. I can use that list of 1325 Q IDs as a way to define a category of items to be included in a download using the <span style="font-family: courier;">acquire_wikidata_metadata.py</span> script. To set up the download, I created a <span style="font-family: courier;">config.json</span> file containing this JSON:</p><div style="text-align: left;"><div style="background-color: black; color: white; font-family: Menlo, Monaco, "Courier New", monospace; font-size: 12px; line-height: 18px; white-space: pre;"><div>{</div><div> <span style="color: #d4d4d4;">"data_path"</span>: <span style="color: #ce9178;">""</span>,</div><div> <span style="color: #d4d4d4;">"item_source_csv"</span>: <span style="color: #ce9178;">"creators.csv"</span>,</div><div> <span style="color: #d4d4d4;">"item_pattern_file"</span>: <span style="color: #ce9178;">""</span>,</div><div> <span style="color: #d4d4d4;">"outfiles"</span>: [</div><div> {</div><div> <span style="color: #d4d4d4;">"manage_descriptions"</span>: <span style="color: #569cd6;">true</span>,</div><div> <span style="color: #d4d4d4;">"label_description_language_list"</span>: [</div><div> <span style="color: #ce9178;">"en"</span>,</div><div> <span style="color: #ce9178;">"es"</span>,</div><div> <span style="color: #ce9178;">"pt"</span>,</div><div> <span style="color: #ce9178;">"fr"</span>,</div><div> <span style="color: #ce9178;">"it"</span>,</div><div> <span style="color: #ce9178;">"nl"</span>,</div><div> <span style="color: #ce9178;">"de"</span>,</div><div> <span style="color: #ce9178;">"da"</span>,</div><div> <span style="color: #ce9178;">"et"</span>,</div><div> <span style="color: #ce9178;">"hu"</span>,</div><div> <span style="color: #ce9178;">"ga"</span>,</div><div> <span style="color: #ce9178;">"ro"</span>,</div><div> <span style="color: #ce9178;">"sk"</span>,</div><div> <span style="color: #ce9178;">"sl"</span>,</div><div> <span style="color: #ce9178;">"zu"</span>,</div><div> <span style="color: #ce9178;">"tr"</span>,</div><div> <span style="color: #ce9178;">"sv"</span></div><div> ],</div><div> <span style="color: #d4d4d4;">"output_file_name"</span>: <span style="color: #ce9178;">"creators_out.csv"</span>,</div><div> <span style="color: #d4d4d4;">"prop_list"</span>: [</div><div> ]</div><div> }</div><div> ]</div><div>}</div></div></div><p style="text-align: left;">As you can see, I'm not concerned with any properties of the works items. I've simply listed the language codes for many languages that primarily use the Latin character set. The <span style="font-family: courier;">creators.csv</span> file is my spreadsheet with the 1325 item identifiers in a column named <span style="font-family: courier;">qid</span>. It defines the set of items I'm interested in. After running the <span style="font-family: courier;">acquire_wikidata_metadata.py</span> script, the <span style="font-family: courier;">creators_out.csv</span> spreadsheet looked like this:</p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgRP0_oSCo861cri_dfCTzSymd4THNOqhWqXUsgtoz6VzrT2TQKPglwSB_KHMq28EM4n9BC4GIppY1DpVIJYew7AX7cHfUoVDmnRzj1Zbe2aCq65JNvDZ7N6H-q1WCwWX0EVfX-fk7A4yI/s1379/artists.png" style="margin-left: 1em; margin-right: 1em;"><img alt="CSV list of artist items in Wikidata" border="0" data-original-height="949" data-original-width="1379" height="440" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgRP0_oSCo861cri_dfCTzSymd4THNOqhWqXUsgtoz6VzrT2TQKPglwSB_KHMq28EM4n9BC4GIppY1DpVIJYew7AX7cHfUoVDmnRzj1Zbe2aCq65JNvDZ7N6H-q1WCwWX0EVfX-fk7A4yI/w640-h440/artists.png" width="640" /></a></div><p style="text-align: left;">There are several things worth noting. In most cases, when the label is available in non-English languages, it's exactly the same as the English label. This confirms my assertion that it's probably fine to just re-use the "English" names as labels in the other languages. There are a couple exceptions. Buckminster Fuller has variation in his labels because "Buckminster" is apparently his middle name. So I'm going to mostly leave that row alone -- he's famous enough that he's represented in most languages anyway. The Haverford Painter's name isn't really a name. It's more of a description applied as a label and it does vary from language to language. I'll just delete that row since I have no idea how to translate "Haverford Painter" into most of the languages. </p><p style="text-align: left;">The other interesting thing is that most of the names are represented in Dutch already. The reason is that there is a bot called <a href="https://www.wikidata.org/wiki/User:Edoderoobot" target="_blank">Enderoobot</a> which, among other things, automatically adds English people name labels as Dutch labels (see <a href="https://www.wikidata.org/w/index.php?title=Q100243501&diff=1288735788&oldid=1288701583" target="_blank">this edit</a> for example). There are only a few missing Dutch labels to fill in. So I definitely should not just copy the entire English column of labels and paste it into the Dutch column.</p><p style="text-align: left;">Since the rows of the CSV are in alphabetical order by Q ID, the top of the spreadsheet contains mostly newer items with Q IDs over 100 million. In the lower part of the sheet where the Q IDs of less than 100 million are located, there are a lot more well-known artists that have labels in more languages. It would take more time than I want to spend right now to scrutinize the existing labels to see if it's safe to paste over them. So for now I'll limit my copying and pasting to the top of the spreadsheet. </p><p>After pasting the English labels into all of the other columns and filling in the few missing Dutch labels, I'm ready to write the new labels to Wikidata. I needed to run the <span style="font-family: courier;">convert_json_to_metadata_schema.py</span> script to generate the metadata description file that VanderBot needs to understand the new <span style="font-family: courier;">creators_out.csv</span> spreadsheet I've just created and edited (see my <a href="http://baskauf.blogspot.com/2021/03/writing-your-own-data-to-wikidata-using_7.html" target="_blank">second post</a> if you don't know about that). I'm now ready to run VanderBot using the same command I used earlier.</p><p>Using this method, I was able to add approximately 3500 multilingual labels with only about 30 seconds of time to copy and paste columns in the spreadsheet and about 10 minutes for VanderBot to write the new labels to the API. I can't even imagine how long that would take to do manually. </p><p>One nice thing is that there is only one interaction with the API per item, regardless of the number of different languages of labels that are changed. Since most of the time that VanderBot takes to do the writing is actually just sleeping 1.25 seconds per item (to avoid exceeding the maximum writing rate for bots without a bot flag), it's important to bundle as many data items per API interaction as possible. </p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhUOGXFe3gPxImkPMLz1mV1oY25lpwbs_rUHpN7qVunAP53HRna87GEm-awjc1b1QD5Rmt7yfotu8oL6xsSOoOYHF1XTETQjZ3jBAt2AP4ec_aJzlHfwV-Ztzlj-sUAx7QzvYgoJtQgX74/s933/marvin-bradley-page.png" style="margin-left: 1em; margin-right: 1em;"><img alt="Wikidata page showing added language labels" border="0" data-original-height="933" data-original-width="867" height="640" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhUOGXFe3gPxImkPMLz1mV1oY25lpwbs_rUHpN7qVunAP53HRna87GEm-awjc1b1QD5Rmt7yfotu8oL6xsSOoOYHF1XTETQjZ3jBAt2AP4ec_aJzlHfwV-Ztzlj-sUAx7QzvYgoJtQgX74/w594-h640/marvin-bradley-page.png" width="594" /></a></div><p>When I check one of the artist's pages, I see now that it has labels in many languages instead of only English.</p><p>Although it would be more labor-intensive, the same process could be used for adding labels in non-Latin character sets. A native speaker could simply go down the rows and type in the labels in Chinese characters, Cyrillic, Greek, Arabic, or any other non-Latin character set in an appropriate column and run the script to add those labels as well.</p><h3 style="text-align: left;">Adding multiple references to an item at once</h3><p>Despite the importance of references to ensuring the reliability of Wikidata, many (most?) statements do not have them. That's understandable when humans create the statements, since including references is time consuming (although less so if you use some of the <a href="https://www.wikidata.org/wiki/Special:Preferences#mw-prefsection-gadgets" target="_blank">gadgets that are available</a> to streamline the process, like <span style="font-family: courier;">currentDate</span> and <span style="font-family: courier;">DuplicateReferences</span>). For bots, it's inexcusable. Most bots are getting their data automatically from some data source and they know what that data source is, so there is no reason for them to not add references other than the laziness of their developers. </p><p>We can't fix other people's bad behavior, but we can fix their missing references with minimal work if we have an easy way to acquire the information. </p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhSpufX3KCJp6ljWhrGBIyqOwL8P9wHv7gcSBtAvWXtPKJVejDfXBQv8bzFgJ6_as1mRhkvbEVvJG1kd_aj1vQ34ejTgYCh8jZWMlFBlYcZpeayANNFW_vAaXvr-Uh33uNnCFiiG4Ls1Vg/s1191/no-refs.png" style="margin-left: 1em; margin-right: 1em;"><img alt="example of item with few references" border="0" data-original-height="1191" data-original-width="1017" height="640" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhSpufX3KCJp6ljWhrGBIyqOwL8P9wHv7gcSBtAvWXtPKJVejDfXBQv8bzFgJ6_as1mRhkvbEVvJG1kd_aj1vQ34ejTgYCh8jZWMlFBlYcZpeayANNFW_vAaXvr-Uh33uNnCFiiG4Ls1Vg/w546-h640/no-refs.png" width="546" /></a></div><br /><p><a href="https://www.wikidata.org/wiki/Q44943965" target="_blank">Q44943965</a> is an article that was created using QuickStatements. Some of the data about the item was curated manually and those statements have references. But most of the bot-created statements don't have any references and I'm too lazy to add them manually. Luckily, the article has a DOI statement near the bottom, so all I need to do is to click on it to verify that the information exists for the statements with missing references. As a reference URL, I'm going to use the HTTPS form of the DOI, <a href="https://doi.org/10.3233/SW-150203">https://doi.org/10.3233/SW-150203</a>, which a human can click on to see the evidence that supports the statement. </p><p>This publication was in the CSVs from the first example in this post, so prior to writing the references, I deleted the CSVs and used the <span style="font-family: courier;">acquire_wikidata_metadata.py</span> script download a fresh copy of the data.</p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiUQvtmGInTLSTazv3rDxNycye-7Ql3djvJzFJJXpv0KDOLjtBcKcGwEtR-0_mbc-uhXdBk_nLXjqIX6f9VHJzSPrHEM0GiX5Uu1gVBDyfmKQrRtcGotAnE_8MCZrRDBHuKDXjl1AmB6jg/s1199/add-references.png" style="margin-left: 1em; margin-right: 1em;"><img alt="CSV showing added references" border="0" data-original-height="155" data-original-width="1199" height="82" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiUQvtmGInTLSTazv3rDxNycye-7Ql3djvJzFJJXpv0KDOLjtBcKcGwEtR-0_mbc-uhXdBk_nLXjqIX6f9VHJzSPrHEM0GiX5Uu1gVBDyfmKQrRtcGotAnE_8MCZrRDBHuKDXjl1AmB6jg/w640-h82/add-references.png" width="640" /></a></div><p>I highlighted the row to make it easier to see and pasted the DOI URL into the <span style="font-family: courier;">doi_ref1_referenceUrl</span> column. I typed in todays date into the <span style="font-family: courier;">doi_ref1_retrieved_val</span> column in the required format: <span style="font-family: courier;">2021-03-17</span>. </p><p>To create the references for the other statements, I just needed to copy the DOI URL into all of the columns whose names end in <span style="font-family: courier;">_ref1_referenceUrl</span> and today's date into all of the columns that end in <span style="font-family: courier;">_ref1_retrieved_val</span>. </p><p>Once I finished that, I saved the CSV and ran VanderBot (I already had the metadata description file from earlier work). I saved the output into a log file so that I could look at it later.</p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjwNvetlDOphBWWRwnQyI6FpyuG86AQY6fvfKMOTXkwaIv4PuuVyLTT-pWJ86ZiG23c_G7rIB3DipeKe6Fsk8Xl1-gIupb4uvwNOB-7YVnwl6MfsO0lEr1QNyfrqfkAOYqUDtpeQU7vLCM/s1052/reference-log.png" style="margin-left: 1em; margin-right: 1em;"><img alt="log showing added references" border="0" data-original-height="667" data-original-width="1052" height="406" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjwNvetlDOphBWWRwnQyI6FpyuG86AQY6fvfKMOTXkwaIv4PuuVyLTT-pWJ86ZiG23c_G7rIB3DipeKe6Fsk8Xl1-gIupb4uvwNOB-7YVnwl6MfsO0lEr1QNyfrqfkAOYqUDtpeQU7vLCM/w640-h406/reference-log.png" width="640" /></a></div><p>When VanderBot processes a CSV, it first writes any new items. It then runs a second check on the spreadsheet to find any items where statements already exist (indicated by the presence of a value in the <span style="font-family: courier;">_uuid column</span> for that property), but where references have NOT been written (indicated by the absence of a<span style="font-family: courier;"> _ref1_hash value</span>). Scrolling through the log file, I saw that there was "no data to write" for any of the statements. In the "Writing references of existing claims" section (screenshot above), I saw the seven new references I created for <a href="https://www.wikidata.org/wiki/Q44943965" target="_blank">Q44943965</a>. </p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi9kZaQCr0OKHrdqoV5lpOdTIU43xl2GQAgzFFmdN3m77-q5w4h8I4MessutZezTcGku4wfsKyAZBt1ZD-zGiQv_LQgWWgWbZJ5GV_3JmEQm42wCCYR9vAU5f8y18NQ4G9MmLw4E9aD9vU/s1153/item-with-refs.png" style="margin-left: 1em; margin-right: 1em;"><img alt="item page with added references" border="0" data-original-height="1153" data-original-width="840" height="640" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi9kZaQCr0OKHrdqoV5lpOdTIU43xl2GQAgzFFmdN3m77-q5w4h8I4MessutZezTcGku4wfsKyAZBt1ZD-zGiQv_LQgWWgWbZJ5GV_3JmEQm42wCCYR9vAU5f8y18NQ4G9MmLw4E9aD9vU/w466-h640/item-with-refs.png" width="466" /></a></div><br /><p>Checking the item page again, I see that all of the statements now have references!</p><p>This is more labor-intensive than making the label changes that I demonstrated in the previous example, but if all of the items in a spreadsheet were derived from the same source, then copying and pasting all the way down the <span style="font-family: courier;">_ref1_referenceURL</span> and <span style="font-family: courier;">_ref1_retrieved_val </span>columns would be really fast. In this case, it was not particularly fast, since I had to look up the DOI, then copy and paste the URL and date manually for each different item. However, since DOI data from CrossRef are machine-readable (via their API, see <a href="https://github.com/CrossRef/rest-api-doc">https://github.com/CrossRef/rest-api-doc</a>), it won't be that hard to script the lookup in Python and have the script add all of the references to the CSV. I may write a post showing how to do that sometime in the future.</p><h2 style="text-align: left;">Conclusion</h2><p>The script that downloads existing data from Wikidata into a CSV (<a href="https://github.com/HeardLibrary/linked-data/blob/master/vanderbot/acquire_wikidata_metadata.py"><span style="font-family: courier;">acquire_wikidata_metadata.py</span></a>) makes it possible to use the <a href="https://github.com/HeardLibrary/linked-data/blob/master/vanderbot/vanderbot.py" target="_blank">VanderBot API-writing script</a> to improve certain kinds of information about items by simply copying multiple cells and pasting them elsewhere in the spreadsheet. Since CSVs are easily read and written by scripts, it is also possible to automate the addition of some kinds of data about existing items to the CSV (and eventually to Wikidata) by scripting. </p><p>In future posts, I will show how to accomplish some of the more difficult aspects of managing your own data in Wikidata using spreadsheets, including automated data acquisition, disambiguation, and monitoring for changes over time.</p><p><br /></p><p><br /></p></div>Steve Baskaufhttp://www.blogger.com/profile/01896499749604153763noreply@blogger.com0tag:blogger.com,1999:blog-5299754536670281996.post-50181913714919944672021-03-11T18:45:00.002-08:002021-03-13T12:19:22.724-08:00Writing your own data to Wikidata using spreadsheets: Part 3 - Determining what properties to use<p> This is the third part of a series of posts intended to help you manage and upload data to Wikidata using spreadsheets. You probably won't get as much out of this post if you haven't already done the do-it-yourself exercises in the first two posts, but since this is a more general topic, you might still find it useful even if you haven't read the earlier ones.</p><p><br /></p><h2 style="text-align: left;">Determining scope</h2><p>The target audience of this post is people or groups who have particular defined datasets that they would like to upload and "manage" on Wikidata. I put "manage" in quotes because no one can absolutely manage any data on Wikidata, since by definition it is a knowledge graph that anyone can edit. So no matter how much we care about "our" data, once we put it into Wikidata, we need to be at peace with the fact that others may edit "our" items. </p><p>There is a good chance that the data we are interested in "managing" may be some kind of data about which we have special knowledge. For example, if we are part of a museum, gallery, or library, we may have items in our collection that are worth describing in Wikidata. It is unlikely that others will have better information than we do about things like accession numbers and license information. I'm going to refer to this kind of information as "authoritative" data -- data that we probably know more about than other Wikidata users. There may be other data that we are very interested in tracking, but about which we may have no more information than anyone else.</p><p>In both of these situations, we have a vested interest in monitoring additions and changes made by others outside our group or organization. In the case of our "authoritative" data, we may want to be on the lookout for vandalism that needs to be reverted. As we track other non-authoritative data, we may discover useful information that's effectively crowd-sourced and available without cost (time or financial) to us. </p><p>There will also be other statements that involve properties that we aren't interested in tracking. That doesn't mean this other information is useless -- it may just not be practical for us to track it since we can't really contribute to it nor gain much benefit from it. </p><p>So an important part of planning a project to upload and manage data in Wikidata is determining the scope of statements you plan to monitor. This is true regardless of how you are doing that managing, but in the case of using CSV spreadsheets, the defined scope will determine what column headers will be present in the spreadsheet. So prior to moving forward with using spreadsheets to write data to Wikidata, we need to decide what properties, qualifiers, and references we plan to document in those spreadsheets.</p><p><br /></p><h2 style="text-align: left;">Defining a group of items of interest</h2><p> The first thing we need to decide is what kind of items are we interested in. There may be an obvious target item type: works in a gallery, specimens in a museum, articles published by researchers in an institution, etc. There will also often be secondary item types associated with the primary one: artists associated with gallery works, collectors with specimens, authors and journals with articles, for example. After determining the primary and secondary item types, the next step is to figure out the value of <span style="font-family: courier;">P31</span> (<span style="font-family: courier;">instance of</span>) goes with each type of item of interest. In some cases, this might be obvious (<span style="font-family: courier;">Q5</span> = <span style="font-family: courier;">human</span> for authors, for example). In other cases it may not be so clear. Is a book <span style="font-family: courier;">Q571</span> (<span style="font-family: courier;">book</span>) or is it <span style="font-family: courier;">Q3331189</span> (<span style="font-family: courier;">version, edition, or translation</span>)? The best answer to this question is probably "what are other people using for items similar to mine?" We'll talk about some tools for figuring that out later in the post.</p><p>There are two useful ways to define a group of related items. The simplest is <i>enumeration</i>: creating a list of Q IDs of items that are similar. Less straightforward but more powerful is to define a <span style="font-family: inherit;"><i>graph pattern</i></span> that can be used in SPARQL to designate the group. We are not going to go off the deep end on SPARQL in this post, but it plays such an important role in using Wikidata that we need to talk about it a little bit. </p><p>Wikidata is a knowledge graph, which means that its items are linked together by the statements that connect them. Thus the links involve the properties that form the statements. The simplest connection between two statements involves a single statement using a single property. For example, if we are interested in works in the Vanderbilt Fine Arts Gallery, we can define that group of items by stating that a work is in the collection (<span style="font-family: courier;">P195</span>) of the Vanderbilt Fine Arts Gallery (<span style="font-family: courier;">Q18563658</span>). We can abbreviate this relationship by the shorthand:</p><p><span style="font-family: courier;">?item wdt:P195 wd:Q18563658.</span></p><p>The <span style="font-family: courier;">?item</span> means that the item is the thing we want to know, and the other two parts lay out how the item is related to the gallery. This shorthand is the simplest kind of graph pattern that we can create to define a group of items. </p><p>We can use this graph pattern to get a list of the names of works in the gallery using the Wikidata Query Service. Click on <a href="https://w.wiki/34rA" target="_blank">this link</a> and it will take you to the Query service with the appropriate query filled in. If you look at the query in the upper right, you'll see our graph pattern stuck between a pair of curly brackets. (The other line is a sort of "magic" line that produces English labels for items that are found.) If you are wondering how you might have known how to set up a query like this, you can drop down the <span style="font-family: courier;">Examples</span> list and select the first query: "Cats". My query is just a hack of that one where I substituted my graph pattern for the one that defines cats (<span style="font-family: courier;">P31</span>=<span style="font-family: courier;">instance of</span>, <span style="font-family: courier;">Q146</span>=<span style="font-family: courier;">house cat</span>). </p><p>We can narrow down the scope of our group if we add another requirement to the graph pattern. For example, if we want our group of items to include only paintings that are in the Vanderbilt gallery, we can add another statement to the graph pattern: the item must also be an <span style="font-family: courier;">instance of</span> (<span style="font-family: courier;">P31</span>) a <span style="font-family: courier;">painting</span> (<span style="font-family: courier;">Q3305213</span>).</p><p><span style="font-family: courier;">?item wdt:P31 wd:Q3305213. </span></p><p>The query using both restrictions is <a href="https://w.wiki/34rE" target="_blank">here</a>.</p><p>These kinds of graph patterns are used all the time in Wikidata, sometimes when you don't even know it. If you visit the <a href="https://www.wikidata.org/wiki/Wikidata:WikiProject_Vanderbilt_Fine_Arts_Gallery/All_Paintings" target="_blank">Vanderbilt Fine Arts Gallery WikiProject paintings page</a> and look just below the star, you'll see that the graph pattern that we just defined is actually what generates that page. We will use such patterns later on to investigate property use by groups of items that are defined by graph patterns.</p><p><br /></p><h2><span style="font-family: inherit;">What properties are used with different kinds of items?</span></h2><h3 style="text-align: left;"><span style="font-family: inherit;">Recoin (Relative Completeness Indicator)</span></h3><div><span style="font-family: inherit;">The simplest way to see what kind of properties tend to be used with certain kinds of items is to look at the page of an item of that kind and see what properties are there. That isn't a very systematic approach, but there is a gadget called <i>Recoin</i> that can make our investigation more robust. Recoin can be installed by clicking on the </span><span style="font-family: courier;">Preferences</span><span style="font-family: inherit;"> link at the top of any Wikidata page, then selecting the </span><span style="font-family: courier;">Gadgets</span><span style="font-family: inherit;"> tab. Check the box for </span><span style="font-family: courier;">Recoin</span><span style="font-family: inherit;">, then click </span><span style="font-family: courier;">Save</span><span style="font-family: inherit;">.</span></div><div><span style="font-family: inherit;"><br /></span></div><div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEibcxUPnwvL-57sXrnCAY_VODEthUh1DfyL6BzqxZoMJfFw5NYshhRvdMFqPiG81rnDiokR9Z88XbhvK96gcX7XDC2UXS5HkkUSXpeR0yyJiIzTLphj8VeQLFQP_pzBOBrYmwvuUs4_tmE/s1101/gallery-screenshot.png" style="margin-left: 1em; margin-right: 1em;"><img alt="screenshot showing Recoin" border="0" data-original-height="1101" data-original-width="896" height="640" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEibcxUPnwvL-57sXrnCAY_VODEthUh1DfyL6BzqxZoMJfFw5NYshhRvdMFqPiG81rnDiokR9Z88XbhvK96gcX7XDC2UXS5HkkUSXpeR0yyJiIzTLphj8VeQLFQP_pzBOBrYmwvuUs4_tmE/w520-h640/gallery-screenshot.png" width="520" /></a></div><div><br /></div>After you enable Recoin, you can click the Recoin link just below the item description and a list will drop down showing the fraction of items having various properties for all items having the same <span style="font-family: courier;">P31</span> value. The example above shows values for instances of "art museum". Of course, this list shows properties that are missing for that page, so you would need to find a page with most properties missing to get a more full list. If you can create a new item having only a <span style="font-family: courier;">P31</span> value, you will be able to get the complete list.</div><div><br /></div><h3 style="text-align: left;">Wikidata:WikiProjects</h3><div>A more systematic approach is to look for a WikiProject that is interested in the same kind of items as you. The <a href="https://www.wikidata.org/wiki/Wikidata:WikiProjects" target="_blank">list of WikiProjects</a> is somewhat intimidating, but if you succeed in finding the right project, it will often contain best-practices guidelines for describing certain types of items. For example, if you expand Cultural WikiProjects, then GLAM (galleries, libraries, archives, and museums) WikiProjects, you will see one called "Sum of all paintings". They have a list of <a href="https://www.wikidata.org/wiki/Wikidata:WikiProject_sum_of_all_paintings#Item_structure_to_describe_paintings_on_Wikidata" target="_blank">recommendations for how to describe paintings</a>. You can find similar lists in other areas and if you are lucky, you will find a list of extensive data model guidelines, such as the Stanford Libraries's data <a href="https://www.wikidata.org/wiki/Wikidata:WikiProject_Stanford_Libraries/Data_models" target="_blank">models for academia</a>. </div><div><br /></div><div>A small amount of time spent searching here will pay large dividends later if you start by using the consensus properties adopted by the community in which you are working. The items you put into Wikidata will be much more likely to be found and linked to by others if you describe them using the same model as is used with other items of the same type.</div><div><br /></div><div><br /><h2 style="text-align: left;"><span style="font-family: inherit;">Determining what properties are used "in the wild"</span></h2></div><div><span style="font-family: inherit;">If you find a WikiProject related to your type of interest, you will probably have a good idea of the properties that group says you should be using for statements about that type of item. However, you might discover that in actuality some of those properties are not really used much. That could be the case if the values are not easily available or if it's to labor-intensive to disambiguate available string values with item values in Wikidata. So it is pretty useful to know what properties are actually being used by items similar to the ones you are interested in creating/editing. </span></div><div><span style="font-family: inherit;"><br /></span></div><div>I have written a Python script, <span style="font-family: courier;">count_entities.py</span>, that you can use to determine what properties have been used to describe a group of related items and the number of items that have used each property. The script details are described <a href="https://github.com/HeardLibrary/linked-data/blob/master/vanderbot/count_entities.md" target="_blank">here</a>. Before using the script with your own set of items, you will need to define your category of items using one of the two methods I described earlier. But for testing purposes, you can try running the script using the default built-in group: works in the Vanderbilt University Fine Arts Gallery. </div><div><br /></div><div>To run the script, you need the following:</div><div><ul style="text-align: left;"><li>Python 3 installed on your computer with the ability to run it at the command line.</li><li>The <span style="font-family: courier;">requests</span> module installed using PIP, Conda, or some other package manager.</li><li>a plain text editor if you want to define the group by SPARQL graph pattern. You can use the built-in text editors TextEdit on Mac or Notepad on Windows.</li><li>a spreadsheet program to open CSV files. LibreOffice Calc is recommended.</li><li>knowledge of how to change directories and run a Python script from your computer's console (Terminal on Mac, Command Line on Windows).</li></ul></div><div>You do NOT need to know how to code in Python. If you are uncertain about any of these requirements, please read the first post in this series, which includes a lot of hand-holding and additional information about them.</div><div><br /></div><div>To run the script, go to the <a href="https://github.com/HeardLibrary/linked-data/blob/master/vanderbot/count_entities.py" target="_blank">script's page on GitHub</a> and right click on the <span style="font-family: courier;">Raw</span> button. Select <span style="font-family: courier;">Save Link As... </span><span style="font-family: inherit;">and save the script in a directory you can easily navigate to using your console. The script will general CSV files as output, so it is best to put the script in a relatively empty directory so you can find the files that are created.</span></div><div><span style="font-family: inherit;"><br /></span></div><div><span style="font-family: inherit;">To test the script, go to your console and navigate to the directory where you saved the script. Enter</span></div><div><span style="font-family: inherit;"><br /></span></div><div><span style="font-family: courier;">python count_entities.py</span></div><div><span style="font-family: inherit;"><br /></span></div><div>(or <span style="font-family: courier;">python3</span> if your installation requires that). The script will create a file called <span style="font-family: courier;">properties_summary.csv</span>, which you can open using your spreadsheet program.</div><div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiC2yK7VBR3TitoyKROMq46cuu-4hvv_wnxBVTJP9TaKpAL1t9gsTW8TjpTUfPzB7-6PDgfrZ25gfG9sTpILnNJVuwj9ke3IONQ3u4rieQLawcpwex-3TPcW1r40tl7pH6aMB-sUBd5Y6E/s614/property_list.png" style="margin-left: 1em; margin-right: 1em;"><img alt="list of properties of gallery items" border="0" data-original-height="614" data-original-width="396" height="400" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiC2yK7VBR3TitoyKROMq46cuu-4hvv_wnxBVTJP9TaKpAL1t9gsTW8TjpTUfPzB7-6PDgfrZ25gfG9sTpILnNJVuwj9ke3IONQ3u4rieQLawcpwex-3TPcW1r40tl7pH6aMB-sUBd5Y6E/w258-h400/property_list.png" width="258" /></a></div><div><br /></div>The table shows all of the properties used to make statements about items in the gallery and the number of items that use each property. Although there are (currently) 6000 items in the group, they use properties fairly consistently, so there aren't that many properties on the list. Other groups may have much longer lists. But often there will be a very long tail of properties used only once or a few times. <div><br /></div><div>Unless you want to keep investigating the Vanderbilt Fine Arts Gallery items, you must define your group using one of the two options described below: <span style="font-family: courier;">--csv</span> (or its brief form <span style="font-family: courier;">-C</span>) to enumerate items in the group by Q ID or <span style="font-family: courier;">--graph</span> (or its brief form <span style="font-family: courier;">-G</span>) to define the group by a graph pattern.<br /><div><br /></div><h3 style="text-align: left;">Defining a group by a list of Q IDs</h3><div>Let's try using the script by defining the group by enumeration. Download the file <span style="font-family: courier;">bluffton_presidents.csv</span> from <a href="https://github.com/HeardLibrary/linked-data/blob/master/json_schema/bluffton_presidents.csv" target="_blank">here</a> into the same directory as the script, using the <span style="font-family: courier;">Raw</span> button as before. NOTE: if you are using a Mac, it may automatically try to change the file extension from <span style="font-family: courier;">.csv</span> to <span style="font-family: courier;">.txt</span> in the <span style="font-family: courier;">Save As... </span>dialog. If so, change the format to <span style="font-family: courier;">All Files</span> and change the extension back to <span style="font-family: courier;">.csv</span> before saving. </div><div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjSTwyGOB_59BB3lJwuj6STkfDM5a_LpWF_PRmJgA4BkdW0iBrE1tqo5JQKU_CeZ7OKcd8lYD85F2DEkGUNvZLknYwatrjes_shRRtgQUid7_k642qnLztvmEJBZXgniJySd91uZ1fNcv8/s804/president-csv.png" style="margin-left: 1em; margin-right: 1em;"><img alt="screenshot of test CSV" border="0" data-original-height="300" data-original-width="804" height="238" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjSTwyGOB_59BB3lJwuj6STkfDM5a_LpWF_PRmJgA4BkdW0iBrE1tqo5JQKU_CeZ7OKcd8lYD85F2DEkGUNvZLknYwatrjes_shRRtgQUid7_k642qnLztvmEJBZXgniJySd91uZ1fNcv8/w640-h238/president-csv.png" width="640" /></a></div><br /><div>If you open the CSV that you downloaded, you'll see that the first column has the header <span style="font-family: courier;">qid</span>. The script requires that the Q IDs be in a column with this header. The position of that column and the presence of other columns do not matter. The items in the column must be Q IDs, including the initial <span style="font-family: courier;">Q</span> and omitting any namespace abbreviations like <span style="font-family: courier;">wd:</span> .</div><div><br /></div><div>Run the script again using this syntax:</div><div><br /></div><div><span style="font-family: courier;">python count_entities.py --csv bluffton_presidents.csv</span></div><div><br /></div><div>Note that the previous output file will be overwritten when you run the script again. </div><div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiNEgRGB4EhRBd4zL7I3yIah5QkK5XriY-Fev6N4bYlfj50HrEBguid5Wict-JWK_S8Q_b-IFunfdD9G5YkBbWIYSN4_9u9Dd0X4mlyIPPEJo4JdL6D5tn3MwucQioAiNqhjz69d8Eijco/s342/president-prop-summary.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="218" data-original-width="342" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiNEgRGB4EhRBd4zL7I3yIah5QkK5XriY-Fev6N4bYlfj50HrEBguid5Wict-JWK_S8Q_b-IFunfdD9G5YkBbWIYSN4_9u9Dd0X4mlyIPPEJo4JdL6D5tn3MwucQioAiNqhjz69d8Eijco/s320/president-prop-summary.png" width="320" /></a></div><br /><div>This time the script produces a list of properties appropriate for people.</div><div><div><br /></div><h3 style="text-align: left;"><span style="font-family: inherit;">Defining a list by SPARQL graph pattern</span></h3><div><span style="font-family: inherit;">Open your text editor and paste in the following text:</span></div><div><span style="font-family: inherit;"><br /></span></div><div><div><span style="font-family: courier;">?qid wdt:P108 wd:Q29052.</span></div><div><span style="font-family: courier;">?article wdt:P50 ?qid.</span></div><div><span style="font-family: courier;">?article wdt:P31 wd:Q13442814.</span></div></div><div><br /></div><div>The first line limits the group to items whose employer (<span style="font-family: courier;">P108</span>) is Vanderbilt University (<span style="font-family: courier;">Q29052</span>). The second line specifies that those items must be authors of something (<span style="font-family: courier;">P50</span>). The third line limits those somethings to being instances of (<span style="font-family: courier;">P31</span>) scholarly articles (<span style="font-family: courier;">Q13442814</span>). So with this graph pattern, we have defined our group as authors of scholarly articles who work (or worked) at Vanderbilt University. </div><div><br /></div><div>Save the file using the name <span style="font-family: courier;">graph_pattern.txt</span> in the same directory as the script. Run the script using this syntax:</div><div><br /></div><div><span style="font-family: courier;">python count_entities.py --graph graph_pattern.txt</span></div><div><br /></div><div>Again, the script will overwrite the previous output file. </div><div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjCzPpi8f7DvqW5V7FnemM3eANw8XQ5qN6NgG827H9lGlaA8Ik7h6-1RaAfyEkDql68PYmrjDFUZT-DOIQ9IbYC2meGODSCIEbXjJm6kGxdPXVkNZoi23rIeiQry5f6LaNeOsMvTaSuQw0/s577/vu-authors-props.png" style="margin-left: 1em; margin-right: 1em;"><img alt="list of properties of Vanderbilt authors" border="0" data-original-height="379" data-original-width="577" height="263" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjCzPpi8f7DvqW5V7FnemM3eANw8XQ5qN6NgG827H9lGlaA8Ik7h6-1RaAfyEkDql68PYmrjDFUZT-DOIQ9IbYC2meGODSCIEbXjJm6kGxdPXVkNZoi23rIeiQry5f6LaNeOsMvTaSuQw0/w400-h263/vu-authors-props.png" width="400" /></a></div><br /><div>This time, the list of properties is much longer because the group is larger and more diverse than in the last example. Despite whatever advice any WikiProjects group may give about best-practices for describing academics, we can see that there is a very small number of properties that are actually given for most of these academic authors. Note that in many cases, given name and family name statements are generated automatically by bots. So if we wanted to create "typical" records, we would only need to provide the top six properties. </div><div><span style="font-family: inherit;"><br /></span></div><div><span style="font-family: inherit;">If you are unfamiliar with creating SPARQL query graph patterns, I recommend experimenting at the <a href="https://query.wikidata.org/" target="_blank">Wikidata Query Service page</a>. The </span><span style="font-family: courier;">Examples</span><span style="font-family: inherit;"> dropdown there shows a lot of examples. However, in most cases, we can define the groups we want with simple graph patterns of only one to three lines.</span></div><div><span style="font-family: inherit;"><br /></span></div><h2 style="text-align: left;"><span style="font-family: inherit;">Examining property use in the wild</span></h2><div><span style="font-family: inherit;">Before deciding for sure what properties you want to write/monitor, it is good to know what typical values are for that property. It is also critical to know whether it is conventional to use qualifiers with that property. The </span><span style="font-family: courier;">count_entities.py</span> script can also collect that information if you use the <span style="font-family: courier;">--prop</span> option (or its brief form <span style="font-family: courier;">-P</span>). I will demonstrate this with the default group (Vanderbilt Fine Arts Gallery works), but you can supply a value for either the <span style="font-family: courier;">--csv</span> or <span style="font-family: courier;">--graph</span> option to define your own group. </div><div><br /></div><div>One of the most important properties to understand about a group is <span style="font-family: courier;">P31</span> (<span style="font-family: courier;">instance of</span>). To see the distribution of values for <span style="font-family: courier;">P31</span> in the gallery, issue this command in your console:</div><div><br /></div><div><div><span style="font-family: courier;">python count_entities.py --prop P31</span></div><div><br /></div></div><div>(or <span style="font-family: courier;">python3</span> if your installation requires that). The script generates a file whose name starts with the the property ID and ends in <span style="font-family: courier;">_summary.csv</span> (<span style="font-family: courier;">P31_summary.csv</span> in this example). Here's what the results look like:</div><div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg4xVtrH95BQv8XUBqfUYBFrHdvuITmhBiSS_hVZeDlT1EalqsdCU3EttR66RxwMYqWNygT2b1nlVJUdCGn8LqGKtAIZDbRT0eNdlauh6FFtauV5bpEMwXq4bha_ovP47tPMROdrxesE9M/s577/p31-gallery.png" style="margin-left: 1em; margin-right: 1em;"><img alt="types of items in the VU gallery" border="0" data-original-height="379" data-original-width="577" height="263" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg4xVtrH95BQv8XUBqfUYBFrHdvuITmhBiSS_hVZeDlT1EalqsdCU3EttR66RxwMYqWNygT2b1nlVJUdCGn8LqGKtAIZDbRT0eNdlauh6FFtauV5bpEMwXq4bha_ovP47tPMROdrxesE9M/w400-h263/p31-gallery.png" width="400" /></a></div><br /><div>We can see that most items in the gallery that are described in Wikidata are prints. There is a long tail of other types with a very small number of representatives (e.g. "shoe"). Note that it is possible for an item to have more than one value for P31, so the total count of item by type could be greater than the total number of items.</div><div><br /></div><div><span style="font-family: inherit;">If any statements using the target property have qualifiers, the script will create a file listing the qualifiers used and the number of items with statements using those qualifiers. In the case of </span><span style="font-family: courier;">P31</span><span style="font-family: inherit;">, there were no qualifiers used, so no file was created. Let's try again using </span><span style="font-family: courier;">P571</span><span style="font-family: inherit;">, </span><span style="font-family: courier;">inception</span><span style="font-family: inherit;">. </span></div><div><div><span style="font-family: courier;"><br /></span></div><div><span style="font-family: courier;">python count_entities.py -P P571</span></div><div><br /></div></div><div><span style="font-family: inherit;">The result in the </span><span style="font-family: courier;">P571_summary.csv</span> file is not very useful.</div><div><span style="font-family: inherit;"><br /></span></div><div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjdlGf3HAEBCbFaDMNGo9CR5s148480_EBhGnIN4f5qzt818bKcRBnN8j-4h-cCErffgpmRAie4K8mWNqTFrg70a48xJq6VlCpGG69N159F3cWWIo15ZKjkyCx9KyltoD4DH6rUfs4pa8o/s354/p571-gallery.png" style="margin-left: 1em; margin-right: 1em;"><img alt="inception dates list" border="0" data-original-height="354" data-original-width="324" height="200" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjdlGf3HAEBCbFaDMNGo9CR5s148480_EBhGnIN4f5qzt818bKcRBnN8j-4h-cCErffgpmRAie4K8mWNqTFrg70a48xJq6VlCpGG69N159F3cWWIo15ZKjkyCx9KyltoD4DH6rUfs4pa8o/w183-h200/p571-gallery.png" width="183" /></a></div><br /><span style="font-family: inherit;">It listed the 401 different inception dates (as of today) for works in the gallery. However, the </span><span style="font-family: courier;">P571_qualifiers_summary.csv</span> is more interesting.</div><div><span style="font-family: inherit;"><br /></span></div><div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhz_ZbWRrsWcrRQvTZmmPprU7kKp6eWCO-843A9DD_FYxIpDpt2mMiQpmT_z2jrX_EDc9wX24eWvcaXTs27xQUh0RcwQGhAXvOXsLllALvRcYm9Z4t1NS7QLBud4Qlt6qggwCMW5BGkqo0/s382/p571-qual-gallery.png" style="margin-left: 1em; margin-right: 1em;"><img alt="qualifiers used with P571" border="0" data-original-height="130" data-original-width="382" height="109" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhz_ZbWRrsWcrRQvTZmmPprU7kKp6eWCO-843A9DD_FYxIpDpt2mMiQpmT_z2jrX_EDc9wX24eWvcaXTs27xQUh0RcwQGhAXvOXsLllALvRcYm9Z4t1NS7QLBud4Qlt6qggwCMW5BGkqo0/w320-h109/p571-qual-gallery.png" width="320" /></a></div><br /><span style="font-family: inherit;">This gives me very important information. For most of the 401 dates, they were qualified by defining an uncertainty range using </span><span style="font-family: courier;">earliest date</span><span style="font-family: inherit;"> (</span><span style="font-family: courier;">P1319</span><span style="font-family: inherit;">) and </span><span style="font-family: courier;">latest date</span><span style="font-family: inherit;"> (</span><span style="font-family: courier;">P1326</span><span style="font-family: inherit;">). The other commonly used qualifier was </span><span style="font-family: courier;">P1480</span><span style="font-family: inherit;"> (</span><span style="font-family: courier;">sourcing circumstances</span><span style="font-family: inherit;">). Examining the <a href="https://www.wikidata.org/wiki/Property:P1480" target="_blank">property description</a>, we see that </span><span style="font-family: courier;">P1480</span><span style="font-family: inherit;"> is used to indicate that a date is "circa" (</span><span style="font-family: courier;">Q5727902</span>). So all three of these qualifiers are really important and should probably be designated to be used with <span style="font-family: courier;">P571</span>.</div><div><br /></div><div>For properties that have a large number of possible values (e.g. properties that have unique values for every item), you probably don't want to have the script generate the file of values if all you want to know is the qualifiers that are used. You can get only the qualifiers output file if you use the <span style="font-family: courier;">--qual</span> (or <span style="font-family: courier;">-Q</span>) option (with no value needed to go with it). A good example for this is <span style="font-family: courier;">P217</span> (<span style="font-family: courier;">inventory number</span>). Every work has a unique value for this property, so there is no reason to download the values for the property. Using the <span style="font-family: courier;">--qual</span> option, I can find out what qualifiers are used without recording the values.</div><div><span style="font-family: inherit;"><br /></span></div><div><span style="font-family: courier;">python count_entities.py -prop P217 --qual</span></div><div><span style="font-family: inherit;"><br /></span></div></div></div><div><span style="font-family: inherit;">The </span><span style="font-family: courier;">P217_qualifiers_summary.csv</span> file shows that there is a single qualifier used with <span style="font-family: courier;">P217</span>: <span style="font-family: courier;">collection</span> (<span style="font-family: courier;">P195</span>). </div><div><br /></div><h2 style="text-align: left;">Putting it together</h2><div>The reason for including this post in the series about writing the Wikidata using spreadsheets is that we need to decide what properties, qualifiers, and references to include in the metadata description description of the CSV that we will use to manage the data. So I will demonstrate how to put this all together to create the spreadsheet and its metadata description.</div><div><br /></div><div>I am interested in adding publications written by Vanderbilt researchers to Wikidata. Since data from <a href="https://www.crossref.org/" target="_blank">Crossref</a> is easily obtainable when DOIs are known, I'm interested in knowing what properties are used with existing items that have DOIs and were written by Vanderbilt researchers. So the first step is to define the group for the items. The first thing I tried was the graph pattern method. Here is my graph pattern:</div><div><br /></div><div><div><span style="font-family: courier;">?person wdt:P1416 wd:Q16849893. # person affiliation VU Libraries</span></div><div><span style="font-family: courier;">?item wdt:P50 ?person. # work author person</span></div><div><span style="font-family: courier;">?item wdt:P356 ?doi. # work has doi DOI.</span></div></div><div><br /></div><div>I tested this pattern at the Query Service with <a href="https://w.wiki/35Wm" target="_blank">this query</a>. However, when I ran the script with the <span style="font-family: courier;">--graph</span> option to determine property use, it timed out. </div><div><br /></div><h3 style="text-align: left;">Determining property use</h3><div>Since Plan A did not work, I moved on to Plan B. I downloaded the results from the query that I ran at the Query Service and put them into a CSV file. I then did a bit of massaging to pull the Q IDs into their own column with the header <span style="font-family: courier;">qid</span>. This time when I ran the script with the <span style="font-family: courier;">--csv</span> option, I got some useful results.</div><div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg8dqbeEs0ruIYeNxCMzU3kD5ge4EwmiVKIm30ZC-lv3PmpX-2P8-iRjZYMFJQt-jEtldJiOV-TnSwNVRBfwhsQnttiLB_ua_bpV_1sSTYzbr-mgZPSjXdmGZiAj1iL3SYR2FM8aFKt_9g/s704/doi-props.png" style="margin-left: 1em; margin-right: 1em;"><img alt="properties of works with DOIs" border="0" data-original-height="704" data-original-width="498" height="400" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg8dqbeEs0ruIYeNxCMzU3kD5ge4EwmiVKIm30ZC-lv3PmpX-2P8-iRjZYMFJQt-jEtldJiOV-TnSwNVRBfwhsQnttiLB_ua_bpV_1sSTYzbr-mgZPSjXdmGZiAj1iL3SYR2FM8aFKt_9g/w283-h400/doi-props.png" width="283" /></a></div><br /><div>Based on these results I probably need to plan to upload and track the first 10 properties (through author name string). For P31 and P1433, it would probably be useful to see what kind of values are usual, but for the rest I just need to know if they are typically used with qualifiers or not. </div><div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEickRICEcklleVUcNIeoFWJBfppWJXumXSTwThFF1pJaBoqwbrKZ9QIzpF_bx7Fmhx4Az3BocVkQ3TlOC0fFaruLBQnGIMt3YEKJDvGaOwWtkJ7EG4nDlHY58pWOVT6XJmu0o87w3FWYJ4/s466/article_p31.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="279" data-original-width="466" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEickRICEcklleVUcNIeoFWJBfppWJXumXSTwThFF1pJaBoqwbrKZ9QIzpF_bx7Fmhx4Az3BocVkQ3TlOC0fFaruLBQnGIMt3YEKJDvGaOwWtkJ7EG4nDlHY58pWOVT6XJmu0o87w3FWYJ4/s320/article_p31.png" width="320" /></a></div><br /><div>The results for P31 indicate that although both <span style="font-family: courier;">scholarly article</span> (<span style="font-family: courier;">Q13442814</span>) and <span style="font-family: courier;">academic journal article</span> (<span style="font-family: courier;">Q18918145</span>) are used to describe these kind of academic publications, <span style="font-family: courier;">scholarly article</span> seems to be more widely used. There were no qualifiers used with <span style="font-family: courier;">P31</span>. Not unexpectedly, a check of <span style="font-family: courier;">P1433</span> revealed many library-related journals. One item used qualifiers with <span style="font-family: courier;">P1433</span>, but those qualifiers, <span style="font-family: courier;">P304</span> (<span style="font-family: courier;">pages</span>), P433 (<span style="font-family: courier;">issue</span>), and P478 (<span style="font-family: courier;">volume</span>), appear to be misplaced since those properties are generally used directly in statements about the work. </div><div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEinJ1ayBp11deOzTvGsPo_jV1zQI5lMa0ZTVCk7xDXDOiTbg_2Fz_DubJEkn5PRn9PAhUXNs1lqLZwX6yVSGXaxIwfEDBnIbVEZtRy_OhmZGJNGbLnBvuYWcuVcryhCr3Rj6Q4RCLEe1EU/s293/P50-quals.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="127" data-original-width="293" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEinJ1ayBp11deOzTvGsPo_jV1zQI5lMa0ZTVCk7xDXDOiTbg_2Fz_DubJEkn5PRn9PAhUXNs1lqLZwX6yVSGXaxIwfEDBnIbVEZtRy_OhmZGJNGbLnBvuYWcuVcryhCr3Rj6Q4RCLEe1EU/s0/P50-quals.png" /></a></div><br /><div>The only other items with qualifiers were <span style="font-family: courier;">P50</span> (<span style="font-family: courier;">author</span>, shown above) and <span style="font-family: courier;">P2093</span> (<span style="font-family: courier;">author name string</span>), which also had the qualifier <span style="font-family: courier;">P1545</span> (<span style="font-family: courier;">series ordinal</span>). So this simplifies the situation quite a bit -- I really only need to worry about qualifiers with the two author-related terms, which are going to require some special handling anyway. </div><div><br /></div><h3 style="text-align: left;">Creating a <span style="font-family: courier;">config.json</span> file for the spreadsheet</h3><div>I now have enough information to know how I want to lay out the spreadsheet(s) to contain the data that I'll upload/manage about journal articles. To understand better how to structure the <span style="font-family: courier;">config.json</span> file that I'll use to generate the spreadsheets and metadata description file, I looked at <a href="https://www.wikidata.org/wiki/Q56825541" target="_blank">one of the articles</a> to help understand the value types for the properties. </div><div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEip2aOPyjsePhvVCj8x4KXy-OJ-HxmaFAqcyjCi2KoW2jgiMpVxhcigrAUE2OMTwMi4a7ZQCBf5ENPloONNb37Oay37lc78BwK_8s-vgQE8u6GtuGAixI4aRpi0_0md9lIsnj1jLDFHzwg/s1175/article-example.png" style="margin-left: 1em; margin-right: 1em;"><img alt="example article" border="0" data-original-height="1175" data-original-width="890" height="640" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEip2aOPyjsePhvVCj8x4KXy-OJ-HxmaFAqcyjCi2KoW2jgiMpVxhcigrAUE2OMTwMi4a7ZQCBf5ENPloONNb37Oay37lc78BwK_8s-vgQE8u6GtuGAixI4aRpi0_0md9lIsnj1jLDFHzwg/w484-h640/article-example.png" width="484" /></a></div><br /><div><br /></div><div>The style of the values on the page help me to know the value type. The item values are hyperlinked text. The string values are unlinked black text. Monolingual text values look like strings, but have their language following them in parentheses.</div><div><br /></div><div>To decide about the number of spreadsheets needed, I thought about which properties were likely to have multiple values per article item. Both author (item) and author name string could have multiple values. So I put them into separate spreadsheets. The rest of the properties will probably have only one value per article (or at least only one value that I'm interested in tracking). So here is what the overall structure of the <span style="font-family: courier;">config.json</span> file looks like:</div><div><br /></div><div><div style="background-color: black; color: white; font-family: Menlo, Monaco, "Courier New", monospace; font-size: 12px; line-height: 18px; white-space: pre;"><div>{</div><div> <span style="color: #d4d4d4;">"data_path"</span>: <span style="color: #ce9178;">""</span>,</div><div> <span style="color: #d4d4d4;">"item_source_csv"</span>: <span style="color: #ce9178;">""</span>,</div><div> <span style="color: #d4d4d4;">"item_pattern_file"</span>: <span style="color: #ce9178;">""</span>,</div><div> <span style="color: #d4d4d4;">"outfiles"</span>: [</div><div> {</div><div> <span style="color: #d4d4d4;">"manage_descriptions"</span>: <span style="color: #569cd6;">true</span>,</div><div> <span style="color: #d4d4d4;">"label_description_language_list"</span>: [</div><div> <span style="color: #ce9178;">"en"</span></div><div> ],</div><div> <span style="color: #d4d4d4;">"output_file_name"</span>: <span style="color: #ce9178;">"articles.csv"</span>,</div><div> <span style="color: #d4d4d4;">"prop_list"</span>: [</div><div> ]</div><div> },</div><div> {</div><div> <span style="color: #d4d4d4;">"manage_descriptions"</span>: <span style="color: #569cd6;">false</span>,</div><div> <span style="color: #d4d4d4;">"label_description_language_list"</span>: [],</div><div> <span style="color: #d4d4d4;">"output_file_name"</span>: <span style="color: #ce9178;">"authors.csv"</span>,</div><div> <span style="color: #d4d4d4;">"prop_list"</span>: [</div><div> ]</div><div> },</div><div> {</div><div> <span style="color: #d4d4d4;">"manage_descriptions"</span>: <span style="color: #569cd6;">false</span>,</div><div> <span style="color: #d4d4d4;">"label_description_language_list"</span>: [],</div><div> <span style="color: #d4d4d4;">"output_file_name"</span>: <span style="color: #ce9178;">"author_strings.csv"</span>,</div><div> <span style="color: #d4d4d4;">"prop_list"</span>: [</div><div> ]</div><div> }</div><div> ]</div><div>}</div></div></div><div><br /></div><div>I don't want to manage descriptions on the two author-related CSVs, and am only including the labels to make it easier to identify the article. I'm only working in English, so that also simplifies the label situation.</div><div><br /></div><div>Here are a few of the property descriptions that I used that illustrate several value types for the statement properties:</div><div><br /></div><div><div style="background-color: black; color: white; font-family: Menlo, Monaco, "Courier New", monospace; font-size: 12px; line-height: 18px; white-space: pre;"><div> {</div><div> <span style="color: #d4d4d4;">"pid"</span>: <span style="color: #ce9178;">"P31"</span>,</div><div> <span style="color: #d4d4d4;">"variable"</span>: <span style="color: #ce9178;">"instance_of"</span>,</div><div> <span style="color: #d4d4d4;">"value_type"</span>: <span style="color: #ce9178;">"item"</span>,</div><div> <span style="color: #d4d4d4;">"qual"</span>: [],</div><div> <span style="color: #d4d4d4;">"ref"</span>: []</div><div> },</div><div> {</div><div> <span style="color: #d4d4d4;">"pid"</span>: <span style="color: #ce9178;">"P356"</span>,</div><div> <span style="color: #d4d4d4;">"variable"</span>: <span style="color: #ce9178;">"doi"</span>,</div><div> <span style="color: #d4d4d4;">"value_type"</span>: <span style="color: #ce9178;">"string"</span>,</div><div> <span style="color: #d4d4d4;">"qual"</span>: [],</div><div> <span style="color: #d4d4d4;">"ref"</span>: [</div><div> {</div><div> <span style="color: #d4d4d4;">"pid"</span>: <span style="color: #ce9178;">"P854"</span>,</div><div> <span style="color: #d4d4d4;">"variable"</span>: <span style="color: #ce9178;">"referenceUrl"</span>,</div><div> <span style="color: #d4d4d4;">"value_type"</span>: <span style="color: #ce9178;">"uri"</span></div><div> },</div><div> {</div><div> <span style="color: #d4d4d4;">"pid"</span>: <span style="color: #ce9178;">"P813"</span>,</div><div> <span style="color: #d4d4d4;">"variable"</span>: <span style="color: #ce9178;">"retrieved"</span>,</div><div> <span style="color: #d4d4d4;">"value_type"</span>: <span style="color: #ce9178;">"date"</span></div><div> }</div><div> ]</div><div> },</div><div> {</div><div> <span style="color: #d4d4d4;">"pid"</span>: <span style="color: #ce9178;">"P577"</span>,</div><div> <span style="color: #d4d4d4;">"variable"</span>: <span style="color: #ce9178;">"published"</span>,</div><div> <span style="color: #d4d4d4;">"value_type"</span>: <span style="color: #ce9178;">"date"</span>,</div><div> <span style="color: #d4d4d4;">"qual"</span>: [],</div><div> <span style="color: #d4d4d4;">"ref"</span>: [</div><div> {</div><div> <span style="color: #d4d4d4;">"pid"</span>: <span style="color: #ce9178;">"P854"</span>,</div><div> <span style="color: #d4d4d4;">"variable"</span>: <span style="color: #ce9178;">"referenceUrl"</span>,</div><div> <span style="color: #d4d4d4;">"value_type"</span>: <span style="color: #ce9178;">"uri"</span></div><div> },</div><div> {</div><div> <span style="color: #d4d4d4;">"pid"</span>: <span style="color: #ce9178;">"P813"</span>,</div><div> <span style="color: #d4d4d4;">"variable"</span>: <span style="color: #ce9178;">"retrieved"</span>,</div><div> <span style="color: #d4d4d4;">"value_type"</span>: <span style="color: #ce9178;">"date"</span></div><div> }</div><div> ]</div><div> },</div><div> {</div><div> <span style="color: #d4d4d4;">"pid"</span>: <span style="color: #ce9178;">"P1476"</span>,</div><div> <span style="color: #d4d4d4;">"variable"</span>: <span style="color: #ce9178;">"title_en"</span>,</div><div> <span style="color: #d4d4d4;">"value_type"</span>: <span style="color: #ce9178;">"monolingualtext"</span>,</div><div> <span style="color: #d4d4d4;">"language"</span>: <span style="color: #ce9178;">"en"</span>,</div><div> <span style="color: #d4d4d4;">"qual"</span>: [],</div><div> <span style="color: #d4d4d4;">"ref"</span>: [</div><div> {</div><div> <span style="color: #d4d4d4;">"pid"</span>: <span style="color: #ce9178;">"P854"</span>,</div><div> <span style="color: #d4d4d4;">"variable"</span>: <span style="color: #ce9178;">"referenceUrl"</span>,</div><div> <span style="color: #d4d4d4;">"value_type"</span>: <span style="color: #ce9178;">"uri"</span></div><div> },</div><div> {</div><div> <span style="color: #d4d4d4;">"pid"</span>: <span style="color: #ce9178;">"P813"</span>,</div><div> <span style="color: #d4d4d4;">"variable"</span>: <span style="color: #ce9178;">"retrieved"</span>,</div><div> <span style="color: #d4d4d4;">"value_type"</span>: <span style="color: #ce9178;">"date"</span></div><div> }</div><div> ]</div><div> },</div><div></div></div></div><div><br /></div><div>Following typical practice, I'm skipping references for <span style="font-family: courier;">P31</span> (<span style="font-family: courier;">instance of</span>). The rest of the properties only have reference properties for <span style="font-family: courier;">P854</span> (<span style="font-family: courier;">reference URL</span>) and <span style="font-family: courier;">P813</span> (<span style="font-family: courier;">retrieved</span>). Some existing items may have references for <span style="font-family: courier;">P248</span> (<span style="font-family: courier;">stated in</span>), but since I'm going to be getting my data from Crossref DOIs, I'll probably just use the URL form of the DOI in all of the references. So I'll only use a column for <span style="font-family: courier;">P854</span>. Notice also that the <span style="font-family: courier;">P1476</span> (<span style="font-family: courier;">title</span>) property must have the extra language key/value pair since it's a monolingual string. If the title of the journal isn't in English, I'm stuck but I'll deal with that problem later if it arises.</div><div><br /></div><div>The final version of my config.json file is <a href="https://gist.github.com/baskaufs/53d24710f65a4a958e9b7ca7cb1f8b43" target="_blank">here</a>. I will now try running the <span style="font-family: courier;">convert_json_to_metadata_schema.py</span> script discussed in the last post to generate the headers for the three CSV files and the metadata description file so that I can test them out.</div><div><br /></div><h3 style="text-align: left;">Test data</h3><div>To test whether this will work, I'm going to manually add data to the spreadsheet for an old article of mine that I know is not yet in Wikidata. It's <a href="https://doi.org/10.1603/0046-225X-30.2.181">https://doi.org/10.1603/0046-225X-30.2.181</a> . Here's <a href="https://gist.github.com/baskaufs/1d7c00b9a442552be56aa81a94b85c4b" target="_blank">a file</a> that shows what the data look like when entered into the spreadsheet. You'll notice that I used the DOI as the reference URL. As I said in the last section, I intend to eventually automate the process of collecting the information from Crossref, but even though I got the information manually, the DOI URL will redirect to the journal article landing page, so anyone checking the reference will be able to see it in human-readable form. So this is a good solution that's honest on the data source and that also allows people to check the reference when the click on the link. </div><div><br /></div><div>Please note that I did NOT fill in the author CSV yet, even though I already know what the author items are. The reason is that if I filled it in without the article item Q ID in the <span style="font-family: courier;">qid</span> column, the VanderBot API-writing script would create two new items that consisted only of author statements about unlabeled items. Instead, I created the item for the article first, then added the article Q ID in the <span style="font-family: courier;">qid</span> column for both author rows in the <span style="font-family: courier;">authors.csv</span> file. You can see what that looks like in <a href="https://gist.github.com/baskaufs/d74b9795d7ddbe97496c234607495175" target="_blank">this file</a>. Since I knew the author item Q IDs for both authors, I could put them both in the authors.csv file, but if I had only known the name strings for some or all of the authors, I would have had to put them in the <span style="font-family: courier;">author_strings.csv</span> file, again along with the article Q IDs after the article record had been written. </div><div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh60CSjkpkz92jpbr5SXBLp0Hkhb02KchyphenhyphenCxWoxKgSuz7lgG1nvB-gV9s3SLcHNYdnAplp_oDbdnvILl2APJmt1mWafzOLAanEZ57QuWdsRnAekstxcxZaBkOWaSFWq-FfMuIa2SrzIFYw/s1261/finished-article-item.png" style="margin-left: 1em; margin-right: 1em;"><img alt="finished item page" border="0" data-original-height="1261" data-original-width="1144" height="640" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh60CSjkpkz92jpbr5SXBLp0Hkhb02KchyphenhyphenCxWoxKgSuz7lgG1nvB-gV9s3SLcHNYdnAplp_oDbdnvILl2APJmt1mWafzOLAanEZ57QuWdsRnAekstxcxZaBkOWaSFWq-FfMuIa2SrzIFYw/w580-h640/finished-article-item.png" width="580" /></a></div><br /><div>The final product seems to have turned out according to the plan. The page <a href="https://www.wikidata.org/wiki/Q105899588" target="_blank">is here</a>. </div><div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj3KZXaw7unE0pBl2cM7mLXOoCVJxYfw8jNfOjvQWttd2rjuFjsPg8MTATlIVG6yIgqqTna7AJhpzcheoRnal1n98EzqXNFnuBs9ok-bLML5blLRMN82cb7rJV5u8n1lULbhrXfgLviIXI/s1120/page-history.png" style="margin-left: 1em; margin-right: 1em;"><img alt="page history of new article" border="0" data-original-height="582" data-original-width="1120" height="332" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj3KZXaw7unE0pBl2cM7mLXOoCVJxYfw8jNfOjvQWttd2rjuFjsPg8MTATlIVG6yIgqqTna7AJhpzcheoRnal1n98EzqXNFnuBs9ok-bLML5blLRMN82cb7rJV5u8n1lULbhrXfgLviIXI/w640-h332/page-history.png" width="640" /></a></div><br /><div>If we examine the page history of the new page, we see that there were two edits. The two smaller, more recent ones were the two author edits and the first, larger edit was the one that created the original item. </div><div><br /></div><h2 style="text-align: left;">What's next?</h2><div>You can try using the <span style="font-family: courier;">config.json</span> file to generate your own CSV headers and metadata description files if you want to try uploading a journal article yourself. Just make sure that it isn't already in Wikidata. You can also hack the <span style="font-family: courier;">config.json</span> file to use different properties, qualifiers, and references for a project of your own. I do highly recommend that you try writing only a single item at first so that if things do not go according to plan, the problems can easily be fixed manually. </div><div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjwEPW9kKUBXfw-Z3oNPtBJj-oill5u84FDx6Lhb9biMG0DdIcJ7oyIRTw2W8zdPK5oJHuhxFUYXXdRI0eizpSkFapJX1H3aL5Nhz9hkX4tHBqWoUGKLaICnjzntKNGhdlGarrw_bHM-SI/s1284/workflow.png" style="margin-left: 1em; margin-right: 1em;"><img alt="Workflow diagram" border="0" data-original-height="804" data-original-width="1284" height="400" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjwEPW9kKUBXfw-Z3oNPtBJj-oill5u84FDx6Lhb9biMG0DdIcJ7oyIRTw2W8zdPK5oJHuhxFUYXXdRI0eizpSkFapJX1H3aL5Nhz9hkX4tHBqWoUGKLaICnjzntKNGhdlGarrw_bHM-SI/w640-h400/workflow.png" width="640" /></a></div><br /><div>Although we have now set up spreadsheets and a metadata description JSON file that can write data to Wikidata, there is still too much manual work for this to be productive. In subsequent posts, I'll talk about how we can automate things we have thus far been doing by hand.</div><div><br /></div><div>The diagram above shows the general workflow that I've been using in the various projects with which I've used the spreadsheet approach. We have basically been working backwards through that workflow, so in the next post I will talk about how we can use the Query Service to download existing data from Wikidata so that we don't duplicate any of the items, statements, or references that already exist in Wikidata. </div><div><br /></div><div>The image above is from a presentation I gave in Feb 2021 describing the "big picture", rationale, and potential benefits of managing data in Wikidata using spreadsheets. You can view that video <a href="https://drive.google.com/file/d/1aB2XuQ_gqdB99tKcxEMoU7j75-x-i6RP/view?usp=sharing" target="_blank">here</a>.</div><div><br /></div>Steve Baskaufhttp://www.blogger.com/profile/01896499749604153763noreply@blogger.com0tag:blogger.com,1999:blog-5299754536670281996.post-20122212212231924792021-03-07T20:28:00.009-08:002021-05-25T10:48:53.826-07:00Writing your own data to Wikidata using spreadsheets: Part 2 - editing the real Wikidata<p> For a video walk-through of the previous blog post and this one, see <a href="https://heardlibrary.github.io/digital-scholarship/script/wikidata/vanderbot/" target="_blank">this page</a>.<br /></p><p> In the <a href="http://baskauf.blogspot.com/2021/03/writing-your-own-data-to-wikidata-using.html" target="_blank">previous post</a>, I described how to create a Wikimedia bot password and use it to write spreadsheet data to the test Wikidata instance: <a href="https://test.wikidata.org/">https://test.wikidata.org/</a>. The process required setting up a JSON metadata description file that mapped the CSV column headers to the RDF variant of the Wikibase data model. The VanderBot Python script used that mapping file to "understand" how to prepare the CSV data to be written to the Wikidata API. The script also recorded its interactions with the API by storing identifiers associated with the knowledge graph entities in the CSV along with the data.</p><p>This post will continue in the "do it yourself" vein of the previous post. In order to successfully complete the activities in this post, you must:</p><p></p><ul style="text-align: left;"><li>have a credentials plain text file (prepared in the <a href="http://baskauf.blogspot.com/2021/03/writing-your-own-data-to-wikidata-using.html" target="_blank">last post</a>)</li><li>have Python installed and know how to run a script from the command line (Python programming skills not required)</li><li>have downloaded the VanderBot script to a directory on your local drive where you plan to work.</li><li>understand that the edits you make are your responsibility just as if you had made them using the graphical interface. If you mess something up, you need to fix it -- mostly likely manually since VanderBot is designed to upload new data, not change or delete existing data.</li><li>have practiced on the test Wikidata instance enough to feel comfortable using VanderBot to make edits. </li></ul><div>If any of these things are not true, then you need to go back and read the <a href="http://baskauf.blogspot.com/2021/03/writing-your-own-data-to-wikidata-using.html">first blog post</a> to prepare. </div><div><br /></div><h2 style="text-align: left;">Options when running the script</h2><div>In the last post, we practiced with VanderBot using all of the settings defaults. However, you may want to change some of those defaults depending you your situation. The most obvious change is to suppress the display of the giant blobs of response JSON from the API that fly up the screen as the script runs. You can redirect most of the output to a log file using the <span style="font-family: courier;">--log</span> option. The log file will record the JSON output and at the end will include a summary of known errors that occurred throughout the writing process. (The same error report will be shown on the console screen, too.) You may choose to ignore the log file most of the time -- it will simply be overwritten the next time the script is run. However, it may be useful if the script terminates due to an error. </div><div><br /></div><div>Most of the other options allow you to designate different file names or locations for the metadata description file and credentials file. It may be convenient to keep the credentials file in the same directory as the other files (the <span style="font-family: courier;">working</span> directory option), but if you are using version control (e.g. GitHub) you should keep it elsewhere. You may wish to use different file names if you have multiple bot passwords or have different metadata description files for different CSVs.</div><div><br /></div><div>The <span style="font-family: courier;">--update</span> option is used to control whether the label and descriptions in the CSV will overwrite different values for existing items in Wikidata. It defaults to <span style="font-family: courier;">suppress</span> updates, but we will talk later about when you might want to use a different option.</div><div><br /></div><h4 style="text-align: left;">Options:</h4><div><br /></div><div><span style="font-family: courier; font-size: x-small;">long form short values default</span></div><div><span style="font-family: courier; font-size: x-small;">--log -L log filename, or path and appended none</span></div><div><span style="font-family: courier; font-size: x-small;"> filename</span></div><div><span style="font-size: x-small;"><span style="font-family: courier;">--json -J JSON metadata description filename "csv-metadata.json</span><b style="font-family: courier;">"</b></span></div><div><span style="font-size: x-small;"><span style="font-family: courier;"> or path and appended filename</span></span></div><div><span style="font-size: x-small;"><span style="font-family: courier;">--credentials -C name of the credentials file "</span></span><span style="font-family: courier; font-size: x-small;">wikibase_credentials.txt</span><span style="font-family: courier; font-size: small;">"</span></div><div><span style="font-size: x-small;"><span style="font-family: courier;">--path -P credentials directory: "home", "home"</span></span></div><div><span style="font-size: x-small;"><span style="font-family: courier;"> "working", or path with trailing "/"</span></span></div><div><span style="font-size: x-small;"><span style="font-family: courier;">--update -U "allow" or "suppress" automatic "suppress"</span></span></div><div><span style="font-size: x-small;"><span style="font-family: courier;"> updates to labels </span></span><span style="font-family: courier; font-size: x-small;">and descriptions</span><span style="font-family: courier; font-size: small;"> </span></div><div><br /></div><h4 style="text-align: left;">Option examples:</h4><div>Note: some installations of Python require using <span style="font-family: courier;">python3</span> instead of <span style="font-family: courier;">python</span> in the command.</div><div><br /></div><div><span style="font-family: courier;">python vanderbot.py --json project-metadata.json --log ../log.txt</span></div><div><span style="font-family: inherit;"><br /></span></div><div><span style="font-family: inherit;">Metadata description file is called </span><span style="font-family: courier;">project-metadata.json</span><span style="font-family: inherit;"> and is in the current working directory. Progress and error logs saved to the file </span><span style="font-family: courier;">log.txt</span><span style="font-family: inherit;"> in the parent directory.</span></div><div><span style="font-family: inherit;"><br /></span></div><div><span style="font-family: arial;"><br /></span></div><div><span style="font-family: courier;">python vanderbot.py -P working -C wikidata-credentials.txt</span></div><div><span style="font-family: courier;"><br /></span></div><div><span style="font-family: inherit;">Credentials file called </span><span style="font-family: courier;">wikidata-credentials.txt</span><span style="font-family: inherit;"> is in the current working directory.</span></div><div><span style="font-family: inherit;"><br /></span></div><div><span style="font-family: courier;"><br /></span></div><div><span style="font-family: courier;">python vanderbot.py --update allow -L update.log</span></div><div><span style="font-family: courier;"><br /></span></div><div><span style="font-family: inherit;">Progress and error logs saved to the file </span><span style="font-family: courier;">update.log</span><span style="font-family: inherit;"> in the current working directory. Labels and descriptions of existing items in Wikidata are automatically replaced with local values if they differ. Notice that the long and short forms of the options can be mixed and are interchangeable.</span></div><div><span style="font-family: inherit;"><br /></span></div><h2 style="text-align: left;"><span style="font-family: inherit;">Writing to the "real" Wikidata</span></h2><div><span style="font-family: inherit;">Once you have set everything up, it is a simple matter to switch from writing from the test.wikidata.org API to the "real" www.wikidata.org API. All that is necessary is to change </span><span style="font-family: courier;">test</span><span style="font-family: inherit;"> to </span><span style="font-family: courier;">www</span><span style="font-family: inherit;"> in the first line of the configuration file:</span></div><div><span style="font-family: inherit;"><br /></span></div><div><span style="font-family: courier;">endpointUrl=https://www.wikidata.org</span></div><div><br /></div><div>The username and password lines can stay the same. </div><div><br /></div><div>However, we cannot use the same CSV and metadata description files as before because the property and item IDs are different in the real Wikidata. We also don't yet want to create new items until we are comfortable with making several edits in the real Wikidata. Fortunately, there are several items in Wikidata that are designated as "sandbox" items, i.e. their metadata can be change to anything by anyone without consequence. They are generally lightly used, so you can edit them and still have time to examine what you have done before someone changes them to something else. The first sandbox item (<a href="https://www.wikidata.org/wiki/Q4115189" target="_blank">Q4115189</a>) is better known than the other two, so we will use sandbox items 2 and 3 in our practice. </div><p></p><h4 style="text-align: left;"> Wikidata sandbox items:</h4><p><span style="font-family: courier;">Q ID Label</span></p><p><span style="font-family: courier;">-------- ------------------</span></p><p><span style="font-family: courier;"><a href="https://www.wikidata.org/wiki/Q4115189" target="_blank">Q4115189</a> Wikidata Sandbox</span></p><p><span style="font-family: courier;"><a href="https://www.wikidata.org/wiki/Q13406268" target="_blank">Q13406268</a> Wikidata Sandbox 2</span></p><p><span style="font-family: courier;"><a href="https://www.wikidata.org/wiki/Q15397819" target="_blank">Q15397819</a> Wikidata Sandbox 3</span></p><p>With respect to etiquette regarding the sandbox items, I don't know that there are particular rules, but I would say that it would not be acceptable to change their labels, since that is the primary means by which users will know what they are. I would say that anything else, including descriptions and aliases is probably open to editing. </p><p>I would avoid adding a large number of statements to the sandbox items and then just leaving them, although a few edits probably don't matter. Probably the best thing to do after you are done playing with editing the sandbox items is to go the the <span style="font-family: courier;">View history</span> page and undo your change (if you've only made one edit) or restore the last previous version before you started playing with the item if you've made a lot of changes. I'll show how to do that when we get to that point.</p><p>After you are comfortable playing with the sandbox items, we will try adding new real items. </p><p><br /></p><h2 style="text-align: left;">Describing a CSV using simpler JSON format</h2><p>We could use the web tool again to create a new metadata description file based on the real Wikidata properties, but I will tell you about another tool that you can use that requires many fewer button clicks. I created a simplified configuration file format that can be used to generate the standard metadata description file based on some rules about how to construct the column header names, assumptions about labels and descriptions, and one simplification of references. (The detailed specifications for the configuration file format are <a href="https://github.com/HeardLibrary/linked-data/blob/master/vanderbot/convert-config.md" target="_blank">here</a>.) The configuration file that we will be using has the default name <span style="font-family: courier;">config.json</span> and can be viewed in <a href="https://gist.github.com/baskaufs/25a19cbb0edf9fcd16423bf231645939" target="_blank">this gist</a>. </p><p>It is not necessary for you to edit this file to use it for the practice exercise. You can simply download it (right-click on the <span style="font-family: courier;">Raw</span> button and select <span style="font-family: courier;">Save file as...</span>). Download it into a directory you can access easily from your home folder -- you can use the same one you used last time, although it might get a bit cluttered.</p><p>If you understand JSON, the file structure will make sense to you. Even if you don't, you can probably copy and paste parts of it to change it to fit your needs. (It includes examples of most of the object types including two we haven't used before: monolingual text and quantity.) If you copy and paste, you will mostly need to be careful about placement of commas. Indentation is optional in JSON and is only used to make the structure more apparent. </p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjn_RkKXBbvtehC0xlHl0HKObKJ7GPk2GJ5U5TbFctPCwCotfmQp_RvoOEb8P9LxnJ59JtSMLqHH7SLvBWFfBYIuI-Qzk3oYJhy9XYaU2z4y-Ux828YJROqoEkcqVfeNSo_Olq2_NIkdFk/s610/csv_level_json.png" style="margin-left: 1em; margin-right: 1em;"><img alt="high-level JSON describing CSV files" border="0" data-original-height="610" data-original-width="513" height="640" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjn_RkKXBbvtehC0xlHl0HKObKJ7GPk2GJ5U5TbFctPCwCotfmQp_RvoOEb8P9LxnJ59JtSMLqHH7SLvBWFfBYIuI-Qzk3oYJhy9XYaU2z4y-Ux828YJROqoEkcqVfeNSo_Olq2_NIkdFk/w538-h640/csv_level_json.png" width="538" /></a></div><br /><p>For now, we can ignore the first three key:value pairs. The rest of the JSON after <span style="font-family: courier;">outfiles</span> describes two CSV files that will be mapped by the metadata description file: <span style="font-family: courier;">artworks.csv</span> and <span style="font-family: courier;">works_depicts.csv</span> . <span style="font-family: courier;">artworks.csv</span> contains data about statements involving 5 properties: <span style="font-family: courier;">P31</span> (<span style="font-family: courier;">instance of</span>), <span style="font-family: courier;">P217</span> (<span style="font-family: courier;">inventory number</span>), <span style="font-family: courier;">P1476</span> (<span style="font-family: courier;">title</span>), <span style="font-family: courier;">P2048</span> (<span style="font-family: courier;">height</span>), and <span style="font-family: courier;">P571</span> (<span style="font-family: courier;">inception</span>). <span style="font-family: courier;">depicts.csv</span> contains data about only one kind of statement: <span style="font-family: courier;">P180</span> (<span style="font-family: courier;">depicts</span>). You may be wondering why I chose to put the depicts statements in a separate CSV. That is because all of the other properties will typically have only one value per item, while a particular artwork may depict several things. So in the first CSV, there will only be one row for each item, while in the second CSV there is an indefinite number of rows per item. </p><div style="background-color: black; color: white; font-family: Menlo, Monaco, "Courier New", monospace; font-size: 12px; line-height: 18px; white-space: pre;"><div> {</div><div> <span style="color: #d4d4d4;">"pid"</span>: <span style="color: #ce9178;">"P571"</span>,</div><div> <span style="color: #d4d4d4;">"variable"</span>: <span style="color: #ce9178;">"inception"</span>,</div><div> <span style="color: #d4d4d4;">"value_type"</span>: <span style="color: #ce9178;">"date"</span>,</div><div> <span style="color: #d4d4d4;">"qual"</span>: [</div><div> {</div><div> <span style="color: #d4d4d4;">"pid"</span>: <span style="color: #ce9178;">"P1319"</span>,</div><div> <span style="color: #d4d4d4;">"variable"</span>: <span style="color: #ce9178;">"earliest_date"</span>,</div><div> <span style="color: #d4d4d4;">"value_type"</span>: <span style="color: #ce9178;">"date"</span></div><div> },</div><div> {</div><div> <span style="color: #d4d4d4;">"pid"</span>: <span style="color: #ce9178;">"P1326"</span>,</div><div> <span style="color: #d4d4d4;">"variable"</span>: <span style="color: #ce9178;">"latest_date"</span>,</div><div> <span style="color: #d4d4d4;">"value_type"</span>: <span style="color: #ce9178;">"date"</span></div><div> }</div><div> ],</div><div> <span style="color: #d4d4d4;">"ref"</span>: [</div><div> {</div><div> <span style="color: #d4d4d4;">"pid"</span>: <span style="color: #ce9178;">"P248"</span>,</div><div> <span style="color: #d4d4d4;">"variable"</span>: <span style="color: #ce9178;">"statedIn"</span>,</div><div> <span style="color: #d4d4d4;">"value_type"</span>: <span style="color: #ce9178;">"item"</span></div><div> }</div><div> ]</div><div> }</div><div></div></div><p>Each property has a <span style="font-family: courier;">pid</span> (property ID), a column header name (<span style="font-family: courier;">variable</span>) and a <span style="font-family: courier;">value_type</span>. The <span style="font-family: courier;">value_type</span> will determine the details of the number of data columns needed to represent that kind of value and the kind of data that will be stored in those columns. Each property can also have zero or more qualifier properties and zero or more reference properties associated with it. In the snippet above, <span style="font-family: courier;">inception</span> (<span style="font-family: courier;">P571</span>) statements will have two associated qualifier (<span style="font-family: courier;">earliest date</span> and <span style="font-family: courier;">latest date</span>) properties and one reference property (<span style="font-family: courier;">stated in</span>). The Wikibase model allows many references per statement, but this configuration file format restricts you to a single reference with as many properties as you want. </p><p>The structure of the qualifier and reference properties are the same as the statement properties (ID, variable, and value type) with the only restriction being that you must use properties that are appropriate for use in qualifiers or references. </p><div style="background-color: black; color: white; font-family: Menlo, Monaco, "Courier New", monospace; font-size: 12px; line-height: 18px; white-space: pre;"><div> {</div><div> <span style="color: #d4d4d4;">"manage_descriptions"</span>: <span style="color: #569cd6;">true</span>,</div><div> <span style="color: #d4d4d4;">"label_description_language_list"</span>: [</div><div> <span style="color: #ce9178;">"en"</span>,</div><div> <span style="color: #ce9178;">"es"</span></div><div> ],</div><div> <span style="color: #d4d4d4;">"output_file_name"</span>: <span style="color: #ce9178;">"artworks.csv"</span>,</div><div> <span style="color: #d4d4d4;">"prop_list"</span>: [</div><div></div></div><p>The situation with labels and descriptions is a little more complicated. If you have more than one data table, you probably only really want to manage the labels and descriptions in one of the tables. In this case, it would make the most sense to manage them in the <span style="font-family: courier;">artworks.csv</span> table, since it has a row for every item and the other table may have zero or more than one row per item. So the <span style="font-family: courier;">manage_description</span> value for the first table is set to <span style="font-family: courier;">true</span>. </p><div style="background-color: black; color: white; font-family: Menlo, Monaco, "Courier New", monospace; font-size: 12px; line-height: 18px; white-space: pre;"><div> {</div><div> <span style="color: #d4d4d4;">"manage_descriptions"</span>: <span style="color: #569cd6;">false</span>,</div><div> <span style="color: #d4d4d4;">"label_description_language_list"</span>: [</div><div> <span style="color: #ce9178;">"en"</span></div><div> ],</div><div> <span style="color: #d4d4d4;">"output_file_name"</span>: <span style="color: #ce9178;">"works_depicts.csv"</span>,</div><div> <span style="color: #d4d4d4;">"prop_list"</span>: [</div><div></div></div><p>In the second table (<span style="font-family: courier;">works_depicts.csv</span>) the <span style="font-family: courier;">manage_descriptions</span> value is set to <span style="font-family: courier;">false</span>. In that table, there will be a label column, but it will be set to be ignored during CSV processing and will only be to help humans understand what is in the rows. The <span style="font-family: courier;">label_description_language_list</span> value contains a list of the ISO language codes for all languages to be included. If <span style="font-family: courier;">manage_description</span> is set to <span style="font-family: courier;">true</span> for a table, there will be both a label and description in the table for every language. If it is set to <span style="font-family: courier;">false</span> for a table, there will only be a label for the default language. The default language of the suppressed output labels is set by the <span style="font-family: courier;">--lang</span> option (see below). Any languages supplied in the JSON (as in the example above) will be ignored.</p><h3 style="text-align: left;">Generating the metadata description file and CSV headers</h3><p>To generate the metadata description files and the CSV files, we need to download <a href="https://github.com/HeardLibrary/linked-data/blob/master/vanderbot/convert_json_to_metadata_schema.py" target="_blank">another script from GitHub</a> called <span style="font-family: courier;">convert_json_to_metadata_schema.py</span> . Download it into the same directory where you downloaded the <span style="font-family: courier;">config.json</span> file. At the command line, run the following command if you used the default name <span style="font-family: courier;">config.json</span></p><p><span style="font-family: courier;">python convert_json_to_metadata_schema.py</span></p><p>If you saved the input configuration file with a different name, or if you want a different name than <span style="font-family: courier;">csv-metadata.json</span> to be used for the output metadata description file, use the following command line options:</p><div><span style="font-family: courier; font-size: x-small;">long form short values default</span></div><div><span style="font-family: courier; font-size: x-small;">--config -C input configuration file path config.json</span></div><div><span style="font-family: courier; font-size: x-small;">--meta -M output metadata description fille path csv-metadata.json</span></div><div><span style="font-family: courier; font-size: x-small;">--lang -L language of labels when output suppressed en</span></div><div><br /></div><div>After you run the script, it will have generated the <span style="font-family: courier;">csv-metadata.json</span> file and also variants of the two CSV files that were specified in the input <span style="font-family: courier;">config.json</span> file: <span style="font-family: courier;">artworks.csv</span> and <span style="font-family: courier;">works_depicts.csv</span> . To prevent accidentally overwriting any existing data, the letter "h" is prepended to the file names of the generated CSVs (<span style="font-family: courier;">hartworks.csv</span> and <span style="font-family: courier;">hworks_depicts.csv</span>). So before you use the files, you need to delete the initial "h" from the file names. </div><div><br /></div><div>The generated CSV files contain only the column headers with no data. But you can still open them with your spreadsheet software to look at them. </div><p> </p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEizEC_uLG-_fIkV9oNCu9JD8i3vDFWqydF6k0CwqeFwB-WrarKpDP8-wZBL4-g7S7mhGNk2Y4pXce7PZr5hNvBTe2WLrUJslbZzOucYfJ1qsLH_XzNQo8fbEJGnQ3ZVxRWF3W2gAW7g-9w/s450/json-snippet.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="450" data-original-width="411" height="400" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEizEC_uLG-_fIkV9oNCu9JD8i3vDFWqydF6k0CwqeFwB-WrarKpDP8-wZBL4-g7S7mhGNk2Y4pXce7PZr5hNvBTe2WLrUJslbZzOucYfJ1qsLH_XzNQo8fbEJGnQ3ZVxRWF3W2gAW7g-9w/w365-h400/json-snippet.png" width="365" /></a></div><div class="separator" style="clear: both; text-align: center;"><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgMBd2ahNTc3_2vvG4wLK5L__blRkccIIlTR_5__0yln3l4nJ_6Vuz9v_W_wnjq9QOwhy6FXefpoZitQ-AOHl596EdY8pw5kVFf7Unf8a1Nzjd_ATBM-ZS02_5SnjpojkHETNqx_cZ1D7w/s1613/spreadsheet1.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="152" data-original-width="1613" height="60" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgMBd2ahNTc3_2vvG4wLK5L__blRkccIIlTR_5__0yln3l4nJ_6Vuz9v_W_wnjq9QOwhy6FXefpoZitQ-AOHl596EdY8pw5kVFf7Unf8a1Nzjd_ATBM-ZS02_5SnjpojkHETNqx_cZ1D7w/w640-h60/spreadsheet1.png" width="640" /></a></div><div class="separator" style="clear: both; text-align: center;"><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgT-xgC11rRsrLCWvx1IjNMepA1PNny3oIRCLc9_LzXSRyWWD_Lfn7MIMLKLucTqdpwtedcUJQ8tjxASDc_cJLAN0pwhwvYVPeqbwrW_dwbvjmwmwUF7W-Sj6I-RDt9VwyfZH-YRTNYzyQ/s1435/spreadsheet2.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="152" data-original-width="1435" height="68" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgT-xgC11rRsrLCWvx1IjNMepA1PNny3oIRCLc9_LzXSRyWWD_Lfn7MIMLKLucTqdpwtedcUJQ8tjxASDc_cJLAN0pwhwvYVPeqbwrW_dwbvjmwmwUF7W-Sj6I-RDt9VwyfZH-YRTNYzyQ/w640-h68/spreadsheet2.png" width="640" /></a></div><p>If you compare the columns in the created spreadsheet with the source JSON configuration file, you should see that the columns are in the order that they were designated in the JSON. The <span style="font-family: courier;">variable</span> values are joined to any parent properties by underscores (e.g. <span style="font-family: courier;">earliest_date</span> appended to <span style="font-family: courier;">inception</span> to form <span style="font-family: courier;">inception_earliest_date</span>). In cases where more than one column is required to describe a value node, the <span style="font-family: courier;">_nodeId</span>, <span style="font-family: courier;">_val</span>, etc. suffixes are added to the corresponding root column </p><p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhSkIUdCYxGuDVcQKL-yB_7AdtwbZC9YQMVfJLxFsxLG1eolfOMP5S8K1ft_kn1drAfOdlYrx4KVHIi8TB79QbX5k6zt_IJUS_6BgswNf39gzp7b_MM_FtGA6_lL2dHMY_-Ud2aC0Y-orA/s424/labels-json.png" style="margin-left: 1em; margin-right: 1em;"><img alt="primary spreadsheet JSON" border="0" data-original-height="231" data-original-width="424" height="174" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhSkIUdCYxGuDVcQKL-yB_7AdtwbZC9YQMVfJLxFsxLG1eolfOMP5S8K1ft_kn1drAfOdlYrx4KVHIi8TB79QbX5k6zt_IJUS_6BgswNf39gzp7b_MM_FtGA6_lL2dHMY_-Ud2aC0Y-orA/w320-h174/labels-json.png" width="320" /></a></div><br /><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEibWbEsep3QUxMK6ArrDS2udjo3wTddVJAxYNNXwvk9c2okpM1cx9WDEy92PpmVifIaI6mQA7Xb2gvvFkLwlZtw_0wT-aH27Nv9bywJ7BvGyWqmDE_-Me1X11mU9R0IMffkWdi_DJ1W2w8/s637/labels-csv.png" style="margin-left: 1em; margin-right: 1em;"><img alt="primary spreadsheet headers" border="0" data-original-height="155" data-original-width="637" height="98" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEibWbEsep3QUxMK6ArrDS2udjo3wTddVJAxYNNXwvk9c2okpM1cx9WDEy92PpmVifIaI6mQA7Xb2gvvFkLwlZtw_0wT-aH27Nv9bywJ7BvGyWqmDE_-Me1X11mU9R0IMffkWdi_DJ1W2w8/w400-h98/labels-csv.png" width="400" /></a></div><p></p><p>Since we chose <span style="font-family: courier;">true</span> as the value of <span style="font-family: courier;">manage_descriptions</span> for the <span style="font-family: courier;">artworks.csv</span> file, the generated spreadsheet includes both labels and descriptions for the two languages we designated. The script automatically prepends <span style="font-family: courier;">label_</span> and <span style="font-family: courier;">description_</span> to the language codes to generate the column headers.</p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi18DC9YisncFZox5m0q-hLT7yEUSIOASXwI58HXSVlPlIDOjqXWf-5TjwQwW1Y7c83sEMsu229Zt3aMo_LRUKQUj2E6QfUTXuv8ij-JcwQ-ecR8Kgyi2qLsPUkAkP4UJZ1O5E3bxfm8GQ/s511/labels2-json.png" style="margin-left: 1em; margin-right: 1em;"><img alt="secondary spreadsheet JSON definition" border="0" data-original-height="183" data-original-width="511" height="115" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi18DC9YisncFZox5m0q-hLT7yEUSIOASXwI58HXSVlPlIDOjqXWf-5TjwQwW1Y7c83sEMsu229Zt3aMo_LRUKQUj2E6QfUTXuv8ij-JcwQ-ecR8Kgyi2qLsPUkAkP4UJZ1O5E3bxfm8GQ/w320-h115/labels2-json.png" width="320" /></a></div><div class="separator" style="clear: both; text-align: center;"><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiQu0AGisMx-LzrDsHSOKQC5ImJTsKGgatm7_RYa4EUwHbLnzoE7LdpUZ4w5oPIbIJ72JF3JiMxfiFfOIli5VADriwIB2R1sg4NwMQbkvTIOs1_Y3O_8NCqw0k7-lkb1PovyybyiaUlkKg/s481/labels2-csv.png" style="margin-left: 1em; margin-right: 1em;"><img alt="secondary spreadsheet headers" border="0" data-original-height="152" data-original-width="481" height="126" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiQu0AGisMx-LzrDsHSOKQC5ImJTsKGgatm7_RYa4EUwHbLnzoE7LdpUZ4w5oPIbIJ72JF3JiMxfiFfOIli5VADriwIB2R1sg4NwMQbkvTIOs1_Y3O_8NCqw0k7-lkb1PovyybyiaUlkKg/w400-h126/labels2-csv.png" width="400" /></a></div><p>For the second spreadsheet, <span style="font-family: courier;">works_depicts.csv</span>, the value of <span style="font-family: courier;">manage_descriptions</span> is <span style="font-family: courier;">false</span>, so only labels are generated. Since the <span style="font-family: courier;">label_en</span> column is only for local use to make the identity of the rows clearer, I only bothered to generate it as English. The value of the labels in this spreadsheet will be ignored by the API upload script.</p><h2 style="text-align: left;">Adding data to the CSV files</h2><div>Since we are still testing, we won't create new items yet in the real Wikidata. Instead, we will add statements to two of the sandbox Wikidata items. </div><div><br /><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh0bmVclizGhij02hqTO_MCDd-Itj6_Pg9PMF1toU4zxEnVtyrqFIwh7OmqG8lnzzeF7qa2L6PWHERV0erfa9MuYof0Im_jq3VJicV0hudd_cOtsA3ecTm7Qb-j3NQdBQJEGZaaa_KN19w/s1019/csv-with-data.png" style="margin-left: 1em; margin-right: 1em;"><img alt="spreadsheet with data" border="0" data-original-height="132" data-original-width="1019" height="82" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh0bmVclizGhij02hqTO_MCDd-Itj6_Pg9PMF1toU4zxEnVtyrqFIwh7OmqG8lnzzeF7qa2L6PWHERV0erfa9MuYof0Im_jq3VJicV0hudd_cOtsA3ecTm7Qb-j3NQdBQJEGZaaa_KN19w/w640-h82/csv-with-data.png" width="640" /></a></div></div><div><br /></div><div>In the <span style="font-family: courier;">qid</span> column of the <span style="font-family: courier;">artworks.csv</span> file, add <span style="font-family: courier;">Q13406268</span> and <span style="font-family: courier;">Q15397819</span> to the first two rows after the header row. For purposes of keeping the row identities clear, I added <span style="font-family: courier;">Wikidata Sandbox 2</span> and <span style="font-family: courier;">Wikidata Sandbox 3</span> as <span style="font-family: courier;">label_en</span> values for those rows, although since we will be using the default to suppress updating labels, these values will have no effect. I also chose to use <span style="font-family: courier;">Q3305213</span> (painting) and <span style="font-family: courier;">Q860861</span> (sculpture) as values of <span style="font-family: courier;">instance_of</span> since the the CSV file is supposed to be about artworks. </div><div><br /></div><div>If you want to see what other values I used in my test, you can look at or download <a href="https://gist.github.com/baskaufs/cbd2334adcdf294c8aeb0e1d99d8d005" target="_blank">this gist</a>. You can use whatever values would amuse you as long as the types of the values match the types that are appropriate for properties specified in the configuration file. Leave all of the ID columns blank (those ending in <span style="font-family: courier;">_uuid</span>, <span style="font-family: courier;">_hash</span>, or <span style="font-family: courier;">_nodeId</span>), since they will be filled in by the API upload script. For the dates, you can either use the abbreviated conventions discussed in the last post (in which case you MUST leave the <span style="font-family: courier;">_prec</span> column empty). If you want to use dates that don't conform to those patterns (precisions less than year, BCI dates, or dates between 1 and 999 CE), you will need to use the long form values and provide an appropriate <span style="font-family: courier;">_prec</span> value. See the <a href="http://vanderbi.lt/vanderbot" target="_blank">VanderBot landing page</a> for details.</div><div><br /></div><div>If you looked at the configuration JSON carefully, you may have noticed that there were two new value types that we didn't see in the last post. The <span style="font-family: courier;">title</span> property (<span style="font-family: courier;">P1476</span>) has the type <span style="font-family: courier;">monolingualtext</span>. Monolingual text values are required to have a language tag in addition to a provided string. Unfortunately, because of limitations of the W3C CSV2RDF Recommendation, the language tag has to be hard-coded in the metadata description file rather than being specified in the CSV. That's why the language is specified in the configuration JSON as the value of <span style="font-family: courier;">language</span> for that property rather than as a column in the CSV table. </div><div><br /></div><div>The other new value type is quantity. Like dates, quantities have value nodes that require two columns in the CSV table to be fully described. The <span style="font-family: courier;">_val</span> column contains a decimal number and the <span style="font-family: courier;">_unit</span> column should contain the Q ID for an item that is an appropriate measurement unit for the number (e.g. <span style="font-family: courier;">Q11573</span> for meter).</div><div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjp_HkDMM08ah305fgPvXgI_6RxIwe-P4lSPZP3iZzLlywVX2Lgwrf3XzjRrXioR-Ii8onJePzKOTP5kcRNU4zHbhQv1gXpVcgxzP4_VX2HN9QsN53LZ9_e-__IiuDxJeqH0BhMpugTWWE/s706/depicts-table.png" style="margin-left: 1em; margin-right: 1em;"><img alt="depicts spreadsheet" border="0" data-original-height="127" data-original-width="706" height="73" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjp_HkDMM08ah305fgPvXgI_6RxIwe-P4lSPZP3iZzLlywVX2Lgwrf3XzjRrXioR-Ii8onJePzKOTP5kcRNU4zHbhQv1gXpVcgxzP4_VX2HN9QsN53LZ9_e-__IiuDxJeqH0BhMpugTWWE/w400-h73/depicts-table.png" width="400" /></a></div><br /><div>The second spreadsheet, <span style="font-family: courier;">works_depicts.csv</span>, describes only one kind of statement, <span style="font-family: courier;">depicts</span> (<span style="font-family: courier;">P180</span>). It is intended to have multiple rows with the same <span style="font-family: courier;">qid</span>, since a work can depict more than one thing. Since I described the Sandbox 2 item as a painting with title "Mickey Mouse house", I decided to say that it depicts Mickey and Minnie Mouse. You can set the depicts values to any item.</div><div><br /></div><h2 style="text-align: left;">Writing the data</h2><div>Before writing the data in the CSVs, open the pages for <a href="https://www.wikidata.org/wiki/Q13406268" target="_blank">Sandbox 2</a> and <a href="https://www.wikidata.org/wiki/Q15397819" target="_blank">Sandbox 3</a> so that you can see how they change when you write. Make sure that the two CSV files, the <span style="font-family: courier;">csv-metadata.json</span> file you generated from <span style="font-family: courier;">config.json</span>, and a copy of <span style="font-family: courier;">vanderbot.py</span> are together in a directory that can easily be accessed from your home directory. Make sure that you removed the "h" from the beginning of the CSV filenames as well.</div><div><br /></div>Open your console application (Terminal on Mac, Command Prompt in Windows), navigate to the directory where the files are, and run VanderBot. Unless you changed default file names and locations, you can just enter<div><br /></div><div><span style="font-family: courier;">python vanderbot.py</span></div><div><br /></div><div>(or use <span style="font-family: courier;">python3</span> if your installation requires that). If you want to save the API response in a log file, specify its name using the <span style="font-family: courier;">--log</span> (or <span style="font-family: courier;">-L</span>) option, like this:</div><div><br /></div><div><span style="font-family: courier;">python vanderbot.py -L log.txt</span></div><div><br /></div><div>When you run the script, you should see something like this (with logging to file):</div><div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhsRUdXjFVfQezhsMv5Ok47xcZbXd4dL0vkrzmaNqTxvZj6ogwY-kLAYCY3rA0RGuZS4t5cBvcaO8GotgttenkJVykJZGPsb4RripyxdHqbl0KfWybsdZzNiffc057uBl_-7ZjZzkme6pY/s730/run-screenshot.png" style="margin-left: 1em; margin-right: 1em;"><img alt="console output during run of VanderBot" border="0" data-original-height="730" data-original-width="636" height="640" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhsRUdXjFVfQezhsMv5Ok47xcZbXd4dL0vkrzmaNqTxvZj6ogwY-kLAYCY3rA0RGuZS4t5cBvcaO8GotgttenkJVykJZGPsb4RripyxdHqbl0KfWybsdZzNiffc057uBl_-7ZjZzkme6pY/w558-h640/run-screenshot.png" width="558" /></a></div><br /><div>There are two episodes of writing to the API, one for each of the CSVs. If you refresh the web pages for the two items, you should see the changes that you made.</div><div><br /></div><div>Click on the View history link at the top of the Wikidata Sandbox 2 page. You will see the revision history for the page.</div><div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjvxUfuLH8QWVvsosRzh_LnSzlcd7D3cpbHpCSrfRhuX4mXp-MtrHyti-ejhSqkngj2sAEWsD_zwljW5u3xoysWdCTmCATgB0b3DYLUSVjYSoNFQKfklWxR20Rhlq6-EQJLuuKksGznVQc/s1057/revision-history.png" style="margin-left: 1em; margin-right: 1em;"><img alt="revision history screenshot" border="0" data-original-height="614" data-original-width="1057" height="372" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjvxUfuLH8QWVvsosRzh_LnSzlcd7D3cpbHpCSrfRhuX4mXp-MtrHyti-ejhSqkngj2sAEWsD_zwljW5u3xoysWdCTmCATgB0b3DYLUSVjYSoNFQKfklWxR20Rhlq6-EQJLuuKksGznVQc/w640-h372/revision-history.png" width="640" /></a></div><br /><div>Notice that on Sandbox 2, there were three edits listed. Each line in a spreadsheet resulted in one write to the API. The first larger one (4997 bytes) was an update consisting of all of the statements made in the first line of <span style="font-family: courier;">artworks.csv</span> . The two later and smaller ones were from the two single-statement <span style="font-family: courier;">depicts</span> writes in the <span style="font-family: courier;">works_depicts.csv</span> table.</div><div><br /></div><div>It is not a requirement to get rid of all of your edits to the sandbox, but to avoid causing the sandbox items to be hopelessly cluttered, you should probably delete your edits. If you made only a single change to the page, you can just click the <span style="font-family: courier;">undo</span> link after the edit. If you made several edits and your edits were the last ones, you can revert back to the last version before your changes by clicking on the <span style="font-family: courier;">restore</span> link after the last edit that was made prior to yours.</div><div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEikUdEAraqW5wDk8kstB2PHUfmdlTgGQYQ78WUWaFyOTxkh6WmxlyslV5qsvyGFE8t7sxQuyMLArcRupElsm_I_q8V4pm40nWNpZRfnGo6tTs4pSmSprI7NsP62yCbvciRoE6LA89h1y4E/s825/restore.png" style="margin-left: 1em; margin-right: 1em;"><img alt="restore dialog" border="0" data-original-height="481" data-original-width="825" height="374" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEikUdEAraqW5wDk8kstB2PHUfmdlTgGQYQ78WUWaFyOTxkh6WmxlyslV5qsvyGFE8t7sxQuyMLArcRupElsm_I_q8V4pm40nWNpZRfnGo6tTs4pSmSprI7NsP62yCbvciRoE6LA89h1y4E/w640-h374/restore.png" width="640" /></a></div><br /><div>The restore dialog will show you all of the changes you made so that you can review them before committing to the restore. Give a summary, and click <span style="font-family: courier;">Publish changes</span>.</div><div><br /></div><div><br /></div><h3 style="text-align: left;">Changing labels</h3><div>VanderBot handles labels and descriptions differently from statements and references. </div><div><br /></div><div>Adding statements or references is controlled by the presence or absence of an identifier corresponding to the column(s) representing the statement or reference in the spreadsheet. The statements or references are only written if their corresponding identifier cell is empty. If you examine the CSVs after their data have been written to the API, you will see that identifiers have been added for all of the columns that contain statement values. That means that if you run the script again, nothing will happen because VanderBot will ignore the values -- they all have assigned identifiers. </div><div><br /></div><div>The behavior of labels and descriptions is different. When a <i><b>new</b></i> item is created, any labels or descriptions that are present will be added to the item. However, VanderBot will NOT make any edits to labels or descriptions of <i><b>existing</b></i> items unless the the <span style="font-family: courier;">--update</span> (or <span style="font-family: courier;">-U</span>) option is set to <span style="font-family: courier;">allow</span> when the script is run. If updating is allowed, the existing labels and descriptions will be changed to whatever is present in the spreadsheet for that item. (The exception to this is when a label or description cell is empty. Empty cells will not result in deleting the label or description.) </div><div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhY60FjGqxKSj7lQfpgO6fS9vo45MJjJT-Y-aKo6RSKQx9w59fKMkQ5NBvlSpLd31U_IHirBlPGdVGAeJzf6WrBfIDvVzW9mneegAOZQn8zYQXp374xvKWhtea6hu6OzI1yz7yxAS717WM/s944/sandbox2-descriptions.png" style="margin-left: 1em; margin-right: 1em;"><img alt="Sandbox 2 labels and descriptions" border="0" data-original-height="367" data-original-width="944" height="248" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhY60FjGqxKSj7lQfpgO6fS9vo45MJjJT-Y-aKo6RSKQx9w59fKMkQ5NBvlSpLd31U_IHirBlPGdVGAeJzf6WrBfIDvVzW9mneegAOZQn8zYQXp374xvKWhtea6hu6OzI1yz7yxAS717WM/w640-h248/sandbox2-descriptions.png" width="640" /></a></div><br /><div>From the screenshot above, you can see that at the start of this experiment, Sandbox 2 had no descriptions in either Spanish or English. Also, the Spanish label isn't actually in Spanish. I'm going to use VanderBot to change that.</div><div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjmwD4wA8G6E966xcfEXXyyoiCa-D-XULW2g_fn6EgZ7Qg5BACNMoA1YraRFVxGVs-Ckmzs_eOig4rYcbDsIMIkPwOGbuHR13oI_DZD8mK2gQoiT5Izkz_2fP3hTnhio8o3D5aekjAPMi4/s1185/csv-label-changes.png" style="margin-left: 1em; margin-right: 1em;"><img alt="CSV showing label and description changes" border="0" data-original-height="148" data-original-width="1185" height="80" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjmwD4wA8G6E966xcfEXXyyoiCa-D-XULW2g_fn6EgZ7Qg5BACNMoA1YraRFVxGVs-Ckmzs_eOig4rYcbDsIMIkPwOGbuHR13oI_DZD8mK2gQoiT5Izkz_2fP3hTnhio8o3D5aekjAPMi4/w640-h80/csv-label-changes.png" width="640" /></a></div><div><br /></div>To make the changes, I started with the <span style="font-family: courier;">artworks.csv</span> spreadsheet after my last edits. I deleted the line for Sandbox 3 since I didn't want to mess with it. I first made sure that my English label was exactly the same as the existing label so that it won't be changed. Then I added a Spanish label, and English and Spanish descriptions. I left the rest of the row the way it was since none of those statements would be written since they all had IDs. <div><br /></div><div>The following command will write the labels and log to a file:</div><div><br /></div><div><span style="font-family: courier;">python vanderbot.py -L log.txt --update allow</span><br /><div><br /></div><div>After the script finishes, checking the log file shows the changes made.</div><div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj6J3Z69bBrn_dMn3-U8nlUtYHq1hZ69Q-ni_loHPyFhq_5OaT7f9u522qyJhNxYLvXfXzG0R1S2rTo5Lr0Odq_q_-4B1umIizYYaW9qET4zqyt6GKUcAPtRYo90XmAswpEBKyvAUhwIPA/s1466/log-screenshot.png" style="margin-left: 1em; margin-right: 1em;"><img alt="log file" border="0" data-original-height="359" data-original-width="1466" height="156" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj6J3Z69bBrn_dMn3-U8nlUtYHq1hZ69Q-ni_loHPyFhq_5OaT7f9u522qyJhNxYLvXfXzG0R1S2rTo5Lr0Odq_q_-4B1umIizYYaW9qET4zqyt6GKUcAPtRYo90XmAswpEBKyvAUhwIPA/w640-h156/log-screenshot.png" width="640" /></a></div><br /><div>Since the English label was identical, there were no changes to it. The log also shows that there were no changes in the other CSV. </div><div><br /></div><div>Checking the web page shows</div><div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhpIxhyUV4OeqWQfPUihpZk3LeaP2TNDYXR3Njrm3JaKNLJTQTQTjxwPhbJWq4bFJ4U9Arr7nP3Sb1sJe5z4mamCBnVuMqdmFZB_i65m0czpl_ARMkL3WhhOxWcKlbr_WiO0msnglgHdXY/s938/label-changes-online.png" style="margin-left: 1em; margin-right: 1em;"><img alt="sandbox item 2 after label and description changes" border="0" data-original-height="410" data-original-width="938" height="280" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhpIxhyUV4OeqWQfPUihpZk3LeaP2TNDYXR3Njrm3JaKNLJTQTQTjxwPhbJWq4bFJ4U9Arr7nP3Sb1sJe5z4mamCBnVuMqdmFZB_i65m0czpl_ARMkL3WhhOxWcKlbr_WiO0msnglgHdXY/w640-h280/label-changes-online.png" width="640" /></a></div><br /><div>Since I'm done with the test, I'm going to delete the descriptions, but the Spanish label isn't any worse than what was there before, so I'll leave it. Checking the history, I can see that all of the labels and descriptions were changed in a single API write, so I'll have to delete them manually if I want to leave the Spanish label -- I can't undo the description changes without also undoing the new Spanish label.</div><div><br /></div><div>The take-home message from this section is that you need to make sure that the existing labels and descriptions in the CSV match what is in Wikidata when label/description updates are allowed (unless you actually <i><b>want</b></i> to change them). This is particularly an issue if your data table is stale because you are coming back to work on it at a much later time after you initially wrote the data. If in the intervening time other users have improved the quality of the data by changing labels and descriptions, you would essentially be reverting their changes back to a worse state. That would be really irritating to someone who put in work to make the improvements. I will talk about strategies to avoid this problem in a later post.</div><div><br /></div><h2 style="text-align: left;">Creating new items</h2><div>At this point, you are hopefully comfortable with VanderBot enough to create or edit real items in the real Wikidata. For now, let's stick with creating new items since editing existing items has the issue of avoiding creating duplicate statements. We will address problem that in a future post. </div><div><br /></div><div>There are several issues that you should consider before creating new items for testing. One is that you really should only create items that meet some minimal standard of notability. The actual <a href="https://www.wikidata.org/wiki/Wikidata:Notability" target="_blank">notability requirements for Wikidata</a> are so minimal that you could theoretically create items about almost anything. But as a practical matter, we really shouldn't just create junk items that don't have some relatively useful purpose. One type of item that seems to be relatively "safe" is university faculty, since they generally have the potential to be authors of academic works that could potentially be references for Wikipedia articles. When I'm testing VanderBot, I often add faculty from my alma mater, Bluffton University, since none of them were in Wikidata until I started adding them. </div><div><br /></div><div>The second issue is that you should create items that have enough information that the item can actually be unambiguously identified. There are several really irritating categories of items that have been added to Wikidata without sufficient information. There are thousands of "Ming Dynasty person" and "Peerage person" items that have little but a name attached to them. They are pointless and just make it harder to find other useful items with similar labels. So, for example, if you add faculty to Wikidata, at a minimum you should include their university affiliation and field of work. </div><div><br /></div><div>The third issue is that you should make sure that you are not creating a duplicate item. In a future post I will talk about strategies for computer-assisted disambiguation. But for now, just typing the label in the search box in the Wikidata is the easiest way to avoid duplication. Try typing it with and without middle initials and also with and without periods after the initials to make sure you tried every permutation.</div><div><br /></div><h3 style="text-align: left;">Configuring the properties</h3><div>If you want to try my strategy of practicing by creating faculty records, you can start with this <a href="https://gist.github.com/baskaufs/6a37c39f70a228d38d5ebda28651ffca" target="_blank">template configuration file</a>. It contains the obligatory <span style="font-family: courier;">instance of</span> (P31) that should be provided for every item, and <span style="font-family: courier;">sex or gender</span> (P21), which despite its issues is probably the most widespread property assigned to humans. I did not provide reference fields for those two properties since they are commonly given without references. The other two properties are probably the minimal properties that should be supplied for faculty: <span style="font-family: courier;">employer</span> (P108) and <span style="font-family: courier;">field of work</span> (P101). One reason why I chose these two properties is because they can both be determined easily from a single source, the <a href="https://www.bluffton.edu/catalog/officers/faculty.aspx" target="_blank">Bluffton University faculty web listing</a>. The statements for these two properties should definitely have references.</div><div><br /></div><div>I've done some querying to try to discover what the most commonly used properties are for references. A key reference property is <span style="font-family: courier;">retrieved</span> (<span style="font-family: courier;">P813</span>). All references should probably have this property. The other property is usually an indication of the source of the reference. Commonly used source properties are: <span style="font-family: courier;">reference URL</span> (<span style="font-family: courier;">P854</span>, used for web pages), <span style="font-family: courier;">stated in</span> (<span style="font-family: courier;">P248</span>, used when the source is a described item in Wikidata with a Q ID value), and <span style="font-family: courier;">Wikimedia import URL</span> (<span style="font-family: courier;">P4656</span>, used when the data have been retrieved from another Wikimedia project with a URL value). Unless you are working specifically on a project to move data from another project like Wikipedia to Wikidata, the first two are the ones you are most likely to use. Since all of my data are coming from a web page, I'm using <span style="font-family: courier;">P854</span> and <span style="font-family: courier;">P813</span> as the reference properties for both of the statement types that have references.</div><div><br /></div><div>Use the <span style="font-family: courier;">convert_json_to_metadata_schema.py</span> Python script and the <span style="font-family: courier;">config.json</span> file you downloaded to generate the metadata description file <span style="font-family: courier;">csv-metadata.json</span> and the <span style="font-family: courier;">faculty.csv</span> CSV file (with header row only). </div><div><br /></div><h3 style="text-align: left;">Adding the data</h3><div>I chose two of the faculty from the web page list and pasted their names into the Wikidata search box to make sure they weren't already existing items. I then added their names to the <span style="font-family: courier;">label_en</span> column of <span style="font-family: courier;">faculty.csv</span> and described them in the <span style="font-family: courier;">description_en</span> column. See <a href="https://gist.github.com/baskaufs/91c7617bac62e1cd79bc5cd20ad6837c" target="_blank">this gist</a> for the examples. <span style="font-family: courier;">Q5</span> is the value for <span style="font-family: courier;">instance_of</span> for all humans. <span style="font-family: courier;">Sex or gender</span> options are given on the <a href="https://www.wikidata.org/wiki/Property:P21" target="_blank">P21 property page</a>. All of the faculty work at Bluffton University (<span style="font-family: courier;">Q886141</span>). The trickiest value was their <span style="font-family: courier;">field</span> of work, which I had to determine using the Wikidata search box. </div><div><br /></div><div>I was then able to copy and paste the URL for the faculty web listing, <span style="font-family: courier;">https://www.bluffton.edu/catalog/officers/faculty.aspx</span>, into the reference URL columns and today's date, <span style="font-family: courier;">2021-03-07</span>, into all of the <span style="font-family: courier;">_retrieved_val</span> columns. I then saved the file. </div><div><br /></div><h3 style="text-align: left;">Writing the data to the API</h3><div><b><i>Note: </i></b>what would happen if you tried to use my example CSV file to write to Wikidata without changing its contents? Before the VanderBot script tries to write a new record to the real Wikidata, it checks the Wikidata Query Service to see if there are already any items with exactly the same labels and descriptions in any language in that row. If it finds a match, it logs an error and goes on to the next row. So since those items were already created by me, VanderBot will do nothing as long as no one has changed either the label or description for the two example items. If a label or description for either of them has been changed since I created the items, then the API will create duplicate items that will need to be merged later. So don't try running the script with my unmodified example files unless you first check that the labels and descriptions are still exactly the same on the items' Wikidata item pages.</div><div><br /></div><div>I ran the <span style="font-family: courier;">vanderbot.py</span> script with logging to a text file.</div><div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgCeslbsW3djJVTj5JflQ61aK5fkXMhKhaTDzxXncBa8xENNXhj268beIfTfdRKnX7eMQWcF8v5re6LEwWC-tDB-ZrMFBOb5UEJkEmk7FNMx2jGhvFDk896vuG_EYA_sYlUh2KyHUcXYNM/s516/console-output-for-faculty.png" style="margin-left: 1em; margin-right: 1em;"><img alt="Console output for faculty item upload" border="0" data-original-height="423" data-original-width="516" height="524" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgCeslbsW3djJVTj5JflQ61aK5fkXMhKhaTDzxXncBa8xENNXhj268beIfTfdRKnX7eMQWcF8v5re6LEwWC-tDB-ZrMFBOb5UEJkEmk7FNMx2jGhvFDk896vuG_EYA_sYlUh2KyHUcXYNM/w640-h524/console-output-for-faculty.png" width="640" /></a></div><br /><div>When writing the statements, the two rows were identified as new records. When the rows were later checked for any new unwritten references, the Q IDs were already known since they had been reported in the API response. </div><div><br /></div><div>To see how the <span style="font-family: courier;">faculty.csv</span> file looked after its data were written to the API, see <a href="https://gist.github.com/baskaufs/238066d209712c95c18d66b5c9bc4a88" target="_blank">this gist</a>.</div><div><br /></div><div>I could check for the new item pages in Wikidata by either searching for the faculty names or by directly using the two new Q IDs. </div><div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj9164uHxFn_hFNLPn_mG6fIS9GyyrP-lfX0FO7YlgIh3s4YcS4NYXyEDWYPxTVX00SZEWOmBw5WwGz-WFrcTs-mqKtZX7KeubV3cc1Pph2zU7qdtnqcB3MU7So08qy4C6JonuzVI4yNes/s1182/new-faculty-page.png" style="margin-left: 1em; margin-right: 1em;"><img alt="new faculty Wikidata page" border="0" data-original-height="1182" data-original-width="925" height="640" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj9164uHxFn_hFNLPn_mG6fIS9GyyrP-lfX0FO7YlgIh3s4YcS4NYXyEDWYPxTVX00SZEWOmBw5WwGz-WFrcTs-mqKtZX7KeubV3cc1Pph2zU7qdtnqcB3MU7So08qy4C6JonuzVI4yNes/w500-h640/new-faculty-page.png" width="500" /></a></div><br /><div>The new page contains all of the data from the CSV table in the appropriate place!</div><div><br /></div><div>Although for only two items this work flow probably took longer than just creating the records by hand, it doesn't take many more items to make this process much faster, particularly if references are added (and they should be!). Adding references requires many button clicks on the graphical web interface, but because the same reference can be added to many rows of the spreadsheet with a single copy and paste, it is very efficient to add references using the VanderBot script. </div><div><br /></div><h2 style="text-align: left;">What's next?</h2><div>In the <a href="http://baskauf.blogspot.com/2021/03/writing-your-own-data-to-wikidata-using_11.html" target="_blank">next post</a>, I'll talk about how you can determine what properties are most commonly used for various types of items. This is important information when you are planning your own projects that involve adding a lot of data to Wikidata.</div><div><br /></div><div><br /></div><div><br /></div><div><br /></div><div><br /></div><div><br /></div><div><br /></div><div><br /></div><div><br /></div><div><br /></div><div><br /><p></p></div></div>Steve Baskaufhttp://www.blogger.com/profile/01896499749604153763noreply@blogger.com0tag:blogger.com,1999:blog-5299754536670281996.post-70789380472323108102021-03-01T10:08:00.009-08:002021-06-05T20:36:59.295-07:00Writing your own data to Wikidata using spreadsheets: Part 1 - test.wikidata.org<p><b>Warning: </b>this blog post involves extreme hand-holding. If that irritates you and you want to try to figure out how to use VanderBot on your own without hand-holding, you can go straight to the <a href="http://vanderbi.lt/vanderbot" target="_blank">VanderBot landing page</a> and look at the very abbreviated instructions there. However, make sure that you understand your responsibilities as a Wikidata user. If they are unclear to you, read the "Responsibility and good citizenship" section below.</p><p>On the other hand, if you love extreme hand-holding, there is a <a href="https://heardlibrary.github.io/digital-scholarship/script/wikidata/vanderbot/" target="_blank">series of videos</a> that will essentially walk you through the steps in this post.<br /></p><p> It has been almost a year since I last wrote about my efforts to write to Wikidata using Python scripts. At that time, I was using a bespoke set of scripts for a very specific purpose: to <a href="https://github.com/HeardLibrary/linked-data/blob/master/vanderbot/researcher-project.md" target="_blank">create or upgrade items in Wikidata about researchers and scholars at Vanderbilt University</a>. I was feeling pretty smug that I actually got the scripts to work, but at that point the scripts were pretty idiosyncratic. They were limited to a particular type of item (people), supported a restricted subset of property types, and used a particular spreadsheet mapping schema that wasn't easily modified. </p><p>Since that time, I have been working to adapt those scripts to be more broadly usable and have been testing them on several other projects: <a href="https://www.wikidata.org/wiki/Wikidata:WikiProject_Vanderbilt_Fine_Arts_Gallery" target="_blank">WikiProject Vanderbilt Fine Arts Gallery</a> and <a href="https://www.wikidata.org/wiki/Wikidata:WikiProject_Art_in_the_Christian_Tradition_(ACT)" target="_blank">WikiProject Art in the Christian Tradition (ACT)</a>, and several smaller ones. The scripts and my ability to explain how to use them have now evolved to the point where I feel like they could be used by others. The goal of this series is to make it possible for you to try them out in a do-it-yourself manner. </p><h3 style="text-align: left;">Background</h3><p>This series of posts will not dwell on the conceptual and technical details except where necessary for you to make the scripts work. For those interested in more details, I refer you to previous things I've written:</p><p></p><ul style="text-align: left;"><li><a href="http://baskauf.blogspot.com/2019/06/putting-data-into-wikidata-using.html" target="_blank">blog post dealing with the minutiae of Wikibase, the Wikimedia API, and authentication</a> (June 2019)</li><li><a href="http://baskauf.blogspot.com/2019/05/getting-data-out-of-wikidata-using.html" target="_blank">blog post describing how to retrieve data from Wikidata from the Query Service using HTTP and SPARQL</a> (May 2019)</li><li><a href="http://baskauf.blogspot.com/2020/02/vanderbot-python-script-for-writing-to.html" target="_blank">blog post giving a general overview of the paradigm of writing to Wikidata using spreadsheets</a> (February 2020)</li><li><a href="http://baskauf.blogspot.com/2020/02/vanderbot-part-2-wikibase-data-model.html" target="_blank">blog post with an overview of the Wikibase model and associated identifiers</a> (February 2020)</li><li><a href="https://heardlibrary.github.io/digital-scholarship/lod/wikibase/" target="_blank">web page with somewhat overlapping overview of the Wikibase model but details about property labels</a> (2019, revised 2020)</li><li><a href="http://baskauf.blogspot.com/2020/02/vanderbot-part-3-writing-data-from-csv.html" target="_blank">blog post with a very brief overview of using the W3C CSV2RDF Recommendation to map spreadsheets to the Wikibase model and a discussion of issues related to timing of interactions with the API</a> (February 2020)</li><li><a href="http://www.semantic-web-journal.net/content/using-w3c-generating-rdf-tabular-data-web-recommendation-manage-small-wikidata-datasets" target="_blank">submitted manuscript with very technical description of using the W3C CSV2RDF Recommendation to map spreadsheets to the Wikibase model</a> (submitted to Semantic Web Journal in December 2020, revised version submitted June 2021)</li><li><a href="http://baskauf.blogspot.com/2020/02/vanderbot-part-4-preparing-data-to-send.html" target="_blank">blog post with overview of the workflow for the Vanderbilt scholar and researcher Wikidata project</a> (February 2020)</li><li>video of presentation at the 2020 LD4 Conference on Linked Data in Libraries: <a href="https://youtu.be/xjQ8rJufeOU" target="_blank">VanderBot: Using a Python script to create and update researcher items in Wikidata</a> (July 2020)</li><li>video of presentation to the Program for Cooperative Cataloging Wikidata Pilot group: <a href="https://drive.google.com/file/d/1aB2XuQ_gqdB99tKcxEMoU7j75-x-i6RP/view?usp=sharing" target="_blank">VanderBot: A spreadsheet-based system for creating and updating items in Wikidata</a> (February 2021)</li></ul><div>It is not necessary to refer to any of this material in order to try out the system. But those interested in the technical details may find the links helpful.</div><div><br /></div><h2 style="text-align: left;">Do I want to try this?</h2><div>Before going any further, you should assess whether it is worth your time trying this out. </div><div><br /></div><h3 style="text-align: left;">Requirements:</h3><div><ul style="text-align: left;"><li>You need to know how to use the command line to navigate around directories and run a program. See <a href="https://heardlibrary.github.io/digital-scholarship/computer/command-unix/" target="_blank">this page for Mac</a> or <a href="https://heardlibrary.github.io/digital-scholarship/computer/command-windows/" target="_blank">this page for Windows</a> if you don't know how to open a console and issue basic commands. In particular, read the section on "Running a program using the command line".</li><li>You need to have Python installed on your computer so that you can run it from the command line. See <a href="https://heardlibrary.github.io/digital-scholarship/script/python/install/" target="_blank">this page</a> for installation instructions. You do NOT need to know how to program in Python. I believe that the only module used in the script that is not part of the standard library is <span style="font-family: courier;">requests</span>, so you may need to install that if you haven't already. </li><li>You need to have an application to open, edit, and save CSV files. The recommended application is <a href="https://www.libreoffice.org/" target="_blank">LibreOffice Calc</a>. Other alternatives are OpenOffice Calc and Excel, but there are situations where you can run into problems with either of them. For information on CSV spreadsheets and how to save them in Excel, see <a href="https://heardlibrary.github.io/digital-scholarship/script/codegraf/018/#csv-spreadsheets-4m38s" target="_blank">this video</a>. For a deeper dive and description of the problems with Excel and OpenOffice Calc, see the first video in <a href="https://heardlibrary.github.io/digital-scholarship/script/codegraf/022/" target="_blank">this lesson</a> and the screenshots after the second video. </li><li>You must have a Wikimedia user account. The same user account is used across Wikimedia platforms, including Wikipedia, Wikidata, and Commons, so if you have an account any of those places you can use it here.</li><li>You need to be familiar with the Wikidata graphical editing interface. I assume that every reader has already done enough editing to understand the important features of the Wikibase model (items, properties, statements, qualifiers, and references) and how they are related to each other. They will not be explained in this post, so if you don't already have experience exploring these features using the graphical interface, you are probably not adequately equipped to continue with this exercise.</li></ul></div><h3 style="text-align: left;">Other alternatives you should consider</h3><div>There are a number of good alternatives to using the VanderBot scripts to write to Wikidata. They are:</div><div><ol style="text-align: left;"><li>Use the graphical interface at <a href="https://www.wikidata.org/">https://www.wikidata.org/</a> to edit items manually. Advantage: very easy to use and robust. Disadvantage: slow and labor intensive. </li><li>Use QuickStatements (<a href="https://www.wikidata.org/wiki/Help:QuickStatements">https://www.wikidata.org/wiki/Help:QuickStatements</a>). Advantage: very easy to use and robust, particularly when used as an integrated part of other tools like <a href="https://scholia.toolforge.org/" target="_blank">Scholia</a> "<a href="https://www.wikidata.org/wiki/Wikidata:Scholia#Missing_pages" target="_blank">missing</a>". Disadvantage: there is a learning curve for constructing the input files from scratch. Users who aren't familiar with how CSV files work may find it confusing.</li><li>Use the Wikidata plugin with OpenRefine (<a href="https://www.wikidata.org/wiki/Wikidata:Tools/OpenRefine">https://www.wikidata.org/wiki/Wikidata:Tools/OpenRefine</a>). Advantage: powerful and full-featured, I believe that some scripting is possible to the extent that scripting is generally possible in OpenRefine using GREL. Disadvantage: requires skill with using OpenRefine plus an additional learning curve for figuring out how to make the Wikidata plugin work, not sure whether it is possible to integrate OpenRefine with command line-based workflows involving other applications.</li><li>Use PyWikiBot (<a href="https://pypi.org/project/pywikibot/">https://pypi.org/project/pywikibot/</a>) or Wikidataintegrator (<a href="https://github.com/SuLab/WikidataIntegrator">https://github.com/SuLab/WikidataIntegrator</a>). Many (most?) Wikidata bots are built using one of these two Python frameworks. Advantage: powerful, full featured, robust. Disadvantage: you need to be a relatively experienced Python coder who understands object oriented programming with Python to use these libraries. There are a number of bot-building tutorials for PyWikiBot online. However, when starting out, I found the proliferation of materials on the subject confusing and for both of these platforms, when I couldn't get things to work, the libraries were so complex I couldn't figure out what was going on. Professional developers would probably not have that problem, but as someone who is self-taught, I was confused.</li></ol><h3 style="text-align: left;">Factors that might make using VanderBot right for you</h3></div><div>If any of the following situations apply to you, VanderBot might be useful for you:</div><div><ol style="text-align: left;"><li>Your data are already in spreadsheets or can be exported as spreadsheets and you would like to keep the data in spreadsheets for future reference, ingest, or editing using off-the-shelf applications like Libre Office or Excel.</li><li>You want to keep humans in the Wikidata data entry loop for quality assurance, but want to increase the speed at which edits can be made.</li><li>Over time, you are interested in keeping versioned snapshots of the data that you have written in a format that is suitable for archival preservation (CSV). </li><li>You are interested in comparing what is currently in Wikidata with what you put into Wikidata to discover beneficial information added by the community or to detect vandalism by bad actors.</li><li>You want to develop a workflow based on command-line tools that can be scheduled and monitored by humans.</li></ol><div>The last of these two features are not yet fully developed, but I'm trying to design the scripts I'm writing to make them possible in the future. </div></div><div><br /></div><div>If one or more of these factors applies to you and the other existing tools don't seem better suited for your purposes, then let's get started.</div><div><br /></div><h2 style="text-align: left;">Responsibility and good citizenship</h2><div>This "lesson" involves using the VanderBot API uploading script to write data to the test Wikidata API (application programming interface). In order to do that, you will need to create a bot password, but not a separate bot account. So let's clarify exactly what that means.</div><div><br /></div><h3 style="text-align: left;">User account and bot password</h3><div>When you create a bot password, you should be logged in under your umbrella Wikimedia account. That account applies across the entire Wikimedia universe: Wikipedia, Wikidata, Commons, and other Wikimedia projects. The bot password you create allows you to automate your interactions by using any Wikimedia API, but the edits that you make will be logged to your user account. That means that you bear the same responsibility for the edits that you make using the bot password as you would if you made them using the graphical interface or QuickStatements. Edits that you make using the bot password will show up in the page history just as if you had made them manually. If you make a mess using the bot password, you are responsible for cleaning it up just as you would be if you made errors using any other editing method. The whole point of scripting is to allow you to do things faster and easier, but the down side of that is that you can also make mistakes faster and easier as well. </div><div><br /></div><div>Because of the potential for disaster, we will start by using Wikidata's test instance: <a href="https://test.wikidata.org/">https://test.wikidata.org/</a> . It behaves exactly like the "real" Wikidata, except that the items and properties there do not necessarily correspond to anything real. If you make a mess in test.wikidata.org, you do NOT have to clean it up -- that's the whole point of it. So it is a place we can experiment without risk and once we feel comfortable, we can easily move to using the "real" Wikidata.</div><div><br /></div><h3 style="text-align: left;">Distinction between a User-Agent and user account</h3><div>In the Wikimedia world, typically when one creates an autonomous bot (one that works without human intervention), a separate bot user account is created. That account is used with a particular application (script, program) that carries out the bot's defined task. However, VanderBot is not an autonomous bot and has no particular defined task. It is a general-purpose script that can be used by any account to make human-mediated edits. So we need to draw a distinction between the application (technical term: User-Agent) and the user account. There actually is a user account called VanderBot (<a href="https://www.wikidata.org/wiki/User:VanderBot">https://www.wikidata.org/wiki/User:VanderBot</a>). It is operated by me and it shows up as the user who made the edits when I use it with the API-writing script. But you can't use it because you don't have the account credentials -- edits that you make will be made under your own user account. On the other hand, regardless of the user account responsible for the edits, the VanderBot Python script will identify itself to the API as the software that is mediating the interaction between you and the API. Software that manages communications between a user and a server is called a <i>User-Agent</i>.</div><div><br /></div><div>You can think of this situation as similar to the difference between your web browser and you. Your web browser is not responsible for the actions that you take with it. If you use Firefox to write the world's best Wikipedia article, the Mozilla Foundation that created Firefox doesn't get credit for that. If you use Chrome to buy drugs or organize an assassination, Google, which created Chrome, does not take responsibility for that. On the other hand, if your browser has a bug that causes it to repeatedly hit a website and create a denial of service problem for a web server, the website may use the User-Agent identification for the browser to either block the browser or to contact the browser's developer to ask them to fix the bug. </div><div><br /></div><div>VanderBot, the User-Agent, has features that prevent it from doing "bad" things to the API, like making requests too fast or not backing off when the server says it's too busy. As the programmer, I'm responsible for those features. I am not responsible if you write bad statements, create duplicate items, or overwrite correct labels and descriptions with stupid ones. Those mistakes will be credited to your user account. On the other hand, if you significantly modify the VanderBot API-writing script (which you are allowed to do under its GNU General Public License v3.0), then you should change the value of its user_agent_header variable with your own URL and email address, particularly if you mess with its "good citizen" features and settings. </div><div><br /></div><h3 style="text-align: left;">Do you need a bot user account and flag?</h3><div>Wikidata has a bot policy, which you can read about <a href="https://www.wikidata.org/wiki/Wikidata:Bots" target="_blank">here</a>. However, that policy defines bots as "tools used to make edits without the necessity of human decision-making". By that definition, VanderBot is not technically a bot since its edits are under human supervision (it's not autonomous). That's good, because it means that you can use it without going through any bot approval process, just as if you had used QuickStatements or OpenRefine to make edits. </div><div><br /></div><div>However, not having bot approval also places rate limitations on interactions with the API. User accounts without "bot flags" (granted after successfully completing the approval process) are limited to 50 writes per minute. Writing at a faster speed without a bot flag will cause the API to block your IP address. This is the primary limitation on the speed of writing data with VanderBot and a delay (<span style="font-family: courier;">api_sleep</span>) is hard-coded in the script. </div><div><br /></div><div>Note that bot approval and bot flags are granted to accounts, not User-Agents. If you use the VanderBot API-writing script along with other scripts as part of a defined automated process, you should set up a separate bot user account. You could then get a bot flag for that particular account and purpose, and remove the speed limitation from the script. I don't know if VanderBot is actually ready for that kind of use at this point. So you're on your own there.</div><div><br /></div><div>The short answer to the overall question of whether you need a separate bot account is usually "no".</div><div><br /></div><h2 style="text-align: left;">Generating a bot password</h2><div>If you understand your responsibilities and have decided that experimenting with VanderBot is worth your time, let's get started on the DIY part. Because the bot password you create can be used across Wikimedia sites, I will illustrate the password creation process at <a href="https://test.wikidata.org/">https://test.wikidata.org/</a> since that is where we will first use it.</div><div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjG0t1XLzsNd2YHjF0dypNw5lSnIkx6wzBYf4Zd-Z9gyxlM29K7Mdmd4jHKbIVsP1K53sxc7YWQAYDVZgCUX012E3dBC-8GxJYFYf3vdAoiFtclUoPUUafrnu4QptxVxlnuv_8zgk8fHSg/s1142/1test-homepage.png" style="margin-left: 1em; margin-right: 1em;"><img alt="test.wikidata.org landing page showing location of Special pages link" border="0" data-original-height="633" data-original-width="1142" height="356" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjG0t1XLzsNd2YHjF0dypNw5lSnIkx6wzBYf4Zd-Z9gyxlM29K7Mdmd4jHKbIVsP1K53sxc7YWQAYDVZgCUX012E3dBC-8GxJYFYf3vdAoiFtclUoPUUafrnu4QptxVxlnuv_8zgk8fHSg/w640-h356/1test-homepage.png" width="640" /></a></div><div class="separator" style="clear: both; text-align: center;"><br /></div><div class="separator" style="clear: both; text-align: left;">The test Wikidata instance looks similar to the regular one, except that the logo in the upper left is in monochrome rather than color. The functionality is identical. Click on the <span style="font-family: courier;">Special pages</span> link in the left pane.</div><div class="separator" style="clear: both; text-align: left;"><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhdyhkqCNPzH_hT1r6ycngjbQOO6BqYlzcRAmKoL63RUVttly-122iBdbEjoNWbckkBpsizEXjVCFnwSCLa5MQ88BACXepSGFVacCaQIMNepmvefLP3nL4pM3UwdYnrztXzt6pwNgS27M0/s1112/2special-pages.png" style="margin-left: 1em; margin-right: 1em;"><img alt="Links on the Special pages page" border="0" data-original-height="1112" data-original-width="810" height="400" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhdyhkqCNPzH_hT1r6ycngjbQOO6BqYlzcRAmKoL63RUVttly-122iBdbEjoNWbckkBpsizEXjVCFnwSCLa5MQ88BACXepSGFVacCaQIMNepmvefLP3nL4pM3UwdYnrztXzt6pwNgS27M0/w291-h400/2special-pages.png" width="291" /></a></div><div class="separator" style="clear: both; text-align: left;"><br /></div><div class="separator" style="clear: both; text-align: left;">On the Special pages page, click on the <span style="font-family: courier;">Bot passwords</span> link in the <span style="font-family: courier;">Users and rights</span> section.</div><div class="separator" style="clear: both; text-align: left;"><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg1E10eINCY9KQ64LrQq5drG1i3Nzcy0us78iiY15CN652Zj9hHrypvexfV1dBjIVpkhY7pWwwf_M78p2rbt-tGIiHVHLCXR9XdF6G4nsK-AkFLpagjrGBTS5PGDc2dAclgqtf2KeBEqHw/s957/3bot-passwords-page.png" style="margin-left: 1em; margin-right: 1em;"><img alt="Bot passwords page" border="0" data-original-height="641" data-original-width="957" height="428" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg1E10eINCY9KQ64LrQq5drG1i3Nzcy0us78iiY15CN652Zj9hHrypvexfV1dBjIVpkhY7pWwwf_M78p2rbt-tGIiHVHLCXR9XdF6G4nsK-AkFLpagjrGBTS5PGDc2dAclgqtf2KeBEqHw/w640-h428/3bot-passwords-page.png" width="640" /></a></div><br /><div class="separator" style="clear: both; text-align: left;">On the <span style="font-family: courier;">Bot passwords</span> page, enter a name for the bot password. It is conventional to include "bot" or "Bot" somewhere in the name of a bot. However, since this password is actually going to be associated with your own user account and not a special bot account, including "bot" in the name is not that important. In the past when bot passwords were only associated with particular Wikimedia sites, it was more important to have mnemonic names to keep bots for different sites straight. However, since you can use the same bot across sites, this is no longer important. </div><div class="separator" style="clear: both; text-align: left;"><br /></div><div class="separator" style="clear: both; text-align: left;">There are actually two reasons why you might want to use multiple, differently-named bot passwords. One is that different passwords can have different scope restrictions (see next step). So one password might only be able to perform certain actions, while another might be less restricted. The other reason is that if a particular password is being used "in production", you might want to have another one for testing. In the event you accidentally expose the credentials for the testing bot password, you could revoke those credentials without affecting the production bot password. However, for most purposes it would probably make sense to just have the "bot name" be the same as your username. </div><div class="separator" style="clear: both; text-align: left;"><br /></div><div class="separator" style="clear: both; text-align: left;">After entering the name, click the <span style="font-family: courier;">Create</span> button.</div><div class="separator" style="clear: both; text-align: left;"><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjB2qvoOu_BvHtRtYvjUyQKcD9t366gzdKUXQnFycKZzsERMX3BWXOO84wYxMejJFSULWpVbPa4lY-JxPCqUhkI8thp98-KmuFQ95BesabWp1m13UjJZz0idXjn0irq7_oT-bEoP8F4W54/s1273/4grants-page.png" style="margin-left: 1em; margin-right: 1em;"><img alt="Available rights that can be assigned to bots" border="0" data-original-height="1273" data-original-width="874" height="640" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjB2qvoOu_BvHtRtYvjUyQKcD9t366gzdKUXQnFycKZzsERMX3BWXOO84wYxMejJFSULWpVbPa4lY-JxPCqUhkI8thp98-KmuFQ95BesabWp1m13UjJZz0idXjn0irq7_oT-bEoP8F4W54/w439-h640/4grants-page.png" width="439" /></a></div><br /><div class="separator" style="clear: both; text-align: left;">On the next page, select the rights that you want to grant to this bot password. I think the important ones are <span style="font-family: courier;">Edit existing pages</span>, <span style="font-family: courier;">Create, edit, and move pages</span>, and <span style="font-family: courier;">Delete pages, revisions, and log entries</span>. However, just in case, I also selected <span style="font-family: courier;">High-volume editing</span>, and <span style="font-family: courier;">View deleted files and pages</span> as well. Leave the rest of the options at their defaults and click the <span style="font-family: courier;">Create</span> button.</div><div class="separator" style="clear: both; text-align: left;"><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEim9Rinkp65251mlBg3Cc-mwrBD2nfsGB3hYS7z8a6OUJrAV4LFoMs_7tJoercgIO6nzWHmuNYGEW4h1Z9oxkEeL5lt00UCtTLtZygEeeCUAMTkeG-n0ILEOW3JxqfItSVoKZkyh2hFVng/s1129/5passwords.png" style="margin-left: 1em; margin-right: 1em;"><img alt="Bot password results page" border="0" data-original-height="472" data-original-width="1129" height="268" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEim9Rinkp65251mlBg3Cc-mwrBD2nfsGB3hYS7z8a6OUJrAV4LFoMs_7tJoercgIO6nzWHmuNYGEW4h1Z9oxkEeL5lt00UCtTLtZygEeeCUAMTkeG-n0ILEOW3JxqfItSVoKZkyh2hFVng/w640-h268/5passwords.png" width="640" /></a></div><br /><div class="separator" style="clear: both; text-align: left;">The resulting page will give you the username and passwords that you will need to write to the API. There are two variants: one where the bot name is appended to your username by @, and another where the username is used alone and the bot name is prepended to the password by @. We will use the first variant (username@botname).</div><div class="separator" style="clear: both; text-align: left;"><br /></div><div class="separator" style="clear: both; text-align: left;">You need to create a plain text file that contains the username and password. To do this, you should use a text or code editor and NOT a word processor like Microsoft Word. Your computer should have a built-in text editor (TextEdit for Mac, Notepad for Windows). If you don't know what text and code editor are, see the first three videos on <a href="https://heardlibrary.github.io/digital-scholarship/script/codegraf/020/" target="_blank">this page</a>. If you are using a Mac, the second video explains how to ensure that TextEdit saves your file as plain text rather as rich text (which will cause an error in our situation) and to ensure that files are opened and closed using UTF-8 character encoding. </div><div><br /></div><div>Open a new document in the text editor. Create three lines of text similar to this:</div><div><br /></div><div><div><span style="font-family: courier;">endpointUrl=https://test.wikidata.org</span></div><div><span style="font-family: courier;">username=User@bot</span></div><div><span style="font-family: courier;">password=465jli90dslhgoiuhsaoi9s0sj5ki3lo</span></div></div><div><br /></div><div>Be careful, since mistyping any character will cause VanderBot to not work. It's best to copy and paste rather than to try to type the credentials. (These are fake credentials, so you can't actually use them -- use your own username and password.) Do not leave a space between the equals sign (<span style="font-family: courier;">=</span>) and the other characters. The first line specifies that we are going to use the test.wikidata.org API, so you can copy it exactly as written above. The username is the login name that includes the @ symbol (<span style="font-family: courier;">Baskaufs@BaskaufTestBot</span> in the example above). The password is the password version that does not have the @ symbol in it. Double check that when you copied the username and password, you did not leave any characters off. Also, put the cursor at the end of each line and make sure that there are no trailing spaces after the text on the line. It does not matter whether the last line is followed with a newline (hard return) or not.</div><div><br /></div><div>When you have entered the text, save the file as <span style="font-family: courier;">wikibase_credentials.txt</span> in your home directory. In the next post, we will see how to use a different name or to change the location to somewhere else. Make sure that there is an underscore between "wikibase" and "credentials", not a dash or a space. If you do not know what your home directory is, or where it is located on your computer, see the <span style="font-family: courier;">Special directories in Windows</span> section of <a href="https://heardlibrary.github.io/digital-scholarship/computer/directories-windows/" target="_blank">this page</a> or the <span style="font-family: courier;">Special directories on Mac</span> section of <a href="https://heardlibrary.github.io/digital-scholarship/computer/directories-mac/" target="_blank">this page</a>. In Finder on a Mac, you can select <span style="font-family: courier;">Home</span> from the <span style="font-family: courier;">Go</span> menu to get there. In Windows File Explorer, start at the <span style="font-family: courier;">c:</span> drive, then navigate to the <span style="font-family: courier;">Users</span> folder. Your user folder will will be within the <span style="font-family: courier;">Users</span> folder and have the same name as your username on the computer. </div><div><br /></div><h2 style="text-align: left;">Preparing the metadata description file and CSV headers</h2><div>The VanderBot API upload script uses CSV files as its data source. Each row in the table represents data about an item. The columns of the table represent various aspects of metadata about the items, such as statements, qualifiers, and references. In order to transfer the data from the CSV to the Wikidata API, the columns of the CSV spreadsheet need to be mapped to the Wikibase data model (the model used by Wikidata). Since the Wikibase model can be represented as RDF, the <a href="https://www.w3.org/TR/csv2rdf/" target="_blank">W3C Generating RDF from Tabular Data on the Web Recommendation</a> can be used to systematically map the CSV columns to the Wikibase model. The VanderBot script uses that mapping to determine how to construct the JSON required to transfer the CSV data to the Wikidata API.</div><div><br /></div><div>Initially, I constructed the mapping file (known as the CSV's "metadata description file") by hand while referring to the W3C Recommendation and its examples. However, it is extremely difficult to build the mapping file by hand without making errors that were difficult to detect. Fortunately, my collaborator, Jessie Baskauf, created a web tool that allows a user to construct the mapping file using drop-downs that are organized in a structure that reflects that of the Wikidata graphical user interface. We will use that tool to create both the mapping file and the CSV header whose field names correspond to those used in the mapping file. </div><div><br /></div><div>The tool itself can be accessed online from <a href="https://heardlibrary.github.io/digital-scholarship/script/wikidata/wikidata-csv2rdf-metadata.html" target="_blank">this link</a>. The Javascript that runs the tool runs entirely within the web browser, so it can be used offline by downloading the <a href="https://github.com/HeardLibrary/linked-data/blob/master/vanderbot/wikidata-csv2rdf-metadata.html" target="_blank">HTML</a>, <a href="https://github.com/HeardLibrary/linked-data/blob/master/vanderbot/wikidata-csv2rdf-metadata.css" target="_blank">CSS</a>, and <a href="https://github.com/HeardLibrary/linked-data/blob/master/vanderbot/wikidata-csv2rdf-metadata.js" target="_blank">Javascript</a> files from GitHub into the same directory, then opening the HTML file in a browser. </div><div><br /></div><div>On the tool page, leave the Wikidata ID field at its default, <span style="font-family: courier;">qid</span>. Use the Add label and Add description buttons to enter the names of each of those fields. I have been using the convention <span style="font-family: courier;">labelEn</span>, <span style="font-family: courier;">labelDe</span>, <span style="font-family: courier;">descriptionEs</span>, etc. where I use lower camelCase and append the language code. However, you can use any name that makes sense to you. Select the appropriate language codes from the dropdown.</div><div><br /></div><div>One thing to note is that there is no correspondence between property and item identifiers in test.wikidata.org (the test Wikidata implementation) and www.wikidata.org (the real Wikidata). So before we can add properties and item values of those properties, we need to look in the test Wikidata site to find properties and items that we want to play with. From the <a href="https://test.wikidata.org/">https://test.wikidata.org/</a> landing page, select <span style="font-family: courier;">Special pages</span> from the left pane as you did before. </div><div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh3JlnApRpBrbtZTBkgRT9tr9GXQVjg8MIi8WYpKwjK7Y_X7Tb4GfZc9iO0YO_WfD-0N42qnjRPeQ73A0D7GMjwI69-xMa-IJhu_B5aaXEeF9voY0JFQ1TTocjZAj_6BKGDqrngatWGHJE/s1129/6prop-list.png" style="margin-left: 1em; margin-right: 1em;"><img alt="Special pages page showing List of properties link" border="0" data-original-height="861" data-original-width="1129" height="488" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh3JlnApRpBrbtZTBkgRT9tr9GXQVjg8MIi8WYpKwjK7Y_X7Tb4GfZc9iO0YO_WfD-0N42qnjRPeQ73A0D7GMjwI69-xMa-IJhu_B5aaXEeF9voY0JFQ1TTocjZAj_6BKGDqrngatWGHJE/w640-h488/6prop-list.png" width="640" /></a></div><br /><div>Near the bottom of the <span style="font-family: courier;">Special pages</span> page in the <span style="font-family: courier;">Wikibase</span> section, click on the <span style="font-family: courier;">List of Properties</span> link. In the real Wikidata instance, creation of properties is controlled by a community process. In the test Wikidata instance anyone can create, change, or delete properties. So although the properties used in this example may still be the same when you do this exercise, they also may have changed. Since we are practicing, you can substitute any other similar property for the ones shown in the examples. We want to chose a couple of properties that have different kinds of values in order to see how that affects the mapping file and CSV headers. So we are looking for a property that has an <span style="font-family: courier;">Item</span> value and one that has a <span style="font-family: courier;">Point in time</span> value. </div><div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiRLRdAZyDbMMcCl-4VoRuiGm7-ccws0v31OJpZBnT6ZKklT7kLDk9a8xm-3Bsuhxv9PhlK8NDFba1Z8RJM_G1q5K6QkL6ySTrGFPM6BQ7M-PcmfT5mNnYsnZbAboQDEbqxRDCF_r55tYU/s900/7properties.png" style="margin-left: 1em; margin-right: 1em;"><img alt="List of Properties page showing Item and Date valued properties" border="0" data-original-height="900" data-original-width="544" height="640" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiRLRdAZyDbMMcCl-4VoRuiGm7-ccws0v31OJpZBnT6ZKklT7kLDk9a8xm-3Bsuhxv9PhlK8NDFba1Z8RJM_G1q5K6QkL6ySTrGFPM6BQ7M-PcmfT5mNnYsnZbAboQDEbqxRDCF_r55tYU/w386-h640/7properties.png" width="386" /></a></div><br /><div>I picked P17 (country) and P18 (Date of birth) to use in the practice example. Clicking on the links shows that P17 has an <span style="font-family: courier;">Item</span> value and P18 has a <span style="font-family: courier;">Point in time</span> value. </div><div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgN4llecv4Kdo4J3awH0OKpCcedmMbijLClbO5JFM3d9ADexKWxtmEh15OO9tbwbxRBYxU5yOlKmiYNGEaZhuUXcJTnhayvudKtpGLPf55Rqf4YVZmgxBuDhi03qWOhdrC-Ftyxjm6ugMI/s1038/8fake-item.png" style="margin-left: 1em; margin-right: 1em;"><img alt="Item page for France in the test Wikidata instance" border="0" data-original-height="803" data-original-width="1038" height="496" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgN4llecv4Kdo4J3awH0OKpCcedmMbijLClbO5JFM3d9ADexKWxtmEh15OO9tbwbxRBYxU5yOlKmiYNGEaZhuUXcJTnhayvudKtpGLPf55Rqf4YVZmgxBuDhi03qWOhdrC-Ftyxjm6ugMI/w640-h496/8fake-item.png" width="640" /></a></div><br /><div>There are not necessarily items in the test Wikidata instance that correspond to those in the real Wikidata, so I searched for some countries to use as values of P17 in the test. I found Q346 (France) and Q53079 (Mexico). You can find your own, or create new items to use if you want.</div><div><br /></div><div>I also wanted to select a property to use for a qualifier and another one to use for a reference. In the real Wikidata instance, many properties have constraints that indicate whether they are suitable to be used as properties in statements, qualifiers, or references. In the test Wikidata instance, most properties don't have any constraints. So I just picked a couple that seemed to make sense. I chose P87 (start date, having a <span style="font-family: courier;">Point in time</span> value) as a qualifier property for P17 (country). (What does that mean? I don't know and it doesn't matter -- this is just a test.) I chose P93 (reference URL, having a <span style="font-family: courier;">URL</span> value) as a reference property for P18 (Date of birth). Here is a summary of my chosen entities:</div><div><br /></div><div><div>P17 country (<span style="font-family: courier;">Item</span> value, used as a statement property)</div><div>P87 start date (<span style="font-family: courier;">Point in time</span> value, used as a qualifier property for P17)</div><div>Q346 France (<span style="font-family: courier;">Item</span>, used as a value for P17)</div><div>Q53079 Mexico (<span style="font-family: courier;">Item</span>, used as a value for P17)</div><div>P18 Date of birth (<span style="font-family: courier;">Point in time</span> value, used as a statement property)</div><div>P93 reference URL (<span style="font-family: courier;">URL</span> value, used as a reference property)</div></div><div><br /></div><div>Using the buttons and drop-downs, I selected the properties listed above on the web tool.</div><div><br /></div><div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhE9ZoyNyn25Go2Hc0TRFVCtIJ5Q9LcYgQltzxjBRrXWp5XY7v08bu2zhpV38Usukne9rdoNiBui9kY0XiXHFbsixrUcSECBo0w0e3mdS_6VT6u8gS982z1xkcY9-DGSVJATUBfpfqR7ns/s1190/9generate-schema.png" style="margin-left: 1em; margin-right: 1em;"><img alt="Mapping file-generating web page showing settings" border="0" data-original-height="1190" data-original-width="862" height="640" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhE9ZoyNyn25Go2Hc0TRFVCtIJ5Q9LcYgQltzxjBRrXWp5XY7v08bu2zhpV38Usukne9rdoNiBui9kY0XiXHFbsixrUcSECBo0w0e3mdS_6VT6u8gS982z1xkcY9-DGSVJATUBfpfqR7ns/w464-h640/9generate-schema.png" width="464" /></a></div><br /> The field names that you choose for the properties used in statements can be whatever you want. It is best to keep them short and do NOT use spaces. If you use multi-word names, I recommend lower camelCase, since dashes may cause problems later on and underscores are used by the tool to indicate the hierarchy of qualifier and reference properties. The fields ending in <span style="font-family: courier;">_uuid</span> and <span style="font-family: courier;">_hash</span> are for statement and reference identifier fields and you should leave them at their defaults. When you create statement properties, by default the tool prefixes qualifier and reference properties their parent statement property names followed by an underscore. You can change these to shorten them if you want, but it's probably best to leave them at their defaults since when CSVs have many columns it becomes difficult to remember the structure without the prefixes. </div><div><br /></div><div>A statement can have multiple qualifier properties, but a statement can have both multiple references and multiple properties within a reference. For simplicity's sake, I recommend sticking with a single reference having one or more properties. </div><div><br /></div><div>Using the drop-downs, be sure to select a value type that is appropriate for the property. There is no quality control here at the point of the tool, but an error will be generated when writing to the API if the selected value type does not match with the value type specified for the property on the property page of the test Wikidata instance.</div><div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiYwXLjgBTW7r_2fsmhtwPUj0aaX91_wCBxJq0qZG0_jbaLdcera5EKLhssVPvSqsCMp_uIJbfnpqBLwWvTQiyOnbZBMdqBrBA_V25O4Yg1c1EcwaE3TIwgZNKv_o-UMNiOfZS7_Al2Vmw/s1202/10copy-csv.png" style="margin-left: 1em; margin-right: 1em;"><img alt="Screenshot showing Copy to clipboard button for generating CSV headers" border="0" data-original-height="736" data-original-width="1202" height="392" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiYwXLjgBTW7r_2fsmhtwPUj0aaX91_wCBxJq0qZG0_jbaLdcera5EKLhssVPvSqsCMp_uIJbfnpqBLwWvTQiyOnbZBMdqBrBA_V25O4Yg1c1EcwaE3TIwgZNKv_o-UMNiOfZS7_Al2Vmw/w640-h392/10copy-csv.png" width="640" /></a></div><br /><div>After you have entered all of the property information, scroll to the bottom and enter a filename in the box. Click on the <span style="font-family: courier;">Create CSV</span> button. At this time, the script isn't sophisticated enough to actually generate the CSV file. (That is a possible future feature.) Rather it generates the header line for the CSV as raw text. Click the <span style="font-family: courier;">Copy to clipboard</span> button, then open a new file using the same text editor that you used to generate the credentials file. </div><div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjBWCL4Bw4gMft-jJCoMcSpJFTpr508iuy_5V_DdETR0dtF5N-fqUh-c9n01a1RFRnE9pJFC9mtZVmOYQy2vBpx8USZ-Xw8xjqMgchyphenhyphen-anG53-qgCSQp_YiLdQFa4Dc27EH75d9zJOv6kw/s1223/11paste-csv.png" style="margin-left: 1em; margin-right: 1em;"><img alt="Text editor window showing pasted column header line" border="0" data-original-height="363" data-original-width="1223" height="190" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjBWCL4Bw4gMft-jJCoMcSpJFTpr508iuy_5V_DdETR0dtF5N-fqUh-c9n01a1RFRnE9pJFC9mtZVmOYQy2vBpx8USZ-Xw8xjqMgchyphenhyphen-anG53-qgCSQp_YiLdQFa4Dc27EH75d9zJOv6kw/w640-h190/11paste-csv.png" width="640" /></a></div><br /><div>Paste the copied text into the new file window.</div><div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhTAFwmkRMLp1XBTJMUKz8RYrun7yjrXugDLuoftmebjmEdIqr0VRAo1a7X-a-wzkdPis11MyCHuj727zAr5Vj1dewrQkeiOggCwRdNkY4heeTEIvm3UFphLjeEUOeDkvKXfW4QjCm957w/s1212/12save-csv.png" style="margin-left: 1em; margin-right: 1em;"><img alt="Save dialog for CSV file" border="0" data-original-height="527" data-original-width="1212" height="278" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhTAFwmkRMLp1XBTJMUKz8RYrun7yjrXugDLuoftmebjmEdIqr0VRAo1a7X-a-wzkdPis11MyCHuj727zAr5Vj1dewrQkeiOggCwRdNkY4heeTEIvm3UFphLjeEUOeDkvKXfW4QjCm957w/w640-h278/12save-csv.png" width="640" /></a></div><br /><div>Select <span style="font-family: courier;">Save</span> or <span style="font-family: courier;">Save As...</span> from an appropriate menu on your editor. The exact appearance of the dialog window will depend on your editor. The screenshot above is for TextEdit on a Mac. Be sure that you use exactly the same file name as you entered in the filename box in the web tool, with a <span style="font-family: courier;">.csv</span> file extension. If your editor gives you a choice of text encoding, be sure to choose UTF-8. The directory into which you save the CSV file will be the one from which you will be running the upload script using the command line. So it is best to save it in some folder that is a subfolder of your home folder. Generally, Downloads, Documents, and Desktop are directly below the home folder, so if you use a subfolder of one of those folders, you should be able to navigate to that folder easily using the command line. </div><div><br /></div><div>Now click the <span style="font-family: courier;">Create JSON</span> button. </div><div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhJpvlV4R-s3HK4-MMQEQzjzasEfPg18b40-zbjXtFv_Ybi7fiFwAzhvced5pyk-2TiEW7GLP21Z96e2Fcyy5ynopDX73qz_3kJeEbwvsV3qmNZ_rToJo4gELAdqc_3Kzh-hic3kFmiSWU/s1044/13copy-json.png" style="margin-left: 1em; margin-right: 1em;"><img alt="Creating the metadata JSON file" border="0" data-original-height="951" data-original-width="1044" height="582" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhJpvlV4R-s3HK4-MMQEQzjzasEfPg18b40-zbjXtFv_Ybi7fiFwAzhvced5pyk-2TiEW7GLP21Z96e2Fcyy5ynopDX73qz_3kJeEbwvsV3qmNZ_rToJo4gELAdqc_3Kzh-hic3kFmiSWU/w640-h582/13copy-json.png" width="640" /></a></div><br /><div>The metadata description JSON for the CSV file columns that you set up will be generated on the screen below the button. Click the <span style="font-family: courier;">Copy to clipboard</span> button. Open a new file in the text editor that you used before and paste the copied text into it. Save the file using the name <span style="font-family: courier;">csv-metadata.json</span> in the same directory where you saved the CSV file.</div><div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgdKBDuWhL6UZaGD8DNj-yAmhoTYlsQfNStZFVegv6d9HgsDCH0-65o61vdm1eRKjgKCn-STm51mtdxToHDS4lf0a0o0lChv9uN7fEjqXZ8jQr-mVnSm-FWzRtxFFModPuVaMHejJCKtU4/s1526/14save-json.png" style="margin-left: 1em; margin-right: 1em;"><img alt="Metadata description JSON in a code editor" border="0" data-original-height="956" data-original-width="1526" height="402" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgdKBDuWhL6UZaGD8DNj-yAmhoTYlsQfNStZFVegv6d9HgsDCH0-65o61vdm1eRKjgKCn-STm51mtdxToHDS4lf0a0o0lChv9uN7fEjqXZ8jQr-mVnSm-FWzRtxFFModPuVaMHejJCKtU4/w640-h402/14save-json.png" width="640" /></a></div><br /><div>I like to paste the JSON into my favorite code editor (VS Code) because it will validate the JSON and display it using syntax highlighting, but that isn't really any better than using a vanilla text editor.</div><div><br /></div><h2 style="text-align: left;">Preparing data to create new items</h2><div>Now we will open the CSV file to add the data that we want to write to the test Wikidata instance. For this practice exercise, you can use Excel to edit the CSV if that is all you have, but if you are serious about using this system in the future, I highly recommend downloading and installing LibreOffice and using its Calc application to open and edit CSVs. I explain the reasons for this in the <span style="font-family: courier;">Skills required</span> section at the top of this post. You can probably just double-click on the file in your file handling system (Finder on Mac or File Explorer in Windows), but if that doesn't work, open your spreadsheet application and open the file via Open in the File menu. </div><div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjUkGOsGI4WCK1Oye-DrJwSzQBxXOjGeMdQRuH9NECMi_5mTgiMN_ix0wHTSh5GlIVWZXcVf3-X96KE73rvIVa9jpks9avrWUfxvoo70HQfBgVLuOWu3aLhrkxCelXMCMpV8XpYDeMuJQE/s1700/15fake-csv-data.png" style="margin-left: 1em; margin-right: 1em;"><img alt="Spreadsheet with fake data" border="0" data-original-height="227" data-original-width="1700" height="86" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjUkGOsGI4WCK1Oye-DrJwSzQBxXOjGeMdQRuH9NECMi_5mTgiMN_ix0wHTSh5GlIVWZXcVf3-X96KE73rvIVa9jpks9avrWUfxvoo70HQfBgVLuOWu3aLhrkxCelXMCMpV8XpYDeMuJQE/w640-h86/15fake-csv-data.png" width="640" /></a></div><br /><div>When you open the CSV file, it should appear as a spreadsheet with the column names in the order that you created them with the web tool with empty rows below. You can now add data in the rows below the header. </div><div><br /></div><div>The screenshot of my example above is too small to easily see, but you can get a better look at it by going to this <a href="https://gist.github.com/baskaufs/405e22ae25efcb34327dbd7f0a7cfa6e" target="_blank">GitHub gist</a>. You must use different labels and descriptions from the ones I used because if you use the same ones, the API will not allow them to be written (more details about this in the next blog post). As values for the country column, you can use the Q ID of any item. (I used Q53079 for Mexico). Notice that the <span style="font-family: courier;">birthDate</span> column does not have a prefix, indicating that it is a statement property and not a child property of something else. The <span style="font-family: courier;">startDate</span> column was prefixed with <span style="font-family: courier;">country_</span> by the web tool. That prefix and its position following the <span style="font-family: courier;">country</span> column are clues that this column is a qualifier for the <span style="font-family: courier;">country</span> column. The <span style="font-family: courier;">refUri</span> column was prefixed by the web tool with <span style="font-family: courier;">birthDate_</span> and <span style="font-family: courier;">ref1_</span>, indicating that it is a property of the first reference for the birthDate statement. Because the value type of the <span style="font-family: courier;">birthDate_ref1_refUrl</span> column is URL, it must be a valid IRI starting with either <span style="font-family: courier;">http://</span> or <span style="font-family: courier;">https://</span>.</div><div><br /></div><div>The two date fields are a more complicated type. Dates, globe coordinates, and quantities are complex data types that cannot be represented by single fields. In the case of dates, they require one column field for the date string and another column filed to indicate the precision of the date (e.g. to year, to month, to century, etc.). There is a somewhat complicated system for representing dates in the Wikibase model (see <a href="https://www.mediawiki.org/wiki/Wikibase/DataModel#Dates_and_times" target="_blank">this page</a> for details). Fortunately, the VanderBot script will automatically convert dates that are formatted according to its conventions into the format required by the API. Those conventions are:</div><div><br /></div><div><span style="font-family: courier;">character pattern example precision</span></div><div><span style="font-family: courier;">----------------- ------- ---------</span></div><div><span style="font-family: courier;">YYYY 1885 to year</span></div><div><span style="font-family: courier;">YYYY-MM 2020-03 to month</span></div><div><span style="font-family: courier;">YYYY-MM-DD 2001-09-11 to day</span></div><div><br /></div><div>In the example spreadsheet, the <span style="font-family: courier;">country_startDate_val</span> date value for the second item has precision to month, while the <span style="font-family: courier;">birthDate_val</span> date values have precision to day. </div><div><br /></div><div>The dates should be placed in the corresponding column with name ending in <span style="font-family: courier;">_val</span>. The script knows that it should make the conversion when the corresponding column with name ending in <span style="font-family: courier;">_prec</span> is empty. If the year has fewer than four digits, is BCE (a negative number), or has a precision lower than year (century, millennium, etc.), then a date string and precision integer properly formatted according to the Wikibase model must be provided explicitly. The script only provides minimal format checking (for the correct number of characters), so dates that are otherwise incorrectly formatted will result in an error that prevents the record from being written to the API. </div><div><br /></div><div>You should also notice that the example spreadsheet has a number of empty columns. These columns will contain identifiers for the various entities described by the data columns. For example, the qid column will contain the identifier for the item. The <span style="font-family: courier;">country_uuid</span> and <span style="font-family: courier;">birthDate_uuid</span> columns will contain the identifiers for the <span style="font-family: courier;">country</span> and <span style="font-family: courier;">birthDate</span> statements. The <span style="font-family: courier;">birthDate_ref1_hash</span> will contain the identifier for the first reference for <span style="font-family: courier;">birthDate</span>, which contains a reference URL. In all of these cases, the Wikidata API will assign those identifiers when the various entities are created and they will be recorded in the CSV file immediately after the item has been created. The VanderBot script uses the presence or absence of these identifiers to know whether the particular identified entity exists and therefore whether it needs to be written to the API or not. </div><div><br /></div><div>The situation with the two date columns whose names end in <span style="font-family: courier;">_nodeId</span> is complicated. For technical reasons that I don't want to get into in this post, the node ID values are not assigned by the API, but rather are generated by VanderBot at the point of processing the dates. This is true for all of the properties with node value types (dates, globe coordinates, and quantities). All you need to know is that you should leave the columns ending in <span style="font-family: courier;">_nodeId</span> blank and that the sets of three date-related columns that have the same first part (<span style="font-family: courier;">country_startDate_nodeId</span>, <span style="font-family: courier;">country_startDate_val</span>, and <span style="font-family: courier;">country_startDate_prec</span>; <span style="font-family: courier;">birthDate_nodeId</span>, <span style="font-family: courier;">birthDate_val</span>, and <span style="font-family: courier;">birthDate_prec</span>) represent complex values that can't be represented by a single column. </div><div><br /></div><div>Note that I did not fill in every cell in the table that could contain values. I did that because in a later step we will practice adding values to the item statements and references after the items have already been created. </div><div><br /></div><div>Be sure to close the CSV file before continuing to the next step. Failure to close the CSV will have different effects depending on the spreadsheet program you are using. I believe that both Excel and Open Office Calc place a lock on the file so that when the VanderBot script tries to write the API responses to the CSV file, it generates an error and crashes the script. Libre Office Calc will allow the changes to be written to the CSV file, but they will not show up unless the file is closed and re-opened. Libre Office Calc will warn you if you try to save an open file if it has been changed by the script while the file was open. In that case, close the file without saving and re-open it to see the changes.</div><div><br /></div><h2 style="text-align: left;">Creating new items using the API</h2><div>The last thing you need to have to actually write data to the API is the VanderBot Python script itself. Go to the <a href="https://github.com/HeardLibrary/linked-data/blob/master/vanderbot/vanderbot.py" target="_blank">code page on GitHub</a>. Right-click on the <span style="font-family: courier;">Raw</span> button in the upper right of the page. Select Save Link As..., navigate to the directory where you saved the <span style="font-family: courier;">csv-metadata.json</span> file and the CSV file that you edited, and save the the <span style="font-family: courier;">vanderbot.py</span> script there. </div><div><br /></div><div>If you have not previously installed the <span style="font-family: courier;">requests</span> library, you may need to do that before you can run the script. If you have Anaconda installed on your computer, <span style="font-family: courier;">requests</span> may already be installed. If you aren't sure, just try running the script as described below. If you get an error message saying that Python doesn't know about <span style="font-family: courier;">requests</span>, then try entering:</div><div><br /></div><div><span style="font-family: courier;">pip install requests</span></div><div><br /></div><div>If that doesn't work, try</div><div><br /></div><div><span style="font-family: courier;">pip3 install requests</span></div><div><br /></div><div>If you use some non-standard package manager like brew or conda, install requests by whatever means you normally install packages.</div><div><br /></div><div>Open the appropriate console program for your operating system (probably <span style="font-family: courier;">Terminal</span> for Mac or <span style="font-family: courier;">Command Prompt</span> for Windows). Use the <span style="font-family: courier;">cd</span> command to navigate to the directory where you saved the file, then list the files to make sure you are in the right place and the files are all there (<span style="font-family: courier;">ls</span> for Mac Linux or <span style="font-family: courier;">dir</span> for Windows DOS). </div><div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj-67m07J0znsCuuUmt25YNWeshwsIouxE6DsW1E-iWSK2NITrv-TmVKn8qXSmz-XI027DoUpwATwxnzxA62ydfY_q6U5VXLZXfdaG_LdOQ_3Hre0aOltFS3Jmv8btkiWuFqFZ6hc2I98k/s975/16run.png" style="margin-left: 1em; margin-right: 1em;"><img alt="Console showing command to launch VanderBot" border="0" data-original-height="281" data-original-width="975" height="184" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj-67m07J0znsCuuUmt25YNWeshwsIouxE6DsW1E-iWSK2NITrv-TmVKn8qXSmz-XI027DoUpwATwxnzxA62ydfY_q6U5VXLZXfdaG_LdOQ_3Hre0aOltFS3Jmv8btkiWuFqFZ6hc2I98k/w640-h184/16run.png" width="640" /></a></div><br /><div>Depending on how you set up Python the command to run the script will probably either be</div><div><br /></div><div><span style="font-family: courier;">python vanderbot.py</span></div><div><br /></div><div>or</div><div><br /></div><div><span style="font-family: courier;">python3 vanderbot.py</span></div><div><br /></div><div>If things work correctly, the console should show the progress of writing to the API.</div><div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi3b9KVda-wcX4HQusLQxOAj4wZMIlfr0772tf5qPmehvo4cXu5DxtCTf2FV2AcZZJ_9jXVRfOhhi5n2KkAB6mHiucJFGCeVvSrB00UIDQAtOWN0vNQr6McUO-gX8w2iEv0AxmRagcKB50/s1072/16a-run-result.png" style="margin-left: 1em; margin-right: 1em;"><img alt="Console result of writing to the API" border="0" data-original-height="844" data-original-width="1072" height="504" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi3b9KVda-wcX4HQusLQxOAj4wZMIlfr0772tf5qPmehvo4cXu5DxtCTf2FV2AcZZJ_9jXVRfOhhi5n2KkAB6mHiucJFGCeVvSrB00UIDQAtOWN0vNQr6McUO-gX8w2iEv0AxmRagcKB50/w640-h504/16a-run-result.png" width="640" /></a></div><br /><div>The first part of the output shows how VanderBot is interpreting the columns of the CSV based on the information from the <span style="font-family: courier;">csv-metadata.json</span> column-mapping file. Then there is an indication that dates have been converted to the form required by the Wikibase model. As the script writes each row to the API, it displays the response of the API. The contents of the response don't matter as long as the end of the response contains "success". After writing statements for each row, the script then checks whether there were any existing statements with added references. Since there were none, nothing was reported. Finally, there is a report of any errors that occurred that prevented particular rows from being written. Not every possible type of error is trapped and some will result in the script terminating before finishing all of the rows of the CSV. In that situation, the last response from the API may give clues about what went wrong. All information about identifiers received from the API prior to termination of the script should be saved in the CSV file, so once the error is fixed, you can just run the script again to try again to write the problematic line. </div><div><br /></div><div>If you re-open the CSV file, you should see results similar to <a href="https://gist.github.com/baskaufs/306f64a546b6d43c4810ffdc2fb55ef7" target="_blank">this gist</a>. All of the identifier columns in the table that are associated with value columns have now been filled in, indicating that those data now exist on Wikidata. Notice also that the dates have been converted into the more complicated format. </div><div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEijWP4PcRRlEhKQibss7LPGPAqKcuBhZAWZcnz4mTq_2fcOZyQSPRO7IeRLQmXK3KcxYZxW5-CCQPe-BxgGP5UIYmlYJ5pTDPS2PAMt5cDzmsVm2uMRza-C9FiC5G2-8PW-4HPSM1PMpBg/s1304/18marie-result.png" style="margin-left: 1em; margin-right: 1em;"><img alt="New page created with reference" border="0" data-original-height="865" data-original-width="1304" height="424" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEijWP4PcRRlEhKQibss7LPGPAqKcuBhZAWZcnz4mTq_2fcOZyQSPRO7IeRLQmXK3KcxYZxW5-CCQPe-BxgGP5UIYmlYJ5pTDPS2PAMt5cDzmsVm2uMRza-C9FiC5G2-8PW-4HPSM1PMpBg/w640-h424/18marie-result.png" width="640" /></a></div><br /><div>If you search the test.wikidata.org site, you should see the record for the new item that you created. Because the <span style="font-family: courier;">birthDate_ref1_refUrl</span> column had a value, a reference was created for the Date of birth statement. </div><div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgdy4gA3_4kPafzGrp8rJ3FcMfcLbsrRsUGqCEu2Jyx76mOXn4B_uV_F8_ht18utWv10XKCs6-dHZG3j6UjVWXuvZS1sKNg40_-u7bytqqUZjVxN63lJ_JFpdPI_rJnbMBbAE5umCrz2Is/s1169/19jose-result.png" style="margin-left: 1em; margin-right: 1em;"><img alt="Test Wikidata item showing qualifier and date precision" border="0" data-original-height="885" data-original-width="1169" height="484" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgdy4gA3_4kPafzGrp8rJ3FcMfcLbsrRsUGqCEu2Jyx76mOXn4B_uV_F8_ht18utWv10XKCs6-dHZG3j6UjVWXuvZS1sKNg40_-u7bytqqUZjVxN63lJ_JFpdPI_rJnbMBbAE5umCrz2Is/w640-h484/19jose-result.png" width="640" /></a></div><br /><div>Because in the second row the <span style="font-family: courier;">country</span> property column was followed by value columns for <span style="font-family: courier;">country_startDate</span> that contained data, the country statement on the web page for the item displays a start date qualifier. The <span style="font-family: courier;">country_startDate_val</span> column contained a value in the form <span style="font-family: courier;">1986-02</span>, so a precision of <span style="font-family: courier;">10</span> (to month) was placed in the <span style="font-family: courier;">country_startDate_prec</span> column of the table and therefore only the month is shown on the web page. In contrast, the <span style="font-family: courier;">birthDate_val</span> column was given a value of <span style="font-family: courier;">1982-02-03</span>, so it was assigned a precision of <span style="font-family: courier;">11</span> (to day) and the day is displayed on the web page for the date of birth statement. </div><div><br /></div><h2 style="text-align: left;">Editing existing items using the API</h2><div>We can add information to the two new items that we just created by filling in parts of the CSV that we left blank before. In <a href="https://gist.github.com/baskaufs/d547642c78bc0d9e44cdf506d62d2c8d" target="_blank">this gist</a>, I added a country value (Q346, France) for item Q214621 and I added a reference value in the <span style="font-family: courier;">birthDate_ref1_refUrl</span> for the birth date statement, which already existed, but did not previously have any references. After making sure that I closed the CSV file in the spreadsheet program, I ran the VanderBot script again.</div><div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjZ0vyL2-rsEddFii6pkflSl6ttZLqZYVyIN1arbPvFFSizsYivKEPId1tOBNeZnH8UNWqJRAezL7TieYvM7ASrgFjj3oB_u2dFE8A_Cz_PlBsOx4Gs5Uw1HS9QimqXZhu0CYLuIgPijt4/s1067/21run-second-time.png" style="margin-left: 1em; margin-right: 1em;"><img alt="Second VanderBot run to add a statement and reference" border="0" data-original-height="917" data-original-width="1067" height="550" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjZ0vyL2-rsEddFii6pkflSl6ttZLqZYVyIN1arbPvFFSizsYivKEPId1tOBNeZnH8UNWqJRAezL7TieYvM7ASrgFjj3oB_u2dFE8A_Cz_PlBsOx4Gs5Uw1HS9QimqXZhu0CYLuIgPijt4/w640-h550/21run-second-time.png" width="640" /></a></div><br /><div>In the first section of the output, the script detected that it needed to add a statement to an existing item (there was already a value in the <span style="font-family: courier;">qid</span> column, but there was a value in the <span style="font-family: courier;">country</span> column without a corresponding identifier value in the <span style="font-family: courier;">country_uuid</span> column). It found no statements to add in the second row, so it did nothing. </div><div><br /></div><div>When it went through each row looking for new references for existing statements, it found one for the birth date reference URL column for the second record (there was an identifier in the <span style="font-family: courier;">birthDate_uuid</span> column, but no identifier in the <span style="font-family: courier;">birthDate_ref1_hash</span> column). It attempted to write the new data to the API, but an interesting thing happened. The server was too busy, and sent a message back to the script that it should wait a while and try again. In general, the script will keep trying with an increasing delay of up to 5 minutes, giving up after 10 tries. In this particular case, on the second retry the server was no longer too busy and the reference was successfully added. </div><div><br /></div><div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgqaGmjya0-uXNbjHRcPr0rMHc-wXBDqorZo85rx-5CapVA4lSAfT1WOBCBEMFesQaZIBI8xs3jQkT5SSchlk3aSJAzIEIyvdMgSC12MZ8D5H4mxq_ECekIYazk-5SPEIzad9t0Yd_0huU/s1130/22view-added-reference.png" style="margin-left: 1em; margin-right: 1em;"><img alt="Item web page showing added reference" border="0" data-original-height="914" data-original-width="1130" height="518" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgqaGmjya0-uXNbjHRcPr0rMHc-wXBDqorZo85rx-5CapVA4lSAfT1WOBCBEMFesQaZIBI8xs3jQkT5SSchlk3aSJAzIEIyvdMgSC12MZ8D5H4mxq_ECekIYazk-5SPEIzad9t0Yd_0huU/w640-h518/22view-added-reference.png" width="640" /></a></div><br /> When I reload the Juan Jose Garza page, I see that the date of birth statement now has a reference where there was none before.</div><div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjk35k4-d_7PD1S3f4Ssjq4n85-JdS6t7bnWdoY5zeXPQAAG28WkR6CpRle2f38vHw2doMVRJ0Rqe4joe06X6MV1DfsLwSE6knD8WhbA_AJEfNhYHe_0Ib51d7MYzEdU6P_w_IsNNyKGo8/s1130/24view-new-marie.png" style="margin-left: 1em; margin-right: 1em;"><img alt="new country statement without qualifier" border="0" data-original-height="914" data-original-width="1130" height="518" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjk35k4-d_7PD1S3f4Ssjq4n85-JdS6t7bnWdoY5zeXPQAAG28WkR6CpRle2f38vHw2doMVRJ0Rqe4joe06X6MV1DfsLwSE6knD8WhbA_AJEfNhYHe_0Ib51d7MYzEdU6P_w_IsNNyKGo8/w640-h518/24view-new-marie.png" width="640" /></a></div><br /><div>Reloading Marie Gareau's page shows the new country statement. You may have noticed that I filled in the country value without giving any start date qualifier value for the statement in the <span style="font-family: courier;">country_startDate_val</span> column. The Wikidata Query Service treats qualifiers and references differently in that it assigns IRI identifiers to references, but does not assign them to qualifiers. Because VanderBot is designed to get information about specific metadata about items using the Query Service, it does not capture and store any identifier for qualifiers. Thus it is currently not possible to add a qualifier to a statement once the statement has been created. This behavior may be modified at some point in the future, but for now you should be aware of that limitation. </div><div><br /></div><h2 style="text-align: left;">Who's responsible for what just happened?</h2><div>We can check the revision history of the Juan Jose Garza page to see how the edits we made were recorded. </div><div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgr5LgGchb21Uc9J0wMTxvfaqZ1YcJHEKgDmq18rzAchyxbvRco5SySq0qZLyOnLkQjDgJI6UA4GgCR6KNFupmfOvqNeHRlTuXPD03k4jH47UiayaC2Ug4ZVjJdKehMjpIY3xdpCNOk450/s1309/23revision-history.png" style="margin-left: 1em; margin-right: 1em;"><img alt="Revision history of new item page" border="0" data-original-height="493" data-original-width="1309" height="242" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgr5LgGchb21Uc9J0wMTxvfaqZ1YcJHEKgDmq18rzAchyxbvRco5SySq0qZLyOnLkQjDgJI6UA4GgCR6KNFupmfOvqNeHRlTuXPD03k4jH47UiayaC2Ug4ZVjJdKehMjpIY3xdpCNOk450/w640-h242/23revision-history.png" width="640" /></a></div><br /><div>Notice that since the edits were made by a script using a bot password associated with my user account (Baskaufs), the edits were credited to me just as if I had made them by hand using the graphical interface. One difference is that the original item was created using a single API interaction. So even though it involved creating a label, a description, and two statements, it was recorded as a single edit instead of four. </div><div><br /></div><div>The benefit of editing as many parts of the item metadata at once as possible is that the interactions with the API are the rate-limiting factor when writing data to the API. VanderBot only makes one API call per row, even if the row contains many more columns than in this simple example. So it can make the edits much faster all at once than it could if it did them all separately.</div><div><br /></div><div>Notice also that there is no record here that the VanderBot script was used. It identified itself to the API through its User-Agent HTTP header when it communicated with the server, and it was a "good citizen" by waiting to retry when the server reported that it was lagged. But there is no record of that interaction in the revision history.</div><div><br /></div><div>To see the final state of the CSV file after all of the uploads shown here, see <a href="https://gist.github.com/baskaufs/ead5484bd579a5f03fe10a5326df236d" target="_blank">this gist</a>.</div><div><br /></div><h2 style="text-align: left;">What should you do next?</h2><div>While you have the spreadsheet and JSON metadata description file set up, you should do a lot more experimenting. There is really nothing that you can "break" on either test.wikidata.org or VanderBot. In particular, you should try doing the following things to see what happens. Some of them are "wrong" things that produce bad results or aren't allowed by the API, while others are harmless or fine. If the script doesn't crash, reload the page or search for the new item in the graphical interface to see what happened. </div><div><br /></div><div><ol style="text-align: left;"><li>Create another row in the spreadsheet where the label and description are the same as an existing item. What happens? (When writing to the "real" Wikidata, there is code in VanderBot that tries to prevent this, but it doesn't work with the test instance.)</li><li>What happens if either the label or the description (but not both) is the same as an existing item? Can you create an item that is missing either a label or a description?</li><li>What happens if you delete the uuid identifier for a statement property and run the script again?</li><li>What happens if you delete the uuid identifier for a statement property, change the value, and run the script again?</li><li>What happens if you delete a reference hash identifier and replace the reference value with a different one?</li><li>What happens if you leave off part of a date (e.g. 1997-9-23 with no leading zero for the month)? You may need to change the cell format to "text" in order to be able to make this mistake in your spreadsheet program. </li><li>What happens if you have the correct number of characters in a date, but the date is malformed (e.g. 199x-09-23)?</li><li>What happens if you change a value, but do not delete the corresponding identifier associated with the value?</li></ol><div>In the <a href="http://baskauf.blogspot.com/2021/03/writing-your-own-data-to-wikidata-using_7.html">next blog post</a>, we will switch to writing to the "real" Wikidata, where we would prefer not to make these kinds of mistakes.</div></div><div><br /></div><div>Answers are below.</div><div>.</div><div>,</div><div>,</div><div>,</div><div>,</div><div>,</div><div>,</div><div>,</div><div>,</div><div>,</div><div>,</div><div>,</div><div>.</div><div>.</div><div>.</div><div>.</div><div>.</div><div>.</div><div>.</div><div>.</div><div>.</div><div>.</div><div>.</div><div>.</div><div>.</div><div>.</div><div>.</div><div>Answers:</div><div><ol style="text-align: left;"><li>The API responds with an error message and the script ends prematurely. </li><li>A new item will be created with the same label (or description). You can also create items lacking either a label or a description (but not both).</li><li>A duplicate statement will be created. This is a bad practice.</li><li>A second value will be added for the property. This is perfectly fine as long as the second value is correct information.</li><li>The new reference gets added as a second reference for the same statement. This is perfectly fine.</li><li>The script does nothing and reports that there was an incorrectly formatted date in that row.</li><li>The script tries to write the value, but the API returns an error message saying that the date format is bad. The script then stops running.</li><li>Nothing happens. The script doesn't look at the value if it already has an identifier associated with it.</li></ol><div><br /></div></div><div><br /></div><div><br /></div><div><br /></div><div><br /></div><div><br /></div><div><br /></div><p></p>Steve Baskaufhttp://www.blogger.com/profile/01896499749604153763noreply@blogger.com0tag:blogger.com,1999:blog-5299754536670281996.post-3985756916823932362020-03-05T08:12:00.002-08:002020-03-05T08:41:34.116-08:00TDWG gets 5 Stars!<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://www.w3.org/DesignIssues/diagrams/lod/597992118v2_350x350_Back.jpg" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="350" data-original-width="350" height="320" src="https://www.w3.org/DesignIssues/diagrams/lod/597992118v2_350x350_Back.jpg" width="320" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Photo from W3C <a href="https://www.w3.org/DesignIssues/LinkedData.html">https://www.w3.org/DesignIssues/LinkedData.html</a></td></tr>
</tbody></table>
<h2>
<br />
TDWG IRIs are dereferenceable with content negotiation!</h2>
<br />
Yesterday was a happy day for me because after several years of work, the switch was flipped and all of the IRIs minted by TDWG under the <span style="font-family: "courier new" , "courier" , monospace;">rs.tdwg.org</span> subdomain became <b><i>dereferenceable with content negotiation</i></b> in most cases. For those readers who aren't hard-core Linked Open Data (LOD) buffs, I'll explain what that means.<br />
<br />
An <a href="https://tools.ietf.org/html/rfc3987" target="_blank">internationalized resource identifier</a> (IRI; superset of uniform resource identifiers, URIs) is a globally unique identifier that generally looks like the well known URL. It usually starts with <span style="font-family: "courier new" , "courier" , monospace;">http://</span> or <span style="font-family: "courier new" , "courier" , monospace;">https://</span>, which implies that something will happen if you put it in a web browser. That "something" is <i>dereferencing</i> - the browser uses the IRI to try to retrieve a document from a remote server and if successful, a web page shows up in the browser. Because a browser's job is to retrieve web pages, when it dereferences an IRI, it asks for a particular "content type" (<span style="font-family: "courier new" , "courier" , monospace;">text/html</span>) indicating that it wants an HTML web page.<br />
<br />
But there are other kinds of software designed to retrieve documents that are readable by machines rather than by humans. When those applications dereference an IRI, they ask for other content types (like <span style="font-family: "courier new" , "courier" , monospace;">text/turtle</span> or <span style="font-family: "courier new" , "courier" , monospace;">application/rdf+xml</span>) that can be interpreted as structured data and be integrated with data from other sources. The same IRI can be used to retrieve different documents that provide the same information in different formats depending on the content type that is requested. The process of determining what kind of document to return to the requesting application is called <i>content negotiation</i>.<br />
<br />
In the past, the behavior of TDWG IRIs were inconsistent. Some IRIs like those of Darwin Core terms would retrieve a web page in a browser and provide machine-readable RDF/XML when requested. Other IRIs like those of Audubon Core terms would retrieve a web page, but no machine-readable formats. Obsolete IRIs like those of old versions of Darwin Core and the defunct TDWG ontology did nothing at all. Then there were many TDWG resources, such as old standards documents, that didn't even have IRIs.<br />
<br />
In an <a href="http://baskauf.blogspot.com/2019/03/understanding-tdwg-standards_10.html" target="_blank">earlier blog post</a>, I described the IRI patterns that I established in order to be able to denote all of the kinds of TDWG standards components that were described in the <a href="https://github.com/tdwg/vocab/blob/master/sds/documentation-specification.md" target="_blank">TDWG Standards Documentation Specification</a>. Those patterns made it possible to use IRIs to refer to things like vocabularies, term lists, and documents in a consistent way. Just creating the IRI patterns and using them to assign IRIs to vocabularies and documents provided a way to uniquely identify those resources, but did not create the "magic" of actually making it possible to use those IRIs to retrieve information. That's what happened yesterday.<br />
<br />
<br />
<h2>
What happens when the IRIs are dereferenced?</h2>
The action that takes place when an <span style="font-family: "courier new" , "courier" , monospace;">rs.tdwg.org</span> IRI is dereferenced depends on the category of the resource and the content type that's requested. There are four categories of behavior that vary primarily on how they deliver human-readable content.<br />
<br />
1. <b>"Living" TDWG vocabulary terms.</b> When a term from one of the actively maintained TDWG vocabularies (currently Darwin Core and Audubon Core) is dereferenced, the browser is redirected to the most helpful reference document for that vocabulary (the Quick Reference Guide for Darwin Core and the Term List document for Audubon Core). You can try this with <span style="font-family: "courier new" , "courier" , monospace;">dwc:recordedBy</span>, <a href="http://rs.tdwg.org/dwc/terms/recordedBy" target="_blank">http://rs.tdwg.org/dwc/terms/recordedBy</a> and <span style="font-family: "courier new" , "courier" , monospace;">ac:caption,</span> <a href="http://rs.tdwg.org/ac/terms/caption" target="_blank">http://rs.tdwg.org/ac/terms/caption</a>.<br />
<br />
2. <b>Obsolete TDWG vocabulary terms, vocabularies, term lists, and special categories of resources. </b>When terms in these categories are dereferenced, a generic web page is generated by a script that provides vanilla information about the term. The same is true for some special categories like Executive Committee decisions. Try it with an obsolete term <a href="http://rs.tdwg.org/dwc/curatorial/Disposition" target="_blank">http://rs.tdwg.org/dwc/curatorial/Disposition</a>, a decision <a href="http://rs.tdwg.org/decisions/decision-2011-10-16_6" target="_blank">http://rs.tdwg.org/decisions/decision-2011-10-16_6</a> and a term list <a href="http://rs.tdwg.org/ac/xmp/" target="_blank">http://rs.tdwg.org/ac/xmp/</a>.<br />
<br />
3. <b>TDWG-maintained standards documents. </b>The maintenance of TDWG standards documents is idiosyncratic and their location depends on where their maintainers happened to have stashed them. The URLs used to retrieve the documents might change if they are put into different places or if their format changes (e.g. changed from PDF to Markdown). To provide a stable way to denote those documents, the IRIs minted in <span style="font-family: "courier new" , "courier" , monospace;">rs.tdwg.org</span> subdomain redirect to whatever current URL delivers that particular document. If the document moves or the access URL changes for some reason, the stable IRI will redirect to the new access URL. Try it with the TDWG Vocabulary Maintenance Specification <a href="http://rs.tdwg.org/vms/doc/specification/" target="_blank">http://rs.tdwg.org/vms/doc/specification/</a>, the Audubon Core Structure document <a href="http://rs.tdwg.org/ac/doc/structure/" target="_blank">http://rs.tdwg.org/ac/doc/structure/</a>, and the TAPIR Protocol Specification <a href="http://rs.tdwg.org/tapir/doc/specification/" target="_blank">http://rs.tdwg.org/tapir/doc/specification/</a>.<br />
<br />
4. <b>Non-TDWG-maintained standards documents. </b>A lot of the old TDWG standards were not actually published by TDWG, and their maintenance is carried out by organizations whose websites are not under TDWG control. So we will just try to keep the TDWG-issued document IRIs pointing at whatever the access URL is currently for the document. Examples: Economic Botany Data Collection Standard specification <a href="http://rs.tdwg.org/ebdc/doc/specification/" target="_blank">http://rs.tdwg.org/ebdc/doc/specification/</a>, Taxonomic Literature : A Selective Guide to Botanical Publications and Collections with Dates, Commentaries and Types (Second edition, vol. 1) <a href="http://rs.tdwg.org/tl/doc/v1/" target="_blank">http://rs.tdwg.org/tl/doc/v1/</a>, and Index Herbariorum <a href="http://rs.tdwg.org/ih/doc/book/" target="_blank">http://rs.tdwg.org/ih/doc/book/</a>.<br />
<br />
<b>Machine-readable metadata</b><br />
For these categories, the machine readable metadata is delivered in the same way: generated by script from the data in the <a href="https://github.com/tdwg/rs.tdwg.org" target="_blank">rs.tdwg.org Github repository</a>. To access the content through content negotiation, you can dereference any of the IRIs above using software like <a href="https://www.postman.com/" target="_blank">Postman</a> that will allow you to specify an <span style="font-family: "courier new" , "courier" , monospace;">Accept</span> header for the machine-readable content type that you want (<span style="font-family: "courier new" , "courier" , monospace;">text/turtle</span> or <span style="font-family: "courier new" , "courier" , monospace;">application/rdf+xml</span>). To access the machine-readable documents directly, drop any trailing slashes and append <span style="font-family: "courier new" , "courier" , monospace;">.ttl </span>or <span style="font-family: "courier new" , "courier" , monospace;">.rdf</span> to access RDF/Turtle or RDF/XML respectively. Examples: <a href="http://rs.tdwg.org/dwc/terms/recordedBy.ttl" target="_blank">http://rs.tdwg.org/dwc/terms/recordedBy.ttl</a>, <a href="http://rs.tdwg.org/dwc/terms/recordedBy.rdf" target="_blank">http://rs.tdwg.org/dwc/terms/recordedBy.rdf</a>, and <a href="http://rs.tdwg.org/tl/doc/v1.ttl" target="_blank">http://rs.tdwg.org/tl/doc/v1.ttl</a>.<br />
<br />
There are also a number of legacy XML schemas that are still being retrieved by some applications and they are made available by just redirecting from the <span style="font-family: "courier new" , "courier" , monospace;">rs.tdwg.org</span> IRI to wherever the schema lives. Example: <a href="http://rs.tdwg.org/dwc/text/tdwg_dwc_text.xsd" target="_blank">http://rs.tdwg.org/dwc/text/tdwg_dwc_text.xsd</a> .<br />
<br />
<b>How this happens</b><br />
The <a href="https://github.com/tdwg/rs.tdwg.org/blob/master/html/restxq.xqm" target="_blank">script that handles all of these many variations</a> of IRIs is written in XQuery (a functional programming language designed to process XML) and runs on a <a href="http://docs.basex.org/wiki/RESTXQ" target="_blank">BaseX server</a> instance. A <a href="https://github.com/tdwg/rs.tdwg.org/blob/master/html/html.xqm" target="_blank">second XQuery script </a>generates the vanilla HTML web pages that are generated from the same data as the machine-readable metadata. I've written more extensively about this approach in <a href="http://baskauf.blogspot.com/2017/03/a-web-service-with-content-negotiation.html" target="_blank">an earlier post</a>, so I won't say more about it here.<br />
<br />
There was a lot of concern about maintaining a server that is based on a programming language that is not well-known among IT professionals. So it's likely that in the future the XQuery-based system will be replaced by something else. I'd like to use something based on the <a href="https://www.w3.org/TR/csv2rdf/" target="_blank">W3C Generating RDF from Tabular Data on the Web Recommendation</a>, since the source data live as CSV files on Github. But for now, this is what we have.<br />
<br />
<h2>
5 Stars???</h2>
The title of this post says that TDWG now gets 5 stars. What does that mean? In 2010, <a href="https://www.w3.org/DesignIssues/LinkedData.html" target="_blank">Tim Berners-Lee promoted a 5 star system</a> to rate the extent to which data sources are freely available in machine-readable form. The TDWG standards metadata have been available online in structured form under an open license (stars 1 through 3), but failed to achieve 5 stars since standards-based machine readable data (RDF) couldn't be acquired by dereferencing the IRIs (star 4) and the resources weren't linked to others in the machine-readable metadata (star 5). As of yesterday, we can tick off stars 4 and 5, so the TDWG standards metadata are now fully compliant with Linked Open Data best practices. Congratulations TDWG!<br />
<br />
Special thanks to Matt Blissett of GBIF for working out the technical details of setting up the server and production protocol and to Tim Robertson of GBIF for his support in getting this done. Thanks also to Cliff Anderson and the XQuery Working Group of the Vanderbilt University Heard Library for introducing me to BaseX server.<br />
<br />
<br />
<br />
<div>
<br /></div>
Steve Baskaufhttp://www.blogger.com/profile/01896499749604153763noreply@blogger.com3tag:blogger.com,1999:blog-5299754536670281996.post-69835889326768545992020-02-08T07:28:00.002-08:002020-02-08T10:40:22.868-08:00VanderBot part 4: Preparing data to send to Wikidata<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj4MaVLvnwqh_fDhADRkaW0V17G5sq-qfwvNDrN-vGjxtddQsjGnhsEGl05ieP0CAYuvQN7wF0j8YDo2jVf2eKXpev9WRnc4CirXeoG78RugOxjHwQgAlSKSXdk6cZrTorEK-GHx77k1xU/s1600/diagram17.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="601" data-original-width="797" height="482" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj4MaVLvnwqh_fDhADRkaW0V17G5sq-qfwvNDrN-vGjxtddQsjGnhsEGl05ieP0CAYuvQN7wF0j8YDo2jVf2eKXpev9WRnc4CirXeoG78RugOxjHwQgAlSKSXdk6cZrTorEK-GHx77k1xU/s640/diagram17.png" width="640" /></a></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
In the <a href="http://baskauf.blogspot.com/2020/02/vanderbot-part-3-writing-data-from-csv.html" target="_blank">previous blog post</a>, I described how I used a Python
script to upload data stored in a CSV spreadsheet to Wikidata via the Wikidata API.<span style="mso-spacerun: yes;"> </span>I noted that the spreadsheet
contained information about whether data were already in Wikidata and if they needed to be written to the API, but I did not say how I acquired those data, nor how I
determined whether they needed to be uploaded or not. That data acquisition and processing is the topic of this post.</div>
<br />
<div class="MsoNormal">
The overall goal of the VanderBot project is to enter data
about Vanderbilt employees (scholars and researchers) and their academic publications into Wikidata.
Thus far in the project, I have focused primarily on acquiring and uploading data about the
employees. The data acquisition process has three stages:<o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
1. Acquiring the names of research employees (faculty,
postdocs, and research staff) in departments of the university.<o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
2. Determining whether those employees were already present
in Wikidata or if items needed to be created for them.<o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
3. Generating data required to make key statements about the
employees and determining whether those statements (and associated references) had
already been asserted in Wikidata.<o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
The data harvesting script (coded in Python) required to
carry out these processes is available via a <a href="https://github.com/HeardLibrary/linked-data/blob/master/publications/process_department.ipynb" target="_blank">Jupyter notebook available onGitHub</a>.<o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjEkcScUlXLcH27rRh2x6noUdII5ZcrUK8CVZWx_Rni0KJWD-XUiFSmObQHJvytQWyx9zM8bd-RW8XzasRVQ87E9aS2OECqhv0HMuI_S-wvScHrjbeZkQ507fAevvNhOnGTE50fBuaw9V4/s1600/diagram18.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="598" data-original-width="780" height="489" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjEkcScUlXLcH27rRh2x6noUdII5ZcrUK8CVZWx_Rni0KJWD-XUiFSmObQHJvytQWyx9zM8bd-RW8XzasRVQ87E9aS2OECqhv0HMuI_S-wvScHrjbeZkQ507fAevvNhOnGTE50fBuaw9V4/s640/diagram18.png" width="640" /></a></div>
<br />
<br />
<h2>
Acquire names of research employees at Vanderbilt</h2>
<br />
<h4>
Scrape departmental website</h4>
<div class="MsoNormal">
I've linked employees to Vanderbilt through
their departmental affiliations. Therefore, the first task was to create items
for departments in the various schools and colleges of Vanderbilt University. I
won't go into detail about that process other than to say that the hacky code I
used to do it is <a href="https://github.com/HeardLibrary/linked-data/tree/master/publications/departments" target="_blank">on GitHub</a>.<o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
The actual names of the employees were acquired by scraping departmental
faculty and staff web pages. I developed the scraping script based on the web
page of my old department, biological sciences. Fortunately, the same page template was used
by many other departments in both the College of Arts and Sciences and the
Peabody College of Education, so I was able to scrape about 2/3 of the
departments in those schools without modifying the script I developed for the biological sciences department.
<o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
Because the departments had differing numbers of researcher
pages covering different categories of researchers, I create a JSON
configuration file where I recorded the base departmental URLs and the strings
appended to that base to generate each of the researcher pages. The
configuration file also included some other data needed by the script, such as
the department's Wikidata Q ID, a generic description to use for researchers
in the department (if they didn’t already have a description), and some strings
that I used for fuzzy matching with other records (described later).
Some sample JSON is included in the comments near the top of the script.<o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
The result at the end of the "Scrape departmental
website" section of the code was a CSV file with the researcher names and
some other data that I made a feeble attempt to scrape, such as their title and
affiliation. <o:p></o:p></div>
<br />
<br />
<h4>
Search ORCID for Vanderbilt employees</h4>
<a href="https://orcid.org/" target="_blank">ORCID</a> (Open Researcher and Contributor ID) plays an
important part in disambiguating employees. Because ORCIDs are globally unique,
associating an employee name with an ORCID allows one to know that the employee is different from someone with the same name who has a different ORCID.<br />
<div class="MsoNormal">
<o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
For that reason, I began the disambiguation process by
performing a search for "Vanderbilt University" using the <a href="https://orcid.org/organizations/integrators/API" target="_blank">ORCID API</a>. The search produced several thousand results. I then dereferenced each of the
resulting ORCID URIs to capture the full data about the researcher. That required an API call for each record and I used a quarter second delay per call
to avoid hitting the API too fast. As a result, this stage of the process took
hours to run.<o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
I screened the results by recording only those that listed
"Vanderbilt University" as part of the employments affiliation
organization string. That excluded people who were only students and never
employees, and included people whose affiliation was "Vanderbilt
University Medical Center", "Vanderbilt University School of
Nursing", etc. As part of the data recorded, I included their stated
departmental affiliations (some had multiple affiliations if they moved from
one department to another during their career). After this stage, I had 2240 name/department
records.<o:p></o:p></div>
<br />
<br />
<h4>
Fuzzy matching of departmental and ORCID records</h4>
The next stage of the process was to try to match employees
from the department that I was processing with the downloaded ORCID records. I
used a Python fuzzy string matching function called <span style="font-family: "courier new" , "courier" , monospace;">fuzz.token_set_ratio()</span> from
the <a href="https://github.com/seatgeek/fuzzywuzzy" target="_blank">fuzzywuzzy</a> package. I tested this function along with others in the package
and it was highly effective at matching names with minor variations (both
people and departmental names). Because this function was insensitive to word
order, it matched names like "Department of Microbiology" and
"Microbiology Department". However, it also made major errors for
name order reversals ("John James" and "James Johns", for
example) so I had an extra check for that.<br />
<div class="MsoNormal">
<o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
If the person's name had a match score of greater than 90
(out of 100), I then performed a match check against the listed department. If it
also had a match score of greater than 90, I assigned that ORCID to the person.
If no listed department matched had a score over 90, I assigned the ORCID, but flagged
that match for manual checking later.<o:p></o:p></div>
<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjIUr2T9o7_LbDhEU6f24xW3RcOi2c6bGMfG9L0erOD9i5pb1Aanj0-JQlnpkJxLbSSorvfe1qPtBp7wSyXZh3LW6NJ_KyMXyR-veYV53-SHDBUdd2cgvHy2BMtF0i71-sLQdUoJ6gvTWM/s1600/diagram19.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="596" data-original-width="779" height="488" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjIUr2T9o7_LbDhEU6f24xW3RcOi2c6bGMfG9L0erOD9i5pb1Aanj0-JQlnpkJxLbSSorvfe1qPtBp7wSyXZh3LW6NJ_KyMXyR-veYV53-SHDBUdd2cgvHy2BMtF0i71-sLQdUoJ6gvTWM/s640/diagram19.png" width="640" /></a></div>
<br />
<h2>
Determine whether employees were already in Wikidata</h2>
<br />
<h4>
Attempt automated matching with people in Wikidata known to work at Vanderbilt</h4>
<div class="MsoNormal">
I was then ready to start trying to match people with existing
Wikidata records. The low-hanging fruit was people whose records already stated
that their employer was Vanderbilt University (Q29052). I ran a SPARQL query for that using the Wikidata Query Service. For each match, I also recorded the employee's
description, ORCID, start date, and end date (where available). <o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
Once I had those data, I checked each departmental employee's
record against the query results. If both the departmental employee and the
potential match from Wikidata had the same ORCID, then I knew that they were
the same person and I assigned the Wikidata Q ID to that employee. If the
employee had an ORCID I could exclude any Wikidata records with non-matching
ORCIDs and only check for name matches with Wikidata records that didn't have
ORCIDs.<span style="mso-spacerun: yes;"> </span>Getting a name match alone was
not a guarantee that the person in Wikidata was the same as the departmental
employee, but given that the pool of possible Wikidata matches only included
people employed at Vanderbilt, a good name match meant that it was probably the same person. If the person had
a description in Wikidata, I printed the two names and the description and
visually inspected the matches. For example, if there was a member of the
Biological Sciences department named Jacob Reynolds and someone in Wikidata
named Jacob C. Reynolds who was a microbiologist, the match was probably good.
On the other hand, if Jacob C. Reynolds was a historian, then some manual
checking was in order.<span style="mso-spacerun: yes;"> I did</span> a few
other tricks that you can see in the code.<o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
This "smart matching" with minimal human intervention was usually able to match a
small fraction of people in the department. But
there were plenty of departmental employees who were already in Wikidata
without any indication that they worked at Vanderbilt. The obvious way to look
for them would be to just do a SPARQL query for their name. There are some
features built in to SPARQL that allow for REGEX checks, but those features are
impossibly slow for a triplestore the size of Wikidata's. The strategy that I
settled for was to generate as many possible variations of the person's name
and query for all of them at once. You can see what I did in the <span style="font-family: "courier new" , "courier" , monospace;">generateNameAlternatives()</span>
function in the code. I searched labels and aliases for: the full name, names with middle
initials with and without periods, first and middle initials with and without
periods, etc. This approach was pretty good at matching with the right people,
but it also matched with a lot of wrong people. For example, for Jacob C.
Reynolds, I would also search for J. C. Reynolds. If John C. Reynolds had J. C.
Reynolds as an alias, he would come up as a hit. I could have tried to automate the processing of the returned names more, but there usually weren't a lot of matches and with
the other screening criteria I applied, it was pretty easy for me to just look at the results and
bypass the false positives.</div>
<div class="MsoNormal">
<o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
When I did the query for the name alternatives, I downloaded the values for several
properties that were useful for eliminating hits. One important screen was to
eliminate any matching items that were instances of classes (P31) other than
human (Q5). I also screened out people who were listed as having died prior to
some set data (2000 worked well - some departments still listed recently
deceased emeriti and I didn't want to eliminate those).<span style="mso-spacerun: yes;"> </span>If both the employee and the name match in
Wikidata had ORCIDs that were different, I also eliminated the hit.<span style="mso-spacerun: yes;"> </span>For all matches that passed these screens, I
printed the description, occupation, and employer if they were given in
Wikidata. <o:p></o:p></div>
<br />
<br />
<h4>
Clues from publications in PubMed and Crossref</h4>
The other powerful tool I used for disambiguation was to look up any
articles linked to the putative Wikidata match.<span style="mso-spacerun: yes;"> </span>For each Wikidata person item who made it this far through
the screen, I did a SPARQL query to find works authored by that person. For up
to 10 works, I did the following.<span style="mso-spacerun: yes;"> </span>If the
article had a PubMed ID, I retrieved the article metadata from the <a href="https://www.ncbi.nlm.nih.gov/books/NBK25501/" target="_blank">PubMed API</a>
and tried to match against the author names. When I got a match with an author,
I checked for an ORCID match (or excluded if an ORCID mismatch) and also for a
fuzzy match against any affiliation that was given.<span style="mso-spacerun: yes;"> </span>If either an ORCID or affiliation matched, I
concluded that the departmental employee was the same as the Wikidata match and
stopped looking.<br />
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
If there was no match in PubMed and the article had a DOI, I
then retrieved the metadata about the article from the <a href="https://www.crossref.org/services/metadata-delivery/rest-api/" target="_blank">CrossRef API</a> and did the same
kind of screening that I did in PubMed.<span style="mso-spacerun: yes;"> </span><o:p></o:p></div>
<br />
<br />
<h4>
Human intervention</h4>
If there was no automatic match via the article searches, I printed out the full set of information (description, employer, articles, etc.) for every name match, along with the name from the department and the name from Wikidata in order for a human to check whether any of the matches seemed plausible. In a lot of cases, it was easy to eliminate matches that had descriptions like "Ming Dynasty person" or occupation = "golfer". If there was uncertainty, the script printed hyperlinked Wikidata URLs and I could just click on them to examine the Wikidata record manually.<br />
<br />
Here's some typical output:<br />
<br />
<span style="font-family: "courier new" , "courier" , monospace;">--------------------------</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">No Wikidata name match: Justine Bruyère</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"><br /></span>
<span style="font-family: "courier new" , "courier" , monospace;">--------------------------</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">No Wikidata name match: Nicole Chaput Guizani</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"><br /></span>
<span style="font-family: "courier new" , "courier" , monospace;">--------------------------</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">SPARQL name search: Caroline Christopher</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">(no ORCID)</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"><br /></span>
<span style="font-family: "courier new" , "courier" , monospace;"><br /></span>
<span style="font-family: "courier new" , "courier" , monospace;">0 Wikidata ID: Q83552019 Name variant: Caroline Christopher <a href="https://www.wikidata.org/wiki/Q83552019" target="_blank">https://www.wikidata.org/wiki/Q83552019</a></span><br />
<span style="font-family: "courier new" , "courier" , monospace;">No death date given.</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">description: human and organizational development educator</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">employer: Vanderbilt University</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">No articles authored by that person</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">Employee: Caroline Christopher vs. name variant: Caroline Christopher</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"><br /></span>
<span style="font-family: "courier new" , "courier" , monospace;">Enter the number of the matched entity, or press Enter/return if none match: 0</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"><br /></span>
<span style="font-family: "courier new" , "courier" , monospace;">--------------------------</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">SPARQL name search: Paul Cobb</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">(no ORCID)</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"><br /></span>
<span style="font-family: "courier new" , "courier" , monospace;"><br /></span>
<span style="font-family: "courier new" , "courier" , monospace;">0 Wikidata ID: Q28936750 Name variant: Paul Cobb <a href="https://www.wikidata.org/wiki/Q28936750" target="_blank">https://www.wikidata.org/wiki/Q28936750</a></span><br />
<span style="font-family: "courier new" , "courier" , monospace;">No death date given.</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">description: association football player</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">occupation: association football player</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">No articles authored by that person</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">Employee: Paul Cobb vs. name variant: Paul Cobb</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"><br /></span>
<span style="font-family: "courier new" , "courier" , monospace;"><br /></span>
<span style="font-family: "courier new" , "courier" , monospace;">1 Wikidata ID: Q55746009 Name variant: Paul Cobb <a href="https://www.wikidata.org/wiki/Q55746009" target="_blank">https://www.wikidata.org/wiki/Q55746009</a></span><br />
<span style="font-family: "courier new" , "courier" , monospace;">No death date given.</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">description: American newspaper publisher</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">occupation: newspaper proprietor</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">No articles authored by that person</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">Employee: Paul Cobb vs. name variant: Paul Cobb</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"><br /></span>
<span style="font-family: "courier new" , "courier" , monospace;">Enter the number of the matched entity, or press Enter/return if none match: </span><br />
<span style="font-family: "courier new" , "courier" , monospace;"><br /></span>
<span style="font-family: "courier new" , "courier" , monospace;">--------------------------</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">No Wikidata name match: Molly Collins</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"><br /></span>
<span style="font-family: "courier new" , "courier" , monospace;">--------------------------</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">No Wikidata name match: Ana Christina da Silva [Iddings]</span><br />
<br />
<br />
Although this step did require human intervention, because of the large amount of information that the script collected about the Wikidata matches, it usually only took a few minutes to disambiguate a department with 30 to 50 employees.<br />
<div>
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjPsVESAarTbzNBolXgPC-MtXplJMIIvNvJvwvykHBL9eoLniv5Ln3lLQC421PnPGhInYvHnnHAjGLQjEdKJuhUVIk8c8xmXUMB3ORCPxvKU-IGxpok50NFr6tD1nUKG1lIV68EIkwMpeM/s1600/diagram20.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="597" data-original-width="786" height="486" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjPsVESAarTbzNBolXgPC-MtXplJMIIvNvJvwvykHBL9eoLniv5Ln3lLQC421PnPGhInYvHnnHAjGLQjEdKJuhUVIk8c8xmXUMB3ORCPxvKU-IGxpok50NFr6tD1nUKG1lIV68EIkwMpeM/s640/diagram20.png" width="640" /></a></div>
<div>
<br /></div>
<h2>
Generate statements and references and determine which were already in Wikidata</h2>
<div>
<br /></div>
<h4>
Generating data for a minimal set of properties</h4>
<div>
<div class="MsoNormal">
The next to last step was to assign values to a minimal set
of properties that I felt each employee should have in a Wikidata record.
Here's what I settled on for that minimal set:<o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
<span style="font-family: "courier new" , "courier" , monospace;">P31 Q5 </span>(<i>instance of human</i>). This was automatically assigned
to all records.<o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
<span style="font-family: "courier new" , "courier" , monospace;">P108 Q29052</span> (<i>employer Vanderbilt University</i>). This applies
to all employees in our project - the employer value can be set at the top of
the script.<o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
<span style="font-family: "courier new" , "courier" , monospace;">P1416</span> <span style="font-family: "courier new" , "courier" , monospace;">[Q ID of department]</span> (<i>affiliation with focal
department</i>). After searching through many possible properties, I decided that
<span style="font-family: "courier new" , "courier" , monospace;">P1416 </span>(<i>affiliation</i>) was the best property to use to assert the employee's
connection to the department I was processing. <span style="font-family: "courier new" , "courier" , monospace;">P108 </span>was also possible, but
there were a lot of people with dual departmental appointments and I generally
didn't know which department was the actual "employer". Affiliation seemed
to be an appropriate connection for regular faculty, postdocs, visiting
faculty, research staff, and other kinds of statuses where the person would
have some kind of research or scholarly output. <o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
<span style="font-family: "courier new" , "courier" , monospace;">P496 [ORCID identifier]</span>. ORCIDs that I'd acquired for the
employees were hard-won and an excellent means for anyone else to carry out
disambiguation, so I definitely wanted to include that assertion if I could. <o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
<span style="font-family: "courier new" , "courier" , monospace;">P21 [sex or gender]</span>. I was really uncomfortable assigning a
value of this property, but this is a property often flagged by <a href="https://www.wikidata.org/wiki/Wikidata:Recoin" target="_blank">Recoin</a> as a top
missing property and I didn't want some overzealous editor deleting my new items because their metadata were too skimpy. Generally, the departmental web pages had photos
to go with the names, so I made a call and manually assigned a value for this
property (options: m=male, f=female, i=intersex, tf=transgender female,
tm=transgender male). Any time the sex or gender seemed uncertain, I did not
provide a value.<o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
<i>The description</i>.<span style="mso-spacerun: yes;"> </span>I
made up a default description for the department, such as "biological
science researcher", "historian", or "American Studies
scholar" for the Biological Sciences, History, and American Studies
departments respectively. I did not overwrite any existing descriptions by
default, although as a last step I looked at the table to replace stupid ones
like "researcher, ORCID: 0000-0002-1234-5678". These defaults were
generally specific enough to prevent collisions where the label/description
combination I was creating would collide with the label/description combination
for an existing record and kill the record write. <o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
When it made sense, I added references to the statements I
was making. Generally, a reference is not expected for <i>instance of human</i> and I
really couldn't give a reference for <i>sex or gender</i>.<span style="mso-spacerun: yes;"> </span>For the <i>employer </i>and <i>affiliation </i>references,
I used the web page that I scraped to get their name as the <i>reference URL</i> and
provided the current date as the value for <span style="font-family: "courier new" , "courier" , monospace;">P813 </span>(<i>retrieved</i>).<span style="mso-spacerun: yes;"> </span>For ORCID, I created a reference that had a
<span style="font-family: "courier new" , "courier" , monospace;">P813 </span>(<i>retrieved</i>) property if I was able to successfully dereference the ORCID
URI. <o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
Because each of these properties had different criteria for
assigning values and references, there was no standard code for assigning them.
The code for each property is annotated, so if you are interested you can look
at it to see how I made the assignments.<o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
</div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg_xaBGWKHzOc5s-xnoSV9tBzXG6NeLT2nSuEfcKaweUBwx00MiZcucXnky3eEqvtOjfzzSX95bOV9tpEaW8lNxjIPrwKqtUOIwSaiTLIg1Fhw7Hbyl-osHe7iLWVu9Ry_BijnysBzuBAg/s1600/diagram21.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="528" data-original-width="975" height="346" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg_xaBGWKHzOc5s-xnoSV9tBzXG6NeLT2nSuEfcKaweUBwx00MiZcucXnky3eEqvtOjfzzSX95bOV9tpEaW8lNxjIPrwKqtUOIwSaiTLIg1Fhw7Hbyl-osHe7iLWVu9Ry_BijnysBzuBAg/s640/diagram21.png" width="640" /></a></div>
<div>
<br /></div>
<h4>
Check for existing data in Wikidata</h4>
<div>
<div class="MsoNormal">
In the earlier posts, I said that I did not want VanderBot to create duplicate items, statements, and references when they already existed in Wikidata.
So a critical last step was to check for existing data using SPARQL. One
important thing to keep in mind is the Query Service Updater lag that I talked
about in the last post. That lag means that changes made up to 8 or 10 hours
ago would not be included in this download. However, given that the Wikidata researcher item records I'm dealing with do not change frequently, the lag generally wasn't a
problem. I should note that it would be possible to get these data directly
from the Wikidata API, but the convenience of getting exactly the information I wanted
using SPARQL outweighed my motivation to develop code to do that.<o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
At this point in the workflow, I've already determined with
a fairly high degree of confidence which of the departmental employees were
already in Wikidata. That takes care of the potential problem of creating duplicate item
records, and it also means that I do not need to check for the presence of
statements or references for any of the new items either.<o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
One interesting feature of SPARQL that I learned from this
project was using the <span style="font-family: "courier new" , "courier" , monospace;">VALUES </span>clause. Despite having used SPARQL for years and
skimming through the SPARQL specification several times, I missed it. The
<span style="font-family: "courier new" , "courier" , monospace;">VALUES </span>clause allows you to specify which values the query should use for a particular
variable in its pattern matching.<span style="mso-spacerun: yes;"> </span>That
makes querying a large triplestore like Wikidata much faster that without it
and it also reduces the number of results that the code has to sort through
when results come back from the query service. Here's an example of a query
using the <span style="font-family: "courier new" , "courier" , monospace;">VALUES </span>clause that you can test at the <a href="https://query.wikidata.org/" target="_blank">Wikidata Query Service</a>:<br />
<o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
<span style="font-family: "courier new" , "courier" , monospace;">SELECT DISTINCT ?id ?statement WHERE {<o:p></o:p></span></div>
<div class="MsoNormal">
<span style="font-family: "courier new" , "courier" , monospace;">VALUES ?id {<o:p></o:p></span></div>
<div class="MsoNormal">
<span style="font-family: "courier new" , "courier" , monospace;"><span style="mso-spacerun: yes;"> </span>wd:Q4958<o:p></o:p></span></div>
<div class="MsoNormal">
<span style="font-family: "courier new" , "courier" , monospace;"><span style="mso-spacerun: yes;"> </span>wd:Q39993<o:p></o:p></span></div>
<div class="MsoNormal">
<span style="font-family: "courier new" , "courier" , monospace;"><span style="mso-spacerun: yes;"> </span>wd:Q234<o:p></o:p></span></div>
<div class="MsoNormal">
<span style="font-family: "courier new" , "courier" , monospace;"><span style="mso-spacerun: yes;"> </span>}<o:p></o:p></span></div>
<div class="MsoNormal">
<span style="font-family: "courier new" , "courier" , monospace;">?id p:P31 ?statement.<o:p></o:p></span></div>
<div class="MsoNormal">
<span style="font-family: "courier new" , "courier" , monospace;">}</span><o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<br /></div>
<div class="MsoNormal">
So the first part of the last step in the workflow is to
generate a list of all of the existing item Q IDs for employees in the
department. That list is passed to the <span style="font-family: "courier new" , "courier" , monospace;">searchStatementAtWikidata()</span> function as
its first argument. <span style="font-family: "courier new" , "courier" , monospace;">searchStatementAtWikidata()</span> is a general purpose function
that will search Wikidata for a particular property of items in the generated list. It can be used either to search for a particular property and value (like
<span style="font-family: "courier new" , "courier" , monospace;">P108 Q29052</span>, <i>employer Vanderbilt University</i>) and retrieve the references for
that statement, or for only the property (like <span style="font-family: "courier new" , "courier" , monospace;">P496</span>, <i>ORCID</i>) and retrieve both
the values and references associated with those statements.<span style="mso-spacerun: yes;"> </span>This behavior is controlled by whether an
empty string is sent for the value argument or not.<span style="mso-spacerun: yes;"> </span>For each of the minimal set of properties
that I'm tracking for departmental employees, the <span style="font-family: "courier new" , "courier" , monospace;">searchStatementAtWikidata()</span>
is used to retrieve any available data for the listed employees. Those data are
then matched with the appropriate employee records and recorded in the CSV file
along with the previously generated property values. <o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
In addition to the property checks, labels, descriptions,
and aliases for the list of employees are retrieved via SPARQL queries. In the
cases of labels and descriptions, if there is an existing label or description
in Wikidata, it is written to the CSV file. If there is no existing label, the
name scraped from the departmental website is written to the CSV as the label.
If there is no existing description, the default description for the department
is written to the CSV. Whatever alias lists are retrieved from Wikidata
(including empty ones) are written to the CSV.<o:p></o:p><br />
<br /></div>
</div>
<h4>
Final manual curation prior to writing to the Wikidata API</h4>
<div>
<div class="MsoNormal">
In theory, the CSV file resulting from the previous step should
contain all of the information needed by the API-writing script that was
discussed in the last post. However, I always manually examine the CSV to look
for problems or things that are stupid such as bad descriptions. <o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
If a description or label is changed, the API-writing script
will detect that it's different from the current value being provided by the
SPARQL endpoint and the new description or label will overwrite the existing
one. The API-writing script is currently not very sophisticated about how it
handles aliases. If there are more aliases in the CSV than are currently in
Wikidata, the script will overwrite existing aliases in Wikidata with those in the
spreadsheet. The assumption is that alias lists are only added to, rather than
aliases being changed or deleted.<span style="mso-spacerun: yes;"> </span>At
some point in the future, I intend to write a separate script that will handle
labels and aliases in a more robust way, so I really didn't want to waste time
now on making the alias-handling better than it is. </div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
A typical situation is to
discover a more specific label for the person than already exists in Wikidata.
In that case, I usually add the existing label to the alias list, and replace
the label value in the CSV with the better new one. <b>WARNING!</b> If you edit the
alias list, make sure that your editor uses generic quotes (ASCII <span style="font-family: "courier new" , "courier" , monospace;">32</span>/Unicode
<span style="font-family: "courier new" , "courier" , monospace;">+U0022</span>) and not "smart quotes". They have a different Unicode value
and will break the script. <a href="https://www.openoffice.org/" target="_blank">Open Office</a>/<a href="https://www.libreoffice.org/" target="_blank">Libre Office</a> (the best applications for editing CSVs in my opinion)
default to smart quotes, so this setting must be turned off manually.<o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
I also just look over the rest of the spreadsheet to
convince myself that nothing weird is going on. Usually the script does an
effective job of downloading the correct reference properties and values, but
I've discovered some odd situations that have caused problems. <o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhJWOeuqH6OVL-NGjf9BqV9_LOK7UoKWSU9oV_640OnwkHtTFFqilJRWFmLek6sD1wcVHkjCwvBBH3kKtK91fIuM_6AIlgBKt-P7vskQxkfLdNd-BGRYMbm6q6Hv5KG7-6UU4wkCCrbasc/s1600/diagram2.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em; text-align: center;"><img border="0" data-original-height="534" data-original-width="975" height="350" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhJWOeuqH6OVL-NGjf9BqV9_LOK7UoKWSU9oV_640OnwkHtTFFqilJRWFmLek6sD1wcVHkjCwvBBH3kKtK91fIuM_6AIlgBKt-P7vskQxkfLdNd-BGRYMbm6q6Hv5KG7-6UU4wkCCrbasc/s640/diagram2.png" width="640" /></a></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
At this point, steps 1 and 2 in the VanderBot diagram have
been completed by the data harvesting script, and the API-writing script
described in the last post is ready to take over in step 3.<span style="mso-spacerun: yes;"> </span>When step 4 is complete, the blank cells in
the CSV for missing item, statement, and reference identifiers will should all
be filled in and the CSV can be filed for future reference. <o:p></o:p><br />
<br /></div>
</div>
<h2>
Final thoughts</h2>
<div>
<div class="MsoNormal">
<br />
I tried to make the API writing script generic and
adaptable for writing statements and references about any kind of entity. That's
achievable simply by editing the JSON schema file that maps the columns in the
source CSV. However, getting the values for that CSV is the tricky part. If one
were confident that only new items were being written, then the table could
filled with only the data to be written and without any item, statement, or
reference identifiers.<span style="mso-spacerun: yes;"> </span>That would be the
case if you were using the script to load your own Wikibase instance. However,
for adding data to Wikidata about most items like people or references, one can't
know if the data needs to be written or not, and that's why a complex and
somewhat idiosyncratic script like the data harvesting script is necessary. So there's no "magic bullet" that will make it possible to automatically know whether you can write data to Wikidata without creating duplicate assertions.<o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
To find records that VanderBot has put into Wikidata, try this query at the <a href="https://query.wikidata.org/" target="_blank">Wikidata Query Service</a>:</div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
<span style="font-family: "courier new" , "courier" , monospace;">select distinct ?employee where {</span></div>
<div class="MsoNormal">
<span style="font-family: "courier new" , "courier" , monospace;"> ?employee wdt:P1416/wdt:P749+ wd:Q29052.</span></div>
<div class="MsoNormal">
<span style="font-family: "courier new" , "courier" , monospace;"> }</span></div>
<div class="MsoNormal">
<span style="font-family: "courier new" , "courier" , monospace;">limit 50</span></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
The triple pattern requires that the employee first have an <i>affiliation</i> (<span style="font-family: "courier new" , "courier" , monospace;">P1416</span>) to some item, and that item be linked by one or more <i>parent organization</i> (<span style="font-family: "courier new" , "courier" , monospace;">P749</span>) links to Vanderbilt University (<span style="font-family: "courier new" , "courier" , monospace;">Q29052</span>). I linked the department items to their parent school or college using <span style="font-family: "courier new" , "courier" , monospace;">P749 </span>and made sure that the University's schools and colleges were all linked to the University by <span style="font-family: "courier new" , "courier" , monospace;">P749 </span>as well. However, some schools like the Blair School of Music do not really have departments, so their employees were affiliated directly to the school or college rather than a department. So the search has to pick up administrative entity items that were either one or two <span style="font-family: "courier new" , "courier" , monospace;">P749 </span>links from the university (hence the "+" property path operator after <span style="font-family: "courier new" , "courier" , monospace;">P749</span>). Since there are a lot of employees, I limited the results to 50. If you click on any of the results, it will take you to the item page and you can view the page history to confirm that VanderBot had made edits to the page. (At some point, there may be people who were linked in this way by an account other than VanderBot, but thus far, VanderBot is probably the only editor of Vanderbilt employees items that's linking to departments by <span style="font-family: Courier New, Courier, monospace;">P1416</span>, given that I recently created all of the department items from scratch.)</div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
A variation of that query will tell you the number of records meeting the criteria of the previous query:</div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
<span style="font-family: "courier new" , "courier" , monospace;">select (count(?employee) as ?count) where {</span></div>
<div class="MsoNormal">
<span style="font-family: "courier new" , "courier" , monospace;"> ?employee wdt:P1416/wdt:P749+ wd:Q29052.</span></div>
<div class="MsoNormal">
<span style="font-family: "courier new" , "courier" , monospace;"> }</span></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
As of 2020-02-08, there are 1221 results. That number should grow as I use VanderBot to process other departments.</div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
<br /></div>
</div>
Steve Baskaufhttp://www.blogger.com/profile/01896499749604153763noreply@blogger.com0tag:blogger.com,1999:blog-5299754536670281996.post-13624778160709014472020-02-07T14:49:00.001-08:002020-02-08T19:03:21.410-08:00VanderBot part 3: Writing data from a CSV file to Wikidata<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiDq5PGdWsB-Kh8NxHHT0OxQlCJWCoIdpXcrtf8JasiDWMEvufhX3JDpBYFNJizS3dZ47_9-Yrc0wOHV83vKQ2ueHRnkTh08AUoQwgQuGDLEIxJUAFf_apqmQ6VAJH16rz1iWCik1DAqLg/s1600/diagram12.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="533" data-original-width="974" height="350" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiDq5PGdWsB-Kh8NxHHT0OxQlCJWCoIdpXcrtf8JasiDWMEvufhX3JDpBYFNJizS3dZ47_9-Yrc0wOHV83vKQ2ueHRnkTh08AUoQwgQuGDLEIxJUAFf_apqmQ6VAJH16rz1iWCik1DAqLg/s640/diagram12.png" width="640" /></a></div>
<div class="separator" style="clear: both; text-align: center;">
<br /></div>
<div class="separator" style="clear: both; text-align: left;">
In <a href="http://baskauf.blogspot.com/2020/02/vanderbot-part-2-wikibase-data-model.html" target="_blank">the previous post of this series</a>, I described how my investigation of the Wikibase data model led me to settle on a relatively simple spreadsheet layout for tracking what items, statements, and references needed to be created or edited in Wikidata. Since column headers in a CSV spreadsheet don't really have any meaning other than to a human, it's necessary to map columns to features of the Wikibase model so that a script would know how to write the data in those columns to appropriate data items in Wikidata. </div>
<br />
<h2>
Developing a schema to map spreadsheet columns to the Wikibase model</h2>
In <a href="http://baskauf.blogspot.com/2016/10/guid-o-matic-goes-to-china.html" target="_blank">a blog post from 2016</a>, I wrote about a similar problem that I faced when creating an application that would translate tabular CSV data to RDF triples. In that case, I created a mapping CSV table that mapped table headers to particular RDF predicates, and that also indicated the kind of object represented in the table (language-tagged literal, IRI, etc.). That approach worked fine and had the advantage of simplicity, but it had the disadvantage that it was an entirely ad hoc solution that I made up for my own use.<br />
<br />
When I learned about the <a href="https://www.w3.org/TR/csv2rdf/" target="_blank">"Generating RDF from Tabular Data on the Web" W3C Recommendation</a>, I recognized that this was a more standardized way to accomplish a mapping from a CSV table to RDF. When I started working on the VanderBot project I realized that since the Wikibase model can be expressed as an RDF graph, I could construct a schema using this W3C standard to document how my CSV data should be mapped to Wikidata items, properties, references, labels, etc. The most relevant part of the standard is <a href="https://www.w3.org/TR/csv2rdf/#example-events-listing" target="_blank">section 7.3, "Example with single table and using virtual columns to produce multiple subjects per row"</a>.<br />
<br />
An example schema that maps the <a href="https://github.com/HeardLibrary/linked-data/blob/master/publications/departments/engineering-to-write.csv" target="_blank">sample table from last the last post</a> is <a href="https://github.com/HeardLibrary/linked-data/blob/master/publications/departments/csv-metadata.json" target="_blank">here</a>. The schema is written in JSON and if ingested by an application that can transform CSV files in accordance with the W3C specification, it should produce RDF triples identical to triples about the subject items that are stored in the Wikidata Query Service triplestore (not all triples, but many of the ones that would be generated if the CSV data were loaded into the Wikidata API). I haven't actually tried this since I haven't acquired such an application, but the point is that the JSON schema applied to the CSV data will generate part of the graph that will eventually be present in Wikidata when the data are loaded.<br />
<br />
I will not go into every detail of the example schema, but show several examples of how parts of it map particular columns.<br />
<div>
<br /></div>
<h4>
Column for the item identifier</h4>
Each column in the table has a corresponding JSON object in the schema. The first column, with the column header title "wikidataId" is mapped with:<br />
<br />
<span style="font-family: "courier new" , "courier" , monospace;">{</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">"titles": "wikidataId",</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">"name": "wikidataId",</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">"datatype": "string", </span><br />
<span style="font-family: "courier new" , "courier" , monospace;">"suppressOutput": true</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">}</span><br />
<br />
This JSON simply associates a variable name (<span style="font-family: "courier new" , "courier" , monospace;">wikidataId</span>) with the Wikidata Q ID for the item that's the subject of each row. (For simplicity, I've chosen to make the variable names the same as the column titles, but that isn't required.) The "true" value for <span style="font-family: "courier new" , "courier" , monospace;">suppressOutput</span> means that no statement is directly generated from this column.<br />
<div>
<br /></div>
<h4>
Column for the label</h4>
<div>
The "labelEn" column is mapped with this JSON object:</div>
<div>
<div>
<br /></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">{</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">"titles": "labelEn",</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">"name": "labelEn",</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">"datatype": "string",</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">"aboutUrl": "http://www.wikidata.org/entity/{wikidataId}",</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">"propertyUrl": "rdfs:label",</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">"lang": "en"</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">}</span></div>
<div>
<br /></div>
<div>
The value of <span style="font-family: "courier new" , "courier" , monospace;">aboutUrl</span> indicates the subject of the triple generated by this column. The curly brackets indicate that the <span style="font-family: "courier new" , "courier" , monospace;">wikidataId</span> variable should be substituted in that place to generate the URI for the subject. The value of <span style="font-family: "courier new" , "courier" , monospace;">propertyUrl</span> is <span style="font-family: "courier new" , "courier" , monospace;">rdfs:label</span>, the RDF predicate that Wikibase uses for its label field. The object of the triple by default is the value present in that column for the row. The <span style="font-family: "courier new" , "courier" , monospace;">lang</span> value provides the language tag for the literal.</div>
</div>
<div>
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgXy-Y7R1Z90ZweF5uzj0Vmjq-YFffogo1wimFRk4Sz-v-Swyn7H3qCgYnv6eW7VKPFiPZkgp4ykhmLSLVUct0i-mTuWK7SZLdJG2GlMQmu_Ic9viS7dPrrQGqL47GeLNojs9cNASEF0Qc/s1600/diagram13.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="104" data-original-width="974" height="68" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgXy-Y7R1Z90ZweF5uzj0Vmjq-YFffogo1wimFRk4Sz-v-Swyn7H3qCgYnv6eW7VKPFiPZkgp4ykhmLSLVUct0i-mTuWK7SZLdJG2GlMQmu_Ic9viS7dPrrQGqL47GeLNojs9cNASEF0Qc/s640/diagram13.png" width="640" /></a></div>
<div>
<br /></div>
<div>
<div>
So when this mapping is applied to the <span style="font-family: "courier new" , "courier" , monospace;">labelEn</span> column of the first row, the triple</div>
<div>
<br /></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;"><http://www.wikidata.org/entity/Q84268104> rdfs:label "Vanderbilt Department of Biomedical Engineering"@en.</span></div>
</div>
<div>
<br /></div>
<div>
would be generated.</div>
<div>
<br /></div>
<h4>
Column for a property having value that is an item (<span style="font-family: "courier new" , "courier" , monospace;">P749</span>)</h4>
<div>
Here is the JSON object that maps the "<span style="font-family: "courier new" , "courier" , monospace;">parentUnit</span>" column.</div>
<div>
<div>
<br /></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">{</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">"titles": "parentUnit",</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">"name": "parentUnit",</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">"datatype": "string",</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">"aboutUrl": "http://www.wikidata.org/entity/{wikidataId}",</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">"propertyUrl": "http://www.wikidata.org/prop/direct/P749",</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">"valueUrl": "http://www.wikidata.org/entity/{parentUnit}"</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">}</span></div>
<div>
<br /></div>
<div>
As before, the subject URI is established by substituting the <span style="font-family: "courier new" , "courier" , monospace;">wikidataId</span> variable into the URI template for <span style="font-family: "courier new" , "courier" , monospace;">aboutUrl</span>. Instead of directly mapping the column value as the object of the triple, the column value is inserted into a <span style="font-family: "courier new" , "courier" , monospace;">valueUrl</span> URI template in the same manner as the <span style="font-family: "courier new" , "courier" , monospace;">aboutUrl</span>. </div>
</div>
<div>
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh-bYfhyphenhyphenloSTMPqKpNOcl5tu2y1fJbQFO0XeAD_eatE9e8-X1Yy2uXjtxRqBlp4gELq0UvHgZ5IhrusJcIMuqblkMoqr9sTSXt1-u7xSmRRT7HaHpIIV4ZHBPVIfJn9EHW9AEbiJ2MGAKM/s1600/diagram14.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="131" data-original-width="974" height="86" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh-bYfhyphenhyphenloSTMPqKpNOcl5tu2y1fJbQFO0XeAD_eatE9e8-X1Yy2uXjtxRqBlp4gELq0UvHgZ5IhrusJcIMuqblkMoqr9sTSXt1-u7xSmRRT7HaHpIIV4ZHBPVIfJn9EHW9AEbiJ2MGAKM/s640/diagram14.png" width="640" /></a></div>
<div>
<br /></div>
<div>
<div>
Applying this column mapping to the <span style="font-family: "courier new" , "courier" , monospace;">parentUnit</span> column generates the triple:</div>
<div>
<br /></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;"><http://www.wikidata.org/entity/Q84268104> <http://www.wikidata.org/prop/direct/P749> <http://www.wikidata.org/entity/Q7914459>.</span></div>
<div>
<br /></div>
<div>
which can be abbreviated</div>
<div>
<br /></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">wd:Q84268104 wdt:P749 wd:Q7914459.</span></div>
</div>
<div>
<br /></div>
<div>
<div>
The other columns in the CSV table are mapped similarly. If there is no <span style="font-family: "courier new" , "courier" , monospace;">valueURl</span> key:value pair, the value for the column is a literal object, and if there is a value for <span style="font-family: "courier new" , "courier" , monospace;">valueURI</span>, the value for the column is used to generate a URI denoting a non-literal object. </div>
<div>
<br /></div>
<div>
The value of <span style="font-family: "courier new" , "courier" , monospace;">datatype</span> is important since it determines the <span style="font-family: "courier new" , "courier" , monospace;">xsd:datatype</span> of literal values in the generated triples.</div>
<div>
<br /></div>
<div>
Not every column generates a triple with a subject that's the subject of the row. The subject may be the value of any other column. This allows the data in the row to form a more complicated graph structure.</div>
</div>
<div>
<br /></div>
<h2>
How the VanderBot script writes the CSV data to the Wikidata API</h2>
<div>
<div>
The script that does the actual writing to the Wikidata API is <a href="https://github.com/HeardLibrary/linked-data/blob/master/publications/process_csv_metadata_full.py" target="_blank">here</a>. The authentication process (<a href="https://github.com/HeardLibrary/linked-data/blob/master/publications/process_csv_metadata_full.py#L338" target="_blank">line 338</a>) is described in detail <a href="https://heardlibrary.github.io/digital-scholarship/host/wikidata/bot/#use-the-bot-to-write-to-the-wikidata-test-instance" target="_blank">elsewhere</a>. </div>
<div>
<br /></div>
<div>
The actual script begins (<a href="https://github.com/HeardLibrary/linked-data/blob/master/publications/process_csv_metadata_full.py#L374" target="_blank">line 374</a>) by loading the schema JSON into a Python data structure and loading the CSV table into a list of dictionaries. </div>
<div>
<br /></div>
<div>
The next section of the code (<a href="https://github.com/HeardLibrary/linked-data/blob/master/publications/process_csv_metadata_full.py#L402" target="_blank">lines 402 to 554</a>) uses the schema JSON to sort the columns of the tables into categories (labels, aliases, descriptions, statements with entity values, and statements with literal values). </div>
<div>
<br /></div>
<div>
From lines <a href="https://github.com/HeardLibrary/linked-data/blob/master/publications/process_csv_metadata_full.py#L556" target="_blank">556 to 756</a>, the script steps through each row of the table to generate the data that needs to be passed to the API to upload new data. In each row, the script goes through each category of data (labels, aliases, etc.) and turns the value in a column into the specific JSON required by the API for uploading that kind of data. I call this "snak JSON" because the units in the JSON represent "snaks" (small, discrete statements) as defined by the Wikibase data model.</div>
<div>
<br /></div>
<div>
Originally, I had written the script in a simpler way, where each piece of information about the item was written in a separate API call. This seemed intuitive since there are individual API methods for uploading every category (label, description, property, reference, etc., see the <a href="https://www.wikidata.org/w/api.php" target="_blank">API documentation</a>). However, because of rate limitations that I'll talk about later, the most reasonable way to write the data was to determine which categories needed to be written for an item and then generate the JSON for all categories at once. I then used the "all in one" method <span style="font-family: "courier new" , "courier" , monospace;">wbeditentity</span> to make all possible edits in a single API call. This resulted in much more complicated code that constructed deeply nested JSON that's difficult to read. The API help page didn't give any examples that were nearly this complicated, so getting this strategy to work required delving deeply into the Wikibase model. One lifesaver was that when a successful API call was made, the API's response included JSON structured according to the Wikibase model that was very similar to the JSON that was necessary to write to the API. Being able to look at this response JSON was really useful to help me figure out what subtle mistakes I was making when constructing the JSON to send to the API.</div>
<div>
<br /></div>
<div>
Simply creating labels, descriptions, and claims would not have been too hard, but I was determined to also have the capability to support references and qualifiers for claims. Here's how I hacked that task: for each statement column, I went through the columns and looked for other columns that the schema indicated were references or qualifiers of that statement. Currently, the script only handles one reference and one qualifier per statement, but when I get around to it, I'll improve the script to remove that limitation. </div>
<div>
<br /></div>
<div>
In line <a href="https://github.com/HeardLibrary/linked-data/blob/master/publications/process_csv_metadata_full.py#L759" target="_blank">759</a>, the script checks whether it found any information about the item that wasn't already written to Wikidata. If there was at least one thing to write, the script attempts to post a parameter dictionary (including the complex, constructed snak JSON) to the API (<a href="https://github.com/HeardLibrary/linked-data/blob/master/publications/process_csv_metadata_full.py#L305" target="_blank">lines 305 to 335</a>) If the attempt was unsuccessful because the API was too busy, it retries several times. If the attempt was unsuccessful for other reasons, the script displays the server's response for debugging. </div>
<div>
<br /></div>
<div>
If the attempt was successful, the script extracts identifiers of newly-created data records (item Q IDs, statement UUIDs, and reference hashes - see the previous post for more on this) and adds them to the CSV table so that the script will know in the future that those data are already in Wikidata. The script rewrites the CSV table after every line so that if the script crashes or the API throws an error during a write attempt, one can simply re-start the script after fixing the problem and the script will know not to create duplicate data on the second go-around (since the identifiers for the already-written data have already been added to the CSV). </div>
<div>
<br /></div>
<div>
I mentioned near the end of my previous post that I don't have any way to record whether labels, descriptions, and qualifiers had already been written or not, since URI identifiers aren't generated for them. The lack of URI identifiers means that one can't refer to those particular assertions directly by URIs in a SPARQL query. Instead, one must make a query asking explicitly for the value of the label, description, or qualifier and then determine whether it's the same as the value in the CSV table. The way the script currently works, prior to creating JSON to send to the API the script sends a SPARQL query asking for the values of labels and descriptions of all of the entities in the table (lines <a href="https://github.com/HeardLibrary/linked-data/blob/master/publications/process_csv_metadata_full.py#L465" target="_blank">465</a> and <a href="https://github.com/HeardLibrary/linked-data/blob/master/publications/process_csv_metadata_full.py#L515" target="_blank">515</a>). Then as the script processes each line of the table, it checks whether the value in the CSV is the same as what's already in Wikidata (and then does nothing) or different. If the value is different, it writes the new value from the CSV and overwrites the value in Wikidata. </div>
<div>
<br /></div>
<div>
It is important to understand this behavior, because if the CSV table is "stale" and has not been updated for a long time, other users may have improved the labels or descriptions. Running the script with the stale values will effectively revert their improvements. So it's important to update the CSV file with current values before running this script that writes to the API. After updating, then you can manually change any labels or descriptions that are unsatisfactory. </div>
<div>
<br /></div>
<div>
In the future, I plan to write additional scripts for managing labels and aliases, so this crude management system will hopefully be improved.</div>
</div>
<div>
<br /></div>
<h2>
Cleaning up missing references</h2>
<div>
In some cases, other Wikidata contributors have already made statements about pre-existing Vanderbilt employee items. For example, someone may have already asserted that the Vanderbilt employee's employer was Vanderbilt University. In such cases, the primary API writing script will do nothing with those statements because it is not possible to write a reference as part of the <span style="font-family: "courier new" , "courier" , monospace;">wbeditentity</span> API method without also writing its parent statement. So I had to create <a href="https://github.com/HeardLibrary/linked-data/blob/master/publications/cleanup_csv_metadata.py" target="_blank">a separate script</a> that is a hack of the primary script in order to write the missing references. I won't describe that script here because its operation is very similar to the main script. The main difference is that it uses the <a href="https://www.wikidata.org/w/api.php?action=help&modules=wbsetreference" target="_blank"><span style="font-family: "courier new" , "courier" , monospace;">wbsetreference</span> API method</a> that is able to directly write a reference given a statement identifier. After running the main script, I run the cleanup script until all of the missing references have been added.</div>
<div>
<br /></div>
<h2>
Timing issues</h2>
<h4>
</h4>
<h4>
Maxlag</h4>
<div>
One of the things that I mentioned in my <a href="http://baskauf.blogspot.com/2019/06/putting-data-into-wikidata-using.html" target="_blank">original post on writing data to Wikidata</a> was that when writing to the "real" Wikidata API (vs. the test API or your own Wikibase instance) it's important to respect the <span style="font-family: "courier new" , "courier" , monospace;">maxlag</span> parameter.</div>
<div>
<div>
<br /></div>
<div>
You can set the value of the <span style="font-family: "courier new" , "courier" , monospace;">maxlag</span> parameter in <a href="https://github.com/HeardLibrary/linked-data/blob/master/publications/process_csv_metadata_full.py#L376" target="_blank">line 381</a>. The recommended value is 5 seconds. A higher <span style="font-family: "courier new" , "courier" , monospace;">maxlag</span> value is more aggressive and a lower <span style="font-family: "courier new" , "courier" , monospace;">maxlag</span> value is "nicer" but means that you are willing to be told more often by the API to wait. The value of <span style="font-family: "courier new" , "courier" , monospace;">maxlag</span> you have chosen is added to the parameters sent to the API in <a href="https://github.com/HeardLibrary/linked-data/blob/master/publications/process_csv_metadata_full.py#L764" target="_blank">line 764</a> just before the POST operation. </div>
<div>
<br /></div>
<div>
The API lag is the average amount of time between when a user requests an operation and the API is able to honor that request. At times of low usage (e.g. nighttime in the US and Europe), the lag may be small, but at times of high usage, the lag can be over 8 seconds (I've seen it go as high as 12 seconds). If you set <span style="font-family: "courier new" , "courier" , monospace;">maxlag</span> to 5 seconds, you are basically telling the server that if the lag gets longer than 5 seconds, ignore your request and you'll try again later. The server tells you to wait by responding to your POST request with a response that contains a <span style="font-family: "courier new" , "courier" , monospace;">maxlag</span> error code and the amount of time the server is lagged. This error is handled in <a href="https://github.com/HeardLibrary/linked-data/blob/master/publications/process_csv_metadata_full.py#L313" target="_blank">line 315</a> of the script. When a lag error is detected, the recommended practice is to wait at least 5 seconds before retrying.</div>
</div>
<div>
<br /></div>
<h4>
Bot flags</h4>
<div>
I naïvely believed that if I respected <span style="font-family: "courier new" , "courier" , monospace;">maxlag</span> errors that I'd be able to write to the API as fast as conditions allowed. However, the very first time I used the VanderBot script to write more than 25 records in a row, I was blocked by the API as a potential spammer with the message "As an anti-abuse measure, you are limited from performing this action too many times in a short space of time, and you have exceeded this limit. Please try again in a few minutes." Clearly my assumption was wrong. Through trial and error, I determined that a write rate of one second per write was too fast and would result in being temporarily blocked, but a rate of two seconds per write was acceptable. So to handle cases when maxlag was not invoked, I put a delay of 2 seconds on the script (<a href="https://github.com/HeardLibrary/linked-data/blob/master/publications/process_csv_metadata_full.py#L821" target="_blank">line 822</a>).</div>
<div>
<div>
<br /></div>
<div>
I had several hypotheses about the cause of the blocking. One possible reason was because I didn't have a bot flag. (More on that later.) Another reason might be because I was running the script from my local computer rather than from <a href="https://www.mediawiki.org/wiki/PAWS" target="_blank">PAWS</a>. PAWS is a web-based interactive programming and publishing environment based on Jupyter notebooks. At Wikicon North America, I had an interesting and helpful conversation with Dominic Byrd-McDevitt of the National Archives who showed me how he used PAWS to publish NARA metadata to Wikidata via a PAWS-based system using Pywikibot. I don't think he had a bot flag and I think his publication rate was faster than one write per second. But I really didn't want to take the time to test this hypothesis by converting my script over to PAWS (which would require more experimentation with authentication). So I decided to make <a href="https://lists.wikimedia.org/pipermail/wikitech-l/2020-January/092946.html" target="_blank">a post to Wikitech-l</a> and see if I could get an answer. </div>
<div>
<br /></div>
<div>
I quickly got <a href="https://lists.wikimedia.org/pipermail/wikitech-l/2020-January/092947.html" target="_blank">a helpful answer</a> that confirmed that neither using PAWS nor Pywikibot should have any effect on the rate limit. If I had a bot flag, I might gain the "noratelimit" right, which might bypass rate limiting in many cases. </div>
<div>
<br /></div>
<div>
Bot flags are discussed <a href="https://www.wikidata.org/wiki/Wikidata:Bots" target="_blank">here</a> . In order to get a bot flag, one must detail the task that the bot will perform, then demonstrate by a test run of 50 and 250 edits that the bot is working correctly. When I was at Wikicon NA, I asked some of the Powers That Be whether it was important to get a bot flag if I was not running an autonomous bot. They said that it wasn't so important if I was monitoring the writing process. It would be difficult to "detail the task" that VanderBot will perform since it's just a general-purpose API writing script, and what it writes will depend on the CSV file and the JSON mapping schema. </div>
<div>
<br /></div>
<div>
In the end, I decided to just forget about getting a bot flag for now and keep the rate at 2 seconds per write. I usually don't write more than 50-100 edits in a session and often the server will be lagged anyway requiring me to wait much longer than 2 seconds. If VanderBot's task becomes more well-defined and autonomous, I might request a bot flag at some point in the future.</div>
</div>
<div>
<br /></div>
<h4>
Query Service Updater lag</h4>
<div>
One of the principles upon which VanderBot is built is that data are written to Wikidata by POSTing to the API, but that the status of data in Wikidata is determined by SPARQL queries of the Query Service. That is a sound idea, but it has one serious limitation. Data that are added through either the API or the human GUI do not immediately appear in the graph database that supports the Query Service. There is a delay, known as the Updater lag, between the time of upload and the time of availability at the Query Service. We can gain a better understanding by looking at the <a href="https://grafana.wikimedia.org/d/000000489/wikidata-query-service?orgId=1" target="_blank">Query Service dashboard</a>.</div>
<div>
<div>
<br /></div>
<div>
Here's a view of the lag time on the day I wrote this post (2020-02-03):</div>
</div>
<div>
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhavyjj-Nu16dEYaAUHOhCFWPH0gMT7u19lo8dOGwRWZVjMT1N_x_bgdZ3oSRImu2BbX8ZfjvFA4Lc3VMtSxUSyX3AaYpcTq2NTz3j-V8WZLDLE9_YWG_v0k0b6SI9zFfaa2DOw-SkT6Kc/s1600/diagram15.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="525" data-original-width="974" height="344" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhavyjj-Nu16dEYaAUHOhCFWPH0gMT7u19lo8dOGwRWZVjMT1N_x_bgdZ3oSRImu2BbX8ZfjvFA4Lc3VMtSxUSyX3AaYpcTq2NTz3j-V8WZLDLE9_YWG_v0k0b6SI9zFfaa2DOw-SkT6Kc/s640/diagram15.png" width="640" /></a></div>
<div>
<br /></div>
<div>
The first thing to notice is that there isn't just one query service. There are actually seven servers running replicates of the Query Service that handle the queries. They are all being updated constantly with data from the relational database connected to the API, but since the updating process has to compete with queries that are being run, some servers cannot keep up with the updates and lag by as much as 10 hours. Other servers have lag times of less than one minute. So depending on the luck of the draw of which server takes your query, data that you wrote to the API may be visible via SPARQL in a few seconds or in half a day.</div>
<div>
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgGwmzXpCUHuzNlCqsPV6NcxPAf6CH4R5emKJbTvEISSPHGw8WUCyyNoZMR46tLBumPxwIBoJn2G-oHwQvoBh6ry0N3mxoVMyb3WclhQStUZEutfE52M_nzM93JArNjqBhXrUExdwqT2go/s1600/diagram16.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="547" data-original-width="974" height="358" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgGwmzXpCUHuzNlCqsPV6NcxPAf6CH4R5emKJbTvEISSPHGw8WUCyyNoZMR46tLBumPxwIBoJn2G-oHwQvoBh6ry0N3mxoVMyb3WclhQStUZEutfE52M_nzM93JArNjqBhXrUExdwqT2go/s640/diagram16.png" width="640" /></a></div>
<div>
<br /></div>
<div>
<div>
A practical implication of this is that if VanderBot updates its CSV record using SQARQL, the data could be as much as half a day out of date. Normally that isn't a problem, since the data I'm working with doesn't change much, and once I write new data, I usually don't mess with it for days. However, since the script depends on a SPARQL query to determine if the labels and descriptions in the CSV differ from what's already in Wikidata, there can be problems if the script crashes half way through the rows of the CSV. If I fix the problem and immediately re-run the script, a lagged Query Service will send a response to the query saying that the labels and descriptions that I successfully wrote a few moments earlier were in their previous state. That will cause VanderBot to attempt to re-write those labels and descriptions. Fortunately, if the API detects that a write operation is trying to set the value of a label or description to the value it already has, it will do nothing. So generally, no harm is done. </div>
<div>
<br /></div>
<div>
This lag is why I use the response JSON sent from the API after a write to update the CSV rather than depending on a separate SPARQL query to make the update. Because the data in the response JSON comes directly from the API and not the Query Service, it is not subject to any lag.</div>
</div>
<div>
<br /></div>
<h2>
Summary</h2>
<div>
<br /></div>
<div>
The API writing script part of VanderBot does the following:</div>
<div>
<ol>
<li>Reads the JSON mapping schema to determine the meaning of the CSV table columns.</li>
<li>Reads in the data from the CSV table.</li>
<li>Sorts out the columns by type of data (label, alias, description, property).</li>
<li>Constructs snak JSON for any new data items that need to be written.</li>
<li>Checks new statements for references and qualifiers by looking at columns associated with the statement properties, then creates snak JSON for references or qualifiers as needed.</li>
<li>Inserts the constructed JSON object into the required parameter dictionary for the <span style="font-family: "courier new" , "courier" , monospace;">wbeditentity</span> API method.</li>
<li>POSTs to the Wikidata API via HTTP.</li>
<li>Parses the response JSON from the API to discover the identifiers of newly created data items.</li>
<li>Inserts the new identifiers into the table and write the CSV file.</li>
</ol>
</div>
<div>
In the <a href="http://baskauf.blogspot.com/2020/02/vanderbot-part-4-preparing-data-to-send.html" target="_blank">final post of this series</a>, I'll describe how the data harvesting script part of VanderBot works.</div>
<div>
<br /></div>
Steve Baskaufhttp://www.blogger.com/profile/01896499749604153763noreply@blogger.com0tag:blogger.com,1999:blog-5299754536670281996.post-86255223488561478072020-02-07T09:16:00.002-08:002020-02-08T19:02:41.715-08:00VanderBot part 2: The Wikibase data model and Wikidata identifiers<img border="0" data-original-height="534" data-original-width="975" height="350" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi5xlZPlCGpbO_-k5B3opwaELGdK0auVHn6yGFLD6m4pfs0m3vn3U6l3JNOZQ-5otK1jxLf3SOHUf6NvZ0yITHx8yRBhURD3D0FO2HKGTommBXKgA3GfegzV-XADSPHrIwfZScQ7ET6LfQ/s640/diagram4.png" style="display: none;" width="640" /><br />
<h2>
The Wikidata GUI and the Wikibase model</h2>
To read part 1 of this series, see <a href="http://baskauf.blogspot.com/2020/02/vanderbot-python-script-for-writing-to.html" target="_blank">this page</a>.<br />
<br />
If you've edited Wikidata using the human-friendly graphical user interface (GUI), you know that items can have multiple properties, each property can have multiple values, each property/value statement can be qualified in multiple ways, each property/value statement can have multiple references, and each reference can have multiple statements about that reference. The GUI keeps this tree-like proliferation of data tidy by collapsing the references and organizing the statements by property.<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhqIQyEdlvSvC8Kc-MXKJuMiKNedjfUg7t0SxZxmkVF_eK_p80r2FDZ5QN-ERnUGBOr1Tdi34OY5VYy2u_LijYhGuqUPg4iqny2DLBuuM9s6xS44ujhyubSFJYLSiXMq25inU0TSFlN-5I/s1600/diagram3.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="354" data-original-width="974" height="232" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhqIQyEdlvSvC8Kc-MXKJuMiKNedjfUg7t0SxZxmkVF_eK_p80r2FDZ5QN-ERnUGBOr1Tdi34OY5VYy2u_LijYhGuqUPg4iqny2DLBuuM9s6xS44ujhyubSFJYLSiXMq25inU0TSFlN-5I/s640/diagram3.png" width="640" /></a></div>
<br />
This organization of information arises from the Wikibase data model (summarized <a href="https://www.mediawiki.org/wiki/Wikibase/DataModel/Primer" target="_blank">here</a>, in detail <a href="https://www.mediawiki.org/wiki/Wikibase/DataModel" target="_blank">here</a>). For those unfamiliar with Wikibase, it is the underlying software system that Wikidata is built upon. Wikidata is just one instance of Wikibase and there are databases other than Wikidata that are built on the Wikibase system. All of those databases built on Wikibase will have a GUI that is similar to Wikidata, although the specific items and properties in those databases will be different from Wikidata.<br />
<br />
To be honest, I found working through the Wikibase model documentation a real slog. (I was particularly mystified by the obscure term for basic assertions: "snak". Originally, I though it was an acronym, but later realized it was an inside joke. A snak is "small, but more than a byte".) But understanding the Wikibase model is critical for anyone who wants to either write to the Wikidata API or query the Wikidata Query Service and I wanted to do both. So I dug in.<br />
<br />
The Wikibase model is an abstract model, but it is possible to represent it as a graph model. That's important because that is why the Wikidata dataset can be exported as RDF and made queryable by SPARQL in the Wikidata Query Service. After some exploration of Wikidata using SPARQL and puzzling over the data model documentation, I was able to draw out the major parts of the Wikibase model as a graph model. It's a bit too much to put in a single diagram, so I made one that showed references and another that showed qualifiers (inserted later in the post). Here's the diagram for references:<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi5xlZPlCGpbO_-k5B3opwaELGdK0auVHn6yGFLD6m4pfs0m3vn3U6l3JNOZQ-5otK1jxLf3SOHUf6NvZ0yITHx8yRBhURD3D0FO2HKGTommBXKgA3GfegzV-XADSPHrIwfZScQ7ET6LfQ/s1600/diagram4.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="584" data-original-width="779" height="478" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi5xlZPlCGpbO_-k5B3opwaELGdK0auVHn6yGFLD6m4pfs0m3vn3U6l3JNOZQ-5otK1jxLf3SOHUf6NvZ0yITHx8yRBhURD3D0FO2HKGTommBXKgA3GfegzV-XADSPHrIwfZScQ7ET6LfQ/s640/diagram4.png" width="640" /></a></div>
<div class="separator" style="clear: both; text-align: left;">
<br /></div>
<div class="separator" style="clear: both; text-align: left;">
<i>Note about namespace prefixes:</i> the exact URI for a particular namespace abbreviation will depend on the Wikibase installation. The URIs shown in the diagrams are for Wikidata. A generic Wikibase instance will contain <span style="font-family: "courier new" , "courier" , monospace;">wikibase.svc</span> as its domain name in place of <span style="font-family: "courier new" , "courier" , monospace;">www.wikidata.org</span>, and other instances will use other domain names. However, the namespace abbreviations shown above are used consistently among installations, and when querying via the human-accessible Query Service or via HTTP, the standard abbreviations can be used without declaring the underlying namespaces. That's convenient because it allows code based on the namespace abbreviations to be generic enough to be used for any Wikibase installation. </div>
<div class="separator" style="clear: both; text-align: left;">
<br /></div>
In the next several sections, I'm going to describe the Wikibase model and how Wikidata assigns identifiers to different parts of it. This will be important in deciding how to track data locally. Following that, I'll briefly describe my strategy for storing those data.<br />
<br />
<h2 style="clear: both; text-align: left;">
Item identifiers</h2>
<div class="separator" style="clear: both; text-align: left;">
The subject item of a statement is identified by a unique "Q" identifier. For example, Vanderbilt University is identified by <span style="font-family: "courier new" , "courier" , monospace;">Q29052</span> and the researcher Antonis Rokas is identified by <span style="font-family: "courier new" , "courier" , monospace;">Q42352198</span>. We can make statements by connecting subject and object items with a defined Wikidata property. For example, the property <span style="font-family: "courier new" , "courier" , monospace;">P108</span> ("employer") can be used to state that Antonis Rokas' employer is Vanderbilt University: <span style="font-family: "courier new" , "courier" , monospace;">Q42352198 P108 Q29052</span>. When the data are transferred from the Wikidata relational database backend fed by the API to the Blazegraph graph database backend of the Query Service, the "Q" item identifiers and "P" property identifiers are turned into URIs by appending the appropriate namespace (<span style="font-family: "courier new" , "courier" , monospace;">wd:Q42352198 wdt:P108 wd:Q29052.</span>)</div>
<div class="separator" style="clear: both;">
<br /></div>
<div class="separator" style="clear: both;">
We can check this out by running the following query at the <a href="https://query.wikidata.org/" target="_blank">Wikidata Query Service</a>:</div>
<div class="separator" style="clear: both;">
<span style="font-family: "courier new" , "courier" , monospace;"><br /></span></div>
<div class="separator" style="clear: both;">
<span style="font-family: "courier new" , "courier" , monospace;">SELECT DISTINCT ?predicate ?object WHERE {</span></div>
<div class="separator" style="clear: both;">
<span style="font-family: "courier new" , "courier" , monospace;"> wd:Q42352198 ?predicate ?object.</span></div>
<div class="separator" style="clear: both;">
<span style="font-family: "courier new" , "courier" , monospace;"> }</span></div>
<div class="separator" style="clear: both;">
<br /></div>
<div class="separator" style="clear: both;">
This query returns all of the statements made about Antonis Rokas in Wikidata.</div>
<div>
<br /></div>
<h2>
Statement identifiers</h2>
In order to be able to record further information about a statement itself, each statement is assigned a unique identifier in the form of a UUID. The UUID is generated at the time the statement is first made. For example, the particular statement above (<span style="font-family: "courier new" , "courier" , monospace;">Q42352198 P108 Q29052</span>) has been assigned the UUID <span style="font-family: "courier new" , "courier" , monospace;">FB9EABCA-69C0-4CFC-BDC3-44CCA9782450</span>. In the transfer from the relational database to Blazegraph, the namespace "<span style="font-family: "courier new" , "courier" , monospace;">wds:</span>" is prepended and for some reason, the subject Q ID is also prepended with a dash. So our example statement would be identified with the URI <span style="font-family: "courier new" , "courier" , monospace;">wds:Q42352198-FB9EABCA-69C0-4CFC-BDC3-44CCA9782450</span>. If you look at the results from the query above, you'll see<br />
<br />
<span style="font-family: "courier new" , "courier" , monospace;">p:P108 wds:Q42352198-FB9EABCA-69C0-4CFC-BDC3-44CCA9782450</span><br />
<br />
as one of the results.<br />
<br />
We can ask what statements have been made about the statement itself by using a similar query, but with the statement URI as the subject:<br />
<br />
<span style="font-family: "courier new" , "courier" , monospace;">SELECT DISTINCT ?predicate ?object WHERE {</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> wds:Q42352198-FB9EABCA-69C0-4CFC-BDC3-44CCA9782450 ?predicate ?object.</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> }</span><br />
<br />
One important detail relates to case insensitivity. UUIDs are supposed to be output as lowercase, but they are supposed to be case-insensitive on input. So in theory, a UUID should represent the same value regardless of the case. However, in the Wikidata system the generated identifier is just a string and that string would be different depending on the case. So the URI<br />
<br />
<span style="font-family: "courier new" , "courier" , monospace;">wds:Q42352198-FB9EABCA-69C0-4CFC-BDC3-44CCA9782450</span><br />
<br />
is <b>not</b> the same as the URI<br />
<br />
<span style="font-family: "courier new" , "courier" , monospace;">wds:Q42352198-fb9eabca-69c0-4cfc-bdc3-44cca9782450</span><br />
<br />
(Try running the query with the lower case version to convince yourself that this is true.) Typically, the UUIDs generated in Wikidata are upper case, but there are some that are lower case. For example, try<br />
<br />
<span style="font-family: "courier new" , "courier" , monospace;">wds:Q57756352-4a25cee4-45bc-63e8-74be-820454a8b7ad</span><br />
<br />
in the query. Generally it is safe to assume that the "Q" in the Q ID is upper case, but I've discovered at least one case where the Q is lower case.<br />
<div>
<br /></div>
<h2>
Reference identifiers</h2>
<div>
<div>
If a statement has a reference, that reference will be assigned an identifier based on a hash algorithm. Here's an example: <span style="font-family: "courier new" , "courier" , monospace;">f9c309a55265fcddd2cb0be62a530a1787c3783e</span>. The reference hash is turned into a URL by prepending the "<span style="font-family: "courier new" , "courier" , monospace;">wdref:</span>" namespace. Statements are linked to references by the property <span style="font-family: "courier new" , "courier" , monospace;">prov:wasDerivedFrom</span>. We can see an example in the results of the previous query:</div>
<div>
<br /></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">prov:wasDerivedFrom wdref:8cfae665e8b64efffe44128acee5eaf584eda3a3</span></div>
<div>
<br /></div>
<div>
which shows the connection of the statement <span style="font-family: "courier new" , "courier" , monospace;">wds:Q42352198-FB9EABCA-69C0-4CFC-BDC3-44CCA9782450</span> (which states <span style="font-family: "courier new" , "courier" , monospace;">wd:Q42352198 wdt:P108 wd:Q29052.</span>) to the reference <span style="font-family: "courier new" , "courier" , monospace;">wdref:8cfae665e8b64efffe44128acee5eaf584eda3a3</span> (which states "reference URL http://orcid.org/0000-0002-7248-6551 and retrieved 12 January 2019"). We can see this if we run a version of the previous query asking about the reference statement:</div>
<div>
<br /></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">SELECT DISTINCT ?predicate ?object WHERE {</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;"> wdref:8cfae665e8b64efffe44128acee5eaf584eda3a3?predicate ?object.</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;"> }</span></div>
<div>
<br /></div>
<div>
As far as I know reference hashes seem to be consistently recorded in all lower case.</div>
<div>
<br /></div>
<div>
Reference identifiers are different from statement identifiers in that they denote the reference itself, and not a particular assertion of the reference. That is, they do not denote "statement <span style="font-family: "courier new" , "courier" , monospace;">prov:wasDerivedFrom</span> reference", only the reference. (In contrast, statement identifiers denote the whole statement "subject property value".) That means that any statement whose reference has exactly the same asserted statements will have the same reference hash (and URI). </div>
<div>
<br /></div>
<div>
We can see that reference URIs are shared by multiple statements using this query:</div>
<div>
<br /></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">SELECT DISTINCT ?statement WHERE {</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;"> ?statement prov:wasDerivedFrom wdref:f9c309a55265fcddd2cb0be62a530a1787c3783e.</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;"> }</span></div>
</div>
<div>
<br /></div>
<h2>
Identifier examples</h2>
<div>
The following part of a table that I generated for Vanderbilt researchers shows examples of the identifiers I've described above.</div>
<div>
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjz7uN1GhhHqSWW8Dgzyltbs4_u9ILNzBY30e7Zql1ErMGztSQRK1vQiELVG49v9-CtKITjPK1PG3cMN52ZgTwtNWvBQ4j72VJrzy2OswcEKj4WifmtmIooj7cV00RkuG5HHeCYigD7Jig/s1600/diagram5.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="95" data-original-width="974" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjz7uN1GhhHqSWW8Dgzyltbs4_u9ILNzBY30e7Zql1ErMGztSQRK1vQiELVG49v9-CtKITjPK1PG3cMN52ZgTwtNWvBQ4j72VJrzy2OswcEKj4WifmtmIooj7cV00RkuG5HHeCYigD7Jig/s1600/diagram5.png" /></a></div>
<div>
<br /></div>
<div>
We see that each item (researcher) has a unique Q ID and that each statement that the researcher is employed at Vanderbilt University (Q29052) has a unique UUID (some upper case, some lower case) and that there are more than one statement that share the same reference (having the same reference hash). </div>
<div>
<br /></div>
<h2>
Statement qualifiers</h2>
<div>
In addition to linking references to a statement, the statements can also be qualified. For example, Brandt Eichman has worked at Vanderbilt since 2004.</div>
<div>
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhga_BOadXLgd6d6HEGnlpfsiLhMQFkrPjisUWVEEppnhJQBsOWYwlyFJARRTIYT921FWmtABNMZa6K66KHv5g7AGx9miWOPHuzBMj7_7iQ6Fv2MFDzWqhu7p9mNGKTstMTKYX5Y9rsNG0/s1600/diagram6.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="385" data-original-width="974" height="252" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhga_BOadXLgd6d6HEGnlpfsiLhMQFkrPjisUWVEEppnhJQBsOWYwlyFJARRTIYT921FWmtABNMZa6K66KHv5g7AGx9miWOPHuzBMj7_7iQ6Fv2MFDzWqhu7p9mNGKTstMTKYX5Y9rsNG0/s640/diagram6.png" width="640" /></a></div>
<div>
<br /></div>
<div>
Here's a diagram showing how the qualifier "start time 2004" is represented in Wikidata's graph database:</div>
<div>
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjF4sY2DK2WW4mGL5_E5x39YJQijzcLG4DnbiBZfH4CVrFHjBZ28SoVwuVl9yxM_P-sJv0GZYMaqKota0lzrrVGJT-dlZayJp9cJ7VI94SNHmPpNHBN67Mc0tydvv58xQo0F7dxvKP_hc8/s1600/diagram7.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="584" data-original-width="782" height="476" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjF4sY2DK2WW4mGL5_E5x39YJQijzcLG4DnbiBZfH4CVrFHjBZ28SoVwuVl9yxM_P-sJv0GZYMaqKota0lzrrVGJT-dlZayJp9cJ7VI94SNHmPpNHBN67Mc0tydvv58xQo0F7dxvKP_hc8/s640/diagram7.png" width="640" /></a></div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<div>
<br /></div>
<div>
We can see that qualifiers are handled a little differently from references. If the qualifier property (in this case <span style="font-family: "courier new" , "courier" , monospace;">P580</span>, "since") has a simple value (literal or item), the value is linked to the statement instance using the <span style="font-family: "courier new" , "courier" , monospace;">pq:</span> namespace version of the property. </div>
<div>
<div>
<br /></div>
<div>
If the value has a complex value (e.g. date), that value is assigned a hash and is linked to the statement instance using the <span style="font-family: "courier new" , "courier" , monospace;">pqv:</span> version of the property. When the data are transferred to the graph database, the <span style="font-family: "courier new" , "courier" , monospace;">wdv:</span> namespace is prepended to the hash. </div>
<div>
<br /></div>
<div>
Because dates are complex, the qualifier "since" requires a non-literal value in addition to a literal value linked by the <span style="font-family: "courier new" , "courier" , monospace;">pq:</span> version of the property (see <a href="https://www.wikidata.org/wiki/Help:Dates" target="_blank">this page</a> for more on the Wikibase date model). We can use this query:</div>
<div>
<br /></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">SELECT DISTINCT ?property ?value WHERE {</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;"> wdv:849f00455434dc418fb4287a4f2b7638 ?property ?value.</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;"> }</span></div>
<div>
<br /></div>
<div>
to explore the non-literal date instance. In Wikidata, all dates are represented as full XML Schema dateTime values (year, month, day, hour, minute, second, timezone). In order to differentiate between the year "2004" and the date 1 January 2004 (both can be represented in Wikidata by the same dateTime value), the year 2004 is assigned a timePrecision of 9 and the date 1 January 2004 is assigned a timePrecision of 11.</div>
<div>
<br /></div>
<div>
Not every qualifier will have a non-literal value. For example, the property "series ordinal" (<span style="font-family: "courier new" , "courier" , monospace;">P1545</span>; used to indicate things like the order authors are listed) has only literal values (integer numbers). So there are values associated with <span style="font-family: "courier new" , "courier" , monospace;">pq:P1545</span>, but not <span style="font-family: "courier new" , "courier" , monospace;">pqv:P1545</span>. The same is true for "language of work or name" (<span style="font-family: "courier new" , "courier" , monospace;">P407</span>; used to describe websites, songs, books, etc.), which has an entity value like <span style="font-family: "courier new" , "courier" , monospace;">Q1860</span> (English).</div>
</div>
<div>
<br /></div>
<h2>
Labels, aliases, and descriptions</h2>
<div>
<div>
Labels, aliases, and descriptions are properties of items that are handled differently from other properties in Wikidata. Labels and descriptions are handled in a similar manner, so I will discuss them together.</div>
<div>
<br /></div>
<div>
Each item in Wikidata can have only one label and one description in any particular language. Therefore adding or changing a label or description requires specifying the appropriate <a href="https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes" target="_blank">ISO 639-1 code</a> for the intended language. When a label or description is changed in Wikidata, the previous version is replaced.</div>
<div>
<br /></div>
<div>
One important restriction is that the label/description combination in a particularly language must be unique. For example, the person with the English label "John Jones" and English description "academic" can currently only be <span style="font-family: "courier new" , "courier" , monospace;">Q16089943</span>. Because labels and descriptions can change, this label/description combination won't necessarily be permanent associated with <span style="font-family: "courier new" , "courier" , monospace;">Q16089943</span> because someone might give that John Jones a more detailed description, or make his name less generic by adding a middle name or initial. So at some point in the future, it might be possible for some other John Jones to be described as "academic". An implication of the prohibition against two items sharing the same label/description pair is that it's better to create labels and descriptions that are as specific as possible to avoid collisions with pre-existing entities. As more entities get added to Wikidata, the probability of such collisions increases.</div>
<div>
<br /></div>
<div>
There is no limit to the number of aliases that an item can have per language. Aliases can be changed by either changing the value of a pre-existing alias or adding a new alias. As far as I know, there is no prohibition about aliases of one item matching aliases of another item.</div>
<div>
<br /></div>
<div>
When these statements are transferred to the Wikidata graph database, labels are values of <span style="font-family: "courier new" , "courier" , monospace;">rdfs:label</span>, descriptions are values of <span style="font-family: "courier new" , "courier" , monospace;">schema:description,</span> and aliases are values of <span style="font-family: "courier new" , "courier" , monospace;">skos:altLabel</span>. All of the values are language-tagged.</div>
</div>
<div>
<br /></div>
<h2>
What am I skipping?</h2>
<div>
Another component of the Wikibase model that I have not discussed is ranks. I also haven't talked about statements that don’t have values (PropertyNoValueSnak and PropertySomeValueSnak), and sitelinks. These are features that may be important to some users, but have not yet been important enough to me to incorporate handling them in my code. </div>
<div>
<br /></div>
<h2>
Local data storage</h2>
<div>
If one wanted to make and track changes to Wikidata items, there are many ways to accomplish that with varying degrees of human intervention. Last year, I spent some time pondering all of the options and came up with this diagram:</div>
<div>
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiqp6cBS09-1snJ9mc39LTpeXwStvdvSMRVui8ogjowrAiPQCiP46kAKLk_Rz0cyC9TKjAJAY3Cv54m0wI-MKXQFDicvJqEgLVw5jkYLJXy4KPRu4jeeHzJaGGNaH0S_Urv0i8PBqyBIg0/s1600/diagram9.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="531" data-original-width="974" height="348" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiqp6cBS09-1snJ9mc39LTpeXwStvdvSMRVui8ogjowrAiPQCiP46kAKLk_Rz0cyC9TKjAJAY3Cv54m0wI-MKXQFDicvJqEgLVw5jkYLJXy4KPRu4jeeHzJaGGNaH0S_Urv0i8PBqyBIg0/s640/diagram9.png" width="640" /></a></div>
<div>
<br /></div>
<div>
Tracking every statement, reference, and qualifier for items would be complicated because each item could have an indefinite number and kind of properties, values, references, and qualifiers. To track all of those things would require a storage system as complicated as Wikidata itself (such as a separate a relational database or a Wikibase instance as shown in the bottom of the diagram). That's way beyond what I'm interested in doing now. But what I learned about the Wikibase model and how data items are identified suggested to me a way to track all of the data that I care about in a single, flat spreadsheet. That workflow can be represented by this subset of the diagram above:</div>
<div>
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgleQdiX1ZSdXwAwCwswc7QwfI5hO2o00U6BWbABZnTJem249MR-LVq2DaiDsVuW-MxRGhA15wnW58QeXVD3KLPn4CHlfnjhLmZwXnQRFJo0mwrcoadata1biXu7SsNg_IshQnkxCzB2s0/s1600/diagram10.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="279" data-original-width="445" height="250" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgleQdiX1ZSdXwAwCwswc7QwfI5hO2o00U6BWbABZnTJem249MR-LVq2DaiDsVuW-MxRGhA15wnW58QeXVD3KLPn4CHlfnjhLmZwXnQRFJo0mwrcoadata1biXu7SsNg_IshQnkxCzB2s0/s400/diagram10.png" width="400" /></a></div>
<div>
<br /></div>
<div>
<div>
I decided on the following structure for the spreadsheet (a CSV file, example <a href="https://github.com/HeardLibrary/linked-data/blob/master/publications/departments/engineering-to-write.csv" target="_blank">here</a>.). The Wikidata Q ID serves as the key for an item and the data in a row is about a particular item. A value in the Wikidata ID column indicates that the item already exists in Wikidata. If the Wikidata ID column does not have a value, that indicates that the item needs to be created. </div>
<div>
<br /></div>
<div>
Each statement has a column representing the property with the value of that property for an item recorded in the cell for that item's row. For each property column, there is an associated column for the UUID identifying the statement consisting of the item, property, and value. If there is no value for a property, no information is available to make that statement. If there is a value and no UUID, then the statement needs to be asserted. If there is a value and a UUID, the statement already exists in Wikidata. </div>
<div>
<br /></div>
<div>
References consist of one or more columns representing the properties that describe the reference. References have a single column to record the hash identifier for the reference. As with statements, if the identifier is absent, that indicates that the reference needs to be added to Wikidata. If the identifier is present, the reference has already been asserted. </div>
<div>
<br /></div>
<div>
Because labels, descriptions, and many qualifiers do not have URIs assigned as their identifiers, their values are listed in columns of the table without corresponding identifier columns. Knowing whether the existing labels descriptions and qualifiers already exist in Wikidata requires making a SPARQL query to find out. That process is described in the fourth blog post.</div>
</div>
<div>
<br /></div>
<h2>
Where does VanderBot come in?</h2>
<div>
In the first post of this series, I showed a version of the following diagram to illustrate how I wanted VanderBot (my Python script for loading Vanderbilt researcher data into Wikidata) to work. That diagram is basically an elaboration of the simpler previous diagram.</div>
<div>
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg7rr66cXGcQK6m41Mh87qYFi1mO4oQLROSMant1rICiiBG_ik_x1VWBw1B536oz8KtHVpwOxbLmwwA78wZdm96rYU_bcAPw7PqjHPQRTH3OvNnLWRymTXC_L6-mGwVeItQWcKcf5kHxDU/s1600/diagram11.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="537" data-original-width="974" height="352" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg7rr66cXGcQK6m41Mh87qYFi1mO4oQLROSMant1rICiiBG_ik_x1VWBw1B536oz8KtHVpwOxbLmwwA78wZdm96rYU_bcAPw7PqjHPQRTH3OvNnLWRymTXC_L6-mGwVeItQWcKcf5kHxDU/s640/diagram11.png" width="640" /></a></div>
<div>
<br /></div>
<div>
The part of the workflow circled in green is the <b>API writing script </b>that I will describe in the <a href="http://baskauf.blogspot.com/2020/02/vanderbot-part-3-writing-data-from-csv.html" target="_blank">third post of this series</a> (the next one). The part of the workflow circled in orange is the <b>data harvesting script</b> that I will describe in the <a href="http://baskauf.blogspot.com/2020/02/vanderbot-part-4-preparing-data-to-send.html" target="_blank">fourth post</a>. Together these two scripts form VanderBot in its current incarnation.</div>
<div>
<br /></div>
<div>
Discussing the scripts in that order may seem a bit backwards because when VanderBot operates, the data harvesting script works before the API writing script. But in developing the two scripts, I needed to think about how I was going to write to the API before I thought about how to harvest the data. So it's probably more sensible for you to learn about the API writing script first as well. Also, the design of the API writing script is intimately related to the Wikidata data model, so that's another reason to talk about it next after this post.</div>
<div>
<br /></div>
Steve Baskaufhttp://www.blogger.com/profile/01896499749604153763noreply@blogger.com0tag:blogger.com,1999:blog-5299754536670281996.post-10728762733069414042020-02-06T20:38:00.001-08:002021-03-13T07:33:30.697-08:00VanderBot: A Python Script for Writing to Wikidata (part 1)<img border="0" data-original-height="534" data-original-width="975" height="350" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEglXZX_VaU_L4rlpa2deouJE5ysjwBM6jR7sXCsVFo4l7SyPDRwn1-8uO-hU0ekaDKdF48HoMlBV1gwrRcBLx4KhKULzRQrOBBcMSIU4uDbPi5QO9T3t1sujxwV1Cfb7Ib2yJvddQ23YN8/s640/diagram2.png" style="display: none;" width="640" /><br /><div class="MsoNormal"><b>Note added 2021-03-13:</b> Although this post is still relevant for understanding the conceptual ideas behind my project to write Vanderbilt researcher/scholar records to Wikidata, I have written another series of blog posts showing (with lots of screenshots and handholding) how you can safely write <b>your own data</b> to the Wikidata API using data that is stored in simple CSV spreadsheets. See <a href="http://baskauf.blogspot.com/2021/03/writing-your-own-data-to-wikidata-using.html" target="_blank">this post</a> for details.</div><div class="MsoNormal"><br /></div><div class="MsoNormal">If you follow my blog, you will notice that I haven't
written much in the last six months. That is at least partly because I've spent
a lot of time working out the practical details of creating a "bot"
that I can use to upload data about Vanderbilt researchers and scholars into
Wikidata. In an <a href="http://baskauf.blogspot.com/2019/06/putting-data-into-wikidata-using.html" target="_blank">earlier post from June last year</a>,
I described in general terms some background about writing to Wikibase, the
platform on which Wikidata is built. (You probably should review that post for
background before starting in on this one.) However, there were a lot of
practical details that needed to be worked out to write to the "real"
Wikidata.<span style="mso-spacerun: yes;"> </span>Those details are what I'll
talk about in this post. <o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
One question I'll dispense with at the start of the post is
"Why didn't you just use Pywikibot?" There are two reasons. One is
that when I<a href="https://heardlibrary.github.io/digital-scholarship/host/wikidata/pywikibot/" target="_blank"> experimented with using Pywikibot and our Wikibase instance</a>,
I encountered an approximately 10 second delay between write operations. I'm
sure that there is some way to defeat that delay, but I was not able to figure
it out by looking through the Pywikibot code and documentation. This brings me
to the second reason. I really don't like to use other people's code that I
don't understand. When I looked through the Pywikibot code, there were layers
of objects and functions calling other objects and functions in different
files. After a short period of sorting through the code, I realized that there
was no way that I was going to understand what was going with Pywikibot at my
current level of skill with Python. <o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
After that experience, I decided to build my bot from the
ground up. Obviously that took more time, but in the end I actually understood
everything that I was doing and also had a much better idea of how the Wikibase
API works.<span style="mso-spacerun: yes;"> </span>The code that I've written is
relatively linear and is liberally annotated with comments. So I hope that people
with a moderate level of experience with Python can understand what I did and
be able to hack the code to meet their own needs.</div>
<div class="MsoNormal">
<br /></div>
<h2>
Where I last left off</h2>
<div class="MsoNormal">
In the previous post about writing to Wikidata, I described
a simple script that took data from a CSV file and wrote it to a Wikibase
instance (the test Wikidata instance, an independent Wikibase installation, or
the real Wikidata).</div>
<div class="MsoNormal">
<o:p></o:p></div>
<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgjJfQJvu8lx5M9OOTRVe_fdUBof9RqgTS2ofe93iPln0n4xbqqINye_r9LdQkgtLvvDPsGkA1iPu9dz3lqVI8TJYQGJ0FU09Sa0ffX8KZ-YOcUPo_J2cFmhHePHWNmXEzQ2HTrb5stH6w/s1600/diagram1.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="560" data-original-width="975" height="366" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgjJfQJvu8lx5M9OOTRVe_fdUBof9RqgTS2ofe93iPln0n4xbqqINye_r9LdQkgtLvvDPsGkA1iPu9dz3lqVI8TJYQGJ0FU09Sa0ffX8KZ-YOcUPo_J2cFmhHePHWNmXEzQ2HTrb5stH6w/s640/diagram1.png" width="640" /></a></div>
<br />
<br />
<div class="MsoNormal">
That script was very limited. It was only able to write
statements and could not associate references with those statements nor add
qualifiers to the statements. It only created new items and had no way to know
if the described entities already existed in the Wikibase instance.<span style="mso-spacerun: yes;"> </span>It also had no way to track data about items
once they had been written. Finally, it simply wrote the data as fast as it could
and did not consider whether it should slow its rate due to high load on the
Wikibase API.</div>
<div class="MsoNormal">
<br /></div>
<h2>
Where I wanted to be</h2>
<div class="MsoNormal">
A major deficiency of the previous script was that its
communication with the Wikibase instance was only one-way. It wrote to the API,
but made little use of the API's response and it made no use of Wikibase's
capabilities to respond to SPARQL queries.
The workflow that I wanted to facilitate was more complicated.</div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEglXZX_VaU_L4rlpa2deouJE5ysjwBM6jR7sXCsVFo4l7SyPDRwn1-8uO-hU0ekaDKdF48HoMlBV1gwrRcBLx4KhKULzRQrOBBcMSIU4uDbPi5QO9T3t1sujxwV1Cfb7Ib2yJvddQ23YN8/s1600/diagram2.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="534" data-original-width="975" height="350" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEglXZX_VaU_L4rlpa2deouJE5ysjwBM6jR7sXCsVFo4l7SyPDRwn1-8uO-hU0ekaDKdF48HoMlBV1gwrRcBLx4KhKULzRQrOBBcMSIU4uDbPi5QO9T3t1sujxwV1Cfb7Ib2yJvddQ23YN8/s640/diagram2.png" width="640" /></a></div>
<o:p></o:p><br />
<br />
<br />
<div class="MsoNormal">
I wanted the script to first send a SPARQL query to the Query
Service to determine which data (including references and qualifiers) that I
wanted to write already existed in Wikidata. (From this point forward, I'm
going to refer to the "real" Wikidata instance of Wikibase, so I will
stop talking about Wikibase generically.) That information would then be used
to determine for each record whether the script needed to: create a new item,
add or change labels and descriptions, add statements to an existing item, to
add references and qualifiers to existing statements, or do nothing because all
of the desired information was already there.<span style="mso-spacerun: yes;">
</span><o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
Once it was determined what needed to be written; the script
would then compose the appropriate JSON (based on the form of "snaks"
in the Wikibase model) for an item and send it to the API. Using the response
from the API, the script would update the records to indicate that the data
were now present in Wikidata. Based on feedback from the API, the script would also
limit its request rate to avoid hitting it too fast at times of high usage.<o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
Eventually, the data uploaded to the API would become
available via the Query Service, making it possible to track in the future
whether the data were still present in Wikidata.</div>
<div class="MsoNormal">
<br /></div>
<h2>
What is VanderBot?</h2>
<div class="MsoNormal">
The simple answer to this question is that VanderBot is the
set of Python scripts that I created to write data to Wikidata. The code is
<a href="https://github.com/HeardLibrary/linked-data/tree/master/publications" target="_blank">freely available in GitHub</a>. However, the question is a little more
complicated than that. </div>
<div class="MsoNormal">
<o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
When an application communicates with server over the
internet, it is technically known as a "User-Agent". It is considered
a polite and good practice for a User-Agent to identify itself to the server via
an HTTP request header. When I use the scripts I've written, I send the header </div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
<span style="font-family: "courier new" , "courier" , monospace;">VanderBot/0.8
(https://github.com/HeardLibrary/linked-data/tree/master/publications;
mailto:steve.baskauf@vanderbilt.edu)</span></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
So VanderBot is also the name of a
User-Agent. Technically, if you used my script without editing it, you would be
using the VanderBot User-Agent, but it probably would be better to not send the
header above, since I don't want server administrators to email me if you do
bad things to their server.<span style="mso-spacerun: yes;"> </span>So you should
change the User-Agent header values if you use or modify the VanderBot code.
(Similarly, you should also change the tool name and email address sent to the
NCBI API in that part of the code - please do not use mine!)<o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
When you write to the Wikidata API, you need to be logged in
as a Wikidata user. I have created a <a href="https://www.wikidata.org/wiki/User:VanderBot" target="_blank">Wikidata user account called VanderBot</a>,
so if I make edits using that account, they are credited to VanderBot in the
page history. So VanderBot is also a registered bot in Wikidata. But since you
don't have my VanderBot access credentials, you can't make edits to Wikidata as
VanderBot even if you use the Vanderbot scripts.<o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
So the complicated answer is that you are welcome to use the
VanderBot code, you probably shouldn't be using "VanderBot" in a
User-Agent header (and definitely not my email address), and you can't use the
VanderBot Wikidata bot account.</div>
<div class="MsoNormal">
<br /></div>
<h2>
Upcoming posts</h2>
<div class="MsoNormal">
In <a href="http://baskauf.blogspot.com/2020/02/vanderbot-part-2-wikibase-data-model.html" target="_blank">part 2 of this series</a>, I will talk about the <b>Wikibase data model</b> and identifiers used for entities in the Wikidata graph. The model and identifier system influenced my choices about how to write the code.</div>
<br />
In <a href="http://baskauf.blogspot.com/2020/02/vanderbot-part-3-writing-data-from-csv.html" target="_blank">part 3</a>, I will describe the <b>API writing script</b> that maps tabular data to the Wikibase model, then writes those data to the Wikidata API.<br />
<br />
In the final <a href="http://baskauf.blogspot.com/2020/02/vanderbot-part-4-preparing-data-to-send.html" target="_blank">part 4</a>, I will describe the <b>data harvesting script</b> that is used to assemble the data to be written to Wikidata and that ensures that duplicate data are not added.<br />
<br />
<br />Steve Baskaufhttp://www.blogger.com/profile/01896499749604153763noreply@blogger.com1tag:blogger.com,1999:blog-5299754536670281996.post-44156780539854009722019-10-23T14:54:00.000-07:002020-03-04T19:11:56.976-08:00Understanding the Standards Documentation Specification, Part 6: The rs.tdwg.org repository<br />
<div class="MsoNormal">
This is the sixth and final post in a series on the TDWG
Standards Documentation Specification (SDS).<span style="mso-spacerun: yes;">
</span>The five earlier posts explain the history and model of the SDS, and how
to retrieve the machine-readable metadata about TDWG standards.</div>
<div class="MsoNormal">
<br />
Note: this post was revised on 2020-03-04 when IRI dereferencing of the http://rs.tdwg.org/ subdomain went from testing into production.</div>
<h2>
Where do standards data live? </h2>
<div class="MsoNormal">
<o:p></o:p></div>
<div class="MsoNormal">
<o:p> </o:p>In earlier posts in this series, I said that after the SDS
was adopted, there wasn't any particular plan for actually putting it into
practice. Since I had a vested interest in its success, I took it upon myself
to work on the details of its implementation, particularly with respect to making
standards metadata available in machine-readable form.</div>
<div class="MsoNormal">
<o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
The SDS is silent about where data should live and how it
should be turned into the various serializations that should be available when clients
dereference resource IRIs. <span style="mso-spacerun: yes;"> </span>My thinking
on this subject was influenced by my observations about previous management of TDWG
standards data.<span style="mso-spacerun: yes;"> </span>In the past, the
following things have happened to TDWG standards data:<o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
</div>
<ul>
<li>the standards documents for TAPIR were accidentally
overwritten and lost.</li>
<li>the authoritative Darwin Core (DwC) documents were locked
up on a proprietary publishing system where only a few people could look at
them or even know what was there.</li>
<li>the normative Darwin Core document was written in RDF/XML,
which no one could read and which was in a document that had to be edited by
hand.</li>
</ul>
<o:p></o:p><br />
<div class="MsoNormal">
<o:p></o:p></div>
<div class="MsoNormal">
<o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
Given that background I was pretty convinced that the place
for the standards data to live was in a public GitHub repository.<span style="mso-spacerun: yes;"> </span>I was able to have a <a href="https://github.com/tdwg/rs.tdwg.org" target="_blank">repository called rs.tdwg.org</a> set up in the TDWG GitHub site for the purpose of storing the standards
metadata.<o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<h2>
Form of the standards metadata</h2>
<div class="MsoNormal">
<o:p></o:p></div>
<div class="MsoNormal">
<o:p> </o:p>Given past problems with formats that have become obsolete or
that were difficult to read and edit, I was convinced that the standards
metadata should be in a simple format.
To me the obvious format was CSV. </div>
<div class="MsoNormal">
<o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
At the time I started working on this project, I had been
working on an application to transform CSV spreadsheets into various forms of
RDF, so I had already been thinking about how the CSV spreadsheets should be
set up to do that.<span style="mso-spacerun: yes;"> </span>I liked the model used
for DwC Archives (DwC-A) and defined in the <a href="https://dwc.tdwg.org/text/" target="_blank">DwC text guide</a>.</div>
<div class="MsoNormal">
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiuKNmiF39Bew-x_qq_Bijnjuyr2IDuOCPY4FHZ0N1zozwBgQIKQJu4Pf1wXNqQ2vDidQvjRxb0mzUdZ37vwRt9Pt1MxAs7O1qyq7p6xEFxNpyrAH-oVRP-dp3niMkUGkGdjsMsIuhyphenhyphen124/s1600/table.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="350" data-original-width="1200" height="186" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiuKNmiF39Bew-x_qq_Bijnjuyr2IDuOCPY4FHZ0N1zozwBgQIKQJu4Pf1wXNqQ2vDidQvjRxb0mzUdZ37vwRt9Pt1MxAs7O1qyq7p6xEFxNpyrAH-oVRP-dp3niMkUGkGdjsMsIuhyphenhyphen124/s640/table.png" width="640" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;"><div class="MsoNormal">
Example metadata CSV file for terms defined by Audubon Core:
<a href="https://github.com/tdwg/rs.tdwg.org/blob/master/audubon/audubon.csv" target="_blank">audubon.csv</a><o:p></o:p></div>
</td></tr>
</tbody></table>
<br />
In the DwC-A model, each table is "about" some
class of thing.<span style="mso-spacerun: yes;"> </span>Each row in a data table
represents an instance of that class, and each column represents some property
of those instances.<span style="mso-spacerun: yes;"> </span>The contents of each
cell represent the value of the property for that instance.<span style="mso-spacerun: yes;"> </span></div>
<div class="MsoNormal">
<o:p></o:p></div>
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgK5ZjGBUlJDLMN95LCe7RLIziEbUqODjL-SzJz5pnLWUxmHm9QGqwUfHn6yKrJOoQ52J5c4r_HGreKruqXWTNim4tccZztx_WkoDIJHukgXkE3tcaTlvNMAYicVKHGN9Kdcne3SQezHVE/s1600/dwc-a.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="358" data-original-width="555" height="412" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgK5ZjGBUlJDLMN95LCe7RLIziEbUqODjL-SzJz5pnLWUxmHm9QGqwUfHn6yKrJOoQ52J5c4r_HGreKruqXWTNim4tccZztx_WkoDIJHukgXkE3tcaTlvNMAYicVKHGN9Kdcne3SQezHVE/s640/dwc-a.png" width="640" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;"><span style="font-size: small; text-align: start;">Darwin Core Archive model (from the </span><a href="https://dwc.tdwg.org/text/" style="font-size: medium; text-align: start;" target="_blank">Darwin Core Text Guide</a><span style="font-size: small; text-align: start;">)</span></td></tr>
</tbody></table>
<br />
In order to associate the columns with their property terms,
DwC Archives use an XML file (meta.xml) that associates the intended properties
with the columns of the spreadsheet.<span style="mso-spacerun: yes;"> </span>Since
a flat spreadsheet can't handle one-to-many relationships very well, the model
connects the instances in the core spreadsheet with extension tables that allow
properties to have multiple values.<br />
<div>
<div class="MsoNormal">
<o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
For the purposes of generating RDF, the form of the meta.xml
file is not adequate.<span style="mso-spacerun: yes;"> </span>One problem is
that the meta.xml file does not indicate whether the value (known in RDF as the
object) recorded in the cell is supposed to be a literal (string) or an IRI. <span style="mso-spacerun: yes;"> </span>A second problem is that in RDF values of
properties can also have language tags or datatypes if they are not plain
literals.<span style="mso-spacerun: yes;"> </span><span style="mso-spacerun: yes;"> </span>Finally, a DwC Archive assumes that a row is a
single type of thing, but actually a row may actually contain information about
several types of things.<br />
<span style="mso-spacerun: yes;"> </span><o:p></o:p></div>
</div>
<div>
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEizRuw9lzfQfQpNbxgWdkZSAXK1UuWou1z8SFUTDenmg4Kd2fjy_CCNJl0syYAp8IY7znOE4-ucMi8tn0KnI6RJU6eG2cm9Lzm7mQtT0oVdTH4OgXDPkYh-YzHFbsJMxWPZVnMJ37ddR5E/s1600/mapping.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="236" data-original-width="635" height="236" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEizRuw9lzfQfQpNbxgWdkZSAXK1UuWou1z8SFUTDenmg4Kd2fjy_CCNJl0syYAp8IY7znOE4-ucMi8tn0KnI6RJU6eG2cm9Lzm7mQtT0oVdTH4OgXDPkYh-YzHFbsJMxWPZVnMJ37ddR5E/s640/mapping.png" width="640" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;"><div class="MsoNormal">
Example CSV mapping file: <a href="https://github.com/tdwg/rs.tdwg.org/blob/master/audubon/audubon-column-mappings.csv" target="_blank">audubon-column-mappings.csv</a><o:p></o:p></div>
</td></tr>
</tbody></table>
<br />
For those reasons I ended up creating my own form of mapping
file -- another CSV file rather than a file in XML format. I won't go into more details here, since I've
already described the system of files in <a href="http://baskauf.blogspot.com/2016/10/guid-o-matic-goes-to-china.html" target="_blank">another blog post</a>. But you can see from the example above that the file relates the column
headers to properties, indicates the type of object (IRI, plain literal, datatyped
literal, or language tagged literal), and provides the value of the language
tag or datatype. The final column
indicates whether that column applies to the main subject of the table or an
instance of another class that has a one-to-one relationship with the subject
resource. </div>
<div>
<div class="MsoNormal">
<o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhldp1lnaMdElhWYOlkL8G2XtFXjwh6vyhPtMIpftJcpK0s7ZWmC8OgdoQstDB6O9e5n2O0WSCeQU5zgPCfznD6zRLP4sYvw8qRepEWBTvhPDlsRxfH_lpNPq5Dmj9hHANkWy8PJ0p1P0s/s1600/links-table.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="76" data-original-width="887" height="54" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhldp1lnaMdElhWYOlkL8G2XtFXjwh6vyhPtMIpftJcpK0s7ZWmC8OgdoQstDB6O9e5n2O0WSCeQU5zgPCfznD6zRLP4sYvw8qRepEWBTvhPDlsRxfH_lpNPq5Dmj9hHANkWy8PJ0p1P0s/s640/links-table.png" width="640" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Add captionExample extension links file: <a href="https://github.com/tdwg/rs.tdwg.org/blob/master/audubon/linked-classes.csv" target="_blank">linked-classes.csv</a><br />
<div class="MsoNormal">
<o:p></o:p></div>
</td></tr>
</tbody></table>
<br /></div>
<div class="MsoNormal">
<o:p><br /></o:p></div>
<div class="MsoNormal">
The links between the core file and the extensions are
described in a separate links file (e.g. <a href="https://github.com/tdwg/rs.tdwg.org/blob/master/audubon/linked-classes.csv" target="_blank">linked-classes.csv</a>). In this example, extension files are required
because each term can have many versions and a term can also replace more than
one term. Because in RDF the links can
be described by properties in either direction, the links file lists the
property linking from the extension to the core file (e.g. <span style="font-family: "courier new" , "courier" , monospace;">dcterms:isVersionOf</span>)
and from the core file to the extension (e.g. <span style="font-family: "courier new" , "courier" , monospace;">dcterms:hasVersion</span>). <o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
This system differs a bit from the DwC-A system where the
fields in the linked extension files are described within the same meta.xml
file. I opted to have a separate mapping file for each extension. The filenames listed in the
linked-classes.csv file point to the extension data files and the mapping files
associated with the extension data files use the same naming pattern as the
mapping files for the core file.<o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
The description of file types above explains most of the
many files that you'll find if you look in a particular directory in the
rs.tdwg.org repo.<o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<h2>
Organization of directories in rs.tdwg.org</h2>
<div class="MsoNormal">
<o:p></o:p></div>
<div class="MsoNormal">
<o:p> </o:p>The set of files detailed above describe a single category
of resources. Most of the directories in
the rs.tdwg.org repository contain such a set that is associated with a
particular namespace that is in use within a TDWG vocabulary (in the language
of the SDS, "term lists"). For
example, <a href="https://github.com/tdwg/rs.tdwg.org/tree/master/audubon" target="_blank">the directory "audubon"</a> (containing the example files above)
describes the current terms minted by Audubon Core and <a href="https://github.com/tdwg/rs.tdwg.org/tree/master/terms" target="_blank">the directory "terms"</a> describes terms minted by Darwin Core. There are also directories that describe
terms that are borrowed by Audubon or Darwin Cores. Those directories have names that end with
"<span style="font-family: "courier new" , "courier" , monospace;">-for-ac</span>" or "<span style="font-family: "courier new" , "courier" , monospace;">-for-dwc</span>". </div>
<div class="MsoNormal">
<o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
For each of the directories that describe terms in a
particular namespace, there is another directory that describes the versions of
those terms. Those directory names have
"<span style="font-family: "courier new" , "courier" , monospace;">-versions</span>" appended to the directory name for their corresponding
current terms. <o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
Finally, there are some special directories that describe
resources in the TDWG standards hierarchy at levels higher than individual
terms: "<a href="https://github.com/tdwg/rs.tdwg.org/tree/master/term-lists" target="_blank">term-lists</a>", "<a href="https://github.com/tdwg/rs.tdwg.org/tree/master/vocabularies" target="_blank">vocabularies</a>", and
"<a href="http://standards/" target="_blank">standards</a>". There is also a
special directory for documents ("<a href="https://github.com/tdwg/rs.tdwg.org/tree/master/docs" target="_blank">docs</a>") that describe all of the
documents that are associated with TDWG standards. Taken together, all of these directories
contain the metadata necessary to completely characterize all of the components
of TDWG standards.<o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<h2>
Using rs.tdwg.org metadata</h2>
<div class="MsoNormal">
<o:p></o:p></div>
<div class="MsoNormal">
<o:p> </o:p>In theory, one could pick through all of the CSV files that
I just described and learn anything you wanted to know about any part of any
TDWG standard. However, that would be a
lot to ask of a human. The real purpose
of the repository is to provide source data for software that can generate the
human- and machine-readable serializations that the SDS specifies. By building all of the serializations from
the same CSV tables, we can reduce errors caused by human entry and guarantee
that a consumer always receives exactly the same metadata regardless of the
chosen format.</div>
<div class="MsoNormal">
<o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
One option for creating the serializations is to run a build
script that generates the serialization as a static file. I used this approach to generate the Audubon
Core Term List document. <a href="https://github.com/tdwg/ac/blob/master/code/build_page.py" target="_blank">A Python script</a> generates Markdown from the appropriate CSV files. The generated file is pushed to GitHub where
it is rendered as a web page via GitHub Pages.<o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
Another option is to generate the serializations on the fly based
on the CSV tables. In <a href="http://baskauf.blogspot.com/2017/03/a-web-service-with-content-negotiation.html" target="_blank">another blog post</a> I describe my efforts to set up a web service that uses CSV files of the form
described above to generate RDF/Turtle, RDF/XML, or JSON-LD serializations of
the data. That system has now been implemented for TDWG standards components. <o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
The SDS specifies that if an IRI is dereferenced with an
<span style="font-family: "courier new" , "courier" , monospace;">Accept:</span> header for one of the RDF serializations, the server should perform
content negotiation (303 redirect) to direct the client to the URL for the
serialization they want. For example, when a client that is a browser (with an Accept header of text/html) dereferences the Darwin Core term IRI <span style="font-family: "courier new" , "courier" , monospace;"><a href="http://rs.tdwg.org/dwc/terms/recordedBy" target="_blank">http://rs.tdwg.org/dwc/terms/recordedBy</a></span>, it will be redirected to the Darwin Core Quick Reference
Guide bookmark for that term. However,
if an <span style="font-family: "courier new" , "courier" , monospace;">Accept:</span> header of <span style="font-family: "courier new" , "courier" , monospace;">text/turtle</span> is used, the client will be redirected to <span style="font-family: "courier new" , "courier" , monospace;"><a href="http://rs.tdwg.org/dwc/terms/recordedBy.ttl" target="_blank">http://rs.tdwg.org/dwc/terms/recordedBy.ttl</a></span>
. Similarly, <span style="font-family: "courier new" , "courier" , monospace;">application/rdf+xml</span>
redirects to a URL ending in <span style="font-family: "courier new" , "courier" , monospace;">.rdf</span> and <span style="font-family: "courier new" , "courier" , monospace;">application/json</span> or <span style="font-family: "courier new" , "courier" , monospace;">application/ld+json</span>
redirects to a URL ending in <span style="font-family: "courier new" , "courier" , monospace;">.json</span> .
Those URLs for specific serializations can also be requested directly
without requiring content negotiation.<br />
<o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
The test system also generates HTML web pages for obsolete
Darwin Core terms that otherwise wouldn't be available via the Darwin Core
website. For example: <span style="font-family: "courier new" , "courier" , monospace;"><a href="http://rs.tdwg.org/dwc/curatorial/Preparations" target="_blank">http://rs.tdwg.org/dwc/curatorial/Preparations</a></span>
redirects to <span style="font-family: "courier new" , "courier" , monospace;"><a href="http://rs.tdwg.org/dwc/curatorial/Preparations.htm" target="_blank">http://rs.tdwg.org/dwc/curatorial/Preparations.htm</a></span>, a web page describing an
obsolete Darwin Core term from 2007.<o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
Providing term dereferencing of this sort is considered a
best practice in the Linked Data community.
But for developers interested in obtaining the machine-readable
metadata, as a practical matter it's probably easier to just get a
machine-readable dump of all of the whole dataset by one of the methods
described in my earlier posts. However,
having the data available in CSV form on GitHub makes the data available in a primitive
"machine-readable" form that doesn't really have anything to do with
Linked Data. Anyone can write a script
to retrieve the raw CSV files from the GitHub repo and process them using
conventional means as long as they understand how the various CSV files within
a directory are related to each other.
Because of the simplicity of the format of the data, it is highly likely
that they will be usable long into the future (or at least as long as GitHub is
viable) even if Linked Data falls by the wayside.<o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<h2>
Maintaining the CSV files in rs.tdwg.org</h2>
<div class="MsoNormal">
<o:p></o:p></div>
<div class="MsoNormal">
<o:p> </o:p>The files in rs.tdwg.org were originally manually assembled
(by me) laboriously from a variety of sources.
All of the current and obsolete Darwin Core data were pulled from the "complete
history" RDF/XML file that was formerly maintained as the "normative
document" for Darwin Core. Audubon
Core terms data were assembled from the somewhat obsolete terms.tdwg.org website. Data on ancient TDWG standards documents and
their authors was assembled by a lot of detective work on my part. However, maintaining the CSV files manually
is not really a viable option. Whenever a
new version of a term is generated, that should spawn a series of new versions
up the standards hierarchy. The new term
version should result in a new modified date for its corresponding current term,
spawn a new version of its containing term list, result in an addition to the
list of terms contained in the term list, generate a new version of the whole
vocabulary, etc. </div>
<div class="MsoNormal">
<o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
It would be unreliable to trust that a human could make all
of the necessary modifications to all of the CSV files without errors. It is also unreasonable to expect standards
maintainers to have to suffer through editing a bunch of CSV files every time
they need to change a term. They should
only have to make minimal changes to a single CSV file and the rest of the work
should be done by a script. <o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
I've written <a href="https://github.com/tdwg/rs.tdwg.org/blob/master/process/process_rs_tdwg_org.ipynb" target="_blank">a Python script within a Jupyter notebook</a> to do
that work. Currently the script will make changes to the necessary CSV files for term
changes and additions within a single term list (a.k.a. "namespace")
of a vocabulary. It currently does not
handle term deprecations and replacements -- presumably those will be uncommon
enough that they could be done by manual editing. It also doesn't handle changes to the documents
metadata. I haven't really implemented
document versioning on rs.tdwg.org, mostly because that's either lost or
unknown information for all of the older standards. That should change in the future, but it just
isn't something I've had the time to work on yet.<o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<h2>
Some final notes</h2>
<div class="MsoNormal">
<o:p></o:p></div>
<div class="MsoNormal">
<o:p> </o:p>Some might take issue with the fact that I've somewhat
unilaterally made these implementation decisions (although I did discuss them
with a number of key TDWG people during the time when I was setting up the rs.tdwg.org
repo). The problem is that TDWG doesn't
really have a very formal mechanism for handling this kind of work. There is the TAG and an Infrastructure interest
group, but neither of them currently have operational procedures for this kind
of implementation. Fortunately, TDWG
generally has given a fairly free hand to people who are willing to do the work
necessary for standards development, and I've received encouragement on this
work, for which I'm grateful. </div>
<div class="MsoNormal">
<o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
I feel relatively confident about the approach of archiving
the standards data as CSV files. With
respect to the method of mapping the columns to properties and my ad hoc system
for linking tables, I think it would actually be better to use the JSON
metadata description files specified in the <a href="https://www.w3.org/TR/csv2rdf/" target="_blank">W3C standard for generating RDF from CSV files</a>. I wasn't aware of that standard when
I started working on the project, but it would probably be a better way to
clarify the relationships between CSV tables and to impart meaning to their
columns. <o:p></o:p></div>
<div class="MsoNormal">
<br />
So far the system that I created
for dereferencing the rs.tdwg.org IRIs seems to be adequate. In the long run, it might be better to use an alternative system. One is to simply have a build script that
generates all of the possible serializations as static files. There would be a lot of them, but who
cares? They could then be served by a
much simpler script that just carried out the content negotiation but did not
actually have to generate the pages.
Another alternative would be to pay a professional to create a better
system. That would involve a commitment
of funds on the part of TDWG. But in
either case the alternative systems could draw their data from the CSV files in
rs.tdwg.org as they currently exist. </div>
<div class="MsoNormal">
<o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
<o:p>
</o:p></div>
<div class="MsoNormal">
When we were near the adoption of the SDS, someone asked
whether the model we developed was too complicated. My answer was that it was just complicated
enough to do all of the things that people said that they wanted. One of my goals in this implementation
project was to show that it actually was possible to fully implement the SDS as
we wrote it. Although the mechanism for
managing and delivering the data may change in the future, the system that I've
developed shows that it's reasonable to expect that TDWG can dereference (with content
negotiation) the IRIs for all of the terms that it mints, and to provide a full
version history for every term, vocabulary, and document that we've published
in the past.<o:p></o:p><br />
<br />
Note: although this is the last post in this series, some people have asked about how one would actually build a new vocabulary using this system. I'll try to write a follow-up showing how it can be done.</div>
</div>
<div>
<br /></div>
Steve Baskaufhttp://www.blogger.com/profile/01896499749604153763noreply@blogger.com0tag:blogger.com,1999:blog-5299754536670281996.post-51617408856711891862019-06-11T09:41:00.001-07:002019-07-18T16:54:22.627-07:00Comparing the ABCD model to Darwin CoreThis post is very focused on the details of two <a href="https://www.tdwg.org/" target="_blank">Biodiversity Information Standards (TDWG)</a> standards as they relate to Linked Data and graph models. If you are generally interested in approaches to Linked Data graph modeling, you might find it interesting. Otherwise, if you aren't into TDWG standards, you may zone out.<br />
<br />
<h2>
Background</h2>
<h3>
</h3>
<h3>
The TDWG Darwin Core and Access to Biological Collection Data (ABCD) standards</h3>
<a href="https://www.tdwg.org/standards/abcd/" target="_blank">Access to Biological Collection Data</a> (ABCD) is a standard of Biodiversity Information Standards (TDWG). It is classified as a "Current Standard" but is in a special category called "2005" standard because it was ratified just before the present <a href="https://www.tdwg.org/about/process/" target="_blank">TDWG by-laws</a> (which specify the details of the standards development process) were adopted in 2006. Originally, ABCD was defined as an XML schema that could be used to validate XML records that describe biodiversity resources. The various versions of the ABCD XML schema can be found in the <a href="https://github.com/tdwg/abcd/tree/master/xml" target="_blank">ABCD GitHub repository</a>.<br />
<br />
<a href="https://www.tdwg.org/standards/dwc/" target="_blank">Darwin Core</a> (DwC) is a current standard of TDWG that was ratified in 2009. It is modeled after <a href="http://www.dublincore.org/specifications/dublin-core/dcmi-terms/" target="_blank">Dublin Core</a>, with which it shares many similarities. Biodiversity data can be transmitted in several ways: as <a href="http://rs.tdwg.org/dwc/terms/simple/" target="_blank">simple spreadsheets</a>, as <a href="http://rs.tdwg.org/dwc/terms/guides/xml/" target="_blank">XML</a>, and as <a href="https://dwc.tdwg.org/text/" target="_blank">text files structured in a form known as a Darwin Core Archive</a>.<br />
<br />
Nearly all of the more than 1.3 billion records in the <a href="https://www.gbif.org/" target="_blank">Global Biodiversity Information Facility (GBIF)</a> have been marked up in either DwC or ABCD.<br />
<br />
<h3>
My role in Darwin Core</h3>
For some time I've been interested in the possibility of using Darwin Core terms as a way to transmit biodiversity data as Linked Open Data (LOD). That interest has manifested itself in my being involved in three ways with the development of Darwin Core:<br />
<br />
<ul>
<li>as the instigator of the establishment of the <a href="http://rs.tdwg.org/dwc/terms/Organism" target="_blank">dwc:Organism</a> class</li>
<li>as the shepherd of the clarification of definitions of all of the Darwin Core (dwc: namespace) Classes and deprecation of the confusing alternative Darwin Core type vocabulary (dwctype: namespace) classes. </li>
<li>as the lead author of the <a href="https://dwc.tdwg.org/rdf/" target="_blank">Darwin Core RDF Guide</a> (for details, see <a href="http://dx.doi.org/10.3233/SW-150199" target="_blank">http://dx.doi.org/10.3233/SW-150199</a>; open access at <a href="http://bit.ly/2e7i3Sj" target="_blank">http://bit.ly/2e7i3Sj</a>)</li>
</ul>
<br />
All three of these official changes to Darwin Core were approved by decision of the TDWG Executive Committee on October 26, 2014. Along with Cam Webb, I also was involved in an unofficial effort called <a href="https://github.com/darwin-sw/dsw" target="_blank">Darwin-SW</a> (DSW) to develop an RDF ontology to create the graph model and object properties that were missing from the Darwin Core vocabulary. (For details, see <a href="http://dx.doi.org/10.3233/SW-150203" target="_blank">http://dx.doi.org/10.3233/SW-150203</a>; open access at <a href="http://bit.ly/2dG85b5" target="_blank">http://bit.ly/2dG85b5</a>.) More on that later...<br />
<br />
I've had no role with ABCD and honestly, I was pretty daunted about the prospect of plowing through the XML schema to try to understand how it worked. However, I've recently been using some new tools Linked Data tools to explore ABCD and they have been instrumental for putting the material together for this blog. More about them later...<br />
<br />
<h2>
A common model for ABCD and Darwin Core?</h2>
Recently, a call went out to people interested in developing a common model for TDWG that would encompass both ABCD and DwC. Because of my past interest in using Darwin Core terms as RDF, I joined the group, which has met online once so far. Because of my basic ignorance about ABCD, I've recently put in some time to try to understand the existing model for ABCD and how it is similar or different from Darwin Core. In the following sections, I'll discuss some issues with modeling Darwin Core, then report on what I've learned about ABCD and how it compares to Darwin Core.<br />
<br />
<h3>
Darwin Core's missing graph model</h3>
One of the things that surprises some people is that although a DwC RDF Guide exists, it is not currently possible to express biodiversity data as RDF using only terms currently in the standard.<br />
<br />
What the RDF Guide does is to clear up how the existing terms of Darwin Core should be used and to mint some new terms that can be used for creating links between resources (i.e. to non-literal objects of triples). For example, as adopted, Darwin Core had the term <span style="font-family: "courier new" , "courier" , monospace;">dwc:recordedBy</span> (<span style="font-family: "courier new" , "courier" , monospace;">http://rs.tdwg.org/dwc/terms/recordedBy</span>) to indicate the person who recorded the occurrence of an organism. However, it was not clear whether the value of this term (i.e. the object of a triple of which the predicate was <span style="font-family: "courier new" , "courier" , monospace;">dwc:recordedBy</span>) should be a literal (i.e. a name string) or an IRI (i.e. an identifier denoting an agent). The RDF Guide establishes that <span style="font-family: "courier new" , "courier" , monospace;">dwc:recordedBy</span> should be used with a literal value, and that a new term, <span style="font-family: "courier new" , "courier" , monospace;">dwciri:recordedBy</span> (<span style="font-family: "courier new" , "courier" , monospace;">http://rs.tdwg.org/dwc/iri/recordedBy</span>) should be used to link to an IRI denoting an agent (i.e. a non-literal value). For each term in Darwin Core where it seemed appropriate for an existing term to have a non-literal (IRI) value, a <span style="font-family: "courier new" , "courier" , monospace;">dwciri:</span> namespace analog of that term was created. The terms affected by this decision are detailed in the <span style="font-family: inherit;"><a href="https://dwc.tdwg.org/rdf/#3-term-reference-normative" target="_blank">Term reference section</a></span> of the guide.<br />
<br />
So with the RDF Guide, it is now possible to express a lot of Darwin Core metadata as RDF. But at the time of the adoption of the RDF Guide there were no existing DwC terms that linked instances of the DwC classes (i.e. object properties), so there was no way to fully express a dataset as RDF. (Another way of saying this is that Darwin Core did not have a graph model for its classes.) It seems like there should be a simple solution to that problem: just define some object properties to connect the classes. But as Joel Sachs and I describe in <a href="http://hdl.handle.net/1803/9296" target="_blank">a recent book chapter</a>, that's not as simple as it seems. In section 3.2 of the chapter, we show how users with varying interests may want to use graph models that are more or less complex, and that inconsistencies on those models makes it difficult to query across datasets that use different models.<br />
<br />
The Darwin Core RDF Guide was developed not long after a bruising, year-long online discussion about modeling Darwin Core (see <a href="https://github.com/darwin-sw/dsw/wiki/TdwgContentEmailSummary" target="_blank">this page</a> for a summary of the gory details). It was clear that if we had planned to include a graph model and the necessary object properties, the RDF Guide would probably never get finished. So it was decided to create the RDF Guide to deal with the existing terms and leave the development of a graph model as a later effort.<br />
<br />
<h3>
Darwin-SW's graph model</h3>
After the exhausting online discussion (argument?) about modeling Darwin Core, I was so burned out on the subject, I had decided that I was basically done with that subject. However, Cam Webb, the eternal optimist, contacted me and said that we should just jump in and try to create a QL-type ontology that had the missing object properties. (See "For further reference" at the end for definitions of "ontology").<br />
<br />
What made that project feasible was that despite the rancor of the online discussion, there actually did seem to be some degree of consensus about a model based on historical work done 20 years earlier. Rich Pyle had laid out a diagram of a model that we were discussing and Greg Whitbread noted that it was quite similar to the Association of Systematics Collections (ASC) model of 1993. All Cam and I really had to do was to create object properties to connect all of the nodes on Rich's diagram. We worked on it for a couple of weeks and the first draft of Darwin-SW (DSW) was done!<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEifHO_CMNAzkC7k1hqKBV975zWra8RsYuiFLBR9-LiewC7etPywODqRNTnn0TpJz4yo4PDL6JADFmZVlvljfW4x55jDrqX3ppn1_TnIQMXgI-403T5EQmbEa9s8UicvoNLjw648tZBdlh8/s1600/acs-diagram.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="856" data-original-width="1114" height="490" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEifHO_CMNAzkC7k1hqKBV975zWra8RsYuiFLBR9-LiewC7etPywODqRNTnn0TpJz4yo4PDL6JADFmZVlvljfW4x55jDrqX3ppn1_TnIQMXgI-403T5EQmbEa9s8UicvoNLjw648tZBdlh8/s640/acs-diagram.png" width="640" /></a></div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<br />
The diagram above shows the DSW graph model overlaid upon the ACS entity-relation (ER) diagram. I realize that it's impossible to see the details in this image, but you can download a poster-sized PowerPoint diagram from <a href="https://github.com/darwin-sw/dsw/raw/master/img/acs-dsw-poster-colorchange.pptx" target="_blank">this page</a> to see the details.<br />
<br />
DSW differs a little from the ASC model in that it includes two Darwin Core classes (dwc:Organism and dwc:Occurrence) that weren't dealt with in the ACS model. Since the ACS model dealt only with museum specimens, it did not include the classes of Darwin Core that were developed later to deal with repeated records of the same organism, or records documented by forms of evidence other than specimens (i.e. human and machine observations, media, living specimens, etc.). But other than that, the DSW model is just a simplified version of the ACS model.<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://raw.githubusercontent.com/darwin-sw/dsw/master/img/dsw-1-0-graph-model.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="703" data-original-width="800" height="562" src="https://raw.githubusercontent.com/darwin-sw/dsw/master/img/dsw-1-0-graph-model.png" width="640" /></a></div>
<div class="separator" style="clear: both; text-align: left;">
<br /></div>
<div class="separator" style="clear: both; text-align: left;">
The diagram above shows the core of the DSW graph model (available <a href="https://github.com/darwin-sw/dsw/raw/master/img/dsw-1.0-graph-model.pptx" target="_blank">poster-sized here</a> if you have trouble seeing the details). The six red bubbles are the six major classes defined by Darwin Core. The yellow bubble is FOAF's Agent class, which can be linked DwC classes by two terms from the <span style="font-family: "courier new" , "courier" , monospace;">dwciri:</span> namespace. The object of <span style="font-family: "courier new" , "courier" , monospace;">dwc:eventDate</span> is a literal, and <span style="font-family: "courier new" , "courier" , monospace;">dwciri:toTaxon</span> links to some yet-to-be-fully-described taxon-like entity that will hopefully be fleshed out by a successor to the <a href="https://www.tdwg.org/standards/tcs/" target="_blank">Taxon Concept Transfer Schema</a> (TCS) standard, but whose place is currently being held by the <span style="font-family: "courier new" , "courier" , monospace;">dwc:Taxon</span> class. The seven object properties printed in blue are DSW's attempt to fill in the object properties that are missing from the Darwin Core standard. </div>
<div class="separator" style="clear: both; text-align: left;">
<br /></div>
<div class="separator" style="clear: both; text-align: left;">
The blue bubble, <span style="font-family: "courier new" , "courier" , monospace;">dsw:Token</span>, is one of the few classes that we defined in DSW instead of borrowing from elsewhere. We probably should have called it <span style="font-family: "courier new" , "courier" , monospace;">dsw:Evidence</span>, because "evidence" is what it represents, but too late now. I will talk more about the Token class in the next section. </div>
<div class="separator" style="clear: both; text-align: left;">
<br /></div>
<h3 style="clear: both; text-align: left;">
What's an Occurrence???</h3>
<div class="separator" style="clear: both; text-align: left;">
One of the longstanding and vexing questions of users of Darwin Core is "what the heck is an occurrence?" The origin of <span style="font-family: "courier new" , "courier" , monospace;">dwc:Occurrence</span> predates my involvement with TDWG, but I believe that its creation was to solve the problem of overlap of terms that applied to both observations and preserved specimens. For example, you could have terms called <span style="font-family: "courier new" , "courier" , monospace;">dwc:observer</span> and <span style="font-family: "courier new" , "courier" , monospace;">dwc:collector</span>, with observer being used with observations and collector being used with specimens. Similarly, you could have <span style="font-family: "courier new" , "courier" , monospace;">dwc:observationRemarks</span> for observations and <span style="font-family: "courier new" , "courier" , monospace;">dwc:collectionRemarks</span> for specimens. But fundamentally, both an observer and a collector are creating a record that an organism was at some place at some time, so why have two different terms for them? Why have two separate remarks term when one would do? So the <span style="font-family: "courier new" , "courier" , monospace;">dwc:Occurrence</span> class was created as an artificial class to organize terms that applied to both specimens and observations (like the two terms <span style="font-family: "courier new" , "courier" , monospace;">dwc:recordedBy</span> and <span style="font-family: "courier new" , "courier" , monospace;">dwc:occurrenceRemark</span> that replace the four terms above). Any terms that applied to only specimens (like <span style="font-family: "courier new" , "courier" , monospace;">dwc:preparations</span> and <span style="font-family: "courier new" , "courier" , monospace;">dwc:disposition</span>) were thrown in the Occurrence group as well. </div>
<div class="separator" style="clear: both; text-align: left;">
<br /></div>
<div class="separator" style="clear: both; text-align: left;">
So for some time, <span style="font-family: "courier new" , "courier" , monospace;">dwc:Occurrence</span> was considered by many to be a sort of superclass for both specimens and observations. However, its definition was pretty murky and a bit circular. Prior to our clarification of class definitions in October 2014, the definition was "The category of information pertaining to evidence of an occurrence in nature, in a collection, or in a dataset (specimen, observation, etc.)." After the class definition cleanup, it was "An existence of an Organism (sensu http://rs.tdwg.org/dwc/terms/Organism) at a particular place at a particular time." That's still a bit obtuse, but appropriate for an artificial class whose instances document that an organism was at a certain place at a certain time. </div>
<div class="separator" style="clear: both; text-align: left;">
<br /></div>
<div class="separator" style="clear: both; text-align: left;">
What DSW does is to clearly separate the artificial Occurrence class from the actual resources that serve to document that the organism occurred. The <span style="font-family: "courier new" , "courier" , monospace;">dsw:Token</span> class is a superclass for any kind of resource that can serve as evidence for the Occurrence. The class name, Token, comes from the fact that the evidence also has a <span style="font-family: "courier new" , "courier" , monospace;">dsw:derivedFrom</span> relationship with the organism that was documented -- it's a kind of token that represents the organism. There is no particular limit to what type of thing can be a token; it can be a preserved specimen, living specimen, image, machine record, DNA sequence, or any other kind of thing that can serve as evidence for an occurrence and is derived in some way from the documented organism. The properties of Tokens are any properties appropriate for any class of evidence: <span style="font-family: "courier new" , "courier" , monospace;">dwc:preparation</span> for preserved specimens, <span style="font-family: "courier new" , "courier" , monospace;">ac:caption</span> for images, etc.</div>
<div class="separator" style="clear: both; text-align: left;">
<br /></div>
<div class="separator" style="clear: both; text-align: left;">
<br /></div>
<h2 style="clear: both; text-align: left;">
Investigating ABCD</h2>
<div class="separator" style="clear: both; text-align: left;">
I mentioned that recently I gained access to some relatively new Linked Data tools for investigating ABCD. One that I'm really excited about is a <a href="https://wiki.bgbm.org/bdidata/index.php/BDI_Data:Main_Page" target="_blank">Wikibase instance that is loaded with the ABCD terminology data</a>. If you've read any of my <a href="http://baskauf.blogspot.com/2019/06/putting-data-into-wikidata-using.html" target="_blank">recent blog posts</a>, you'll know that I'm very interested in learning how Wikibase can be used as a way to manage Linked Data. So I was really excited both to see how the ABCD team had fit the ABCD model into the Wikibase model and also to be able to use the built-in <a href="https://wiki.bgbm.org/bdidata/query/" target="_blank">Query service</a> to explore the ABCD model. </div>
<div class="separator" style="clear: both; text-align: left;">
<br /></div>
<div class="separator" style="clear: both; text-align: left;">
The other useful thing that I just recently discovered is an <a href="https://github.com/tdwg/abcd/blob/master/ontology/abcd_concepts.owl" target="_blank">ABCD OWL ontology document</a> in RDF/XML serialization. It was loaded into the ABCD GitHub repo only a few days ago, so I'm excited to be able to use it as a point of comparison with the Wikibase data. I've loaded the ontology triples into the Vanderbilt Libraries' triplestore as the named graph <span style="font-family: "courier new" , "courier" , monospace;">http://rs.tdwg.org/abcd/terms/</span> so that I can query it using <a href="https://sparql.vanderbilt.edu/" target="_blank">the SPARQL endpoint</a>. In most of the comparisons that I've done, the results from the OWL document and the Wikibase data are identical. (As I noted in the "Time for a Snak" section of <a href="http://baskauf.blogspot.com/2019/06/putting-data-into-wikidata-using.html" target="_blank">my previous post</a>, the Wikibase data model differs significantly from the standard RDFS model of class, range, domain, etc. So querying the two data sources requires some significant adjustments in the actual queries used in order to fit the model that the data are encoded in.)</div>
<div class="separator" style="clear: both; text-align: left;">
<br /></div>
<div class="separator" style="clear: both; text-align: left;">
One caveat is that <a href="https://abcd.biowikifarm.net/wiki/Main_Page" target="_blank">ABCD 3.0</a> is currently under development and the Wikibase installation is clearly marked as "experimental". So I'm assuming that the data both there and in the ontology are subject to change. Nevertheless, both of these data sources has given me a much better understanding of how ABCD models the biodiversity universe.</div>
<div class="separator" style="clear: both; text-align: left;">
<br /></div>
<h3 style="clear: both; text-align: left;">
Term types</h3>
<div class="separator" style="clear: both; text-align: left;">
The <a href="https://wiki.bgbm.org/bdidata/index.php/BDI_Data:Main_Page" target="_blank">Main Page</a> of the Wikibase installation gives a good explanation of the types of terms included in its dataset. In the description, they use the word "concept", but I prefer to restrict the use of the word "concept" to what I consider to be its standard use: for controlled vocabulary terms. (See the "For further reference" section for more on this.") So to translate their "types" list, I would say they describe one type of vocabulary (Controlled Vocabulary Q14) and four types of terms: Class Q32, Object Property Q33, Datatype Property Q34, and Controlled Term (i.e. concept) Q16. </div>
<div class="separator" style="clear: both; text-align: left;">
<br /></div>
<div class="separator" style="clear: both; text-align: left;">
For comparison purposes, the class and property terms in the <a href="https://github.com/tdwg/abcd/blob/master/ontology/abcd_concepts.owl" target="_blank"><span style="font-family: "courier new" , "courier" , monospace;"><span id="goog_116719628"></span>abcd_concepts.owl</span> OWL ontology</a><span id="goog_116719629"></span><span id="goog_116719631"></span><a href="https://www.blogger.com/"></a><span id="goog_116719632"></span> are typed as: <span style="font-family: "courier new" , "courier" , monospace;">owl:Class</span>, <span style="font-family: "courier new" , "courier" , monospace;">owl:ObjectProperty</span>, and <span style="font-family: "courier new" , "courier" , monospace;">owl:DatatypeProperty</span>. The controlled vocabularies are typed as <span style="font-family: "courier new" , "courier" , monospace;">owl:Class</span> rather than <span style="font-family: "courier new" , "courier" , monospace;">skos:ConceptScheme</span>, so subsequently the controlled vocabulary terms are typed as instances of the classes that correspond to their containing controlled vocabularies (e.g. <span style="font-family: "courier new" , "courier" , monospace;">abcd:Female rdf:type abcd:Sex</span>), rather than as <span style="font-family: "courier new" , "courier" , monospace;">skos:Concept</span>. It's a valid modeling choice, but isn't according to the recommendations of the TDWG Standards Documentation Specification. (More details about this later in the " The place of controlled vocabularies in the model" section.)</div>
<div class="separator" style="clear: both; text-align: left;">
<br /></div>
<div class="separator" style="clear: both; text-align: left;">
The query service makes it easy to discover what properties have actually been used with each type of term. Here is an example for Classes:</div>
<div class="separator" style="clear: both; text-align: left;">
<br /></div>
<div class="separator" style="clear: both;">
<span style="font-family: "courier new" , "courier" , monospace;">PREFIX bwd: <http://wiki.bgbm.org/entity/></span></div>
<div class="separator" style="clear: both;">
<span style="font-family: "courier new" , "courier" , monospace;">PREFIX bwdt: <http://wiki.bgbm.org/prop/direct/></span></div>
<div class="separator" style="clear: both;">
<span style="font-family: "courier new" , "courier" , monospace;"><br /></span></div>
<div class="separator" style="clear: both;">
<span style="font-family: "courier new" , "courier" , monospace;">SELECT DISTINCT ?predicate ?label WHERE {</span></div>
<div class="separator" style="clear: both;">
<span style="font-family: "courier new" , "courier" , monospace;"> ?concept bwdt:P8 bwd:Q219.</span></div>
<div class="separator" style="clear: both;">
<span style="font-family: "courier new" , "courier" , monospace;"> ?concept bwdt:P9 bwd:Q32.</span></div>
<div class="separator" style="clear: both;">
<span style="font-family: "courier new" , "courier" , monospace;"> ?concept bwdt:P25 ?name.</span></div>
<div class="separator" style="clear: both;">
<span style="font-family: "courier new" , "courier" , monospace;"><br /></span></div>
<div class="separator" style="clear: both;">
<span style="font-family: "courier new" , "courier" , monospace;"> ?concept ?predicate ?value.</span></div>
<div class="separator" style="clear: both;">
<span style="font-family: "courier new" , "courier" , monospace;">OPTIONAL {</span></div>
<div class="separator" style="clear: both;">
<span style="font-family: "courier new" , "courier" , monospace;"> ?genericProp wikibase:directClaim ?predicate.</span></div>
<div class="separator" style="clear: both;">
<span style="font-family: "courier new" , "courier" , monospace;"> ?genericProp rdfs:label ?label.</span></div>
<div class="separator" style="clear: both;">
<span style="font-family: "courier new" , "courier" , monospace;"> }</span></div>
<div class="separator" style="clear: both;">
<span style="font-family: "courier new" , "courier" , monospace;">MINUS {</span></div>
<div class="separator" style="clear: both;">
<span style="font-family: "courier new" , "courier" , monospace;"> ?otherGenericProp wikibase:claim ?predicate.</span></div>
<div class="separator" style="clear: both;">
<span style="font-family: "courier new" , "courier" , monospace;"> }</span></div>
<div class="separator" style="clear: both;">
<span style="font-family: "courier new" , "courier" , monospace;">}</span></div>
<div class="separator" style="clear: both;">
<span style="font-family: "courier new" , "courier" , monospace;">ORDER BY ?predicate</span></div>
<div class="separator" style="clear: both;">
<br /></div>
<div class="separator" style="clear: both;">
This query is complicated a bit by the somewhat complex way that Wikibase handles properties and their labels (see <a href="https://heardlibrary.github.io/digital-scholarship/lod/wikibase/#references" target="_blank">this</a> for details), but you can see that it works by going to <a href="https://wiki.bgbm.org/bdidata/query/" target="_blank">https://wiki.bgbm.org/bdidata/query/</a> and pasting the query into the box. </div>
<div class="separator" style="clear: both;">
<br /></div>
<div class="separator" style="clear: both;">
One of the cool things that the Wikibase Query service allows you to do is copy the link from the browser URL bar and the link contains the query itself as part of the URL. This means that you can link directly to the query so that when you click on <a href="https://wiki.bgbm.org/bdidata/query/#PREFIX%20bwd%3A%20%3Chttp%3A%2F%2Fwiki.bgbm.org%2Fentity%2F%3E%0APREFIX%20bwdt%3A%20%3Chttp%3A%2F%2Fwiki.bgbm.org%2Fprop%2Fdirect%2F%3E%0A%0ASELECT%20DISTINCT%20%3Fpredicate%20%3Flabel%20%20WHERE%20%7B%0A%20%20%3Fconcept%20bwdt%3AP8%20bwd%3AQ219.%0A%20%20%3Fconcept%20bwdt%3AP9%20bwd%3AQ32.%0A%20%20%3Fconcept%20bwdt%3AP25%20%3Fname.%0A%0A%20%20%3Fconcept%20%3Fpredicate%20%3Fvalue.%0AOPTIONAL%20%7B%0A%20%20%3FgenericProp%20wikibase%3AdirectClaim%20%3Fpredicate.%0A%20%20%3FgenericProp%20rdfs%3Alabel%20%3Flabel.%0A%20%20%7D%0AMINUS%20%7B%0A%20%20%3FotherGenericProp%20wikibase%3Aclaim%20%3Fpredicate.%0A%20%20%7D%0A%7D%0AORDER%20BY%20%3Fpredicate%0A%0A" target="_blank">the link</a>, the query will load itself into the Query Service GUI box. So to avoid cluttering up this post with cut and paste queries, I'll just link the queries like this: <a href="https://wiki.bgbm.org/bdidata/query/#PREFIX%20bwd%3A%20%3Chttp%3A%2F%2Fwiki.bgbm.org%2Fentity%2F%3E%0APREFIX%20bwdt%3A%20%3Chttp%3A%2F%2Fwiki.bgbm.org%2Fprop%2Fdirect%2F%3E%0A%0ASELECT%20DISTINCT%20%3Fpredicate%20%3Flabel%20%20WHERE%20%7B%0A%20%20%3Fconcept%20bwdt%3AP8%20bwd%3AQ219.%0A%20%20%3Fconcept%20bwdt%3AP9%20bwd%3AQ33.%0A%20%20%3Fconcept%20bwdt%3AP25%20%3Fname.%0A%0A%20%20%3Fconcept%20%3Fpredicate%20%3Fvalue.%0AOPTIONAL%20%7B%0A%20%20%3FgenericProp%20wikibase%3AdirectClaim%20%3Fpredicate.%0A%20%20%3FgenericProp%20rdfs%3Alabel%20%3Flabel.%0A%20%20%7D%0AMINUS%20%7B%0A%20%20%3FotherGenericProp%20wikibase%3Aclaim%20%3Fpredicate.%0A%20%20%7D%0A%7D%0AORDER%20BY%20%3Fpredicate%0A%0A" target="_blank">properties used with object properties</a>, <a href="https://wiki.bgbm.org/bdidata/query/#PREFIX%20bwd%3A%20%3Chttp%3A%2F%2Fwiki.bgbm.org%2Fentity%2F%3E%0APREFIX%20bwdt%3A%20%3Chttp%3A%2F%2Fwiki.bgbm.org%2Fprop%2Fdirect%2F%3E%0A%0ASELECT%20DISTINCT%20%3Fpredicate%20%3Flabel%20%20WHERE%20%7B%0A%20%20%3Fconcept%20bwdt%3AP8%20bwd%3AQ219.%0A%20%20%3Fconcept%20bwdt%3AP9%20bwd%3AQ34.%0A%20%20%3Fconcept%20bwdt%3AP25%20%3Fname.%0A%0A%20%20%3Fconcept%20%3Fpredicate%20%3Fvalue.%0AOPTIONAL%20%7B%0A%20%20%3FgenericProp%20wikibase%3AdirectClaim%20%3Fpredicate.%0A%20%20%3FgenericProp%20rdfs%3Alabel%20%3Flabel.%0A%20%20%7D%0AMINUS%20%7B%0A%20%20%3FotherGenericProp%20wikibase%3Aclaim%20%3Fpredicate.%0A%20%20%7D%0A%7D%0AORDER%20BY%20%3Fpredicate%0A%0A" target="_blank">datatype properties</a>, <a href="https://wiki.bgbm.org/bdidata/query/#PREFIX%20bwd%3A%20%3Chttp%3A%2F%2Fwiki.bgbm.org%2Fentity%2F%3E%0APREFIX%20bwdt%3A%20%3Chttp%3A%2F%2Fwiki.bgbm.org%2Fprop%2Fdirect%2F%3E%0A%0ASELECT%20DISTINCT%20%3Fpredicate%20%3Flabel%20%20WHERE%20%7B%0A%20%20%3Fconcept%20bwdt%3AP8%20bwd%3AQ219.%0A%20%20%3Fconcept%20bwdt%3AP9%20bwd%3AQ16.%0A%20%20%3Fconcept%20bwdt%3AP25%20%3Fname.%0A%0A%20%20%3Fconcept%20%3Fpredicate%20%3Fvalue.%0AOPTIONAL%20%7B%0A%20%20%3FgenericProp%20wikibase%3AdirectClaim%20%3Fpredicate.%0A%20%20%3FgenericProp%20rdfs%3Alabel%20%3Flabel.%0A%20%20%7D%0AMINUS%20%7B%0A%20%20%3FotherGenericProp%20wikibase%3Aclaim%20%3Fpredicate.%0A%20%20%7D%0A%7D%0AORDER%20BY%20%3Fpredicate%0A%0A" target="_blank">controlled terms</a>, and <a href="https://wiki.bgbm.org/bdidata/query/#PREFIX%20bwd%3A%20%3Chttp%3A%2F%2Fwiki.bgbm.org%2Fentity%2F%3E%0APREFIX%20bwdt%3A%20%3Chttp%3A%2F%2Fwiki.bgbm.org%2Fprop%2Fdirect%2F%3E%0A%0ASELECT%20DISTINCT%20%3Fpredicate%20%3Flabel%20%20WHERE%20%7B%0A%20%20%3Fconcept%20bwdt%3AP8%20bwd%3AQ219.%0A%20%20%3Fconcept%20bwdt%3AP9%20bwd%3AQ14.%0A%20%20%3Fconcept%20bwdt%3AP25%20%3Fname.%0A%0A%20%20%3Fconcept%20%3Fpredicate%20%3Fvalue.%0AOPTIONAL%20%7B%0A%20%20%3FgenericProp%20wikibase%3AdirectClaim%20%3Fpredicate.%0A%20%20%3FgenericProp%20rdfs%3Alabel%20%3Flabel.%0A%20%20%7D%0AMINUS%20%7B%0A%20%20%3FotherGenericProp%20wikibase%3Aclaim%20%3Fpredicate.%0A%20%20%7D%0A%7D%0AORDER%20BY%20%3Fpredicate%0A%0A" target="_blank">controlled vocabularies</a>. </div>
<div class="separator" style="clear: both;">
<br /></div>
<div class="separator" style="clear: both;">
If you run each of the queries, you'll see that the properties used to describe the various term and vocabulary types are similar to the table shown at the bottom of the <a href="https://wiki.bgbm.org/bdidata/index.php/BDI_Data:Main_Page" target="_blank">Main Page</a>.</div>
<div class="separator" style="clear: both;">
<br /></div>
<h3 style="clear: both;">
Classes</h3>
<div class="separator" style="clear: both;">
One of the things I was interested in finding out about were the classes that were included in ABCD. <a href="https://wiki.bgbm.org/bdidata/query/#PREFIX%20bwd%3A%20%3Chttp%3A%2F%2Fwiki.bgbm.org%2Fentity%2F%3E%0APREFIX%20bwdt%3A%20%3Chttp%3A%2F%2Fwiki.bgbm.org%2Fprop%2Fdirect%2F%3E%0A%0ASELECT%20DISTINCT%20%3FwikibaseId%20%3Firi%20%3Frdfs_label%20%3Frdfs_comment%20%3FgroupLabel%20WHERE%20%7B%0A%20%20%3FwikibaseId%20bwdt%3AP8%20bwd%3AQ219.%0A%20%20%3FwikibaseId%20bwdt%3AP9%20bwd%3AQ32.%20%23restrict%20to%20Classes%0A%20%20%3FwikibaseId%20bwdt%3AP25%20%3FlocalName.%0A%20%20BIND%28CONCAT%28%22http%3A%2F%2Frs.tdwg.org%2Fabcd%2Fterms%2F%22%2C%3FlocalName%29%20AS%20%3Firi%29%0A%20%20%3FwikibaseId%20rdfs%3Alabel%20%3Frdfs_label.%0A%20%20%3FwikibaseId%20schema%3Adescription%20%3Frdfs_comment.%0A%20%20%3FwikibaseId%20bwdt%3AP48%20%3FconceptGroup.%0A%20%20%3FconceptGroup%20rdfs%3Alabel%20%3FgroupLabel.%0A%7D%0AORDER%20BY%20%3FgroupLabel%0A" target="_blank">This query</a> will create a table of all of the classes in ABCD 3.0 along with basic information about them. One thing that is very clear from running the query is that ABCD has a LOT more classes (57) than DwC (15). Fortunately, the classes are grouped into categories based on the core classes they are associated with. This was really helpful for me because it made it obvious to me that Gathering, Unit, and Identification were key classes in the model. The Identification class was basically the same as the <span style="font-family: "courier new" , "courier" , monospace;">dwc:identification</span> class of Darwin Core. The Gathering class, defined as "A class to describe a collection or observation event." seems to be more or less synonymous to the <span style="font-family: "courier new" , "courier" , monospace;">dwc:Event</span> class. The Unit class, defined as "A class to join all data referring to a unit such as specimen or observation record" is almost exactly how I described the <span style="font-family: "courier new" , "courier" , monospace;">dwc:Occurrence</span> class: an artificial class that's used to group properties that are common to specimens and observations. </div>
<div class="separator" style="clear: both;">
<br /></div>
<h3 style="clear: both;">
Object properties</h3>
<div class="separator" style="clear: both;">
Another key thing that I wanted to know was how the ABCD 3.0 graph model compared with the DSW graph model. In order to do that, I needed to study the object properties and find out how they connected instances of classes. </div>
<div class="separator" style="clear: both;">
<br /></div>
<div class="separator" style="clear: both;">
As we can see from the table of term properties on the Main Page, object properties are required to have a defined range. They are not required to have a domain. Cam and I got a lot of flack when we assigned ranges and domains to object properties in DSW because of the way ranges and domains can generate unintended entailments. There is a common misconception that if one assigns a range to an object property that it REQUIRES that the object to be in instance of that class. Actually what it does is to entail that the object IS an instance of that class, whether that makes sense or not. We were OK with assigning ranges and domains in DSW because we didn't want people to use the DSW object properties to link class instances other than those that we specified in our mode - if people ignored our guidance, then they got unintended entailments. In ABCD the object properties all have names like "hasX", so if the object of a triple using the property isn't an instance of class "X", it's pretty silly to use that property. So here is makes some sense to assign ranges. Perhaps wisely, few of the ABCD object properties have the optional domain declaration. That allows those properties to be used with subject resources other than types that might have been originally envisioned without it entailing anything silly. </div>
<div class="separator" style="clear: both;">
<br /></div>
<div class="separator" style="clear: both;">
Instead of assigning domains, ABCD uses the property <span style="font-family: "courier new" , "courier" , monospace;">abcd:associatedWithClass</span> to indicate the class or classes whose instances you'd expect to have that property. <a href="https://wiki.bgbm.org/bdidata/query/#PREFIX%20bwd%3A%20%3Chttp%3A%2F%2Fwiki.bgbm.org%2Fentity%2F%3E%0APREFIX%20bwdt%3A%20%3Chttp%3A%2F%2Fwiki.bgbm.org%2Fprop%2Fdirect%2F%3E%0A%0ASELECT%20DISTINCT%20%3FwikibaseId%20%3Firi%20%3FrangeName%20%3FassociatedWithClass%20%3FdomainName%20WHERE%20%7B%0A%20%20%3FwikibaseId%20bwdt%3AP8%20bwd%3AQ219.%0A%20%20%3FwikibaseId%20bwdt%3AP9%20bwd%3AQ33.%20%23restrict%20to%20Object%20Properties%0A%20%20%3FwikibaseId%20bwdt%3AP25%20%3FlocalName.%0A%20%20BIND%28CONCAT%28%22http%3A%2F%2Frs.tdwg.org%2Fabcd%2Fterms%2F%22%2C%3FlocalName%29%20AS%20%3Firi%29%0AOPTIONAL%20%7B%0A%20%20%3FwikibaseId%20bwdt%3AP13%20%3Frange.%0A%20%20%3Frange%20rdfs%3Alabel%20%3FrangeName.%0A%20%20%7D%0AOPTIONAL%20%7B%0A%20%20%3FwikibaseId%20bwdt%3AP29%20%3Fdomain.%0A%20%20%3Fdomain%20rdfs%3Alabel%20%3FdomainName.%0A%20%20%7D%0AOPTIONAL%20%7B%0A%20%20%3Fclass%20bwdt%3AP45%20%3FwikibaseId.%0A%20%20%3Fclass%20rdfs%3Alabel%20%3FassociatedWithClass.%0A%20%20%7D%0A%7D%0AORDER%20BY%20%3FassociatedWithClass" target="_blank">Here's a query</a> that lists all of the object properties, their ranges, and the subject class with which they are associated. The query shows that there are a much larger number of link types (135) than DSW has. That's to be expected since there are a lot more classes. The actual number of ABCD object properties (88) is less than the number of link types because some of the object properties are used to link more than one combination of class instances. </div>
<div class="separator" style="clear: both;">
<br /></div>
<h2 style="clear: both;">
Comparison of the DSW and ABCD graph model</h2>
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEik0tYtyYPzUi__wiJ8HuP-01oBylzG_hO2It9Am0URj6LX_EJ7VZ3p2KVNvy2aYGzQ7URIQaRFB02b2jb0cGrghNDJYT1dV8TGVIDwTielVevCV6Jr5K5XWN43PP7cSyg3WeheFS2msm8/s1600/abcd-graph-model.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="784" data-original-width="1097" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEik0tYtyYPzUi__wiJ8HuP-01oBylzG_hO2It9Am0URj6LX_EJ7VZ3p2KVNvy2aYGzQ7URIQaRFB02b2jb0cGrghNDJYT1dV8TGVIDwTielVevCV6Jr5K5XWN43PP7cSyg3WeheFS2msm8/s1600/abcd-graph-model.png" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Color coding described in text</td></tr>
</tbody></table>
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both;">
<br /></div>
<div class="separator" style="clear: both;">
I went through the rather labor-intensive process of creating a PowerPoint diagram (above) that overlays part of the ABCD graph model on top of the DSW graph diagram that I showed previously. (There are other ABCD classes that I did't include because the diagram was too crowded and I was getting tired.) Although ABCD has a whole bunch of extra classes that don't correspond to DwC classes, the main DwC classes are have ABCD analogs that are connected in a very similar manner to the way they are connected in DSW. The resemblance is actually rather striking. </div>
<div class="separator" style="clear: both;">
<br /></div>
<div class="separator" style="clear: both;">
Here are a few notes about the diagram. First of all, it isn't surprising that ABCD doesn't have an Organism class that corresponds to <span style="font-family: "courier new" , "courier" , monospace;">dwc:Organism</span>. As its name indicates, "Access to Biological Collections Data" is focused primarily on data from collections. As I learned from the fight to get <span style="font-family: "courier new" , "courier" , monospace;">dwc:Organism</span> added to Darwin Core, collections people don't care much about repeated observations. They generally only sample an organism once since they usually kill it in the process. So they rarely have to deal with multiple occurrences linked to the same organism. However, people who track live whales or band birds care about the <span style="font-family: "courier new" , "courier" , monospace;">dwc:Organism</span> class a lot since its primary purpose is to enable one-to-many relationships between organisms and occurrences (as opposed to having the purpose of creating some kind of semantic model of organisms). </div>
<div class="separator" style="clear: both;">
<br /></div>
<div class="separator" style="clear: both;">
Another obvious difference is the absence of any Location class that's separate from <span style="font-family: "courier new" , "courier" , monospace;">abcd:Gathering</span>. Another common theme in discussing a model for Darwin Core was whether there was any need to have a <span style="font-family: "courier new" , "courier" , monospace;">dwc:Event</span> class in addition to the <span style="font-family: "courier new" , "courier" , monospace;">dcterms:Location</span> class, or if we could just denormalize it out of existence. In that case, the disagreement was between collections people (who often only collect at a particular location once) and people who conducted long-term monitoring of sites (who therefore had many sampling Events at one Location). </div>
<div class="separator" style="clear: both;">
<br /></div>
<div class="separator" style="clear: both;">
The general theme here is that people who don't have one-to-many (or many-to-many) relationships between classes don't see the need for the extra classes and omit them from their graph model. But the more diverse the kinds of datasets we want to handle with the model, the more complicated the core graph model needs to be. </div>
<div class="separator" style="clear: both;">
<br /></div>
<div class="separator" style="clear: both;">
The other thing that surprised me a little in the ABCD graph model was that the "Unit" was connected to the "Gathering Agent" through an instance of <span style="font-family: "courier new" , "courier" , monospace;">abcd:FieldNumber</span>, instead of being connected directly as does <span style="font-family: "courier new" , "courier" , monospace;">dwciri:recordedBy</span>. I guess that makes sense if there's a one-to-many relationship between the Unit and the FieldNumber (several Gathering Agents assign their own FieldNumber to the Unit). There are some parallels with <span style="font-family: "courier new" , "courier" , monospace;">dwciri:fieldNumber</span>, although it is defined to have a subject that is field notes and an object that is a <span style="font-family: "courier new" , "courier" , monospace;">dwc:Event</span>. (see table 3.7 in the <a href="https://dwc.tdwg.org/rdf/" target="_blank">DwC RDF Guide</a>). Clearly there would be some work required to harmonize DwC and ABCD in this area.</div>
<div class="separator" style="clear: both;">
<br /></div>
<div class="separator" style="clear: both;">
The other part of the two graph models I want to draw attention to is the area of <span style="font-family: "courier new" , "courier" , monospace;">dsw:Token</span>. </div>
<div class="separator" style="clear: both;">
<br /></div>
<div class="separator" style="clear: both;">
There are two different ways of imagining the <span style="font-family: "courier new" , "courier" , monospace;">dsw:Token</span> class. One way is to say that <span style="font-family: "courier new" , "courier" , monospace;">dsw:Token</span> is a class that includes every kind of evidence. In that view, we enumerate the token classes we can think of, then define them using the properties associated with those kinds of evidence. The other way to think about it is to say that all of the properties that we can't join together under the banner of <span style="font-family: "courier new" , "courier" , monospace;">dwc:Occurrence</span> get grouped under an appropriate kind of token. In that view, our job is to sort properties, and we then name the token classes as a way to group the sorted properties. These are really just two different ways of describing the same thing. </div>
<div class="separator" style="clear: both;">
<br /></div>
<div class="separator" style="clear: both;">
The ABCD analog of the <span style="font-family: "courier new" , "courier" , monospace;">dsw:Token</span> class is the class <span style="font-family: "courier new" , "courier" , monospace;">abcd:TypeSpecificInformation</span>. Its definition is: "A super class to create and link to type specific information about a unit." Recall that the definition of a Unit is "A class to join all data referring to a unit such as specimen or observation record". These definitions correspond to the "sorting out of properties" view I described above. Properties common to all kinds of evidence are organized together under the Unit class, but properties that are not common get sorted out into the appropriate specific subclass of <span style="font-family: "courier new" , "courier" , monospace;">abcd:TypeSpecificInformation</span>. </div>
<div class="separator" style="clear: both;">
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjqudCQNeaAf4fViHtvsB2-PyKTsLEXlNpQHnI12qyEUdmm5YwaiT7UZrhkNas4IQvd6BhJrdzWg12IG5SBCnGNn3w87uVv9b02EkwUdPMCw5uIucAnUwNcPOWafHIp1VEK-Q-yw-Hi8zg/s1600/abcd-subclass.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="391" data-original-width="1093" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjqudCQNeaAf4fViHtvsB2-PyKTsLEXlNpQHnI12qyEUdmm5YwaiT7UZrhkNas4IQvd6BhJrdzWg12IG5SBCnGNn3w87uVv9b02EkwUdPMCw5uIucAnUwNcPOWafHIp1VEK-Q-yw-Hi8zg/s1600/abcd-subclass.png" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">ABCD class hierarchy</td></tr>
</tbody></table>
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both;">
The diagram above shows the "enumeration of types of evidence" view. In the diagram, you can see most of the imaginable kinds of specific evidence types listed as subclasses of <span style="font-family: "courier new" , "courier" , monospace;">abcd:TypeSpecificInformation</span>. These subclasses correspond with some of the possible DwC classes that could serve as Tokens: <span style="font-family: "courier new" , "courier" , monospace;">abcd:HerbariumUnit</span> corresponds to <span style="font-family: "courier new" , "courier" , monospace;">dwc:PreservedSpecimen</span>, <span style="font-family: "courier new" , "courier" , monospace;">abcd:BotanicalGardenUnit</span> corresponds to <span style="font-family: "courier new" , "courier" , monospace;">dwc:LivingSpecimen</span>, <span style="font-family: "courier new" , "courier" , monospace;">abcd:ObservationUnit</span> corresponds to <span style="font-family: "courier new" , "courier" , monospace;">dwc:HumanObservation</span>, etc. </div>
<div class="separator" style="clear: both;">
<br /></div>
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgi08hXz1uiE31vuSwDo85Hwn1U1edudAibfhaQf8cRIYm-C7_DWCCm6HHRqOtd4v0akLxF_W64o2g7U0qNIJksbF6r-e-Q_ycDqcjpgK6r2KS_cHM8rsZlXsuVFLtcz5V06HZUYpfZG14/s1600/abcd-subclass-links.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="531" data-original-width="1143" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgi08hXz1uiE31vuSwDo85Hwn1U1edudAibfhaQf8cRIYm-C7_DWCCm6HHRqOtd4v0akLxF_W64o2g7U0qNIJksbF6r-e-Q_ycDqcjpgK6r2KS_cHM8rsZlXsuVFLtcz5V06HZUYpfZG14/s1600/abcd-subclass-links.png" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Object properties linking abcd:Unit instances and instances of subclasses of <span style="font-family: "courier new" , "courier" , monospace;">abcd:TypeSpecificInformation</span></td></tr>
</tbody></table>
<div class="separator" style="clear: both;">
<br /></div>
<div class="separator" style="clear: both;">
In the same way that DSW uses the object property <span style="font-family: "courier new" , "courier" , monospace;">dsw:evidenceFor</span> to link Tokens and Occurrences, ABCD uses the object property <span style="font-family: "courier new" , "courier" , monospace;">abcd:hasTypeSpecificInformation</span> to link <span style="font-family: "courier new" , "courier" , monospace;">abcd:TypeSpecificInformation</span> instances to Units. In addition, ABCD defines separate object properties that link an <span style="font-family: "courier new" , "courier" , monospace;">abcd:Unit</span> to instances of each subclass of <span style="font-family: "courier new" , "courier" , monospace;">abcd:TypeSpecificInformation</span>. To find all of those properties, I ran <a href="https://wiki.bgbm.org/bdidata/query/#PREFIX%20bwd%3A%20%3Chttp%3A%2F%2Fwiki.bgbm.org%2Fentity%2F%3E%0APREFIX%20bwdt%3A%20%3Chttp%3A%2F%2Fwiki.bgbm.org%2Fprop%2Fdirect%2F%3E%0A%0ASELECT%20DISTINCT%20%3FwikibaseId%20%3FclassIri%20%3Fproperty%20%3FpropIri%20WHERE%20%7B%0A%20%20%3FwikibaseId%20bwdt%3AP1%20bwd%3AQ2025.%0A%20%20%3Fproperty%20bwdt%3AP13%20%3FwikibaseId.%0A%20%20%3Fproperty%20bwdt%3AP46%20bwd%3AQ1762.%0A%20%20%3Fproperty%20rdfs%3Alabel%20%3Flabel.%0A%20%20%3FwikibaseId%20bwdt%3AP25%20%3FclassLocalName.%0A%20%20BIND%28CONCAT%28%22http%3A%2F%2Frs.tdwg.org%2Fabcd%2Fterms%2F%22%2C%3FclassLocalName%29%20AS%20%3FclassIri%29%0A%20%20%3Fproperty%20bwdt%3AP25%20%3FpropLocalName.%0A%20%20BIND%28CONCAT%28%22http%3A%2F%2Frs.tdwg.org%2Fabcd%2Fterms%2F%22%2C%3FpropLocalName%29%20AS%20%3FpropIri%29%0A%7D%0AORDER%20BY%20%3Flabel" target="_blank">this query</a>; the specific object properties are all shown in the diagram above. </div>
<div class="separator" style="clear: both;">
<br /></div>
<div class="separator" style="clear: both;">
Clearly, the diagram above diagram is too complicated to insert as part of the man diagram comparing ABCD and DwC. Instead, I abbreviated it in the main diagram as shown in the following detail:</div>
<div class="separator" style="clear: both;">
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg6ZII0nl5I8On1uB5EUCNv91aZWU1FZ-2teLqx5U2EFobpD8sMYrmxMB7TNjSyAn0fGbntMiwLv42i6eK8SXLdrPylKJaJWVL7-CAEbiSi5FZS_82hSXz6EP_d74N1got28wOTAf4ordo/s1600/type-specific-information-detail.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="372" data-original-width="454" height="523" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg6ZII0nl5I8On1uB5EUCNv91aZWU1FZ-2teLqx5U2EFobpD8sMYrmxMB7TNjSyAn0fGbntMiwLv42i6eK8SXLdrPylKJaJWVL7-CAEbiSi5FZS_82hSXz6EP_d74N1got28wOTAf4ordo/s640/type-specific-information-detail.png" width="640" /></a></div>
<div class="separator" style="clear: both;">
<br /></div>
<div class="separator" style="clear: both;">
In this part of the diagram, I generalized the nine subclasses by a single bubble for the superclass <span style="font-family: "courier new" , "courier" , monospace;">abcd:TypeSpecificInformation</span>. The link from the Unit to the evidence instance can be made through the <span style="font-family: "courier new" , "courier" , monospace;">abcd:hasTypeSpecificInformation</span> or it can be made using one of the nine object properties that connect the Unit directly to the evidence. </div>
<div class="separator" style="clear: both;">
<br /></div>
<div class="separator" style="clear: both;">
In addition, I also placed <span style="font-family: "courier new" , "courier" , monospace;">abcd:MultimediaObject</span> in the position of <span style="font-family: "courier new" , "courier" , monospace;">dsw:Token</span>. Although images (and other kinds of multimedia) taken directly of the organism at the time the occurrence is recorded is often ignored by the museum community, with the flood of data coming from iNaturalist into GBIF, media is now a very important type of direct evidence for occurrences. </div>
<div class="separator" style="clear: both;">
<br /></div>
<div class="separator" style="clear: both;">
So in general, <span style="font-family: "courier new" , "courier" , monospace;">abcd:TypeSpecificInformation</span> is synonymous with <span style="font-family: "courier new" , "courier" , monospace;">dsw:Token</span>, with the exception that multimedia objects can serve as Tokens but aren't explicitly listed as subclasses of <span style="font-family: "courier new" , "courier" , monospace;">abcd:TypeSpecificInformation</span>.</div>
<div class="separator" style="clear: both;">
<br /></div>
<h3 style="clear: both;">
The place of controlled vocabularies in the model</h3>
<div class="separator" style="clear: both;">
The last major difference between the ABCD model and Darwin Core is how they deal with controlled vocabularies. Take for example the property <span style="font-family: "courier new" , "courier" , monospace;">abcd:hasSex</span>. In the Wikibase installation, it's item Q1057 and has the range <span style="font-family: "courier new" , "courier" , monospace;">abcd:Sex</span>. The range property would entail that <span style="font-family: "courier new" , "courier" , monospace;">abcd:Sex</span> is a Class, but it's type is given in the Wikibase installation as Controlled Vocabulary rather than Class. As I mentioned earlier, in the <span style="font-family: "courier new" , "courier" , monospace;">abcd_concepts.owl</span> ontology document, the controlled vocabularies are actually typed as <span style="font-family: "courier new" , "courier" , monospace;">owl:Class</span> rather than <span style="font-family: "courier new" , "courier" , monospace;">skos:ConceptScheme</span> as I would expect, with the controlled terms as instances of the controlled vocabularies. </div>
<div class="separator" style="clear: both;">
<br /></div>
<div class="separator" style="clear: both;">
So let's assume we have an <span style="font-family: "courier new" , "courier" , monospace;">abcd:Unit</span> instance called <span style="font-family: "courier new" , "courier" , monospace;">_:occurrence1</span> that is a female. Using the model of ABCD, the following triples could describe the situation:</div>
<div class="separator" style="clear: both;">
<br /></div>
<div class="separator" style="clear: both;">
<span style="font-family: "courier new" , "courier" , monospace;">abcd:Sex a rdfs:Class.</span></div>
<div class="separator" style="clear: both;">
<span style="font-family: "courier new" , "courier" , monospace;">abcd:hasSex a owl:ObjectProperty;</span></div>
<div class="separator" style="clear: both;">
<span style="font-family: "courier new" , "courier" , monospace;"> rdfs:range abcd:Sex.</span></div>
<div class="separator" style="clear: both;">
<span style="font-family: "courier new" , "courier" , monospace;">abcd:Female a abcd:Sex;</span></div>
<div class="separator" style="clear: both;">
<span style="font-family: "courier new" , "courier" , monospace;"> rdfs:label "female"@en.</span></div>
<div class="separator" style="clear: both;">
<span style="font-family: "courier new" , "courier" , monospace;">_:occurrence1 abcd:hasSex abcd:Female.</span></div>
<div class="separator" style="clear: both;">
<br /></div>
<div class="separator" style="clear: both;">
Currently, there are many terms in Darwin Core that say "Recommended best practice is to use a controlled vocabulary." However, most of these terms do not (yet) have controlled vocabularies, although this could change soon. Let's assume that the Standards Documentation Specification is followed and a SKOS-based controlled vocabulary identified by the IRI <span style="font-family: "courier new" , "courier" , monospace;">dwcv:gender</span> is created to be used to provide values for the term <span style="font-family: "courier new" , "courier" , monospace;">dwciri:sex</span>. Assume that the controlled vocabulary contains the terms <span style="font-family: "courier new" , "courier" , monospace;">dwcv:male</span> and <span style="font-family: "courier new" , "courier" , monospace;">dwcv:female</span>. The following triples could then describe the situation:</div>
<div class="separator" style="clear: both;">
<br /></div>
<div class="separator" style="clear: both;">
<span style="font-family: "courier new" , "courier" , monospace;">dwcv:gender</span><span style="font-family: "courier new" , "courier" , monospace;"> a skos:ConceptScheme.</span></div>
<div class="separator" style="clear: both;">
<span style="font-family: "courier new" , "courier" , monospace;">dwcv:female</span><span style="font-family: "courier new" , "courier" , monospace;"> a skos:Concept;</span></div>
<div class="separator" style="clear: both;">
<span style="font-family: "courier new" , "courier" , monospace;"> skos:prefLabel "female"@en;</span></div>
<div class="separator" style="clear: both;">
<span style="font-family: "courier new" , "courier" , monospace;"> rdf:value "female";</span></div>
<div class="separator" style="clear: both;">
<span style="font-family: "courier new" , "courier" , monospace;"> skos:inScheme dwcv:gender.</span></div>
<div class="separator" style="clear: both;">
<span style="font-family: "courier new" , "courier" , monospace;">_:occurrence1 dwc:sex "</span><span style="font-family: "courier new" , "courier" , monospace;">female"</span><span style="font-family: "courier new" , "courier" , monospace;">.</span></div>
<div class="separator" style="clear: both;">
<span style="font-family: "courier new" , "courier" , monospace;">_:occurrence1 dwciri:sex </span><span style="font-family: "courier new" , "courier" , monospace;">dwcv:female</span><span style="font-family: "courier new" , "courier" , monospace;">.</span></div>
<div class="separator" style="clear: both;">
<br /></div>
<div class="separator" style="clear: both;">
From the standpoint of generic modeling, neither of these approaches are "right" or "wrong". However, the latter approach is consistent with sections 4.1.2, 4.5, and 4.5.4 of the <a href="https://github.com/tdwg/vocab/blob/master/sds/documentation-specification.md" target="_blank">TDWG Standards Documentation Specification</a> as well as the pattern noted for controlled vocabularies in <a href="https://www.w3.org/TR/dwbp/#dataVocabularies" target="_blank">section 8.9</a> of the W3C <i>Data on the Web Best Practices</i> recommendation.</div>
<div class="separator" style="clear: both;">
<br /></div>
<div class="separator" style="clear: both;">
One reason that the ABCD graph diagram is more complicated than the DSW graph diagram is that some classes shown on the ABCD diagram as yellow bubbles (<span style="font-family: "courier new" , "courier" , monospace;">abcd:RecordBasis</span> and <span style="font-family: "courier new" , "courier" , monospace;">abcd:Sex</span>) and other classes not shown (like <span style="font-family: "courier new" , "courier" , monospace;">abcd:PermitType</span>, <span style="font-family: "courier new" , "courier" , monospace;">abcd:NomenclaturalCode</span>, etc.) represent controlled vocabularies rather than classes of linked resources. </div>
<div class="separator" style="clear: both;">
<br /></div>
<h2 style="clear: both;">
Final thoughts</h2>
<div class="separator" style="clear: both;">
</div>
I have to say that I was somewhat surprised at how similar the ABCD and Darwin-SW graph models were. Perhaps I shouldn't be that surprised, given the DSW model's roots in the ACS model - it generally reflects the way the collections community views the universe and that view undoubtedly informs the ABCD model as well. That's good news, because it means that it should be possible to create a consensus graph model for Darwin Core and ABCD with minimal changes to either standard.<br />
<div>
<br /></div>
<div>
With such a model, it should be possible using SPARQL CONSTRUCT queries mediated by software to perform automated conversions from Darwin Core linked data to ABCD linked data. The CONSTRUCT query could insert blank nodes in places where the ABCD model has classes that aren't included in DwC. The conversion in the other direction would be more difficult since classes included in ABCD that aren't in DwC would have to be eliminated to make the conversion, and that might result in data loss as the data were denormalized. Still, the idea of any automated conversion is an encouraging thought!</div>
<div>
<br /></div>
<div>
The other thing that is clear to me from this investigation is that the current DwC and ABCD vocabularies could relatively easily be further developed into QL-like ontologies. That's basically what has already been done in the <a href="https://github.com/tdwg/abcd/blob/master/ontology/abcd_concepts.owl" target="_blank">abcd_concepts.owl ontology document</a> and in DSW. It has been suggested that TDWG ontology development be carried out using the OBO Foundry system, but that system is designed to create and maintain EL-like ontologies. Transforming Darwin Core and ABCD to EL-like ontologies would be be much more difficult and it is not clear to me what would be gained by that, given that the primary use case for ontology development in TDWG would be to facilitate querying of large volumes of instance data.<br />
<br />
<div class="separator" style="clear: both;">
<br /></div>
<h2>
For further reference</h2>
<h3>
</h3>
<h3>
Ontologies vs. controlled vocabularies</h3>
The distinction between ontologies and controlled vocabularies is discussed in several standards:<br />
<br />
<ul>
<li><a href="https://www.w3.org/TR/dwbp/#dataVocabularies" target="_blank">Section 8.9 of the W3C Data on the Web Best Practices Recommendation</a> </li>
<li><a href="https://github.com/tdwg/vocab/blob/master/iso25964.md" target="_blank">ISO 25964 (Thesauri and interoperability with other vocabularies)</a></li>
<li><a href="https://www.w3.org/TR/skos-reference/#L1045" target="_blank">Section 1.3 of the SKOS Simple Knowledge Organization System Reference</a></li>
</ul>
<br />
To paraphrase these references, there is a fundamental difference between ontologies and controlled vocabularies. <b>Ontologies </b>define knowledge related to some shared conceptualization in a formal way so that machines can carry out reasoning. They aren't primarily designed for human interaction. <b>Controlled vocabularies</b> are designed to help humans use natural language to organize and find items by associating consistent labels with concepts. Controlled vocabularies don't assert axioms or facts. A <b>thesaurus </b> (sensu <a href="https://github.com/tdwg/vocab/blob/master/iso25964.md" target="_blank">ISO 25964)</a> is a subset of controlled vocabulary where its concepts are organized with explicit relationships (e.g. broader, narrower, etc.).<br />
<br />
The <i>Data on the Web Best Practices</i> recommendation notes in section 8.9 that controlled vocabularies and ontologies can be used together when the concepts defined in the controlled vocabulary are used as values for a property defined in an ontology. It gives the following example: "A concept from a thesaurus, say, 'architecture', will for example be used in the subject field for a book description (where 'subject' has been defined in an ontology for books)."<br />
<br />
<h3>
Kinds of ontologies</h3>
The <a href="https://www.w3.org/TR/owl2-profiles/#Introduction" target="_blank">Introduction of the W3C OWL 2 Web Ontology Language Profiles</a> Recommendation describes several profiles or sublanguages of the OWL 2 language for building ontologies. These profiles place restrictions on the structure of OWL 2 ontologies in ways that make them more efficient for dealing with data of different sorts. The nature of these restrictions are very technical and way beyond the scope of this post, but I mention the profiles because they provide a convenient way the characterize ontology modeling approaches. (I also refer you to <a href="https://www.cambridgesemantics.com/blog/semantic-university/learn-owl-rdfs/flavors-of-owl/" target="_blank">this post</a>, which offers a very succinct description of the difference in the profiles.)<br />
<br />
<b>OWL 2 EL</b> is suitable for "applications employing ontologies that define very large numbers of classes and/or properties". A classic example of such an ontology is the <a href="http://geneontology.org/docs/ontology-documentation/" target="_blank">Gene Ontology</a>, where the data themselves are represented as tens of thousands of classes. <b>OWL 2 QL</b> is suitable for "applications that use large volumes of instance data, and where query answering is the most important reasoning activity." A classic example of such an ontology is the <a href="http://www.geonames.org/ontology/documentation.html" target="_blank">GeoNames ontology</a>, which contains only 7 classes and 28 properties, but is used with over eleven million place feature instances. In OWL 2 QL, query answering can be implemented using conventional relational database systems.<br />
<br />
I refer to ontologies with many classes and properties for which OWL 2 EL is suitable as "<b>EL-like ontologies</b>", and ontologies with few classes and properties used with lots of instance data for which OWL 2 QL is suitable as "<b>QL-like ontologies</b>".<br />
<br />
<h3>
Vocabularies and terms</h3>
<a href="https://www.w3.org/TR/dwbp/#dataVocabularies" target="_blank">Section 8.9 of the W3C Data on the Web Best Practices Recommendation</a> describes vocabularies and terms in this way:<br />
<blockquote class="tr_bq">
<b>Vocabularies</b> define the concepts and relationships (also referred to as “<b>terms</b>” or “attributes”) used to describe and represent an area of interest. They are used to classify the terms that can be used in a particular application, characterize possible relationships, and define possible constraints on using those terms. Several near-synonyms for 'vocabulary' have been coined, for example, ontology, controlled vocabulary, thesaurus, taxonomy, code list, semantic network.</blockquote>
So a vocabulary is a broad category that includes both ontologies and controlled vocabularies, and it is a collection of terms. In this post, I use "vocabulary" and "term" in this context and avoid using the word "<b>concept</b>" unless I specifically mean it in the sense of a <span style="font-family: "courier new" , "courier" , monospace;">skos:Concept</span> (i.e. a term in controlled vocabulary).<br />
<br />
<span style="font-size: xx-small;">Note: this was originally posted 2019-06-11 but was edited on 2019-06-12 to clarify the position of the subclasses of <span style="font-family: "courier new" , "courier" , monospace;">abcd:hasTypeSpecificInformation</span> in the model.</span><br />
<br /></div>
Steve Baskaufhttp://www.blogger.com/profile/01896499749604153763noreply@blogger.com2tag:blogger.com,1999:blog-5299754536670281996.post-70754714521796970652019-06-04T15:39:00.005-07:002021-03-13T07:31:39.097-08:00Putting Data into Wikidata using SoftwareThis is a followup post to <a href="http://baskauf.blogspot.com/2019/05/getting-data-out-of-wikidata-using.html" target="_blank">an earlier post about getting data out of Wikidata</a>, so although what I'm writing about here doesn't really depend on having read that post, you might want to take a look at it for background.<div><br /></div><div><b>Note added 2021-03-13:</b> Although this post is still relevant for understanding some of the basic ideas about writing to a Wikibase API (including Wikidata's), I have written another series of blog posts showing (with lots of screenshots and handholding) <b>how you can safely write your own data to the Wikidata API</b> using data that is stored in simple CSV spreadsheets. See <a href="http://baskauf.blogspot.com/2021/03/writing-your-own-data-to-wikidata-using.html" target="_blank">this post</a> for details.</div><div>
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://upload.wikimedia.org/wikipedia/commons/thumb/1/1a/Wikidata_Bots.png/210px-Wikidata_Bots.png" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="149" data-original-width="210" src="https://upload.wikimedia.org/wikipedia/commons/thumb/1/1a/Wikidata_Bots.png/210px-Wikidata_Bots.png" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;"><span style="font-size: xx-small;">Image from <a href="https://commons.wikimedia.org/wiki/File:Wikidata_Bots.png" target="_blank">Wikimedia Commons</a>; licensing murky but open</span></td></tr>
</tbody></table>
<h2>
What do I mean by "putting data into Wikidata"?</h2>
<div>
I have two confessions to make right at the start. To some extent, the title of this post is misleading. What I am actually going to talk about is putting data into Wikibase, which isn't exactly the same thing as Wikidata. I'll explain about that in a moment. The second confession is that if all you really want are the technical details of how to write to Wikibase/Wikidata and the do-it-yourself scripts, you can just skip reading the rest of this post and go directly to<a href="https://heardlibrary.github.io/digital-scholarship/host/wikidata/bot/" target="_blank"> a web page that I've already written</a> on that subject. But hopefully you will read on and try the scripts after you've read the background information here.</div>
<div>
<br /></div>
<div>
<a href="http://wikiba.se/" target="_blank">Wikibase</a> is the underlying application upon which <a href="https://www.wikidata.org/" target="_blank">Wikidata</a> is built. So if you are able to write to Wikibase using a script, you are also able to use that same script to write to Wikidata. However, there is an important difference between the two. If you <a href="https://heardlibrary.github.io/digital-scholarship/lod/install/#using-docker-compose-to-create-an-instance-of-wikibase-on-your-local-computer" target="_blank">create your own instance of Wikibase</a>, it is essentially a blank version of Wikidata into which you can put your own data, and whose properties you can tweak in any way that you want. In contrast, Wikidata is a community-supported project that contains data from many sources, and which has properties that have been developed by consensus. So you can't just do whatever you want with Wikidata. (Well, actually you can, but your changes might be reverted and you might get banned if you do things that the community considers bad.)</div>
<div>
<br /></div>
<div>
So before you start using a script to mess with the "real" Wikidata, it's really important to first understand the expectations and social conventions of the Wikidata community. Although I've been messing around with scripting interactions with Wikibase and Wikidata for months, I have not turned a script loose on the "real" Wikidata yet because I still have some work to do to meet the community expectations.<br />
<br />
Before you start using a script to make edits to the real Wikidata, at a minimum you need to do the following:<br />
<br />
<ul>
<li>read the <a href="https://www.mediawiki.org/wiki/API:Etiquette" target="_blank">MediaWiki API Etiquette page</a></li>
<li>study and understand the <a href="https://www.mediawiki.org/wiki/Manual:Maxlag_parameter" target="_blank">information on the Maxlag parameter</a></li>
<li>understand the <a href="https://www.wikidata.org/wiki/Wikidata:Bots" target="_blank">bot approval process and follow the guidelines for creating and operating a Wikidata bot</a></li>
<li>test your script extensively on the <a href="https://test.wikidata.org/" target="_blank">Wikidata test instance</a></li>
</ul>
<br />
If you are only thinking about using a script to write to your own instance of Wikibase, you can ignore the steps above and just hack away. The worse case scenario is that you'll have to blow the whole thing up and start over, which is not that big of a deal if you haven't yet invested a lot of time in loading data.<br />
<br />
<h2>
Some basic background on Wikibase</h2>
Although we tend to talk about Wikibase as if it were a single application, it actually consists of several applications operating together in a coordinated installation. This is somewhat of a gory detail that we can usually ignore. However, having a basic understanding the structure of Wikidata will help us to understand why we even though Wikidata supports Linked Data, we have to write to Wikidata through the MediaWiki API. (Full disclosure: I'm not an expert on Wikibase and what I say here is based on the understanding that I have gained based on my own explorations.)<br />
<br />
We can see the various pieces of Wikibase by looking its <a href="https://github.com/wmde/wikibase-docker/blob/master/docker-compose.yml" target="_blank">Docker Compose YAML file</a>. Here are some of them:<br />
<br />
<ul>
<li>a mysql database</li>
<li>a Blazegraph triplestore backend (exposed on port 8989)</li>
<li>the Wikidata Query Service frontend (exposed on port 8282)</li>
<li>the Mediawiki GUI and API (exposed on port 8181)</li>
<li>a Wikidata Query Service updater</li>
<li>Quickstatements (which doesn't work right out of the box, so we'll ignore it)</li>
</ul>
<br />
When data are entered into Wikibase using the Mediawiki instance at port 8181, they are stored in the mysql database. The Wikidata Query Service updater checks periodically for changes in the database. When it finds one, it loads the changed data into the Blazegraph triplestore. Although one can access the Blazegraph interface directly through port 8989, accessing the triplestore indirectly through the Wikidata Query Service frontend on port 8282 gives some additional bells and whistles that make querying easier.<br />
<br />
If I look at the terminal window while Docker Compose is running Wikidata, I see this:<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh2LP10U7O52aW4rnMSO7Knsihap5LcsTeqGcYWOJiOjvER5L9OOtkoS006YYWGt74U81SMU-XEWLz9C1cNsE1q_OTmMj1Lokf82BhF9BQfBLHCUH_OaF2QFL8QvBT4yivxDbv9RszTwOk/s1600/wikibase-terminal.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="268" data-original-width="815" height="209" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh2LP10U7O52aW4rnMSO7Knsihap5LcsTeqGcYWOJiOjvER5L9OOtkoS006YYWGt74U81SMU-XEWLz9C1cNsE1q_OTmMj1Lokf82BhF9BQfBLHCUH_OaF2QFL8QvBT4yivxDbv9RszTwOk/s640/wikibase-terminal.png" width="640" /></a></div>
<br />
You can see that the updater is looking for changes every 10 seconds. This goes on in the terminal window as long as the instance is up. So when changes are made via Mediawiki, they show up in the Query Service within about 10 seconds.<br />
<br />
If you access Blazegraph via <a href="http://localhost:8989/bigdata/">http://localhost:8989/bigdata/</a>, you'll see the normal GUI that will be familiar to you if you've used Blazegraph before:<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiQDuO0yO7dN0LXH1h7UBkybfZ_90Q5oJjUhwZJPOuBMlTVMg2Oy_R79nDwe3P6xy32EVWSMoXO4hZP-klmRBRxl55mmNaeDhmO7RAmFqjiBHVsBxIKyAzS-aZ-RDxKQ0bOZbZCSFPiPiU/s1600/blazegraph-interface.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="872" data-original-width="1162" height="480" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiQDuO0yO7dN0LXH1h7UBkybfZ_90Q5oJjUhwZJPOuBMlTVMg2Oy_R79nDwe3P6xy32EVWSMoXO4hZP-klmRBRxl55mmNaeDhmO7RAmFqjiBHVsBxIKyAzS-aZ-RDxKQ0bOZbZCSFPiPiU/s640/blazegraph-interface.png" width="640" /></a></div>
<br />
<div style="clear: both; text-align: left;">
However, if you go to the UPDATE tab and try to add data using SPARQL Update, you'll find that it's disabled. That means that the only way to actually get data into the system is through the Mediawiki GUI or API exposed through port 8181, and NOT through the standard Linked Data mechanism of SPARQL Update. So if you want to add data to Wikibase (either your local installation or the Wikidata instance of Wikibase), you need to figure out how to use the Mediawiki API, which is based on a specific Wikimedia data model and NOT on standard RDF or RDFS. </div>
<br />
<h2>
The MediaWiki API</h2>
The <a href="https://www.mediawiki.org/wiki/API:Main_page" target="_blank">MediaWiki API</a> is a generic web service for all installations in the WikiMedia universe. That includes not only familiar <a href="https://en.wikipedia.org/wiki/Wikimedia_Foundation" target="_blank">Wikimedia Foundation</a> projects like <a href="https://www.wikipedia.org/" target="_blank">Wikipedia</a> in all of its various languages, <a href="https://commons.wikimedia.org/" target="_blank">Wikimedia Commons</a>, and <a href="https://www.wikidata.org/" target="_blank">Wikidata</a>, but also any of the many other projects built on the open source MediaWiki platform.<br />
<br />
The API allows you to perform many possible read or write actions on a MediaWiki installation. Those actions are listed on the <a href="https://www.wikidata.org/w/api.php" target="_blank">MediaWiki API help page</a> and you can learn their details by clicking on the name of any of the actions. The actions whose names begin with "wb" are the ones specifically related to Wikibase and there is <a href="https://www.mediawiki.org/wiki/Wikibase/API" target="_blank">a special page</a> that focuses only on that set of actions. Since this post is related to Wikibase, we will focus on those actions. Although a number of the Wikibase-related actions can read from the API, as I pointed out in <a href="http://baskauf.blogspot.com/2019/05/getting-data-out-of-wikidata-using.html" target="_blank">my most recent previous post </a>there is not much point in reading directly from the API when one can just use Wikibase's awesome SPARQL interface instead. So in my opinion, the most important Wikibase actions are the ones that write to the API rather than read.<br />
<br />
The Wikibase-specific API page makes <a href="https://www.mediawiki.org/wiki/Wikibase/API#Post_vs._get" target="_blank">two important points about writing to a Wikibase instance</a>: writing requires a <i>token</i> (more on that later) and must be done using an HTTP POST request. I have to confess that when I first started looking at the API documentation, I was mystified about how to translate the examples given there into request bodies that could be sent as part of a POST request. But there is a very useful tool that makes it much easier to construct the POST requests: the <a href="https://test.wikidata.org/wiki/Special:ApiSandbox" target="_blank">API sandbox</a>. There are actually multiple sandboxes (e.g. real Wikidata, Wikidata test instance, real Wikipedia, Wikipedia test instance, etc.), but since tests that you do in an API sandbox cause real changes to their corresponding MediaWiki instances, you should practice using the <a href="https://test.wikidata.org/" target="_blank">Wikidata test instance</a> sandbox (<a href="https://test.wikidata.org/wiki/Special:ApiSandbox">https://test.wikidata.org/wiki/Special:ApiSandbox</a>) and not the sandbox for the real Wikidata, which looks and behaves exactly the same as the test instance sandbox.<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjPxalauwPbGbA89GVF0Q28ZFd01a6nB9-s_Oue_4r8RG8SFckhp6P6nVBbaWqUvGozoV6kJwtCHsxgtZz0Bc7Vo-0DU058FC5laCFJnUu3vMTd2rQNF0OhRbPYNVlByiADEIg2Vqak6UE/s1600/sandbox.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="942" data-original-width="1143" height="526" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjPxalauwPbGbA89GVF0Q28ZFd01a6nB9-s_Oue_4r8RG8SFckhp6P6nVBbaWqUvGozoV6kJwtCHsxgtZz0Bc7Vo-0DU058FC5laCFJnUu3vMTd2rQNF0OhRbPYNVlByiADEIg2Vqak6UE/s640/sandbox.png" width="640" /></a></div>
<br />
<br />
When you go to the sandbox, you can select from the dropdown the action that you want to test. Alternatively, you can click on <a href="https://test.wikidata.org/w/api.php?action=help&modules=wbcreateclaim" target="_blank">one of the actions</a> on the MediaWiki API help page, then in the Examples section, click on the "[open in sandbox]" link to jump directly to the sandbox with the parameters already filled into the form. <br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEilSaCfwy41pdL5-IydJZU9ESc33ttEqqWSWywwIqm6gYHiXuZR2gGWIBrdzVyOOT2szDZKYo_bvhU8TPt15rkvLnpdrA6NE3cJmZdXWslw1tZQQccL5FeDHvLiVMKCXnjPV6exE0GRiJA/s1600/examples.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="259" data-original-width="1143" height="144" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEilSaCfwy41pdL5-IydJZU9ESc33ttEqqWSWywwIqm6gYHiXuZR2gGWIBrdzVyOOT2szDZKYo_bvhU8TPt15rkvLnpdrA6NE3cJmZdXWslw1tZQQccL5FeDHvLiVMKCXnjPV6exE0GRiJA/s640/examples.png" width="640" /></a></div>
<br />
Click on the "action=..." link in the menu on the left if needed to enter any necessary parameters. Note: since testing the write actions requires a token, you need to log in (same credentials as Wikipedia or any other Wikimedia site), then click the "Auto-fill the token" button before the write action will really work. Once the action has taken place, you can go to the edited entry in the test Wikidata instance and convince yourself that it really worked.<br />
<br />
On the sandbox page, clicking on the "Results" link in the menu on the left will provide you with a really useful piece of information: the Request JSON that needs to be sent to the API as the body of the POST request:<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg2G0bnp3ishBfHB7qfTjvBTCJBPgEd9tW_oJ7pljJdlUS20sjpYBaoqttFbAuINz0UzDh7HMSIwYqnCWBmcmDXivnqztTVkwCu_XnECOs6jxicbMV1zHRgLNXlie8w5dvp_nYH67BsZmg/s1600/api-request-results-page.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="497" data-original-width="939" height="338" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg2G0bnp3ishBfHB7qfTjvBTCJBPgEd9tW_oJ7pljJdlUS20sjpYBaoqttFbAuINz0UzDh7HMSIwYqnCWBmcmDXivnqztTVkwCu_XnECOs6jxicbMV1zHRgLNXlie8w5dvp_nYH67BsZmg/s640/api-request-results-page.png" width="640" /></a></div>
<br />
Drop down the "Show request data as:" list to "JSON" and you can copy the Request JSON to use as you write and test your bot script. Once you've had a chance to look at several examples of request JSON, you can then compare it to the information given on the various API help pages to understand better what exactly you need to send to the API as the body of your POST request.<br />
<br />
<h2>
Authentication</h2>
In the last section, I mentioned that all write actions required a token. So what is that token, and how do you get it? In the API sandbox, you just click on a button and magic happens: a token is pasted into the box on the form. But what do you do for a real script?<br />
<br />
The actual process of getting the necessary token is a bit convoluted an I won't go into the details here since they are covered in detail (with screenshots) on another web page in the <a href="https://heardlibrary.github.io/digital-scholarship/host/wikidata/bot/#set-up-the-bot" target="_blank">Set up the bot</a> and <a href="https://heardlibrary.github.io/digital-scholarship/host/wikidata/bot/#use-the-bot-to-write-to-the-wikidata-test-instance" target="_blank">Use the bot to write to the Wikidata test instance</a> sections. The abridged version is that you first need to create a bot username and password, then use those credentials to interact with the API to get the CSRF token that will allow you to perform the POST request.<br />
<br />
For use in the test Wikidata instance or in your own Wikibase installation, you can just create the bot password using your own personal account. (Note: "bot" is just MediaWiki lingo for a script that automates edits.) However, the guidelines for <a href="https://www.wikidata.org/wiki/Wikidata:Bots" target="_blank">getting approval for a Wikidata bot</a> say that if you want to create a bot that carries out manipulations of the real Wikidata, you need to create a separate account specifically for the bot. An approved bot will receive a "bot flag" indicating that the community has given a thumbs-up to the bot to carry out its designated tasks. In the practice examples I've given, you don't need to do that, so you can ignore that part for now.<br />
<br />
A CSRF token is issued for a particular editing session, so once it has been issued, it can be re-used for many actions that are carried out by the bot script during that session. I've written a Python function, <span style="font-family: "courier new" , "courier" , monospace;">authenticate()</span>, that can be copied from <a href="https://github.com/HeardLibrary/digital-scholarship/blob/master/code/wikibase/api/load_csv.py" target="_blank">this page</a> and used to get the CSRF token - it's not necessary to understand the details unless you care about that kind of thing.<br />
<br />
<h2>
Time for a Snak</h2>
You can't get very far into the process of performing Wikibase actions on the MediaWiki API before you start running into the term <i>snak</i>. Despite reading various Wikibase documents and doing some minimal googling, I have not been able to find out the origin of the word "snak". I suppose it is either an inside joke, a term from some language other than English, or an acronym. If anybody out there knows, I would love to be set straight on this.<br />
<br />
The <a href="https://www.mediawiki.org/wiki/Wikibase/DataModel" target="_blank">Wikibase/DataModel reference page</a> <a href="https://www.mediawiki.org/wiki/Wikibase/DataModel#Snaks" target="_blank">defines snaks as</a>: "the basic information structures used to describe Entities in Wikidata. They are an integral part of each Statement (which can be viewed as collection of Snaks about an Entity, together with a list of references)." But what exactly does that mean?<br />
<br />
Truthfully, I find the reference page a tough slog, so if you are unfamiliar with the Wikidata model and want to get a better understanding of it, I would recommend starting with the <a href="https://www.mediawiki.org/wiki/Wikibase/DataModel/Primer" target="_blank">Data Model Primer page</a>, which shows very clearly how the data model relates to the familiar MediaWiki item entry GUI (but ironically does not mention snaks anywhere on the entire page). I would also recommend studying the following graph diagram, which comes from a <a href="https://heardlibrary.github.io/digital-scholarship/lod/wikibase/" target="_blank">page that I wrote</a> to help people get started making more complex Wikibase/Wikidata SPARQL queries.<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://heardlibrary.github.io/digital-scholarship/lod/images/wikidata-statement-reference.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="554" data-original-width="734" height="482" src="https://heardlibrary.github.io/digital-scholarship/lod/images/wikidata-statement-reference.png" width="640" /></a></div>
<br />
Before I talk about how snaks fit into the Wikibase data model, I want to talk briefly about how the Wikibase modeling approach differs from modeling more typical for RDF-based Linked Data. A typical RDF-based graph model is built upon the <a href="https://www.w3.org/TR/rdf-schema/" target="_blank">RDFS</a>, which includes an implicit notion of classes and type. One could then build a model on top of RDFS by creating an ontology where class relationships are define using subclass statements, restrictions are placed on class membership, ranges and domains are defined, etc. The overall goal is to describe some model of the world (real or imagined).<br />
<br />
In contrast to that, a basic principle of Wikibase is that <a href="https://www.mediawiki.org/wiki/Wikibase/DataModel/Primer#Statements" target="_blank">it is not about the truth</a>. Rather, the Wikibase model is based on describing statements and their references. So the Wikibase model does not assume that we can model the world by placing items in a class. Rather, the Wikibase model allows us to state that "so-and-so says" that an item is a member of some class. A key property in Wikidata is P31 ("instance of"), which is used with almost every item to document a statement about class membership. But there is no requirement that some other installation of Wikibase have an "instance of" property, or that if an "instance of" property exists its identifier must be P31. "Instance of" is not an idea that's "baked into" the Wikibase model in the way it's build into RDFS. "Instance of" is just one of the many properties that the Wikidata community has decided it would like to use in statements that it documents. The same is true of "subclass of" (P279). A user can create the statement Q6256 P279 Q1048835 ("country" "subclass of" "political territorial entity"), but according to the Wikibase model, that is not some kind of special assertion of the state of reality. Rather, it's just one of the many other statements about items that have been documented in the Wikidata knowledge base.<br />
<br />
So when we say that some part of the Wikidata community is "building a model" of their domain, they aren't doing it by building a formal ontology using RDF, RDFS, or OWL. Rather, they are doing it by making and documenting statements that involve the properties P31 and P279, just as they would make and document statements using any of the other thousands of properties that have been created by the Wikidata community.<br />
<br />
What is actually "baked into" the Wikibase model (and Wikidata by extension) are the notions of property/value pairs associated with statements, reference property/value pairs associated with statements, and qualifiers and ranks for statements (not shown in the diagram above). The Wikibase data model assumes that the properties associated with statements and references exist, but does not define any of them <i>a priori</i>. Creating those particular properties are is up to the implementers of a particular Wikibase instance.<br />
<br />
These key philosophical differences between the Wikibase model and the "standard" RDF/RDFS/OWL world need to be understood by implementers from the Linked Data world who are interested in using Wikibase as a platform to host their data. Building a knowledge graph on top of Wikibase will automatically include notions of statements and reference, but it will NOT automatically include notions of class membership and subclass relationships. Those features of the model will have to be built by the implementers through creation of appropriate properties. It's also possible to use SPARQL Construct to translate a statement in Wikidata lingo like<br />
<br />
Q42 P31 Q5.<br />
<br />
into a standard RDF/RDFS statement like<br />
<br />
Q42 rdf:type Q5.<br />
<br />
although there are OWL-related problems with this approach related to an item being used as both a class and an instance. But that's way beyond the scope of this post.<br />
<br />
So after that rather lengthy aside, let's return to the question of snaks. A somewhat oversimplified description of a snak would be to say that it's a property/value pair of some sort. (There are also less commonly "no value" and "some value" snaks in cases where particular values aren't known - you can read about their details on the reference page.) The exact nature of the snak will depend on whether the value is a string, an item, or some other more complicated entity like a date range or geographic location. "Main" snaks are property/value pairs that are associated directly with the subject item and "qualifier" snaks qualify the statement made by the main snak. Zero to many reference records are linked to the statement, and each reference record has its own set of property/value snaks describing the reference itself (as opposed to describing the main statement). Given that the primary concern of the Wikibase data model is documenting statements involving property/value pairs, snaks are a central part of that model.<br />
<br />
The reason I'm going out into the weeds on the subjects of snaks in this post is that a basic knowledge of snaks is required in order to understand the lingo of the Wikibase actions described in the MediaWiki API help. For example, if we look at the <a href="https://test.wikidata.org/w/api.php?action=help&modules=wbcreateclaim" target="_blank">help page for the wbcreateclaim</a> action, we can see how a knowledge of snaks will help us better understand the parameters required for that action.<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgHQY6wJGf01AbBvFyX8q_fhJCb4xHz09rhevieKJA4VYuDqaiyZ_-HLSN-0_4zY5ChF0_KjmoJbSRlZXFx7l9fNOGEaW-bWRDpqH3IsVrk7zX-GcjqX5JNej57kSegO1dM5MC-4VsBdvk/s1600/create-claim-help.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="471" data-original-width="1167" height="258" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgHQY6wJGf01AbBvFyX8q_fhJCb4xHz09rhevieKJA4VYuDqaiyZ_-HLSN-0_4zY5ChF0_KjmoJbSRlZXFx7l9fNOGEaW-bWRDpqH3IsVrk7zX-GcjqX5JNej57kSegO1dM5MC-4VsBdvk/s640/create-claim-help.png" width="640" /></a></div>
<br />
<br />
In most cases, <span style="font-family: "courier new" , "courier" , monospace;">snaktype</span> will have a value of <span style="font-family: "courier new" , "courier" , monospace;">value</span> (unless you want to make a "no value" or "some value" assertion). If we want to write a claim having a typical snak, we will have to provide the API with values for both the <span style="font-family: "courier new" , "courier" , monospace;">property</span> and <span style="font-family: "courier new" , "courier" , monospace;">value</span> parameters. The <span style="font-family: "courier new" , "courier" , monospace;">property</span> parameter is straightforward: the property's "P" identifier is simply given as the value of the parameter.<br />
<br />
The value of the snak is more complicated. Its value is a string that also includes the delimiters necessary to describe the particular kind of value that's appropriate for the property. If the property is supposed to have a string value, then the value of the <span style="font-family: "courier new" , "courier" , monospace;">value</span> parameter will be the string enclosed in quotes. If the property is supposed to have an item as a value, then the information about the item is given as a string that includes all of the JSON delimiters (quotes, colons, curly braces, etc.) required in the API documentation. Since all of the parameters and values for the action will be passed to the API as JSON in the POST request, the value of the <span style="font-family: "courier new" , "courier" , monospace;">value</span> parameter will end up as a JSON string inside of JSON. Depending on the programming language you use, you may have to use escaping or some other mechanism to make sure that the JSON string for the <span style="font-family: "courier new" , "courier" , monospace;">value</span> value is rendered properly. Here are some examples of how part of the POST request body JSON might look in a programming language where escaping is done by preceding a character with a backslash:<br />
<br />
<i>if the value is a string:</i><br />
<br />
<span style="font-family: "courier new" , "courier" , monospace;">{</span><br />
<span style="font-family: inherit;">...</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> "property": "P1234",</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> "value": "\"WGS84\"",</span><br />
<span style="font-family: inherit;">...</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">}</span><br />
<br />
<i>if the value is an item:</i><br />
<br />
<span style="font-family: "courier new" , "courier" , monospace;">{</span><br />
...<br />
<span style="font-family: "courier new" , "courier" , monospace;"> "property": "P9876",</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> "value": "{\"entity-type\":\"item\",\"numeric-id\":1}",</span><br />
...<br />
<span style="font-family: "courier new" , "courier" , monospace;">}</span><br />
<br />
Because the quotes that are part of the <span style="font-family: "courier new" , "courier" , monospace;">value</span> parameter value string are inside the quotes required by the request body JSON, they were escaped as <span style="font-family: "courier new" , "courier" , monospace;">\"</span>.<br />
<br />
For JSON data sent by the <span style="font-family: "courier new" , "courier" , monospace;">requests</span> Python library as the body of a POST request, the JSON can be passed into the <span style="font-family: "courier new" , "courier" , monospace;">.post()</span> method as a dictionary data structure, and <span style="font-family: "courier new" , "courier" , monospace;">requests</span> will turn the dictionary into JSON before sending it to the API. To some extent, that allows one to dodge the whole escaping thing by using a combination of single and double quotes when constructing the dictionary. So in Python, we could code the dictionary to be passed by <span style="font-family: "courier new" , "courier" , monospace;">requests</span> like this:<br />
<br />
<i>if the value is a string:</i><br />
<span style="font-family: "courier new" , "courier" , monospace;"><br /></span>
<span style="font-family: "courier new" , "courier" , monospace;">{</span><br />
<span style="font-family: inherit;">...</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> 'property': 'P1234',</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> 'value': '"WGS84"',</span><br />
<span style="font-family: inherit;">...</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">}</span><br />
<br />
<i>if the value is an item:</i><br />
<div>
<i><br /></i></div>
<span style="font-family: "courier new" , "courier" , monospace;">{</span><br />
...<br />
<span style="font-family: "courier new" , "courier" , monospace;"> 'property': 'P9876',</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> 'value': '{"entity-type":"item","numeric-id":1}',</span><br />
...<br />
<span style="font-family: "courier new" , "courier" , monospace;">}</span><br />
<br />
since Python dictionaries can be defined using using single quotes. Other kinds of values such as geocoordinates will have a different structure for their <span style="font-family: "courier new" , "courier" , monospace;">value</span> string.<br />
<br />
I ran into problems in Python when I tried to build the <span style="font-family: "courier new" , "courier" , monospace;">value</span> values for the POST body dictionary by directly concatenating string variables with literals containing curly braces. Since Python uses curly braces to define string replacement fields, it got confused and threw an error in some of my lines of code. The simplest solution to that problem was to construct a dictionary for the data that needed to be turned into a string value, then pass that dictionary into the <span style="font-family: "courier new" , "courier" , monospace;">json.dumps()</span> function to turn the dictionary into a valid JSON string (rather than trying to build that string directly). The string resulting as output of <span style="font-family: "courier new" , "courier" , monospace;">json.dumps()</span> could then be assigned as the value of the appropriate parameter to be included in the JSON sent in the POST body. You can see how I used this approach in lines 128 through 148 of t<a href="https://github.com/HeardLibrary/digital-scholarship/blob/master/code/wikibase/api/load_csv.py" target="_blank">his script</a>.<br />
<br />
I realize that what I've just described here is about as confusing as trying to watch the movie <i>Inception</i> for the first time, but I probably wasted at least half of the time it took me to get my bot script to work by being confused about what a snak was and how to construct the value of the <span style="font-family: "courier new" , "courier" , monospace;">value</span> parameter. So at least you will have a heads up about this confusing topic, and by looking at my example code you will hopefully be able to figure it out.<br />
<br />
<h2>
Putting it all together</h2>
So to summarize, here are the steps you need to take to write to any Wikibase installation using the MediaWiki API:<br />
<br />
<ol>
<li>Create a bot to get a username and password.</li>
<li>Determine the structure of the JSON body that needs to be passed to the API in the POST request for the desired action.</li>
<li>Use the bot credentials to log into an HTTP session with the API and get a CSRF token.</li>
<li>Execute the code necessary to insert the data you want to write into the appropriate JSON structure for the action.</li>
<li>Execute the code necessary to perform the POST request and pass the JSON to the API. </li>
<li>Track the API response to determine if errors occurred and handle any errors. </li>
<li>Repeat many times (otherwise why are you automating with a bot?). </li>
</ol>
<br />
<a href="https://heardlibrary.github.io/digital-scholarship/host/wikidata/bot/" target="_blank">This tutorial</a> will walk you through the steps and provides code examples and screenshots to get you going.<br />
<br />
If you are writing to the "real" Wikidata instance of Wikibase, you need to take several additional steps:<br />
<br />
<ul>
<li>Create a separate bot account.</li>
<li>Define what the bot will do and describe those tasks in the bot's talk page.</li>
<li>Request approval for permission to operate the bot.</li>
<li>In programming the bot, figure out how you will check for existing records and avoid creating duplicate items or claims. </li>
<li>Perform 50 to 250 edits with the bot to show that it works. Make sure that you throttle the bot appropriately using the Maxlag parameter.</li>
<li>After you get approval, put the bot into production mode and monitor its performance carefully.</li>
</ul>
<br />
The <a href="https://www.wikidata.org/wiki/Wikidata:Bots" target="_blank">Wikidata:Bots</a> page gives many of the necessary administrative details of setting up a Wikidata bot.<br />
<br />
For writing to the "real" Wikidata, you might consider using the <a href="https://github.com/wikimedia/pywikibot" target="_blank">Pywikibot Python library</a> to build your bot. I've written a tutorial for that <a href="https://heardlibrary.github.io/digital-scholarship/host/wikidata/pywikibot/" target="_blank">here</a>. Pywikibot has built-in throttling, so that takes care of potential problems with hitting the API at an unacceptable rate. However, in tests that I carried out on our test instance of Wikibase hosted on AWS, writing directly to the API as I've described here was about 60 times faster than using Pywikibot. So if you are writing a lot of data to a fresh and empty Wikibase instance, you may find using Pywikibot's slow speed frustrating.<br />
<br />
<h2>
Acknowledgements</h2>
<a href="https://wikimediafoundation.org/profile/asaf-bartov/" target="_blank">Asaf Bartov</a> and <a href="https://en.wikipedia.org/wiki/Andrew_Lih" target="_blank">Andrew Lih</a>'s presentations and their answers to my questions at the <a href="https://wiki.duraspace.org/display/LD4P2/2019+LD4+Conference+on+Linked+Data+in+Libraries" target="_blank">2019 LD4P conference</a> were critical for helping me to finally figure out how to write effectively to Wikibase. Thanks!<br />
<br /></div>
</div>Steve Baskaufhttp://www.blogger.com/profile/01896499749604153763noreply@blogger.com1tag:blogger.com,1999:blog-5299754536670281996.post-55834155518588678692019-05-28T10:24:00.003-07:002022-06-11T11:25:46.708-07:00Getting Data Out of Wikidata using Software<p>For those of you who have been struggling through my series of posts on the TDWG Standards Documentation Specification, you may be happy to read a more "fun" post on everybody's current darling: Wikidata. This post has a lot of "try it yourself" activities, so if you have an installation of Python on your computer, you can try running the scripts yourself.</p><p>2022-06-11 note: a more recent followup post to this one describing Python code to reliably retrieve data from the Wikidata Query Service is here: "<a href="https://baskauf.blogspot.com/2022/06/making-sparql-queries-to-wikidata-using.html" target="_blank">Making SPARQL queries to Wikidata using Python</a>".<br />
<br />
</p><div class="separator" style="clear: both; text-align: center;">
<a href="https://www.wikidata.org/" target="_blank"><img border="0" data-original-height="141" data-original-width="200" src="https://upload.wikimedia.org/wikipedia/commons/thumb/6/66/Wikidata-logo-en.svg/200px-Wikidata-logo-en.svg.png" /></a></div>
<br />
<br />
Because <a href="https://www.library.vanderbilt.edu/scholarly/" target="_blank">our group at the Vanderbilt Libraries</a> is very interested in leveraging Wikidata and Wikibase for possible use in projects, I recently began participating in the <a href="https://wiki.duraspace.org/display/LD4P2/Wikidata+Affinity+Group" target="_blank">LD4 Wikidata Affinity Group</a>, an interest group of the <a href="https://wiki.duraspace.org/pages/viewpage.action?pageId=74515029" target="_blank">Linked Data for Production</a> project. I also attended the <a href="https://wiki.duraspace.org/display/LD4P2/2019+LD4+Conference+on+Linked+Data+in+Libraries" target="_blank">2019 LD4 Conference on Linked Data in Libraries</a> in Boston earlier this month. In both instances, most of the participants were librarians who were knowledgeable about Linked Data. So it's been a pleasure to participate in events where I don't have to explain what RDF is, or why one might be able to do cool things with Linked Open Data (LOD).<br />
<br />
However, I have been surprised to hear people complain a couple of times at those events that Wikidata doesn't have a good API that people can use to acquire data to use in applications such as those that generate web pages. When I mentioned that Wikidata's query service effectively serves as a powerful API, I got blank looks from most people present.<br />
<br />
I suspect that one reason why people don't think of the <a href="https://query.wikidata.org/" target="_blank">Wikidata Query Service</a> as an API is because it has such an awesome graphical interface that people can interact with. Who wouldn't get into using a simple dropdown to create cool visualizations that include maps, timelines, and of course, pictures of cats? But underneath it all, the query service is a SPARQL endpoint, and as I have pontificated in <a href="http://baskauf.blogspot.com/2019/04/understanding-tdwg-standards_7.html" target="_blank">a previous post</a>, a SPARQL endpoint is just a glorified, program-it-yourself API.<br />
<br />
In this post, I will demonstrate how you can use SPARQL to acquire both generic data and RDF triples from the Wikidata query service.<br />
<br />
<h2>
What is SPARQL for?</h2>
The recursive acronym SPARQL (pronounced like "sparkle") stands for "SPARQL Protocol and RDF Query Language". Most users of SPARQL know that it is a query language. It is less commonly known that SPARQL has <a href="https://www.w3.org/TR/sparql11-protocol/" target="_blank">a protocol</a> associated with it that allows client software to communicate with the endpoint server using <a href="https://tools.ietf.org/html/rfc2616" target="_blank">hypertext transfer protocol</a> (HTTP). That protocol establishes things like how queries should be sent to the server using GET or POST, how the client indicates the desired form of the response, and how the server indicates the media type of the response. These are all standard kinds of things that are required for a software client to interact with an API, and we'll see the necessary details when we get to the examples.<br />
<br />
The query parts of SPARQL demarcate the kinds of tasks we can accomplish using it. There are three main things we can do with SPARQL:<br />
<br />
<ul>
<li>get generic data from the underlying graph database (triplestore) using SELECT</li>
<li>get RDF triples based on data in the graph database using CONSTRUCT</li>
<li>load RDF data into the triplestore using UPDATE</li>
</ul>
<br />
In the Wikidata system, data enters the Blazegraph triplestore directly from a separate database, so the third of these methods (UPDATE) is not enabled. That leaves the SELECT and CONSTRUCT query forms and we will examine each of them separately.<br />
<br />
<h2>
Getting generic data using SPARQL SELECT</h2>
The SELECT query form is probably most familiar to users of the Wikidata Query Service. It's the form used when you do all of those cool visualizations using the dropdown examples. The example queries do some magical things using commands that are not part of standard SPARQL, such as view settings that are in comments and the AUTO_LANGUAGE feature. In my examples, I will use only standard SPARQL for dealing with languages and ignore the view settings since we aren't using a graphical interface anyway.<br />
<br />
We are going to develop an application that will allow us to discover what Wikidata knows about superheroes. The query that we are going to start off with is this one:<br />
<br />
<span style="font-family: "courier new" , "courier" , monospace;">PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#></span><br />
<span style="font-family: "courier new" , "courier" , monospace;">PREFIX wd: <http://www.wikidata.org/entity/></span><br />
<span style="font-family: "courier new" , "courier" , monospace;">PREFIX wdt: <http://www.wikidata.org/prop/direct/></span><br />
<span style="font-family: "courier new" , "courier" , monospace;">SELECT DISTINCT ?name ?iri WHERE {</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">?iri wdt:P106 wd:Q188784.</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">?iri wdt:P1080 wd:Q931597.</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">?iri rdfs:label ?name.</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">FILTER(lang(?name)="en")</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">}</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">ORDER BY ASC(?name)</span><br />
<br />
For reference purposes, in the query, <span style="font-family: "courier new" , "courier" , monospace;">wdt:P106 wd:Q188784</span> is "occupation superhero" and <span style="font-family: "courier new" , "courier" , monospace;">wdt:P1080 wd:Q931597</span> is "from fictional universe Marvel Universe". This is what restricts the results to Marvel superheroes. (You can leave this restriction out, but then the list gets unmanageably long.) The language filter restricts the labels to the English ones. The name of the superhero and its Wikidata identifier are what is returned by the query.<br />
<br />
If you want to try the query, you can go the the <a href="https://query.wikidata.org/" target="_blank">graphical query interface</a> (GUI), paste it into the box, and click the blue "run" button. I should note that Wikidata will allow you to get away with leaving off the PREFIX declarations, but that bugs me, so I'm going to include them since I think it's a good practice to be in the habit of including them.<br />
<br />
When you run the query, you will see the result in the form of a table:<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg802D6K5awHrapMNyllqsFVOafFZ2Lyz3ghnD1Kd0a0Fu7K51AgoRCN0ekPfhwF_PDBGHW6lzfrv8SVPmEHboYNDWKTpLHUbco8BmpGjCF5lw6dmEapVlz1FbFNyyLjNVEXRAjiuK65lY/s1600/query-wikidata-org.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="798" data-original-width="1171" height="435" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg802D6K5awHrapMNyllqsFVOafFZ2Lyz3ghnD1Kd0a0Fu7K51AgoRCN0ekPfhwF_PDBGHW6lzfrv8SVPmEHboYNDWKTpLHUbco8BmpGjCF5lw6dmEapVlz1FbFNyyLjNVEXRAjiuK65lY/s640/query-wikidata-org.png" width="640" /></a></div>
<br />
The table shows all of the bindings to the <span style="font-family: "courier new" , "courier" , monospace;">?name</span> variable in one column and the bindings for the <span style="font-family: "courier new" , "courier" , monospace;">?iri</span> variable in a second column. However, when you make the query programmatically rather than through the GUI, there are a number of possible non-tabular forms (i.e. serializations) in which you can receive the results.<br />
<br />
To understand what is going on under the hood here, you need to issue the query to the SPARQL endpoint using client software that will allow you specify all of the settings required by the SPARQL HTTP protocol. Most programmers at this point would tell you to use <a href="https://curl.haxx.se/" target="_blank">CURL</a> to issue the commands, but personally, I find CURL difficult to use and confusing for beginners. I always use <a href="https://www.getpostman.com/" target="_blank">Postman</a>, which is free, and easy to use and understand.<br />
<br />
The SPARQL protocol describes <a href="https://www.w3.org/TR/sparql11-protocol/#query-operation" target="_blank">several ways to make a query</a>. We will talk about two of them here.<br />
<br />
<b>Query via GET</b> is pretty straightforward if you are used to interacting with APIs. The SPARQL query is sent to the endpoint (<span style="font-family: "courier new" , "courier" , monospace;">https://query.wikidata.org/sparql</span>) as a query string with a key of query and value that is the URL-encoded query. The details of how to do that using Postman are <a href="https://heardlibrary.github.io/digital-scholarship/lod/sparql/#acquiring-triples-from-an-endpoint-using-post" target="_blank">here</a>. The end result is that you are creating a long, ugly URL that looks like this:<br />
<br />
<span style="font-family: "courier new" , "courier" , monospace;">https://query.wikidata.org/sparql?query=%0APREFIX+rdfs%3A+%3Chttp%3A%2F%2Fwww.w3.org%2F2000%2F01%2Frdf-schema%23%3E%0APREFIX+wd%3A+%3Chttp%3A%2F%2Fwww.wikidata.org%2Fentity%2F%3E%0APREFIX+wdt%3A+%3Chttp%3A%2F%2Fwww.wikidata.org%2Fprop%2Fdirect%2F%3E%0ASELECT+DISTINCT+%3Fname+%3Firi+WHERE+%7B%0A%3Firi+wdt%3AP106+wd%3AQ188784.%0A%3Firi+wdt%3AP1080+wd%3AQ931597.%0A%3Firi+rdfs%3Alabel+%3Fname.%0AFILTER%28lang%28%3Fname%29%3D%22en%22%29%0A%7D%0AORDER+BY+ASC%28%3Fname%29%0A</span><br />
<br />
If you are OK with getting your results in the default XML serialization, you just need to request that URL and the file that comes back will have your results. You can even do that by just pasting the ugly URL into the URL box of a web browser if you don't want to bother with Postman.<br />
<br />
However, since we are planning to use the results in a program, it is much easier to use the results if they are in JSON. Getting the results in JSON requires sending an <span style="font-family: "courier new" , "courier" , monospace;">Accept</span> request header of <span style="font-family: "courier new" , "courier" , monospace;">application/sparql-results+json</span> along with the GET request. You can't do that in a web browser, but in Postman you can set request headers by filling in the appropriate boxes on the header tab as shown <a href="https://heardlibrary.github.io/digital-scholarship/lod/sparql/#retrieving-sparql-query-data-using-http" target="_blank">here</a>. SPARQL endpoints may also accept the more generic JSON <span style="font-family: "courier new" , "courier" , monospace;">Accept</span> request header <span style="font-family: "courier new" , "courier" , monospace;">application/json</span>, but the previous header is the preferred one for SPARQL requests.<br />
<br />
<b>Query via POST </b>is in some ways simpler than using GET. The SPARQL query is sent to the endpoint using only the base URL without any query string. The query itself is sent to the endpoint in unencoded form as the message body. A <a href="https://www.w3.org/TR/2013/REC-sparql11-protocol-20130321/#query-via-post-direct" target="_blank">critical requirement</a> is that the request must be sent with a <span style="font-family: "courier new" , "courier" , monospace;">Content-Type</span> header of <span style="font-family: "courier new" , "courier" , monospace;">application/sparql-query</span>. If you want the response to be in JSON, you must also include an <span style="font-family: "courier new" , "courier" , monospace;">Accept</span> request header of <span style="font-family: "courier new" , "courier" , monospace;">application/sparql-results+json</span> as was the case with query via GET.<br />
<br />
There is no particular advantage of using POST instead of GET, except in cases where using GET would result in a URL that exceeds the allowed length for the endpoint server. I'm not sure what that limit is for Wikidata's server, but typically the maximum is between 5000 and 15000 characters. So if the query you are sending ends up being very long, it is safer to send it using POST.<br />
<br />
<b>Response.</b> The JSON response that we get back look like this:<br />
<br />
<span style="font-family: "courier new" , "courier" , monospace;">{</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> "head": {</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> "vars": [</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> "name",</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> "iri"</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> ]</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> },</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> "results": {</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> "bindings": [</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> {</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> "name": {</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> "xml:lang": "en",</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> "type": "literal",</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> "value": "Amanda Sefton"</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> },</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> "iri": {</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> "type": "uri",</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> "value": "http://www.wikidata.org/entity/Q3613591"</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> }</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> },</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> {</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> "name": {</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> "xml:lang": "en",</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> "type": "literal",</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> "value": "Andreas von Strucker"</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> },</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> "iri": {</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> "type": "uri",</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> "value": "http://www.wikidata.org/entity/Q4755702"</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> }</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> },</span><br />
<div>
...</div>
<div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;"> {</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;"> "name": {</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;"> "xml:lang": "en",</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;"> "type": "literal",</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;"> "value": "Zeitgeist"</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;"> },</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;"> "iri": {</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;"> "type": "uri",</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;"> "value": "http://www.wikidata.org/entity/Q8068621"</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;"> }</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;"> }</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;"> ]</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;"> }</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">}</span></div>
</div>
<div>
<br /></div>
The part of the results that we care about is the value of the <span style="font-family: "courier new" , "courier" , monospace;">bindings </span>key. It's a array of objects that include a key for each of the variables that we used in the query (e.g. <span style="font-family: "courier new" , "courier" , monospace;">name </span>and <span style="font-family: "courier new" , "courier" , monospace;">iri</span>). The value for each variable key is another object representing a bound result. The bound result object contains key:value pairs that tell you more than we could learn from the Query Service GUI table, notably the language tag for strings and whether the result was a literal or URI. But basically, the information for each object in the array corresponds to the information that was in a row in the GUI table. In our program, we can step through each object in the array and pull out the bound results that we want.<br />
<br />
The reason for going into these gory details is to point out that the generic HTTP operations that we just carried out can be done for any programming language that has libraries to perform HTTP calls. We will see how this is done in practice for two languages.<br />
<br />
<h3>
Using Python to get generic data using SPARQL SELECT</h3>
A Python 3 script that performs the query above can be downloaded from <a href="https://github.com/HeardLibrary/digital-scholarship/blob/master/code/wikidata/requests_wikidata_json.py" target="_blank">this page</a>. The query itself is assigned to a variable as a multi-line string in lines 10-19. In line 3, the script allows the user to choose a language for the query and the code for that language is inserted as the variable <span style="font-family: "courier new" , "courier" , monospace;">isoLanguage </span>in line 17.<br />
<br />
The script uses the popular and easy-to-use <span style="font-family: "courier new" , "courier" , monospace;">requests </span>library to make the HTTP call. It's not part of the standard library, so if you haven't used it before, you'll need to install it using PIP before you run the script. The actual HTTP GET call is made in line 26. The <span style="font-family: "courier new" , "courier" , monospace;">requests </span>module is really smart and will automatically URL-encode the query when it's sent into the <span style="font-family: "courier new" , "courier" , monospace;">.get()</span> method as a value of <span style="font-family: "courier new" , "courier" , monospace;">params</span>. So you don't have to worry about that yourself.<br />
<br />
If you uncomment line 27 and comment line 26, you can make the request using the <span style="font-family: "courier new" , "courier" , monospace;">.post()</span> method instead of GET. For a query of this size, there is no particular advantage of one method over the other. The syntax varies slightly (specifying the query as a data payload rather than a query parameter) and the POST request includes the required <span style="font-family: "courier new" , "courier" , monospace;">Content-Type</span> header in addition to the <span style="font-family: "courier new" , "courier" , monospace;">Accept</span> header to receive JSON.<br />
<br />
There are print statements in lines 21 and 29 so that you can see what the query looks like after the insertion of the language code, and after it's been URL-encoded and appended to the base endpoint URL. You can delete them later if they annoy you. If you uncomment line 31, you can see the raw JSON results as the have been received from the query service. They should look like what was shown above.<br />
<br />
Line 33 converts the received JSON string into a Python data structure, and also pulls out the array value of the <span style="font-family: "courier new" , "courier" , monospace;">bindings </span>key (now a Python "list" data structure). Lines 34 to 37 step through each result in the list, extract the values bound to the <span style="font-family: "courier new" , "courier" , monospace;">?name</span> and <span style="font-family: "courier new" , "courier" , monospace;">?iri</span> variables, then prints them on the screen. The result looks like this for English:<br />
<br />
<span style="font-family: "courier new" , "courier" , monospace;">http://www.wikidata.org/entity/Q3613591 : Amanda Sefton</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">http://www.wikidata.org/entity/Q4755702 : Andreas von Strucker</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">http://www.wikidata.org/entity/Q14475812 : Anne Weying</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">http://www.wikidata.org/entity/Q2604744 : Anya Corazon</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">http://www.wikidata.org/entity/Q2299363 : Armor</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">http://www.wikidata.org/entity/Q2663986 : Aurora</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">http://www.wikidata.org/entity/Q647105 : Banshee</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">http://www.wikidata.org/entity/Q302186 : Beast</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">http://www.wikidata.org/entity/Q2893582 : Bedlam</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">http://www.wikidata.org/entity/Q2343504 : Ben Reilly</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">http://www.wikidata.org/entity/Q616633 : Betty Ross</span><br />
...<br />
<br />
If we run the program and enter the language code for Russian (<span style="font-family: "courier new" , "courier" , monospace;">ru</span>), the results look like this:<br />
<br />
<span style="font-family: "courier new" , "courier" , monospace;">http://www.wikidata.org/entity/Q49262738 : Ultimate Ангел</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">http://www.wikidata.org/entity/Q48891562 : Ultimate Джин Грей</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">http://www.wikidata.org/entity/Q39052195 : Ultimate Женщина-паук</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">http://www.wikidata.org/entity/Q4003146 : Ultimate Зверь</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">http://www.wikidata.org/entity/Q48958279 : Ultimate Китти Прайд</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">http://www.wikidata.org/entity/Q4003156 : Ultimate Колосс</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">http://www.wikidata.org/entity/Q16619139 : Ultimate Рик Джонс</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">http://www.wikidata.org/entity/Q7880273 : Ultimate Росомаха</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">http://www.wikidata.org/entity/Q48946153 : Ultimate Роуг</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">http://www.wikidata.org/entity/Q4003147 : Ultimate Циклоп</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">http://www.wikidata.org/entity/Q48947511 : Ultimate Человек-лёд</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">http://www.wikidata.org/entity/Q4003183 : Ultimate Шторм</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">http://www.wikidata.org/entity/Q2663986 : Аврора</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">http://www.wikidata.org/entity/Q3613591 : Аманда Сефтон</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">http://www.wikidata.org/entity/Q4755702 : Андреас фон Штрукер</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">http://www.wikidata.org/entity/Q2604744 : Аня Коразон</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">http://www.wikidata.org/entity/Q770064 : Архангел</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">http://www.wikidata.org/entity/Q28006858 : Баки Барнс</span><br />
...<br />
<br />
So did our script just use an API? I would argue that it did. But it's programmable: if we wanted it to retrieve superheroes from the DC universe, all we would need to do is to replace Q931597 with Q1152150 in line 15 of the script.<br />
<br />
Now that we have the labels and ID numbers for the superheroes, we could let the user pick one and we could carry out a second query to find out more. I'll demonstrate that in the next example.<br />
<br />
<h3>
Using Javascript/JQuery to get generic data using SPARQL SELECT</h3>
<div>
Because the protocol to acquire the data is generic, we can go through the same steps in any programming language. <a href="https://github.com/HeardLibrary/digital-scholarship/blob/master/code/wikidata/item-properties.js" target="_blank">Here is an example</a> using Javascript with some JQuery functions. The <a href="https://github.com/HeardLibrary/digital-scholarship/blob/master/code/wikidata/item-properties.html" target="_blank">accompanying web page</a> sets up two dropdown lists, with the second one being populated by the Javascript using the superhero names retrieved using SPARQL. You can try the page directly <a href="https://s3.us-east-2.amazonaws.com/sparql-upload/item-properties.html" target="_blank">from this web page</a>. To have the page start off using a language other than English, append a question mark, followed by the language code, like <a href="https://s3.us-east-2.amazonaws.com/sparql-upload/item-properties.html?de" target="_blank">this</a>. If you want to try hacking the Javascript yourself, you can download both documents into the same local directory, then double click on the HTML file to open it in a browser. You can then edit the Javascript and reload the page to see the effect.</div>
<div>
<br /></div>
<div>
There are basically two things that happen in this page. </div>
<div>
<br /></div>
<div>
<b>Use SPARQL to find superhero names in the selected language.</b> The initial page load (line 157) and making a selection on the Language dropdown (lines 106-114) fire the <span style="font-family: "courier new" , "courier" , monospace;">setStatusOptions()</span> function (lines 58-98). That function inserts the selected language into the SPARQL query string (lines 69-78), URL-encodes the query (line 79), then performs the GET to the constructed URL (lines 82-87). The script then steps through each result in the bindings array (line 89) and pulls out the bound <span style="font-family: "courier new" , "courier" , monospace;">name </span>and <span style="font-family: "courier new" , "courier" , monospace;">iri </span>value for the result (lines 90-91). Up to this point, the script is doing exactly the same things as the Python script. In line 93, the name is assigned to the dropdown option label and the IRI is assigned to the dropdown option value. The page then waits for something else to happen.</div>
<div>
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEix2wiMbGUuPIaeWSVom-s63CuGuCOXwH9wTZt92k13FYSEfHt0okyfx7xJo7h2jS67M-tjkH3v1JhuxHfxIKbYKmQfbqXfD7JLdhgieqLFSqTYtcKjyDIGzmnzrZJwqFwP9Ww5LjxxZ40/s1600/javascript-query-results.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="535" data-original-width="1182" height="288" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEix2wiMbGUuPIaeWSVom-s63CuGuCOXwH9wTZt92k13FYSEfHt0okyfx7xJo7h2jS67M-tjkH3v1JhuxHfxIKbYKmQfbqXfD7JLdhgieqLFSqTYtcKjyDIGzmnzrZJwqFwP9Ww5LjxxZ40/s640/javascript-query-results.png" width="640" /></a></div>
<div>
<br /></div>
<div>
In this screenshot, I've turned on Chrome's developer tools so I can watch what the page is doing as it runs. This is what the screen looks like after the page loads. I've clicked on the <span style="font-family: "courier new" , "courier" , monospace;">Network </span>tab and selected the <span style="font-family: "courier new" , "courier" , monospace;">sparql?query=PREFIC%20rdfs...</span> item. I can see the result of the query that was executed by the <span style="font-family: "courier new" , "courier" , monospace;">setStatusOptions()</span> function.</div>
<div>
<br /></div>
<div>
<b>Use SPARQL to find properties and values associated with a selected superhero.</b> When a selection is made in the Character dropdown, the <span style="font-family: "courier new" , "courier" , monospace;">$("#box1").change(function()</span> (lines 117-154) is executed. It operates in a manner similar to the <span style="font-family: "courier new" , "courier" , monospace;">setStatusOptions()</span> function, except that it uses a different query that finds properties associated with the superhero and the values of those properties. Lines 121 through 133 insert the IRI from the selected dropdown into line 125 and the code of the selected language in lines 130 and 131, resulting in a query like this (for Black Panther, <span style="font-family: "courier new" , "courier" , monospace;">Q998220</span>):</div>
<div>
<br /></div>
<div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#></span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">PREFIX wd: <http://www.wikidata.org/entity/></span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">PREFIX wdt: <http://www.wikidata.org/prop/direct/></span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">SELECT DISTINCT ?property ?value WHERE {</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;"><http://www.wikidata.org/entity/Q998220> ?propertyUri ?valueUri.</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">?valueUri rdfs:label ?value.</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">?genProp <http://wikiba.se/ontology#directClaim> ?propertyUri.</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">?genProp rdfs:label ?property.</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">FILTER(substr(str(?propertyUri),1,36)="http://www.wikidata.org/prop/direct/")</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">FILTER(LANG(?property) = "en")</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">FILTER(LANG(?value) = "en")</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">}</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">ORDER BY ASC(?property)</span></div>
</div>
<div>
<br /></div>
<div>
The triple patterns in lines 127 and 128 and filter in 129 give access to the label of a property used in a statement about the superhero. The model for property labels in Wikidata are a bit complex - see <a href="https://heardlibrary.github.io/digital-scholarship/lod/wikibase/#references" target="_blank">this page</a> for more details. This query only finds statements whose values are items (not those with string values) because the triple pattern in line 126 requires the value to have a label (and strings don't have labels). A more complex graph pattern than this one would be necessary to get values of statements with both literal (string) and non-literal (item) values.</div>
<div>
<br /></div>
<div>
Lines 144-149 differ from the earlier function in that they build a text string (called <span style="font-family: "courier new" , "courier" , monospace;">text</span>) containing the property and value strings from the returned query results. The completed string is inserted as HTML into the <span style="font-family: "courier new" , "courier" , monospace;">div1 </span>element of the web page (line 149). </div>
<div>
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi3NBiZ8GsFR1Ba0GSouMY3Njh2GYzDEPh-mx1P2rSxevHm4mAVcAdbKQEMneBW_LtAmWDsjRWY5VXOD-J4cWqgeIEt9o8FGRc6G_pZcV2YfhSooE-Pxbg1G4NY_qA8QYjK0p50-aGlBWY/s1600/javascript-query-results2.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="736" data-original-width="1173" height="400" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi3NBiZ8GsFR1Ba0GSouMY3Njh2GYzDEPh-mx1P2rSxevHm4mAVcAdbKQEMneBW_LtAmWDsjRWY5VXOD-J4cWqgeIEt9o8FGRc6G_pZcV2YfhSooE-Pxbg1G4NY_qA8QYjK0p50-aGlBWY/s640/javascript-query-results2.png" width="640" /></a></div>
<div>
<br /></div>
<div>
<br /></div>
<div>
The screenshot above shows what happens in Developer Tools when the Character dropdown is used to select "Black Panther". You can see on the Network tab on the right that another network action has occurred - the second SPARQL query. Clicking on the response tab shows the response JSON, which is very similar in form to the previous query results. On the left side of the screen, you can see where the statements about the Black Panther have been inserted into the web page.</div>
<div>
<br /></div>
<div>
It is worth noting that the results that are acquired vary a lot depending on the language that is chosen. The first query that builds the character dropdown requires that the superhero have a label in the chosen language. If there isn't a label in that language for that character, then the superhero isn't listed. That's why the list of English superhero names is so long and the simplified Chinese list only has a few options. Similarly, properties for a character are only listed if they have labels in that language and if those labels also have a value that has a label in that language. So we miss a lot of superheros and properties that exist if no one has bothered to create labels for them in a given language. </div>
<div>
<br /></div>
<div>
This page is also very generic. Except for the page titles and headers in different languages, which are hard-coded, minor changes to the triple patterns in lines 73 and 74 would make it possible to retrieve information about almost any kind of thing described in Wikidata.</div>
<div>
<br /></div>
<div>
<h2>
Getting RDF triples using SPARQL CONSTRUCT</h2>
</div>
<div>
In SPARQL SELECT, we specify any number of variables that we want the endpoint to send information about. The values that are bound to those variables can be any kind of string (datatyped or language tagged) or IRI. In contrast, SPARQL CONSTRUCT always returns the same kind of information: RDF triples. So the response to CONSTRUCT is always an RDF graph.</div>
<div>
<br /></div>
<div>
As with the SELECT query, you can issue a CONSTRUCT query at the Wikidata Query Service GUI by pasting it into the box. You can try it with this query:</div>
<div>
<br /></div>
<div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">PREFIX wd: <http://www.wikidata.org/entity/></span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">CONSTRUCT {</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;"> wd:Q42 ?p1 ?o.</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;"> ?s ?p2 wd:Q42.</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">}</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">WHERE {</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;"> {wd:Q42 ?p1 ?o.}</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;"> UNION</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;"> {?s ?p2 wd:Q42.}</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">}</span></div>
</div>
<div>
<br /></div>
<div>
The WHERE clause of the query requires that triples match one of two patterns: triples where the subject is item Q42, and triples where the object is item Q42. The graph to be constructed and returned consists of all of the triples that conform to one of those two patterns. In other words, the graph that is returned to the client is all triples in Wikidata that are directly related to Douglas Adams (Q42). </div>
<div>
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEguuLRinm9ngZB0SR7lZzRYRaTmTJv9tfWmo9SrGugoYACV2EmlAPlIrpfNnF6yBi1BgRcbpGXrUtQaTK36dr8bpdu6IaKgpgtmyc2rHOnZHwq_Cryj91E6IqAmbNl2XuSJBoItldmZF2Q/s1600/query-wikidata-org2.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="918" data-original-width="1251" height="468" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEguuLRinm9ngZB0SR7lZzRYRaTmTJv9tfWmo9SrGugoYACV2EmlAPlIrpfNnF6yBi1BgRcbpGXrUtQaTK36dr8bpdu6IaKgpgtmyc2rHOnZHwq_Cryj91E6IqAmbNl2XuSJBoItldmZF2Q/s640/query-wikidata-org2.png" width="640" /></a></div>
<div>
<br /></div>
<div>
<br /></div>
<div>
When we compare the results to what we got when we pasted the SELECT query into the box, we see that there is also a table at the bottom. However, in a CONSTRUCT query there will always be three columns for the three parts of the triple (subject, predicate, and object), plus a column for the "context", which we won't worry about here. The triples that are shown here mostly look the same, but that's only because the table in the GUI doesn't tell us the language tags of the labels and the names in most languages written in Latin characters are the same. </div>
<div>
<br /></div>
<div>
If we use Postman to make the query, we have the option to specify the serialization that we want for the response graph. Blazegraph (the system that runs Wikidata's SPARQL endpoint) will support any of the common RDF serializations, so we just need to send an <span style="font-family: "courier new" , "courier" , monospace;">Accept</span> header with the appropriate media type (<span style="font-family: "courier new" , "courier" , monospace;">text/turtle</span> for Turtle, <span style="font-family: "courier new" , "courier" , monospace;">application/rdf+xml</span> for XML, or <span style="font-family: "courier new" , "courier" , monospace;">application/ld+json</span> for JSON-LD). Otherwise, the query is sent via GET or POST just as in the SELECT example.</div>
<div>
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj-9SEIWQLtCTzYhZPjczvUK8miEeLQl2VuBHnK8tMOfc9qeln4kdSZcYoAMnscQ4eLhoAYCV7q064i9tmX3QQUWWLUyzt-3ignTj_wz3zOTEThIoFTQu_XlIYL4K8Ii4fctfTsM1-mErg/s1600/postman-query-results.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="1155" data-original-width="915" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj-9SEIWQLtCTzYhZPjczvUK8miEeLQl2VuBHnK8tMOfc9qeln4kdSZcYoAMnscQ4eLhoAYCV7q064i9tmX3QQUWWLUyzt-3ignTj_wz3zOTEThIoFTQu_XlIYL4K8Ii4fctfTsM1-mErg/s1600/postman-query-results.png" /></a></div>
<div>
<br /></div>
<div>
These results (in Turtle serialization) show us the language tags of all of the labels, and we can see that the string part of many of them are the same when the language uses Latin characters. </div>
<div>
<br /></div>
<div>
We can also run this query using <a href="https://github.com/HeardLibrary/digital-scholarship/blob/master/code/wikidata/requests_wikidata_triples.py" target="_blank">this Python script</a>. As with the previous Python script, we assign the query to a variable in lines 9 to 18. This time, we substitute the value of the <span style="font-family: "courier new" , "courier" , monospace;">item </span>variable, which is set to be <span style="font-family: "courier new" , "courier" , monospace;">Q42</span>, but could be changed to retrieve triples about any other item. After we perform the GET request, the result is written to a text file, <span style="font-family: "courier new" , "courier" , monospace;">requestsOutput.ttl</span> . We could then load that file into our own triplestore if we wanted.</div>
<div>
<br /></div>
<h3>
Using <span style="font-family: "courier new" , "courier" , monospace;">rdflib</span> in Python to manipulate RDF data from SPARQL CONSTRUCT</h3>
<div>
Since the result of our CONSTRUCT query is an RDF graph, there aren't the same kind of direct uses for the data in generic Python (or Javascript, for that matter) as the JSON results of the SELECT query. However, Python has an awesome library for working with RDF data called <span style="font-family: "courier new" , "courier" , monospace;">rdflib</span>. Let's take a look at <a href="https://github.com/HeardLibrary/digital-scholarship/blob/master/code/wikidata/rdflib_wikidata.py" target="_blank">another Python script</a> that uses <span style="font-family: "courier new" , "courier" , monospace;">rdflib </span>to mess around with RDF graphs acquired from the Wikidata SPARQL endpoint. (Don't forget to use PIP to install <span style="font-family: "courier new" , "courier" , monospace;">rdflib</span> if you haven't used it before.</div>
<div>
<br /></div>
<div>
Lines 11 through 24 do the same thing as the previous script: create variables containing the base endpoint URL and the query string. </div>
<div>
<br /></div>
<div>
The <span style="font-family: "courier new" , "courier" , monospace;">rdflib </span>package allows you to create an instance of a graph (line 8) and has a <span style="font-family: "courier new" , "courier" , monospace;">.parse() </span>method that will retrieve a file containing serialized RDF from a URL, parse it, and load it into the graph instance (line 29). In typical use, the URL of a specific file is passed into the method, but since all of the information necessary to initiate a SPARQL CONSTRUCT query via GET is encoded in a URL, and since the result of the query is a file containing serialized RDF, we can just pass the complex endpoint URL with the encoded query into the method and the graph returned from the query will go directly into the <span style="font-family: "courier new" , "courier" , monospace;">itemGraph </span>graph instance. </div>
<div>
<br /></div>
<div>
There are two issues with using the method in this way. One is that unlike the <span style="font-family: "courier new" , "courier" , monospace;">requests .get()</span> method, the <span style="font-family: "courier new" , "courier" , monospace;">rdflib .parse()</span> method does not allow you to include request headers in your GET call. Fortunately, if no <span style="font-family: "courier new" , "courier" , monospace;">Accept </span>header is sent to the Wikidata SPARQL endpoint, it defaults to returning RDF/XML, and the <span style="font-family: "courier new" , "courier" , monospace;">.parse()</span> method is fine with that. The other issue is that unlike the <span style="font-family: "courier new" , "courier" , monospace;">requests .get()</span> method, the <span style="font-family: "courier new" , "courier" , monospace;">rdflib .parse()</span> method does not automatically URL-encode the query string and associate it with the <span style="font-family: "courier new" , "courier" , monospace;">query </span>key. That is why line 25 builds the URL manually and uses the <span style="font-family: "courier new" , "courier" , monospace;">urllib.parse.quote()</span> function to URL-encode the query string before appending it to the rest of the URL. </div>
<div>
<br /></div>
<div>
Upon completion of line 29, we now have the triples constructed by our query loaded into the <span style="font-family: "courier new" , "courier" , monospace;">itemGraph </span>graph instance. What can we do with them? The <a href="https://rdflib.readthedocs.io/en/stable/" target="_blank">rdflib documentation</a> provides some ideas. If I am understanding it correctly, graph is an iterable object consisting of tuples, each of which represents a triple. So in line 33, we can simply use the <span style="font-family: "courier new" , "courier" , monospace;">len()</span> function to determine how many triples were loaded into the graph instance. In lines 35 and 37, I used the <span style="font-family: "courier new" , "courier" , monospace;">.preferredLabel()</span> method to search through the graph to find the labels for Q42 in two languages. </div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;"><br /></span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">rdflib </span>has a number of other powerful features that are worth exploring. One is its embedded SPARQL feature, which perhaps isn't that useful here since we just got the graph using a SPARQL query. Nevertheless, it's a cool function. The other capability that could be very powerful is <span style="font-family: "courier new" , "courier" , monospace;">rdflib</span>'s nearly effortless ability to <a href="https://rdflib.readthedocs.io/en/stable/merging.html" target="_blank">merge RDF graphs</a>. In the example script, the value of the <span style="font-family: "courier new" , "courier" , monospace;">item </span>variable is hard-coded in line 5. However, the value of <span style="font-family: "courier new" , "courier" , monospace;">item </span>could be determined by a <span style="font-family: "courier new" , "courier" , monospace;">for </span>loop and the triples associated with many items could be accumulated into a single graph before saving the merged graph as a file (line 41) to be used elsewhere (e.g. loaded into a triplestore). You can imagine how a SPARQL SELECT query could be made to generate a list of items (as was done in the "Using Python to get generic data using SPARQL SELECT" section of this post), then that list could be passed into the code discussed here to create a graph containing all of the information about every item meeting some criteria set out in the SELECT query. That's pretty powerful stuff!</div>
<div>
<br /></div>
<h2>
Alternate methods to get data from Wikidata</h2>
<div>
Although I've made the case here that SPARQL SELECT and CONSTRUCT queries are probably the best way to get data from Wikidata, there are other options. I'll describe three.</div>
<div>
<br /></div>
<h3>
MediaWiki API</h3>
<div>
Since Wikidata is built on the MediaWiki system, the MediaWiki API is another mechanism to acquire generic data (not RDF triples) about items in Wikidata. I have written <a href="https://github.com/HeardLibrary/digital-scholarship/blob/master/code/wikibase/api/read-statements.py" target="_blank">a Python script</a> that uses <a href="https://test.wikidata.org/w/api.php?action=help&modules=wbgetclaims" target="_blank">the <span style="font-family: "courier new" , "courier" , monospace;">wbgetclaims </span>action</a> to get data about the claims (i.e. statements) made about a Wikidata item. I won't go into detail about the script, since it just uses the <span style="font-family: "courier new" , "courier" , monospace;">requests </span>module's <span style="font-family: "courier new" , "courier" , monospace;">.get()</span> method to get and parse JSON as was done in the first Python example of this post. The main tricky thing about this method is that you need to understand about "snaks", an <a href="https://www.mediawiki.org/wiki/Wikibase/DataModel#Snaks" target="_blank">idiosyncratic feature of the Wikibase data model</a>. The structure of the JSON for the value of a claim varies depending on the type of the snak - thus the series of <span style="font-family: "courier new" , "courier" , monospace;">try...except...</span> statements in lines 20 through 29. </div>
<div>
<br /></div>
<div>
If you intend to use the MediaWiki API, you will need to put in a significant amount of time studying the <a href="https://www.mediawiki.org/wiki/API:Main_page" target="_blank">API documentation</a>. A list of possible actions are on <a href="https://test.wikidata.org/w/api.php?action=help&modules=main" target="_blank">this page</a> - actions whose names begin with "wb" are relevant to Wikidata. I will be talking a lot more about using the WikiMedia API in the next blog post, so stay tuned.</div>
<div>
<br /></div>
<h3>
Dereferencing Wikidata item IRIs</h3>
<div>
Wikidata plays nicely in the Linked Data world in that they support content negotiation for dereferencing of their IRIs. That means that you can just do an HTTP GET for any item IRI with an <span style="font-family: "courier new" , "courier" , monospace;">Accept </span>request header of one of the RDF media types, and you'll get a very complete description of the item in RDF. </div>
<div>
<br /></div>
<div>
For example, if I use Postman to dereference the IRI <span style="font-family: "courier new" , "courier" , monospace;">http://www.wikidata.org/entity/Q42</span> with an <span style="font-family: "courier new" , "courier" , monospace;">Accept </span>header of <span style="font-family: "courier new" , "courier" , monospace;">text/turtle</span> and allow Postman to automatically follow redirects, I eventually get redirected to the URL <span style="font-family: "courier new" , "courier" , monospace;">https://www.wikidata.org/wiki/Special:EntityData/Q42.ttl</span> . The result is a pretty massive file that contains 66499 triples (as of 2019-05-28). In contrast, the SPARQL CONSTRUCT query to find all of the statements where Q42 was either the subject or object of the triple returned 884 triples. Why are there 75 times as many triples when we dereference the URI? If we scroll through the 66499 triples, we can see that not only do we have all of the triples that contain Q42, but also all of the triples about every part of every triple that contains Q42 (a complete description of the properties and a complete description of the values of statements about Q42). So this is a possible method to acquire information about an item in the form of RDF triples, but you get way more than you may be interested in knowing. </div>
<div>
<br /></div>
<h3>
Using SPARQL DESCRIBE</h3>
<div>
One of the SPARQL query forms that I didn't mention earlier is DESCRIBE. The <a href="https://www.w3.org/TR/2013/REC-sparql11-query-20130321/#describe" target="_blank">SPARQL 1.1 Query Language specification</a> is a bit vague about what is supposed to happen in a DESCRIBE query. It says "The DESCRIBE form returns a single result RDF graph containing RDF data about resources. This data is not prescribed by a SPARQL query, where the query client would need to know the structure of the RDF in the data source, but, instead, is determined by the SPARQL query processor." In other words, it's up to the particular SPARQL query processor implementation to decide what information to send the client about the resource. It may opt to send triples that are indirectly related to the described resource, particularly if the connection is made by blank nodes (a situation that would make it more difficult for the client to "follow its nose" to find the other triples). So basically, the way to find out what a SPARQL endpoint will send as a response to a DESCRIBE query is to do one and see what you get. </div>
<div>
<br /></div>
<div>
When I issue the query</div>
<div>
<br /></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">DESCRIBE <http://www.wikidata.org/entity/Q42></span></div>
<div>
<br /></div>
<div>
to the Wikidata SPARQL endpoint with an <span style="font-family: "courier new" , "courier" , monospace;">Accept </span>request header of <span style="font-family: "courier new" , "courier" , monospace;">text/turtle</span>, I get 884 triples, all of which have Q42 as either the subject or object. So at least for the Wikidata query service SPARQL endpoint, the DESCRIBE query provides a simpler way to express the CONSTRUCT query that I described in the "Getting RDF triples using SPARQL CONSTRUCT" section above. </div>
<div>
<br /></div>
<h2>
The power of SPARQL CONSTRUCT</h2>
<div>
In the simple example above, DESCRIBE was a more efficient way than CONSTRUCT to get all of the triples where Q42 was the subject or object. However, the advantage of using CONSTRUCT is that you can tailor the triples to be returned in more specific ways. For example, you could easily obtain only the triples where Q42 is the subject by just leaving out the </div>
<div>
<br /></div>
<div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">?s ?p2 wd:Q42.</span></div>
</div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;"><br /></span></div>
<div>
part of the query.</div>
<div>
<br /></div>
<div>
In the CONSTRUCT examples I've discussed so far, the triples in the constructed graph all existed in the Wikidata dataset - we just plucked them out of there for our own use. However, there is no requirement that the constructed triples actually exist in the data source. We can actually "make up" triples to say anything we want. I'll illustrate this with an example involving references.<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://heardlibrary.github.io/digital-scholarship/lod/images/wikidata-statement-reference.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="554" data-original-width="734" height="482" src="https://heardlibrary.github.io/digital-scholarship/lod/images/wikidata-statement-reference.png" width="640" /></a></div>
<br /></div>
<div>
<br /></div>
<div>
The Wikibase graph model (upon which the Wikidata model is based) is somewhat complex with respect to the references that support claims about items. (See <a href="https://heardlibrary.github.io/digital-scholarship/lod/wikibase/#references" target="_blank">this page</a> for more information.) When a statement is made, a statement resource is instantiated and it's linked to the subject item by a "property" (<span style="font-family: "courier new" , "courier" , monospace;">p:</span> namespace) analog of the "truthy direct property" (<span style="font-family: "courier new" , "courier" , monospace;">wdt:</span> namespace) that is used to link the subject item to the object of the claim. The statement instance is then linked to zero to many reference instances by a <span style="font-family: "courier new" , "courier" , monospace;">prov:wasDerivedFrom</span> (<span style="font-family: "courier new" , "courier" , monospace;">http://www.w3.org/ns/prov#wasDerivedFrom</span>) predicate. The reference instances can then be linked to a variety of source resources by reference properties. These reference properties are not intrinsic to the Wikibase graph model and are created as needed by the community just as is the case with other properties in Wikidata. </div>
<div>
<br /></div>
<div>
We can explore the references that support claims about Q42 by going to the <a href="https://query.wikidata.org/" target="_blank">Wikidata Query Service GUI</a> and pasting in <a href="https://query.wikidata.org/#PREFIX%20wd%3A%20%3Chttp%3A%2F%2Fwww.wikidata.org%2Fentity%2F%3E%0APREFIX%20p%3A%20%3Chttp%3A%2F%2Fwww.wikidata.org%2Fprop%2F%3E%0APREFIX%20pr%3A%20%3Chttp%3A%2F%2Fwww.wikidata.org%2Fprop%2Freference%2F%3E%0APREFIX%20prov%3A%20%3Chttp%3A%2F%2Fwww.w3.org%2Fns%2Fprov%23%3E%0ASELECT%20DISTINCT%20%3FpropertyUri%20%3FreferenceProperty%20%3Fsource%0AWHERE%20%7B%0A%20%20wd%3AQ42%20%3FpropertyUri%20%3Fstatement.%0A%20%20%3Fstatement%20prov%3AwasDerivedFrom%20%3Freference.%0A%20%20%3Freference%20%3FreferenceProperty%20%3Fsource.%0A%7D%0AORDER%20BY%20%3FreferenceProperty" target="_blank">this query</a>: </div>
<div>
<br /></div>
<div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">PREFIX wd: <http://www.wikidata.org/entity/></span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">PREFIX p: <http://www.wikidata.org/prop/></span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">PREFIX pr: <http://www.wikidata.org/prop/reference/></span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">PREFIX prov: <http://www.w3.org/ns/prov#></span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">SELECT DISTINCT ?propertyUri ?referenceProperty ?source</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">WHERE {</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;"> wd:Q42 ?propertyUri ?statement.</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;"> ?statement prov:wasDerivedFrom ?reference.</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;"> ?reference ?referenceProperty ?source.</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">}</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">ORDER BY ?referenceProperty</span></div>
</div>
<div>
<br /></div>
<div>
After clicking on the blue "run" button, the results table shows us three things:</div>
<div>
<ul>
<li>in the first column we see the property used in the claim</li>
<li>in the second column we see the kind of reference property that was used to support the claim</li>
<li>in the third column we see the value of the reference property, which is the cited source</li>
</ul>
</div>
<div>
Since the table is sorted by the reference properties, we can click on them to see what they are. One of the useful ones is <a href="https://www.wikidata.org/wiki/Property:P248" target="_blank">P248, "stated in"</a>. It links to an item that is an information document or database that supports a claim. This is very reminiscent of <a href="http://www.dublincore.org/specifications/dublin-core/dcmi-terms/#terms-source" target="_blank">dcterms:source</a>, "a related resource from which the described resource is derived". If I wanted to capture this information in my own triplestore, but use the more standard Dublin Core term, I could construct a graph that contained the statement instance, but then connected the statement directly to the source using dcterms:source. Here's how I would write the CONSTRUCT query:</div>
<div>
<br /></div>
<div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">PREFIX wd: <http://www.wikidata.org/entity/></span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">PREFIX p: <http://www.wikidata.org/prop/></span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">PREFIX pr: <http://www.wikidata.org/prop/reference/></span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">PREFIX prov: <http://www.w3.org/ns/prov#></span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">PREFIX dcterms: <http://purl.org/dc/terms/></span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">CONSTRUCT {</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;"> wd:Q42 ?propertyUri ?statement.</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;"> ?statement dcterms:source ?source.</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;"> }</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">WHERE {</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;"> wd:Q42 ?propertyUri ?statement.</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;"> ?statement prov:wasDerivedFrom ?reference.</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;"> ?reference pr:P248 ?source.</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">}</span></div>
</div>
<div>
<br /></div>
<div>
You can test out <a href="https://query.wikidata.org/#PREFIX%20wd%3A%20%3Chttp%3A%2F%2Fwww.wikidata.org%2Fentity%2F%3E%0APREFIX%20p%3A%20%3Chttp%3A%2F%2Fwww.wikidata.org%2Fprop%2F%3E%0APREFIX%20pr%3A%20%3Chttp%3A%2F%2Fwww.wikidata.org%2Fprop%2Freference%2F%3E%0APREFIX%20prov%3A%20%3Chttp%3A%2F%2Fwww.w3.org%2Fns%2Fprov%23%3E%0APREFIX%20dcterms%3A%20%3Chttp%3A%2F%2Fpurl.org%2Fdc%2Fterms%2F%3E%0ACONSTRUCT%20%7B%0A%20%20wd%3AQ42%20%3FpropertyUri%20%3Fstatement.%0A%20%20%3Fstatement%20dcterms%3Asource%20%3Fsource.%0A%20%20%7D%0AWHERE%20%7B%0A%20%20wd%3AQ42%20%3FpropertyUri%20%3Fstatement.%0A%20%20%3Fstatement%20prov%3AwasDerivedFrom%20%3Freference.%0A%20%20%3Freference%20pr%3AP248%20%3Fsource.%0A%7D%0A" target="_blank">the query at the Wikidata Query Service GUI</a>. You could simplify the situation even more if you made up your own predicate for "has a source about it", which we could call <span style="font-family: "courier new" , "courier" , monospace;">ex:source</span> (<span style="font-family: "courier new" , "courier" , monospace;">http://example.org/source</span>). In that case, the constructed graph would be defined as</div>
<div>
<br /></div>
<div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">CONSTRUCT {</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;"> wd:Q42 ex:source</span><span style="font-family: "courier new" , "courier" , monospace;"> ?source.</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;"> }</span></div>
</div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;"><br /></span></div>
<div>
<a href="https://query.wikidata.org/#PREFIX%20wd%3A%20%3Chttp%3A%2F%2Fwww.wikidata.org%2Fentity%2F%3E%0APREFIX%20p%3A%20%3Chttp%3A%2F%2Fwww.wikidata.org%2Fprop%2F%3E%0APREFIX%20pr%3A%20%3Chttp%3A%2F%2Fwww.wikidata.org%2Fprop%2Freference%2F%3E%0APREFIX%20prov%3A%20%3Chttp%3A%2F%2Fwww.w3.org%2Fns%2Fprov%23%3E%0APREFIX%20dcterms%3A%20%3Chttp%3A%2F%2Fpurl.org%2Fdc%2Fterms%2F%3E%0APREFIX%20ex%3A%20%3Chttp%3A%2F%2Fexample.org%2F%3E%0ACONSTRUCT%20%7B%0A%20%20wd%3AQ42%20ex%3Asource%20%3Fsource.%0A%20%20%7D%0AWHERE%20%7B%0A%20%20wd%3AQ42%20%3FpropertyUri%20%3Fstatement.%0A%20%20%3Fstatement%20prov%3AwasDerivedFrom%20%3Freference.%0A%20%20%3Freference%20pr%3AP248%20%3Fsource.%0A%7D" target="_blank">This construct query</a> could be incorporated into a Python script (or a script in any other language that supports HTTP calls) using the <span style="font-family: "courier new" , "courier" , monospace;">requests </span>or <span style="font-family: "courier new" , "courier" , monospace;">rdflib </span>modules as described earlier in this post.<br />
<br />
<h2>
Conclusion</h2>
I have hopefully made the case that the ability to perform SPARQL SELECT and CONSTRUCT querys at the Wikidata Query Service eliminates the need for the Wikimedia Foundation to create an additional API to provide data from Wikidata. Using SPARQL queries provides data retrieval capabilities that are only limited by your imagination. It's true that using a SPARQL endpoint as an API requires some knowledge about constructing SPARQL queries, but I would assert that this is a skill that must be acquired by anyone who is really serious about using LOD. I'm a relative novice at constructing SPARQL queries, but even so can think of a mind-boggling array of possibilities for using them to get data from Wikidata.<br />
<br />
In my next post, I'm going to talk about the reverse situation: getting data into Wikidata using software.<br />
<br /></div>
<div>
<br /></div>
<div>
<br /></div>
<div>
<br /></div>
<div>
<br /></div>
Steve Baskaufhttp://www.blogger.com/profile/01896499749604153763noreply@blogger.com0tag:blogger.com,1999:blog-5299754536670281996.post-91897039361409392082019-04-24T10:06:00.001-07:002020-03-04T19:11:34.832-08:00Understanding the TDWG Standards Documentation Specification, Part 5: Acquiring Machine-readable using DCATThis is the fifth in a series of posts about the TDWG Standards Documentation Specification (SDS). For background on the SDS, see the <a href="http://baskauf.blogspot.com/2019/03/understanding-tdwg-standards.html" target="_blank">first post</a>. For information on the SDS hierarchical model and how it relates to IRI design, see the <a href="http://baskauf.blogspot.com/2019/03/understanding-tdwg-standards_10.html" target="_blank">second post</a>. For information about how TDWG standards metadata can be retrieved via IRI dereferencing, see the <a href="http://baskauf.blogspot.com/2019/04/understanding-tdwg-standards.html" target="_blank">third post</a>. For information about accessing TDWG standards metadata via a SPARQL API, see the <a href="http://baskauf.blogspot.com/2019/04/understanding-tdwg-standards_7.html" target="_blank">fourth post</a>.<br />
<br />
Note: this post was revised on 2020-03-04 when IRI dereferencing of the http://rs.tdwg.org/ subdomain went from testing into production.<br />
<h2>
Acquiring the machine-readable TDWG standards metadata based on the W3C Data Catalog (DCAT) Vocabulary Recommendation.</h2>
<div>
<br /></div>
<h3>
Not-so-great methods of getting a dump of all of the machine-readable metadata</h3>
In the last two posts of this series, I showed two different ways that you could acquire machine-readable metadata about TDWG Standards and their components.<br />
<br />
In the <a href="http://baskauf.blogspot.com/2019/04/understanding-tdwg-standards.html" target="_blank">third post</a>, I explained how the implementation of the Standards Documentation Specification (SDS) could allow a machine (i.e. computer software) use the classic Linked Open Data (LOD) method of "following its nose" and essentially scraping the standards metadata by discovering linked IRIs, then following those links to retrieve metadata about the linked components. There are two problems with this approach. One is that it's very inefficient. Multiple HTTP calls are required to acquire the metadata about a single resource and there are thousands of resources that would need to be scraped. A more serious problem is that some of the terms that are current or past terms of Darwin and Audubon Cores are not dereferenceable. For example, the International Press Telecommunications Council (IPTC) terms that are borrowed by Audubon Core are defined in a PDF document and don't dereference. There are many ancient Darwin Core terms in namespaces other than the<span style="font-family: "courier new" , "courier" , monospace;"> rs.tdwg.org</span> subdomain that don't even bring up a web page, let alone machine-readable metadata. And the "permanent URLs" of the standards themselves (e.g. <span style="font-family: "courier new" , "courier" , monospace;">http://www.tdwg.org/standards/116</span>) do not use content negotiation to return machine-readable metadata (although they might at some future point). So there are many items of interest whose machine-readable metadata simply cannot be discovered by this means, since linked IRIs can't be dereferenced with a request for machine-readable metadata.<br />
<br />
In the <a href="http://baskauf.blogspot.com/2019/04/understanding-tdwg-standards_7.html" target="_blank">fourth post</a>, I described how the SPARQL query language could be used to get all of the triples in the TDWG Standards dataset. The query to do so was really simple:<br />
<br />
<span style="font-family: "courier new" , "courier" , monospace;">CONSTRUCT {?s ?p ?o}</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">FROM <http://rs.tdwg.org/></span><br />
<span style="font-family: "courier new" , "courier" , monospace;">WHERE {?s ?p ?o}</span><br />
<div>
<br /></div>
<div>
and by requesting the appropriate content type (XML, Turtle, or JSON-LD) via an Accept header, a single HTTP call would retrieve all of the metadata at once. If all goes well, this is a simple and effective method. However, this method depends critically on two things: there has to be a SPARQL endpoint that is functioning and publicly accessible, and the metadata in the triplestore of the underlying graph database must be up-to-date with the most recent data. At the moment, both of those things are true about the <a href="https://sparql.vanderbilt.edu/" target="_blank">Vanderbilt Library SPARQL endpoint</a> (<span style="font-family: "courier new" , "courier" , monospace;">https://sparql.vanderbilt.edu/sparql</span>), but there is no guarantee that it will continue to be true indefinitely. There is no reason why there cannot be multiple SPARQL endpoints where the data are available, and TDWG itself could run its own, but currently there are no plans for that to happen and so we are stuck with depending on the Vanderbilt endpoint.</div>
<div>
<br /></div>
<h2>
Getting a machine-readable data dump from TDWG itself</h2>
<div>
<br /></div>
<div>
I'm now going to tell you about the best way to acquire authoritative machine-readable metadata from the <span style="font-family: "courier new" , "courier" , monospace;">rs.tdwg.org</span> implementation itself. But first we need to talk about the W3C Data Catalog (DCAT) recommendation, which is used to organize the data dump. The SDS does not mention the DCAT recommendation, but since DCAT is an international standard, it is the logical choice to be used for describing the TDWG standards datasets.<br />
<br /></div>
<div>
<br /></div>
<h3>
Data Catalog Vocabulary (DCAT)</h3>
<div>
In 2014, the W3C ratified the <a href="https://www.w3.org/TR/vocab-dcat/" target="_blank">DCAT vocabulary</a> as a Recommendation (the W3C term for its ratified standards). DCAT is a vocabulary for describing datasets of any form. The described datasets can be machine-readable, but do not have to be, and could include non-machine-readable forms like spreadsheets. The description of the datasets is in RDF, although the Recommendation is agnostic about the serialization. </div>
<div>
<br /></div>
<div>
There are three classes of resources that are described by the DCAT vocabulary. A <b><i>data catalog</i></b> is the resource that describes datasets. It's type is <span style="font-family: "courier new" , "courier" , monospace;">dcat:Catalog</span> (<span style="font-family: "courier new" , "courier" , monospace;">http://www.w3.org/ns/dcat#Catalog</span>). The <i><b>datasets</b></i> described in the catalog are assigned the type <span style="font-family: "courier new" , "courier" , monospace;">dcat:Dataset</span>, which is a subclass of <span style="font-family: "courier new" , "courier" , monospace;">dctype:Dataset</span> (<span style="font-family: "courier new" , "courier" , monospace;">http://purl.org/dc/dcmitype/Dataset</span>). The third class of resources, <b><i>distributions</i></b>, are described as "an accessible form of a dataset" and can include downloadable files or web services. Distributions are assigned the type <span style="font-family: "courier new" , "courier" , monospace;">dcat:Distribution</span> (<span style="font-family: "courier new" , "courier" , monospace;">http://www.w3.org/ns/dcat#Distribution</span>). The hierarchical relationship among these classes of resources is shown in the following diagram.<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiz3b18P5IKqa1T8c8ey2n9E1gDlKFZrSPqWzhx8tGQs-WMtuGAyh8WGrYcLJyoeBx9qXOcod8GwgQ0HYCPQU_-5Oc-qGijdRtRnCkIf8ahh9o2h9AU_TA96F_NzFQsalzl3VU4gWlEPEY/s1600/dcat.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="720" data-original-width="1002" height="458" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiz3b18P5IKqa1T8c8ey2n9E1gDlKFZrSPqWzhx8tGQs-WMtuGAyh8WGrYcLJyoeBx9qXOcod8GwgQ0HYCPQU_-5Oc-qGijdRtRnCkIf8ahh9o2h9AU_TA96F_NzFQsalzl3VU4gWlEPEY/s640/dcat.png" width="640" /></a></div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<br />
An important thing to notice is that the DCAT vocabulary defines several terms whose IRIs are very similar: <span style="font-family: "courier new" , "courier" , monospace;">dcat:dataset</span> and <span style="font-family: "courier new" , "courier" , monospace;">dcat:Dataset</span>, and <span style="font-family: "courier new" , "courier" , monospace;">dcat:distribution</span> and <span style="font-family: "courier new" , "courier" , monospace;">dcat:Distribution</span>. The only thing that differs between the pairs of terms is whether the local name is capitalized or not. Those with capitalized local names denote <b><i>classes</i></b> and those that begin with lower case denote object <b><i>properties</i></b>.<br />
<br /></div>
<div>
<h3>
Organization of TDWG data according to the DCAT data model</h3>
</div>
<div>
I assigned the IRI <span style="font-family: "courier new" , "courier" , monospace;">http://rs.tdwg.org/index</span> to denote the TDWG standards metadata catalog. The local name "index" is descriptive of a catalog, and the IRI has the added benefit of supporting a typical web behavior: if a base subdomain like <span style="font-family: "courier new" , "courier" , monospace;">http://rs.tdwg.org/</span> is dereferenced, it is typical for that form of IRI to dereference to a "homepage" having the IRI <span style="font-family: "courier new" , "courier" , monospace;">http://rs.tdwg.org/index.htm</span>, and <span style="font-family: "courier new" , "courier" , monospace;">http://rs-test.tdwg.org/index.htm</span> does indeed redirect to a "homepage" of sorts: the README.md page for the rs.tdwg.org GitHub repo where the authoritative metadata tables live. You can try this yourself by putting either <span style="font-family: "courier new" , "courier" , monospace;"><a href="http://rs.tdwg.org/" target="_blank">http://rs.tdwg.org/</a></span>or <span style="font-family: "courier new" , "courier" , monospace;"><a href="http://rs.tdwg.org/index.htm" target="_blank">http://rs.tdwg.org/index.htm</a></span> into a browser URL bar and see what happens. However, making an HTTP call to either of these IRIs with an <span style="font-family: "courier new" , "courier" , monospace;">Accept </span>header for machine-readable RDF (<span style="font-family: "courier new" , "courier" , monospace;">text/turtle</span> or <span style="font-family: "courier new" , "courier" , monospace;">application/rdf+xml</span>) will redirect to a representation-specific IRI like <a href="http://rs.tdwg.org/index.ttl" style="font-family: "Courier New", Courier, monospace;" target="_blank">http://rs.tdwg.org/index.ttl</a> or <a href="http://rs.tdwg.org/index.rdf" style="font-family: "Courier New", Courier, monospace;" target="_blank">http://rs.tdwg.org/index.rdf</a> as you'd expect in the Linked Data world.<br />
<br />
The data catalog denoted by <span style="font-family: "courier new" , "courier" , monospace;">http://rs.tdwg.org/index</span> describes the data located in the GitHub repository <a href="https://github.com/tdwg/rs.tdwg.org" target="_blank">https://github.com/tdwg/rs.tdwg.org</a>. Those data are organized into a number of directories, with each directory containing all of the information required to map metadata-containing CSV files to machine-readable RDF. From the standpoint of DCAT, we can consider the information in each directory as a dataset. There is no philosophical reason why we should organize the datasets that way. Rather, it is based on practicality, since the server that dereferences TDWG IRIs can generate a data dump for each directory via a dump URL. See <a href="https://github.com/tdwg/rs.tdwg.org/blob/master/index/index-datasets.csv" target="_blank">this file</a> for a complete list of the datasets.<br />
<br />
Each of the abstract datasets can be accessed through one of several distributions. Currently, the RDF metadata about the TDWG data says that there are three distributions for each of the datasets: one in RDF/XML, one in RDF/Turtle, and one in JSON-LD (with the JSON-LD having a problem I mentioned in the <a href="http://baskauf.blogspot.com/2019/04/understanding-tdwg-standards.html" target="_blank">third post</a>). The IANA media type for each distribution is given as the value of a <span style="font-family: "courier new" , "courier" , monospace;">dcat:mediaType</span> property (see the diagram above for an example).<br />
<br />
One thing that is a bit different from what one might consider the traditional Linked Data approach is that the distributions are not really considered representations of the datasets. That is, under the DCAT model, one does not necessarily expect to be redirected to the distribution IRI from dereferencing of the dataset IRI through content negotiation. That's because content negotiation generally results in direct retrieval of some human- or machine-readable serialization, but in the DCAT model, the distribution itself is a separate, abstract entity apart from the serialization. The serialization itself is connected via a <span style="font-family: "courier new" , "courier" , monospace;">dcat:downloadURL</span> property of the distribution (see the diagram above). I'm not sure why the DCAT model adds this extra layer, but I think it is probably so that a permanent IRI can be assigned to the distribution, while the download URL can be a mutable thing that can change over time, yet still be discovered through its link to the distribution.<br />
<br />
At the moment, the dataset IRIs don't dereference, although that could be changed in the future if need be. Despite that, their metadata are exposed when the data catalog IRI itself is dereferenced, so a machine could learn all it needed to know about them with a single HTTP call to the catalog IRI.<br />
<br />
In the case of the TDWG data, I didn't actually mint IRIs for the distributions, since it's not that likely that anyone would ever need to address them directly and I wasn't interested in maintaining another set of identifiers. So they are represented by blank (anonymous) nodes in the dataset. The download URLs can be determined from the dataset URI by rules, so there's no need to maintain a record of them, either.<br />
<br />
Here is an abbreviated bit of the Turtle that you get if you dereference the catalog IRI <span style="font-family: "courier new" , "courier" , monospace;">http://rs.tdwg.org/index</span> and request <span style="font-family: "courier new" , "courier" , monospace;">text/turtle</span> (or just retrieve <a href="http://rs.tdwg.org/index.ttl" target="_blank">http://rs.tdwg.org/index.ttl</a>):<br />
<br />
<span style="font-family: "courier new" , "courier" , monospace;">@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>.</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">@prefix xsd: <http://www.w3.org/2001/XMLSchema#>.</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">@prefix dc: <http://purl.org/dc/elements/1.1/>.</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">@prefix dcterms: <http://purl.org/dc/terms/>.</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">@prefix dcat: <http://www.w3.org/ns/dcat#>.</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">@prefix dcmitype: <http://purl.org/dc/dcmitype/>.</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"><br /></span>
<span style="font-family: "courier new" , "courier" , monospace;"><http://rs.tdwg.org/index></span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> dc:publisher "Biodiversity Information Standards (TDWG)"@en;</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> dcterms:publisher <https://www.grid.ac/institutes/grid.480498.9>;</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> dcterms:license <http://creativecommons.org/licenses/by/4.0/>;</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> dcterms:modified "2018-10-09"^^xsd:date;</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> rdfs:label "TDWG dataset catalog"@en;</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> rdfs:comment "This dataset contains the data that underlies TDWG standards and standards documents"@en;</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> dcat:dataset <http://rs.tdwg.org/index/audubon>;</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> a dcat:Catalog.</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"><br /></span>
<span style="font-family: "courier new" , "courier" , monospace;"><http://rs.tdwg.org/index/audubon></span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> dcterms:modified "2018-10-09"^^xsd:date;</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> rdfs:label "Audubon Core-defined terms"@en;</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> dcat:distribution _:53c07f45-4561-448b-9bb9-396e47d3ad1d;</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> a dcmitype:Dataset.</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"><br /></span>
<span style="font-family: "courier new" , "courier" , monospace;">_:53c07f45-4561-448b-9bb9-396e47d3ad1d</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> dcat:mediaType <https://www.iana.org/assignments/media-types/application/rdf+xml>;</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> dcterms:license <https://creativecommons.org/publicdomain/zero/1.0/>;</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> dcat:downloadURL <http://rs.tdwg.org/dump/audubon.rdf>;</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> a dcat:Distribution.</span><br />
<div>
<br /></div>
In this Turtle, you can see the DCAT-based structure as described above.<br />
<br />
Returning to a comment that I made earlier, DCAT can describe data in any form and it's not restricted to RDF. So in theory, one could consider each dataset to have a distribution that is in CSV format, and use the GitHub raw URL for the CSV file as the download URL of that distribution. I haven't done that because complete information about the dataset requires the combination of the raw CSV file with a property mapping table and I don't know how to represent that complexity in DCAT. But at least in theory it could be done. One can also indicate that a distribution of the dataset is available from an API such as a SPARQL endpoint, which I also have not done because the datasets aren't compartmentalized into named graphs and therefore can't really be distinguished from each other. But again, in theory it could be done.</div>
<div>
<br />
<h3>
Getting a dump of all of the data</h3>
At the start of this post, I complained that there were potential issues with the first two methods that I described for retrieving all of the TDWG standards metadata. I promised a better way, so here it is!<br />
<br />
In theory, a client could start with the catalog IRI (<span style="font-family: "courier new" , "courier" , monospace;">http://rs.tdwg.org/index</span>), dereference it requesting the machine-readable serialization flavor of your choice, and follow the links to the download URLs of all 50 of the datasets currently in the catalog. That would be in the LOD style and would require far fewer HTTP calls than the thousands that would be required to scrape all of the machine-readable data one standards-related resource at a time.<br />
<br />
However, here is a quick and dirty way that doesn't require using any Linked Data technology:<br />
<ul>
<li>use a script of your favorite programming language to load the <a href="https://raw.githubusercontent.com/tdwg/rs.tdwg.org/master/index/index-datasets.csv" target="_blank">raw file for the datasets CSV table on GitHub</a></li>
<li>get the dataset name from the second ("term_localName") column (e.g. <span style="font-family: "courier new" , "courier" , monospace;">audubon</span>)</li>
<li>prepend <span style="font-family: "courier new" , "courier" , monospace;">http://rs.tdwg.org/dump/</span> to the name (e.g. <span style="font-family: "courier new" , "courier" , monospace;">http://rs.tdwg.org/dump/audubon</span>)</li>
<li>append the appropriate file extension for the serialization you want (<span style="font-family: "courier new" , "courier" , monospace;">.ttl</span> for Turtle, <span style="font-family: "courier new" , "courier" , monospace;">.rdf</span> for XML) to the URL from the previous step (e.g. <span style="font-family: "courier new" , "courier" , monospace;">http://rs.tdwg.org/dump/audubon.ttl</span>)</li>
<li>make an HTTP GET call to that URL to acquire the machine-readable serialization for that dataset. </li>
<li>Repeat for the other 49 data rows in the table.</li>
</ul>
<br />
I've actually done something like this in lines 55 to 63 of <a href="https://github.com/tdwg/rs.tdwg.org/blob/master/index/database-triple-loader.py" target="_blank">a Python script</a> on GitHub. Rather than making a GET request, the script actually uses the constructed URL to create a <a href="https://www.w3.org/TR/sparql11-update/" target="_blank">SPARQL Update</a> command that loads the data directly from the TDWG server into a graph database triplestore (lines 133 and 127) via an HTTP POST request. But you could use GET to load the data directly into your own software using a library like Python's <a href="https://github.com/RDFLib/rdflib" target="_blank">RDFLib</a> if you preferred to work with it directly rather than through a SPARQL endpoint.<br />
<br />
The advantage of getting the dump in this way is that it would be coming directly from the authoritative TDWG server (which gets its data from the CSVs in the rs.tdwg.org repo of the TDWG GitHub site). You would then be guaranteed to have the most up-to-date version of the data, something that would not necessarily happen if you got the data from somebody else's SPARQL endpoint.<br />
<br />
In the future, this method will be important because it would be the best way to build reliable applications that made use of standards metadata. For many standards and the "regular" TDWG vocabularies that conform to the SDS (Darwin and Audubon Cores), retrieving up-to-date metadata probably isn't that critical because those standards don't change very quickly. However, in the case of controlled vocabularies, access to up-to-date data may be more important.<br />
<br /></div>
Steve Baskaufhttp://www.blogger.com/profile/01896499749604153763noreply@blogger.com0tag:blogger.com,1999:blog-5299754536670281996.post-50618535727662009952019-04-07T20:44:00.000-07:002020-03-04T19:15:35.581-08:00Understanding the TDWG Standards Documentation Specification, Part 4: Machine-readable Metadata Via an APIThis is the fourth in a series of posts about the TDWG Standards Documentation Specification (SDS). For background on the SDS, see the <a href="http://baskauf.blogspot.com/2019/03/understanding-tdwg-standards.html" target="_blank">first post</a>. For information on its hierarchical model and how it relates to IRI design, see the <a href="http://baskauf.blogspot.com/2019/03/understanding-tdwg-standards_10.html" target="_blank">second post</a>. For information about how metadata is retrieved via IRI dereferencing, see the <a href="http://baskauf.blogspot.com/2019/04/understanding-tdwg-standards.html" target="_blank">third post</a>.<br />
<br />
Note: this post was revised on 2020-03-04 when IRI dereferencing of the http://rs.tdwg.org/ subdomain went from testing into production.<br />
<br />
<h2>
Retrieving metadata about TDWG standards using a web API</h2>
<br />
If you have persevered through the first three posts in this series, congratulations! The main reason for those earlier posts was to provide the background for this post, which is on the topic that will probably be most interesting to readers: how to effectively retrieve machine-readable metadata about TDWG standards using a web API.<br />
<br />
Let's start with retrieving an example resource: the term IRIs and definitions of terms of a TDWG vocabulary (Darwin Core=dwc or Audubon Core=ac).<br />
<br />
Here is what we need for the API call:<br />
<br />
<b>Resource URL</b>: <span style="font-family: "courier new" , "courier" , monospace;">https://sparql.vanderbilt.edu/sparql</span><br />
<b>Method</b>: <span style="font-family: "courier new" , "courier" , monospace;">GET</span><br />
<b>Authentication required:</b> No<br />
<b>Request header key</b>: <span style="font-family: "courier new" , "courier" , monospace;">Accept</span><br />
<b>Request header value</b>: <span style="font-family: "courier new" , "courier" , monospace;">application/json</span><span style="font-family: inherit;">, </span><span style="font-family: "courier new" , "courier" , monospace;">text/csv</span><span style="font-family: inherit;"> or </span><span style="font-family: "courier new" , "courier" , monospace;">application/xml</span><br />
<b>Parameter key</b>: <span style="font-family: "courier new" , "courier" , monospace;">query</span><br />
<b>Parameter value</b>: insert "<span style="font-family: "courier new" , "courier" , monospace;">dwc</span>" or "<span style="font-family: "courier new" , "courier" , monospace;">ac</span>" in place of <b>{vocabularyAbbreviation}</b> in the following string:<br />
"<span style="font-family: "courier new" , "courier" , monospace;">prefix%20skos%3A%20%3Chttp%3A%2F%2Fwww.w3.org%2F2004%2F02%2Fskos%2Fcore%23%3E%0Aprefix%20dcterms%3A%20%3Chttp%3A%2F%2Fpurl.org%2Fdc%2Fterms%2F%3E%0ASELECT%20DISTINCT%20%3Firi%20%3Fdefinition%0AWHERE%20%7B%0A%20%20GRAPH%20%3Chttp%3A%2F%2Frs.tdwg.org%2F%3E%20%7B%0A%20%20%20%20%3Chttp%3A%2F%2Frs.tdwg.org%2F</span><span style="font-family: inherit;"><b>{vocabularyAbbreviation}</b></span><span style="font-family: "courier new" , "courier" , monospace;">%2F%3E%20dcterms%3AhasPart%20%3FtermList.%0A%20%20%20%20%3FtermList%20dcterms%3AhasPart%20%3Firi.%0A%20%20%20%20%3Firi%20skos%3AprefLabel%20%3Flabel.%0A%20%20%20%20%3Firi%20skos%3Adefinition%20%3Fdefinition.%0A%20%20%20%20FILTER(lang(%3Flabel)%3D%22en%22)%0A%20%20%20%20FILTER(lang(%3Fdefinition)%3D%22en%22)%0A%20%20%20%20%7D%0A%7D%0AORDER%20BY%20%3Firi</span>"<br />
<br />
<b>Note:</b> the <span style="font-family: "courier new" , "courier" , monospace;">Accept</span> header is required to receive JSON -- omitting it returns XML.<br />
<br />
Here's an example response that shows the structure of the JSON that is returned:<br />
<br />
<span style="font-family: "courier new" , "courier" , monospace;">{</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> "head": {</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> "vars": [</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> "iri",</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> "definition"</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> ]</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> },</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> "results": {</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> "bindings": [</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> {</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> "iri": {</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> "type": "uri",</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> "value": "http://ns.adobe.com/exif/1.0/PixelXDimension"</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> },</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> "definition": {</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> "xml:lang": "en",</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> "type": "literal",</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> "value": "Information specific to compressed data. When a compressed file is recorded, the valid width of the meaningful image shall be recorded in this tag, whether or not there is padding data or a restart marker. This tag shall not exist in an uncompressed file."</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> }</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> },</span><br />
<div>
<b>(... many more array values here ...)</b></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;"> {</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> "iri": {</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> "type": "uri",</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> "value": "http://rs.tdwg.org/dwc/terms/waterBody"</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> },</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> "definition": {</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> "xml:lang": "en",</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> "type": "literal",</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> "value": "The name of the water body in which the Location occurs. Recommended best practice is to use a controlled vocabulary such as the Getty Thesaurus of Geographic Names."</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> }</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> }</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> ]</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> }</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">}</span></div>
<br />
Here is an example script to use the API via Python 3. (You can convert to your own favorite programming language or see <a href="https://heardlibrary.github.io/digital-scholarship/script/python/install/" target="_blank">this page</a> if you need to set up Python 3 on your computer.) Note: the <span style="font-family: "courier new" , "courier" , monospace;">requests</span> module is not included in the Python standard library and must be installed using PIP or another package manager.<br />
<br />
Although the API can return CSV and XML, we will only be using JSON in this example.<br />
------<br />
<br />
<span style="font-family: "courier new" , "courier" , monospace;">import requests</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"><br /></span>
<span style="font-family: "courier new" , "courier" , monospace;">vocab = input('Enter the vocabulary abbreviation (dwc for Darwin Core or ac for Audubon Core): ')</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"><br /></span>
<span style="font-family: "courier new" , "courier" , monospace;"># values required for the HTTP request</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">resourceUrl = 'https://sparql.vanderbilt.edu/sparql'</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">requestHeaderKey = 'Accept'</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">requestHeaderValue = 'application/json'</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">parameterKey = 'query'</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">parameterValue ='prefix%20skos%3A%20%3Chttp%3A%2F%2Fwww.w3.org%2F2004%2F02%2Fskos%2Fcore%23%3E%0Aprefix%20dcterms%3A%20%3Chttp%3A%2F%2Fpurl.org%2Fdc%2Fterms%2F%3E%0ASELECT%20DISTINCT%20%3Firi%20%3Fdefinition%0AWHERE%20%7B%0A%20%20GRAPH%20%3Chttp%3A%2F%2Frs.tdwg.org%2F%3E%20%7B%0A%20%20%20%20%3Chttp%3A%2F%2Frs.tdwg.org%2F'</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">parameterValue += vocab</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">parameterValue += '%2F%3E%20dcterms%3AhasPart%20%3FtermList.%0A%20%20%20%20%3FtermList%20dcterms%3AhasPart%20%3Firi.%0A%20%20%20%20%3Firi%20skos%3AprefLabel%20%3Flabel.%0A%20%20%20%20%3Firi%20skos%3Adefinition%20%3Fdefinition.%0A%20%20%20%20FILTER(lang(%3Flabel)%3D%22en%22)%0A%20%20%20%20FILTER(lang(%3Fdefinition)%3D%22en%22)%0A%20%20%20%20%7D%0A%7D%0AORDER%20BY%20%3Firi'</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">url = resourceUrl + '?' + parameterKey + '=' + parameterValue</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"><br /></span>
<span style="font-family: "courier new" , "courier" , monospace;"># make the HTTP request and store the terms data in a list</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">r = requests.get(url, headers={requestHeaderKey: requestHeaderValue})</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">responseBody = r.json()</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">items = responseBody['results']['bindings']</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"><br /></span>
<span style="font-family: "courier new" , "courier" , monospace;"># iterate through the list and print what we wanted</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">for item in items:</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> print(item['iri']['value'])</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> print(item['definition']['value'])</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> print()</span><br />
<div>
<br /></div>
<div>
------</div>
<div>
For anyone who has programmed an application to retrieve data from an API, this is pretty standard stuff and because the <span style="font-family: "courier new" , "courier" , monospace;">requests </span>module is so simple to use, the part of the code that actually retrieves the data from the API (lines 16-18) is only three lines long. So the coding required to retrieve the data is not complicated. For the output I just had the values for the IRI and definition printed to the console, but obviously you could do whatever you wanted with them in your own programming.</div>
<div>
<br /></div>
<div>
If you are familiar with using web APIs and if you examined the details of the code, you will probably have several questions:</div>
<div>
<br /></div>
<div>
- Why is the parameter value so much longer and weirder than what is typical for web APIs?</div>
<div>
- What is this <span style="font-family: "courier new" , "courier" , monospace;">sparql.vanderbilt.edu</span> API?</div>
<div>
- What other kinds of resources can be obtained from the API?</div>
<div>
<br /></div>
<h2>
About the API</h2>
<div>
The reason that the parameter value is so long and weird looking is because the required parameter value is a SPARQL query in URL-encoded form. I purposefully obfuscated the parameter value by URL-encoding it in the script because I wanted to emphasize how a SPARQL endpoint is fundamentally just like any other web API, except with a more complicated query parameter. </div>
<div>
<br /></div>
<div>
I feel like in the past Linked Data, RDF, and SPARQL has been talked about in the TDWG community like it is some kind of religion with secrets that only initiated members of the priesthood can know. (For a short introduction to this topic, see <a href="https://youtu.be/YWyCCJ6B2WE" target="_blank">this video</a>.) It is true that if you want to design an RDF data model or build the infrastructure to transform tabular data to RDF, you need to know a lot of technical details, but those are not tasks that most people need to do. You actually don't need to know anything about RDF, how it's structured, or how to create it in order to use a SPARQL endpoint, as I just demonstrated above.</div>
<div>
<br /></div>
<div>
The endpoint <span style="font-family: "courier new" , "courier" , monospace;">http://sparql.vanderbilt.edu/sparql</span> provides public access to datasets that have been made available by the <a href="https://www.library.vanderbilt.edu/" target="_blank">Vanderbilt Libraries</a>. It is our intention to keep this API up and the datasets stable for as long as possible. (For more about the API, see <a href="https://github.com/HeardLibrary/semantic-web/blob/master/sparql/README.md" target="_blank">this page</a>.) However, there is nothing special about about the API - it's just an installation of Blazegraph, which is freely available without cost as a Docker image (see <a href="https://heardlibrary.github.io/digital-scholarship/lod/install/" target="_blank">this page</a> for instructions if you want to try installing it on your own computer). The TDWG dataset that is loaded into the Vanderbilt API is also freely available and can be installed in any Blazegraph instance. So although the Vanderbilt API provides a convenient way to access the TDWG data, there is nothing special about it. There is no custom programming that has been done to get it online and there has been no processing of the data that was loaded into it. There could be zero to many other APIs that could be set up to provide exactly the same services using exactly the same API calls. For those who are interested, later on in this post I will provide more details about how anyone can obtain the data, but those are details that most users can happily ignore.</div>
<div>
<br /></div>
<div>
The interesting thing about SPARQL endpoints is that there is an unlimited number of resources that can be obtained from the API. Conventional APIs, such as the <a href="https://www.gbif.org/developer/summary" target="_blank">GBIF</a> or <a href="https://developer.twitter.com/en/docs.html" target="_blank">Twitter</a> APIs, provide web pages that list the available resources and the parameter key/value pairs required to obtain them. If potential users want to obtain a resource that is not currently available, they have to ask the API developers to create the code required to allow them to access that resource. A SPARQL endpoint is much simpler. It has exactly one resource URL (the URL of the endpoint) and for read operations has only one parameter key (<span style="font-family: "courier new" , "courier" , monospace;">query</span>). The value of that single parameter is the SPARQL query. </div>
<div>
<br /></div>
<div>
In a manner analogous to traditional API documentation, we can (and should) provide a list of queries that would retrieve the types of information that users typically might want to obtain. Developers who are satisfied with that list can simply follow the recipe and make API calls using that recipe as they would for any other API. But the great thing about a SPARQL endpoint is that you are NOT limited to any provided list of queries. If you are willing to study the TDWG standards data model that I described in the <a href="http://baskauf.blogspot.com/2019/03/understanding-tdwg-standards_10.html" target="_blank">second post of this series</a> and expend a minimal amount of time learning to construct SPARQL queries (see <a href="https://heardlibrary.github.io/digital-scholarship/lod/sparql/" target="_blank">this beginner's page</a> to get started), you can retrieve any kind of data that you can imagine without needing to beg some developers to add that functionality to their API. </div>
<div>
<br /></div>
<div>
In the next section, I'm going to simplify the Python 3 script that I listed above, then provide several additional API call examples.</div>
<div>
<br /></div>
<h2>
A generic Python script for making other API calls</h2>
<div>
Here is the previous script in a more straightforward and hackable form:</div>
<div>
------</div>
<div>
<br /></div>
<div>
<div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">import requests</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;"><br /></span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">vocab = input('Enter the vocabulary abbreviation (dwc for Darwin Core or ac for Audubon Core): ')</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;"><br /></span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">parameterValue ='''prefix skos: <http://www.w3.org/2004/02/skos/core#></span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">prefix dcterms: <http://purl.org/dc/terms/></span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">SELECT DISTINCT ?iri ?definition</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">WHERE {</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;"> GRAPH <http://rs.tdwg.org/> {</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;"> <http://rs.tdwg.org/'''</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;"><br /></span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">parameterValue += vocab</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">parameterValue += '''/> dcterms:hasPart ?termList.</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;"> ?termList dcterms:hasPart ?iri.</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;"> ?iri skos:prefLabel ?label.</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;"> ?iri skos:definition ?definition.</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;"> FILTER(lang(?label)="en")</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;"> FILTER(lang(?definition)="en")</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;"> }</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">}</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">ORDER BY ?iri'''</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;"><br /></span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">endpointUrl = 'https://sparql.vanderbilt.edu/sparql'</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">requestHeaderValue = 'application/json'</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;"><br /></span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;"># make the HTTP request and store the terms data in a list</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">r = requests.get(endpointUrl, headers={'Accept': requestHeaderValue}, params={'query': parameterValue})</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">responseBody = r.json()</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">items = responseBody['results']['bindings']</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;"><br /></span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;"># iterate through the list and print what we wanted</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">for item in items:</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;"> print(item['iri']['value'])</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;"> print(item['definition']['value'])</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;"> print()</span></div>
</div>
<div>
<br /></div>
</div>
<div>
------</div>
<div>
<br /></div>
<div>
The awesome Python <span style="font-family: "courier new" , "courier" , monospace;">requests </span>module allows you to pass the parameters to the <span style="font-family: "courier new" , "courier" , monospace;">.get()</span> method as a dict, getting rid of the necessity of constructing the entire URL yourself. The values you pass are automatically URL-encoded, so that eliminates the necessity of doing the encoding yourself. As a result, I was able to create the parameter value by assigning multi-line strings that are formatted in a much more readable way. Since the only header we should ever need to send is <span style="font-family: "courier new" , "courier" , monospace;">Accept</span> and the only parameter key we should need is <span style="font-family: "courier new" , "courier" , monospace;">query</span>, I just hard-coded them into the corresponding dicts of the <span style="font-family: "courier new" , "courier" , monospace;">.get() </span>method. I left the value for the <span style="font-family: "courier new" , "courier" , monospace;">Accept </span>request header as a variable in line 24 as a variable in case anybody wants to play with requesting XML or a CSV table.</div>
<div>
<br /></div>
<div>
We can now request different kinds of data from the API by changing the parameter value that is assigned in lines 3 through 21. </div>
<div>
<br /></div>
<h3>
Multilingual labels and definitions </h3>
<div>
To retrieve the label and definition for a Darwin Core term in a particular language, substitute these lines for lines 3-21:</div>
<div>
------</div>
<div>
<br /></div>
<div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">localName = input('Enter the local name of a Darwin Core term: ')</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">language = input('Enter the two-letter code for the language you want (en, es, zh-hans): ')</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;"><br /></span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">parameterValue ='''prefix skos: <http://www.w3.org/2004/02/skos/core#></span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">SELECT DISTINCT ?label ?definition</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">WHERE {</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;"> GRAPH <http://rs.tdwg.org/> {</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;"> BIND(IRI("http://rs.tdwg.org/dwc/terms/'''</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">parameterValue += localName</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">parameterValue += '''") as ?iri)</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;"> BIND("'''</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">parameterValue += language</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">parameterValue += '''" as ?language)</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;"> ?iri skos:prefLabel ?label.</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;"> ?iri skos:definition ?definition.</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;"> FILTER(lang(?label)=?language)</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;"> FILTER(lang(?definition)=?language)</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;"> }</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">}'''</span></div>
</div>
<div>
<br /></div>
<div>
------</div>
<div>
The printout section needs to be changed, since we asked for a label instead of an IRI:</div>
<div>
------</div>
<div>
<br /></div>
<div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">for item in items:</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;"> print()</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;"> print(item['label']['value'])</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;"> print(item['definition']['value'])</span></div>
</div>
<div>
<br /></div>
<div>
------</div>
<div>
The "local name" asked for by the script is the last part of a Darwin Core IRI. For example, the local name for <span style="font-family: "courier new" , "courier" , monospace;">dwc:recordedBy</span> (that is, <span style="font-family: "courier new" , "courier" , monospace;">http://rs.tdwg.org/dwc/terms/recordedBy</span>) would be <span style="font-family: "courier new" , "courier" , monospace;">recordedBy</span>. (You can find more local names to try <a href="http://rs.tdwg.org/dwc/terms/" target="_blank">here</a>.)<br />
<br />
Other than English, we currently only have translations of term names and labels in Spanish and simplified Chinese. We also only have translations of <span style="font-family: "courier new" , "courier" , monospace;">dwc:</span> namespace terms from Darwin Core and not <span style="font-family: "courier new" , "courier" , monospace;">dwciri:</span>, <span style="font-family: "courier new" , "courier" , monospace;">dc:</span>, or <span style="font-family: "courier new" , "courier" , monospace;">dcterms:</span> terms. So this resource is currently somewhat limited, but could get better in the future with the addition of other languages to the dataset.</div>
<div>
<br /></div>
<h3>
Track the history of any TDWG term to the beginning of the universe</h3>
<div>
The user sends the full IRI of any term ever created by TDWG and the API will return the term name, version date of issue, definition and status of every version that was a precursor of that term. Again, replace lines 3-21 with this:</div>
<div>
------</div>
<div>
<br /></div>
<div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">iri = input('Enter the unabbreviated IRI of a TDWG vocabulary term: ')</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;"><br /></span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">parameterValue ='''prefix dcterms: <http://purl.org/dc/terms/></span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">prefix skos: <http://www.w3.org/2004/02/skos/core#></span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">prefix tdwgutility: <http://rs.tdwg.org/dwc/terms/attributes/></span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">SELECT DISTINCT ?term ?date ?definition ?status</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">WHERE {</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;"> GRAPH <http://rs.tdwg.org/> {</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;"> <'''</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">parameterValue += iri</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">parameterValue += '''> dcterms:hasVersion ?directVersion.</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;"> ?directVersion dcterms:replaces* ?version.</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;"> ?version dcterms:issued ?date.</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;"> ?version tdwgutility:status ?status.</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;"> ?version dcterms:isVersionOf ?term.</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;"> ?version skos:definition ?definition.</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;"> FILTER(lang(?definition)="en")</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;"> }</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">}</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">ORDER BY DESC(?date)'''</span></div>
</div>
<div>
<br /></div>
<div>
------</div>
<div>
and replace the printout section with this:</div>
<div>
------</div>
<div>
<br /></div>
<div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">for item in items:</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;"> print()</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;"> print(item['date']['value'])</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;"> print(item['term']['value'])</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;"> print(item['definition']['value'])</span></div>
</div>
<div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;"> print(item['status']['value'])</span></div>
</div>
<div>
<br /></div>
<div>
------</div>
<div>
<br /></div>
<div>
The results of this query allow you to see every possible previous term that might have been used in the past to refer to this concept, and to see how the definition of those earlier terms differed from the target term. You should try it with everyone's favorite confusing term, <span style="font-family: "courier new" , "courier" , monospace;">dwc:basisOfRecord</span>, which has the unabbreviated IRI <span style="font-family: "courier new" , "courier" , monospace;">http://rs.tdwg.org/dwc/terms/basisOfRecord</span> . </div>
<div>
<br /></div>
<div>
You can make a simple modification to the script to have the call return every term that has ever been used to replace an obsolete term, and the definitions of every version of those terms. Just replace <span style="font-family: "courier new" , "courier" , monospace;">dcterms:replaces*</span> with <span style="font-family: "courier new" , "courier" , monospace;">dcterms:isReplacedby*</span> in the second <span style="font-family: "courier new" , "courier" , monospace;">parameterValue</span> string. If you want them to be ordered from oldest to newest, you can replace <span style="font-family: "courier new" , "courier" , monospace;">DESC(?date)</span> with A<span style="font-family: "courier new" , "courier" , monospace;">SC(?date)</span>. Try it with this refugee from the past: <span style="font-family: "courier new" , "courier" , monospace;">http://digir.net/schema/conceptual/darwin/2003/1.0/YearCollected</span> .</div>
<div>
<br /></div>
<h3>
What are all of the TDWG Standards documents?</h3>
<div>
This version of the script lets you enter any part of a TDWG standard's name and it will retrieve all of the documents that are part of that standard, tell you the date it was last modified, and give the URL where you might be able to find it (some are print only and at least one -- XDF -- seems to be lost entirely). Press enter without any text and you will get all of them. Here's the code to generate the parameter value:</div>
<div>
------</div>
<div>
<br /></div>
<div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">searchString = input('Enter part of the standard name, or press Enter for all: ')</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;"><br /></span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">parameterValue ='''PREFIX foaf: <http://xmlns.com/foaf/0.1/></span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#></span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">PREFIX dcterms: <http://purl.org/dc/terms/></span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">SELECT DISTINCT ?docLabel ?date ?stdLabel ?url</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">FROM <http://rs.tdwg.org/></span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">WHERE {</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;"> ?standard a dcterms:Standard.</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;"> ?standard rdfs:label ?stdLabel.</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;"> ?standard dcterms:hasPart ?document.</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;"> ?document a foaf:Document.</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;"> ?document rdfs:label ?docLabel.</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;"> ?document rdfs:seeAlso ?url.</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;"> ?document dcterms:modified ?date.</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;"> FILTER(lang(?stdLabel)="en")</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;"> FILTER(lang(?docLabel)="en")</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;"> FILTER(contains(?stdLabel, "'''</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">parameterValue += searchString</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">parameterValue += '''")) </span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">}</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">ORDER BY ?stdLabel'''</span></div>
<div>
<br /></div>
<div>
------</div>
<div>
and here's the printout section:</div>
<div>
------</div>
<div>
<br /></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">for item in items:</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;"> print()</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;"> print(item['docLabel']['value'])</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;"> print(item['date']['value'])</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;"> print(item['stdLabel']['value'])</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;"> print(item['url']['value'])</span></div>
</div>
<div>
<br /></div>
<div>
------</div>
<div>
<br /></div>
<div>
<b>Note: </b>the URLs that are returned are access URLs, NOT the IRI identifiers for the documents!<br />
<br />
The following is a variation of the API call above. In this variation, you enter the name of a standard (or press Enter for all), and you can retrieve the names of all of the contributors (whose roles might have included author, editor, translator, reviewer, or review manager). Parameter value code:</div>
<div>
<br /></div>
<div>
------</div>
<div>
<br /></div>
<div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">parameterValue ='''PREFIX foaf: <http://xmlns.com/foaf/0.1/></span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#></span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">PREFIX dcterms: <http://purl.org/dc/terms/></span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">SELECT DISTINCT ?contributor ?stdLabel</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">FROM <http://rs.tdwg.org/></span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">WHERE {</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;"> ?standard a dcterms:Standard.</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;"> ?standard rdfs:label ?stdLabel.</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;"> ?standard dcterms:hasPart ?document.</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;"> ?document a foaf:Document.</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;"> ?document dcterms:contributor ?contribUri.</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;"> ?contribUri rdfs:label ?contributor.</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;"> FILTER(contains(?stdLabel, "'''</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">parameterValue += searchString</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">parameterValue += '''")) </span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">}</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">ORDER BY ?contributor'''</span></div>
</div>
<div>
<br /></div>
<div>
------</div>
<div>
Printout code:</div>
<div>
------</div>
<div>
<br /></div>
<div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">for item in items:</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;"> print()</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;"> print(item['contributor']['value'])</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;"> print(item['stdLabel']['value'])</span></div>
</div>
<div>
<br /></div>
<div>
------</div>
<div>
<br />
<b>Note: </b>assembling this list of documents was my best shot at determining what documents should be considered to be part of the standards themselves, as opposed to ancillary documents not part of the standards. It's possible that I might have missed some, or included some that aren't considered key to the standards. This is more of a problem with older documents whose status was not clearly designated.<br />
<br /></div>
<div>
The SDS isn't very explicit about how to assign all of the properties that should probably be assigned to documents, so some information that might be important is missing, such as documentation of contributor roles. Also, I could not determine who all of the review managers were, where the authoritative locations were for all documents, nor find prior versions for some documents. So this part of the TDWG standards metadata still needs some work. </div>
<div>
<br /></div>
<h3>
An actual Linked Data application</h3>
<div>
In the previous examples, the data involved was limited to metadata about TDWG standards. However, we can make an API call that is an actual bona fide application of Linked Data. Data from the <a href="http://bioimages.vanderbilt.edu/" target="_blank">Bioimages project</a> are available as RDF/XML. You can examine the human-readable web page of an image at <a href="http://bioimages.vanderbilt.edu/thomas/0488-01-01">http://bioimages.vanderbilt.edu/thomas/0488-01-01</a> and the corresponding RDF/XML <a href="http://bioimages.vanderbilt.edu/thomas/0488-01-01.rdf" target="_blank">here</a>. Both the human- and machine-readable versions of the image metadata use either Darwin Core or Audubon Core terms as most of their properties. However, the Bioimages metadata do not provide an explanation of what those TDWG vocabulary terms mean. </div>
<div>
<br /></div>
<div>
Both the Bioimages and TDWG metadata datasets have been loaded into the Vanderbilt Libraries SPARQL API, and we can include both datasets in the query's dataset using the FROM keyword. That allows us to make use of information from the TDWG dataset in the query of the Bioimages data because the two datasets are linked by use of common term IRIs. In the query, we can ask for the metadata values for the image (from the Bioimages dataset), but include the definition of the properties (from the TDWG dataset; not present in the Bioimages dataset). </div>
<div>
------</div>
<div>
<br /></div>
<div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">iri = input('Enter the unabbreviated IRI of an image from Bioimages: ')</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;"><br /></span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">parameterValue ='''PREFIX dcterms: <http://purl.org/dc/terms/></span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">PREFIX skos: <http://www.w3.org/2004/02/skos/core#></span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">SELECT DISTINCT ?label ?value ?definition</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">FROM <http://rs.tdwg.org/></span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">FROM <http://bioimages.vanderbilt.edu/images></span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">WHERE {</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;"> <'''</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">parameterValue += iri</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">parameterValue += '''> ?property ?value.</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;"> ?property skos:prefLabel ?label.</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;"> ?property skos:definition ?definition.</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;"> FILTER(lang(?label)="en")</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;"> FILTER(lang(?definition)="en")</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">}'''</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;"><br /></span></div>
<div>
------</div>
<div>
Printout code:</div>
<div>
------</div>
<div>
<br /></div>
<div>
</div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">for item in item</span><span style="font-family: "courier new" , "courier" , monospace;">s:</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;"> print()</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;"> print(item['label']['value'])</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;"> print(item['value']['value'])</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;"> print(item['definition']['value'])</span></div>
</div>
<div>
<br /></div>
<div>
------</div>
<div>
<br />
You can try this script out on the example IRI I gave above (<span style="font-family: "courier new" , "courier" , monospace;">http://bioimages.vanderbilt.edu/thomas/0488-01-01</span>) or on any other image identifier in the collection (listed under "Refer to this permanent identifier for the image:" on any of the image metadata pages that you get to by clicking on an image thumbnail). </div>
<h2>
Conclusion</h2>
<div>
Hopefully, these examples can give you a taste for the kind of metadata about TDWG standards that can be retrieved using an API. There are several final issues that I should discuss before I wrap up this post. I'm going to present them in a Q&A format.</div>
<div>
<br /></div>
<div>
Q: <b>Can I build an application to use this API?</b></div>
<div>
A. Yes, you could. We intend for the Vanderbilt SPARQL API to remain up indefinitely at the endpoint URL given in the examples. However, we can't make a hard promise about that, and the API is not set up to handle large amounts of traffic. There aren't any usage limits and subsequently it's already been crashed once by someone who hit it really hard. So if you need a robust service, you should probably set up your own installation of Blazegraph and populate it with the TDWG dataset.</div>
<div>
<br /></div>
<div>
Q: <b>How can I get a dump of the TDWG data to populate my own version of the API?</b></div>
<div>
A: The simplest way is to execute this query to the Vanderbilt SPARQL API as above with an <span style="font-family: "courier new" , "courier" , monospace;">Accept</span> header of <span style="font-family: "courier new" , "courier" , monospace;">text/turtle</span>:</div>
<div>
<br /></div>
<div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">CONSTRUCT {?s ?p ?o}</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">FROM <http://rs.tdwg.org/></span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">WHERE {?s ?p ?o}</span></div>
</div>
<div>
<br /></div>
<div>
URL-encoded, the query is:<br />
<br />
<span style="font-family: "courier new" , "courier" , monospace;">CONSTRUCT%20%7B%3Fs%20%3Fp%20%3Fo%7D%0AFROM%20%3Chttp%3A%2F%2Frs.tdwg.org%2F%3E%0AWHERE%20%7B%3Fs%20%3Fp%20%3Fo%7D</span><br />
<br />
If you use <a href="https://www.getpostman.com/" target="_blank">Postman</a>, you can drop down the Send button to <span style="font-family: "courier new" , "courier" , monospace;">Send and Download</span> and save the data in a file, which you can upload into your own instance of Blazegraph or some other SPARQL endpoint/triplestore. (There are approximately 43000 statements (triples) in the dataset, so copy and paste is not a great method of putting them into a file.) If your triplestore doesn't support RDF/Turtle, you can get RDF/XML instead by using an <span style="font-family: "courier new" , "courier" , monospace;">Accept</span> header of <span style="font-family: "courier new" , "courier" , monospace;">application/rdf+xml</span>.<br />
<br />
There is a better method of acquiring the data that uses the authoritative source data, but I'll have to describe that in a subsequent post.</div>
<div>
<br /></div>
<div>
Q: <b>How accurate are the data?</b></div>
<div>
A: I've spent many, many hours over the last several years curating the <a href="https://github.com/tdwg/rs.tdwg.org" target="_blank">source data in GitHub</a>. Nevertheless, I still discover errors almost every time I try new queries on the data. If you discover errors, put them in the <a href="https://github.com/tdwg/rs.tdwg.org/issues" target="_blank">issues tracker</a> and I'll try to fix them.</div>
<div>
<br /></div>
<div>
Q: <b>How would this work for future controlled vocabularies?</b></div>
<div>
A: This is a really important question. It's so important that I'm going to address it in a subsequent post in the series.</div>
<div>
<br /></div>
<div>
Q: <b>How can I retrieve information from the API about resources that weren't described in the examples?</b></div>
<div>
A: Since a SPARQL endpoint is essentially a program-it-yourself API, all you need is to have the right SPARQL query to retrieve the information you want. First you need to have a clear idea of the question you want to answer. Then you've got two options: find someone who knows how to write SPARQL queries and get them to write the query for you, or teach yourself how to write SPARQL queries and do it yourself. You can test your queries by pasting them in the box at <a href="https://sparql.vanderbilt.edu/">https://sparql.vanderbilt.edu/</a> as you build them. It is not possible to create the queries without understanding the underlying data model (the graph model) and the machine-readable properties assigned to each kind of resource. That's why I wrote the first (boring) parts of this series and why we wrote the specification itself.</div>
<div>
<br /></div>
<div>
Q: <b>Where did the data in the dataset come from and how is it managed?</b></div>
<div>
A: That is an excellent question. Actually it is several questions:</div>
<div>
<br /></div>
<div>
- where does the data come from? (answer: the <a href="https://github.com/tdwg/rs.tdwg.org" target="_blank">source csv tables in GitHub</a>)</div>
<div>
- how does the source data get turned into machine-readable data?</div>
<div>
- how does the machine-readable data get into the API?</div>
<div>
<br /></div>
<div>
One of the beauties of <a href="https://en.wikipedia.org/wiki/Representational_state_transfer" target="_blank">REST</a> is that when you request a URI from a server, you should be able to get a useful response from the server without having to worry about how the server generates that response. What that means in this context is that the intermediate steps that lie between the source data and what comes out of the API (the answers to the second and third questions above) can change and the client should never notice the difference since it would still be able to get exactly the same response. That's because the processing essentially involves implementing a mapping between what's in the tables on GitHub and what the SDS says the standardized machine-readable metadata should look like. There is no one particular way that mapping must happen, as long as the end result is the same. I will discuss this point in what will probably be the last post of the series.</div>
<div>
<br /></div>
<div>
<br /></div>
<div>
<br /></div>
Steve Baskaufhttp://www.blogger.com/profile/01896499749604153763noreply@blogger.com0tag:blogger.com,1999:blog-5299754536670281996.post-75417362575066228542019-04-02T21:17:00.000-07:002020-03-04T19:30:32.073-08:00Understanding the TDWG Standards Documentation Specification, Part 3: Machine-readable Metadata Via Content NegotiationThis is the third in a series of posts about the TDWG Standards Documentation Specification (SDS). For background on the SDS, see the <a href="http://baskauf.blogspot.com/2019/03/understanding-tdwg-standards.html" target="_blank">first post</a>. For information on its hierarchical model and how it relates to IRI design, see the <a href="http://baskauf.blogspot.com/2019/03/understanding-tdwg-standards_10.html" target="_blank">second post</a>.<br />
<br />
Note: this post was revised on 2020-03-04 when IRI dereferencing of the http://rs.tdwg.org/ subdomain went from testing into production.<br />
<br />
<br />
<h2>
Human- vs. Machine-readable metadata</h2>
In the previous posts, I made the point that the SDS considers standards-related resources such as standards, vocabularies, term lists, terms, and documents to be abstract entities (<a href="https://github.com/tdwg/vocab/blob/master/sds/documentation-specification.md#21-abstract-resources-and-representations" target="_blank">section 2.1</a>). As such, the IRI assigned to a resource denotes that resource in its abstract form. That abstract resource does not have one particular <i>representation</i> -- rather it can have multiple representation syntaxes which differ in format, but which in most cases provide equivalent information.<br />
<br />
For example, consider the deprecated Darwin Core term <span style="font-family: "courier new" , "courier" , monospace;">dwccuratorial:Disposition</span>. It is denoted by the IRI <span style="font-family: "courier new" , "courier" , monospace;">http://rs.tdwg.org/dwc/curatorial/Disposition</span>. The metadata for this term in human-readable form looks like this:<br />
<br />
<span style="font-family: "courier new" , "courier" , monospace;">Term Name: dwccuratorial:Disposition</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">Label: Disposition</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">Term IRI: http://rs.tdwg.org/dwc/curatorial/Disposition</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">Term version IRI: http://rs.tdwg.org/dwc/curatorial/version/Disposition-2007-04-17</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">Modified: 2009-04-24</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">Definition: The current disposition of the cataloged item. Examples: "in collection", "missing", "voucher elsewhere", "duplicates elsewhere".</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">Type: Property</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">Note: This term is no longer recommended for use.</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">Is replaced by: http://rs.tdwg.org/dwc/terms/disposition</span><br />
<br />
In RDF/Turtle machine-readable serializations, the metadata looks like this (namespace abbreviations omitted):<br />
<br />
<span style="font-family: "courier new" , "courier" , monospace;"><http://rs.tdwg.org/dwc/curatorial/Disposition></span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> rdfs:isDefinedBy <http://rs.tdwg.org/dwc/curatorial/>;</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> dcterms:isPartOf <http://rs.tdwg.org/dwc/curatorial/>;</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> dcterms:created "2007-04-17"^^xsd:date;</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> dcterms:modified "2009-04-24"^^xsd:date;</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> owl:deprecated "true"^^xsd:boolean;</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> rdfs:label "Disposition"@en;</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> skos:prefLabel "Disposition"@en;</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> rdfs:comment "The current disposition of the cataloged item. Examples: \"in collection\", \"missing\", \"voucher elsewhere\", \"duplicates elsewhere\"."@en;</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> skos:definition "The current disposition of the cataloged item. Examples: \"in collection\", \"missing\", \"voucher elsewhere\", \"duplicates elsewhere\"."@en;</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> rdf:type <http://www.w3.org/1999/02/22-rdf-syntax-ns#Property>;</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> tdwgutility:abcdEquivalence "DataSets/DataSet/Units/Unit/SpecimenUnit/Disposition";</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> dcterms:hasVersion <http://rs.tdwg.org/dwc/curatorial/version/Disposition-2007-04-17>;</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"> dcterms:isReplacedBy <http://rs.tdwg.org/dwc/terms/disposition>.</span><br />
<div>
<br /></div>
<div>
In RDF/XML machine-readable form, the metadata looks like this (namespace abbreviations omitted):</div>
<div>
<br /></div>
<div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;"><rdf:Description rdf:about="http://rs.tdwg.org/dwc/curatorial/Disposition"></span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;"> <rdfs:isDefinedBy rdf:resource="http://rs.tdwg.org/dwc/curatorial/"/></span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;"> <dcterms:isPartOf rdf:resource="http://rs.tdwg.org/dwc/curatorial/"/></span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;"> <dcterms:created rdf:datatype="http://www.w3.org/2001/XMLSchema#date">2007-04-17</dcterms:created></span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;"> <dcterms:modified rdf:datatype="http://www.w3.org/2001/XMLSchema#date">2009-04-24</dcterms:modified></span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;"> <owl:deprecated rdf:datatype="http://www.w3.org/2001/XMLSchema#boolean">true</owl:deprecated></span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;"> <rdfs:label xml:lang="en">Disposition</rdfs:label></span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;"> <skos:prefLabel xml:lang="en">Disposition</skos:prefLabel></span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;"> <rdfs:comment xml:lang="en">The current disposition of the cataloged item. Examples: "in collection", "missing", "voucher elsewhere", "duplicates elsewhere".</rdfs:comment></span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;"> <skos:definition xml:lang="en">The current disposition of the cataloged item. Examples: "in collection", "missing", "voucher elsewhere", "duplicates elsewhere".</skos:definition></span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;"> <rdf:type rdf:resource="http://www.w3.org/1999/02/22-rdf-syntax-ns#Property"/></span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;"> <tdwgutility:abcdEquivalence>DataSets/DataSet/Units/Unit/SpecimenUnit/Disposition</tdwgutility:abcdEquivalence></span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;"> <dcterms:hasVersion rdf:resource="http://rs.tdwg.org/dwc/curatorial/version/Disposition-2007-04-17"/></span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;"> <dcterms:isReplacedBy rdf:resource="http://rs.tdwg.org/dwc/terms/disposition"/></span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;"></rdf:Description></span></div>
</div>
<div>
<br /></div>
<div>
For brevity, I'll omit the JSON-LD serialization. If you make a careful comparison of the two machine-readable serializations shown here, you'll see that they contain exactly the same information. </div>
<div>
<br /></div>
<div>
The SDS requires that when a machine consumes any machine-readable serialization, it acquire information identical to any other serialization (<a href="https://github.com/tdwg/vocab/blob/master/sds/documentation-specification.md#4-machine-readable-documents" target="_blank">section 4</a>). For most resources (terms, vocabularies, etc.), the human-readable representation generally contains the same information as the machine-readable serializations for all of the key properties required by the SDS, although some that aren't required, such as the abdcEquivalence, are omitted. The exception to this is standards-related documents -- the human-readable representation is the <b>document itself</b>, while the machine-readable representations are <b>metadata about</b> the document. (In contrast, machine-readable metadata about vocabularies, term lists, and terms contain virtually complete data about the resource.) </div>
<div>
<br /></div>
<h2>
Distinguishing between resources and the documents that describe them</h2>
<div>
<a href="https://github.com/tdwg/vocab/blob/master/sds/documentation-specification.md#4-machine-readable-documents" target="_blank">Section 4.1</a> of the SDS requires that machine-readable documents must have IRIs that are different from the IRIs of the abstract resources that they describe. Although at first it many not be apparent why this is important, we can see why if we consider the case of some of the older TDWG standards documents. For instance, the document <i>Floristic Regions of the World</i> (denoted by the IRI <span style="font-family: "courier new" , "courier" , monospace;"><a href="http://rs.tdwg.org/frw/doc/book/" target="_blank">http://rs.tdwg.org/frw/doc/book/</a></span>) by A. L. Takhtahan was adopted as part of TDWG standard <span style="font-family: "courier new" , "courier" , monospace;">http://www.tdwg.org/standards/104</span>. It is copyrighted by the University of California Press and is not available under an open license. However, the metadata about the book in RDF/Turtle serialization (denoted by the IRI <span style="font-family: "courier new" , "courier" , monospace;"><a href="http://rs.tdwg.org/frw/doc/book.ttl" target="_blank">http://rs.tdwg.org/frw/doc/book.ttl</a></span>) is freely available. So we could make the statement</div>
<div>
<br /></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">http://rs.tdwg.org/frw/doc/book.ttl dcterms:license https://creativecommons.org/publicdomain/zero/1.0/ .</span></div>
<div>
<br /></div>
<div>
but it would NOT be accurate to make the statement </div>
<div>
<br /></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">http://rs.tdwg.org/frw/doc/book/ dcterms:license https://creativecommons.org/publicdomain/zero/1.0/ .</span></div>
<div>
<br /></div>
<div>
because the book isn't licensed as CC0. Similarly, it would be correct to say:<br />
<br />
<span style="font-family: "courier new" , "courier" , monospace;">http://rs.tdwg.org/frw/doc/book/</span><span style="font-family: "courier new" , "courier" , monospace;"> dc:creator "A. L. Takhtahan" .</span><br />
<br />
but not:<br />
<br />
<span style="font-family: "courier new" , "courier" , monospace;">http://rs.tdwg.org/frw/doc/book/</span><span style="font-family: "courier new" , "courier" , monospace;"> dc:creator "Biodiversity Information Standards (TDWG)" .</span><br />
<br />
because TDWG did not create the book. On the other hand, saying:<br />
<br />
<span style="font-family: "courier new" , "courier" , monospace;">http://rs.tdwg.org/frw/doc/book.ttl</span><span style="font-family: "courier new" , "courier" , monospace;"> dc:creator "Biodiversity Information Standards (TDWG)" .</span><br />
<div>
<span style="font-family: "courier new" , "courier" , monospace;"><br /></span></div>
would be correct, since TDWG did create the RDF/Turtle metadata document that describes the book.<br />
<br />
Although in human-readable documents we tend to be fuzzy about the distinction between resources and the metadata about those resources, when we create machine-readable metadata representations we need to be careful to distinguish between the two.<br />
<br />
The SDS prescribes a way to link metadata documents and the resources they are about: <span style="font-family: "courier new" , "courier" , monospace;">dcterms:references</span> and <span style="font-family: "courier new" , "courier" , monospace;">dcterms:isReferencedBy</span> (<a href="https://github.com/tdwg/vocab/blob/master/sds/documentation-specification.md#41-identifying-a-resource-and-the-machine-readable-document-that-describes-it" target="_blank">section 4.1</a>). In the example above, we can say:<br />
<br />
<span style="font-family: "courier new" , "courier" , monospace;">http://rs.tdwg.org/frw/doc/book.ttl dcterms:references </span><span style="font-family: "courier new" , "courier" , monospace;">http://rs.tdwg.org/frw/doc/book/ .</span><br />
<br />
and<br />
<span style="font-family: "courier new" , "courier" , monospace;"><br /></span>
<span style="font-family: "courier new" , "courier" , monospace;">http://rs.tdwg.org/frw/doc/book/ dcterms:isReferencedBy </span><span style="font-family: "courier new" , "courier" , monospace;">http://rs.tdwg.org/frw/doc/book.ttl .</span><br />
<span style="font-family: "courier new" , "courier" , monospace;"><br /></span></div>
<div>
<div>
<h2>
Content negotiation</h2>
</div>
<div>
As I explained in the <a href="http://baskauf.blogspot.com/2019/03/understanding-tdwg-standards_10.html" target="_blank">second post of this series</a>, IRIs are fundamentally identifiers. There is no requirement that an IRI actually dereference to retrieve a web page or any other kind of document, although if it did, that would be nice, since that's the kind of behavior that people expect, particularly if the IRI begins with "http://" or "https://". If you think about it, defining TDWG IRIs to denote an abstract conceptual thing is a bit of a problem, because only non-abstract files can actually be returned to a user from a server through the Internet. You can't retrieve an abstract thing like the emotion "love" or the concept "justice" through the Internet, although you could certainly mint IRIs to denote those kinds of things.<br />
<br />
The standard practice when an IRI denotes a resource that is a physical object or abstract idea is to redirect the user to a <b>document </b>that is <b>about </b>the object or idea. Such a document containing descriptive metadata about the resource is called a <i>representation </i>of the resource. Users can specify what kind of document (human- or machine-readable) they want, and more specifically, the serialization that they want if they are asking for a machine-readable document. This process is called <i>content negotiation</i>.<br />
<br />
Resolution of permanent identifiers indefinitely is specified by Recommendation 7 of the <a href="https://github.com/tdwg/guid-as/blob/master/guid/tdwg_guid_applicability_statement.pdf" target="_blank">TDWG Globally Unique Identifier (GUID) Applicability Statement</a> standard, although it does not go into details of how that resolution should happen. <a href="https://github.com/tdwg/vocab/blob/master/sds/documentation-specification.md#2-the-structure-of-tdwg-standards" target="_blank">Section 2.1.1 and 2.1.2</a> of the SDS expands on the GUID AS by saying that the abstract IRI should be stable and generic, and that content negotiation should redirect the user to an IRI for a particular content type that will serve as a URL that can be used to retrieve the document of the content type the user wanted. That requirement is based on the widespread practice in the Linked Data community as expressed in the 2008 W3C Note "<a href="https://www.w3.org/TR/cooluris/">Cool URIs for the Semantic Web</a>".<br />
<br />
The SDS does not specify a particular way that this redirection should be accomplished, but given that it's desirable to support as many different serializations as possible, I chose to implement the "<a href="https://www.w3.org/TR/cooluris/#r303uri" target="_blank">303 URIs forwarding to Different Documents</a>" recipe described in the Cool URIs document. Here are the specific details:<br />
<br />
1. Client software performs an HTTP GET request for the abstract IRI of the resource and includes an <span style="font-family: "courier new" , "courier" , monospace;">Accept</span> header that specifies the content type that it wants.<br />
<br />
2. The server responds with an HTTP status code of 303 and includes the URL for the specific content type requested. To construct the redirect URL, any abstract IRIs with trailing slashes first have the trailing slash removed. If <span style="font-family: "courier new" , "courier" , monospace;">text/html</span> is requested (i.e. human-readable web page), <span style="font-family: "courier new" , "courier" , monospace;">.htm</span> is appended to the IRI to form the redirect URL. If <span style="font-family: "courier new" , "courier" , monospace;">text/turtle</span> is requested, <span style="font-family: "courier new" , "courier" , monospace;">.ttl</span> is appended. If <span style="font-family: "courier new" , "courier" , monospace;">application/rdf+xml</span> is requested, <span style="font-family: "courier new" , "courier" , monospace;">.rdf</span> is appended. If <span style="font-family: "courier new" , "courier" , monospace;">application/ld+json</span> is requested, <span style="font-family: "courier new" , "courier" , monospace;">.json</span> is appended.<br />
<br />
3. The client then requests the specific redirect URL and the server returns the appropriate document in the serialization requested. In this stage, the <span style="font-family: "courier new" , "courier" , monospace;">Accept</span> header is ignored by the server. In the case of standards documents and current terms in Darwin and Audubon Cores, there typically will be an additional redirect to a web page that isn't generated programmatically by the rs.tdwg.org server and might be located anywhere.<br />
<br />
We can test the behavior using <a href="https://curl.haxx.se/" target="_blank">curl</a> or a graphical HTTP client like <a href="https://www.getpostman.com/" target="_blank">Postman</a>. Here is an example using Postman (with automatic following of redirects turned off):<br />
<br />
1. Client requests metadata about the basic Darwin Core vocabulary by HTTP GET to the generic IRI: <span style="font-family: "courier new" , "courier" , monospace;">http://rs.tdwg.org/dwc/</span> with an <span style="font-family: "courier new" , "courier" , monospace;">Accept</span> header of <span style="font-family: "courier new" , "courier" , monospace;">text/turtle</span>.<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgqN4aM0MmSDg0qdNss61xM7LCgFUc3FG0W5mFD9YfJgnNpW0gXHu0KJZi2-MPpRBx9v2X5UasnQuZ0ayY1QzCU9ulq5Yyf-57VjEMAzPMaePEFNClg5eq9ekUmxm3oL-ojOvbPDoO50s8/s1600/303-redirect.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="1" data-original-height="344" data-original-width="937" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgqN4aM0MmSDg0qdNss61xM7LCgFUc3FG0W5mFD9YfJgnNpW0gXHu0KJZi2-MPpRBx9v2X5UasnQuZ0ayY1QzCU9ulq5Yyf-57VjEMAzPMaePEFNClg5eq9ekUmxm3oL-ojOvbPDoO50s8/s1600/303-redirect.png" /></a></div>
<br />
<br />
2. The server responds with a 303 (see other) code and redirects to <span style="font-family: "courier new" , "courier" , monospace;">http://rs.tdwg.org/dwc.ttl</span><br />
<br />
3. The client sends another GET request to <span style="font-family: "courier new" , "courier" , monospace;"><a href="http://rs.tdwg.org/dwc.ttl" target="_blank">http://rs.tdwg.org/dwc.ttl</a></span>, this time without any <span style="font-family: "courier new" , "courier" , monospace;">Accept </span>header.<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhVIMOXMW1RIgneuXOry4J9EJDWpTtzM4kxxQvJL31zkObduFV45DpACaP_RS6aADV_gjRblx_0WHwzKiECrov4tbVC12hvz1YJhgXmPeWn5vPhyphenhypheniFn6Ma51Q2BORHojr3Qc9gdmHx1DRM/s1600/200-response-code.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="1" data-original-height="350" data-original-width="919" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhVIMOXMW1RIgneuXOry4J9EJDWpTtzM4kxxQvJL31zkObduFV45DpACaP_RS6aADV_gjRblx_0WHwzKiECrov4tbVC12hvz1YJhgXmPeWn5vPhyphenhypheniFn6Ma51Q2BORHojr3Qc9gdmHx1DRM/s1600/200-response-code.png" /></a></div>
<br />
<br />
4. The server responds with a 200 (success) code and a <span style="font-family: "courier new" , "courier" , monospace;">Content-Type</span> response header of <span style="font-family: "courier new" , "courier" , monospace;">text/turtle</span>. The response body is the document serialized as RDF/Turtle.<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjcgkBqYzlQvnD7mSNA4fs0L18Ew-SIvpGBRHPUz_1DsbfAHq5k8HSDLWpto9ATAohq0nVEf-5VURVvrfyPX-qoQ3ck6ketuBc8jbMq-Dquyort7tXRl5fMhaHu_OHaIRATsFu6JlSt6Bs/s1600/200-response-body.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="1" data-original-height="560" data-original-width="867" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjcgkBqYzlQvnD7mSNA4fs0L18Ew-SIvpGBRHPUz_1DsbfAHq5k8HSDLWpto9ATAohq0nVEf-5VURVvrfyPX-qoQ3ck6ketuBc8jbMq-Dquyort7tXRl5fMhaHu_OHaIRATsFu6JlSt6Bs/s1600/200-response-body.png" /></a></div>
<br />
This illustration was done "manually" using Postman, but it is relatively simple to use any typical programming language (such as Javascript or Python) to perform HTTP calls with appropriate <span style="font-family: "courier new" , "courier" , monospace;">Accept </span>headers.[1] So enabling IRI dereferencing with content negotiation really starts to open up TDWG standards to machine readability.<br />
<br />
One feature of this implementation method is that it allows a human user to examine a representation in any serialization using a browser by just by hacking the abstract IRI using the rules in step 2. Thus, if you want to see what the RDF/XML serialization looks like for the basic Darwin Core vocabulary, you can put the URL <span style="font-family: "courier new" , "courier" , monospace;"><a href="http://rs.tdwg.org/dwc.rdf" target="_blank">http://rs.tdwg.org/dwc.rdf</a></span> into a browser. The browser will send an <span style="font-family: "courier new" , "courier" , monospace;">Accept</span> header of <span style="font-family: "courier new" , "courier" , monospace;">text/html</span>, but since the URL contains an extension for a specific file type, the server will ignore the <span style="font-family: "courier new" , "courier" , monospace;">Accept</span> header and send RDF/XML anyway. (Depending on how the browser is set up to handle file types, it may display the retrieved file in the browser window, or may initiate a download of the file into the user's Downloads directory.)<br />
<br />
<b><i>Important note:</i> currently (as of April 2019), there is an error in the algorithm that generates the JSON-LD that causes repeated properties to be serialized incorrectly. The JSON that is returned validates as JSON-LD, but when the document is interpreted, some instances of the repeated properties are ignored. So application designers should at this point plan to consume either RDF/XML or RDF/Turtle until this error is corrected.</b><br />
<br />
<h2>
Why does this matter?</h2>
There are three reasons why implementation of dereferencing TDWG standards-related IRIs through content negotiation is important.<br />
<br />
1. The least important reason is probably the one that is given as a <a href="https://www.w3.org/DesignIssues/LinkedData.html" target="_blank">core rationale in the Linked Data world</a>: when someone "looks up" a URI, they get useful information and can discover more things through the links in the metadata. In theory, one could "discover" any resource related to TDWG standards, scrape the machine-readable metadata about that resource, dereference other resources that are linked to the first one, scrape those resources' medata and follow their links, etc. until everything that there is to be known about TDWG standards has been discovered. Essentially, we could have an analog of the Google web scraper that scrapes machine-readable documents instead of web pages. In theory, this could be done, but it would result in many HTTP calls and would be a very inefficient way to keep up-to-date on TDWG standards. There is a much better way, and I'll discuss it in the next post.<br />
<br />
2. Probably the most important reason is that implementing real permanent IRIs for TDWG vocabularies and documents puts a stop to the continual breaking of links and browser bookmarks that happens every time documents get moved to a new website, get changed from HTML to markdown, etc. If we stress that the permanent IRIs are what should be bookmarked and cited, we can always set up the server to redirect to the URL of the day where the document or information actually lives. Since the permanent IRIs are "cool" and don't include implementation-specific aspects like "<span style="font-family: "courier new" , "courier" , monospace;">.php</span>" or "<span style="font-family: "courier new" , "courier" , monospace;">?pid=123&lan=en</span>", we can change the way we actually generate and serve the data at will without ever "breaking" any links. This is really critical if we want people to be able to cite IRIs for TDWG standards components in journal articles with those IRIs continuing to dereference indefinitely.<br />
<br />
3. The third reason is more philosophical. By having IRIs that dereference to human- and machine-readable metadata, we demonstrate that these are "real" IRIs that exhibit the behavior expected from "grown-up" organizations in the Linked Data world in specific, and the web in general. We show that TDWG is not some fly-by-night organization that creates identifiers one day and abandons them the next. The Internet is littered with the wreckage of vocabularies and ontologies from organizations that minted terms but stopped paying for their domain name, or couldn't keep their servers running. Having properly dereferencing, permanent IRIs marks TDWG as a real standards organization that can run with the big dogs like Dublin Core and the W3C. (We also get <a href="https://www.w3.org/community/webize/2014/01/17/what-is-5-star-linked-data/" target="_blank">5 stars</a> !)<br />
<br />
In my next post I'll talk about retrieving SDS-specified machine readable standards metadata en masse.<br />
<br />
<h3>
[1] Sample Python 3 code for dereferencing a term IRI</h3>
Note: you may need to use PIP to install the requests module if you don't already have it.<br />
<br />
<span style="font-family: "courier new" , "courier" , monospace;">import requests</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">iri = 'http://rs.tdwg.org/ac/terms/caption'</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">accept = 'text/turtle'</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">r = requests.get(iri, headers={'Accept' : accept})</span><br />
<span style="font-family: "courier new" , "courier" , monospace;">print(r.text)</span><br />
<br /></div>
</div>
<div>
<br /></div>
Steve Baskaufhttp://www.blogger.com/profile/01896499749604153763noreply@blogger.com0tag:blogger.com,1999:blog-5299754536670281996.post-81303935708140016952019-03-10T07:55:00.003-07:002020-03-04T19:29:38.154-08:00Understanding the TDWG Standards Documentation Specification, Part 2: Hierarchy Model and Implementation of IRIs<div class="separator" style="clear: both; text-align: center;">
<br /></div>
<br />
<div class="MsoNormal">
This is the second in a series of posts about the TDWG
Standards Documentation Specification (SDS).<span style="mso-spacerun: yes;">
</span>For background on the SDS, see the first post.<br />
<br />
Note: this post was revised on 2020-03-04 when IRI dereferencing of the http://rs.tdwg.org/ subdomain went from testing into production.</div>
<h2>
Implementation plan?</h2>
<div class="MsoNormal">
<o:p></o:p></div>
<div class="MsoNormal">
The SDS was ratified and issued in April of 2017. It did not, however, include any plan for its
implementation. It wasn't actually clear
whose responsibility it was to make implementation of the SDS happen. The Technical Architecture Group (TAG) might have
been a logical group to take charge, but in 2017 it had not yet been
reconstituted in its current form. As
the architect of the SDS, I had a vested interest in seeing it become
functional, so I decided to take the initiative to figure out how it could be
implemented. As I worked on this project, I got feedback from the Darwin Core Maintenance Group, key people working on the TDWG website and other infrastructure, and later from some TAG members.</div>
<div class="MsoNormal">
<o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
Although the SDS provided a general framework, it left a lot
of the details to implementers.<span style="mso-spacerun: yes;"> </span>In
particular, the SDS had relatively little to say about the form of URIs used as
identifiers for documents whose form was specified by the SDS.<span style="mso-spacerun: yes;"> </span>For guidance, I looked to precedents set by
Darwin Core, general practices in the Linked Data world, and practicalities of
URI dereferencing.<br />
<br /></div>
<h2>
The SDS model</h2>
<div class="MsoNormal">
<o:p></o:p></div>
<div class="MsoNormal">
The SDS describes a <a href="https://github.com/tdwg/vocab/blob/master/sds/documentation-specification.md#22-standards-components-hierarchy" target="_blank">hierarchical model</a> for resources within its
scope. That hierarchy is relatively
simple for documents within a standard: there is simply a hasPart/isPartOf
relationship between the standard and its documents. </div>
<div class="MsoNormal">
<o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
For vocabularies, the situation is more complicated.<span style="mso-spacerun: yes;"> </span>The SDS describes four levels in the hierarchy
that applies to vocabularies: standard, vocabulary, term list, and term.<span style="mso-spacerun: yes;"> </span>There was some discussion in the run-up to ratification
of the SDS as to whether the model needed to be this complicated.<span style="mso-spacerun: yes;"> </span>At that time, I asserted that this was the
least complicated model that could accomplish all of the things that people
said they wanted to do with vocabularies in TDWG.<span style="mso-spacerun: yes;"> </span><o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
It would be tempting to say that a much simpler model might
be possible.<span style="mso-spacerun: yes;"> </span>For example, we could
consider the Audubon Core Standard to be synonymous with the Audubon Core
vocabulary.<span style="mso-spacerun: yes;"> </span>We could say that Audubon Core terms were a direct part of it -- a simple two-level hierarchy.<span style="mso-spacerun: yes;"> </span></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
However, the Audubon Core Standard is more than just a set of
terms.<span style="mso-spacerun: yes;"> </span>The Audubon Core vocabulary is
distinct from the documents that describe how Audubon Core should be used (the
<a href="https://tdwg.github.io/ac/structure/" target="_blank">structure document</a>, <a href="https://tdwg.github.io/ac/termlist/" target="_blank">term list document</a>, etc.), which are also part of the standard.<span style="mso-spacerun: yes;"> </span>Although we might lump the standard, vocabulary,
and documents together in our human minds, if we really aspire to have machine-readable
descriptions of components of TDWG standards, we have to distinguish between
things that are not the same -- things that have different authors, creation dates,
and version histories.<span style="mso-spacerun: yes;"> </span><o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjYPV5Ql5X4b4xcDhgIuYzPvit2xOPqQj9MVoQZDRxblgGx2Oqi1wPkeWk58kgbYZmEGfMt0zYtXX7eRbfNNLPGLrAsM_SGsaa52fSmheoy1Bz-csLJh0Qq8c9v6ikpkQWEuztriPXNePU/s1600/standard.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="466" data-original-width="907" height="328" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjYPV5Ql5X4b4xcDhgIuYzPvit2xOPqQj9MVoQZDRxblgGx2Oqi1wPkeWk58kgbYZmEGfMt0zYtXX7eRbfNNLPGLrAsM_SGsaa52fSmheoy1Bz-csLJh0Qq8c9v6ikpkQWEuztriPXNePU/s640/standard.png" width="640" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Example of first (standards) and second (vocabularies and documents) levels of the TDWG Standards Documentation Specification hierarchy</td></tr>
</tbody></table>
<div class="MsoNormal">
<br />
As I described in the previous post, there was also a desire
expressed in the community for the capability to have more than one
"Darwin Core vocabulary".<span style="mso-spacerun: yes;"> </span>Some
people might want only the basic vocabulary (a "bag of terms" with
definitions). Others might want a more complicated vocabulary where some terms
might be declared to be equivalent to terms outside of Darwin Core, or classes
might be declared to be subclasses of classes in an outside ontology.<span style="mso-spacerun: yes;"> </span>Still others might want to create a Darwin
Core vocabulary that restrict the values that can be used for certain terms, or
entail class membership through range and domain declarations.<span style="mso-spacerun: yes;"> </span>So although we don't currently have more than
one Darwin Core vocabulary, we want to allow for that possibility in the
future.<span style="mso-spacerun: yes;"> That's another reason to have a model that separates the standard from the vocabulary or vocabularies that it defines.</span><o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhOcU7BWMyeLP70Ct9Lvw_3-sPSOU9EKrO4KRq6WA3SCZ4VFjlSdMRh3DV33vaKagJg7HRxexTiKB38Us2WEjpZtotY7gw8SrAUmrJ7-MtbW6qQgVBjl52kKxInbheeeLb38sTPEehXEA0/s1600/vocabulary.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="420" data-original-width="889" height="302" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhOcU7BWMyeLP70Ct9Lvw_3-sPSOU9EKrO4KRq6WA3SCZ4VFjlSdMRh3DV33vaKagJg7HRxexTiKB38Us2WEjpZtotY7gw8SrAUmrJ7-MtbW6qQgVBjl52kKxInbheeeLb38sTPEehXEA0/s640/vocabulary.png" width="640" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Example of second (vocabularies) and third (term lists) levels of the TDWG Standards Documentation Specification hierarchy</td></tr>
</tbody></table>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
Within a vocabulary, the SDS describes an entity called
"term list" (<a href="https://github.com/tdwg/vocab/blob/master/sds/documentation-specification.md#33-vocabulary-descriptions" target="_blank">Section 3.3.3</a> and <a href="https://github.com/tdwg/vocab/blob/master/sds/documentation-specification.md#44-vocabularies-term-lists-and-terms" target="_blank">4.4.2</a>).<span style="mso-spacerun: yes;"> </span><o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgRRawknQV_efb9Dlpon7LYsfsOHC28uTzafAEALrJi9yymUI862zVg-OyE0hFThtS10Wfv_-L0Nqa-tUF8pWEL4jlQ2Vhcg7oPvXaJYTQauoKiUfaA4nUGYPtfWrz3FWzq1fOA8x5AXRI/s1600/defining-term-list.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="422" data-original-width="893" height="302" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgRRawknQV_efb9Dlpon7LYsfsOHC28uTzafAEALrJi9yymUI862zVg-OyE0hFThtS10Wfv_-L0Nqa-tUF8pWEL4jlQ2Vhcg7oPvXaJYTQauoKiUfaA4nUGYPtfWrz3FWzq1fOA8x5AXRI/s640/defining-term-list.png" width="640" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Example of third (term list) and fourth (term) levels of the TDWG Standards Documentation Specification hierarchy. This is an example of a list of terms defined by TDWG and only includes a few of the terms on the list.</td></tr>
</tbody></table>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
For terms defined by a TDWG vocabulary, there is an authoritative
term list for each namespace.<span style="mso-spacerun: yes;"> </span>For
example, there is an authoritative term list for <a href="http://rs.tdwg.org/dwc/terms/" target="_blank">the dwc: namespace</a> and another
for <a href="http://rs.tdwg.org/dwc/iri/" target="_blank">the dwciri: namespace</a>.<span style="mso-spacerun: yes;"> </span>These lists
are considered authoritative because they define the terms they contain.<span style="mso-spacerun: yes;"> </span>Dereferencing a term list IRI should return
the term list document.<span style="mso-spacerun: yes;"> </span><o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjTeheaahZ9fUX_hCFr_3ShQUFvY_TnxKOGuCKO0YVOwqFfH2REp63alXK2_GoHNq4Ia9Iz_aEzQq8c62rz7NEPZ9RbujayqshKtJGf2RvHavaNv_z_sj-XucTaaxjEUI49_PFiC0jwczM/s1600/borrowed-term-list.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="430" data-original-width="914" height="300" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjTeheaahZ9fUX_hCFr_3ShQUFvY_TnxKOGuCKO0YVOwqFfH2REp63alXK2_GoHNq4Ia9Iz_aEzQq8c62rz7NEPZ9RbujayqshKtJGf2RvHavaNv_z_sj-XucTaaxjEUI49_PFiC0jwczM/s640/borrowed-term-list.png" width="640" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Example of third (term list) and fourth (term) levels of the TDWG Standards Documentation Specification hierarchy. This is an example of a list of terms borrowed by TDWG and only includes a few of the terms on the list.</td></tr>
</tbody></table>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
<o:p> </o:p>A term list can also contain terms that are borrowed from
another vocabulary and included in the TDWG vocabulary. The SDS does not prescribe how borrowed terms
should be organized in term lists -- for example, whether all borrowed terms
should be included in a single list or whether there should be a separate term
list for each namespace from which terms are borrowed. As a practical matter, it made sense to create
a separate term list for each namespace. <br />
<br /></div>
<h2>
Some notes about IRIs</h2>
<div class="MsoNormal">
<o:p></o:p></div>
<div class="MsoNormal">
According to the SDS, each resource in the hierarchy should
be assigned an IRI as an identifier (<a href="https://github.com/tdwg/vocab/blob/master/sds/documentation-specification.md#21-abstract-resources-and-representations" target="_blank">Section 2.1.1</a>). An
IRI is a superset of URIs that allows for non-Latin characters to be used. For the purposes of this post, you can consider
URIs and IRIs to be synonymous.</div>
<div class="MsoNormal">
<o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
There has always been <a href="https://tools.ietf.org/html/rfc3986#section-1.2.2" target="_blank">confusion between the use of IRIs/URIsas identifiers and URLs as resource locators</a>.<span style="mso-spacerun: yes;">
</span>Fundamentally, an IRI is an identifier that may or may not actually
dereference in a web browser to retrieve a web page about the resource.<span style="mso-spacerun: yes;"> </span>In the Linked Data community, it is
considered a best practice for IRIs to dereference, but it isn't a
requirement.<span style="mso-spacerun: yes;"> </span>In fact, there are a number
of "borrowed" term IRIs in Audubon Core that don't dereference and
probably never will.<span style="mso-spacerun: yes;"> </span>So although it
isn't a requirement of the SDS that TDWG IRIs dereference, one goal of
implementation is to eventually make that happen.<span style="mso-spacerun: yes;"> </span></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
The origin of the subdomain <span style="font-family: "courier new" , "courier" , monospace;">rs.tdwg.org</span> has always been a
little mysterious to me.<span style="mso-spacerun: yes;"> </span>I believe that
the "rs" part stands for "schema repository" and that it
was originally intended to be a place from which XML and other schemas could be
retrieved.<span style="mso-spacerun: yes;"> </span>Although I don't think there
is any official policy that requires use of the <span style="font-family: "courier new" , "courier" , monospace;">rs.tdwg.org</span> subdomain for
TDWG-minted IRIs, that has become the convention with Darwin Core and Audubon Core
and I've taken that as the precedent to be followed when creating other IRIs that
denote resources associated with TDWG standards.<span style="mso-spacerun: yes;"> </span>The exception to this pattern are the IRIs
for the standards themselves.<span style="mso-spacerun: yes;"> </span>The
precedent there is that TDWG standards have IRIs in the form
<span style="font-family: "courier new" , "courier" , monospace;">http://www.tdwg.org/standards/nnn</span>, where "<span style="font-family: "courier new" , "courier" , monospace;">nnn</span>" is a number assigned to a
particular standard.<span style="mso-spacerun: yes;"> </span><br />
<span style="mso-spacerun: yes;"><br /></span></div>
<h2>
IRI patterns for vocabulary standards</h2>
<div class="MsoNormal">
<o:p></o:p></div>
<div class="MsoNormal">
<o:p> </o:p>I used the precedents established by the Darwin and Audubon Core
standards, together with the URI specification (<a href="https://tools.ietf.org/html/rfc3986" target="_blank">RFC 3986</a>) itself to establish IRI
patterns that are consistent with the hierarchy established by the SDS. <a href="https://tools.ietf.org/html/rfc3986#section-1.2.3" target="_blank">Section 1.2.3 of RFC 3986</a> notes that a forward slash is used to "delimit components
that are significant to the generic parser's hierarchical interpretation of an identifier" and the IRIs of components of vocabularies can be interpreted this way. </div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
Here
are the patterns I established or continued based on past practice:<br />
<br /></div>
<h3>
Standards IRI:</h3>
<div class="MsoNormal">
<o:p></o:p></div>
<div class="MsoNormal">
<span style="font-family: "courier new" , "courier" , monospace;">http://www.tdwg.org/standards/nnn</span><o:p></o:p></div>
<div class="MsoNormal">
where "<span style="font-family: "courier new" , "courier" , monospace;">nnn</span>" consists of numeric characters assigned
to the standard.<span style="mso-spacerun: yes;"> </span>Dereferencing these
IRIs should lead the user to the landing page of the standard.<span style="mso-spacerun: yes;"> </span>Example of the Darwin Core standard:<o:p></o:p></div>
<div class="MsoNormal">
<span style="font-family: "courier new" , "courier" , monospace;"><a href="http://www.tdwg.org/standards/450" target="_blank">http://www.tdwg.org/standards/450</a></span><o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
Note that since these IRIs aren't within the <span style="font-family: "courier new" , "courier" , monospace;">rs.tdwg.org</span>
subdomain, the test system I've implemented does not handle their
dereferencing. <span style="mso-spacerun: yes;"> </span>Standards IRI
dereferencing is handled by a separate system and I don’t know how fully functional
it is for all prior TDWG standards.<br />
<br /></div>
<h3>
Vocabulary IRI:</h3>
<div class="MsoNormal">
<o:p></o:p></div>
<div class="MsoNormal">
<span style="font-family: "courier new" , "courier" , monospace;">http://rs.tdwg.org/vvv/</span><o:p></o:p></div>
<div class="MsoNormal">
where "<span style="font-family: "courier new" , "courier" , monospace;">vvv</span>" is a sequence of alphabetic characters
assigned to the vocabulary.<span style="mso-spacerun: yes;"> </span>Example of
the Darwin Core basic vocabulary:<o:p></o:p></div>
<div class="MsoNormal">
<span style="font-family: "courier new" , "courier" , monospace;"><a href="http://rs.tdwg.org/dwc/" target="_blank">http://rs.tdwg.org/dwc/</a></span><br />
<br /></div>
<h3>
Term list IRI:</h3>
<div class="MsoNormal">
<o:p></o:p></div>
<div class="MsoNormal">
<span style="font-family: "courier new" , "courier" , monospace;">http://rs.tdwg.org/vvv/ttt/</span><o:p></o:p></div>
<div class="MsoNormal">
where "<span style="font-family: "courier new" , "courier" , monospace;">vvv</span>" is a sequence of alphabetic characters
assigned to the vocabulary and "<span style="font-family: "courier new" , "courier" , monospace;">ttt</span>" is a sequence of alphabetic characters
assigned to the term list within that vocabulary.<span style="mso-spacerun: yes;"> </span>Example of the Darwin Core IRI-valued terms:<o:p></o:p></div>
<div class="MsoNormal">
<span style="font-family: "courier new" , "courier" , monospace;"><a href="http://rs.tdwg.org/dwc/iri/" target="_blank">http://rs.tdwg.org/dwc/iri/</a></span><br />
<br /></div>
<h3>
Term IRI:</h3>
<div class="MsoNormal">
<o:p></o:p></div>
<div class="MsoNormal">
<span style="font-family: "courier new" , "courier" , monospace;">http://rs.tdwg.org/vvv/ttt/nnn</span><o:p></o:p></div>
<div class="MsoNormal">
where "<span style="font-family: "courier new" , "courier" , monospace;">vvv</span>" is a sequence of alphabetic characters
assigned to the vocabulary, "<span style="font-family: "courier new" , "courier" , monospace;">ttt</span>" is a sequence of alphabetic
characters assigned to the term list within that vocabulary, and "<span style="font-family: "courier new" , "courier" , monospace;">nnn</span>"
is the local name of the term.<span style="mso-spacerun: yes;"> </span>Example
of the "in described place" term:<o:p></o:p></div>
<div class="MsoNormal">
<a href="http://rs.tdwg.org/dwc/iri/inDescribedPlace" target="_blank"><span style="font-family: "courier new" , "courier" , monospace;">http://rs.tdwg.org/dwc/iri/inDescribedPlace</span></a><o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
The term pattern described above is backward compatible with
all current Darwin Core and Audubon Core term IRIs.<span style="mso-spacerun: yes;"> </span>Existing Darwin Core RDF/XML asserts relationships
between terms and the resource that defines them like this:<o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
<span style="font-family: "courier new" , "courier" , monospace; font-size: x-small;">http://rs.tdwg.org/dwc/terms/dateIdentified rdfs:isDefinedBy
http://rs.tdwg.org/dwc/terms/ .</span><o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
So the IRI pattern for term lists is also backwards compatible
with this previous use, with the name "term list" now explicitly given to
the resource that defines terms.<span style="mso-spacerun: yes;"> </span><o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
The IRI pattern for vocabularies is new, but is consistent
with the hierarchy and is necessary to distinguish between vocabularies and the
standards that create them.<span style="mso-spacerun: yes;"> </span><o:p></o:p><br />
<span style="mso-spacerun: yes;"><br /></span></div>
<div class="MsoNormal">
<br /></div>
<h2>
IRI pattern for documents</h2>
<div class="MsoNormal">
<o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
Previously, there had been no consistent pattern for IRIs
assigned to documents associated with standards.<span style="mso-spacerun: yes;"> </span>Here are some examples of IRIs for Darwin Core documents:<o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
The Darwin Core XML guide: <span style="font-family: "courier new" , "courier" , monospace;">http://rs.tdwg.org/dwc/terms/guides/xml/</span><o:p></o:p></div>
<div class="MsoNormal">
The Darwin Core simple text guide: <span style="font-family: "courier new" , "courier" , monospace;">http://rs.tdwg.org/dwc/terms/simple/</span><o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
To maintain backwards compatibility, these pre-existing IRIs
were left unchanged.<span style="mso-spacerun: yes;"> </span>However, the IRI patterns
used for Darwin Core documents make it difficult to distinguish programmatically
between term and document IRIs using pattern matching.<span style="mso-spacerun: yes;"> </span>So for all documents from standards other than Darwin Core, I
used this pattern:<o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
<span style="font-family: "courier new" , "courier" , monospace;">http://rs.tdwg.org/sss/doc/docname/</span><o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
where "<span style="font-family: "courier new" , "courier" , monospace;">sss</span>" is a sequence of alphabetic characters
representing the standard and "<span style="font-family: "courier new" , "courier" , monospace;">docname</span>" is a short series of alphabetic
characters representing the document.<span style="mso-spacerun: yes;">
</span>For example:<o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
<span style="font-family: "courier new" , "courier" , monospace;"><a href="http://rs.tdwg.org/ac/doc/structure/" target="_blank">http://rs.tdwg.org/ac/doc/structure/</a></span><o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
is the IRI for the Audubon Core Structure document.<span style="mso-spacerun: yes;"> </span><br />
<span style="mso-spacerun: yes;"><br /></span></div>
<h2>
Redirection</h2>
<div class="MsoNormal">
<o:p></o:p></div>
<div class="MsoNormal">
One thing that should be made clear is the distinction between
the IRI that identifies a resource and the URL that actually can be used to
retrieve a document or metadata about some other resource. Because the SDS <a href="https://github.com/tdwg/vocab/blob/master/sds/documentation-specification.md#21-abstract-resources-and-representations" target="_blank">considers the resources it describes as abstract entities</a>, those entities can have multiple formats or serializations
that are distinct from the abstract resources themselves. For example, the Audubon Core Structure
document is an abstract thing identified by <span style="font-family: "courier new" , "courier" , monospace;">http://rs.tdwg.org/ac/doc/structure/</span>
. However, the HTML serialization of
that document can currently be retrieved from the URL <span style="font-family: "courier new" , "courier" , monospace;"><a href="https://tdwg.github.io/ac/structure/" target="_blank">https://tdwg.github.io/ac/structure/</a></span>
and in the future that document might be made available at different URLs in
other formats such as PDF. It is
required that the IRI of the abstract resource be stable and unchanged, but
there is no requirement that the retrieval URL for a serialization stay the
same over time. Thus it's important that
citations and bookmarks be set to the permanent IRI of the resource, and that redirection
from the permanent IRI to the retrieval URL be maintained so that people can
actually acquire a copy of the resource using a browser. </div>
<div class="MsoNormal">
<o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
In the past, obscure, deprecated Darwin Core terms simply
didn't dereference.<span style="mso-spacerun: yes;"> </span>In the test system,
they redirect programmatically to a URL that is the term IRI plus
"<span style="font-family: "courier new" , "courier" , monospace;">.htm</span>".<span style="mso-spacerun: yes;"> </span>Here's an example:<o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
<span style="font-family: "courier new" , "courier" , monospace;"><a href="http://rs.tdwg.org/dwc/curatorial/CollectorNumber" target="_blank">http://rs.tdwg.org/dwc/curatorial/CollectorNumber</a></span><o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
redirects to <o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
<span style="font-family: "courier new" , "courier" , monospace;"><a href="http://rs.tdwg.org/dwc/curatorial/CollectorNumber.htm" target="_blank">http://rs.tdwg.org/dwc/curatorial/CollectorNumber.htm</a></span><o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
The document that is retrieved is an HTML, human-readable
description of the term.<span style="mso-spacerun: yes;"> </span><o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
Historically, current Darwin Core terms redirected to the Darwin
Core Quick Reference page and that behavior has been maintained in the test
system.<span style="mso-spacerun: yes;"> </span>Here's an example:<o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
<span style="font-family: "courier new" , "courier" , monospace;"><a href="http://rs.tdwg.org/dwc/terms/institutionCode" target="_blank">http://rs.tdwg.org/dwc/terms/institutionCode</a></span><o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
redirects to <o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
<span style="font-family: "courier new" , "courier" , monospace;"><a href="https://dwc.tdwg.org/terms/#dwc:institutionCode" target="_blank">https://dwc.tdwg.org/terms/#dwc:institutionCode</a></span><o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
The same is true with Audubon Core terms, whose IRIs
redirect to an appropriate place on the Audubon Core Term List document. <span style="mso-spacerun: yes;"> </span>The URLs of both the Audubon Core Term List
page and Darwin Core Quick Reference page have changed recently, reinforcing
the importance of citing the actual term IRIs rather than the redirected URLs.<o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://github.com/tdwg/vocab/raw/master/graphics/version-model.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="287" data-original-width="800" height="228" src="https://github.com/tdwg/vocab/raw/master/graphics/version-model.png" width="640" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">TDWG Standards Documentation Specification version model (from <a href="https://github.com/tdwg/vocab/blob/master/sds/documentation-specification.md#23-versioning-model" target="_blank">Section 2.3</a>)</td></tr>
</tbody></table>
<h2>
</h2>
<h2>
Versions</h2>
<div class="MsoNormal">
<o:p></o:p></div>
<div class="MsoNormal">
Taking cues from Dublin Core and the W3C, the SDS describes
a version model that can be used to track versions of resources associated with
TDWG standards. For example,
dereferencing the Darwin Core vocabulary IRI <span style="font-family: "courier new" , "courier" , monospace;"><a href="http://rs.tdwg.org/dwc/" target="_blank">http://rs.tdwg.org/dwc/</a></span> shows that
there are 19 versions: 18 previous version and a most recent version that
corresponds to the current Darwin Core vocabulary. </div>
<div class="MsoNormal">
<o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
For vocabularies and term lists, the version IRIs are constructed by appending an ISO 8601 date after the
final slash and inserting "<span style="font-family: "courier new" , "courier" , monospace;">version/</span>" before the terminal string.<span style="mso-spacerun: yes;"> </span>For example, the current Darwin Core vocabulary IRI is <span style="font-family: "courier new" , "courier" , monospace;"><a href="http://rs.tdwg.org/dwc/" target="_blank">http://rs.tdwg.org/dwc/</a></span>
and a version of the Darwin Core vocabulary is <a href="http://rs.tdwg.org/version/dwc/2015-03-27" target="_blank"><span style="font-family: "courier new" , "courier" , monospace;">http://rs.tdwg.org/version/dwc/2015-03-27</span></a>
.<span style="mso-spacerun: yes;"> </span>The current Darwin Core IRI-value term
list IRI is <span style="font-family: "courier new" , "courier" , monospace;"><a href="http://rs.tdwg.org/dwc/iri/" target="_blank">http://rs.tdwg.org/dwc/iri/</a></span> and a version of it is <span style="font-family: "courier new" , "courier" , monospace;"><a href="http://rs.tdwg.org/dwc/version/iri/2015-03-27" target="_blank">http://rs.tdwg.org/dwc/version/iri/2015-03-27</a></span>
.<span style="mso-spacerun: yes;"> </span>(Although it wouldn't be necessary to
include the characters "<span style="font-family: "courier new" , "courier" , monospace;">version/</span>" in the version IRI, doing so makes pattern
recognition for those IRIs much simpler.)<span style="mso-spacerun: yes;">
</span><o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
Following the precedent already set for Darwin Core, term
version IRIs are formed by appending an ISO 8601 date with a dash.<span style="mso-spacerun: yes;"> </span>Again "<span style="font-family: "courier new" , "courier" , monospace;">version/</span>" is inserted ahead
of the local name to make IRI pattern recognition easier.<span style="mso-spacerun: yes;"> </span>For example, the term IRI <span style="font-family: "courier new" , "courier" , monospace;"><a href="http://rs.tdwg.org/dwc/terms/establishmentMeans" target="_blank">http://rs.tdwg.org/dwc/terms/establishmentMeans</a></span>
has a version <span style="font-family: "courier new" , "courier" , monospace;"><a href="http://rs.tdwg.org/dwc/terms/version/establishmentMeans-2009-04-24" target="_blank">http://rs.tdwg.org/dwc/terms/version/establishmentMeans-2009-04-24</a></span><o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
For documents, the version IRI is formed by simply appending
the ISO 8601 date after the trailing slash.<span style="mso-spacerun: yes;">
</span>(In the case of documents, IRI pattern recognition is less critical
since there aren't hierarchical levels below the level of the document. So
"<span style="font-family: "courier new" , "courier" , monospace;">version/</span>" isn't inserted in the version IRI.)<span style="mso-spacerun: yes;"> </span>For example, the document <span style="font-family: "courier new" , "courier" , monospace;"><a href="http://rs.tdwg.org/sds/doc/specification/" target="_blank">http://rs.tdwg.org/sds/doc/specification/</a></span>
has a version <a href="http://rs.tdwg.org/sds/doc/specification/2007-11-05" target="_blank"><span style="font-family: "courier new" , "courier" , monospace;">http://rs.tdwg.org/sds/doc/specification/2007-11-05</span></a> .<span style="mso-spacerun: yes;"> </span><o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
In the case of non-document resources, resolution of version
IRIs is fully implemented, since human-readable pages can be constructed programmatically for those
resources using data from the metadata database.<span style="mso-spacerun: yes;"> </span>However, since the human-readable versions of
standards documents are generally created manually and have idiosyncratic redirection
IRIs, version IRI resolution is currently only partially implemented.<span style="mso-spacerun: yes;"> </span>In the case of many standards documents, the location
of previous versions is not known or they are not yet available online.<span style="mso-spacerun: yes;"> </span>So for now, one can't explore older versions
of standards documents in the same way one can explore older versions of
vocabularies, term lists, and terms.<br />
<br /></div>
<h2>
Summary</h2>
<div class="MsoNormal">
<o:p></o:p></div>
<div class="MsoNormal">
I've implemented a system of IRIs that are consistent with
the SDS and past practice of Darwin and Audubon Cores. Although the patterns I established aren't the only possible ones, they work well for facilitating pattern matching by a server that generates many of the documents programmatically, so I feel that the pattern system is sound.</div>
<div class="MsoNormal">
<o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
Here are some starting points for exploration:<o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
<b>Audubon Core basic vocabulary:</b><o:p></o:p></div>
<div class="MsoNormal">
<span style="font-family: "courier new" , "courier" , monospace;"><a href="http://rs.tdwg.org/ac/" target="_blank">http://rs.tdwg.org/ac/</a></span><o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
<b>Darwin Core basic vocabulary:</b><o:p></o:p></div>
<div class="MsoNormal">
<span style="font-family: "courier new" , "courier" , monospace;"><a href="http://rs.tdwg.org/dwc/" target="_blank">http://rs.tdwg.org/dwc/</a></span><o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
From these two vocabulary pages you can surf to term lists,
terms, and older versions of all of the resources.<o:p></o:p><br />
<br />
<b>Terms borrowed by Audubon Core from the IPTC Photo Metadata Extension:</b><br />
<span style="font-family: "courier new" , "courier" , monospace;"><a href="http://rs.tdwg.org/ac/Iptc4xmpExt/" target="_blank">http://rs.tdwg.org/ac/Iptc4xmpExt/</a></span><br />
<br />
<b>The October 16, 2011 version of the Darwin Core vocabulary:</b><br />
<a href="http://rs.tdwg.org/version/dwc/2011-10-16" target="_blank"><span style="font-family: "courier new" , "courier" , monospace;">http://rs.tdwg.org/version/dwc/2011-10-16</span></a><br />
<br />
<b>The April 24, 2009 version of the list of core Darwin Core terms:</b><br />
<span style="font-family: "courier new" , "courier" , monospace;"><a href="http://rs.tdwg.org/dwc/version/terms/2009-04-24" target="_blank">http://rs.tdwg.org/dwc/version/terms/2009-04-24</a></span><br />
<br />
<b>The September 11, 2009 version of Basis of Record:</b><br />
<span style="background-color: white; color: #212529;"><span style="font-family: "courier new" , "courier" , monospace;"><a href="http://rs.tdwg.org/dwc/terms/version/basisOfRecord-2009-09-11" target="_blank">http://rs.tdwg.org/dwc/terms/version/basisOfRecord-2009-09-11</a></span></span><br />
<span style="background-color: white; color: #212529; font-family: , "blinkmacsystemfont" , "segoe ui" , "roboto" , "helvetica neue" , "arial" , sans-serif , "apple color emoji" , "segoe ui emoji" , "segoe ui symbol" , "noto color emoji";"><br /></span>
<b>A deprecated Darwin Core term list:</b><br />
<span style="font-family: "courier new" , "courier" , monospace;"><a href="http://rs.tdwg.org/dwc/curatorial/" target="_blank">http://rs.tdwg.org/dwc/curatorial/</a></span></div>
<div class="MsoNormal">
<br />
<b>A deprecated Darwin Core term:</b><br />
<span style="font-family: "courier new" , "courier" , monospace;"><a href="http://rs.tdwg.org/dwc/dwctype/MachineObservation" target="_blank">http://rs.tdwg.org/dwc/dwctype/MachineObservation</a></span><br />
<br /></div>
<div class="MsoNormal">
<b>Here are some examples of document IRIs that redirect:</b><o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
<span style="font-family: "courier new" , "courier" , monospace;"><a href="http://rs.tdwg.org/ac/doc/introduction/" target="_blank">http://rs.tdwg.org/ac/doc/introduction/</a></span><o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
<span style="font-family: "courier new" , "courier" , monospace;"><a href="http://rs.tdwg.org/tapir/doc/xmlschema/" target="_blank">http://rs.tdwg.org/tapir/doc/xmlschema/</a></span><o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
<span style="font-family: "courier new" , "courier" , monospace;"><a href="http://rs.tdwg.org/apn/doc/data/" target="_blank">http://rs.tdwg.org/apn/doc/data/</a></span><o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
In the next post, I'll describe how the system I've
implemented allows retrieval of machine-readable metadata.<o:p></o:p></div>
<br />Steve Baskaufhttp://www.blogger.com/profile/01896499749604153763noreply@blogger.com0tag:blogger.com,1999:blog-5299754536670281996.post-64032007977592852262019-03-03T22:48:00.000-08:002019-03-04T07:41:07.300-08:00Understanding the TDWG Standards Documentation Specification Part 1: BackgroundThis is the first in a series of posts about the <a href="https://github.com/tdwg/vocab/blob/master/sds/documentation-specification.md">TDWG Standards Documentation Specification (SDS)</a>, with special reference to how its implementation enables machine access to information about TDWG standards. In particular, the SDS makes it possible to acquire all available information about TDWG vocabularies, including all historical versions of terms. In this post, I'm going to describe the genesis of the SDS and how the practical experience of the TDWG community influenced the ultimate state of the specification.<br />
<br />
<h2>
Historical background </h2>
<br />
The <a href="https://github.com/tdwg/vocab/blob/master/tdwg-stds-spec.pdf">original draft of the SDS</a> was written in 2007 by Roger Hyam as part of the effort to modernize the TDWG standards development process. The original draft was focused on how human-readable documents should be formatted. The SDS remained in draft form for several years and during that time, new standards documents generally reflected the directives of that draft.<br />
<br />
In 2013, a Vocabulary Management Task Group examined the status of the old TDWG Ontology and the experience of the community with the term change section of the Darwin Core Namespace Policy. The <a href="http://www.gbif.org/resource/80862">task group recommended</a> that a new SDS be written with guidelines for the formatting of both human- and computer-readable documents, and that the Darwin Core Namespace Policy be used as the starting point for writing a specification describing how vocabularies should be maintained.<br />
<br />
In 2014, I was asked to lead a task group to revise the SDS and to move it forward to the status of ratified standard. One advantage of returning to work on the specification after seven years had elapsed was that we had the benefit of experience from work with the Darwin Core standard and had learned several important lessons from that. Some of those lessons were about weaknesses in policies related to standards documents and some were process-oriented. Because of the interrelation between the documentation of standards and the processes of their development and maintenance, the parallel development of both the SDS and the <a href="https://github.com/tdwg/vocab/blob/master/vms/maintenance-specification.md">Vocabulary Maintenance Specification (VMS)</a> by the task group allowed the two specifications to be developed in a complementary fashion.<br />
<br />
One of the key problems with the state of TDWG standards documents was that it was difficult to know which documents associated with a complex standard like Darwin Core were actually part of the standard, and which documents were ancillary documents that provided useful information about the standard, but that were not actually part of the standard. That distinction was important because changes to documents within a standard should be subject to a potentially rigorous process of review, while documents outside the standard could be changed at will. There was a similar problem with the idea expressed in the original SDS draft that certain documents that were part of a standard should be considered normative, while other documents that were part of the standard were not normative. If the status of "normative" were bestowed on an entire document, what did that mean for parts of that document such as examples, or mutable URLs? Did changing an example or URL require invoking a standards review process or could they just be changed or corrected at will?<br />
<br />
To make matters worse, the final designation of documents that were considered authoritatively to be part of the standard was determined by which documents were included in a .zip file that was uploaded to the OJS instance that was managing the standards adoption process at the time. That made it virtually impossible for any layman to actually know whether a particular document was part of a standard or not.<br />
<br />
In the Darwin Core Standard at that time, the RDF/XML representation of the vocabulary was designated as the normative document. That presented several problems. One problem was that the XML document was by its nature a machine-readable document, making it difficult for people to read and understand it. Another question involved the text, XML, and RDF guides that specified how the standard was to be implemented, but that were not considered "the normative document". Clearly those documents were required to comply with the standard, so shouldn't they be considered in some way normative? The problem was made worse by the fact that the RDF guide document minted an entire category of Darwin Core terms (the <span style="font-family: "courier new" , "courier" , monospace;">dwciri:</span> terms having IRI values), but those terms weren't actually found in the normative RDF/XML document.<br />
<br />
Defining the Darwin Core vocabulary as RDF/XML also anchored it in a serialization that was becoming less commonly used. With the ratification of RDF/Turtle and JSON-LD by the W3C as alternative machine-readable serializations, it made less sense to define Darwin Core specifically in RDF/XML.
<br />
<br />
During the time between the writing of the draft SDS in 2007 and the convening of the task group, there was also significant discussion in the community as to whether Darwin Core should be developed as a full ontology, or whether it should remain a simple "bag of terms" having minimal human-readable definitions. The first option would allow for greater expressiveness, but the second option would allow for the broadest possible use of the vocabulary.<br />
<br />
<h2>
Strategies of the Standards Documentation Specification </h2>
<br />
The problems outlined above led the task group to create a specification with several key features that addressed those problems.<br />
<br />
The ratified SDS threw out the idea that inclusion of a document as part of a standard was determined by presence in a .zip file. Instead, a document is considered part of a standard if it is designated as such. That designation was to take place in two ways. First, <a href="https://github.com/tdwg/vocab/blob/master/sds/documentation-specification.md#32-descriptive-documents">a human-readable document itself should state clearly in its header section that it is part of a standard</a> (Section 3.2.3.1). <a href="https://github.com/tdwg/vocab/blob/master/sds/documentation-specification.md#42-general-metadata">Machine-readable documents would have a <span style="font-family: "courier new" , "courier" , monospace;">dcterms:isPartOf</span> property that link them to a standard</a> (Section 4.2.2). Second, <a href="https://github.com/tdwg/vocab/blob/master/sds/documentation-specification.md#31-landing-page-for-the-standard">each standard will have an official "landing page" that would state clearly which documents were parts of the standard</a> (Section 3.1).
<br />
<br />
The SDS also got rid of the idea that particular documents were normative. Any document that is part of a standard can contain parts that are normative and parts that are not. Each human-readable document <a href="https://github.com/tdwg/vocab/blob/master/sds/documentation-specification.md#32-descriptive-documents">will contain a statement in its introduction outlining what parts are normative and what parts are not</a> (Section 3.2.1). This designation can be made by labeling certain parts as normative, or by rules such as "all parts are normative except sections labeled as 'example' in their subtitles".
The problem of serializations for standards and their parts was addressed by considering <a href="https://github.com/tdwg/vocab/blob/master/sds/documentation-specification.md#21-abstract-resources-and-representations">standards components to be abstract entities that can have multiple equivalent serializations</a> (Section 2.1). For human-readable documents, it is irrelevant whether a document is in HTML, PDF, or Markdown format. It is desirable to make documents available in as many formats as possible as long as they contain substantially the same content. For machine-readable documents, there is no preferred serialization. It is required that <a href="https://github.com/tdwg/vocab/blob/master/sds/documentation-specification.md#22-standards-components-hierarchy">a machine consuming any of the serializations should receive exactly the same information</a> (Section 2.2.4). Again, the more available serializations the better as long as the abstract meaning of their content is the same.<br />
<br />
The issue of enhancing vocabularies through added semantics was addressed by a "layered" approach that had been suggested in online discussion prior to the formation of the task group. All TDWG vocabularies will consist of a set of terms with basic properties that delineate their definition, label, and housekeeping metadata. This "basic" vocabulary can be used in a broad range of applications. <a href="https://github.com/tdwg/vocab/blob/master/sds/documentation-specification.md#44-vocabularies-term-lists-and-terms">Additional vocabularies could be constructed by adding components</a> to the basic vocabulary, such as constraints and properties generating entailments (Section 4.4.2.2). Thus, there could eventually be several Darwin Core vocabularies, one consisting of only the basic components, and zero to many additional vocabularies consisting of the basic vocabulary plus "enhancement" components layered on top of the basic vocabulary. Because the nature of such enhancements could not be known in advance, the VMS contains a <a href="https://github.com/tdwg/vocab/blob/master/vms/maintenance-specification.md#4-vocabulary-enhancements">process for the development of vocabulary enhancements</a> that includes use-case collection and implementation experience reports. (Section 4) At the present time, there aren't any additional enhanced vocabularies, but they could be created in the future if members of the community can show that those enhancements are needed to accomplish some useful purpose.<br />
<br />
In the next post of this series, I'll discuss how these strategies resulted in the model for machine-readable metadata embodied in the final standards documentation specification.
Steve Baskaufhttp://www.blogger.com/profile/01896499749604153763noreply@blogger.com0tag:blogger.com,1999:blog-5299754536670281996.post-90158208450292496212018-02-25T19:56:00.001-08:002018-09-25T11:51:55.642-07:00Turning stuff on and off using a Raspberry Pi: Froggy resurrected<script type="application/ld+json">
{
"@context": "http://schema.org",
"@type": "BlogPosting",
"mainEntityOfPage": {
"@type": "WebPage",
"@id": "http://baskauf.blogspot.com/2018/02/turning-stuff-on-and-off-using.html"
},
"headline": "Turning stuff on and off using a Raspberry Pi: Froggy resurrected",
"image": {
"@type": "ImageObject",
"url": "https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjeDmR5zbUOgOjZh9QbJGpzRRjzqs5on6MzDBrhxPpFLzbtTY1DPM6WYjq9oB6nKm5PYuRSt4gGnI_s-t-PvQVVC0eVKKBZ5u1MKsACiLAr2xzsXOmeUdQVOKpRdj7BTysTFrIAIc3-wpo/s640/2018-02-24+12.52.31.jpg",
"height": 640,
"width": 480
},
"datePublished": "2018-02-18T08:07:56-07:00",
"dateModified": "2018-02-18T08:07:56-07:00",
"author": {
"@type": "Person",
"name": "Steve Baskauf",
"@id":"https://orcid.org/0000-0003-4365-3135",
"sameAs": "http://bioimages.vanderbilt.edu/contact/baskauf"
},
"publisher": {
"@type": "Organization",
"name": "Baskauf personal blog",
"logo": {
"@type": "ImageObject",
"url": "https://scontent-atl3-1.cdninstagram.com/vp/94f43ab4c94ca56c3d74f3d275fb7670/5C2E73A3/t51.2885-19/s150x150/13774778_1753029844967575_204181086_a.jpg",
"width": 60,
"height": 60
}
},
"description": "This article explains how a Raspberry Pi computer can be used to turn electrical devices on and off, including the necessary materials. It shows how the Raspberry Pi can be used to control several motors to make a homemade robot move."
}
</script>
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjeDmR5zbUOgOjZh9QbJGpzRRjzqs5on6MzDBrhxPpFLzbtTY1DPM6WYjq9oB6nKm5PYuRSt4gGnI_s-t-PvQVVC0eVKKBZ5u1MKsACiLAr2xzsXOmeUdQVOKpRdj7BTysTFrIAIc3-wpo/s1600/2018-02-24+12.52.31.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="1200" data-original-width="1600" height="480" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjeDmR5zbUOgOjZh9QbJGpzRRjzqs5on6MzDBrhxPpFLzbtTY1DPM6WYjq9oB6nKm5PYuRSt4gGnI_s-t-PvQVVC0eVKKBZ5u1MKsACiLAr2xzsXOmeUdQVOKpRdj7BTysTFrIAIc3-wpo/s640/2018-02-24+12.52.31.jpg" width="640" /></a></div>
<br />
I recently listened to an <a href="http://www.bbc.co.uk/programmes/b09ly60f">interview of Eben Upton on The Life Scientific podcast</a>, where he talked about what lead to his development of the Raspberry Pi (RPi), a fully functioning computer that you can buy for as little as $5. The concept of a computer that was essentially disposable fascinated me, and I decided to get one to play with. Originally, I was going to get the cheapest model, but in order to use it, one needed to have a wireless keyboard (which I didn't have). Since I was hoping to run the computer entirely with junk that I had lying around the house, I opted instead to get the $35 <a href="https://www.raspberrypi.org/products/raspberry-pi-3-model-b/">model 3B</a>, which has 4 USB 2 ports, a standard size HDMI connector, and an Ethernet connector in addition to built-in WiFi and Bluetooth. I was able to use a micro SD card, a mouse, and a keyboard that I already had, although I did have to pay about $15 more at WalMart to get an HDMI to VGA adapter in order to use an old monitor that I had stored down in the basement.<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://pbs.twimg.com/media/DU4A50OX4AA4a_E.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="600" data-original-width="800" height="300" src="https://pbs.twimg.com/media/DU4A50OX4AA4a_E.jpg" width="400" /></a></div>
<br />
<br />
I was able to use an old iPad USB power supply to provide the power to the RPi, but it barely puts out the minimum required current, so I frequently see the little lightning bolt in the upper right of the screen indicating that the computer was being underpowered. At one point, I also used a junky USB cable and it caused the computer to infinitely re-boot until I replaced it with a better quality one.<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgByhe81YUCu4w-QUUn_4WEj0uLyhxd74HUPW_ZifG3xPNJpQTKW0iqJv7e29F74f6CKKRPSsMLmfHk80QSXKTWn8_qHKI_bWlkJj64bxKSIKvNGqH0GR4pbILJmhyphenhyphenRRMswCawKmvp-VlE/s1600/2018-02-24+21.48.25.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="1200" data-original-width="1600" height="480" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgByhe81YUCu4w-QUUn_4WEj0uLyhxd74HUPW_ZifG3xPNJpQTKW0iqJv7e29F74f6CKKRPSsMLmfHk80QSXKTWn8_qHKI_bWlkJj64bxKSIKvNGqH0GR4pbILJmhyphenhyphenRRMswCawKmvp-VlE/s640/2018-02-24+21.48.25.jpg" width="640" /></a></div>
One of the features that was very attractive to me was the apparent ease with which one could interface the RPi with external devices. The RPi 3B is about the size of a credit card (although much thicker due to the various ports sticking up out of the circuit board), and it has 40 pins sticking up on one side that serve as the general purpose input/output interface (GPIO). Many of those pins can be used either to send output to a device being controlled by the RPi, or to receive input from some kind of sensor. Other pins serve as ground connections or provide 3.3V or 5V power.<br />
<br />
After the initial euphoria of successfully downloading the Linux OS (known as Raspbian) onto the micro SD card and booting the computer, I realized that I didn't have the stuff that I needed to actually do the interfacing. In the past, I would have just made a trip to Radio Shack to pick up the items I needed, but since there is no longer any store in Nashville that sells electronics components to consumers, I had to make a careful assessment of what I needed so that I could make a minimal number of purchases online and save on shipping. In the following section, I'll list the items that I decided I needed.<br />
<br />
<h2>
Useful stuff for interfacing the Raspberry Pi</h2>
One of the most basic things that anyone who wants to play with interfacing the RPi should have is a solderless breadboard. I already had one, so I didn't need to order one, but if you don't have one, you need to get it. The size isn't that important because we aren't going to be hooking up a lot of things. In this post, I'm going to assume that the reader knows how to use a breadboard. If not, just read about it online. It's not complicated.<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiqIjejEP5QLT7DGUTqCOWYz9MyjW36FngedS1plNTSxn8vud-eKJRzj9C6VgH1sJ8IpBV6qCObq-dmDLVsW7nul00D82PIBslSnmWQTlsjqMAonyJhGbd3jSawvZE8L1Fz5Fx-OTeGYm8/s1600/2018-02-24+22.05.04.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="1200" data-original-width="1600" height="240" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiqIjejEP5QLT7DGUTqCOWYz9MyjW36FngedS1plNTSxn8vud-eKJRzj9C6VgH1sJ8IpBV6qCObq-dmDLVsW7nul00D82PIBslSnmWQTlsjqMAonyJhGbd3jSawvZE8L1Fz5Fx-OTeGYm8/s320/2018-02-24+22.05.04.jpg" width="320" /></a></div>
Another item that would be difficult to do without are some jumper wires. For $7, I found a set that had all three combinations of sockets and prongs. (All prices are in US dollars.) The ones I've used the most have a socket on one end (to fit over the GPIO pins) and a prong on the other end (to fit in a hole in the breadboard). For making shorter connections within the breadboard, I used short pieces of insulated solid wire. Wire cutters/strippers are very useful for preparing those wires.<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhU51Fbxhxoyq6sySVjeSX5wwsdRqfILM2q9rJokkqvrwpNhMDZTsXjianW4rpNZBcH-_I_SwIFdYBkgTcdALbs0ouGGVMixcVGmLD4b3SZ-e1F59jRpQ1RUhn70xFhrbn7ZMdBtbubdqs/s1600/2018-02-25+07.44.31.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="1200" data-original-width="1600" height="240" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhU51Fbxhxoyq6sySVjeSX5wwsdRqfILM2q9rJokkqvrwpNhMDZTsXjianW4rpNZBcH-_I_SwIFdYBkgTcdALbs0ouGGVMixcVGmLD4b3SZ-e1F59jRpQ1RUhn70xFhrbn7ZMdBtbubdqs/s320/2018-02-25+07.44.31.jpg" width="320" /></a></div>
One thing that you will hear repeatedly as you read about using the GPIO connections of the RPi is that you should never expose the pins to voltages over 3.3 volts, nor draw too much current from the power outputs or pins. The GPIO pins can output enough current to light an LED, or to turn a transistor on, but they can't handle outputting larger amounts of current, such as would be necessary to drive a motor or the coil of a relay. Because the pins shouldn't be exposed to voltages over 3.3 volts, they can't accept input from things like TTL chips that work at 5 volts. The solution in both of these cases is to use optoisolator (or optocoupler) chips. An optoisolator consists of an LED pointing at a phototransistor (effectively acting as a switch) inside an opaque package. The operative principle is that when the LED turns on, the transistor gets turned on, but without any direct electrical connection between the two circuits.<br />
<br />
For output, the optoisolator LED is turned on by the GPIO interface and the controlled device is turned on by the phototransistor. For input, the optoisolator LED is turned on by the external sensor and the transistor is used to change the voltage present on the GPIO pin. This means that all kinds of bad things (such as over-voltaging or excessive current) can happen on the external side without having any effect on the Raspberry Pi. The worst-case scenario is that the optoisolator will get fried. Since they only cost me 35 cents each (a bag of 20 for $7), that's no big loss. The part number that I bought was AE1143, but there are probably others that are equivalent. However, if you want to use them in a breadboard, be sure that the ones you buy have a normal DIP package (see picture above) that will fit in the breadboard holes.<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg_ER1Hvws_sI5JNLvmPpEJQAiwA-17sTsomxz-1lwjlrkVWJ_ZSikCqrPshPNzcz3-l_jTAxFqwp2k00XwIeDigJH2KllyiT_cyFfRxBzV2M2sTvPfrsZYixowCH4vEwjRl-ECD1fj95A/s1600/2018-02-25+07.43.56.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="652" data-original-width="1600" height="259" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg_ER1Hvws_sI5JNLvmPpEJQAiwA-17sTsomxz-1lwjlrkVWJ_ZSikCqrPshPNzcz3-l_jTAxFqwp2k00XwIeDigJH2KllyiT_cyFfRxBzV2M2sTvPfrsZYixowCH4vEwjRl-ECD1fj95A/s640/2018-02-25+07.43.56.jpg" width="640" /></a></div>
In order to actually turn stuff on and off, you need to have relays that can be turned on and off by the optoisolator. You could buy individual relays and the various electronic parts that need to go between them and the optoisolators, but I decided it would be simplest to just buy a board that had 8 single-pole, double-throw (SPDT) relays with most of the necessary circuitry already on board, including built-in optoisolators. You can also buy modules that have 4 or fewer relays, but they aren't much cheaper than the $10 I paid for this one. There are a bunch of places online that sell them and put their brand name on them, but it appears that they are all the same and made by the same manufacturer.<br />
<br />
If the only thing you want to do is output, you don't need to buy the discrete optoisolators I described above since they are already included on this board. But if you also want to do input from sensors to the RPi, you should buy the separate optoisolators. (I'm not going to describe how to do input in this post, but it isn't very complicated.)<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh0e2ZTphVJVp9BpQ1aMiYKLMiil42KLbznLAIq9o47q9HH808MsJB8HgAktZKx2Y085DuESHErehwINqP3w_eG7hLfdASVqByEXPpnSKs1V8zbZ0Ft2omp3erLfwUg5waQiwyClFbMi64/s1600/2018-02-25+07.51.02.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="873" data-original-width="1600" height="174" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh0e2ZTphVJVp9BpQ1aMiYKLMiil42KLbznLAIq9o47q9HH808MsJB8HgAktZKx2Y085DuESHErehwINqP3w_eG7hLfdASVqByEXPpnSKs1V8zbZ0Ft2omp3erLfwUg5waQiwyClFbMi64/s320/2018-02-25+07.51.02.jpg" width="320" /></a></div>
You can actually connect the relay module directly to the RPI GPIO pins via jumper wires, but for reasons that I'll get into later, it is probably better to buy a chip like ULN2803APG that has multiple Darlington transistor arrays and use it between the GPIO interface and the relay board. I bought a pack of two ULN2803APG chips for $5 - each chip can control 8 relays, so one chip is actually all you need to drive the relay module.<br />
<br />
So if you already have a monitor, keyboard, USB power supply, etc. to hook up the computer, and if you already have a breadboard, your total cost to get off the ground with the RPi computer, relay module, and parts necessary to connect them is about $60 (plus whatever you have to pay for shipping).<br />
<h2>
Important issues relating to interfacing the Raspberry Pi</h2>
Although the Raspberry Pi provides a great opportunity for learning electronics, I already had enough experience with electronics that I wasn't really looking at this project as a means to increase my knowledge in that area. Mostly, I just wanted to turn things on and off with the minimal amount of effort. So of course, I started by googling topics related to interfacing an RPi.<br />
<br />
Unfortunately, most of the results fell into two categories: questions asked by people who knew little or nothing about electronics that were answered by people who also didn't really know much about electronics, or highly technical questions asked by people who knew a lot about electronics that resulted in technical answers given by electrical engineers. Neither of these kinds of sources of information really told me what I wanted to know: the most straightforward way to safely turn things on and off using the RPi GPIO pins.<br />
<br />
After consulting a number of online sources, I reached some conclusions, which I will summarize below. I should also say that I found the book "Exploring Raspberry Pi : Interfacing to the Real World with Embedded Linux" by Derek Molloy (John Wiley & Sons, Incorporated, 2016. ) very useful as a comprehensive reference. The book went far beyond where I was interested in going, but there were two sections that were particularly helpful. The general introduction to the GPIO (pgs. 220-223) and the introduction to digital input and output to powered circuits (pgs. 224-229) provided pretty much all of the technical details I needed to safely start interfacing without having to worry about frying the RPi.<br />
<br />
<h3>
Pins on the GPIO</h3>
One of the most important details is knowing the purpose of the 40 pins of the GPIO. Fig. 6-1 (p. 221) of the Malloy book is probably the best diagram I've seen, but since I don't have permission to post it here, I'll instead include a diagram from <a href="https://pinout.xyz/">https://pinout.xyz/</a><br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://pinout.xyz/resources/raspberry-pi-pinout.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="249" data-original-width="800" height="198" src="https://pinout.xyz/resources/raspberry-pi-pinout.png" width="640" /></a></div>
<div class="separator" style="clear: both; text-align: left;">
The orientation of the pins in this diagram correspond to the orientation shown in the close-up image of the RPi shown earlier in this post. There are two numbering systems for referring to the pins. One refers to the physical position of the pin. That system starts with 1 at the lower left, 2 at upper left, 3 lower 2nd column, 4 upper 2nd column, 5 lower 3rd column, etc. to pin 40 at the upper right. The other numbering system, which is probably more commonly used, is the "GPIO" number. I believe that this numbering system is consistent with earlier models of RPi that had fewer than 40 pins. In the diagram above, the GPIO numbers are shown above and below the pins (e.g. GPIO14 and GPIO15 in the upper row of pins, 4th and 5th pins from the left). In the default operating mode, any of these numbered GPIO pins can be used for input or output, with the exception of the ID_SD and ID_SC pins numbered 0 and 1 in this diagram. You should not connect anything to these pins unless you do further research into their function. Various pins can serve purposes (indicated by the color highlighting) other than general input and output when the GPIO is put into other modes, but that is beyond the scope of this post. </div>
<div class="separator" style="clear: both; text-align: left;">
<br /></div>
<h3 style="clear: both; text-align: left;">
Power from the GPIO</h3>
<div class="separator" style="clear: both; text-align: left;">
In the diagram above, there are 12 pins that don't have GPIO numbers. The 8 black-colored pins are ground pins. They are all equivalent. The two red-colored pins labeled "5V" can provide power at 5 volts, and the two tan-colored pins labeled "3V3" can provide power at 3.3 volts. These pins provide a convenient source of power for things that you've connected to the GPIO pins, but they have a limited power output. Your circuit should not draw more than 200-300 mA from the 5 V source and should draw no more than 50 mA from the 3.3 V source. </div>
<div class="separator" style="clear: both; text-align: left;">
<br /></div>
<div class="separator" style="clear: both; text-align: left;">
You should make sure that when the RPi is turned off, there is no power being applied to the GPIO pins. That's not a problem if you are using only the built-in power supply from the 5.0 and 3.3 V pins because they'll power down when the RPi powers down. If your circuit needs more current than what the built-in power supply can provide, you should provide an external power supply. But it's best that such externally-powered circuits be electrically isolated on the other side of optoisolators anyway so that you don't have to worry about them accidentally applying power to the GPIO pins of the turned-off RPi. It is good to use the internal GPIO power pins only for parts of the circuit on the computer side of the optoisolators.</div>
<div class="separator" style="clear: both; text-align: left;">
<br /></div>
<div class="separator" style="clear: both; text-align: left;">
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://pbs.twimg.com/media/DVZ1_EPVMAAaLsa.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="600" data-original-width="800" height="240" src="https://pbs.twimg.com/media/DVZ1_EPVMAAaLsa.jpg" width="320" /></a></div>
<h2 style="clear: both; text-align: left;">
Turning on an LED with a GPIO pin</h2>
<div class="separator" style="clear: both; text-align: left;">
There are abundant examples on the web showing how to turn an LED on and off using one of the GPIO pins set in output mode. Essentially, when the GPIO pin is set to "on", it outputs 3.3 volts and when it is set to "off", it is at ground. To turn on and off an LED, you simply place an LED in series with a resistor and connect the ends to the GPIO pin and one of the ground pins. (If you put the LED in backwards, nothing bad happens - just turn it around and try again.) The resistance of the resistor should be low enough that the LED lights up enough to see, but not so low that the circuit draws more than 2-3 mA from the 3.3V output of the GPIO pin. A 1 k ohm resistor should be OK for that purpose. (If you are only turning on a single LED, you can make the LED brighter by using a smaller resistor and draw more than 3 mA from a GPIO pin. But you don't want to do that with multiple pins at once.) I chose to use GPIO pin 18 because it was conveniently located next to a ground pin.</div>
<div class="separator" style="clear: both; text-align: left;">
<br /></div>
<div class="separator" style="clear: both; text-align: left;">
There are a number of ways to use software to turn a GPIO pin on and off. Since I planned to use Python to write the controlling software, I imported a module called "gpiozero" that has a simple function for turning a GPIO pin on and off. (There are other more sophisticated Python modules for interacting with the GPIO, but that's beyond the scope of this post.) Here's the code:</div>
<div class="separator" style="clear: both; text-align: left;">
<br /></div>
<div class="separator" style="clear: both;">
<span style="font-family: "courier new" , "courier" , monospace;">from gpiozero import LED</span></div>
<div class="separator" style="clear: both;">
<span style="font-family: "courier new" , "courier" , monospace;">from time import sleep</span></div>
<div class="separator" style="clear: both;">
<span style="font-family: "courier new" , "courier" , monospace;">led18 = LED(18)</span></div>
<div class="separator" style="clear: both;">
<span style="font-family: "courier new" , "courier" , monospace;"><br /></span></div>
<div class="separator" style="clear: both;">
<span style="font-family: "courier new" , "courier" , monospace;">led18.on()</span></div>
<div class="separator" style="clear: both;">
<span style="font-family: "courier new" , "courier" , monospace;">sleep(5)</span></div>
<div class="separator" style="clear: both;">
<span style="font-family: "courier new" , "courier" , monospace;">led18.off()</span></div>
<div class="separator" style="clear: both; text-align: left;">
<br /></div>
<div class="separator" style="clear: both; text-align: left;">
The program makes GPIO18 go to 3.3 volts (turning the LED on), waits for 5 seconds, then makes GPIO18 go to 0 volts (turning the LED off). This was pretty exciting for about the first minute or so after I got it to work, but it wasn't really what I was trying to accomplish: turning any device on and off.</div>
<div class="separator" style="clear: both; text-align: left;">
<br /></div>
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://upload.wikimedia.org/wikipedia/commons/thumb/0/02/Optoisolator_Pinout.svg/200px-Optoisolator_Pinout.svg.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="120" data-original-width="200" src="https://upload.wikimedia.org/wikipedia/commons/thumb/0/02/Optoisolator_Pinout.svg/200px-Optoisolator_Pinout.svg.png" /></a></div>
<div class="separator" style="clear: both; text-align: center;">
<span style="font-size: xx-small;">Wikimedia Commons, Optoisolator_Pinout.svg</span></div>
<br />
However, if you look at the circuit diagram of an optoisolator, you can see that the left side is simply an LED. So the simple task of turning on an LED is really useful if that LED is inside an optoisolator. Following the example of Fig. 6-7 (p. 228) of the Malloy book, I connected pin 2 of the optoisolator to one of the GPIO ground pins, connected a resistor of about 2 k ohm to pin 1 of the optoisolator, and connected the other end of the resistor to the GPIO pin that I wanted to use to control the circuit (e.g. GPIO18). (Note: pin 1 is designated by a small dot on the top of the optoisolator DIP.) Based on Fig. 6-7, that should result in drawing a safe current of about 1 mA from the GPIO pin.<br />
<br />
<h2>
Turning something on and off with an optoisolator</h2>
The phototransistor side of the optoisolator is essentially a switch. When sufficient light comes from the LED inside the optoisolator, current flows from pin 3 to pin 4. When the LED is dark, no current flows. However, the amount of current that flows through the phototransistor is pretty small when the LED is lit with only 1 mA of current. So the phototransistor can be used to turn on another transistor in one of two ways:<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgbYTrpKm5AcnYCs3yncqzNjRdsOBPiDgSuXBWVXpd98Ed75C7OIxuhmlgNArEj8vThb0_xi0Up5PNqELNaLKeCzNzA4H7s9tOxogU46g9uiqcYEShq9lExW2cvWObu9_ZhuIDHIidepJQ/s1600/transistors.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="298" data-original-width="576" height="206" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgbYTrpKm5AcnYCs3yncqzNjRdsOBPiDgSuXBWVXpd98Ed75C7OIxuhmlgNArEj8vThb0_xi0Up5PNqELNaLKeCzNzA4H7s9tOxogU46g9uiqcYEShq9lExW2cvWObu9_ZhuIDHIidepJQ/s400/transistors.png" width="400" /></a></div>
<div class="separator" style="clear: both; text-align: center;">
<span style="font-size: xx-small;">Wikimedia Commons, left: Darlington_pair_diagram.svg CC BY-SA by user Michael9422, right: Compound_trans.svg</span></div>
<br />
The transistor pair on the left is called a Darlington pair and the pair on the right is called a Sziklai pair. In both pairs, the transistor on the left (Q1) would represent the phototransistor inside the optoisolator. Instead of being controlled by current flowing into its base (B), Q1 is controlled by the light from the LED striking it. In the Darlington pair, Q2 is an NPN transistor, which is turned on by current flowing <b>into</b> its base. So when the phototransistor Q1 turns on, the small current flowing from it into the base of transistor Q2 is enough to saturate Q2, causing a lot of current to flow through Q2 from its collector (C) to its emitter (E). In the Sziklai pair, Q2 is a PNP transistor, which is turned on by current flowing <b>out</b> of its base. So when the phototransistor Q1 turns on, the small current flowing through it pulls enough current from the base of transistor Q2 to saturate Q2, causing a lot of current to flow through Q2 from its collector (C) to its emitter (E).<br />
<br />
Either of these two configurations produces the same result: turning on the phototransistor in the optoisolator turns on a second transistor that can sink a lot more current. The current sunk by the second transistor is enough to turn on a small light, or to energize the coil of a relay. If the thing that you want to turn on and off is something that draws a lot of current, uses a voltage higher than about 5 volts, or uses alternating current, then you will need to use the second resistor to turn on a relay. A relay is a mechanical switch that is closed when its coil is energized. Since the switch is mechanical, it doesn't care about the nature of the electricity passing through it as long as the voltage and current don't exceed the maximum for which it is rated. So for example, if you want to turn the lights of your house on and off, you'll need a relay since the voltage is over 100 volts and is alternating current. (Note: I do NOT advise that you try this unless you are familiar with the safety hazards associated with household wiring. You can electrocute yourself if you make a mistake.) You also should use a relay if you want to control any kind of motor.<br />
<br />
<h2>
Turning the 8-relay module on and off</h2>
If you don't care about how the electronics work and just want to put the relay circuit together, skip this section.<br />
<br />
This kind of setup is exactly what is built into the 8-relay module that I bought online. Sunfounder has <a href="http://wiki.sunfounder.cc/index.php?title=8_Channel_5V_Relay_Module">a useful page</a> that provides a helpful circuit diagram of the 8-relay module. From that page, you can download a large scale circuit diagram as well as a wiring diagram of the module. I've pulled out the circuit diagram for one of the relay modules:<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEheDq4HePwbyMyhIeNQ-Jg-ukV9XVzo0KGkYwuuLC3Qb9CuUpwCt297ES9IrDm5MRITfLnMYjSWyoOrhweOoHh-SAQ2Uo6iXjga9wI4KXDtQ75oHtvKTuMNcn03rgi8hVTBPPY532MwEUw/s1600/8-channel+relay+module.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="367" data-original-width="864" height="270" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEheDq4HePwbyMyhIeNQ-Jg-ukV9XVzo0KGkYwuuLC3Qb9CuUpwCt297ES9IrDm5MRITfLnMYjSWyoOrhweOoHh-SAQ2Uo6iXjga9wI4KXDtQ75oHtvKTuMNcn03rgi8hVTBPPY532MwEUw/s640/8-channel+relay+module.png" width="640" /></a></div>
<br />
If you compare this diagram to the ones above, you'll see that the central part of the circuit is a Darlington pair. The load that is turned on by the NPN transistor T5 is a relay, shown in the upper right of the diagram. When the coil is not energized, pin 1 of the relay is connected to pin 2. When the coil is energized, pin 1 is connected to pin 3 of the relay. Thus, the relay is a single pole, double throw (SPDT) switch. The diode D5 is there because when a coil is de-energized, its collapsing magnetic field can generate a surge of current that can damage the transistors, so the diode allows that current to safely dissipate.<br />
<br />
According to what I've read online, you should be able to connect the input of the relay module directly to one of the GPIO pins and use it to turn the relay on and off. The "testing experiment" for Raspberry Pi on the Sunfounder page shows how to do this. But I would NOT recommend that you try that experiment for several reasons. The most obvious reason is that the photos on the web page do not show clearly how to connect the wires (some wires hide others, making it hard to tell what's going on). The other reason why following the pattern in that example is a bad idea is because it uses the Raspberry Pi's power supply to run everything. In their example, they get away with it, but if you are really planning to use the relay module to power 8 devices, you need to have a better understanding of how to make the connections in a safe way that doesn't risk over-voltaging or drawing too much current from the GPIO connections.<br />
<br />
The first issue with the Sunfounder example circuit involves using the RPi's 5 volt power pin to run everything, including the LEDs shown on the circuit board. In real use, the load driven through the relays would be powered separately, using any possible voltage and current that the relays are rated for. The particular relays in the module say that they can handle up to 10 A, AC voltages up to 250 V, and DC voltages up to 30 V. For my testing, I connected a little battery-powered motor to pins 1 and 3 so that the motor would turn on and off. That circuit should have no connection to anything else on the circuit board.<br />
<br />
The other issue is whether the voltage supply pins labeled "VCC" and "JD-VCC" should be tied together and supplied with a single power supply, or if they should be supplied with power separately. When the relay unit ships, it comes with a jumper that connects VCC and JD-VCC pins:<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg986kIcGI0MSdob-3_c3xgjyCqibuJKYRnXw5h8FlYsPgfnVFiTUHx7qZWoOn1dFHPfUs_qlqeG4prK_IUSW_1NFNr_qH5pshYmjE_aSUtOQQOJiZvnAFUL56qbzHK6s7QSUcx_DWRWMQ/s1600/jumper.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="199" data-original-width="359" height="177" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg986kIcGI0MSdob-3_c3xgjyCqibuJKYRnXw5h8FlYsPgfnVFiTUHx7qZWoOn1dFHPfUs_qlqeG4prK_IUSW_1NFNr_qH5pshYmjE_aSUtOQQOJiZvnAFUL56qbzHK6s7QSUcx_DWRWMQ/s320/jumper.png" width="320" /></a></div>
You should pull this jumper off the board and leave the two pins disconnected. If you want to connect them in the future, it should be a conscious decision on your part after considering the implications, but it should not happen by default. If you look at the circuit diagram above, you'll see that connecting VCC with JD-VCC defeats the purpose of even having the optoisolator in the circuit, since it makes an electrical connection between the circuits on its two sides.<br />
<br />
The circuit diagram shows that JD-VCC supplies 5 volt power to the transistors and coils of the relays. When the relays aren't energized, the current is minimal, but when a single coil is energized, it draws about 65 mA. So if you were only going to use one of the 8 relays, you could easily use the 5 V power pin from the RPi GPIO pin set to supply the power. However, if you used three relays and they were routinely energized at the same time, you would be approaching the 200 mA limit of output for the 5 V GPIO power pin. Using all 8 relays would draw over 500 mA, which would probably either fry or at least crash the RPi. In addition, if like me you are running the RPi off of an iPad charger/power supply that barely provides enough current to run the RPi even without interfacing, you might crash the RPi with even fewer than 3 relays connected. A relatively simple solution is to just create your own battery-operated power supply using 4 D cells and a voltage regulator. That could easily run the relay unit for a pretty long time, and by being battery powered would enable you to use it in a robot without having to have a cord plugged into a wall receptacle. See the Appendix at the end for more on this.<br />
<br />
The other problem with the relay board is that the inputs are "active low". That means that they are turned on when the GPIO pins are "off" (at ground = 0 V). To turn the relays off, the GPIO pins controlling them need to be "on" (3.3 V). Since the GPIO pins starting state is "off", that isn't really a good thing, because it means that the coils on the relay board would be energized as soon as the RPi is turned on - before you even start running the software to control it. It would be better to make the inputs be "active high" so that the coils would only be energized when you send a signal to make them be energized (i.e. turn the GPIO state to "on"). I became aware of this issue while reading <a href="https://www.raspberrypi.org/forums/viewtopic.php?t=36225">a thread on the raspberrypi.org forum</a>. I don't recommend reading the thread unless you are really hard-core, because the thread really gets out into the weeds and the suggested solution involves hooking up a bunch of discreet transistors and resistors to solve the problem.<br />
<br />
There is actually much simpler solution that has been mentioned in several other places: using a ULN2803APG chip (the last item on the list of supplies that I bought for this project). You can view <a href="https://cdn-shop.adafruit.com/datasheets/ULN2803A.pdf">the data sheet</a> for the ULN2803, but I'll cut to the chase by inserting the circuit diagram here:<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi4eXph9cle0cHgiRBrPnQ6Is6cl5f2LNffAiYqUbqppCri6OgHkIVBHeNy2WLpgjUJtkzce1zKvBOwi7_-ddSZqrGB-ue-IV7TiG8ZjOyKrcB_vLxqIZg3YpgGNdTv299d_R5bd-Het5Y/s1600/uln2803.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="239" data-original-width="409" height="232" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi4eXph9cle0cHgiRBrPnQ6Is6cl5f2LNffAiYqUbqppCri6OgHkIVBHeNy2WLpgjUJtkzce1zKvBOwi7_-ddSZqrGB-ue-IV7TiG8ZjOyKrcB_vLxqIZg3YpgGNdTv299d_R5bd-Het5Y/s400/uln2803.PNG" width="400" /></a></div>
<br />
You can ignore all of the resistors and diodes and just focus on the two transistors. If you compare this diagram with my earlier diagrams, you'll see that the ULN2803APG contains a Darlington pair. When the input of the ULN2803APG goes "high", current flows into the base of the transistor on the left, turning it on. Current flows from its emitter into the base of the second transistor, turning it on - effectively closing a switch that connects the output to ground, i.e. making the output "low". If the output is connected to the input of one of the relay controllers, it will ground the relay input, causing both the LED inside the optoisolator to turn on and the indicator LED on the relay circuit board to turn on.<br />
<br />
When the input of the ULN2803APG goes low, both transistors turn off, disconnecting the output from the ground and allowing its voltage to float. If the output is connected to the input of one of the relay controllers, it will be at the VCC voltage and the optoisolator won't be turned on.<br />
<br />
The combination of resistors inside the ULN2803APG was chosen so that the output goes low when the input exceeds 2.5 volts (just right for the GPIO output voltage of 3.3 V).<br />
<br />
So essentially, the ULN2803APG chip flips the inputs of the relay board to make them active low rather than active high. That function is shown symbolically in the pinnout diagram for the ULN2803:<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjhWsM1zh0gJkbzLJm89PxcaB5mw8zg9UTGEXa6pcdVKFl9rcPZhrmYm9jrCkT7l8ApsZZ_0fGBySRTDJUjKPki-e9gZGYgrA2oVMog481WIJptpiiLEZFHmhW9aWSu0laSYDLrp95V8qc/s1600/pins.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="260" data-original-width="435" height="191" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjhWsM1zh0gJkbzLJm89PxcaB5mw8zg9UTGEXa6pcdVKFl9rcPZhrmYm9jrCkT7l8ApsZZ_0fGBySRTDJUjKPki-e9gZGYgrA2oVMog481WIJptpiiLEZFHmhW9aWSu0laSYDLrp95V8qc/s320/pins.png" width="320" /></a></div>
where the circuitry is summarized as "not" gates (changing low to high and high to low).<br />
<br />
The wiring configuration is super simple. For each relay that you want to control, connect the output of the GPIO to one of the bottom pins on the chip (1 through 8), then connect the pin on the opposite side (11 through 18) to the input on the relay board that you want to control. The GND pin needs to be tied to one of the ground pins on the RPi and to the ground pin on the relay board. It does not seem to me that it should be necessary to connect the "COMMON" pin to anything, although in the examples I've seen it's been connected to VCC of the relay board. I don't think it matters, since under normal operation the diode on the common connection will block any current from flowing anyway.<br />
<br />
The only remaining question is what power source to use for the VCC connection on the relay board. If one created an external 5 V power supply (such as the one I suggested using D batteries), that supply could be used. However, in the spirit of keeping the RPi completely isolated electrically from external circuits, it would probably be best to connect the VCC connection on the relay board to one of the 5 V power pins on the GPIO since the VCC connection supplies the computer side of the optoisolators. In my test circuit, I found that the ULN2803APG chip drew a negligible amount of current when connected by itself to the 5 V pin (as expected given the diode in the "common" connection inside the chip). When the VCC pin of the relay board was connected to the 5 V pin of the GPIO, it only drew about 1.3 mA per relay control circuit. So even if all 8 relays were in use, it would draw only about 10 mA from the 5 V pin - way below the 200 mA maximum "safe" output for that pin. I didn't actually measure the current being drawn from the GPIO control output pin, but I would imagine that it would be at a safe, low value since it is only turning on the transistors on the ULN2803APG, and not actually driving the optoisolator and status LEDs on the relay board as it would have been if there were a direct connection from the GPIO to the relay board.<br />
<br />
After all of the stress I encountered trying to figure out a "safe" way to run the 8-relay module from the RPi GPIO, I'm pretty satisfied because this setup is both really simple to wire and also keeps the currents and voltages on the GPIO pins far below their safety limits. If a separate 5V supply is used to power the relay coils via JD-VCC (rather than using a 5 V power pin from the GPIO), the RPi is also completely isolated electrically from external circuits on the far side of the optoisolator.<br />
<h2>
<br />Quick and dirty instructions for connecting the 8-relay board to the Raspberry Pi</h2>
It is best to make the initial connections with the RPi turned OFF in case you plug a wire in the wrong place during the setup.<br />
<br />
1. Remove the jumper connecting the VCC and JD-VCC pins on the 8-relay board and leave it off.<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEihXnRVH2RjBzA6WdVS5sEM05lIWciK_vdhd_XUVRmnmYd7kxw4TQdDKhVGw60ANwN2nOpYYI38Xn9efmWXszIyAkS4txjUtR4yxcI7cWXd5WGvoqsyeKfkwkoz1EWfqeStxtVWGllhuk0/s1600/jumper.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="199" data-original-width="359" height="177" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEihXnRVH2RjBzA6WdVS5sEM05lIWciK_vdhd_XUVRmnmYd7kxw4TQdDKhVGw60ANwN2nOpYYI38Xn9efmWXszIyAkS4txjUtR4yxcI7cWXd5WGvoqsyeKfkwkoz1EWfqeStxtVWGllhuk0/s320/jumper.png" width="320" /></a></div>
2. Insert the ULN2803APG chip into your wireless breadboard.<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgrPPgtMxj0cnu4-Nh72bOw2KJxuZPHG3nYpe_BUzmMV-5jyJzyzByn_6stqxGZB4taDlGuSm-wBPfTwpereVAA5eXOOEZ1btdhg3Lvi4gqJP9tXjjn8B6v3-l17I14S6QcnKhyTbHLSrg/s1600/pins.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="260" data-original-width="435" height="191" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgrPPgtMxj0cnu4-Nh72bOw2KJxuZPHG3nYpe_BUzmMV-5jyJzyzByn_6stqxGZB4taDlGuSm-wBPfTwpereVAA5eXOOEZ1btdhg3Lvi4gqJP9tXjjn8B6v3-l17I14S6QcnKhyTbHLSrg/s320/pins.png" width="320" /></a></div>
3. Use a jumper wire to connect one of the ground pins from the GPIO (it doesn't matter which one) to the GND pin of the ULN2803APG chip.<br />
<br />
4. Use a jumper wire to connect the GND pin of the ULN2803APG chip to the GND pin of the 8-relay board (it doesn't matter which GND pin).<br />
<br />
5. Use a jumper wire to connect one of the 5 V pins on the Raspberry Pi's GPIO (it doesn't matter which of the 5 V pins) to the common pin of the ULN2803APG chip.<br />
<br />
6. Use a jumper wire to connect the common pin of the ULN2803APG chip to a VCC pin on the 8-relay module (it doesn't matter which VCC pin).<br />
<br />
7. Connect a jumper wire from the GPIO output pin that you want to use to one of the input pins on the ULN2803APG chip. If you want to use the code in my example, use GPIO18 (pin 12). Using pin 11 on the ULN2803APG chip would be sensible.<br />
<br />
8. Connect a jumper wire from the corresponding output pin of the ULN2803APG chip to the input of the relay that you want to use on the 8-relay board. If you used ULN2803APG pin 11 in the last step, you should use pin 18 for the output pin.<br />
<br />
9. Connect the JD-VCC pin on the 8-relay board to a 5 V source of power. If you just want to test the system with a single relay, you could connect it by a jumper to a 5 V power pin of the RPi's GPIO (or to the common pin of the ULN2803APG chip, which is itself connected to the 5 V pin). But don't do this if you are going to use more than 2 or 3 of the relays (see details above for the reason). In that case, buy or make a separate 5 V power supply to supply JD-VCC (see appendix).<br />
<br />
10. Turn on the Raspberry Pi and your external 5 V power supply (if you used one).<br />
<br />
11. Run the code snippet that I gave above if you are using Python (you must first have first imported the gpiozero module). For other programming languages, look up appropriate code on the web. If everything is working, you should see the indicator LED for your chosen relay turn on for 5 seconds, then turn off. If you listen carefully, you should also be able to hear a quiet clicking sound as the relay closes and opens.<br />
<br />
12. If everything has worked up to this point, connect something that you want to turn on to the relays. You'll need a jeweler's screwdriver or some other small screwdriver to open the screw that clamps down on the output wires from the relay. A small, battery-powered motor is good for a test. Here's how it worked for me: <a href="https://twitter.com/baskaufs/status/963243621012123648">https://twitter.com/baskaufs/status/963243621012123648</a><br />
<br />
<h2>
Froggy the Robot, take 1</h2>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjxdjELtrgr5EH0h_Zrh2M7nqtxASzeqjIrfGiSnAIbUkBe1MKTMZihyphenhyphenA9MCfXeD1aUEHLkYhmWST0BMXJeLSV6SUlcew7jMM1RXKbEeUZbUxDOrk-yPNnp1LWjrYUmtA-owecEugIxT4U/s1600/51-jXz7nISL._SY344_BO1%252C204%252C203%252C200_.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="346" data-original-width="202" height="320" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjxdjELtrgr5EH0h_Zrh2M7nqtxASzeqjIrfGiSnAIbUkBe1MKTMZihyphenhyphenA9MCfXeD1aUEHLkYhmWST0BMXJeLSV6SUlcew7jMM1RXKbEeUZbUxDOrk-yPNnp1LWjrYUmtA-owecEugIxT4U/s320/51-jXz7nISL._SY344_BO1%252C204%252C203%252C200_.jpg" width="186" /></a></div>
Ever since I was a kid and read Andy Buckram's Tin Men (Carol Ryrie Brink, 1966), I always thought it would be really cool to build a robot. When I was in college, I had the opportunity to take a course that focused on digital electronics and we had fun in the class building burglar alarms and other cool stuff with TTL logic chips. So over the years, I accumulated various power supplies, chips, and other miscellaneous junk with the intention of actually building a robot some day.<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhZxpcTspHQ_HPwGFfBTFiS4amawZ3PCwrmbzJDUcCl4O4RW-lAe8_bh-3WX1gP2Wk79Ru46JA8YV9u_233y5FidzDZnZ_2pdvrMgDrFbS1DqpGaITTR6-IJNCaY9kPOVlPb8e_QNDoFaU/s1600/new+doc+2018-02-25+18.45.08_1.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="1600" data-original-width="1446" height="400" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhZxpcTspHQ_HPwGFfBTFiS4amawZ3PCwrmbzJDUcCl4O4RW-lAe8_bh-3WX1gP2Wk79Ru46JA8YV9u_233y5FidzDZnZ_2pdvrMgDrFbS1DqpGaITTR6-IJNCaY9kPOVlPb8e_QNDoFaU/s400/new+doc+2018-02-25+18.45.08_1.jpg" width="361" /></a></div>
About ten years ago when my two daughters were in middle school, I decided that the time was ripe for actually building the robot as a father-daughter project. Somewhere along the line, I acquired the plans for building an RS232 interface based on an AY-3-1015D Universal Asynchronous Receiver/Transmitter (UART) and the UART chip itself. The plan was to use the UART to communicate between a laptop's serial port and the robot, and use the UART data bit outputs to control relays on the robot. So after some soldering lessons for the girls, we started putting it together. I think I underestimated the patience of pre-teens for hours of soldering and ended up doing most of it myself, but I think they understood the basic principle of what we were building.<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi8GIGig1hg9hyImRFYK_etYjLHx1f9ngV4kDqZRQP9h6ORqx5Y2FfOFNjJxqzWPrFgIjO81VJfY1vfA2q-AKvA-5qCBAx5TFEW5EBWUTBqiIoWJNzQCQBOvaD8aTonCzDSGIessYg6i7E/s1600/2018-02-25+18.59.06.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="1200" data-original-width="1600" height="300" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi8GIGig1hg9hyImRFYK_etYjLHx1f9ngV4kDqZRQP9h6ORqx5Y2FfOFNjJxqzWPrFgIjO81VJfY1vfA2q-AKvA-5qCBAx5TFEW5EBWUTBqiIoWJNzQCQBOvaD8aTonCzDSGIessYg6i7E/s400/2018-02-25+18.59.06.jpg" width="400" /></a></div>
<br />
In the end, we had a little Visual Basic program with buttons that sent a number whose bits determined which relays should be turned on and off. Each bit of the output of the UART went through a 7404 TTL NOT chip (to prevent backwards frying of the UART chip and to invert the signal), and the output of the 7404 drove a PNP transistor in a manner very analogous to Sziklai pair discussed earlier in this post.<br />
<br />
One difference between the relays that we used in our project and the relays that come on the 8-relay board is that the relays in our robot project were double pole, double throw (DPDT) rather than SPDT. The reason this was important to us was that we wanted to use the relays to be able to reverse the direction of the robot motors. See this diagram that I borrowed from quora.com:<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgo48jFvYA9cfs4y9GcXGTYFfXDy5xZ1KZ4FJKu_0MPwyxt1Sf3Rw8TtA-ZU4IqCQhbgXzqFzGVyvNxV3jV3eIK58Zjqosg4JEA1XZZOiCp22W3oV7SetgywNCm6xHMuMGY4bj9exeeym8/s1600/main-qimg-5a0e7843f693ffa73e40b017eda6b7ae.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="177" data-original-width="559" height="126" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgo48jFvYA9cfs4y9GcXGTYFfXDy5xZ1KZ4FJKu_0MPwyxt1Sf3Rw8TtA-ZU4IqCQhbgXzqFzGVyvNxV3jV3eIK58Zjqosg4JEA1XZZOiCp22W3oV7SetgywNCm6xHMuMGY4bj9exeeym8/s400/main-qimg-5a0e7843f693ffa73e40b017eda6b7ae.png" width="400" /></a></div>
When the switch contacts are thrown up, the positive side of the battery is connected to the + end of the motor and the motor rotates one way. When the switch contacts are thrown down, the positive side of the battery is connected to the - end of the motor and the motor runs the other way. This kind of reversing action could be mimicked by the SPDT relays of the 8-relay unit, but it would require using two of the relays in tandem (i.e. 2 relays to reverse one motor). It would also be tricky to avoid shorting the battery if the two switches didn't throw at exactly the same time.<br />
<br />
Our robot was not very sophisticated - it just had 2 wheels whose direction could be controlled independently. When both wheels went forward, the robot went forward. When both went backwards, the robot went backwards. When one wheel went forwards and the other went backwards, the robot rotated in an appropriate direction (we had a third unpowered caster wheel to support the back side of the robot platform).<br />
<br />
The most exciting feature of the robot was an old drawer from a CD drive (back in the days when they were motorized). With two more relays, we could power the drawer motor and control its direction (in or out). The out-and-in movement of the drawer reminded my daughters of a frog's tongue, so that's how the robot got the name "Froggy". In the end, my daughters created a game where a magnet hung from the end of the "tongue" and they could drive the robot around picking up small iron BBs scattered on the floor.<br />
<br />
After the novelty wore off, Froggy was put away in a box. Between then and now, serial ports have virtually disappeared from computers, although I was able to get my old Dell laptop running long enough to make this video of the old Froggy in operation:<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<iframe allowfullscreen="" class="YOUTUBE-iframe-video" data-thumbnail-src="https://i.ytimg.com/vi/8nNL0Svz1Zc/0.jpg" frameborder="0" height="266" src="https://www.youtube.com/embed/8nNL0Svz1Zc?feature=player_embedded" width="320"></iframe></div>
<br />
<h2>
Froggy the Robot, take 2</h2>
When I decided to buy a Raspberry Pi, I knew immediately that one of my first projects would be to try to rebuild Froggy to be controlled directly by the RPi GPIO interface. All of the parts of Froggy's brain that ran the UART could be lobotomized, leaving the board with the relays and all of their connections to the motors. In the part of this diagram:<br />
<br />
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi8GIGig1hg9hyImRFYK_etYjLHx1f9ngV4kDqZRQP9h6ORqx5Y2FfOFNjJxqzWPrFgIjO81VJfY1vfA2q-AKvA-5qCBAx5TFEW5EBWUTBqiIoWJNzQCQBOvaD8aTonCzDSGIessYg6i7E/s1600/2018-02-25+18.59.06.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em; text-align: center;"><img border="0" data-original-height="1200" data-original-width="1600" height="300" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi8GIGig1hg9hyImRFYK_etYjLHx1f9ngV4kDqZRQP9h6ORqx5Y2FfOFNjJxqzWPrFgIjO81VJfY1vfA2q-AKvA-5qCBAx5TFEW5EBWUTBqiIoWJNzQCQBOvaD8aTonCzDSGIessYg6i7E/s400/2018-02-25+18.59.06.jpg" width="400" /></a><br />
<br />
where the NOT gate was, I replaced it with the collector side (pin 3) of the phototransistor in one of the discrete optoisolators. Pin 4 was connected to a common ground with the coil. I was a bit concerned whether the phototransistor could sink enough current to light up the indicator LED as well as turning on the PNP transistor that drove the relay. But it had no problem with that, so I was rather quickly able to run the five control wires for the relays into the outputs of 5 optoisolators.<br />
<br />
The most difficult part of making the conversion was to make a cable to connect Froggy to the RPi. When I ran Froggy with the RS232 interface, it only required two wires (a ground wire and the signal wire). I was able to splice together several old telephone cords to make Froggy's teather quite long. However, when controlling Froggy using the RPi, I needed a separate control wire for each of the five relays, plus a ground wire. Luckily, I was able to find an ancient ribbon cable that had been spliced to a multi-wire cable, which I had salvaged from some old piece of junk. It had only narrowly escaped being thrown out last summer when I cleaned out the basement. Unfortunately, there were more wires in the multi-wire cable than in the ribbon cable, and apparently some of the ribbon cable wires weren't actually connected to anything. So I had to spend over an hour with my ohmmeter trying to figure out which of wires at the two ends of the cable were actually connected to each other. Eventually, I had something like 10 usable wires in the cable - 6 for running Froggy now and some others for future expansion.<br />
<br />
Here is the end result:<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<iframe allowfullscreen="" class="YOUTUBE-iframe-video" data-thumbnail-src="https://i.ytimg.com/vi/1HHWUh3uZZA/0.jpg" frameborder="0" height="266" src="https://www.youtube.com/embed/1HHWUh3uZZA?feature=player_embedded" width="320"></iframe></div>
<br />
You can see the Python code that runs the Froggy controller in <a href="https://gist.github.com/baskaufs/3ca84908d5fc88865957929700549b1e">this gist</a>.<br />
<h2>
Future projects</h2>
I probably won't devote a lot of energy to embellishments for Froggy. I may add one or more sensor buttons to the end of the tongue that will detect if the robot has run into something. In this post, I didn't go into how to accept input through the GPIO interface. It is much simpler than output and only requires 5 V power, an optoisolator, a resistor, and a switch. I may write another post if I get sensor buttons working.<br />
<br />
What I really want to do is to figure out how to set up a web server on the RPi so that I can communicate with it through WiFi and the Internet. If I manage to do that, the RPi will just ride on Froggy with a portable power supply - no monitor, keyboard, or mouse required. If I get that to work, I could control the robot through a remote computer or perhaps my phone. We'll see if I ever get around to doing that!<br />
<br />
<h2>
Appendix</h2>
To run Froggy's onboard 5 V electronics, I just bought a little MC78LXXA 5 volt (positive) 0.1 A positive voltage regulator (TO-92 package). It was super-simple to hook it up. Here's a diagram of a different 5 V voltage regulator, but the concept is the same.<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://www.electroschematics.com/wp-content/uploads/2013/03/ldo-voltage-regulator-mcp1755.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://www.electroschematics.com/wp-content/uploads/2013/03/ldo-voltage-regulator-mcp1755.jpg" data-original-height="195" data-original-width="466" height="133" width="320" /></a></div>
<div class="separator" style="clear: both; text-align: center;">
<span style="font-size: xx-small;">image from <a href="https://www.electroschematics.com/8475/ldo-5v-voltage-regulator-with-mcp1755/">https://www.electroschematics.com/8475/ldo-5v-voltage-regulator-with-mcp1755/</a></span></div>
<br />
I bought a D cell holder that would hold 4 cells in series. At 1.5 V per cell, that's 6 volts. I connected the negative end of the cell holder to the ground pin of the voltage regulator and the positive end to the "+5.5V ... 16V" connection at the left side of the diagram. The "+5V" connection at the right serves as the regulated 5 V supply (with a common ground to the negative end of the battery holder. My data sheet says "Bypass Capacitors are recommended for optimum stability and transient response and should be located as close as possible." I actually just left them out and got away with it, although it probably would have been better to have put them in.<br />
<br />
The MC78LXXA is rated for an output of 100 mA. I suspect that when I was driving all 5 of the relays, I might have gone over that, but it always was able to run the circuitry anyway. When the robot is at rest with no energized relays, I don't think that it is drawing more than about 10 mA. If you wanted to use the 4 D cell system to provide power to the 8 relay module, I think you could just use a 5 volt regulator with a higher current output rating. For example, I googled and found a μA7805CKC regulator in a TO-220 package that is rated at 1.5 A. That would easily provide the 500 mA that I estimated was required when all 8 relays on the module were energized, with 1000 mA to spare.<br />
<br />Steve Baskaufhttp://www.blogger.com/profile/01896499749604153763noreply@blogger.com1