Module 4 Assignment Getting Data

Getting Data in Module 4

For this week, the concept I struggled with the most was the language. I frequently got stuck due to so many stupid, easy typos. I wrote (‘items’) rather than [‘items’] in the API Requests assignment, I was trying to run a function with text.strip rather than text.strip() in the Webscraping assignment, and I had placelist.txt written instead of ‘placelist.txt’ in the Geocoding assignment. That last one took me the longest because I had no clue why Colab was telling me placelist.txt was not defined since it obviously was! It took me an embarrassing amount of time before I finally copied Prof. Kane’s code at the bottom and figured it out. I liked concluding with the Gender Inference assignment. The way it incorporated all the functions of the previous assignments was helpful to me.

A concept that will help me in my own research is identifying the range. At first it took me some time to figure out why Prof. Kane used 31 pages as her range for a list of 610 results until I saw that there were 20 results per page. I’ll be using lots of site files and catalogue records to research material collections of trade assemblages, an endeavor which will require lots of pages from many databases (“data dumps”, as one of my anthro professors called it).

The US National Archives has a Flickr API that contains more than 16,000 historical photographs, maps, newspaper pages, and other documents publicly available. Each Flickr post contains an image and other data variables such as Production Date, Series, Creator, and a Identification Number. I found the link to the Flickr account on this National Archives webpage, but I’m having trouble locating the link to the actual API data. It looks like whoever runs this Flickr account pulls individual records of images from the API and uploads them here with all the data each record contains. In order to access the API you’d probably have to do some digging to find the contact person who manages the Flickr and would know how to grant permission.

I think it would be cool to fetch the Production Date data to compare the time periods of the images to see which years are better represented than others. According to the description, the National Archives date to 1775, but I’m wondering how many images they actually uploaded from the eighteenth century. To use an archaeological term, this could be similar to taphonomic bias – since younger materials preserve better, there’s more imagery to work with from later time periods. This causes researchers to focus more on recent records as opposed to the earlier ones.

One reply on “Getting Data in Module 4”

No need to contact anyone–Flickr as a service has an API for the whole site, and individual account holders like the National Archives only toggle on and off if they want their images to be available through the API. (So the Natl Archives has an account here, not an API themselves).

And yes, those fiddly typos are a common tripping point–you know the vocabulary, but the grammar and punctuation are much harder 🙂

The recency bias is a huge problem for digitization–archival preservation gets worse the further back you go, and print text/mass production images in the 19th and 20th are easier to digitize and better funded for various reasons. There’s also the annoying problem of institutions sometimes cataloging digital images with the “creation” date of when the image was scanned, though that’s less of a problem in recent years.

Keep me in the loop when you start sifting through collections databases, I’m interested to see what you’ll end up working with. If you have access to something like that now, it could make for an interesting final project this semester.

Sent you access requests for the assignments.

Comments are closed.