4: Getting Data – 2021 Practicum in Digital History

How do we find data?

For this class, I’ve curated a collection of datasets that I know will work well for the kinds of questions historians ask for our final projects. But once you leave class, you won’t always have nicely curated data at your fingertips! Sometimes you’ll have to find it on your own in the wild. If your career takes you to a museum or archive, it’s also helpful to know how to work with collections data in bulk and why researchers might want to access your collections programmatically.

So this week we will:

Learn about APIs and how to get data from them
How to scrape data that doesn’t have an API
How to extend existing data by fetching geographical or gender information

Start	Meet	Post
Mar 1	Mar 3	Mar 5

Reading
Watch
Wednesday Agenda
Assignment: Data Cleaning
Tasks

Reading

No reading this week!

Watch

After watching the video above, you may be asking why I included it. Historians aren’t journalists and we rarely put together our own datasets. However, every time I teach this class I get students who want to transcribe or otherwise put together their own data for the final project. You should not do this and I hope the video above helps explain why. Putting together a good, well made data set is incredibly time consuming.

This week we’re going to learn some basics of acquiring data that will extend our Python skills. Acquiring data that has already been digitized–meaning the text is already machine-readable–is relatively easy, but still time consuming, as you will find out when you curse me for the assignments this week. Acquiring data that is not digitized–even if it’s available in digital photos or images–is a giant pain in the ass.

“Digitized” archival material that is only photographed or scanned is not truly digitized because it’s not machine-readable. Computers have a very hard time making meaning of scanned text. Optical Character Recognition (or OCR) is available to convert images of scanned text to machine-readable text (the kind you can copy and paste). If you encounter a scan of a page and you’re not sure it’s been OCR’d, try selecting some of the text and pasting it elsewhere–if you can’t select it, it’s not OCR’d. If you can select it but you can’t paste it elsewhere, it’s OCR’d but protected, like Google Books’ scans. At the scale most historians and small institutions need to OCR, Adobe Acrobat Pro ($300, available on campus) or DevonThink ($99, less for students and non-profits) are the best available software packages to OCR documents with a button click. Google Docs can OCR small images, but struggles with book-length documents.

However, even great OCR doesn’t work for all historical use cases. First, as of 2021, OCR only works on typeset documents, meaning that it doesn’t work on hand-written archival material or things like medieval manuscripts. There is one large project in the works that is getting increasingly accurate results for OCR of hand written texts, but it requires a large “training set” of known, transcribed documents in the same hand to learn the handwriting style.

Second, OCR isn’t great for all typeset documents. Consider this PDF of the Fort Ticonderoga research library’s pre-digitized card catalog or the Papers of Sir William Johnson. Both are typeset, but the OCR for both are garbage because the image and typeset quality is too poor for high quality results (and fairly standard quality). Copy and paste some of the text out of the page images to see how well the OCR’d text matches what you can read with your human eyeballs. Both were OCR’d using top-notch software, but their OCR’d text is unusable for research purposes.

The gold standard for data input is human transcription, and many institutions have set up crowd-sourced volunteer transcription interfaces for that reason. For a long-term, institutional project or a long-term, book-length project, human transcription can make good sense because it gives the highest quality results.

Transcription is not a good idea for a semester length project. I will not approve any final projects that require transcription. If you can’t show me a csv of your data or where you will download/scrape it from after this week, you will have to use one of the datasets I’ve provided for your final project.

Wednesday Agenda

During our Wednesday meeting we’ll troubleshoot our many, many assignments this week. Get ready to stretch your python skills!

This week’s assignments are fairly long (which is why there is no reading). Be sure to set aside as much time as you would usually spend on reading plus class time to work through them.

Watch: Some Advice

This week’s assignments are hard!

Tasks

Complete the API Request Colabs assignment
Complete the Web Scraping Colabs assignment
Complete the Georeferencing Colabs assignment
Complete the Gender Inference assignment
When you’re done, you should have 4 links to share in a post here on the course blog: one link each to your API, Web Scraping, Georeferencing, and Gender Golabs assignments. Tag your post with the Getting Data category. Link your assignments in the body of your post, and briefly discuss the following:
1. a concept you found difficult,
2. a concept you think will be helpful in your own research,
3. and link to and discuss a web resource (such as a museum/archive website or API) that you think would be interesting to fetch data from. (If you find something that we can easily pull data from, this is a good time to identify something you’re interested in working with for your final project!)
Your post write up only needs to be 200-300 words.

If you get stuck, don’t suffer in silence. Ask on Slack or check out others’ posts on the course site if you need help with something.

Watch: Data Joins

There’s no corresponding assignment for this video, but I provide it so that you can see what you might do with your data after fetching it from a couple of different sources. You may want to refer back to this as you work on your final project later in the semester.