How do we find data?
For this class, I’ve curated a collection of datasets that I know will work well for the kinds of questions historians ask for our final projects. But once you leave class, you won’t always have nicely curated data at your fingertips! Sometimes you’ll have to find it on your own in the wild. If your career takes you to a museum or archive, it’s also helpful to know how to work with collections data in bulk and why researchers might want to access your collections programmatically.
So this week we will:
- Learn about APIs and how to get data from them
- How to scrape data that doesn’t have an API
- How to extend existing data by fetching geographical or gender information
No reading this week!
After watching the video above, you may be asking why I included it. Historians aren’t journalists and we rarely put together our own datasets. However, every time I teach this class I get students who want to transcribe or otherwise put together their own data for the final project. You should not do this and I hope the video above helps explain why. Putting together a good, well made data set is incredibly time consuming.
This week we’re going to learn some basics of acquiring data that will extend our Python skills. Acquiring data that has already been digitized–meaning the text is already machine-readable–is relatively easy, but still time consuming, as you will find out when you curse me for the assignments this week. Acquiring data that is not digitized–even if it’s available in digital photos or images–is a giant pain in the ass.
“Digitized” archival material that is only photographed or scanned is not truly digitized because it’s not machine-readable. Computers have a very hard time making meaning of scanned text. Optical Character Recognition (or OCR) is available to convert images of scanned text to machine-readable text (the kind you can copy and paste). At the scale most historians and small institutions need, Adobe Acrobat Pro ($300, available on campus) or DevonThink ($99, less for students and non-profits) are the best available software packages. Google Docs can OCR small images, but struggles with book-length documents.
However, even great OCR doesn’t work for all historical use cases. First, as of 2021, OCR only works on typeset documents, meaning that it doesn’t work on hand-written archival material or things like medieval manuscripts. There is one large project in the works that is getting increasingly accurate results for OCR of hand written texts, but it requires a large “training set” of known, transcribed documents in the same hand to learn the handwriting style.
Second, OCR isn’t great for all typeset documents. Consider this PDF of the Fort Ticonderoga research library’s pre-digitized card catalog or the Papers of Sir William Johnson. Both are typeset, but the OCR for both are garbage because the image and typeset quality is too poor for high quality results (and fairly standard quality). Copy and paste some of the text out of the page images to see how well the OCR’d text matches what you can read with your human eyeballs. Both were OCR’d using top-notch software, but their OCR’d text is unusable for research purposes.
The gold standard for data input is human transcription, and many institutions have set up crowd-sourced volunteer transcription interfaces for that reason. For a long-term, institutional project or a long-term, book-length project, human transcription can make good sense because it gives the highest quality results.
Transcription is not a good idea for a semester length project. I will not approve any final projects that require transcription. If you can’t show me a csv of your data or where you will download/scrape it from after this week, you will have to use one of the datasets I’ve provided for your final project.
During our Wednesday meeting we’ll troubleshoot our many, many assignments this week. Get ready to stretch your python skills!
This week’s assignments are fairly long (which is why there is no reading). Be sure to set aside as much time as you would usually spend on reading plus class time to work through them.
- Complete the API Request Colabs assignment
- Complete the Web Scraping Colabs assignment
- Complete the Georeferencing Colabs assignment
- Complete the Gender Inference assignment
- When you’re done, you should have 4 links to share in the comments below: one link each to your API, Web Scraping, Georeferencing, and Gender Golabs assignments. Share them in a comment below using a pretty link <a href>