Categories
Module 4 Assignment Getting Data

Module 4 Assignments

This week’s assignments had me staring at my computer screen for hours hoping the mysteries of Python would reveal themselves to me.

In all honesty, it wasn’t that bad. After awhile, it was fun, like a puzzle that needed solving. The biggest thing I need to remember is to slow down. Unlike the task of reading large quantities of pages per week, which requires skimming skills, reading each. piece. of. code. makes life so much easier. One misplaced quotation mark stalled me for 10 minute while I tried to figure out what was not working.

The Genderize resource is going to be really helpful in my research. Women’s names appear so infrequently in the documents I’m dealing with, that being able to spot one will be important–especially since I’m dealing with some Dutch and Palatine names that can be unclear if it is a male or female name (like ‘Asa,’ I didn’t know that was likely a male name.)

I use Newspapers.com all the time, especially the Poughkeepsie Journal, and I was wondering, can I scrape search results? For instance, if I wanted a file with all the instances of the search term “eloped” or “runaway“?

Assignment Links

API Requests

Webscraping 

Geocoding 

Gender Inference

One reply on “Module 4 Assignments”

Newspapers.com is an Ancestry product, so it annoyingly does not allow programmatic access and does not have a public API, because Ancestry’s profit model depends on charging for restricted access, and bulk programmatic access could cut into that. One of the things that annoys me about this is that an increasing number of institutions like the New York State Archives contract with Ancestry to digitize their collections, and they’re nominally free to view, but Ancestry doesn’t allow any research downloads of those collections, meaning that they are effectively not digitized despite being full text searchable.

You would be able to do that kind of programmatic text search on Chronicling America, but only for the papers they have digitized, which doesn’t look like it includes Poughkeepsie Journal. The New York Times has its own historic API, and the NYPL has a metadata API for their digital collections, but a quick glance through their Poughkeepsie collections looks likes they mainly have images.

Assignments look good, and good point about the different reading styles required–it’s much closer to close reading a primary document than it is to skimming secondary work! 🙂

Comments are closed.