Categories
Portfolio: Katie Tote Portfolios

Module 9: Work

I’m using the “What’s On the Menu?” data, compiled by the New York Public Library. This is an archival dataset of thousands and thousands of restaurant, hotel, transportation, and special occasion menus from around the word, spanning from the 1850s to the 2010s. The NYPL is crowdsourcing the transcription, which means that any of us can go edit the data to try to help digitize these menus — this is great, and makes digitization possible, but at the same time… there are a lot of data issues.

There are four datasets – one that contains menu information (year, venue, etc), one for menu pages (detailing each specific page of each menu), one for menu items (each item on each menu page), and then dishes (naming specific dishes on menus). See data critique below!

Data Cleaning:

Wrangling the data took me a very long time. I first had to figure out how to link the datasets — I’m hoping to look at the popularity of individual dishes in different decades, so needed to figure out how to get from the menu dataset (which contained the date) to the dish dataset. I cleaned each spreadsheet in Excel slightly, changing the name of the ‘id’ variable in each to match what it’s called in other sheets — for example, MenuPage.csv has ‘id’ and ‘menu_id’. The ‘menu_id’ identifier aligns with a column in Menu.csv, but in that sheet, it’s just called ‘id’.

I used Excel to filter the Menu.csv file to only include menus published from 1855-1965. No real reason for those dates — just sounded like a nice century block that aligns with a lot of immigration/urbanization in the United States. I checked outliers with the NYPL website and corrected them — for example, a date of ‘Jan 01 0001’ is obviously a transcription error, so looking at the actual scanned menu, I could see that the date is Oct 31 1912. I created a ‘year’ column from ‘date’, since I’m not concerned with days or months.

I brought this filtered file into OpenRefine, where I used text faceting and filtering to identify menus from New York State. I probably could’ve written code to do this, but it would be less reliable — for example, if I wrote Python code to identify rows with “NY” or “New York” in the ‘place’ column, it would miss things like “66th and Broadway” or “Hotel Savoy”, but with my human brain, I know those are places in NYC. I faceted first to identify any big clusters, then hand searched for variations like typos like “New Yok” and for smaller cities in NY (“Albany”, “Buffalo”, etc).

I then moved the dataset into RStudio (a statistical software similar to Python that I’m more familiar with), and noticed that for some reason, there were no records after 1940. I realized this is because the dataset shifted, and for the twentieth century, locations were more often listed in the ‘venue_name’ variable. I went back to OpenRefine and repeated the search and facet process with the ‘venue_name’ and ‘notes’ columns — ‘venue_name’ was tedious and involved a lot of Googling to figure out where various nineteenth century hotels were located.

In RStudio, I left-joined the cleaned and filtered Menu.csv with MenuPage.csv, then left-joined MenuPage.csv with MenuItem.csv, and finally MenuItem.csv with Dish.csv. This leaves a dataset where each row is a specific menu item, with information on its specific dish, menu, location, and year.

I dropped unnecessary variables and created a ‘Decade’ variable, and then created a summary dataset, containing 4 variables: decade, dish_id, name, and count. This gives me a count of how many times any given dish appears within a decade.

My next step is to go back into OpenRefine and start faceting the dish names, where I’ll hopefully find some interesting trends to write up for this project!

Extending the Data:

I think that my original idea may shift after looking through the dishes, but my original idea was to compare popularity of various dishes each decade with immigration trends. To do that, I’d use US Census Data / the American Community Survey, which is publicly available for download.

Visualizations:

We’ll see how the last bit of data cleaning goes. I think I’ll use some maps, for sure, and also probably line graphs with timelines, showing the popularity of certain dishes over time.

3 replies on “Module 9: Work”

Looks good. When you get a chance, put your cleaned data files up somewhere so I can see how you restructured things. I assume from your visualizations you’re planning on using Tableau? Have you merged everything into a single sheet, or are going to use Tableau’s joins and relationships? I don’t think it matters either way, but I want to be clear for my own reference.

What’s the geographic extent of the data from what you’ve seen? From what I recall there’s not much outside of NY, but it’s been a bit since I looked through it in detail. Did you sort out the years/venue issue so that you have data after 1940 now?

From what I recall of the NYPL project, they did a first pass of machine OCR of the menu scans, and then the crowdsource project is to correct the OCR, so that might explain some of the stranger errors like 0001 for 1912. Doesn’t change much for you, just an interesting project process note.

I was planning to make visualizations in R, since that’s where I ended up doing the bulk of my analysis, but I can try Tableau out as well! I combined everything using R — I’ll upload the combined dataset and my code and link it for you when I’m back at home tomorrow.

Yes, I did end up figuring out the year issue, so now have the complete 1850-1959 time block. I’m not sure about geographic scope but came across some European menus, as well as some in Chicago, Philadelphia, and New Orleans, at least. Since I revised my research plan (as I just posted in my Week 11 update!) I may end up just using the entire dataset anyway, instead of the NY limited data set I put together. I already have the code written, so I’m going to rerun and see if the results are wildly different if I don’t limit geographically. TBH I’m a little concerned that my limited set may miss a lot of NY menus since the methodology wasn’t exactly perfect

That sounds like a fine plan; just make sure to note in your write up what your geographic scope is and whether you checked for any geographic differences (it’s fine if you don’t see any, but it’s often something that editors ask when you move to publication).

Comments are closed.