I’m using the “What’s On the Menu?” data, compiled by the New York Public Library. This is an archival dataset of thousands and thousands of restaurant, hotel, transportation, and special occasion menus from around the word, spanning from the 1850s to the 2010s. The NYPL is crowdsourcing the transcription, which means that any of us can go edit the data to try to help digitize these menus — this is great, and makes digitization possible, but at the same time… there are a lot of data issues.
There are four datasets – one that contains menu information (year, venue, etc), one for menu pages (detailing each specific page of each menu), one for menu items (each item on each menu page), and then dishes (naming specific dishes on menus). See data critique below!
Wrangling the data took me a very long time. I first had to figure out how to link the datasets — I’m hoping to look at the popularity of individual dishes in different decades, so needed to figure out how to get from the menu dataset (which contained the date) to the dish dataset. I cleaned each spreadsheet in Excel slightly, changing the name of the ‘id’ variable in each to match what it’s called in other sheets — for example, MenuPage.csv has ‘id’ and ‘menu_id’. The ‘menu_id’ identifier aligns with a column in Menu.csv, but in that sheet, it’s just called ‘id’.
I used Excel to filter the Menu.csv file to only include menus published from 1855-1965. No real reason for those dates — just sounded like a nice century block that aligns with a lot of immigration/urbanization in the United States. I checked outliers with the NYPL website and corrected them — for example, a date of ‘Jan 01 0001’ is obviously a transcription error, so looking at the actual scanned menu, I could see that the date is Oct 31 1912. I created a ‘year’ column from ‘date’, since I’m not concerned with days or months.
I brought this filtered file into OpenRefine, where I used text faceting and filtering to identify menus from New York State. I probably could’ve written code to do this, but it would be less reliable — for example, if I wrote Python code to identify rows with “NY” or “New York” in the ‘place’ column, it would miss things like “66th and Broadway” or “Hotel Savoy”, but with my human brain, I know those are places in NYC. I faceted first to identify any big clusters, then hand searched for variations like typos like “New Yok” and for smaller cities in NY (“Albany”, “Buffalo”, etc).
I then moved the dataset into RStudio (a statistical software similar to Python that I’m more familiar with), and noticed that for some reason, there were no records after 1940. I realized this is because the dataset shifted, and for the twentieth century, locations were more often listed in the ‘venue_name’ variable. I went back to OpenRefine and repeated the search and facet process with the ‘venue_name’ and ‘notes’ columns — ‘venue_name’ was tedious and involved a lot of Googling to figure out where various nineteenth century hotels were located.
In RStudio, I left-joined the cleaned and filtered Menu.csv with MenuPage.csv, then left-joined MenuPage.csv with MenuItem.csv, and finally MenuItem.csv with Dish.csv. This leaves a dataset where each row is a specific menu item, with information on its specific dish, menu, location, and year.
I dropped unnecessary variables and created a ‘Decade’ variable, and then created a summary dataset, containing 4 variables: decade, dish_id, name, and count. This gives me a count of how many times any given dish appears within a decade.
My next step is to go back into OpenRefine and start faceting the dish names, where I’ll hopefully find some interesting trends to write up for this project!
Extending the Data:
I think that my original idea may shift after looking through the dishes, but my original idea was to compare popularity of various dishes each decade with immigration trends. To do that, I’d use US Census Data / the American Community Survey, which is publicly available for download.
We’ll see how the last bit of data cleaning goes. I think I’ll use some maps, for sure, and also probably line graphs with timelines, showing the popularity of certain dishes over time.