Categories
Module Page

2: Humanities, Digitally

Where is history on the web and how’d it get there?

This week we’re thinking about the pieces that make up the internet, and how to put stuff on it.

This week we will:

  • Think about what mass digitization means for the historical profession
  • Think about what we can and can’t know as historians and why
  • Learn how to “read” data
StartMetPost
Feb 15Feb 17Feb 19

Module Outline

Wednesday agenda

During our Wednesday meeting, we’ll troubleshoot the Data Cleaning assignment and answer questions about the Data Critique assignment listed below in Tasks.

Reading

Everyone

Discussion Starter

The discussion starter video for this is on Blackboard under the title “Module 2: Humanities Digitally.”

After reading all the materials and watching the discussion starter video, respond on the #module2 Slack channel using the 3CQ method: compliment, comment, connection, question.

Compliments should emphasize something you liked about what a discussion starter said or what the group discussed. Comment should reinforce, but deepen an idea they shared. Connections should connect what the discussion starters talked about to your own unique thought or reaction, extending the discussion to new ideas, examples, or concepts. Finally, questions should open up space for class discussion. They may question something the discussion starters discussed, raise a question based on the readings, complicate an existing example or idea, or direct us to think about something in the reading the discussion starters overlooked.

Remember that I’m not requiring you to respond to a certain number of classmates, but you will get as much out of this class as you put in–talking to one another will help you deepen your own understanding!

Tasks

  1. Review my email for this week if you missed it
  2. Read the assigned materials and respond in the #module2 Slack channel to the discussion starter video using the 3QE method above.
  3. Do the Data Critique assignment below
  4. Post your data critique to the shared google sheet. You will need to fill out a separate row for every column in your dataset!
  5. Do the Data Cleaning assignment below
  6. In the comments of this post, use pretty links < a href > to post a link to the csv and json files on your Github

You may have noticed the past two weeks that I sometimes have very specific instructions for where and how to link your homework assignments. As we progress through the semester, we’ll start doing programming tasks or interacting with programs where steps need to be done in very particular orders that seem arbitrary at the time. I want you to get used to carefully following directions so that you build the skill of being precise in your work!

Watch: What is Data

Assignment: Data Critique

You’ve been assigned one or more data sets (see your assigned datasets here; download the actual data here). I’ve assigned everyone roughly the same amount of work–some datasets are larger or more complex than others. It may be helpful to review last week’s data filtering lesson and the Working with Data assignment. Filters will let you quickly see what kind of data is in a column without scrolling through the whole thing.

For this assignment, you will need to examine what information is in your dataset, what kind of events, people, or phenomena your dataset describes, and what it cannot describe.  Use a spreadsheet filter to get an idea of what kind of data you’re working with.  What’s the scope of your data temporally, geographically, in number of records, or in other dimensions?

What’s the “thing” that composes a row? Is a row a person, an event, an object, something else? What attributes are documented by the columns? Is there any kind of column missing that you might expect given the kind of “thing” the row documents? For example, if we have a row describing a person, it might be unusual that it doesn’t have a column for gender. If this dataset were your only source, what kind of information would be left out?

As best you’re able to determine, you should also describe how the data was generated, what the original sources were, how the data was collected, and how your data is divided. Some datasets have links to a research project that explains how it was collected; others don’t. 

To post your data critique, open the shared google sheet and fill out one row in the Dataset tab for each dataset, and one row per column of your datasets in the Row descriptions tab. You will need to fill out a separate row for every column in your dataset! See the example rows for how I did this assignment for the Albany Manumissions dataset in the shared google sheet.

To get the name of each column in your dataset into the shared google sheet, it may be helpful to select all the column names with cmd + shift + arrow left (Mac) / ctrl + shift + arrow left (PC) and transpose the column names into the field names column. This will save you the time of typing the field names all by hand.

You will also need to get the count of how many records your dataset includes and the minimum and maximum for any numeric fields (including dates)! Remember our use of formulas last module: =COUNTA(range) will count the text entries in a column. Use a column like ID or record number for this that you’re sure has an entry in every cell–some columns may not have an entry in every cell, which will give you an inaccurate count! To get the smallest number, use =MIN(range), which gives the minimum number in a searched range and =MAX(range) to get the largest number. Since your dataset isn’t connected to the shared data critique sheet, you won’t be able to enter these formulas into the data critique sheet. But you can use formulas in your dataset and then hand-type or paste special > values only the number you get into the data critique sheet.

If your dataset includes blank cells in a column you need a minimum for, =MIN() will return 0 if there are blank cells (and no numbers smaller than zero). To get around this, you can use the formula =SMALL(range, 2), which will get the second smallest number in the range you’re searching (if you know the second smallest number is what you need). If you you don’t know that the second smallest number is what you need, you can use =MINIFS(range, range, “<0”) to search the range and exclude all numbers under 0. The range is repeated twice because you can use this function to search one range and get the value for that row in a different range. “<0” is a formula itself and can be substituted for any other numeric check like greater than 20 “>20”, equals a certain cell “=A2” etc.

This assignment is adapted from Miriam Posner’s Data Critique.

Assignment: Data Cleaning

You will need to download and install OpenRefine for this assignment. Please email or ask on Slack if you run into difficulties with the installation.

The slides below will walk you through some major data cleaning steps and ask you to post your final data, but note that I often don’t tell you when to stop cleaning in these slides! That’s intentional. I don’t actually care if you get this dataset completely clean; for this assignment I just want to see if you understand the concept of cleaning data, sharing the final file, and extracting your steps (ie, documenting your methodology and making it reproducible). You could spend a very long time cleaning this particular dataset, but you don’t need to. Once you feel like you understand a concept, feel free to move on and leave the messy data behind you.

Remember that with lesson slides, you can view the slides full screen and copy and paste text out of the slides to put in your project.

Watch: Data Cleaning

This is a short video of me cleaning some data for one of our later assignments. You don’t need to do the cleaning shown in this video, but it’ll give you a quick overview of how the cluster and merge function works.

34 replies on “2: Humanities, Digitally”

Looks like the problem you’re having with your pretty links is an extra quote mark at the end of the url; if you’re copy/pasting the < a href="" > bit, try hand writing it and pasting the url in so you don’t end up with extra quote marks. (I mention this not to make an example you, but because it’s a common problem 🙂 )

The files look good! What you did with the join(cell, cell, ”) is smart, I hadn’t seen that before.

Here’s my GitHub Repo with the cleaned .CSV and .JSON file.

OpenRefine is a cool software that I’ve never used before! The clustering tool is especially helpful — I think that’d be more difficult to do in the programs I’m used to, though I don’t have a ton of experience with qualitative data. I’m surprised this tool isn’t more well known!

Looks good! Yeah, OpenRefine is super. It has a command line interface if you’re comfortable with that. Similar clustering can be done in Python, but it can be a big pain in the ass. Not sure about R, but I’d be surprised if there wasn’t some package out there to handle it. It would fall under natural language processing if you’re familiar with any of those.

The files look good! Watch your typos–you’ve got an extra quote mark in your pretty links and your json file is .jsn, rather than .json. In the next couple of modules these little typos will trip you up a lot!

Comments are closed.