{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Data cleaning\n", "An example Jupyter notebook for hypothetical classroom use, as part of the *Programming Historian* lesson on Jupyter notebooks, by Quinn Dombrowski, Tassie Gniady, and David Kloster." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## What's data cleaning?\n", "If you want use a computer to do any sort of data analysis, you need to make sure that the data is *clean*: consistently formatted using conventions the computer can work with. For instance, if you're manually counting all the times cats are mentioned in a text, you probably won't think twice about including \"cat\", \"Cat\", \"cats\", \"Cat--\", \"cat.\" as part of your count. Depending on your exact research question, you might also include related words like \"kitty\", \"kitten\", or \"feline\".\n", "\n", "Having a person do this task could take a lot of time, depending on the length of the text in question, but you can reasonably expect that a person will take all this variation into account when counting, especially if you ask them to count \"all words referring to cats\". Computers don't have the same understanding of a text that a person does, nor the same understanding of the concept of a cat. It may be possible to get a computer to count similar things to what a human would count when given this task, but first you need to do things to modify the text so that the computer doesn't get confused by minor variation (like capitalization or punctuation) that a human wouldn't necessarily give any thought to.\n", "\n", "Even when all your data is in a consistent format, some research questions may require you to modify that format. The following example comes from a research project on how *Harry Potter* fan fiction gets written across different languages and cultures." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## The data\n", "The example data that we're working with was captured from the Italian fan fiction archive efpfanfic.net. Each line of the file has the rating (like a movie rating) for the fanfic, ranging from *verde* (green) to *giallo* (yellow) to *arancione* (orange) to *rosso* (red), the date when the fanfic was originally published, and the most recent update date. More information was captured about each story (including the title, URL, author, author page URL, genre, characters, character pairings, and description), but the information here is enough to start exploring a certain set of questions." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## The direction of the reseach inquiry\n", "The data that we're working with -- just the rating, publication date, and updated date -- can serve as the basis for finding answers to questions like:\n", "\n", "- What are the patterns -- if any -- in *when* people publish fanfic? \n", "- Are there different patterns depending on the rating?\n", "- Are writers more likely to update their fanfic on certain days of the week?\n", "- What trends can we see in the time interval between publication and updates? (To make sense of this, we may need to reference more data, like length or intended length of the fanfic.)\n", "- Are any patterns that we find consistent over time, or do they change?\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Date formats\n", "The publication and updated dates that we scraped from the fanfic archive take the format *day*/*month*/*year*. To start answering any of the questions that related to days of the week, we need to convert the date as we have it into the day of the week.\n", "\n", "