{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Data cleaning\n",
    "An example Jupyter notebook for hypothetical classroom use, as part of the *Programming Historian* lesson on Jupyter notebooks, by Quinn Dombrowski, Tassie Gniady, and David Kloster."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## What's data cleaning?\n",
    "If you want use a computer to do any sort of data analysis, you need to make sure that the data is *clean*: consistently formatted using conventions the computer can work with. For instance, if you're manually counting all the times cats are mentioned in a text, you probably won't think twice about including \"cat\", \"Cat\", \"cats\", \"Cat--\", \"cat.\" as part of your count. Depending on your exact research question, you might also include related words like \"kitty\", \"kitten\", or \"feline\".\n",
    "\n",
    "Having a person do this task could take a lot of time, depending on the length of the text in question, but you can reasonably expect that a person will take all this variation into account when counting, especially if you ask them to count \"all words referring to cats\". Computers don't have the same understanding of a text that a person does, nor the same understanding of the concept of a cat. It may be possible to get a computer to count similar things to what a human would count when given this task, but first you need to do things to modify the text so that the computer doesn't get confused by minor variation (like capitalization or punctuation) that a human wouldn't necessarily give any thought to.\n",
    "\n",
    "Even when all your data is in a consistent format, some research questions may require you to modify that format. The following example comes from a research project on how *Harry Potter* fan fiction gets written across different languages and cultures."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## The data\n",
    "The example data that we're working with was captured from the Italian fan fiction archive efpfanfic.net. Each line of the file has the rating (like a movie rating) for the fanfic, ranging from *verde* (green) to *giallo* (yellow) to *arancione* (orange) to *rosso* (red), the date when the fanfic was originally published, and the most recent update date. More information was captured about each story (including the title, URL, author, author page URL, genre, characters, character pairings, and description), but the information here is enough to start exploring a certain set of questions."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## The direction of the reseach inquiry\n",
    "The data that we're working with  -- just the rating, publication date, and updated date -- can serve as the basis for finding answers to questions like:\n",
    "\n",
    "- What are the patterns -- if any -- in *when* people publish fanfic? \n",
    "- Are there different patterns depending on the rating?\n",
    "- Are writers more likely to update their fanfic on certain days of the week?\n",
    "- What trends can we see in the time interval between publication and updates? (To make sense of this, we may need to reference more data, like length or intended length of the fanfic.)\n",
    "- Are any patterns that we find consistent over time, or do they change?\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Date formats\n",
    "The publication and updated dates that we scraped from the fanfic archive take the format *day*/*month*/*year*. To start answering any of the questions that related to days of the week, we need to convert the date as we have it into the day of the week.\n",
    "\n",
    "<div class=\"alert alert-warning\"><strong>Group activity 1:</strong>\n",
    "\n",
    "In your table groups, find out what day of the week 11/3/14 was.</div>\n",
    "\n",
    "Now that you've converted one date, let's figure out how long it'll take to convert more of them!\n",
    "\n",
    "<div class=\"alert alert-warning\"><strong>Group activity 2:</strong>\n",
    "\n",
    "In your table groups, one student should be the timekeeper, while the other students each convert the following dates to their day of the week. The timekeeper will keep track of how long each student takes, and average the times from the students at the table. Don't race, and don't get distracted with browsing other websites; work at a pace that you imagine could be sustainable over an hour.\n",
    "<ul>\n",
    "    <li>11/3/14</li>\n",
    "    <li>7/6/14</li>\n",
    "    <li>7/6/14</li>\n",
    "    <li>27/06/11</li>\n",
    "    <li>15/02/11</li>\n",
    "    <li>2/11/10</li>\n",
    "    <li>3/11/10</li>\n",
    "    <li>20/10/10</li>\n",
    "    <li>27/08/09</li>\n",
    "    <li>30/08/09</li>\n",
    "    <li>27/12/08</li>\n",
    "    <li>2/12/08</li>\n",
    "    <li>16/06/08</li>\n",
    "    <li>22/11/08</li>\n",
    "    <li>30/01/07</li>\n",
    "    <li>16/10/08</li>\n",
    "    <li>6/9/08</li>\n",
    "    <li>3/10/08</li>\n",
    "    <li>7/9/08</li>\n",
    "    <li>26/07/07</li>\n",
    "</ul>\n",
    "    </div>\n",
    "\n",
    "Before we move on, compare answers for all 20 dates. If someone at your table got a different answer, walk through the process together step by step to figure out why.\n",
    "\n",
    "<div class=\"alert alert-warning\"><strong>Group activity 3:</strong>\n",
    "\n",
    "For two minutes, think about how many date conversions you'd be willing to do this way. \n",
    "\n",
    "- Would that number be different if you were being paid (at the average hourly wage for student research assistants at your university)? \n",
    "- What if it were your own project and you weren't being paid?\n",
    "- Based on the average time it took for people at your table to convert 20 dates, how much would you have to pay (as the imaginary project directory) to have students convert 112,286 dates, paying the average hourly rate for student research assisants?\n",
    "- As a project director, what is most you'd be willing to pay to answer questions that require the dates to be converted this way?\n",
    "\n",
    "Write down your own answers, then discuss your answers with your table group.</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Writing code\n",
    "While there have been large-scale projects involving painstaking manual labor that have run for decades (e.g. many dictionary projects for historical languages), people only undertake that kind of work for projects where they're sure the result will be very valuable for an enter disciplinary community. Mere curiosity isn't worth it -- you might not find anything worthwhile.\n",
    "\n",
    "When a data conversion task doesn't require much or any human judgement, it might be something that you can automate by writing code. Tasks that involve doing exact same thing over and over again (i.e. turning dates that are consistently formatted into a different consistent format, rather than changing something inconsistent into a different inconsistent thing) are *great* candidates for writing code: writing code *once* will have a huge payoff.\n",
    "\n",
    "Below is the code that can convert 112,286 dates to their day of the week. It should run for less than a minute (depending on the computer that's running it), and display all the results below the code."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#datetime is used to convert to the date from the parsed format\n",
    "#to the day of the week\n",
    "import datetime\n",
    "#csv reads data from a csv (comma separated value) format\n",
    "import csv\n",
    "#dateutil is used to parse the input dates\n",
    "import dateutil"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "#the \"magic word\" above will print the time it takes to run this entire cell\n",
    "#you can scroll to the bottom of the converted dates to see how long it took\n",
    "#identifies the source file to open, calls it f\n",
    "with open('ph-jupyter-notebook-example.csv') as f:\n",
    "    #defines \"csv_reader\" as running the function csv.reader on the file\n",
    "    csv_reader = csv.reader(f, delimiter=',')\n",
    "    #for each row that's being read by csv_reader...\n",
    "    for row in csv_reader:\n",
    "        #creates a list called \"values\" with the contents of the row\n",
    "        values = list(row)\n",
    "        #defines \"rating\" as the first thing in the list\n",
    "        #counting in Python starts with 0, not 1\n",
    "        rating = values[0]\n",
    "        #defines \"parseddatepub\" as the second thing (1, because we start with 0) in the list,\n",
    "        #converted into a standard date format using dateutil.parser\n",
    "        #and when those dates are parsed, the parser should know\n",
    "        #that the first value in the sequence is the day\n",
    "        parseddatepub = dateutil.parser.parse(values[1], dayfirst=True)\n",
    "        #same as above for the updated date, the third thing (2) in the list\n",
    "        parseddateupdate = dateutil.parser.parse(values[2], dayfirst=True)\n",
    "        #defines \"dayofweekpub\" as parseddatepub (defined above), converted to the day of week\n",
    "        #%A is what you use to change it to the day of the week\n",
    "        #You can see othe formats here: https://docs.python.org/3/library/datetime.html#strftime-and-strptime-behavior\n",
    "        dayofweekpub = datetime.date.strftime(parseddatepub, '%A')\n",
    "        #same thing for update date\n",
    "        dayofweekupdate = datetime.date.strftime(parseddateupdate, '%A')\n",
    "        #creates a list of the rating and the newly formatted dates\n",
    "        updatedvalues = [rating, dayofweekpub, dayofweekupdate]\n",
    "        #writes all the values under this code cell\n",
    "        print(updatedvalues)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Editing code\n",
    "Once you've engaged time-consuming human labor in a text cleaning task, it's an expensive prospect to change your mind about the output format. When you're using code, it may take some time to figure out how to modify the code to get the new output format, but once you've worked it out, you can get your new output immediately.\n",
    "\n",
    "<div class=\"alert alert-warning\"><strong>Group activity 4:</strong>\n",
    "\n",
    "In your table group, work together to modify the code in the cell below (identical to the code above) so that the date is formatted as follows: \"Monday, September 01, 2014\"\n",
    "\n",
    "<em>Hint:</em> start by looking through the [datetime module documentation](https://docs.python.org/3/library/datetime.html#strftime-and-strptime-behavior).</div>\n",
    "\n",
    "<div class=\"alert alert-warning\"><strong>Group activity 5:</strong>\n",
    "\n",
    "It looks awkward to have those zeroes in the date numbers, but the Python *datetime* module really likes what's called \"zero padded dates\". In your group, search the web for what you can do to *not* display a zero-padded date using Python's datetime module. If you can find a solution, see if you can make it work in your code.\n",
    "\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Data cleaning & analysis for self-identified coders\n",
    "People who identify as coders can gather together in their own table groups to tackle the following tasks. There'll be an opportunity to share what you managed to work out before the end of the class.\n",
    "\n",
    "Tasks marked with a <i class=\"fa-question-circle fa\"></i> are ones where the project doesn't already have code written. If you come up with a working solution, you'll be credited as a collaborator on the project in real academic publications that stem from that work.\n",
    "\n",
    "<div class=\"alert alert-warning\"><strong>Coding task 1:</strong>\n",
    "\n",
    "Convert the dates in the example file to days of the week (you can crib from the code above, if you like). Write some additional code that works out what percentage of the stories were updated on the same day of the week they were written.\n",
    "\n",
    "</div>\n",
    "\n",
    "<div class=\"alert alert-warning\"><strong>Coding task 2:</strong>\n",
    "\n",
    "Do you find anything surprising about your results from the first task? Why do you think the results are that way? Is there anything you need to revise in the code? *(Hint: one thing you might want to check on is how long it usually takes between publication and update.)*\n",
    "</div>\n",
    "\n",
    "<div class=\"alert alert-warning\"><strong>Coding task 3:</strong>\n",
    "\n",
    "What percentage of stories in each \"rating\" are published on each day of the week? (e.g. \"Rosso: 10% Monday, 20% Tuesday, etc.\")\n",
    "</div>\n",
    "\n",
    "<div class=\"alert alert-warning\"><strong>Coding task 4:</strong>\n",
    "\n",
    "What percentage of the stories overall are posted <em>or updated</em> on a Sunday? Do Easter Sundays show higher, lower, or the same level of fanfic activity?\n",
    "</div>\n",
    "\n",
    "<div class=\"alert alert-warning\"><i class=\"fa-question-circle fa\"></i><strong>Coding task 4:</strong>\n",
    "\n",
    "Look at these same metrics (being updated on the same day, day of the week of publication -- in general and per rating) year by year (2003-2019). Are the results similar every year, or are there changes over time?\n",
    "</div>"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}