Normalizing Textual Data with Python

William J. Turkel; Adam Crymble

Donate to Programming Historian today!

Programming Historian

Normalizing Textual Data with Python

William J. Turkel and Adam Crymble

In this lesson, we will make the list we created in the ‘From HTML to a List of Words’ lesson easier to analyze by normalizing this data.

Peer-reviewed

CC-BY 4.0

Support PH

edited by

Miriam Posner

reviewed by

Jim Clifford
Francesca Benatti
Frederik Elwert

published

| 2012-07-17

modified

| 2012-07-17

difficulty

| Medium

https://doi.org/10.46430/phen0014

Donate today!

Great Open Access tutorials cost money to produce. Join the growing number of people supporting Programming Historian so we can continue to share knowledge free of charge.

Available in: EN (original) | FR | PT | ES

This lesson is part of a series of 15 lessons - You are on lesson 9 | previous lesson | next lesson

The Old Bailey Online’s website has recently been updated. Unfortunately, due to the various changes, many (if not all) elements of the example website used in this lesson will not work as described. The methodologies taught by this lesson remain relevant, however, and may be adapted by readers to a different example site. We are working on adapting the lesson to the new Old Bailey Online website, but we have no clear timeline on when the lesson will be updated. [April 2024]

Lesson Goals

The list that we created in the From HTML to a List of Words (2) needs some normalizing before it can be used further. We are going to do this by applying additional string methods, as well as by using regular expressions. Once normalized, we will be able to more easily analyze our data.

Files Needed For This Lesson

html-to-list-1.py
obo.py

If you do not have these files from the previous lesson, you can download a zip

Cleaning up the List

In From HTML to a List of Words (2), we wrote a Python program called html-to-list-1.py which downloaded a web page, stripped out the HTML formatting and metadata and returned a list of “words” like the one shown below. Technically, these entities are called “tokens” rather than “words”. They include some things that are, strictly speaking, not words at all (like the abbreviation &c. for “etcetera”). They also include some things that may be considered composites of more than one word. The possessive “Akerman’s,” for example, is sometimes analyzed by linguists as two words: “Akerman” plus a possessive marker. Is “o’clock” one word or two? And so on.

Turn back to your program html-to-list-1.py and make sure that your results look something like this:

['324.', '\xc2\xa0', 'BENJAMIN', 'BOWSEY', '(a', 'blackmoor', ')', 'was',
'indicted', 'for', 'that', 'he', 'together', 'with', 'five', 'hundred',
'other', 'persons', 'and', 'more,', 'did,', 'unlawfully,', 'riotously,',
'and', 'tumultuously', 'assemble', 'on', 'the', '6th', 'of', 'June', 'to',
'the', 'disturbance', 'of', 'the', 'public', 'peace', 'and', 'did', 'begin',
'to', 'demolish', 'and', 'pull', 'down', 'the', 'dwelling', 'house', 'of',
'\xc2\xa0', 'Richard', 'Akerman', ',', 'against', 'the', 'form', 'of',
'the', 'statute,', '&amp;c.', '\xc2\xa0', 'ROSE', 'JENNINGS', ',', 'Esq.',
'sworn.', 'Had', 'you', 'any', 'occasion', 'to', 'be', 'in', 'this', 'part',
'of', 'the', 'town,', 'on', 'the', '6th', 'of', 'June', 'in', 'the',
'evening?', '-', 'I', 'dined', 'with', 'my', 'brother', 'who', 'lives',
'opposite', 'Mr.', "Akerman's", 'house.', 'They', 'attacked', 'Mr.',
"Akerman's", 'house', 'precisely', 'at', 'seven', "o'clock;", 'they',
'were', 'preceded', 'by', 'a', 'man', 'better', 'dressed', 'than', 'the',
'rest,', 'who']

By itself, this ability to separate the document into words doesn’t buy us much because we already know how to read. We can use the text, however, to do things that aren’t usually possible without special software. We’re going to start by computing the frequencies of tokens and other linguistic units, a classic measure of a text.

It is clear that our list is going to need some cleaning up before we can use it to count frequencies. In keeping with the practices established in From HTML to a List of Words (1), let’s try to describe our algorithm in plain English first. We want to know the frequency of each meaningful word that appears in the trial transcript. So, the steps involved might look like this:

Convert all words to lower case so that “BENJAMIN” and “benjamin” are counted as the same word
Remove any strange or unusual characters
Count the number of times each word appears
Remove overly common words such as “it”, “the”, “and”, etc.

Convert to Lower Case

Typically tokens are folded to lower case when counting frequencies, so we’ll do that using the string method lower which was introduced in Manipulating Strings in Python. Since this is a string method we will have to apply it to the string: text in the html-to-list1.py program. Amend html-to-list1.py by adding the string tag lower() to the the end of the text string.

#html-to-list1.py
import urllib.request, urllib.error, urllib.parse, obo

url = 'http://www.oldbaileyonline.org/browse.jsp?id=t17800628-33&div=t17800628-33'

response = urllib.request.urlopen(url)
html = str(response.read().decode('UTF-8'))
text = obo.stripTags(html).lower() #add the string method here.
wordlist = text.split()

print(wordlist)

You should now see the same list of words as before, but with all characters changed to lower case.

By calling methods one after another like this, we can keep our code short and make some pretty significant changes to our program.

Like we said before, Python makes it easy to do a lot with very little code!

At this point, we might look through a number of other Old Bailey Online entries and a wide range of other potential sources to make sure that there aren’t other special characters that are going to cause problems later. We might also try to anticipate situations where we don’t want to get rid of punctuation (e.g., distinguishing monetary amounts like “$1629” or “£1295” from dates, or recognizing that “1629-40” has a different meaning than “1629 40”.) This is what professional programmers get paid to do: try to think of everything that might go wrong and deal with it in advance.

We’re going to take a different approach. Our main goal is to develop techniques that a working historian can use during the research process. This means that we will almost always prefer approximately correct solutions that can be developed quickly. So rather than taking the time now to make our program robust in the face of exceptions, we’re simply going to get rid of anything that isn’t an accented or unaccented letter or an Arabic numeral. Programming is typically a process of “stepwise refinement”. You start with a problem and part of a solution, and then you keep refining your solution until you have something that works better.

Python Regular Expressions

We’ve eliminated upper case letters. That just leaves all the punctuation to get rid of. Punctuation will throw off our frequency counts if we leave them in. We want “evening?” to be counted as “evening” and “1780.” as “1780”, of course.

It is possible to use the replace string method to remove each type of punctuation:

text = text.replace('[', '')
text = text.replace(']', '')
text = text.replace(',', '')
#etc...

But that’s not very efficient. In keeping with our goal of creating short, powerful programs, we’re going to use a mechanism called regular expressions. Regular expressions are provided by many programming languages in a range of different forms.

Regular expressions allow you to search for well defined patterns and can drastically shorten the length of your code. For instance, if you wanted to know if a substring matched a letter of the alphabet, rather than use an if/else statement to check if it matched the letter “a” then “b” then “c”, and so on, you could use a regular expression to see if the substring matched a letter between “a” and “z”. Or, you could check for the presence of a digit, or a capital letter, or any alphanumeric character, or a carriage return, or any combination of the above, and more.

In Python, regular expressions are available as a Python module. To speed up processing it is not loaded automatically because not all programs require it. So, you will have to import the module (called re) in the same way that you imported your obo.py module.

Since we’re interested in only alphanumeric characters, we’ll create a regular expression that will isolate only these and remove the rest. Copy the following function and paste it into the obo.py module at the end. You can leave the other functions in the module alone, as we’ll continue to use those.

# Given a text string, remove all non-alphanumeric
# characters (using Unicode definition of alphanumeric).

def stripNonAlphaNum(text):
    import re
    return re.compile(r'\W+', re.UNICODE).split(text)

The regular expression in the above code is the material inside the string, in other words W+. The W is shorthand for the class of non-alphanumeric characters. In a Python regular expression, the plus sign (+) matches one or more copies of a given character. The re.UNICODE tells the interpreter that we want to include characters from the world’s other languages in our definition of “alphanumeric”, as well as the A to Z, a to z and 0-9 of English. Regular expressions have to be compiled before they can be used, which is what the rest of the statement does. Don’t worry about understanding the compilation part right now.

When we refine our html-to-list1.py program, it now looks like this:

#html-to-list1.py
import urllib.request, urllib.error, urllib.parse, obo

url = 'http://www.oldbaileyonline.org/browse.jsp?id=t17800628-33&div=t17800628-33'

response = urllib.request.urlopen(url)
html = response.read().decode('UTF-8')
text = obo.stripTags(html).lower()
wordlist = obo.stripNonAlphaNum(text)

print(wordlist)

When you execute the program and look through its output in the “Command Output” pane, you’ll see that it has done a pretty good job. This code will split hyphenated forms like “coach-wheels” into two words and turn the possessive “s” or “o’clock” into separate words by losing the apostrophe. But it is a good enough approximation to what we want that we should move on to counting frequencies before attempting to make it better. (If you work with sources in more than one language, you need to learn more about the Unicode standard and about Python support for it.)

About the authors

William J. Turkel is Professor of History at the University of Western Ontario.

Adam Crymble is Associate Professor of Digital Humanities at University College London.

Suggested Citation

William J. Turkel and Adam Crymble, "Normalizing Textual Data with Python," Programming Historian 1 (2012), https://doi.org/10.46430/phen0014.

Donate today!

Great Open Access tutorials cost money to produce. Join the growing number of people supporting Programming Historian so we can continue to share knowledge free of charge.

Donate to Programming Historian today!

Programming Historian

Normalizing Textual Data with Python

William J. Turkel and Adam Crymble

edited by

reviewed by

published

modified

difficulty

Donate today!

Contents

Lesson Goals

Files Needed For This Lesson

Cleaning up the List

Convert to Lower Case

Python Regular Expressions

Suggested Reading

Code Syncing

About the authors

Suggested Citation

Donate today!