April 1, 2013

Counting Frequencies from Zotero Items

    Reviewed by Fred Gibbs
    Recommended for Beginning Users

Lesson Goals

In Counting Frequencies you learned how to count the frequency of specific words in a list using python. In this lesson, we will expand on that topic by showing you how to get information from Zotero HTML items, save the content from those items, and count the frequencies of words. It may be beneficial to look over the previous lesson before we begin.

Files Needed For This Lesson

  • obo.py

If you do not have these files, you can download programming-historian-3, a (zip) file from the previous lesson.

Modifying the obo.py Module

Before we begin, we need to adjust obo.py in order to use this module to interact with different html files. The stripTags function in the obo.py module must be updated to the following, because it was previously designed for Old Bailey Online content only. First, we need to remove the line that instructs the program to begin at the end of the header, then we will tell it where to begin. Open the obo.py file in your text editor and follow the instructions below:

def stripTags(pageContents):
    #remove the following line
    #startLoc = pageContents.find("<hr/><h2>")

    #modify the following line
    #pageContents = pageContents[startLoc:]

    #so that it looks like this
    pageContents = pageContents[0:]

    inside = 0
    text = ' '

    for char in pageContents:
        if char == '<':
            inside = 1
        elif (inside == 1 and char =='>'):
            inside = 0
        elif inside == 1:
            continue
        else:
            text += char

    return text 

Remember to save your changes before we continue.

Get Items from Zotero and Save Local Copy

After we have modified the obo.py file, we can create a program designed to request the top two items from a collection within a Zotero library, retrieve their associated URLs, read the web pages, and save the content to a local copy. This particular program will only work on webpage-type items with html content (for instance, entering the URLs of JSTOR or Google Books pages will not result in an analysis of the actual content).

First, create a new .py file and save it in your programming historian directory. Make sure your copy of the obo.py file is in the same location. Once you have saved your file, we can begin by importing the libraries and program data we will need to run this program:

#Get urls from Zotero items, create local copy, count frequencies
import obo
from libZotero import zotero
import urllib2 

Next, we need to tell our program where to find the items we will be using in our analysis. Using the sample Zotero library from which we retrieved items in the lesson on the Zotero API, or using your personal library, we will pull the first two top-level items from either the library or from a specific collection within the library. (To find your collection key, mouseover the RSS button on that collection’s page and use the second alpha-numeric sequence in the URL. If you are trying to connect to an individual user library, you must change the word group to the word user, replace the six-digit number with your user ID, and insert your own API key.)

#links to Zotero library
zlib = zotero.Library('group', '155975', '<null>', 'f4Bfk3OTYb7bukNwfcKXKNLG')

#specifies subcollection - leave blank to use whole library
collectionKey = 'I253KRDT'

#retrieves top two items from library
items = zlib.fetchItemsTop({'limit': 2, 'collectionKey': collectionKey, 'content': 'json,bib,coins'}) 

Now we can instruct our program to retrieve the URL from each of our items, create a filename using that URL, and save a copy of the html on the page.

#retrieves url from each item, creates a filename from the url, saves a local copy
for item in items:
    url = item.get('url')
    filename = url.split('/')[-1] + '.html'             #splits url at last /
    filename = filename.split('=')[-1]                  #splits url at last =
    filename = filename.replace('.html.html', '.html')  #removes double .html
    print 'Saving local copy of ' + filename

    response = urllib2.urlopen(url)
    webContent = response.read()
    f = open(filename,'w')
    f.write(webContent)
    f.close()

Running this portion of the program will result in the following:

Saving local copy of PastsFutures.html
Saving local copy of 29.html 

Get Item URLs from Zotero and Count Frequencies

Now that we’ve retrieved our items and created local html files, we can use the next portion of our program to retrieve the URLs, read the web pages, create a list of words, count their frequencies, and display them. Most of this should be familiar to you from the Counting Frequencies lesson.

#retrieves url from each item, creates a filename from the url
for item in items:
    itemTitle = item.get('title')
    url = item.get('url')
    filename = url.split('/')[-1] + '.html'             #splits url at last /
    filename = filename.split('=')[-1]                  #splits url at last =
    filename = filename.replace('.html.html', '.html')  #removes double .html
    print '\n' + itemTitle +'\nFilename: ' + filename + '\nWord Frequencies\n'
    response = urllib2.urlopen(url)
    html = response.read()
    

This section of code grabs the URL from our items, removes the unnecessary portions, and creates and prints a filename. For the items in our sample collection, the output looks something like this:

 The Pasts and Futures of Digital History
Filename: PastsFutures.html
Word Frequencies

History and the Web, From the Illustrated Newspaper to Cyberspace: Visual Technologies and Interaction in the Nineteenth and Twenty-First Centuries
Filename: 29.html
Word Frequencies 

Now we can go ahead and create our list of words and their frequencies. Enter the following:

#strips HTML tags, strips nonAlpha characters, removes stopwords
    text = obo.stripTags(html).lower()
    fullwordlist = obo.stripNonAlphaNum(text)
    wordlist = obo.removeStopwords(fullwordlist, obo.stopwords)

#counts frequencies
    dictionary = obo.wordListToFreqDict(wordlist)
    sorteddict = obo.sortFreqDict(dictionary)

#displays list of words and frequencies
    for s in sorteddict: print str(s)

Your final output will include a long list of words accompanied by their frequency within the html file:

Saving local copy of PastsFutures.html
Saving local copy of 29.html

The Pasts and Futures of Digital History
Filename: PastsFutures.html
Word Frequencies

(51, 'history')
(43, 'new')
(31, '9')
(27, 'historians')
(24, 'digital')
(23, 'social')
(21, 'narrative')
(16, 'media')
(15, 'time')
(13, 'possibilities')
(13, 'past')
(12, 'science')
...

History and the Web, From the Illustrated Newspaper to Cyberspace: Visual Technologies and Interaction in the Nineteenth and Twenty-First Centuries
Filename: 29.html
Word Frequencies

(52, 'new')
(49, 'history')
(46, 'media')
(44, 'ndash')
(34, 'figure')
(34, 'digital')
(24, 'visual')
(24, 'museum')
(24, 'http')
(23, 'edu')
(22, 'web')
(22, 'text')
(22, 'barnum')
(21, 'users')
(21, 'information')
...

About the author

Spencer Roberts is Research Assistant and former Digital History Research Fellow at the Roy Rosenzweig Center for History and New Media, and a Ph.D. graduate student at George Mason University in the Department of History.  

Suggested Citation

Spencer Roberts , "Counting Frequencies from Zotero Items," Programming Historian, (01 April 2013), http://programminghistorian.org/lessons/counting-frequencies-from-zotero-items