While surfing through Wikipedia one day, I came across the page for Ulysses, and in my quest to better my favorite source of knowledge, I decided to provide a citation for the word statistics of the novel.

The article claims that Ulysses totals 250,000 words from a vocabulary of 30,000 words. It’s not a bad estimate. The version of Ulysses I analyzed is from Project Gutenberg and contains 264,834 words, using a vocabulary of 30,030 words. “the” happens to be (fairly obviously) the most frequently used word in the novel at 14,905 occurrences.

The code I use is very simple. You can download it with the links I’ve provided at the bottom of this post, as well as the text I’ve used to perform the analysis (note: I’ve removed all the copyright information that Project Gutenberg places on the downloaded text for a more accurate analysis. If anyone has a problem with this, I will remove it). I’ve also included the book’s vocabulary with frequency in case you want to play around with that.

You can do some interesting stuff with the word list. For example, to see how many times Joyce used a specific word, for example, “cheese”, you could do the following:

words = {}

for line in file('ulysses_words.txt', 'r'):
    word, frequency = line.split()
    words[word] = frequency

print words['cheese']

And boom, “11”. Shoot me an email if you do something cool with this.


