Just to learn, I made a small script in Python that will find out what are the most used words in a book or piece of text. It has special support for most gutenberg files, it can read the title of the book and exclude the licence (above and bellow the book content) from the counting.
The usage is simple:
Usage:
-i[arg], –input[arg] : The input to get the content from (URL, localfile, raw text). Special support for Gutenberg books (with the licence removed, and module can be extended to read book title.)
-h, –help : Displays this help screen
-o[arg], –output[arg] : The file to output the csv file to. Note: This should end with .csv.
-c, –common : This removes the common words in the final statistics. Support for customization of the common word list later.
This uses Mark Pilgrim’s openAnything module (since I know how it works, I don’t think it would really help if I coded it myself). It also contains a script called mergecsv which merges statistics for two books by adding the count of the words.
Download (with openAnything, mergecsv, common.txt, unittest (booktest.py, sample book of Alice in Wonderland and Sherlock Holmes): book.tar.gz
Highlighted Source:
It doesn’t parse Gutenberg E-texts, but it does parse Gutenberg E-Books. (Difference being the way the start and end of the book is indicated). I don’t want to do this, because it’s almost the same thing, and the purpose of this is only to learn.
Feedback welcome!