Frequency of words and phrases of a document

Sometimes out of curiosity you need to find the frequency of the words you use in one of your documents. Or you would like to make it a little bit more generic and find the frequency of N-sized groups of consecutive words in your document. For that matter I’ve written the following Python script:

# -*- coding: utf-8 -*-
import re
import sys

def overlapped_chunks(l, n):
  for i in xrange(0, len(l)):
    yield l[i:i+n]

count = int(sys.argv[1])

data =
data = unicode(data,'utf8');
words = re.compile(u'[\s.:,/\[\]\(\)«»•–…\']+', re.UNICODE).split(data)
dict = {}
for word in overlapped_chunks(words, count):
  if any([i.isdigit() for i in word]):
  word = ' '.join(word)
  word = word.replace('-','')
  word = word.lower()
  if word not in dict:

list = [(dict[x],x) for x in dict]
for (x,y) in list:
  print '%7d %s' % (x,y.encode('utf8'))

Let’s try the program to a document consisting of 1000 words of Lorem Ipsum:

$ < body.txt 1 | head
     22 ut
     20 in
     19 ac
     17 id
     16 a
     15 sed
     15 eu
     14 vitae
     13 nulla
     13 non
$ < body.txt 2 | head
     10 sit amet
      2 varius a
      2 ut viverra
      2 ut velit
      2 ut ante
      2 turpis quis
      2 tincidunt ligula
      2 sodales sed
      2 sodales pellentesque
      2 sociis natoque

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

%d bloggers like this: