Frequency of words and phrases of a document

Sometimes out of curiosity you need to find the frequency of the words you use in one of your documents. Or you would like to make it a little bit more generic and find the frequency of N-sized groups of consecutive words in your document. For that matter I’ve written the following Python script:

#!/usr/bin/python
# -*- coding: utf-8 -*-
import re
import sys

def overlapped_chunks(l, n):
  for i in xrange(0, len(l)):
    yield l[i:i+n]

count = int(sys.argv[1])

data = sys.stdin.read()
data = unicode(data,'utf8');
words = re.compile(u'[\s.:,/\[\]\(\)«»•–…\']+', re.UNICODE).split(data)
dict = {}
for word in overlapped_chunks(words, count):
  if any([i.isdigit() for i in word]):
    continue
  word = ' '.join(word)
  word = word.replace('-','')
  word = word.lower()
  if word not in dict:
    dict[word]=1
  else:
    dict[word]+=1

list = [(dict[x],x) for x in dict]
list.sort(reverse=True)
for (x,y) in list:
  print '%7d %s' % (x,y.encode('utf8'))

Let’s try the program to a document consisting of 1000 words of Lorem Ipsum:

$ count_words.py < body.txt 1 | head
     22 ut
     20 in
     19 ac
     17 id
     16 a
     15 sed
     15 eu
     14 vitae
     13 nulla
     13 non
$ count_words.py < body.txt 2 | head
     10 sit amet
      2 varius a
      2 ut viverra
      2 ut velit
      2 ut ante
      2 turpis quis
      2 tincidunt ligula
      2 sodales sed
      2 sodales pellentesque
      2 sociis natoque

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s


%d bloggers like this: