When working with large amounts of text, it can be interesting to know how often different words appear. For example, an author may overuse certain words, in which case he or she might wish to find synonyms to substitute for words that appear too often. This subsection develops a program for counting words and presenting the frequency information in a useful format.
At first glance, a program like this would seem to do the job:
# wordfreq-first-try.awk --- print list of word frequencies { for (i = 1; i <= NF; i++) freq[$i]++ }
END { for (word in freq) printf "%s\t%d\n", word, freq[word] }
The program relies on awk
’s default field-splitting
mechanism to break each line up into “words” and uses an
associative array named freq
, indexed by each word, to count
the number of times the word occurs. In the END
rule,
it prints the counts.
This program has several problems that prevent it from being useful on real text files:
awk
language considers upper- and lowercase characters to be
distinct. Therefore, “bartender” and “Bartender” are not treated
as the same word. This is undesirable, because words are capitalized
if they begin sentences in normal text, and a frequency analyzer should
not be sensitive to capitalization.
awk
convention that fields are
separated just by whitespace. Other characters in the input (except
newlines) don’t have any special meaning to awk
. This means that
punctuation characters count as part of words.
The first problem can be solved by using tolower()
to remove case
distinctions. The second problem can be solved by using gsub()
to remove punctuation characters. Finally, we solve the third problem
by using the system sort
utility to process the output of the
awk
script. Here is the new version of the program:
# wordfreq.awk --- print list of word frequencies { $0 = tolower($0) # remove case distinctions # remove punctuation gsub(/[^[:alnum:]_[:blank:]]/, "", $0) for (i = 1; i <= NF; i++) freq[$i]++ } END { for (word in freq) printf "%s\t%d\n", word, freq[word] }
The regexp /[^[:alnum:]_[:blank:]]/
might have been written
/[[:punct:]]/
, but then underscores would also be removed,
and we want to keep them.
Assuming we have saved this program in a file named wordfreq.awk, and that the data is in file1, the following pipeline:
awk -f wordfreq.awk file1 | sort -k 2nr
produces a table of the words appearing in file1 in order of decreasing frequency.
The awk
program suitably massages the
data and produces a word frequency table, which is not ordered.
The awk
script’s output is then sorted by the sort
utility and printed on the screen.
The options given to sort
specify a sort that uses the second field of each input line (skipping
one field), that the sort keys should be treated as numeric quantities
(otherwise ‘15’ would come before ‘5’), and that the sorting
should be done in descending (reverse) order.
The sort
could even be done from within the program, by changing
the END
action to:
END { sort = "sort -k 2nr" for (word in freq) printf "%s\t%d\n", word, freq[word] | sort close(sort) }
This way of sorting must be used on systems that do not
have true pipes at the command-line (or batch-file) level.
See the general operating system documentation for more information on how
to use the sort
program.