Word-frequency filter

摘自 Robbins A., Beebe N. - Classic Shell Scripting - 2005


Chapter 5.


Problem:

Given a text file and an integer n, you are to print the words (and their frequencies of occurrence) whose frequencies of occurrence are among the n largest in order of decreasing frequency.(找到一个文档中出现次数最多的n哥单词,并显示他们的出现次数)


McIlroy’s program illustrates the power of the Unix tools approach: break a complex problem into simpler parts that you already know how to handle. To solve the word-frequency problem, McIlroy converted the text file to a list of words, one per line (tr does the job), mapped words to a single lettercase (tr again), sorted the list (sort), reduced it to a list of unique words with counts (uniq), sorted that list by descending counts (sort), and finally, printed the first several entries in the list (sed, though head would work too).

Example 5-5. Word-frequency filter
#! /bin/sh
# Read a text stream on standard input, and output a list of
# the n (default: 25) most frequently occurring words and
# their frequency counts, in order of descending counts, on
# standard output.
#
# Usage:
# wf [n]

tr -cs A-Za-z\' '\n' |              Replace nonletters with newlines    
    tr A-Z a-z |                    Map uppercase to lowercase
        sort |                      Sort the words in ascending order    
            uniq -c |               Eliminate duplicates, showing their counts
                sort -k1,1nr -k2 |  Sort by descending count, and then by ascending word
                    sed ${1:-25}q   Print only the first n (default: 25) lines; see Chapter 3





你可能感兴趣的:(Word-frequency filter)