TranslateProject/sources/tech/20210401 Use awk to calculate letter frequency.md
DarkSun 4f1400d882 选题[tech]: 20210401 Use awk to calculate letter frequency
sources/tech/20210401 Use awk to calculate letter frequency.md
2021-04-02 05:04:50 +08:00

287 lines
8.6 KiB
Markdown
Raw Blame History

This file contains invisible Unicode characters

This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

[#]: subject: (Use awk to calculate letter frequency)
[#]: via: (https://opensource.com/article/21/4/gawk-letter-game)
[#]: author: (Jim Hall https://opensource.com/users/jim-hall)
[#]: collector: (lujun9972)
[#]: translator: ( )
[#]: reviewer: ( )
[#]: publisher: ( )
[#]: url: ( )
Use awk to calculate letter frequency
======
Write an awk script to determine the most (and least) common letters in
a set of words.
![Typewriter keys in multicolor][1]
I recently started writing a game where you build words using letter tiles. To create the game, I needed to know the frequency of letters across regular words in the English language, so I could present a useful set of letter tiles. Letter frequency is discussed in various places, including [on Wikipedia][2], but I wanted to calculate the letter frequency myself.
Linux provides a list of words in the `/usr/share/dict/words` file, so I already have a list of likely words to use. The `words` file contains lots of words that I want, but a few that I don't. I wanted a list of all words that weren't compound words (no hyphens or spaces) or proper nouns (no uppercase letters). To get that list, I can run the `grep` command to pull out only the lines that consist solely of lowercase letters:
```
`$ grep  '^[a-z]*$' /usr/share/dict/words`
```
This regular expression asks `grep` to match patterns that are only lowercase letters. The characters `^` and `$` in the pattern represent the start and end of the line, respectively. The `[a-z]` grouping will match only the lowercase letters **a** to **z**.
Here's a quick sample of the output:
```
$ grep  '^[a-z]*$' /usr/share/dict/words | head
a
aa
aaa
aah
aahed
aahing
aahs
aal
aalii
aaliis
```
And yes, those are all valid words. For example, "aahed" is the past tense exclamation of "aah," as in relaxation. And an "aalii" is a bushy tropical shrub.
Now I just need to write a `gawk` script to do the work of counting the letters in each word, and then print the relative frequency of each letter it finds.
### Counting letters
One way to count letters in `gawk` is to iterate through each character in each input line and count occurrences of each letter **a** to **z**. The `substr` function will return a substring of a given length, such as a single letter, from a larger string. For example, this code example will evaluate each character `c` from the input:
```
{
    len = length($0); for (i = 1; i <= len; i++) {
        c = substr($0, i, 1);
    }
}
```
If I start with a global string `LETTERS` that contains the alphabet, I can use the `index` function to find the location of a single letter in the alphabet. I'll expand the `gawk` code example to evaluate only the letters **a** to **z** in the input:
```
BEGIN { LETTERS = "abcdefghijklmnopqrstuvwxyz" }
 
{
    len = length($0); for (i = 1; i <= len; i++) {
        c = substr($0, i, 1);
        ltr = index(LETTERS, c);
    }
}
```
Note that the index function returns the first occurrence of the letter from the `LETTERS` string, starting with 1 at the first letter, or zero if not found. If I have an array that is 26 elements long, I can use the array to count the occurrences of each letter. I'll add this to my code example to increment (using `++`) the count for each letter as it appears in the input:
```
BEGIN { LETTERS = "abcdefghijklmnopqrstuvwxyz" }
 
{
    len = length($0); for (i = 1; i <= len; i++) {
        c = substr($0, i, 1);
        ltr = index(LETTERS, c);
 
        if (ltr > 0) {
            ++count[ltr];
        }
    }
}
```
### Printing relative frequency
After the `gawk` script counts all the letters, I want to print the frequency of each letter it finds. I am not interested in the total number of each letter from the input, but rather the _relative frequency_ of each letter. The relative frequency scales the counts so that the letter with the fewest occurrences (such as the letter **q**) is set to 1, and other letters are relative to that.
I'll start with the count for the letter **a**, then compare that value to the counts for each of the other letters **b** to **z**:
```
END {
    min = count[1]; for (ltr = 2; ltr <= 26; ltr++) {
        if (count[ltr] < min) {
            min = count[ltr];
        }
    }
}
```
At the end of that loop, the variable `min` contains the minimum count for any letter. I can use that to provide a scale for the counts to print the relative frequency of each letter. For example, if the letter with the lowest occurrence is **q**, then `min` will be equal to the **q** count.
Then I loop through each letter and print it with its relative frequency. I divide each count by `min` to print the relative frequency, which means the letter with the lowest count will be printed with a relative frequency of 1. If another letter appears twice as often as the lowest count, that letter will have a relative frequency of 2. I'm only interested in integer values here, so 2.1 and 2.9 are the same as 2 for my purposes:
```
END {
    min = count[1]; for (ltr = 2; ltr <= 26; ltr++) {
        if (count[ltr] < min) {
            min = count[ltr];
        }
    }
 
    for (ltr = 1; ltr <= 26; ltr++) {
        print substr(LETTERS, ltr, 1), int(count[ltr] / min);
    }
}
```
### Putting it all together
Now I have a `gawk` script that can count the relative frequency of letters in its input:
```
#!/usr/bin/gawk -f
 
# only count a-z, ignore A-Z and any other characters
 
BEGIN { LETTERS = "abcdefghijklmnopqrstuvwxyz" }
 
{
    len = length($0); for (i = 1; i <= len; i++) {
        c = substr($0, i, 1);
        ltr = index(LETTERS, c);
 
        if (ltr > 0) {
            ++count[ltr];
        }
    }
}
 
# print relative frequency of each letter
   
END {
    min = count[1]; for (ltr = 2; ltr <= 26; ltr++) {
        if (count[ltr] < min) {
            min = count[ltr];
        }
    }
 
    for (ltr = 1; ltr <= 26; ltr++) {
        print substr(LETTERS, ltr, 1), int(count[ltr] / min);
    }
}
```
I'll save that to a file called `letter-freq.awk` so that I can use it more easily from the command line.
If you prefer, you can also use `chmod +x` to make the file executable on its own. The `#!/usr/bin/gawk -f` on the first line means Linux will run it as a script using the `/usr/bin/gawk` program. And because the `gawk` command line uses `-f` to indicate which file it should use as a script, you need that hanging `-f` so that executing `letter-freq.awk` at the shell will be properly interpreted as running `/usr/bin/gawk -f letter-freq.awk` instead.
I can test the script with a few simple inputs. For example, if I feed the alphabet into my `gawk` script, each letter should have a relative frequency of 1:
```
$ echo abcdefghijklmnopqrstuvwxyz | gawk -f letter-freq.awk
a 1
b 1
c 1
d 1
e 1
f 1
g 1
h 1
i 1
j 1
k 1
l 1
m 1
n 1
o 1
p 1
q 1
r 1
s 1
t 1
u 1
v 1
w 1
x 1
y 1
z 1
```
Repeating that example but adding an extra instance of the letter **e** will print the letter **e** with a relative frequency of 2 and every other letter as 1:
```
$ echo abcdeefghijklmnopqrstuvwxyz | gawk -f letter-freq.awk
a 1
b 1
c 1
d 1
e 2
f 1
g 1
h 1
i 1
j 1
k 1
l 1
m 1
n 1
o 1
p 1
q 1
r 1
s 1
t 1
u 1
v 1
w 1
x 1
y 1
z 1
```
And now I can take the big step! I'll use the `grep` command with the `/usr/share/dict/words` file and identify the letter frequency for all words spelled entirely with lowercase letters:
```
$ grep  '^[a-z]*$' /usr/share/dict/words | gawk -f letter-freq.awk
a 53
b 12
c 28
d 21
e 72
f 7
g 15
h 17
i 58
j 1
k 5
l 36
m 19
n 47
o 47
p 21
q 1
r 46
s 48
t 44
u 25
v 6
w 4
x 1
y 13
z 2
```
Of all the lowercase words in the `/usr/share/dict/words` file, the letters **j**, **q**, and **x** occur least frequently. The letter **z** is also pretty rare. Not surprisingly, the letter **e** is the most frequently used.
--------------------------------------------------------------------------------
via: https://opensource.com/article/21/4/gawk-letter-game
作者:[Jim Hall][a]
选题:[lujun9972][b]
译者:[译者ID](https://github.com/译者ID)
校对:[校对者ID](https://github.com/校对者ID)
本文由 [LCTT](https://github.com/LCTT/TranslateProject) 原创编译,[Linux中国](https://linux.cn/) 荣誉推出
[a]: https://opensource.com/users/jim-hall
[b]: https://github.com/lujun9972
[1]: https://opensource.com/sites/default/files/styles/image-full-size/public/lead-images/osdc-docdish-typewriterkeys-3.png?itok=NyBwMdK_ (Typewriter keys in multicolor)
[2]: https://en.wikipedia.org/wiki/Letter_frequency