From 7a326c87e7701a54abb761aca8d87ba91614271e Mon Sep 17 00:00:00 2001 From: Xingyu Wang Date: Mon, 28 Sep 2020 00:03:12 +0800 Subject: [PATCH] TSL --- ...your awk skills with two easy tutorials.md | 287 ------------------ ...your awk skills with two easy tutorials.md | 272 +++++++++++++++++ 2 files changed, 272 insertions(+), 287 deletions(-) delete mode 100644 sources/tech/20191031 Advance your awk skills with two easy tutorials.md create mode 100644 translated/tech/20191031 Advance your awk skills with two easy tutorials.md diff --git a/sources/tech/20191031 Advance your awk skills with two easy tutorials.md b/sources/tech/20191031 Advance your awk skills with two easy tutorials.md deleted file mode 100644 index 083dbbb011..0000000000 --- a/sources/tech/20191031 Advance your awk skills with two easy tutorials.md +++ /dev/null @@ -1,287 +0,0 @@ -[#]: collector: (lujun9972) -[#]: translator: (wxy) -[#]: reviewer: ( ) -[#]: publisher: ( ) -[#]: url: ( ) -[#]: subject: (Advance your awk skills with two easy tutorials) -[#]: via: (https://opensource.com/article/19/10/advanced-awk) -[#]: author: (Dave Neary https://opensource.com/users/dneary) - -Advance your awk skills with two easy tutorials -====== -Go beyond one-line awk scripts with mail merge and word counting. -![a checklist for a team][1] - -Awk is one of the oldest tools in the Unix and Linux user's toolbox. Created in the 1970s by Alfred Aho, Peter Weinberger, and Brian Kernighan (the A, W, and K of the tool's name), awk was created for complex processing of text streams. It is a companion tool to sed, the stream editor, which is designed for line-by-line processing of text files. Awk allows more complex structured programs and is a complete programming language. - -This article will explain how to use awk for more structured and complex tasks, including a simple mail merge application. - -### Awk program structure - -An awk script is made up of functional blocks surrounded by **{}** (curly brackets). There are two special function blocks, **BEGIN** and **END**, that execute before processing the first line of the input stream and after the last line is processed. In between, blocks have the format: - - -``` -`pattern { action statements }` -``` - -Each block executes when the line in the input buffer matches the pattern. If no pattern is included, the function block executes on every line of the input stream. - -Also, the following syntax can be used to define functions in awk that can be called from any block: - - -``` -`function name(parameter list) { statements }` -``` - -This combination of pattern-matching blocks and functions allows the developer to structure awk programs for reuse and readability. - -### How awk processes text streams - -Awk reads text from its input file or stream one line at a time and uses a field separator to parse it into a number of fields. In awk terminology, the current buffer is a _record_. There are a number of special variables that affect how awk reads and processes a file: - - * **FS** (field separator): By default, this is any whitespace (spaces or tabs) - * **RS** (record separator): By default, a newline (**\n**) - * **NF** (number of fields): When awk parses a line, this variable is set to the number of fields that have been parsed - * **$0:** The current record - * **$1, $2, $3, etc.:** The first, second, third, etc. field from the current record - * **NR** (number of records): The number of records that have been parsed so far by the awk script - - - -There are many other variables that affect awk's behavior, but this is enough to start with. - -### Awk one-liners - -For a tool so powerful, it's interesting that most of awk's usage is basic one-liners. Perhaps the most common awk program prints selected fields from an input line from a CSV file, a log file, etc. For example, the following one-liner prints a list of usernames from **/etc/passwd**: - - -``` -`awk -F":" '{print $1 }' /etc/passwd` -``` - -As mentioned above, **$1** is the first field in the current record. The **-F** option sets the FS variable to the character **:**. - -The field separator can also be set in a BEGIN function block: - - -``` -`awk 'BEGIN { FS=":" } {print $1 }' /etc/passwd` -``` - -In the following example, every user whose shell is not **/sbin/nologin** can be printed by preceding the block with a pattern match: - - -``` -`awk 'BEGIN { FS=":" } ! /\/sbin\/nologin/ {print $1 }' /etc/passwd` -``` - -### Advanced awk: Mail merge - -Now that you have some of the basics, try delving deeper into awk with a more structured example: creating a mail merge. - -A mail merge uses two files, one (called in this example **email_template.txt**) containing a template for an email you want to send: - - -``` -From: Program committee <[pc@event.org][2]> -To: {firstname} {lastname} <{email}> -Subject: Your presentation proposal - -Dear {firstname}, - -Thank you for your presentation proposal: -  {title} - -We are pleased to inform you that your proposal has been successful! We -will contact you shortly with further information about the event -schedule. - -Thank you, -The Program Committee -``` - -And the other is a CSV file (called **proposals.csv**) with the people you want to send the email to: - - -``` -firstname,lastname,email,title -Harry,Potter,[hpotter@hogwarts.edu][3],"Defeating your nemesis in 3 easy steps" -Jack,Reacher,[reacher@covert.mil][4],"Hand-to-hand combat for beginners" -Mickey,Mouse,[mmouse@disney.com][5],"Surviving public speaking with a squeaky voice" -Santa,Claus,[sclaus@northpole.org][6],"Efficient list-making" -``` - -You want to read the CSV file, replace the relevant fields in the first file (skipping the first line), then write the result to a file called **acceptanceN.txt**, incrementing **N** for each line you parse. - -Write the awk program in a file called **mail_merge.awk**. Statements are separated by **;** in awk scripts. The first task is to set the field separator variable and a couple of other variables the script needs. You also need to read and discard the first line in the CSV, or a file will be created starting with _Dear firstname_. To do this, use the special function **getline** and reset the record counter to 0 after reading it. - - -``` -BEGIN { -  FS=","; -  template="email_template.txt"; -  output="acceptance"; -  getline; -  NR=0; -} -``` - -The main function is very straightforward: for each line processed, a variable is set for the various fields—**firstname**, **lastname**, **email**, and **title**. The template file is read line by line, and the function **sub** is used to substitute any occurrence of the special character sequences with the value of the relevant variable. Then the line, with any substitutions made, is output to the output file. - -Since you are dealing with the template file and a different output file for each line, you need to clean up and close the file handles for these files before processing the next record. - - -``` -{ -        # Read relevant fields from input file -        firstname=$1; -        lastname=$2; -        email=$3; -        title=$4; - -        # Set output filename -        outfile=(output NR ".txt"); - -        # Read a line from template, replace special fields, and -        # print result to output file -        while ( (getline ln < template) > 0 ) -        { -                sub(/{firstname}/,firstname,ln); -                sub(/{lastname}/,lastname,ln); -                sub(/{email}/,email,ln); -                sub(/{title}/,title,ln); -                print(ln) > outfile; -        } - -        # Close template and output file in advance of next record -        close(outfile); -        close(template); -} -``` - -You're done! Run the script on the command line with: - - -``` -`awk -f mail_merge.awk proposals.csv` -``` - -or - - -``` -`awk -f mail_merge.awk < proposals.csv` -``` - -and you will find text files generated in the current directory. - -### Advanced awk: Word frequency count - -One of the most powerful features in awk is the associative array. In most programming languages, array entries are typically indexed by a number, but in awk, arrays are referenced by a key string. You could store an entry from the file _proposals.txt_ from the previous section. For example, in a single associative array, like this: - - -``` -        proposer["firstname"]=$1; -        proposer["lastname"]=$2; -        proposer["email"]=$3; -        proposer["title"]=$4; -``` - -This makes text processing very easy. A simple program that uses this concept is the idea of a word frequency counter. You can parse a file, break out words (ignoring punctuation) in each line, increment the counter for each word in the line, then output the top 20 words that occur in the text. - -First, in a file called **wordcount.awk**, set the field separator to a regular expression that includes whitespace and punctuation: - - -``` -BEGIN { -        # ignore 1 or more consecutive occurrences of the characters -        # in the character group below -        FS="[ .,:;()<>{}@!\"'\t]+"; -} -``` - -Next, the main loop function will iterate over each field, ignoring any empty fields (which happens if there is punctuation at the end of a line), and increment the word count for the words in the line. - - -``` -{ -        for (i = 1; i <= NF; i++) { -                if ($i != "") { -                        words[$i]++; -                } -        } -} -``` - -Finally, after the text is processed, use the END function to print the contents of the array, then use awk's capability of piping output into a shell command to do a numerical sort and print the 20 most frequently occurring words: - - -``` -END { -        sort_head = "sort -k2 -nr | head -n 20"; -        for (word in words) { -                printf "%s\t%d\n", word, words[word] | sort_head; -        } -        close (sort_head); -} -``` - -Running this script on an earlier draft of this article produced this output: - - -``` -[[dneary@dhcp-49-32.bos.redhat.com][7]]$ awk -f wordcount.awk < awk_article.txt -the     79 -awk     41 -a       39 -and     33 -of      32 -in      27 -to      26 -is      25 -line    23 -for     23 -will    22 -file    21 -we      16 -We      15 -with    12 -which   12 -by      12 -this    11 -output  11 -function        11 -``` - -### What's next? - -If you want to learn more about awk programming, I strongly recommend the book [_Sed and awk_][8] by Dale Dougherty and Arnold Robbins. - -One of the keys to progressing in awk programming is mastering "extended regular expressions." Awk offers several powerful additions to the sed [regular expression][9] syntax you may already be familiar with. - -Another great resource for learning awk is the [GNU awk user guide][10]. It has a full reference for awk's built-in function library, as well as lots of examples of simple and complex awk scripts. - --------------------------------------------------------------------------------- - -via: https://opensource.com/article/19/10/advanced-awk - -作者:[Dave Neary][a] -选题:[lujun9972][b] -译者:[译者ID](https://github.com/译者ID) -校对:[校对者ID](https://github.com/校对者ID) - -本文由 [LCTT](https://github.com/LCTT/TranslateProject) 原创编译,[Linux中国](https://linux.cn/) 荣誉推出 - -[a]: https://opensource.com/users/dneary -[b]: https://github.com/lujun9972 -[1]: https://opensource.com/sites/default/files/styles/image-full-size/public/lead-images/checklist_hands_team_collaboration.png?itok=u82QepPk (a checklist for a team) -[2]: mailto:pc@event.org -[3]: mailto:hpotter@hogwarts.edu -[4]: mailto:reacher@covert.mil -[5]: mailto:mmouse@disney.com -[6]: mailto:sclaus@northpole.org -[7]: mailto:dneary@dhcp-49-32.bos.redhat.com -[8]: https://www.amazon.com/sed-awk-Dale-Dougherty/dp/1565922255/book -[9]: https://en.wikibooks.org/wiki/Regular_Expressions/POSIX-Extended_Regular_Expressions -[10]: https://www.gnu.org/software/gawk/manual/gawk.html diff --git a/translated/tech/20191031 Advance your awk skills with two easy tutorials.md b/translated/tech/20191031 Advance your awk skills with two easy tutorials.md new file mode 100644 index 0000000000..044c9c9280 --- /dev/null +++ b/translated/tech/20191031 Advance your awk skills with two easy tutorials.md @@ -0,0 +1,272 @@ +[#]: collector: (lujun9972) +[#]: translator: (wxy) +[#]: reviewer: ( ) +[#]: publisher: ( ) +[#]: url: ( ) +[#]: subject: (Advance your awk skills with two easy tutorials) +[#]: via: (https://opensource.com/article/19/10/advanced-awk) +[#]: author: (Dave Neary https://opensource.com/users/dneary) + + +通过两个简单的教程来提高你的 awk 技能 +====== + +> 超越单行的awk脚本,邮件合并和字数统计。 +!["一个团队的检查表"[1] + +`awk` 是 Unix 和 Linux 用户工具箱中最古老的工具之一。`awk` 由 Alfred Aho、Peter Weinberger 和 Brian Kernighan(工具名称中的 A、W 和 K)在 20 世纪 70 年代创建,用于复杂的文本流处理。它是流编辑器 `sed` 的配套工具,后者是为逐行处理文本文件而设计的。`awk` 支持更复杂的结构化程序,是一种完整的编程语言。 + +本文将介绍如何使用 `awk` 完成更多结构化的复杂任务,包括一个简单的邮件合并程序。 + +### awk 的程序结构 + +`awk` 脚本是由 `{}`(大括号)包围的功能块组成,其中有两个特殊的功能块,`BEGIN` 和 `END`,它们在处理第一行输入流之前和最后一行处理之后执行。在这两者之间,块的格式为: + +``` +模式 { 动作语句 } +``` + +当输入缓冲区中的行与模式匹配时,每个块都会执行。如果没有包含模式,则函数块在输入流的每一行都会执行。 + +另外,以下语法可以用于在 `awk` 中定义可以从任何块中调用的函数。 + +``` +function 函数名(参数列表) { 语句 } +``` + +这种模式匹配块和函数的组合允许开发者结构化 `awk` 程序,以便重用和可读。 + +### awk 如何处理文本流 + +`awk` 每次从输入文件或流中一行一行地读取文本,并使用字段分隔符将其解析成若干字段。在 `awk` 的术语中,当前的缓冲区是一个*记录*。有一些特殊的变量会影响 `awk` 读取和处理文件的方式: + + * `FS`(字段分隔符field separator)。默认情况下,这是任何空格字符(空格或制表符)。 + * `RS`(记录分隔符record separator)。默认情况下是一个新行(`n`)。 + * `NF`(字段数number of fields)。当 `awk` 解析一行时,这个变量被设置为已解析的字段数。 + * `$0:` 当前记录。 + * `$1`、`$2`、`$3` 等:当前记录的第一、第二、第三等字段。 + * `NR`(记录数number of records)。迄今已被 `awk` 脚本解析的记录数。 + +影响 `awk` 行为的变量还有很多,但这已经足够开始了。 + +### 单行 awk 脚本 + +对于一个如此强大的工具来说,有趣的是,`awk` 的大部分用法都是基本的单行脚本。也许最常见的 `awk` 程序是打印 CSV 文件、日志文件等输入行中的选定字段。例如,下面的单行脚本从 `/etc/passwd` 中打印出一个用户名列表: + +``` +awk -F":" '{print $1 }' /etc/passwd +``` + +如上所述,`$1` 是当前记录中的第一个字段。`-F` 选项将 `FS` 变量设置为字符 `:`。 + +字段分隔符也可以在 `BEGIN` 函数块中设置: + +``` +awk 'BEGIN { FS=":" } {print $1 }' /etc/passwd +``` + +在下面的例子中,每一个 shell 不是 `/sbin/nologin` 的用户都可以通过在该块前面加上匹配模式来打印出来: + +``` +awk 'BEGIN { FS=":" } ! /\/sbin\/nologin/ {print $1 }' /etc/passwd +``` + +### awk 进阶:邮件合并 + +现在你已经掌握了一些基础知识,尝试用一个更具有结构化的例子来深入了解 `awk`:创建邮件合并。 + +邮件合并使用两个文件,其中一个文件(在本例中称为 `email_template.txt`)包含了你要发送的电子邮件的模板: + +``` +From: Program committee +To: {firstname} {lastname} <{email}> +Subject: Your presentation proposal + +Dear {firstname}, + +Thank you for your presentation proposal: + {title} + +We are pleased to inform you that your proposal has been successful! We +will contact you shortly with further information about the event +schedule. + +Thank you, +The Program Committee +``` + +而另一个则是一个 CSV 文件(名为 `proposals.csv`),里面有你要发送邮件的人: + +``` +firstname,lastname,email,title +Harry,Potter,hpotter@hogwarts.edu,"Defeating your nemesis in 3 easy steps" +Jack,Reacher,reacher@covert.mil,"Hand-to-hand combat for beginners" +Mickey,Mouse,mmouse@disney.com,"Surviving public speaking with a squeaky voice" +Santa,Claus,sclaus@northpole.org,"Efficient list-making" +``` + +你要读取 CSV 文件,替换第一个文件中的相关字段(跳过第一行),然后把结果写到一个叫 `acceptanceN.txt` 的文件中,每解析一行就递增 `N`。 + +把 `awk` 程序写在一个叫 `mail_merge.awk` 的文件中。在 `awk` 脚本中的语句用 `;` 分隔。第一个任务是设置字段分隔符变量和其他几个脚本需要的变量。你还需要读取并丢弃 CSV 中的第一行,否则会创建一个以 `Dear firstname` 开头的文件。要做到这一点,请使用特殊函数 `getline`,并在读取后将记录计数器重置为 0。 + +``` +BEGIN { +  FS=","; +  template="email_template.txt"; +  output="acceptance"; +  getline; +  NR=0; +} +``` + +主要功能非常简单:每处理一行,就为各种字段设置一个变量 —— `firstname`、`lastname`、`email` 和 `title`。模板文件被逐行读取,并使用函数 `sub` 将任何出现的特殊字符序列替换为相关变量的值。然后将该行以及所做的任何替换输出到输出文件中。 + +由于每行都要处理模板文件和不同的输出文件,所以在处理下一条记录之前,需要清理和关闭这些文件的文件句柄。 + +``` +{ +        # Read relevant fields from input file +        firstname=$1; +        lastname=$2; +        email=$3; +        title=$4; + +        # Set output filename +        outfile=(output NR ".txt"); + +        # Read a line from template, replace special fields, and +        # print result to output file +        while ( (getline ln < template) > 0 ) +        { +                sub(/{firstname}/,firstname,ln); +                sub(/{lastname}/,lastname,ln); +                sub(/{email}/,email,ln); +                sub(/{title}/,title,ln); +                print(ln) > outfile; +        } + +        # Close template and output file in advance of next record +        close(outfile); +        close(template); +} +``` + +你已经完成了! 在命令行上运行该脚本: + + +``` +awk -f mail_merge.awk proposals.csv +``` + +或 + +``` +awk -f mail_merge.awk < proposals.csv +``` + +你会发现在当前目录下生成的文本文件。 + +### awk 进阶:字频计数 + +`awk` 中最强大的功能之一是关联数组,在大多数编程语言中,数组条目通常由数字索引,但在 `awk` 中,数组由一个键字符串进行引用。你可以从上一节的文件 `proposals.txt` 中存储一个条目。例如,在一个单一的关联数组中,像这样: + +``` +        proposer["firstname"]=$1; +        proposer["lastname"]=$2; +        proposer["email"]=$3; +        proposer["title"]=$4; +``` + +这使得文本处理变得非常容易。一个使用了这个概念的简单的程序就是词频计数器。你可以解析一个文件,在每一行中分解出单词(忽略标点符号),对行中的每个单词进行递增计数器,然后输出文本中出现的前 20 个单词。 + +首先,在一个名为 `wordcount.awk` 的文件中,将字段分隔符设置为包含空格和标点符号的正则表达式: + +``` +BEGIN { + # ignore 1 or more consecutive occurrences of the characters + # in the character group below + FS="[ .,:;()<>{}@!\"'\t]+"; +} +``` + +接下来,主循环函数将遍历每个字段,忽略任何空字段(如果行末有标点符号,则会出现这种情况),并递增行中单词数: + +``` +{ +        for (i = 1; i <= NF; i++) { +                if ($i != "") { +                        words[$i]++; +                } +        } +} +``` + +最后,处理完文本后,使用 `END` 函数打印数组的内容,然后利用 `awk` 的能力,将输出的内容用管道输入 shell 命令,进行数字排序,并打印出 20 个最常出现的单词。 + +``` +END { +        sort_head = "sort -k2 -nr | head -n 20"; +        for (word in words) { +                printf "%s\t%d\n", word, words[word] | sort_head; +        } +        close (sort_head); +} +``` + +在这篇文章的早期草稿上运行这个脚本,会产生这样的输出: + +``` +[dneary@dhcp-49-32.bos.redhat.com]$ awk -f wordcount.awk < awk_article.txt +the 79 +awk 41 +a 39 +and 33 +of 32 +in 27 +to 26 +is 25 +line 23 +for 23 +will 22 +file 21 +we 16 +We 15 +with 12 +which 12 +by 12 +this 11 +output 11 +function 11 +``` + +### 下一步是什么? + +如果你想了解更多关于 `awk` 编程的知识,我强烈推荐 Dale Dougherty 和 Arnold Robbins 所著的《[Sed 和 awk][8]》这本书。 + +`awk` 编程进阶的关键之一是掌握“扩展正则表达式”。`awk` 为你可能已经熟悉的 sed [正则表达式][9]语法提供了几个强大的补充。 + +另一个学习 `awk` 的好资源是 [GNU awk 用户指南][10]。它有一个完整的 `awk` 内置函数库的参考资料,以及很多简单和复杂的 `awk` 脚本的例子。 + +-------------------------------------------------------------------------------- + +via: https://opensource.com/article/19/10/advanced-awk + +作者:[Dave Neary][a] +选题:[lujun9972][b] +译者:[wxy](https://github.com/wxy) +校对:[校对者ID](https://github.com/校对者ID) + +本文由 [LCTT](https://github.com/LCTT/TranslateProject) 原创编译,[Linux中国](https://linux.cn/) 荣誉推出 + +[a]: https://opensource.com/users/dneary +[b]: https://github.com/lujun9972 +[1]: https://opensource.com/sites/default/files/styles/image-full-size/public/lead-images/checklist_hands_team_collaboration.png?itok=u82QepPk (a checklist for a team) +[2]: mailto:pc@event.org +[3]: mailto:hpotter@hogwarts.edu +[4]: mailto:reacher@covert.mil +[5]: mailto:mmouse@disney.com +[6]: mailto:sclaus@northpole.org +[7]: mailto:dneary@dhcp-49-32.bos.redhat.com +[8]: https://www.amazon.com/sed-awk-Dale-Dougherty/dp/1565922255/book +[9]: https://en.wikibooks.org/wiki/Regular_Expressions/POSIX-Extended_Regular_Expressions +[10]: https://www.gnu.org/software/gawk/manual/gawk.html