translated over: disylee <disylee@hotmail.com>

This commit is contained in:
disylee 2014-08-11 00:40:15 +08:00
parent 1ebe99e0fd
commit 6dff8e0752
2 changed files with 126 additions and 117 deletions

View File

@ -1,117 +0,0 @@
disylee来占坑~~
Top 10! Fun On The Command Line
================================================================================
**If you just love making 'top 10'-type lists but are a little embarrassed to say so, tell people you're passionate about data exploration. To impress them even more, explain that you do your data exploration on the command line. But don't ruin the impression by telling them how easy that is!**
In this article I'll do some data exploration with basic GNU/Linux tools and 'one-column tables', by which I mean simple lists. For more information on the commands used here, see their Linux 'man' pages, or ask for an explanation in the 'Comments' section.
### Passwords ###
The first list to explore is Mark Burnett's 2011 compilation of the [10000 most commonly used passwords][1]. The list is ordered most-frequent-first, and is one source of the widely known factoid that 'password' is the most commonly used password, with '123456' in second place. Here I've put the list in a file called passwords, and used the head command to show the first 10 lines:
![](http://thelinuxrain.com/content/01-articles/67-top-10-fun-on-the-command-line/1.png)
(Burnett explains how he collects his passwords [here][2]. Note that he converted all uppercase letters to lowercase in his list.)
OK, so 'password' is top of the Burnett list. What about individual digits?
![](http://thelinuxrain.com/content/01-articles/67-top-10-fun-on-the-command-line/2.png)
Interesting. The digit '1' appears in the password list more than twice as often as the next most-used digit, '2', and the 10 digits are in numerical as well as popularity order, except for 0 and 9. And the top 10 letters?
![](http://thelinuxrain.com/content/01-articles/67-top-10-fun-on-the-command-line/3.png)
The most frequent letters in the passwords file are EARONISTLC. That's not too far off EAIRTONSLC, which is the frequency pattern in at least [one published table][3] of letter usage in common English words. Does this mean that most passwords are actually common English words, maybe with a few digits thrown in?
To find out, I'll first convert passwords to a list of letters-only strings, then see how many of those strings are in an English dictionary.
First I'll delete all the digits in passwords with a **sed** command, then delete all the punctuation marks, then all the blank lines. This creates a list of letters-only passwords. Then I'll prune that list with **sort** and **uniq** to get rid of any duplicates. (For example, 'abc1234def' and 'abc1!2!3!def!' both reduce to 'abcdef'.) According to the wc command, my pruning reduces the 10000 passwords to 8583 letters-only strings:
![](http://thelinuxrain.com/content/01-articles/67-top-10-fun-on-the-command-line/4.png)
For a handy English dictionary I'll use the file `usr/share/dict/american-english`, which came with my Debian Linux distribution. It contains 99171 words. I'll first convert this wordlist to lowercase-only with the **tr** command, then delete any duplicate entries with **sort** and **uniq** (like 'A' and 'a' both becoming 'a'). That reduces the wordlist to 97723 entries:
![](http://thelinuxrain.com/content/01-articles/67-top-10-fun-on-the-command-line/5.png)
I can now ask the comm command with the '-23' option to compare the two lists and report just the words in the letters-only file that are not found in the dictionary:
![](http://thelinuxrain.com/content/01-articles/67-top-10-fun-on-the-command-line/6.png)
The total is 3137, so at least 8583 - 3137 = 5446 'core' passwords in Burnett's lowercase-only list (about 63%) are either plain English words, or plain English words with some digits or punctuation marks added. I wrote at least because a big proportion of the 3137 strings are only slight modifications of plain English words or names, or words or names missing from the /usr/share dictionary. Among the LA's, for example, are 'labtec', 'ladyboy', 'lakeside', 'lalakers', 'lalala', 'laserjet', 'lasvegas', 'lavalamp' and 'lawman'.
### Placenames ###
In a previous [Linux Rain article][4], I described how I built a table of Australian placenames with more than 370 000 entries. Using it, I can now answer vital questions like 'Is Round Hill the most popular name for hills in Australia?' and 'Is Sandy Beach tops for beaches, and Rocky Creek for creeks?'
The placename field in the gazetteer table is number 2, so here goes:
![](http://thelinuxrain.com/content/01-articles/67-top-10-fun-on-the-command-line/7.png)
Wow. I wasn't even close. (But note how I saved typing by using the **^string1^string2** command. It repeats the last command, but substitutes string2 for string1. Wonderful BASH trick!)
Another burning question is how many placenames there are with 'Mile' in them, like 'Six Mile Creek', and how they rank:
![](http://thelinuxrain.com/content/01-articles/67-top-10-fun-on-the-command-line/8.png)
I've noticed a lot of Dead Horse Creeks in my Australian travels, and so has the gazetteer:
![](http://thelinuxrain.com/content/01-articles/67-top-10-fun-on-the-command-line/9.png)
### Species ###
The third list to explore comes from a table I published this year of new Australian insect species named in the period 1961-2010. From the table I've pulled out all the 'species epithets', which are the second parts of genus-species combinations like Homo sapiens (you and me) and Apis mellifera (European honeybee).
(Tech note: The insects table, which is available from the open data Zenodo repository at [https://zenodo.org/record/10481][5], includes subspecies. For my 'top 10' exercise I first isolated all the unique genus-species combinations, to avoid duplication from subspecies like Apis mellifera iberica, Apis mellifera intermissa, etc. The final species file has 18155 species epithets.)
Most people who make jokes about scientific names use the '-us' ending, as in 'Biggus buggus'. What about entomologists? There are a couple of good, command-line ways to get the last 2 letters of a string, and here I've used both:
![](http://thelinuxrain.com/content/01-articles/67-top-10-fun-on-the-command-line/10.png)
Yep, entomologists prefer '-us', too. Next, I wonder how many species are named for my home State of Tasmania? (Below I ask head for the first 100 lines to make sure I get all the 'tasman' combinations.)
![](http://thelinuxrain.com/content/01-articles/67-top-10-fun-on-the-command-line/11.png)
How about Queensland?
![](http://thelinuxrain.com/content/01-articles/67-top-10-fun-on-the-command-line/12.png)
And generally speaking, what are the top 10 names in that insect species list?
![](http://thelinuxrain.com/content/01-articles/67-top-10-fun-on-the-command-line/13.png)
Hmm. Apart from the obvious 'australis' and 'australiensis', and the geographical 'occidentalis' (of the west), the other 7 epithets in the 10-most-popular list have been created by entomologists to honour other entomologists. (The epithet 'commoni' honors the Australian butterfly and moth specialist Ian F.B. Common, 1917-2006.)
### Speechifying ###
The commands used above work on simple lists. To make a simple list out a block of text, the command line is again your friend. For example, I've saved a rather filibustery [speech][6] in the Australian Senate on 16 July 2014 as the text file hansard. To split hansard into a list of words:
![](http://thelinuxrain.com/content/01-articles/67-top-10-fun-on-the-command-line/14.png)
And to look at word frequency in the speech:
![](http://thelinuxrain.com/content/01-articles/67-top-10-fun-on-the-command-line/15.png)
### Coming soon... ###
Doing 'top 10' and other rankings from multi-column tables requires a few more command-line tools. I'll demonstrate their use in a future article.
--------------------------------------------------------------------------------
via: http://thelinuxrain.com/articles/top-10-fun-on-the-command-line
原文作者Bob Mesibov(Bob Mesibov is Tasmanian, retired and a keen Linux tinkerer.)
译者:[译者ID](https://github.com/译者ID) 校对:[校对者ID](https://github.com/校对者ID)
本文由 [LCTT](https://github.com/LCTT/TranslateProject) 原创翻译,[Linux中国](http://linux.cn/) 荣誉推出
[1]:https://xato.net/passwords/more-top-worst-passwords/#.U8eD13AvDy0
[2]:https://xato.net/passwords/how-i-collect-passwords/#.U8eEdnAvDy0
[3]:http://www.rinkworks.com/words/letterfreq.shtml
[4]:http://www.thelinuxrain.com/articles/building-a-gazetteer-table-from-kml-files
[5]:https://zenodo.org/record/10481
[6]:http://parlinfo.aph.gov.au/parlInfo/search/display/display.w3p;db=CHAMBER;id=chamber%2Fhansards%2F232fa1a8-d7e8-4b22-9018-1a99b5a96812%2F0025;query=Id%3A%22chamber%2Fhansards%2F232fa1a8-d7e8-4b22-9018-1a99b5a96812%2F0000%22

View File

@ -0,0 +1,126 @@
Top 10! Fun On The Command Line
================================================================================
排名前10!在命令行中寻找乐趣
================================================================================
**如果你喜欢做排名前10'类型的列表但又有点不好意思这样说,那么告诉人们你热爱数据的探索。为了进一步打动他们们,向他们解释你在命令行间的数据探索。但是不要告诉他们这其实很容易,以免你的好形象就这样被毁灭了哦!**
在这篇文章中我将基于GNU/Linux工具和'一列的表格'来做一些数据探索,也就是我所说的简单列表。如若想在这里通过命令行查看更多的信息,请查看'man'页,或者在“注释”部分求解。
### 密码 ###
在第一个列表里探讨的是马克.伯内特2011著的关于[10000 个最常用的密码][1]汇编。这个列表是有序的、使用最频繁的也是广为人知的阶乘“密码”的最常用来源之一与“123456”并列排名第二。在这里我已经把一个所谓密码的文件列出来了并且使用head命令把排名前10的列出来了
![](http://thelinuxrain.com/content/01-articles/67-top-10-fun-on-the-command-line/1.png)
(伯内特解释他是如何手收集这些密码的[这里][2]。你会注意到在列表中他把所有大写字母都转换成小写的。)
OK所以'密码'是伯内特列表的顶部。那么个位数字呢?
![](http://thelinuxrain.com/content/01-articles/67-top-10-fun-on-the-command-line/2.png)
非常有趣的是!数字'1'出现在密码列表中多于是第二个最常用数字'2'的两倍以及10位数字在数值以及人气排序除了0和9。而排名前10的字母呢
![](http://thelinuxrain.com/content/01-articles/67-top-10-fun-on-the-command-line/3.png)
在密码文件中出现最频繁的字母依次是EARONISTLC.这个和EAIRTONSLC差不多也是一个频率模式图至少在 [one published table][3] 中最常适用的英语单词。这是否意味着,大部分密码其实是一些常见的英语单词呢,也许会参杂一些数字呢?
为了找到答案,我先把密码转换成一个纯字母字符串列表,然后看看有多少字符串是可以在英语词典中找到的。
首先我将通过 **sed**命令删除所有密码中的数字,然后删除所有的标点符号,再删除所有的空行。这将创建出一个纯字母的密码列表。然后我通过**sort** 和 **uniq**来修剪列表的排序,将重复项取出。(例如,'abc1234def'和'abc1!2!3!def!'都剔除剩下'abcdef'.) 。根据wc命令我把1000个密码减至成8583个纯字母的字符串
![](http://thelinuxrain.com/content/01-articles/67-top-10-fun-on-the-command-line/4.png)
我经常使用一本便携式的英语字典,我通常会使用`usr/share/dict/american-english`,这个文件是来自Debian Linux的一个发行版本。它包含了99171个单词。我会先通过**tr**命令将这个词表转换为纯小写的,然后使用**sort** 和 **uniq**删除掉人和重复的条目排序(例如'A' 和 'a' 都将成为 'a'。这样就将词表的数量减至97723项了
![](http://thelinuxrain.com/content/01-articles/67-top-10-fun-on-the-command-line/5.png)
我现在可以用通信命令带'-23'的参数来比较两个列表和报告中纯字母文件而没有出现在字典中的单词:
![](http://thelinuxrain.com/content/01-articles/67-top-10-fun-on-the-command-line/6.png)
总数是3137所以至少有8583 -3137 = 5446个'核心'密码在伯纳特的纯小写字母列表中大约63%是简单的英语单词或者是简单的单词附加一些数字或者标点符号。我之所以写至少是因为在3137个字符串中有很大比例是只有经过轻微修改的纯英语单词、名称、或者在/usr/share字典中未能找到的名称修改而成的。在LA项中例如'labtec', 'ladyboy', 'lakeside', 'lalakers', 'lalala', 'laserjet', 'lasvegas', 'lavalamp' and 'lawman'.
### Placenames ###
### 地名 ###
在之前的一篇[Linux Rain article][4],我描述了如何建立一张37万项澳大利亚的地名表。有了它我现在可以回答一些类似这样的关键问题“Round Hill是澳大利亚山脉中最流行的名字吗”和“桑迪是沙滩之最而岩溪峡谷呢
在地名表中地名字段的排名第2所以这里有
![](http://thelinuxrain.com/content/01-articles/67-top-10-fun-on-the-command-line/7.png)
哇。我当时甚至没有关闭。(但是请注意到我是如何通过**^string1^string2**命令保存打印的内容。它重复着最后一个命令但是用第2个字符串代替了第1个字符串。这是多门经常的BASH绝招
另一个亟待解决的问题是有多少地名有'Mile'在其中,例如'Six Mile Creek',而他们的排名又是如何:
![](http://thelinuxrain.com/content/01-articles/67-top-10-fun-on-the-command-line/8.png)
我在我的澳洲之旅发现有很多Dead Horse Creeks因此有这些地名
![](http://thelinuxrain.com/content/01-articles/67-top-10-fun-on-the-command-line/9.png)
### 种类 ###
第三个列表是探索我今年出版的1961-2010年期间澳大利亚新种类昆虫名。从这个列表中我去掉所有“物种的绰号”这也是种群组合的第二部分像智人你和我和西方蜜蜂欧洲蜜蜂
科技小贴士这个昆虫表可以从开发数据Zenodo库中[https://zenodo.org/record/10481][5],包括亚种。在我的top 10练习中我首先分离出所有独特的种群组合这样避免了重复的例如蜜蜂iberica的亚种以避免蜜蜂的绰号intermissa等等。最后一个物种文件有18155个绰号。
大多数人讲科学名称带玩笑式地用'-us'结局,如'Biggus buggus'.那么昆虫学家呢有几个不错的用命令行的方式获取字符串的最后2个字母在这里我都会用到这2个
![](http://thelinuxrain.com/content/01-articles/67-top-10-fun-on-the-command-line/10.png)
耶!昆虫学家喜欢也‘-us结尾。接下来我不知道有多少物种是以我的家乡State of Tasmania来命名的下面我想看看前100行来确保我得到的所有'tasman'组合.)
![](http://thelinuxrain.com/content/01-articles/67-top-10-fun-on-the-command-line/11.png)
那么昆士兰呢?
![](http://thelinuxrain.com/content/01-articles/67-top-10-fun-on-the-command-line/12.png)
一般来说昆虫物种名单中的前10名分别是什么呢
![](http://thelinuxrain.com/content/01-articles/67-top-10-fun-on-the-command-line/13.png)
嗯,除了明显的'australis'和'australiensis',而地理方面的'occidentalis'西部另外昆虫学家创建7个在10个最流行列表中的绰号已经旅行了其它昆虫学家的意愿。(加词'commoni'是给澳大利亚蝴蝶和蛾专家Ian F.B. Common,从1917-2006.)
### Speechifying ###
### 演变 ###
上面使用的命令在一个简单的列表中。从一个文本块做一个简单的列表这个命令行又再次成为你的朋友。例如我保存了一个相当filibustery[演讲][6]在澳大利亚参议院于2014年7月16日的文本文件议事录。将议事录分割成一个单词列表
![](http://thelinuxrain.com/content/01-articles/67-top-10-fun-on-the-command-line/14.png)
And to look at word frequency in the speech:
现在看看讲话中的单词使用频率:
![](http://thelinuxrain.com/content/01-articles/67-top-10-fun-on-the-command-line/15.png)
### Coming soon... ###
### 即将推出... ###
从多列的表中做'top 10'等排名,需要更多些的命令行工具。我将会在未来的文章中证明他们的用处。
--------------------------------------------------------------------------------
via: http://thelinuxrain.com/articles/top-10-fun-on-the-command-line
原文作者Bob Mesibov(Bob Mesibov is Tasmanian, retired and a keen Linux tinkerer.)
译者:[disylee](https://github.com/译者ID) 校对:[校对者ID](https://github.com/校对者ID)
本文由 [LCTT](https://github.com/LCTT/TranslateProject) 原创翻译,[Linux中国](http://linux.cn/) 荣誉推出
[1]:https://xato.net/passwords/more-top-worst-passwords/#.U8eD13AvDy0
[2]:https://xato.net/passwords/how-i-collect-passwords/#.U8eEdnAvDy0
[3]:http://www.rinkworks.com/words/letterfreq.shtml
[4]:http://www.thelinuxrain.com/articles/building-a-gazetteer-table-from-kml-files
[5]:https://zenodo.org/record/10481
[6]:http://parlinfo.aph.gov.au/parlInfo/search/display/display.w3p;db=CHAMBER;id=chamber%2Fhansards%2F232fa1a8-d7e8-4b22-9018-1a99b5a96812%2F0025;query=Id%3A%22chamber%2Fhansards%2F232fa1a8-d7e8-4b22-9018-1a99b5a96812%2F0000%22