mirror of
https://github.com/LCTT/TranslateProject.git
synced 2025-01-25 23:11:02 +08:00
translated over: disylee <disylee@hotmail.com>
This commit is contained in:
parent
1ebe99e0fd
commit
6dff8e0752
@ -1,117 +0,0 @@
|
||||
disylee来占坑~~
|
||||
Top 10! Fun On The Command Line
|
||||
================================================================================
|
||||
**If you just love making 'top 10'-type lists but are a little embarrassed to say so, tell people you're passionate about data exploration. To impress them even more, explain that you do your data exploration on the command line. But don't ruin the impression by telling them how easy that is!**
|
||||
|
||||
In this article I'll do some data exploration with basic GNU/Linux tools and 'one-column tables', by which I mean simple lists. For more information on the commands used here, see their Linux 'man' pages, or ask for an explanation in the 'Comments' section.
|
||||
|
||||
### Passwords ###
|
||||
|
||||
The first list to explore is Mark Burnett's 2011 compilation of the [10000 most commonly used passwords][1]. The list is ordered most-frequent-first, and is one source of the widely known factoid that 'password' is the most commonly used password, with '123456' in second place. Here I've put the list in a file called passwords, and used the head command to show the first 10 lines:
|
||||
|
||||
![](http://thelinuxrain.com/content/01-articles/67-top-10-fun-on-the-command-line/1.png)
|
||||
|
||||
(Burnett explains how he collects his passwords [here][2]. Note that he converted all uppercase letters to lowercase in his list.)
|
||||
|
||||
OK, so 'password' is top of the Burnett list. What about individual digits?
|
||||
|
||||
![](http://thelinuxrain.com/content/01-articles/67-top-10-fun-on-the-command-line/2.png)
|
||||
|
||||
Interesting. The digit '1' appears in the password list more than twice as often as the next most-used digit, '2', and the 10 digits are in numerical as well as popularity order, except for 0 and 9. And the top 10 letters?
|
||||
|
||||
![](http://thelinuxrain.com/content/01-articles/67-top-10-fun-on-the-command-line/3.png)
|
||||
|
||||
The most frequent letters in the passwords file are EARONISTLC. That's not too far off EAIRTONSLC, which is the frequency pattern in at least [one published table][3] of letter usage in common English words. Does this mean that most passwords are actually common English words, maybe with a few digits thrown in?
|
||||
|
||||
To find out, I'll first convert passwords to a list of letters-only strings, then see how many of those strings are in an English dictionary.
|
||||
|
||||
First I'll delete all the digits in passwords with a **sed** command, then delete all the punctuation marks, then all the blank lines. This creates a list of letters-only passwords. Then I'll prune that list with **sort** and **uniq** to get rid of any duplicates. (For example, 'abc1234def' and 'abc1!2!3!def!' both reduce to 'abcdef'.) According to the wc command, my pruning reduces the 10000 passwords to 8583 letters-only strings:
|
||||
|
||||
![](http://thelinuxrain.com/content/01-articles/67-top-10-fun-on-the-command-line/4.png)
|
||||
|
||||
For a handy English dictionary I'll use the file `usr/share/dict/american-english`, which came with my Debian Linux distribution. It contains 99171 words. I'll first convert this wordlist to lowercase-only with the **tr** command, then delete any duplicate entries with **sort** and **uniq** (like 'A' and 'a' both becoming 'a'). That reduces the wordlist to 97723 entries:
|
||||
|
||||
![](http://thelinuxrain.com/content/01-articles/67-top-10-fun-on-the-command-line/5.png)
|
||||
|
||||
I can now ask the comm command with the '-23' option to compare the two lists and report just the words in the letters-only file that are not found in the dictionary:
|
||||
|
||||
![](http://thelinuxrain.com/content/01-articles/67-top-10-fun-on-the-command-line/6.png)
|
||||
|
||||
The total is 3137, so at least 8583 - 3137 = 5446 'core' passwords in Burnett's lowercase-only list (about 63%) are either plain English words, or plain English words with some digits or punctuation marks added. I wrote at least because a big proportion of the 3137 strings are only slight modifications of plain English words or names, or words or names missing from the /usr/share dictionary. Among the LA's, for example, are 'labtec', 'ladyboy', 'lakeside', 'lalakers', 'lalala', 'laserjet', 'lasvegas', 'lavalamp' and 'lawman'.
|
||||
|
||||
### Placenames ###
|
||||
|
||||
In a previous [Linux Rain article][4], I described how I built a table of Australian placenames with more than 370 000 entries. Using it, I can now answer vital questions like 'Is Round Hill the most popular name for hills in Australia?' and 'Is Sandy Beach tops for beaches, and Rocky Creek for creeks?'
|
||||
|
||||
The placename field in the gazetteer table is number 2, so here goes:
|
||||
|
||||
![](http://thelinuxrain.com/content/01-articles/67-top-10-fun-on-the-command-line/7.png)
|
||||
|
||||
Wow. I wasn't even close. (But note how I saved typing by using the **^string1^string2** command. It repeats the last command, but substitutes string2 for string1. Wonderful BASH trick!)
|
||||
|
||||
Another burning question is how many placenames there are with 'Mile' in them, like 'Six Mile Creek', and how they rank:
|
||||
|
||||
![](http://thelinuxrain.com/content/01-articles/67-top-10-fun-on-the-command-line/8.png)
|
||||
|
||||
I've noticed a lot of Dead Horse Creeks in my Australian travels, and so has the gazetteer:
|
||||
|
||||
![](http://thelinuxrain.com/content/01-articles/67-top-10-fun-on-the-command-line/9.png)
|
||||
|
||||
### Species ###
|
||||
|
||||
The third list to explore comes from a table I published this year of new Australian insect species named in the period 1961-2010. From the table I've pulled out all the 'species epithets', which are the second parts of genus-species combinations like Homo sapiens (you and me) and Apis mellifera (European honeybee).
|
||||
|
||||
(Tech note: The insects table, which is available from the open data Zenodo repository at [https://zenodo.org/record/10481][5], includes subspecies. For my 'top 10' exercise I first isolated all the unique genus-species combinations, to avoid duplication from subspecies like Apis mellifera iberica, Apis mellifera intermissa, etc. The final species file has 18155 species epithets.)
|
||||
|
||||
Most people who make jokes about scientific names use the '-us' ending, as in 'Biggus buggus'. What about entomologists? There are a couple of good, command-line ways to get the last 2 letters of a string, and here I've used both:
|
||||
|
||||
![](http://thelinuxrain.com/content/01-articles/67-top-10-fun-on-the-command-line/10.png)
|
||||
|
||||
Yep, entomologists prefer '-us', too. Next, I wonder how many species are named for my home State of Tasmania? (Below I ask head for the first 100 lines to make sure I get all the 'tasman' combinations.)
|
||||
|
||||
![](http://thelinuxrain.com/content/01-articles/67-top-10-fun-on-the-command-line/11.png)
|
||||
|
||||
How about Queensland?
|
||||
|
||||
![](http://thelinuxrain.com/content/01-articles/67-top-10-fun-on-the-command-line/12.png)
|
||||
|
||||
And generally speaking, what are the top 10 names in that insect species list?
|
||||
|
||||
![](http://thelinuxrain.com/content/01-articles/67-top-10-fun-on-the-command-line/13.png)
|
||||
|
||||
Hmm. Apart from the obvious 'australis' and 'australiensis', and the geographical 'occidentalis' (of the west), the other 7 epithets in the 10-most-popular list have been created by entomologists to honour other entomologists. (The epithet 'commoni' honors the Australian butterfly and moth specialist Ian F.B. Common, 1917-2006.)
|
||||
|
||||
### Speechifying ###
|
||||
|
||||
The commands used above work on simple lists. To make a simple list out a block of text, the command line is again your friend. For example, I've saved a rather filibustery [speech][6] in the Australian Senate on 16 July 2014 as the text file hansard. To split hansard into a list of words:
|
||||
|
||||
![](http://thelinuxrain.com/content/01-articles/67-top-10-fun-on-the-command-line/14.png)
|
||||
|
||||
And to look at word frequency in the speech:
|
||||
|
||||
![](http://thelinuxrain.com/content/01-articles/67-top-10-fun-on-the-command-line/15.png)
|
||||
|
||||
### Coming soon... ###
|
||||
|
||||
Doing 'top 10' and other rankings from multi-column tables requires a few more command-line tools. I'll demonstrate their use in a future article.
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
via: http://thelinuxrain.com/articles/top-10-fun-on-the-command-line
|
||||
|
||||
原文作者:Bob Mesibov(Bob Mesibov is Tasmanian, retired and a keen Linux tinkerer.)
|
||||
|
||||
译者:[译者ID](https://github.com/译者ID) 校对:[校对者ID](https://github.com/校对者ID)
|
||||
|
||||
本文由 [LCTT](https://github.com/LCTT/TranslateProject) 原创翻译,[Linux中国](http://linux.cn/) 荣誉推出
|
||||
|
||||
[1]:https://xato.net/passwords/more-top-worst-passwords/#.U8eD13AvDy0
|
||||
[2]:https://xato.net/passwords/how-i-collect-passwords/#.U8eEdnAvDy0
|
||||
[3]:http://www.rinkworks.com/words/letterfreq.shtml
|
||||
[4]:http://www.thelinuxrain.com/articles/building-a-gazetteer-table-from-kml-files
|
||||
[5]:https://zenodo.org/record/10481
|
||||
[6]:http://parlinfo.aph.gov.au/parlInfo/search/display/display.w3p;db=CHAMBER;id=chamber%2Fhansards%2F232fa1a8-d7e8-4b22-9018-1a99b5a96812%2F0025;query=Id%3A%22chamber%2Fhansards%2F232fa1a8-d7e8-4b22-9018-1a99b5a96812%2F0000%22
|
126
translated/tech/20140723 Top 10 Fun On The Command Line.md
Normal file
126
translated/tech/20140723 Top 10 Fun On The Command Line.md
Normal file
@ -0,0 +1,126 @@
|
||||
Top 10! Fun On The Command Line
|
||||
================================================================================
|
||||
|
||||
排名前10!在命令行中寻找乐趣
|
||||
================================================================================
|
||||
|
||||
|
||||
**如果你喜欢做‘排名前10'类型的列表但又有点不好意思这样说,那么告诉人们你热爱数据的探索。为了进一步打动他们们,向他们解释你在命令行间的数据探索。但是不要告诉他们这其实很容易,以免你的好形象就这样被毁灭了哦!**
|
||||
|
||||
在这篇文章中,我将基于GNU/Linux工具和'一列的表格'来做一些数据探索,也就是我所说的简单列表。如若想在这里通过命令行查看更多的信息,请查看'man'页,或者在“注释”部分求解。
|
||||
|
||||
### 密码 ###
|
||||
|
||||
在第一个列表里探讨的是马克.伯内特2011著的关于[10000 个最常用的密码][1]汇编。这个列表是有序的、使用最频繁的,也是广为人知的阶乘“密码”的最常用来源之一,与“123456”并列排名第二。在这里,我已经把一个所谓密码的文件列出来了,并且使用head命令把排名前10的列出来了:
|
||||
|
||||
![](http://thelinuxrain.com/content/01-articles/67-top-10-fun-on-the-command-line/1.png)
|
||||
|
||||
(伯内特解释他是如何手收集这些密码的[这里][2]。你会注意到在列表中他把所有大写字母都转换成小写的。)
|
||||
|
||||
OK,所以'密码'是伯内特列表的顶部。那么个位数字呢?
|
||||
|
||||
![](http://thelinuxrain.com/content/01-articles/67-top-10-fun-on-the-command-line/2.png)
|
||||
|
||||
非常有趣的是!数字'1'出现在密码列表中多于是第二个最常用数字'2'的两倍,以及10位数字在数值以及人气排序,除了0和9。而排名前10的字母呢?
|
||||
![](http://thelinuxrain.com/content/01-articles/67-top-10-fun-on-the-command-line/3.png)
|
||||
|
||||
在密码文件中出现最频繁的字母依次是EARONISTLC.这个和EAIRTONSLC差不多,也是一个频率模式图至少在 [one published table][3] 中最常适用的英语单词。这是否意味着,大部分密码其实是一些常见的英语单词呢,也许会参杂一些数字呢?
|
||||
|
||||
为了找到答案,我先把密码转换成一个纯字母字符串列表,然后看看有多少字符串是可以在英语词典中找到的。
|
||||
|
||||
|
||||
首先我将通过 **sed**命令删除所有密码中的数字,然后删除所有的标点符号,再删除所有的空行。这将创建出一个纯字母的密码列表。然后我通过**sort** 和 **uniq**来修剪列表的排序,将重复项取出。(例如,'abc1234def'和'abc1!2!3!def!'都剔除剩下'abcdef'.) 。根据wc命令,我把1000个密码减至成8583个纯字母的字符串:
|
||||
![](http://thelinuxrain.com/content/01-articles/67-top-10-fun-on-the-command-line/4.png)
|
||||
|
||||
我经常使用一本便携式的英语字典,我通常会使用`usr/share/dict/american-english`,这个文件是来自Debian Linux的一个发行版本。它包含了99171个单词。我会先通过**tr**命令将这个词表转换为纯小写的,然后使用**sort** 和 **uniq**删除掉人和重复的条目排序(例如'A' 和 'a' 都将成为 'a')。这样就将词表的数量减至97723项了:
|
||||
|
||||
![](http://thelinuxrain.com/content/01-articles/67-top-10-fun-on-the-command-line/5.png)
|
||||
|
||||
我现在可以用通信命令带'-23'的参数来比较两个列表和报告中纯字母文件而没有出现在字典中的单词:
|
||||
![](http://thelinuxrain.com/content/01-articles/67-top-10-fun-on-the-command-line/6.png)
|
||||
|
||||
总数是3137,所以至少有8583 -3137 = 5446个'核心'密码在伯纳特的纯小写字母列表中(大约63%)是简单的英语单词,或者是简单的单词附加一些数字或者标点符号。我之所以写至少是因为在3137个字符串中有很大比例是只有经过轻微修改的纯英语单词、名称、或者在/usr/share字典中未能找到的名称修改而成的。在LA项中,例如,'labtec', 'ladyboy', 'lakeside', 'lalakers', 'lalala', 'laserjet', 'lasvegas', 'lavalamp' and 'lawman'.
|
||||
|
||||
### Placenames ###
|
||||
### 地名 ###
|
||||
|
||||
在之前的一篇[Linux Rain article][4],我描述了如何建立一张37万项澳大利亚的地名表。有了它,我现在可以回答一些类似这样的关键问题“Round Hill是澳大利亚山脉中最流行的名字吗?”和“桑迪是沙滩之最,而岩溪峡谷呢?”
|
||||
|
||||
在地名表中地名字段的排名第2,所以这里有:
|
||||
|
||||
![](http://thelinuxrain.com/content/01-articles/67-top-10-fun-on-the-command-line/7.png)
|
||||
|
||||
|
||||
哇。我当时甚至没有关闭。(但是请注意到我是如何通过**^string1^string2**命令保存打印的内容。它重复着最后一个命令,但是用第2个字符串代替了第1个字符串。这是多门经常的BASH绝招!)
|
||||
|
||||
|
||||
另一个亟待解决的问题是有多少地名有'Mile'在其中,例如'Six Mile Creek',而他们的排名又是如何:
|
||||
![](http://thelinuxrain.com/content/01-articles/67-top-10-fun-on-the-command-line/8.png)
|
||||
|
||||
我在我的澳洲之旅发现有很多Dead Horse Creeks,因此有这些地名:
|
||||
|
||||
![](http://thelinuxrain.com/content/01-articles/67-top-10-fun-on-the-command-line/9.png)
|
||||
|
||||
|
||||
### 种类 ###
|
||||
|
||||
第三个列表是探索我今年出版的1961-2010年期间澳大利亚新种类昆虫名。从这个列表中,我去掉所有“物种的绰号”,这也是种群组合的第二部分像智人(你和我)和西方蜜蜂(欧洲蜜蜂)。
|
||||
|
||||
|
||||
(科技小贴士:这个昆虫表,可以从开发数据Zenodo库中[https://zenodo.org/record/10481][5],包括亚种。在我的‘top 10’练习中,我首先分离出所有独特的种群组合,这样避免了重复的,例如蜜蜂iberica的亚种,以避免蜜蜂的绰号intermissa,等等。最后一个物种文件有18155个绰号。)
|
||||
|
||||
大多数人讲科学名称带玩笑式地用'-us'结局,如'Biggus buggus'.那么昆虫学家呢?有几个不错的,用命令行的方式获取字符串的最后2个字母,在这里我都会用到这2个:
|
||||
![](http://thelinuxrain.com/content/01-articles/67-top-10-fun-on-the-command-line/10.png)
|
||||
|
||||
|
||||
|
||||
耶!昆虫学家喜欢也‘-us’结尾。接下来,我不知道有多少物种是以我的家乡State of Tasmania来命名的?(下面我想看看前100行,来确保我得到的所有'tasman'组合.)
|
||||
|
||||
![](http://thelinuxrain.com/content/01-articles/67-top-10-fun-on-the-command-line/11.png)
|
||||
|
||||
那么昆士兰呢?
|
||||
|
||||
![](http://thelinuxrain.com/content/01-articles/67-top-10-fun-on-the-command-line/12.png)
|
||||
|
||||
一般来说,昆虫物种名单中的前10名分别是什么呢?
|
||||
|
||||
![](http://thelinuxrain.com/content/01-articles/67-top-10-fun-on-the-command-line/13.png)
|
||||
|
||||
嗯,除了明显的'australis'和'australiensis',而地理方面的'occidentalis'(西部),另外昆虫学家创建7个在10个最流行列表中的绰号已经旅行了其它昆虫学家的意愿。(加词'commoni'是给澳大利亚蝴蝶和蛾专家Ian F.B. Common,从1917-2006.)
|
||||
|
||||
### Speechifying ###
|
||||
### 演变 ###
|
||||
|
||||
上面使用的命令在一个简单的列表中。从一个文本块做一个简单的列表,这个命令行又再次成为你的朋友。例如,我保存了一个相当filibustery[演讲][6]在澳大利亚参议院于2014年7月16日的文本文件议事录。将议事录分割成一个单词列表:
|
||||
![](http://thelinuxrain.com/content/01-articles/67-top-10-fun-on-the-command-line/14.png)
|
||||
|
||||
And to look at word frequency in the speech:
|
||||
|
||||
现在看看讲话中的单词使用频率:
|
||||
|
||||
![](http://thelinuxrain.com/content/01-articles/67-top-10-fun-on-the-command-line/15.png)
|
||||
|
||||
### Coming soon... ###
|
||||
### 即将推出... ###
|
||||
|
||||
从多列的表中做'top 10'等排名,需要更多些的命令行工具。我将会在未来的文章中证明他们的用处。
|
||||
|
||||
|
||||
|
||||
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
via: http://thelinuxrain.com/articles/top-10-fun-on-the-command-line
|
||||
|
||||
原文作者:Bob Mesibov(Bob Mesibov is Tasmanian, retired and a keen Linux tinkerer.)
|
||||
|
||||
译者:[disylee](https://github.com/译者ID) 校对:[校对者ID](https://github.com/校对者ID)
|
||||
|
||||
本文由 [LCTT](https://github.com/LCTT/TranslateProject) 原创翻译,[Linux中国](http://linux.cn/) 荣誉推出
|
||||
|
||||
[1]:https://xato.net/passwords/more-top-worst-passwords/#.U8eD13AvDy0
|
||||
[2]:https://xato.net/passwords/how-i-collect-passwords/#.U8eEdnAvDy0
|
||||
[3]:http://www.rinkworks.com/words/letterfreq.shtml
|
||||
[4]:http://www.thelinuxrain.com/articles/building-a-gazetteer-table-from-kml-files
|
||||
[5]:https://zenodo.org/record/10481
|
||||
[6]:http://parlinfo.aph.gov.au/parlInfo/search/display/display.w3p;db=CHAMBER;id=chamber%2Fhansards%2F232fa1a8-d7e8-4b22-9018-1a99b5a96812%2F0025;query=Id%3A%22chamber%2Fhansards%2F232fa1a8-d7e8-4b22-9018-1a99b5a96812%2F0000%22
|
Loading…
Reference in New Issue
Block a user