mirror of
https://github.com/LCTT/TranslateProject.git
synced 2024-12-26 21:30:55 +08:00
GraveAccent translated 20180422...
This commit is contained in:
parent
e41e0cf5ec
commit
d29de3e5f4
@ -1,526 +0,0 @@
|
||||
GraveAccent翻译中 Command Line Tricks For Data Scientists • kade killary
|
||||
======
|
||||
|
||||
![](https://i.imgur.com/0mzQMcB.png)
|
||||
|
||||
For many data scientists, data manipulation begins and ends with Pandas or the Tidyverse. In theory, there is nothing wrong with this notion. It is, after all, why these tools exist in the first place. Yet, these options can often be overkill for simple tasks like delimiter conversion.
|
||||
|
||||
Aspiring to master the command line should be on every developer’s list, especially data scientists. Learning the ins and outs of your shell will undeniably make you more productive. Beyond that, the command line serves as a great history lesson in computing. For instance, awk - a data-driven scripting language. Awk first appeared in 1977 with the help of [Brian Kernighan][1], the K in the legendary [K&R book][2]. Today, some near 50 years later, awk remains relevant with [new books][3] still appearing every year! Thus, it’s safe to assume that an investment in command line wizardry won’t depreciate any time soon.
|
||||
|
||||
### What We’ll Cover
|
||||
|
||||
* ICONV
|
||||
* HEAD
|
||||
* TR
|
||||
* WC
|
||||
* SPLIT
|
||||
* SORT & UNIQ
|
||||
* CUT
|
||||
* PASTE
|
||||
* JOIN
|
||||
* GREP
|
||||
* SED
|
||||
* AWK
|
||||
|
||||
|
||||
|
||||
### ICONV
|
||||
|
||||
File encodings can be tricky. For the most part files these days are all UTF-8 encoded. To understand some of the magic behind UTF-8, check out this [excellent video][4]. Nonetheless, there are times where we receive a file that isn’t in this format. This can lead to some wonky attempts at swapping the encoding schema. Here, `iconv` is a life saver. Iconv is a simple program that will take text in one encoding and output the text in another.
|
||||
```
|
||||
# Converting -f (from) latin1 (ISO-8859-1)
|
||||
# -t (to) standard UTF_8
|
||||
|
||||
iconv -f ISO-8859-1 -t UTF-8 < input.txt > output.txt
|
||||
|
||||
```
|
||||
|
||||
* Useful options:
|
||||
|
||||
* `iconv -l` list all known encodings
|
||||
* `iconv -c` silently discard characters that cannot be converted
|
||||
|
||||
|
||||
|
||||
### HEAD
|
||||
|
||||
If you are a frequent Pandas user then `head` will be familiar. Often when dealing with new data the first thing we want to do is get a sense of what exists. This leads to firing up Pandas, reading in the data and then calling `df.head()` \- strenuous, to say the least. Head, without any flags, will print out the first 10 lines of a file. The true power of `head` lies in testing out cleaning operations. For instance, if we wanted to change the delimiter of a file from commas to pipes. One quick test would be: `head mydata.csv | sed 's/,/|/g'`.
|
||||
```
|
||||
# Prints out first 10 lines
|
||||
|
||||
head filename.csv
|
||||
|
||||
# Print first 3 lines
|
||||
|
||||
head -n 3 filename.csv
|
||||
|
||||
```
|
||||
|
||||
* Useful options:
|
||||
|
||||
* `head -n` print a specific number of lines
|
||||
* `head -c` print a specific number of bytes
|
||||
|
||||
|
||||
|
||||
### TR
|
||||
|
||||
Tr is analogous to translate. This powerful utility is a workhorse for basic file cleaning. An ideal use case is for swapping out the delimiters within a file.
|
||||
```
|
||||
# Converting a tab delimited file into commas
|
||||
|
||||
cat tab_delimited.txt | tr "\t" "," comma_delimited.csv
|
||||
|
||||
```
|
||||
|
||||
Another feature of `tr` is all the built in `[:class:]` variables at your disposal. These include:
|
||||
```
|
||||
[:alnum:] all letters and digits
|
||||
[:alpha:] all letters
|
||||
[:blank:] all horizontal whitespace
|
||||
[:cntrl:] all control characters
|
||||
[:digit:] all digits
|
||||
[:graph:] all printable characters, not including space
|
||||
[:lower:] all lower case letters
|
||||
[:print:] all printable characters, including space
|
||||
[:punct:] all punctuation characters
|
||||
[:space:] all horizontal or vertical whitespace
|
||||
[:upper:] all upper case letters
|
||||
[:xdigit:] all hexadecimal digits
|
||||
|
||||
```
|
||||
|
||||
You can chain a variety of these together to compose powerful programs. The following is a basic word count program you could use to check your READMEs for overuse.
|
||||
```
|
||||
cat README.md | tr "[:punct:][:space:]" "\n" | tr "[:upper:]" "[:lower:]" | grep . | sort | uniq -c | sort -nr
|
||||
|
||||
```
|
||||
|
||||
Another example using basic regex:
|
||||
```
|
||||
# Converting all upper case letters to lower case
|
||||
|
||||
cat filename.csv | tr '[A-Z]' '[a-z]'
|
||||
|
||||
```
|
||||
|
||||
* Useful options:
|
||||
|
||||
* `tr -d` delete characters
|
||||
* `tr -s` squeeze characters
|
||||
* `\b` backspace
|
||||
* `\f` form feed
|
||||
* `\v` vertical tab
|
||||
* `\NNN` character with octal value NNN
|
||||
|
||||
|
||||
|
||||
### WC
|
||||
|
||||
Word count. Its value is primarily derived from the `-l` flag, which will give you the line count.
|
||||
```
|
||||
# Will return number of lines in CSV
|
||||
|
||||
wc -l gigantic_comma.csv
|
||||
|
||||
```
|
||||
|
||||
This tool comes in handy to confirm the output of various commands. So, if we were to convert the delimiters within a file and then run `wc -l` we would expect the total lines to be the same. If not, then we know something went wrong.
|
||||
|
||||
* Useful options:
|
||||
|
||||
* `wc -c` print the byte counts
|
||||
* `wc -m` print the character counts
|
||||
* `wc -L` print length of longest line
|
||||
* `wc -w` print word counts
|
||||
|
||||
|
||||
|
||||
### SPLIT
|
||||
|
||||
File sizes can range dramatically. And depending on the job, it could be beneficial to split up the file - thus `split`. The basic syntax for split is:
|
||||
```
|
||||
# We will split our CSV into new_filename every 500 lines
|
||||
|
||||
split -l 500 filename.csv new_filename_
|
||||
|
||||
# filename.csv
|
||||
# ls output
|
||||
# new_filename_aaa
|
||||
# new_filename_aab
|
||||
# new_filename_aac
|
||||
|
||||
```
|
||||
|
||||
Two quirks are the naming convention and lack of file extensions. The suffix convention can be numeric via the `-d` flag. To add file extensions, you’ll need to run the following `find` command. It will change the names of ALL files within the current directory by appending `.csv`, so be careful.
|
||||
```
|
||||
find . -type f -exec mv '{}' '{}'.csv \;
|
||||
|
||||
# ls output
|
||||
# filename.csv.csv
|
||||
# new_filename_aaa.csv
|
||||
# new_filename_aab.csv
|
||||
# new_filename_aac.csv
|
||||
|
||||
```
|
||||
|
||||
* Useful options:
|
||||
|
||||
* `split -b` split by certain byte size
|
||||
* `split -a` generate suffixes of length N
|
||||
* `split -x` split using hex suffixes
|
||||
|
||||
|
||||
|
||||
### SORT & UNIQ
|
||||
|
||||
The preceding commands are obvious: they do what they say they do. These two provide the most punch in tandem (i.e. unique word counts). This is due to `uniq`, which only operates on duplicate adjacent lines. Thus, the reason to `sort` before piping the output through. One interesting note is that `sort -u` will achieve the same results as the typical `sort file.txt | uniq` pattern.
|
||||
|
||||
Sort does have a sneakily useful ability for data scientists: the ability to sort an entire CSV based on a particular column.
|
||||
```
|
||||
# Sorting a CSV file by the second column alphabetically
|
||||
|
||||
sort -t"," -k2,2 filename.csv
|
||||
|
||||
# Numerically
|
||||
|
||||
sort -t"," -k2n,2 filename.csv
|
||||
|
||||
# Reverse order
|
||||
|
||||
sort -t"," -k2nr,2 filename.csv
|
||||
|
||||
```
|
||||
|
||||
The `-t` option here is to specify the comma as our delimiter. More often than not spaces or tabs are assumed. Furthermore, the `-k` flag is for specifying our key. The syntax for this is `-km,n`, with `m` being the starting field and `n` being the last.
|
||||
|
||||
* Useful options:
|
||||
|
||||
* `sort -f` ignore case
|
||||
* `sort -r` reverse sort order
|
||||
* `sort -R` scramble order
|
||||
* `uniq -c` count number of occurrences
|
||||
* `uniq -d` only print duplicate lines
|
||||
|
||||
|
||||
|
||||
### CUT
|
||||
|
||||
Cut is for removing columns. To illustrate, if we only wanted the first and third columns.
|
||||
```
|
||||
cut -d, -f 1,3 filename.csv
|
||||
|
||||
```
|
||||
|
||||
To select every column other than the first.
|
||||
```
|
||||
cut -d, -f 2- filename.csv
|
||||
|
||||
```
|
||||
|
||||
In combination with other commands, `cut` serves as a filter.
|
||||
```
|
||||
# Print first 10 lines of column 1 and 3, where "some_string_value" is present
|
||||
|
||||
head filename.csv | grep "some_string_value" | cut -d, -f 1,3
|
||||
|
||||
```
|
||||
|
||||
Finding out the number of unique values within the second column.
|
||||
```
|
||||
cat filename.csv | cut -d, -f 2 | sort | uniq | wc -l
|
||||
|
||||
# Count occurences of unique values, limiting to first 10 results
|
||||
|
||||
cat filename.csv | cut -d, -f 2 | sort | uniq -c | head
|
||||
|
||||
```
|
||||
|
||||
### PASTE
|
||||
|
||||
Paste is a niche command with an interesting function. If you have two files that you need merged, and they are already sorted, `paste` has you covered.
|
||||
```
|
||||
# names.txt
|
||||
adam
|
||||
john
|
||||
zach
|
||||
|
||||
# jobs.txt
|
||||
lawyer
|
||||
youtuber
|
||||
developer
|
||||
|
||||
# Join the two into a CSV
|
||||
|
||||
paste -d ',' names.txt jobs.txt > person_data.txt
|
||||
|
||||
# Output
|
||||
adam,lawyer
|
||||
john,youtuber
|
||||
zach,developer
|
||||
|
||||
```
|
||||
|
||||
For a more SQL_-esque variant, see below.
|
||||
|
||||
### JOIN
|
||||
|
||||
Join is a simplistic, quasi-tangential, SQL. The largest differences being that `join` will return all columns and matches can only be on one field. By default, `join` will try and use the first column as the match key. For a different result, the following syntax is necessary:
|
||||
```
|
||||
# Join the first file (-1) by the second column
|
||||
# and the second file (-2) by the first
|
||||
|
||||
join -t"," -1 2 -2 1 first_file.txt second_file.txt
|
||||
|
||||
```
|
||||
|
||||
The standard join is an inner join. However, an outer join is also viable through the `-a` flag. Another noteworthy quirk is the `-e` flag, which can be used to substitute a value if a missing field is found.
|
||||
```
|
||||
# Outer join, replace blanks with NULL in columns 1 and 2
|
||||
# -o which fields to substitute - 0 is key, 1.1 is first column, etc...
|
||||
|
||||
join -t"," -1 2 -a 1 -a2 -e ' NULL' -o '0,1.1,2.2' first_file.txt second_file.txt
|
||||
|
||||
```
|
||||
|
||||
Not the most user-friendly command, but desperate times, desperate measures.
|
||||
|
||||
* Useful options:
|
||||
|
||||
* `join -a` print unpairable lines
|
||||
* `join -e` replace missing input fields
|
||||
* `join -j` equivalent to `-1 FIELD -2 FIELD`
|
||||
|
||||
|
||||
|
||||
### GREP
|
||||
|
||||
Global search for a regular expression and print, or `grep`; likely, the most well known command, and with good reason. Grep has a lot of power, especially for finding your way around large codebases. Within the realm of data science, it acts as a refining mechanism for other commands. Although its standard usage is valuable as well.
|
||||
```
|
||||
# Recursively search and list all files in directory containing 'word'
|
||||
|
||||
grep -lr 'word' .
|
||||
|
||||
# List number of files containing word
|
||||
|
||||
grep -lr 'word' . | wc -l
|
||||
|
||||
```
|
||||
|
||||
Count total number of lines containing word / pattern.
|
||||
```
|
||||
grep -c 'some_value' filename.csv
|
||||
|
||||
# Same thing, but in all files in current directory by file name
|
||||
|
||||
grep -c 'some_value' *
|
||||
|
||||
```
|
||||
|
||||
Grep for multiple values using the or operator - `\|`.
|
||||
```
|
||||
grep "first_value\|second_value" filename.csv
|
||||
|
||||
```
|
||||
|
||||
* Useful options
|
||||
|
||||
* `alias grep="grep --color=auto"` make grep colorful
|
||||
* `grep -E` use extended regexps
|
||||
* `grep -w` only match whole words
|
||||
* `grep -l` print name of files with match
|
||||
* `grep -v` inverted matching
|
||||
|
||||
|
||||
|
||||
### THE BIG GUNS
|
||||
|
||||
Sed and Awk are the two most powerful commands in this article. For brevity, I’m not going to go into exhausting detail about either. Instead, I will cover a variety of commands that prove their impressive might. If you want to know more, [there is a book][5] just for that.
|
||||
|
||||
### SED
|
||||
|
||||
At its core `sed` is a stream editor. It excels at substitutions, but can also be leveraged for all out refactoring.
|
||||
|
||||
The most basic `sed` command consists of `s/old/new/g`. This translates to search for old value, replace with new globally. Without the `/g` our command would terminate after the first occurrence.
|
||||
|
||||
To get a quick taste of the power lets dive into an example. In this scenario you’ve been given the following file:
|
||||
```
|
||||
balance,name
|
||||
$1,000,john
|
||||
$2,000,jack
|
||||
|
||||
```
|
||||
|
||||
The first thing we may want to do is remove the dollar signs. The `-i` flag indicates in-place. The `''` is to indicate a zero-length file extension, thus overwriting our initial file. Ideally, you would test each of these individually and then output to a new file.
|
||||
```
|
||||
sed -i '' 's/\$//g' data.txt
|
||||
|
||||
# balance,name
|
||||
# 1,000,john
|
||||
# 2,000,jack
|
||||
|
||||
```
|
||||
|
||||
Next up, the commas in our `balance` column values.
|
||||
```
|
||||
sed -i '' 's/\([0-9]\),\([0-9]\)/\1\2/g' data.txt
|
||||
|
||||
# balance,name
|
||||
# 1000,john
|
||||
# 2000,jack
|
||||
|
||||
```
|
||||
|
||||
Lastly, Jack up and decided to quit one day. So, au revoir, mon ami.
|
||||
```
|
||||
sed -i '' '/jack/d' data.txt
|
||||
|
||||
# balance,name
|
||||
# 1000,john
|
||||
|
||||
```
|
||||
|
||||
As you can see, `sed` packs quite a punch, but the fun doesn’t stop there.
|
||||
|
||||
### AWK
|
||||
|
||||
The best for last. Awk is much more than a simple command: it is a full-blown language. Of everything covered in this article, `awk` is by far the coolest. If you find yourself impressed there are loads of great resources - see [here][6], [here][7] and [here][8].
|
||||
|
||||
Common use cases for `awk` include:
|
||||
|
||||
* Text processing
|
||||
* Formatted text reports
|
||||
* Performing arithmetic operations
|
||||
* Performing string operations
|
||||
|
||||
|
||||
|
||||
Awk can parallel `grep` in its most nascent form.
|
||||
```
|
||||
awk '/word/' filename.csv
|
||||
|
||||
```
|
||||
|
||||
Or with a little more magic the combination of `grep` and `cut`. Here, `awk` prints the third and fourth column, tab separated, for all lines with our word. `-F,` merely changes our delimiter to a comma.
|
||||
```
|
||||
awk -F, '/word/ { print $3 "\t" $4 }' filename.csv
|
||||
|
||||
```
|
||||
|
||||
Awk comes with a lot of nifty variables built-in. For instance, `NF` \- number of fields - and `NR` \- number of records. To get the fifty-third record in a file:
|
||||
```
|
||||
awk -F, 'NR == 53' filename.csv
|
||||
|
||||
```
|
||||
|
||||
An added wrinkle is the ability to filter based off of one or more values. The first example, below, will print the line number and columns for records where the first column equals string.
|
||||
```
|
||||
awk -F, ' $1 == "string" { print NR, $0 } ' filename.csv
|
||||
|
||||
# Filter based off of numerical value in second column
|
||||
|
||||
awk -F, ' $2 == 1000 { print NR, $0 } ' filename.csv
|
||||
|
||||
```
|
||||
|
||||
Multiple numerical expressions:
|
||||
```
|
||||
# Print line number and columns where column three greater
|
||||
# than 2005 and column five less than one thousand
|
||||
|
||||
awk -F, ' $3 >= 2005 && $5 <= 1000 { print NR, $0 } ' filename.csv
|
||||
|
||||
```
|
||||
|
||||
Sum the third column:
|
||||
```
|
||||
awk -F, '{ x+=$3 } END { print x }' filename.csv
|
||||
|
||||
```
|
||||
|
||||
The sum of the third column, for values where the first column equals “something”.
|
||||
```
|
||||
awk -F, '$1 == "something" { x+=$3 } END { print x }' filename.csv
|
||||
|
||||
```
|
||||
|
||||
Get the dimensions of a file:
|
||||
```
|
||||
awk -F, 'END { print NF, NR }' filename.csv
|
||||
|
||||
# Prettier version
|
||||
|
||||
awk -F, 'BEGIN { print "COLUMNS", "ROWS" }; END { print NF, NR }' filename.csv
|
||||
|
||||
```
|
||||
|
||||
Print lines appearing twice:
|
||||
```
|
||||
awk -F, '++seen[$0] == 2' filename.csv
|
||||
|
||||
```
|
||||
|
||||
Remove duplicate lines:
|
||||
```
|
||||
# Consecutive lines
|
||||
awk 'a !~ $0; {a=$0}']
|
||||
|
||||
# Nonconsecutive lines
|
||||
awk '! a[$0]++' filename.csv
|
||||
|
||||
# More efficient
|
||||
awk '!($0 in a) {a[$0];print}
|
||||
|
||||
```
|
||||
|
||||
Substitute multiple values using built-in function `gsub()`.
|
||||
```
|
||||
awk '{gsub(/scarlet|ruby|puce/, "red"); print}'
|
||||
|
||||
```
|
||||
|
||||
This `awk` command will combine multiple CSV files, ignoring the header and then append it at the end.
|
||||
```
|
||||
awk 'FNR==1 && NR!=1{next;}{print}' *.csv > final_file.csv
|
||||
|
||||
```
|
||||
|
||||
Need to downsize a massive file? Welp, `awk` can handle that with help from `sed`. Specifically, this command breaks one big file into multiple smaller ones based on a line count. This one-liner will also add an extension.
|
||||
```
|
||||
sed '1d;$d' filename.csv | awk 'NR%NUMBER_OF_LINES==1{x="filename-"++i".csv";}{print > x}'
|
||||
|
||||
# Example: splitting big_data.csv into data_(n).csv every 100,000 lines
|
||||
|
||||
sed '1d;$d' big_data.csv | awk 'NR%100000==1{x="data_"++i".csv";}{print > x}'
|
||||
|
||||
```
|
||||
|
||||
### CLOSING
|
||||
|
||||
The command line boasts endless power. The commands covered in this article are enough to elevate you from zero to hero in no time. Beyond those covered, there are many utilities to consider for daily data operations. [Csvkit][9], [xsv][10] and [q][11] are three of note. If you’re looking to take an even deeper dive into command line data science, then look no further than [this book][12]. It’s also available online [for free][13]!
|
||||
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
via: http://kadekillary.work/post/cli-4-ds/
|
||||
|
||||
作者:[Kade Killary][a]
|
||||
选题:[lujun9972](https://github.com/lujun9972)
|
||||
译者:[译者ID](https://github.com/译者ID)
|
||||
校对:[校对者ID](https://github.com/校对者ID)
|
||||
|
||||
本文由 [LCTT](https://github.com/LCTT/TranslateProject) 原创编译,[Linux中国](https://linux.cn/) 荣誉推出
|
||||
|
||||
[a]:http://kadekillary.work/authors/kadekillary
|
||||
[1]:https://en.wikipedia.org/wiki/Brian_Kernighan
|
||||
[2]:https://en.wikipedia.org/wiki/The_C_Programming_Language
|
||||
[3]:https://www.amazon.com/Learning-AWK-Programming-cutting-edge-text-processing-ebook/dp/B07BT98HDS
|
||||
[4]:https://www.youtube.com/watch?v=MijmeoH9LT4
|
||||
[5]:https://www.amazon.com/sed-awk-Dale-Dougherty/dp/1565922255/ref=sr_1_1?ie=UTF8&qid=1524381457&sr=8-1&keywords=sed+and+awk
|
||||
[6]:https://www.amazon.com/AWK-Programming-Language-Alfred-Aho/dp/020107981X/ref=sr_1_1?ie=UTF8&qid=1524388936&sr=8-1&keywords=awk
|
||||
[7]:http://www.grymoire.com/Unix/Awk.html
|
||||
[8]:https://www.tutorialspoint.com/awk/index.htm
|
||||
[9]:http://csvkit.readthedocs.io/en/1.0.3/
|
||||
[10]:https://github.com/BurntSushi/xsv
|
||||
[11]:https://github.com/harelba/q
|
||||
[12]:https://www.amazon.com/Data-Science-Command-Line-Time-Tested/dp/1491947853/ref=sr_1_1?ie=UTF8&qid=1524390894&sr=8-1&keywords=data+science+at+the+command+line
|
||||
[13]:https://www.datascienceatthecommandline.com/
|
@ -0,0 +1,469 @@
|
||||
数据科学家的命令行技巧
|
||||
======
|
||||
|
||||
![](https://i.imgur.com/0mzQMcB.png)
|
||||
|
||||
对于许多数据科学家来说,数据操作始于和结束于 Pandas 或 Tidyverse。从理论上讲,这样做没有任何问题。毕竟,这就是这些工具存在的原因。然而,对于像分隔符转换这样的简单任务,这些工具是大材小用了。
|
||||
|
||||
立志掌握命令行应该在每个开发人员的清单上,特别是数据科学家。学习 shell 的来龙去脉将无可否认地提高你的生产力。除此之外,命令行还是计算领域的一个重要历史课程。例如,awk - 一种数据驱动的脚本语言。1977年,在传奇的 [K&R 书][2]中 K 即 [Brain Kernighan][1] 的帮助下,Awk 首次出现。今天,大约五十年过去了,awk 仍然和每年出现的[新书][3]相关。因此,可以安全地假设对命令行魔法的投资不会很快贬值。
|
||||
|
||||
### 我们将涵盖什么
|
||||
|
||||
* ICONV
|
||||
* HEAD
|
||||
* TR
|
||||
* WC
|
||||
* SPLIT
|
||||
* SORT & UNIQ
|
||||
* CUT
|
||||
* PASTE
|
||||
* JOIN
|
||||
* GREP
|
||||
* SED
|
||||
* AWK
|
||||
|
||||
|
||||
|
||||
### ICONV
|
||||
|
||||
文件编码可能会很棘手。现在大部分文件都是 UTF-8 编码的。要了解 UTF-8 背后的一些魔力,请查看这个出色的[视频][4]。尽管如此,有时我们收到的文件不是这种格式。这可能引起对改变编码模式的一些不靠谱尝试。这里,iconv 是一个拯救者。Iconv 是一个简单的程序,它将获取采用一种编码的文本并输出采用另一种编码的文本。
|
||||
```
|
||||
# Converting -f (from) latin1 (ISO-8859-1)
|
||||
# -t (to) standard UTF_8
|
||||
|
||||
iconv -f ISO-8859-1 -t UTF-8 < input.txt > output.txt
|
||||
```
|
||||
|
||||
* 实用选项:
|
||||
|
||||
* `iconv -l` 列出所有已知编码
|
||||
* `iconv -c` 默默丢弃无法转换的字符
|
||||
|
||||
|
||||
|
||||
### HEAD
|
||||
|
||||
如果你是一个频繁的 Pandas 用户,那么会很熟悉 `head`。通常在处理新数据时,我们想做的第一件事就是了解其内容。这导致启动 Pandas,读取数据然后调用 `df.head()` \- 这至少是费劲的。没有任何标志的 Head 将打印出文件的前10行。`head` 的真正力量在于测试出来干净利落的操作。例如,如果我们想将文件的分隔符从逗号更改为管道。一个快速测试将是:`head mydata.csv | sed 's/,/|/g'`。
|
||||
```bash
|
||||
# Prints out first 10 lines
|
||||
head filename.csv
|
||||
|
||||
# Print first 3 lines
|
||||
head -n 3 filename.csv
|
||||
```
|
||||
|
||||
* 实用选项:
|
||||
|
||||
* `head -n` 打印特定行数
|
||||
* `head -c` 打印具体的字节数
|
||||
|
||||
|
||||
|
||||
### TR
|
||||
|
||||
Tr 类似于翻译。这个功能强大的实用程序是基本文件清理的主力。理想的用例是交换文件中的分隔符。
|
||||
```bash
|
||||
# Converting a tab delimited file into commas
|
||||
cat tab_delimited.txt | tr "\t" "," comma_delimited.csv
|
||||
```
|
||||
|
||||
`tr` 另一个功能是在你支配中的内建 `[:class:]` 变量(POSIX 字符类)。这些包括了:
|
||||
|
||||
```
|
||||
[:alnum:] all letters and digits
|
||||
[:alpha:] all letters
|
||||
[:blank:] all horizontal whitespace
|
||||
[:cntrl:] all control characters
|
||||
[:digit:] all digits
|
||||
[:graph:] all printable characters, not including space
|
||||
[:lower:] all lower case letters
|
||||
[:print:] all printable characters, including space
|
||||
[:punct:] all punctuation characters
|
||||
[:space:] all horizontal or vertical whitespace
|
||||
[:upper:] all upper case letters
|
||||
[:xdigit:] all hexadecimal digits
|
||||
```
|
||||
|
||||
你可以将这些连接在一起以组成强大的程序。以下是一个基本的字数统计程序,可用于检查自述文件是否过度使用。
|
||||
```
|
||||
cat README.md | tr "[:punct:][:space:]" "\n" | tr "[:upper:]" "[:lower:]" | grep . | sort | uniq -c | sort -nr
|
||||
```
|
||||
|
||||
另一个使用基本正则表达式的例子:
|
||||
```
|
||||
# Converting all upper case letters to lower case
|
||||
cat filename.csv | tr '[A-Z]' '[a-z]'
|
||||
```
|
||||
|
||||
* 实用选项:
|
||||
|
||||
* `tr -d` 删除字符
|
||||
* `tr -s` 压缩字符
|
||||
* `\b` 退格
|
||||
* `\f` 换页
|
||||
* `\v` 垂直制表符
|
||||
* `\NNN` 八进制字符
|
||||
|
||||
|
||||
|
||||
### WC
|
||||
|
||||
单词数量。它的值主要来自 `-l` 标志,它会给你提供行数。
|
||||
```
|
||||
# Will return number of lines in CSV
|
||||
wc -l gigantic_comma.csv
|
||||
```
|
||||
|
||||
这个工具可以方便地确认各种命令的输出。所以,如果我们在转换文件中的分隔符之后运行 `wc -l`,我们会期待总行数是一样的,如果不一致,我们就知道有地方出错了。
|
||||
|
||||
* 实用选项:
|
||||
|
||||
* `wc -c` 打印字节数
|
||||
* `wc -m` 打印字符数
|
||||
* `wc -L` 打印最长行的长度
|
||||
* `wc -w` 打印单词数量
|
||||
|
||||
|
||||
|
||||
### SPLIT
|
||||
|
||||
文件大小的范围可以很广。取决于任务,拆分文件可以是有益的,所以使用 `split` 吧。split的基本语法是:
|
||||
```bash
|
||||
# We will split our CSV into new_filename every 500 lines
|
||||
split -l 500 filename.csv new_filename_
|
||||
# filename.csv
|
||||
# ls output
|
||||
# new_filename_aaa
|
||||
# new_filename_aab
|
||||
# new_filename_aa
|
||||
```
|
||||
|
||||
两个奇怪的地方是命名约定和缺少文件扩展名。后缀约定可以通过 `-d` 标志变为数字。要添加文件扩展名,你需要运行以下 `find` 命令。它将通过附加 `.csv`更改当前目录中所有文件的名称,所以小心了。
|
||||
```bash
|
||||
find . -type f -exec mv '{}' '{}'.csv \;
|
||||
# ls output
|
||||
# filename.csv.csv
|
||||
# new_filename_aaa.csv
|
||||
# new_filename_aab.csv
|
||||
# new_filename_aac.csv
|
||||
```
|
||||
|
||||
* 实用选项:
|
||||
|
||||
* `split -b` 按特定字节大小分割
|
||||
* `split -a` 生成长度为 N 的后缀
|
||||
* `split -x` 使用十六进制后缀分割
|
||||
|
||||
|
||||
|
||||
### SORT & UNIQ
|
||||
|
||||
以上两个命令很明显:他们的作用就是字面意思。这两者结合起来可以提供最强大的冲击 (i.e. 单独单词数量)。这是由于 `uniq` 只作用于重复的相邻行。这也是在输出前 `sort` 的原因。一个有趣的纪录是 `sort -u` 会达到和典型的 `sort file.txt | uniq` 模式一样的结果。
|
||||
|
||||
Sort 对数据科学家来说确实具有潜在的有用能力:能够根据特定列对整个 CSV 进行排序。
|
||||
```bash
|
||||
# Sorting a CSV file by the second column alphabetically
|
||||
sort -t"," -k2,2 filename.csv
|
||||
|
||||
# Numerically
|
||||
sort -t"," -k2n,2 filename.csv
|
||||
|
||||
# Reverse order
|
||||
sort -t"," -k2nr,2 filename.csv
|
||||
```
|
||||
|
||||
这里的 `-t` 选项将逗号指定为分隔符。通常假设分隔符是空格或制表符。此外,`-k` 标志是为了确定我们的 key。这里的语法是 `-km,n`,`m` 作为开始列,`n` 作为结束列。
|
||||
|
||||
* 实用选项:
|
||||
|
||||
* `sort -f` 忽略大小写
|
||||
* `sort -r` 反向排序
|
||||
* `sort -R` 乱序
|
||||
* `uniq -c` 统计出现次数
|
||||
* `uniq -d` 只打印重复行
|
||||
|
||||
|
||||
|
||||
### CUT
|
||||
|
||||
Cut 用于删除列。为了演示,如果我们只想删除第一和第三列。
|
||||
```bash
|
||||
cut -d, -f 1,3 filename.csv
|
||||
```
|
||||
|
||||
选择除了第一行外的所有行。
|
||||
```bash
|
||||
cut -d, -f 2- filename.csv
|
||||
```
|
||||
|
||||
结合其他命令,将`cut` 用作过滤器。
|
||||
```bash
|
||||
# Print first 10 lines of column 1 and 3, where "some_string_value" is present
|
||||
head filename.csv | grep "some_string_value" | cut -d, -f 1,3
|
||||
```
|
||||
|
||||
查出第二列中唯一值的数量。
|
||||
```bash
|
||||
cat filename.csv | cut -d, -f 2 | sort | uniq | wc -l
|
||||
|
||||
# Count occurences of unique values, limiting to first 10 results
|
||||
cat filename.csv | cut -d, -f 2 | sort | uniq -c | head
|
||||
```
|
||||
|
||||
### PASTE
|
||||
|
||||
Paste 是一个带有趣味性功能的粘贴命令。如果你有两个需要合并的文件,并且它们已经排序了,`paste` 帮你解决了接下来的步骤。
|
||||
```bash
|
||||
# names.txt
|
||||
adam
|
||||
john
|
||||
zach
|
||||
|
||||
# jobs.txt
|
||||
lawyer
|
||||
youtuber
|
||||
developer
|
||||
|
||||
# Join the two into a CSV
|
||||
paste -d ',' names.txt jobs.txt > person_data.txt
|
||||
|
||||
# Output
|
||||
adam,lawyer
|
||||
john,youtuber
|
||||
zach,developer
|
||||
```
|
||||
|
||||
查看更多 SQL_-esque 变种,见下文。
|
||||
|
||||
### JOIN
|
||||
|
||||
Join 是一个简单准切向的 SQL。最大的区别是 `join` 将返回所有列以及只能在一个字段上匹配。默认情况下,`join` 将尝试使用第一列作为匹配键。为了获得不同结果,必须使用以下语法:
|
||||
```bash
|
||||
# Join the first file (-1) by the second column
|
||||
# and the second file (-2) by the first
|
||||
join -t "," -1 2 -2 1 first_file.txt second_file.txt
|
||||
```
|
||||
|
||||
标准的 join 是内连接。然而,外连接通过 `-a` 标志也是可行的。另一个值得一提的技巧是 `-q` 标志,如果发现有缺失的字段,可用于替换值。
|
||||
```bash
|
||||
# Outer join, replace blanks with NULL in columns 1 and 2
|
||||
# -o which fields to substitute - 0 is key, 1.1 is first column, etc...
|
||||
join -t"," -1 2 -a 1 -a2 -e ' NULL' -o '0,1.1,2.2' first_file.txt second_file.txt
|
||||
```
|
||||
|
||||
不是最用户友好的命令,而是绝望时刻的绝望措施。
|
||||
|
||||
* 实用选项:
|
||||
|
||||
* `join -a` 打印不可配对的行
|
||||
* `join -e` 替换丢失的输入字段
|
||||
* `join -j` 相当于 `-1 FIELD -2 FIELD`
|
||||
|
||||
|
||||
|
||||
### GREP
|
||||
|
||||
用正则表达式全局搜索并且打印,或者 `grep`,可能是最有名的命令并且有充分的理由。Grep 很强大,特别适合在大型代码库中找到路径。在数据科学的王国里,它充当其他命令的提炼机制。虽然它的标准用途也很有价值。
|
||||
```
|
||||
# Recursively search and list all files in directory containing 'word'
|
||||
|
||||
grep -lr 'word' .
|
||||
|
||||
# List number of files containing word
|
||||
|
||||
grep -lr 'word' . | wc -l
|
||||
|
||||
```
|
||||
|
||||
计算包含单词或模式的总行数。
|
||||
```
|
||||
grep -c 'some_value' filename.csv
|
||||
|
||||
# Same thing, but in all files in current directory by file name
|
||||
|
||||
grep -c 'some_value' *
|
||||
```
|
||||
|
||||
使用 or 运算符 - `\|` 为多个值 Grep。
|
||||
```
|
||||
grep "first_value\|second_value" filename.csv
|
||||
```
|
||||
|
||||
* 实用选项:
|
||||
|
||||
* `alias grep="grep --color=auto"` 使 grep 丰富多彩
|
||||
* `grep -E` 使用扩展的 regexp
|
||||
* `grep -w` 只匹配整个单词
|
||||
* `grep -l` 打印匹配的文件名
|
||||
* `grep -v` 倒置匹配
|
||||
|
||||
|
||||
|
||||
### 大人物们
|
||||
|
||||
Sed 和 Awk 是本文中最强大的两个命令。为简介起见,我不打算详细讨论这两个命令。相反,我将介绍各种能证明其令人印象深刻的力量的命令。如果你想了解更多,[这儿就有一本书][5]是关于它们的。
|
||||
|
||||
### SED
|
||||
|
||||
`sed` 本质上是一个流编辑器。它擅长替换,但也可以用于所有输出重构。
|
||||
|
||||
最基本的 `sed` 命令由 `s/old/new/g` 组成。这转换为搜索旧值,全局替换为新值。 如果没有 `/g`,我们的命令将在旧值第一次出现后终止。
|
||||
|
||||
为了快速了解它的功能,我们可以深入了解一个例子。 在以下情景中,你已获得以下文件:
|
||||
```
|
||||
balance,name
|
||||
$1,000,john
|
||||
$2,000,jack
|
||||
```
|
||||
|
||||
我们可能想要做的第一件事是删除美元符号。`-i` 标志表示原位。`''` 表示零长度文件扩展名,从而覆盖我们的初始文件。理想情况下,你可以单独测试每个,然后输出到新文件。
|
||||
```
|
||||
sed -i '' 's/\$//g' data.txt
|
||||
# balance,name
|
||||
# 1,000,john
|
||||
# 2,000,jack
|
||||
```
|
||||
|
||||
接下来, `blance` 列的逗号。
|
||||
```
|
||||
sed -i '' 's/\([0-9]\),\([0-9]\)/\1\2/g' data.txt
|
||||
# balance,name
|
||||
# 1000,john
|
||||
# 2000,jack
|
||||
```
|
||||
|
||||
最后杰克有一天决定退出。所以,再见了,我的朋友。
|
||||
```
|
||||
sed -i '' '/jack/d' data.txt
|
||||
# balance,name
|
||||
# 1000,john
|
||||
```
|
||||
|
||||
正如你所看到的,`sed` 有很多强大的功能,但乐趣并不止于此。
|
||||
|
||||
### AWK
|
||||
|
||||
最好的留在最后。Awk 不仅仅是一个简单的命令:它是一个成熟的语言。在本文中涉及的所有内容中,`awk` 是目前为止最酷的。如果你发现自己对其印象深刻,这里有很多很棒的资源 - 看 [这里][6], [这里][7] 和 [这里][8]。
|
||||
|
||||
`awk` 的常见用例包括:
|
||||
|
||||
* 文字处理
|
||||
* 格式化文本报告
|
||||
* 执行算术运算
|
||||
* 执行字符串操作
|
||||
|
||||
|
||||
|
||||
Awk 可以以最原生的形式并行 `grep`。
|
||||
```
|
||||
awk '/word/' filename.csv
|
||||
```
|
||||
|
||||
或者更加神奇:将 `grep` 和 `cut` 组合起来。在这里,`awk` 打印第三和第四列,用 tab 分隔,对于所有带我们指定单词的行。`-F,` 只是改变我们的分隔符为逗号。
|
||||
```bash
|
||||
awk -F, '/word/ { print $3 "\t" $4 }' filename.csv
|
||||
```
|
||||
|
||||
Awk 内置了许多精巧的变量。比如,`NF` \- 字段数 - 和 `NR` \- 记录数。要获取文件中的第53条记录:
|
||||
```bash
|
||||
awk -F, 'NR == 53' filename.csv
|
||||
```
|
||||
|
||||
增加的代码是基于一个或多个值进行过滤的能力。下面的第一个示例将打印第一列等于给定字符串的记录的行号和列。
|
||||
```bash
|
||||
awk -F, ' $1 == "string" { print NR, $0 } ' filename.csv
|
||||
|
||||
# Filter based off of numerical value in second column
|
||||
awk -F, ' $2 == 1000 { print NR, $0 } ' filename.csv
|
||||
```
|
||||
|
||||
多个数值表达式:
|
||||
```bash
|
||||
# Print line number and columns where column three greater
|
||||
# than 2005 and column five less than one thousand
|
||||
|
||||
awk -F, ' $3 >= 2005 && $5 <= 1000 { print NR, $0 } ' filename.csv
|
||||
```
|
||||
|
||||
求出第三列的总和:
|
||||
```bash
|
||||
awk -F, '{ x+=$3 } END { print x }' filename.csv
|
||||
```
|
||||
|
||||
在第一列等于 “something”的那些行,求出第三列值的总和。
|
||||
```bash
|
||||
awk -F, '$1 == "something" { x+=$3 } END { print x }' filename.csv
|
||||
```
|
||||
|
||||
获取文件的尺寸:
|
||||
```bash
|
||||
awk -F, 'END { print NF, NR }' filename.csv
|
||||
|
||||
# Prettier version
|
||||
awk -F, 'BEGIN { print "COLUMNS", "ROWS" }; END { print NF, NR }' filename.csv
|
||||
```
|
||||
|
||||
打印出现了两次的行:
|
||||
```bash
|
||||
awk -F, '++seen[$0] == 2' filename.csv
|
||||
```
|
||||
|
||||
删除重复的行:
|
||||
```bash
|
||||
# Consecutive lines
|
||||
awk 'a !~ $0; {a=$0}']
|
||||
|
||||
# Nonconsecutive lines
|
||||
awk '! a[$0]++' filename.csv
|
||||
|
||||
# More efficient
|
||||
awk '!($0 in a) {a[$0];print}
|
||||
```
|
||||
|
||||
使用内置函数 `gsub()` 替换多个值。
|
||||
```bash
|
||||
awk '{gsub(/scarlet|ruby|puce/, "red"); print}'
|
||||
```
|
||||
|
||||
这个 `awk` 命令将组合多个 CSV 文件,忽略标题,然后在最后附加它。
|
||||
```bash
|
||||
awk 'FNR==1 && NR!=1{next;}{print}' *.csv > final_file.csv
|
||||
```
|
||||
|
||||
需要缩小一个庞大的文件? `awk` 可以在 `sed` 的帮助下处理它。具体来说,该命令根据行数将一个大文件分成多个较小的文件。这个 one-liner 也将增加一个扩展。
|
||||
```bash
|
||||
sed '1d;$d' filename.csv | awk 'NR%NUMBER_OF_LINES==1{x="filename-"++i".csv";}{print > x}'
|
||||
|
||||
# Example: splitting big_data.csv into data_(n).csv every 100,000 lines
|
||||
sed '1d;$d' big_data.csv | awk 'NR%100000==1{x="data_"++i".csv";}{print > x}'
|
||||
```
|
||||
|
||||
### 结语
|
||||
|
||||
命令行拥有无穷无尽的力量。本文中介绍的命令足以将你从一无所知提升到英雄人物。除了涵盖的内容之外,还有许多实用程序需要考虑用于日常数据操作。[Csvkit][9], [xsv][10] 还有 [q][11] 是需要记住的三个。如果你希望更深入地了解命令行数据科学,查看[这本书][12]。它也可以[免费][13]在线获得!
|
||||
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
via: http://kadekillary.work/post/cli-4-ds/
|
||||
|
||||
作者:[Kade Killary][a]
|
||||
选题:[lujun9972](https://github.com/lujun9972)
|
||||
译者:[GraveAccent](https://github.com/graveaccent)
|
||||
校对:[校对者ID](https://github.com/校对者ID)
|
||||
|
||||
本文由 [LCTT](https://github.com/LCTT/TranslateProject) 原创编译,[Linux中国](https://linux.cn/) 荣誉推出
|
||||
|
||||
[a]:http://kadekillary.work/authors/kadekillary
|
||||
[1]:https://en.wikipedia.org/wiki/Brian_Kernighan
|
||||
[2]:https://en.wikipedia.org/wiki/The_C_Programming_Language
|
||||
[3]:https://www.amazon.com/Learning-AWK-Programming-cutting-edge-text-processing-ebook/dp/B07BT98HDS
|
||||
[4]:https://www.youtube.com/watch?v=MijmeoH9LT4
|
||||
[5]:https://www.amazon.com/sed-awk-Dale-Dougherty/dp/1565922255/ref=sr_1_1?ie=UTF8&qid=1524381457&sr=8-1&keywords=sed+and+awk
|
||||
[6]:https://www.amazon.com/AWK-Programming-Language-Alfred-Aho/dp/020107981X/ref=sr_1_1?ie=UTF8&qid=1524388936&sr=8-1&keywords=awk
|
||||
[7]:http://www.grymoire.com/Unix/Awk.html
|
||||
[8]:https://www.tutorialspoint.com/awk/index.htm
|
||||
[9]:http://csvkit.readthedocs.io/en/1.0.3/
|
||||
[10]:https://github.com/BurntSushi/xsv
|
||||
[11]:https://github.com/harelba/q
|
||||
[12]:https://www.amazon.com/Data-Science-Command-Line-Time-Tested/dp/1491947853/ref=sr_1_1?ie=UTF8&qid=1524390894&sr=8-1&keywords=data+science+at+the+command+line
|
||||
[13]:https://www.datascienceatthecommandline.com/
|
Loading…
Reference in New Issue
Block a user