mirror of
https://github.com/LCTT/TranslateProject.git
synced 2024-12-29 21:41:00 +08:00
Merge remote-tracking branch 'LCTT/master'
This commit is contained in:
commit
e57ea13c2a
@ -0,0 +1,201 @@
|
||||
如何统计 Linux 中文件和文件夹/目录的数量
|
||||
======
|
||||
|
||||
嗨,伙计们,今天我们再次带来一系列可以多方面帮助到你的复杂的命令。 通过操作命令,可以帮助您计数当前目录中的文件和目录、递归计数,统计特定用户创建的文件列表等。
|
||||
|
||||
在本教程中,我们将向您展示如何使用多个命令,并使用 `ls`、`egrep`、`wc` 和 `find` 命令执行一些高级操作。 下面的命令将可用在多个方面。
|
||||
|
||||
为了实验,我打算总共创建 7 个文件和 2 个文件夹(5 个常规文件和 2 个隐藏文件)。 下面的 `tree` 命令的输出清楚的展示了文件和文件夹列表。
|
||||
|
||||
```
|
||||
# tree -a /opt
|
||||
/opt
|
||||
├── magi
|
||||
│ └── 2g
|
||||
│ ├── test5.txt
|
||||
│ └── .test6.txt
|
||||
├── test1.txt
|
||||
├── test2.txt
|
||||
├── test3.txt
|
||||
├── .test4.txt
|
||||
└── test.txt
|
||||
|
||||
2 directories, 7 files
|
||||
```
|
||||
|
||||
### 示例-1
|
||||
|
||||
统计当前目录的文件(不包括隐藏文件)。 运行以下命令以确定当前目录中有多少个文件,并且不计算点文件(LCTT 译注:点文件即以“.” 开头的文件,它们在 Linux 默认是隐藏的)。
|
||||
|
||||
```
|
||||
# ls -l . | egrep -c '^-'
|
||||
4
|
||||
```
|
||||
|
||||
**细节:**
|
||||
|
||||
* `ls` : 列出目录内容
|
||||
* `-l` : 使用长列表格式
|
||||
* `.` : 列出有关文件的信息(默认为当前目录)
|
||||
* `|` : 将一个程序的输出发送到另一个程序进行进一步处理的控制操作符
|
||||
* `egrep` : 打印符合模式的行
|
||||
* `-c` : 通用输出控制
|
||||
* `'^-'` : 以“-”开头的行(`ls -l` 列出长列表时,行首的 “-” 代表普通文件)
|
||||
|
||||
### 示例-2
|
||||
|
||||
统计当前目录包含隐藏文件在内的文件。 包括当前目录中的点文件。
|
||||
|
||||
```
|
||||
# ls -la . | egrep -c '^-'
|
||||
5
|
||||
```
|
||||
|
||||
### 示例-3
|
||||
|
||||
运行以下命令来计数当前目录的文件和文件夹。 它会计算所有的文件和目录。
|
||||
|
||||
```
|
||||
# ls -l | wc -l
|
||||
5
|
||||
```
|
||||
|
||||
**细节:**
|
||||
|
||||
* `ls` : 列出目录内容
|
||||
* `-l` : 使用长列表格式
|
||||
* `|` : 将一个程序的输出发送到另一个程序进行进一步处理的控制操作符
|
||||
* `wc` : 这是一个统计每个文件的换行符、单词和字节数的命令
|
||||
* `-l` : 输出换行符的数量
|
||||
|
||||
### 示例-4
|
||||
|
||||
统计当前目录包含隐藏文件和目录在内的文件和文件夹。
|
||||
|
||||
```
|
||||
# ls -la | wc -l
|
||||
8
|
||||
```
|
||||
|
||||
### 示例-5
|
||||
|
||||
递归计算当前目录的文件,包括隐藏文件。
|
||||
|
||||
```
|
||||
# find . -type f | wc -l
|
||||
7
|
||||
```
|
||||
|
||||
**细节 :**
|
||||
|
||||
* `find` : 搜索目录结构中的文件
|
||||
* `-type` : 文件类型
|
||||
* `f` : 常规文件
|
||||
* `wc` : 这是一个统计每个文件的换行符、单词和字节数的命令
|
||||
* `-l` : 输出换行符的数量
|
||||
|
||||
### 示例-6
|
||||
|
||||
使用 `tree` 命令输出目录和文件数(不包括隐藏文件)。
|
||||
|
||||
```
|
||||
# tree | tail -1
|
||||
2 directories, 5 files
|
||||
```
|
||||
|
||||
### 示例-7
|
||||
|
||||
使用包含隐藏文件的 `tree` 命令输出目录和文件计数。
|
||||
|
||||
```
|
||||
# tree -a | tail -1
|
||||
2 directories, 7 files
|
||||
```
|
||||
|
||||
### 示例-8
|
||||
|
||||
运行下面的命令递归计算包含隐藏目录在内的目录数。
|
||||
|
||||
```
|
||||
# find . -type d | wc -l
|
||||
3
|
||||
```
|
||||
|
||||
### 示例-9
|
||||
|
||||
根据文件扩展名计数文件数量。 这里我们要计算 `.txt` 文件。
|
||||
|
||||
```
|
||||
# find . -name "*.txt" | wc -l
|
||||
7
|
||||
```
|
||||
|
||||
### 示例-10
|
||||
|
||||
组合使用 `echo` 命令和 `wc` 命令统计当前目录中的所有文件。 `4` 表示当前目录中的文件数量。
|
||||
|
||||
```
|
||||
# echo *.* | wc
|
||||
1 4 39
|
||||
```
|
||||
|
||||
### 示例-11
|
||||
|
||||
组合使用 `echo` 命令和 `wc` 命令来统计当前目录中的所有目录。 第二个 `1` 表示当前目录中的目录数量。
|
||||
|
||||
```
|
||||
# echo */ | wc
|
||||
1 1 6
|
||||
```
|
||||
|
||||
### 示例-12
|
||||
|
||||
组合使用 `echo` 命令和 `wc` 命令来统计当前目录中的所有文件和目录。 `5` 表示当前目录中的目录和文件的数量。
|
||||
|
||||
```
|
||||
# echo * | wc
|
||||
1 5 44
|
||||
```
|
||||
|
||||
### 示例-13
|
||||
|
||||
统计系统(整个系统)中的文件数。
|
||||
|
||||
```
|
||||
# find / -type f | wc -l
|
||||
69769
|
||||
```
|
||||
|
||||
### 示例-14
|
||||
|
||||
统计系统(整个系统)中的文件夹数。
|
||||
|
||||
```
|
||||
# find / -type d | wc -l
|
||||
8819
|
||||
```
|
||||
|
||||
### 示例-15
|
||||
|
||||
运行以下命令来计算系统(整个系统)中的文件、文件夹、硬链接和符号链接数。
|
||||
|
||||
```
|
||||
# find / -type d -exec echo dirs \; -o -type l -exec echo symlinks \; -o -type f -links +1 -exec echo hardlinks \; -o -type f -exec echo files \; | sort | uniq -c
|
||||
8779 dirs
|
||||
69343 files
|
||||
20 hardlinks
|
||||
11646 symlinks
|
||||
```
|
||||
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
via: https://www.2daygeek.com/how-to-count-the-number-of-files-and-folders-directories-in-linux/
|
||||
|
||||
作者:[Magesh Maruthamuthu][a]
|
||||
译者:[Flowsnow](https://github.com/Flowsnow)
|
||||
校对:[wxy](https://github.com/wxy)
|
||||
|
||||
本文由 [LCTT](https://github.com/LCTT/TranslateProject) 原创编译,[Linux中国](https://linux.cn/) 荣誉推出
|
||||
|
||||
[a]:https://www.2daygeek.com/author/magesh/
|
||||
[1]:https://www.2daygeek.com/empty-a-file-delete-contents-lines-from-a-file-remove-matching-string-from-a-file-remove-empty-blank-lines-from-a-file/
|
@ -1,160 +0,0 @@
|
||||
Leemeans Translating
|
||||
Fastest way to unzip a zip file in Python
|
||||
======
|
||||
So the context is this; a zip file is uploaded into a [web service][1] and Python then needs extract that and analyze and deal with each file within. In this particular application what it does is that it looks at the file's individual name and size, compares that to what has already been uploaded in AWS S3 and if the file is believed to be different or new, it gets uploaded to AWS S3.
|
||||
|
||||
[![Uploads today][2]][3]
|
||||
The challenge is that these zip files that come in are huuuge. The average is 560MB but some are as much as 1GB. Within them, there are mostly plain text files but there are some binary files in there too that are huge. It's not unusual that each zip file contains 100 files and 1-3 of those make up 95% of the zip file size.
|
||||
|
||||
At first I tried unzipping the file, in memory, and deal with one file at a time. That failed spectacularly with various memory explosions and EC2 running out of memory. I guess it makes sense. First you have the 1GB file in RAM, then you unzip each file and now you have possibly 2-3GB all in memory. So, the solution, after much testing, was to dump the zip file to disk (in a temporary directory in `/tmp`) and then iterate over the files. This worked much better but I still noticed the whole unzipping was taking up a huge amount of time. **Is there perhaps a way to optimize that?**
|
||||
|
||||
### Baseline function
|
||||
|
||||
First it's these common functions that simulate actually doing something with the files in the zip file:
|
||||
```
|
||||
def _count_file(fn):
|
||||
with open(fn, 'rb') as f:
|
||||
return _count_file_object(f)
|
||||
|
||||
def _count_file_object(f):
|
||||
# Note that this iterates on 'f'.
|
||||
# You *could* do 'return len(f.read())'
|
||||
# which would be faster but potentially memory
|
||||
# inefficient and unrealistic in terms of this
|
||||
# benchmark experiment.
|
||||
total = 0
|
||||
for line in f:
|
||||
total += len(line)
|
||||
return total
|
||||
|
||||
```
|
||||
|
||||
Here's the simplest one possible:
|
||||
```
|
||||
def f1(fn, dest):
|
||||
with open(fn, 'rb') as f:
|
||||
zf = zipfile.ZipFile(f)
|
||||
zf.extractall(dest)
|
||||
|
||||
total = 0
|
||||
for root, dirs, files in os.walk(dest):
|
||||
for file_ in files:
|
||||
fn = os.path.join(root, file_)
|
||||
total += _count_file(fn)
|
||||
return total
|
||||
|
||||
```
|
||||
|
||||
If I analyze it a bit more carefully, I find that it spends about 40% doing the `extractall` and 60% doing the looping over files and reading their full length.
|
||||
|
||||
### First attempt
|
||||
|
||||
My first attempt was to try to use threads. You create an instance of `zipfile.ZipFile`, extract every file name within and start a thread for each name. Each thread is given a function that does the "meat of the work" (in this benchmark, iterating over the file and getting its total size). In reality that function does a bunch of complicated S3, Redis and PostgreSQL stuff but in my benchmark I just made it a function that figures out the total length of file. The thread pool function:
|
||||
```
|
||||
def f2(fn, dest):
|
||||
|
||||
def unzip_member(zf, member, dest):
|
||||
zf.extract(member, dest)
|
||||
fn = os.path.join(dest, member.filename)
|
||||
return _count_file(fn)
|
||||
|
||||
with open(fn, 'rb') as f:
|
||||
zf = zipfile.ZipFile(f)
|
||||
futures = []
|
||||
with concurrent.futures.ThreadPoolExecutor() as executor:
|
||||
for member in zf.infolist():
|
||||
futures.append(
|
||||
executor.submit(
|
||||
unzip_member,
|
||||
zf,
|
||||
member,
|
||||
dest,
|
||||
)
|
||||
)
|
||||
total = 0
|
||||
for future in concurrent.futures.as_completed(futures):
|
||||
total += future.result()
|
||||
return total
|
||||
```
|
||||
|
||||
**Result: ~10% faster**
|
||||
|
||||
### Second attempt
|
||||
|
||||
So perhaps the GIL is blocking me. The natural inclination is to try to use multiprocessing to spread the work across multiple available CPUs. But doing so has the disadvantage that you can't pass around a non-pickleable object so you have to send just the filename to each future function:
|
||||
```
|
||||
def unzip_member_f3(zip_filepath, filename, dest):
|
||||
with open(zip_filepath, 'rb') as f:
|
||||
zf = zipfile.ZipFile(f)
|
||||
zf.extract(filename, dest)
|
||||
fn = os.path.join(dest, filename)
|
||||
return _count_file(fn)
|
||||
|
||||
|
||||
|
||||
def f3(fn, dest):
|
||||
with open(fn, 'rb') as f:
|
||||
zf = zipfile.ZipFile(f)
|
||||
futures = []
|
||||
with concurrent.futures.ProcessPoolExecutor() as executor:
|
||||
for member in zf.infolist():
|
||||
futures.append(
|
||||
executor.submit(
|
||||
unzip_member_f3,
|
||||
fn,
|
||||
member.filename,
|
||||
dest,
|
||||
)
|
||||
)
|
||||
total = 0
|
||||
for future in concurrent.futures.as_completed(futures):
|
||||
total += future.result()
|
||||
return total
|
||||
```
|
||||
|
||||
**Result: ~300% faster**
|
||||
|
||||
### That's cheating!
|
||||
|
||||
The problem with using a pool of processors is that it requires that the original `.zip` file exists on disk. So in my web server, to use this solution, I'd first have to save the in-memory ZIP file to disk, then invoke this function. Not sure what the cost of that it's not likely to be cheap.
|
||||
|
||||
Well, it doesn't hurt to poke around. Perhaps, it could be worth it if the extraction was significantly faster.
|
||||
|
||||
But remember! This optimization depends on using up as many CPUs as it possibly can. What if some of those other CPUs are needed for something else going on in `gunicorn`? Those other processes would have to patiently wait till there's a CPU available. Since there's other things going on in this server, I'm not sure I'm willing to let on process take over all the other CPUs.
|
||||
|
||||
### Conclusion
|
||||
|
||||
Doing it serially turns out to be quite nice. You're bound to one CPU but the performance is still pretty good. Also, just look at the difference in the code between `f1` and `f2`! With `concurrent.futures` pool classes you can cap the number of CPUs it's allowed to use but that doesn't feel great either. What if you get the number wrong in a virtual environment? Of if the number is too low and don't benefit any from spreading the workload and now you're just paying for overheads to move the work around?
|
||||
|
||||
I'm going to stick with `zipfile.ZipFile(file_buffer).extractall(temp_dir)`. It's good enough for this.
|
||||
|
||||
### Want to try your hands on it?
|
||||
|
||||
I did my benchmarking using a `c5.4xlarge` EC2 server. The files can be downloaded from:
|
||||
```
|
||||
wget https://www.peterbe.com/unzip-in-parallel/hack.unzip-in-parallel.py
|
||||
wget https://www.peterbe.com/unzip-in-parallel/symbols-2017-11-27T14_15_30.zip
|
||||
|
||||
```
|
||||
|
||||
The `.zip` file there is 34MB which is relatively small compared to what's happening on the server.
|
||||
|
||||
The `hack.unzip-in-parallel.py` is a hot mess. It contains a bunch of terrible hacks and ugly stuff but hopefully it's a start.
|
||||
|
||||
|
||||
|
||||
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
via: https://www.peterbe.com/plog/fastest-way-to-unzip-a-zip-file-in-python
|
||||
|
||||
作者:[Peterbe][a]
|
||||
译者:[译者ID](https://github.com/译者ID)
|
||||
校对:[校对者ID](https://github.com/校对者ID)
|
||||
|
||||
本文由 [LCTT](https://github.com/LCTT/TranslateProject) 原创编译,[Linux中国](https://linux.cn/) 荣誉推出
|
||||
|
||||
[a]:https://www.peterbe.com/
|
||||
[1]:https://symbols.mozilla.org
|
||||
[2]:https://cdn-2916.kxcdn.com/cache/b7/bb/b7bbcf60347a5fa91420f71bbeed6d37.png
|
||||
[3]:https://cdn-2916.kxcdn.com/cache/e6/dc/e6dc20acd37d94239edbbc0727721e4a.png
|
@ -1,163 +0,0 @@
|
||||
如何统计Linux中文件和文件夹/目录的数量
|
||||
======
|
||||
嗨,伙计们,今天我们又来了一系列棘手的命令,会多方面帮助你。 这是一种操作命令,它可以帮助您计算当前目录中的文件和目录,递归计数,特定用户创建的文件列表等。
|
||||
|
||||
在本教程中,我们将向您展示如何使用多个命令,并使用ls,egrep,wc和find命令执行一些高级操作。 下面的命令很有帮助。
|
||||
|
||||
为了实验,我打算总共创建7个文件和2个文件夹(5个常规文件和2个隐藏文件)。 看到下面的tree命令的输出清楚的展示文件和文件夹列表。
|
||||
|
||||
**推荐阅读** [文件操作命令][1]
|
||||
```
|
||||
# tree -a /opt
|
||||
/opt
|
||||
├── magi
|
||||
│ └── 2g
|
||||
│ ├── test5.txt
|
||||
│ └── .test6.txt
|
||||
├── test1.txt
|
||||
├── test2.txt
|
||||
├── test3.txt
|
||||
├── .test4.txt
|
||||
└── test.txt
|
||||
|
||||
2 directories, 7 files
|
||||
|
||||
```
|
||||
|
||||
**示例-1 :** 统计当前目录文件(排除隐藏文件)。 运行以下命令以确定当前目录中有多少个文件,并且不计算点文件(LCTT译者注:点文件即当前目录文件和上级目录文件)。
|
||||
```
|
||||
# ls -l . | egrep -c '^-'
|
||||
4
|
||||
```
|
||||
|
||||
**细节:**
|
||||
|
||||
* `ls` : 列出目录内容
|
||||
* `-l` : 使用长列表格式
|
||||
* `.` : 列出有关文件的信息(默认为当前目录)
|
||||
* `|` : 控制操作器将一个程序的输出发送到另一个程序进行进一步处理
|
||||
* `egrep` : 打印符合模式的行
|
||||
* `-c` : 通用输出控制
|
||||
* `'^-'` : 它们分别匹配一行的开头和结尾的空字符串
|
||||
|
||||
|
||||
|
||||
**示例-2 :** 统计包含隐藏文件的当前目录文件。 包括当前目录中的点文件。
|
||||
```
|
||||
# ls -la . | egrep -c '^-'
|
||||
5
|
||||
```
|
||||
|
||||
**示例-3 :** 运行以下命令来计算当前目录文件和文件夹。 它会一次计算所有的。
|
||||
```
|
||||
# ls -1 | wc -l
|
||||
5
|
||||
```
|
||||
|
||||
**细节:**
|
||||
|
||||
* `ls` : 列出目录内容
|
||||
* `-l` : 使用长列表格式
|
||||
* `|` : 控制操作器将一个程序的输出发送到另一个程序进行进一步处理
|
||||
* `wc` : 这是一个为每个文件打印换行符,字和字节数的命令
|
||||
* `-l` : 打印换行符数
|
||||
|
||||
|
||||
|
||||
**示例-4 :** 统计包含隐藏文件和目录的当前目录文件和文件夹。
|
||||
```
|
||||
# ls -1a | wc -l
|
||||
8
|
||||
```
|
||||
|
||||
**示例-5 :** 递归计算当前目录文件,其中包括隐藏文件。
|
||||
```
|
||||
# find . -type f | wc -l
|
||||
7
|
||||
```
|
||||
|
||||
**细节 :**
|
||||
|
||||
* `find` : 搜索目录层次结构中的文件
|
||||
* `-type` : 文件类型
|
||||
* `f` : 常规文件
|
||||
* `wc` : 这是一个为每个文件打印换行符,字和字节数的命令
|
||||
* `-l` : 打印换行符数
|
||||
|
||||
|
||||
|
||||
**示例-6 :** 使用tree命令打印目录和文件数(排除隐藏文件)。
|
||||
```
|
||||
# tree | tail -1
|
||||
2 directories, 5 files
|
||||
```
|
||||
|
||||
**示例-7 :** 使用包含隐藏文件的树命令打印目录和文件数。
|
||||
```
|
||||
# tree -a | tail -1
|
||||
2 directories, 7 files
|
||||
```
|
||||
|
||||
**示例-8 :** 运行下面的命令递归计算包含隐藏目录的目录。
|
||||
```
|
||||
# find . -type d | wc -l
|
||||
3
|
||||
```
|
||||
|
||||
**示例-9 :** 根据文件扩展名计算文件数量。 这里我们要计算 `.txt` 文件。
|
||||
```
|
||||
# find . -name "*.txt" | wc -l
|
||||
7
|
||||
```
|
||||
|
||||
**示例-10 :** 使用echo命令和wc命令统计当前目录中的所有文件。 `4`表示当前目录中的文件数量。
|
||||
```
|
||||
# echo * | wc
|
||||
1 4 39
|
||||
```
|
||||
|
||||
**示例-11 :** 通过使用echo命令和wc命令来统计当前目录中的所有目录。 `1`表示当前目录中的目录数量。
|
||||
```
|
||||
# echo comic/ published/ sources/ translated/ | wc
|
||||
1 1 6
|
||||
```
|
||||
|
||||
**示例-12 :** 通过使用echo命令和wc命令来统计当前目录中的所有文件和目录。 `5`表示当前目录中的目录和文件的数量。
|
||||
```
|
||||
# echo * | wc
|
||||
1 5 44
|
||||
```
|
||||
|
||||
**示例-13 :** 统计系统(整个系统)中的文件数。
|
||||
```
|
||||
# find / -type f | wc -l
|
||||
69769
|
||||
```
|
||||
|
||||
**示例-14 :** 统计系统(整个系统)中的文件夹数。
|
||||
```
|
||||
# find / -type d | wc -l
|
||||
8819
|
||||
```
|
||||
|
||||
**示例-15 :** 运行以下命令来计算系统(整个系统)中的文件,文件夹,硬链接和符号链接数。
|
||||
```
|
||||
# find / -type d -exec echo dirs \; -o -type l -exec echo symlinks \; -o -type f -links +1 -exec echo hardlinks \; -o -type f -exec echo files \; | sort | uniq -c
|
||||
8779 dirs
|
||||
69343 files
|
||||
20 hardlinks
|
||||
11646 symlinks
|
||||
```
|
||||
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
via: https://www.2daygeek.com/how-to-count-the-number-of-files-and-folders-directories-in-linux/
|
||||
|
||||
作者:[Magesh Maruthamuthu][a]
|
||||
译者:[Flowsnow](https://github.com/Flowsnow)
|
||||
校对:[校对者ID](https://github.com/校对者ID)
|
||||
|
||||
本文由 [LCTT](https://github.com/LCTT/TranslateProject) 原创编译,[Linux中国](https://linux.cn/) 荣誉推出
|
||||
|
||||
[a]:https://www.2daygeek.com/author/magesh/
|
||||
[1]:https://www.2daygeek.com/empty-a-file-delete-contents-lines-from-a-file-remove-matching-string-from-a-file-remove-empty-blank-lines-from-a-file/
|
@ -0,0 +1,158 @@
|
||||
Python中最快解压zip文件的方法
|
||||
======
|
||||
假设(现在的)上下文(context,计算机术语,此处意为业务情景)是这样的:一个zip文件被上传到一个[web服务][1]中,然后Python需要解压这个zip文件然后分析和处理其中的每个文件。这个特殊的应用查看每个文件各自的名称和大小 ,并和已经上传到AWS S3上的文件进行比较,如果文件(和AWS S3上的相比)有所不同或者文件本身更新,那么就将它上传到AWS S3。
|
||||
|
||||
[![Uploads today][2]][3]
|
||||
|
||||
挑战在于这些zip文件太大了。他们的平均大小是560MB但是其中一些大于1GB。这些文件中大多数是文本文件,但是其中同样也有一些巨大的二进制文件。不同寻常的是,每个zip文件包含100个文件但是其中1-3个文件却占据了多达95%的zip文件大小。
|
||||
|
||||
最开始我尝试在内存中解压文件,并且每次只处理一个文件。在各种内存爆炸和EC2耗尽内存的情况下,这个方法壮烈失败了。我觉得这个方法应该有用。最开始你有1GB文件在RAM中,然后你现在解压每个文件并有了大约2-3GB放在了内存中。所以,在很多次测试之后,解决方案是将这些zip文件提取(dump)到磁盘上(在临时目录`/tmp`中)然后遍历这些文件。这次情况好多了但是我仍然注意到了整个解压过程花费了巨量的时间。**是否可能有方法优化呢?**
|
||||
|
||||
### 原始函数(baseline function)
|
||||
|
||||
首先是下面这些模拟对zip文件中文件实际操作的普通函数:
|
||||
```
|
||||
def _count_file(fn):
|
||||
with open(fn, 'rb') as f:
|
||||
return _count_file_object(f)
|
||||
|
||||
def _count_file_object(f):
|
||||
# Note that this iterates on 'f'.
|
||||
# You *could* do 'return len(f.read())'
|
||||
# which would be faster but potentially memory
|
||||
# inefficient and unrealistic in terms of this
|
||||
# benchmark experiment.
|
||||
total = 0
|
||||
for line in f:
|
||||
total += len(line)
|
||||
return total
|
||||
|
||||
```
|
||||
这里是可能最简单的另一个(函数):
|
||||
```
|
||||
def f1(fn, dest):
|
||||
with open(fn, 'rb') as f:
|
||||
zf = zipfile.ZipFile(f)
|
||||
zf.extractall(dest)
|
||||
|
||||
total = 0
|
||||
for root, dirs, files in os.walk(dest):
|
||||
for file_ in files:
|
||||
fn = os.path.join(root, file_)
|
||||
total += _count_file(fn)
|
||||
return total
|
||||
|
||||
```
|
||||
|
||||
如果我更仔细地分析一下,我(将会)发现这个函数花费时间40%运行`extractall`,60%的时间在执行读取文件长度的循环。
|
||||
|
||||
### 第一步尝试
|
||||
|
||||
我的第一步尝试是使用线程。先创建一个`zipfile.ZipFile`的实例,展开每个文件名到其中然后为每一个名称开始一个线程。每个线程都给它一个函数来做"实质工作"(在这个基础测试(benchmark)中,就是遍历每个文件然后获取它的名称)。实际(业务中)的函数进行的工作是复杂的S3,Redis和PostgreSQL操作,但是在我的基准测试中我只需要制作一个可以找出文件长度的函数就好了。线程池函数:
|
||||
```
|
||||
def f2(fn, dest):
|
||||
|
||||
def unzip_member(zf, member, dest):
|
||||
zf.extract(member, dest)
|
||||
fn = os.path.join(dest, member.filename)
|
||||
return _count_file(fn)
|
||||
|
||||
with open(fn, 'rb') as f:
|
||||
zf = zipfile.ZipFile(f)
|
||||
futures = []
|
||||
with concurrent.futures.ThreadPoolExecutor() as executor:
|
||||
for member in zf.infolist():
|
||||
futures.append(
|
||||
executor.submit(
|
||||
unzip_member,
|
||||
zf,
|
||||
member,
|
||||
dest,
|
||||
)
|
||||
)
|
||||
total = 0
|
||||
for future in concurrent.futures.as_completed(futures):
|
||||
total += future.result()
|
||||
return total
|
||||
```
|
||||
|
||||
**结果:加速~10%**
|
||||
|
||||
### 第二步尝试
|
||||
|
||||
所以可能是GIL(译者注:Global Interpreter Lock,一种全局锁,CPython中的一个概念)阻碍了我。最自然的想法是尝试使用multiprocessing在多个CPU上分配工作。但是这样做有缺点,那就是你不能传递一个非可pickle序列化的对象(译注:意为只有可pickle序列化的对象可以被传递),所以你只能发送文件名到之后的函数中:
|
||||
```
|
||||
def unzip_member_f3(zip_filepath, filename, dest):
|
||||
with open(zip_filepath, 'rb') as f:
|
||||
zf = zipfile.ZipFile(f)
|
||||
zf.extract(filename, dest)
|
||||
fn = os.path.join(dest, filename)
|
||||
return _count_file(fn)
|
||||
|
||||
|
||||
|
||||
def f3(fn, dest):
|
||||
with open(fn, 'rb') as f:
|
||||
zf = zipfile.ZipFile(f)
|
||||
futures = []
|
||||
with concurrent.futures.ProcessPoolExecutor() as executor:
|
||||
for member in zf.infolist():
|
||||
futures.append(
|
||||
executor.submit(
|
||||
unzip_member_f3,
|
||||
fn,
|
||||
member.filename,
|
||||
dest,
|
||||
)
|
||||
)
|
||||
total = 0
|
||||
for future in concurrent.futures.as_completed(futures):
|
||||
total += future.result()
|
||||
return total
|
||||
```
|
||||
|
||||
**结果: 加速~300%**
|
||||
|
||||
### 这是作弊
|
||||
|
||||
使用处理器池的问题是这样需要存储在磁盘上的原始`.zip`文件。所以为了在我的web服务器上使用这个解决方案,我首先得要将内存中的ZIP文件保存到磁盘,然后调用这个函数。这样做的代价我不是很清楚但是应该不低。
|
||||
|
||||
好吧,再翻翻(poke around)看又没有损失(Well, it doesn't hurt to poke around)。可能,解压过程加速到足以弥补这样做的损失了吧。
|
||||
|
||||
但是一定记住!这个优化取决于使用所有可用的CPU。如果一些其他的CPU需要执行在`gunicorn`中的其它事务呢?这时,这些其他进程必须等待,直到有CPU可用。由于在这个服务器上有其他的事务正在进行,我不是很确定我想要在进程中接管所有其他CPU。
|
||||
|
||||
### 结论
|
||||
|
||||
一步一步地做(这个任务)这个过程感觉挺好的。你被限制在一个CPU上但是表现仍然特别好。同样地,一定要看看在`f1`和`f2`两段代码之间的不同之处!利用`concurrent.futures`池类你可以获取可以使用的CPU的个数,但是这样做同样给人感觉不是很好。如果你在虚拟环境中获取的个数是错的呢?或者可用的个数太低以致无法从负载分配获取好处并且现在你仅仅是为了移动负载而支付营运开支呢?
|
||||
|
||||
我将会继续使用`zipfile.ZipFile(file_buffer).extractall(temp_dir)`。这个工作这样做已经足够好了。
|
||||
|
||||
### 想试试手吗?
|
||||
|
||||
我使用一个`c5.4xlarge` EC2服务器来进行我的基准测试。文件可以从此处下载:
|
||||
```
|
||||
wget https://www.peterbe.com/unzip-in-parallel/hack.unzip-in-parallel.py
|
||||
wget https://www.peterbe.com/unzip-in-parallel/symbols-2017-11-27T14_15_30.zip
|
||||
|
||||
```
|
||||
|
||||
这里的`.zip`文件有34MB。和在服务器上发生的已经小了很多。
|
||||
|
||||
`hack.unzip-in-parallel.py`文件里是一团糟。它包含了大量可怕的入侵和丑恶的事情,但是万幸这只是一个开始(译注:大概入侵没有完成)。
|
||||
|
||||
|
||||
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
via: https://www.peterbe.com/plog/fastest-way-to-unzip-a-zip-file-in-python
|
||||
|
||||
作者:[Peterbe][a]
|
||||
译者:[Leemeans](https://github.com/leemeans)
|
||||
校对:[校对者ID](https://github.com/校对者ID)
|
||||
|
||||
本文由 [LCTT](https://github.com/LCTT/TranslateProject) 原创编译,[Linux中国](https://linux.cn/) 荣誉推出
|
||||
|
||||
[a]:https://www.peterbe.com/
|
||||
[1]:https://symbols.mozilla.org
|
||||
[2]:https://cdn-2916.kxcdn.com/cache/b7/bb/b7bbcf60347a5fa91420f71bbeed6d37.png
|
||||
[3]:https://cdn-2916.kxcdn.com/cache/e6/dc/e6dc20acd37d94239edbbc0727721e4a.png
|
Loading…
Reference in New Issue
Block a user