Merge pull request #19279 from geekpi/translating

translating
This commit is contained in:
geekpi 2020-08-10 08:47:09 +08:00 committed by GitHub
commit caf60880ae
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
2 changed files with 116 additions and 143 deletions

View File

@ -1,143 +0,0 @@
[#]: collector: (lujun9972)
[#]: translator: (geekpi)
[#]: reviewer: ( )
[#]: publisher: ( )
[#]: url: ( )
[#]: subject: (Analyze your web server log files with this Python tool)
[#]: via: (https://opensource.com/article/20/7/python-lars)
[#]: author: (Ben Nuttall https://opensource.com/users/bennuttall)
Analyze your web server log files with this Python tool
======
This Python module can collect website usage logs in multiple formats
and output well structured data for analysis.
![Person standing in front of a giant computer screen with numbers, data][1]
Ever wanted to know how many visitors you've had to your website? Or which pages, articles, or downloads are the most popular? If you're self-hosting your blog or website, whether you use Apache, Nginx, or even Microsoft IIS (yes, really), [lars][2] is here to help.
Lars is a web server-log toolkit for [Python][3]. That means you can use Python to parse log files retrospectively (or in real time) using simple code, and do whatever you want with the data—store it in a database, save it as a CSV file, or analyze it right away using more Python.
Lars is another hidden gem written by [Dave Jones][4]. I first saw Dave present lars at a local Python user group. Then a few years later, we started using it in the [piwheels][5] project to read in the Apache logs and insert rows into our Postgres database. In real time, as Raspberry Pi users download Python packages from [piwheels.org][6], we log the filename, timestamp, system architecture (Arm version), distro name/version, Python version, and so on. Since it's a relational database, we can join these results on other tables to get more contextual information about the file.
You can install lars with:
```
`$ pip install lars`
```
On some systems, the right route will be [ `sudo` ] `pip3 install lars`.
To get started, find a single web access log and make a copy of it. You'll want to download the log file onto your computer to play around with it. I'm using Apache logs in my examples, but with some small (and obvious) alterations, you can use Nginx or IIS. On a typical web server, you'll find Apache logs in `/var/log/apache2/` then usually `access.log` , `ssl_access.log` (for HTTPS), or gzipped rotated logfiles like `access-20200101.gz` or `ssl_access-20200101.gz` .
First of all, what does a log entry look like?
```
81.174.152.222 - - [30/Jun/2020:23:38:03 +0000] "GET / HTTP/1.1" 200 6763 "-" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:77.0) Gecko/20100101 Firefox/77.0"
```
This is a request showing the IP address of the origin of the request, the timestamp, the requested file path (in this case / , the homepage, the HTTP status code, the user agent (Firefox on Ubuntu), and so on.
Your log files will be full of entries like this, not just every single page hit, but every file and resource served—every CSS stylesheet, JavaScript file and image, every 404, every redirect, every bot crawl. To get any sensible data out of your logs, you need to parse, filter, and sort the entries. That's what lars is for. This example will open a single log file and print the contents of every row:
[code]
```
with open('ssl_access.log') as f:
    with ApacheSource(f) as source:
        for row in source:
            print(row)
```
```
Which will show results like this for every log entry:
```
```
Row(remote_host=IPv4Address('81.174.152.222'), ident=None, remote_user=None, time=DateTime(2020, 6, 30, 23, 38, 3), request=Request(method='GET', url=Url(scheme='', netloc='', path_str='/', params='', query_str='', fragment=''), protocol='HTTP/1.1'), status=200, size=6763)
```
It's parsed the log entry and put the data into a structured format. The entry has become a namedtuple with attributes relating to the entry data, so for example, you can access the status code with row.status and the path with row.request.url.path_str:
[code]
```
```
```
with open('ssl_access.log') as f:
    with ApacheSource(f) as source:
        for row in source:
            print(f'hit {row.request.url.path_str} with status code {row.status}')
```
```
If you wanted to show only the 404s, you could do:
[code]
```
with open('ssl_access.log') as f:
    with ApacheSource(f) as source:
        for row in source:
            if row.status == 404:
                print(row.request.url.path_str)
```
```
You might want to de-duplicate these and print the number of unique pages with 404s:
[code]
```
s = set()
with open('ssl_access.log') as f:
    with ApacheSource(f) as source:
        for row in source:
            if row.status == 404:
                s.add(row.request.url.path_str)
print(len(s))
```
```
Dave and I have been working on expanding piwheels' logger to include web-page hits, package searches, and more, and it's been a piece of cake, thanks to lars. It's not going to tell us any answers about our users—we still have to do the data analysis, but it's taken an awkward file format and put it into our database in a way we can make use of it.
Check out lars' documentation to see how to read Apache, Nginx, and IIS logs, and learn what else you can do with it. Thanks, yet again, to Dave for another great tool!
```
* * *
```
This originally appeared on Ben Nuttall's Tooling Blog and is republished with permission.
```
--------------------------------------------------------------------------------
via: https://opensource.com/article/20/7/python-lars
作者:[Ben Nuttall][a]
选题:[lujun9972][b]
译者:[译者ID](https://github.com/译者ID)
校对:[校对者ID](https://github.com/校对者ID)
本文由 [LCTT](https://github.com/LCTT/TranslateProject) 原创编译,[Linux中国](https://linux.cn/) 荣誉推出
[a]: https://opensource.com/users/bennuttall
[b]: https://github.com/lujun9972
[1]: https://opensource.com/sites/default/files/styles/image-full-size/public/lead-images/data_metrics_analytics_desktop_laptop.png?itok=9QXd7AUr (Person standing in front of a giant computer screen with numbers, data)
[2]: https://lars.readthedocs.io/en/latest/
[3]: https://opensource.com/resources/python
[4]: https://twitter.com/waveform80/
[5]: https://opensource.com/article/18/10/piwheels-python-raspberrypi
[6]: http://piwheels.org

View File

@ -0,0 +1,116 @@
[#]: collector: (lujun9972)
[#]: translator: (geekpi)
[#]: reviewer: ( )
[#]: publisher: ( )
[#]: url: ( )
[#]: subject: (Analyze your web server log files with this Python tool)
[#]: via: (https://opensource.com/article/20/7/python-lars)
[#]: author: (Ben Nuttall https://opensource.com/users/bennuttall)
使用这个 Python 工具分析你的 Web 服务器日志文件
======
这个 Python 模块可以以多种格式收集网站使用日志并输出良好结构化数据进行分析。
![Person standing in front of a giant computer screen with numbers, data][1]
是否曾经想知道有多少访问者访问过你的网站?或哪个页面、文章或下载最受欢迎?如果你是自托管的博客或网站,那么无论你使用的是 Apache、Nginx 还是 Microsoft IIS是的没错[lars][2]都可以为你提供帮助。
Lars 是 [Python][3] 写的 Web 服务器日志工具包。这意味着你可以使用 Python 通过简单的代码回溯(或实时)解析日志,并对数据做任何你想做的事:将它存储在数据库中、另存为 CSV 文件,或者立即使用 Python 进行更多分析。
Lars 是 [Dave Jones][4] 写的另一个隐藏的宝石。我第一次在本地 Python 用户中看到 Dave 演示 lars。几年后我们开始在 [piwheels][5] 项目中使用它来读取 Apache 日志并将行插入到我们的 Postgres 数据库中。当树莓派用户从 [piwheels.org][6] 下载 Python 包时我们会记录文件名、时间戳、系统架构Arm 版本)、发行版名称/版本Python 版本等。由于它是一个关系数据库,因此我们可以将这些结果加入其他表中以获得有关文件的更多上下文信息。
你可以使用以下方法安装lars
```
`$ pip install lars`
```
在某些系统上,正确的方式是 [ `sudo` ] `pip3 install lars`.
首先,找到一个 Web 访问日志并制作一个副本。你需要将日志文件下载到计算机上进行操作。我在示例中使用的是 Apache 日志,但是经过一些小(且明显)的更改,你可以使用 Nginx 或 IIS。在典型的 Web 服务器上,你会在 `/var/log/apache2/` 中找到 Apache 日志,通常是 access.log、ssl_access.log对于 HTTPS或 gzip 压缩后的轮转日志文件,如 `access-20200101.gz` 或者 `ssl_access-20200101.gz`
首先,日志是什么样的?
```
81.174.152.222 - - [30/Jun/2020:23:38:03 +0000] "GET / HTTP/1.1" 200 6763 "-" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:77.0) Gecko/20100101 Firefox/77.0"
```
这是一个显示了请求源 IP 地址、时间戳、请求文件路径(在本例中是主页 /、HTTP 状态代码用户代理Ubuntu 上的 Firefox等的请求。
你的日志文件将充满这样的条目,不仅是每个打开的页面,还包括返回的每个文件和资源:每个 CSS 样式表、JavaScript 文件和图像,每个 404、每个重定向、每个爬虫。要从日志中获取有意义的数据你需要对条目进行解析、过滤和排序。这就是 Lars 的用处。本示例将打开一个日志文件并打印每一行的内容:
```
with open('ssl_access.log') as f:
    with ApacheSource(f) as source:
        for row in source:
            print(row)
```
它会为每条日志显示如下结果:
```
Row(remote_host=IPv4Address('81.174.152.222'), ident=None, remote_user=None, time=DateTime(2020, 6, 30, 23, 38, 3), request=Request(method='GET', url=Url(scheme='', netloc='', path_str='/', params='', query_str='', fragment=''), protocol='HTTP/1.1'), status=200, size=6763)
```
它解析了日志条目,并将数据放入结构化格式中。该条目已成为具有与条目数据相关属性的 namedtuple因此例如你可以使用 row.status 访问状态代码,并使用 row.request.url.path_str 访问路径:
```
with open('ssl_access.log') as f:
    with ApacheSource(f) as source:
        for row in source:
            print(f'hit {row.request.url.path_str} with status code {row.status}')
```
如果你只想显示 404可以执行以下操作
```
with open('ssl_access.log') as f:
    with ApacheSource(f) as source:
        for row in source:
            if row.status == 404:
                print(row.request.url.path_str)
```
你可能要对这些数据去重,并打印独立的 404 页面数量:
```
s = set()
with open('ssl_access.log') as f:
    with ApacheSource(f) as source:
        for row in source:
            if row.status == 404:
                s.add(row.request.url.path_str)
print(len(s))
```
我和 Dave 一直在努力扩展 piwheel 的日志记录器,使其包含网页点击量、软件包搜索等内容,这些并不难,都要归功于 lars。它不会告诉我们有关用户的任何答案。我们仍然需要进行数据分析但它采取了一种粗糙的文件格式并以我们可以利用它的方式将它放入我们的数据库。
查阅 lars 的文档,以了解如何读取 Apache、Nginx 和 IIS 日志,并了解你还可以使用它做什么。再次感谢 Dave 提供的出色工具!
* * *
它最初发布在 Ben Nuttall 的 Tooling Blog 中,并获许重新发布。
--------------------------------------------------------------------------------
via: https://opensource.com/article/20/7/python-lars
作者:[Ben Nuttall][a]
选题:[lujun9972][b]
译者:[geekpi](https://github.com/geekpi)
校对:[校对者ID](https://github.com/校对者ID)
本文由 [LCTT](https://github.com/LCTT/TranslateProject) 原创编译,[Linux中国](https://linux.cn/) 荣誉推出
[a]: https://opensource.com/users/bennuttall
[b]: https://github.com/lujun9972
[1]: https://opensource.com/sites/default/files/styles/image-full-size/public/lead-images/data_metrics_analytics_desktop_laptop.png?itok=9QXd7AUr (Person standing in front of a giant computer screen with numbers, data)
[2]: https://lars.readthedocs.io/en/latest/
[3]: https://opensource.com/resources/python
[4]: https://twitter.com/waveform80/
[5]: https://opensource.com/article/18/10/piwheels-python-raspberrypi
[6]: http://piwheels.org