Merge pull request #26638 from Yufei-Yan/master

Finish translating.
This commit is contained in:
Xingyu.Wang 2022-07-30 23:01:51 +08:00 committed by GitHub
commit 4e00500bde
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
2 changed files with 182 additions and 188 deletions

View File

@ -1,188 +0,0 @@
[#]: subject: "7 key components of observability in Python"
[#]: via: "https://opensource.com/article/21/11/observability-python"
[#]: author: "Moshe Zadka https://opensource.com/users/moshez"
[#]: collector: "lujun9972"
[#]: translator: "MCGA"
[#]: reviewer: " "
[#]: publisher: " "
[#]: url: " "
7 key components of observability in Python
======
Learn why observability is important for Python and how to implement it
into your software development lifecycle.
![Searching for code][1]
The applications you write execute a lot of code, in a way that's essentially invisible. So how can you know:
* Is the code working?
* Is it working well?
* Who's using it, and how?
Observability is the ability to look at data that tells you what your code is doing. In this context, the main problem area is server code in distributed systems. It's not that observability isn't important for client applications; it's that clients tend not to be written in Python. It's not that observability does not matter for, say, data science; it's that tooling for observability in data science (mostly Juptyter and quick feedback) is different.
### Why observability matters
So why does observability matter? Observability is a vital part of the software development life cycle (SDLC).
Shipping an application is not the end; it is the beginning of a new cycle. In that cycle, the first stage is to confirm that the new version is running well. Otherwise, a rollback is probably needed. Which features are working well? Which ones have subtle bugs? You need to know what's going on to know what to work on next. Things fail in weird ways. Whether it's a natural disaster, a rollout of underlying infrastructure, or an application getting into a strange state, things can fail at any time, for any reason.
Outside of the standard SDLC, you need to know that everything is still running. If it's not running, it's essential to have a way to know how it is failing.
### Feedback
The first part of observability is getting feedback. When code gives information about what it is doing, feedback can help in many ways. In a staging or testing environment, feedback helps find problems and, more importantly, triage them in a faster way. This improves the tooling and communication around the validation step.
When doing a canary deployment or changing a feature flag, feedback is also important to let you know whether to continue, wait longer, or roll it back.
### Monitor
Sometimes you suspect that something has gone wrong. Maybe a dependent service is having issues, or maybe social media is barraging you with questions about your site. Maybe there's a complicated operation in a related system, and you want to make sure your system is handling it well. In those cases, you want to aggregate the data from your observability system into dashboards.
When writing an application, these dashboards need to be part of the design criteria. The only way they have data to display is when your application shares it with them.
### Alerts
Watching dashboards for more than 15 minutes at a time is like watching paint dry. No human should be subjected to this. For that task, we have alerting systems. Alerting systems compare the observability data to the expected data and send a notification when it doesn't match up. Fully delving into incident management is beyond the scope of this article. However, observable applications are alert-friendly in two ways:
* They produce enough data, with enough quality, that high-quality alerts can be sent.
* The alert has enough data, or the receiver can easily get the data, to help triage the source.
High-quality alerts have three properties:
* Low false alarms: If there's an alert, there's definitely a problem.
* Low missing alarms: When there's a problem, an alert is triggered.
* Timely: An alert is sent quickly to minimize time to recovery.
These three properties are in a three-way conflict. You can reduce false alarms by raising the threshold of detection at the cost of increasing missing alarms. You can reduce missing alarms by lowering the threshold of detection at the expense of increasing false alarms. You can reduce both false alarms and missing alarms by collecting more data at the cost of timeliness.
Improving all three parameters is harder. This is where the quality of observability data comes in. Higher quality data can reduce all three.
### Logging
Some people like to make fun of print-based debugging. But in a world where most software runs on not-your-local-PC, print debugging is all you can do. Logging is a formalization of print debugging. The Python logging library, for all of its faults, allows standardized logging. Most importantly, it means you can log from libraries.
The application is responsible for configuring which logs go where. Ironically, after many years where applications were literally responsible for configuration, this is less and less true. Modern applications in a modern container orchestration environment log to standard error and standard output and trust the orchestration system to manage the log properly.
However, you should not rely on it in libraries, or pretty much anywhere. If you want to let the operator know what's going on, _use logging, not print_.
#### Logging levels
One of the most important features of logging is _logging levels_. Logging levels allow you to filter and route logs appropriately. But this can only be done if logging levels are consistent. At the very least, you should make them consistent across your applications.
With a little help, libraries that choose incompatible semantics can be retroactively fixed by appropriate configuration at the application level. Do this by using the most important universal convention in Python: using the `getLogger(__name-_)`.
Most reasonable libraries follow this convention. Filters can modify logging objects in place before they are emitted. You can attach a filter to the handler that will modify the messages based on the name to have appropriate levels.
```
import logging
LOGGER=logging.getLogger(__name__)
```
With this in mind, you now have to actually specify semantics for logging levels. There are a lot of options, but the following are my favorite:
* Error: This sends an immediate alert. The application is in a state that requires operator attention. (This means that Critical and Error are folded.)
* Warning: I like to call these “Business hours alerts.” Someone should look at this within one business day.
* Info: This is emitted during normal flow. It's designed to help people understand what the application is doing if they already suspect a problem.
* Debug: This is not emitted in the production environment by default. It might or might not be emitted in development or staging, and it can be turned on explicitly in production if more information is needed.
In no case should you include PII (Personal Identifiable Information) or passwords in logs. This is true regardless of levels. Levels change, debug levels are activated, and so on. Logging aggregation systems are rarely PII-safe, especially with evolving PII regulation (HIPAA, GDPR, and others).
#### Log aggregation
Modern systems are almost always distributed. Redundancy, scaling, and sometimes jurisdictional needs mean horizontal distribution. Microservices mean vertical distribution. Logging into each machine to check the logs is no longer realistic. It is often a bad idea for proper control reasons: allowing developers to log into machines gives them too many privileges.
All logs should be sent into an aggregator. There are commercial offerings, you can configure an ELK stack, or you can use any other database (SQL or no-SQL). As a really low-tech solution, you can write the logs to files and ship them to an object storage. There are too many solutions to explain, but the most important thing is choosing one and aggregating everything.
#### Logging queries
After logging everything to one place, there are too many logs. The specific aggregator defines how to write queries, but whether it's grepping through storage or writing NoSQL queries, logging queries to match source and details are useful.
### Metric scraping
Metrics scraping is a server pull model. The metrics server connects to the application periodically and pulls the metrics.
At the very least, this means the server needs connectivity and discovery for all relevant application servers.
#### Prometheus as a standard
The [Prometheus][2] format as an endpoint is useful if your metrics aggregator is Prometheus. But it is also useful if it is not! Almost all systems contain a compatibility shim for Prometheus endpoints.
Adding a Prometheus shim to your application using the client Python library allows it to be scraped by most metrics aggregators. Prometheus expects to find, once it discovers the server, a metrics endpoint. This is often part of the application routing, often at `/metrics`. Regardless of the platform of the web application, if you can serve a custom byte stream with a custom content type at a given endpoint, you can be scraped by Prometheus.
For the most popular framework, there is also a middleware plugin or something equivalent that automatically collects some metrics, like latency and error rates. This is not usually enough. You want to collect custom application data: for example, cache hit/miss rates per endpoint, database latency, and so on.
#### Using counters
Prometheus supports several data types. One important and subtle type is the counter. Counters always advance—with one caveat.
When the application resets, the counter goes back to zero. These “epochs” in counters are managed by having the counter “creation time” sent as metadata. Prometheus will know not to compare counters from two different epochs.
#### Using gauges
Gauges are much simpler: They measure instantaneous values. Use them for measurements that go up and down: for example, total allocated memory, size of cache, and so on.
#### Using enums
Enums are useful for states of the application as a whole, although they can be collected on a more granular basis. For example, if you are using a feature-gating framework, a feature that can have several states (e.g., in use, disabled, shadowing) might be useful to have as an enum.
### Analytics
Analytics are different from metrics in that they correspond to coherent events. For example, in network servers, an event is one outside request and its resulting work. In particular, the analytics event cannot be sent until the event is finished.
An event contains specific measurements: latency, number and possibly details of resulting requests to other services, and so on.
#### Structured Logging
One current possible option is structured logging. The send event is just sending a log with a properly formatted payload. This data can be queried from the log aggregator, parsed, and ingested into an appropriate system for allowing visibility into it.
### Error tracking
You can use logs to track errors, and you can use analytics to track errors. But a dedicated error system is worthwhile. A system optimized for errors can afford to send more data since errors are rare. It can send the right data, and it can do smart things with the data. Error-tracking systems in Python usually hook into a generic exception handler, collect data, and send it to a dedicated error aggregator.
#### Using Sentry
In many cases, running Sentry yourself is the right thing to do. When an error has occurred, something has gone wrong. Reliably removing sensitive data is not possible, since these are precisely the cases where the sensitive data might have ended up somewhere it shouldn't.
It is often not a big load: exceptions are supposed to be rare. Finally, this is not a system that needs high-quality, high-reliability backups. Yesterday's errors are already fixed, hopefully, and if they are not—you'll know!
### Fast, safe, repeatable: choose all three
Observable systems are faster to develop since they give you feedback. They are safer to run since, when they go wrong, they let you know sooner. Finally, observability lends itself to building repeatable processes around it since there is a feedback loop. Observability gives you knowledge about your application. And knowing is half the battle.
#### Upfront investment pays off
Building all the observability layers is hard work. It also often feels like wasted work, or at least like “nice to have but not urgent.”
Can you build it later? Maybe, but you shouldn't. Building it right lets you speed up the rest of development so much at all stages: testing, monitoring, and even onboarding new people. In an industry with as much churn as tech, just reducing the overhead of onboarding a new person is worth it.
The fact is, observability is important, so write it in early in the process and maintain it throughout. In turn, it will help you maintain your software.
--------------------------------------------------------------------------------
via: https://opensource.com/article/21/11/observability-python
作者:[Moshe Zadka][a]
选题:[lujun9972][b]
译者:[译者ID](https://github.com/译者ID)
校对:[校对者ID](https://github.com/校对者ID)
本文由 [LCTT](https://github.com/LCTT/TranslateProject) 原创编译,[Linux中国](https://linux.cn/) 荣誉推出
[a]: https://opensource.com/users/moshez
[b]: https://github.com/lujun9972
[1]: https://opensource.com/sites/default/files/styles/image-full-size/public/lead-images/search_find_code_python_programming.png?itok=ynSL8XRV (Searching for code)
[2]: https://opensource.com/article/21/7/run-prometheus-home-container

View File

@ -0,0 +1,182 @@
[#]: subject: "7 key components of observability in Python"
[#]: via: "https://opensource.com/article/21/11/observability-python"
[#]: author: "Moshe Zadka https://opensource.com/users/moshez"
[#]: collector: "lujun9972"
[#]: translator: "MCGA"
[#]: reviewer: " "
[#]: publisher: " "
[#]: url: " "
Python 中可观测性的 7 个关键组件
======
学习为什么 Python 中的可观测行很重要,以及如何在你的软件开发生命周期中实现它。
![Searching for code][1]
你写的应用会执行很多代码,以一种看不到的方式。所以你是怎么知道:
* 代码是否在运行?
* 是不是在正常工作?
* 谁在使用它,如何使用?
可观测性是一种可以通过查看数据来告诉你,你的代码在做什么的一种能力。在这篇文章中,主要关注的问题是分布式系统中的服务器代码。并不是说客户端应用代码的可观测性不重要,只是说客户端往往不是用 Python 写的。也不是说可观测性对数据科学不重要,而是在数据科学领域的可观测性工具(大多是 Juptyter 和快速反馈)是不同的。
### 为什么可观测性很重要
所以为什么可观测性重要呢在软件开发生命周期SDLC可观测性是一个关键的部分。
交付一个应用不是结束;这只是一个新周期的开始。在这个周期中,第一个阶段是确认这个新版本运行正常。否则的话,很有可能就需要回滚。哪个功能正常运行?哪些功能有小 bug你需要知道发生了什么才能知道接下来要怎么做。这些东西有时候会以奇怪的方式不能正常运行。不管是天灾还是底层基础设施的回滚或者应用进入了一种奇怪的状态这些东西可能在任何时间以任何理由停止工作。
在标准 SDLC 之外,你需要知道一切都在运行中。如果没有,有办法知道是怎么不能运行的,这是非常关键的。
### 反馈
可观测性的第一部分是获得反馈。当代码给出它正在做什么的信息时,反馈可以在很多方面提供帮助。在模拟环境或测试环境中,反馈有助于发现问题,更重要的是,以更快的方式对它们进行分类。这可以改善在验证步骤中的工具和交流。
当进行<ruby>金丝雀部署<rt>canary deployment</rt></ruby>或更改特性标志时,你需要知道是否要继续,还是等更长时间,或者回滚,反馈就显得很重要了。
### 监控
有时候你怀疑有些东西不太对。也许是一个依赖服务有问题,或者是社交网站有大量关于你的网站的问题。也许在相关的系统中有复杂的操作,然后你想确保你的系统能完美处理。在这些情况下,你就想把可观测性系统的数据整合到控制面板上。
当写一个应用的时候,这些控制面板需要是设计标准的一部分。只有当你的应用能把数据共享给这些控制面板,它们才会把这些数据显示出来。
### 警报
每次看控制面板超过 15 分钟就像看油漆变干一样。任何人都不应该遭受这种折磨。对于这种任务,我们要有报警系统。报警系统将可观测性数据与预期数据进行对比,当它们不匹配的时候就发出通知。完全深入研究时间管理超出了本文的范围。然而,从两方面来说,可观测应用是<ruby>报警友好的<rt>alert-friendly</rt></ruby>
* 它们有足够多,足够好的数据,发出的警报才是高质量的。
* 警报有足够的数据,或者接收者可以很容易的得到数据,这样有助于找到源头
高质量警报有三个特点:
* 较少的错报:如果有警报,那一定是有问题了。
* 较少的漏报:如果有问题,那一定有警报触发。
* 及时性:警报会迅速发出以减少恢复时间。
这三个特点是互相有冲突的。你可以通过提高监测的标准来减少错误警报,代价是增加了漏报。你也可以通过降低监测的门槛来减少漏报,代价是增加错报。通过收集更多数据,你也可以同时减少错报和漏报,而代价是降低了及时性。
同时改善这三个参数就更难了。这就要求高质量的可观测性数据。更高质量的数据可以同时改善这三个特点。
### 日志
有的人喜欢嘲笑用打印来调试的方法。但是在一个大多数软件都不在你本机运行的世界里你所能做的只有打印调试。日志记录就是打印调试的一种形式。对于它的所有错误Python 日志库允许标准话的日志记录。更重要的是,它意味着你可以通过这些库去记录日志。
应用程序要对配置日志如何记录负责。讽刺地是,在应用程序对配置日志负责了多年以后,现在越来越不是这样了。在现代容器<ruby>编排<rt>orchestration</rt><ruby>环境中,现代应用程序记录标准错误和标准输出,并且信任<ruby>编排<rt>orchestration</rt><ruby>系统可以合理的处理日志。
然而你不应该依赖库或者说其他任何地方。如果你想让操作的人知道发生了什么_使用日志而不是打印_
#### 日志级别
日志记录的一个最重要功能就是 _日志级别_。不同的日志级别可以让你过滤并找到合适的日志。但是这只能在日志级别保持一致的情况下完成。最后,你应该在整个应用程序中保持日志级别的一致性。
在应用层面通过合理的配置,只需要这一点帮助,那些选择了不兼容语义的库就可以被修复。通过使用 Python 中最重要的通用风格:使用 `getLogger(__name-_)`
大多数合理的库都会遵循这个约定。<ruby>筛选器<rt>Filters</rt></ruby>可以在发出日志对象之前就地修改它们。你可以将筛选器附加到处理程序,这个处理程序会根据名称修改消息,使其具有合适的级别。
```
import logging
LOGGER=logging.getLogger(__name__)
```
考虑到这一点,你现在必须明确日志级别的语义。这其中有很多选项,但是下面这些是我的最爱:
* Error立即发送一个警告。应用程序处于一个需要操作人员引起注意的状态。这意味着有致命问题和错误
* Warning我喜欢把这些称作“工作时间警报”。这种情况下应该有人在一个工作日内关注一下。
* Info这是在正常工作流程中发出的。如果怀疑有问题的时候这个是用来帮助人们了解应用程序在做什么的。
* Debug默认情况下这个不应该在生产环境中出现。在模拟环境或开发环境下可以发出来也可以不发。如果需要更多的信息在生产环境也可以特地被打开。
任何情况下都不要在日志中包含个人信心PIIPersonal Identifiable Information或密码。无论日志级别是什么都要这么做比如级别更改激活调试级别等等。日志聚合系统很少是 <ruby>PII 安全<rt>PII-safe</rt></ruby>的,特别是随着 PII 管理的进步HIPAAGDPR以及其他的
#### 日志聚合
现代系统几乎都是分布式的。<ruby>冗余<rt>redundancy</rt><ruby><ruby>扩展性<rt>scaling</rt></ruby>,有时是<ruby>管辖权<rt>jurisdictional</rt></ruby>需要更多的水平分布。微服务意味着垂直分布。登录到每个机器去查看日志已经是不现实的了。出于合理的控制原因,允许开发人员登录到机器中会给予他们他多特权,这不是个好主意。
所有的日志都应该被发到一个聚合模块。有一些商业的方案,你可以配置一个 ELK 栈或者也可以使用其他的数据库SQL 或则 no-SQL。作为一个技术含量很低的解决方案你可以将日志写入文件然后将他们发送到对象存储中。有很多解决方案但是最重要的事情是选择一个并且将所有东西聚合到一起。
#### 日志查询
在将所有东西记录到一个地方后,会有很多日志。特定的聚合模块可以定义如何写查询,但是它是通过从存储中搜索还是写 NoSQL 的查询,日志查询以匹配源和详细信息是很有用的。
### <ruby>度量抓取<rt>Metric Scraping</rt><ruby>
度量抓取是一个<ruby>服务器拉取<rt>server pull</rt><ruby>模型。度量服务器定时和应用程序连接,并且拉取度量。
最后,这意味着服务器需要连接并且找到所有相关的应用服务器。
#### 以 Prometheus 为标准
如果你的度量聚合模块是 Prometheus[Prometheus][2] 格式做为一个<ruby>端点<rt>endpoint</rt></ruby>是很有用的。但是,即使聚合模块不是 Prometheus也是很有用的。几乎所有的系统都包含与 Prometheus 端点兼容的<ruby>垫片<rt>shim</rt></ruby>
使用客户端 Python 库给你的应用程序加一个 Prometheus 垫片,这将使他能够被大多数的度量聚合器所抓取。当 Prometheus 发现一个服务器,它就期望找到一个度量端点。这经常是应用程序路由的一部分,通常在 `/metrics` 路径下。不管 web 应用的平台是什么如果你能在一个端点下运行一个定制类型的定制字节流Prometheus 就可以将它抓取。
对于大多数流行的框架,总有一个中间件插件或者类似的东西收集度量,像延迟和错误率。通常这还不够。你需要收集定制的应用数据:比如,每个端点的缓存<ruby>命中/缺失<rt>hit/miss</rt></ruby>率,数据库延迟,等等。
#### 使用计数器
Prometheus 支持多个数据类型。一个重要且巧妙的类型就是计数器。计数器总是会现有一个警告。
当应用重置,计数器会归零。计数器中的这些“<ruby>时期<rt>epochs</rt></ruby>”通过将计数器“创建时间”作为元数据发送来管理。Prometheus 知道不去比较两个<ruby>时期<rt>epochs</rt></ruby>的计数器
#### 使用仪表
仪表会简单很多:他们测量瞬时值。用它们来测量会上下起伏的数据:比如,所有分配的内存,缓存大小,等等。
#### 使用枚举
枚举对于整个应用程序的状态是很有用的,尽管它们可以以更精细的方式被收集。比如,你正使用一个<ruby>功能限制<rt>feature-gating</rt></ruby>框架,一个有多个状态(比如,使用中,关闭,<ruby>屏蔽<rt>shadowing</rt></ruby>)的功能,也许使用枚举会更有用。
### 分析
分析不同于度量,因为它们要对应连续的事件。比如,在网络服务器中,一个事件是在请求和所有工作都完成之后的。特别是事件分析在事件完成之前是不能被发送的。
一个事件包含特定的度量:延迟,对其他服务请求的数量和可能的细节,等等。
#### 结构化日志
现在一个可能的选择是将日志结构化。发送事件只发送带有正确格式<ruby>负载<rt>payload</rt></ruby>的日志。这个数据可以从日志聚合器请求,然后解析,并且放入一个合适的系统,这样可以对它进行可视化。
### 错误追踪
你可以使用日志去追踪错误然后你可以用分析方法去追踪错误。但是一个专门的错误系统还是值得的。一个为错误优化的系统可以发送更多的错误因为错误毕竟还是不多。这样它就可以发送正确的数据并且用这些数据它能做更多智能的事情。Python 中的错误追踪系统通常和一般的异常处理关联,然后收集数据,并且把它发到一个专门的错误聚合器。
#### 使用 Sentry
很多情况下,自己运行 Sentry 是一个正确的事情。当一个错误发生时,有些东西就出问题了。可靠地删除敏感数据是不可能的,因为一定有会出现敏感数据被发送到不应该的地方。
通常,这种工作量并不会很大:异常并不常出现。最后,这个系统并不需要很高的质量,也不需要高可靠性的备份。但愿昨天的错误已经修复了,如果没有,你还会发现的!
### 快速,安全,可重复:三者都要
可观测的系统可以开发的更快,因为它们可以给你提供反馈。它们运行起来也更安全,因为当出问题的时候,它们也会更早的让你知道。最后,因为有反馈循环,可观测性也有助于围绕它构建可重复的过程。可观测性可以让你了解你的应用程序。而更了解它们,就胜利了一半。
#### 前期的投入总会有回报
构建所有的可观测层是一件困难的事情。总会让人感觉是在浪费工作,或者更像是“可以有,但是不急”。
之后再做这个可以吗?也许吧,但是不应该。正确的构建可观测性可以加速后面所有阶段的开发:测试,监控,甚至是培训新人。在一个和科技行业一样动荡的行业,减少培训新人的工作量绝对是值得的。
事实上,可观测性很重要,所以尽早把它写出来,然后就可以在整个过程中进行维护。反过来,它也会帮你维护你的软件。
--------------------------------------------------------------------------------
via: https://opensource.com/article/21/11/observability-python
作者:[Moshe Zadka][a]
选题:[lujun9972][b]
译者:[译者ID](https://github.com/译者ID)
校对:[Yufei-Yan](https://github.com/Yufei-Yan)
本文由 [LCTT](https://github.com/LCTT/TranslateProject) 原创编译,[Linux中国](https://linux.cn/) 荣誉推出
[a]: https://opensource.com/users/moshez
[b]: https://github.com/lujun9972
[1]: https://opensource.com/sites/default/files/styles/image-full-size/public/lead-images/search_find_code_python_programming.png?itok=ynSL8XRV (Searching for code)
[2]: https://opensource.com/article/21/7/run-prometheus-home-container