Merge pull request #4299 from ChrisLeeGit/master

完成翻译 Web Service Efficiency at Instagram with Python
This commit is contained in:
Ezio 2016-08-10 12:28:34 +08:00 committed by GitHub
commit 54760b557d
2 changed files with 78 additions and 75 deletions

View File

@ -1,75 +0,0 @@
Being translated by ChrisLeeGit
Web Service Efficiency at Instagram with Python
===============================================
Instagram currently features the worlds largest deployment of the Django web framework, which is written entirely in Python. We initially chose to use Python because of its reputation for simplicity and practicality, which aligns well with our philosophy of “do the simple thing first.” But simplicity can come with a tradeoff: efficiency. Instagram has doubled in size over the last two years and recently crossed 500 million users, so there is a strong need to maximize web service efficiency so that our platform can continue to scale smoothly. In the past year weve made our efficiency program a priority, and over the last six months weve been able to maintain our user growth without adding new capacity to our Django tiers. In this post, well share some of the tools we built and how we use them to optimize our daily deployment flow.
### Why Efficiency?
Instagram, like all software, is limited by physical constraints like servers and datacenter power. With these constraints in mind, there are two main goals we want to achieve with our efficiency program:
1. Instagram should be able to serve traffic normally with continuous code rollouts in the case of lost capacity in one data center region, due to natural disaster, regional network issues, etc.
2. Instagram should be able to freely roll out new products and features without being blocked by capacity.
To meet these goals, we realized we needed to persistently monitor our system and battle regression.
### Defining Efficiency
Web services are usually bottlenecked by available CPU time on each server. Efficiency in this context means using the same amount of CPU resources to do more work, a.k.a, processing more user requests per second (RPS). As we look for ways to optimize, our first challenge is trying to quantify our current efficiency. Up to this point, we were approximating efficiency using Average CPU time per requests, but there were two inherent limitations to using this metric:
1. Diversity of devices. Using CPU time for measuring CPU resources is not ideal because it is affected by both CPU models and CPU loads.
2. Request impacts data. Measuring CPU resource per request is not ideal because adding and removing light or heavy requests would also impact the efficiency metric using the per-requests measurement.
Compared to CPU time, CPU instruction is a better metric, as it reports the same numbers regardless of CPU models and CPU loads for the same request. Instead of linking all our data to each user request, we chose to use a per active user metric. We eventually landed on measuring efficiency by using CPU instruction per active user during peak minute. With our new metric established, our next step was to learn more about our regressions by profiling Django.
### Profiling the Django Service
There are two major questions we want to answer by profiling our Django web service:
1. Does a CPU regression happen?
2. What causes the CPU regression and how do we fix it?
To answer the first question, we need to track the CPU-instruction-per-active-user metric. If this metric increases, we know a CPU regression has occurred.
The tool we built for this purpose is called Dynostats. Dynostats utilizes Django middleware to sample user requests by a certain rate, recording key efficiency and performance metrics such as the total CPU instructions, end to end requests latency, time spent on accessing memcache and database services, etc. On the other hand, each request has multiple metadata that we can use for aggregation, such as the endpoint name, the HTTP return code of the request, the server name that serves this request, and the latest commit hash on the request. Having two aspects for a single request record is especially powerful because we can slice and dice on various dimensions that help us narrow down the cause of any CPU regression. For example, we can aggregate all requests by their endpoint names as shown in the time series chart below, where it is very obvious to spot if any regression happens on a specific endpoint.
![](https://d262ilb51hltx0.cloudfront.net/max/800/1*3iouYiAchYBwzF-v0bALMw.png)
CPU instructions matter for measuring efficiencyand theyre also the hardest to get. Python does not have common libraries that support direct access to the CPU hardware counters (CPU hardware counters are the CPU registers that can be programmed to measure performance metrics, such as CPU instructions). Linux kernel, on the other hand, provides the perf_event_open system call. Bridging through Python ctypes enables us to call the syscall function in standard C library, which also provides C compatible data types for programming the hardware counters and reading data from them.
With Dynostats, we can already find CPU regressions and dig into the cause of the CPU regression, such as which endpoint gets impacted most, who committed the changes that actually cause the CPU regression, etc. However, when a developer is notified that their changes have caused a CPU regression, they usually have a hard time finding the problem. If it was obvious, the regression probably wouldnt have been committed in the first place!
Thats why we needed a Python profiler that the developer can use to find the root cause of the regression (once Dynostats identifies it). Instead of starting from scratch, we decided to make slight alterations to cProfile, a readily available Python profiler. The cProfile module normally provides a set of statistics describing how long and how often various parts of a program were executed. Instead of measuring in time, we took cProfile and replaced the timer with a CPU instruction counter that reads from hardware counters. The data is created at the end of the sampled requests and sent to some data pipelines. We also send metadata similar to what we have in Dynostats, such as server name, cluster, region, endpoint name, etc.
On the other side of the data pipeline, we created a tailer to consume the data. The main functionality of the tailer is to parse the cProfile stats data and create entities that represent Python function-level CPU instructions. By doing so, we can aggregate CPU instructions by Python functions, making it easier to tell which functions contribute to CPU regression.
### Monitoring and Alerting Mechanism
At Instagram, we [deploy our backend 3050 times a day][1]. Any one of these deployments can contain troublesome CPU regressions. Since each rollout usually includes at least one diff, it is easy to identify the cause of any regression. Our efficiency monitoring mechanism includes scanning the CPU instruction in Dynostats before and after each rollout, and sending out alerts when the change exceeds a certain threshold. For the CPU regressions happening over longer periods of time, we also have a detector to scan daily and weekly changes for the most heavily loaded endpoints.
Deploying new changes is not the only thing that can trigger a CPU regression. In many cases, the new features or new code paths are controlled by global environment variables (GEV). There are very common practices for rolling out new features to a subset of users on a planned schedule. We added this information as extra metadata fields for each request in Dynostats and cProfile stats data. Grouping requests by those fields reveal possible CPU regressions caused by turning the GEVs. This enables us to catch CPU regressions before they can impact performance.
### Whats Next?
Dynostats and our customized cProfile, along with the monitoring and alerting mechanism weve built to support them, can effectively identify the culprit for most CPU regressions. These developments have helped us recover more than 50% of unnecessary CPU regressions, which would have otherwise gone unnoticed.
There are still areas where we can improve and make it easier to embed into Instagrams daily deployment flow:
1. The CPU instruction metric is supposed to be more stable than other metrics like CPU time, but we still observe variances that make our alerting noisy. Keeping signal:noise ratio reasonably low is important so that developers can focus on the real regressions. This could be improved by introducing the concept of confidence intervals and only alarm when it is high. For different endpoints, the threshold of variation could also be set differently.
2. One limitation for detecting CPU regressions by GEV change is that we have to manually enable the logging of those comparisons in Dynostats. As the number of GEVs increases and more features are developed, this wont scale well. Instead, we could leverage an automatic framework that schedules the logging of these comparisons and iterates through all GEVs, and send alerts when regressions are detected.
3. cProfile needs some enhancement to handle wrapper functions and their children functions better.
With the work weve put into building the efficiency framework for Instagrams web service, we are confident that we will keep scaling our service infrastructure using Python. Weve also started to invest more into the Python language itself, and are beginning to explore moving our Python from version 2 to 3. We will continue to explore this and more experiments to keep improving both infrastructure and developer efficiency, and look forward to sharing more soon.
--------------------------------------------------------------------------------
via: https://engineering.instagram.com/web-service-efficiency-at-instagram-with-python-4976d078e366#.tiakuoi4p
作者:[Min Ni][a]
译者:[ChrisLeeGit](https://github.com/chrisleegit)
校对:[校对者ID](https://github.com/校对者ID)
本文由 [LCTT](https://github.com/LCTT/TranslateProject) 原创编译,[Linux中国](https://linux.cn/) 荣誉推出
[a]: https://engineering.instagram.com/@InstagramEng?source=post_header_lockup
[1]: https://engineering.instagram.com/continuous-deployment-at-instagram-1e18548f01d1#.p5adp7kcz

View File

@ -0,0 +1,78 @@
Instagram Web 服务效率与 Python
===============================================
Instagram 目前是世界上最大规模部署 Django web 框架(该框架完全使用 Python 编写)的主角。我们最初选用 Python 是因为它久负盛名的简洁性与实用性这非常符合我们的哲学思想——“先做简单的事情”。但简洁性也会带来效率方面的折衷。Instagram 的规模在过去两年中已经翻番,并且最近已突破 5 亿用户,所以急需最大程度地提升 web 服务效率以便我们的平台能够继续顺利地扩大。在过去的一年我们已经将效率计划efficiency program提上日程并在过去的六个月我们已经能够做到无需向我们的 Django 层Django tiers添加新的容量来维护我们的用户增长。我们将在本文分享一些由我们构建的工具以及如何使用它们来优化我们的日常部署流程。
### 为何需要提升效率?
Instagram正如所有的软件受限于像服务器和数据中心能源这样的样物理限制。鉴于这些限制在我们的效率计划中有两个我们希望实现的主要目标
1. Instagram 应当能够利用持续代码发布提供正常地通信服务,防止因为自然灾害、区域性网络问题等造成某一个数据中心区丢失。
2. Instagram 应当能够自由地滚动发布新产品和新功能,不必因容量而受阻。
想要实现这些目标我们意识到我们需要持续不断地监控我们的系统并在战斗中回归battle regression
### 定义效率
Web 服务器的瓶颈通常在于每台服务器上可用的 CPU 时间。在这种环境下,效率就意味着利用相同的 CPU 资源完成更多的任务也就是说每秒处理更多的用户请求requests per second, RPS。当我们寻找优化方法时我们面临的第一个最大的挑战就是尝试量化我们当前的效率。到目前为止我们一直在使用“每次请求的平均 CPU 时间”来评估效率,但使用这种指标也有其固有限制:
1. 设备多样性。使用 CPU 时间来测量 CPU 资源并非理想方案,因为它同时受到 CPU 模型与 CPU 负载影响。
1. 请求影响数据。测量每次请求的 CPU 资源并非理想方案因为在使用每次请求测量per-request measurement方案时添加或移除轻量级或重量级的请求也会影响到效率指标。
相对于 CPU 时间来说CPU 指令是一种更好的指标,因为对于相同的请求,它会报告相同的数字,不管 CPU 模型和 CPU 负载情况如何。我们选择使用了一种叫做”每个活动用户per active user“的指标而不是将我们所有的数据链接到每个用户请求上。我们最终采用”每个活动用户在高峰期间的 CPU 指令CPU instruction per active user during peak minute“来测量效率。我们建立好新的度量标准后下一步就是通过对 Django 的分析来学习更多关于我们的回归our regressions
### Django 服务分析
通过分析我们的 Django web 服务,我们希望回答两个主要问题:
1. 一次 CPU 回归会发生吗?
2. 是什么导致了 CPU 回归问题发生以及我们该怎样修复它?
想要回答第一个问题,我们需要追踪”每个活动用户的 CPU 指令CPU-instruction-per-active-user“指标。如果该指标增加我们就知道一次 CPU 回归已经发生了。
我们为此构建的工具叫做 Dynostats。Dynostats 利用 Django 中间件以一定的速率采样用户请求,记录键效率以及性能指标,例如 CPU 总指令数、端到端请求时延、花费在访问内存缓存memcache和数据库服务的时间等。另一方面每个请求都有很多可用于聚合的元数据metadata例如端点名称、HTTP 请求返回码、服务该请求的服务器名称以及请求中最新提交的哈希值hash。对于单个请求记录来说有两个方面非常强大因为我们可以在不同的维度上进行切割那将帮助我们减少任何导致 CPU 回归的原因。例如,我们可以根据他们的端点名称聚合所有请求,正如下面的时间序列图所示,从图中可以清晰地看出在特定端点上是否发生了回归。
![](https://d262ilb51hltx0.cloudfront.net/max/800/1*3iouYiAchYBwzF-v0bALMw.png)
CPU 指令对测量效率很重要——当然它们也很难获得。Python 并没有支持直接访问 CPU 硬件计数器CPU 硬件计数器是指可编程 CPU 寄存器,用于测量性能指标,例如 CPU 指令的公共库。另一方面Linux 内核提供了 `perf_event_open` 系统调用。通过 Python ctypes 桥接技术能够让我们调用标准 C 库编写的系统调用函数,它也为我们提供了兼容 C 的数据类型,从而可以编程硬件计数器并从它们读取数据。
使用 Dynostats我们已经可以找出 CPU 回归,并探究 CPU 回归发生的原因,例如哪个端点受到的影响最多,谁提交了真正会导致 CPU 回归的变更等。然而,当开发者收到他们的变更已经导致一次 CPU 回归发生的通知时,他们通常难以找出问题所在。如果问题很明显,那么回归可能就不会一开始就被提交!
这就是为何我们需要一个 Python 分析器,从而使开发者能够使用它找出回归(一旦 Dynostats 发现了它)发生的根本原因。不同于白手起家,我们决定对一个现成的 Python 分析器 cProfile 做适当的修改。cProfile 模块通常会提供一个统计集合来描述程序不同的部分执行时间和执行频率。我们将 cProfile 的定时器timer替换成了一个从硬件计数器读取的 CPU 指令计数器,以此取代对时间的测量。我们在采样请求后产生数据并把数据发送到数据流水线。我们也会发送一些我们在 Dynostats 所拥有的类似元数据,例如服务器名称、集群、区域、端点名称等。
在数据流水线的另一边我们创建了一个消费数据的尾随者tailer。尾随者的主要功能是解析 cProfile 的统计数据并创建能够表示 Python 函数级别的 CPU 指令的实体。如此,我们能够通过 Python 函数来聚集 CPU 指令,从而更加方便地找出是什么函数导致了 CPU 回归。
### 监控与警报机制
在 Instagram我们 [每天部署 30-50 次后端服务][1]。这些部署中的任何一个都能发生 CPU 回归的问题。因为每次发生通常都包含至少一个区别diff所以找出任何回归是很容易的。我们的效率监控机制包含在每次发布前后都会在 Dynostats 中哦了过扫描 CPU 指令,并且当变更超出某个阈值时发出警告。对于长期会发生 CPU 回归的情况,我们也有一个探测器为负载最繁重的端点提供日常和每周的变更扫描。
部署新的变更并非触发一次 CPU 回归的唯一情况。在许多情况下新的功能和新的代码路径都由全局环境变量global environment variables, GEV控制。 在一个计划好的时间表上,给一个用户子集发布新功能有一些非常一般的实践。我们在 Dynostats 和 cProfile 统计数据中为每个请求添加了这个信息作为额外的元数据字段。来自这些字段的组请求通过转变全局环境变量GEV从而暴露出可能的 CPU 回归问题。这让我们能够在它们对性能造成影响前就捕获到 CPU 回归。
### 接下来是什么?
Dynostats 和我们定制的 cProfile以及我们建立去支持它们的监控和警报机制能够有效地找出大多数导致 CPU 回归的元凶。这些进展已经帮助我们恢复了超过 50% 的不必要的 CPU 回归,否则我们就根本不会知道。
我们仍然还有一些可以提升的方面并可以更加便捷将它们地加入到 Instagram 的日常部署流程中:
1. CPU 指令指标应该要比其它指标如 CPU 时间更加稳定但我们仍然观察了让我们头疼的差异。保持信号“信噪比noise ratio”合理地低是非常重要的这样开发者们就可以集中于真实的回归上。这可以通过引入置信区间confidence intervals的概念来提升并在信噪比过高时发出警报。针对不同的端点变化的阈值也可以设置为不同值。
2. 通过更改 GEV 来探测 CPU 回归的一个限制就是我们要在 Dynostats 中手动启用这些比较的日志输出。当 GEV 逐渐增加,越来越多的功能被开发出来,这就不便于扩展了。
作为替代,我们能够利用一个自动化框架来调度这些比较的日志输出,并对所有的 GEV 进行遍历,然后当检查到回归时就发出警告。
3. cProfile 需要一些增强以便更好地处理装饰器函数以及它们的子函数。
鉴于我们在为 Instagram 的 web 服务构建效率框架中所投入的工作,所以我们对于将来使用 Python 继续扩展我们的服务很有信心。
我们也开始向 Python 语言自身投入更多,并且开始探索从 Python 2 转移 Python 3 之道。我们将会继续探索并做更多的实验以继续提升基础设施与开发者效率,我们期待着很快能够分享更多的经验。
--------------------------------------------------------------------------------
via: https://engineering.instagram.com/web-service-efficiency-at-instagram-with-python-4976d078e366#.tiakuoi4p
作者:[Min Ni][a]
译者:[ChrisLeeGit](https://github.com/chrisleegit)
校对:[校对者ID](https://github.com/校对者ID)
本文由 [LCTT](https://github.com/LCTT/TranslateProject) 原创编译,[Linux中国](https://linux.cn/) 荣誉推出
[a]: https://engineering.instagram.com/@InstagramEng?source=post_header_lockup
[1]: https://engineering.instagram.com/continuous-deployment-at-instagram-1e18548f01d1#.p5adp7kcz