Merge pull request #11114 from belitex/master

翻译完成:What breaks our systems- A taxonomy of black swans
This commit is contained in:
Xingyu.Wang 2018-11-08 21:06:44 +08:00 committed by GitHub
commit 194b993122
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
2 changed files with 122 additions and 134 deletions

View File

@ -1,134 +0,0 @@
translating by belitex
What breaks our systems: A taxonomy of black swans
======
Find and fix outlier events that create issues before they trigger severe production problems.
![](https://opensource.com/sites/default/files/styles/image-full-size/public/lead-images/black-swan-pair_0.png?itok=MkshwqVg)
Black swans are a metaphor for outlier events that are severe in impact (like the 2008 financial crash). In production systems, these are the incidents that trigger problems that you didn't know you had, cause major visible impact, and can't be fixed quickly and easily by a rollback or some other standard response from your on-call playbook. They are the events you tell new engineers about years after the fact.
Black swans, by definition, can't be predicted, but sometimes there are patterns we can find and use to create defenses against categories of related problems.
For example, a large proportion of failures are a direct result of changes (code, environment, or configuration). Each bug triggered in this way is distinctive and unpredictable, but the common practice of canarying all changes is somewhat effective against this class of problems, and automated rollbacks have become a standard mitigation.
As our profession continues to mature, other kinds of problems are becoming well-understood classes of hazards with generalized prevention strategies.
### Black swans observed in the wild
All technology organizations have production problems, but not all of them share their analyses. The organizations that publicly discuss incidents are doing us all a service. The following incidents describe one class of a problem and are by no means isolated instances. We all have black swans lurking in our systems; it's just some of us don't know it yet.
#### Hitting limits
Running headlong into any sort of limit can produce very severe incidents. A canonical example of this was [Instapaper's outage in February 2017][1] . I challenge any engineer who has carried a pager to read the outage report without a chill running up their spine. Instapaper's production database was on a filesystem that, unknown to the team running the service, had a 2TB limit. With no warning, it stopped accepting writes. Full recovery took days and required migrating its database.
Limits can strike in various ways. Sentry hit [limits on maximum transaction IDs in Postgres][2] . Platform.sh hit [size limits on a pipe buffer][3] . SparkPost [triggered AWS's DDoS protection][4] . Foursquare hit a performance cliff when one of its [datastores ran out of RAM][5]
One way to get advance knowledge of system limits is to test periodically. Good load testing (on a production replica) ought to involve write transactions and should involve growing each datastore beyond its current production size. It's easy to forget to test things that aren't your main datastores (such as Zookeeper). If you hit limits during testing, you have time to fix the problems. Given that resolution of limits-related issues can involve major changes (like splitting a datastore), time is invaluable.
When it comes to cloud services, if your service generates unusual loads or uses less widely used products or features (such as older or newer ones), you may be more at risk of hitting limits. It's worth load testing these, too. But warn your cloud provider first.
Finally, where limits are known, add monitoring (with associated documentation) so you will know when your systems are approaching those ceilings. Don't rely on people still being around to remember.
#### Spreading slowness
> "The world is much more correlated than we give credit to. And so we see more of what Nassim Taleb calls 'black swan events'—rare events happen more often than they should because the world is more correlated."
> —[Richard Thaler][6]
HostedGraphite's postmortem on how an [AWS outage took down its load balancers][7] (which are not hosted on AWS) is a good example of just how much correlation exists in distributed computing systems. In this case, the load-balancer connection pools were saturated by slow connections from customers that were hosted in AWS. The same kinds of saturation can happen with application threads, locks, and database connections—any kind of resource monopolized by slow operations.
HostedGraphite's incident is an example of externally imposed slowness, but often slowness can result from saturation somewhere in your own system creating a cascade and causing other parts of your system to slow down. An [incident at Spotify][8] demonstrates such spread—the streaming service's frontends became unhealthy due to saturation in a different microservice. Enforcing deadlines for all requests, as well as limiting the length of request queues, can prevent such spread. Your service will serve at least some traffic, and recovery will be easier because fewer parts of your system will be broken.
Retries should be limited with exponential backoff and some jitter. An outage at Square, in which its [Redis datastore became overloaded][9] due to a piece of code that retried failed transactions up to 500 times with no backoff, demonstrates the potential severity of excessive retries. The [Circuit Breaker][10] design pattern can be helpful here, too.
Dashboards should be designed to clearly show [utilization, saturation, and errors][11] for all resources so problems can be found quickly.
#### Thundering herds
Often, failure scenarios arise when a system is under unusually heavy load. This can arise organically from users, but often it arises from systems. A surge of cron jobs that starts at midnight is a venerable example. Mobile clients can also be a source of coordinated demand if they are programmed to fetch updates at the same time (of course, it is much better to jitter such requests).
Events occurring at pre-configured times aren't the only source of thundering herds. Slack experienced [multiple outages][12] over a short time due to large numbers of clients being disconnected and immediately reconnecting, causing large spikes of load. CircleCI saw a [severe outage][13] when a GitLab outage ended, leading to a surge of builds queued in its database, which became saturated and very slow.
Almost any service can be the target of a thundering herd. Planning for such eventualities—and testing that your plan works as intended—is therefore a must. Client backoff and [load shedding][14] are often core to such approaches.
If your systems must constantly ingest data that can't be dropped, it's key to have a scalable way to buffer this data in a queue for later processing.
#### Automation systems are complex systems
> "Complex systems are intrinsically hazardous systems."
> —[Richard Cook, MD][15]
If your systems must constantly ingest data that can't be dropped, it's key to have a scalable way to buffer this data in a queue for later processing.
The trend for the past several years has been strongly towards more automation of software operations. Automation of anything that can reduce your system's capacity (e.g., erasing disks, decommissioning devices, taking down serving jobs) needs to be done with care. Accidents (due to bugs or incorrect invocations) with this kind of automation can take down your system very efficiently, potentially in ways that are hard to recover from.
The trend for the past several years has been strongly towards more automation of software operations. Automation of anything that can reduce your system's capacity (e.g., erasing disks, decommissioning devices, taking down serving jobs) needs to be done with care. Accidents (due to bugs or incorrect invocations) with this kind of automation can take down your system very efficiently, potentially in ways that are hard to recover from.
Christina Schulman and Etienne Perot of Google describe some examples in their talk [Help Protect Your Data Centers with Safety Constraints][16]. One incident sent Google's entire in-house content delivery network (CDN) to disk-erase.
Schulman and Perot suggest using a central service to manage constraints, which limits the pace at which destructive automation can operate, and being aware of system conditions (for example, avoiding destructive operations if the service has recently had an alert).
Automation systems can also cause havoc when they interact with operators (or with other automated systems). [Reddit][17] experienced a major outage when its automation restarted a system that operators had stopped for maintenance. Once you have multiple automation systems, their potential interactions become extremely complex and impossible to predict.
It will help to deal with the inevitable surprises if all this automation writes logs to an easily searchable, central place. Automation systems should always have a mechanism to allow them to be quickly turned off (fully or only for a subset of operations or targets).
### Defense against the dark swans
These are not the only black swans that might be waiting to strike your systems. There are many other kinds of severe problem that can be avoided using techniques such as canarying, load testing, chaos engineering, disaster testing, and fuzz testing—and of course designing for redundancy and resiliency. Even with all that, at some point your system will fail.
To ensure your organization can respond effectively, make sure your key technical staff and your leadership have a way to coordinate during an outage. For example, one unpleasant issue you might have to deal with is a complete outage of your network. It's important to have a fail-safe communications channel completely independent of your own infrastructure and its dependencies. For instance, if you run on AWS, using a service that also runs on AWS as your fail-safe communication method is not a good idea. A phone bridge or an IRC server that runs somewhere separate from your main systems is good. Make sure everyone knows what the communications platform is and practices using it.
Another principle is to ensure that your monitoring and your operational tools rely on your production systems as little as possible. Separate your control and your data planes so you can make changes even when systems are not healthy. Don't use a single message queue for both data processing and config changes or monitoring, for example—use separate instances. In [SparkPost: The Day the DNS Died][4], Jeremy Blosser presents an example where critical tools relied on the production DNS setup, which failed.
### The psychology of battling the black swan
To ensure your organization can respond effectively, make sure your key technical staff and your leadership have a way to coordinate during an outage.
Dealing with major incidents in production can be stressful. It really helps to have a structured incident-management process in place for these situations. Many technology organizations (
Dealing with major incidents in production can be stressful. It really helps to have a structured incident-management process in place for these situations. Many technology organizations ( [including Google][18] ) successfully use a version of FEMA's Incident Command System. There should be a clear way for any on-call individual to call for assistance in the event of a major problem they can't resolve alone.
For long-running incidents, it's important to make sure people don't work for unreasonable lengths of time and get breaks to eat and sleep (uninterrupted by a pager). It's easy for exhausted engineers to make a mistake or overlook something that might resolve the incident faster.
### Learn more
There are many other things that could be said about black (or formerly black) swans and strategies for dealing with them. If you'd like to learn more, I highly recommend these two books dealing with resilience and stability in production: Susan Fowler's [Production-Ready Microservices][19] and Michael T. Nygard's [Release It!][20].
Laura Nolan will present [What Breaks Our Systems: A Taxonomy of Black Swans][21] at [LISA18][22], October 29-31 in Nashville, Tennessee, USA.
--------------------------------------------------------------------------------
via: https://opensource.com/article/18/10/taxonomy-black-swans
作者:[Laura Nolan][a]
选题:[lujun9972][b]
译者:[译者ID](https://github.com/译者ID)
校对:[校对者ID](https://github.com/校对者ID)
本文由 [LCTT](https://github.com/LCTT/TranslateProject) 原创编译,[Linux中国](https://linux.cn/) 荣誉推出
[a]: https://opensource.com/users/lauranolan
[b]: https://github.com/lujun9972
[1]: https://medium.com/making-instapaper/instapaper-outage-cause-recovery-3c32a7e9cc5f
[2]: https://blog.sentry.io/2015/07/23/transaction-id-wraparound-in-postgres.html
[3]: https://medium.com/@florian_7764/technical-post-mortem-of-the-august-incident-82ab4c3d6547
[4]: https://www.usenix.org/conference/srecon18americas/presentation/blosser
[5]: https://groups.google.com/forum/#!topic/mongodb-user/UoqU8ofp134
[6]: https://en.wikipedia.org/wiki/Richard_Thaler
[7]: https://blog.hostedgraphite.com/2018/03/01/spooky-action-at-a-distance-how-an-aws-outage-ate-our-load-balancer/
[8]: https://labs.spotify.com/2013/06/04/incident-management-at-spotify/
[9]: https://medium.com/square-corner-blog/incident-summary-2017-03-16-2f65be39297
[10]: https://en.wikipedia.org/wiki/Circuit_breaker_design_pattern
[11]: http://www.brendangregg.com/usemethod.html
[12]: https://slackhq.com/this-was-not-normal-really
[13]: https://circleci.statuspage.io/incidents/hr0mm9xmm3x6
[14]: https://www.youtube.com/watch?v=XNEIkivvaV4
[15]: https://web.mit.edu/2.75/resources/random/How%20Complex%20Systems%20Fail.pdf
[16]: https://www.usenix.org/conference/srecon18americas/presentation/schulman
[17]: https://www.reddit.com/r/announcements/comments/4y0m56/why_reddit_was_down_on_aug_11/
[18]: https://landing.google.com/sre/book/chapters/managing-incidents.html
[19]: http://shop.oreilly.com/product/0636920053675.do
[20]: https://www.oreilly.com/library/view/release-it/9781680500264/
[21]: https://www.usenix.org/conference/lisa18/presentation/nolan
[22]: https://www.usenix.org/conference/lisa18

View File

@ -0,0 +1,122 @@
让系统崩溃的黑天鹅分类
======
在严重的故障发生之前,找到引起问题的异常事件,并修复它。
![](https://opensource.com/sites/default/files/styles/image-full-size/public/lead-images/black-swan-pair_0.png?itok=MkshwqVg)
黑天鹅用来比喻造成严重影响的小概率事件(比如 2008 年的金融危机)。在生产环境的系统中,黑天鹅是指这样的事情:它引发了你不知道的问题,造成了重大影响,不能快速修复或回滚,也不能用值班说明书上的其他标准响应来解决。它是事发几年后你还在给新人说起的事件。
从定义上看,黑天鹅是不可预测的,不过有时候我们能找到其中的一些模式,针对有关联的某一类问题准备防御措施。
例如,大部分故障的直接原因是变更(代码、环境或配置)。虽然这种方式触发的 bug 是独特的,不可预测的,但是常见的金丝雀发布对避免这类问题有一定的作用,而且自动回滚已经成了一种标准止损策略。
随着我们的专业性不断成熟,一些其他的问题也正逐渐变得容易理解,被归类到某种风险并有普适的预防策略。
### 公布出来的黑天鹅事件
所有科技公司都有生产环境的故障,只不过并不是所有公司都会分享他们的事故分析。那些公开讨论事故的公司帮了我们的忙。下列事故都描述了某一类问题,但它们绝对不是只属于一个类别。我们的系统中都有黑天鹅在潜伏着,只是有些人还不知道而已。
#### 达到上限
达到任何类型的限制都会引发严重事故。这类问题的一个典型例子是 2017 年 2 月 [Instapaper 的一次服务中断][1]。我把这份事故报告给任何一个运维工作者看他们读完都会脊背发凉。Instapaper 生产环境的数据库所在的文件系统有 2 TB 的大小限制,但是数据库服务团队并不知情。在没有任何报错的情况下,数据库不再接受任何写入了。完全恢复需要好几天,而且还得迁移数据库。
资源限制有各式各样的触发场景。Sentry 遇到了 [Postgres 的最大事务 ID 限制][2]。Platform.sh 遇到了[管道缓冲区大小限制][3]。SparkPost [触发了 AWS 的 DDos 保护][4]。Foursquare 在他们的一个 [MongoDB 耗尽内存][5]时遭遇了性能骤降。
提前了解系统限制的一个办法是定期做测试。好的压力测试(在生产环境的副本上做)应该包含写入事务,并且应该把每一种数据存储都写到超过当前生产环境的容量。压力测试时很容易忽略的是次要存储(比如 Zookeeper。如果你是在测试时遇到了资源限制那么你还有时间去解决问题。鉴于这种资源限制问题的解决方案可能涉及重大的变更比如数据存储拆分所以时间是非常宝贵的。
说到云产品的使用,如果你的服务产生了异常的负载,或者你用的产品或功能还没有被广泛使用(比如老旧的或者新兴的),那么你遇到资源上限的风险很大。对这些云产品做一下压力测试是值得的。不过,做之前要提醒一下你的云服务提供商。
最后,知道了哪里有限制之后,要增加监控(和对应文档),这样你才能知道系统在什么时候接近了资源上限。不要寄希望于那些还在维护服务的人会记得。
#### 扩散的慢请求
> “这个世界的关联性远比我们想象中更大。所以我们看到了更多 Nassim Taleb 所说的‘黑天鹅事件’ —— 即罕见事件以更高的频率离谱地发生了,因为世界是相互关联的”
> — [Richard Thaler][6]
HostedGraphite 的负载均衡器并没有托管在 AWS 上,却[被 AWS 的服务中断给搞垮了][7],他们关于这次事故原因的分析报告很好地诠释了分布式计算系统之间存在多么大的关联。在这个事件里,负载均衡器的连接池被来自 AWS 上的客户访问占满了,因为这些连接很耗时。同样的现象还会发生在应用的线程、锁、数据库连接上 —— 任何能被慢操作占满的资源。
这个 HostedGraphite 的例子中,慢速连接是外部系统施加的,不过慢速连接经常是由内部某个系统的饱和所引起的,饱和与慢操作的级联,拖慢了系统中的其他部分。[Spotify 的一个事故][8]就说明了这样的传播 —— 流媒体服务的前端被另一个微服务的饱和所影响,造成健康检查失败。强制给所有请求设置超时时间,以及限制请求队列的长度,可以预防这一类故障传播。这样即使有问题,至少你的服务还能承担一些流量,而且因为整体上你的系统里故障的部分更少了,恢复起来也会更快。
重试的间隔应该用指数退避来限制一下并加入一些时间抖动。Square 有一次服务中断是 [Redis 存储的过载][9],原因是有一段代码对失败的事务重试了 500 次,没有任何重试退避的方案,也说明了过度重试的潜在风险。另外,针对这种情况,[断路器][10]设计模式也是有用的。
应该设计出监控仪表盘来清晰地展示所有资源的[使用率,饱和度和报错][11],这样才能快速发现问题。
#### 突发的高负载
系统在异常高的负载下经常会发生故障。用户天然会引发高负载,不过也常常是由系统引发的。午夜突发的 cron 定时任务是老生常谈了。如果程序让移动客户端同时去获取更新,这些客户端也会造成突发的大流量(当然,给这种请求加入时间抖动会好很多)。
在预定时刻同时发生的事件并不是突发大流量的唯一原因。Slack 经历过一次短时间内的[多次服务中断][12],原因是非常多的客户端断开连接后立即重连,造成了突发的大负载。 CircleCI 也经历过一次[严重的服务中断][13],当时 Gitlab 从故障中恢复了,所以数据库里积累了大量的构建任务队列,服务变得饱和而且缓慢。
几乎所有的服务都会受突发的高负载所影响。所以对这类可能出现的事情做应急预案——并测试一下预案能否正常工作——是必须的。客户端退避和[减载][14]通常是这些方案的核心。
如果你的系统必须不间断地接收数据,并且数据不能被丢掉,关键是用可伸缩的方式把数据缓冲到队列中,后续再处理。
#### 自动化系统是复杂的系统
> “复杂的系统本身就是有风险的系统”
> —— [Richard Cook, MD][15]
过去几年里软件的运维操作趋势是更加自动化。任何可能降低系统容量的自动化操作(比如擦除磁盘,退役设备,关闭服务)都应该谨慎操作。这类自动化操作的故障(由于系统有 bug 或者有不正确的调用)能很快地搞垮你的系统,而且可能很难恢复。
谷歌的 Christina Schulman 和 Etienne Perot 在[用安全规约协助保护你的数据中心][16]的演讲中给了一些例子。其中一次事故是将谷歌整个内部的内容分发网络CDN提交给了擦除磁盘的自动化系统。
Schulman 和 Perot 建议使用一个中心服务来管理规约,限制破坏性自动化操作的速度,并能感知到系统状态(比如避免在最近有告警的服务上执行破坏性的操作)。
自动化系统在与运维人员(或其他自动化系统)交互时,也可能造成严重事故。[Reddit][17] 遭遇过一次严重的服务中断,当时他们的自动化系统重启了一个服务,但是这个服务是运维人员停掉做维护的。一旦有了多个自动化系统,它们之间潜在的交互就变得异常复杂和不可预测。
所有的自动化系统都把日志输出到一个容易搜索的中心存储上,能帮助到对这类不可避免的意外情况的处理。自动化系统总是应该具备这样一种机制,即允许快速地关掉它们(完全关掉或者只关掉其中一部分操作或一部分目标)。
### 防止黑天鹅事件
可能在等着击垮系统的黑天鹅可不止上面这些。有很多其他的严重问题是能通过一些技术来避免的,像金丝雀发布,压力测试,混沌工程,灾难测试和模糊测试——当然还有冗余性和弹性的设计。但是即使用了这些技术,有时候你的系统还是会有故障。
为了确保你的组织能有效地响应,在服务中断期间,请保证关键技术人员和领导层有办法沟通协调。例如,有一种你可能需要处理的烦人的事情,那就是网络完全中断。拥有故障时仍然可用的通信通道非常重要,这个通信通道要完全独立于你们自己的基础设施和基础设施的依赖。举个例子,假如你使用 AWS那么把故障时可用的通信服务部署在 AWS 上就不明智了。在和你的主系统无关的地方,运行电话网桥或 IRC 服务器是比较好的方案。确保每个人都知道这个通信平台,并练习使用它。
另一个原则是,确保监控和运维工具对生产环境系统的依赖尽可能的少。将控制平面和数据平面分开,你才能在系统不健康的时候做变更。不要让数据处理和配置变更或监控使用同一个消息队列,比如——应该使用不同的消息队列实例。在 [SparkPost: DNS 挂掉的那一天][4] 这个演讲中Jeremy Blosser 讲了一个这类例子,很关键的工具依赖了生产环境的 DNS 配置,但是生产环境的 DNS 出了问题。
### 对抗黑天鹅的心理学
处理生产环境的重大事故时会产生很大的压力。为这些场景制定结构化的事故管理流程确实是有帮助的。很多科技公司([包括谷歌][18])成功地使用了联邦应急管理局事故指挥系统的某个版本。对于每一个值班的人,遇到了他们无法独立解决的重大问题时,都应该有一个明确的寻求协助的方法。
对于那些持续很长时间的事故,有一点很重要,要确保工程师不会连续工作到不合理的时长,确保他们不会不吃不睡(没有报警打扰的睡觉)。疲惫不堪的工程师很容易犯错或者漏掉了可能更快解决故障的信息。
### 了解更多
关于黑天鹅或者以前的黑天鹅事件以及应对策略还有很多其他的事情可以说。如果你想了解更多我强烈推荐你去看这两本书它们是关于生产环境中的弹性和稳定性的Susan Fowler 写的[生产微服务][19],还有 Michael T. Nygard 的 [Release It!][20]。
--------------------------------------------------------------------------------
via: https://opensource.com/article/18/10/taxonomy-black-swans
作者:[Laura Nolan][a]
选题:[lujun9972][b]
译者:[BeliteX](https://github.com/belitex)
校对:[校对者ID](https://github.com/校对者ID)
本文由 [LCTT](https://github.com/LCTT/TranslateProject) 原创编译,[Linux中国](https://linux.cn/) 荣誉推出
[a]: https://opensource.com/users/lauranolan
[b]: https://github.com/lujun9972
[1]: https://medium.com/making-instapaper/instapaper-outage-cause-recovery-3c32a7e9cc5f
[2]: https://blog.sentry.io/2015/07/23/transaction-id-wraparound-in-postgres.html
[3]: https://medium.com/@florian_7764/technical-post-mortem-of-the-august-incident-82ab4c3d6547
[4]: https://www.usenix.org/conference/srecon18americas/presentation/blosser
[5]: https://groups.google.com/forum/#!topic/mongodb-user/UoqU8ofp134
[6]: https://en.wikipedia.org/wiki/Richard_Thaler
[7]: https://blog.hostedgraphite.com/2018/03/01/spooky-action-at-a-distance-how-an-aws-outage-ate-our-load-balancer/
[8]: https://labs.spotify.com/2013/06/04/incident-management-at-spotify/
[9]: https://medium.com/square-corner-blog/incident-summary-2017-03-16-2f65be39297
[10]: https://en.wikipedia.org/wiki/Circuit_breaker_design_pattern
[11]: http://www.brendangregg.com/usemethod.html
[12]: https://slackhq.com/this-was-not-normal-really
[13]: https://circleci.statuspage.io/incidents/hr0mm9xmm3x6
[14]: https://www.youtube.com/watch?v=XNEIkivvaV4
[15]: https://web.mit.edu/2.75/resources/random/How%20Complex%20Systems%20Fail.pdf
[16]: https://www.usenix.org/conference/srecon18americas/presentation/schulman
[17]: https://www.reddit.com/r/announcements/comments/4y0m56/why_reddit_was_down_on_aug_11/
[18]: https://landing.google.com/sre/book/chapters/managing-incidents.html
[19]: http://shop.oreilly.com/product/0636920053675.do
[20]: https://www.oreilly.com/library/view/release-it/9781680500264/
[21]: https://www.usenix.org/conference/lisa18/presentation/nolan
[22]: https://www.usenix.org/conference/lisa18