mirror of
https://github.com/LCTT/TranslateProject.git
synced 2024-12-29 21:41:00 +08:00
commit
d3165400b3
@ -1,95 +0,0 @@
|
||||
translating---geekpi
|
||||
|
||||
Creating better disaster recovery plans
|
||||
============================================================
|
||||
|
||||
Five questions for Tanya Reilly: How service interdependencies make recovery harder and why it’s a good idea to deliberately and preemptively manage dependencies.
|
||||
|
||||
[Register for the O'Reilly Velocity Conference][5] to join Tanya Reilly and other industry experts. Use code ORM20 to save 20% on your conference pass (Gold, Silver, and Bronze passes).
|
||||
|
||||
I recently asked Tanya Reilly, Site Reliability Engineer at Google, to share her thoughts on how to make better disaster recovery plans. Tanya is presenting a session titled [_Have you tried turning it off and turning it on again?_][9] at the O’Reilly Velocity Conference, taking place Oct. 1-4 in New York.
|
||||
|
||||
### 1\. What are the most common mistakes people make when planning their backup systems strategy?
|
||||
|
||||
The classic line is "you don't need a backup strategy, you need a restore strategy." If you have backups, but you haven't tested restoring them, you don't really have backups. Testing doesn't just mean knowing you can get the data back; it means knowing how to put it back into the database, how to handle incremental changes, how to reinstall the whole thing if you need to. It means being sure that your recovery path doesn't rely on some system that could be lost at the same time as the data.
|
||||
|
||||
But testing restores is tedious. It's the sort of thing that people will cut corners on if they're busy. It's worth taking the time to make it as simple and painless and automated as possible; never rely on human willpower for anything! At the same time, you have to be sure that the people involved know what to do, so it's good to plan regular wide-scale disaster tests. Recovery exercises are a great way to find out that the documentation for the process is missing or out of date, or that you don't have enough resources (disk, network, etc.) to transfer and reinsert the data.
|
||||
|
||||
### 2\. What are the most common challenges in creating a disaster recovery (DR) plan?
|
||||
|
||||
I think a lot of DR is an afterthought: "We have this great system, and our business relies on it ... I guess we should do DR for it?" And by that point, the system is extremely complex, full of interdependencies and hard to duplicate.
|
||||
|
||||
The first time something is installed, it's often hand-crafted by a human who is tweaking things and getting it right, and sometimes that's the version that sticks around. When you build the _second_ one, it's hard to be sure it's exactly the same. Even in sites with serious config management, you can leave something out, or let it get out of date.
|
||||
|
||||
Encrypted backups aren't much use if you've lost access to the decryption key, for example. And any parts that are only used in a disaster may have bit-rotted since you last checked in on them. The only way to be sure you've covered everything is to fail over in earnest. Plan your disaster for a time when you're ready for it!
|
||||
|
||||
It's better if you can design the system so that the disaster recovery modes are part of normal operation. If your service is designed from the start to be replicated, adding more replicas is a regular operation and probably automated. There are no new pathways; it's just a capacity problem. But there can still be some forgotten components of the system that only run in one or two places. An occasional scheduled fake disaster is good for shaking those out.
|
||||
|
||||
By the way, those forgotten components could include information that's only in one person's brain, so if you find yourself saying, "We can't do our DR failover test until X is back from vacation," then that person is a dangerous single point of failure.
|
||||
|
||||
Parts of the system that are only used in disasters need the most testing, or they'll fail you when you need them. The fewer of those you have, the safer you are and the less toilsome testing you have to do.
|
||||
|
||||
### 3\. Why do service interdependencies make recovery harder after a disaster?
|
||||
|
||||
If you've got just one binary, then recovering it is relatively easy: you start that binary back up. But we increasingly break out common functionality into separate services. Microservices mean we have more flexibility and less reinvention of wheels: if we need a backend to do something and one already exists, great, we can just use that. But someone needs to keep a big picture of what depends on what, because it can get very tangled very fast.
|
||||
|
||||
#### MANAGE, GROW, AND EVOLVE YOUR SYSTEMS
|
||||
|
||||
|
||||
You may know what backends you use directly, but you might not notice when new ones are added into libraries you use. You might depend on something that also indirectly depends on you. After an outage, you can end up with a deadlock: two systems that each can't start until the other is running and providing some functionality. It's a hard situation to recover from!
|
||||
|
||||
You can even end up with things that indirectly depend on themselves—for example, a device that you need to configure to bring up the network, but you can't get to it while the network is down. Often people have thought about these circular dependencies in advance and have some sort of fallback plan, but those are inherently the road less traveled: they're only intended to be used in extreme cases, and they follow a different path through your systems or processes or code. This means they're more likely to have a bug that won't be uncovered until you really, really need them to work.
|
||||
|
||||
### 4\. You advise people to start deliberately managing their dependencies long before they think they need to in order to ward off potentially catastrophic system failure. Why is this important and what’s your advice for doing it effectively?
|
||||
|
||||
Managing your dependencies is essential for being sure you can recover from a disaster. It makes operating the systems easier too. If your dependencies aren't reliable, you can't be reliable, so you need to know what they are.
|
||||
|
||||
It's possible to start managing dependencies after they've become chaotic, but it's much, much easier if you start early. You can set policies on the use of various services—for example, you must be this high in the stack to depend on this set of systems. You can introduce a culture of thinking about dependencies by making it a regular part of design document review. But bear in mind that lists of dependencies will quickly become stale; it's best if you have programmatic dependency discovery, and even dependency enforcement. [My Velocity talk][10] covers more about how we do that.
|
||||
|
||||
The other advantage of starting early is that you can split up your services into vertical "strata," where the functionality in each stratum must be able to come completely online before the next one begins. So, for example, you could say that the network has to be able to completely start up without using any other services. Then, say, your storage systems should depend on nothing but the network, the application backends should only depend on network and storage, and so on. Different strata will make sense for different architectures.
|
||||
|
||||
If you plan this in advance, it's much easier for new services to choose dependencies. Each one should only depend on services lower in the stack. You can still end up with cycles—things in the same stratum depending on each other—but they're more tightly contained and easier to deal with on a case-by-case basis.
|
||||
|
||||
### 5\. What other parts of the program for Velocity NY are of interest to you?
|
||||
|
||||
I've got my whole Tuesday and Wednesday schedule completely worked out! As you might have gathered, I care a lot about making huge interdependent systems manageable, so I'm looking forward to hearing [Carin Meier's thoughts on managing system complexity][11], [Sarah Wells on microservices][12] and [Baron Schwartz on observability][13]. I'm fascinated to hear [Jon Moore's story][14] on how Comcast went from yearly release cycles to releasing daily. And as an ex-sysadmin, I'm looking forward to hearing [where Bryan Liles sees that role going][15].
|
||||
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
作者简介:
|
||||
|
||||
Nikki McDonald
|
||||
|
||||
Nikki McDonald is a content director at O'Reilly Media, Inc. She lives in Ann Arbor, Michigan.
|
||||
|
||||
Tanya Reilly
|
||||
|
||||
Tanya Reilly has been a Systems Administrator and Site Reliability Engineer at Google since 2005, working on low-level infrastructure like distributed locking, load balancing, and bootstrapping. Before Google, she was a Systems Administrator at eircom.net, Ireland’s largest ISP, and before that she was the entire IT Department for a small software house.
|
||||
|
||||
----------------------------
|
||||
|
||||
via: https://www.oreilly.com/ideas/creating-better-disaster-recovery-plans
|
||||
|
||||
作者:[ Nikki McDonald][a],[Tanya Reilly][b]
|
||||
译者:[译者ID](https://github.com/译者ID)
|
||||
校对:[校对者ID](https://github.com/校对者ID)
|
||||
|
||||
本文由 [LCTT](https://github.com/LCTT/TranslateProject) 原创编译,[Linux中国](https://linux.cn/) 荣誉推出
|
||||
|
||||
[a]:https://www.oreilly.com/people/nikki-mcdonald
|
||||
[b]:https://www.oreilly.com/people/5c97a-tanya-reilly
|
||||
[1]:https://pixabay.com/en/crane-baukran-load-crane-crane-arm-2436704/
|
||||
[2]:https://conferences.oreilly.com/velocity/vl-ny?intcmp=il-webops-confreg-reg-vlny17_new_site_right_rail_cta
|
||||
[3]:https://www.oreilly.com/people/nikki-mcdonald
|
||||
[4]:https://www.oreilly.com/people/5c97a-tanya-reilly
|
||||
[5]:https://conferences.oreilly.com/velocity/vl-ny?intcmp=il-webops-confreg-reg-vlny17_new_site_creating_better_disaster_recovery_plans_text_cta
|
||||
[6]:https://www.oreilly.com/ideas/creating-better-disaster-recovery-plans
|
||||
[7]:https://conferences.oreilly.com/velocity/vl-ny?intcmp=il-webops-confreg-reg-vlny17_new_site_right_rail_cta
|
||||
[8]:https://conferences.oreilly.com/velocity/vl-ny?intcmp=il-webops-confreg-reg-vlny17_new_site_right_rail_cta
|
||||
[9]:https://conferences.oreilly.com/velocity/vl-ny/public/schedule/detail/61400?intcmp=il-webops-confreg-reg-vlny17_new_site_creating_better_disaster_recovery_plans_body_text_cta
|
||||
[10]:https://conferences.oreilly.com/velocity/vl-ny/public/schedule/detail/61400?intcmp=il-webops-confreg-reg-vlny17_new_site_creating_better_disaster_recovery_plans_body_text_cta
|
||||
[11]:https://conferences.oreilly.com/velocity/vl-ny/public/schedule/detail/62779?intcmp=il-webops-confreg-reg-vlny17_new_site_creating_better_disaster_recovery_plans_body_text_cta
|
||||
[12]:https://conferences.oreilly.com/velocity/vl-ny/public/schedule/detail/61597?intcmp=il-webops-confreg-reg-vlny17_new_site_creating_better_disaster_recovery_plans_body_text_cta
|
||||
[13]:https://conferences.oreilly.com/velocity/vl-ny/public/schedule/detail/61630?intcmp=il-webops-confreg-reg-vlny17_new_site_creating_better_disaster_recovery_plans_body_text_cta
|
||||
[14]:https://conferences.oreilly.com/velocity/vl-ny/public/schedule/detail/62733?intcmp=il-webops-confreg-reg-vlny17_new_site_creating_better_disaster_recovery_plans_body_text_cta
|
||||
[15]:https://conferences.oreilly.com/velocity/vl-ny/public/schedule/detail/62893?intcmp=il-webops-confreg-reg-vlny17_new_site_creating_better_disaster_recovery_plans_body_text_cta
|
@ -0,0 +1,91 @@
|
||||
创建更好的灾难恢复计划
|
||||
============================================================
|
||||
|
||||
Tanya Reilly 的五个问题:相互依赖的服务如何使恢复更加困难,为什么有意并预先管理依赖是个好主意。
|
||||
|
||||
我最近询问 Google 的网站可靠性工程师 Tanya Reilly 分享她关于如何制定更好的灾难恢复计划的想法。Tanya 在 10 月 1 日到 4 日在纽约举行的 O'Reilly Velocity Conference 上发表了一个题为[_你有没有尝试把它关闭之后再打开?] [9]的演讲。
|
||||
|
||||
### 1\. 在计划备份系统策略时,人们最常犯的错误是什么?
|
||||
|
||||
经典的一条是“你不需要备份策略,你需要一个恢复策略”。如果你有备份,但你尚未测试恢复它们,那么你没有真正的备份。测试不仅仅意味着知道你可以得到数据,还这意味着知道如何把它放回数据库,如何处理增量更改,如果你需要的话,如何重新安装整个系统。这意味着确保你的恢复路径不依赖于与数据同时丢失的某些系统。
|
||||
|
||||
但测试恢复是枯燥的。这是人们在忙碌时会偷工减料的那种事情。这值得花时间使其尽可能简单、无痛、自动化,永远不要靠任何人的意志力!同时,你必须确保有关人员知道该怎么做,所以定期进行大规模的灾难测试是很好的。恢复练习是找出该过程的文档是否缺失或过期的好方法,或者你没有足够的资源(磁盘、网络等)来传输和重新插入数据。
|
||||
|
||||
### 2\. 创建灾难恢复 (DR) 计划最常见的挑战是什么?
|
||||
|
||||
我认为很多 DR 是一种事后的想法:“我们有这个很棒的系统,我们的业务依赖它。。。我猜我们应该为它做 DR 吗?”而且到那时,系统会非常复杂,充满相互依赖关系,很难复制。
|
||||
|
||||
第一次安装的东西,它通常是由人手动调整并正常工作的,有时是相同的版本。当你构建_第二_个时,很难确定它是完全一样的。即使在具有严格配置管理的站点中,你也可以将某些内容留下,或者让其过期。
|
||||
|
||||
例如,如果你已经失去对解密密钥的访问权限,那么加密备份没有太多用处。而且任何只在灾难中使用的部分都可能会因为上次检查它们而破环。确保你已经涵盖所有东西的唯一方法做故障切换。当你准备好了的,就计划一下你的灾难吧!
|
||||
|
||||
如果你可以设计系统,以使灾难恢复模式成为正常运行的一部分,那么情况会更好。如果你的服务从一开始就被设计为可复制的,添加更多的副本就是一个常规的操作并可能是自动化的。没有新的方法,这只是一个容量问题。但是,系统中仍然存在一些只能在一个或两个地方运行的组件。偶然计划中的假灾难能将它们发现。
|
||||
|
||||
顺便说一句,那些被遗忘的组件可能包括仅在一个人的大脑中的信息,所以如果你自己发现说:“我们不能在 X 休假回来前进行 DR 故障切换测试”,那么那个人是一个危险的单点失败。
|
||||
|
||||
仅在灾难中使用的系统需要最多的测试,否则在需要时会失败。你有的越少越安全,且辛苦的测试工作也越少。
|
||||
|
||||
### 3\. 为什么服务相互依赖使得灾难恢复更加困难?
|
||||
|
||||
如果你只有一个二进制文件,那么恢复它是比较容易的:你开始二进制备份。但是我们越来越多地将通用功能分解成单独的服务。微服务意味着我们有更多的灵活性和更少地重新发明轮子:如果我们需要一个后台做一些事情,并且有一个已经存在,那么很好,我们就可以使用它。但是一些需要保留很大的依赖关系,因为它很快会变得纠缠。
|
||||
|
||||
#### 管理、成长,并推动您的系统
|
||||
|
||||
|
||||
你可能知道你直接使用的后端,但是你可能不会注意到有新的添加到你使用的库中。你可能依赖于它,它也间接依赖于你。在中断之后,你可能会遇到一个死锁:两个系统都不能启动,直到另一个运行并提供一些功能。这是一个困难的恢复情况!
|
||||
|
||||
你甚至可以最终得到间接依赖于自身的内容,例如你需要配置启动网络的设备,但在网络关闭时无法访问该设备。人们通常会提前考虑这些循环依赖,并且有某种后 备计划,但是这些本质上是不太行得通的路:它们只适用于极端情况,并且通过你的系统、进程或代码遵循不同的路径。这意味着,他们很可能有一个不会被发现的问题, 直到你真的, 真的需要他们的工作的时候。
|
||||
|
||||
### 4\. 你建议人们在认为需要之前,开始有意管理其依赖关系,以防止潜在的灾难性系统故障。为什么这很重要,你有什么建议有效地做到这一点?
|
||||
|
||||
管理你的依赖关系对于确保你可以从灾难中恢复至关重要。它使操作系统更容易。如果你的依赖不可靠,你就不可靠,所以你需要知道它们是什么。
|
||||
|
||||
它们变得混乱后也可以开始管理依赖关系,但是如果你早点开始,它会变得更容易一些。你可以设置使用各种服务策略-例如,你必须在堆栈中的这一层依赖于这组系统。你可以通过使其成为设计文件审查的常规部分,引入考虑依赖关系的文化。但请记住,依赖关系列表将很快变得陈旧。如果你有程序化的依赖关系发现,甚至依赖强制执行,这是最好的。 [我的 Velocity 谈话][10]涵盖了我们如何做到这一点。
|
||||
|
||||
早期开始的另一个优点是,你可以将服务拆分为垂直“层”,每个层次中的功能必须能够在下一个启动之前完全在线。所以,例如,你可以说网络必须能够完全启动而不使用任何其他服务。那么说,你的存储系统应该仅仅依赖于网络,程序后端应该仅仅依赖于网络和存储,等等。不同的层次对于不同的架构是有意义的。
|
||||
|
||||
如果你提前计划,新服务更容易选择依赖关系。每个服务应该只依赖堆栈中较低的服务。你仍然可以结束循环,在相同的层次服务上批次依赖 - 但是它们更加紧密地包含,并且在逐个情况基础上更容易处理。
|
||||
|
||||
### 5\. 你对 Velocity NY 的其他部分感兴趣么?
|
||||
|
||||
我整个星期二和星期三的时间表都完成了!正如你可能收集的那样,我非常关心使得巨大的相互依赖的系统可以管理,所以我期待听到[ Carin Meier 关于管理系统复杂性的想法][11]、[Sarah Wells 的微服务][12]和[ Baron 的可观察性][13]。我非常着迷听到[ Jon Moore 关于 Comcast 如何从年度发布到每天发布的故事][14]。作为一个前系统管理员,我很期待听到[ Bryan Liles 对这个职位走向的看法][15]。
|
||||
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
作者简介:
|
||||
|
||||
Nikki McDonald
|
||||
|
||||
Nikki McDonald 是 O'Reilly Media,Inc.的内容总监。她住在密歇根州的安娜堡市。
|
||||
|
||||
Tanya Reilly
|
||||
|
||||
Tanya Reilly 自 2005 年以来一直是 Google 的系统管理员和站点可靠性工程师,致力于分布式锁、负载均衡和引导等底层基础架构。在加入 Google 之前,她是爱尔兰最大的 ISP eircom.net 的系统管理员,在这之前她是一个小型软件公司的整个 IT 部门。
|
||||
|
||||
----------------------------
|
||||
|
||||
via: https://www.oreilly.com/ideas/creating-better-disaster-recovery-plans
|
||||
|
||||
作者:[ Nikki McDonald][a],[Tanya Reilly][b]
|
||||
译者:[geekpi](https://github.com/geekpi)
|
||||
校对:[校对者ID](https://github.com/校对者ID)
|
||||
|
||||
本文由 [LCTT](https://github.com/LCTT/TranslateProject) 原创编译,[Linux中国](https://linux.cn/) 荣誉推出
|
||||
|
||||
[a]:https://www.oreilly.com/people/nikki-mcdonald
|
||||
[b]:https://www.oreilly.com/people/5c97a-tanya-reilly
|
||||
[1]:https://pixabay.com/en/crane-baukran-load-crane-crane-arm-2436704/
|
||||
[2]:https://conferences.oreilly.com/velocity/vl-ny?intcmp=il-webops-confreg-reg-vlny17_new_site_right_rail_cta
|
||||
[3]:https://www.oreilly.com/people/nikki-mcdonald
|
||||
[4]:https://www.oreilly.com/people/5c97a-tanya-reilly
|
||||
[5]:https://conferences.oreilly.com/velocity/vl-ny?intcmp=il-webops-confreg-reg-vlny17_new_site_creating_better_disaster_recovery_plans_text_cta
|
||||
[6]:https://www.oreilly.com/ideas/creating-better-disaster-recovery-plans
|
||||
[7]:https://conferences.oreilly.com/velocity/vl-ny?intcmp=il-webops-confreg-reg-vlny17_new_site_right_rail_cta
|
||||
[8]:https://conferences.oreilly.com/velocity/vl-ny?intcmp=il-webops-confreg-reg-vlny17_new_site_right_rail_cta
|
||||
[9]:https://conferences.oreilly.com/velocity/vl-ny/public/schedule/detail/61400?intcmp=il-webops-confreg-reg-vlny17_new_site_creating_better_disaster_recovery_plans_body_text_cta
|
||||
[10]:https://conferences.oreilly.com/velocity/vl-ny/public/schedule/detail/61400?intcmp=il-webops-confreg-reg-vlny17_new_site_creating_better_disaster_recovery_plans_body_text_cta
|
||||
[11]:https://conferences.oreilly.com/velocity/vl-ny/public/schedule/detail/62779?intcmp=il-webops-confreg-reg-vlny17_new_site_creating_better_disaster_recovery_plans_body_text_cta
|
||||
[12]:https://conferences.oreilly.com/velocity/vl-ny/public/schedule/detail/61597?intcmp=il-webops-confreg-reg-vlny17_new_site_creating_better_disaster_recovery_plans_body_text_cta
|
||||
[13]:https://conferences.oreilly.com/velocity/vl-ny/public/schedule/detail/61630?intcmp=il-webops-confreg-reg-vlny17_new_site_creating_better_disaster_recovery_plans_body_text_cta
|
||||
[14]:https://conferences.oreilly.com/velocity/vl-ny/public/schedule/detail/62733?intcmp=il-webops-confreg-reg-vlny17_new_site_creating_better_disaster_recovery_plans_body_text_cta
|
||||
[15]:https://conferences.oreilly.com/velocity/vl-ny/public/schedule/detail/62893?intcmp=il-webops-confreg-reg-vlny17_new_site_creating_better_disaster_recovery_plans_body_text_cta
|
Loading…
Reference in New Issue
Block a user