Merge pull request #4867 from geekpi/master

translated
This commit is contained in:
geekpi 2017-01-03 13:35:02 +08:00 committed by GitHub
commit 4b471c8262
2 changed files with 84 additions and 85 deletions

View File

@ -1,85 +0,0 @@
translating---geekpi
What is SRE (Site Reliability Engineering)?
============================================================
Site Reliability Engineer is a job title we are starting to see more and more these days. What does it mean? Where does it come from? Learn from Google's SRE team.
![Bridge](https://d3tdunqjn7n0wj.cloudfront.net/360x240/bridge-1031545-1400-389c9609ff7c64083c93db48dc77eeff.jpg)
This is an excerpt from [Site Reliability Engineering][9], edited by Niall Richard Murphy, Jennifer Petoff, Chris Jones, Betsy Beyer.
Site Reliability Engineering will also be covered at the [O'Reilly Velocity Conference, Nov. 7-10 in Amsterdam][10].
### Introduction
> Hope is not a strategy.
>
> Traditional SRE saying
It is a truth universally acknowledged that systems do not run themselves. How, then, _should_ a system—particularly a complex computing system that operates at a large scale—be run?
### The Sysadmin Approach to Service Management
The sysadmin model of service management has several advantages. For companies deciding how to run and staff a service, this approach is relatively easy to implement: as a familiar industry paradigm, there are many examples from which to learn and emulate. A relevant talent pool is already widely available. An array of existing tools, software components (off the shelf or otherwise), and integration companies are available to help run those assembled systems, so a novice sysadmin team doesnt have to reinvent the wheel and design a system from scratch.
Traditional operations teams and their counterparts in product development thus often end up in conflict, most visibly over how quickly software can be released to production. At their core, the development teams want to launch new features and see them adopted by users. At _their_ core, the ops teams want to make sure the service doesnt break while they are holding the pager. Because most outages are caused by some kind of change—a new configuration, a new feature launch, or a new type of user traffic—the two teams goals are fundamentally in tension.
Both groups understand that it is unacceptable to state their interests in the baldest possible terms ("We want to launch anything, any time, without hindrance" versus "We wont want to ever change anything in the system once it works"). And because their vocabulary and risk assumptions differ, both groups often resort to a familiar form of trench warfare to advance their interests. The ops team attempts to safeguard the running system against the risk of change by introducing launch and change gates. For example, launch reviews may contain an explicit check for _every_ problem that has _ever_ caused an outage in the past—that could be an arbitrarily long list, with not all elements providing equal value. The dev team quickly learns how to respond. They have fewer "launches" and more "flag flips," "incremental updates," or "cherrypicks." They adopt tactics such as sharding the product so that fewer features are subject to the launch review.
### Googles Approach to Service Management: Site Reliability Engineering
Conflict isnt an inevitable part of offering a software service. Google has chosen to run our systems with a different approach: our Site Reliability Engineering teams focus on hiring software engineers to run our products and to create systems to accomplish the work that would otherwise be performed, often manually, by sysadmins.
What exactly is Site Reliability Engineering, as it has come to be defined at Google? My explanation is simple: SRE is what happens when you ask a software engineer to design an operations team. When I joined Google in 2003 and was tasked with running a "Production Team" of seven engineers, my entire life up to that point had been software engineering. So I designed and managed the group the way _I_ would want it to work if I worked as an SRE myself. That group has since matured to become Googles present-day SRE team, which remains true to its origins as envisioned by a lifelong software engineer.
A primary building block of Googles approach to service management is the composition of each SRE team. As a whole, SRE can be broken down two main categories.
5060% are Google Software Engineers, or more precisely, people who have been hired via the standard procedure for Google Software Engineers. The other 4050% are candidates who were very close to the Google Software Engineering qualifications (i.e., 8599% of the skill set required), and who _in addition_ had a set of technical skills that is useful to SRE but is rare for most software engineers. By far, UNIX system internals and networking (Layer 1 to Layer 3) expertise are the two most common types of alternate technical skills we seek.
Common to all SREs is the belief in and aptitude for developing software systems to solve complex problems. Within SRE, we track the career progress of both groups closely, and have to date found no practical difference in performance between engineers from the two tracks. In fact, the somewhat diverse background of the SRE team frequently results in clever, high-quality systems that are clearly the product of the synthesis of several skill sets.
The result of our approach to hiring for SRE is that we end up with a team of people who (a) will quickly become bored by performing tasks by hand, and (b) have the skill set necessary to write software to replace their previously manual work, even when the solution is complicated. SREs also end up sharing academic and intellectual background with the rest of the development organization. Therefore, SRE is fundamentally doing work that has historically been done by an operations team, but using engineers with software expertise, and banking on the fact that these engineers are inherently both predisposed to, and have the ability to, design and implement automation with software to replace human labor.
By design, it is crucial that SRE teams are focused on engineering. Without constant engineering, operations load increases and teams will need more people just to keep pace with the workload. Eventually, a traditional ops-focused group scales linearly with service size: if the products supported by the service succeed, the operational load will grow with traffic. That means hiring more people to do the same tasks over and over again.
To avoid this fate, the team tasked with managing a service needs to code or it will drown. Therefore, Google places _a 50% cap on the aggregate "ops" work for all SREs_—tickets, on-call, manual tasks, etc. This cap ensures that the SRE team has enough time in their schedule to make the service stable and operable. This cap is an upper bound; over time, left to their own devices, the SRE team should end up with very little operational load and almost entirely engage in development tasks, because the service basically runs and repairs itself: we want systems that are _automatic_, not just _automated_. In practice, scale and new features keep SREs on their toes.
Googles rule of thumb is that an SRE team must spend the remaining 50% of its time actually doing development. So how do we enforce that threshold? In the first place, we have to measure how SRE time is spent. With that measurement in hand, we ensure that the teams consistently spending less than 50% of their time on development work change their practices. Often this means shifting some of the operations burden back to the development team, or adding staff to the team without assigning that team additional operational responsibilities. Consciously maintaining this balance between ops and development work allows us to ensure that SREs have the bandwidth to engage in creative, autonomous engineering, while still retaining the wisdom gleaned from the operations side of running a service.
Weve found that Google SREs approach to running large-scale systems has many advantages. Because SREs are directly modifying code in their pursuit of making Googles systems run themselves, SRE teams are characterized by both rapid innovation and a large acceptance of change. Such teams are relatively inexpensive—supporting the same service with an ops-oriented team would require a significantly larger number of people. Instead, the number of SREs needed to run, maintain, and improve a system scales sublinearly with the size of the system. Finally, not only does SRE circumvent the dysfunctionality of the dev/ops split, but this structure also improves our product development teams: easy transfers between product development and SRE teams cross-train the entire group, and improve skills of developers who otherwise may have difficulty learning how to build a million-core distributed system.
Despite these net gains, the SRE model is characterized by its own distinct set of challenges. One continual challenge Google faces is hiring SREs: not only does SRE compete for the same candidates as the product development hiring pipeline, but the fact that we set the hiring bar so high in terms of both coding and system engineering skills means that our hiring pool is necessarily small. As our discipline is relatively new and unique, not much industry information exists on how to build and manage an SRE team (although hopefully this book will make strides in that direction!). And once an SRE team is in place, their potentially unorthodox approaches to service management require strong management support. For example, the decision to stop releases for the remainder of the quarter once an error budget is depleted might not be embraced by a product development team unless mandated by their management.
###### DevOps or SRE?
The term “DevOps” emerged in industry in late 2008 and as of this writing (early 2016) is still in a state of flux. Its core principles—involvement of the IT function in each phase of a systems design and development, heavy reliance on automation versus human effort, the application of engineering practices and tools to operations tasks—are consistent with many of SREs principles and practices. One could view DevOps as a generalization of several core SRE principles to a wider range of organizations, management structures, and personnel. One could equivalently view SRE as a specific implementation of DevOps with some idiosyncratic extensions.
------------------------
作者简介Benjamin Treynor Sloss coined the term "Site Reliability Engineering" and has been responsible for global operations, networking, and production engineering at Google since 2003\. As of 2016, he manages a team of approximately 4,000 software, hardware, and network engineers across the globe.
--------------------------------------------------------------------------------
via: https://www.oreilly.com/ideas/what-is-sre-site-reliability-engineering
作者:[Benjamin Treynor][a]
译者:[译者ID](https://github.com/译者ID)
校对:[校对者ID](https://github.com/校对者ID)
本文由 [LCTT](https://github.com/LCTT/TranslateProject) 原创编译,[Linux中国](https://linux.cn/) 荣誉推出
[a]:https://www.oreilly.com/people/benjamin-treynor-sloss
[1]:https://shop.oreilly.com/product/0636920053385.do
[2]:https://shop.oreilly.com/product/0636920053385.do
[3]:https://www.oreilly.com/ideas/what-is-sre-site-reliability-engineering
[4]:https://shop.oreilly.com/product/0636920053385.do
[5]:https://shop.oreilly.com/product/0636920053385.do
[6]:https://www.oreilly.com/people/benjamin-treynor-sloss
[7]:https://pixabay.com/
[8]:https://www.oreilly.com/people/benjamin-treynor-sloss
[9]:http://shop.oreilly.com/product/0636920041528.do?intcmp=il-webops-books-videos-update-na_new_site_site_reliability_engineering_text_cta
[10]:http://conferences.oreilly.com/velocity/devops-web-performance-eu?intcmp=il-webops-confreg-update-vleu16_new_site_what_is_sre_text_cta
[11]:https://pixabay.com/

View File

@ -0,0 +1,84 @@
什么是SRE网站可靠性工程
============================================================
网站可靠性工程是近来越来越多看到的一个职位。它是什么意思?它来自哪里?让我们从 Google SRE 团队来学习。
![Bridge](https://d3tdunqjn7n0wj.cloudfront.net/360x240/bridge-1031545-1400-389c9609ff7c64083c93db48dc77eeff.jpg)
这里有一篇由 Niall Richard Murphy、Jennifer Petoff、Chris Jones、Betsy Beyer 编辑一篇来自[网站可靠性工程][9]的摘录。
网站可靠性工程也在[11月7-10日在阿姆斯特丹举办的 O'Reilly Velocity 会议][10]上有提到。
### 介绍
> 希望不是一种策略。
>
> 传统的 SRE 说
一个公认的事实是系统不会自己。 那么一个特定系统的复杂大规模系统_应该_怎么运行呢
### sysadmin 服务管理方法
sysadmin服务管理模型有几个优点。对于决定该如何运行和服务的公司而言这种方法相对容易实现它作为一个熟悉的行业范例有很多例子可以从中学习和效仿。相关人才库已经广泛普及。有一系列现有的工具软件组件现成的或其他和集成公司可用于帮助运行这些组装的系统所以新手sysadmin团队不必重新发明轮子以及从头设计系统。
因此传统运营团队及其在产品开发中的同行往往会发生冲突最突出的是如何将软件发布到生产环境。在他们核心中开发团队希望推出新功能并看到它们被用户采纳。在_他们_的核心上ops 团队希望确保服务在运行中不会中断。因为大多数中断是由某种变化引起的 - 新的配置、新的功能发布或者新的用户流量类型 - 这两个团队的目标基本上处于紧张状态。
两个团队都明白,以最可能的条款(“我们可以没有阻碍地在任何时间发布任何东西”以及“我们不想在系统工作后改变任何东西”)来表达他们的利益是不可接受的。因为他们的词汇和风险假设都不同,两个团体经常采用熟悉斗争形式来提高他们的利益。 ops 团队试图通过发布介绍和提高门槛来保护运行中的系统免受更改的风险。例如发布审查可能包含对_每个_问题的显式审查这些问题过去都_曾经_引起过服务中断 - 它可能是一个任意长度的列表并且不是所有元素都提供相等的值。开发团队很快学会了如何回应。他们有较少的“发布”和更多的“标志翻转”、“增量更新”或“cherrypicks”。他们采取诸如分割产品功能的策略以便更少的功能受到发布审查。
### Google 服务管理的方法:网站可靠性工程
冲突不是提供软件服务的必然部分。Google 选择以不同的方式运行我们的系统我们的网站可靠性工程团队专注于雇佣软件工程师来运行我们的产品并创建系统来完成那些本来由sysadmins手动完成的工作。
什么是网站可靠性工程是如它在谷歌定义的那样么我的解释很简单SRE 是当你要求一位软件工程师设计一个运维团队时会发生的那样。当我在2003年加入 Google 并负责运行一个由 7 名工程师组成的“生产团队”时,那时我工作的全部都是软件工程。所以我设计和管理了一个假如我是一名 SRE _我_想要的团队的样子。这个团队已经成为了 Google 的目前的 SRE 团队,它仍然是一名终生软件工程师所想象的那个样子。
Google 服务管理方法的主要构成部分是由每个 SRE 团队的组成。作为一个整体SRE可以分为两大类。
50-60 的人是 Google 软件工程师,或者更确切地说,是通过 Google 软件工程师的标准程序招聘的人。其他 40-50 的候选人非常接近 Google 软件工程师资格(即所需技能集的 85-99以及一些具有大多数软件工程师没有的一些 SRE 技术技能的人。到目前为止UNIX 系统内部和网络第1层到第3层的专业知识是我们寻求的两种最常见的替代技术技能。
所有 SRE 的共同点是对开发软件系统以解决复杂问题的信念和能力。在 SRE 中我们密切跟踪两个团队的职业发展并且迄今为止发现在两种工程师之间的表现没有实际差异。事实上SRE 团队的多样背景经常产生聪明、高质量的系统,这显然是几个技能集合成的产物。
我们这样招聘 SRE 的结果是我们有了这样一个团队a手动执行任务很快会变得无聊。b他们有必要的技能集来写出软件以取代以前的手动操作即使解决方案很复杂。SRE 还会与其他开发部门分享学术以及知识背景。因此SRE 从根本上做了一个运维团队历来做的工作,但它使用具有软件专业知识的工程师,并期望这些内在倾向于用软件,并且有能力用软件的人用软件设计并实现自动化来代替人力劳动。
按照设计,至关重要的是 SRE 团队专注于工程。没有恒定的工程,运维工作增加,团队将需要更多的人来上工作量。最终,传统的以 ops 为中心的团队与服务规模呈线性关系:如果服务支持的产品成功,运维工作将随着流量而增长。这意味着雇用更多的人一遍又一遍地完成相同的任务。
为了避免这种命运负责管理服务的团队需要写代码否则就会被工作淹没。因此Google _设置了一个 “ops” 工作如 ticket、紧急呼叫、手动任务最多只占 50% SRE 工作的上限_。此上限确保SRE团队在其计划中有足够的时间使服务稳定及可操作。50% 是上限;随着时间的推移除了自己的设备SRE 团队应该只有很少的运维工作他们几乎可以完全从事开发任务因为服务基本上可以运行和维修自己我们想要的系统是_自动的_而不只是_自动化_。在实践中规模和新功能始终 SRE 要考虑的
Google的经验法则是SRE团队必须花费剩余的 50 的时间来进行实际开发。那么我们该如何执行这个阈值呢?首先,我们必须测量 SRE 如何花费时间。通过测量,我们确保团队不断花费不到 50% 的时间用于开发改变他们实践的工作上。通常这意味着会将一些运维负担转移回开发团队,或者给团队添加新的员工,而不指派该团队额外的运维责任。意识到在运维和开发工作之间保持这种平衡使我们能保证 SRE 具有参与创造性的自主工程的空间,同时仍然保留从运维那学来的智慧。
我们发现Google SRE 的运行大规模系统的方法有很多优点。由于 SRE 是直接修改代码以使Google的系统运行自己SRE团队的特点是快速创新以及大量接受变革。这样的团队能相对价廉地支持相同的服务面向运维的团队需要大量的人。相反运行、维护和改进系统所需的 SRE 的数量随系统的大小而线性地缩放。最后SRE 不仅规避了开发/运维分裂的障碍,而且这种结构也改善了我们的产品开发团队:产品开发和 SRE 团队之间的轻松转移交叉培训整个团队,并且提高了那些在学习构建百万级别分布式系统上有困难的开发人员的技能。
尽管有这些好处SRE 模型的特点是其自身独特的挑战。 Google 面临的一个持续挑战是招聘 SRESRE 不仅与产品开发招聘流程竞争相同的候选人,而且我们将招聘人员的编码和系统工程技能都设置得如此之高,这意味着我们的招聘池必然很小。由于我们的学科相对新颖独特,在如何建立和管理 SRE 团队方面没有太多的行业信息(尽管希望这本书能朝着这个方向迈进!)。一旦 SRE 团队到位,他们潜在的非正统的服务管理方法需要强有力的管理支持。例如,一旦错误预估耗尽,除非是管理层的强制要求, 否则在季度剩余的时间里决定停止发布可能不会被产品开发团队所接受。
###### DevOps 或者 SRE
“DevOps” 这个术语在 2008 年末出现并在写这篇文章时2016 年早期)仍在发生变动。 其核心原则IT部门在系统设计和开发的每个阶段的参与、对自动化与人力投入的严重依赖、工程实践和工具在操作任务中的应用与许多 SRE 的原则和实践一致。 人们可以将 DevOps 视为向更广泛的组织管理结构和人员的几种核心SRE原则。 可以等价地将 SRE 视为具有某些特殊扩展的 DevOps 的特定实现。
------------------------
作者简介Benjamin Treynor Sloss 创造了“网站可靠性工程”一词他自2003年以来一直负责 Google 的全球运营、网络和生产工程。截至2016年他管理着全球范围内一个大约4000名软硬件和网络工程师团队。
--------------------------------------------------------------------------------
via: https://www.oreilly.com/ideas/what-is-sre-site-reliability-engineering
作者:[Benjamin Treynor][a]
译者:[geekpi](https://github.com/geekpi)
校对:[校对者ID](https://github.com/校对者ID)
本文由 [LCTT](https://github.com/LCTT/TranslateProject) 原创编译,[Linux中国](https://linux.cn/) 荣誉推出
[a]:https://www.oreilly.com/people/benjamin-treynor-sloss
[1]:https://shop.oreilly.com/product/0636920053385.do
[2]:https://shop.oreilly.com/product/0636920053385.do
[3]:https://www.oreilly.com/ideas/what-is-sre-site-reliability-engineering
[4]:https://shop.oreilly.com/product/0636920053385.do
[5]:https://shop.oreilly.com/product/0636920053385.do
[6]:https://www.oreilly.com/people/benjamin-treynor-sloss
[7]:https://pixabay.com/
[8]:https://www.oreilly.com/people/benjamin-treynor-sloss
[9]:http://shop.oreilly.com/product/0636920041528.do?intcmp=il-webops-books-videos-update-na_new_site_site_reliability_engineering_text_cta
[10]:http://conferences.oreilly.com/velocity/devops-web-performance-eu?intcmp=il-webops-confreg-update-vleu16_new_site_what_is_sre_text_cta
[11]:https://pixabay.com/