翻译完成 20160823 The infrastructure behind Twitter - efficiency and optimization.md (#4447)

* Published with https://stackedit.io/ * Published with https://stackedit.io/ * Published with https://stackedit.io/ * Published with https://stackedit.io/ * Published with https://stackedit.io/ * Published with https://stackedit.io/ * remove useless files * continue * continue * Published with https://stackedit.io/ * Published with https://stackedit.io/ * Published with https://stackedit.io/ * Published with https://stackedit.io/ * Published with https://stackedit.io/ * Published with https://stackedit.io/ * Published with https://stackedit.io/ * Published with https://stackedit.io/ * Published with https://stackedit.io/ * Published with https://stackedit.io/ * Published with https://stackedit.io/ * Published with https://stackedit.io/ * finished tranlating
2025-03-27 02:30:10 +08:00 · 2016-09-20 13:20:33 +08:00 · 2016-09-20 13:20:33 +08:00 · 43e8e22e31
commit 43e8e22e31
parent 234b9068ad
2 changed files with 129 additions and 129 deletions
--- a/sources/tech/20160823
+++ b/sources/tech/20160823
@ -1,129 +0,0 @@
-Eriwoon Start to translate this article
-The infrastructure behind Twitter: efficiency and optimization
-===========
-
-
-In the past, we've published details about Finagle, Manhattan, and the summary of how we re-architected the site to be able to handle events like Castle in the Sky, the Super Bowl, 2014 World Cup, the global New Year's Eve celebration, among others. In this infrastructure series, we're focusing on the core infrastructure and components that run Twitter. We're also going to focus each blog on efforts surrounding scalability, reliability, and efficiency in a way that highlights the history of our infrastructure, challenges we've faced, lessons learned, upgrades made, and where we're heading.
-
-### Data center efficiency
-
-#### History
-
-Twitter hardware and data centers are at the scale few technology companies ever reach. However, this was not accomplished without a few missteps along the way. Our uptime has matured through a combination of physical improvements and software-based changes.
-
-During the period when the fail whale was prevalent, outages occurred due to software limitations, as well as physical failures at the hardware or infrastructure level. Failure domains existed in various definitions which had to be aggregated to determine the risk and required redundancy for services. As the business scaled in customers, services, media content, and global presence, the strategy evolved to efficiently and resiliently support the service.
-
-#### Challenges
-
-Software dependencies on bare metal were further dependant on our data centers' ability to operate and maintain uptime of power, fiber connectivity, and environment. These discrete physical failure domains had to be reviewed against the services distributed on the hardware to provide for fault tolerance.
-
-The initial decision of which data center service provider to scale with was done when specialization in site selection, operation, and design was in its infancy. We began in a hosted provider then migrated to a colocation facility as we scaled. Early service interruptions occurred as result of equipment failures, data center design issues, maintenance issues, and human error. As a result, we continually iterated on the physical layer designs to increase the resiliency of the hardware and the data center operations.
-
-The physical reasons for service interruptions were inclusive of hardware failures at the server component level, top of rack switch, and core switches. For example, during the initial evaluation of our customized servers, the hardware team determined the cost of the second power supply was not warranted given the low rate of failure of server power supplies — so they were removed from the design. The data center power topology provides redundancy through separate physical whips to the racks and requires the second power supply. Removal of the second power supply eliminated the redundant power path, leaving the hardware vulnerable to impact during distribution faults in the power system. To mitigate the impact of the single power supply, ATS units were required to be added at the rack level to allow a secondary path for power.
-
-The layering of systems with diverse fiber paths, power sources, and physical domains continued to separate services from impacts at relatively small scale interruptions, thus improving resiliency.
-
-#### Lessons learned and major technology upgrades, migrations, and adoptions
-
-We learned to model dependencies between the physical failure domains, (i.e. building power and cooling, hardware, fiber) and the services distributed across them to better predict fault tolerance and drive improvements.
-
-We added additional data centers providing regional diversity to mitigate risk from natural disaster and the ability to fail between regions when it was needed during major upgrades, deploys or incidents. The active-active operation of data centers provided for staged code deployment reducing overall impacts of code rollouts.
-
-The efficiency of power use by the data centers has improved with expanding the operating ranges of the environmental envelope and designing the hardware for resiliency at the higher operating temperatures.
-
-#### Future work
-
-Our data centers continue to evolve in strategy and operation, providing for live changes to the operating network and hardware without interruption to the users. Our strategy will continue to focus on scale within the existing power and physical footprints through optimization and maintaining flexibility while driving efficiency in the coming years.
-
-### Hardware efficiency
-
-#### History and challenges
-
-Our hardware engineering team was started to qualify and validate performance of off-the-shelf purchased hardware, and evolved into customization of hardware for cost and performance optimizations.
-
-Procuring and consuming hardware at Twitter's scale comes with a unique set of challenges. In order to meet the demands of our internal customers, we initially started a program to qualify and ensure the quality of purchased hardware. The team was primarily focused on performance and reliability testing ensuring that systems could meet the demands. Running systematic tests to validate the behavior was predictable, and there were very few bugs introduced.
-
-As we scaled our major workloads (Mesos, Hadoop, Manhattan, and MySQL) it became apparent the available market offerings didn't quite meet the needs. Off-the-shelf servers come with enterprise features, like raid controllers and hot swap power supplies. These components improve reliability at small scale, but often decrease performance and increase cost; for example some raid controllers interfered with the performance of SSDs and could be a third of the cost of the system.
-
-At the time, we were a large user of mysql databases. Issues arose from both supply and performance of SAS media. The majority of deployments were 1u servers, and the total number of drives used plus a writeback cache could predict the performance of a system often time limited to a sustained 2000 sequential IOPS. In order to continue scaling this workload, we were stranding CPU cores and disk capacity to meet IOPS requirement. We were unable to find cost-effective solutions at this time.
-
-As our volume of hardware reached a critical mass, it made sense to invest in a hardware engineering team for customized white box solutions with focus on reducing the capital expenses and increased performance metrics.
-
-#### Major technology changes and adoption
-
-We've made many transitions in our hardware technology stack. Below is a timeline for adoptions of new technology and internally developed platforms.
-
- 2012 - SSDs become the primary storage media for our MySQL and key/value databases.
- 2013 - Our first custom solution for Hadoop workloads is developed, and becomes our primary bulk storage solution.
- 2013 - Our custom solution is developed for Mesos, TFE, and cache workloads.
- 2014 - Our custom SSD key/value server completes development.
- 2015 - Our custom database solution is developed.
- 2016 - We developed GPU systems for inference and training of machine learning models.
-
-#### Lessons learned
-
-The objective of our Hardware Engineering team is to significantly reduce the capital expenditure and operating expenditure by making small tradeoffs that improve our TCO. Two generalizations can apply to reduce the cost of a server:
-
-1. Removing the unused components
-2. Improving utilization
-
-Twitter's workload is divided into four main verticals: storage, compute, database, and gpu. Twitter defines requirements on a per vertical basis, allowing Hardware Engineering to produce a focused feature set for each. This approach allows us to optimize component selection where the equipment may go unused or underutilized. For example, our storage configuration has been designed specifically for Hadoop workloads and was delivered at a TCO reduction of 20% over the original OEM solution. At the same time, the design improved both the performance and reliability of the hardware. Similarly, for our compute vertical, the Hardware Engineering Team has improved the efficiency of these systems by removing unnecessary features.
-
-There is a minimum overhead required to operate a server, and we quickly reached a point where it could no longer remove components to reduce cost. In the compute vertical specifically, we decided the best approach was to look at solutions that replaced multiple nodes with a single node, and rely on Aurora/Mesos to manage the capacity. We settled on a design that replaced two of our previous generation compute nodes with a single node.
-
-Our design verification began with a series of rough benchmarks, and then progressed to a series of production load tests confirming a scaling factor of 2. Most of this improvement came from simply increasing the thread count of the CPU, but our testing confirmed a 20-50% improvement in our per thread performance. Additionally we saw a 25% increase in our per thread power efficiency, due to sharing the overhead of the server across more threads.
-
-For the initial deployment, our monitoring showed a 1.5 replacement factor, which was well below the design goal. An examination of the performance data revealed there was a flawed assumption in the workload characteristics, and that it needed to be identified.
-
-Our Hardware Engineering Team's initial action was to develop a model to predict the packing efficiency of the current Aurora job set into various hardware configurations. This model correctly predicted the scaling factor we were observing in the fleet, and suggested we were stranding cores due to unforeseen storage requirements. Additionally, the model predicted we would see a still improved scaling factor by changing the memory configuration as well.
-
-Hardware configuration changes take time to implement, so Hardware Engineering identified a few large jobs and worked with our SRE teams to adjust the scheduling requirements to reduce the storage needs. These changes were quick to deploy, and resulted in an immediate improvement to a 1.85 scaling factor.
-
-In order to address the situation permanently, we needed to adjust to configuration of the server. Simply expanding the installed memory and disk capacity resulted in a 20% improvement in the CPU core utilization, at a minimal cost increase. Hardware Engineering worked with our manufacturing partners to adjust the bill of materials for the initial shipments of these servers. Follow up observations confirmed a 2.4 scaling factor exceeding the target design.
-
-### Migration from bare metal to mesos
-
-Until 2012, running a service inside Twitter required hardware requisitions. Service owners had to find out and request the particular model or class of server, worry about your rack diversity, maintain scripts to deploy code, and manage dead hardware. There was essentially no "service discovery." When a web service needed to talk to the user service, it typically loaded up a YAML file containing all of the host IPs and ports of the user service and the service used that list (port reservations were tracked in a wiki page). As hardware died or was added, managing required editing and committing changes to the YAML file that would go out with the next deploy. Making changes in the caching tier meant many deploys over hours and days, adding a few hosts at a time and deploying in stages. Dealing with cache inconsistencies during the deploy was a common occurrence, since some hosts would be using the new list and some the old. It was possible to have a host running old code (because the box was temporarily down during the deploy) resulting in a flaky behavior with the site.
-
-In 2012/2013, two things started to get adopted at Twitter: service discovery (via a zookeeper cluster and a library in the core module of Finagle) and Mesos (including our own scheduler framework on top of Mesos called Aurora, now an Apache project).
-
-Service discovery no longer required static YAML host lists. A service either self-registered on startup or was automatically registered under mesos into a "serverset" (which is just a path to a list of znodes in zookeeper based on the role, environment, and service name). Any service that needed to talk to that service would just watch that path and get a live view of what servers were out there.
-
-With Mesos/Aurora, instead of having a script (we were heavy users of Capistrano) that took a list of hosts, pushed binaries around and orchestrated a rolling restart, a service owner pushed the package into a service called "packer" (which is a service backed by HDFS), uploaded an aurora configuration that described the service (how many CPUs it needed, how much memory, how many instances needed, the command lines of all the tasks each instance should run) and Aurora would complete the deploy. It schedules instances on an available hosts, downloads the artifact from packer, registers it in service discovery, and launches it. If there are any failures (hardware dies, network fails, etc), Mesos/Aurora automatically reschedules the instance on another host.
-
-#### Twitter's Private PaaS
-
-Mesos/Aurora and Service Discovery in combination were revolutionary. There were many bugs and growing pains over the next few years and many hard lessons learned about distributed systems, but the fundamental design was sound. In the old world, the teams were constantly dealing with and thinking about hardware and its management. In the new world, the engineers only have to think about how best to configure their services and how much capacity to deploy. We were also able to radically improve the CPU utilization of Twitter's fleet over time, since generally each service that got their own bare metal hardware didn't fully utilize its resources and did a poor job of managing capacity. Mesos allows us to pack multiple services into a box without having to think about it, and adding capacity to a service is only requesting quota, changing one line of a config, and doing a deploy.
-
-Within two years, most "stateless" services moved into Mesos. Some of the most important and largest services (including our user service and our ads serving system) were among the first to move. Being the largest, they saw the biggest benefit to their operational burden. This allowed them to reduce their operational burden.
-
-We are continuously looking for ways to improve the efficiency and optimization of the infrastructure. As part of this, we regularly benchmark against public cloud providers and offerings to validate our TCO and performance expectations of the infrastructure. We also have a good presence in public cloud, and will continue to utilize the public cloud when it's the best available option. The next series of this post will mainly focus on the scale of our infrastructure.
-
-Special thanks to Jennifer Fraser, David Barr, Geoff Papilion, Matt Singer, and Lam Dong for all their contributions to this blog post.
-
-
-
-
-
--------------------------------------------------------------------------------
-
-via: https://blog.twitter.com/2016/the-infrastructure-behind-twitter-efficiency-and-optimization?utm_source=webopsweekly&utm_medium=email
-
-作者：[mazdakh][a]
-译者：[译者ID](https://github.com/译者ID)
-校对：[校对者ID](https://github.com/校对者ID)
-
-本文由 [LCTT](https://github.com/LCTT/TranslateProject) 原创编译，[Linux中国](https://linux.cn/) 荣誉推出
-
-[a]: https://twitter.com/intent/user?screen_name=mazdakh
-[1]: https://twitter.com/jenniferfraser
-[2]: https://twitter.com/davebarr
-[3]: https://twitter.com/gpapilion
-[4]: https://twitter.com/lamdong
-
-
-
-
-
-
-
--- a/translated/tech/20160823
+++ b/translated/tech/20160823
@ -0,0 +1,129 @@
+Twitter背后的基础设施：效率与优化
+===========
+
+过去我们曾经发布过一些关于 [Finagle](https://twitter.github.io/finagle/) , [Manhattan](https://blog.twitter.com/2014/manhattan-our-real-time-multi-tenant-distributed-database-for-twitter-scale) 这些项目的文章，还写过一些针对大型事件活动的架构优化的文章，例如天空之城，超级碗， 2014 世界杯，全球新年夜庆祝活动等。在这篇基础设施系列文章中，我主要聚焦于 Twitter 的一些关键设施和组件。我也会写一些我们在系统的扩展性，可靠性，效率性方面的做过的改进，例如我们基础设施的历史，遇到过的挑战，学到的教训，做过的升级，以及我们现在前进的方向等等。
+
+> 天空之城：2013年8月2日，宫崎骏的《天空之城》在NTV迎来其第14次电视重播，剧情发展到高潮之时，Twitter的TPS（Tweets Per Second）也被推上了新的高度——143,199 TPS，是平均值的25倍，这个记录保持至今 -- 译者注。
+
+### 数据中心的效率优化
+
+#### 历史
+
+当前Twitter硬件和数据中心的规模已经超过大多数公司。但达到这样的规模不是一蹴而就的，系统是随着软硬件的升级优化一步步成熟起来的，过程中我们也曾经犯过很多错误。
+
+有个一时期我们的系统故障不断。软件问题，硬件问题，甚至底层设备问题不断爆发，常常导致系统运营中断。随着 Twitter 在客户、服务、媒体上的影响力不断扩大，构建一个高效、可靠的系统来提供服务成为我们的战略诉求。
+
+> Twitter系统故障的界面被称为失败鲸（Fail Whale），如下图 -- 译者注
+![Fail Whale](https://upload.wikimedia.org/wikipedia/en/d/de/Failwhale.png)
+
+#### 挑战
+
+一开始，我们的软件是直接安装在服务器，这意味着软件可靠性依赖硬件，电源、网络以及其他的环境因素都是威胁。这种情况下，如果要增加容错能力，就需要统筹考虑物理设备和在上面运行的服务。
+
+最早采购数据中心方案的时候，我们都还是菜鸟，对于站点选择、运营和设计都非常不专业。我们先直接租用主机，业务增长后我们改用主机托管。早期遇到的问题主要是因为设备故障、数据中心设计问题、维护问题以及人为操作失误。我们也在持续迭代我们的硬件设计，从而增强硬件和数据中心的容错性。
+
+服务中断的原因有很多，其中硬件故障常发生在服务器、机架交换机、核心交换机这地方。举一个我们曾经犯过的错误，硬件团队最初在设计服务器的时候，认为双路电源对减少供电问题的意义不大 -- 他们真的就移除了一块电源。然而数据中心一般给机架提供两路供电来提高冗余性，防止电网故障传导到服务器，而这需要两块电源。最终我们不得不在机架上增加了一个 ATS 单元（AC transfer switch 交流切换开关）来接入第二路供电。
+
+提高系统的可靠性靠的就是这样的改进，给网络、供电甚至机房增加冗余，从而将影响控制到最小范围。
+
+#### 我们学到的教训以及技术的升级、迁移和选型
+
+我们学到的第一个教训就是要先建模，将可能出故障的地方（例如建筑的供电和冷却系统、硬件、光线网络等）和运行在上面的服务之间的依赖关系弄清楚，这样才能更好地分析，从而优化设计提升容错能力。
+
+我们增加了更多的数据中心提升地理容灾能力，减少自然灾害的影响。而且这种站点隔离也降低了软件的风险，减少了例如软件部署升级和系统故障的风险。这种多活的数据中心架构提供了代码灰度发布的能力，减少代码首次上线时候的影响。
+
+我们设计新硬件使之能够在更高温度下正常运行，数据中心的能源效率因此有所提升。
+
+#### 下一步工作
+
+随着公司的战略发展和运营增长，我们在不影响我们的最终用户的前提下，持续不断改进我们的数据中心。下一步工作主要是在当前能耗和硬件的基础上，通过维护和优化来提升效率。
+
+### 硬件的效率优化
+
+#### 历史和挑战
+
+我们的硬件工程师团队刚成立的时候只能测试市面上现有硬件，而现在我们能自己定制硬件以节省成本并提升效率。
+
+Twitter 是一个很大的公司，它对硬件的要求对任何团队来说都是一个不小的挑战。为了满足整个公司的需求，我们的首要工作是能检测并保证购买的硬件的品质。团队重点关注的是性能和可靠性这两部分。对于硬件我们会做系统性的测试来保证其性能可预测，保证尽量不引入新的问题。
+
+随着我们一些关键组件的负荷越来越大（如 Mesos , Hadoop , Manhattan , MySQL 等），市面上的产品已经无法满足我们的需求。同时供应商提供的一些高级服务器功能，例如 Raid 管理或者电源热切换等，可靠性提升很小，反而会拖累系统性能而且价格高昂，例如一些 Raid 控制器价格高达系统总报价的三分之一，还拖累了 SSD 的性能。
+
+那时，我们也是 MySQL 数据库的一个大型用户。SAS（Serial Attached SCSI，串行连接 SCSI ）设备的供应和性能都有很大的问题。我们大量使用 1 u 的服务器，它的驱动器和回写缓存一起也只能支撑每秒 2000 次顺序 IO。为了获得更好的效果，我们只得不断增加 CPU 核心数并加强磁盘能力。我们那时候找不到更节省成本的方案。
+
+后来随着我们对硬件需求越来越大，我们可以成立了一个硬件团队，从而自己来设计更便宜更高效的硬件。
+
+#### 关键技术变更与选择
+
+我们不断的优化硬件相关的技术，下面是我们采用的新技术和自研平台的时间轴。
+
+- 2012 - 采用 SSD 作为我们 MySQL 和 Key-Value 数据库的主要存储。
+- 2013 - 我们开发了第一个定制版 Hadoop 工作站，它现在是我们主要的大容量存储方案。
+- 2013 - 我们定制的解决方案应用在 Mesos 、 TFE（ Twitter Front-End ）以及缓存设备上。
+- 2014 - 我们定制的 SSD Key-Value 服务器完成开发。
+- 2015 - 我们定制的数据库解决方案完成开发。
+- 2016 - 我们开发了一个 GPU 系统来做模糊推理和训练机器学习。
+
+#### 学到的教训
+
+硬件团队的工作本质是通过做取舍来优化TCO（总体拥有成本），最终达到达到降低 CAPEX（资本支出）和 OPEX（运营支出）的目的。概括来说，服务器降成本就是：
+
+1. 删除无用的功能和组件
+2. 提升利用率
+
+Twitter 的设备总体来说有这四大类：存储设备、计算设备、数据库和 GPU 。 Twitter 对每一类都定义了详细的需求，让硬件工程师更针对性地设计产品，从而优化掉那些用不到或者极少用的冗余部分。例如，我们的存储设备就专门为 Hadoop 优化，设备的购买和运营成本相比于 OEM 产品降低了 20% 。同时，这样做减法还提高了设备的性能和可靠性。同样的，对于计算设备，硬件工程师们也通过移除无用的特性获得了效率提升。
+
+一个服务器可以移除的组件总是有限的，我们很快就把能移除的都扔掉了。于是我们想出了其他办法，例如在存储设备里，我们认为降低成本最好的办法是用一个节点替换多个节点，并通过 Aurora/Mesos 来管理任务负载。这就是我们现在正在做的东西。
+
+对于这个我们自己新设计的服务器，首先要通过一系列的标准测试，然后会再做一系列负载测试，我们的目标是一台新设备至少能替换两台旧设备。大多数的提升都比较简单，例如增加 CPU 的进程数，同时我们的测试也比较出新 CPU 的 单线程能力提高了 20~50% ，对应能耗降低了 25% ，这都是我们测试环节需要做的工作。
+
+这个新设备首次部署的时候，监控发现新设备只能替换 1.5 台旧设备，这比我们的目标低了很多。对性能数据检查后发现，我们之前新硬件的部分指标是错的，而这正是我们在做性能测试需要发现的问题。
+
+对此我们硬件团队开发了一个模型，用来预测在不同的硬件配置下当前 Aurora 任务的打包效率。这个模型正确的预测了新旧硬件的性能比例。模型还指出了我们一开始没有考虑到的存储需求，并因此建议我们增加 CPU 核心数。另外，它还预测，如果我们修改内存的配置，那系统的性能还会有较大提高。
+
+硬件配置的改变都需要花时间去操作，所以我们的硬件工程师们就首先找出几个关键痛点。例如我们和站点工程团队一起调整任务顺序来降低存储需求，这种修改很简单也很有效，新设备可以代替 1.85 个旧设备了。
+
+为了更好的优化效率，我们对新硬件的配置做了修改，扩大了内存和磁盘容量就将 CPU 利用率提高了20% ，而这只增加了非常小的成本。同时我们的硬件工程师也和生产的伙伴一起优化发货顺序来降低货运成本。后续的观察发现我们的自己的新设备实际上可以代替 2.4 台旧设备，这个超出了预定的目标。
+
+### 从裸设备迁移到 mesos 集群
+
+直到2012年为止，软件团队在 Twitter 开通一个新服务还需要自己操心硬件：配置硬件的规格需求，研究机架尺寸，开发部署脚本以及处理硬件故障。同时，系统中没有所谓的“服务发现”机制，当一个服务需要调用一个另一个服务时候，需要读取一个 YAML 配置文件，这个配置文件中有目标服务对应的主机 IP 和端口信息（端口信息是由一个公共 wiki 页面维护的）。随着硬件的替换和更新，YAML 配置文件里的内容也会不断的编辑更新。每次更新都需要花几个小时甚至几天来重启在各个服务，从而将新配置刷新到所有服务的缓存里，所以我们只能尽量一次增加多个配置并且按次序分别重启。我们经常遇到重启过程中 cache 不一致导致的问题，因为有的主机在使用旧的配置有的主机在用新的。有时候一台主机的异常（例如它正在重启）会导致整个站点都无法正常工作。 
+
+在 2012/2013 年的时候，Twitter 开始尝试两个新事物：服务发现（来自 ZooKeeper 集群和 Finagle 核心模块中的一个库）和 Mesos（包括基于 Mesos 的一个自研的计划任务框架 Aurora ，它现在也是 Apache 基金会的一个项目）。
+
+服务发现功能意味着不需要再维护一个静态 YAML 主机列表了。服务或者在启动后主动注册，或者自动被 mesos 接入到一个“服务集”（就是一个 ZooKeeper 中的 znode 列表，包含角色、环境和服务名信息）中。任何想要访问这个服务的组件都只需要监控这个路径就可以实时获取到一个正在工作的服务列表。
+
+现在我们通过 Mesos/Aurora ，而不是使用脚本（我们曾经是 Capistrano 的重度用户）来获取一个主机列表、分发代码并规划重启任务。现在软件团队如果想部署一个新服务，只需要将软件包上传到一个叫 Packer 的工具上（它是一个基于 HDFS 的服务），再在 Aurora 配置上描述文件（需要多少 CPU ，多少内存，多少个实例，启动的命令行代码），然后 Aurora 就会自动完成整个部署过程。 Aurora 先找到可用的主机，从 Packer 下载代码，注册到“服务发现”，最后启动这个服务。如果整个过程中遇到失败（硬件故障、网络中断等等）， Mesos/Aurora 会自动重选一个新主机并将服务部署上去。
+
+#### Twitter 的私有 PaaS 云平台
+
+Mesos/Aurora 和服务发现这两个功能给我们带了革命性的变化。虽然在接下来几年里，我们碰到了无数 bug ，伤透了无数脑筋，学到了分布式系统里的无数教训，但是这套架还是非常赞的。以前大家一直忙于处理硬件搭配和管理，而现在，大家只需要考虑如何优化业务以及需要多少系统能力就可以了。同时，我们也从根本上解决了 CPU 利用率低的问题，以前服务直接安装在服务器上，这种方式无法充分利用服务器资源，任务协调能力也很差。现在 Mesos 允许我们把多个服务打包成一个服务包，增加一个新服务只需要修改硬件配额，再改一行配置就可以了。
+
+在两年时间里，多数“无状态”服务迁移到了 Mesos 平台。一些大型且重要的服务（包括我们的用户服务和广告服务）是最先迁移上去的。因为它们的体量巨大，所以他们从这些服务里获得的好处也最多。
+
+我们一直在不断追求效率提升和架构优化的最佳实践。我们会定期去测试公有云的产品，和我们自己产品的 TCO 以及性能做对比。我们也拥抱公有云的服务，事实上我们现在正在使用公有云产品。最后，这个系列的下一篇将会主要聚焦于我们基础设施的体量方面。
+
+特别感谢 Jennifer Fraser, David Barr, Geoff Papilion, Matt Singer, Lam Dong 对这篇文章的贡献。
+
+
+
+
+--------------------------------------------------------------------------------
+
+via: https://blog.twitter.com/2016/the-infrastructure-behind-twitter-efficiency-and-optimization?utm_source=webopsweekly&utm_medium=email
+
+作者：[mazdakh][a]
+译者：[译者ID](https://github.com/译者ID)
+校对：[校对者ID](https://github.com/校对者ID)
+
+本文由 [LCTT](https://github.com/LCTT/TranslateProject) 原创编译，[Linux中国](https://linux.cn/) 荣誉推出
+
+[a]: https://twitter.com/intent/user?screen_name=mazdakh
+[1]: https://twitter.com/jenniferfraser
+[2]: https://twitter.com/davebarr
+[3]: https://twitter.com/gpapilion
+[4]: https://twitter.com/lamdong
+
+
+
+
+