Merge remote-tracking branch 'LCTT/master'

This commit is contained in:
Xingyu Wang 2020-08-29 20:57:00 +08:00
commit 9d8924b297
4 changed files with 250 additions and 157 deletions

View File

@ -1,39 +1,42 @@
[#]: collector: (lujun9972)
[#]: translator: (geekpi)
[#]: reviewer: ( )
[#]: publisher: ( )
[#]: url: ( )
[#]: reviewer: (wxy)
[#]: publisher: (wxy)
[#]: url: (https://linux.cn/article-12561-1.html)
[#]: subject: (Being open to open values)
[#]: via: (https://opensource.com/open-organization/20/8/being-open-to-open-values)
[#]: author: (Heidi Hess von Ludewig https://opensource.com/users/heidi-hess-von-ludewig)
对开放的价值观持开放态度
======
开放管理可能会让人感到恐惧。一位经理人解释了它为什么值得冒险。
> 开放管理可能会让人感到恐惧。一位经理人解释了为什么值得冒这个风险。
![Open Lego CAD][1]
在本期的“[用开放的价值观管理][2]”系列中,我和美国一家全国性保险公司的定价总监、人事经理 Braxton 聊了聊。
2018 年 6 月Braxton 联系到了开放组织社区的红帽人员。他想了解更多关于他_和_他的团队如何使用开放的价值观,以不同的方式工作。我们很乐意提供帮助。于是我帮助 Braxton 和他的团队组织了一个关于[开放组织原则][3]的研讨会,并在之后还保持着联系,这样我就可以了解他在变得更加开放的过程中的风险。
2018 年 6 月Braxton 联系到了开放组织社区的红帽人员。他想了解更多关于他*和*他的团队如何使用开放的价值观,以不同的方式工作。我们很乐意提供帮助。于是我帮助 Braxton 和他的团队组织了一个关于[开放组织原则][3]的研讨会,并在之后还保持着联系,这样我就可以了解他在变得更加开放的过程中的风险。
最近我们采访了 Braxton并和他一起坐下来听了事情的进展。[产业/组织心理学家和员工参与度专家][4] Tracy Guiliani 和 [Bryan Behrenshausen][5] 一起加入了。我们的谈话范围很广,探讨了了解开源价值观后的感受,如何利用它们来改变组织,以及它们如何帮助 Braxton 和他的团队更好地工作和提高参与度。
最近我们采访了 Braxton并和他一起坐下来听了事情的进展。[产业/组织心理学家和员工参与度专家][4] Tracy Guiliani 和 [Bryan Behrenshausen][5] 一起加入了采访。我们的谈话范围很广,探讨了了解开源价值观后的感受,如何利用它们来改变组织,以及它们如何帮助 Braxton 和他的团队更好地工作和提高参与度。
与 Braxton 合作是一次异常有意义的经历。它让我们直接见证了一个人如何将开放组织社区驱动的研讨会材料融入动态变化,并使他、他的团队和他的组织受益。开放组织大使_一直_在寻求帮助人们获得关于开放价值的见解和知识,使他们能够理解文化变革和[自己组织内的转型][6]。
与 Braxton 合作是一次异常有意义的经历。它让我们直接见证了一个人如何将开放组织社区驱动的研讨会材料融入动态变化,并使他、他的团队和他的组织受益。开放组织大使*一直*在寻求帮助人们获得关于开放价值的见解和知识,使他们能够理解文化变革和[自己组织内的转型][6]。
他和他的团队正在以对他们有效的方式执行他们独特的开放价值观,并且让团队实现的利益超过了提议变革在时间和精力上的投入。
Braxton 对开放组织原则的_解释_和使组织更加开放的策略的接受,让我们深受启发。
Braxton 对开放组织原则的*解释*和使组织更加开放的策略,让我们深受启发。
Braxton 承认,他的更开放的目标并不包括“制造另一个红帽”。相反,他和他的团队是在以对他们有效的方式,以及让团队实现的利益超过提议的变革所带来的时间和精力投入,来执行他们独特的开放价值观。
In the first part of our interview, you'll also hear Braxton describe:
在我们采访的第一部分,你还会听到 Braxton 描述。
在我们采访的第一部分,你还会听到 Braxton 描述:
1. 在了解了透明性、协作性、适应性、社区性和包容性这五种开放式组织价值观之后,“开放式管理”对他意味着什么?
2. 他的一些开放管理做法
2. 他的一些开放管理做法
3. 他如何在他的团队中倡导开放文化,如何在后来者中鼓励开源价值观,以及他所体验到的好处。
4. 当人们试图改造自己的组织时,对开源价值观最大的误解是什么?
- [收听对 Braxton 的采访](https://opensource.com/sites/default/files/images/open-org/braxton_1.ogg)
--------------------------------------------------------------------------------
via: https://opensource.com/open-organization/20/8/being-open-to-open-values
@ -41,7 +44,7 @@ via: https://opensource.com/open-organization/20/8/being-open-to-open-values
作者:[Heidi Hess von Ludewig][a]
选题:[lujun9972][b]
译者:[geekpi](https://github.com/geekpi)
校对:[校对者ID](https://github.com/校对者ID)
校对:[wxy](https://github.com/wxy)
本文由 [LCTT](https://github.com/LCTT/TranslateProject) 原创编译,[Linux中国](https://linux.cn/) 荣誉推出

View File

@ -1,145 +0,0 @@
[#]: collector: (lujun9972)
[#]: translator: ( chenmu-kk )
[#]: reviewer: ( )
[#]: publisher: ( )
[#]: url: ( )
[#]: subject: (9 open source tools for building a fault-tolerant system)
[#]: via: (https://opensource.com/article/19/3/tools-fault-tolerant-system)
[#]: author: (Bryant Son (Red Hat, Community Moderator) https://opensource.com/users/brson)
9 open source tools for building a fault-tolerant system
九个用来构建容错系统的开源工具
======
Maximize uptime and minimize problems with these open source tools.
这些开源工具可以最大化延长运行时间并且在最大程度上减少问题。
![magnifying glass on computer screen, finding a bug in the code][1]
I've always been interested in web development and software architecture because I like to see the broader picture of a working system. Whether you are building a mobile app or a web application, it has to be connected to the internet to exchange data among different modules, which means you need a web service.
我总是对web开发和软件体系结构很感兴趣因为我喜欢看到一个工作系统的更广阔的前景。无论是构建一个移动应用程序还是一个web应用程序都必须连接到internet才能在不同的模块中交换数据这意味着你需要web服务。
If you use a cloud system as your application's backend, you can take advantage of greater computing power, as the backend service will scale horizontally and vertically and orchestrate different services. But whether or not you use a cloud backend, it's important to build a _fault-tolerant system_ —one that is resilient, stable, fast, and safe.
如果选择云系统作为应用程序的后端,则可以利用更优秀的计算能力,因为后端服务将会在水平和垂直方向上进行扩展并编排不同的服务。但无论你是否使用云后端,建造一个灵活、稳定、快速又安全的容错系统是必不可少的。
To understand fault-tolerant systems, let's use Facebook, Amazon, Google, and Netflix as examples. Millions and billions of users access these platforms simultaneously while transmitting enormous amounts of data via peer-to-peer and user-to-server networks, and you can be sure there are also malicious users with bad intentions, like hacking or denial-of-service (DoS) attacks. Even so, these platforms can operate 24 hours a day and 365 days a year without downtime.
要了解容错系统,让我们以脸书、亚马逊、谷歌和奈飞为例。数以亿计的用户会同时接入这些平台并通过对等网络和用户-服务器网络传输大量数据你可以肯定这其中还存在许多的带有不法目的的恶意用户例如黑客攻击和拒绝服务DoS攻击。即使如此这些平台无需停机也可以全年无休地运转。
Although machine learning and smart algorithms are the backbones of these systems, the fact that they achieve consistent service without a single minute of downtime is praiseworthy. Their expensive hardware and gigantic datacenters certainly matter, but the elegant software designs supporting the services are equally important. And the fault-tolerant system is one of the principles to build such an elegant system.
尽管机器学习和智能算法是这些系统的基础,但它们实现一致服务而不需要一分钟停机的事实值得称赞。它们昂贵的硬件设备和巨大的数据中心当然十分重要,但是支持服务的精密软件设计也同样重要。而且容错系统是一个构建如此精密系统的法则之一。
### Two behaviors that cause problems in production
Here's another way to think of a fault-tolerant system. When you run your application service locally, everything seems to be fine. Great! But when you promote your service to the production environment, all hell breaks loose. In a situation like this, a fault-tolerant system helps by addressing two problems: Fail-stop behavior and Byzantine behavior.
#### Fail-stop behavior
Fail-stop behavior is when a running system suddenly halts or a few parts of the system fail. Server downtime and database inaccessibility fall under this category. For example, in the diagram below, Service 1 can't communicate with Service 2 because Service 2 is inaccessible:
![Fail-stop behavior due to Service 2 downtime][2]
But the problem can also occur if there is a network problem between the services, like this:
![Fail-stop behavior due to network failure][3]
#### Byzantine behavior
Byzantine behavior is when the system continuously runs but doesn't produce the expected behavior (e.g., wrong data or an invalid value).
Byzantine failure can happen if Service 2 has corrupted data or values, even though the service looks to be operating just fine, like in this example:
![Byzantine failure due to corrupted service][4]
Or, there can be a malicious middleman intercepting between the services and injecting unwanted data:
![Byzantine failure due to malicious middleman][5]
Neither fail-stop nor Byzantine behavior is a desired situation, so we need ways to prevent or fix them. That's where fault-tolerant systems come into play. Following are eight open source tools that can help you address these problems.
### Tools for building a fault-tolerant system
Although building a truly practical fault-tolerant system touches upon in-depth _distributed computing theory_ and complex computer science principles, there are many software tools—many of them, like the following, open source—to alleviate undesirable results by building a fault-tolerant system.
#### Circuit-breaker pattern: Hystrix and Resilience4j
The [circuit-breaker pattern][6] is a technique that helps to return a prepared dummy response or a simple response when a service fails:
![Circuit breaker pattern][7]
Netflix's open source **[Hystrix][8]** is the most popular implementation of the circuit-breaker pattern.
Many companies where I've worked previously are leveraging this wonderful tool. Surprisingly, Netflix announced that it will no longer update Hystrix. (Yeah, I know.) Instead, Netflix recommends using an alternative solution like [**Resilence4j**][9], which supports Java 8 and functional programming, or an alternative practice like [Adaptive Concurrency Limit][10].
#### Load balancing: Nginx and HaProxy
Load balancing is one of the most fundamental concepts in a distributed system and must be present to have a production-quality environment. To understand load balancers, we first need to understand the concept of _redundancy_. Every production-quality web service has multiple servers that provide redundancy to take over and maintain services when servers go down.
![Load balancer][11]
Think about modern airplanes: their dual engines provide redundancy that allows them to land safely even if an engine catches fire. (It also helps that most commercial airplanes have state-of-art, automated systems.) But, having multiple engines (or servers) means that there must be some kind of scheduling mechanism to effectively route the system when something fails.
A load balancer is a device or software that optimizes heavy traffic transactions by balancing multiple server nodes. For instance, when thousands of requests come in, the load balancer acts as the middle layer to route and evenly distribute traffic across different servers. If a server goes down, the load balancer forwards requests to the other servers that are running well.
There are many load balancers available, but the two best-known ones are Nginx and HaProxy.
[**Nginx**][12] is more than a load balancer. It is an HTTP and reverse proxy server, a mail proxy server, and a generic TCP/UDP proxy server. Companies like Groupon, Capital One, Adobe, and NASA use it.
[**HaProxy**][13] is also popular, as it is a free, very fast and reliable solution offering high availability, load balancing, and proxying for TCP and HTTP-based applications. Many large internet companies, including GitHub, Reddit, Twitter, and Stack Overflow, use HaProxy. Oh and yes, Red Hat Enterprise Linux also supports HaProxy configuration.
#### Actor model: Akka
The [actor model][14] is a concurrency design pattern that delegates responsibility when an _actor_ , which is a primitive unit of computation, receives a message. An actor can create even more actors and delegate the message to them.
[**Akka**][15] is one of the most well-known tools for the actor model implementation. The framework supports Java and Scala, which are both based on JVM.
#### Asynchronous, non-blocking I/O using messaging queue: Kafka and RabbitMQ
Multi-threaded development has been popular in the past, but this practice has been discouraged and replaced with asynchronous, non-blocking I/O patterns. For Java, this is explicitly stated in its [Enterprise Java Bean (EJB) specifications][16]:
> "An enterprise bean must not use thread synchronization primitives to synchronize execution of multiple instances.
>
> "The enterprise bean must not attempt to manage threads. The enterprise bean must not attempt to start, stop, suspend, or resume a thread, or to change a thread's priority or name. The enterprise bean must not attempt to manage thread groups."
Now, there are other practices like stream APIs and actor models. But messaging queues like [**Kafka**][17] and [**RabbitMQ**][18] offer the out-of-box support for asynchronous and non-blocking IO features, and they are powerful open source tools that can be replacements for threads by handling concurrent processes.
#### Other options: Eureka and Chaos Monkey
Other useful tools for fault-tolerant systems include monitoring tools, such as Netflix's **[Eureka][19]** , and stress-testing tools, like **[Chaos Monkey][20]**. They aim to discover potential issues earlier by testing in lower environments, like integration (INT), quality assurance (QA), and user acceptance testing (UAT), to prevent potential problems before moving to the production environment.
* * *
What open source tools are you using for building a fault-tolerant system? Please share your favorites in the comments.
--------------------------------------------------------------------------------
via: https://opensource.com/article/19/3/tools-fault-tolerant-system
作者:[Bryant Son (Red Hat, Community Moderator)][a]
选题:[lujun9972][b]
译者:[译者ID](https://github.com/译者ID)
校对:[校对者ID](https://github.com/校对者ID)
本文由 [LCTT](https://github.com/LCTT/TranslateProject) 原创编译,[Linux中国](https://linux.cn/) 荣誉推出
[a]: https://opensource.com/users/brson
[b]: https://github.com/lujun9972
[1]: https://opensource.com/sites/default/files/styles/image-full-size/public/lead-images/mistake_bug_fix_find_error.png?itok=PZaz3dga (magnifying glass on computer screen, finding a bug in the code)
[2]: https://opensource.com/sites/default/files/uploads/1_errordowntimeservice.jpg (Fail-stop behavior due to Service 2 downtime)
[3]: https://opensource.com/sites/default/files/uploads/2_errordowntimenetwork.jpg (Fail-stop behavior due to network failure)
[4]: https://opensource.com/sites/default/files/uploads/3_byzantinefailuremalicious.jpg (Byzantine failure due to corrupted service)
[5]: https://opensource.com/sites/default/files/uploads/4_byzantinefailuremiddleman.jpg (Byzantine failure due to malicious middleman)
[6]: https://martinfowler.com/bliki/CircuitBreaker.html
[7]: https://opensource.com/sites/default/files/uploads/5_circuitbreakerpattern.jpg (Circuit breaker pattern)
[8]: https://github.com/Netflix/Hystrix/wiki
[9]: https://github.com/resilience4j/resilience4j
[10]: https://medium.com/@NetflixTechBlog/performance-under-load-3e6fa9a60581
[11]: https://opensource.com/sites/default/files/uploads/7_loadbalancer.jpg (Load balancer)
[12]: https://www.nginx.com
[13]: https://www.haproxy.org
[14]: https://en.wikipedia.org/wiki/Actor_model
[15]: https://akka.io
[16]: https://jcp.org/aboutJava/communityprocess/final/jsr220/index.html
[17]: https://kafka.apache.org
[18]: https://www.rabbitmq.com
[19]: https://github.com/Netflix/eureka
[20]: https://github.com/Netflix/chaosmonkey

View File

@ -0,0 +1,95 @@
[#]: collector: (lujun9972)
[#]: translator: ( )
[#]: reviewer: ( )
[#]: publisher: ( )
[#]: url: ( )
[#]: subject: (Why Comcast open sourced its DNS management tool)
[#]: via: (https://opensource.com/article/20/9/open-source-dns)
[#]: author: (Paul Cleary https://opensource.com/users/pauljamescleary)
Why Comcast open sourced its DNS management tool
======
This open source DNS management tool was built by and for the telcom
giant, but is establishing itself in its own right and welcoming more
contributors.
![An intersection of pipes.][1]
Adoption of [DevOps][2] practices at Comcast led to increased automation and configuration of infrastructure that supports applications, back-office, data centers, and our network. These practices require teams to move fast and be self-reliant. Infrastructure is constantly turned upside down, with network traffic moved around it constantly. Good DNS record management is critical to support this level of autonomy and automation, but how can a large, diverse enterprise move quickly while safely governing its DNS assets?
### Challenge
Prior to 2016, DNS record management was mostly done through an online ticketing system—users would submit tickets for DNS changes that were manually reviewed and implemented by a separate team of DNS technicians. This system frequently required manual intervention for many of the DNS requests, which was time-consuming.
Turnaround times for DNS changes were in hours, which is not suitable for infrastructure automation. Large Internet companies can manage millions of DNS records, making it practically impossible for DNS technicians to certify the correctness of the thousands of DNS updates being requested daily. This increased the possibility of an inadvertent errant update to a critical DNS record that ultimately would lead to a downtime event.
In addition, engineering teams are intimately familiar with their DNS needs—much more so than a single group of DNS technicians serving an entire enterprise. So, we needed to enable engineering teams to self-service their own DNS records, implement changes quickly (in seconds), and at the same time, make sure all changes are done safely.
### Solution
VinylDNS was built at Comcast and subsequently opened to empower engineering teams to automate as they please while providing the safety and administrative controls demanded by DNS operators and the Comcast Security team.
### Security as a way of life
VinylDNS is all about automation and enhanced security. At Comcast, the VinylDNS team worked in close coordination with both the DNS and engineering teams at Comcast, as well as the security team, to meet stringent engineering and security requirements. An incredible array of access controls was implemented that give extreme flexibility to both DNS operators and engineering teams to control their DNS assets.
Access controls implemented at the DNS zone level allow any team to control who can make updates to their DNS zones. When a DNS zone is registered and authorized to a VinylDNS group, only members of that group can make changes to DNS records in that DNS zone. In addition, access-list (ACL) rules provide extreme flexibility to allow other VinylDNS users to manage records in that zone. These ACL rules can be defined using regular expression masks or classless inter-domain routing (CIDR) rules and DNS record types that lock down access to specific users and groups to certain records in specific DNS zones.
### Meeting the demands of automation
A [representational state transfer (REST) API][3] was built along with the system. This uses request signing to help eliminate man-in-the-middle attacks. Once the engineering teams at Comcast caught wind of the kind of automation afforded by VinylDNS, many began building out tooling to integrate directly with VinylDNS via its API. It wasn't long before most of them were using organically developed tooling integrated with the VinylDNS API to support their DNS needs.
### Performing at large enterprise scale
Very quickly, VinylDNS was managing a million DNS records and thousands of DNS zones, and supporting hundreds of engineers. As we sought to expand VinylDNS to support the rest of Comcast, we recognized some challenges.
1. Certain DNS records were off-limits, deemed too critical to manage in any way other than by hand.
2. The ACL rule model, while flexible, would be impossible to set up and maintain across the entirety of Comcast's DNS footprint (which has millions of DNS zones, and hundreds of millions of DNS records).
3. Many DNS domains are considered "universal" and not locked down to a single group. This holds true for reverse zones, as IP space can often be freely assigned to anyone.
4. Certain DNS change requests still require a manual review and approval, i.e., you cannot truly automate everything.
5. Some teams that provision a DNS record are not the same engineers responsible for its lifecycle. The engineers that ultimately decommission a DNS record might be unknown at the time of creation.
6. Certain teams require DNS changes to be scheduled at some point in the future. For example, maintenance may be done off-hours, and the employee doing the maintenance may not have access to VinylDNS.
To address these issues, VinylDNS added more access controls and features. Shared zones allow universal access while maintaining security via record ownership. Record ownership ensures that the party who creates a DNS record is the only one that can manage that record. This feature alone allowed us to move much of the DNS reverse space into VinylDNS.
Manual review was added to support tighter governance on certain DNS zones and records. For example, a sensitive DNS zone might demand review before implementing changes, as opposed to having all changes immediately applied.
High-value domains support was added to block VinylDNS from ever being able to update certain DNS records. High-value DNS records like [www.comcast.com][4], for example, are impossible to manage via VinylDNS and require extreme governance that can't be accomplished via an automation platform.
Global ACLs were added to support situations where teams that created DNS records were not responsible for the maintenance and decommissioning of those DNS records. This allowed overrides for certain groups by fully qualified domain name (FQDN) and IP address for certain DNS domains.
Finally, scheduled changes allow users to schedule a DNS change for a future time.
### Results
VinylDNS now governs most of Comcast's internal DNS space, managing millions of DNS records across thousands of DNS zones, and supporting thousands of engineers. In addition, we leverage integration with a wide array of tools and programming languages, including Java, Python, Go, and Ruby (most of which are open source).
### Toward the future
There are several opportunities for additional feature development, which Comcast has planned as part of its ongoing evolution of the platform. The same level of access controls and governance is needed for DNS assets managed in public cloud settings. In addition, we are looking into the ability to manage DNS zones (create and delete), which is required for IPv6 reverse zones. Finally, we are looking to create a powerful admin experience for our DNS operators who are looking to take advantage of the data that lives in the VinylDNS database.
### Opening up
[VinylDNS][5] is an open source project released and managed by [Comcast Open Source][6]. VinylDNS and its accompanying ecosystem were built by engineers in several organizations across Comcast, leveraging our inner source program. It is free for use, licensed under Apache License 2.0. We welcome all contributors, from code to bugs to feature requests, from new projects to project ideas. You can [contact our team on Gitter][7].
--------------------------------------------------------------------------------
via: https://opensource.com/article/20/9/open-source-dns
作者:[Paul Cleary][a]
选题:[lujun9972][b]
译者:[译者ID](https://github.com/译者ID)
校对:[校对者ID](https://github.com/校对者ID)
本文由 [LCTT](https://github.com/LCTT/TranslateProject) 原创编译,[Linux中国](https://linux.cn/) 荣誉推出
[a]: https://opensource.com/users/pauljamescleary
[b]: https://github.com/lujun9972
[1]: https://opensource.com/sites/default/files/styles/image-full-size/public/lead-images/LAW-Internet_construction_9401467_520x292_0512_dc.png?itok=RPkPPtDe (An intersection of pipes.)
[2]: https://opensource.com/resources/devops
[3]: https://www.redhat.com/en/topics/api/what-is-a-rest-api
[4]: http://www.comcast.com
[5]: https://www.vinyldns.io
[6]: https://comcast.github.io/
[7]: https://gitter.im/vinyldns/vinyldns

View File

@ -0,0 +1,140 @@
[#]: collector: (lujun9972)
[#]: translator: ( chenmu-kk )
[#]: reviewer: ( )
[#]: publisher: ( )
[#]: url: ( )
[#]: subject: (9 open source tools for building a fault-tolerant system)
[#]: via: (https://opensource.com/article/19/3/tools-fault-tolerant-system)
[#]: author: (Bryant Son (Red Hat, Community Moderator) https://opensource.com/users/brson)
九个用来构建容错系统的开源工具
======
这些开源工具可以最大化延长运行时间并且在最大程度上减少问题。
![magnifying glass on computer screen, finding a bug in the code][1]
我总是对web开发和软件体系结构很感兴趣因为我喜欢看到一个工作系统的更广阔的前景。无论是构建一个移动应用程序还是一个web应用程序都必须连接到internet才能在不同的模块中交换数据这意味着你需要web服务。
如果选择云系统作为应用程序的后端,则可以利用更优秀的计算能力,因为后端服务将会在水平和垂直方向上进行扩展并编排不同的服务。但无论你是否使用云后端,建造一个灵活、稳定、快速又安全的容错系统是必不可少的。
要了解容错系统,让我们以脸书、亚马逊、谷歌和奈飞为例。数以亿计的用户会同时接入这些平台并通过对等网络和用户-服务器网络传输大量数据你可以肯定这其中还存在许多的带有不法目的的恶意用户例如黑客攻击和拒绝服务DoS攻击。即使如此这些平台无需停机也可以全年无休地运转。
尽管机器学习和智能算法是这些系统的基础,但它们实现一致服务而不需要一分钟停机的事实值得称赞。它们昂贵的硬件设备和巨大的数据中心当然十分重要,但是支持服务的精密软件设计也同样重要。而且容错系统是一个构建如此精密系统的法则之一。
### 生产过程中导致错误的两种行为
这是考虑容错系统的另一种方法。当你在本地运行应用程序服务时,每件事似乎都很完美。棒极了!但当你提升服务到生产环境时,一切都会变得一团糟。在这种情况下,容错系统通过解决两个问题来提供帮助:故障停止行为和拜占庭行为。
#### 故障停止行为
故障停止行为是运行中系统突然停止运行或者系统中的某些部分发生了故障时。服务器停机时间和数据库不可访问都属于此种类型。举个例子在下图中由于服务2无法访问因此服务1无法与服务2进行通信。
![Fail-stop behavior due to Service 2 downtime][2]
但是,如果服务之间存在网络问题,也会出现此问题,如下图所示:
![Fail-stop behavior due to network failure][3]
#### 拜占庭行为
拜占庭行为是指系统连续运行,但并不产生预期行为(例如:错误的数据或者无效的数据)。
如果服务2的数据或者值已损坏则可能会发生拜占庭故障即使服务看起来运行得很好比如下面的例子:
![Byzantine failure due to corrupted service][4]
或者,可能存在恶意的中间人在服务之间拦截并注入不需要的数据:
![Byzantine failure due to malicious middleman][5]
故障停止和拜占庭行为都不是理想的情况因此我们需要一些预防或修复它们的手段。这里容错系统就起作用了。以下是可以帮助你解决这些问题的8个开源工具。
### 构建容错系统的工具
尽管构建一个真正实用的容错系统涉及到深入的“分布式计算理论”和复杂的计算机科学原理,但有许多的软件工具(其中许多是开放源代码的软件)通过构建容错系统来减轻不良后果。
#### 断路模式Hystrix 和Resilience4j
[断路模式][6]是一种有助于在服务失败时返回准备好的虚拟回应或者简单回应的一种技术。
![Circuit breaker pattern][7]
奈飞的开源**[Hystrix][8]** 是断路模式中最流行的应用。
我之前工作过的很多家公司都在用这款出色的工具。令人意外的是奈飞宣布将不再更新Hystrix是的我知道。相反奈飞建议使用另一种支持Java8和函数式编程的[**Resilence4j**][9]之类的替代解决方案,或者类似于[Adaptive Concurrency Limit][10]的适应性解决方案。
#### 负载均衡Nginx 和HaProxy
负载均衡是分布式系统中最基本的概念之一它必须具备生产质量的环境。要理解负载均衡器首先我们需要明白冗余的概念。每个生产质量的web服务都有多个服务器在服务器宕机时提供冗余来接管和维持服务。
![Load balancer][11]
想想现代飞机:它们的双引擎提供冗余,使它们即使在一个引擎着火的情况下也能安全的着陆。(这也有助于大多数商用飞机拥有最先进的自动化系统)。但是,拥有多引擎(或者多服务器)也意味着必须存在一些调度机制在故障时有效地对系统进行路由。
负载均衡器是一种通过平衡多个服务节点来优化重流量事务的设备或者软件。举个例子,当数以千计的请求涌入时,负载均衡器可以作为中间层在不同的服务期间路由和平均分配流量。如果服务器宕机,负载均衡器会将请求转发给其他运行良好的服务器。
有许多可用的负载均衡器但其中最出名的两个就是Nginx 和HaProxy。
[**Nginx**][12]不仅仅是一个负载均衡器。它是HTTP和反向代理服务器、邮件代理服务器和通用TCP/UDP代理服务器。Groupon、Capital One、 Adobe和NASA等公司都在使用它。
[**HaProxy**][13] 也很受欢迎因为它是一个免费快且可靠的解决方案它为基于TCP和HTTP的应用程序提供高可用性、负载平衡和代理。许多大型网络公司包括Github、Reddit、Twitter和Stack Overflow都使用HaProxy。是的Red Hat Enterprise Linux同样支持HaProxy设置。
#### 参与者模型Akka
[参与者模型][14] 是一种并发设计模式当作为基本计算单位的_actor_接收到消息时它会分派责任。一个actor可以创建更多的actor并将消息委派给他们。
[**Akka**][15] 是用于actor模式实现的最著名工具之一。框架同时支持基于JVM的Java和Scala。
#### 使用消息队列的异步、非阻塞I/O:Kafka和RabbitMQ
多线程开发在过去很流行但是这种做法一直不被鼓励取而代之的是异步的、非阻塞的I/O模式。对于Java这一点在[EnterpriseJavaBeanEJB规范][16]中明确说明:
> “企业bean一定不能使用线程同步原语来同步多个实例的执行。”
>
> “企业bean不得试图去管理线程。企业bean不得试图启动、停止、挂起或恢复线程或者去更改线程的优先级或者名称。企业bean不得试图管理线程组。”
如今有其他做法像流API和参与者模型。但像[**Kafka**][17] 和[**RabbitMQ**][18]之类的消息队列为异步和非阻塞IO功能提供了开箱即用的支持同时它们也是功能强大的开源工具通过处理并发进程可以替代线程。
#### 其他的选择Eureka 和Chaos Monkey
用于容错系统其他有用的工具包括奈飞的**[Eureka][19]** 之类的监控工具,以及像 **[Chaos Monkey][20]** 这样的压力测试工具。它们旨在通过在较低环境中的测试像集成INT质量保障QA和用户接受测试UAT来早早发现潜在问题以防止在转移到生产环境之前出现潜在问题。
* * *
你在使用什么开源工具来构建一个容错系统呢?请在评论中分享您的收藏夹。
--------------------------------------------------------------------------------
via: https://opensource.com/article/19/3/tools-fault-tolerant-system
作者:[Bryant Son (Red Hat, Community Moderator)][a]
选题:[lujun9972][b]
译者:[chenmu-kk](https://github.com/译者ID)
校对:[校对者ID](https://github.com/校对者ID)
本文由 [LCTT](https://github.com/LCTT/TranslateProject) 原创编译,[Linux中国](https://linux.cn/) 荣誉推出
[a]: https://opensource.com/users/brson
[b]: https://github.com/lujun9972
[1]: https://opensource.com/sites/default/files/styles/image-full-size/public/lead-images/mistake_bug_fix_find_error.png?itok=PZaz3dga (magnifying glass on computer screen, finding a bug in the code)
[2]: https://opensource.com/sites/default/files/uploads/1_errordowntimeservice.jpg (Fail-stop behavior due to Service 2 downtime)
[3]: https://opensource.com/sites/default/files/uploads/2_errordowntimenetwork.jpg (Fail-stop behavior due to network failure)
[4]: https://opensource.com/sites/default/files/uploads/3_byzantinefailuremalicious.jpg (Byzantine failure due to corrupted service)
[5]: https://opensource.com/sites/default/files/uploads/4_byzantinefailuremiddleman.jpg (Byzantine failure due to malicious middleman)
[6]: https://martinfowler.com/bliki/CircuitBreaker.html
[7]: https://opensource.com/sites/default/files/uploads/5_circuitbreakerpattern.jpg (Circuit breaker pattern)
[8]: https://github.com/Netflix/Hystrix/wiki
[9]: https://github.com/resilience4j/resilience4j
[10]: https://medium.com/@NetflixTechBlog/performance-under-load-3e6fa9a60581
[11]: https://opensource.com/sites/default/files/uploads/7_loadbalancer.jpg (Load balancer)
[12]: https://www.nginx.com
[13]: https://www.haproxy.org
[14]: https://en.wikipedia.org/wiki/Actor_model
[15]: https://akka.io
[16]: https://jcp.org/aboutJava/communityprocess/final/jsr220/index.html
[17]: https://kafka.apache.org
[18]: https://www.rabbitmq.com
[19]: https://github.com/Netflix/eureka
[20]: https://github.com/Netflix/chaosmonkey