translating

This commit is contained in:
geekpi 2017-10-17 09:01:57 +08:00
parent 9981ce953d
commit cab4a35311
2 changed files with 57 additions and 59 deletions

View File

@ -1,59 +0,0 @@
translating---geekpi
Our journey from Redis 2 to Redis 3 while not taking the site down.
============================================================
We use [Redis][2] within Sky Betting & Gaming as a shared in-memory cache for things like identity tokens that need to be known across API servers, or web servers. Within the Core Tribe this is used to help deal with the huge number of logins we have to handle day to day and particularly at busy times when we could have more than 20,000 people logging in within a single minute. This works well in so far as the data is readily available to a large number of servers (in the case of SSO tokens 70 Apache HTTPD servers). Weve recently embarked upon a process of upgrading our Redis servers, and this upgrade is intended to enable the use of the native clustering features available from Redis 3.2\. This blog post hopes to explain why were using clustering, what problems we have encountered along the way, and what our solutions have been.
### In the beginning (or at least before the upgrade)
Our legacy caches consisted of a pair of Redis servers for each cache that we had, with keepalived running to ensure that there was always a master node, listening on a floating IP address. These failover pairs required considerable effort to manage when things went wrong, and the failure modes were sometimes quite interesting. On occasion the slave node, that would only allow reads of the data it held, and not writes, would end up with the floating IP address, which was relatively easy to diagnose, but broke whichever application was trying to use that cache at the time in painful ways.
### The new application
So whilst in this situation we needed to build a new application, one that used a shared in-memory cache, but that we didnt want to be at the mercy of a dodgy failover process for that cache. So our requirements were a shared in-memory cache, with no single point of failure, that could cope with multiple different failure modes using as little human intervention as possible, and also recover after those events cleanly, also with little human intervention, an additional ask was to improve the security of the cache to reduce the scope for data exfiltration (more on that later). At the time Redis Sentinel was looking promising, and there were a number of applications floating about to allow proxying of Redis connections such as [twemproxy][3]. This would have lead to a setup with many moving parts, it should have worked, with minimal human interaction, but it was complex and needed a large number of servers and services running and communicating with each other.
![Redis Sentinel and TwemProxy](http://engineering.skybettingandgaming.com/images/Redis-Sentinel-and-TwemProxy.svg)
There would be a number of application servers talking to twemproxy, which would route their calls to an appropriate Redis master, and twemproxy would get the information on the masters from a sentinal cluster, which would control which Redis instances were master and which were slave. This setup, as well as being complex, still had a single point of failure, it relied on twemproxy to handle sharding, and connections to the correct Redis instance. It had the advantage of being transparent to the application so we could in theory, having built this, have moved existing applications over to this Redis configuration without changing the application. But we were building an application from scratch, so migration of an application wasnt a requirement, yet.
Fortunately it was at this time that Redis 3.2 came out, and that had native clustering built in, removing the need for a separate sentinel cluster.
![Redis3 Cluster and Twemproxy](http://engineering.skybettingandgaming.com/images/Redis3-Cluster-and-Twemproxy.svg)
This allowed for a simpler set up, but alas twemproxy didnt support Redis cluster sharding, it could shard data for you, but if it tried to do so in a manor inconsistent with the cluster sharding it would cause problems. There were guides available to make it match up, but the cluster could change shape automatically and change the way the sharding was set up. And it still had a single point of failure. It is at this point that I will be forever grateful to one of my colleagues who found a Redis cluster aware driver for Node.js, allowing us to drop twemproxy altogether.
![Redis3 Cluster](http://engineering.skybettingandgaming.com/images/Redis3-Cluster.svg)
With this we were able to shard data automatically, and failovers and failbacks were largely automatic. The application knew which nodes existed, and when writing data if they wrote to the wrong node the cluster would redirect that write automatically. This was the configuration that was chosen, and it worked we had a shared in-memory cache that was reasonably robust, and could cope with basic failure modes without intervention. During testing we did find some flaws. Replication was on a node by node basis, so if we lost a master node, then its slave became a single point of failure until the dead node was restored into service, also only the masters voted on the cluster health, so if we lost too many masters too quickly the cluster wouldnt self heal. But this was better than we had.
### Moving forward
With a new application, using a clustered Redis configuration, we became increasingly uncomfortable with the state of the legacy Redis instances, but the new applicaction simply wasnt of the same scale as the existing applications (over 30GB of memory is dedicated to the database of our largest legacy Redis instance). So with Redis cluster proven at a low level we decided to migrate to off the legacy Redis instances to new Redis clusters.
As we had a Node.js Redis driver that supported Redis cluster natively, we started with the migration of our Node.js applications onto Redis cluster. But how do you go about moving tens of gigabytes of constantly shifting data from one place to another, without causing major problems? Especially given these bits of data are things like authentication tokens, so if they were wrong our end users would be logged out. One option was to ask for a full site outage, point everything over to the new Redis cluster and migrate the data into it and hope for the best. Another option was to switch over to the new cluster and force all our users to login again. Neither of these proved to be terribly palatable, for obvious reasons. The alternative that was decided upon was to write the data to both the legacy Redis instance, and the cluster that was replacing it, at the same time, we would then read from the cluster increasingly more often as time went on. As the data has a limited shelf life (tokens expire after a few hours) this approach should result in zero downtime, and no risk of data loss. And so it was. The migration was a success.
All that remained was the Redis instances that served our PHP code (well one of them anyway, the other turned out to be unnecessary in the end) and we hit a thorn, actually a couple, in the road. First, and most pressing, was finding a Redis cluster aware driver that we could use in PHP, and the version of PHP we were using. This proved to be something that was doable, because we had upgraded to a recent version of PHP. Alas the driver we chose did not like using Redis auth, something we had decided to use with Redis cluster as an extra security step (I told you there would be more on that security thing). As we were replacing each legacy Redis instance with its own Redis cluster the fix seemed straight forward, turn Redis auth off, and all would be well with the world. However this did not prove to be true, for some reason having done this the Redis cluster wouldnt accept connections from the web servers. A new security feature introduced by Redis in version 3 called protected mode would stop Redis listening to connections from external IP addresses when Redis was bound to any interface, and no Redis auth password was configured. This proved reasonably easy to fix, but caught us off guard.
### And now?
So this is where we find ourselves. We have migrated off some of our legacy Redis instances, and are migrating off the rest. We have, by way of doing this solved some of our technical debt, and improved our platforms stability. With Redis cluster we can also scale out the in-memory databases as well as scale them up. Redis is single threaded, so just throwing more memory at a single instance is only ever going to allow so much growth, and we are already nipping at the heels of that limit. Were expecting an improved perfromance from the new cluster, as well as it giving us more options for expansion and load balancing.
### What about the Future?
So we have solved some technical debt, and made our services easier to support, and more stable. That doesnt mean the job is done, indeed Redis 4 appears to have some features that we may want to look into. And Redis isnt the only software we use. We will continue to work to improve the platform, and reduce the time spent dealing with technical debt, but as our customer base expands, and we strive to offer ever richer services, we are always going to end up with things that need improving. The next challenge is likely to be related to scaling from more than 20,000 logins a minute to more than 40,000, and even beyond.
--------------------------------------------------------------------------------
via: http://engineering.skybettingandgaming.com/2017/09/25/redis-2-to-redis-3/
作者:[ Craig Stewart][a]
译者:[译者ID](https://github.com/译者ID)
校对:[校对者ID](https://github.com/校对者ID)
本文由 [LCTT](https://github.com/LCTT/TranslateProject) 原创编译,[Linux中国](https://linux.cn/) 荣誉推出
[a]:http://engineering.skybettingandgaming.com/authors#craig_stewart
[1]:http://engineering.skybettingandgaming.com/category/devops/
[2]:https://redis.io/
[3]:https://github.com/twitter/twemproxy

View File

@ -0,0 +1,57 @@
我们从 Redis 2 迁移到 Redis 3 而不让网站下线的过程
============================================================
我们在 Sky BettingGaming 中使用 [Redis][2] 作为共享内存中缓存,用于鉴别需要跨 API 服务器或者 Web 服务器的令牌。在 Core Tribe 内,它用来帮助处理日益庞大的登录数量,特别是在繁忙的时候,我们在一分钟内登录数量会超过 20,000 人。这在很大程度上适用于数据在大量服务器的情况下(在 SSO 令牌在 70 台 Apache HTTPD 服务器的情况下)。我们最近开始升级 Redis 服务器的过程,此升级旨在使用 Redis 3.2 提供的原生集群功能。这篇博客希望解释为什么我们要使用集群、我们遇到的问题以及我们的解决方案。
### 在开始阶段(或至少在升级之前)
我们的传统缓存每个缓存都包括一对 Redis 服务器,使用 keepalive 确保始终有一个主节点,监听浮动 IP 地址。当出现问题时,这些服务器对需要很大的精力来进行管理,故障模式有时是非常有趣的。有时,从属节点只允许读取它所持有的数据,而不是写入最终会出现浮动的 IP 地址,这是相对容易诊断的,这相对容易诊断,但无论哪个程序试图使用该缓存时都会很麻烦。
### 新的应用程序
因此,这种情况下,我们需要构建一个新的应用程序,一个使用共享内存缓存的应用程序,但是我们不希望对该缓存进行迂回的故障切换过程。因此,我们的要求是共享的内存缓存,没有单点故障,可以使用尽可能少的人为干预来应对多种不同的故障模式,并且在事件恢复之后也能够在很少的人为干预下恢复,一个额外的要求是提高缓存的安全性,以减少数据泄露的范围(稍后再说)。当时 Redis Sentinel 看起来很有希望,并且有许多程序支持代理 Redis 连接,比如[twemproxy][3]。这会导致还要安装其他很多组件,它应该有效,并且有最少的人际交互,但它是复杂的,需要大量的服务器和服务运行并且相互通信。
![Redis Sentinel and TwemProxy](http://engineering.skybettingandgaming.com/images/Redis-Sentinel-and-TwemProxy.svg)
将会有大量的应用服务器与 twemproxy 进行通信,这会将他们的调用路由到合适的 Redis 主机twemproxy 将从 sentinal 集群获取主节点的信息,它将控制哪台 Redis 实例是主,哪台是从。这个设置是复杂的,而且仍有单点故障,它依赖于 twemproxy 来处理分片,并且连接到正确的 Redis 实例。它具有对应用程序透明的优点,所以我们可以在理论上建立这一点,将现有的应用程序转移到这个 Redis 配置,而不用改变应用程序。但是我们从头开始构建一个应用程序,所以迁移应用程序不是一个必需条件。
幸运的是这个时候Redis 3.2 出来了,而且内置了原生集群,消除了对单一 sentinel 集群需要。
![Redis3 Cluster and Twemproxy](http://engineering.skybettingandgaming.com/images/Redis3-Cluster-and-Twemproxy.svg)
它有一个更简单的设置,但 twemproxy 不支持 Redis 集群分片,它可能会为你分片数据,但是如果尝试在与集群分片不一致的集群中这样做会导致问题。有参考的指南可以使其匹配,但是集群可以自动改变形式,并改变分片的设置方式。它仍然有单点故障。正是在这一点上,我将永远感谢我的一位同事发现了一个 Node.js 的 Redis 的集群发现驱动程序,让我们完全放弃了 twemproxy。
![Redis3 Cluster](http://engineering.skybettingandgaming.com/images/Redis3-Cluster.svg)
因此,我们能够自动分片数据,故障转移和故障恢复基本上是自动的。应用程序知道哪些节点存在,并且在写入数据时,如果写入错误的节点,集群将自动重定向该写入。这是选择的配置,这让我们共享的内存缓存是相当鲁棒的,可以没有干预地应付基本的故障模式。在测试期间,我们的确发现了一些缺陷。复制是在一个接一个节点的基础上,因此如果我们丢失了一个主节点,那么它的从节点会成为一个单点故障,直到死节点恢复服务,也只有主节点对集群健康投票,所以如果我们一下失去太多主节点,那么集群无法自我恢复。但这比我们过去的好。
### 向前进
使用集群 Redis 配置的新程序中,我们对于传统 Redis 实例的状态变得越来越不适应,但是新程序与现有程序的规模并不相同(超过 30GB 的内存专用于我们最大的遗留 Redis 实例数据库)。因此,随着 Redis 集群在底层被证明,我们决定迁移到传统的 Redis 实例到新的 Redis 集群中。
由于我们有一个原生支持 Redis 集群的 Node.js Redis 驱动程序,因此我们开始将 Node.js 程序迁移到 Redis 集群。但是,如何将数十亿字节的数据从一个地方移动到另一个地方,而不会造成重大问题?特别是考虑到这些数据是认证令牌,所以如果它们错了,我们的终端用户将被登出。一个选择是要求网站完全中断,将所有内容都指向新的 Redis 群集,并将数据迁移到其中,并希望获得最佳效果。另一个选择是切换到新集群,并强制所有用户再次登录。由于显而易见的原因,这些都不是非常合适的。决定的替代方法是将数据写入传统的 Redis 实例和正在替换它的集群,同时随着时间的推移,我们将逐渐更多地向集群读取。由于数据的有效期有限(令牌在几个小时后到期),这种方法可以导致零停机,并且不会有数据丢失的风险。所以这么做了。迁移是成功的。
剩下的就是服务于我们的 PHP 代码(其中还有一个项目是有用的,其他的最终是没必要的)的 Redis 的实例了,我们在这过程中遇到了一个困难,实际上是两个。首先,也是最紧迫的是找到在 PHP 中使用的 Redis 集群发现驱动程序,还要是我们正在使用的 PHP 版本。这被证明是可行的,因为我们升级到了最新版本的 PHP。我们选择的驱动程序不喜欢使用 Redis 授权,因此我们决定使用 Redis 集群作为一个额外的安全步骤 (我告诉你,这将有更多的安全性)。当我们用 Redis 集群替换每个传统的 Redis 实例时,修复似乎很直接,将 Redis 授权关闭这样它将会响应所有的请求。然而这并不是真的由于某些原因Redis 集群不会接受来自 Web 服务器的连接。 Redis在版本 3 中引入的称为保护模式的新安全功能将在 Redis 绑定到任何接口时停止 Redis 监听来自外部 IP 地址的连接,并无需配置 Redis 授权密码。这被证明相当容易修复,但让我们保持警惕。
### 现在?
这就是我们现在的情况。我们已经迁移了我们的一些传统 Redis 实例,并且正在迁移其余的。我们通过这样做解决了我们的一些技术债务,并提高了我们的平台的稳定性。使用 Redis 集群,我们还可以扩展内存数据库并扩展它们。 Redis 是单线程的,所以只要在单个实例中留出更多的内存就会有这么多的增长,而且我们已经紧跟在这个限制后面。我们期待着从新的集群中获得改进的性能,同时也为我们提供了扩展和负载均衡的更多选择。
### 未来怎么样?
我们解决了一些技术性债务这使我们的服务更容易支持更加稳定。但这并不意味着这项工作完成了Redis 4 似乎有一些我们可能想要研究的功能。而且 Redis 并不是我们使用的唯一软件。我们将继续努力改进平台,缩短处理技术债务的时间,但随着客户群体的扩大,我们力求提供更丰富的服务,我们总是会遇到需要改进的事情。下一个挑战可能与每分钟超过 20,000次 登录到超过 40,000 次甚至更高的扩展有关。
--------------------------------------------------------------------------------
via: http://engineering.skybettingandgaming.com/2017/09/25/redis-2-to-redis-3/
作者:[ Craig Stewart][a]
译者:[geekpi](https://github.com/geekpi)
校对:[校对者ID](https://github.com/校对者ID)
本文由 [LCTT](https://github.com/LCTT/TranslateProject) 原创编译,[Linux中国](https://linux.cn/) 荣誉推出
[a]:http://engineering.skybettingandgaming.com/authors#craig_stewart
[1]:http://engineering.skybettingandgaming.com/category/devops/
[2]:https://redis.io/
[3]:https://github.com/twitter/twemproxy