From 287612d0afbd09c3f4484b22fdab0562c2792698 Mon Sep 17 00:00:00 2001 From: geekpi Date: Tue, 14 Nov 2017 09:05:15 +0800 Subject: [PATCH] translated --- .../tech/20170524 View Counting at Reddit.md | 60 ++++++------ .../tech/20170524 View Counting at Reddit.md | 91 +++++++++++++++++++ 2 files changed, 120 insertions(+), 31 deletions(-) create mode 100644 translated/tech/20170524 View Counting at Reddit.md diff --git a/sources/tech/20170524 View Counting at Reddit.md b/sources/tech/20170524 View Counting at Reddit.md index 74d151d22b..8807345d63 100644 --- a/sources/tech/20170524 View Counting at Reddit.md +++ b/sources/tech/20170524 View Counting at Reddit.md @@ -1,64 +1,62 @@ -translating---geekpi - -View Counting at Reddit +Reddit 的浏览计数 ====================== ![](https://redditupvoted.files.wordpress.com/2017/05/content-view-header.png?crop=0px%2C9px%2C1000px%2C483px&resize=1200%2C580) -We want to better communicate the scale of Reddit to our users. Up to this point, vote score and number of comments were the main indicators of activity on a given post. However, Reddit has many visitors that consume content without voting or commenting. We wanted to build a system that could capture this activity by counting the number of views a post received. This number is then shown to content creators and moderators to provide them better insight into the activity on specific posts. +我们希望更好地将 Reddit 的规模传达给我们的用户。到目前为止,投票得分和评论数量是特定文章活动的主要指标。然而,Reddit 有许多访问者在没有投票或评论的情况下阅读内容。我们希望建立一个能够捕捉到帖子阅读数量的系统。然后将数量展示给内容创建者和版主,以便他们更好地了解特定帖子上的活动。 ![](https://redditupvoted.files.wordpress.com/2017/05/cvcs-neeson-fix.png?w=372&h=743&zoom=2) -In this post, we’re going to talk about how we implemented counting at scale. +在这篇文章中,我们将讨论我们如何伸缩地实现计数。 -### Counting Methodology +### 计数方法 -We had four main requirements for counting views: +对浏览计数有四个主要要求: -*Counts must be real time or near-real time. No daily or hourly aggregates. -*Each user must only be counted once within a short time window. -*The displayed count must be within a few percentage points of the actual tally. -*The system must be able to run at production scale and process events within a few seconds of their occurrence. +* 计数必须是实时的或接近实时的。不是每天或每小时的总量。 +* 每个用户在短时间内只能计数一次。 +* 显示的数量与实际的误差在百分之几。 +* 系统必须能够在生产环境运行,并在事件发生后几秒内处理事件。 -Satisfying all four of these requirements is trickier than it sounds. In order to maintain an exact count in real time we would need to know whether or not a specific user visited the post before. To know that information, we would need to store the set of users who had previously visited each post, and then check that set every time we processed a new view on a post. A naive implementation of this solution would be to store the unique user set as a hash table in memory, with the post ID as the key. +满足这四项要求比听起来要复杂得多。为了实时保持准确的计数,我们需要知道某个特定的用户是否曾经访问过这个帖子。要知道这些信息,我们需要存储先前访问过每个帖子的用户组,然后在每次处理新帖子时查看该组。这个解决方案的一个原始实现是将唯一的用户集作为散列表存储在内存中,并且以发布 ID 作为关键字。 -This approach works well for less trafficked posts, but is very difficult to scale once a post becomes popular and the number of viewers rapidly increases. Several popular posts have over one million unique viewers! On posts like these, it becomes extremely taxing on both memory and CPU to store all the IDs and do frequent lookups into the set to see if someone has already visited before. +这种方法适用于较少浏览量的文章,但一旦文章流行,阅读人数迅速增加,这种方法很难扩展。几个热门的帖子有超过一百万的唯一读者!对于这种帖子,对于内存和 CPU 来说影响都很大,因为要存储所有的 ID,并频繁地查找集合,看看是否有人已经访问过。 -Since we could not provide exact counts, we looked at a few different [cardinality estimation][1] algorithms. We considered two options that closely matched what we were looking to accomplish: +由于我们不能提供精确的计数, 我们研究了几个不同的[基数估计][1] 算法。我们考虑了两个非常符合我们期望的选择: -1. A linear probabilistic counting approach, which is very accurate, but requires linearly more memory as the set being counted gets larger. -2. A [HyperLogLog][2] (HLL)-based counting approach. HLLs grow sub-linearly with set size, but do not provide the same level of accuracy as linear counters. +1. 线性概率计数方法,非常准确,但要计数的集合越大,则线性地需要更多的内存。 +2. 基于 [HyperLogLog][2](HLL)的计数方法。HLL 随集合大小次线性增长,但不能提供与线性计数器相同的准确度。 -For an understanding of just how much space HLLs really save, consider the r/pics post included at the top of this blog post. It received over 1 million unique users. If we had to store 1 million unique user IDs, and each user ID is an 8-byte long, then we would require 8 megabytes of memory just to count the unique users for a single post! In contrast, using an HLL for counting would take significantly less memory. The amount of memory varies per implementation, but in the case of [this implementation][3], we could count over 1 million IDs using just 12 kilobytes of space, which would be 0.15% of the original space usage! +要了解 HLL 真正节省的空间大小,考虑到右侧图片包括顶部的文章。它有超过 100 万的唯一用户。如果我们存储 100 万个唯一用户 ID,并且每个用户 ID 是 8 个字节长,那么我们需要 8 兆内存来计算单个帖子的唯一用户数!相比之下,使用 HLL 进行计数会占用更少的内存。每个实现的内存量是不一样的,但是对于[这个实现][3],我们可以使用仅仅 12 千字节的空间计算超过一百万个 ID,这将是原始空间使用量的 0.15%! -([This article on High Scalability][5] has a good overview of both of the above algorithms.) +([这篇关于高可伸缩性的文章][5]很好地概述了上述两种算法。) -Many HLL implementations use a combination of the above two approaches, by starting with linear counting for small sets and switching over to HLL once the size reaches a certain point. The former is frequently referred to as a “sparse” HLL representation, while the latter is referred to as a “dense” HLL representation. The hybrid approach is very advantageous, because it can provide accurate results for both small sets and large sets while retaining a modest memory footprint. This approach is described in more detail in [Google’s HyperLogLog++ paper][6]. +许多 HLL 实现使用了上述两种方法的组合,即对于小集合以线性计数开始并且一旦大小达到特定点就切换到 HLL。前者通常被称为 “稀疏” HLL 表达,而后者被称为“密集” HLL 表达。混合的方法是非常有利的,因为它可以提供准确的结果,同时保留适度的内存占用量。这个方法在[Google 的 HyperLogLog++ 论文][6]中有更详细的描述。 -While the HLL algorithm is fairly standard, there were three variants we considered using in our implementation. Note that for in-memory HLL implementations, we only looked at Java and Scala implementations as we primarily use Java and Scala on the data engineering team. +虽然 HLL 算法是相当标准的,但在我们的实现中我们考虑使用三种变体。请注意,对于内存中的 HLL 实现,我们只关注 Java 和 Scala 实现,因为我们主要在数据工程团队中使用 Java 和 Scala。 -1. Twitter’s Algebird library, implemented in Scala. Algebird has good usage docs, but the implementation details of the sparse and dense HLL representations were not easily understandable. -2. An implementation of HyperLogLog++ located in stream-lib, implemented in Java. The code in stream-lib is very well-documented, but it was somewhat difficult to understand how to use the library properly and tune it to our needs. -3. Redis’s HLL implementation (which we chose). We felt that the Redis implementation of HLLs were well-documented and easily configurable, and the HLL-related APIs provided were easy to integrate. As an added benefit, using Redis alleviated many of our performance concerns by taking the CPU and memory-intensive portion of the counting application (HLL computations) out and moving that onto a dedicated server. +1. Twitter 的 Algebird 库,用 Scala 实现。Algebird 有很好的使用文档,但是稀疏和密集的 HLL 表达的实现细节不容易理解。 +2. 在 stream-lib 中的 HyperLogLog++ 的实现,用 Java 实现。stream-lib 中的代码有很好的文档,但是要理解如何正确使用这个库并且调整它以满足我们的需求是有些困难的。 +3. Redis 的 HLL 实现(我们选择的)。我们认为,Redis 的 HLL 实施方案有很好的文档并且易于配置,所提供的 HLL 相关的 API 易于集成。作为一个额外的好处,使用 Redis 通过将计数应用程序(HLL 计算)的 CPU 和内存密集型部分移出并将其移至专用服务器上,从而缓解了我们的许多性能问题。 ![](https://redditupvoted.files.wordpress.com/2017/05/view-counts-architecture.png) -Reddit’s data pipeline is primarily oriented around [Apache Kafka][7]. When a user views a post, an event gets fired and sent to an event collector server, which batches the events and persists them into Kafka. +Reddit 的数据管道主要围绕 [Apache Kafka][7]。当用户查看帖子时,事件被激发并发送到事件收集器服务器,该服务器批量处理事件并将其保存到 Kafka 中。 -From here, the view-counting system has two components which operate sequentially. The first part of our counting architecture is a Kafka consumer called [Nazar][8], which will read each event from Kafka and pass it through a set of rules we’ve concocted to determine whether or not an event should be counted. We gave it this name because just as a nazar is an eye-shaped amulet protecting you from evil, the Nazar system is an “eye” that protects us from bad actors trying to game the system. Nazar uses Redis to maintain state and keep track of potential reasons why a view should not be counted. One reason we may not count an event is if it’s the result of repeat views from the same user over a short period of time. Nazar will then alter the event, adding a Boolean flag indicating whether or not it should be counted, before sending the event back to Kafka. +从这里,浏览计数系统有两个按顺序运行的组件。我们的计数架构的第一部分是一个名为 [Nazar][8] 的 Kafka 消费者,它将读取来自 Kafka 的每个事件,并通过我们编制的一组规则来确定是否应该计算一个事件。我们给它起了这个名字是因为 Nazar 是一个保护你免受邪恶的眼形护身符,Nazar 系统是一个“眼睛”,它可以保护我们免受不良因素的影响。Nazar 使用 Redis 保持状态,并跟踪不应计算浏览的潜在原因。我们可能无法统计事件的一个原因是,由于同一用户在短时间内重复浏览的结果。Nazar 接着将改变事件,添加一个布尔标志表明是否应该被计数,然后再发回 Kafka 事件。 -This is where the second part of the project picks up. We have a second Kafka consumer called [Abacus][9], which does the actual counting of views and makes the counts available for the site or clients to display. Abacus reads the events from Kafka that were output by Nazar; then, depending on Nazar’s determination, it either counts or skips over the view. If the event is marked for counting, then Abacus first checks if there is an HLL counter already existing in Redis for the post corresponding to the event. If the counter is already in Redis, then Abacus makes a [PFADD][10] request to Redis for that post. If the counter is not already in Redis, then Abacus makes a request to a Cassandra cluster, which we use to persist both the HLL counters and the raw count numbers, and makes a [SET][11] request into Redis to add the filter. This usually happens when people view older posts whose counters have been evicted from Redis. +这是这个项目要说的第二部分。我们有第二个叫做 [Abacus][9] 的 Kafka 消费者,它实际上对视浏览进行计数,并使计数在网站和客户端可见。Abacus 读取 Nazar 输出的 Kafka 事件。接着,根据 Nazar 的决定,它将计算或跳过本次浏览。如果事件被标记为计数,那么 Abacus 首先检查 Redis 中是否存在已经存在与事件对应的帖子的 HLL 计数器。如果计数器已经在 Redis 中,那么 Abacus 向 Redis 发出一个 [PFADD][10] 的请求。如果计数器还没有在 Redis 中,那么 Abacus 向 Cassandra 集群发出请求,我们用这个集群来持久化 HLL 计数器和原始计数,并向 Redis 发出一个 [SET][11] 请求来添加过滤器。这种情况通常发生在人们查看已经被 Redis 删除的旧帖的时候。 -In order to allow for maintaining counts on older posts that might have been evicted from Redis, Abacus periodically writes out both the full HLL filter from Redis along with the count for each post to a Cassandra cluster. Writes to Cassandra are batched in 10-second groups per post in order to avoid overloading the cluster. Below is a diagram outlining this event flow at a high level. +为了保持对可能从 Redis 删除的旧帖子的维护,Abacus 定期将 Redis 的完整 HLL 过滤器以及每个帖子的计数记录到 Cassandra 集群中。 Cassandra 的写入以 10 秒一组分批写入,以避免超载。下面是一个高层的事件流程图。 ![](https://redditupvoted.files.wordpress.com/2017/05/abacus-event-flow.png) -### Conclusion +### 总结 -We hope that view counts will better enable content creators to understand the full reach of their posts, and help moderators to quickly identify which posts are receiving large amounts of traffic in their communities. In the future, we plan to leverage the real-time potential of our data pipeline to deliver more useful feedback to redditors. +我们希望浏览量计数器能够更好地帮助内容创作者了解每篇文章的情况,并帮助版主快速确定哪些帖子在其社区拥有大量流量。未来,我们计划利用数据管道的实时潜力向更多的人提供更多有用的反馈。 -If you’re interested in solving problems like this at scale, [check out our Careers page][12]. +如果你有兴趣解决这样的问题,[请查看我们的职位页面][12]。 ---------------------- @@ -66,7 +64,7 @@ If you’re interested in solving problems like this at scale, [check out our Ca via: https://redditblog.com/2017/05/24/view-counting-at-reddit/ 作者:[Krishnan Chandra ][a] -译者:[译者ID](https://github.com/译者ID) +译者:[geekpi](https://github.com/geekpi) 校对:[校对者ID](https://github.com/校对者ID) 本文由 [LCTT](https://github.com/LCTT/TranslateProject) 原创编译,[Linux中国](https://linux.cn/) 荣誉推出 diff --git a/translated/tech/20170524 View Counting at Reddit.md b/translated/tech/20170524 View Counting at Reddit.md new file mode 100644 index 0000000000..8807345d63 --- /dev/null +++ b/translated/tech/20170524 View Counting at Reddit.md @@ -0,0 +1,91 @@ +Reddit 的浏览计数 +====================== + +![](https://redditupvoted.files.wordpress.com/2017/05/content-view-header.png?crop=0px%2C9px%2C1000px%2C483px&resize=1200%2C580) + + +我们希望更好地将 Reddit 的规模传达给我们的用户。到目前为止,投票得分和评论数量是特定文章活动的主要指标。然而,Reddit 有许多访问者在没有投票或评论的情况下阅读内容。我们希望建立一个能够捕捉到帖子阅读数量的系统。然后将数量展示给内容创建者和版主,以便他们更好地了解特定帖子上的活动。 + +![](https://redditupvoted.files.wordpress.com/2017/05/cvcs-neeson-fix.png?w=372&h=743&zoom=2) + +在这篇文章中,我们将讨论我们如何伸缩地实现计数。 + +### 计数方法 + +对浏览计数有四个主要要求: + +* 计数必须是实时的或接近实时的。不是每天或每小时的总量。 +* 每个用户在短时间内只能计数一次。 +* 显示的数量与实际的误差在百分之几。 +* 系统必须能够在生产环境运行,并在事件发生后几秒内处理事件。 + +满足这四项要求比听起来要复杂得多。为了实时保持准确的计数,我们需要知道某个特定的用户是否曾经访问过这个帖子。要知道这些信息,我们需要存储先前访问过每个帖子的用户组,然后在每次处理新帖子时查看该组。这个解决方案的一个原始实现是将唯一的用户集作为散列表存储在内存中,并且以发布 ID 作为关键字。 + +这种方法适用于较少浏览量的文章,但一旦文章流行,阅读人数迅速增加,这种方法很难扩展。几个热门的帖子有超过一百万的唯一读者!对于这种帖子,对于内存和 CPU 来说影响都很大,因为要存储所有的 ID,并频繁地查找集合,看看是否有人已经访问过。 + +由于我们不能提供精确的计数, 我们研究了几个不同的[基数估计][1] 算法。我们考虑了两个非常符合我们期望的选择: + +1. 线性概率计数方法,非常准确,但要计数的集合越大,则线性地需要更多的内存。 +2. 基于 [HyperLogLog][2](HLL)的计数方法。HLL 随集合大小次线性增长,但不能提供与线性计数器相同的准确度。 + +要了解 HLL 真正节省的空间大小,考虑到右侧图片包括顶部的文章。它有超过 100 万的唯一用户。如果我们存储 100 万个唯一用户 ID,并且每个用户 ID 是 8 个字节长,那么我们需要 8 兆内存来计算单个帖子的唯一用户数!相比之下,使用 HLL 进行计数会占用更少的内存。每个实现的内存量是不一样的,但是对于[这个实现][3],我们可以使用仅仅 12 千字节的空间计算超过一百万个 ID,这将是原始空间使用量的 0.15%! + +([这篇关于高可伸缩性的文章][5]很好地概述了上述两种算法。) + +许多 HLL 实现使用了上述两种方法的组合,即对于小集合以线性计数开始并且一旦大小达到特定点就切换到 HLL。前者通常被称为 “稀疏” HLL 表达,而后者被称为“密集” HLL 表达。混合的方法是非常有利的,因为它可以提供准确的结果,同时保留适度的内存占用量。这个方法在[Google 的 HyperLogLog++ 论文][6]中有更详细的描述。 + +虽然 HLL 算法是相当标准的,但在我们的实现中我们考虑使用三种变体。请注意,对于内存中的 HLL 实现,我们只关注 Java 和 Scala 实现,因为我们主要在数据工程团队中使用 Java 和 Scala。 + +1. Twitter 的 Algebird 库,用 Scala 实现。Algebird 有很好的使用文档,但是稀疏和密集的 HLL 表达的实现细节不容易理解。 +2. 在 stream-lib 中的 HyperLogLog++ 的实现,用 Java 实现。stream-lib 中的代码有很好的文档,但是要理解如何正确使用这个库并且调整它以满足我们的需求是有些困难的。 +3. Redis 的 HLL 实现(我们选择的)。我们认为,Redis 的 HLL 实施方案有很好的文档并且易于配置,所提供的 HLL 相关的 API 易于集成。作为一个额外的好处,使用 Redis 通过将计数应用程序(HLL 计算)的 CPU 和内存密集型部分移出并将其移至专用服务器上,从而缓解了我们的许多性能问题。 + +![](https://redditupvoted.files.wordpress.com/2017/05/view-counts-architecture.png) + +Reddit 的数据管道主要围绕 [Apache Kafka][7]。当用户查看帖子时,事件被激发并发送到事件收集器服务器,该服务器批量处理事件并将其保存到 Kafka 中。 + +从这里,浏览计数系统有两个按顺序运行的组件。我们的计数架构的第一部分是一个名为 [Nazar][8] 的 Kafka 消费者,它将读取来自 Kafka 的每个事件,并通过我们编制的一组规则来确定是否应该计算一个事件。我们给它起了这个名字是因为 Nazar 是一个保护你免受邪恶的眼形护身符,Nazar 系统是一个“眼睛”,它可以保护我们免受不良因素的影响。Nazar 使用 Redis 保持状态,并跟踪不应计算浏览的潜在原因。我们可能无法统计事件的一个原因是,由于同一用户在短时间内重复浏览的结果。Nazar 接着将改变事件,添加一个布尔标志表明是否应该被计数,然后再发回 Kafka 事件。 + +这是这个项目要说的第二部分。我们有第二个叫做 [Abacus][9] 的 Kafka 消费者,它实际上对视浏览进行计数,并使计数在网站和客户端可见。Abacus 读取 Nazar 输出的 Kafka 事件。接着,根据 Nazar 的决定,它将计算或跳过本次浏览。如果事件被标记为计数,那么 Abacus 首先检查 Redis 中是否存在已经存在与事件对应的帖子的 HLL 计数器。如果计数器已经在 Redis 中,那么 Abacus 向 Redis 发出一个 [PFADD][10] 的请求。如果计数器还没有在 Redis 中,那么 Abacus 向 Cassandra 集群发出请求,我们用这个集群来持久化 HLL 计数器和原始计数,并向 Redis 发出一个 [SET][11] 请求来添加过滤器。这种情况通常发生在人们查看已经被 Redis 删除的旧帖的时候。 + +为了保持对可能从 Redis 删除的旧帖子的维护,Abacus 定期将 Redis 的完整 HLL 过滤器以及每个帖子的计数记录到 Cassandra 集群中。 Cassandra 的写入以 10 秒一组分批写入,以避免超载。下面是一个高层的事件流程图。 + +![](https://redditupvoted.files.wordpress.com/2017/05/abacus-event-flow.png) + +### 总结 + +我们希望浏览量计数器能够更好地帮助内容创作者了解每篇文章的情况,并帮助版主快速确定哪些帖子在其社区拥有大量流量。未来,我们计划利用数据管道的实时潜力向更多的人提供更多有用的反馈。 + +如果你有兴趣解决这样的问题,[请查看我们的职位页面][12]。 + + +---------------------- + +via: https://redditblog.com/2017/05/24/view-counting-at-reddit/ + +作者:[Krishnan Chandra ][a] +译者:[geekpi](https://github.com/geekpi) +校对:[校对者ID](https://github.com/校对者ID) + +本文由 [LCTT](https://github.com/LCTT/TranslateProject) 原创编译,[Linux中国](https://linux.cn/) 荣誉推出 + +[a]:https://redditblog.com/topic/technology/ +[1]:https://en.wikipedia.org/wiki/Count-distinct_problem +[2]:http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf +[3]:http://antirez.com/news/75 +[5]:http://highscalability.com/blog/2012/4/5/big-data-counting-how-to-count-a-billion-distinct-objects-us.html +[6]:https://stefanheule.com/papers/edbt13-hyperloglog.pdf +[7]:https://kafka.apache.org/ +[8]:https://en.wikipedia.org/wiki/Nazar_(amulet) +[9]:https://en.wikipedia.org/wiki/Abacus +[10]:https://redis.io/commands/pfadd +[11]:https://redis.io/commands/set +[12]:https://about.reddit.com/careers/ + + + + + + + +