Merge pull request #23521 from unigeorge/unigeorge-branch

finished
This commit is contained in:
Xingyu.Wang 2021-10-13 00:45:34 +08:00 committed by GitHub
commit ef1735aeb6
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
2 changed files with 82 additions and 87 deletions

View File

@ -1,87 +0,0 @@
[#]: collector: (lujun9972)
[#]: translator: (unigeorge)
[#]: reviewer: ( )
[#]: publisher: ( )
[#]: url: ( )
[#]: subject: (How to process real-time data with Apache)
[#]: via: (https://opensource.com/article/20/2/real-time-data-processing)
[#]: author: (Simon Crosby https://opensource.com/users/simon-crosby)
How to process real-time data with Apache
======
Open source is leading the way with a rich canvas of projects for
processing real-time events.
![Alarm clocks with different time][1]
In the "always-on" future with billions of connected devices, storing raw data for analysis later will not be an option because users want accurate responses in real time. Prediction of failures and other context-sensitive conditions require data to be processed in real time—certainly before it hits a database.
It's tempting to simply say "the cloud will scale" to meet demands to process streaming data in real time, but some simple examples show that it can never meet the need for real-time responsiveness to boundless data streams. In these situations—from mobile devices to IoT—a new paradigm is needed. Whereas cloud computing relies on a "store then analyze" big data approach, there is a critical need for software frameworks that are comfortable instantly processing endless, noisy, and voluminous streams of data as they arrive to permit a real-time response, prediction, or insight.
For example, the city of Palo Alto, Calif. produces more streaming data from its traffic infrastructure per day than the Twitter Firehose. That's a lot of data. Predicting city traffic for consumers like Uber, Lyft, and FedEx requires real-time analysis, learning, and prediction. Event processing in the cloud leads to an inescapable latency of about half a second per event.
We need a simple yet powerful programming paradigm that lets applications process boundless data streams on the fly in these and similar situations:
* Data volumes are huge, or moving raw data is expensive.
* Data is generated by widely distributed assets (such as mobile devices).
* Data is of ephemeral value, and analysis can't wait.
* It is critical to always have the latest insight, and extrapolation won't do.
### Publish and subscribe
A key architectural pattern in the domain of event-driven systems is the concept of pub/sub or publish/subscribe messaging. This is an asynchronous communication method in which messages are delivered from _publishers_ (anything producing data) to *subscribers (*applications that process data). Pub/sub decouples arbitrary numbers of senders from an unknown set of consumers.
In pub/sub, sources _publish_ events for a _topic_ to a _broker_ that stores them in the order in which they are received. An application _subscribes_ to one or more _topics_, and the _broker_ forwards matching events. Apache Kafka and Pulsar and CNCF NATS are pub/sub systems. Cloud services for pub/sub include Google Pub/Sub, AWS Kinesis, Azure Service Bus, Confluent Cloud, and others.
Pub/sub systems do not _run_ subscriber applications—they simply _deliver_ data to topic subscribers.
Streaming data often contains events that are updates to the state of applications or infrastructure. When choosing an architecture to process data, the role of a data-distribution system such as a pub/sub framework is limited. The "how" of the consumer application lies beyond the scope of the pub/sub system. This leaves an enormous amount of complexity for the developer to manage. So-called stream processors are a special kind of subscriber that analyzes data on the fly and delivers results back to the same broker.
### Apache Spark
[Apache Spark][2] is a unified analytics engine for large-scale data processing. Often, Apache Spark Streaming is used as a stream processor, for example, to feed machine learning models with new data. Spark Streaming breaks data into mini-batches that are each independently analyzed by a Spark model or some other system. The stream of events is grouped into mini-batches for analysis, but the stream processor itself must be elastic:
* The stream processor must be capable of scaling with the data rate, even across servers and clouds, and also balance load across instances, ensuring resilience and other application-layer needs.
* It must be able to analyze data from sources that report at widely different rates, meaning it must be stateful—or store state in a database. This latter approach is often used when Spark Streaming is used as the stream processor and can cause performance problems when ultra-low latency responses are needed.
A related project, [Apache Samza][3], offers a way to process real-time event streams, and to scale elastically using [Hadoop Yarn][4] or [Apache Mesos][5] to manage compute resources.
### Solving the problem of scaling data
It's important to note that even Samza cannot entirely alleviate data processing demands for the application developer. Scaling data rates mean that tasks to process events need to be load-balanced across many instances, and the only way to share the resulting application-layer state between instances is to use a database. However, the moment state coordination between tasks of an application devolves to a database, there is an inevitable knock-on effect upon performance. Moreover, the choice of database is crucial. As the system scales, cluster management for the database becomes the next potential bottleneck.
This can be solved with alternative solutions that are stateful, elastic, and can be used in place of a stream processor. At the application level (within each container or instance), these solutions build a stateful model of concurrent, interlinked "web agents" on the fly from streaming updates. Agents are concurrent "nano-services" that consume raw data for a single source and maintain their state. Agents interlink to share state based on real-world relationships between sources found in the data, such as containment and proximity. Agents thus form a graph of concurrent services that can analyze their own state and the states of agents to which they are linked. Each agent provides a nano-service for a single data source that converts from raw data to state and analyzes, learns, and predicts from its own changes and those of its linked subgraph.
These solutions simplify application architecture by allowing agents—digital twins of real-world sources—to be widely distributed, even while maintaining the distributed graph that interlinks them at the application layer. This is because the links are URLs that map to the current runtime execution instance of the solution and the agent itself. In this way, the application seamlessly scales across instances without DevOps concerns. Agents consume data and maintain state. They also compute over their own state and that of other agents. Because agents are stateful, there is no need for a database, and insights are computed at memory speed.
### Reading world data with open source
There is a sea change afoot in the way we view data: Instead of the database being the system of record, the real world is, and digital twins of real-world things can continuously stream their state. Fortunately, the open source community is leading the way with a rich canvas of projects for processing real-time events. From pub/sub, where the most active communities are Apache Kafka, Pulsar, and CNCF NATS, to the analytical frameworks that continually process streamed data, including Apache Spark, [Flink][6], [Beam][7], Samza, and Apache-licensed [SwimOS][8] and [Hazelcast][9], developers have the widest choices of software systems. Specifically, there is no richer set of proprietary software frameworks available. Developers have spoken, and the future of software is open source.
Introduction to Apache Hadoop, an open source software framework for storage and large scale...
--------------------------------------------------------------------------------
via: https://opensource.com/article/20/2/real-time-data-processing
作者:[Simon Crosby][a]
选题:[lujun9972][b]
译者:[unigeorge](https://github.com/unigeorge)
校对:[校对者ID](https://github.com/校对者ID)
本文由 [LCTT](https://github.com/LCTT/TranslateProject) 原创编译,[Linux中国](https://linux.cn/) 荣誉推出
[a]: https://opensource.com/users/simon-crosby
[b]: https://github.com/lujun9972
[1]: https://opensource.com/sites/default/files/styles/image-full-size/public/lead-images/clocks_time.png?itok=_ID09GDk (Alarm clocks with different time)
[2]: https://spark.apache.org/
[3]: https://samza.apache.org/
[4]: https://hadoop.apache.org/
[5]: http://mesos.apache.org/
[6]: https://flink.apache.org/
[7]: https://beam.apache.org
[8]: https://github.com/swimos/swim
[9]: https://hazelcast.com/

View File

@ -0,0 +1,82 @@
[#]: collector: (lujun9972)
[#]: translator: (unigeorge)
[#]: reviewer: ( )
[#]: publisher: ( )
[#]: url: ( )
[#]: subject: (How to process real-time data with Apache)
[#]: via: (https://opensource.com/article/20/2/real-time-data-processing)
[#]: author: (Simon Crosby https://opensource.com/users/simon-crosby)
如何使用 Apache 软件处理实时数据
======
开源社区在处理实时事件的项目丰富度方面处于领先地位。
![Alarm clocks with different time][1]
在“永不下线”的未来,入网设备规模可能会达到数十亿。存储原始数据,日后再进行分析的方案将不再能满足需求,因为用户需要实时且准确的响应。要对故障等敏感状况进行预测,实时处理数据也必不可少——数据到达数据库后再处理肯定是来不及的。
有人可能会说,“云可扩展性”能够满足实时处理流数据的需求,但一些简单的例子就能表明它永远无法满足对无界数据流进行实时响应的需求。从移动设备到物联网,都需要一种新的范式来满足需求。尽管云计算依赖对大数据“先存储后分析”的方案,但也迫切需要一种能够处理持续、杂乱和海量数据流的软件框架,并在数据流到达时立即对其进行处理,以保证实时的响应、预测和对数据的洞悉。
例如,在加利福尼亚州的帕洛阿尔托市,每天从基础交通设施产生的流数据比 Twitter Firehose 还要多。这是很大的数据量。为 Uber、Lyft 和 FedEx 等消费者预测城市交通需要实时的分析、学习和预测。云处理不可避免地导致每个事件大约会有半秒的延迟。
我们需要一个简单而强大的编程范式,让应用程序在类似下面的情况时能够动态处理无界数据流:
* 数据量巨大,或原始数据的移动成本很高。
* 数据由广泛分布的资产(例如移动设备)生成。
* 数据具有转瞬即逝的价值,即时分析迫在眉睫。
* 需要始终洞悉最新数据情况,外推法行不通。
### 发布和订阅
事件驱动系统领域中有一个关键架构模式:<ruby>发布/订阅<rp>(</rp><rt>publish/subscribe</rt><rp>)</rp></ruby> 消息传递模式。这是一种异步通信方法,其中消息会从 _发布者_(数据产生方)传递到 _订阅者_(处理数据的应用程序)。发布/订阅模式可以将消息发送者与消费者分离开来。
在发布/订阅模式中,消息源会 _发布_ 针对某个 _topic_(主题) 的 <ruby>事件<rp>(</rp><rt>event</rt><rp>)</rp></ruby>_broker_(服务端),后者按接收顺序存储它们。应用程序可以 _订阅_ 一个或多个 _topic_,然后 _broker_ 会转发匹配的事件。 Apache Kafka 和 Pulsar 以及 CNCF NATS 是发布/订阅系统。 发布/订阅的云服务包括 Google Pub/Sub、AWS Kinesis、Azure Service Bus、Confluent Cloud 等。LCTT译注本段部分术语英文名称更为泛用针对这些术语正文采用英文仅在括号中标注其对应中文。
发布/订阅系统不会 _运行_ 订阅者应用程序,它们只是 _传递_ 数据给相应 topic 的订阅者。
流数据通常包含应用程序或基础架构状态更新的事件。在选择架构来处理数据时,发布/订阅框架等数据分发系统的作用是有限的。消费者应用程序的“处理方式”超出了发布/订阅系统的范围。这让开发人员的管理变得极具复杂性。所谓的流处理器是一种特殊的订阅者,可以动态分析数据并将结果返回给同一个 broker。
### Apache Spark
[Apache Spark][2] 是用于大规模数据处理的统一分析引擎。通常将 Apache Spark Streaming 用作流处理器例如给机器学习模型提供新数据。Spark Streaming 将数据分成小批量,每个小批量都由 Spark 模型或其他系统独立分析。事件流可以被分组成小批量以进行分析,但流处理器本身必须具有弹性:
* 流处理器必须能够根据数据速率进行扩展,甚至要能够跨越服务器和云,并且还可以跨实例实现负载均衡,以确保弹性和其他应用层的需求。
* 它必须能够分析来自不同来源的数据,这些数据源的报告速率可能相差很大。这意味着它必须是有状态的,或者将状态存储在数据库中。当使用 Spark Streaming 作为流处理器时,通常会使用后一种方法,这种方法在需要超低延迟响应时可能会存在性能问题。
相关项目 [Apache Samza][3] 也提供了一种处理实时事件流的方法,并使用 [Hadoop Yarn][4] 或 [Apache Mesos][5] 来管理计算资源,以便进行弹性扩展。
### 解决数据扩展问题
需要注意的是,即使是 Samza 也不能完全减轻开发人员的数据处理需求。扩展数据规模意味着处理事件的任务需要跨多个实例进行负载均衡,而使用数据库是实例间共享结果应用层状态的唯一方法。然而,当应用程序任务之间的状态协调转移到数据库时,对性能会产生不可避免的连锁反应。此外,数据库的选择也至关重要。随着系统的扩展,数据库的集群管理会成为下一个潜在的瓶颈。
这个问题可以通过有状态、有弹性的替代方案来解决并且这样的解决方案可以用来代替流处理器。在应用程序级别容器或实例内这些解决方案依据流的更新动态构建并发、互连的“web 代理”的有状态模型。代理是并发的“微服务”,它们消费单一来源的原始数据并维护它们的状态。基于数据中发现的源之间的真实关系(如包含和临近),代理实现互连以共享状态。代理也因此形成了一个并发服务图,可以分析它们自己的状态和链接到的代理的状态。数据源将原始数据转换为状态,并根据自身及其链接子图的变化进行分析、学习和预测,每个代理都为单个这样的数据源提供微服务。
这些解决方案允许大量的代理(真实数据源的数字类比)分布,甚至还有在应用层使代理互连的分布式图,从而简化了应用架构。这是因为代理之间互连的本质,是映射到解决方案的当前运行时执行实例和代理本身的 URL。通过这种方式应用程序可以跨实例无缝扩展而无需担心 DevOps 问题。代理消费数据并维护状态,还会计算自己和其他代理的状态。由于代理是有状态的,因此不需要数据库,并且数据洞察是以内存速度计算的。
### 使用开源阅读数据世界
我们查看数据的方式正在发生翻天覆地的变化:不再将数据库用作记录系统,取而代之的是现实世界,现实世界事物的数字类比可以不断地传输它们的状态。幸运的是,开源社区在处理实时事件的项目丰富度方面处于领先地位。从发布/订阅模式(其中最活跃的社区是 Apache Kafka、Pulsar 和 CNCF NATS到持续处理流数据的分析框架包括 Apache Spark、[Flink][6]、[Beam][7]、Samza以及 Apache 许可的 [SwimOS][8] 和 [Hazelcast][9],对开发人员来说,可选择项目非常之多。可以说,没有什么地方比开源社区的专有软件框架更多了。试看软件的未来,必是开源的天下。
--------------------------------------------------------------------------------
via: https://opensource.com/article/20/2/real-time-data-processing
作者:[Simon Crosby][a]
选题:[lujun9972][b]
译者:[unigeorge](https://github.com/unigeorge)
校对:[校对者ID](https://github.com/校对者ID)
本文由 [LCTT](https://github.com/LCTT/TranslateProject) 原创编译,[Linux中国](https://linux.cn/) 荣誉推出
[a]: https://opensource.com/users/simon-crosby
[b]: https://github.com/lujun9972
[1]: https://opensource.com/sites/default/files/styles/image-full-size/public/lead-images/clocks_time.png?itok=_ID09GDk (Alarm clocks with different time)
[2]: https://spark.apache.org/
[3]: https://samza.apache.org/
[4]: https://hadoop.apache.org/
[5]: http://mesos.apache.org/
[6]: https://flink.apache.org/
[7]: https://beam.apache.org
[8]: https://github.com/swimos/swim
[9]: https://hazelcast.com/