翻译完成

This commit is contained in:
cool-summer-021 2023-04-17 16:31:14 +08:00 committed by GitHub
parent 295256b343
commit 906a0145ea
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
2 changed files with 135 additions and 135 deletions

View File

@ -1,135 +0,0 @@
[#]: subject: "Synchronize databases more easily with open source tools"
[#]: via: "https://opensource.com/article/23/3/synchronize-databases-apache-seatunnel"
[#]: author: "Li Zongwen https://opensource.com/users/li-zongwen"
[#]: collector: "lkxed"
[#]: translator: "cool-summer-021"
[#]: reviewer: " "
[#]: publisher: " "
[#]: url: " "
Synchronize databases more easily with open source tools
======
Change Data Capture (CDC) uses Server Agents to record, insert, update, and delete activity applied to database tables. CDC provides details on changes in an easy-to-use relational format. It captures column information and metadata needed to apply the changes to the target environment for modified rows. A changing table that mirrors the column structure of the tracked source table stores this information.
Capturing change data is no easy feat. However, the open source [Apache SeaTunnel project][1] i is a data integration platform provides CDC function with a design philosophy and feature set that makes these captures possible, with features above and beyond existing solutions.
### CDC usage scenarios
Classic use cases for CDC is data synchronization or backups between heterogeneous databases. You may synchronize data between [MySQL][2], PostgreSQL, MariaDB, and similar databases in one scenario. You could synchronize the data to a full-text search engine in a different example. With CDC, you can create backups of data based on what CDC has captured.
When designed well, the data analysis system obtains data for processing by subscribing to changes in the target data tables. There's no need to embed the analysis process into the existing system.
### Sharing data state between microservices
[Microservices][3] are popular, but sharing information between them is often complicated. CDC is a possible solution. Microservices can use CDC to obtain changes in other microservice databases, acquire data status updates, and execute the corresponding logic.
### Update cache
The concept of [Command Query Responsibility Segregation (CQRS)][4] is the separation of command activity from query activity. The two are fundamentally different:
- A command writes data to a data source.
- A query reads data from a data source.
The problem is, when does a read event happen in relation to when a write event happened, and what bears the burden of making those events occur?
It can be difficult to update a cache. You can use CDC to obtain data update events from a database and let that control the refresh or invalidation of the cache.
CQRS design usually uses two different storage instances to support business query and change operations. Because of the use of two stores, in order to ensure data consistency, we can use distributed transactions to ensure strong data consistency, at the cost of availability, performance, and scalability. You can also use CDC to ensure final consistency of data, which has better performance and scalability, but at the cost of data latency, which can currently be kept in the range of millisecond in the industry.
For example, you could use CDC to synchronize MySQL data to your full-text search engine, such as ElasticSearch. In this architecture, ElasticSearch searches all queries, but when you want to modify data, you don't directly change ElasticSearch. Instead, you modify the upstream MySQL data, which generates a data update event. This event is consumed by the ElasticSearch system as it monitors the database, and the event prompts an update within ElasticSearch.
In some CQRS systems, a similar method can be used to update the query view.
### Pain points
CDC isn't a new concept and various existing projects implement it. For many users, though, there are some disadvantages to the existing solutions.
#### Single table configuration
With some CDC software, you must configure each table separately. For example, to synchronize ten tables, you need to write ten source SQLs and Sink SQLs. To perform a transform, you also need to write the transform SQL.
Sometimes, a table can be written by hand, but only when the volume is small. When the volume is large, type mapping or parameter configuration errors may occur, resulting in high operation and maintenance costs.
Apache SeaTunnel is an easy-to-use data integration platform hoping to solve this problem.
#### Schema evolution is not supported
Some CDC solutions support DDL event sending but do not support sending to Sink so that it can make synchronous changes. Even a CDC that can get an event may not be able to send it to the engine because it cannot change the Type information of the transform based on the DDL event (so the Sink cannot follow the DDL event to change it).
#### Too many links
On some CDC platforms, when there are several tables, a link must be used to represent each table while one is synchronized. When there are many tables, a lot of links are required. This puts pressure on the source JDBC database and causes too many Binlogs, which may result in repeated log parsing.
### SeaTunnel CDC architecture goals
Apache SeaTunnel is an open source high-performance, distributed, and massive data integration framework. To tackle the problems the existing data integration tool's CDC functions cannot solve, the community "reinvents the wheel" to develop a CDC platform with unique features. This architectural design is based on the strengths and weaknesses of existing CDC tools.
Apache Seatunnel supports:
- Lock-free parallel snapshot history data.
- Log heartbeat detection and dynamic table addition.
- Sub-database, sub-table, and multi-structure table reading.
- Schema evolution.
- All the basic CDC functions.
The Apache SeaTunnel reduces the operations and maintenance costs for users and can dynamically add tables.
For example, when you want to synchronize the entire database and add a new table later, you don't need to maintain it manually, change the job configuration, or stop and restart jobs.
Additionally, Apache SeaTunnel supports reading sub-databases, sub-tables, and multi-structure tables in parallel. It also allows schema evolution, DDL transmission, and changes supporting schema evolution in the engine, which can be changed to Transform and Sink.
### SeaTunnel CDC current status
Currently, CDC has the basic capabilities to support incremental and snapshot phases. It also supports MySQL for real-time and offline use. The MySQL real-time test is complete, and the offline test is coming. The schema is not supported yet because it involves changes to Transform and Sink. The dynamic discovery of new tables is not yet supported, and some interfaces have been reserved for multi-structure tables.
### Project outlook
As an Apache incubation project, the Apache SeaTunnel community is developing rapidly. The next community planning session has these main directions:
#### 1. Expand and improve connector and catalog ecology
We're working to enhance many connector and catalog features, including:
- Support more connectors, including TiDB, Doris, and Stripe.
- Improving existing connectors in terms of usability and performance.
- Support CDC connectors for real-time, incremental synchronization scenarios.
Anyone interested in connectors can review [Umbrella][5].
#### 2. Support for more data integration scenarios (SeaTunnel Engine)
There are pain points that existing engines cannot solve, such as the synchronization of an entire database, the synchronization of table structure changes, and the large granularity of task failure.
We're working to solve those issues. Anyone interested in the CDC engine should look at [issue 2272][6].
#### 3. Easier to use (SeaTunnel Web)
We're working to provide a web interface to make operations easier and more intuitive. Through a web interface, we will make it possible to display Catalog, Connector, Job, and related information, in the form of DAG/SQL. We're also giving users access to the scheduling platform to easily tackle task management.
Visit the [web sub-project][7] for more information on the web UI.
### Wrap up
Database activity often must be carefully tracked to manage changes based on activities such as record updates, deletions, or insertions. Change Data Capture provides this capability. Apache SeaTunnel is an open source solution that addresses these needs and continues to evolve to offer more features. The project and community are active and your participation is welcome.
--------------------------------------------------------------------------------
via: https://opensource.com/article/23/3/synchronize-databases-apache-seatunnel
作者:[Li Zongwen][a]
选题:[lkxed][b]
译者:[译者ID](https://github.com/译者ID)
校对:[校对者ID](https://github.com/校对者ID)
本文由 [LCTT](https://github.com/LCTT/TranslateProject) 原创编译,[Linux中国](https://linux.cn/) 荣誉推出
[a]: https://opensource.com/users/li-zongwen
[b]: https://github.com/lkxed/
[1]: https://seatunnel.apache.org/
[2]: https://opensource.com/downloads/mariadb-mysql-cheat-sheet
[3]: https://www.redhat.com/en/topics/microservices?intcmp=7013a000002qLH8AAM
[4]: https://www.redhat.com/architect/illustrated-cqrs
[5]: https://github.com/apache/incubator-seatunnel/issues/1946
[6]: https://github.com/apache/incubator-seatunnel/issues/2272
[7]: https://github.com/apache/incubator-seatunnel-web

View File

@ -0,0 +1,135 @@
[#]: subject: "Synchronize databases more easily with open source tools"
[#]: via: "https://opensource.com/article/23/3/synchronize-databases-apache-seatunnel"
[#]: author: "Li Zongwen https://opensource.com/users/li-zongwen"
[#]: collector: "lkxed"
[#]: translator: "cool-summer-021"
[#]: reviewer: " "
[#]: publisher: " "
[#]: url: " "
Synchronize databases more easily with open source tools
======
CDC 使用服务端代理记录、添加、更新和删除对数据表的各种操作。它以一种易用的关系型格式提供了关于数据变化的细节信息。它可以捕获将更改应用于已修改行的目标环境所需的列信息和元数据。这些信息保存在一个与被跟踪的源表的列结构相对应的变化表内。
捕获变化的数据可不是一件容易的事。不过,有一个开源项目————[Apache SeaTunnel][1],它是一个数据整合平台,它提供了符合设计理念的 CDC 功能和相关的工具箱,实现了对数据变化的捕获,其功能包括上文提到的和那些超越现有产品的解决方案。
### 使用场景
CDC的经典应用是数据同步或不同数据库间的备份。你可以在[MySQL][2], PostgreSQL, MariaDB等类似的数据库间进行数据同步。你也可以将数据同步到全文搜索引擎。借助 CDC你还可以基于 CDC 捕获的数据创建备份。
设计完成后,数据分析系统通过订阅目标数据表的变化情况获取需要处理的数据。不需要将分析过程嵌入已有系统。
### 在微服务间共享数据状态
微服务现在很流行但是在微服务间共享信息往往是一件复杂的事。CDC 就是可能的解决方案之一。微服务可以使用 CDC 来获取其他微服务的数据库变化,获取数据状态更新,以及执行相应逻辑。
### 更新缓存
[CQRS][4]是一种把查询操作分为一些命令的理念。其中有两点本质上的不同:
- 命令向数据源写入数据。
- 查询从数据源读取数据。
问题是,相对于写事件发生的时间,读事件什么时候发生,以及这些事件的发生是由谁来承担责任的?
更新缓存是困难的。你可以使用 CDC 从数据库获取数据更新事件,让它控制缓存的更新或失效。
CQRS 设计通常使用两种不同的存储实例来支持业务查询和更新操作。正是因为使用两个存储实例我们可以以可用性、性能和可伸缩性为代价使用分布式事务来确保数据一致性。你也可以使用CDC 来确保最终的数据一致性,它的性能和伸缩性较好,但其代价是数据延迟较严重,目前业界可以保持在毫秒范围内。
例如,你可以使用 CDC 把 MySQL 中的数据同步到你的全文搜索引擎比如ElasticSearch。在这种架构中ElasticSearch 搜索了所有的查询,但是当你需要修改数据时,你不能直接操作 ElasticSearch 的,你是修改上游的 MySQL 数据来实现的因此就生成了一个更新事件。当ElasticSearch 监视数据库时,这个事件就被系统消费了,并在 ElasticSearch 中提示更新。
在一些 CQRS 系统中,也可以用类似的方法更新查询视图。
### 痛点
CDC 不是一个新概念,很多现有的项目已经实现了它。但是对很多用户来说,已有解决方案存在一些不利因素。
#### 简单数据表配置
当你使用一些 CDC 软件时,你必须分别配置每个表。例如,为了同步十张表,你需要写十条 SQL 代码。为了进行一次转换操作,你也需要写与转换相关的 SQL 代码。
有时候,一张表可以手写,但只对数据量小的情况适用。当数据量大时,会发生类型映射或参数配置的错误,进而导致较高的操作和维护成本。
SeaTunnel是一个易用的数据集成平台有望解决这个问题。
#### 不支持模式演化
一些CDC 解决方案支持DDL 事件传递,但不支持传递到 Sink所以它能进行同步变更。由于基于 DDL 事件的类型转换是不可变更的,所以即使一个能获取事件的 CDC 也不一定可以将它发送至引擎Sink 不能遵循 DDL 事件来改变它)。
#### 太多的链接
在一些 CDC 平台上,当有多个表时,如果一张表被同步了,就必须使用链接来代表每张表。当存在多个表时,也需要很多链接。这就给源 JDBC 数据库带来了压力同时导致binlog 文件过多,还会导致重复的日志解析。
### SeaTunnel CDC 架构的目标
Apache SeaTunnel 是一个开源、高效、分布式、大量数据的集成架构。为了解决现有数据集成工具解决不了的问题,开发者社区“重新造轮子”,开发了一种具有独特功能的 CDC 平台。它的架构设计吸收了现有工具的优点,去除了相应的缺点。
Apache Seatunnel 支持:
- 以无锁、并行的方式快照历史数据。
- 日志心跳检测和动态添加数据表。
- 读取子数据库、子表和多结构表。
- 模式演化。
- 一切基础 CDC 功能。
减少了用户的操作和维护成本,并且支持动态添加数据表。
例如,当你要同步整个数据库,稍后需要添加一个新表,你不必手动维护、改变配置或重启当前工作。
另外Apache SeaTunnel也支持并行读取子数据库、子表和多结构表。还支持模式演化、DDL 转换和在引擎内改变支持的模式这些可以变为Transform和 Sink。
### SeaTunnel CDC 现状
如今CDC 拥有基本功能,支持自动增长和快照数据。它也支持 MySQL 实时和离线使用。MySQL 实时功能测试已完成即将进行离线测试。目前还不支持该模式因为它涉及对Transform和Sink的更改。不支持动态发现新增表已预留了一些支持多结构表的接口。
### 项目前景
作为 Apache 孵化的项目Apache SeaTunnel 的社区正快速发展起来。下一届社区计划会议的主要目标有:
#### 1. 发展并改进连接器和目录生态
我们正努力改善连接器和目录功能,包括:
- 支持连接更多数据库包括TiDB, Doris 和 Stripe。
- 改善现有的连接器的易用性和性能。
- 支持 CDC 连接器用于实时、增量同步场景。
任何对连接器感兴趣者都可以查看 [Umbrella][5].
#### 2. 支持更多数据集成场景 (SeaTunnel Engine)
现有的引擎仍然存在一些解决不了的痛点,例如对整个数据库的同步,表结构变化的同步以及大粒度的任务失败。
我们正努力解决这些问题,对此感兴趣者可以查看[issue 2272][6]。
#### 3. 更易使用Web 版)
我们正努力提供Web 平台接口令操作更简便。通过Web 平台接口我们将实现以DAG/SQL的形式查看目录、连接器、任务和相关信息。我们也会给予用户访问调度平台的权限以便更方便地进行任务管理。
欲了解更多关于 Web 版的信息,请访问[Web平台子项目][7]。
### 总结
必须严密追踪数据库的相关动态才能对数据的更新、删除或添加操作进行管理。CDC 提供了这种功能。Apache SeaTunnel是一个开源解决方案能满足这些需求它将持续迭代更新从而提供更多功能。相关的项目和社区也很活跃欢迎你的加入。
--------------------------------------------------------------------------------
via: https://opensource.com/article/23/3/synchronize-databases-apache-seatunnel
作者:[Li Zongwen][a]
选题:[lkxed][b]
译者:[cool-summer-021](https://github.com/cool-summer-021)
校对:[校对者ID](https://github.com/校对者ID)
本文由 [LCTT](https://github.com/LCTT/TranslateProject) 原创编译,[Linux中国](https://linux.cn/) 荣誉推出
[a]: https://opensource.com/users/li-zongwen
[b]: https://github.com/lkxed/
[1]: https://seatunnel.apache.org/
[2]: https://opensource.com/downloads/mariadb-mysql-cheat-sheet
[3]: https://www.redhat.com/en/topics/microservices?intcmp=7013a000002qLH8AAM
[4]: https://www.redhat.com/architect/illustrated-cqrs
[5]: https://github.com/apache/incubator-seatunnel/issues/1946
[6]: https://github.com/apache/incubator-seatunnel/issues/2272
[7]: https://github.com/apache/incubator-seatunnel-web