From 6537bcf2431fe7bbaefde3d4f91e07f91f4d28d8 Mon Sep 17 00:00:00 2001 From: Vonng Date: Thu, 29 Aug 2024 10:25:44 +0800 Subject: [PATCH] update ch2 ch3 v2 struct --- | 17 +++- | 10 -- | 131 +++++++++++++++++---------- | 256 +++++++++++++++++++++++++++++++++++----------------- | 24 ++--- | 22 ++--- | 14 +-- 7 files changed, 298 insertions(+), 176 deletions(-) diff --git a/ b/ index 02b42ae..a8e3a76 100644 --- a/ +++ b/ @@ -13,6 +13,17 @@ > 本地:你可在项目根目录中执行 `make`,并通过浏览器阅读([在线预览](。 +-------- + +> **预览版读者须知** +> +> 在预览版中,你可以最早获取到未经编辑的作者原始撰写稿 —— 因此你能在这些书籍出版前就用上这些技术。 +> +> 本书的 GitHub 反馈仓库位于 *[](*。 +> +> 如果你对如何改进本书的内容和/或示例有任何建议,或者你发现本章有缺失的材料,请在 GitHub 上联系我们。 + + -------- ## 译序 @@ -60,8 +71,9 @@ * [数据系统,法律与社会](数据系统法律与社会) * [本章小结](本章小结) * [第二章:定义非功能性要求]( - * [关于数据系统的思考](关于数据系统的思考) - * [可靠性](可靠性) + * [案例学习:主页时间线](案例学习社交网络主页时间线) + * [描述性能](描述性能) + * [可靠性与容灾](可靠性与容灾) * [可伸缩性](可伸缩性) * [可维护性](可维护性) * [本章小结](本章小结) @@ -69,6 +81,7 @@ * [关系模型与文档模型](关系模型与文档模型) * [数据查询语言](数据查询语言) * [图数据模型](图数据模型) + * [数据框、矩阵和数组](数据框矩阵和数组) * [本章小结](本章小结) * [第四章:存储与检索]( * [驱动数据库的数据结构](驱动数据库的数据结构) diff --git a/ b/ index 8f02b80..0709541 100644 --- a/ +++ b/ @@ -6,16 +6,6 @@ > > [Thomas Sowell](, 与 Fred Barnes 的采访 (2005) --------- - -> **预览版读者须知** -> -> 在提前发布的电子书中,你可以最早获取到作者撰写时的原始未编辑内容——因此你可以在这些书籍正式发布前长时间利用这些技术。 -> -> 这将是最终书籍的第一章。本书的 GitHub 仓库位于 *[](*。 -> -> 如果你对如何改进本书的内容和/或示例有任何建议,或者你发现本章有缺失的材料,请在 GitHub 上联系我们。 - 数据在今天的许多应用程序开发中居于核心地位。随着网络和移动应用、软件即服务(SaaS)以及云服务的普及,将来自不同用户的数据存储在共享的基于服务器的数据基础设施中已成为常态。需要存储和供分析使用的数据包括用户活动、商业交易、设备和传感器的数据。当用户与应用程序交互时,他们既读取存储的数据,也生成更多数据。 小量数据,可在单一机器上存储和处理,通常相对容易处理。然而,随着数据量或查询率的增加,需要将数据分布到多台机器上,这引入了许多挑战。随着应用程序需求的复杂化,仅在一个系统中存储所有数据已不再足够,可能需要结合多个提供不同功能的存储或处理系统。 diff --git a/ b/ index 93975b6..f7ec812 100644 --- a/ +++ b/ @@ -38,9 +38,9 @@ 其次,越来越多的应用程序有着各种严格而广泛的要求,单个工具不足以满足所有的数据处理和存储需求。取而代之的是,总体工作被拆分成一系列能被单个工具高效完成的任务,并通过应用代码将它们缝合起来。 -例如,如果将缓存(应用管理的缓存层,Memcached 或同类产品)和全文搜索(全文搜索服务器,例如 Elasticsearch 或 Solr)功能从主数据库剥离出来,那么使缓存 / 索引与主数据库保持同步通常是应用代码的责任。[图 1-1](../img/fig1-1.png) 给出了这种架构可能的样子(细节将在后面的章节中详细介绍)。 +例如,如果将缓存(应用管理的缓存层,Memcached 或同类产品)和全文搜索(全文搜索服务器,例如 Elasticsearch 或 Solr)功能从主数据库剥离出来,那么使缓存 / 索引与主数据库保持同步通常是应用代码的责任。[图 1-1](img/fig1-1.png) 给出了这种架构可能的样子(细节将在后面的章节中详细介绍)。 -![](../img/fig1-1.png) +![](img/fig1-1.png) **图 1-1 一个可能的组合使用多个组件的数据系统架构** @@ -165,7 +165,7 @@ 大体上讲,这一对操作有两种实现方式。 -1. 发布推文时,只需将新推文插入全局推文集合即可。当一个用户请求自己的主页时间线时,首先查找他关注的所有人,查询这些被关注用户发布的推文并按时间顺序合并。在如 [图 1-2](../img/fig1-2.png) 所示的关系型数据库中,可以编写这样的查询: +1. 发布推文时,只需将新推文插入全局推文集合即可。当一个用户请求自己的主页时间线时,首先查找他关注的所有人,查询这些被关注用户发布的推文并按时间顺序合并。在如 [图 1-2](img/fig1-2.png) 所示的关系型数据库中,可以编写这样的查询: ```sql SELECT tweets.*, users.* @@ -175,13 +175,13 @@ WHERE follows.follower_id = current_user ``` - ![](../img/fig1-2.png) + ![](img/fig1-2.png) **图 1-2 推特主页时间线的关系型模式简单实现** -2. 为每个用户的主页时间线维护一个缓存,就像每个用户的推文收件箱([图 1-3](../img/fig1-3.png))。当一个用户发布推文时,查找所有关注该用户的人,并将新的推文插入到每个主页时间线缓存中。因此读取主页时间线的请求开销很小,因为结果已经提前计算好了。 +2. 为每个用户的主页时间线维护一个缓存,就像每个用户的推文收件箱([图 1-3](img/fig1-3.png))。当一个用户发布推文时,查找所有关注该用户的人,并将新的推文插入到每个主页时间线缓存中。因此读取主页时间线的请求开销很小,因为结果已经提前计算好了。 - ![](../img/fig1-3.png) + ![](img/fig1-3.png) **图 1-3 用于分发推特至关注者的数据流水线,2012 年 11 月的负载参数【16】** @@ -191,7 +191,7 @@ 在推特的例子中,每个用户粉丝数的分布(可能按这些用户的发推频率来加权)是探讨可伸缩性的一个关键负载参数,因为它决定了扇出负载。你的应用程序可能具有非常不同的特征,但可以采用相似的原则来考虑它的负载。 -推特轶事的最终转折:现在已经稳健地实现了方法 2,推特逐步转向了两种方法的混合。大多数用户发的推文会被扇出写入其粉丝主页时间线缓存中。但是少数拥有海量粉丝的用户(即名流)会被排除在外。当用户读取主页时间线时,分别地获取出该用户所关注的每位名流的推文,再与用户的主页时间线缓存合并,如方法 1 所示。这种混合方法能始终如一地提供良好性能。在 [第十二章](../ 中我们将重新讨论这个例子,这在覆盖更多技术层面之后。 +推特轶事的最终转折:现在已经稳健地实现了方法 2,推特逐步转向了两种方法的混合。大多数用户发的推文会被扇出写入其粉丝主页时间线缓存中。但是少数拥有海量粉丝的用户(即名流)会被排除在外。当用户读取主页时间线时,分别地获取出该用户所关注的每位名流的推文,再与用户的主页时间线缓存合并,如方法 1 所示。这种混合方法能始终如一地提供良好性能。在 [第十二章]( 中我们将重新讨论这个例子,这在覆盖更多技术层面之后。 ### 描述性能 @@ -212,9 +212,9 @@ 即使不断重复发送同样的请求,每次得到的响应时间也都会略有不同。现实世界的系统会处理各式各样的请求,响应时间可能会有很大差异。因此我们需要将响应时间视为一个可以测量的数值 **分布(distribution)**,而不是单个数值。 -在 [图 1-4](../img/fig1-4.png) 中,每个灰条代表一次对服务的请求,其高度表示请求花费了多长时间。大多数请求是相当快的,但偶尔会出现需要更长的时间的异常值。这也许是因为缓慢的请求实质上开销更大,例如它们可能会处理更多的数据。但即使(你认为)所有请求都花费相同时间的情况下,随机的附加延迟也会导致结果变化,例如:上下文切换到后台进程,网络数据包丢失与 TCP 重传,垃圾收集暂停,强制从磁盘读取的页面错误,服务器机架中的震动【18】,还有很多其他原因。 +在 [图 1-4](img/fig1-4.png) 中,每个灰条代表一次对服务的请求,其高度表示请求花费了多长时间。大多数请求是相当快的,但偶尔会出现需要更长的时间的异常值。这也许是因为缓慢的请求实质上开销更大,例如它们可能会处理更多的数据。但即使(你认为)所有请求都花费相同时间的情况下,随机的附加延迟也会导致结果变化,例如:上下文切换到后台进程,网络数据包丢失与 TCP 重传,垃圾收集暂停,强制从磁盘读取的页面错误,服务器机架中的震动【18】,还有很多其他原因。 -![](../img/fig1-4.png) +![](img/fig1-4.png) **图 1-4 展示了一个服务 100 次请求响应时间的均值与百分位数** @@ -224,7 +224,7 @@ 如果想知道典型场景下用户需要等待多长时间,那么中位数是一个好的度量标准:一半用户请求的响应时间少于响应时间的中位数,另一半服务时间比中位数长。中位数也被称为第 50 百分位点,有时缩写为 p50。注意中位数是关于单个请求的;如果用户同时发出几个请求(在一个会话过程中,或者由于一个页面中包含了多个资源),则至少一个请求比中位数慢的概率远大于 50%。 -为了弄清异常值有多糟糕,可以看看更高的百分位点,例如第 95、99 和 99.9 百分位点(缩写为 p95,p99 和 p999)。它们意味着 95%、99% 或 99.9% 的请求响应时间要比该阈值快,例如:如果第 95 百分位点响应时间是 1.5 秒,则意味着 100 个请求中的 95 个响应时间快于 1.5 秒,而 100 个请求中的 5 个响应时间超过 1.5 秒。如 [图 1-4](../img/fig1-4.png) 所示。 +为了弄清异常值有多糟糕,可以看看更高的百分位点,例如第 95、99 和 99.9 百分位点(缩写为 p95,p99 和 p999)。它们意味着 95%、99% 或 99.9% 的请求响应时间要比该阈值快,例如:如果第 95 百分位点响应时间是 1.5 秒,则意味着 100 个请求中的 95 个响应时间快于 1.5 秒,而 100 个请求中的 5 个响应时间超过 1.5 秒。如 [图 1-4](img/fig1-4.png) 所示。 响应时间的高百分位点(也称为 **尾部延迟**,即 **tail latencies**)非常重要,因为它们直接影响用户的服务体验。例如亚马逊在描述内部服务的响应时间要求时是以 99.9 百分位点为准,即使它只影响一千个请求中的一个。这是因为请求响应最慢的客户往往也是数据最多的客户,也可以说是最有价值的客户 —— 因为他们掏钱了【19】。保证网站响应迅速对于保持客户的满意度非常重要,亚马逊观察到:响应时间增加 100 毫秒,销售量就减少 1%【20】;而另一些报告说:慢 1 秒钟会让客户满意度指标减少 16%【21,22】。 @@ -238,13 +238,13 @@ > #### 实践中的百分位点 > -> 在多重调用的后端服务里,高百分位数变得特别重要。即使并行调用,最终用户请求仍然需要等待最慢的并行调用完成。如 [图 1-5](../img/fig1-5.png) 所示,只需要一个缓慢的调用就可以使整个最终用户请求变慢。即使只有一小部分后端调用速度较慢,如果最终用户请求需要多个后端调用,则获得较慢调用的机会也会增加,因此较高比例的最终用户请求速度会变慢(该效果称为尾部延迟放大,即 tail latency amplification【24】)。 +> 在多重调用的后端服务里,高百分位数变得特别重要。即使并行调用,最终用户请求仍然需要等待最慢的并行调用完成。如 [图 1-5](img/fig1-5.png) 所示,只需要一个缓慢的调用就可以使整个最终用户请求变慢。即使只有一小部分后端调用速度较慢,如果最终用户请求需要多个后端调用,则获得较慢调用的机会也会增加,因此较高比例的最终用户请求速度会变慢(该效果称为尾部延迟放大,即 tail latency amplification【24】)。 > > 如果你想将响应时间百分点添加到你的服务的监视仪表板,则需要持续有效地计算它们。例如,你可以使用滑动窗口来跟踪连续10分钟内的请求响应时间。每一分钟,你都会计算出该窗口中的响应时间中值和各种百分数,并将这些度量值绘制在图上。 > > 简单的实现是在时间窗口内保存所有请求的响应时间列表,并且每分钟对列表进行排序。如果对你来说效率太低,那么有一些算法能够以最小的 CPU 和内存成本(如前向衰减【25】、t-digest【26】或 HdrHistogram 【27】)来计算百分位数的近似值。请注意,平均百分比(例如,减少时间分辨率或合并来自多台机器的数据)在数学上没有意义 - 聚合响应时间数据的正确方法是添加直方图【28】。 -![](../img/fig1-5.png) +![](img/fig1-5.png) **图 1-5 当一个请求需要多个后端请求时,单个后端慢请求就会拖慢整个终端用户的请求** @@ -256,7 +256,7 @@ 人们经常讨论 **纵向伸缩**(scaling up,也称为垂直伸缩,即 vertical scaling,转向更强大的机器)和 **横向伸缩**(scaling out,也称为水平伸缩,即 horizontal scaling,将负载分布到多台小机器上)之间的对立。跨多台机器分配负载也称为 “**无共享(shared-nothing)**” 架构。可以在单台机器上运行的系统通常更简单,但高端机器可能非常贵,所以非常密集的负载通常无法避免地需要横向伸缩。现实世界中的优秀架构需要将这两种方法务实地结合,因为使用几台足够强大的机器可能比使用大量的小型虚拟机更简单也更便宜。 -有些系统是 **弹性(elastic)** 的,这意味着可以在检测到负载增加时自动增加计算资源,而其他系统则是手动伸缩(人工分析容量并决定向系统添加更多的机器)。如果负载 **极难预测(highly unpredictable)**,则弹性系统可能很有用,但手动伸缩系统更简单,并且意外操作可能会更少(请参阅 “[分区再平衡](../分区再平衡)”)。 +有些系统是 **弹性(elastic)** 的,这意味着可以在检测到负载增加时自动增加计算资源,而其他系统则是手动伸缩(人工分析容量并决定向系统添加更多的机器)。如果负载 **极难预测(highly unpredictable)**,则弹性系统可能很有用,但手动伸缩系统更简单,并且意外操作可能会更少(请参阅 “[分区再平衡](分区再平衡)”)。 跨多台机器部署 **无状态服务(stateless services)** 非常简单,但将带状态的数据系统从单节点变为分布式配置则可能引入许多额外复杂度。出于这个原因,常识告诉我们应该将数据库放在单个节点上(纵向伸缩),直到伸缩成本或可用性需求迫使其改为分布式。 @@ -365,17 +365,10 @@ 不幸的是,使应用可靠、可伸缩或可维护并不容易。但是某些模式和技术会不断重新出现在不同的应用中。在接下来的几章中,我们将看到一些数据系统的例子,并分析它们如何实现这些目标。 -在本书后面的 [第三部分](../ 中,我们将看到一种模式:几个组件协同工作以构成一个完整的系统(如 [图 1-1](../img/fig1-1.png) 中的例子) +在本书后面的 [第三部分]( 中,我们将看到一种模式:几个组件协同工作以构成一个完整的系统(如 [图 1-1](img/fig1-1.png) 中的例子) - -# A Note for Early Release Readers - -With Early Release ebooks, you get books in their earliest form—the author’s raw and unedited content as they write—so you can take advantage of these technologies long before the official release of these titles. - -This will be the 2nd chapter of the final book. The GitHub repo for this book is *[\*\*](*. - -If you have comments about how we might improve the content and/or examples in this book, or if you notice missing material within this chapter, please reach out on GitHub. +-------- If you are building an application, you will be driven by a list of requirements. At the top of your list is most likely the functionality that the application must offer: what screens and what buttons you need, and what each operation is supposed to do in order to fulfill the purpose of your software. These are your *functional requirements*. @@ -390,13 +383,18 @@ Not all nonfunctional requirements fall within the scope of this book, but sever The terminology introduced in this chapter will also be useful in the following chapters, when we go into the details of how data-intensive systems are implemented. However, abstract definitions can be quite dry; to make the ideas more concrete, we will start this chapter with a case study of how a social networking service might work, which will provide practical examples of performance and scalability. -# Case Study: Social Network Home Timelines + + + +-------- + +## 案例学习:社交网络主页时间线 Imagine you are given the task of implementing a social network in the style of X (formerly Twitter), in which users can post messages and follow other users. This will be a huge simplification of how such a service actually works [[1](, [2](, [3](], but it will help illustrate some of the issues that arise in large-scale systems. Let’s assume that users make 500 million posts per day, or 5,700 posts per second on average. Occasionally, the rate can spike as high as 150,000 posts/second [[4](]. Let’s also assume that the average user follows 200 people and has 200 followers (although there is a very wide range: most people have only a handful of followers, and a few celebrities such as Barack Obama have over 100 million followers). -## Representing Users, Posts, and Follows +### Representing Users, Posts, and Follows Imagine we keep all of the data in a relational database as shown in [Figure 2-1]( We have one table for users, one table for posts, and one table for follow relationships. @@ -421,7 +419,7 @@ Posts are supposed to be timely, so let’s assume that after somebody makes a p Moreover, the query above is quite expensive: if you are following 200 people, it needs to fetch a list of recent posts by each of those 200 people, and merge those lists. 2 million timeline queries per second then means that the database needs to look up the recent posts from some sender 400 million times per second—a huge number. And that is the average case. Some users follow tens of thousands of accounts; for them, this query is very expensive to execute, and difficult to make fast. -## Materializing and Updating Timelines +### Materializing and Updating Timelines How can we do better? Firstly, instead of polling, it would be better if the server actively pushed new posts to any followers who are currently online. Secondly, we should precompute the results of the query above so that a user’s request for their home timeline can be served from a cache. @@ -441,7 +439,17 @@ This process of precomputing and updating the results of a query is called *mate One way of solving this problem is to handle celebrity posts separately from everyone else’s posts: we can save ourselves the effort of adding them to millions of timelines by storing the celebrity posts separately and merging them with the materialized timeline when it is read. Despite such optimizations, handling celebrities on a social network can require a lot of infrastructure [[5](]. -# Describing Performance + + + + + + + + +-------- + +## 描述性能 Most discussions of software performance consider two main types of metric: @@ -461,7 +469,7 @@ There is often a connection between throughput and response time; an example of ###### Figure 2-3. As the throughput of a service approaches its capacity, the response time increases dramatically due to queueing. -# When an overloaded system won’t recover +#### When an overloaded system won’t recover If a system is close to overload, with throughput pushed close to the limit, it can sometimes enter a vicious cycle where it becomes less efficient and hence even more overloaded. For example, if there is a long queue of requests waiting to be handled, response times may increase so much that clients time out and resend their request. This causes the rate of requests to increase even further, making the problem worse—a *retry storm*. Even when the load is reduced again, such a system may remain in an overloaded state until it is rebooted or otherwise reset. This phenomenon is called a *metastable failure*, and it can cause serious outages in production systems [[6](, [7](]. @@ -471,7 +479,7 @@ In terms of performance metrics, the response time is usually what users care ab In this section we will focus primarily on response times, and we will return to throughput and scalability in [“Scalability”]( -## Latency and Response Time +### 延迟与响应时间 “Latency” and “response time” are sometimes used interchangeably, but in this book we will use the terms in a specific way (illustrated in [Figure 2-4]( @@ -488,7 +496,7 @@ The response time can vary significantly from one request to the next, even if y Queueing delays often account for a large part of the variability in response times. As a server can only process a small number of things in parallel (limited, for example, by its number of CPU cores), it only takes a small number of slow requests to hold up the processing of subsequent requests—an effect known as *head-of-line blocking*. Even if those subsequent requests have fast service times, the client will see a slow overall response time due to the time waiting for the prior request to complete. The queueing delay is not part of the service time, and for this reason it is important to measure response times on the client side. -## Average, Median, and Percentiles +### 平均数,中位数与百分位点 Because the response time varies from one request to the next, we need to think of it not as a single number, but as a *distribution* of values that you can measure. In [Figure 2-5](, each gray bar represents a request to a service, and its height shows how long that request took. Most requests are reasonably fast, but there are occasional *outliers* that take much longer. Variation in network delay is also known as *jitter*. @@ -506,7 +514,7 @@ High percentiles of response times, also known as *tail latencies*, are importan On the other hand, optimizing the 99.99th percentile (the slowest 1 in 10,000 requests) was deemed too expensive and to not yield enough benefit for Amazon’s purposes. Reducing response times at very high percentiles is difficult because they are easily affected by random events outside of your control, and the benefits are diminishing. -# The user impact of response times +### The user impact of response times It seems intuitively obvious that a fast service is better for users than a slow service [[17](]. However, it is surprisingly difficult to get hold of reliable data to quantify the effect that latency has on user behavior. @@ -516,7 +524,7 @@ A more recent Akamai study [[21]( A study by Yahoo [[22](] compares click-through rates on fast-loading versus slow-loading search results, controlling for quality of search results. It finds 20–30% more clicks on fast searches when the difference between fast and slow responses is 1.25 seconds or more. -## Use of Response Time Metrics +#### 使用响应时间指标 High percentiles are especially important in backend services that are called multiple times as part of serving a single end-user request. Even if you make the calls in parallel, the end-user request still needs to wait for the slowest of the parallel calls to complete. It takes just one slow call to make the entire end-user request slow, as illustrated in [Figure 2-6]( Even if only a small percentage of backend calls are slow, the chance of getting a slow call increases if an end-user request requires multiple backend calls, and so a higher proportion of end-user requests end up being slow (an effect known as *tail latency amplification* [[23](]). @@ -526,7 +534,7 @@ High percentiles are especially important in backend services that are called mu Percentiles are often used in *service level objectives* (SLOs) and *service level agreements* (SLAs) as ways of defining the expected performance and availability of a service [[24](]. For example, an SLO may set a target for a service to have a median response time of less than 200 ms and a 99th percentile under 1 s, and a target that at least 99.9% of valid requests result in non-error responses. An SLA is a contract that specifies what happens if the SLO is not met (for example, customers may be entitled to a refund). That is the basic idea, at least; in practice, defining good availability metrics for SLOs and SLAs is not straightforward [[25](, [26](]. -# Computing percentiles +#### 计算百分位点 If you want to add response time percentiles to the monitoring dashboards for your services, you need to efficiently calculate them on an ongoing basis. For example, you may want to keep a rolling window of response times of requests in the last 10 minutes. Every minute, you calculate the median and various percentiles over the values in that window and plot those metrics on a graph. @@ -534,7 +542,11 @@ The simplest implementation is to keep a list of response times for all requests Beware that averaging percentiles, e.g., to reduce the time resolution or to combine data from several machines, is mathematically meaningless—the right way of aggregating response time data is to add the histograms [[31](]. -# Reliability and Fault Tolerance + + +-------- + +## 可靠性与容灾 Everybody has an intuitive idea of what it means for something to be reliable or unreliable. For software, typical expectations include: @@ -555,7 +567,7 @@ If all those things together mean “working correctly,” then we can understan The distinction between fault and failure can be confusing because they are the same thing, just at different levels. For example, if a hard drive stops working, we say that the hard drive has failed: if the system consists only of that one hard drive, it has stopped providing the required service. However, if the system you’re talking about contains many hard drives, then the failure of a single hard drive is only a fault from the point of view of the bigger system, and the bigger system might be able to tolerate that fault by having a copy of the data on another hard drive. -## Fault Tolerance +### 容灾 We call a system *fault-tolerant* if it continues providing the required service to the user in spite of certain faults occurring. If a system cannot tolerate a certain part becoming faulty, we call that part a *single point of failure* (SPOF), because a fault in that part escalates to cause the failure of the whole system. @@ -567,7 +579,7 @@ Counter-intuitively, in such fault-tolerant systems, it can make sense to *incre Although we generally prefer tolerating faults over preventing faults, there are cases where prevention is better than cure (e.g., because no cure exists). This is the case with security matters, for example: if an attacker has compromised a system and gained access to sensitive data, that event cannot be undone. However, this book mostly deals with the kinds of faults that can be cured, as described in the following sections. -## Hardware and Software Faults +### 硬件与软件缺陷 When we think of causes of system failure, hardware faults quickly come to mind: @@ -580,7 +592,7 @@ When we think of causes of system failure, hardware faults quickly come to mind: These events are rare enough that you often don’t need to worry about them when working on a small system, as long as you can easily replace hardware that becomes faulty. However, in a large-scale system, hardware faults happen often enough that they become part of the normal system operation. -### Tolerating hardware faults through redundancy +#### 通过冗余容忍硬件缺陷 Our first response to unreliable hardware is usually to add redundancy to the individual hardware components in order to reduce the failure rate of the system. Disks may be set up in a RAID configuration (spreading data across multiple disks in the same machine so that a failed disk does not cause data loss), servers may have dual power supplies and hot-swappable CPUs, and datacenters may have batteries and diesel generators for backup power. Such redundancy can often keep a machine running uninterrupted for years. @@ -592,7 +604,7 @@ The fault-tolerance techniques we discuss in this book are designed to tolerate Systems that can tolerate the loss of entire machines also have operational advantages: a single-server system requires planned downtime if you need to reboot the machine (to apply operating system security patches, for example), whereas a multi-node fault-tolerant system can be patched by restarting one node at a time, without affecting the service for users. This is called a *rolling upgrade*, and we will discuss it further in [Link to Come]. -### Software faults +#### 软件缺陷 Although hardware failures can be weakly correlated, they are still mostly independent: for example, if one disk fails, it’s likely that other disks in the same machine will be fine for another while. On the other hand, software faults are often very highly correlated, because it is common for many nodes to run the same software and thus have the same bugs [[53](, [54](]. Such faults are harder to anticipate, and they tend to cause many more system failures than uncorrelated hardware faults [[43](]. For example: @@ -606,7 +618,7 @@ The bugs that cause these kinds of software faults often lie dormant for a long There is no quick solution to the problem of systematic faults in software. Lots of small things can help: carefully thinking about assumptions and interactions in the system; thorough testing; process isolation; allowing processes to crash and restart; avoiding feedback loops such as retry storms (see [“When an overloaded system won’t recover”](; measuring, monitoring, and analyzing system behavior in production. -## Humans and Reliability +### 人类与可靠性 Humans design and build software systems, and the operators who keep the systems running are also human. Unlike machines, humans don’t just follow rules; their strength is being creative and adaptive in getting their job done. However, this characteristic also leads to unpredictability, and sometimes mistakes that can lead to failures, despite best intentions. For example, one study of large internet services found that configuration changes by operators were the leading cause of outages, whereas hardware faults (servers or network) played a role in only 10–25% of outages [[63](]. @@ -620,7 +632,7 @@ Increasingly, organizations are adopting a culture of *blameless postmortems*: a As a general principle, when investigating an incident, you should be suspicious of simplistic answers. “Bob should have been more careful when deploying that change” is not productive, but neither is “We must rewrite the backend in Haskell.” Instead, management should take the opportunity to learn the details of how the sociotechnical system works from the point of view of the people who work with it every day, and take steps to improve it based on this feedback [[64](]. -# How Important Is Reliability? +### 可靠性到底有多重要? Reliability is not just for nuclear power stations and air traffic control—more mundane applications are also expected to work reliably. Bugs in business applications cause lost productivity (and legal risks if figures are reported incorrectly), and outages of e-commerce sites can have huge costs in terms of lost revenue and damage to reputation. @@ -632,7 +644,11 @@ There are situations in which we may choose to sacrifice reliability in order to -# Scalability + + +-------- + +## 可伸缩性 Even if a system is working reliably today, that doesn’t mean it will necessarily work reliably in the future. One common reason for degradation is increased load: perhaps the system has grown from 10,000 concurrent users to 100,000 concurrent users, or from 1 million to 10 million. Perhaps it is processing much larger volumes of data than it did before. @@ -648,7 +664,7 @@ The reason is that scalability is not a one-dimensional label: it is meaningless If you succeed in making your application popular, and therefore handling a growing amount of load, you will learn where your performance bottlenecks lie, and therefore you will know along which dimensions you need to scale. At that point it’s time to start worrying about techniques for scalability. -## Describing Load +### 描述负载 First, we need to succinctly describe the current load on the system; only then can we discuss growth questions (what happens if our load doubles?). Often this will be a measure of throughput: for example, the number of requests per second to a service, how many gigabytes of new data arrive per day, or the number of shopping cart checkouts per hour. Sometimes you care about the peak of some variable quantity, such as the number of simultaneously online users in [“Case Study: Social Network Home Timelines”]( @@ -663,7 +679,7 @@ Usually our goal is to keep the performance of the system within the requirement If you can double the resources in order to handle twice the load, while keeping performance the same, we say that you have *linear scalability*, and this is considered a good thing. Occasionally it is possible to handle twice the load with less than double the resources, due to economies of scale or a better distribution of peak load [[71](, [72](]. Much more likely is that the cost grows faster than linearly, and there may be many reasons for the inefficiency. For example, if you have a lot of data, then processing a single write request may involve more work than if you have a small amount of data, even if the size of the request is the same. -## Shared-Memory, Shared-Disk, and Shared-Nothing Architecture +### 共享内存,共享磁盘,无共享架构 The simplest way of increasing the hardware resources of a service is to move it to a more powerful machine. Individual CPU cores are no longer getting significantly faster, but you can buy a machine (or rent a cloud instance) with more CPU cores, more RAM, and more disk space. This approach is called *vertical scaling* or *scaling up*. @@ -677,7 +693,7 @@ The advantages of shared-nothing are that it has the potential to scale linearly Some cloud-native database systems use separate services for storage and transaction execution (see [“Separation of storage and compute”](, with multiple compute nodes sharing access to the same storage service. This model has some similarity to a shared-disk architecture, but it avoids the scalability problems of older systems: instead of providing a filesystem (NAS) or block device (SAN) abstraction, the storage service offers a specialized API that is designed for the specific needs of the database [[75](]. -## Principles for Scalability +### 可伸缩性原则 The architecture of systems that operate at large scale is usually highly specific to the application—there is no such thing as a generic, one-size-fits-all scalable architecture (informally known as *magic scaling sauce*). For example, a system that is designed to handle 100,000 requests per second, each 1 kB in size, looks very different from a system that is designed for 3 requests per minute, each 2 GB in size—even though the two systems have the same data throughput (100 MB/sec). @@ -687,7 +703,14 @@ A good general principle for scalability is to break a system down into smaller Another good principle is not to make things more complicated than necessary. If a single-machine database will do the job, it’s probably preferable to a complicated distributed setup. Auto-scaling systems (which automatically add or remove resources in response to demand) are cool, but if your load is fairly predictable, a manually scaled system may have fewer operational surprises (see [Link to Come]). A system with five services is simpler than one with fifty. Good architectures usually involve a pragmatic mixture of approaches. -# Maintainability + + + + + +-------- + +## 可维护性 Software does not wear out or suffer material fatigue, so it does not break in the same ways as mechanical objects do. But the requirements for an application frequently change, the environment that the software runs in changes (such as its dependencies and the underlying platform), and it has bugs that need fixing. @@ -709,7 +732,7 @@ Every system we create today will one day become a legacy system if it is valuab Make it easy for engineers to make changes to the system in the future, adapting it and extending it for unanticipated use cases as requirements change. -## Operability: Making Life Easy for Operations +### 可操作性:人生苦短,关爱运维 We previously discussed the role of operations in [“Operations in the Cloud Era”](, and we saw that human processes are at least as important for reliable operations as software tools. In fact, it has been suggested that “good operations can often work around the limitations of bad (or incomplete) software, but good software cannot run reliably with bad operations” [[54](]. @@ -726,7 +749,7 @@ Good operability means making routine tasks easy, allowing the operations team t - Self-healing where appropriate, but also giving administrators manual control over the system state when needed - Exhibiting predictable behavior, minimizing surprises -## Simplicity: Managing Complexity +### 简单性:管理复杂度 Small software projects can have delightfully simple and expressive code, but as projects get larger, they often become very complex and difficult to understand. This complexity slows down everyone who needs to work on the system, further increasing the cost of maintenance. A software project mired in complexity is sometimes described as a *big ball of mud* [[83](]. @@ -742,7 +765,7 @@ For example, high-level programming languages are abstractions that hide machine Abstractions for application code, which aim to reduce its complexity, can be created using methodologies such as *design patterns* [[87](] and *domain-driven design* (DDD) [[88](]. This book is not about such application-specific abstractions, but rather about general-purpose abstractions on top of which you can build your applications, such as database transactions, indexes, and event logs. If you want to use techniques such as DDD, you can implement them on top of the foundations described in this book. -## Evolvability: Making Change Easy +### 可演化性:让变更更容易 It’s extremely unlikely that your system’s requirements will remain unchanged forever. They are much more likely to be in constant flux: you learn new facts, previously unanticipated use cases emerge, business priorities change, users request new features, new platforms replace old platforms, legal or regulatory requirements change, growth of the system forces architectural changes, etc. @@ -752,7 +775,13 @@ The ease with which you can modify a data system, and adapt it to changing requi One major factor that makes change difficult in large systems is when some action is irreversible, and therefore that action needs to be taken very carefully [[90](]. For example, say you are migrating from one database to another: if you cannot switch back to the old system in case of problems wth the new one, the stakes are much higher than if you can easily go back. Minimizing irreversibility improves flexibility. -# Summary + + + + +-------- + +## 本章小结 In this chapter we examined several examples of nonfunctional requirements: performance, reliability, scalability, and maintainability. Through these topics we have also encountered principles and terminology that we will need throughout the rest of the book. We started with a case study of how one might implement home timelines in a social network, which illustrated some of the challenges that arise at scale. @@ -762,9 +791,11 @@ To achieve reliability, you can use fault tolerance techniques, which allow a sy Finally, we examined several facets of maintainability, including supporting the work of operations teams, managing complexity, and making it easy to evolve an application’s functionality over time. There are no easy answers on how to achieve these things, but one thing that can help is to build applications using well-understood building blocks that provide useful abstractions. The rest of this book will cover a selection of the most important such building blocks. -##### Footnotes -##### References + +-------- + +## 参考文献 [[1](] Mike Cvet. [How We Learned to Stop Worrying and Love Fan-In at Twitter]( At *QCon San Francisco*, December 2016. diff --git a/ b/ index 14b8988..a128fbd 100644 --- a/ +++ b/ @@ -26,9 +26,14 @@ 掌握一个数据模型需要花费很多精力(想想关系数据建模有多少本书)。即便只使用一个数据模型,不用操心其内部工作机制,构建软件也是非常困难的。然而,因为数据模型对上层软件的功能(能做什么,不能做什么)有着至深的影响,所以选择一个适合的数据模型是非常重要的。 -在本章中,我们将研究一系列用于数据存储和查询的通用数据模型(前面列表中的第 2 点)。特别地,我们将比较关系模型,文档模型和少量基于图形的数据模型。我们还将查看各种查询语言并比较它们的用例。在 [第三章](../ 中,我们将讨论存储引擎是如何工作的。也就是说,这些数据模型实际上是如何实现的(列表中的第 3 点)。 +在本章中,我们将研究一系列用于数据存储和查询的通用数据模型(前面列表中的第 2 点)。特别地,我们将比较关系模型,文档模型和少量基于图形的数据模型。我们还将查看各种查询语言并比较它们的用例。在 [第三章]( 中,我们将讨论存储引擎是如何工作的。也就是说,这些数据模型实际上是如何实现的(列表中的第 3 点)。 + + + +---------- + ## 关系模型与文档模型 现在最著名的数据模型可能是 SQL。它基于 Edgar Codd 在 1970 年提出的关系模型【1】:数据被组织成 **关系**(SQL 中称作 **表**),其中每个关系是 **元组**(SQL 中称作 **行**) 的无序集合。 @@ -116,7 +121,7 @@ } ``` -有一些开发人员认为 JSON 模型减少了应用程序代码和存储层之间的阻抗不匹配。不过,正如我们将在 [第四章](../ 中看到的那样,JSON 作为数据编码格式也存在问题。无模式对 JSON 模型来说往往被认为是一个优势;我们将在 “[文档模型中的模式灵活性](#文档模型中的模式灵活性)” 中讨论这个问题。 +有一些开发人员认为 JSON 模型减少了应用程序代码和存储层之间的阻抗不匹配。不过,正如我们将在 [第四章]( 中看到的那样,JSON 作为数据编码格式也存在问题。无模式对 JSON 模型来说往往被认为是一个优势;我们将在 “[文档模型中的模式灵活性](#文档模型中的模式灵活性)” 中讨论这个问题。 JSON 表示比 [图 2-1](img/fig2-1.png) 中的多表模式具有更好的 **局部性(locality)**。如果在前面的关系型示例中获取简介,那需要执行多个查询(通过 `user_id` 查询每个表),或者在 User 表与其下属表之间混乱地执行多路连接。而在 JSON 表示中,所有相关信息都在同一个地方,一个查询就足够了。 @@ -144,7 +149,7 @@ JSON 表示比 [图 2-1](img/fig2-1.png) 中的多表模式具有更好的 **局 [^ii]: 关于关系模型的文献区分了几种不同的规范形式,但这些区别几乎没有实际意义。一个经验法则是,如果重复存储了可以存储在一个地方的值,则模式就不是 **规范化(normalized)** 的。 -> 数据库管理员和开发人员喜欢争论规范化和非规范化,让我们暂时保留判断吧。在本书的 [第三部分](../,我们将回到这个话题,探讨系统的方法用以处理缓存,非规范化和衍生数据。 +> 数据库管理员和开发人员喜欢争论规范化和非规范化,让我们暂时保留判断吧。在本书的 [第三部分](,我们将回到这个话题,探讨系统的方法用以处理缓存,非规范化和衍生数据。 不幸的是,对这些数据进行规范化需要多对一的关系(许多人生活在一个特定的地区,许多人在一个特定的行业工作),这与文档模型不太吻合。在关系数据库中,通过 ID 来引用其他表中的行是正常的,因为连接很容易。在文档数据库中,一对多树结构没有必要用连接,对连接的支持通常很弱 [^iii]。 @@ -220,7 +225,7 @@ CODASYL 中的查询是通过利用遍历记录列和跟随访问路径表在数 ### 关系型数据库与文档数据库在今日的对比 -将关系数据库与文档数据库进行比较时,可以考虑许多方面的差异,包括它们的容错属性(请参阅 [第五章](../和处理并发性(请参阅 [第七章](../。本章将只关注数据模型中的差异。 +将关系数据库与文档数据库进行比较时,可以考虑许多方面的差异,包括它们的容错属性(请参阅 [第五章](和处理并发性(请参阅 [第七章](。本章将只关注数据模型中的差异。 支持文档数据模型的主要论据是架构灵活性,因局部性而拥有更好的性能,以及对于某些应用程序而言更接近于应用程序使用的数据结构。关系模型通过为连接提供更好的支持以及支持多对一和多对多的关系来反击。 @@ -280,7 +285,7 @@ UPDATE users SET first_name = substring_index(name, ' ', 1); -- MySQL 值得指出的是,为了局部性而分组集合相关数据的想法并不局限于文档模型。例如,Google 的 Spanner 数据库在关系数据模型中提供了同样的局部性属性,允许模式声明一个表的行应该交错(嵌套)在父表内【27】。Oracle 类似地允许使用一个称为 **多表索引集群表(multi-table index cluster tables)** 的类似特性【28】。Bigtable 数据模型(用于 Cassandra 和 HBase)中的 **列族(column-family)** 概念与管理局部性的目的类似【29】。 -在 [第三章](../ 将还会看到更多关于局部性的内容。 +在 [第三章]( 将还会看到更多关于局部性的内容。 #### 文档和关系数据库的融合 @@ -420,7 +425,7 @@ for (var i = 0; i < liElements.length; i++) { MapReduce 是一个由 Google 推广的编程模型,用于在多台机器上批量处理大规模的数据【33】。一些 NoSQL 数据存储(包括 MongoDB 和 CouchDB)支持有限形式的 MapReduce,作为在多个文档中执行只读查询的机制。 -关于 MapReduce 更详细的介绍在 [第十章](../。现在我们只简要讨论一下 MongoDB 使用的模型。 +关于 MapReduce 更详细的介绍在 [第十章](。现在我们只简要讨论一下 MongoDB 使用的模型。 MapReduce 既不是一个声明式的查询语言,也不是一个完全命令式的查询 API,而是处于两者之间:查询的逻辑用代码片段来表示,这些代码片段会被处理框架重复性调用。它基于 `map`(也称为 `collect`)和 `reduce`(也称为 `fold` 或 `inject`)函数,两个函数存在于许多函数式编程语言中。 @@ -488,7 +493,7 @@ db.observations.mapReduce(function map() { map 和 reduce 函数在功能上有所限制:它们必须是 **纯** 函数,这意味着它们只使用传递给它们的数据作为输入,它们不能执行额外的数据库查询,也不能有任何副作用。这些限制允许数据库以任何顺序运行任何功能,并在失败时重新运行它们。然而,map 和 reduce 函数仍然是强大的:它们可以解析字符串、调用库函数、执行计算等等。 -MapReduce 是一个相当底层的编程模型,用于计算机集群上的分布式执行。像 SQL 这样的更高级的查询语言可以用一系列的 MapReduce 操作来实现(见 [第十章](../,但是也有很多不使用 MapReduce 的分布式 SQL 实现。須注意,SQL 并没有限制它只能在单一机器上运行,而 MapReduce 也并没有垄断所有的分布式查询执行。 +MapReduce 是一个相当底层的编程模型,用于计算机集群上的分布式执行。像 SQL 这样的更高级的查询语言可以用一系列的 MapReduce 操作来实现(见 [第十章](,但是也有很多不使用 MapReduce 的分布式 SQL 实现。須注意,SQL 并没有限制它只能在单一机器上运行,而 MapReduce 也并没有垄断所有的分布式查询执行。 能够在查询中使用 JavaScript 代码是高级查询的一个重要特性,但这不限于 MapReduce,一些 SQL 数据库也可以用 JavaScript 函数进行扩展【34】。 @@ -539,7 +544,7 @@ db.observations.aggregate([ **图 2-5 图数据结构示例(框代表顶点,箭头代表边)** -有几种不同但相关的方法用来构建和查询图表中的数据。在本节中,我们将讨论属性图模型(由 Neo4j,Titan 和 InfiniteGraph 实现)和三元组存储(triple-store)模型(由 Datomic、AllegroGraph 等实现)。我们将查看图的三种声明式查询语言:Cypher,SPARQL 和 Datalog。除此之外,还有像 Gremlin 【36】这样的图形查询语言和像 Pregel 这样的图形处理框架(见 [第十章](../。 +有几种不同但相关的方法用来构建和查询图表中的数据。在本节中,我们将讨论属性图模型(由 Neo4j,Titan 和 InfiniteGraph 实现)和三元组存储(triple-store)模型(由 Datomic、AllegroGraph 等实现)。我们将查看图的三种声明式查询语言:Cypher,SPARQL 和 Datalog。除此之外,还有像 Gremlin 【36】这样的图形查询语言和像 Pregel 这样的图形处理框架(见 [第十章](。 ### 属性图 @@ -927,19 +932,12 @@ Cypher 和 SPARQL 使用 SELECT 立即跳转,但是 Datalog 一次只进行一 * 粒子物理学家数十年来一直在进行大数据类型的大规模数据分析,像大型强子对撞机(LHC)这样的项目现在会处理数百 PB 的数据!在这样的规模下,需要定制解决方案来阻止硬件成本的失控【49】。 * **全文搜索** 可以说是一种经常与数据库一起使用的数据模型。信息检索是一个很大的专业课题,我们不会在本书中详细介绍,但是我们将在第三章和第三部分中介绍搜索索引。 -让我们暂时将其放在一边。在 [下一章](../ 中,我们将讨论在 **实现** 本章描述的数据模型时会遇到的一些权衡。 +让我们暂时将其放在一边。在 [下一章]( 中,我们将讨论在 **实现** 本章描述的数据模型时会遇到的一些权衡。 + +----------------------------- - -# A Note for Early Release Readers - -With Early Release ebooks, you get books in their earliest form—the author’s raw and unedited content as they write—so you can take advantage of these technologies long before the official release of these titles. - -This will be the 3rd chapter of the final book. The GitHub repo for this book is *[\*\*](*. - -If you have comments about how we might improve the content and/or examples in this book, or if you notice missing material within this chapter, please reach out on GitHub. - Data models are perhaps the most important part of developing software, because they have such a profound effect: not only on how the software is written, but also on how we *think about the problem* that we are solving. Most applications are built by layering one data model on top of another. For each layer, the key question is: how is it *represented* in terms of the next-lower layer? For example: @@ -953,7 +951,7 @@ In a complex application there may be more intermediary levels, such as APIs bui Several different data models are widely used in practice, often for different purposes. Some types of data and some queries are easy to express in one model, and awkward in another. In this chapter we will explore those trade-offs by comparing the relational model, the document model, graph-based data models, event sourcing, and dataframes. We will also briefly look at query languages that allow you to work with these models. This comparison will help you decide when to use which model. -# Terminology: Declarative Query Languages +## 术语:声明式查询语言 Many of the query languages in this chapter (such as SQL, Cypher, SPARQL, or Datalog) are *declarative*, which means that you specify the pattern of the data you want—what conditions the results must meet, and how you want the data to be transformed (e.g., sorted, grouped, and aggregated)—but not *how* to achieve that goal. The database system’s query optimizer can decide which indexes and which join algorithms to use, and in which order to execute various parts of the query. @@ -961,7 +959,13 @@ In contrast, with most programming languages you would have to write an *algorit For example, a database might be able to execute a declarative query in parallel across multiple CPU cores and machines, without you having to worry about how to implement that parallelism [[2](]. In a hand-coded algorithm it would be a lot of work to implement such parallel execution yourself. -# Relational Model versus Document Model + + + + +-------- + +## 关系模型与文档模型 The best-known data model today is probably that of SQL, based on the relational model proposed by Edgar Codd in 1970 [[3](]: data is organized into *relations* (called *tables* in SQL), where each relation is an unordered collection of *tuples* (*rows* in SQL). @@ -975,15 +979,15 @@ One lasting effect of the NoSQL movement is the popularity of the *document mode The pros and cons of document and relational data have been debated extensively; let’s examine some of the key points of that debate. -## The Object-Relational Mismatch +### 对象关系不匹配 Much application development today is done in object-oriented programming languages, which leads to a common criticism of the SQL data model: if data is stored in relational tables, an awkward translation layer is required between the objects in the application code and the database model of tables, rows, and columns. The disconnect between the models is sometimes called an *impedance mismatch*. -###### Note +> **注意** +> +> The term *impedance mismatch* is borrowed from electronics. Every electric circuit has a certain impedance (resistance to alternating current) on its inputs and outputs. When you connect one circuit’s output to another one’s input, the power transfer across the connection is maximized if the output and input impedances of the two circuits match. An impedance mismatch can lead to signal reflections and other troubles. -The term *impedance mismatch* is borrowed from electronics. Every electric circuit has a certain impedance (resistance to alternating current) on its inputs and outputs. When you connect one circuit’s output to another one’s input, the power transfer across the connection is maximized if the output and input impedances of the two circuits match. An impedance mismatch can lead to signal reflections and other troubles. - -### Object-relational mapping (ORM) +#### 对象关系映射(ORM) Object-relational mapping (ORM) frameworks like ActiveRecord and Hibernate reduce the amount of boilerplate code required for this translation layer, but they are often criticized [[6](]. Some commonly cited problems are: @@ -1000,7 +1004,7 @@ Nevertheless, ORMs also have advantages: - Some ORMs help with caching the results of database queries, which can help reduce the load on the database. - ORMs can also help with managing schema migrations and other administrative activities. -### The document data model for one-to-many relationships +#### The document data model for one-to-many relationships Not all data lends itself well to a relational representation; let’s look at an example to explore a limitation of the relational model. [Figure 3-1]( illustrates how a résumé (a LinkedIn profile) could be expressed in a relational schema. The profile as a whole can be identified by a unique identifier, `user_id`. Fields like `first_name` and `last_name` appear exactly once per user, so they can be modeled as columns on the `users` table. @@ -1008,11 +1012,11 @@ Most people have had more than one job in their career (positions), and people m ![ddia 0201](img/ddia_0201.png) -###### Figure 3-1. Representing a LinkedIn profile using a relational schema. +> Figure 3-1. Representing a LinkedIn profile using a relational schema. Another way of representing the same information, which is perhaps more natural and maps more closely to an object structure in application code, is as a JSON document as shown in [Example 3-1]( -##### Example 3-1. Representing a LinkedIn profile as a JSON document +> Example 3-1. Representing a LinkedIn profile as a JSON document ``` { @@ -1045,13 +1049,18 @@ The one-to-many relationships from the user profile to the user’s positions, e ![ddia 0202](img/ddia_0202.png) -###### Figure 3-2. One-to-many relationships forming a tree structure. +> Figure 3-2. One-to-many relationships forming a tree structure. -###### Note +> **注意** +> +> This type of relationship is sometimes called *one-to-few* rather than *one-to-many*, since a résumé typically has a small number of positions [[9](, [10](]. In sitations where there may be a genuinely large number of related items—say, comments on a celebrity’s social media post, of which there could be many thousands—embedding them all in the same document may be too unwieldy, so the relational approach in [Figure 3-1]( is preferable. -This type of relationship is sometimes called *one-to-few* rather than *one-to-many*, since a résumé typically has a small number of positions [[9](, [10](]. In sitations where there may be a genuinely large number of related items—say, comments on a celebrity’s social media post, of which there could be many thousands—embedding them all in the same document may be too unwieldy, so the relational approach in [Figure 3-1]( is preferable. -## Normalization, Denormalization, and Joins + + +-------- + +### 范式化,反范式化,连接 In [Example 3-1]( in the preceding section, `region_id` is given as an ID, not as the plain-text string `"Washington, DC, United States"`. Why? @@ -1069,7 +1078,7 @@ The advantage of using an ID is that because it has no meaning to humans, it nev The downside of a normalized representation is that every time you want to display a record containing an ID, you have to do an additional lookup to resolve the ID into something human-readable. In a relational data model, this is done using a *join*, for example: -``` +```sql SELECT users.*, regions.region_name FROM users JOIN regions ON users.region_id = @@ -1078,7 +1087,7 @@ WHERE = 251; In a document database, it is more common to either use a denormalized representation that needs no join when reading, or to perform the join in application code—that is, you first fetch a document containing an ID, and then perform a second query to resolve that ID into another document. In MongoDB, it is also possible to perform a join using the `$lookup` operator in an aggregation pipeline: -``` +```mongodb-json db.users.aggregate([ { $match: { _id: 251 } }, { $lookup: { @@ -1090,7 +1099,7 @@ db.users.aggregate([ ]) ``` -### Trade-offs of normalization +#### Trade-offs of normalization In the résumé example, while the `region_id` field is a reference into a standardized set of regions, the name of the `organization` (the company or government where the person worked) and `school_name` (where they studied) are just strings. This representation is denormalized: many people may have worked at the same company, but there is no ID linking them. @@ -1105,7 +1114,7 @@ Besides the cost of performing all these updates, you also need to consider the Normalization tends to be better for OLTP systems, where both reads and updates need to be fast; analytics systems often fare better with denormalized data, since they perform updates in bulk, and the performance of read-only queries is the dominant concern. Moreover, in systems of small to moderate scale, a normalized data model is often best, because you don’t have to worry about keeping multiple copies of the data consistent with each other, and the cost of performing joins is acceptable. However, in very large-scale systems, the cost of joins can become problematic. -### Denormalization in the social networking case study +#### Denormalization in the social networking case study In [“Case Study: Social Network Home Timelines”]( we compared a normalized representation ([Figure 2-1]( and a denormalized one (precomputed, materialized timelines): here, the join between `posts` and `follows` was too expensive, and the materialized timeline is a cache of the result of that join. The fan-out process that inserts a new post into followers’ timelines was our way of keeping the denormalized representation consistent. @@ -1127,7 +1136,13 @@ This example shows that having to perform joins when reading data is not, as som If you need to decide whether to denormalize something in your application, the social network case study shows that the choice is not immediately obvious: the most scalable approach may involve denormalizing some things and leaving other things normalized. You will have to carefully consider how often the information changes, and the cost of reads and writes (which might be dominated by outliers, such as users with many follows/followers in the case of a typical social network). Normalization and denormalization are not inherently good or bad—they are just a trade-off in terms of performance of reads and writes, as well as the amount of effort to implement. -## Many-to-One and Many-to-Many Relationships + + + + +-------- + +### 多对一与多对多关系 While `positions` and `education` in [Figure 3-1]( are examples of one-to-many or one-to-few relationships (one résumé has several positions, but each position belongs only to one résumé), the `region_id` field is an example of a *many-to-one* relationship (many people live in the same region, but we assume that each person lives in only one region at any one time). @@ -1135,11 +1150,11 @@ If we introduce entities for organizations and schools, and reference them by ID ![ddia 0203](img/ddia_0203.png) -###### Figure 3-3. Many-to-many relationships in the relational model. +> Figure 3-3. Many-to-many relationships in the relational model. Many-to-one and many-to-many relationships do not easily fit within one self-contained JSON document; they lend themselves more to a normalized representation. In a document model, one possible representation is given in [Example 3-2]( and illustrated in [Figure 3-4]( the data within each dotted rectangle can be grouped into one document, but the links to organizations and schools are best represented as references to other documents. -##### Example 3-2. A résumé that references organizations by ID. +> Example 3-2. A résumé that references organizations by ID. ``` { @@ -1156,7 +1171,7 @@ Many-to-one and many-to-many relationships do not easily fit within one self-con ![ddia 0204](img/ddia_0204.png) -###### Figure 3-4. Many-to-many relationships in the document model: the data within each dotted box can be grouped into one document. +> Figure 3-4. Many-to-many relationships in the document model: the data within each dotted box can be grouped into one document. Many-to-many relationships often need to be queried in “both directions”: for example, finding all of the organizations that a particular person has worked for, and finding all of the people who have worked at a particular organization. One way of enabling such queries is to store ID references on both sides, i.e., a résumé includes the ID of each organization where the person has worked, and the organization document includes the IDs of the résumés that mention that organization. This representation is denormalized, since the relationship is stored in two places, which could become inconsistent with each other. @@ -1164,7 +1179,7 @@ A normalized representation stores the relationship in only one place, and relie In the document model of [Example 3-2](, the database needs to index the `org_id` field of objects inside the `positions` array. Many document databases and relational databases with JSON support are able to create such indexes on values inside a document. -## Stars and Snowflakes: Schemas for Analytics +#### Stars and Snowflakes: Schemas for Analytics Data warehouses (see [“Data Warehousing”]( are usually relational, and there are a few widely-used conventions for the structure of tables in a data warehouse: a *star schema*, *snowflake schema*, *dimensional modeling* [[12](], and *one big table* (OBT). These structures are optimized for the needs of business analysts. ETL processes translate data from operational systems into this schema. @@ -1172,7 +1187,7 @@ The example schema in [Figure 3-5]( ![ddia 0309](img/ddia_0309.png) -###### Figure 3-5. Example of a star schema for use in a data warehouse. +> Figure 3-5. Example of a star schema for use in a data warehouse. Usually, facts are captured as individual events, because this allows maximum flexibility of analysis later. However, this means that the fact table can become extremely large. A big enterprise may have many petabytes of transaction history in its data warehouse, mostly represented as fact tables. @@ -1194,7 +1209,7 @@ Some data warehouse schemas take denormalization even further and leave out the In the context of analytics, such denormalization is unproblematic, since the data typically represents a log of historical data that is not going to change (except maybe for occasionally correcting an error). The issues of data consistency and write overheads that occur with denormalization in OLTP systems are not as pressing in analytics. -## When to Use Which Model +#### When to Use Which Model The main arguments in favor of the document data model are schema flexibility, better performance due to locality, and that for some applications it is closer to the object model used by the application. The relational model counters by providing better support for joins, many-to-one, and many-to-many relationships. Let’s examine these arguments in more detail. @@ -1204,7 +1219,7 @@ The document model has limitations: for example, you cannot refer directly to a Some applications allow the user to choose the order of items: for example, imagine a to-do list or issue tracker where the user can drag and drop tasks to reorder them. The document model supports such applications well, because the items (or their IDs) can simply be stored in a JSON array to determine their order. In relational databases there isn’t a standard way of representing such reorderable lists, and various tricks are used: sorting by an integer column (requiring renumbering when you insert into the middle), a linked list of IDs, or fractional indexing [[14](, [15](, [16](]. -### Schema flexibility in the document model +#### Schema flexibility in the document model Most document databases, and the JSON support in relational databases, do not enforce any schema on the data in documents. XML support in relational databases usually comes with optional schema validation. No schema means that arbitrary keys and values can be added to a document, and when reading, clients have no guarantees as to what fields the documents may contain. @@ -1240,7 +1255,7 @@ The schema-on-read approach is advantageous if the items in the collection don In situations like these, a schema may hurt more than it helps, and schemaless documents can be a much more natural data model. But in cases where all records are expected to have the same structure, schemas are a useful mechanism for documenting and enforcing that structure. We will discuss schemas and schema evolution in more detail in [Link to Come]. -### Data locality for reads and writes +#### Data locality for reads and writes A document is usually stored as a single continuous string, encoded as JSON, XML, or a binary variant thereof (such as MongoDB’s BSON). If your application often needs to access the entire document (for example, to render it on a web page), there is a performance advantage to this *storage locality*. If data is split across multiple tables, like in [Figure 3-1](, multiple index lookups are required to retrieve it all, which may require more disk seeks and take more time. @@ -1248,7 +1263,7 @@ The locality advantage only applies if you need large parts of the document at t However, the idea of storing related data together for locality is not limited to the document model. For example, Google’s Spanner database offers the same locality properties in a relational data model, by allowing the schema to declare that a table’s rows should be interleaved (nested) within a parent table [[25](]. Oracle allows the same, using a feature called *multi-table index cluster tables* [[26](]. The *column-family* concept in the Bigtable data model (used in Cassandra, HBase, and ScyllaDB), also known as a *wide-column* model, has a similar purpose of managing locality [[27](]. -### Query languages for documents +#### Query languages for documents Another difference between a relational and a document database is the language or API that you use to query it. Most relational databases are queried using SQL, but document databases are more varied. Some allow only key-value access by primary key, while others also offer secondary indexes to query for values inside documents, and some provide rich query languages. @@ -1256,7 +1271,7 @@ XML databases are often queried using XQuery and XPath, which are designed to al Let’s look at another example to get a feel for this language—this time an aggregation, which is especially needed for analytics. Imagine you are a marine biologist, and you add an observation record to your database every time you see animals in the ocean. Now you want to generate a report saying how many sharks you have sighted per month. In PostgreSQL you might express that query like this: -``` +```sql SELECT date_trunc('month', observation_timestamp) AS observation_month, sum(num_animals) AS total_animals FROM observations @@ -1285,17 +1300,30 @@ db.observations.aggregate([ The aggregation pipeline language is similar in expressiveness to a subset of SQL, but it uses a JSON-based syntax rather than SQL’s English-sentence-style syntax; the difference is perhaps a matter of taste. -### Convergence of document and relational databases +#### Convergence of document and relational databases Document databases and relational databases started out as very different approaches to data management, but they have grown more similar over time. Relational databases added support for JSON types and query operators, and the ability to index properties inside documents. Some document databases (such as MongoDB, Couchbase, and RethinkDB) added support for joins, secondary indexes, and declarative query languages. This convergence of the models is good news for application developers, because the relational model and the document model work best when you can combine both in the same database. Many document databases need relational-style references to other documents, and many relational databases have sections where schema flexibility is beneficial. Relational-document hybrids are a powerful combination. -###### Note +> **注意** +> +> Codd’s original description of the relational model [[3](] actually allowed something similar to JSON within a relational schema. He called it *nonsimple domains*. The idea was that a value in a row doesn’t have to just be a primitive datatype like a number or a string, but it could also be a nested relation (table)—so you can have an arbitrarily nested tree structure as a value, much like the JSON or XML support that was added to SQL over 30 years later. -Codd’s original description of the relational model [[3](] actually allowed something similar to JSON within a relational schema. He called it *nonsimple domains*. The idea was that a value in a row doesn’t have to just be a primitive datatype like a number or a string, but it could also be a nested relation (table)—so you can have an arbitrarily nested tree structure as a value, much like the JSON or XML support that was added to SQL over 30 years later. -# Graph-Like Data Models + + + + + + + + + + +-------- + +## 类图数据模型 We saw earlier that the type of relationships is an important distinguishing feature between different data models. If your application has mostly one-to-many relationships (tree-structured data) and few other relationships between records, the document model is appropriate. @@ -1332,9 +1360,12 @@ To illustrate these different languages and models, this section uses the graph ![ddia 0205](img/ddia_0205.png) -###### Figure 3-6. Example of graph-structured data (boxes represent vertices, arrows represent edges). +> Figure 3-6. Example of graph-structured data (boxes represent vertices, arrows represent edges). -## Property Graphs + +-------- + +### 属性图 In the *property graph* (also known as *labeled property graph*) model, each vertex consists of: @@ -1383,15 +1414,19 @@ Some important aspects of this model are: The edges table is like the many-to-many associative table/join table we saw in [“Many-to-One and Many-to-Many Relationships”](, generalized to allow many different types of relationship to be stored in the same table. There may also be indexes on the labels and the properties, allowing vertices or edges with certain properties to be found efficiently. -###### Note +> **Note** + +> A limitation of graph models is that an edge can only associate two vertices with each other, whereas a relational join table can represent three-way or even higher-degree relationships by having multiple foreign key references on a single row. Such relationships can be represented in a graph by creating an additional vertex corresponding to each row of the join table, and edges to/from that vertex, or by using a *hypergraph*. -A limitation of graph models is that an edge can only associate two vertices with each other, whereas a relational join table can represent three-way or even higher-degree relationships by having multiple foreign key references on a single row. Such relationships can be represented in a graph by creating an additional vertex corresponding to each row of the join table, and edges to/from that vertex, or by using a *hypergraph*. Those features give graphs a great deal of flexibility for data modeling, as illustrated in [Figure 3-6]( The figure shows a few things that would be difficult to express in a traditional relational schema, such as different kinds of regional structures in different countries (France has *départements* and *régions*, whereas the US has *counties* and *states*), quirks of history such as a country within a country (ignoring for now the intricacies of sovereign states and nations), and varying granularity of data (Lucy’s current residence is specified as a city, whereas her place of birth is specified only at the level of a state). You could imagine extending the graph to also include many other facts about Lucy and Alain, or other people. For instance, you could use it to indicate any food allergies they have (by introducing a vertex for each allergen, and an edge between a person and an allergen to indicate an allergy), and link the allergens with a set of vertices that show which foods contain which substances. Then you could write a query to find out what is safe for each person to eat. Graphs are good for evolvability: as you add features to your application, a graph can easily be extended to accommodate changes in your application’s data structures. -## The Cypher Query Language + +-------- + +### Cypher查询语言 *Cypher* is a query language for property graphs, originally created for the Neo4j graph database, and later developed into an open standard as *openCypher* [[37](]. Besides Neo4j, Cypher is supported by Memgraph, KùzuDB [[34](], Amazon Neptune, Apache AGE (with storage in PostgreSQL), and others. It is named after a character in the movie *The Matrix* and is not related to ciphers in cryptography [[38](]. @@ -1435,7 +1470,9 @@ There are several possible ways of executing the query. The description given he But equivalently, you could start with the two `Location` vertices and work backward. If there is an index on the `name` property, you can efficiently find the two vertices representing the US and Europe. Then you can proceed to find all locations (states, regions, cities, etc.) in the US and Europe respectively by following all incoming `WITHIN` edges. Finally, you can look for people who can be found through an incoming `BORN_IN` or `LIVES_IN` edge at one of the location vertices. -## Graph Queries in SQL +-------- + +### SQL中的图查询 [Example 3-3]( suggested that graph data can be represented in a relational database. But if we put graph data in a relational structure, can we also query it using SQL? @@ -1447,9 +1484,9 @@ In Cypher, `:WITHIN*0..` expresses that fact very concisely: it means “follow Since SQL:1999, this idea of variable-length traversal paths in a query can be expressed using something called *recursive common table expressions* (the `WITH RECURSIVE` syntax). [Example 3-6]( shows the same query—finding the names of people who emigrated from the US to Europe—expressed in SQL using this technique. However, the syntax is very clumsy in comparison to Cypher. -##### Example 3-6. The same query as [Example 3-5](, written in SQL using recursive common table expressions +> Example 3-6. The same query as [Example 3-5](, written in SQL using recursive common table expressions -``` +```postgresql WITH RECURSIVE -- in_usa is the set of vertex IDs of all locations within the United States @@ -1521,7 +1558,10 @@ The fact that a 4-line Cypher query requires 31 lines in SQL shows how much of a However, the situation may be improving: at the time of writing, there are plans to add a graph query language called GQL to the SQL standard [[41](, [42](], which will provide a syntax inspired by Cypher, GSQL [[43](], and PGQL [[44](]. -## Triple-Stores and SPARQL + +-------- + +### 三元组与SPARQL The triple-store model is mostly equivalent to the property graph model, using different words to describe the same ideas. It is nevertheless worth discussing, because there are various tools and languages for triple-stores that can be valuable additions to your toolbox for building applications. @@ -1532,13 +1572,13 @@ The subject of a triple is equivalent to a vertex in a graph. The object is one 1. A value of a primitive datatype, such as a string or a number. In that case, the predicate and object of the triple are equivalent to the key and value of a property on the subject vertex. Using the example from [Figure 3-6](, (*lucy*, *birthYear*, *1989*) is like a vertex `lucy` with properties `{"birthYear": 1989}`. 2. Another vertex in the graph. In that case, the predicate is an edge in the graph, the subject is the tail vertex, and the object is the head vertex. For example, in (*lucy*, *marriedTo*, *alain*) the subject and object *lucy* and *alain* are both vertices, and the predicate *marriedTo* is the label of the edge that connects them. -###### Note - -To be precise, databases that offer a triple-like data model often need to store some additional metadata on each tuple. For example, AWS Neptune uses quads (4-tuples) by adding a graph ID to each triple [[45](]; Datomic uses 5-tuples, extending each triple with a transaction ID and a boolean to indicate deletion [[46](]. Since these databases retain the basic *subject-predicate-object* structure explained above, this book nevertheless calls them triple-stores. +> **注意** +> +> To be precise, databases that offer a triple-like data model often need to store some additional metadata on each tuple. For example, AWS Neptune uses quads (4-tuples) by adding a graph ID to each triple [[45](]; Datomic uses 5-tuples, extending each triple with a transaction ID and a boolean to indicate deletion [[46](]. Since these databases retain the basic *subject-predicate-object* structure explained above, this book nevertheless calls them triple-stores. [Example 3-7]( shows the same data as in [Example 3-4](, written as triples in a format called *Turtle*, a subset of *Notation3* (*N3*) [[47](]. -##### Example 3-7. A subset of the data in [Figure 3-6](, represented as Turtle triples +> Example 3-7. A subset of the data in [Figure 3-6](, represented as Turtle triples ``` @prefix : . @@ -1562,7 +1602,7 @@ In this example, vertices of the graph are written as `_:*someName*`. The name d It’s quite repetitive to repeat the same subject over and over again, but fortunately you can use semicolons to say multiple things about the same subject. This makes the Turtle format quite readable: see [Example 3-8]( -##### Example 3-8. A more concise way of writing the data in [Example 3-7]( +> Example 3-8. A more concise way of writing the data in [Example 3-7]( ``` @prefix : . @@ -1572,17 +1612,17 @@ _:usa a :Location; :name "United States"; :type "country"; :within _:nameri _:namerica a :Location; :name "North America"; :type "continent". ``` -# The Semantic Web +#### The Semantic Web Some of the research and development effort on triple stores was motivated by the *Semantic Web*, an early-2000s effort to facilitate internet-wide data exchange by publishing data not only as human-readable web pages, but also in a standardized, machine-readable format. Although the Semantic Web as originally envisioned did not succeed [[48](, [49](], the legacy of the Semantic Web project lives on in a couple of specific technologies: *linked data* standards such as JSON-LD [[50](], *ontologies* used in biomedical science [[51](], Facebook’s Open Graph protocol [[52](] (which is used for link unfurling [[53](]), knowledge graphs such as Wikidata, and standardized vocabularies for structured data maintained by [``]( Triple-stores are another Semantic Web technology that has found use outside of its original use case: even if you have no interest in the Semantic Web, triples can be a good internal data model for applications. -### The RDF data model +#### The RDF data model The Turtle language we used in [Example 3-8]( is actually a way of encoding data in the *Resource Description Framework* (RDF) [[54](], a data model that was designed for the Semantic Web. RDF data can also be encoded in other ways, for example (more verbosely) in XML, as shown in [Example 3-9]( Tools like Apache Jena can automatically convert between different RDF encodings. -##### Example 3-9. The data of [Example 3-8](, expressed using RDF/XML syntax +> Example 3-9. The data of [Example 3-8](, expressed using RDF/XML syntax ``` ` doesn’t necessarily need to resolve to anything—from RDF’s point of view, it is simply a namespace. To avoid potential confusion with `http://` URLs, the examples in this section use non-resolvable URIs such as `urn:example:within`. Fortunately, you can just specify this prefix once at the top of the file, and then forget about it. -### The SPARQL query language +#### SPARQL查询语言 *SPARQL* is a query language for triple-stores using the RDF data model [[55](]. (It is an acronym for *SPARQL Protocol and RDF Query Language*, pronounced “sparkle.”) It predates Cypher, and since Cypher’s pattern matching is borrowed from SPARQL, they look quite similar. The same query as before—finding people who have moved from the US to Europe—is similarly concise in SPARQL as it is in Cypher (see [Example 3-10]( -##### Example 3-10. The same query as [Example 3-5](, expressed in SPARQL +> Example 3-10. The same query as [Example 3-5](, expressed in SPARQL ``` PREFIX : @@ -1652,7 +1692,10 @@ Because RDF doesn’t distinguish between properties and edges but just uses pre SPARQL is supported by Amazon Neptune, AllegroGraph, Blazegraph, OpenLink Virtuoso, Apache Jena, and various other triple stores [[35](]. -## Datalog: Recursive Relational Queries + +-------- + +### Datalog:递归关系查询 Datalog is a much older language than SPARQL or Cypher: it arose from academic research in the 1980s [[56](, [57](, [58](]. It is less well known among software engineers and not widely supported in mainstream databases, but it ought to be better-known since it is a very expressive language that is particularly powerful for complex queries. Several niche databases, including Datomic, LogicBlox, CozoDB, and LinkedIn’s LIquid [[59](] use Datalog as their query language. @@ -1662,7 +1705,7 @@ The contents of a Datalog database consists of *facts*, and each fact correspond [Example 3-11]( shows how to write the data from the left-hand side of [Figure 3-6]( in Datalog. The edges of the graph (`within`, `born_in`, and `lives_in`) are represented as two-column join tables. For example, Lucy has the ID 100 and Idaho has the ID 3, so the relationship “Lucy was born in Idaho” is represented as `born_in(100, 3)`. -##### Example 3-11. A subset of the data in [Figure 3-6](, represented as Datalog facts +> Example 3-11. A subset of the data in [Figure 3-6](, represented as Datalog facts ``` location(1, "North America", "continent"). @@ -1678,7 +1721,7 @@ born_in(100, 3). /* Lucy was born in Idaho */ Now that we have defined the data, we can write the same query as before, as shown in [Example 3-12]( It looks a bit different from the equivalent in Cypher or SPARQL, but don’t let that put you off. Datalog is a subset of Prolog, a programming language that you might have seen before if you’ve studied computer science. -##### Example 3-12. The same query as [Example 3-5](, expressed in Datalog +> Example 3-12. The same query as [Example 3-5](, expressed in Datalog ``` within_recursive(LocID, PlaceName) :- location(LocID, PlaceName, _). /* Rule 1 */ @@ -1712,13 +1755,15 @@ By repeated application of rules 1 and 2, the `within_recursive` virtual table c ![ddia 0206](img/ddia_0206.png) -###### Figure 3-7. Determining that Idaho is in North America, using the Datalog rules from [Example 3-12]( +> Figure 3-7. Determining that Idaho is in North America, using the Datalog rules from [Example 3-12]( Now rule 3 can find people who were born in some location `BornIn` and live in some location `LivingIn`. Rule 4 invokes rule 3 with `BornIn = 'United States'` and `LivingIn = 'Europe'`, and returns only the names of the people who match the search. By querying the contents of the virtual `us_to_europe` table, the Datalog system finally gets the same answer as in the earlier Cypher and SPARQL queries. The Datalog approach requires a different kind of thinking compared to the other query languages discussed in this chapter. It allows complex queries to be built up rule by rule, with one rule referring to other rules, similarly to the way that you break down code into functions that call each other. Just like functions can be recursive, Datalog rules can also invoke themselves, like rule 2 in [Example 3-12](, which enables graph traversals in Datalog queries. -## GraphQL +-------- + +### GraphQL GraphQL is a query language that, by design, is much more restrictive than the other query languages we have seen in this chapter. The purpose of GraphQL is to allow client software running on a user’s device (such as a mobile app or a JavaScript web app frontend) to request a JSON document with a particular structure, containing the fields necessary for rendering its user interface. GraphQL interfaces allow developers to rapidly change queries in client code without changing server-side APIs. @@ -1726,7 +1771,7 @@ GraphQL’s flexibility comes at a cost. Organizations that adopt GraphQL often Nevertheless, GraphQL is useful. [Example 3-13]( shows how you might implement a group chat application such as Discord or Slack using GraphQL. The query requests all the channels that the user has access to, including the channel name and the 50 most recent messages in each channel. For each message it requests the timestamp, the message content, and the name and profile picture URL for the sender of the message. Moreover, if a message is a reply to another message, the query also requests the sender name and the content of the message it is replying to (which might be rendered in a smaller font above the reply, in order to provide some context). -##### Example 3-13. Example GraphQL query for a group chat application +> Example 3-13. Example GraphQL query for a group chat application ``` query ChatApp { @@ -1752,7 +1797,7 @@ query ChatApp { [Example 3-14]( shows what a response to the query in [Example 3-13]( might look like. The response is a JSON document that mirrors the structure of the query: it contains exactly those attributes that were requested, no more and no less. This approach has the advantage that the server does not need to know which attributes the client requires in order to render the user interface; instead, the client can simply request what it needs. For example, this query does not request a profile picture URL for the sender of the `replyTo` message, but if the user interface were changed to add that profile picture, it would be easy for the client to add the required `imageUrl` attribute to the query without changing the server. -##### Example 3-14. A possible response to the query in [Example 3-13]( +> Example 3-14. A possible response to the query in [Example 3-13]( ``` { @@ -1787,7 +1832,18 @@ The server’s database can store the data in a more normalized form, and perfor Even though the response to a GraphQL query looks similar to a response from a document database, and even though it has “graph” in the name, GraphQL can be implemented on top of any type of database—relational, document, or graph. -# Event Sourcing and CQRS + + + + + + + + + +-------- + +## 事件溯源与CQRS In all the data models we have discussed so far, the data is queried in the same form as it is written—be it JSON documents, rows in tables, or vertices and edges in a graph. However, in complex applications it can sometimes be difficult to find a single data representation that is able to satisfy all the different ways that the data needs to be queried and presented. In such situations, it can be beneficial to write data in one form, and then to derive from it several representations that are optimized for different types of reads. @@ -1799,7 +1855,7 @@ Perhaps the simplest, fastest, and most expressive way of writing data is an *ev ![ddia 0208](img/ddia_0208.png) -###### Figure 3-8. Using a log of immutable events as source of truth, and deriving materialized views from it. +> Figure 3-8. Using a log of immutable events as source of truth, and deriving materialized views from it. In [Figure 3-8](, every change to the state of the conference (such as the organizer opening registrations, or attendees making and cancelling registrations) is first stored as an event. Whenever an event is appended to the log, several *materialized views* (also known as *projections* or *read models*) are also updated to reflect the effect of that event. In the conference example, there might be one materialized view that collects all information related to the status of each booking, another that computes charts for the conference organizer’s dashboard, and a third that generates files for the printer that produces the attendees’ badges. @@ -1830,7 +1886,24 @@ You can implement event sourcing on top of any database, but there are also some The only important requirement is that the event storage system must guarantee that all materialized views process the events in exactly the same order as they appear in the log; as we shall see in [Link to Come], this is not always easy to achieve in a distributed system. -# Dataframes, Matrices, and Arrays + + + +-------- + +## 数据框、矩阵和数组 + +本章迄今为止我们看到的数据模型通常用于事务处理和分析目的(见[“事务处理与分析对比”](。还有一些数据模型,你可能在分析或科学上下文中遇到,但它们很少出现在OLTP系统中:数据框和数字的多维数组,如矩阵。 + +数据框是R语言、Python的Pandas库、Apache Spark、ArcticDB、Dask等系统支持的数据模型。它们是数据科学家准备训练机器学习模型的数据时常用的工具,但也广泛用于数据探索、统计数据分析、数据可视化及类似目的。 + +乍一看,数据框类似于关系数据库或电子表格中的表。它支持类似关系的操作符,对数据框内容执行批量操作:例如,对所有行应用一个函数,根据某些条件过滤行,按某些列分组并聚合其他列,以及基于某些键将一个数据框中的行与另一个数据框合并(关系数据库中称为*联接*的操作,在数据框上通常称为*合并*)。 + +数据框通常不是通过像SQL这样的声明性查询操作,而是通过一系列修改其结构和内容的命令进行操纵。这符合数据科学家的典型工作流程,他们逐步“整理”数据,使其能够找到他们正在询问的问题的答案。这些操作通常发生在数据科学家的私有数据集副本上,通常在他们的本地机器上,尽管最终结果可能与其他用户共享。 + +数据框API还提供了远超关系数据库所提供的各种操作,而且数据模型的使用方式通常与典型的关系数据建模非常不同 [[64](]。例如,数据框的一个常见用途是将数据从类似关系的表示转换为矩阵或多维数组表示,这是许多机器学习算法所期望的输入形式。 + +一个这样的转换的简单示例显示在[图3-9](中。左边是一个关系表,显示不同用户对各种电影的评分(在1到5的范围内),右边的数据被转换成一个矩阵,每一列是一部电影,每一行是一个用户(类似于电子表格中的*数据透视表*)。该矩阵是*稀疏的*,这意味着许多用户-电影组合没有数据,但这是可以的。这个矩阵可能有成千上万的列,因此不适合在关系数据库中存储,但数据框和提供稀疏数组的库(如Python的NumPy)可以轻松处理这种数据 The data models we have seen so far in this chapter are generally used for both transaction processing and analytics purposes (see [“Transaction Processing versus Analytics”]( There are also some data models that you are likely to encounter in an analytical or scientific context, but that rarely feature in OLTP systems: dataframes and multidimensional arrays of numbers such as matrices. @@ -1846,7 +1919,17 @@ A simple example of such a transformation is shown in [Figure 3-9](https://learn ![ddia 0207](img/ddia_0207.png) -###### Figure 3-9. Transforming a relational database of movie ratings into a matrix representation. +> 图3-9 将电影评级的关系数据库转换为矩阵表示。 + + +矩阵只能包含数字,各种技术被用来将非数字数据转换为矩阵中的数字。例如: + +- 日期(在[图3-9](中的示例矩阵中被省略)可以缩放为某个适当范围内的浮点数。 +- 对于只能取固定小范围值的列(例如,电影数据库中电影的类型),通常使用*独热编码*:我们为每个可能的值创建一列(一列是“喜剧”,一列是“戏剧”,一列是“恐怖”等),并在代表电影的每一行中,在与该电影类型对应的列中放置1,在所有其他列中放置0。这种表示也很容易泛化到适用于多种类型的电影。 + +一旦数据以数字矩阵的形式存在,就可以进行线性代数操作,这是许多机器学习算法的基础。例如,[图3-9](中的数据可以是一个推荐系统的一部分,该系统可能会推荐用户可能喜欢的电影。数据框足够灵活,可以让数据从关系形式逐渐演变为矩阵表示,同时让数据科学家控制最适合实现数据分析或模型训练过程目标的表示。 + +还有一些数据库,如TileDB [[65](],专门用于存储大量的多维数字数组;它们被称为*数组数据库*,最常用于存储科学数据集,如地理空间测量(在规则间隔的网格上的栅格数据)、医学成像或天文望远镜的观测 [[66](]。数据框也在金融行业中用于表示*时间序列数据*,如资产价格和随时间的交易 [[67](]。 A matrix can only contain numbers, and various techniques are used to transform non-numerical data into numbers in the matrix. For example: @@ -1857,7 +1940,11 @@ Once the data is in the form of a matrix of numbers, it is amenable to linear al There are also databases such as TileDB [[65](] that specialize in storing large multidimensional arrays of numbers; they are called *array databases* and are most commonly used for scientific datasets such as geospatial measurements (raster data on a regularly spaced grid), medical imaging, or observations from astronomical telescopes [[66](]. Dataframes are also used in the financial industry for representing *time series data*, such as the prices of assets and trades over time [[67](]. -# Summary + + +-------- + +## 本章小结 Data models are a huge subject, and in this chapter we have taken a quick look at a broad variety of different models. We didn’t have space to go into all the details of each model, but hopefully the overview has been enough to whet your appetite to find out more about the model that best fits your application’s requirements. @@ -1883,9 +1970,10 @@ Although we have covered a lot of ground, there are still data models left unmen We have to leave it there for now. In the next chapter we will discuss some of the trade-offs that come into play when *implementing* the data models described in this chapter. -##### Footnotes -##### References +-------- + +## 参考文献 [[1](] Jamie Brandon. [Unexplanations: query optimization works because sql is declarative]( **, February 2024. Archived at []( diff --git a/ b/ index 5bed5c8..7e6290a 100644 --- a/ +++ b/ @@ -2,25 +2,25 @@ 本书前四章介绍了数据系统底层的基础概念,无论是在单台机器上运行的单点数据系统,还是分布在多台机器上的分布式数据系统都适用。 -1. [第一章]( 将介绍本书使用的术语和方法。**可靠性,可伸缩性和可维护性** ,这些词汇到底意味着什么?如何实现这些目标? -2. [第二章]( 将对几种不同的 **数据模型和查询语言** 进行比较。从程序员的角度看,这是数据库之间最明显的区别。不同的数据模型适用于不同的应用场景。 -3. [第三章]( 将深入 **存储引擎** 内部,研究数据库如何在磁盘上摆放数据。不同的存储引擎针对不同的负载进行优化,选择合适的存储引擎对系统性能有巨大影响。 -4. [第四章](ch4) 将对几种不同的 **数据编码** 进行比较。特别研究了这些格式在应用需求经常变化、模式需要随时间演变的环境中表现如何。 +1. [第一章]( 将介绍 数**据系统架构中的利弊权衡**,探讨了影响数据系统架构的一些重要选择,并介绍了在本书余下部分将需要用到的术语。 +2. [第二章]( 将介绍本书使用的术语和方法。**可靠性,可伸缩性和可维护性** ,这些词汇到底意味着什么?如何实现这些目标? +3. [第三章]( 将对几种不同的 **数据模型和查询语言** 进行比较。从程序员的角度看,这是数据库之间最明显的区别。不同的数据模型适用于不同的应用场景。 +4. [第四章]( 将深入 **存储引擎** 内部,研究数据库如何在磁盘上摆放数据。不同的存储引擎针对不同的负载进行优化,选择合适的存储引擎对系统性能有巨大影响。 +5. [第五章]( 将对几种不同的 **数据编码** 进行比较。特别研究了这些格式在应用需求经常变化、模式需要随时间演变的环境中表现如何。 -第二部分将专门讨论在 **分布式数据系统** 中特有的问题。 ## 目录 - 1. [可靠性、可伸缩性和可维护性]( -2. [数据模型与查询语言]( -3. [存储与检索]( -4. [编码与演化]( +2. [定义非功能性要求]( +3. [数据模型与查询语言]( +4. [存储与检索]( +5. [编码与演化]( ------ -| 上一章 | 目录 | 下一章 | -| ------------------ | ------------------------------- | -------------------------------------------- | -| [序言]( | [设计数据密集型应用]( | [第一章:可靠性、可伸缩性和可维护性]( | +| 上一章 | 目录 | 下一章 | +|------------------|------------------------|----------------------------| +| [序言]( | [设计数据密集型应用]( | [第一章:数据系统架构中的利弊权衡]( | diff --git a/ b/ index 613e0cc..6474dc7 100644 --- a/ +++ b/ @@ -53,11 +53,11 @@ * 复制(Replication) - 在几个不同的节点上保存数据的相同副本,可能放在不同的位置。复制提供了冗余:如果一些节点不可用,剩余的节点仍然可以提供数据服务。复制也有助于改善性能。[第五章]( 将讨论复制。 + 在几个不同的节点上保存数据的相同副本,可能放在不同的位置。复制提供了冗余:如果一些节点不可用,剩余的节点仍然可以提供数据服务。复制也有助于改善性能。[第六章]( 将讨论复制。 * 分区 (Partitioning) - 将一个大型数据库拆分成较小的子集(称为 **分区**,即 partitions),从而不同的分区可以指派给不同的 **节点**(nodes,亦称 **分片**,即 sharding)。[第六章]( 将讨论分区。 + 将一个大型数据库拆分成较小的子集(称为 **分区**,即 partitions),从而不同的分区可以指派给不同的 **节点**(nodes,亦称 **分片**,即 sharding)。[第七章]( 将讨论分区。 复制和分区是不同的机制,但它们经常同时使用。如 [图 II-1](img/figii-1.png) 所示。 @@ -65,18 +65,18 @@ **图 II-1 一个数据库切分为两个分区,每个分区都有两个副本** -理解了这些概念,就可以开始讨论在分布式系统中需要做出的困难抉择。[第七章]( 将讨论 **事务(Transaction)**,这对于了解数据系统中可能出现的各种问题,以及我们可以做些什么很有帮助。[第八章]( 和 [第九章]( 将讨论分布式系统的根本局限性。 +理解了这些概念,就可以开始讨论在分布式系统中需要做出的困难抉择。[第八章]( 将讨论 **事务(Transaction)**,这对于了解数据系统中可能出现的各种问题,以及我们可以做些什么很有帮助。[第九章]( 和 [第十章]( 将讨论分布式系统的根本局限性。 在本书的 [第三部分]( 中,将讨论如何将多个(可能是分布式的)数据存储集成为一个更大的系统,以满足复杂的应用需求。但首先,我们来聊聊分布式的数据。 ## 索引 -5. [复制]( -6. [分区]( -7. [事务]( -8. [分布式系统的麻烦]( -9. [一致性与共识]( +6. [复制]( +7. [分区]( +8. [事务]( +9. [分布式系统的麻烦]( +10. [一致性与共识]( ## 参考文献 @@ -88,6 +88,6 @@ ------ -| 上一章 | 目录 | 下一章 | -| ---------------------------- | ------------------------------- | ---------------------- | -| [第四章:编码与演化]( | [设计数据密集型应用]( | [第五章:复制]( | \ No newline at end of file +| 上一章 | 目录 | 下一章 | +|---------------------|------------------------|------------------| +| [第五章:编码与演化]( | [设计数据密集型应用]( | [第六章:复制]( | \ No newline at end of file diff --git a/ b/ index 3842ff1..483802b 100644 --- a/ +++ b/ @@ -28,17 +28,17 @@ ## 章节概述 -我们将从 [第十章]( 开始,研究例如 MapReduce 这样 **面向批处理(batch-oriented)** 的数据流系统。对于建设大规模数据系统,我们将看到,它们提供了优秀的工具和思想。[第十一章]( 将把这些思想应用到 **流式数据(data streams)** 中,使我们能用更低的延迟完成同样的任务。[第十二章]( 将对本书进行总结,探讨如何使用这些工具来构建可靠,可伸缩和可维护的应用。 +我们将从 [第十一章]( 开始,研究例如 MapReduce 这样 **面向批处理(batch-oriented)** 的数据流系统。对于建设大规模数据系统,我们将看到,它们提供了优秀的工具和思想。[第十一章]( 将把这些思想应用到 **流式数据(data streams)** 中,使我们能用更低的延迟完成同样的任务。[第十二章]( 将对本书进行总结,探讨如何使用这些工具来构建可靠,可伸缩和可维护的应用。 ## 索引 -10. [批处理]( -11. [流处理]( -12. [数据系统的未来]( +10. [批处理]( +11. [流处理]( +12. [做正确的事]( ------ -| 上一章 | 目录 | 下一章 | -| ------------------------------ | ------------------------------- | ------------------------- | -| [第九章:一致性与共识]( | [设计数据密集型应用]( | [第十章:批处理]( | \ No newline at end of file +| 上一章 | 目录 | 下一章 | +|-----------------------|------------------------|---------------------| +| [第十章:一致性与共识]( | [设计数据密集型应用]( | [第十一章:批处理]( | \ No newline at end of file