add basic structure for 2nd edition

This commit is contained in:
Vonng 2024-08-29 07:41:22 +08:00
parent 25541f68f8
commit c3b8169085
49 changed files with 10810 additions and 7271 deletions

1
.gitignore vendored
View File

@ -2,3 +2,4 @@
.code/ .code/
__pycache__/ __pycache__/
.DS_Store .DS_Store
tmp/

159
README.md
View File

@ -1,6 +1,6 @@
# 设计数据密集型应用(第二版) - 中文翻译 # 设计数据密集型应用(第二版) - 中文翻译
> [v2](https://github.com/Vonng/ddia/tree/v2) 分支正在进行 DDIA 第二版的[中文翻译](https://github.com/Vonng/ddia/issues/345),欢迎参与校对与翻译。 > DDIA 第二版的[中文翻译](https://github.com/Vonng/ddia/issues/345) 正在 [v2](https://github.com/Vonng/ddia/tree/v2) 分支上进行,欢迎参与校对与翻译。
- 作者: [Martin Kleppmann](https://martin.kleppmann.com) - 作者: [Martin Kleppmann](https://martin.kleppmann.com)
- 原名:[《Designing Data-Intensive Applications》](http://shop.oreilly.com/product/0636920032175.do) - 原名:[《Designing Data-Intensive Applications》](http://shop.oreilly.com/product/0636920032175.do)
@ -12,6 +12,8 @@
> >
> 本地:你可在项目根目录中执行 `make`,并通过浏览器阅读([在线预览](http://ddia.vonng.com/#/))。 > 本地:你可在项目根目录中执行 `make`,并通过浏览器阅读([在线预览](http://ddia.vonng.com/#/))。
--------
## 译序 ## 译序
> 不懂数据库的全栈工程师不是好架构师 > 不懂数据库的全栈工程师不是好架构师
@ -46,78 +48,79 @@
### [第一部分:数据系统基础](part-i.md) ### [第一部分:数据系统基础](part-i.md)
* [第一章:可靠性、可伸缩性和可维护性](ch1.md) * [第一章:数据系统架构中的利弊权衡](ch1.md)
* [关于数据系统的思考](ch1.md#关于数据系统的思考) * [第二章:定义非功能性要求](ch2.md)
* [可靠性](ch1.md#可靠性) * [关于数据系统的思考](ch2.md#关于数据系统的思考)
* [可伸缩性](ch1.md#可伸缩性) * [可靠性](ch2.md#可靠性)
* [可维护性](ch1.md#可维护性) * [可伸缩性](ch2.md#可伸缩性)
* [本章小结](ch1.md#本章小结) * [可维护性](ch2.md#可维护性)
* [第二章:数据模型与查询语言](ch2.md)
* [关系模型与文档模型](ch2.md#关系模型与文档模型)
* [数据查询语言](ch2.md#数据查询语言)
* [图数据模型](ch2.md#图数据模型)
* [本章小结](ch2.md#本章小结) * [本章小结](ch2.md#本章小结)
* [第三章:存储与检索](ch3.md) * [第三章:数据模型与查询语言](ch3.md)
* [驱动数据库的数据结构](ch3.md#驱动数据库的数据结构) * [关系模型与文档模型](ch3.md#关系模型与文档模型)
* [事务处理还是分析?](ch3.md#事务处理还是分析?) * [数据查询语言](ch3.md#数据查询语言)
* [列式存储](ch3.md#列式存储) * [图数据模型](ch3.md#图数据模型)
* [本章小结](ch3.md#本章小结) * [本章小结](ch3.md#本章小结)
* [第四章:编码与演化](ch4.md) * [第四章:存储与检索](ch4.md)
* [编码数据的格式](ch4.md#编码数据的格式) * [驱动数据库的数据结构](ch4.md#驱动数据库的数据结构)
* [数据流的类型](ch4.md#数据流的类型) * [事务处理还是分析?](ch4.md#事务处理还是分析?)
* [列式存储](ch4.md#列式存储)
* [本章小结](ch4.md#本章小结) * [本章小结](ch4.md#本章小结)
* [第五章:编码与演化](ch5.md)
* [编码数据的格式](ch5.md#编码数据的格式)
* [数据流的类型](ch5.md#数据流的类型)
* [本章小结](ch5.md#本章小结)
### [第二部分:分布式数据](part-ii.md) ### [第二部分:分布式数据](part-ii.md)
* [第五章:复制](ch5.md) * [第五章:复制](ch6.md)
* [领导者与追随者](ch5.md#领导者与追随者) * [领导者与追随者](ch6.md#领导者与追随者)
* [复制延迟问题](ch5.md#复制延迟问题) * [复制延迟问题](ch6.md#复制延迟问题)
* [多主复制](ch5.md#多主复制) * [多主复制](ch6.md#多主复制)
* [无主复制](ch5.md#无主复制) * [无主复制](ch6.md#无主复制)
* [本章小结](ch5.md#本章小结)
* [第六章:分区](ch6.md)
* [分区与复制](ch6.md#分区与复制)
* [键值数据的分区](ch6.md#键值数据的分区)
* [分区与次级索引](ch6.md#分区与次级索引)
* [分区再平衡](ch6.md#分区再平衡)
* [请求路由](ch6.md#请求路由)
* [本章小结](ch6.md#本章小结) * [本章小结](ch6.md#本章小结)
* [第七章:事务](ch7.md) * [第六章:分区](ch7.md)
* [事务的棘手概念](ch7.md#事务的棘手概念) * [分区与复制](ch7.md#分区与复制)
* [弱隔离级别](ch7.md#弱隔离级别) * [键值数据的分区](ch7.md#键值数据的分区)
* [可串行化](ch7.md#可串行化) * [分区与次级索引](ch7.md#分区与次级索引)
* [分区再平衡](ch7.md#分区再平衡)
* [请求路由](ch7.md#请求路由)
* [本章小结](ch7.md#本章小结) * [本章小结](ch7.md#本章小结)
* [第八章:分布式系统的麻烦](ch8.md) * [第七章:事务](ch8.md)
* [故障与部分失效](ch8.md#故障与部分失效) * [事务的棘手概念](ch8.md#事务的棘手概念)
* [不可靠的网络](ch8.md#不可靠的网络) * [弱隔离级别](ch8.md#弱隔离级别)
* [不可靠的时钟](ch8.md#不可靠的时钟) * [可串行化](ch8.md#可串行化)
* [知识、真相与谎言](ch8.md#知识、真相与谎言)
* [本章小结](ch8.md#本章小结) * [本章小结](ch8.md#本章小结)
* [九章:一致性与共识](ch9.md) * [八章:分布式系统的麻烦](ch9.md)
* [一致性保证](ch9.md#一致性保证) * [故障与部分失效](ch9.md#故障与部分失效)
* [线性一致性](ch9.md#线性一致性) * [不可靠的网络](ch9.md#不可靠的网络)
* [顺序保证](ch9.md#顺序保证) * [不可靠的时钟](ch9.md#不可靠的时钟)
* [分布式事务与共识](ch9.md#分布式事务与共识) * [知识、真相与谎言](ch9.md#知识、真相与谎言)
* [本章小结](ch9.md#本章小结) * [本章小结](ch9.md#本章小结)
* [第九章:一致性与共识](ch10.md)
* [一致性保证](ch10.md#一致性保证)
* [线性一致性](ch10.md#线性一致性)
* [顺序保证](ch10.md#顺序保证)
* [分布式事务与共识](ch10.md#分布式事务与共识)
* [本章小结](ch10.md#本章小结)
### [第三部分:衍生数据](part-iii.md) ### [第三部分:衍生数据](part-iii.md)
* [第十章:批处理](ch10.md) * [第十一章:批处理](ch11.md)
* [使用Unix工具的批处理](ch10.md#使用Unix工具的批处理) * [使用Unix工具的批处理](ch11.md#使用Unix工具的批处理)
* [MapReduce和分布式文件系统](ch10.md#MapReduce和分布式文件系统) * [MapReduce和分布式文件系统](ch11.md#MapReduce和分布式文件系统)
* [MapReduce之后](ch10.md#MapReduce之后) * [MapReduce之后](ch11.md#MapReduce之后)
* [本章小结](ch10.md#本章小结)
* [第十一章:流处理](ch11.md)
* [传递事件流](ch11.md#传递事件流)
* [数据库与流](ch11.md#数据库与流)
* [流处理](ch11.md#流处理)
* [本章小结](ch11.md#本章小结) * [本章小结](ch11.md#本章小结)
* [第十二章:数据系统的未来](ch12.md) * [第十二章:流处理](ch12.md)
* [数据集成](ch12.md#数据集成) * [传递事件流](ch12.md#传递事件流)
* [分拆数据库](ch12.md#分拆数据库) * [数据库与流](ch12.md#数据库与流)
* [将事情做正确](ch12.md#将事情做正确) * [流处理](ch12.md#流处理)
* [做正确的事情](ch12.md#做正确的事情)
* [本章小结](ch12.md#本章小结) * [本章小结](ch12.md#本章小结)
* [第十三章:数据系统的未来](ch13.md)
* [数据集成](ch13.md#数据集成)
* [分拆数据库](ch13.md#分拆数据库)
* [将事情做正确](ch13.md#将事情做正确)
* [做正确的事情](ch13.md#做正确的事情)
* [本章小结](ch13.md#本章小结)
### [术语表](glossary.md) ### [术语表](glossary.md)
@ -140,7 +143,7 @@
1. [序言初翻修正](https://github.com/Vonng/ddia/commit/afb5edab55c62ed23474149f229677e3b42dfc2c) by [@seagullbird](https://github.com/Vonng/ddia/commits?author=seagullbird) 1. [序言初翻修正](https://github.com/Vonng/ddia/commit/afb5edab55c62ed23474149f229677e3b42dfc2c) by [@seagullbird](https://github.com/Vonng/ddia/commits?author=seagullbird)
2. [第一章语法标点校正](https://github.com/Vonng/ddia/commit/973b12cd8f8fcdf4852f1eb1649ddd9d187e3644) by [@nevertiree](https://github.com/Vonng/ddia/commits?author=nevertiree) 2. [第一章语法标点校正](https://github.com/Vonng/ddia/commit/973b12cd8f8fcdf4852f1eb1649ddd9d187e3644) by [@nevertiree](https://github.com/Vonng/ddia/commits?author=nevertiree)
3. [第六章部分校正](https://github.com/Vonng/ddia/commit/d4eb0852c0ec1e93c8aacc496c80b915bb1e6d48) 与[第十章的初翻](https://github.com/Vonng/ddia/commit/9de8dbd1bfe6fbb03b3bf6c1a1aa2291aed2490e) by [@MuAlex](https://github.com/Vonng/ddia/commits?author=MuAlex) 3. [第六章部分校正](https://github.com/Vonng/ddia/commit/d4eb0852c0ec1e93c8aacc496c80b915bb1e6d48) 与[第十章的初翻](https://github.com/Vonng/ddia/commit/9de8dbd1bfe6fbb03b3bf6c1a1aa2291aed2490e) by [@MuAlex](https://github.com/Vonng/ddia/commits?author=MuAlex)
4. [第一部分](part-i.md)前言,[ch2](ch2.md)校正 by [@jiajiadebug](https://github.com/Vonng/ddia/commits?author=jiajiadebug) 4. [第一部分](part-i.md)前言,[ch2](ch3.md)校正 by [@jiajiadebug](https://github.com/Vonng/ddia/commits?author=jiajiadebug)
5. [词汇表](glossary.md)、[后记](colophon.md)关于野猪的部分 by [@Chowss](https://github.com/Vonng/ddia/commits?author=Chowss) 5. [词汇表](glossary.md)、[后记](colophon.md)关于野猪的部分 by [@Chowss](https://github.com/Vonng/ddia/commits?author=Chowss)
6. [繁體中文](https://github.com/Vonng/ddia/pulls)版本与转换脚本 by [@afunTW](https://github.com/afunTW) 6. [繁體中文](https://github.com/Vonng/ddia/pulls)版本与转换脚本 by [@afunTW](https://github.com/afunTW)
7. 多处翻译修正 by [@songzhibin97](https://github.com/Vonng/ddia/commits?author=songzhibin97) [@MamaShip](https://github.com/Vonng/ddia/commits?author=MamaShip) [@FangYuan33](https://github.com/Vonng/ddia/commits?author=FangYuan33) 7. 多处翻译修正 by [@songzhibin97](https://github.com/Vonng/ddia/commits?author=songzhibin97) [@MamaShip](https://github.com/Vonng/ddia/commits?author=MamaShip) [@FangYuan33](https://github.com/Vonng/ddia/commits?author=FangYuan33)
@ -252,7 +255,7 @@
| [115](https://github.com/Vonng/ddia/pull/115) | [@NageNalock](https://github.com/NageNalock) | 第七章病句修改: 重复词语 | | [115](https://github.com/Vonng/ddia/pull/115) | [@NageNalock](https://github.com/NageNalock) | 第七章病句修改: 重复词语 |
| [114](https://github.com/Vonng/ddia/pull/114) | [@Sunt-ing](https://github.com/Sunt-ing) | Update README.md: correct the book name | | [114](https://github.com/Vonng/ddia/pull/114) | [@Sunt-ing](https://github.com/Sunt-ing) | Update README.md: correct the book name |
| [113](https://github.com/Vonng/ddia/pull/113) | [@lpxxn](https://github.com/lpxxn) | 修改语句 | | [113](https://github.com/Vonng/ddia/pull/113) | [@lpxxn](https://github.com/lpxxn) | 修改语句 |
| [112](https://github.com/Vonng/ddia/pull/112) | [@ibyte2011](https://github.com/ibyte2011) | Update ch9.md | | [112](https://github.com/Vonng/ddia/pull/112) | [@ibyte2011](https://github.com/ibyte2011) | Update ch10.md |
| [110](https://github.com/Vonng/ddia/pull/110) | [@lpxxn](https://github.com/lpxxn) | 读已写入数据 | | [110](https://github.com/Vonng/ddia/pull/110) | [@lpxxn](https://github.com/lpxxn) | 读已写入数据 |
| [107](https://github.com/Vonng/ddia/pull/107) | [@abbychau](https://github.com/abbychau) | 單調鐘和好死还是赖活着 | | [107](https://github.com/Vonng/ddia/pull/107) | [@abbychau](https://github.com/abbychau) | 單調鐘和好死还是赖活着 |
| [106](https://github.com/Vonng/ddia/pull/106) | [@enochii](https://github.com/enochii) | typo in ch2: fix braces typo | | [106](https://github.com/Vonng/ddia/pull/106) | [@enochii](https://github.com/enochii) | typo in ch2: fix braces typo |
@ -263,7 +266,7 @@
| [101](https://github.com/Vonng/ddia/pull/101) | [@Sunt-ing](https://github.com/Sunt-ing) | typo in Ch4: should be "改变" rathr than "盖面" | | [101](https://github.com/Vonng/ddia/pull/101) | [@Sunt-ing](https://github.com/Sunt-ing) | typo in Ch4: should be "改变" rathr than "盖面" |
| [100](https://github.com/Vonng/ddia/pull/100) | [@LiminCode](https://github.com/LiminCode) | fix missing translation | | [100](https://github.com/Vonng/ddia/pull/100) | [@LiminCode](https://github.com/LiminCode) | fix missing translation |
| [99 ](https://github.com/Vonng/ddia/pull/99) | [@mrdrivingduck](https://github.com/mrdrivingduck) | ch6: fix the word rebalancing | | [99 ](https://github.com/Vonng/ddia/pull/99) | [@mrdrivingduck](https://github.com/mrdrivingduck) | ch6: fix the word rebalancing |
| [98 ](https://github.com/Vonng/ddia/pull/98) | [@jacklightChen](https://github.com/jacklightChen) | fix ch7.md: fix wrong references | | [98 ](https://github.com/Vonng/ddia/pull/98) | [@jacklightChen](https://github.com/jacklightChen) | fix ch8.md: fix wrong references |
| [97 ](https://github.com/Vonng/ddia/pull/97) | [@jenac](https://github.com/jenac) | 96 | | [97 ](https://github.com/Vonng/ddia/pull/97) | [@jenac](https://github.com/jenac) | 96 |
| [96 ](https://github.com/Vonng/ddia/pull/96) | [@PragmaTwice](https://github.com/PragmaTwice) | ch2: fix typo about 'may or may not be' | | [96 ](https://github.com/Vonng/ddia/pull/96) | [@PragmaTwice](https://github.com/PragmaTwice) | ch2: fix typo about 'may or may not be' |
| [95 ](https://github.com/Vonng/ddia/pull/95) | [@EvanMu96](https://github.com/EvanMu96) | fix translation of "the battle cry" in ch5 | | [95 ](https://github.com/Vonng/ddia/pull/95) | [@EvanMu96](https://github.com/EvanMu96) | fix translation of "the battle cry" in ch5 |
@ -271,32 +274,32 @@
| [93 ](https://github.com/Vonng/ddia/pull/93) | [@kemingy](https://github.com/kemingy) | ch5: fix markdown and some typos | | [93 ](https://github.com/Vonng/ddia/pull/93) | [@kemingy](https://github.com/kemingy) | ch5: fix markdown and some typos |
| [92 ](https://github.com/Vonng/ddia/pull/92) | [@Gilbert1024](https://github.com/Gilbert1024) | Merge pull request #1 from Vonng/master | | [92 ](https://github.com/Vonng/ddia/pull/92) | [@Gilbert1024](https://github.com/Gilbert1024) | Merge pull request #1 from Vonng/master |
| [88 ](https://github.com/Vonng/ddia/pull/88) | [@kemingy](https://github.com/kemingy) | fix typo for ch1, ch2, ch3, ch4 | | [88 ](https://github.com/Vonng/ddia/pull/88) | [@kemingy](https://github.com/kemingy) | fix typo for ch1, ch2, ch3, ch4 |
| [87 ](https://github.com/Vonng/ddia/pull/87) | [@wynn5a](https://github.com/wynn5a) | Update ch3.md | | [87 ](https://github.com/Vonng/ddia/pull/87) | [@wynn5a](https://github.com/wynn5a) | Update ch4.md |
| [86 ](https://github.com/Vonng/ddia/pull/86) | [@northmorn](https://github.com/northmorn) | Update ch1.md | | [86 ](https://github.com/Vonng/ddia/pull/86) | [@northmorn](https://github.com/northmorn) | Update ch2.md |
| [85 ](https://github.com/Vonng/ddia/pull/85) | [@sunbuhui](https://github.com/sunbuhui) | fix ch2.md: fix ch2 ambiguous translation | | [85 ](https://github.com/Vonng/ddia/pull/85) | [@sunbuhui](https://github.com/sunbuhui) | fix ch3.md: fix ch2 ambiguous translation |
| [84 ](https://github.com/Vonng/ddia/pull/84) | [@ganler](https://github.com/ganler) | Fix translation: use up | | [84 ](https://github.com/Vonng/ddia/pull/84) | [@ganler](https://github.com/ganler) | Fix translation: use up |
| [83 ](https://github.com/Vonng/ddia/pull/83) | [@afunTW](https://github.com/afunTW) | Using OpenCC to convert from zh-cn to zh-tw | | [83 ](https://github.com/Vonng/ddia/pull/83) | [@afunTW](https://github.com/afunTW) | Using OpenCC to convert from zh-cn to zh-tw |
| [82 ](https://github.com/Vonng/ddia/pull/82) | [@kangni](https://github.com/kangni) | fix gitbook url | | [82 ](https://github.com/Vonng/ddia/pull/82) | [@kangni](https://github.com/kangni) | fix gitbook url |
| [78 ](https://github.com/Vonng/ddia/pull/78) | [@hanyu2](https://github.com/hanyu2) | Fix unappropriated translation | | [78 ](https://github.com/Vonng/ddia/pull/78) | [@hanyu2](https://github.com/hanyu2) | Fix unappropriated translation |
| [77 ](https://github.com/Vonng/ddia/pull/77) | [@Ozarklake](https://github.com/Ozarklake) | fix typo | | [77 ](https://github.com/Vonng/ddia/pull/77) | [@Ozarklake](https://github.com/Ozarklake) | fix typo |
| [75 ](https://github.com/Vonng/ddia/pull/75) | [@2997ms](https://github.com/2997ms) | Fix typo | | [75 ](https://github.com/Vonng/ddia/pull/75) | [@2997ms](https://github.com/2997ms) | Fix typo |
| [74 ](https://github.com/Vonng/ddia/pull/74) | [@2997ms](https://github.com/2997ms) | Update ch9.md | | [74 ](https://github.com/Vonng/ddia/pull/74) | [@2997ms](https://github.com/2997ms) | Update ch10.md |
| [70 ](https://github.com/Vonng/ddia/pull/70) | [@2997ms](https://github.com/2997ms) | Update ch7.md | | [70 ](https://github.com/Vonng/ddia/pull/70) | [@2997ms](https://github.com/2997ms) | Update ch8.md |
| [67 ](https://github.com/Vonng/ddia/pull/67) | [@jiajiadebug](https://github.com/jiajiadebug) | fix issues in ch2 - ch9 and glossary | | [67 ](https://github.com/Vonng/ddia/pull/67) | [@jiajiadebug](https://github.com/jiajiadebug) | fix issues in ch2 - ch9 and glossary |
| [66 ](https://github.com/Vonng/ddia/pull/66) | [@blindpirate](https://github.com/blindpirate) | Fix typo | | [66 ](https://github.com/Vonng/ddia/pull/66) | [@blindpirate](https://github.com/blindpirate) | Fix typo |
| [63 ](https://github.com/Vonng/ddia/pull/63) | [@haifeiWu](https://github.com/haifeiWu) | Update ch10.md | | [63 ](https://github.com/Vonng/ddia/pull/63) | [@haifeiWu](https://github.com/haifeiWu) | Update ch11.md |
| [62 ](https://github.com/Vonng/ddia/pull/62) | [@ych](https://github.com/ych) | fix ch1.md typesetting problem | | [62 ](https://github.com/Vonng/ddia/pull/62) | [@ych](https://github.com/ych) | fix ch2.md typesetting problem |
| [61 ](https://github.com/Vonng/ddia/pull/61) | [@xianlaioy](https://github.com/xianlaioy) | docs:钟-->种去掉ou | | [61 ](https://github.com/Vonng/ddia/pull/61) | [@xianlaioy](https://github.com/xianlaioy) | docs:钟-->种去掉ou |
| [60 ](https://github.com/Vonng/ddia/pull/60) | [@Zombo1296](https://github.com/Zombo1296) | 否则 -> 或者 | | [60 ](https://github.com/Vonng/ddia/pull/60) | [@Zombo1296](https://github.com/Zombo1296) | 否则 -> 或者 |
| [59 ](https://github.com/Vonng/ddia/pull/59) | [@AlexanderMisel](https://github.com/AlexanderMisel) | 呼叫->调用,显着->显著 | | [59 ](https://github.com/Vonng/ddia/pull/59) | [@AlexanderMisel](https://github.com/AlexanderMisel) | 呼叫->调用,显着->显著 |
| [58 ](https://github.com/Vonng/ddia/pull/58) | [@ibyte2011](https://github.com/ibyte2011) | Update ch8.md | | [58 ](https://github.com/Vonng/ddia/pull/58) | [@ibyte2011](https://github.com/ibyte2011) | Update ch9.md |
| [55 ](https://github.com/Vonng/ddia/pull/55) | [@saintube](https://github.com/saintube) | ch8: 修改链接错误 | | [55 ](https://github.com/Vonng/ddia/pull/55) | [@saintube](https://github.com/saintube) | ch8: 修改链接错误 |
| [54 ](https://github.com/Vonng/ddia/pull/54) | [@Panmax](https://github.com/Panmax) | Update ch2.md | | [54 ](https://github.com/Vonng/ddia/pull/54) | [@Panmax](https://github.com/Panmax) | Update ch3.md |
| [53 ](https://github.com/Vonng/ddia/pull/53) | [@ibyte2011](https://github.com/ibyte2011) | Update ch9.md | | [53 ](https://github.com/Vonng/ddia/pull/53) | [@ibyte2011](https://github.com/ibyte2011) | Update ch10.md |
| [52 ](https://github.com/Vonng/ddia/pull/52) | [@hecenjie](https://github.com/hecenjie) | Update ch1.md | | [52 ](https://github.com/Vonng/ddia/pull/52) | [@hecenjie](https://github.com/hecenjie) | Update ch2.md |
| [51 ](https://github.com/Vonng/ddia/pull/51) | [@latavin243](https://github.com/latavin243) | fix 修正ch3 ch4几处翻译 | | [51 ](https://github.com/Vonng/ddia/pull/51) | [@latavin243](https://github.com/latavin243) | fix 修正ch3 ch4几处翻译 |
| [50 ](https://github.com/Vonng/ddia/pull/50) | [@AlexZFX](https://github.com/AlexZFX) | 几个疏漏和格式错误 | | [50 ](https://github.com/Vonng/ddia/pull/50) | [@AlexZFX](https://github.com/AlexZFX) | 几个疏漏和格式错误 |
| [49 ](https://github.com/Vonng/ddia/pull/49) | [@haifeiWu](https://github.com/haifeiWu) | Update ch1.md | | [49 ](https://github.com/Vonng/ddia/pull/49) | [@haifeiWu](https://github.com/haifeiWu) | Update ch2.md |
| [48 ](https://github.com/Vonng/ddia/pull/48) | [@scaugrated](https://github.com/scaugrated) | fix typo | | [48 ](https://github.com/Vonng/ddia/pull/48) | [@scaugrated](https://github.com/scaugrated) | fix typo |
| [47 ](https://github.com/Vonng/ddia/pull/47) | [@lzwill](https://github.com/lzwill) | Fixed typos in ch2 | | [47 ](https://github.com/Vonng/ddia/pull/47) | [@lzwill](https://github.com/lzwill) | Fixed typos in ch2 |
| [45 ](https://github.com/Vonng/ddia/pull/45) | [@zenuo](https://github.com/zenuo) | 删除一个多余的右括号 | | [45 ](https://github.com/Vonng/ddia/pull/45) | [@zenuo](https://github.com/zenuo) | 删除一个多余的右括号 |
@ -304,20 +307,20 @@
| [43 ](https://github.com/Vonng/ddia/pull/43) | [@baijinping](https://github.com/baijinping) | "更假简单"->"更加简单" | | [43 ](https://github.com/Vonng/ddia/pull/43) | [@baijinping](https://github.com/baijinping) | "更假简单"->"更加简单" |
| [42 ](https://github.com/Vonng/ddia/pull/42) | [@tisonkun](https://github.com/tisonkun) | 修复 ch1 中的无序列表格式 | | [42 ](https://github.com/Vonng/ddia/pull/42) | [@tisonkun](https://github.com/tisonkun) | 修复 ch1 中的无序列表格式 |
| [38 ](https://github.com/Vonng/ddia/pull/38) | [@renjie-c](https://github.com/renjie-c) | 纠正多处的翻译小错误 | | [38 ](https://github.com/Vonng/ddia/pull/38) | [@renjie-c](https://github.com/renjie-c) | 纠正多处的翻译小错误 |
| [37 ](https://github.com/Vonng/ddia/pull/37) | [@tankilo](https://github.com/tankilo) | fix translation mistakes in ch4.md | | [37 ](https://github.com/Vonng/ddia/pull/37) | [@tankilo](https://github.com/tankilo) | fix translation mistakes in ch5.md |
| [36 ](https://github.com/Vonng/ddia/pull/36) | [@wwek](https://github.com/wwek) | 1.修复多个链接错误 2.名词优化修订 3.错误修订 | | [36 ](https://github.com/Vonng/ddia/pull/36) | [@wwek](https://github.com/wwek) | 1.修复多个链接错误 2.名词优化修订 3.错误修订 |
| [35 ](https://github.com/Vonng/ddia/pull/35) | [@wwek](https://github.com/wwek) | fix ch7.md to ch8.md link error | | [35 ](https://github.com/Vonng/ddia/pull/35) | [@wwek](https://github.com/wwek) | fix ch8.md to ch9.md link error |
| [34 ](https://github.com/Vonng/ddia/pull/34) | [@wwek](https://github.com/wwek) | Merge pull request #1 from Vonng/master | | [34 ](https://github.com/Vonng/ddia/pull/34) | [@wwek](https://github.com/wwek) | Merge pull request #1 from Vonng/master |
| [33 ](https://github.com/Vonng/ddia/pull/33) | [@wwek](https://github.com/wwek) | fix part-ii.md link error | | [33 ](https://github.com/Vonng/ddia/pull/33) | [@wwek](https://github.com/wwek) | fix part-ii.md link error |
| [32 ](https://github.com/Vonng/ddia/pull/32) | [@JCYoky](https://github.com/JCYoky) | Update ch2.md | | [32 ](https://github.com/Vonng/ddia/pull/32) | [@JCYoky](https://github.com/JCYoky) | Update ch3.md |
| [31 ](https://github.com/Vonng/ddia/pull/31) | [@elsonLee](https://github.com/elsonLee) | Update ch7.md | | [31 ](https://github.com/Vonng/ddia/pull/31) | [@elsonLee](https://github.com/elsonLee) | Update ch8.md |
| [26 ](https://github.com/Vonng/ddia/pull/26) | [@yjhmelody](https://github.com/yjhmelody) | 修复一些明显错误 | | [26 ](https://github.com/Vonng/ddia/pull/26) | [@yjhmelody](https://github.com/yjhmelody) | 修复一些明显错误 |
| [25 ](https://github.com/Vonng/ddia/pull/25) | [@lqbilbo](https://github.com/lqbilbo) | 修复链接错误 | | [25 ](https://github.com/Vonng/ddia/pull/25) | [@lqbilbo](https://github.com/lqbilbo) | 修复链接错误 |
| [24 ](https://github.com/Vonng/ddia/pull/24) | [@artiship](https://github.com/artiship) | 修改词语顺序 | | [24 ](https://github.com/Vonng/ddia/pull/24) | [@artiship](https://github.com/artiship) | 修改词语顺序 |
| [23 ](https://github.com/Vonng/ddia/pull/23) | [@artiship](https://github.com/artiship) | 修正错别字 | | [23 ](https://github.com/Vonng/ddia/pull/23) | [@artiship](https://github.com/artiship) | 修正错别字 |
| [22 ](https://github.com/Vonng/ddia/pull/22) | [@artiship](https://github.com/artiship) | 纠正翻译错误 | | [22 ](https://github.com/Vonng/ddia/pull/22) | [@artiship](https://github.com/artiship) | 纠正翻译错误 |
| [21 ](https://github.com/Vonng/ddia/pull/21) | [@zhtisi](https://github.com/zhtisi) | 修正目录和本章标题不符的情况 | | [21 ](https://github.com/Vonng/ddia/pull/21) | [@zhtisi](https://github.com/zhtisi) | 修正目录和本章标题不符的情况 |
| [20 ](https://github.com/Vonng/ddia/pull/20) | [@rentiansheng](https://github.com/rentiansheng) | Update ch7.md | | [20 ](https://github.com/Vonng/ddia/pull/20) | [@rentiansheng](https://github.com/rentiansheng) | Update ch8.md |
| [19 ](https://github.com/Vonng/ddia/pull/19) | [@LHRchina](https://github.com/LHRchina) | 修复语句小bug | | [19 ](https://github.com/Vonng/ddia/pull/19) | [@LHRchina](https://github.com/LHRchina) | 修复语句小bug |
| [16 ](https://github.com/Vonng/ddia/pull/16) | [@MuAlex](https://github.com/MuAlex) | Master | | [16 ](https://github.com/Vonng/ddia/pull/16) | [@MuAlex](https://github.com/MuAlex) | Master |
| [15 ](https://github.com/Vonng/ddia/pull/15) | [@cg-zhou](https://github.com/cg-zhou) | Update translation progress | | [15 ](https://github.com/Vonng/ddia/pull/15) | [@cg-zhou](https://github.com/cg-zhou) | Update translation progress |

View File

@ -1,18 +1,19 @@
- [序言](preface.md) - [序言](preface.md)
- [第一部分:数据系统基础](part-i.md) - [第一部分:数据系统基础](part-i.md)
- [第一章:可靠性、可伸缩性和可维护性](ch1.md) - [第一章:数据系统架构中的利弊权衡](ch1.md)
- [第二章:数据模型与查询语言](ch2.md) - [第二章:定义非功能性要求](ch2.md)
- [第三章:存储与检索](ch3.md) - [第三章:数据模型与查询语言](ch3.md)
- [第四章:编码与演化](ch4.md) - [第四章:存储与检索](ch4.md)
- [第五章:编码与演化](ch5.md)
- [第二部分:分布式数据](part-ii.md) - [第二部分:分布式数据](part-ii.md)
- [五章:复制](ch5.md) - [六章:复制](ch6.md)
- [六章:分区](ch6.md) - [七章:分区](ch7.md)
- [七章:事务](ch7.md) - [八章:事务](ch8.md)
- [八章:分布式系统的麻烦](ch8.md) - [九章:分布式系统的麻烦](ch9.md)
- [九章:一致性与共识](ch9.md) - [十章:一致性与共识](ch10.md)
- [第三部分:衍生数据](part-iii.md) - [第三部分:衍生数据](part-iii.md)
- [第十章:批处理](ch10.md) - [第十一章:批处理](ch11.md)
- [第十一章:流处理](ch11.md) - [第十二章:流处理](ch12.md)
- [第十二章:数据系统的未来](ch12.md) - [第十三章:数据系统的未来](ch13.md)
- [术语表](glossary.md) - [术语表](glossary.md)
- [后记](colophon.md) - [后记](colophon.md)

585
ch1.md
View File

@ -1,414 +1,511 @@
# 第一章:可靠性、可伸缩性和可维护性 # 第一章:数据系统架构中的利弊权衡
![](img/ch1.png) > *没有解决方案,只有利弊权衡。[…] 尽你所能获取最好的利弊权衡,这是你唯一能指望的事。*
> 互联网做得太棒了,以至于大多数人将它看作像太平洋这样的自然资源,而不是什么人工产物。上一次出现这种大规模且无差错的技术,你还记得是什么时候吗?
> >
> —— [艾伦・凯](http://www.drdobbs.com/architecture-and-design/interview-with-alan-kay/240003442) 在接受 Dobb 博士杂志采访时说2012 年) > [Thomas Sowell](https://www.youtube.com/watch?v=2YUtKr8-_Fg), 与 Fred Barnes 的采访 (2005)
-----------------------
[TOC]
现今很多应用程序都是 **数据密集型data-intensive** 的,而非 **计算密集型compute-intensive** 的。因此 CPU 很少成为这类应用的瓶颈,更大的问题通常来自数据量、数据复杂性、以及数据的变更速度。 ![img](img/ch1.png)
数据密集型应用通常由标准组件构建而成,标准组件提供了很多通用的功能;例如,许多应用程序都需要: # A Note for Early Release Readers
- 存储数据,以便自己或其他应用程序之后能再次找到 *数据库,即 databases* With Early Release ebooks, you get books in their earliest form—the authors raw and unedited content as they write—so you can take advantage of these technologies long before the official release of these titles.
- 记住开销昂贵操作的结果,加快读取速度(*缓存,即 caches*
- 允许用户按关键字搜索数据,或以各种方式对数据进行过滤(*搜索索引,即 search indexes*
- 向其他进程发送消息,进行异步处理(*流处理,即 stream processing*
- 定期处理累积的大批量数据(*批处理,即 batch processing*
如果这些功能听上去平淡无奇,那是因为这些 **数据系统data system** 是非常成功的抽象:我们一直不假思索地使用它们并习以为常。绝大多数工程师不会幻想从零开始编写存储引擎,因为在开发应用时,数据库已经是足够完美的工具了。 This will be the 1st chapter of the final book. The GitHub repo for this book is *[\*https://github.com/ept/ddia2-feedback\*](https://github.com/ept/ddia2-feedback)*.
但现实没有这么简单。不同的应用有着不同的需求,因而数据库系统也是百花齐放,有着各式各样的特性。实现缓存有很多种手段,创建搜索索引也有好几种方法,诸如此类。因此在开发应用前,我们依然有必要先弄清楚最适合手头工作的工具和方法。而且当单个工具解决不了你的问题时,组合使用这些工具可能还是有些难度的。 If you have comments about how we might improve the content and/or examples in this book, or if you notice missing material within this chapter, please reach out on GitHub.
本书将是一趟关于数据系统原理、实践与应用的旅程,并讲述了设计数据密集型应用的方法。我们将探索不同工具之间的共性与特性,以及各自的实现原理。 Data is central to much application development today. With web and mobile apps, software as a service (SaaS), and cloud services, it has become normal to store data from many different users in a shared server-based data infrastructure. Data from user activity, business transactions, devices and sensors needs to be stored and made available for analysis. As users interact with an application, they both read the data that is stored, and also generate more data.
本章将从我们所要实现的基础目标开始:可靠、可伸缩、可维护的数据系统。我们将澄清这些词语的含义,概述考量这些目标的方法。并回顾一些后续章节所需的基础知识。在接下来的章节中我们将抽丝剥茧,研究设计数据密集型应用时可能遇到的设计决策。 Small amounts of data, which can be stored and processed on a single machine, are often fairly easy to deal with. However, as the data volume or the rate of queries grows, it needs to be distributed across multiple machines, which introduces many challenges. As the needs of the application become more complex, it is no longer sufficient to store everything in one system, but it might be necessary to combine multiple storage or processing systems that provide different capabilities.
We call an application *data-intensive* if data management is one of the primary challenges in developing the application [[1](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#Kouzes2009)]. While in *compute-intensive* systems the challenge is parallelizing some very large computation, in data-intensive applications we usually worry more about things like storing and processing large data volumes, managing changes to data, ensuring consistency in the face of failures and concurrency, and making sure services are highly available.
## 关于数据系统的思考 Such applications are typically built from standard building blocks that provide commonly needed functionality. For example, many applications need to:
我们通常认为,数据库、消息队列、缓存等工具分属于几个差异显著的类别。虽然数据库和消息队列表面上有一些相似性 —— 它们都会存储一段时间的数据 —— 但它们有迥然不同的访问模式,这意味着迥异的性能特征和实现手段。 - Store data so that they, or another application, can find it again later (*databases*)
- Remember the result of an expensive operation, to speed up reads (*caches*)
- Allow users to search data by keyword or filter it in various ways (*search indexes*)
- Handle events and data changes as soon as they occur (*stream processing*)
- Periodically crunch a large amount of accumulated data (*batch processing*)
那我们为什么要把这些东西放在 **数据系统data system** 的总称之下混为一谈呢? In building an application we typically take several software systems or services, such as databases or APIs, and glue them together with some application code. If you are doing exactly what the data systems were designed for, then this process can be quite easy.
近些年来出现了许多新的数据存储工具与数据处理工具。它们针对不同应用场景进行优化因此不再适合生硬地归入传统类别【1】。类别之间的界限变得越来越模糊例如数据存储可以被当成消息队列用Redis消息队列则带有类似数据库的持久保证Apache Kafka However, as your application becomes more ambitious, challenges arise. There are many database systems with different characteristics, suitable for different purposes—how do you choose which one to use? There are various approaches to caching, several ways of building search indexes, and so on—how do you reason about their trade-offs? You need to figure out which tools and which approaches are the most appropriate for the task at hand, and it can be difficult to combine tools when you need to do something that a single tool cannot do alone.
其次,越来越多的应用程序有着各种严格而广泛的要求,单个工具不足以满足所有的数据处理和存储需求。取而代之的是,总体工作被拆分成一系列能被单个工具高效完成的任务,并通过应用代码将它们缝合起来。 This book is a guide to help you make decisions about which technologies to use and how to combine them. As you will see, there is no one approach that is fundamentally better than others; everything has pros and cons. With this book, you will learn to ask the right questions to evaluate and compare data systems, so that you can figure out which approach will best serve the needs of your particular application.
例如如果将缓存应用管理的缓存层Memcached 或同类产品)和全文搜索(全文搜索服务器,例如 Elasticsearch 或 Solr功能从主数据库剥离出来那么使缓存 / 索引与主数据库保持同步通常是应用代码的责任。[图 1-1](img/fig1-1.png) 给出了这种架构可能的样子(细节将在后面的章节中详细介绍)。 We will start our journey by looking at some of the ways that data is typically used in organizations today. Many of the ideas here have their origin in *enterprise software* (i.e., the software needs and engineering practices of large organizations, such as big corporations and governments), since historically, only large organizations had the large data volumes that required sophisticated technical solutions. If your data volume is small enough, you can simply keep it in a spreadsheet! However, more recently it has also become common for smaller companies and startups to manage large data volumes and build data-intensive systems.
![](img/fig1-1.png) One of the key challenges with data systems is that different people need to do very different things with data. If you are working at a company, you and your team will have one set of priorities, while another team may have entirely different goals, although you might even be working with the same dataset! Moreover, those goals might not be explicitly articulated, which can lead to misunderstandings and disagreement about the right approach.
**图 1-1 一个可能的组合使用多个组件的数据系统架构** To help you understand what choices you can make, this chapter compares several contrasting concepts, and explores their trade-offs:
当你将多个工具组合在一起提供服务时,服务的接口或 **应用程序编程接口API, Application Programming Interface** 通常向客户端隐藏这些实现细节。现在,你基本上已经使用较小的通用组件创建了一个全新的、专用的数据系统。这个新的复合数据系统可能会提供特定的保证,例如:缓存在写入时会作废或更新,以便外部客户端获取一致的结果。现在你不仅是应用程序开发人员,还是数据系统设计人员了。 - the difference between transaction processing and analytics ([“Transaction Processing versus Analytics”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#sec_introduction_analytics));
- pros and cons of cloud services and self-hosted systems ([“Cloud versus Self-Hosting”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#sec_introduction_cloud));
- when to move from single-node systems to distributed systems ([“Distributed versus Single-Node Systems”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#sec_introduction_distributed)); and
- balancing the needs of the business and the rights of the user ([“Data Systems, Law, and Society”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#sec_introduction_compliance)).
设计数据系统或服务时可能会遇到很多棘手的问题,例如:当系统出问题时,如何确保数据的正确性和完整性?当部分系统退化降级时,如何为客户提供始终如一的良好性能?当负载增加时,如何扩容应对?什么样的 API 才是好的 API Moreover, this chapter will provide you with terminology that we will need for the rest of the book.
影响数据系统设计的因素很多,包括参与人员的技能和经验、历史遗留问题、系统路径依赖、交付时限、公司的风险容忍度、监管约束等,这些因素都需要具体问题具体分析。 # Terminology: Frontends and Backends
本书着重讨论三个在大多数软件系统中都很重要的问题: Much of what we will discuss in this book relates to *backend development*. To explain that term: for web applications, the client-side code (which runs in a web browser) is called the *frontend*, and the server-side code that handles user requests is known as the *backend*. Mobile apps are similar to frontends in that they provide user interfaces, which often communicate over the Internet with a server-side backend. Frontends sometimes manage data locally on the users device [[2](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#Kleppmann2019)], but the greatest data infrastructure challenges often lie in the backend: a frontend only needs to handle one users data, whereas the backend manages data on behalf of *all* of the users.
* 可靠性Reliability A backend service is often reachable via HTTP; it usually consists of some application code that reads and writes data in one or more databases, and sometimes interfaces with additional data systems such as caches or message queues (which we might collectively call *data infrastructure*). The application code is often *stateless* (i.e., when it finishes handling one HTTP request, it forgets everything about that request), and any information that needs to persist from one request to another needs to be stored either on the client, or in the server-side data infrastructure.
系统在 **困境**adversity比如硬件故障、软件故障、人为错误中仍可正常工作正确完成功能并能达到期望的性能水准。请参阅 “[可靠性](#可靠性)”。 # Transaction Processing versus Analytics
* 可伸缩性Scalability If you are working on data systems in an enterprise, you are likely to encounter several different types of people who work with data. The first type are *backend engineers* who build services that handle requests for reading and updating data; these services often serve external users, either directly or indirectly via other services (see [“Microservices and Serverless”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#sec_introduction_microservices)). Sometimes services are for internal use by other parts of the organization.
有合理的办法应对系统的增长(数据量、流量、复杂性)。请参阅 “[可伸缩性](#可伸缩性)”。 In addition to the teams managing backend services, two other groups of people typically require access to an organizations data: *business analysts*, who generate reports about the activities of the organization in order to help the management make better decisions (*business intelligence* or *BI*), and *data scientists*, who look for novel insights in data or who create user-facing product features that are enabled by data analysis and machine learning/AI (for example, “people who bought X also bought Y” recommendations on an e-commerce website, predictive analytics such as risk scoring or spam filtering, and ranking of search results).
* 可维护性Maintainability Although business analysts and data scientists tend to use different tools and operate in different ways, they have some things in common: both perform *analytics*, which means they look at the data that the users and backend services have generated, but they generally do not modify this data (except perhaps for fixing mistakes). They might create derived datasets in which the original data has been processed in some way. This has led to a split between two types of systems—a distinction that we will use throughout this book:
许多不同的人(工程师、运维)在不同的生命周期,都能高效地在系统上工作(使系统保持现有行为,并适应新的应用场景)。请参阅 “[可维护性](#可维护性)”。 - *Operational systems* consist of the backend services and data infrastructure where data is created, for example by serving external users. Here, the application code both reads and modifies the data in its databases, based on the actions performed by the users.
- *Analytical systems* serve the needs of business analysts and data scientists. They contain a read-only copy of the data from the operational systems, and they are optimized for the types of data processing that are needed for analytics.
人们经常追求这些词汇,却没有清楚理解它们到底意味着什么。为了工程的严谨性,本章的剩余部分将探讨可靠性、可伸缩性和可维护性的含义。为实现这些目标而使用的各种技术,架构和算法将在后续的章节中研究。 As we shall see in the next section, operational and analytical systems are often kept separate, for good reasons. As these systems have matured, two new specialized roles have emerged: *data engineers* and *analytics engineers*. Data engineers are the people who know how to integrate the operational and the analytical systems, and who take responsibility for the organizations data infrastructure more widely [[3](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#Reis2022)]. Analytics engineers model and transform data to make it more useful for end users querying data in an organization [[4](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#Machado2023)].
Many engineers specialize on either the operational or the analytical side. However, this book covers both operational and analytical data systems, since both play an important role in the lifecycle of data within an organization. We will explore in-depth the data infrastructure that is used to deliver services both to internal and external users, so that you can work better with your colleagues on the other side of this divide.
## 可靠性 ## Characterizing Analytical and Operational Systems
人们对于一个东西是否可靠,都有一个直观的想法。人们对可靠软件的典型期望包括: In the early days of business data processing, a write to the database typically corresponded to a *commercial transaction* taking place: making a sale, placing an order with a supplier, paying an employees salary, etc. As databases expanded into areas that didnt involve money changing hands, the term *transaction* nevertheless stuck, referring to a group of reads and writes that form a logical unit.
* 应用程序表现出用户所期望的功能。 ###### Note
* 允许用户犯错,允许用户以出乎意料的方式使用软件。
* 在预期的负载和数据量下,性能满足要求。
* 系统能防止未经授权的访问和滥用。
如果所有这些在一起意味着 “正确工作”,那么可以把可靠性粗略理解为 “即使出现问题,也能继续正确工作”。 [Link to Come] explores in detail what we mean with a transaction. This chapter uses the term loosely to refer to low-latency reads and writes.
造成错误的原因叫做 **故障fault**,能预料并应对故障的系统特性可称为 **容错fault-tolerant****回弹性resilient**。“**容错**” 一词可能会产生误导,因为它暗示着系统可以容忍所有可能的错误,但在实际中这是不可能的。比方说,如果整个地球(及其上的所有服务器)都被黑洞吞噬了,想要容忍这种错误,需要把网络托管到太空中 —— 这种预算能不能批准就祝你好运了。所以在讨论容错时,只有谈论特定类型的错误才有意义。 Even though databases started being used for many different kinds of data—posts on social media, moves in a game, contacts in an address book, and many others—the basic access pattern remained similar to processing business transactions. An operational system typically looks up a small number of records by some key (this is called a *point query*). Records are inserted, updated, or deleted based on the users input. Because these applications are interactive, this access pattern became known as *online transaction processing* (OLTP).
注意 **故障fault** 不同于 **失效failure**【2】。**故障** 通常定义为系统的一部分状态偏离其标准,而 **失效** 则是系统作为一个整体停止向用户提供服务。故障的概率不可能降到零,因此最好设计容错机制以防因 **故障** 而导致 **失效**。本书中我们将介绍几种用不可靠的部件构建可靠系统的技术。 However, databases also started being increasingly used for analytics, which has very different access patterns compared to OLTP. Usually an analytic query scans over a huge number of records, and calculates aggregate statistics (such as count, sum, or average) rather than returning the individual records to the user. For example, a business analyst at a supermarket chain may want to answer analytic queries such as:
反直觉的是,在这类容错系统中,通过故意触发来 **提高** 故障率是有意义的例如在没有警告的情况下随机地杀死单个进程。许多高危漏洞实际上是由糟糕的错误处理导致的【3】因此我们可以通过故意引发故障来确保容错机制不断运行并接受考验从而提高故障自然发生时系统能正确处理的信心。Netflix 公司的 *Chaos Monkey*【4】就是这种方法的一个例子。 - What was the total revenue of each of our stores in January?
- How many more bananas than usual did we sell during our latest promotion?
- Which brand of baby food is most often purchased together with brand X diapers?
尽管比起 **阻止错误prevent error**,我们通常更倾向于 **容忍错误**。但也有 **预防胜于治疗** 的情况(比如不存在治疗方法时)。安全问题就属于这种情况。例如,如果攻击者破坏了系统,并获取了敏感数据,这种事是撤销不了的。但本书主要讨论的是可以恢复的故障种类,正如下面几节所述。 The reports that result from these types of queries are important for business intelligence, helping the management decide what to do next. In order to differentiate this pattern of using databases from transaction processing, it has been called *online analytic processing* (OLAP) [[5](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#Codd1993)]. The difference between OLTP and analytics is not always clear-cut, but some typical characteristics are listed in [Table 1-1](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#tab_oltp_vs_olap).
### 硬件故障 | Property | Operational systems (OLTP) | Analytical systems (OLAP) |
| :------------------ | :---------------------------------------------- | :---------------------------------------- |
| Main read pattern | Point queries (fetch individual records by key) | Aggregate over large number of records |
| Main write pattern | Create, update, and delete individual records | Bulk import (ETL) or event stream |
| Human user example | End user of web/mobile application | Internal analyst, for decision support |
| Machine use example | Checking if an action is authorized | Detecting fraud/abuse patterns |
| Type of queries | Fixed set of queries, predefined by application | Analyst can make arbitrary queries |
| Data represents | Latest state of data (current point in time) | History of events that happened over time |
| Dataset size | Gigabytes to terabytes | Terabytes to petabytes |
当想到系统失效的原因时,**硬件故障hardware faults** 总会第一个进入脑海。硬盘崩溃、内存出错、机房断电、有人拔错网线…… 任何与大型数据中心打过交道的人都会告诉你:一旦你拥有很多机器,这些事情 **总** 会发生! ###### Note
据报道称,硬盘的 **平均无故障时间MTTF, mean time to failure** 约为 10 到 50 年【5】【6】。因此从数学期望上讲在拥有 10000 个磁盘的存储集群上,平均每天会有 1 个磁盘出故障。 The meaning of *online* in *OLAP* is unclear; it probably refers to the fact that queries are not just for predefined reports, but that analysts use the OLAP system interactively for explorative queries.
为了减少系统的故障率,第一反应通常都是增加单个硬件的冗余度,例如:磁盘可以组建 RAID服务器可能有双路电源和热插拔 CPU数据中心可能有电池和柴油发电机作为后备电源某个组件挂掉时冗余组件可以立刻接管。这种方法虽然不能完全防止由硬件问题导致的系统失效但它简单易懂通常也足以让机器不间断运行很多年。 With operational systems, users are generally not allowed to construct custom SQL queries and run them on the database, since that would potentially allow them to read or modify data that they do not have permission to access. Moreover, they might write queries that are expensive to execute, and hence affect the database performance for other users. For these reasons, OLTP systems mostly run a fixed set of queries that are baked into the application code, and use one-off custom queries only occasionally for maintenance or troubleshooting. On the other hand, analytic databases usually give their users the freedom to write arbitrary SQL queries by hand, or to generate queries automatically using a data visualization or dashboard tool such as Tableau, Looker, or Microsoft Power BI.
直到最近,硬件冗余对于大多数应用来说已经足够了,它使单台机器完全失效变得相当罕见。只要你能快速地把备份恢复到新机器上,故障停机时间对大多数应用而言都算不上灾难性的。只有少量高可用性至关重要的应用才会要求有多套硬件冗余。 ## Data Warehousing
但是随着数据量和应用计算需求的增加,越来越多的应用开始大量使用机器,这会相应地增加硬件故障率。此外,在类似亚马逊 AWSAmazon Web Services的一些云服务平台上虚拟机实例不可用却没有任何警告也是很常见的【7】因为云平台的设计就是优先考虑 **灵活性flexibility****弹性elasticity**[^i],而不是单机可靠性。 At first, the same databases were used for both transaction processing and analytic queries. SQL turned out to be quite flexible in this regard: it works well for both types of queries. Nevertheless, in the late 1980s and early 1990s, there was a trend for companies to stop using their OLTP systems for analytics purposes, and to run the analytics on a separate database system instead. This separate database was called a *data warehouse*.
如果在硬件冗余的基础上进一步引入软件容错机制,那么系统在容忍整个(单台)机器故障的道路上就更进一步了。这样的系统也有运维上的便利,例如:如果需要重启机器(例如应用操作系统安全补丁),单服务器系统就需要计划停机。而允许机器失效的系统则可以一次修复一个节点,无需整个系统停机。 A large enterprise may have dozens, even hundreds, of operational transaction processing systems: systems powering the customer-facing website, controlling point of sale (checkout) systems in physical stores, tracking inventory in warehouses, planning routes for vehicles, managing suppliers, administering employees, and performing many other tasks. Each of these systems is complex and needs a team of people to maintain it, so these systems end up operating mostly independently from each other.
[^i]: 在 [应对负载的方法](#应对负载的方法) 一节定义 It is usually undesirable for business analysts and data scientists to directly query these OLTP systems, for several reasons:
### 软件错误 - the data of interest may be spread across multiple operational systems, making it difficult to combine those datasets in a single query (a problem known as *data silos*);
- the kinds of schemas and data layouts that are good for OLTP are less well suited for analytics (see [“Stars and Snowflakes: Schemas for Analytics”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch03.html#sec_datamodels_analytics));
- analytic queries can be quite expensive, and running them on an OLTP database would impact the performance for other users; and
- the OLTP systems might reside in a separate network that users are not allowed direct access to for security or compliance reasons.
我们通常认为硬件故障是随机的、相互独立的:一台机器的磁盘失效并不意味着另一台机器的磁盘也会失效。虽然大量硬件组件之间可能存在微弱的相关性(例如服务器机架的温度等共同的原因),但同时发生故障也是极为罕见的。 A *data warehouse*, by contrast, is a separate database that analysts can query to their hearts content, without affecting OLTP operations [[6](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#Chaudhuri1997)]. As we shall see in [Link to Come], data warehouses often store data in a way that is very different from OLTP databases, in order to optimize for the types of queries that are common in analytics.
另一类错误是内部的 **系统性错误systematic error**【8】。这类错误难以预料而且因为是跨节点相关的所以比起不相关的硬件故障往往可能造成更多的 **系统失效**【5】。例子包括 The data warehouse contains a read-only copy of the data in all the various OLTP systems in the company. Data is extracted from OLTP databases (using either a periodic data dump or a continuous stream of updates), transformed into an analysis-friendly schema, cleaned up, and then loaded into the data warehouse. This process of getting data into the data warehouse is known as *ExtractTransformLoad* (ETL) and is illustrated in [Figure 1-1](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#fig_dwh_etl). Sometimes the order of the *transform* and *load* steps is swapped (i.e., the transformation is done in the data warehouse, after loading), resulting in *ELT*.
* 接受特定的错误输入,便导致所有应用服务器实例崩溃的 BUG。例如 2012 年 6 月 30 日的闰秒,由于 Linux 内核中的一个错误【9】许多应用同时挂掉了。 ![ddia 0308](img/ddia_0308.png)
* 失控进程会用尽一些共享资源,包括 CPU 时间、内存、磁盘空间或网络带宽。
* 系统依赖的服务变慢,没有响应,或者开始返回错误的响应。
* 级联故障一个组件中的小故障触发另一个组件中的故障进而触发更多的故障【10】。
导致这类软件故障的 BUG 通常会潜伏很长时间,直到被异常情况触发为止。这种情况意味着软件对其环境做出了某种假设 —— 虽然这种假设通常来说是正确的但由于某种原因最后不再成立了【11】。 ###### Figure 1-1. Simplified outline of ETL into a data warehouse.
虽然软件中的系统性故障没有速效药,但我们还是有很多小办法,例如:仔细考虑系统中的假设和交互;彻底的测试;进程隔离;允许进程崩溃并重启;测量、监控并分析生产环境中的系统行为。如果系统能够提供一些保证(例如在一个消息队列中,进入与发出的消息数量相等),那么系统就可以在运行时不断自检,并在出现 **差异discrepancy** 时报警【12】。 In some cases the data sources of the ETL processes are external SaaS products such as customer relationship management (CRM), email marketing, or credit card processing systems. In those cases, you do not have direct access to the original database, since it is accessible only via the software vendors API. Bringing the data from these external systems into your own data warehouse can enable analyses that are not possible via the SaaS API. ETL for SaaS APIs is often implemented by specialist data connector services such as Fivetran, Singer, or AirByte.
### 人为错误 Some database systems offer *hybrid transactional/analytic processing* (HTAP), which aims to enable OLTP and analytics in a single system without requiring ETL from one system into another [[7](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#Ozcan2017), [8](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#Prout2022)]. However, many HTAP systems internally consist of an OLTP system coupled with a separate analytical system, hidden behind a common interface—so the distinction beween the two remains important for understanding how these systems work.
设计并构建了软件系统的工程师是人类,维持系统运行的运维也是人类。即使他们怀有最大的善意,人类也是不可靠的。举个例子,一项关于大型互联网服务的研究发现,运维配置错误是导致服务中断的首要原因,而硬件故障(服务器或网络)仅导致了 10-25% 的服务中断【13】。 Moreover, even though HTAP exists, it is common to have a separation between transactional and analytic systems due to their different goals and requirements. In particular, it is considered good practice for each operational system to have its own database (see [“Microservices and Serverless”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#sec_introduction_microservices)), leading to hundreds of separate operational databases; on the other hand, an enterprise usually has a single data warehouse, so that business analysts can combine data from several operational systems in a single query.
尽管人类不可靠,但怎么做才能让系统变得可靠?最好的系统会组合使用以下几种办法: The separation between operational and analytical systems is part of a wider trend: as workloads have become more demanding, systems have become more specialized and optimized for particular workloads. General-purpose systems can handle small data volumes comfortably, but the greater the scale, the more specialized systems tend to become [[9](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#Stonebraker2005fitsall)].
* 以最小化犯错机会的方式设计系统。例如精心设计的抽象、API 和管理后台使做对事情更容易,搞砸事情更困难。但如果接口限制太多,人们就会忽略它们的好处而想办法绕开。很难正确把握这种微妙的平衡。 ### From data warehouse to data lake
* 将人们最容易犯错的地方与可能导致失效的地方 **解耦decouple**。特别是提供一个功能齐全的非生产环境 **沙箱sandbox**,使人们可以在不影响真实用户的情况下,使用真实数据安全地探索和实验。
* 在各个层次进行彻底的测试【3】从单元测试、全系统集成测试到手动测试。自动化测试易于理解已经被广泛使用特别适合用来覆盖正常情况中少见的 **边缘场景corner case**
* 允许从人为错误中简单快速地恢复,以最大限度地减少失效情况带来的影响。例如,快速回滚配置变更,分批发布新代码(以便任何意外错误只影响一小部分用户),并提供数据重算工具(以备旧的计算出错)。
* 配置详细和明确的监控,比如性能指标和错误率。在其他工程学科中这指的是 **遥测telemetry**(一旦火箭离开了地面,遥测技术对于跟踪发生的事情和理解失败是至关重要的)。监控可以向我们发出预警信号,并允许我们检查是否有任何地方违反了假设和约束。当出现问题时,指标数据对于问题诊断是非常宝贵的。
* 良好的管理实践与充分的培训 —— 一个复杂而重要的方面,但超出了本书的范围。
### 可靠性有多重要? A data warehouse often uses a *relational* data model that is queried through SQL (see [Chapter 3](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch03.html#ch_datamodels)), perhaps using specialized business intelligence software. This model works well for the types of queries that business analysts need to make, but it is less well suited to the needs of data scientists, who might need to perform tasks such as:
可靠性不仅仅是针对核电站和空中交通管制软件而言,我们也期望更多平凡的应用能可靠地运行。商务应用中的错误会导致生产力损失(也许数据报告不完整还会有法律风险),而电商网站的中断则可能会导致收入和声誉的巨大损失。 - Transform data into a form that is suitable for training a machine learning model; often this requires turning the rows and columns of a database table into a vector or matrix of numerical values called *features*. The process of performing this transformation in a way that maximizes the performance of the trained model is called *feature engineering*, and it often requires custom code that is difficult to express using SQL.
- Take textual data (e.g., reviews of a product) and use natural language processing techniques to try to extract structured information from it (e.g., the sentiment of the author, or which topics they mention). Similarly, they might need to extract structured information from photos using computer vision techniques.
即使在 “非关键” 应用中我们也对用户负有责任。试想一位家长把所有的照片和孩子的视频储存在你的照片应用里【15】。如果数据库突然损坏他们会感觉如何他们可能会知道如何从备份恢复吗 Although there have been efforts to add machine learning operators to a SQL data model [[10](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#Cohen2009)] and to build efficient machine learning systems on top of a relational foundation [[11](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#Olteanu2020)], many data scientists prefer not to work in a relational database such as a data warehouse. Instead, many prefer to use Python data analysis libraries such as pandas and scikit-learn, statistical analysis languages such as R, and distributed analytics frameworks such as Spark [[12](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#Bornstein2020)]. We discuss these further in [“Dataframes, Matrices, and Arrays”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch03.html#sec_datamodels_dataframes).
在某些情况下,我们可能会选择牺牲可靠性来降低开发成本(例如为未经证实的市场开发产品原型)或运营成本(例如利润率极低的服务),但我们偷工减料时,应该清楚意识到自己在做什么。 Consequently, organizations face a need to make data available in a form that is suitable for use by data scientists. The answer is a *data lake*: a centralized data repository that holds a copy of any data that might be useful for analysis, obtained from operational systems via ETL processes. The difference from a data warehouse is that a data lake simply contains files, without imposing any particular file format or data model. Files in a data lake might be collections of database records, encoded using a file format such as Avro or Parquet (see [Link to Come]), but they can equally well contain text, images, videos, sensor readings, sparse matrices, feature vectors, genome sequences, or any other kind of data [[13](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#Fowler2015)].
ETL processes have been generalized to *data pipelines*, and in some cases the data lake has become an intermediate stop on the path from the operational systems to the data warehouse. The data lake contains data in a “raw” form produced by the operational systems, without the transformation into a relational data warehouse schema. This approach has the advantage that each consumer of the data can transform the raw data into a form that best suits their needs. It has been dubbed the *sushi principle*: “raw data is better” [[14](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#Johnson2015)].
## 可伸缩性 Besides loading data from a data lake into a separate data warehouse, it is also possible to run typical data warehousing workloads (SQL queries and business analytics) directly on the files in the data lake, alongside data science/machine learning workloads. This architecture is known as a *data lakehouse*, and it requires a query execution engine and a metadata (e.g., schema management) layer that extend the data lakes file storage [[15](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#Armbrust2021)]. Apache Hive, Spark SQL, Presto, and Trino are examples of this approach.
系统今天能可靠运行,并不意味未来也能可靠运行。服务 **降级degradation** 的一个常见原因是负载增加,例如:系统负载已经从一万个并发用户增长到十万个并发用户,或者从一百万增长到一千万。也许现在处理的数据量级要比过去大得多。 ### Beyond the data lake
**可伸缩性Scalability** 是用来描述系统应对负载增长能力的术语。但是请注意,这不是贴在系统上的一维标签:说 “X 可伸缩” 或 “Y 不可伸缩” 是没有任何意义的。相反,讨论可伸缩性意味着考虑诸如 “如果系统以特定方式增长,有什么选项可以应对增长?” 和 “如何增加计算资源来处理额外的负载?” 等问题。 As analytics practices have matured, organizations have been increasingly paying attention to the management and operations of analytics systems and data pipelines, as captured for example in the DataOps manifesto [[16](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#DataOps)]. Part of this are issues of governance, privacy, and compliance with regulation such as GDPR and CCPA, which we discuss in [“Data Systems, Law, and Society”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#sec_introduction_compliance) and [Link to Come].
### 描述负载 Moreover, analytical data is increasingly made available not only as files and relational tables, but also as streams of events (see [Link to Come]). With file-based data analysis you can re-run the analysis periodically (e.g., daily) in order to respond to changes in the data, but stream processing allows analytics systems to respond to events much faster, on the order of seconds. Depending on the application and how time-sensitive it is, a stream processing approach can be valuable, for example to identify and block potentially fraudulent or abusive activity.
在讨论增长问题(如果负载加倍会发生什么?)前,首先要能简要描述系统的当前负载。负载可以用一些称为 **负载参数load parameters** 的数字来描述。参数的最佳选择取决于系统架构,它可能是每秒向 Web 服务器发出的请求、数据库中的读写比率、聊天室中同时活跃的用户数量、缓存命中率或其他东西。除此之外,也许平均情况对你很重要,也许你的瓶颈是少数极端场景。 In some cases the outputs of analytics systems are made available to operational systems (a process sometimes known as *reverse ETL* [[17](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#Manohar2021)]). For example, a machine-learning model that was trained on data in an analytics system may be deployed to production, so that it can generate recommendations for end-users, such as “people who bought X also bought Y”. Such deployed outputs of analytics systems are also known as *data products* [[18](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#ORegan2018)]. Machine learning models can be deployed to operational systems using specialized tools such as TFX, Kubeflow, or MLflow.
为了使这个概念更加具体,我们以推特在 2012 年 11 月发布的数据【16】为例。推特的两个主要业务是 ## Systems of Record and Derived Data
* 发布推文 Related to the distinction between operational and analytical systems, this book also distinguishes between *systems of record* and *derived data systems*. These terms are useful because they can help you clarify the flow of data through a system:
用户可以向其粉丝发布新消息(平均 4.6k 请求 / 秒,峰值超过 12k 请求 / 秒)。 - Systems of record
* 主页时间线 A system of record, also known as *source of truth*, holds the authoritative or *canonical* version of some data. When new data comes in, e.g., as user input, it is first written here. Each fact is represented exactly once (the representation is typically *normalized*; see [“Normalization, Denormalization, and Joins”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch03.html#sec_datamodels_normalization)). If there is any discrepancy between another system and the system of record, then the value in the system of record is (by definition) the correct one.
用户可以查阅他们关注的人发布的推文300k 请求 / 秒)。 - Derived data systems
处理每秒 12,000 次写入(发推文的速率峰值)还是很简单的。然而推特的伸缩性挑战并不是主要来自推特量,而是来自 **扇出fan-out**[^ii]—— 每个用户关注了很多人,也被很多人关注。 Data in a derived system is the result of taking some existing data from another system and transforming or processing it in some way. If you lose derived data, you can recreate it from the original source. A classic example is a cache: data can be served from the cache if present, but if the cache doesnt contain what you need, you can fall back to the underlying database. Denormalized values, indexes, materialized views, transformed data representations, and models trained on a dataset also fall into this category.
[^ii]: 扇出:从电子工程学中借用的术语,它描述了输入连接到另一个门输出的逻辑门数量。输出需要提供足够的电流来驱动所有连接的输入。在事务处理系统中,我们使用它来描述为了服务一个传入请求而需要执行其他服务的请求数量。 Technically speaking, derived data is *redundant*, in the sense that it duplicates existing information. However, it is often essential for getting good performance on read queries. You can derive several different datasets from a single source, enabling you to look at the data from different “points of view.”
大体上讲,这一对操作有两种实现方式。 Analytical systems are usually derived data systems, because they are consumers of data created elsewhere. Operational services may contain a mixture of systems of record and derived data systems. The systems of record are the primary databases to which data is first written, whereas the derived data systems are the indexes and caches that speed up common read operations, especially for queries that the system of record cannot answer efficiently.
1. 发布推文时,只需将新推文插入全局推文集合即可。当一个用户请求自己的主页时间线时,首先查找他关注的所有人,查询这些被关注用户发布的推文并按时间顺序合并。在如 [图 1-2](img/fig1-2.png) 所示的关系型数据库中,可以编写这样的查询: Most databases, storage engines, and query languages are not inherently a system of record or a derived system. A database is just a tool: how you use it is up to you. The distinction between system of record and derived data system depends not on the tool, but on how you use it in your application. By being clear about which data is derived from which other data, you can bring clarity to an otherwise confusing system architecture.
```sql When the data in one system is derived from the data in another, you need a process for updating the derived data when the original in the system of record changes. Unfortunately, many databases are designed based on the assumption that your application only ever needs to use that one database, and they do not make it easy to integrate multiple systems in order to propagate such updates. In [Link to Come] we will discuss approaches to *data integration*, which allow us to compose multiple data systems to achieve things that one system alone cannot do.
SELECT tweets.*, users.*
FROM tweets
JOIN users ON tweets.sender_id = users.id
JOIN follows ON follows.followee_id = users.id
WHERE follows.follower_id = current_user
```
![](img/fig1-2.png) That brings us to the end of our comparison of analytics and transaction processing. In the next section, we will examine another trade-off that you might have already seen debated multiple times.
**图 1-2 推特主页时间线的关系型模式简单实现** # Cloud versus Self-Hosting
2. 为每个用户的主页时间线维护一个缓存,就像每个用户的推文收件箱([图 1-3](img/fig1-3.png))。当一个用户发布推文时,查找所有关注该用户的人,并将新的推文插入到每个主页时间线缓存中。因此读取主页时间线的请求开销很小,因为结果已经提前计算好了。 With anything that an organization needs to do, one of the first questions is: should it be done in-house, or should it be outsourced? Should you build or should you buy?
![](img/fig1-3.png) Ultimately, this is a question about business priorities. The received management wisdom is that things that are a core competency or a competitive advantage of your organization should be done in-house, whereas things that are non-core, routine, or commonplace should be left to a vendor [[19](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#Fournier2021)]. To give an extreme example, most companies do not generate their own electricity (unless they are an energy company, and leaving aside emergency backup power), since it is cheaper to buy electricity from the grid.
**图 1-3 用于分发推特至关注者的数据流水线2012 年 11 月的负载参数【16】** With software, two important decisions to be made are who builds the software and who deploys it. There is a spectrum of possibilities that outsource each decision to various degrees, as illustrated in [Figure 1-2](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#fig_cloud_spectrum). At one extreme is bespoke software that you write and run in-house; at the other extreme are widely-used cloud services or Software as a Service (SaaS) products that are implemented and operated by an external vendor, and which you only access through a web interface or API.
推特的第一个版本使用了方法 1但系统很难跟上主页时间线查询的负载。所以公司转向了方法 2方法 2 的效果更好,因为发推频率比查询主页时间线的频率几乎低了两个数量级,所以在这种情况下,最好在写入时做更多的工作,而在读取时做更少的工作。 ![ddia 0101](img/ddia_0101.png)
然而方法 2 的缺点是,发推现在需要大量的额外工作。平均来说,一条推文会发往约 75 个关注者,所以每秒 4.6k 的发推写入,变成了对主页时间线缓存每秒 345k 的写入。但这个平均值隐藏了用户粉丝数差异巨大这一现实,一些用户有超过 3000 万的粉丝,这意味着一条推文就可能会导致主页时间线缓存的 3000 万次写入!及时完成这种操作是一个巨大的挑战 —— 推特尝试在 5 秒内向粉丝发送推文。 ###### Figure 1-2. A spectrum of types of software and its operations.
在推特的例子中,每个用户粉丝数的分布(可能按这些用户的发推频率来加权)是探讨可伸缩性的一个关键负载参数,因为它决定了扇出负载。你的应用程序可能具有非常不同的特征,但可以采用相似的原则来考虑它的负载。 The middle ground is off-the-shelf software (open source or commercial) that you *self-host*, i.e., deploy yourself—for example, if you download MySQL and install it on a server you control. This could be on your own hardware (often called *on-premises*, even if the server is actually in a rented datacenter rack and not literally on your own premises), or on a virtual machine in the cloud (*Infrastructure as a Service* or IaaS). There are still more points along this spectrum, e.g., taking open source software and running a modified version of it.
推特轶事的最终转折:现在已经稳健地实现了方法 2推特逐步转向了两种方法的混合。大多数用户发的推文会被扇出写入其粉丝主页时间线缓存中。但是少数拥有海量粉丝的用户即名流会被排除在外。当用户读取主页时间线时分别地获取出该用户所关注的每位名流的推文再与用户的主页时间线缓存合并如方法 1 所示。这种混合方法能始终如一地提供良好性能。在 [第十二章](ch12.md) 中我们将重新讨论这个例子,这在覆盖更多技术层面之后。 Seperately from this spectrum there is also the question of *how* you deploy services, either in the cloud or on-premises—for example, whether you use an orchestration framework such as Kubernetes. However, choice of deployment tooling is out of scope of this book, since other factors have a greater influence on the architecture of data systems.
### 描述性能 ## Pros and Cons of Cloud Services
一旦系统的负载被描述好,就可以研究当负载增加会发生什么。我们可以从两种角度来看: Using a cloud service, rather than running comparable software yourself, essentially outsources the operation of that software to the cloud provider. There are good arguments for and against cloud services. Cloud providers claim that using their services saves you time and money, and allows you to move faster compared to setting up your own infrastructure.
* 增加负载参数并保持系统资源CPU、内存、网络带宽等不变时系统性能将受到什么影响 Whether a cloud service is actually cheaper and easier than self-hosting depends very much on your skills and the workload on your systems. If you already have experience setting up and operating the systems you need, and if your load is quite predictable (i.e., the number of machines you need does not fluctuate wildly), then its often cheaper to buy your own machines and run the software on them yourself [[20](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#HeinemeierHansson2022), [21](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#Badizadegan2022)].
* 增加负载参数并希望保持性能不变时,需要增加多少系统资源?
这两个问题都需要性能数据,所以让我们简单地看一下如何描述系统性能。 On the other hand, if you need a system that you dont already know how to deploy and operate, then adopting a cloud service is often easier and quicker than learning to manage the system yourself. If you have to hire and train staff specifically to maintain and operate the system, that can get very expensive. You still need an operations team when youre using the cloud (see [“Operations in the Cloud Era”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#sec_introduction_operations)), but outsourcing the basic system administration can free up your team to focus on higher-level concerns.
对于 Hadoop 这样的批处理系统,通常关心的是 **吞吐量throughput**,即每秒可以处理的记录数量,或者在特定规模数据集上运行作业的总时间 [^iii]。对于在线系统,通常更重要的是服务的 **响应时间response time**,即客户端发送请求到接收响应之间的时间。 When you outsource the operation of a system to a company that specializes in running that service, that can potentially result in a better service, since the provider gains operational expertise from providing the service to many customers. On the other hand, if you run the service yourself, you can configure and tune it to perform well on your particular workload; it is unlikely that a cloud service would be willing to make such customizations on your behalf.
[^iii]: 理想情况下,批量作业的运行时间是数据集的大小除以吞吐量。在实践中由于数据倾斜(数据不是均匀分布在每个工作进程中),需要等待最慢的任务完成,所以运行时间往往更长。 Cloud services are particularly valuable if the load on your systems varies a lot over time. If you provision your machines to be able to handle peak load, but those computing resources are idle most of the time, the system becomes less cost-effective. In this situation, cloud services have the advantage that they can make it easier to scale your computing resources up or down in response to changes in demand.
> #### 延迟和响应时间 For example, analytics systems often have extremely variable load: running a large analytical query quickly requires a lot of computing resources in parallel, but once the query completes, those resources sit idle until the user makes the next query. Predefined queries (e.g., for daily reports) can be enqueued and scheduled to smooth out the load, but for interactive queries, the faster you want them to complete, the more variable the workload becomes. If your dataset is so large that querying it quickly requires significant computing resources, using the cloud can save money, since you can return unused resources to the provider rather than leaving them idle. For smaller datasets, this difference is less significant.
>
> **延迟latency****响应时间response time** 经常用作同义词,但实际上它们并不一样。响应时间是客户所看到的,除了实际处理请求的时间( **服务时间service time** )之外,还包括网络延迟和排队延迟。延迟是某个请求等待处理的 **持续时长**,在此期间它处于 **休眠latent** 状态并等待服务【17】。
即使不断重复发送同样的请求,每次得到的响应时间也都会略有不同。现实世界的系统会处理各式各样的请求,响应时间可能会有很大差异。因此我们需要将响应时间视为一个可以测量的数值 **分布distribution**,而不是单个数值。 The biggest downside of a cloud service is that you have no control over it:
在 [图 1-4](img/fig1-4.png) 中,每个灰条代表一次对服务的请求,其高度表示请求花费了多长时间。大多数请求是相当快的,但偶尔会出现需要更长的时间的异常值。这也许是因为缓慢的请求实质上开销更大,例如它们可能会处理更多的数据。但即使(你认为)所有请求都花费相同时间的情况下,随机的附加延迟也会导致结果变化,例如:上下文切换到后台进程,网络数据包丢失与 TCP 重传垃圾收集暂停强制从磁盘读取的页面错误服务器机架中的震动【18】还有很多其他原因。 - If it is lacking a feature you need, all you can do is to politely ask the vendor whether they will add it; you generally cannot implement it yourself.
- If the service goes down, all you can do is to wait for it to recover.
- If you are using the service in a way that triggers a bug or causes performance problems, it will be difficult for you to diagnose the issue. With software that you run yourself, you can get performance metrics and debugging information from the operating system to help you understand its behavior, and you can look at the server logs, but with a service hosted by a vendor you usually do not have access to these internals.
- Moreover, if the service shuts down or becomes unacceptably expensive, or if the vendor decides to change their product in a way you dont like, you are at their mercy—continuing to run an old version of the software is usually not an option, so you will be forced to migrate to an alternative service [[22](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#Yegge2020)]. This risk is mitigated if there are alternative services that expose a compatible API, but for many cloud services there are no standard APIs, which raises the cost of switching, making vendor lock-in a problem.
![](img/fig1-4.png) Despite all these risks, it has become more and more popular for organizations to build new applications on top of cloud services. However, cloud services will not subsume all in-house data systems: many older systems predate the cloud, and for any services that have specialist requirements that existing cloud services cannot meet, in-house systems remain necessary. For example, very latency-sensitive applications such as high-frequency trading require full control of the hardware.
**图 1-4 展示了一个服务 100 次请求响应时间的均值与百分位数** ## Cloud-Native System Architecture
通常报表都会展示服务的平均响应时间。(严格来讲 “平均” 一词并不指代任何特定公式,但实际上它通常被理解为 **算术平均值arithmetic mean**:给定 n 个值,加起来除以 n )。然而如果你想知道 “**典型typical**” 响应时间,那么平均值并不是一个非常好的指标,因为它不能告诉你有多少用户实际上经历了这个延迟。 Besides having a different economic model (subscribing to a service instead of buying hardware and licensing software to run on it), the rise of the cloud has also had a profound effect on how data systems are implemented on a technical level. The term *cloud-native* is used to describe an architecture that is designed to take advantage of cloud services.
通常使用 **百分位点percentiles** 会更好。如果将响应时间列表按最快到最慢排序,那么 **中位数median** 就在正中间:举个例子,如果你的响应时间中位数是 200 毫秒,这意味着一半请求的返回时间少于 200 毫秒,另一半比这个要长。 In principle, almost any software that you can self-host could also be provided as a cloud service, and indeed such managed services are now available for many popular data systems. However, systems that have been designed from the ground up to be cloud-native have been shown to have several advantages: better performance on the same hardware, faster recovery from failures, being able to quickly scale computing resources to match the load, and supporting larger datasets [[23](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#Verbitski2017), [24](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#Antonopoulos2019_ch1), [25](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#Vuppalapati2020)]. [Table 1-2](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#tab_cloud_native_dbs) lists some examples of both types of systems.
如果想知道典型场景下用户需要等待多长时间,那么中位数是一个好的度量标准:一半用户请求的响应时间少于响应时间的中位数,另一半服务时间比中位数长。中位数也被称为第 50 百分位点,有时缩写为 p50。注意中位数是关于单个请求的如果用户同时发出几个请求在一个会话过程中或者由于一个页面中包含了多个资源则至少一个请求比中位数慢的概率远大于 50%。 | Category | Self-hosted systems | Cloud-native systems |
| :--------------- | :-------------------------- | :----------------------------------------------------------- |
| Operational/OLTP | MySQL, PostgreSQL, MongoDB | AWS Aurora [[23](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#Verbitski2017)], Azure SQL DB Hyperscale [[24](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#Antonopoulos2019_ch1)], Google Cloud Spanner |
| Analytical/OLAP | Teradata, ClickHouse, Spark | Snowflake [[25](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#Vuppalapati2020)], Google BigQuery, Azure Synapse Analytics |
为了弄清异常值有多糟糕,可以看看更高的百分位点,例如第 95、99 和 99.9 百分位点(缩写为 p95p99 和 p999。它们意味着 95%、99% 或 99.9% 的请求响应时间要比该阈值快,例如:如果第 95 百分位点响应时间是 1.5 秒,则意味着 100 个请求中的 95 个响应时间快于 1.5 秒,而 100 个请求中的 5 个响应时间超过 1.5 秒。如 [图 1-4](img/fig1-4.png) 所示。 ### Layering of cloud services
响应时间的高百分位点(也称为 **尾部延迟**,即 **tail latencies**)非常重要,因为它们直接影响用户的服务体验。例如亚马逊在描述内部服务的响应时间要求时是以 99.9 百分位点为准,即使它只影响一千个请求中的一个。这是因为请求响应最慢的客户往往也是数据最多的客户,也可以说是最有价值的客户 —— 因为他们掏钱了【19】。保证网站响应迅速对于保持客户的满意度非常重要亚马逊观察到响应时间增加 100 毫秒,销售量就减少 1%【20】而另一些报告说慢 1 秒钟会让客户满意度指标减少 16%【2122】。 Many self-hosted data systems have very simple system requirements: they run on a conventional operating system such as Linux or Windows, they store their data as files on the filesystem, and they communicate via standard network protocols such as TCP/IP. A few systems depend on special hardware such as GPUs (for machine learning) or RDMA network interfaces, but on the whole, self-hosted software tends to use very generic computing resources: CPU, RAM, a filesystem, and an IP network.
另一方面,优化第 99.99 百分位点(一万个请求中最慢的一个)被认为太昂贵了,不能为亚马逊的目标带来足够好处。减小高百分位点处的响应时间相当困难,因为它很容易受到随机事件的影响,这超出了控制范围,而且效益也很小。 In a cloud, this type of software can be run on an Infrastructure-as-a-Service environment, using one or more virtual machines (or *instances*) with a certain allocation of CPUs, memory, disk, and network bandwidth. Compared to physical machines, cloud instances can be provisioned faster and they come in a greater variety of sizes, but otherwise they are similar to a traditional computer: you can run any software you like on it, but you are responsible for administering it yourself.
百分位点通常用于 **服务级别目标SLO, service level objectives****服务级别协议SLA, service level agreements**即定义服务预期性能和可用性的合同。SLA 可能会声明,如果服务响应时间的中位数小于 200 毫秒,且 99.9 百分位点低于 1 秒,则认为服务工作正常(如果响应时间更长,就认为服务不达标)。这些指标为客户设定了期望值,并允许客户在 SLA 未达标的情况下要求退款。 In contrast, the key idea of cloud-native services is to use not only the computing resources managed by your operating system, but also to build upon lower-level cloud services to create higher-level services. For example:
**排队延迟queueing delay** 通常占了高百分位点处响应时间的很大一部分。由于服务器只能并行处理少量的事务(如受其 CPU 核数的限制),所以只要有少量缓慢的请求就能阻碍后续请求的处理,这种效应有时被称为 **头部阻塞head-of-line blocking** 。即使后续请求在服务器上处理的非常迅速,由于需要等待先前请求完成,客户端最终看到的是缓慢的总体响应时间。因为存在这种效应,测量客户端的响应时间非常重要。 - *Object storage* services such as Amazon S3, Azure Blob Storage, and Cloudflare R2 store large files. They provide more limited APIs than a typical filesystem (basic file reads and writes), but they have the advantage that they hide the underlying physical machines: the service automatically distributes the data across many machines, so that you dont have to worry about running out of disk space on any one machine. Even if some machines or their disks fail entirely, no data is lost.
- Many other services are in turn built upon object storage and other cloud services: for example, Snowflake is a cloud-based analytic database (data warehouse) that relies on S3 for data storage [[25](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#Vuppalapati2020)], and some other services in turn build upon Snowflake.
为测试系统的可伸缩性而人为产生负载时产生负载的客户端要独立于响应时间不断发送请求。如果客户端在发送下一个请求之前等待先前的请求完成这种行为会产生人为排队的效果使得测试时的队列比现实情况更短使测量结果产生偏差【23】。 As always with abstractions in computing, there is no one right answer to what you should use. As a general rule, higher-level abstractions tend to be more oriented towards particular use cases. If your needs match the situations for which a higher-level system is designed, using the existing higher-level system will probably provide what you need with much less hassle than building it yourself from lower-level systems. On the other hand, if there is no high-level system that meets your needs, then building it yourself from lower-level components is the only option.
> #### 实践中的百分位点 ### Separation of storage and compute
>
> 在多重调用的后端服务里,高百分位数变得特别重要。即使并行调用,最终用户请求仍然需要等待最慢的并行调用完成。如 [图 1-5](img/fig1-5.png) 所示,只需要一个缓慢的调用就可以使整个最终用户请求变慢。即使只有一小部分后端调用速度较慢,如果最终用户请求需要多个后端调用,则获得较慢调用的机会也会增加,因此较高比例的最终用户请求速度会变慢(该效果称为尾部延迟放大,即 tail latency amplification【24】
>
> 如果你想将响应时间百分点添加到你的服务的监视仪表板则需要持续有效地计算它们。例如你可以使用滑动窗口来跟踪连续10分钟内的请求响应时间。每一分钟你都会计算出该窗口中的响应时间中值和各种百分数并将这些度量值绘制在图上。
>
> 简单的实现是在时间窗口内保存所有请求的响应时间列表,并且每分钟对列表进行排序。如果对你来说效率太低,那么有一些算法能够以最小的 CPU 和内存成本如前向衰减【25】、t-digest【26】或 HdrHistogram 【27】来计算百分位数的近似值。请注意平均百分比例如减少时间分辨率或合并来自多台机器的数据在数学上没有意义 - 聚合响应时间数据的正确方法是添加直方图【28】。
![](img/fig1-5.png) In traditional computing, disk storage is regarded as durable (we assume that once something is written to disk, it will not be lost); to tolerate the failure of an individual hard disk, RAID is often used to maintain copies of the data on several disks. In the cloud, compute instances (virtual machines) may also have local disks attached, but cloud-native systems typically treat these disks more like an ephemeral cache, and less like long-term storage. This is because the local disk becomes inaccessible if the associated instance fails, or if the instance is replaced with a bigger or a smaller one (on a different physical machine) in order to adapt to changes in load.
**图 1-5 当一个请求需要多个后端请求时,单个后端慢请求就会拖慢整个终端用户的请求** As an alternative to local disks, cloud services also offer virtual disk storage that can be detached from one instance and attached to a different one (Amazon EBS, Azure managed disks, and persistent disks in Google Cloud). Such a virtual disk is not actually a physical disk, but rather a cloud service provided by a separate set of machines, which emulates the behavior of a disk (a *block device*, where each block is typically 4 KiB in size). This technology makes it possible to run traditional disk-based software in the cloud, but it often suffers from poor performance and poor scalability [[23](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#Verbitski2017)].
### 应对负载的方法 To address this problem, cloud-native services generally avoid using virtual disks, and instead build on dedicated storage services that are optimized for particular workloads. Object storage services such as S3 are designed for long-term storage of fairly large files, ranging from hundreds of kilobytes to several gigabytes in size. The individual rows or values stored in a database are typically much smaller than this; cloud databases therefore typically manage smaller values in a separate service, and store larger data blocks (containing many individual values) in an object store [[24](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#Antonopoulos2019_ch1)].
现在我们已经讨论了用于描述负载的参数和用于衡量性能的指标。可以开始认真讨论可伸缩性了:当负载参数增加时,如何保持良好的性能? In a traditional systems architecture, the same computer is responsible for both storage (disk) and computation (CPU and RAM), but in cloud-native systems, these two responsibilities have become somewhat separated or *disaggregated* [[8](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#Prout2022), [25](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#Vuppalapati2020), [26](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#Shapira2023), [27](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#Murthy2022)]: for example, S3 only stores files, and if you want to analyze that data, you will have to run the analysis code somewhere outside of S3. This implies transferring the data over the network, which we will discuss further in [“Distributed versus Single-Node Systems”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#sec_introduction_distributed).
适应某个级别负载的架构不太可能应付 10 倍于此的负载。如果你正在开发一个快速增长的服务,那么每次负载发生数量级的增长时,你可能都需要重新考虑架构 —— 或者更频繁。 Moreover, cloud-native systems are often *multitenant*, which means that rather than having a separate machine for each customer, data and computation from several different customers are handled on the same shared hardware by the same service [[28](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#Vanlightly2023)]. Multitenancy can enable better hardware utilization, easier scalability, and easier management by the cloud provider, but it also requires careful engineering to ensure that one customers activity does not affect the performance or security of the system for other customers [[29](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#Jonas2019)].
人们经常讨论 **纵向伸缩**scaling up也称为垂直伸缩即 vertical scaling转向更强大的机器**横向伸缩**scaling out也称为水平伸缩即 horizontal scaling将负载分布到多台小机器上之间的对立。跨多台机器分配负载也称为 “**无共享shared-nothing**” 架构。可以在单台机器上运行的系统通常更简单,但高端机器可能非常贵,所以非常密集的负载通常无法避免地需要横向伸缩。现实世界中的优秀架构需要将这两种方法务实地结合,因为使用几台足够强大的机器可能比使用大量的小型虚拟机更简单也更便宜。 ## Operations in the Cloud Era
有些系统是 **弹性elastic** 的,这意味着可以在检测到负载增加时自动增加计算资源,而其他系统则是手动伸缩(人工分析容量并决定向系统添加更多的机器)。如果负载 **极难预测highly unpredictable**,则弹性系统可能很有用,但手动伸缩系统更简单,并且意外操作可能会更少(请参阅 “[分区再平衡](ch6.md#分区再平衡)”)。 Traditionally, the people managing an organizations server-side data infrastructure were known as *database administrators* (DBAs) or *system administrators* (sysadmins). More recently, many organizations have tried to integrate the roles of software development and operations into teams with a shared responsibility for both backend services and data infrastructure; the *DevOps* philosophy has guided this trend. *Site Reliability Engineers* (SREs) are Googles implementation of this idea [[30](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#Beyer2016)].
跨多台机器部署 **无状态服务stateless services** 非常简单,但将带状态的数据系统从单节点变为分布式配置则可能引入许多额外复杂度。出于这个原因,常识告诉我们应该将数据库放在单个节点上(纵向伸缩),直到伸缩成本或可用性需求迫使其改为分布式。 The role of operations is to ensure services are reliably delivered to users (including configuring infrastructure and deploying applications), and to ensure a stable production environment (including monitoring and diagnosing any problems that may affect reliability). For self-hosted systems, operations traditionally involves a significant amount of work at the level of individual machines, such as capacity planning (e.g., monitoring available disk space and adding more disks before you run out of space), provisioning new machines, moving services from one machine to another, and installing operating system patches.
随着分布式系统的工具和抽象越来越好,至少对于某些类型的应用而言,这种常识可能会改变。可以预见分布式数据系统将成为未来的默认设置,即使对不处理大量数据或流量的场景也如此。本书的其余部分将介绍多种分布式数据系统,不仅讨论它们在可伸缩性方面的表现,还包括易用性和可维护性。 Many cloud services present an API that hides the individual machines that actually implement the service. For example, cloud storage replaces fixed-size disks with *metered billing*, where you can store data without planning your capacity needs in advance, and you are then charged based on the space actually used. Moreover, many cloud services remain highly available, even when individual machines have failed (see [“Reliability and Fault Tolerance”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch02.html#sec_introduction_reliability)).
大规模的系统架构通常是应用特定的 —— 没有一招鲜吃遍天的通用可伸缩架构(不正式的叫法:**万金油magic scaling sauce** )。应用的问题可能是读取量、写入量、要存储的数据量、数据的复杂度、响应时间要求、访问模式或者所有问题的大杂烩。 This shift in emphasis from individual machines to services has been accompanied by a change in the role of operations. The high-level goal of providing a reliable service remains the same, but the processes and tools have evolved. The DevOps/SRE philosophy places greater emphasis on:
举个例子,用于处理每秒十万个请求(每个大小为 1 kB的系统与用于处理每分钟 3 个请求(每个大小为 2GB的系统看上去会非常不一样尽管两个系统有同样的数据吞吐量。 - automation—preferring repeatable processes over manual one-off jobs,
- preferring ephemeral virtual machines and services over long running servers,
- enabling frequent application updates,
- learning from incidents, and
- preserving the organizations knowledge about the system, even as individual people come and go [[31](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#Limoncelli2020)].
一个良好适配应用的可伸缩架构,是围绕着 **假设assumption** 建立的:哪些操作是常见的?哪些操作是罕见的?这就是所谓负载参数。如果假设最终是错误的,那么为伸缩所做的工程投入就白费了,最糟糕的是适得其反。在早期创业公司或非正式产品中,通常支持产品快速迭代的能力,要比可伸缩至未来的假想负载要重要的多。 With the rise of cloud services, there has been a bifurcation of roles: operations teams at infrastructure companies specialize in the details of providing a reliable service to a large number of customers, while the customers of the service spend as little time and effort as possible on infrastructure [[32](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#Majors2020)].
尽管这些架构是应用程序特定的,但可伸缩的架构通常也是从通用的积木块搭建而成的,并以常见的模式排列。在本书中,我们将讨论这些构件和模式。 Customers of cloud services still require operations, but they focus on different aspects, such as choosing the most appropriate service for a given task, integrating different services with each other, and migrating from one service to another. Even though metered billing removes the need for capacity planning in the traditional sense, its still important to know what resources you are using for which purpose, so that you dont waste money on cloud resources that are not needed: capacity planning becomes financial planning, and performance optimization becomes cost optimization [[33](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#Cherkasky2021)]. Moreover, cloud services do have resource limits or *quotas* (such as the maximum number of processes you can run concurrently), which you need to know about and plan for before you run into them [[34](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#Kushchi2023)].
Adopting a cloud service can be easier and quicker than running your own infrastructure, although even here there is a cost in learning how to use it, and perhaps working around its limitations. Integration between different services becomes a particular challenge as a growing number of vendors offers an ever broader range of cloud services targeting different use cases [[35](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#Bernhardsson2021), [36](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#Stancil2021)]. ETL (see [“Data Warehousing”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#sec_introduction_dwh)) is only part of the story; operational cloud services also need to be integrated with each other. At present, there is a lack of standards that would facilitate this sort of integration, so it often involves significant manual effort.
## 可维护性 Other operational aspects that cannot fully be outsourced to cloud services include maintaining the security of an application and the libraries it uses, managing the interactions between your own services, monitoring the load on your services, and tracking down the cause of problems such as performance degradations or outages. While the cloud is changing the role of operations, the need for operations is as great as ever.
众所周知,软件的大部分开销并不在最初的开发阶段,而是在持续的维护阶段,包括修复漏洞、保持系统正常运行、调查失效、适配新的平台、为新的场景进行修改、偿还技术债和添加新的功能。 # Distributed versus Single-Node Systems
不幸的是,许多从事软件系统行业的人不喜欢维护所谓的 **遗留legacy** 系统,—— 也许因为涉及修复其他人的错误、和过时的平台打交道,或者系统被迫使用于一些份外工作。每一个遗留系统都以自己的方式让人不爽,所以很难给出一个通用的建议来和它们打交道。 A system that involves several machines communicating via a network is called a *distributed system*. Each of the processes participating in a distributed system is called a *node*. There are various reasons why you might want a system to be distributed:
但是我们可以,也应该以这样一种方式来设计软件:在设计之初就尽量考虑尽可能减少维护期间的痛苦,从而避免自己的软件系统变成遗留系统。为此,我们将特别关注软件系统的三个设计原则: - Inherently distributed systems
* 可操作性Operability If an application involves two or more interacting users, each using their own device, then the system is unavoidably distributed: the communication between the devices will have to go via a network.
便于运维团队保持系统平稳运行。 - Requests between cloud services
* 简单性Simplicity If data is stored in one service but processed in another, it must be transferred over the network from one service to the other.
从系统中消除尽可能多的 **复杂度complexity**,使新工程师也能轻松理解系统(注意这和用户接口的简单性不一样)。 - Fault tolerance/high availability
* 可演化性evolvability If your application needs to continue working even if one machine (or several machines, or the network, or an entire datacenter) goes down, you can use multiple machines to give you redundancy. When one fails, another one can take over. See [“Reliability and Fault Tolerance”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch02.html#sec_introduction_reliability).
使工程师在未来能轻松地对系统进行更改,当需求变化时为新应用场景做适配。也称为 **可扩展性extensibility**、**可修改性modifiability** 或 **可塑性plasticity** - Scalability
和之前提到的可靠性、可伸缩性一样,实现这些目标也没有简单的解决方案。不过我们会试着想象具有可操作性,简单性和可演化性的系统会是什么样子。 If your data volume or computing requirements grow bigger than a single machine can handle, you can potentially spread the load across multiple machines. See [“Scalability”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch02.html#sec_introduction_scalability).
### 可操作性:人生苦短,关爱运维 - Latency
有人认为,“良好的运维经常可以绕开垃圾(或不完整)软件的局限性,而再好的软件摊上垃圾运维也没法可靠运行”。尽管运维的某些方面可以,而且应该是自动化的,但在最初建立正确运作的自动化机制仍然取决于人。 If you have users around the world, you might want to have servers at various locations worldwide so that each user can be served from a datacenter that is geographically close to them. That avoids the users having to wait for network packets to travel halfway around the world to answer their requests. See [“Describing Performance”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch02.html#sec_introduction_percentiles).
运维团队对于保持软件系统顺利运行至关重要。一个优秀运维团队的典型职责如下或者更多【29】 - Elasticity
* 监控系统的运行状况,并在服务状态不佳时快速恢复服务。 If your application is busy at some times and idle at other times, a cloud deployment can scale up or down to meet the demand, so that you pay only for resources you are actively using. This more difficult on a single machine, which needs to be provisioned to handle the maximum load, even at times when it is barely used.
* 跟踪问题的原因,例如系统故障或性能下降。
* 及时更新软件和平台,比如安全补丁。
* 了解系统间的相互作用,以便在异常变更造成损失前进行规避。
* 预测未来的问题,并在问题出现之前加以解决(例如,容量规划)。
* 建立部署、配置、管理方面的良好实践,编写相应工具。
* 执行复杂的维护任务,例如将应用程序从一个平台迁移到另一个平台。
* 当配置变更时,维持系统的安全性。
* 定义工作流程,使运维操作可预测,并保持生产环境稳定。
* 铁打的营盘流水的兵,维持组织对系统的了解。
良好的可操作性意味着更轻松的日常工作,进而运维团队能专注于高价值的事情。数据系统可以通过各种方式使日常任务更轻松: - Using specialized hardware
* 通过良好的监控,提供对系统内部状态和运行时行为的 **可见性visibility** Different parts of the system can take advantage of different types of hardware to match their workload. For example, an object store may use machines with many disks but few CPUs, whereas a data analysis system may use machines with lots of CPU and memory but no disks, and a machine learning system may use machines with GPUs (which are much more efficient than CPUs for training deep neural networks and other machine learning tasks).
* 为自动化提供良好支持,将系统与标准化工具相集成。
* 避免依赖单台机器(在整个系统继续不间断运行的情况下允许机器停机维护)。
* 提供良好的文档和易于理解的操作模型(“如果做 X会发生 Y”
* 提供良好的默认行为,但需要时也允许管理员自由覆盖默认值。
* 有条件时进行自我修复,但需要时也允许管理员手动控制系统状态。
* 行为可预测,最大限度减少意外。
- Legal compliance
### 简单性:管理复杂度 Some countries have data residency laws that require data about people in their jurisdiction to be stored and processed geographically within that country [[37](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#Korolov2022)]. The scope of these rules varies—for example, in some cases it applies only to medical or financial data, while other cases are broader. A service with users in several such jurisdictions will therefore have to distribute their data across servers in several locations.
小型软件项目可以使用简单讨喜的、富表现力的代码,但随着项目越来越大,代码往往变得非常复杂,难以理解。这种复杂度拖慢了所有系统相关人员,进一步增加了维护成本。一个陷入复杂泥潭的软件项目有时被描述为 **烂泥潭a big ball of mud** 【30】。 These reasons apply both to services that you write yourself (application code) and services consisting of off-the-shelf software (such as databases).
**复杂度complexity** 有各种可能的症状,例如:状态空间激增、模块间紧密耦合、纠结的依赖关系、不一致的命名和术语、解决性能问题的 Hack、需要绕开的特例等等现在已经有很多关于这个话题的讨论【31,32,33】。 ## Problems with Distributed Systems
因为复杂度导致维护困难时,预算和时间安排通常会超支。在复杂的软件中进行变更,引入错误的风险也更大:当开发人员难以理解系统时,隐藏的假设、无意的后果和意外的交互就更容易被忽略。相反,降低复杂度能极大地提高软件的可维护性,因此简单性应该是构建系统的一个关键目标。 Distributed systems also have downsides. Every request and API call that goes via the network needs to deal with the possibility of failure: the network may be interrupted, or the service may be overloaded or crashed, and therefore any request may time out without receiving a response. In this case, we dont know whether the service received the request, and simply retrying it might not be safe. We will discuss these problems in detail in [Link to Come].
简化系统并不一定意味着减少功能;它也可以意味着消除 **额外的accidental** 的复杂度。Moseley 和 Marks【32】把 **额外复杂度** 定义为:由具体实现中涌现,而非(从用户视角看,系统所解决的)问题本身固有的复杂度。 Although datacenter networks are fast, making a call to another service is still vastly slower than calling a function in the same process [[38](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#Nath2019)]. When operating on large volumes of data, rather than transferring the data from storage to a separate machine that processes it, it can be faster to bring the computation to the machine that already has the data [[39](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#Hellerstein2019)]. More nodes are not always faster: in some cases, a simple single-threaded program on one computer can perform significantly better than a cluster with over 100 CPU cores [[40](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#McSherry2015_ch1)].
用于消除 **额外复杂度** 的最好工具之一是 **抽象abstraction**。一个好的抽象可以将大量实现细节隐藏在一个干净,简单易懂的外观下面。一个好的抽象也可以广泛用于各类不同应用。比起重复造很多轮子,重用抽象不仅更有效率,而且有助于开发高质量的软件。抽象组件的质量改进将使所有使用它的应用受益。 Troubleshooting a distributed system is often difficult: if the system is slow to respond, how do you figure out where the problem lies? Techniques for diagnosing problems in distributed systems are developed under the heading of *observability* [[41](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#Sridharan2018), [42](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#Majors2019)], which involves collecting data about the execution of a system, and allowing it to be queried in ways that allows both high-level metrics and individual events to be analyzed. *Tracing* tools such as OpenTelemetry allow you to track which client called which server for which operation, and how long each call took [[43](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#Sigelman2010)].
例如高级编程语言是一种抽象隐藏了机器码、CPU 寄存器和系统调用。SQL 也是一种抽象,隐藏了复杂的磁盘 / 内存数据结构、来自其他客户端的并发请求、崩溃后的不一致性。当然在用高级语言编程时,我们仍然用到了机器码;只不过没有 **直接directly** 使用罢了,正是因为编程语言的抽象,我们才不必去考虑这些实现细节。 Databases provide various mechanisms for ensuring data consistency, as we shall see in [Link to Come] and [Link to Come]. However, when each service has its own database, maintaining consistency of data across those different services becomes the applications problem. Distributed transactions, which we explore in [Link to Come], are a possible technique for ensuring consistency, but they are rarely used in a microservices context because they run counter to the goal of making services independent from each other [[44](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#Laigner2021)].
抽象可以帮助我们将系统的复杂度控制在可管理的水平,不过,找到好的抽象是非常困难的。在分布式系统领域虽然有许多好的算法,但我们并不清楚它们应该打包成什么样抽象。 For all these reasons, if you can do something on a single machine, this is often much simpler and easier compared to setting up a distributed system [[21](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#Badizadegan2022)]. CPUs, memory, and disks have grown larger, faster, and more reliable. When combined with single-node databases such as DuckDB, SQLite, and KùzuDB, many workloads can now run on a single node. We will explore more on this topic in [Link to Come].
本书将紧盯那些允许我们将大型系统的部分提取为定义明确的、可重用的组件的优秀抽象。 ## Microservices and Serverless
### 可演化性:拥抱变化 The most common way of distributing a system across multiple machines is to divide them into clients and servers, and let the clients make requests to the servers. Most commonly HTTP is used for this communication, as we will discuss in [Link to Come]. The same process may be both a server (handling incoming requests) and a client (making outbound requests to other services).
系统的需求永远不变,基本是不可能的。更可能的情况是,它们处于常态的变化中,例如:你了解了新的事实、出现意想不到的应用场景、业务优先级发生变化、用户要求新功能、新平台取代旧平台、法律或监管要求发生变化、系统增长迫使架构变化等。 This way of building applications has traditionally been called a *service-oriented architecture* (SOA); more recently the idea has been refined into a *microservices* architecture [[45](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#Newman2021_ch1), [46](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#Richardson2014)]. In this architecture, a service has one well-defined purpose (for example, in the case of S3, this would be file storage); each service exposes an API that can be called by clients via the network, and each service has one team that is responsible for its maintenance. A complex application can thus be decomposed into multiple interacting services, each managed by a separate team.
在组织流程方面,**敏捷agile** 工作模式为适应变化提供了一个框架。敏捷社区还开发了对在频繁变化的环境中开发软件很有帮助的技术工具和模式,如 **测试驱动开发TDD, test-driven development****重构refactoring** There are several advantages to breaking down a complex piece of software into multiple services: each service can be updated independently, reducing coordination effort among teams; each service can be assigned the hardware resources it needs; and by hiding the implementation details behind an API, the service owners are free to change the implementation without affecting clients. In terms of data storage, it is common for each service to have its own databases, and not to share databases between services: sharing a database would effectively make the entire database structure a part of the services API, and then that structure would be difficult to change. Shared databases could also cause one services queries to negatively impact the performance of other services.
这些敏捷技术的大部分讨论都集中在相当小的规模(同一个应用中的几个代码文件)。本书将探索在更大数据系统层面上提高敏捷性的方法,可能由几个不同的应用或服务组成。例如,为了将装配主页时间线的方法从方法 1 变为方法 2你会如何 “重构” 推特的架构 On the other hand, having many services can itself breed complexity: each service requires infrastructure for deploying new releases, adjusting the allocated hardware resources to match the load, collecting logs, monitoring service health, and alerting an on-call engineer in the case of a problem. *Orchestration* frameworks such as Kubernetes have become a popular way of deploying services, since they provide a foundation for this infrastructure. Testing a service during development can be complicated, since you also need to run all the other services that it depends on.
修改数据系统并使其适应不断变化需求的容易程度,是与 **简单性****抽象性** 密切相关的:简单易懂的系统通常比复杂系统更容易修改。但由于这是一个非常重要的概念,我们将用一个不同的词来指代数据系统层面的敏捷性: **可演化性evolvability** 【34】。 Microservice APIs can be challenging to evolve. Clients that call an API expect the API to have certain fields. Developers might wish to add or remove fields to an API as business needs change, but doing so can cause clients to fail. Worse still, such failures are often not discovered until late in the development cycle when the updated service API is deployed to a staging or production environment. API description standards such as OpenAPI and gRPC help manage the relationship between client and server APIs; we discuss these further in [Link to Come].
Microservices are primarily a technical solution to a people problem: allowing different teams to make progress independently without having to coordinate with each other. This is valuable in a large company, but in a small company where there are not many teams, using microservices is likely to be unnecessary overhead, and it is preferable to implement the application in the simplest way possible [[45](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#Newman2021_ch1)].
## 本章小结 *Serverless*, or *function-as-a-service* (FaaS), is another approach to deploying services, in which the management of the infrastructure is outsourced to a cloud vendor [[29](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#Jonas2019)]. When using virtual machines, you have to explicitly choose when to start up or shut down an instance; in contrast, with the serverless model, the cloud provider automatically allocates and frees hardware resources as needed, based on the incoming requests to your service [[47](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#Shahrad2020)]. The term “serverless” can misleading: each serverless function execution still runs on a server, but subsequent executions might run on a different one.
本章探讨了一些关于数据密集型应用的基本思考方式。这些原则将指导我们阅读本书的其余部分,那里将会深入技术细节。 Just like cloud storage replaced capacity planning (deciding in advance how many disks to buy) with a metered billing model, the serverless approach is bringing metered billing to code execution: you only pay for the time that your application code is actually running, rather than having to provision resources in advance.
一个应用必须满足各种需求才称得上有用。有一些 **功能需求**functional requirements即它应该做什么比如允许以各种方式存储检索搜索和处理数据以及一些 **非功能性需求**nonfunctional即通用属性例如安全性、可靠性、合规性、可伸缩性、兼容性和可维护性。在本章详细讨论了可靠性可伸缩性和可维护性。 ## Cloud Computing versus Supercomputing
Cloud computing is not the only way of building large-scale computing systems; an alternative is *high-performance computing* (HPC), also known as *supercomputing*. Although there are overlaps, HPC often has different priorities and uses different techniques compared to cloud computing and enterprise datacenter systems. Some of those differences are:
**可靠性Reliability** 意味着即使发生故障,系统也能正常工作。故障可能发生在硬件(通常是随机的和不相关的)、软件(通常是系统性的 Bug很难处理和人类不可避免地时不时出错。**容错技术** 可以对终端用户隐藏某些类型的故障。 - Supercomputers are typically used for computationally intensive scientific computing tasks, such as weather forecasting, molecular dynamics (simulating the movement of atoms and molecules), complex optimization problems, and solving partial differential equations. On the other hand, cloud computing tends to be used for online services, business data systems, and similar systems that need to serve user requests with high availability.
- A supercomputer typically runs large batch jobs that checkpoint the state of their computation to disk from time to time. If a node fails, a common solution is to simply stop the entire cluster workload, repair the faulty node, and then restart the computation from the last checkpoint [[48](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#Barroso2018), [49](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#Fiala2012)]. With cloud services, it is usually not desirable to stop the entire cluster, since the services need to continually serve users with minimal interruptions.
- Supercomputers are typically built from specialized hardware, where each node is quite reliable. Nodes in cloud services are usually built from commodity machines, which can provide equivalent performance at lower cost due to economies of scale, but which also have higher failure rates (see [“Hardware and Software Faults”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch02.html#sec_introduction_hardware_faults)).
- Supercomputer nodes typically communicate through shared memory and remote direct memory access (RDMA), which support high bandwidth and low latency, but assume a high level of trust among the users of the system [[50](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#KornfeldSimpson2020)]. In cloud computing, the network and the machines are often shared by mutually untrusting organizations, requiring stronger security mechanisms such as resource isolation (e.g., virtual machines), encryption and authentication.
- Cloud datacenter networks are often based on IP and Ethernet, arranged in Clos topologies to provide high bisection bandwidth—a commonly used measure of a networks overall performance [[48](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#Barroso2018), [51](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#Singh2015)]. Supercomputers often use specialized network topologies, such as multi-dimensional meshes and toruses [[52](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#Lockwood2014)], which yield better performance for HPC workloads with known communication patterns.
- Cloud computing allows nodes to be distributed across multiple geographic locations, whereas supercomputers generally assume that all of their nodes are close together.
**可伸缩性Scalability** 意味着即使在负载增加的情况下也有保持性能的策略。为了讨论可伸缩性,我们首先需要定量描述负载和性能的方法。我们简要了解了推特主页时间线的例子,介绍描述负载的方法,并将响应时间百分位点作为衡量性能的一种方式。在可伸缩的系统中可以添加 **处理容量processing capacity** 以在高负载下保持可靠。 Large-scale analytics systems sometimes share some characteristics with supercomputing, which is why it can be worth knowing about these techniques if you are working in this area. However, this book is mostly concerned with services that need to be continually available, as discussed in [“Reliability and Fault Tolerance”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch02.html#sec_introduction_reliability).
**可维护性Maintainability** 有许多方面,但实质上是关于工程师和运维团队的生活质量的。良好的抽象可以帮助降低复杂度,并使系统易于修改和适应新的应用场景。良好的可操作性意味着对系统的健康状态具有良好的可见性,并拥有有效的管理手段。 # Data Systems, Law, and Society
不幸的是,使应用可靠、可伸缩或可维护并不容易。但是某些模式和技术会不断重新出现在不同的应用中。在接下来的几章中,我们将看到一些数据系统的例子,并分析它们如何实现这些目标。 So far youve seen in this chapter that the architecture of data systems is influenced not only by technical goals and requirements, but also by the human needs of the organizations that they support. Increasingly, data systems engineers are realizing that serving the needs of their own business is not enough: we also have a responsibility towards society at large.
在本书后面的 [第三部分](part-iii.md) 中,我们将看到一种模式:几个组件协同工作以构成一个完整的系统(如 [图 1-1](img/fig1-1.png) 中的例子) One particular concern are systems that store data about people and their behavior. Since 2018 the *General Data Protection Regulation* (GDPR) has given residents of many European countries greater control and legal rights over their personal data, and similar privacy regulation has been adopted in various other countries and states around the world, including for example the California Consumer Privacy Act (CCPA). Regulations around AI, such as the *EU AI Act*, place further restrictions on how personal data can be used.
Moreover, even in areas that are not directly subject to regulation, there is increasing recognition of the effects that computer systems have on people and society. Social media has changed how individuals consume news, which influences their political opinions and hence may affect the outcome of elections. Automated systems increasingly make decisions that have profound consequences for individuals, such as deciding who should be given a loan or insurance coverage, who should be invited to a job interview, or who should be suspected of a crime [[53](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#ONeil2016_ch1)].
## 参考文献 Everyone who works on such systems shares a responsibility for considering the ethical impact and ensuring that they comply with relevant law. It is not necessary for everybody to become an expert in law and ethics, but a basic awareness of legal and ethical principles is just as important as, say, some foundational knowledge in distributed systems.
1. Michael Stonebraker and Uğur Çetintemel: “['One Size Fits All': An Idea Whose Time Has Come and Gone](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.68.9136&rep=rep1&type=pdf),” at *21st International Conference on Data Engineering* (ICDE), April 2005. Legal considerations are influencing the very foundations of how data systems are being designed [[54](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#Shastri2020)]. For example, the GDPR grants individuals the right to have their data erased on request (sometimes known as the *right to be forgotten*). However, as we shall see in this book, many data systems rely on immutable constructs such as append-only logs as part of their design; how can we ensure deletion of some data in the middle of a file that is supposed to be immutable? How do we handle deletion of data that has been incorporated into derived datasets (see [“Systems of Record and Derived Data”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#sec_introduction_derived)), such as training data for machine learning models? Answering these questions creates new engineering challenges.
1. Walter L. Heimerdinger and Charles B. Weinstock: “[A Conceptual Framework for System Fault Tolerance](http://www.sei.cmu.edu/reports/92tr033.pdf),” Technical Report CMU/SEI-92-TR-033, Software Engineering Institute, Carnegie Mellon University, October 1992.
1. Ding Yuan, Yu Luo, Xin Zhuang, et al.: “[Simple Testing Can Prevent Most Critical Failures: An Analysis of Production Failures in Distributed Data-Intensive Systems](https://www.usenix.org/system/files/conference/osdi14/osdi14-paper-yuan.pdf),” at *11th USENIX Symposium on Operating Systems Design and Implementation* (OSDI), October 2014.
1. Yury Izrailevsky and Ariel Tseitlin: “[The Netflix Simian Army](http://techblog.netflix.com/2011/07/netflix-simian-army.html),” *techblog.netflix.com*, July 19, 2011.
1. Daniel Ford, François Labelle, Florentina I. Popovici, et al.: “[Availability in Globally Distributed Storage Systems](http://research.google.com/pubs/archive/36737.pdf),” at *9th USENIX Symposium on Operating Systems Design and Implementation* (OSDI), October 2010.
1. Brian Beach: “[Hard Drive Reliability Update Sep 2014](https://www.backblaze.com/blog/hard-drive-reliability-update-september-2014/),” *backblaze.com*, September 23, 2014.
1. Laurie Voss: “[AWS: The Good, the Bad and the Ugly](https://web.archive.org/web/20160429075023/http://blog.awe.sm/2012/12/18/aws-the-good-the-bad-and-the-ugly/),” *blog.awe.sm*, December 18, 2012.
1. Haryadi S. Gunawi, Mingzhe Hao, Tanakorn Leesatapornwongsa, et al.: “[What Bugs Live in the Cloud?](http://ucare.cs.uchicago.edu/pdf/socc14-cbs.pdf),” at *5th ACM Symposium on Cloud Computing* (SoCC), November 2014. [doi:10.1145/2670979.2670986](http://dx.doi.org/10.1145/2670979.2670986)
1. Nelson Minar: “[Leap Second Crashes Half the Internet](http://www.somebits.com/weblog/tech/bad/leap-second-2012.html),” *somebits.com*, July 3, 2012.
1. Amazon Web Services: “[Summary of the Amazon EC2 and Amazon RDS Service Disruption in the US East Region](http://aws.amazon.com/message/65648/),” *aws.amazon.com*, April 29, 2011.
1. Richard I. Cook: “[How Complex Systems Fail](http://web.mit.edu/2.75/resources/random/How%20Complex%20Systems%20Fail.pdf),” Cognitive Technologies Laboratory, April 2000.
1. Jay Kreps: “[Getting Real About Distributed System Reliability](http://blog.empathybox.com/post/19574936361/getting-real-about-distributed-system-reliability),” *blog.empathybox.com*, March 19, 2012.
1. David Oppenheimer, Archana Ganapathi, and David A. Patterson: “[Why Do Internet Services Fail, and What Can Be Done About It?](http://static.usenix.org/legacy/events/usits03/tech/full_papers/oppenheimer/oppenheimer.pdf),” at *4th USENIX Symposium on Internet Technologies and Systems* (USITS), March 2003.
1. Nathan Marz: “[Principles of Software Engineering, Part 1](http://nathanmarz.com/blog/principles-of-software-engineering-part-1.html),” *nathanmarz.com*, April 2, 2013.
1. Michael Jurewitz:“[The Human Impact of Bugs](http://jury.me/blog/2013/3/14/the-human-impact-of-bugs),” *jury.me*, March 15, 2013.
1. Raffi Krikorian: “[Timelines at Scale](http://www.infoq.com/presentations/Twitter-Timeline-Scalability),” at *QCon San Francisco*, November 2012.
1. Martin Fowler: *Patterns of Enterprise Application Architecture*. Addison Wesley, 2002. ISBN: 978-0-321-12742-6
1. Kelly Sommers: “[After all that run around, what caused 500ms disk latency even when we replaced physical server?](https://twitter.com/kellabyte/status/532930540777635840)” *twitter.com*, November 13, 2014.
1. Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, et al.: “[Dynamo: Amazon's Highly Available Key-Value Store](http://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf),” at *21st ACM Symposium on Operating Systems Principles* (SOSP), October 2007.
1. Greg Linden: “[Make Data Useful](http://glinden.blogspot.co.uk/2006/12/slides-from-my-talk-at-stanford.html),” slides from presentation at Stanford University Data Mining class (CS345), December 2006.
1. Tammy Everts: “[The Real Cost of Slow Time vs Downtime](http://www.webperformancetoday.com/2014/11/12/real-cost-slow-time-vs-downtime-slides/),” *webperformancetoday.com*, November 12, 2014.
1. Jake Brutlag:“[Speed Matters for Google Web Search](http://googleresearch.blogspot.co.uk/2009/06/speed-matters.html),” *googleresearch.blogspot.co.uk*, June 22, 2009.
1. Tyler Treat: “[Everything You Know About Latency Is Wrong](http://bravenewgeek.com/everything-you-know-about-latency-is-wrong/),” *bravenewgeek.com*, December 12, 2015.
1. Jeffrey Dean and Luiz André Barroso: “[The Tail at Scale](http://cacm.acm.org/magazines/2013/2/160173-the-tail-at-scale/fulltext),” *Communications of the ACM*, volume 56, number 2, pages 7480, February 2013. [doi:10.1145/2408776.2408794](http://dx.doi.org/10.1145/2408776.2408794)
1. Graham Cormode, Vladislav Shkapenyuk, Divesh Srivastava, and Bojian Xu: “[Forward Decay: A Practical Time Decay Model for Streaming Systems](http://dimacs.rutgers.edu/~graham/pubs/papers/fwddecay.pdf),” at *25th IEEE International Conference on Data Engineering* (ICDE), March 2009.
1. Ted Dunning and Otmar Ertl: “[Computing Extremely Accurate Quantiles Using t-Digests](https://github.com/tdunning/t-digest),” *github.com*, March 2014.
1. Gil Tene: “[HdrHistogram](http://www.hdrhistogram.org/),” *hdrhistogram.org*.
1. Baron Schwartz: “[Why Percentiles Dont Work the Way You Think](https://www.vividcortex.com/blog/why-percentiles-dont-work-the-way-you-think),” *vividcortex.com*, December 7, 2015.
1. James Hamilton: “[On Designing and Deploying Internet-Scale Services](https://www.usenix.org/legacy/events/lisa07/tech/full_papers/hamilton/hamilton.pdf),” at *21st Large Installation System Administration Conference* (LISA), November 2007.
1. Brian Foote and Joseph Yoder: “[Big Ball of Mud](http://www.laputan.org/pub/foote/mud.pdf),” at *4th Conference on Pattern Languages of Programs* (PLoP), September 1997.
1. Frederick P Brooks: “No Silver Bullet Essence and Accident in Software Engineering,” in *The Mythical Man-Month*, Anniversary edition, Addison-Wesley, 1995. ISBN: 978-0-201-83595-3
1. Ben Moseley and Peter Marks: “[Out of the Tar Pit](http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.93.8928),” at *BCS Software Practice Advancement* (SPA), 2006.
1. Rich Hickey: “[Simple Made Easy](http://www.infoq.com/presentations/Simple-Made-Easy),” at *Strange Loop*, September 2011.
1. Hongyu Pei Breivold, Ivica Crnkovic, and Peter J. Eriksson: “[Analyzing Software Evolvability](http://www.mrtc.mdh.se/publications/1478.pdf),” at *32nd Annual IEEE International Computer Software and Applications Conference* (COMPSAC), July 2008. [doi:10.1109/COMPSAC.2008.50](http://dx.doi.org/10.1109/COMPSAC.2008.50)
At present we dont have clear guidelines on which particular technologies or system architectures should be considered “GDPR-compliant” or not. The regulation deliberately does not mandate particular technologies, because these may quickly change as technology progresses. Instead, the legal texts set out high-level principles that are subject to interpretation. This means that there are no simple answers to the question of how to comply with privacy regulation, but we will look at some of the technologies in this book through this lens.
------ In general, we store data because we think that its value is greater than the costs of storing it. However, it is worth remembering that the costs of storage are not just the bill you pay for Amazon S3 or another service: the cost-benefit calculation should also take into account the risks of liability and reputational damage if the data were to be leaked or compromised by adversaries, and the risk of legal costs and fines if the storage and processing of the data is found not to be compliant with the law.
| 上一章 | 目录 | 下一章 | Governments or police forces might also compel companies to hand over data. When there is a risk that the data may reveal criminalized behaviors (for example, homosexuality in several Middle Eastern and African countries, or seeking an abortion in several US states), storing that data creates real safety risks for users. Travel to an abortion clinic, for example, could easily be revealed by location data, perhaps even by a log of the users IP addresses over time (which indicate approximate location).
| ----------------------------------- | ------------------------------- | ------------------------------------ |
| [第一部分:数据系统基础](part-i.md) | [设计数据密集型应用](README.md) | [第二章:数据模型与查询语言](ch2.md) | Once all the risks are taken into account, it might be reasonable to decide that some data is simply not worth storing, and that it should therefore be deleted. This principle of *data minimization* (sometimes known by the German term *Datensparsamkeit*) runs counter to the “big data” philosophy of storing lots of data speculatively in case it turns out to be useful in the future [[55](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#Datensparsamkeit)]. But it fits with the GDPR, which mandates that personal data many only be collected for a specified, explicit purpose, that this data may not later be used for any other purpose, and that the data must not be kept for longer than necessary for the purposes for which it was collected [[56](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#GDPR)].
Businesses have also taken notice of privacy and safety concerns. Credit card companies require payment processing businesses to adhere to strict payment card industry (PCI) standards. Processors undergo frequent evaluations from independent auditors to verify continued compliance. Software vendors have also seen increased scrutiny. Many buyers now require their vendors to comply with Service Organization Control (SOC) Type 2 standards. As with PCI compliance, vendors undergo third party audits to verify adherence.
Generally, it is important to balance the needs of your business against the needs of the people whose data you are collecting and processing. There is much more to this topic; in [Link to Come] we will go deeper into the topics of ethics and legal compliance, including the problems of bias and discrimination.
# Summary
The theme of this chapter has been to understand trade-offs: that is, to recognize that for many questions there is not one right answer, but several different approaches that each have various pros and cons. We explored some of the most important choices that affect the architecture of data systems, and introduced terminology that will be needed throughout the rest of this book.
We started by making a distinction between operational (transaction-processing, OLTP) and analytical (OLAP) systems, and saw their different characteristics: not only managing different types of data with different access patterns, but also serving different audiences. We encountered the concept of a data warehouse and data lake, which receive data feeds from operational systems via ETL. In [Link to Come] we will see that operational and analytical systems often use very different internal data layouts because of the different types of queries they need to serve.
We then compared cloud services, a comparatively recent development, to the traditional paradigm of self-hosted software that has previously dominated data systems architecture. Which of these approaches is more cost-effective depends a lot on your particular situation, but its undeniable that cloud-native approaches are bringing big changes to the way data systems are architected, for example in the way they separate storage and compute.
Cloud systems are intrinsically distributed, and we briefly examined some of the trade-offs of distributed systems compared to using a single machine. There are situations in which you cant avoid going distributed, but its advisable not to rush into making a system distributed if its possible to keep it on a single machine. In [Link to Come] and [Link to Come] we will cover the challenges with distributed systems in more detail.
Finally, we saw that data systems architecture is determined not only by the needs of the business deploying the system, but also by privacy regulation that protects the rights of the people whose data is being processed—an aspect that many engineers are prone to ignoring. How we translate legal requirements into technical implementations is not yet well understood, but its important to keep this question in mind as we move through the rest of this book.
##### Footnotes
##### References
[[1](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#Kouzes2009-marker)] Richard T. Kouzes, Gordon A. Anderson, Stephen T. Elbert, Ian Gorton, and Deborah K. Gracio. [The Changing Paradigm of Data-Intensive Computing](http://www2.ic.uff.br/~boeres/slides_AP/papers/TheChanginParadigmDataIntensiveComputing_2009.pdf). *IEEE Computer*, volume 42, issue 1, January 2009. [doi:10.1109/MC.2009.26](https://doi.org/10.1109/MC.2009.26)
[[2](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#Kleppmann2019-marker)] Martin Kleppmann, Adam Wiggins, Peter van Hardenberg, and Mark McGranaghan. [Local-first software: you own your data, in spite of the cloud](https://www.inkandswitch.com/local-first/). At *2019 ACM SIGPLAN International Symposium on New Ideas, New Paradigms, and Reflections on Programming and Software* (Onward!), October 2019. [doi:10.1145/3359591.3359737](https://doi.org/10.1145/3359591.3359737)
[[3](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#Reis2022-marker)] Joe Reis and Matt Housley. [*Fundamentals of Data Engineering*](https://www.oreilly.com/library/view/fundamentals-of-data/9781098108298/). OReilly Media, 2022. ISBN: 9781098108304
[[4](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#Machado2023-marker)] Rui Pedro Machado and Helder Russa. [*Analytics Engineering with SQL and dbt*](https://www.oreilly.com/library/view/analytics-engineering-with/9781098142377/). OReilly Media, 2023. ISBN: 9781098142384
[[5](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#Codd1993-marker)] Edgar F. Codd, S. B. Codd, and C. T. Salley. [Providing OLAP to User-Analysts: An IT Mandate](http://www.estgv.ipv.pt/PaginasPessoais/jloureiro/ESI_AID2007_2008/fichas/codd.pdf). E. F. Codd Associates, 1993. Archived at [perma.cc/RKX8-2GEE](https://perma.cc/RKX8-2GEE)
[[6](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#Chaudhuri1997-marker)] Surajit Chaudhuri and Umeshwar Dayal. [An Overview of Data Warehousing and OLAP Technology](https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/sigrecord.pdf). *ACM SIGMOD Record*, volume 26, issue 1, pages 6574, March 1997. [doi:10.1145/248603.248616](https://doi.org/10.1145/248603.248616)
[[7](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#Ozcan2017-marker)] Fatma Özcan, Yuanyuan Tian, and Pinar Tözün. [Hybrid Transactional/Analytical Processing: A Survey](https://humming80.github.io/papers/sigmod-htaptut.pdf). At *ACM International Conference on Management of Data* (SIGMOD), May 2017. [doi:10.1145/3035918.3054784](https://doi.org/10.1145/3035918.3054784)
[[8](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#Prout2022-marker)] Adam Prout, Szu-Po Wang, Joseph Victor, Zhou Sun, Yongzhu Li, Jack Chen, Evan Bergeron, Eric Hanson, Robert Walzer, Rodrigo Gomes, and Nikita Shamgunov. [Cloud-Native Transactions and Analytics in SingleStore](https://dl.acm.org/doi/abs/10.1145/3514221.3526055). At *International Conference on Management of Data* (SIGMOD), June 2022. [doi:10.1145/3514221.3526055](https://doi.org/10.1145/3514221.3526055)
[[9](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#Stonebraker2005fitsall-marker)] Michael Stonebraker and Uğur Çetintemel. [One Size Fits All: An Idea Whose Time Has Come and Gone](https://pages.cs.wisc.edu/~shivaram/cs744-readings/fits_all.pdf). At *21st International Conference on Data Engineering* (ICDE), April 2005. [doi:10.1109/ICDE.2005.1](https://doi.org/10.1109/ICDE.2005.1)
[[10](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#Cohen2009-marker)] Jeffrey Cohen, Brian Dolan, Mark Dunlap, Joseph M Hellerstein, and Caleb Welton. [MAD Skills: New Analysis Practices for Big Data](http://www.vldb.org/pvldb/vol2/vldb09-219.pdf). *Proceedings of the VLDB Endowment*, volume 2, issue 2, pages 14811492, August 2009. [doi:10.14778/1687553.1687576](https://doi.org/10.14778/1687553.1687576)
[[11](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#Olteanu2020-marker)] Dan Olteanu. [The Relational Data Borg is Learning](http://www.vldb.org/pvldb/vol13/p3502-olteanu.pdf). *Proceedings of the VLDB Endowment*, volume 13, issue 12, August 2020. [doi:10.14778/3415478.3415572](https://doi.org/10.14778/3415478.3415572)
[[12](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#Bornstein2020-marker)] Matt Bornstein, Martin Casado, and Jennifer Li. [Emerging Architectures for Modern Data Infrastructure: 2020](https://future.a16z.com/emerging-architectures-for-modern-data-infrastructure-2020/). *future.a16z.com*, October 2020. Archived at [perma.cc/LF8W-KDCC](https://perma.cc/LF8W-KDCC)
[[13](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#Fowler2015-marker)] Martin Fowler. [DataLake](https://www.martinfowler.com/bliki/DataLake.html). *martinfowler.com*, February 2015. Archived at [perma.cc/4WKN-CZUK](https://perma.cc/4WKN-CZUK)
[[14](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#Johnson2015-marker)] Bobby Johnson and Joseph Adler. [The Sushi Principle: Raw Data Is Better](https://learning.oreilly.com/videos/strata-hadoop/9781491924143/9781491924143-video210840/). At *Strata+Hadoop World*, February 2015.
[[15](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#Armbrust2021-marker)] Michael Armbrust, Ali Ghodsi, Reynold Xin, and Matei Zaharia. [Lakehouse: A New Generation of Open Platforms that Unify Data Warehousing and Advanced Analytics](https://www.cidrdb.org/cidr2021/papers/cidr2021_paper17.pdf). At *11th Annual Conference on Innovative Data Systems Research* (CIDR), January 2021.
[[16](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#DataOps-marker)] DataKitchen, Inc. [The DataOps Manifesto](https://dataopsmanifesto.org/en/). *dataopsmanifesto.org*, 2017. Archived at [perma.cc/3F5N-FUQ4](https://perma.cc/3F5N-FUQ4)
[[17](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#Manohar2021-marker)] Tejas Manohar. [What is Reverse ETL: A Definition & Why Its Taking Off](https://hightouch.io/blog/reverse-etl/). *hightouch.io*, November 2021. Archived at [perma.cc/A7TN-GLYJ](https://perma.cc/A7TN-GLYJ)
[[18](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#ORegan2018-marker)] Simon ORegan. [Designing Data Products](https://towardsdatascience.com/designing-data-products-b6b93edf3d23). *towardsdatascience.com*, August 2018. Archived at [perma.cc/HU67-3RV8](https://perma.cc/HU67-3RV8)
[[19](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#Fournier2021-marker)] Camille Fournier. [Why is it so hard to decide to buy?](https://skamille.medium.com/why-is-it-so-hard-to-decide-to-buy-d86fee98e88e) *skamille.medium.com*, July 2021. Archived at [perma.cc/6VSG-HQ5X](https://perma.cc/6VSG-HQ5X)
[[20](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#HeinemeierHansson2022-marker)] David Heinemeier Hansson. [Why were leaving the cloud](https://world.hey.com/dhh/why-we-re-leaving-the-cloud-654b47e0). *world.hey.com*, October 2022. Archived at [perma.cc/82E6-UJ65](https://perma.cc/82E6-UJ65)
[[21](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#Badizadegan2022-marker)] Nima Badizadegan. [Use One Big Server](https://specbranch.com/posts/one-big-server/). *specbranch.com*, August 2022. Archived at [perma.cc/M8NB-95UK](https://perma.cc/M8NB-95UK)
[[22](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#Yegge2020-marker)] Steve Yegge. [Dear Google Cloud: Your Deprecation Policy is Killing You](https://steve-yegge.medium.com/dear-google-cloud-your-deprecation-policy-is-killing-you-ee7525dc05dc). *steve-yegge.medium.com*, August 2020. Archived at [perma.cc/KQP9-SPGU](https://perma.cc/KQP9-SPGU)
[[23](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#Verbitski2017-marker)] Alexandre Verbitski, Anurag Gupta, Debanjan Saha, Murali Brahmadesam, Kamal Gupta, Raman Mittal, Sailesh Krishnamurthy, Sandor Maurice, Tengiz Kharatishvili, and Xiaofeng Bao. [Amazon Aurora: Design Considerations for High Throughput Cloud-Native Relational Databases](https://media.amazonwebservices.com/blog/2017/aurora-design-considerations-paper.pdf). At *ACM International Conference on Management of Data* (SIGMOD), pages 10411052, May 2017. [doi:10.1145/3035918.3056101](https://doi.org/10.1145/3035918.3056101)
[[24](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#Antonopoulos2019_ch1-marker)] Panagiotis Antonopoulos, Alex Budovski, Cristian Diaconu, Alejandro Hernandez Saenz, Jack Hu, Hanuma Kodavalla, Donald Kossmann, Sandeep Lingam, Umar Farooq Minhas, Naveen Prakash, Vijendra Purohit, Hugh Qu, Chaitanya Sreenivas Ravella, Krystyna Reisteter, Sheetal Shrotri, Dixin Tang, and Vikram Wakade. [Socrates: The New SQL Server in the Cloud](https://www.microsoft.com/en-us/research/uploads/prod/2019/05/socrates.pdf). At *ACM International Conference on Management of Data* (SIGMOD), pages 17431756, June 2019. [doi:10.1145/3299869.3314047](https://doi.org/10.1145/3299869.3314047)
[[25](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#Vuppalapati2020-marker)] Midhul Vuppalapati, Justin Miron, Rachit Agarwal, Dan Truong, Ashish Motivala, and Thierry Cruanes. [Building An Elastic Query Engine on Disaggregated Storage](https://www.usenix.org/system/files/nsdi20-paper-vuppalapati.pdf). At *17th USENIX Symposium on Networked Systems Design and Implementation* (NSDI), February 2020.
[[26](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#Shapira2023-marker)] Gwen Shapira. [Compute-Storage Separation Explained](https://www.thenile.dev/blog/storage-compute). *thenile.dev*, January 2023. Archived at [perma.cc/QCV3-XJNZ](https://perma.cc/QCV3-XJNZ)
[[27](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#Murthy2022-marker)] Ravi Murthy and Gurmeet Goindi. [AlloyDB for PostgreSQL under the hood: Intelligent, database-aware storage](https://cloud.google.com/blog/products/databases/alloydb-for-postgresql-intelligent-scalable-storage). *cloud.google.com*, May 2022. Archived at [archive.org](https://web.archive.org/web/20220514021120/https://cloud.google.com/blog/products/databases/alloydb-for-postgresql-intelligent-scalable-storage)
[[28](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#Vanlightly2023-marker)] Jack Vanlightly. [The Architecture of Serverless Data Systems](https://jack-vanlightly.com/blog/2023/11/14/the-architecture-of-serverless-data-systems). *jack-vanlightly.com*, November 2023. Archived at [perma.cc/UDV4-TNJ5](https://perma.cc/UDV4-TNJ5)
[[29](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#Jonas2019-marker)] Eric Jonas, Johann Schleier-Smith, Vikram Sreekanti, Chia-Che Tsai, Anurag Khandelwal, Qifan Pu, Vaishaal Shankar, Joao Carreira, Karl Krauth, Neeraja Yadwadkar, Joseph E Gonzalez, Raluca Ada Popa, Ion Stoica, David A Patterson. [Cloud Programming Simplified: A Berkeley View on Serverless Computing](https://arxiv.org/abs/1902.03383). *arxiv.org*, February 2019.
[[30](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#Beyer2016-marker)] Betsy Beyer, Jennifer Petoff, Chris Jones, and Niall Richard Murphy. [*Site Reliability Engineering: How Google Runs Production Systems*](https://www.oreilly.com/library/view/site-reliability-engineering/9781491929117/). OReilly Media, 2016. ISBN: 9781491929124
[[31](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#Limoncelli2020-marker)] Thomas Limoncelli. [The Time I Stole $10,000 from Bell Labs](https://queue.acm.org/detail.cfm?id=3434773). *ACM Queue*, volume 18, issue 5, November 2020. [doi:10.1145/3434571.3434773](https://doi.org/10.1145/3434571.3434773)
[[32](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#Majors2020-marker)] Charity Majors. [The Future of Ops Jobs](https://acloudguru.com/blog/engineering/the-future-of-ops-jobs). *acloudguru.com*, August 2020. Archived at [perma.cc/GRU2-CZG3](https://perma.cc/GRU2-CZG3)
[[33](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#Cherkasky2021-marker)] Boris Cherkasky. [(Over)Pay As You Go for Your Datastore](https://medium.com/riskified-technology/over-pay-as-you-go-for-your-datastore-11a29ae49a8b). *medium.com*, September 2021. Archived at [perma.cc/Q8TV-2AM2](https://perma.cc/Q8TV-2AM2)
[[34](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#Kushchi2023-marker)] Shlomi Kushchi. [Serverless Doesnt Mean DevOpsLess or NoOps](https://thenewstack.io/serverless-doesnt-mean-devopsless-or-noops/). *thenewstack.io*, February 2023. Archived at [perma.cc/3NJR-AYYU](https://perma.cc/3NJR-AYYU)
[[35](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#Bernhardsson2021-marker)] Erik Bernhardsson. [Storm in the stratosphere: how the cloud will be reshuffled](https://erikbern.com/2021/11/30/storm-in-the-stratosphere-how-the-cloud-will-be-reshuffled.html). *erikbern.com*, November 2021. Archived at [perma.cc/SYB2-99P3](https://perma.cc/SYB2-99P3)
[[36](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#Stancil2021-marker)] Benn Stancil. [The data OS](https://benn.substack.com/p/the-data-os). *benn.substack.com*, September 2021. Archived at [perma.cc/WQ43-FHS6](https://perma.cc/WQ43-FHS6)
[[37](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#Korolov2022-marker)] Maria Korolov. [Data residency laws pushing companies toward residency as a service](https://www.csoonline.com/article/3647761/data-residency-laws-pushing-companies-toward-residency-as-a-service.html). *csoonline.com*, January 2022. Archived at [perma.cc/CHE4-XZZ2](https://perma.cc/CHE4-XZZ2)
[[38](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#Nath2019-marker)] Kousik Nath. [These are the numbers every computer engineer should know](https://www.freecodecamp.org/news/must-know-numbers-for-every-computer-engineer/). *freecodecamp.org*, September 2019. Archived at [perma.cc/RW73-36RL](https://perma.cc/RW73-36RL)
[[39](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#Hellerstein2019-marker)] Joseph M Hellerstein, Jose Faleiro, Joseph E Gonzalez, Johann Schleier-Smith, Vikram Sreekanti, Alexey Tumanov, and Chenggang Wu. [Serverless Computing: One Step Forward, Two Steps Back](https://arxiv.org/abs/1812.03651). At *Conference on Innovative Data Systems Research* (CIDR), January 2019.
[[40](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#McSherry2015_ch1-marker)] Frank McSherry, Michael Isard, and Derek G. Murray. [Scalability! But at What COST?](https://www.usenix.org/system/files/conference/hotos15/hotos15-paper-mcsherry.pdf) At *15th USENIX Workshop on Hot Topics in Operating Systems* (HotOS), May 2015.
[[41](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#Sridharan2018-marker)] Cindy Sridharan. *[Distributed Systems Observability: A Guide to Building Robust Systems](https://unlimited.humio.com/rs/756-LMY-106/images/Distributed-Systems-Observability-eBook.pdf)*. Report, OReilly Media, May 2018. Archived at [perma.cc/M6JL-XKCM](https://perma.cc/M6JL-XKCM)
[[42](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#Majors2019-marker)] Charity Majors. [Observability — A 3-Year Retrospective](https://thenewstack.io/observability-a-3-year-retrospective/). *thenewstack.io*, August 2019. Archived at [perma.cc/CG62-TJWL](https://perma.cc/CG62-TJWL)
[[43](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#Sigelman2010-marker)] Benjamin H. Sigelman, Luiz André Barroso, Mike Burrows, Pat Stephenson, Manoj Plakal, Donald Beaver, Saul Jaspan, and Chandan Shanbhag. [Dapper, a Large-Scale Distributed Systems Tracing Infrastructure](https://research.google/pubs/pub36356/). Google Technical Report dapper-2010-1, April 2010. Archived at [perma.cc/K7KU-2TMH](https://perma.cc/K7KU-2TMH)
[[44](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#Laigner2021-marker)] Rodrigo Laigner, Yongluan Zhou, Marcos Antonio Vaz Salles, Yijian Liu, and Marcos Kalinowski. [Data management in microservices: State of the practice, challenges, and research directions](http://www.vldb.org/pvldb/vol14/p3348-laigner.pdf). *Proceedings of the VLDB Endowment*, volume 14, issue 13, pages 33483361, September 2021. [doi:10.14778/3484224.3484232](https://doi.org/10.14778/3484224.3484232)
[[45](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#Newman2021_ch1-marker)] Sam Newman. [*Building Microservices*, second edition](https://www.oreilly.com/library/view/building-microservices-2nd/9781492034018/). OReilly Media, 2021. ISBN: 9781492034025
[[46](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#Richardson2014-marker)] Chris Richardson. [Microservices: Decomposing Applications for Deployability and Scalability](http://www.infoq.com/articles/microservices-intro). *infoq.com*, May 2014. Archived at [perma.cc/CKN4-YEQ2](https://perma.cc/CKN4-YEQ2)
[[47](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#Shahrad2020-marker)] Mohammad Shahrad, Rodrigo Fonseca, Íñigo Goiri, Gohar Chaudhry, Paul Batum, Jason Cooke, Eduardo Laureano, Colby Tresness, Mark Russinovich, Ricardo Bianchini. [Serverless in the Wild: Characterizing and Optimizing the Serverless Workload at a Large Cloud Provider](https://www.usenix.org/system/files/atc20-shahrad.pdf). At *USENIX Annual Technical Conference* (ATC), July 2020.
[[48](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#Barroso2018-marker)] Luiz André Barroso, Urs Hölzle, and Parthasarathy Ranganathan. [The Datacenter as a Computer: Designing Warehouse-Scale Machines](https://www.morganclaypool.com/doi/10.2200/S00874ED3V01Y201809CAC046), third edition. Morgan & Claypool Synthesis Lectures on Computer Architecture, October 2018. [doi:10.2200/S00874ED3V01Y201809CAC046](https://doi.org/10.2200/S00874ED3V01Y201809CAC046)
[[49](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#Fiala2012-marker)] David Fiala, Frank Mueller, Christian Engelmann, Rolf Riesen, Kurt Ferreira, and Ron Brightwell. [Detection and Correction of Silent Data Corruption for Large-Scale High-Performance Computing](http://moss.csc.ncsu.edu/~mueller/ftp/pub/mueller/papers/sc12.pdf),” at *International Conference for High Performance Computing, Networking, Storage and Analysis* (SC), November 2012. [doi:10.1109/SC.2012.49](https://doi.org/10.1109/SC.2012.49)
[[50](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#KornfeldSimpson2020-marker)] Anna Kornfeld Simpson, Adriana Szekeres, Jacob Nelson, and Irene Zhang. [Securing RDMA for High-Performance Datacenter Storage Systems](https://www.usenix.org/conference/hotcloud20/presentation/kornfeld-simpson). At *12th USENIX Workshop on Hot Topics in Cloud Computing* (HotCloud), July 2020.
[[51](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#Singh2015-marker)] Arjun Singh, Joon Ong, Amit Agarwal, Glen Anderson, Ashby Armistead, Roy Bannon, Seb Boving, Gaurav Desai, Bob Felderman, Paulie Germano, Anand Kanagala, Jeff Provost, Jason Simmons, Eiichi Tanda, Jim Wanderer, Urs Hölzle, Stephen Stuart, and Amin Vahdat. [Jupiter Rising: A Decade of Clos Topologies and Centralized Control in Googles Datacenter Network](http://conferences.sigcomm.org/sigcomm/2015/pdf/papers/p183.pdf). At *Annual Conference of the ACM Special Interest Group on Data Communication* (SIGCOMM), August 2015. [doi:10.1145/2785956.2787508](https://doi.org/10.1145/2785956.2787508)
[[52](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#Lockwood2014-marker)] Glenn K. Lockwood. [Hadoops Uncomfortable Fit in HPC](http://glennklockwood.blogspot.co.uk/2014/05/hadoops-uncomfortable-fit-in-hpc.html). *glennklockwood.blogspot.co.uk*, May 2014. Archived at [perma.cc/S8XX-Y67B](https://perma.cc/S8XX-Y67B)
[[53](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#ONeil2016_ch1-marker)] Cathy ONeil: *Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy*. Crown Publishing, 2016. ISBN: 9780553418811
[[54](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#Shastri2020-marker)] Supreeth Shastri, Vinay Banakar, Melissa Wasserman, Arun Kumar, and Vijay Chidambaram. [Understanding and Benchmarking the Impact of GDPR on Database Systems](http://www.vldb.org/pvldb/vol13/p1064-shastri.pdf). *Proceedings of the VLDB Endowment*, volume 13, issue 7, pages 10641077, March 2020. [doi:10.14778/3384345.3384354](https://doi.org/10.14778/3384345.3384354)
[[55](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#Datensparsamkeit-marker)] Martin Fowler. [Datensparsamkeit](https://www.martinfowler.com/bliki/Datensparsamkeit.html). *martinfowler.com*, December 2013. Archived at [perma.cc/R9QX-CME6](https://perma.cc/R9QX-CME6)
[[56](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#GDPR-marker)] [Regulation (EU) 2016/679 of the European Parliament and of the Council of 27 April 2016 (General Data Protection Regulation)](https://eur-lex.europa.eu/legal-content/EN/TXT/HTML/?uri=CELEX:32016R0679&from=EN). *Official Journal of the European Union* L 119/1, May 2016.

1181
ch10.md

File diff suppressed because it is too large Load Diff

1465
ch11.md

File diff suppressed because it is too large Load Diff

1296
ch12.md

File diff suppressed because it is too large Load Diff

1012
ch13.md Normal file

File diff suppressed because it is too large Load Diff

1671
ch2.md

File diff suppressed because it is too large Load Diff

2392
ch3.md

File diff suppressed because it is too large Load Diff

1000
ch4.md

File diff suppressed because it is too large Load Diff

954
ch5.md

File diff suppressed because it is too large Load Diff

867
ch6.md

File diff suppressed because it is too large Load Diff

1037
ch7.md

File diff suppressed because it is too large Load Diff

1431
ch8.md

File diff suppressed because it is too large Load Diff

1474
ch9.md

File diff suppressed because it is too large Load Diff

View File

@ -32,93 +32,131 @@ In this chapter, we will start by exploring the fundamentals of what we are tryi
## Summary ## Summary
In this chapter, we have explored some fundamental ways of thinking about data-intensive applications. These principles will guide us through the rest of the book, where we dive into deep technical detail. The theme of this chapter has been to understand trade-offs: that is, to recognize that for many questions there is not one right answer, but several different approaches that each have various pros and cons. We explored some of the most important choices that affect the architecture of data systems, and introduced terminology that will be needed throughout the rest of this book.
An application has to meet various requirements in order to be useful. There are *functional requirements* (what it should do, such as allowing data to be stored, retrieved, searched, and processed in various ways), and some *nonfunctional require ments* (general properties like security, reliability, compliance, scalability, compatibil ity, and maintainability). In this chapter we discussed reliability, scalability, and maintainability in detail. We started by making a distinction between operational (transaction-processing, OLTP) and analytical (OLAP) systems, and saw their different characteristics: not only managing different types of data with different access patterns, but also serving different audiences. We encountered the concept of a data warehouse and data lake, which receive data feeds from operational systems via ETL. In [Link to Come] we will see that operational and analytical systems often use very different internal data layouts because of the different types of queries they need to serve.
*Reliability* means making systems work correctly, even when faults occur. Faults can be in hardware (typically random and uncorrelated), software (bugs are typically sys tematic and hard to deal with), and humans (who inevitably make mistakes from time to time). Fault-tolerance techniques can hide certain types of faults from the end user. We then compared cloud services, a comparatively recent development, to the traditional paradigm of self-hosted software that has previously dominated data systems architecture. Which of these approaches is more cost-effective depends a lot on your particular situation, but its undeniable that cloud-native approaches are bringing big changes to the way data systems are architected, for example in the way they separate storage and compute.
*Scalability* means having strategies for keeping performance good, even when load increases. In order to discuss scalability, we first need ways of describing load and performance quantitatively. We briefly looked at Twitters home timelines as an example of describing load, and response time percentiles as a way of measuring performance. In a scalable system, you can add processing capacity in order to remain reliable under high load. Cloud systems are intrinsically distributed, and we briefly examined some of the trade-offs of distributed systems compared to using a single machine. There are situations in which you cant avoid going distributed, but its advisable not to rush into making a system distributed if its possible to keep it on a single machine. In [Link to Come] and [Link to Come] we will cover the challenges with distributed systems in more detail.
*Maintainability* has many facets, but in essence its about making life better for the engineering and operations teams who need to work with the system. Good abstrac tions can help reduce complexity and make the system easier to modify and adapt for new use cases. Good operability means having good visibility into the systems health, and having effective ways of managing it. Finally, we saw that data systems architecture is determined not only by the needs of the business deploying the system, but also by privacy regulation that protects the rights of the people whose data is being processed—an aspect that many engineers are prone to ignoring. How we translate legal requirements into technical implementations is not yet well understood, but its important to keep this question in mind as we move through the rest of this book.
There is unfortunately no easy fix for making applications reliable, scalable, or main tainable. However, there are certain patterns and techniques that keep reappearing in different kinds of applications. In the next few chapters we will take a look at some examples of data systems and analyze how they work toward those goals. ##### Footnotes
Later in the book, in [Part III](part-iii.md), we will look at patterns for systems that consist of sev eral components working together, such as the one in [Figure 1-1](../img/fig1-1.png). ##### References
[[1](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#Kouzes2009-marker)] Richard T. Kouzes, Gordon A. Anderson, Stephen T. Elbert, Ian Gorton, and Deborah K. Gracio. [The Changing Paradigm of Data-Intensive Computing](http://www2.ic.uff.br/~boeres/slides_AP/papers/TheChanginParadigmDataIntensiveComputing_2009.pdf). *IEEE Computer*, volume 42, issue 1, January 2009. [doi:10.1109/MC.2009.26](https://doi.org/10.1109/MC.2009.26)
[[2](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#Kleppmann2019-marker)] Martin Kleppmann, Adam Wiggins, Peter van Hardenberg, and Mark McGranaghan. [Local-first software: you own your data, in spite of the cloud](https://www.inkandswitch.com/local-first/). At *2019 ACM SIGPLAN International Symposium on New Ideas, New Paradigms, and Reflections on Programming and Software* (Onward!), October 2019. [doi:10.1145/3359591.3359737](https://doi.org/10.1145/3359591.3359737)
## References [[3](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#Reis2022-marker)] Joe Reis and Matt Housley. [*Fundamentals of Data Engineering*](https://www.oreilly.com/library/view/fundamentals-of-data/9781098108298/). OReilly Media, 2022. ISBN: 9781098108304
--------------------
1. Michael Stonebraker and Uğur Çetintemel: “['One Size Fits All': An Idea Whose Time Has Come and Gone](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.68.9136&rep=rep1&type=pdf),” at *21st International Conference on Data Engineering* (ICDE), April 2005. [[4](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#Machado2023-marker)] Rui Pedro Machado and Helder Russa. [*Analytics Engineering with SQL and dbt*](https://www.oreilly.com/library/view/analytics-engineering-with/9781098142377/). OReilly Media, 2023. ISBN: 9781098142384
1. Walter L. Heimerdinger and Charles B. Weinstock: “[A Conceptual Framework for System Fault Tolerance](http://www.sei.cmu.edu/reports/92tr033.pdf),” Technical Report CMU/SEI-92-TR-033, Software Engineering Institute, Carnegie Mellon University, October 1992. [[5](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#Codd1993-marker)] Edgar F. Codd, S. B. Codd, and C. T. Salley. [Providing OLAP to User-Analysts: An IT Mandate](http://www.estgv.ipv.pt/PaginasPessoais/jloureiro/ESI_AID2007_2008/fichas/codd.pdf). E. F. Codd Associates, 1993. Archived at [perma.cc/RKX8-2GEE](https://perma.cc/RKX8-2GEE)
2. Ding Yuan, Yu Luo, Xin Zhuang, et al.: “[Simple Testing Can Prevent Most Critical Failures: An Analysis of Production Failures in Distributed Data-Intensive Systems](https://www.usenix.org/system/files/conference/osdi14/osdi14-paper-yuan.pdf),” at *11th USENIX Symposium on Operating Systems Design and Implementation* (OSDI), October 2014. [[6](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#Chaudhuri1997-marker)] Surajit Chaudhuri and Umeshwar Dayal. [An Overview of Data Warehousing and OLAP Technology](https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/sigrecord.pdf). *ACM SIGMOD Record*, volume 26, issue 1, pages 6574, March 1997. [doi:10.1145/248603.248616](https://doi.org/10.1145/248603.248616)
3. Yury Izrailevsky and Ariel Tseitlin: “[The Netflix Simian Army](http://techblog.netflix.com/2011/07/netflix-simian-army.html),” *techblog.netflix.com*, July 19, 2011. [[7](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#Ozcan2017-marker)] Fatma Özcan, Yuanyuan Tian, and Pinar Tözün. [Hybrid Transactional/Analytical Processing: A Survey](https://humming80.github.io/papers/sigmod-htaptut.pdf). At *ACM International Conference on Management of Data* (SIGMOD), May 2017. [doi:10.1145/3035918.3054784](https://doi.org/10.1145/3035918.3054784)
4. Daniel Ford, François Labelle, Florentina I. Popovici, et al.: “[Availability in Globally Distributed Storage Systems](http://research.google.com/pubs/archive/36737.pdf),” at *9th USENIX Symposium on Operating Systems Design and Implementation* (OSDI), [[8](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#Prout2022-marker)] Adam Prout, Szu-Po Wang, Joseph Victor, Zhou Sun, Yongzhu Li, Jack Chen, Evan Bergeron, Eric Hanson, Robert Walzer, Rodrigo Gomes, and Nikita Shamgunov. [Cloud-Native Transactions and Analytics in SingleStore](https://dl.acm.org/doi/abs/10.1145/3514221.3526055). At *International Conference on Management of Data* (SIGMOD), June 2022. [doi:10.1145/3514221.3526055](https://doi.org/10.1145/3514221.3526055)
October 2010.
5. Brian Beach: “[Hard Drive Reliability Update Sep 2014](https://www.backblaze.com/blog/hard-drive-reliability-update-september-2014/),” *backblaze.com*, September 23, 2014. [[9](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#Stonebraker2005fitsall-marker)] Michael Stonebraker and Uğur Çetintemel. [One Size Fits All: An Idea Whose Time Has Come and Gone](https://pages.cs.wisc.edu/~shivaram/cs744-readings/fits_all.pdf). At *21st International Conference on Data Engineering* (ICDE), April 2005. [doi:10.1109/ICDE.2005.1](https://doi.org/10.1109/ICDE.2005.1)
6. Laurie Voss: “[AWS: The Good, the Bad and the Ugly](https://web.archive.org/web/20160429075023/http://blog.awe.sm/2012/12/18/aws-the-good-the-bad-and-the-ugly/),” *blog.awe.sm*, December 18, 2012. [[10](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#Cohen2009-marker)] Jeffrey Cohen, Brian Dolan, Mark Dunlap, Joseph M Hellerstein, and Caleb Welton. [MAD Skills: New Analysis Practices for Big Data](http://www.vldb.org/pvldb/vol2/vldb09-219.pdf). *Proceedings of the VLDB Endowment*, volume 2, issue 2, pages 14811492, August 2009. [doi:10.14778/1687553.1687576](https://doi.org/10.14778/1687553.1687576)
7. Haryadi S. Gunawi, Mingzhe Hao, Tanakorn Leesatapornwongsa, et al.: “[What Bugs Live in the Cloud?](http://ucare.cs.uchicago.edu/pdf/socc14-cbs.pdf),” at *5th ACM Symposium on Cloud Computing* (SoCC), November 2014. [doi:10.1145/2670979.2670986](http://dx.doi.org/10.1145/2670979.2670986) [[11](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#Olteanu2020-marker)] Dan Olteanu. [The Relational Data Borg is Learning](http://www.vldb.org/pvldb/vol13/p3502-olteanu.pdf). *Proceedings of the VLDB Endowment*, volume 13, issue 12, August 2020. [doi:10.14778/3415478.3415572](https://doi.org/10.14778/3415478.3415572)
8. Nelson Minar: “[Leap Second Crashes Half the Internet](http://www.somebits.com/weblog/tech/bad/leap-second-2012.html),” *somebits.com*, July 3, 2012. [[12](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#Bornstein2020-marker)] Matt Bornstein, Martin Casado, and Jennifer Li. [Emerging Architectures for Modern Data Infrastructure: 2020](https://future.a16z.com/emerging-architectures-for-modern-data-infrastructure-2020/). *future.a16z.com*, October 2020. Archived at [perma.cc/LF8W-KDCC](https://perma.cc/LF8W-KDCC)
9. Amazon Web Services: “[Summary of the Amazon EC2 and Amazon RDS Service Disruption in the US East Region](http://aws.amazon.com/message/65648/),” *aws.amazon.com*, April 29, 2011. [[13](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#Fowler2015-marker)] Martin Fowler. [DataLake](https://www.martinfowler.com/bliki/DataLake.html). *martinfowler.com*, February 2015. Archived at [perma.cc/4WKN-CZUK](https://perma.cc/4WKN-CZUK)
10. Richard I. Cook: “[How Complex Systems Fail](http://web.mit.edu/2.75/resources/random/How%20Complex%20Systems%20Fail.pdf),” Cognitive Technologies Laboratory, April 2000. [[14](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#Johnson2015-marker)] Bobby Johnson and Joseph Adler. [The Sushi Principle: Raw Data Is Better](https://learning.oreilly.com/videos/strata-hadoop/9781491924143/9781491924143-video210840/). At *Strata+Hadoop World*, February 2015.
11. Jay Kreps: “[Getting Real About Distributed System Reliability](http://blog.empathybox.com/post/19574936361/getting-real-about-distributed-system-reliability),” *blog.empathybox.com*, March 19, 2012. [[15](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#Armbrust2021-marker)] Michael Armbrust, Ali Ghodsi, Reynold Xin, and Matei Zaharia. [Lakehouse: A New Generation of Open Platforms that Unify Data Warehousing and Advanced Analytics](https://www.cidrdb.org/cidr2021/papers/cidr2021_paper17.pdf). At *11th Annual Conference on Innovative Data Systems Research* (CIDR), January 2021.
12. David Oppenheimer, Archana Ganapathi, and David A. Patterson: “[Why Do Internet Services Fail, and What Can Be Done About It?](http://static.usenix.org/legacy/events/usits03/tech/full_papers/oppenheimer/oppenheimer.pdf),” at *4th USENIX Symposium on Internet Technologies and Systems* (USITS), March 2003. [[16](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#DataOps-marker)] DataKitchen, Inc. [The DataOps Manifesto](https://dataopsmanifesto.org/en/). *dataopsmanifesto.org*, 2017. Archived at [perma.cc/3F5N-FUQ4](https://perma.cc/3F5N-FUQ4)
13. Nathan Marz: “[Principles of Software Engineering, Part 1](http://nathanmarz.com/blog/principles-of-software-engineering-part-1.html),” *nathanmarz.com*, April 2, 2013. [[17](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#Manohar2021-marker)] Tejas Manohar. [What is Reverse ETL: A Definition & Why Its Taking Off](https://hightouch.io/blog/reverse-etl/). *hightouch.io*, November 2021. Archived at [perma.cc/A7TN-GLYJ](https://perma.cc/A7TN-GLYJ)
14. Michael Jurewitz:“[The Human Impact of Bugs](http://jury.me/blog/2013/3/14/the-human-impact-of-bugs),” *jury.me*, March 15, 2013. [[18](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#ORegan2018-marker)] Simon ORegan. [Designing Data Products](https://towardsdatascience.com/designing-data-products-b6b93edf3d23). *towardsdatascience.com*, August 2018. Archived at [perma.cc/HU67-3RV8](https://perma.cc/HU67-3RV8)
15. Raffi Krikorian: “[Timelines at Scale](http://www.infoq.com/presentations/Twitter-Timeline-Scalability),” at *QCon San Francisco*, November 2012. [[19](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#Fournier2021-marker)] Camille Fournier. [Why is it so hard to decide to buy?](https://skamille.medium.com/why-is-it-so-hard-to-decide-to-buy-d86fee98e88e) *skamille.medium.com*, July 2021. Archived at [perma.cc/6VSG-HQ5X](https://perma.cc/6VSG-HQ5X)
16. Martin Fowler: *Patterns of Enterprise Application Architecture*. Addison Wesley, 2002. ISBN: 978-0-321-12742-6 [[20](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#HeinemeierHansson2022-marker)] David Heinemeier Hansson. [Why were leaving the cloud](https://world.hey.com/dhh/why-we-re-leaving-the-cloud-654b47e0). *world.hey.com*, October 2022. Archived at [perma.cc/82E6-UJ65](https://perma.cc/82E6-UJ65)
17. Kelly Sommers: “[After all that run around, what caused 500ms disk latency even when we replaced physical server?](https://twitter.com/kellabyte/status/532930540777635840)” *twitter.com*, November 13, 2014. [[21](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#Badizadegan2022-marker)] Nima Badizadegan. [Use One Big Server](https://specbranch.com/posts/one-big-server/). *specbranch.com*, August 2022. Archived at [perma.cc/M8NB-95UK](https://perma.cc/M8NB-95UK)
18. Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, et al.: “[Dynamo: Amazon's Highly Available Key-Value Store](http://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf),” at *21st ACM Symposium on Operating Systems Principles* (SOSP), October 2007. [[22](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#Yegge2020-marker)] Steve Yegge. [Dear Google Cloud: Your Deprecation Policy is Killing You](https://steve-yegge.medium.com/dear-google-cloud-your-deprecation-policy-is-killing-you-ee7525dc05dc). *steve-yegge.medium.com*, August 2020. Archived at [perma.cc/KQP9-SPGU](https://perma.cc/KQP9-SPGU)
19. Greg Linden: “[Make Data Useful](http://glinden.blogspot.co.uk/2006/12/slides-from-my-talk-at-stanford.html),” slides from presentation at Stanford University Data Mining class (CS345), December 2006. [[23](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#Verbitski2017-marker)] Alexandre Verbitski, Anurag Gupta, Debanjan Saha, Murali Brahmadesam, Kamal Gupta, Raman Mittal, Sailesh Krishnamurthy, Sandor Maurice, Tengiz Kharatishvili, and Xiaofeng Bao. [Amazon Aurora: Design Considerations for High Throughput Cloud-Native Relational Databases](https://media.amazonwebservices.com/blog/2017/aurora-design-considerations-paper.pdf). At *ACM International Conference on Management of Data* (SIGMOD), pages 10411052, May 2017. [doi:10.1145/3035918.3056101](https://doi.org/10.1145/3035918.3056101)
20. Tammy Everts: “[The Real Cost of Slow Time vs Downtime](http://www.webperformancetoday.com/2014/11/12/real-cost-slow-time-vs-downtime-slides/),” *webperformancetoday.com*, November 12, 2014. [[24](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#Antonopoulos2019_ch1-marker)] Panagiotis Antonopoulos, Alex Budovski, Cristian Diaconu, Alejandro Hernandez Saenz, Jack Hu, Hanuma Kodavalla, Donald Kossmann, Sandeep Lingam, Umar Farooq Minhas, Naveen Prakash, Vijendra Purohit, Hugh Qu, Chaitanya Sreenivas Ravella, Krystyna Reisteter, Sheetal Shrotri, Dixin Tang, and Vikram Wakade. [Socrates: The New SQL Server in the Cloud](https://www.microsoft.com/en-us/research/uploads/prod/2019/05/socrates.pdf). At *ACM International Conference on Management of Data* (SIGMOD), pages 17431756, June 2019. [doi:10.1145/3299869.3314047](https://doi.org/10.1145/3299869.3314047)
21. Jake Brutlag:“[Speed Matters for Google Web Search](http://googleresearch.blogspot.co.uk/2009/06/speed-matters.html),” *googleresearch.blogspot.co.uk*, June 22, 2009. [[25](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#Vuppalapati2020-marker)] Midhul Vuppalapati, Justin Miron, Rachit Agarwal, Dan Truong, Ashish Motivala, and Thierry Cruanes. [Building An Elastic Query Engine on Disaggregated Storage](https://www.usenix.org/system/files/nsdi20-paper-vuppalapati.pdf). At *17th USENIX Symposium on Networked Systems Design and Implementation* (NSDI), February 2020.
22. Tyler Treat: “[Everything You Know About Latency Is Wrong](http://bravenewgeek.com/everything-you-know-about-latency-is-wrong/),” *bravenewgeek.com*, December 12, 2015. [[26](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#Shapira2023-marker)] Gwen Shapira. [Compute-Storage Separation Explained](https://www.thenile.dev/blog/storage-compute). *thenile.dev*, January 2023. Archived at [perma.cc/QCV3-XJNZ](https://perma.cc/QCV3-XJNZ)
23. Jeffrey Dean and Luiz André Barroso: “[The Tail at Scale](http://cacm.acm.org/magazines/2013/2/160173-the-tail-at-scale/fulltext),” *Communications of the ACM*, volume 56, number 2, pages 7480, February 2013. [doi:10.1145/2408776.2408794](http://dx.doi.org/10.1145/2408776.2408794) [[27](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#Murthy2022-marker)] Ravi Murthy and Gurmeet Goindi. [AlloyDB for PostgreSQL under the hood: Intelligent, database-aware storage](https://cloud.google.com/blog/products/databases/alloydb-for-postgresql-intelligent-scalable-storage). *cloud.google.com*, May 2022. Archived at [archive.org](https://web.archive.org/web/20220514021120/https://cloud.google.com/blog/products/databases/alloydb-for-postgresql-intelligent-scalable-storage)
24. Graham Cormode, Vladislav Shkapenyuk, Divesh Srivastava, and Bojian Xu: “[Forward Decay: A Practical Time Decay Model for Streaming Systems](http://dimacs.rutgers.edu/~graham/pubs/papers/fwddecay.pdf),” at *25th IEEE International Conference on Data Engineering* (ICDE), March 2009. [[28](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#Vanlightly2023-marker)] Jack Vanlightly. [The Architecture of Serverless Data Systems](https://jack-vanlightly.com/blog/2023/11/14/the-architecture-of-serverless-data-systems). *jack-vanlightly.com*, November 2023. Archived at [perma.cc/UDV4-TNJ5](https://perma.cc/UDV4-TNJ5)
25. Ted Dunning and Otmar Ertl: “[Computing Extremely Accurate Quantiles Using t-Digests](https://github.com/tdunning/t-digest),” *github.com*, March 2014. [[29](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#Jonas2019-marker)] Eric Jonas, Johann Schleier-Smith, Vikram Sreekanti, Chia-Che Tsai, Anurag Khandelwal, Qifan Pu, Vaishaal Shankar, Joao Carreira, Karl Krauth, Neeraja Yadwadkar, Joseph E Gonzalez, Raluca Ada Popa, Ion Stoica, David A Patterson. [Cloud Programming Simplified: A Berkeley View on Serverless Computing](https://arxiv.org/abs/1902.03383). *arxiv.org*, February 2019.
26. Gil Tene: “[HdrHistogram](http://www.hdrhistogram.org/),” *hdrhistogram.org*. [[30](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#Beyer2016-marker)] Betsy Beyer, Jennifer Petoff, Chris Jones, and Niall Richard Murphy. [*Site Reliability Engineering: How Google Runs Production Systems*](https://www.oreilly.com/library/view/site-reliability-engineering/9781491929117/). OReilly Media, 2016. ISBN: 9781491929124
27. Baron Schwartz: “[Why Percentiles Dont Work the Way You Think](https://www.vividcortex.com/blog/why-percentiles-dont-work-the-way-you-think),” *vividcortex.com*, December 7, 2015. [[31](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#Limoncelli2020-marker)] Thomas Limoncelli. [The Time I Stole $10,000 from Bell Labs](https://queue.acm.org/detail.cfm?id=3434773). *ACM Queue*, volume 18, issue 5, November 2020. [doi:10.1145/3434571.3434773](https://doi.org/10.1145/3434571.3434773)
28. James Hamilton: “[On Designing and Deploying Internet-Scale Services](https://www.usenix.org/legacy/events/lisa07/tech/full_papers/hamilton/hamilton.pdf),” at *21st Large Installation [[32](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#Majors2020-marker)] Charity Majors. [The Future of Ops Jobs](https://acloudguru.com/blog/engineering/the-future-of-ops-jobs). *acloudguru.com*, August 2020. Archived at [perma.cc/GRU2-CZG3](https://perma.cc/GRU2-CZG3)
System Administration Conference* (LISA), November 2007.
29. Brian Foote and Joseph Yoder: “[Big Ball of Mud](http://www.laputan.org/pub/foote/mud.pdf),” at *4th Conference on Pattern Languages of Programs* (PLoP), September 1997. [[33](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#Cherkasky2021-marker)] Boris Cherkasky. [(Over)Pay As You Go for Your Datastore](https://medium.com/riskified-technology/over-pay-as-you-go-for-your-datastore-11a29ae49a8b). *medium.com*, September 2021. Archived at [perma.cc/Q8TV-2AM2](https://perma.cc/Q8TV-2AM2)
30. Frederick P Brooks: “No Silver Bullet Essence and Accident in Software Engineering,” in *The Mythical Man-Month*, Anniversary edition, Addison-Wesley, 1995. ISBN: 978-0-201-83595-3 [[34](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#Kushchi2023-marker)] Shlomi Kushchi. [Serverless Doesnt Mean DevOpsLess or NoOps](https://thenewstack.io/serverless-doesnt-mean-devopsless-or-noops/). *thenewstack.io*, February 2023. Archived at [perma.cc/3NJR-AYYU](https://perma.cc/3NJR-AYYU)
31. Ben Moseley and Peter Marks: “[Out of the Tar Pit](http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.93.8928),” at *BCS Software Practice Advancement* (SPA), 2006. [[35](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#Bernhardsson2021-marker)] Erik Bernhardsson. [Storm in the stratosphere: how the cloud will be reshuffled](https://erikbern.com/2021/11/30/storm-in-the-stratosphere-how-the-cloud-will-be-reshuffled.html). *erikbern.com*, November 2021. Archived at [perma.cc/SYB2-99P3](https://perma.cc/SYB2-99P3)
32. Rich Hickey: “[Simple Made Easy](http://www.infoq.com/presentations/Simple-Made-Easy),” at *Strange Loop*, September 2011. [[36](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#Stancil2021-marker)] Benn Stancil. [The data OS](https://benn.substack.com/p/the-data-os). *benn.substack.com*, September 2021. Archived at [perma.cc/WQ43-FHS6](https://perma.cc/WQ43-FHS6)
33. Hongyu Pei Breivold, Ivica Crnkovic, and Peter J. Eriksson: “[Analyzing Software Evolvability](http://www.mrtc.mdh.se/publications/1478.pdf),” at *32nd Annual IEEE International Computer Software and Applications Conference* (COMPSAC), July 2008. [doi:10.1109/COMPSAC.2008.50](http://dx.doi.org/10.1109/COMPSAC.2008.50) [[37](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#Korolov2022-marker)] Maria Korolov. [Data residency laws pushing companies toward residency as a service](https://www.csoonline.com/article/3647761/data-residency-laws-pushing-companies-toward-residency-as-a-service.html). *csoonline.com*, January 2022. Archived at [perma.cc/CHE4-XZZ2](https://perma.cc/CHE4-XZZ2)
[[38](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#Nath2019-marker)] Kousik Nath. [These are the numbers every computer engineer should know](https://www.freecodecamp.org/news/must-know-numbers-for-every-computer-engineer/). *freecodecamp.org*, September 2019. Archived at [perma.cc/RW73-36RL](https://perma.cc/RW73-36RL)
[[39](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#Hellerstein2019-marker)] Joseph M Hellerstein, Jose Faleiro, Joseph E Gonzalez, Johann Schleier-Smith, Vikram Sreekanti, Alexey Tumanov, and Chenggang Wu. [Serverless Computing: One Step Forward, Two Steps Back](https://arxiv.org/abs/1812.03651). At *Conference on Innovative Data Systems Research* (CIDR), January 2019.
[[40](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#McSherry2015_ch1-marker)] Frank McSherry, Michael Isard, and Derek G. Murray. [Scalability! But at What COST?](https://www.usenix.org/system/files/conference/hotos15/hotos15-paper-mcsherry.pdf) At *15th USENIX Workshop on Hot Topics in Operating Systems* (HotOS), May 2015.
[[41](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#Sridharan2018-marker)] Cindy Sridharan. *[Distributed Systems Observability: A Guide to Building Robust Systems](https://unlimited.humio.com/rs/756-LMY-106/images/Distributed-Systems-Observability-eBook.pdf)*. Report, OReilly Media, May 2018. Archived at [perma.cc/M6JL-XKCM](https://perma.cc/M6JL-XKCM)
[[42](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#Majors2019-marker)] Charity Majors. [Observability — A 3-Year Retrospective](https://thenewstack.io/observability-a-3-year-retrospective/). *thenewstack.io*, August 2019. Archived at [perma.cc/CG62-TJWL](https://perma.cc/CG62-TJWL)
[[43](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#Sigelman2010-marker)] Benjamin H. Sigelman, Luiz André Barroso, Mike Burrows, Pat Stephenson, Manoj Plakal, Donald Beaver, Saul Jaspan, and Chandan Shanbhag. [Dapper, a Large-Scale Distributed Systems Tracing Infrastructure](https://research.google/pubs/pub36356/). Google Technical Report dapper-2010-1, April 2010. Archived at [perma.cc/K7KU-2TMH](https://perma.cc/K7KU-2TMH)
[[44](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#Laigner2021-marker)] Rodrigo Laigner, Yongluan Zhou, Marcos Antonio Vaz Salles, Yijian Liu, and Marcos Kalinowski. [Data management in microservices: State of the practice, challenges, and research directions](http://www.vldb.org/pvldb/vol14/p3348-laigner.pdf). *Proceedings of the VLDB Endowment*, volume 14, issue 13, pages 33483361, September 2021. [doi:10.14778/3484224.3484232](https://doi.org/10.14778/3484224.3484232)
[[45](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#Newman2021_ch1-marker)] Sam Newman. [*Building Microservices*, second edition](https://www.oreilly.com/library/view/building-microservices-2nd/9781492034018/). OReilly Media, 2021. ISBN: 9781492034025
[[46](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#Richardson2014-marker)] Chris Richardson. [Microservices: Decomposing Applications for Deployability and Scalability](http://www.infoq.com/articles/microservices-intro). *infoq.com*, May 2014. Archived at [perma.cc/CKN4-YEQ2](https://perma.cc/CKN4-YEQ2)
[[47](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#Shahrad2020-marker)] Mohammad Shahrad, Rodrigo Fonseca, Íñigo Goiri, Gohar Chaudhry, Paul Batum, Jason Cooke, Eduardo Laureano, Colby Tresness, Mark Russinovich, Ricardo Bianchini. [Serverless in the Wild: Characterizing and Optimizing the Serverless Workload at a Large Cloud Provider](https://www.usenix.org/system/files/atc20-shahrad.pdf). At *USENIX Annual Technical Conference* (ATC), July 2020.
[[48](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#Barroso2018-marker)] Luiz André Barroso, Urs Hölzle, and Parthasarathy Ranganathan. [The Datacenter as a Computer: Designing Warehouse-Scale Machines](https://www.morganclaypool.com/doi/10.2200/S00874ED3V01Y201809CAC046), third edition. Morgan & Claypool Synthesis Lectures on Computer Architecture, October 2018. [doi:10.2200/S00874ED3V01Y201809CAC046](https://doi.org/10.2200/S00874ED3V01Y201809CAC046)
[[49](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#Fiala2012-marker)] David Fiala, Frank Mueller, Christian Engelmann, Rolf Riesen, Kurt Ferreira, and Ron Brightwell. [Detection and Correction of Silent Data Corruption for Large-Scale High-Performance Computing](http://moss.csc.ncsu.edu/~mueller/ftp/pub/mueller/papers/sc12.pdf),” at *International Conference for High Performance Computing, Networking, Storage and Analysis* (SC), November 2012. [doi:10.1109/SC.2012.49](https://doi.org/10.1109/SC.2012.49)
[[50](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#KornfeldSimpson2020-marker)] Anna Kornfeld Simpson, Adriana Szekeres, Jacob Nelson, and Irene Zhang. [Securing RDMA for High-Performance Datacenter Storage Systems](https://www.usenix.org/conference/hotcloud20/presentation/kornfeld-simpson). At *12th USENIX Workshop on Hot Topics in Cloud Computing* (HotCloud), July 2020.
[[51](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#Singh2015-marker)] Arjun Singh, Joon Ong, Amit Agarwal, Glen Anderson, Ashby Armistead, Roy Bannon, Seb Boving, Gaurav Desai, Bob Felderman, Paulie Germano, Anand Kanagala, Jeff Provost, Jason Simmons, Eiichi Tanda, Jim Wanderer, Urs Hölzle, Stephen Stuart, and Amin Vahdat. [Jupiter Rising: A Decade of Clos Topologies and Centralized Control in Googles Datacenter Network](http://conferences.sigcomm.org/sigcomm/2015/pdf/papers/p183.pdf). At *Annual Conference of the ACM Special Interest Group on Data Communication* (SIGCOMM), August 2015. [doi:10.1145/2785956.2787508](https://doi.org/10.1145/2785956.2787508)
[[52](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#Lockwood2014-marker)] Glenn K. Lockwood. [Hadoops Uncomfortable Fit in HPC](http://glennklockwood.blogspot.co.uk/2014/05/hadoops-uncomfortable-fit-in-hpc.html). *glennklockwood.blogspot.co.uk*, May 2014. Archived at [perma.cc/S8XX-Y67B](https://perma.cc/S8XX-Y67B)
[[53](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#ONeil2016_ch1-marker)] Cathy ONeil: *Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy*. Crown Publishing, 2016. ISBN: 9780553418811
[[54](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#Shastri2020-marker)] Supreeth Shastri, Vinay Banakar, Melissa Wasserman, Arun Kumar, and Vijay Chidambaram. [Understanding and Benchmarking the Impact of GDPR on Database Systems](http://www.vldb.org/pvldb/vol13/p1064-shastri.pdf). *Proceedings of the VLDB Endowment*, volume 13, issue 7, pages 10641077, March 2020. [doi:10.14778/3384345.3384354](https://doi.org/10.14778/3384345.3384354)
[[55](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#Datensparsamkeit-marker)] Martin Fowler. [Datensparsamkeit](https://www.martinfowler.com/bliki/Datensparsamkeit.html). *martinfowler.com*, December 2013. Archived at [perma.cc/R9QX-CME6](https://perma.cc/R9QX-CME6)
[[56](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#GDPR-marker)] [Regulation (EU) 2016/679 of the European Parliament and of the Council of 27 April 2016 (General Data Protection Regulation)](https://eur-lex.europa.eu/legal-content/EN/TXT/HTML/?uri=CELEX:32016R0679&from=EN). *Official Journal of the European Union* L 119/1, May 2016.

Binary file not shown.

Before

Width:  |  Height:  |  Size: 586 KiB

After

Width:  |  Height:  |  Size: 898 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 834 KiB

After

Width:  |  Height:  |  Size: 1016 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 966 KiB

After

Width:  |  Height:  |  Size: 834 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 737 KiB

After

Width:  |  Height:  |  Size: 966 KiB

BIN
img/ch13.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 737 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 810 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 975 KiB

After

Width:  |  Height:  |  Size: 1.1 MiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 813 KiB

After

Width:  |  Height:  |  Size: 1.4 MiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 821 KiB

After

Width:  |  Height:  |  Size: 813 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 646 KiB

After

Width:  |  Height:  |  Size: 821 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 832 KiB

After

Width:  |  Height:  |  Size: 646 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 905 KiB

After

Width:  |  Height:  |  Size: 832 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 1016 KiB

After

Width:  |  Height:  |  Size: 905 KiB

BIN
img/ddia_0101.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 30 KiB

BIN
img/ddia_0102.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 60 KiB

BIN
img/ddia_0103.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 61 KiB

BIN
img/ddia_0104.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 38 KiB

BIN
img/ddia_0104a.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 64 KiB

BIN
img/ddia_0104b.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 42 KiB

BIN
img/ddia_0105.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 23 KiB

BIN
img/ddia_0201.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 178 KiB

BIN
img/ddia_0202.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 46 KiB

BIN
img/ddia_0203.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 62 KiB

BIN
img/ddia_0204.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 78 KiB

BIN
img/ddia_0205.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 86 KiB

BIN
img/ddia_0206.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 75 KiB

BIN
img/ddia_0207.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 62 KiB

BIN
img/ddia_0208.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 189 KiB

BIN
img/ddia_0308.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 134 KiB

BIN
img/ddia_0309.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 126 KiB

414
tmp/ch1o.md Normal file
View File

@ -0,0 +1,414 @@
# 第一章:可靠性、可伸缩性和可维护性
![](../img/ch1.png)
> 互联网做得太棒了,以至于大多数人将它看作像太平洋这样的自然资源,而不是什么人工产物。上一次出现这种大规模且无差错的技术,你还记得是什么时候吗?
>
> —— [艾伦・凯](http://www.drdobbs.com/architecture-and-design/interview-with-alan-kay/240003442) 在接受 Dobb 博士杂志采访时说2012 年)
-----------------------
[TOC]
现今很多应用程序都是 **数据密集型data-intensive** 的,而非 **计算密集型compute-intensive** 的。因此 CPU 很少成为这类应用的瓶颈,更大的问题通常来自数据量、数据复杂性、以及数据的变更速度。
数据密集型应用通常由标准组件构建而成,标准组件提供了很多通用的功能;例如,许多应用程序都需要:
- 存储数据,以便自己或其他应用程序之后能再次找到 *数据库,即 databases*
- 记住开销昂贵操作的结果,加快读取速度(*缓存,即 caches*
- 允许用户按关键字搜索数据,或以各种方式对数据进行过滤(*搜索索引,即 search indexes*
- 向其他进程发送消息,进行异步处理(*流处理,即 stream processing*
- 定期处理累积的大批量数据(*批处理,即 batch processing*
如果这些功能听上去平淡无奇,那是因为这些 **数据系统data system** 是非常成功的抽象:我们一直不假思索地使用它们并习以为常。绝大多数工程师不会幻想从零开始编写存储引擎,因为在开发应用时,数据库已经是足够完美的工具了。
但现实没有这么简单。不同的应用有着不同的需求,因而数据库系统也是百花齐放,有着各式各样的特性。实现缓存有很多种手段,创建搜索索引也有好几种方法,诸如此类。因此在开发应用前,我们依然有必要先弄清楚最适合手头工作的工具和方法。而且当单个工具解决不了你的问题时,组合使用这些工具可能还是有些难度的。
本书将是一趟关于数据系统原理、实践与应用的旅程,并讲述了设计数据密集型应用的方法。我们将探索不同工具之间的共性与特性,以及各自的实现原理。
本章将从我们所要实现的基础目标开始:可靠、可伸缩、可维护的数据系统。我们将澄清这些词语的含义,概述考量这些目标的方法。并回顾一些后续章节所需的基础知识。在接下来的章节中我们将抽丝剥茧,研究设计数据密集型应用时可能遇到的设计决策。
## 关于数据系统的思考
我们通常认为,数据库、消息队列、缓存等工具分属于几个差异显著的类别。虽然数据库和消息队列表面上有一些相似性 —— 它们都会存储一段时间的数据 —— 但它们有迥然不同的访问模式,这意味着迥异的性能特征和实现手段。
那我们为什么要把这些东西放在 **数据系统data system** 的总称之下混为一谈呢?
近些年来出现了许多新的数据存储工具与数据处理工具。它们针对不同应用场景进行优化因此不再适合生硬地归入传统类别【1】。类别之间的界限变得越来越模糊例如数据存储可以被当成消息队列用Redis消息队列则带有类似数据库的持久保证Apache Kafka
其次,越来越多的应用程序有着各种严格而广泛的要求,单个工具不足以满足所有的数据处理和存储需求。取而代之的是,总体工作被拆分成一系列能被单个工具高效完成的任务,并通过应用代码将它们缝合起来。
例如如果将缓存应用管理的缓存层Memcached 或同类产品)和全文搜索(全文搜索服务器,例如 Elasticsearch 或 Solr功能从主数据库剥离出来那么使缓存 / 索引与主数据库保持同步通常是应用代码的责任。[图 1-1](../img/fig1-1.png) 给出了这种架构可能的样子(细节将在后面的章节中详细介绍)。
![](../img/fig1-1.png)
**图 1-1 一个可能的组合使用多个组件的数据系统架构**
当你将多个工具组合在一起提供服务时,服务的接口或 **应用程序编程接口API, Application Programming Interface** 通常向客户端隐藏这些实现细节。现在,你基本上已经使用较小的通用组件创建了一个全新的、专用的数据系统。这个新的复合数据系统可能会提供特定的保证,例如:缓存在写入时会作废或更新,以便外部客户端获取一致的结果。现在你不仅是应用程序开发人员,还是数据系统设计人员了。
设计数据系统或服务时可能会遇到很多棘手的问题,例如:当系统出问题时,如何确保数据的正确性和完整性?当部分系统退化降级时,如何为客户提供始终如一的良好性能?当负载增加时,如何扩容应对?什么样的 API 才是好的 API
影响数据系统设计的因素很多,包括参与人员的技能和经验、历史遗留问题、系统路径依赖、交付时限、公司的风险容忍度、监管约束等,这些因素都需要具体问题具体分析。
本书着重讨论三个在大多数软件系统中都很重要的问题:
* 可靠性Reliability
系统在 **困境**adversity比如硬件故障、软件故障、人为错误中仍可正常工作正确完成功能并能达到期望的性能水准。请参阅 “[可靠性](#可靠性)”。
* 可伸缩性Scalability
有合理的办法应对系统的增长(数据量、流量、复杂性)。请参阅 “[可伸缩性](#可伸缩性)”。
* 可维护性Maintainability
许多不同的人(工程师、运维)在不同的生命周期,都能高效地在系统上工作(使系统保持现有行为,并适应新的应用场景)。请参阅 “[可维护性](#可维护性)”。
人们经常追求这些词汇,却没有清楚理解它们到底意味着什么。为了工程的严谨性,本章的剩余部分将探讨可靠性、可伸缩性和可维护性的含义。为实现这些目标而使用的各种技术,架构和算法将在后续的章节中研究。
## 可靠性
人们对于一个东西是否可靠,都有一个直观的想法。人们对可靠软件的典型期望包括:
* 应用程序表现出用户所期望的功能。
* 允许用户犯错,允许用户以出乎意料的方式使用软件。
* 在预期的负载和数据量下,性能满足要求。
* 系统能防止未经授权的访问和滥用。
如果所有这些在一起意味着 “正确工作”,那么可以把可靠性粗略理解为 “即使出现问题,也能继续正确工作”。
造成错误的原因叫做 **故障fault**,能预料并应对故障的系统特性可称为 **容错fault-tolerant****回弹性resilient**。“**容错**” 一词可能会产生误导,因为它暗示着系统可以容忍所有可能的错误,但在实际中这是不可能的。比方说,如果整个地球(及其上的所有服务器)都被黑洞吞噬了,想要容忍这种错误,需要把网络托管到太空中 —— 这种预算能不能批准就祝你好运了。所以在讨论容错时,只有谈论特定类型的错误才有意义。
注意 **故障fault** 不同于 **失效failure**【2】。**故障** 通常定义为系统的一部分状态偏离其标准,而 **失效** 则是系统作为一个整体停止向用户提供服务。故障的概率不可能降到零,因此最好设计容错机制以防因 **故障** 而导致 **失效**。本书中我们将介绍几种用不可靠的部件构建可靠系统的技术。
反直觉的是,在这类容错系统中,通过故意触发来 **提高** 故障率是有意义的例如在没有警告的情况下随机地杀死单个进程。许多高危漏洞实际上是由糟糕的错误处理导致的【3】因此我们可以通过故意引发故障来确保容错机制不断运行并接受考验从而提高故障自然发生时系统能正确处理的信心。Netflix 公司的 *Chaos Monkey*【4】就是这种方法的一个例子。
尽管比起 **阻止错误prevent error**,我们通常更倾向于 **容忍错误**。但也有 **预防胜于治疗** 的情况(比如不存在治疗方法时)。安全问题就属于这种情况。例如,如果攻击者破坏了系统,并获取了敏感数据,这种事是撤销不了的。但本书主要讨论的是可以恢复的故障种类,正如下面几节所述。
### 硬件故障
当想到系统失效的原因时,**硬件故障hardware faults** 总会第一个进入脑海。硬盘崩溃、内存出错、机房断电、有人拔错网线…… 任何与大型数据中心打过交道的人都会告诉你:一旦你拥有很多机器,这些事情 **总** 会发生!
据报道称,硬盘的 **平均无故障时间MTTF, mean time to failure** 约为 10 到 50 年【5】【6】。因此从数学期望上讲在拥有 10000 个磁盘的存储集群上,平均每天会有 1 个磁盘出故障。
为了减少系统的故障率,第一反应通常都是增加单个硬件的冗余度,例如:磁盘可以组建 RAID服务器可能有双路电源和热插拔 CPU数据中心可能有电池和柴油发电机作为后备电源某个组件挂掉时冗余组件可以立刻接管。这种方法虽然不能完全防止由硬件问题导致的系统失效但它简单易懂通常也足以让机器不间断运行很多年。
直到最近,硬件冗余对于大多数应用来说已经足够了,它使单台机器完全失效变得相当罕见。只要你能快速地把备份恢复到新机器上,故障停机时间对大多数应用而言都算不上灾难性的。只有少量高可用性至关重要的应用才会要求有多套硬件冗余。
但是随着数据量和应用计算需求的增加,越来越多的应用开始大量使用机器,这会相应地增加硬件故障率。此外,在类似亚马逊 AWSAmazon Web Services的一些云服务平台上虚拟机实例不可用却没有任何警告也是很常见的【7】因为云平台的设计就是优先考虑 **灵活性flexibility****弹性elasticity**[^i],而不是单机可靠性。
如果在硬件冗余的基础上进一步引入软件容错机制,那么系统在容忍整个(单台)机器故障的道路上就更进一步了。这样的系统也有运维上的便利,例如:如果需要重启机器(例如应用操作系统安全补丁),单服务器系统就需要计划停机。而允许机器失效的系统则可以一次修复一个节点,无需整个系统停机。
[^i]: 在 [应对负载的方法](#应对负载的方法) 一节定义
### 软件错误
我们通常认为硬件故障是随机的、相互独立的:一台机器的磁盘失效并不意味着另一台机器的磁盘也会失效。虽然大量硬件组件之间可能存在微弱的相关性(例如服务器机架的温度等共同的原因),但同时发生故障也是极为罕见的。
另一类错误是内部的 **系统性错误systematic error**【8】。这类错误难以预料而且因为是跨节点相关的所以比起不相关的硬件故障往往可能造成更多的 **系统失效**【5】。例子包括
* 接受特定的错误输入,便导致所有应用服务器实例崩溃的 BUG。例如 2012 年 6 月 30 日的闰秒,由于 Linux 内核中的一个错误【9】许多应用同时挂掉了。
* 失控进程会用尽一些共享资源,包括 CPU 时间、内存、磁盘空间或网络带宽。
* 系统依赖的服务变慢,没有响应,或者开始返回错误的响应。
* 级联故障一个组件中的小故障触发另一个组件中的故障进而触发更多的故障【10】。
导致这类软件故障的 BUG 通常会潜伏很长时间,直到被异常情况触发为止。这种情况意味着软件对其环境做出了某种假设 —— 虽然这种假设通常来说是正确的但由于某种原因最后不再成立了【11】。
虽然软件中的系统性故障没有速效药,但我们还是有很多小办法,例如:仔细考虑系统中的假设和交互;彻底的测试;进程隔离;允许进程崩溃并重启;测量、监控并分析生产环境中的系统行为。如果系统能够提供一些保证(例如在一个消息队列中,进入与发出的消息数量相等),那么系统就可以在运行时不断自检,并在出现 **差异discrepancy** 时报警【12】。
### 人为错误
设计并构建了软件系统的工程师是人类,维持系统运行的运维也是人类。即使他们怀有最大的善意,人类也是不可靠的。举个例子,一项关于大型互联网服务的研究发现,运维配置错误是导致服务中断的首要原因,而硬件故障(服务器或网络)仅导致了 10-25% 的服务中断【13】。
尽管人类不可靠,但怎么做才能让系统变得可靠?最好的系统会组合使用以下几种办法:
* 以最小化犯错机会的方式设计系统。例如精心设计的抽象、API 和管理后台使做对事情更容易,搞砸事情更困难。但如果接口限制太多,人们就会忽略它们的好处而想办法绕开。很难正确把握这种微妙的平衡。
* 将人们最容易犯错的地方与可能导致失效的地方 **解耦decouple**。特别是提供一个功能齐全的非生产环境 **沙箱sandbox**,使人们可以在不影响真实用户的情况下,使用真实数据安全地探索和实验。
* 在各个层次进行彻底的测试【3】从单元测试、全系统集成测试到手动测试。自动化测试易于理解已经被广泛使用特别适合用来覆盖正常情况中少见的 **边缘场景corner case**
* 允许从人为错误中简单快速地恢复,以最大限度地减少失效情况带来的影响。例如,快速回滚配置变更,分批发布新代码(以便任何意外错误只影响一小部分用户),并提供数据重算工具(以备旧的计算出错)。
* 配置详细和明确的监控,比如性能指标和错误率。在其他工程学科中这指的是 **遥测telemetry**(一旦火箭离开了地面,遥测技术对于跟踪发生的事情和理解失败是至关重要的)。监控可以向我们发出预警信号,并允许我们检查是否有任何地方违反了假设和约束。当出现问题时,指标数据对于问题诊断是非常宝贵的。
* 良好的管理实践与充分的培训 —— 一个复杂而重要的方面,但超出了本书的范围。
### 可靠性有多重要?
可靠性不仅仅是针对核电站和空中交通管制软件而言,我们也期望更多平凡的应用能可靠地运行。商务应用中的错误会导致生产力损失(也许数据报告不完整还会有法律风险),而电商网站的中断则可能会导致收入和声誉的巨大损失。
即使在 “非关键” 应用中我们也对用户负有责任。试想一位家长把所有的照片和孩子的视频储存在你的照片应用里【15】。如果数据库突然损坏他们会感觉如何他们可能会知道如何从备份恢复吗
在某些情况下,我们可能会选择牺牲可靠性来降低开发成本(例如为未经证实的市场开发产品原型)或运营成本(例如利润率极低的服务),但我们偷工减料时,应该清楚意识到自己在做什么。
## 可伸缩性
系统今天能可靠运行,并不意味未来也能可靠运行。服务 **降级degradation** 的一个常见原因是负载增加,例如:系统负载已经从一万个并发用户增长到十万个并发用户,或者从一百万增长到一千万。也许现在处理的数据量级要比过去大得多。
**可伸缩性Scalability** 是用来描述系统应对负载增长能力的术语。但是请注意,这不是贴在系统上的一维标签:说 “X 可伸缩” 或 “Y 不可伸缩” 是没有任何意义的。相反,讨论可伸缩性意味着考虑诸如 “如果系统以特定方式增长,有什么选项可以应对增长?” 和 “如何增加计算资源来处理额外的负载?” 等问题。
### 描述负载
在讨论增长问题(如果负载加倍会发生什么?)前,首先要能简要描述系统的当前负载。负载可以用一些称为 **负载参数load parameters** 的数字来描述。参数的最佳选择取决于系统架构,它可能是每秒向 Web 服务器发出的请求、数据库中的读写比率、聊天室中同时活跃的用户数量、缓存命中率或其他东西。除此之外,也许平均情况对你很重要,也许你的瓶颈是少数极端场景。
为了使这个概念更加具体,我们以推特在 2012 年 11 月发布的数据【16】为例。推特的两个主要业务是
* 发布推文
用户可以向其粉丝发布新消息(平均 4.6k 请求 / 秒,峰值超过 12k 请求 / 秒)。
* 主页时间线
用户可以查阅他们关注的人发布的推文300k 请求 / 秒)。
处理每秒 12,000 次写入(发推文的速率峰值)还是很简单的。然而推特的伸缩性挑战并不是主要来自推特量,而是来自 **扇出fan-out**[^ii]—— 每个用户关注了很多人,也被很多人关注。
[^ii]: 扇出:从电子工程学中借用的术语,它描述了输入连接到另一个门输出的逻辑门数量。输出需要提供足够的电流来驱动所有连接的输入。在事务处理系统中,我们使用它来描述为了服务一个传入请求而需要执行其他服务的请求数量。
大体上讲,这一对操作有两种实现方式。
1. 发布推文时,只需将新推文插入全局推文集合即可。当一个用户请求自己的主页时间线时,首先查找他关注的所有人,查询这些被关注用户发布的推文并按时间顺序合并。在如 [图 1-2](../img/fig1-2.png) 所示的关系型数据库中,可以编写这样的查询:
```sql
SELECT tweets.*, users.*
FROM tweets
JOIN users ON tweets.sender_id = users.id
JOIN follows ON follows.followee_id = users.id
WHERE follows.follower_id = current_user
```
![](../img/fig1-2.png)
**图 1-2 推特主页时间线的关系型模式简单实现**
2. 为每个用户的主页时间线维护一个缓存,就像每个用户的推文收件箱([图 1-3](../img/fig1-3.png))。当一个用户发布推文时,查找所有关注该用户的人,并将新的推文插入到每个主页时间线缓存中。因此读取主页时间线的请求开销很小,因为结果已经提前计算好了。
![](../img/fig1-3.png)
**图 1-3 用于分发推特至关注者的数据流水线2012 年 11 月的负载参数【16】**
推特的第一个版本使用了方法 1但系统很难跟上主页时间线查询的负载。所以公司转向了方法 2方法 2 的效果更好,因为发推频率比查询主页时间线的频率几乎低了两个数量级,所以在这种情况下,最好在写入时做更多的工作,而在读取时做更少的工作。
然而方法 2 的缺点是,发推现在需要大量的额外工作。平均来说,一条推文会发往约 75 个关注者,所以每秒 4.6k 的发推写入,变成了对主页时间线缓存每秒 345k 的写入。但这个平均值隐藏了用户粉丝数差异巨大这一现实,一些用户有超过 3000 万的粉丝,这意味着一条推文就可能会导致主页时间线缓存的 3000 万次写入!及时完成这种操作是一个巨大的挑战 —— 推特尝试在 5 秒内向粉丝发送推文。
在推特的例子中,每个用户粉丝数的分布(可能按这些用户的发推频率来加权)是探讨可伸缩性的一个关键负载参数,因为它决定了扇出负载。你的应用程序可能具有非常不同的特征,但可以采用相似的原则来考虑它的负载。
推特轶事的最终转折:现在已经稳健地实现了方法 2推特逐步转向了两种方法的混合。大多数用户发的推文会被扇出写入其粉丝主页时间线缓存中。但是少数拥有海量粉丝的用户即名流会被排除在外。当用户读取主页时间线时分别地获取出该用户所关注的每位名流的推文再与用户的主页时间线缓存合并如方法 1 所示。这种混合方法能始终如一地提供良好性能。在 [第十二章](../ch12.md) 中我们将重新讨论这个例子,这在覆盖更多技术层面之后。
### 描述性能
一旦系统的负载被描述好,就可以研究当负载增加会发生什么。我们可以从两种角度来看:
* 增加负载参数并保持系统资源CPU、内存、网络带宽等不变时系统性能将受到什么影响
* 增加负载参数并希望保持性能不变时,需要增加多少系统资源?
这两个问题都需要性能数据,所以让我们简单地看一下如何描述系统性能。
对于 Hadoop 这样的批处理系统,通常关心的是 **吞吐量throughput**,即每秒可以处理的记录数量,或者在特定规模数据集上运行作业的总时间 [^iii]。对于在线系统,通常更重要的是服务的 **响应时间response time**,即客户端发送请求到接收响应之间的时间。
[^iii]: 理想情况下,批量作业的运行时间是数据集的大小除以吞吐量。在实践中由于数据倾斜(数据不是均匀分布在每个工作进程中),需要等待最慢的任务完成,所以运行时间往往更长。
> #### 延迟和响应时间
>
> **延迟latency****响应时间response time** 经常用作同义词,但实际上它们并不一样。响应时间是客户所看到的,除了实际处理请求的时间( **服务时间service time** )之外,还包括网络延迟和排队延迟。延迟是某个请求等待处理的 **持续时长**,在此期间它处于 **休眠latent** 状态并等待服务【17】。
即使不断重复发送同样的请求,每次得到的响应时间也都会略有不同。现实世界的系统会处理各式各样的请求,响应时间可能会有很大差异。因此我们需要将响应时间视为一个可以测量的数值 **分布distribution**,而不是单个数值。
在 [图 1-4](../img/fig1-4.png) 中,每个灰条代表一次对服务的请求,其高度表示请求花费了多长时间。大多数请求是相当快的,但偶尔会出现需要更长的时间的异常值。这也许是因为缓慢的请求实质上开销更大,例如它们可能会处理更多的数据。但即使(你认为)所有请求都花费相同时间的情况下,随机的附加延迟也会导致结果变化,例如:上下文切换到后台进程,网络数据包丢失与 TCP 重传垃圾收集暂停强制从磁盘读取的页面错误服务器机架中的震动【18】还有很多其他原因。
![](../img/fig1-4.png)
**图 1-4 展示了一个服务 100 次请求响应时间的均值与百分位数**
通常报表都会展示服务的平均响应时间。(严格来讲 “平均” 一词并不指代任何特定公式,但实际上它通常被理解为 **算术平均值arithmetic mean**:给定 n 个值,加起来除以 n )。然而如果你想知道 “**典型typical**” 响应时间,那么平均值并不是一个非常好的指标,因为它不能告诉你有多少用户实际上经历了这个延迟。
通常使用 **百分位点percentiles** 会更好。如果将响应时间列表按最快到最慢排序,那么 **中位数median** 就在正中间:举个例子,如果你的响应时间中位数是 200 毫秒,这意味着一半请求的返回时间少于 200 毫秒,另一半比这个要长。
如果想知道典型场景下用户需要等待多长时间,那么中位数是一个好的度量标准:一半用户请求的响应时间少于响应时间的中位数,另一半服务时间比中位数长。中位数也被称为第 50 百分位点,有时缩写为 p50。注意中位数是关于单个请求的如果用户同时发出几个请求在一个会话过程中或者由于一个页面中包含了多个资源则至少一个请求比中位数慢的概率远大于 50%。
为了弄清异常值有多糟糕,可以看看更高的百分位点,例如第 95、99 和 99.9 百分位点(缩写为 p95p99 和 p999。它们意味着 95%、99% 或 99.9% 的请求响应时间要比该阈值快,例如:如果第 95 百分位点响应时间是 1.5 秒,则意味着 100 个请求中的 95 个响应时间快于 1.5 秒,而 100 个请求中的 5 个响应时间超过 1.5 秒。如 [图 1-4](../img/fig1-4.png) 所示。
响应时间的高百分位点(也称为 **尾部延迟**,即 **tail latencies**)非常重要,因为它们直接影响用户的服务体验。例如亚马逊在描述内部服务的响应时间要求时是以 99.9 百分位点为准,即使它只影响一千个请求中的一个。这是因为请求响应最慢的客户往往也是数据最多的客户,也可以说是最有价值的客户 —— 因为他们掏钱了【19】。保证网站响应迅速对于保持客户的满意度非常重要亚马逊观察到响应时间增加 100 毫秒,销售量就减少 1%【20】而另一些报告说慢 1 秒钟会让客户满意度指标减少 16%【2122】。
另一方面,优化第 99.99 百分位点(一万个请求中最慢的一个)被认为太昂贵了,不能为亚马逊的目标带来足够好处。减小高百分位点处的响应时间相当困难,因为它很容易受到随机事件的影响,这超出了控制范围,而且效益也很小。
百分位点通常用于 **服务级别目标SLO, service level objectives****服务级别协议SLA, service level agreements**即定义服务预期性能和可用性的合同。SLA 可能会声明,如果服务响应时间的中位数小于 200 毫秒,且 99.9 百分位点低于 1 秒,则认为服务工作正常(如果响应时间更长,就认为服务不达标)。这些指标为客户设定了期望值,并允许客户在 SLA 未达标的情况下要求退款。
**排队延迟queueing delay** 通常占了高百分位点处响应时间的很大一部分。由于服务器只能并行处理少量的事务(如受其 CPU 核数的限制),所以只要有少量缓慢的请求就能阻碍后续请求的处理,这种效应有时被称为 **头部阻塞head-of-line blocking** 。即使后续请求在服务器上处理的非常迅速,由于需要等待先前请求完成,客户端最终看到的是缓慢的总体响应时间。因为存在这种效应,测量客户端的响应时间非常重要。
为测试系统的可伸缩性而人为产生负载时产生负载的客户端要独立于响应时间不断发送请求。如果客户端在发送下一个请求之前等待先前的请求完成这种行为会产生人为排队的效果使得测试时的队列比现实情况更短使测量结果产生偏差【23】。
> #### 实践中的百分位点
>
> 在多重调用的后端服务里,高百分位数变得特别重要。即使并行调用,最终用户请求仍然需要等待最慢的并行调用完成。如 [图 1-5](../img/fig1-5.png) 所示,只需要一个缓慢的调用就可以使整个最终用户请求变慢。即使只有一小部分后端调用速度较慢,如果最终用户请求需要多个后端调用,则获得较慢调用的机会也会增加,因此较高比例的最终用户请求速度会变慢(该效果称为尾部延迟放大,即 tail latency amplification【24】
>
> 如果你想将响应时间百分点添加到你的服务的监视仪表板则需要持续有效地计算它们。例如你可以使用滑动窗口来跟踪连续10分钟内的请求响应时间。每一分钟你都会计算出该窗口中的响应时间中值和各种百分数并将这些度量值绘制在图上。
>
> 简单的实现是在时间窗口内保存所有请求的响应时间列表,并且每分钟对列表进行排序。如果对你来说效率太低,那么有一些算法能够以最小的 CPU 和内存成本如前向衰减【25】、t-digest【26】或 HdrHistogram 【27】来计算百分位数的近似值。请注意平均百分比例如减少时间分辨率或合并来自多台机器的数据在数学上没有意义 - 聚合响应时间数据的正确方法是添加直方图【28】。
![](../img/fig1-5.png)
**图 1-5 当一个请求需要多个后端请求时,单个后端慢请求就会拖慢整个终端用户的请求**
### 应对负载的方法
现在我们已经讨论了用于描述负载的参数和用于衡量性能的指标。可以开始认真讨论可伸缩性了:当负载参数增加时,如何保持良好的性能?
适应某个级别负载的架构不太可能应付 10 倍于此的负载。如果你正在开发一个快速增长的服务,那么每次负载发生数量级的增长时,你可能都需要重新考虑架构 —— 或者更频繁。
人们经常讨论 **纵向伸缩**scaling up也称为垂直伸缩即 vertical scaling转向更强大的机器**横向伸缩**scaling out也称为水平伸缩即 horizontal scaling将负载分布到多台小机器上之间的对立。跨多台机器分配负载也称为 “**无共享shared-nothing**” 架构。可以在单台机器上运行的系统通常更简单,但高端机器可能非常贵,所以非常密集的负载通常无法避免地需要横向伸缩。现实世界中的优秀架构需要将这两种方法务实地结合,因为使用几台足够强大的机器可能比使用大量的小型虚拟机更简单也更便宜。
有些系统是 **弹性elastic** 的,这意味着可以在检测到负载增加时自动增加计算资源,而其他系统则是手动伸缩(人工分析容量并决定向系统添加更多的机器)。如果负载 **极难预测highly unpredictable**,则弹性系统可能很有用,但手动伸缩系统更简单,并且意外操作可能会更少(请参阅 “[分区再平衡](../ch6.md#分区再平衡)”)。
跨多台机器部署 **无状态服务stateless services** 非常简单,但将带状态的数据系统从单节点变为分布式配置则可能引入许多额外复杂度。出于这个原因,常识告诉我们应该将数据库放在单个节点上(纵向伸缩),直到伸缩成本或可用性需求迫使其改为分布式。
随着分布式系统的工具和抽象越来越好,至少对于某些类型的应用而言,这种常识可能会改变。可以预见分布式数据系统将成为未来的默认设置,即使对不处理大量数据或流量的场景也如此。本书的其余部分将介绍多种分布式数据系统,不仅讨论它们在可伸缩性方面的表现,还包括易用性和可维护性。
大规模的系统架构通常是应用特定的 —— 没有一招鲜吃遍天的通用可伸缩架构(不正式的叫法:**万金油magic scaling sauce** )。应用的问题可能是读取量、写入量、要存储的数据量、数据的复杂度、响应时间要求、访问模式或者所有问题的大杂烩。
举个例子,用于处理每秒十万个请求(每个大小为 1 kB的系统与用于处理每分钟 3 个请求(每个大小为 2GB的系统看上去会非常不一样尽管两个系统有同样的数据吞吐量。
一个良好适配应用的可伸缩架构,是围绕着 **假设assumption** 建立的:哪些操作是常见的?哪些操作是罕见的?这就是所谓负载参数。如果假设最终是错误的,那么为伸缩所做的工程投入就白费了,最糟糕的是适得其反。在早期创业公司或非正式产品中,通常支持产品快速迭代的能力,要比可伸缩至未来的假想负载要重要的多。
尽管这些架构是应用程序特定的,但可伸缩的架构通常也是从通用的积木块搭建而成的,并以常见的模式排列。在本书中,我们将讨论这些构件和模式。
## 可维护性
众所周知,软件的大部分开销并不在最初的开发阶段,而是在持续的维护阶段,包括修复漏洞、保持系统正常运行、调查失效、适配新的平台、为新的场景进行修改、偿还技术债和添加新的功能。
不幸的是,许多从事软件系统行业的人不喜欢维护所谓的 **遗留legacy** 系统,—— 也许因为涉及修复其他人的错误、和过时的平台打交道,或者系统被迫使用于一些份外工作。每一个遗留系统都以自己的方式让人不爽,所以很难给出一个通用的建议来和它们打交道。
但是我们可以,也应该以这样一种方式来设计软件:在设计之初就尽量考虑尽可能减少维护期间的痛苦,从而避免自己的软件系统变成遗留系统。为此,我们将特别关注软件系统的三个设计原则:
* 可操作性Operability
便于运维团队保持系统平稳运行。
* 简单性Simplicity
从系统中消除尽可能多的 **复杂度complexity**,使新工程师也能轻松理解系统(注意这和用户接口的简单性不一样)。
* 可演化性evolvability
使工程师在未来能轻松地对系统进行更改,当需求变化时为新应用场景做适配。也称为 **可扩展性extensibility**、**可修改性modifiability** 或 **可塑性plasticity**
和之前提到的可靠性、可伸缩性一样,实现这些目标也没有简单的解决方案。不过我们会试着想象具有可操作性,简单性和可演化性的系统会是什么样子。
### 可操作性:人生苦短,关爱运维
有人认为,“良好的运维经常可以绕开垃圾(或不完整)软件的局限性,而再好的软件摊上垃圾运维也没法可靠运行”。尽管运维的某些方面可以,而且应该是自动化的,但在最初建立正确运作的自动化机制仍然取决于人。
运维团队对于保持软件系统顺利运行至关重要。一个优秀运维团队的典型职责如下或者更多【29】
* 监控系统的运行状况,并在服务状态不佳时快速恢复服务。
* 跟踪问题的原因,例如系统故障或性能下降。
* 及时更新软件和平台,比如安全补丁。
* 了解系统间的相互作用,以便在异常变更造成损失前进行规避。
* 预测未来的问题,并在问题出现之前加以解决(例如,容量规划)。
* 建立部署、配置、管理方面的良好实践,编写相应工具。
* 执行复杂的维护任务,例如将应用程序从一个平台迁移到另一个平台。
* 当配置变更时,维持系统的安全性。
* 定义工作流程,使运维操作可预测,并保持生产环境稳定。
* 铁打的营盘流水的兵,维持组织对系统的了解。
良好的可操作性意味着更轻松的日常工作,进而运维团队能专注于高价值的事情。数据系统可以通过各种方式使日常任务更轻松:
* 通过良好的监控,提供对系统内部状态和运行时行为的 **可见性visibility**
* 为自动化提供良好支持,将系统与标准化工具相集成。
* 避免依赖单台机器(在整个系统继续不间断运行的情况下允许机器停机维护)。
* 提供良好的文档和易于理解的操作模型(“如果做 X会发生 Y”
* 提供良好的默认行为,但需要时也允许管理员自由覆盖默认值。
* 有条件时进行自我修复,但需要时也允许管理员手动控制系统状态。
* 行为可预测,最大限度减少意外。
### 简单性:管理复杂度
小型软件项目可以使用简单讨喜的、富表现力的代码,但随着项目越来越大,代码往往变得非常复杂,难以理解。这种复杂度拖慢了所有系统相关人员,进一步增加了维护成本。一个陷入复杂泥潭的软件项目有时被描述为 **烂泥潭a big ball of mud** 【30】。
**复杂度complexity** 有各种可能的症状,例如:状态空间激增、模块间紧密耦合、纠结的依赖关系、不一致的命名和术语、解决性能问题的 Hack、需要绕开的特例等等现在已经有很多关于这个话题的讨论【31,32,33】。
因为复杂度导致维护困难时,预算和时间安排通常会超支。在复杂的软件中进行变更,引入错误的风险也更大:当开发人员难以理解系统时,隐藏的假设、无意的后果和意外的交互就更容易被忽略。相反,降低复杂度能极大地提高软件的可维护性,因此简单性应该是构建系统的一个关键目标。
简化系统并不一定意味着减少功能;它也可以意味着消除 **额外的accidental** 的复杂度。Moseley 和 Marks【32】把 **额外复杂度** 定义为:由具体实现中涌现,而非(从用户视角看,系统所解决的)问题本身固有的复杂度。
用于消除 **额外复杂度** 的最好工具之一是 **抽象abstraction**。一个好的抽象可以将大量实现细节隐藏在一个干净,简单易懂的外观下面。一个好的抽象也可以广泛用于各类不同应用。比起重复造很多轮子,重用抽象不仅更有效率,而且有助于开发高质量的软件。抽象组件的质量改进将使所有使用它的应用受益。
例如高级编程语言是一种抽象隐藏了机器码、CPU 寄存器和系统调用。SQL 也是一种抽象,隐藏了复杂的磁盘 / 内存数据结构、来自其他客户端的并发请求、崩溃后的不一致性。当然在用高级语言编程时,我们仍然用到了机器码;只不过没有 **直接directly** 使用罢了,正是因为编程语言的抽象,我们才不必去考虑这些实现细节。
抽象可以帮助我们将系统的复杂度控制在可管理的水平,不过,找到好的抽象是非常困难的。在分布式系统领域虽然有许多好的算法,但我们并不清楚它们应该打包成什么样抽象。
本书将紧盯那些允许我们将大型系统的部分提取为定义明确的、可重用的组件的优秀抽象。
### 可演化性:拥抱变化
系统的需求永远不变,基本是不可能的。更可能的情况是,它们处于常态的变化中,例如:你了解了新的事实、出现意想不到的应用场景、业务优先级发生变化、用户要求新功能、新平台取代旧平台、法律或监管要求发生变化、系统增长迫使架构变化等。
在组织流程方面,**敏捷agile** 工作模式为适应变化提供了一个框架。敏捷社区还开发了对在频繁变化的环境中开发软件很有帮助的技术工具和模式,如 **测试驱动开发TDD, test-driven development****重构refactoring**
这些敏捷技术的大部分讨论都集中在相当小的规模(同一个应用中的几个代码文件)。本书将探索在更大数据系统层面上提高敏捷性的方法,可能由几个不同的应用或服务组成。例如,为了将装配主页时间线的方法从方法 1 变为方法 2你会如何 “重构” 推特的架构
修改数据系统并使其适应不断变化需求的容易程度,是与 **简单性****抽象性** 密切相关的:简单易懂的系统通常比复杂系统更容易修改。但由于这是一个非常重要的概念,我们将用一个不同的词来指代数据系统层面的敏捷性: **可演化性evolvability** 【34】。
## 本章小结
本章探讨了一些关于数据密集型应用的基本思考方式。这些原则将指导我们阅读本书的其余部分,那里将会深入技术细节。
一个应用必须满足各种需求才称得上有用。有一些 **功能需求**functional requirements即它应该做什么比如允许以各种方式存储检索搜索和处理数据以及一些 **非功能性需求**nonfunctional即通用属性例如安全性、可靠性、合规性、可伸缩性、兼容性和可维护性。在本章详细讨论了可靠性可伸缩性和可维护性。
**可靠性Reliability** 意味着即使发生故障,系统也能正常工作。故障可能发生在硬件(通常是随机的和不相关的)、软件(通常是系统性的 Bug很难处理和人类不可避免地时不时出错。**容错技术** 可以对终端用户隐藏某些类型的故障。
**可伸缩性Scalability** 意味着即使在负载增加的情况下也有保持性能的策略。为了讨论可伸缩性,我们首先需要定量描述负载和性能的方法。我们简要了解了推特主页时间线的例子,介绍描述负载的方法,并将响应时间百分位点作为衡量性能的一种方式。在可伸缩的系统中可以添加 **处理容量processing capacity** 以在高负载下保持可靠。
**可维护性Maintainability** 有许多方面,但实质上是关于工程师和运维团队的生活质量的。良好的抽象可以帮助降低复杂度,并使系统易于修改和适应新的应用场景。良好的可操作性意味着对系统的健康状态具有良好的可见性,并拥有有效的管理手段。
不幸的是,使应用可靠、可伸缩或可维护并不容易。但是某些模式和技术会不断重新出现在不同的应用中。在接下来的几章中,我们将看到一些数据系统的例子,并分析它们如何实现这些目标。
在本书后面的 [第三部分](../part-iii.md) 中,我们将看到一种模式:几个组件协同工作以构成一个完整的系统(如 [图 1-1](../img/fig1-1.png) 中的例子)
## 参考文献
1. Michael Stonebraker and Uğur Çetintemel: “['One Size Fits All': An Idea Whose Time Has Come and Gone](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.68.9136&rep=rep1&type=pdf),” at *21st International Conference on Data Engineering* (ICDE), April 2005.
1. Walter L. Heimerdinger and Charles B. Weinstock: “[A Conceptual Framework for System Fault Tolerance](http://www.sei.cmu.edu/reports/92tr033.pdf),” Technical Report CMU/SEI-92-TR-033, Software Engineering Institute, Carnegie Mellon University, October 1992.
1. Ding Yuan, Yu Luo, Xin Zhuang, et al.: “[Simple Testing Can Prevent Most Critical Failures: An Analysis of Production Failures in Distributed Data-Intensive Systems](https://www.usenix.org/system/files/conference/osdi14/osdi14-paper-yuan.pdf),” at *11th USENIX Symposium on Operating Systems Design and Implementation* (OSDI), October 2014.
1. Yury Izrailevsky and Ariel Tseitlin: “[The Netflix Simian Army](http://techblog.netflix.com/2011/07/netflix-simian-army.html),” *techblog.netflix.com*, July 19, 2011.
1. Daniel Ford, François Labelle, Florentina I. Popovici, et al.: “[Availability in Globally Distributed Storage Systems](http://research.google.com/pubs/archive/36737.pdf),” at *9th USENIX Symposium on Operating Systems Design and Implementation* (OSDI), October 2010.
1. Brian Beach: “[Hard Drive Reliability Update Sep 2014](https://www.backblaze.com/blog/hard-drive-reliability-update-september-2014/),” *backblaze.com*, September 23, 2014.
1. Laurie Voss: “[AWS: The Good, the Bad and the Ugly](https://web.archive.org/web/20160429075023/http://blog.awe.sm/2012/12/18/aws-the-good-the-bad-and-the-ugly/),” *blog.awe.sm*, December 18, 2012.
1. Haryadi S. Gunawi, Mingzhe Hao, Tanakorn Leesatapornwongsa, et al.: “[What Bugs Live in the Cloud?](http://ucare.cs.uchicago.edu/pdf/socc14-cbs.pdf),” at *5th ACM Symposium on Cloud Computing* (SoCC), November 2014. [doi:10.1145/2670979.2670986](http://dx.doi.org/10.1145/2670979.2670986)
1. Nelson Minar: “[Leap Second Crashes Half the Internet](http://www.somebits.com/weblog/tech/bad/leap-second-2012.html),” *somebits.com*, July 3, 2012.
1. Amazon Web Services: “[Summary of the Amazon EC2 and Amazon RDS Service Disruption in the US East Region](http://aws.amazon.com/message/65648/),” *aws.amazon.com*, April 29, 2011.
1. Richard I. Cook: “[How Complex Systems Fail](http://web.mit.edu/2.75/resources/random/How%20Complex%20Systems%20Fail.pdf),” Cognitive Technologies Laboratory, April 2000.
1. Jay Kreps: “[Getting Real About Distributed System Reliability](http://blog.empathybox.com/post/19574936361/getting-real-about-distributed-system-reliability),” *blog.empathybox.com*, March 19, 2012.
1. David Oppenheimer, Archana Ganapathi, and David A. Patterson: “[Why Do Internet Services Fail, and What Can Be Done About It?](http://static.usenix.org/legacy/events/usits03/tech/full_papers/oppenheimer/oppenheimer.pdf),” at *4th USENIX Symposium on Internet Technologies and Systems* (USITS), March 2003.
1. Nathan Marz: “[Principles of Software Engineering, Part 1](http://nathanmarz.com/blog/principles-of-software-engineering-part-1.html),” *nathanmarz.com*, April 2, 2013.
1. Michael Jurewitz:“[The Human Impact of Bugs](http://jury.me/blog/2013/3/14/the-human-impact-of-bugs),” *jury.me*, March 15, 2013.
1. Raffi Krikorian: “[Timelines at Scale](http://www.infoq.com/presentations/Twitter-Timeline-Scalability),” at *QCon San Francisco*, November 2012.
1. Martin Fowler: *Patterns of Enterprise Application Architecture*. Addison Wesley, 2002. ISBN: 978-0-321-12742-6
1. Kelly Sommers: “[After all that run around, what caused 500ms disk latency even when we replaced physical server?](https://twitter.com/kellabyte/status/532930540777635840)” *twitter.com*, November 13, 2014.
1. Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, et al.: “[Dynamo: Amazon's Highly Available Key-Value Store](http://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf),” at *21st ACM Symposium on Operating Systems Principles* (SOSP), October 2007.
1. Greg Linden: “[Make Data Useful](http://glinden.blogspot.co.uk/2006/12/slides-from-my-talk-at-stanford.html),” slides from presentation at Stanford University Data Mining class (CS345), December 2006.
1. Tammy Everts: “[The Real Cost of Slow Time vs Downtime](http://www.webperformancetoday.com/2014/11/12/real-cost-slow-time-vs-downtime-slides/),” *webperformancetoday.com*, November 12, 2014.
1. Jake Brutlag:“[Speed Matters for Google Web Search](http://googleresearch.blogspot.co.uk/2009/06/speed-matters.html),” *googleresearch.blogspot.co.uk*, June 22, 2009.
1. Tyler Treat: “[Everything You Know About Latency Is Wrong](http://bravenewgeek.com/everything-you-know-about-latency-is-wrong/),” *bravenewgeek.com*, December 12, 2015.
1. Jeffrey Dean and Luiz André Barroso: “[The Tail at Scale](http://cacm.acm.org/magazines/2013/2/160173-the-tail-at-scale/fulltext),” *Communications of the ACM*, volume 56, number 2, pages 7480, February 2013. [doi:10.1145/2408776.2408794](http://dx.doi.org/10.1145/2408776.2408794)
1. Graham Cormode, Vladislav Shkapenyuk, Divesh Srivastava, and Bojian Xu: “[Forward Decay: A Practical Time Decay Model for Streaming Systems](http://dimacs.rutgers.edu/~graham/pubs/papers/fwddecay.pdf),” at *25th IEEE International Conference on Data Engineering* (ICDE), March 2009.
1. Ted Dunning and Otmar Ertl: “[Computing Extremely Accurate Quantiles Using t-Digests](https://github.com/tdunning/t-digest),” *github.com*, March 2014.
1. Gil Tene: “[HdrHistogram](http://www.hdrhistogram.org/),” *hdrhistogram.org*.
1. Baron Schwartz: “[Why Percentiles Dont Work the Way You Think](https://www.vividcortex.com/blog/why-percentiles-dont-work-the-way-you-think),” *vividcortex.com*, December 7, 2015.
1. James Hamilton: “[On Designing and Deploying Internet-Scale Services](https://www.usenix.org/legacy/events/lisa07/tech/full_papers/hamilton/hamilton.pdf),” at *21st Large Installation System Administration Conference* (LISA), November 2007.
1. Brian Foote and Joseph Yoder: “[Big Ball of Mud](http://www.laputan.org/pub/foote/mud.pdf),” at *4th Conference on Pattern Languages of Programs* (PLoP), September 1997.
1. Frederick P Brooks: “No Silver Bullet Essence and Accident in Software Engineering,” in *The Mythical Man-Month*, Anniversary edition, Addison-Wesley, 1995. ISBN: 978-0-201-83595-3
1. Ben Moseley and Peter Marks: “[Out of the Tar Pit](http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.93.8928),” at *BCS Software Practice Advancement* (SPA), 2006.
1. Rich Hickey: “[Simple Made Easy](http://www.infoq.com/presentations/Simple-Made-Easy),” at *Strange Loop*, September 2011.
1. Hongyu Pei Breivold, Ivica Crnkovic, and Peter J. Eriksson: “[Analyzing Software Evolvability](http://www.mrtc.mdh.se/publications/1478.pdf),” at *32nd Annual IEEE International Computer Software and Applications Conference* (COMPSAC), July 2008. [doi:10.1109/COMPSAC.2008.50](http://dx.doi.org/10.1109/COMPSAC.2008.50)
------
| 上一章 | 目录 | 下一章 |
| ----------------------------------- | ------------------------------- | ------------------------------------ |
| [第一部分:数据系统基础](../part-i.md) | [设计数据密集型应用](../README.md) | [第二章:数据模型与查询语言](../ch2.md) |

989
tmp/ch2o.md Normal file
View File

@ -0,0 +1,989 @@
# 第二章:数据模型与查询语言
![](img//ch2.png)
> 语言的边界就是思想的边界。
>
> —— 路德维奇・维特根斯坦《逻辑哲学》1922
-------------------
[TOC]
数据模型可能是软件开发中最重要的部分了,因为它们的影响如此深远:不仅仅影响着软件的编写方式,而且影响着我们的 **解题思路**
多数应用使用层层叠加的数据模型构建。对于每层数据模型的关键问题是:它是如何用低一层数据模型来 **表示** 的?例如:
1. 作为一名应用开发人员,你观察现实世界(里面有人员、组织、货物、行为、资金流向、传感器等),并采用对象或数据结构,以及操控那些数据结构的 API 来进行建模。那些结构通常是特定于应用程序的。
2. 当要存储那些数据结构时,你可以利用通用数据模型来表示它们,如 JSON 或 XML 文档、关系数据库中的表或图模型。
3. 数据库软件的工程师选定如何以内存、磁盘或网络上的字节来表示 JSON / XML/ 关系 / 图数据。这类表示形式使数据有可能以各种方式来查询,搜索,操纵和处理。
4. 在更低的层次上,硬件工程师已经想出了使用电流、光脉冲、磁场或者其他东西来表示字节的方法。
一个复杂的应用程序可能会有更多的中间层次,比如基于 API 的 API不过基本思想仍然是一样的每个层都通过提供一个明确的数据模型来隐藏更低层次中的复杂性。这些抽象允许不同的人群有效地协作例如数据库厂商的工程师和使用数据库的应用程序开发人员
数据模型种类繁多,每个数据模型都带有如何使用的设想。有些用法很容易,有些则不支持如此;有些操作运行很快,有些则表现很差;有些数据转换非常自然,有些则很麻烦。
掌握一个数据模型需要花费很多精力(想想关系数据建模有多少本书)。即便只使用一个数据模型,不用操心其内部工作机制,构建软件也是非常困难的。然而,因为数据模型对上层软件的功能(能做什么,不能做什么)有着至深的影响,所以选择一个适合的数据模型是非常重要的。
在本章中,我们将研究一系列用于数据存储和查询的通用数据模型(前面列表中的第 2 点)。特别地,我们将比较关系模型,文档模型和少量基于图形的数据模型。我们还将查看各种查询语言并比较它们的用例。在 [第三章](../ch3.md) 中,我们将讨论存储引擎是如何工作的。也就是说,这些数据模型实际上是如何实现的(列表中的第 3 点)。
## 关系模型与文档模型
现在最著名的数据模型可能是 SQL。它基于 Edgar Codd 在 1970 年提出的关系模型【1】数据被组织成 **关系**SQL 中称作 **表**),其中每个关系是 **元组**SQL 中称作 **行**) 的无序集合。
关系模型曾是一个理论性的提议,当时很多人都怀疑是否能够有效实现它。然而到了 20 世纪 80 年代中期关系数据库管理系统RDBMSes和 SQL 已成为大多数人们存储和查询某些常规结构的数据的首选工具。关系数据库已经持续称霸了大约 25~30 年 —— 这对计算机史来说是极其漫长的时间。
关系数据库起源于商业数据处理,在 20 世纪 60 年代和 70 年代用大型计算机来执行。从今天的角度来看,那些用例显得很平常:典型的 **事务处理**(将销售或银行交易,航空公司预订,库存管理信息记录在库)和 **批处理**(客户发票,工资单,报告)。
当时的其他数据库迫使应用程序开发人员必须考虑数据库内部的数据表示形式。关系模型致力于将上述实现细节隐藏在更简洁的接口之后。
多年来,在数据存储和查询方面存在着许多相互竞争的方法。在 20 世纪 70 年代和 80 年代初网状模型network model和层次模型hierarchical model曾是主要的选择但关系模型relational model随后占据了主导地位。对象数据库在 20 世纪 80 年代末和 90 年代初来了又去。XML 数据库在二十一世纪初出现但只有小众采用过。关系模型的每个竞争者都在其时代产生了大量的炒作但从来没有持续【2】。
随着电脑越来越强大和互联,它们开始用于日益多样化的目的。关系数据库非常成功地被推广到业务数据处理的原始范围之外更为广泛的用例上。你今天在网上看到的大部分内容依旧是由关系数据库来提供支持,无论是在线发布、讨论、社交网络、电子商务、游戏、软件即服务生产力应用程序等内容。
### NoSQL 的诞生
现在 - 2010 年代NoSQL 开始了最新一轮尝试试图推翻关系模型的统治地位。“NoSQL” 这个名字让人遗憾,因为实际上它并没有涉及到任何特定的技术。最初它只是作为一个醒目的 Twitter 标签,用在 2009 年一个关于分布式,非关系数据库上的开源聚会上。无论如何,这个术语触动了某些神经,并迅速在网络创业社区内外传播开来。好些有趣的数据库系统现在都与 *#NoSQL* 标签相关联,并且 NoSQL 被追溯性地重新解释为 **不仅是 SQLNot Only SQL** 【4】。
采用 NoSQL 数据库的背后有几个驱动因素,其中包括:
* 需要比关系数据库更好的可伸缩性,包括非常大的数据集或非常高的写入吞吐量
* 相比商业数据库产品,免费和开源软件更受偏爱
* 关系模型不能很好地支持一些特殊的查询操作
* 受挫于关系模型的限制性渴望一种更具多动态性与表现力的数据模型【5】
不同的应用程序有不同的需求,一个用例的最佳技术选择可能不同于另一个用例的最佳技术选择。因此,在可预见的未来,关系数据库似乎可能会继续与各种非关系数据库一起使用 - 这种想法有时也被称为 **混合持久化polyglot persistence**
### 对象关系不匹配
目前大多数应用程序开发都使用面向对象的编程语言来开发,这导致了对 SQL 数据模型的普遍批评:如果数据存储在关系表中,那么需要一个笨拙的转换层,处于应用程序代码中的对象和表,行,列的数据库模型之间。模型之间的不连贯有时被称为 **阻抗不匹配impedance mismatch**[^i]。
[^i]: 一个从电子学借用的术语。每个电路的输入和输出都有一定的阻抗(交流电阻)。当你将一个电路的输出连接到另一个电路的输入时,如果两个电路的输出和输入阻抗匹配,则连接上的功率传输将被最大化。阻抗不匹配会导致信号反射及其他问题。
像 ActiveRecord 和 Hibernate 这样的 **对象关系映射ORM object-relational mapping** 框架可以减少这个转换层所需的样板代码的数量,但是它们不能完全隐藏这两个模型之间的差异。
![](img//fig2-1.png)
**图 2-1 使用关系型模式来表示领英简介**
例如,[图 2-1](img//fig2-1.png) 展示了如何在关系模式中表示简历(一个 LinkedIn 简介)。整个简介可以通过一个唯一的标识符 `user_id` 来标识。像 `first_name``last_name` 这样的字段每个用户只出现一次,所以可以在 User 表上将其建模为列。但是,大多数人在职业生涯中拥有多于一份的工作,人们可能有不同样的教育阶段和任意数量的联系信息。从用户到这些项目之间存在一对多的关系,可以用多种方式来表示:
* 传统 SQL 模型SQL1999 之前)中,最常见的规范化表示形式是将职位,教育和联系信息放在单独的表中,对 User 表提供外键引用,如 [图 2-1](img//fig2-1.png) 所示。
* 后续的 SQL 标准增加了对结构化数据类型和 XML 数据的支持;这允许将多值数据存储在单行内,并支持在这些文档内查询和索引。这些功能在 OracleIBM DB2MS SQL Server 和 PostgreSQL 中都有不同程度的支持【6,7】。JSON 数据类型也得到多个数据库的支持,包括 IBM DB2MySQL 和 PostgreSQL 【8】。
* 第三种选择是将职业,教育和联系信息编码为 JSON 或 XML 文档,将其存储在数据库的文本列中,并让应用程序解析其结构和内容。这种配置下,通常不能使用数据库来查询该编码列中的值。
对于一个像简历这样自包含文档的数据结构而言JSON 表示是非常合适的:请参阅 [例 2-1]()。JSON 比 XML 更简单。面向文档的数据库(如 MongoDB 【9】RethinkDB 【10】CouchDB 【11】和 Espresso【12】支持这种数据模型。
**例 2-1. 用 JSON 文档表示一个 LinkedIn 简介**
```json
{
"user_id": 251,
"first_name": "Bill",
"last_name": "Gates",
"summary": "Co-chair of the Bill & Melinda Gates... Active blogger.",
"region_id": "us:91",
"industry_id": 131,
"photo_url": "/p/7/000/253/05b/308dd6e.jpg",
"positions": [
{
"job_title": "Co-chair",
"organization": "Bill & Melinda Gates Foundation"
},
{
"job_title": "Co-founder, Chairman",
"organization": "Microsoft"
}
],
"education": [
{
"school_name": "Harvard University",
"start": 1973,
"end": 1975
},
{
"school_name": "Lakeside School, Seattle",
"start": null,
"end": null
}
],
"contact_info": {
"blog": "http://thegatesnotes.com",
"twitter": "http://twitter.com/BillGates"
}
}
```
有一些开发人员认为 JSON 模型减少了应用程序代码和存储层之间的阻抗不匹配。不过,正如我们将在 [第四章](../ch4.md) 中看到的那样JSON 作为数据编码格式也存在问题。无模式对 JSON 模型来说往往被认为是一个优势;我们将在 “[文档模型中的模式灵活性](#文档模型中的模式灵活性)” 中讨论这个问题。
JSON 表示比 [图 2-1](img//fig2-1.png) 中的多表模式具有更好的 **局部性locality**。如果在前面的关系型示例中获取简介,那需要执行多个查询(通过 `user_id` 查询每个表),或者在 User 表与其下属表之间混乱地执行多路连接。而在 JSON 表示中,所有相关信息都在同一个地方,一个查询就足够了。
从用户简介文件到用户职位,教育历史和联系信息,这种一对多关系隐含了数据中的一个树状结构,而 JSON 表示使得这个树状结构变得明确(见 [图 2-2](img//fig2-2.png))。
![](img//fig2-2.png)
**图 2-2 一对多关系构建了一个树结构**
### 多对一和多对多的关系
在上一节的 [例 2-1]() 中,`region_id` 和 `industry_id` 是以 ID而不是纯字符串 “Greater Seattle Area” 和 “Philanthropy” 的形式给出的。为什么?
如果用户界面用一个自由文本字段来输入区域和行业,那么将他们存储为纯文本字符串是合理的。另一方式是给出地理区域和行业的标准化的列表,并让用户从下拉列表或自动填充器中进行选择,其优势如下:
* 各个简介之间样式和拼写统一
* 避免歧义(例如,如果有几个同名的城市)
* 易于更新 —— 名称只存储在一个地方,如果需要更改(例如,由于政治事件而改变城市名称),很容易进行全面更新。
* 本地化支持 —— 当网站翻译成其他语言时,标准化的列表可以被本地化,使得地区和行业可以使用用户的语言来显示
* 更好的搜索 —— 例如,搜索华盛顿州的慈善家就会匹配这份简介,因为地区列表可以编码记录西雅图在华盛顿这一事实(从 “Greater Seattle Area” 这个字符串中看不出来)
存储 ID 还是文本字符串,这是个 **副本duplication** 问题。当使用 ID 时对人类有意义的信息比如单词Philanthropy只存储在一处所有引用它的地方使用 IDID 只在数据库中有意义)。当直接存储文本时,对人类有意义的信息会复制在每处使用记录中。
使用 ID 的好处是ID 对人类没有任何意义因而永远不需要改变ID 可以保持不变,即使它标识的信息发生变化。任何对人类有意义的东西都可能需要在将来某个时候改变 —— 如果这些信息被复制,所有的冗余副本都需要更新。这会导致写入开销,也存在不一致的风险(一些副本被更新了,还有些副本没有被更新)。去除此类重复是数据库 **规范化normalization** 的关键思想。[^ii]
[^ii]: 关于关系模型的文献区分了几种不同的规范形式,但这些区别几乎没有实际意义。一个经验法则是,如果重复存储了可以存储在一个地方的值,则模式就不是 **规范化normalized** 的。
> 数据库管理员和开发人员喜欢争论规范化和非规范化,让我们暂时保留判断吧。在本书的 [第三部分](../part-iii.md),我们将回到这个话题,探讨系统的方法用以处理缓存,非规范化和衍生数据。
不幸的是,对这些数据进行规范化需要多对一的关系(许多人生活在一个特定的地区,许多人在一个特定的行业工作),这与文档模型不太吻合。在关系数据库中,通过 ID 来引用其他表中的行是正常的,因为连接很容易。在文档数据库中,一对多树结构没有必要用连接,对连接的支持通常很弱 [^iii]。
[^iii]: 在撰写本文时RethinkDB 支持连接MongoDB 不支持连接,而 CouchDB 只支持预先声明的视图。
如果数据库本身不支持连接,则必须在应用程序代码中通过对数据库进行多个查询来模拟连接。(在这种情况中,地区和行业的列表可能很小,改动很少,应用程序可以简单地将其保存在内存中。不过,执行连接的工作从数据库被转移到应用程序代码上。)
此外,即便应用程序的最初版本适合无连接的文档模型,随着功能添加到应用程序中,数据会变得更加互联。例如,考虑一下对简历例子进行的一些修改:
* 组织和学校作为实体
在前面的描述中,`organization`(用户工作的公司)和 `school_name`(他们学习的地方)只是字符串。也许他们应该是对实体的引用呢?然后,每个组织、学校或大学都可以拥有自己的网页(标识、新闻提要等)。每个简历可以链接到它所提到的组织和学校,并且包括他们的图标和其他信息(请参阅 [图 2-3](img//fig2-3.png),来自 LinkedIn 的一个例子)。
* 推荐
假设你想添加一个新的功能:一个用户可以为另一个用户写一个推荐。在用户的简历上显示推荐,并附上推荐用户的姓名和照片。如果推荐人更新他们的照片,那他们写的任何推荐都需要显示新的照片。因此,推荐应该拥有作者个人简介的引用。
![](img//fig2-3.png)
**图 2-3 公司名不仅是字符串还是一个指向公司实体的链接LinkedIn 截图)**
[图 2-4](img//fig2-4.png) 阐明了这些新功能需要如何使用多对多关系。每个虚线矩形内的数据可以分组成一个文档,但是对单位,学校和其他用户的引用需要表示成引用,并且在查询时需要连接。
![](img//fig2-4.png)
**图 2-4 使用多对多关系扩展简历**
### 文档数据库是否在重蹈覆辙?
在多对多的关系和连接已常规用在关系数据库时,文档数据库和 NoSQL 重启了辩论:如何以最佳方式在数据库中表示多对多关系。那场辩论可比 NoSQL 古老得多,事实上,最早可以追溯到计算机化数据库系统。
20 世纪 70 年代最受欢迎的业务数据处理数据库是 IBM 的信息管理系统IMS最初是为了阿波罗太空计划的库存管理而开发的并于 1968 年有了首次商业发布【13】。目前它仍在使用和维护运行在 IBM 大型机的 OS/390 上【14】。
IMS 的设计中使用了一个相当简单的数据模型,称为 **层次模型hierarchical model**,它与文档数据库使用的 JSON 模型有一些惊人的相似之处【2】。它将所有数据表示为嵌套在记录中的记录树这很像 [图 2-2](img//fig2-2.png) 的 JSON 结构。
同文档数据库一样IMS 能良好处理一对多的关系但是很难应对多对多的关系并且不支持连接。开发人员必须决定是否复制非规范化数据或手动解决从一个记录到另一个记录的引用。这些二十世纪六七十年代的问题与现在开发人员遇到的文档数据库问题非常相似【15】。
那时人们提出了各种不同的解决方案来解决层次模型的局限性。其中最突出的两个是 **关系模型**relational model它变成了 SQL并统治了世界**网状模型**network model最初很受关注但最终变得冷门。这两个阵营之间的 “大辩论” 在 70 年代持续了很久时间【2】。
那两个模式解决的问题与当前的问题相关,因此值得简要回顾一下那场辩论。
#### 网状模型
网状模型由一个称为数据系统语言会议CODASYL的委员会进行了标准化并被数个不同的数据库厂商实现它也被称为 CODASYL 模型【16】。
CODASYL 模型是层次模型的推广。在层次模型的树结构中每条记录只有一个父节点在网络模式中每条记录可能有多个父节点。例如“Greater Seattle Area” 地区可能是一条记录,每个居住在该地区的用户都可以与之相关联。这允许对多对一和多对多的关系进行建模。
网状模型中记录之间的链接不是外键,而更像编程语言中的指针(同时仍然存储在磁盘上)。访问记录的唯一方法是跟随从根记录起沿这些链路所形成的路径。这被称为 **访问路径access path**
最简单的情况下,访问路径类似遍历链表:从列表头开始,每次查看一条记录,直到找到所需的记录。但在多对多关系的情况中,数条不同的路径可以到达相同的记录,网状模型的程序员必须跟踪这些不同的访问路径。
CODASYL 中的查询是通过利用遍历记录列和跟随访问路径表在数据库中移动游标来执行的。如果记录有多个父结点(即多个来自其他记录的传入指针),则应用程序代码必须跟踪所有的各种关系。甚至 CODASYL 委员会成员也承认,这就像在 n 维数据空间中进行导航【17】。
尽管手动选择访问路径能够最有效地利用 20 世纪 70 年代非常有限的硬件功能(如磁带驱动器,其搜索速度非常慢),但这使得查询和更新数据库的代码变得复杂不灵活。无论是分层还是网状模型,如果你没有所需数据的路径,就会陷入困境。你可以改变访问路径,但是必须浏览大量手写数据库查询代码,并重写来处理新的访问路径。更改应用程序的数据模型是很难的。
#### 关系模型
相比之下,关系模型做的就是将所有的数据放在光天化日之下:一个 **关系(表)** 只是一个 **元组(行)** 的集合,仅此而已。如果你想读取数据,它没有迷宫似的嵌套结构,也没有复杂的访问路径。你可以选中符合任意条件的行,读取表中的任何或所有行。你可以通过指定某些列作为匹配关键字来读取特定行。你可以在任何表中插入一个新的行,而不必担心与其他表的外键关系 [^iv]。
[^iv]: 外键约束允许对修改进行限制,但对于关系模型这并不是必选项。即使有约束,外键连接在查询时执行,而在 CODASYL 中,连接在插入时高效完成。
在关系数据库中,查询优化器自动决定查询的哪些部分以哪个顺序执行,以及使用哪些索引。这些选择实际上是 “访问路径”,但最大的区别在于它们是由查询优化器自动生成的,而不是由程序员生成,所以我们很少需要考虑它们。
如果想按新的方式查询数据,你可以声明一个新的索引,查询会自动使用最合适的那些索引。无需更改查询来利用新的索引(请参阅 “[数据查询语言](#数据查询语言)”)。关系模型因此使添加应用程序新功能变得更加容易。
关系数据库的查询优化器是复杂的已耗费了多年的研究和开发精力【18】。关系模型的一个关键洞察是只需构建一次查询优化器随后使用该数据库的所有应用程序都可以从中受益。如果你没有查询优化器的话那么为特定查询手动编写访问路径比编写通用优化器更容易 —— 不过从长期看通用解决方案更好。
#### 与文档数据库相比
在一个方面,文档数据库还原为层次模型:在其父记录中存储嵌套记录([图 2-1](img//fig2-1.png) 中的一对多关系,如 `positions``education` 和 `contact_info`),而不是在单独的表中。
但是,在表示多对一和多对多的关系时,关系数据库和文档数据库并没有根本的不同:在这两种情况下,相关项目都被一个唯一的标识符引用,这个标识符在关系模型中被称为 **外键**,在文档模型中称为 **文档引用**【9】。该标识符在读取时通过连接或后续查询来解析。迄今为止文档数据库没有走 CODASYL 的老路。
### 关系型数据库与文档数据库在今日的对比
将关系数据库与文档数据库进行比较时,可以考虑许多方面的差异,包括它们的容错属性(请参阅 [第五章](../ch5.md))和处理并发性(请参阅 [第七章](../ch7.md))。本章将只关注数据模型中的差异。
支持文档数据模型的主要论据是架构灵活性,因局部性而拥有更好的性能,以及对于某些应用程序而言更接近于应用程序使用的数据结构。关系模型通过为连接提供更好的支持以及支持多对一和多对多的关系来反击。
#### 哪种数据模型更有助于简化应用代码?
如果应用程序中的数据具有类似文档的结构(即,一对多关系树,通常一次性加载整个树),那么使用文档模型可能是一个好主意。将类似文档的结构分解成多个表(如 [图 2-1](img//fig2-1.png) 中的 `positions`、`education` 和 `contact_info`)的关系技术可能导致繁琐的模式和不必要的复杂的应用程序代码。
文档模型有一定的局限性:例如,不能直接引用文档中的嵌套的项目,而是需要说 “用户 251 的位置列表中的第二项”(很像层次模型中的访问路径)。但是,只要文件嵌套不太深,这通常不是问题。
文档数据库对连接的糟糕支持可能是个问题也可能不是问题这取决于应用程序。例如如果某分析型应用程序使用一个文档数据库来记录何时何地发生了何事那么多对多关系可能永远也用不上。【19】。
但如果你的应用程序确实会用到多对多关系那么文档模型就没有那么诱人了。尽管可以通过反规范化来消除对连接的需求但这需要应用程序代码来做额外的工作以确保数据一致性。尽管应用程序代码可以通过向数据库发出多个请求的方式来模拟连接但这也将复杂性转移到应用程序中而且通常也会比由数据库内的专用代码更慢。在这种情况下使用文档模型可能会导致更复杂的应用代码与更差的性能【15】。
我们没有办法说哪种数据模型更有助于简化应用代码,因为它取决于数据项之间的关系种类。对高度关联的数据而言,文档模型是极其糟糕的,关系模型是可以接受的,而选用图形模型(请参阅 “[图数据模型](#图数据模型)”)是最自然的。
#### 文档模型中的模式灵活性
大多数文档数据库以及关系数据库中的 JSON 支持都不会强制文档中的数据采用何种模式。关系数据库的 XML 支持通常带有可选的模式验证。没有模式意味着可以将任意的键和值添加到文档中,并且当读取时,客户端无法保证文档可能包含的字段。
文档数据库有时称为 **无模式schemaless**,但这具有误导性,因为读取数据的代码通常假定某种结构 —— 即存在隐式模式但不由数据库强制执行【20】。一个更精确的术语是 **读时模式**(即 schema-on-read数据的结构是隐含的只有在数据被读取时才被解释相应的是 **写时模式**(即 schema-on-write传统的关系数据库方法中模式明确且数据库确保所有的数据都符合其模式【21】。
读时模式类似于编程语言中的动态运行时类型检查而写时模式类似于静态编译时类型检查。就像静态和动态类型检查的相对优点具有很大的争议性一样【22】数据库中模式的强制性是一个具有争议的话题一般来说没有正确或错误的答案。
在应用程序想要改变其数据格式的情况下这些方法之间的区别尤其明显。例如假设你把每个用户的全名存储在一个字段中而现在想分别存储名字和姓氏【23】。在文档数据库中只需开始写入具有新字段的新文档并在应用程序中使用代码来处理读取旧文档的情况。例如
```go
if (user && user.name && !user.first_name) {
// Documents written before Dec 8, 2013 don't have first_name
user.first_name = user.name.split(" ")[0];
}
```
另一方面,在 “静态类型” 数据库模式中,通常会执行以下 **迁移migration** 操作:
```sql
ALTER TABLE users ADD COLUMN first_name text;
UPDATE users SET first_name = split_part(name, ' ', 1); -- PostgreSQL
UPDATE users SET first_name = substring_index(name, ' ', 1); -- MySQL
```
模式变更的速度很慢,而且要求停运。它的这种坏名誉并不是完全应得的:大多数关系数据库系统可在几毫秒内执行 `ALTER TABLE` 语句。MySQL 是一个值得注意的例外,它执行 `ALTER TABLE` 时会复制整个表这可能意味着在更改一个大型表时会花费几分钟甚至几个小时的停机时间尽管存在各种工具来解决这个限制【24,25,26】。
大型表上运行 `UPDATE` 语句在任何数据库上都可能会很慢,因为每一行都需要重写。要是不可接受的话,应用程序可以将 `first_name` 设置为默认值 `NULL`,并在读取时再填充,就像使用文档数据库一样。
当由于某种原因(例如,数据是异构的)集合中的项目并不都具有相同的结构时,读时模式更具优势。例如,如果:
* 存在许多不同类型的对象,将每种类型的对象放在自己的表中是不现实的。
* 数据的结构由外部系统决定。你无法控制外部系统且它随时可能变化。
在上述情况下,模式的坏处远大于它的帮助,无模式文档可能是一个更加自然的数据模型。但是,要是所有记录都具有相同的结构,那么模式是记录并强制这种结构的有效机制。第四章将更详细地讨论模式和模式演化。
#### 查询的数据局部性
文档通常以单个连续字符串形式进行存储,编码为 JSON、XML 或其二进制变体(如 MongoDB 的 BSON。如果应用程序经常需要访问整个文档例如将其渲染至网页那么存储局部性会带来性能优势。如果将数据分割到多个表中如 [图 2-1](img//fig2-1.png) 所示),则需要进行多次索引查找才能将其全部检索出来,这可能需要更多的磁盘查找并花费更多的时间。
局部性仅仅适用于同时需要文档绝大部分内容的情况。即使只访问文档其中的一小部分数据库通常需要加载整个文档对于大型文档来说这种加载行为是很浪费的。更新文档时通常需要整个重写。只有不改变文档大小的修改才可以容易地原地执行。因此通常建议保持相对小的文档并避免增加文档大小的写入【9】。这些性能限制大大减少了文档数据库的实用场景。
值得指出的是为了局部性而分组集合相关数据的想法并不局限于文档模型。例如Google 的 Spanner 数据库在关系数据模型中提供了同样的局部性属性允许模式声明一个表的行应该交错嵌套在父表内【27】。Oracle 类似地允许使用一个称为 **多表索引集群表multi-table index cluster tables** 的类似特性【28】。Bigtable 数据模型(用于 Cassandra 和 HBase中的 **列族column-family** 概念与管理局部性的目的类似【29】。
在 [第三章](../ch3.md) 将还会看到更多关于局部性的内容。
#### 文档和关系数据库的融合
自 2000 年代中期以来大多数关系数据库系统MySQL 除外)都已支持 XML。这包括对 XML 文档进行本地修改的功能,以及在 XML 文档中进行索引和查询的功能。这允许应用程序使用那种与文档数据库应当使用的非常类似的数据模型。
从 9.3 版本开始的 PostgreSQL 【8】从 5.7 版本开始的 MySQL 以及从版本 10.5 开始的 IBM DB2【30】也对 JSON 文档提供了类似的支持级别。鉴于用在 Web APIs 的 JSON 流行趋势,其他关系数据库很可能会跟随他们的脚步并添加 JSON 支持。
在文档数据库中RethinkDB 在其查询语言中支持类似关系的连接,一些 MongoDB 驱动程序可以自动解析数据库引用(有效地执行客户端连接,尽管这可能比在数据库中执行的连接慢,需要额外的网络往返,并且优化更少)。
随着时间的推移,关系数据库和文档数据库似乎变得越来越相似,这是一件好事:数据模型相互补充 [^v],如果一个数据库能够处理类似文档的数据,并能够对其执行关系查询,那么应用程序就可以使用最符合其需求的功能组合。
关系模型和文档模型的混合是未来数据库一条很好的路线。
[^v]: Codd 对关系模型【1】的原始描述实际上允许在关系模式中与 JSON 文档非常相似。他称之为 **非简单域nonsimple domains**。这个想法是,一行中的值不一定是一个像数字或字符串一样的原始数据类型,也可以是一个嵌套的关系(表),因此可以把一个任意嵌套的树结构作为一个值,这很像 30 年后添加到 SQL 中的 JSON 或 XML 支持。
## 数据查询语言
当引入关系模型时关系模型包含了一种查询数据的新方法SQL 是一种 **声明式** 查询语言,而 IMS 和 CODASYL 使用 **命令式** 代码来查询数据库。那是什么意思?
许多常用的编程语言是命令式的。例如,给定一个动物物种的列表,返回列表中的鲨鱼可以这样写:
```js
function getSharks() {
var sharks = [];
for (var i = 0; i < animals.length; i++) {
if (animals[i].family === "Sharks") {
sharks.push(animals[i]);
}
}
return sharks;
}
```
而在关系代数中,你可以这样写:
$$
sharks = \sigma_{family = "sharks"}(animals)
$$
其中 $\sigma$(希腊字母西格玛)是选择操作符,只返回符合 `family="shark"` 条件的动物。
定义 SQL 时,它紧密地遵循关系代数的结构:
```sql
SELECT * FROM animals WHERE family ='Sharks';
```
命令式语言告诉计算机以特定顺序执行某些操作。可以想象一下,逐行地遍历代码,评估条件,更新变量,并决定是否再循环一遍。
在声明式查询语言(如 SQL 或关系代数)中,你只需指定所需数据的模式 - 结果必须符合哪些条件,以及如何将数据转换(例如,排序,分组和集合) - 但不是如何实现这一目标。数据库系统的查询优化器决定使用哪些索引和哪些连接方法,以及以何种顺序执行查询的各个部分。
声明式查询语言是迷人的,因为它通常比命令式 API 更加简洁和容易。但更重要的是,它还隐藏了数据库引擎的实现细节,这使得数据库系统可以在无需对查询做任何更改的情况下进行性能提升。
例如,在本节开头所示的命令代码中,动物列表以特定顺序出现。如果数据库想要在后台回收未使用的磁盘空间,则可能需要移动记录,这会改变动物出现的顺序。数据库能否安全地执行,而不会中断查询?
SQL 示例不确保任何特定的顺序因此不在意顺序是否改变。但是如果查询用命令式的代码来写的话那么数据库就永远不可能确定代码是否依赖于排序。SQL 相当有限的功能性为数据库提供了更多自动优化的空间。
最后声明式语言往往适合并行执行。现在CPU 的速度通过核心core的增加变得更快而不是以比以前更高的时钟速度运行【31】。命令代码很难在多个核心和多个机器之间并行化因为它指定了指令必须以特定顺序执行。声明式语言更具有并行执行的潜力因为它们仅指定结果的模式而不指定用于确定结果的算法。在适当情况下数据库可以自由使用查询语言的并行实现【32】。
### Web 上的声明式查询
声明式查询语言的优势不仅限于数据库。为了说明这一点,让我们在一个完全不同的环境中比较声明式和命令式方法:一个 Web 浏览器。
假设你有一个关于海洋动物的网站。用户当前正在查看鲨鱼页面,因此你将当前所选的导航项目 “鲨鱼” 标记为当前选中项目。
```html
<ul>
<li class="selected">
<p>Sharks</p>
<ul>
<li>Great White Shark</li>
<li>Tiger Shark</li>
<li>Hammerhead Shark</li>
</ul>
</li>
<li><p>Whales</p>
<ul>
<li>Blue Whale</li>
<li>Humpback Whale</li>
<li>Fin Whale</li>
</ul>
</li>
</ul>
```
现在想让当前所选页面的标题具有一个蓝色的背景,以便在视觉上突出显示。使用 CSS 实现起来非常简单:
```css
li.selected > p {
background-color: blue;
}
```
这里的 CSS 选择器 `li.selected > p` 声明了我们想要应用蓝色样式的元素的模式:即其直接父元素是具有 CSS 类 `selected``<li>` 元素的所有 `<p>` 元素。示例中的元素 `<p>Sharks</p>` 匹配此模式,但 `<p>Whales</p>` 不匹配,因为其 `<li>` 父元素缺少 `class="selected"`
如果使用 XSL 而不是 CSS你可以做类似的事情
```xml
<xsl:template match="li[@class='selected']/p">
<fo:block background-color="blue">
<xsl:apply-templates/>
</fo:block>
</xsl:template>
```
这里的 XPath 表达式 `li[@class='selected']/p` 相当于上例中的 CSS 选择器 `li.selected > p`。CSS 和 XSL 的共同之处在于,它们都是用于指定文档样式的声明式语言。
想象一下,必须使用命令式方法的情况会是如何。在 Javascript 中,使用 **文档对象模型DOM** API其结果可能如下所示
```js
var liElements = document.getElementsByTagName("li");
for (var i = 0; i < liElements.length; i++) {
if (liElements[i].className === "selected") {
var children = liElements[i].childNodes;
for (var j = 0; j < children.length; j++) {
var child = children[j];
if (child.nodeType === Node.ELEMENT_NODE && child.tagName === "P") {
child.setAttribute("style", "background-color: blue");
}
}
}
}
```
这段 JavaScript 代码命令式地将元素设置为蓝色背景,但是代码看起来很糟糕。不仅比 CSS 和 XSL 等价物更长,更难理解,而且还有一些严重的问题:
* 如果选定的类被移除(例如,因为用户点击了不同的页面),即使代码重新运行,蓝色背景也不会被移除 - 因此该项目将保持突出显示,直到整个页面被重新加载。使用 CSS浏览器会自动检测 `li.selected > p` 规则何时不再适用,并在选定的类被移除后立即移除蓝色背景。
* 如果你想要利用新的 API例如 `document.getElementsByClassName("selected")` 甚至 `document.evaluate()`)来提高性能,则必须重写代码。另一方面,浏览器供应商可以在不破坏兼容性的情况下提高 CSS 和 XPath 的性能。
在 Web 浏览器中,使用声明式 CSS 样式比使用 JavaScript 命令式地操作样式要好得多。类似地,在数据库中,使用像 SQL 这样的声明式查询语言比使用命令式查询 API 要好得多 [^vi]。
[^vi]: IMS 和 CODASYL 都使用命令式 API。应用程序通常使用 COBOL 代码遍历数据库中的记录一次一条记录【2,16】。
### MapReduce查询
MapReduce 是一个由 Google 推广的编程模型用于在多台机器上批量处理大规模的数据【33】。一些 NoSQL 数据存储(包括 MongoDB 和 CouchDB支持有限形式的 MapReduce作为在多个文档中执行只读查询的机制。
关于 MapReduce 更详细的介绍在 [第十章](../ch10.md)。现在我们只简要讨论一下 MongoDB 使用的模型。
MapReduce 既不是一个声明式的查询语言,也不是一个完全命令式的查询 API而是处于两者之间查询的逻辑用代码片段来表示这些代码片段会被处理框架重复性调用。它基于 `map`(也称为 `collect`)和 `reduce`(也称为 `fold``inject`)函数,两个函数存在于许多函数式编程语言中。
最好举例来解释 MapReduce 模型。假设你是一名海洋生物学家,每当你看到海洋中的动物时,你都会在数据库中添加一条观察记录。现在你想生成一个报告,说明你每月看到多少鲨鱼。
在 PostgreSQL 中,你可以像这样表述这个查询:
```sql
SELECT
date_trunc('month', observation_timestamp) AS observation_month,
sum(num_animals) AS total_animals
FROM observations
WHERE family = 'Sharks'
GROUP BY observation_month;
```
`date_trunc('month'timestamp)` 函数用于确定包含 `timestamp` 的日历月份,并返回代表该月份开始的另一个时间戳。换句话说,它将时间戳舍入成最近的月份。
这个查询首先过滤观察记录,以只显示鲨鱼家族的物种,然后根据它们发生的日历月份对观察记录果进行分组,最后将在该月的所有观察记录中看到的动物数目加起来。
同样的查询用 MongoDB 的 MapReduce 功能可以按如下来表述:
```js
db.observations.mapReduce(function map() {
var year = this.observationTimestamp.getFullYear();
var month = this.observationTimestamp.getMonth() + 1;
emit(year + "-" + month, this.numAnimals);
},
function reduce(key, values) {
return Array.sum(values);
},
{
query: {
family: "Sharks"
},
out: "monthlySharkReport"
});
```
* 可以声明式地指定一个只考虑鲨鱼种类的过滤器(这是 MongoDB 特定的 MapReduce 扩展)。
* 每个匹配查询的文档都会调用一次 JavaScript 函数 `map`,将 `this` 设置为文档对象。
* `map` 函数发出一个键(包括年份和月份的字符串,如 `"2013-12"``"2014-1"`)和一个值(该观察记录中的动物数量)。
* `map` 发出的键值对按键来分组。对于具有相同键(即,相同的月份和年份)的所有键值对,调用一次 `reduce` 函数。
* `reduce` 函数将特定月份内所有观测记录中的动物数量相加。
* 将最终的输出写入到 `monthlySharkReport` 集合中。
例如,假设 `observations` 集合包含这两个文档:
```json
{
observationTimestamp: Date.parse( "Mon, 25 Dec 1995 12:34:56 GMT"),
family: "Sharks",
species: "Carcharodon carcharias",
numAnimals: 3
}
{
observationTimestamp: Date.parse("Tue, 12 Dec 1995 16:17:18 GMT"),
family: "Sharks",
species: "Carcharias taurus",
numAnimals: 4
}
```
对每个文档都会调用一次 `map` 函数,结果将是 `emit("1995-12",3)``emit("1995-12",4)`。随后,以 `reduce("1995-12",[3,4])` 调用 `reduce` 函数,将返回 `7`
map 和 reduce 函数在功能上有所限制:它们必须是 **纯** 函数这意味着它们只使用传递给它们的数据作为输入它们不能执行额外的数据库查询也不能有任何副作用。这些限制允许数据库以任何顺序运行任何功能并在失败时重新运行它们。然而map 和 reduce 函数仍然是强大的:它们可以解析字符串、调用库函数、执行计算等等。
MapReduce 是一个相当底层的编程模型,用于计算机集群上的分布式执行。像 SQL 这样的更高级的查询语言可以用一系列的 MapReduce 操作来实现(见 [第十章](../ch10.md)),但是也有很多不使用 MapReduce 的分布式 SQL 实现。須注意SQL 并没有限制它只能在单一机器上运行,而 MapReduce 也并没有垄断所有的分布式查询执行。
能够在查询中使用 JavaScript 代码是高级查询的一个重要特性,但这不限于 MapReduce一些 SQL 数据库也可以用 JavaScript 函数进行扩展【34】。
MapReduce 的一个可用性问题是,必须编写两个密切合作的 JavaScript 函数这通常比编写单个查询更困难。此外声明式查询语言为查询优化器提供了更多机会来提高查询的性能。基于这些原因MongoDB 2.2 添加了一种叫做 **聚合管道** 的声明式查询语言的支持【9】。用这种语言表述鲨鱼计数查询如下所示
```js
db.observations.aggregate([
{ $match: { family: "Sharks" } },
{ $group: {
_id: {
year: { $year: "$observationTimestamp" },
month: { $month: "$observationTimestamp" }
},
totalAnimals: { $sum: "$numAnimals" } }}
]);
```
聚合管道语言的表现力与(前述 PostgreSQL 例子的SQL 子集相当,但是它使用基于 JSON 的语法而不是 SQL 那种接近英文句式的语法这种差异也许只是口味问题。这个故事的寓意是NoSQL 系统可能会意外发现自己只是重新发明了一套经过乔装改扮的 SQL。
## 图数据模型
如我们之前所见,多对多关系是不同数据模型之间具有区别性的重要特征。如果你的应用程序大多数的关系是一对多关系(树状结构化数据),或者大多数记录之间不存在关系,那么使用文档模型是合适的。
但是,要是多对多关系在你的数据中很常见呢?关系模型可以处理多对多关系的简单情况,但是随着数据之间的连接变得更加复杂,将数据建模为图形显得更加自然。
一个图由两种对象组成:**顶点**vertices也称为 **节点**,即 nodes**实体**,即 entities**边**edges也称为 **关系**,即 relationships**弧**,即 arcs。多种数据可以被建模为一个图形。典型的例子包括
* 社交图谱
顶点是人,边指示哪些人彼此认识。
* 网络图谱
顶点是网页,边缘表示指向其他页面的 HTML 链接。
* 公路或铁路网络
顶点是交叉路口,边线代表它们之间的道路或铁路线。
可以将那些众所周知的算法运用到这些图上例如汽车导航系统搜索道路网络中两点之间的最短路径PageRank 可以用在网络图上来确定网页的流行程度,从而确定该网页在搜索结果中的排名。
在刚刚给出的例子中图中的所有顶点代表了相同类型的事物人、网页或交叉路口。不过图并不局限于这样的同类数据同样强大地是图提供了一种一致的方式用来在单个数据存储中存储完全不同类型的对象。例如Facebook 维护一个包含许多不同类型的顶点和边的单个图顶点表示人、地点、事件、签到和用户的评论边表示哪些人是好友、签到发生在哪里、谁评论了什么帖子、谁参与了什么事件等等【35】。
在本节中,我们将使用 [图 2-5](img//fig2-5.png) 所示的示例。它可以从社交网络或系谱数据库中获得:它显示了两个人,来自爱达荷州的 Lucy 和来自法国 Beaune 的 Alain。他们已婚住在伦敦。
![](img//fig2-5.png)
**图 2-5 图数据结构示例(框代表顶点,箭头代表边)**
有几种不同但相关的方法用来构建和查询图表中的数据。在本节中,我们将讨论属性图模型(由 Neo4jTitan 和 InfiniteGraph 实现和三元组存储triple-store模型由 Datomic、AllegroGraph 等实现。我们将查看图的三种声明式查询语言CypherSPARQL 和 Datalog。除此之外还有像 Gremlin 【36】这样的图形查询语言和像 Pregel 这样的图形处理框架(见 [第十章](../ch10.md))。
### 属性图
在属性图模型中每个顶点vertex包括
* 唯一的标识符
* 一组出边outgoing edges
* 一组入边ingoing edges
* 一组属性(键值对)
每条边edge包括
* 唯一标识符
* 边的起点(**尾部顶点**,即 tail vertex
* 边的终点(**头部顶点**,即 head vertex
* 描述两个顶点之间关系类型的标签
* 一组属性(键值对)
可以将图存储看作由两个关系表组成:一个存储顶点,另一个存储边,如 [例 2-2]() 所示(该模式使用 PostgreSQL JSON 数据类型来存储每个顶点或每条边的属性)。头部和尾部顶点用来存储每条边;如果你想要一组顶点的输入或输出边,你可以分别通过 `head_vertex``tail_vertex` 来查询 `edges` 表。
**例 2-2 使用关系模式来表示属性图**
```sql
CREATE TABLE vertices (
vertex_id INTEGER PRIMARY KEY,
properties JSON
);
CREATE TABLE edges (
edge_id INTEGER PRIMARY KEY,
tail_vertex INTEGER REFERENCES vertices (vertex_id),
head_vertex INTEGER REFERENCES vertices (vertex_id),
label TEXT,
properties JSON
);
CREATE INDEX edges_tails ON edges (tail_vertex);
CREATE INDEX edges_heads ON edges (head_vertex);
```
关于这个模型的一些重要方面是:
1. 任何顶点都可以有一条边连接到任何其他顶点。没有模式限制哪种事物可不可以关联。
2. 给定任何顶点,可以高效地找到它的入边和出边,从而遍历图,即沿着一系列顶点的路径前后移动(这就是为什么 [例 2-2]() 在 `tail_vertex``head_vertex` 列上都有索引的原因)。
3. 通过对不同类型的关系使用不同的标签,可以在一个图中存储几种不同的信息,同时仍然保持一个清晰的数据模型。
这些特性为数据建模提供了很大的灵活性,如 [图 2-5](img//fig2-5.png) 所示。图中显示了一些传统关系模式难以表达的事情例如不同国家的不同地区结构法国有省和大区美国有县和州国中国的怪事先忽略主权国家和民族错综复杂的烂摊子不同的数据粒度Lucy 现在的住所记录具体到城市,而她的出生地点只是在一个州的级别)。
你可以想象该图还能延伸出许多关于 Lucy 和 Alain 的事实,或其他人的其他更多的事实。例如,你可以用它来表示食物过敏(为每个过敏源增加一个顶点,并增加人与过敏源之间的一条边来指示一种过敏情况),并链接到过敏源,每个过敏源具有一组顶点用来显示哪些食物含有哪些物质。然后,你可以写一个查询,找出每个人吃什么是安全的。图在可演化性方面是富有优势的:当你向应用程序添加功能时,可以轻松扩展图以适应程序数据结构的变化。
### Cypher 查询语言
Cypher 是属性图的声明式查询语言,为 Neo4j 图形数据库而发明【37】它是以电影 “黑客帝国” 中的一个角色来命名的而与密码学中的加密算法无关【38】
[例 2-3]() 显示了将 [图 2-5](img//fig2-5.png) 的左边部分插入图形数据库的 Cypher 查询。你可以以类似的方式把图的剩余部分添加进去,但这里为了文章可閱读性而省略这部分的示例。每个顶点都有一个像 `USA``Idaho` 这样的符号名称,查询的其他部分可以使用这些名称在顶点之间创建边,使用箭头符号:`Idaho - [WITHIN] ->USA` 创建一条标记为 `WITHIN` 的边,`Idaho` 为尾节点,`USA` 为头节点。
**例 2-3 将图 2-5 中的数据子集表示为 Cypher 查询**
```cypher
CREATE
(NAmerica:Location {name:'North America', type:'continent'}),
(USA:Location {name:'United States', type:'country' }),
(Idaho:Location {name:'Idaho', type:'state' }),
(Lucy:Person {name:'Lucy' }),
(Idaho) -[:WITHIN]-> (USA) -[:WITHIN]-> (NAmerica),
(Lucy) -[:BORN_IN]-> (Idaho)
```
当 [图 2-5](img//fig2-5.png) 的所有顶点和边被添加到数据库后,让我们提些有趣的问题:例如,找到所有从美国移民到欧洲的人的名字。更确切地说,这里我们想要找到符合下面条件的所有顶点,并且返回这些顶点的 `name` 属性:该顶点拥有一条连到美国任一位置的 `BORN_IN` 边,和一条连到欧洲的任一位置的 `LIVING_IN` 边。
[例 2-4]() 展示了如何在 Cypher 中表达这个查询。在 MATCH 子句中使用相同的箭头符号来查找图中的模式:`(person) -[:BORN_IN]-> ()` 可以匹配 `BORN_IN` 边的任意两个顶点。该边的尾节点被绑定了变量 `person`,头节点则未被绑定。
**例 2-4 查找所有从美国移民到欧洲的人的 Cypher 查询:**
```cypher
MATCH
(person) -[:BORN_IN]-> () -[:WITHIN*0..]-> (us:Location {name:'United States'}),
(person) -[:LIVES_IN]-> () -[:WITHIN*0..]-> (eu:Location {name:'Europe'})
RETURN person.name
```
查询按如下来解读:
> 找到满足以下两个条件的所有顶点(称之为 person 顶点):
> 1. `person` 顶点拥有一条到某个顶点的 `BORN_IN` 出边。从那个顶点开始,沿着一系列 `WITHIN` 出边最终到达一个类型为 `Location``name` 属性为 `United States` 的顶点。
>
> 2. `person` 顶点还拥有一条 `LIVES_IN` 出边。沿着这条边,可以通过一系列 `WITHIN` 出边最终到达一个类型为 `Location``name` 属性为 `Europe` 的顶点。
>
> 对于这样的 `Person` 顶点,返回其 `name` 属性。
执行这条查询可能会有几种可行的查询路径。这里给出的描述建议首先扫描数据库中的所有人,检查每个人的出生地和居住地,然后只返回符合条件的那些人。
等价地,也可以从两个 `Location` 顶点开始反向地查找。假如 `name` 属性上有索引,则可以高效地找到代表美国和欧洲的两个顶点。然后,沿着所有 `WITHIN` 入边,可以继续查找出所有在美国和欧洲的位置(州、地区、城市等)。最后,查找出那些可以由 `BORN_IN``LIVES_IN` 入边到那些位置顶点的人。
通常对于声明式查询语言来说,在编写查询语句时,不需要指定执行细节:查询优化程序会自动选择预测效率最高的策略,因此你可以专注于编写应用程序的其他部分。
### SQL 中的图查询
[例 2-2]() 指出,可以在关系数据库中表示图数据。但是,如果图数据已经以关系结构存储,我们是否也可以使用 SQL 查询它?
答案是肯定的,但有些困难。在关系数据库中,你通常会事先知道在查询中需要哪些连接。在图查询中,你可能需要在找到待查找的顶点之前,遍历可变数量的边。也就是说,连接的数量事先并不确定。
在我们的例子中,这发生在 Cypher 查询中的 `() -[:WITHIN*0..]-> ()` 规则中。一个人的 `LIVES_IN` 边可以指向任何类型的位置街道、城市、地区、国家等。一个城市可以在WITHIN一个地区内一个地区可以在WITHIN在一个州内一个州可以在WITHIN一个国家内等等。`LIVES_IN` 边可以直接指向正在查找的位置,或者一个在位置层次结构中隔了数层的位置。
在 Cypher 中,用 `WITHIN*0..` 非常简洁地表述了上述事实:“沿着 `WITHIN` 边,零次或多次”。它很像正则表达式中的 `*` 运算符。
自 SQL:1999查询可变长度遍历路径的思想可以使用称为 **递归公用表表达式**`WITH RECURSIVE` 语法)的东西来表示。[例 2-5]() 显示了同样的查询 - 查找从美国移民到欧洲的人的姓名 - 在 SQL 使用这种技术PostgreSQL、IBM DB2、Oracle 和 SQL Server 均支持)来表述。但是,与 Cypher 相比,其语法非常笨拙。
**例 2-5 与示例 2-4 同样的查询,在 SQL 中使用递归公用表表达式表示**
```sql
WITH RECURSIVE
-- in_usa 包含所有的美国境内的位置 ID
in_usa(vertex_id) AS (
SELECT vertex_id FROM vertices WHERE properties ->> 'name' = 'United States'
UNION
SELECT edges.tail_vertex FROM edges
JOIN in_usa ON edges.head_vertex = in_usa.vertex_id
WHERE edges.label = 'within'
),
-- in_europe 包含所有的欧洲境内的位置 ID
in_europe(vertex_id) AS (
SELECT vertex_id FROM vertices WHERE properties ->> 'name' = 'Europe'
UNION
SELECT edges.tail_vertex FROM edges
JOIN in_europe ON edges.head_vertex = in_europe.vertex_id
WHERE edges.label = 'within' ),
-- born_in_usa 包含了所有类型为 Person且出生在美国的顶点
born_in_usa(vertex_id) AS (
SELECT edges.tail_vertex FROM edges
JOIN in_usa ON edges.head_vertex = in_usa.vertex_id
WHERE edges.label = 'born_in' ),
-- lives_in_europe 包含了所有类型为 Person且居住在欧洲的顶点。
lives_in_europe(vertex_id) AS (
SELECT edges.tail_vertex FROM edges
JOIN in_europe ON edges.head_vertex = in_europe.vertex_id
WHERE edges.label = 'lives_in')
SELECT vertices.properties ->> 'name'
FROM vertices
JOIN born_in_usa ON vertices.vertex_id = born_in_usa.vertex_id
JOIN lives_in_europe ON vertices.vertex_id = lives_in_europe.vertex_id;
```
* 首先,查找 `name` 属性为 `United States` 的顶点,将其作为 `in_usa` 顶点的集合的第一个元素。
* 从 `in_usa` 集合的顶点出发,沿着所有的 `with_in` 入边,将其尾顶点加入同一集合,不断递归直到所有 `with_in` 入边都被访问完毕。
* 同理,从 `name` 属性为 `Europe` 的顶点出发,建立 `in_europe` 顶点的集合。
* 对于 `in_usa` 集合中的每个顶点,根据 `born_in` 入边来查找出生在美国某个地方的人。
* 同样,对于 `in_europe` 集合中的每个顶点,根据 `lives_in` 入边来查找居住在欧洲的人。
* 最后,把在美国出生的人的集合与在欧洲居住的人的集合相交。
同一个查询,用某一个查询语言可以写成 4 行,而用另一个查询语言需要 29 行,这恰恰说明了不同的数据模型是为不同的应用场景而设计的。选择适合应用程序的数据模型非常重要。
### 三元组存储和 SPARQL
三元组存储模式大体上与属性图模型相同,用不同的词来描述相同的想法。不过仍然值得讨论,因为三元组存储有很多现成的工具和语言,这些工具和语言对于构建应用程序的工具箱可能是宝贵的补充。
在三元组存储中,所有信息都以非常简单的三部分表示形式存储(**主语****谓语****宾语**)。例如,三元组 **(吉姆, 喜欢, 香蕉)** 中,**吉姆** 是主语,**喜欢** 是谓语(动词),**香蕉** 是对象。
三元组的主语相当于图中的一个顶点。而宾语是下面两者之一:
1. 原始数据类型中的值,例如字符串或数字。在这种情况下,三元组的谓语和宾语相当于主语顶点上的属性的键和值。例如,`(lucy, age, 33)` 就像属性 `{“age”33}` 的顶点 lucy。
2. 图中的另一个顶点。在这种情况下,谓语是图中的一条边,主语是其尾部顶点,而宾语是其头部顶点。例如,在 `(lucy, marriedTo, alain)` 中主语和宾语 `lucy``alain` 都是顶点,并且谓语 `marriedTo` 是连接他们的边的标签。
[例 2-6]() 展示了与 [例 2-3]() 相同的数据,以称为 Turtle 的格式Notation3N3【39】的一个子集写成三元组。
**例 2-6 图 2-5 中的数据子集,表示为 Turtle 三元组**
```reStructuredText
@prefix : <urn:example:>.
_:lucy a :Person.
_:lucy :name "Lucy".
_:lucy :bornIn _:idaho.
_:idaho a :Location.
_:idaho :name "Idaho".
_:idaho :type "state".
_:idaho :within _:usa.
_:usa a :Location
_:usa :name "United States"
_:usa :type "country".
_:usa :within _:namerica.
_:namerica a :Location
_:namerica :name "North America"
_:namerica :type :"continent"
```
在这个例子中,图的顶点被写为:`_someName`。这个名字并不意味着这个文件以外的任何东西。它的存在只是帮助我们明确哪些三元组引用了同一顶点。当谓语表示边时,该宾语是一个顶点,如 `_:idaho :within _:usa.`。当谓语是一个属性时,该宾语是一个字符串,如 `_:usa :name"United States"`
一遍又一遍地重复相同的主语看起来相当重复,但幸运的是,可以使用分号来说明关于同一主语的多个事情。这使得 Turtle 格式相当不错,可读性强:请参阅 [例 2-7]()。
**例 2-7 一种相对例 2-6 写入数据的更为简洁的方法。**
```
@prefix : <urn:example:>.
_:lucy a :Person; :name "Lucy"; :bornIn _:idaho.
_:idaho a :Location; :name "Idaho"; :type "state"; :within _:usa
_:usa a :Loaction; :name "United States"; :type "country"; :within _:namerica.
_:namerica a :Location; :name "North America"; :type "continent".
```
#### 语义网
如果你深入了解关于三元组存储的信息,可能会陷入关于**语义网**的讨论漩涡中。三元组存储模型其实是完全独立于语义网存在的例如Datomic【40】作为一种三元组存储数据库 [^vii],从未被用于语义网中。但是,由于在很多人眼中这两者紧密相连,我们应该简要地讨论一下。
[^vii]: 从技术上讲Datomic 使用的是五元组而不是三元组,两个额外的字段是用于版本控制的元数据
从本质上讲,语义网是一个简单且合理的想法:网站已经将信息发布为文字和图片供人类阅读,为什么不将信息作为机器可读的数据也发布给计算机呢?(基于三元组模型的)**资源描述框架****RDF**【41】被用作不同网站以统一的格式发布数据的一种机制允许来自不同网站的数据自动合并成 **一个数据网络** —— 成为一种互联网范围内的 “通用语义网数据库”。
不幸的是,语义网在二十一世纪初被过度炒作,但到目前为止没有任何迹象表明已在实践中应用,这使得许多人嗤之以鼻。它还饱受眼花缭乱的缩略词、过于复杂的标准提案和狂妄自大的困扰。
然而,如果从过去的失败中汲取教训,语义网项目还是拥有很多优秀的成果。即使你没有兴趣在语义网上发布 RDF 数据,三元组这种模型也是一种好的应用程序内部数据模型。
#### RDF 数据模型
[例 2-7]() 中使用的 Turtle 语言是一种用于 RDF 数据的人类可读格式。有时候RDF 也可以以 XML 格式编写,不过完成同样的事情会相对啰嗦,请参阅 [例 2-8]()。Turtle/N3 是更可取的,因为它更容易阅读,像 Apache Jena 【42】这样的工具可以根据需要在不同的 RDF 格式之间进行自动转换。
**例 2-8 用 RDF/XML 语法表示例 2-7 的数据**
```xml
<rdf:RDF xmlns="urn:example:"
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
<Location rdf:nodeID="idaho">
<name>Idaho</name>
<type>state</type>
<within>
<Location rdf:nodeID="usa">
<name>United States</name>
<type>country</type>
<within>
<Location rdf:nodeID="namerica">
<name>North America</name>
<type>continent</type>
</Location>
</within>
</Location>
</within>
</Location>
<Person rdf:nodeID="lucy">
<name>Lucy</name>
<bornIn rdf:nodeID="idaho"/>
</Person>
</rdf:RDF>
```
RDF 有一些奇怪之处,因为它是为了在互联网上交换数据而设计的。三元组的主语,谓语和宾语通常是 URI。例如谓语可能是一个 URI`<http://my-company.com/namespace#within>``<http://my-company.com/namespace#lives_in>`,而不仅仅是 `WITHIN``LIVES_IN`。这个设计背后的原因为了让你能够把你的数据和其他人的数据结合起来,如果他们赋予单词 `within` 或者 `lives_in` 不同的含义,两者也不会冲突,因为它们的谓语实际上是 `<http://other.org/foo#within>``<http://other.org/foo#lives_in>`
从 RDF 的角度来看URL `<http://my-company.com/namespace>` 不一定需要能解析成什么东西,它只是一个命名空间。为避免与 `http://URL` 混淆,本节中的示例使用不可解析的 URI`urnexamplewithin`。幸运的是,你只需在文件顶部对这个前缀做一次声明,后续就不用再管了。
### SPARQL 查询语言
**SPARQL** 是一种用于三元组存储的面向 RDF 数据模型的查询语言【43】它是 SPARQL 协议和 RDF 查询语言的缩写,发音为 “sparkle”。SPARQL 早于 Cypher并且由于 Cypher 的模式匹配借鉴于 SPARQL这使得它们看起来非常相似【37】。
与之前相同的查询 —— 查找从美国移民到欧洲的人 —— 使用 SPARQL 比使用 Cypher 甚至更为简洁(请参阅 [例 2-9]())。
**例 2-9 与示例 2-4 相同的查询,用 SPARQL 表示**
```sparql
PREFIX : <urn:example:>
SELECT ?personName WHERE {
?person :name ?personName.
?person :bornIn / :within* / :name "United States".
?person :livesIn / :within* / :name "Europe".
}
```
结构非常相似。以下两个表达式是等价的SPARQL 中的变量以问号开头):
```
(person) -[:BORN_IN]-> () -[:WITHIN*0..]-> (location) # Cypher
?person :bornIn / :within* ?location. # SPARQL
```
因为 RDF 不区分属性和边,而只是将它们作为谓语,所以可以使用相同的语法来匹配属性。在下面的表达式中,变量 `usa` 被绑定到任意 `name` 属性为字符串值 `"United States"` 的顶点:
```
(usa {name:'United States'}) # Cypher
?usa :name "United States". # SPARQL
```
SPARQL 是一种很好的查询语言 —— 尽管它构想的语义网从未实现,但它仍然是一种可用于应用程序内部的强大工具。
> #### 图形数据库与网状模型相比较
>
> 在 “[文档数据库是否在重蹈覆辙?](#文档数据库是否在重蹈覆辙?)” 中,我们讨论了 CODASYL 和关系模型如何竞相解决 IMS 中的多对多关系问题。乍一看CODASYL 的网状模型看起来与图模型相似。CODASYL 是否是图形数据库的第二个变种?
>
> 不,他们在几个重要方面有所不同:
>
> * 在 CODASYL 中,数据库有一个模式,用于指定哪种记录类型可以嵌套在其他记录类型中。在图形数据库中,不存在这样的限制:任何顶点都可以具有到其他任何顶点的边。这为应用程序适应不断变化的需求提供了更大的灵活性。
> * 在 CODASYL 中,达到特定记录的唯一方法是遍历其中的一个访问路径。在图形数据库中,可以通过其唯一 ID 直接引用任何顶点,也可以使用索引来查找具有特定值的顶点。
> * 在 CODASYL 中,记录的子项目是一个有序集合,所以数据库必须去管理它们的次序(这会影响存储布局),并且应用程序在插入新记录到数据库时必须关注新记录在这些集合中的位置。在图形数据库中,顶点和边是无序的(只能在查询时对结果进行排序)。
> * 在 CODASYL 中,所有查询都是命令式的,难以编写,并且很容易因架构变化而受到破坏。在图形数据库中,你可以在命令式代码中手写遍历过程,但大多数图形数据库都支持高级声明式查询,如 Cypher 或 SPARQL。
>
>
### 基础Datalog
**Datalog** 是比 SPARQL、Cypher 更古老的语言,在 20 世纪 80 年代被学者广泛研究【44,45,46】。它在软件工程师中不太知名但是它是重要的因为它为以后的查询语言提供了基础。
实践中Datalog 在有限的几个数据系统中使用:例如,它是 Datomic 【40】的查询语言Cascalog 【47】是一种用于查询 Hadoop 大数据集的 Datalog 实现 [^viii]。
[^viii]: Datomic 和 Cascalog 使用 Datalog 的 Clojure S 表达式语法。在下面的例子中使用了一个更容易阅读的 Prolog 语法,但两者没有任何功能差异。
Datalog 的数据模型类似于三元组模式,但进行了一点泛化。把三元组写成 **谓语****主语,宾语**),而不是写三元语(**主语,谓语,宾语**)。[例 2-10]() 显示了如何用 Datalog 写入我们的例子中的数据。
**例 2-10 用 Datalog 来表示图 2-5 中的数据子集**
```prolog
name(namerica, 'North America').
type(namerica, continent).
name(usa, 'United States').
type(usa, country).
within(usa, namerica).
name(idaho, 'Idaho').
type(idaho, state).
within(idaho, usa).
name(lucy, 'Lucy').
born_in(lucy, idaho).
```
既然已经定义了数据,我们可以像之前一样编写相同的查询,如 [例 2-11]() 所示。它看起来与 Cypher 或 SPARQL 的语法差异较大但请不要抗拒它。Datalog 是 Prolog 的一个子集,如果你是计算机科学专业的学生,可能已经见过 Prolog。
**例 2-11 与示例 2-4 相同的查询,用 Datalog 表示**
```
within_recursive(Location, Name) :- name(Location, Name). /* Rule 1 */
within_recursive(Location, Name) :- within(Location, Via), /* Rule 2 */
within_recursive(Via, Name).
migrated(Name, BornIn, LivingIn) :- name(Person, Name), /* Rule 3 */
born_in(Person, BornLoc),
within_recursive(BornLoc, BornIn),
lives_in(Person, LivingLoc),
within_recursive(LivingLoc, LivingIn).
?- migrated(Who, 'United States', 'Europe'). /* Who = 'Lucy'. */
```
Cypher 和 SPARQL 使用 SELECT 立即跳转,但是 Datalog 一次只进行一小步。我们定义 **规则**,以将新谓语告诉数据库:在这里,我们定义了两个新的谓语,`within_recursive` 和 `migrated`。这些谓语不是存储在数据库中的三元组中,而是从数据或其他规则派生而来的。规则可以引用其他规则,就像函数可以调用其他函数或者递归地调用自己一样。像这样,复杂的查询可以借由小的砖瓦构建起来。
在规则中,以大写字母开头的单词是变量,谓语则用 Cypher 和 SPARQL 的方式一样来匹配。例如,`name(Location, Name)` 通过变量绑定 `Location = namerica``Name ='North America'` 可以匹配三元组 `name(namerica, 'North America')`
要是系统可以在 `:-` 操作符的右侧找到与所有谓语的一个匹配,就运用该规则。当规则运用时,就好像通过 `:-` 的左侧将其添加到数据库(将变量替换成它们匹配的值)。
因此,一种可能的应用规则的方式是:
1. 数据库存在 `name (namerica, 'North America')`,故运用规则 1。它生成 `within_recursive (namerica, 'North America')`
2. 数据库存在 `within (usa, namerica)`,在上一步骤中生成 `within_recursive (namerica, 'North America')`,故运用规则 2。它会产生 `within_recursive (usa, 'North America')`
3. 数据库存在 `within (idaho, usa)`,在上一步生成 `within_recursive (usa, 'North America')`,故运用规则 2。它产生 `within_recursive (idaho, 'North America')`
通过重复应用规则 1 和 2`within_recursive` 谓语可以告诉我们在数据库中包含北美(或任何其他位置名称)的所有位置。这个过程如 [图 2-6](img//fig2-6.png) 所示。
![](img//fig2-6.png)
**图 2-6 使用示例 2-11 中的 Datalog 规则来确定爱达荷州在北美。**
现在规则 3 可以找到出生在某个地方 `BornIn` 的人,并住在某个地方 `LivingIn`。通过查询 `BornIn ='United States'``LivingIn ='Europe'`,并将此人作为变量 `Who`,让 Datalog 系统找出变量 `Who` 会出现哪些值。因此,最后得到了与早先的 Cypher 和 SPARQL 查询相同的答案。
相对于本章讨论的其他查询语言,我们需要采取不同的思维方式来思考 Datalog 方法,但这是一种非常强大的方法,因为规则可以在不同的查询中进行组合和重用。虽然对于简单的一次性查询,显得不太方便,但是它可以更好地处理数据很复杂的情况。
## 本章小结
数据模型是一个巨大的课题,在本章中,我们快速浏览了各种不同的模型。我们没有足够的篇幅来详述每个模型的细节,但是希望这个概述足以激起你的兴趣,以更多地了解最适合你的应用需求的模型。
在历史上,数据最开始被表示为一棵大树(层次数据模型),但是这不利于表示多对多的关系,所以发明了关系模型来解决这个问题。最近,开发人员发现一些应用程序也不适合采用关系模型。新的非关系型 “NoSQL” 数据存储分化为两个主要方向:
1. **文档数据库** 主要关注自我包含的数据文档,而且文档之间的关系非常稀少。
2. **图形数据库** 用于相反的场景:任意事物之间都可能存在潜在的关联。
这三种模型(文档,关系和图形)在今天都被广泛使用,并且在各自的领域都发挥很好。一个模型可以用另一个模型来模拟 —— 例如,图数据可以在关系数据库中表示 —— 但结果往往是糟糕的。这就是为什么我们有着针对不同目的的不同系统,而不是一个单一的万能解决方案。
文档数据库和图数据库有一个共同点,那就是它们通常不会将存储的数据强制约束为特定模式,这可以使应用程序更容易适应不断变化的需求。但是应用程序很可能仍会假定数据具有一定的结构;区别仅在于模式是**明确的**(写入时强制)还是**隐含的**(读取时处理)。
每个数据模型都具有各自的查询语言或框架我们讨论了几个例子SQLMapReduceMongoDB 的聚合管道CypherSPARQL 和 Datalog。我们也谈到了 CSS 和 XSL/XPath它们不是数据库查询语言而包含有趣的相似之处。
虽然我们已经覆盖了很多层面,但仍然有许多数据模型没有提到。举几个简单的例子:
* 使用基因组数据的研究人员通常需要执行 **序列相似性搜索**,这意味着需要一个很长的字符串(代表一个 DNA 序列),并在一个拥有类似但不完全相同的字符串的大型数据库中寻找匹配。这里所描述的数据库都不能处理这种用法,这就是为什么研究人员编写了像 GenBank 这样的专门的基因组数据库软件的原因【48】。
* 粒子物理学家数十年来一直在进行大数据类型的大规模数据分析像大型强子对撞机LHC这样的项目现在会处理数百 PB 的数据在这样的规模下需要定制解决方案来阻止硬件成本的失控【49】。
* **全文搜索** 可以说是一种经常与数据库一起使用的数据模型。信息检索是一个很大的专业课题,我们不会在本书中详细介绍,但是我们将在第三章和第三部分中介绍搜索索引。
让我们暂时将其放在一边。在 [下一章](../ch3.md) 中,我们将讨论在 **实现** 本章描述的数据模型时会遇到的一些权衡。
## 参考文献
1. Edgar F. Codd: “[A Relational Model of Data for Large Shared Data Banks](https://www.seas.upenn.edu/~zives/03f/cis550/codd.pdf),” *Communications of the ACM*, volume 13, number 6, pages 377387, June 1970. [doi:10.1145/362384.362685](http://dx.doi.org/10.1145/362384.362685)
1. Michael Stonebraker and Joseph M. Hellerstein: “[What Goes Around Comes Around](http://mitpress2.mit.edu/books/chapters/0262693143chapm1.pdf),” in *Readings in Database Systems*, 4th edition, MIT Press, pages 241, 2005. ISBN: 978-0-262-69314-1
1. Pramod J. Sadalage and Martin Fowler: *NoSQL Distilled*. Addison-Wesley, August 2012. ISBN: 978-0-321-82662-6
1. Eric Evans: “[NoSQL: What's in a Name?](http://blog.sym-link.com/2009/10/30/nosql_whats_in_a_name.html),” *blog.sym-link.com*, October 30, 2009.
1. James Phillips: “[Surprises in Our NoSQL Adoption Survey](http://blog.couchbase.com/nosql-adoption-survey-surprises),” *blog.couchbase.com*, February 8, 2012.
1. Michael Wagner: *SQL/XML:2006 Evaluierung der Standardkonformität ausgewählter Datenbanksysteme*. Diplomica Verlag, Hamburg, 2010. ISBN: 978-3-836-64609-3
1. “[XML Data in SQL Server](http://technet.microsoft.com/en-us/library/bb522446.aspx),” SQL Server 2012 documentation, *technet.microsoft.com*, 2013.
1. “[PostgreSQL 9.3.1 Documentation](http://www.postgresql.org/docs/9.3/static/index.html),” The PostgreSQL Global Development Group, 2013.
1. “[The MongoDB 2.4 Manual](http://docs.mongodb.org/manual/),” MongoDB, Inc., 2013.
1. “[RethinkDB 1.11 Documentation](http://www.rethinkdb.com/docs/),” *rethinkdb.com*, 2013.
1. “[Apache CouchDB 1.6 Documentation](http://docs.couchdb.org/en/latest/),” *docs.couchdb.org*, 2014.
1. Lin Qiao, Kapil Surlaker, Shirshanka Das, et al.: “[On Brewing Fresh Espresso: LinkedIns Distributed Data Serving Platform](http://www.slideshare.net/amywtang/espresso-20952131),” at *ACM International Conference on Management of Data* (SIGMOD), June 2013.
1. Rick Long, Mark Harrington, Robert Hain, and Geoff Nicholls: <a href="http://www.redbooks.ibm.com/redbooks/pdfs/sg245352.pdf">*IMS Primer*</a>. IBM Redbook SG24-5352-00, IBM International Technical Support Organization, January 2000.
1. Stephen D. Bartlett: “[IBMs IMS—Myths, Realities, and Opportunities](ftp://public.dhe.ibm.com/software/data/ims/pdf/TCG2013015LI.pdf),” The Clipper Group Navigator, TCG2013015LI, July 2013.
1. Sarah Mei: “[Why You Should Never Use MongoDB](http://www.sarahmei.com/blog/2013/11/11/why-you-should-never-use-mongodb/),” *sarahmei.com*, November 11, 2013.
1. J. S. Knowles and D. M. R. Bell: “The CODASYL Model,” in *Databases—Role and Structure: An Advanced Course*, edited by P. M. Stocker, P. M. D. Gray, and M. P. Atkinson, pages 1956, Cambridge University Press, 1984. ISBN: 978-0-521-25430-4
1. Charles W. Bachman: “[The Programmer as Navigator](http://dl.acm.org/citation.cfm?id=362534),” *Communications of the ACM*, volume 16, number 11, pages 653658, November 1973. [doi:10.1145/355611.362534](http://dx.doi.org/10.1145/355611.362534)
1. Joseph M. Hellerstein, Michael Stonebraker, and James Hamilton: “[Architecture of a Database System](http://db.cs.berkeley.edu/papers/fntdb07-architecture.pdf),” *Foundations and Trends in Databases*, volume 1, number 2, pages 141259, November 2007. [doi:10.1561/1900000002](http://dx.doi.org/10.1561/1900000002)
1. Sandeep Parikh and Kelly Stirman: “[Schema Design for Time Series Data in MongoDB](http://blog.mongodb.org/post/65517193370/schema-design-for-time-series-data-in-mongodb),” *blog.mongodb.org*, October 30, 2013.
1. Martin Fowler: “[Schemaless Data Structures](http://martinfowler.com/articles/schemaless/),” *martinfowler.com*, January 7, 2013.
1. Amr Awadallah: “[Schema-on-Read vs. Schema-on-Write](http://www.slideshare.net/awadallah/schemaonread-vs-schemaonwrite),” at *Berkeley EECS RAD Lab Retreat*, Santa Cruz, CA, May 2009.
1. Martin Odersky: “[The Trouble with Types](http://www.infoq.com/presentations/data-types-issues),” at *Strange Loop*, September 2013.
1. Conrad Irwin: “[MongoDB—Confessions of a PostgreSQL Lover](https://speakerdeck.com/conradirwin/mongodb-confessions-of-a-postgresql-lover),” at *HTML5DevConf*, October 2013.
1. “[Percona Toolkit Documentation: pt-online-schema-change](http://www.percona.com/doc/percona-toolkit/2.2/pt-online-schema-change.html),” Percona Ireland Ltd., 2013.
1. Rany Keddo, Tobias Bielohlawek, and Tobias Schmidt: “[Large Hadron Migrator](https://github.com/soundcloud/lhm),” SoundCloud, 2013.
1. Shlomi Noach: “[gh-ost: GitHub's Online Schema Migration Tool for MySQL](http://githubengineering.com/gh-ost-github-s-online-migration-tool-for-mysql/),” *githubengineering.com*, August 1, 2016.
1. James C. Corbett, Jeffrey Dean, Michael Epstein, et al.: “[Spanner: Googles Globally-Distributed Database](http://research.google.com/archive/spanner.html),” at *10th USENIX Symposium on Operating System Design and Implementation* (OSDI), October 2012.
1. Donald K. Burleson: “[Reduce I/O with Oracle Cluster Tables](http://www.dba-oracle.com/oracle_tip_hash_index_cluster_table.htm),” *dba-oracle.com*.
1. Fay Chang, Jeffrey Dean, Sanjay Ghemawat, et al.: “[Bigtable: A Distributed Storage System for Structured Data](http://research.google.com/archive/bigtable.html),” at *7th USENIX Symposium on Operating System Design and Implementation* (OSDI), November 2006.
1. Bobbie J. Cochrane and Kathy A. McKnight: “[DB2 JSON Capabilities, Part 1: Introduction to DB2 JSON](http://www.ibm.com/developerworks/data/library/techarticle/dm-1306nosqlforjson1/),” IBM developerWorks, June 20, 2013.
1. Herb Sutter: “[The Free Lunch Is Over: A Fundamental Turn Toward Concurrency in Software](http://www.gotw.ca/publications/concurrency-ddj.htm),” *Dr. Dobb's Journal*, volume 30, number 3, pages 202-210, March 2005.
1. Joseph M. Hellerstein: “[The Declarative Imperative: Experiences and Conjectures in Distributed Logic](http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-90.pdf),” Electrical Engineering and Computer Sciences, University of California at Berkeley, Tech report UCB/EECS-2010-90, June 2010.
1. Jeffrey Dean and Sanjay Ghemawat: “[MapReduce: Simplified Data Processing on Large Clusters](http://research.google.com/archive/mapreduce.html),” at *6th USENIX Symposium on Operating System Design and Implementation* (OSDI), December 2004.
1. Craig Kerstiens: “[JavaScript in Your Postgres](https://blog.heroku.com/javascript_in_your_postgres),” *blog.heroku.com*, June 5, 2013.
1. Nathan Bronson, Zach Amsden, George Cabrera, et al.: “[TAO: Facebooks Distributed Data Store for the Social Graph](https://www.usenix.org/conference/atc13/technical-sessions/presentation/bronson),” at *USENIX Annual Technical Conference* (USENIX ATC), June 2013.
1. “[Apache TinkerPop3.2.3 Documentation](http://tinkerpop.apache.org/docs/3.2.3/reference/),” *tinkerpop.apache.org*, October 2016.
1. “[The Neo4j Manual v2.0.0](http://docs.neo4j.org/chunked/2.0.0/index.html),” Neo Technology, 2013.
1. Emil Eifrem: [Twitter correspondence](https://twitter.com/emileifrem/status/419107961512804352), January 3, 2014.
1. David Beckett and Tim Berners-Lee: “[Turtle Terse RDF Triple Language](http://www.w3.org/TeamSubmission/turtle/),” W3C Team Submission, March 28, 2011.
1. “[Datomic Development Resources](http://docs.datomic.com/),” Metadata Partners, LLC, 2013.
1. W3C RDF Working Group: “[Resource Description Framework (RDF)](http://www.w3.org/RDF/),” *w3.org*, 10 February 2004.
1. “[Apache Jena](http://jena.apache.org/),” Apache Software Foundation.
1. Steve Harris, Andy Seaborne, and Eric Prud'hommeaux: “[SPARQL 1.1 Query Language](http://www.w3.org/TR/sparql11-query/),” W3C Recommendation, March 2013.
1. Todd J. Green, Shan Shan Huang, Boon Thau Loo, and Wenchao Zhou: “[Datalog and Recursive Query Processing](http://blogs.evergreen.edu/sosw/files/2014/04/Green-Vol5-DBS-017.pdf),” *Foundations and Trends in Databases*, volume 5, number 2, pages 105195, November 2013. [doi:10.1561/1900000017](http://dx.doi.org/10.1561/1900000017)
1. Stefano Ceri, Georg Gottlob, and Letizia Tanca: “[What You Always Wanted to Know About Datalog (And Never Dared to Ask)](https://www.researchgate.net/profile/Letizia_Tanca/publication/3296132_What_you_always_wanted_to_know_about_Datalog_and_never_dared_to_ask/links/0fcfd50ca2d20473ca000000.pdf),” *IEEE Transactions on Knowledge and Data Engineering*, volume 1, number 1, pages 146166, March 1989. [doi:10.1109/69.43410](http://dx.doi.org/10.1109/69.43410)
1. Serge Abiteboul, Richard Hull, and Victor Vianu: <a href="http://webdam.inria.fr/Alice/">*Foundations of Databases*</a>. Addison-Wesley, 1995. ISBN: 978-0-201-53771-0, available online at *webdam.inria.fr/Alice*
1. Nathan Marz: “[Cascalog](http://cascalog.org/),” *cascalog.org*.
1. Dennis A. Benson, Ilene Karsch-Mizrachi, David J. Lipman, et al.: “[GenBank](http://nar.oxfordjournals.org/content/36/suppl_1/D25.full-text-lowres.pdf),” *Nucleic Acids Research*, volume 36, Database issue, pages D25D30, December 2007. [doi:10.1093/nar/gkm929](http://dx.doi.org/10.1093/nar/gkm929)
1. Fons Rademakers: “[ROOT for Big Data Analysis](http://indico.cern.ch/getFile.py/access?contribId=13&resId=0&materialId=slides&confId=246453),” at *Workshop on the Future of Big Data Management*, London, UK, June 2013.
------
| 上一章 | 目录 | 下一章 |
|-----------------------------|---------------------------|-------------------------|
| [第一章:可靠性、可伸缩性和可维护性](ch1.md) | [设计数据密集型应用](../README.md) | [第三章:数据模型与查询语言](ch3.md) |