Merge pull request #5696 from ictlyh/master

Translated tech/20170308 Many SQL Performance Problems Stem from  Unn…
This commit is contained in:
Yuanhao Luo 2017-06-14 16:28:49 +08:00 committed by GitHub
commit 5d3068b171
2 changed files with 394 additions and 396 deletions

View File

@ -1,396 +0,0 @@
ictlyh Translating
Many SQL Performance Problems Stem from “Unnecessary, Mandatory Work”
============================================================ 
Probably the most impactful thing you could learn about when writing efficient SQL is [indexing][1]. A very close runner-up, however, is the fact that a lot of SQL clients demand tons of **“unnecessary, mandatory work”** from the database.
Repeat this after me:
> Unnecessary, Mandatory Work
What is **“unnecessary, mandatory work”**? Its two things (duh):
### Unnecessary
Lets assume your client application needs this information here:
[
![](https://lukaseder.files.wordpress.com/2017/03/title-rating.png?w=662)
][2]
Nothing out of the ordinary. We run a movie database ([e.g. the Sakila database][3]) and we want to display the title and rating of each film to the user.
This is the query that would produce the above result:
`SELECT title, rating`
`FROM film`
However, our application (or our ORM) runs this query instead:
`SELECT *`
`FROM film`
What are we getting? Guess what. Were getting tons of useless information:
[
![](https://lukaseder.files.wordpress.com/2017/03/useless-information.png?w=662&h=131)
][4]
Theres even some complex JSON all the way to the right, which is loaded:
* From the disk
* Into the caches
* Over the wire
* Into the client memory
* And then discarded
Yes, we discard most of this information. The work that was performed to retrieve it was completely unnecessary. Right? Agreed.
### Mandatory
Thats the worse part. While optimisers have become quite smart these days, this work is mandatory for the database. Theres no way the database can  _know_  that the client application actually didnt need 95% of the data. And thats just a simple example. Imagine if we joined more tables…
So what, you think? Databases are fast? Let me offer you some insight you may not have thought of, before:
### Memory consumption
Sure, the individual execution time doesnt really change much. Perhaps, itll be 1.5x slower, but we can handle that right? For the sake of convenience? Sometimes thats true. But if youre sacrificing performance for convenience  _every time_ , things add up. Were no longer talking about performance (speed of individual queries), but throughput (system response time), and thats when stuff gets really hairy and tough to fix. When you stop being able to scale.
Lets look at execution plans, Oracle this time:
```
--------------------------------------------------
| Id | Operation | Name | Rows | Bytes |
--------------------------------------------------
| 0 | SELECT STATEMENT | | 1000 | 166K|
| 1 | TABLE ACCESS FULL| FILM | 1000 | 166K|
--------------------------------------------------
```
Versus
```
--------------------------------------------------
| Id | Operation | Name | Rows | Bytes |
--------------------------------------------------
| 0 | SELECT STATEMENT | | 1000 | 20000 |
| 1 | TABLE ACCESS FULL| FILM | 1000 | 20000 |
--------------------------------------------------
```
Were using 8x as much memory in the database when doing `SELECT *`rather than `SELECT film, rating`. Thats not really surprising though, is it? We knew that. Yet we accepted it in many many of our queries where we simply didnt need all that data. We generated **needless, mandatory work** for the database, and it does sum up. Were using 8x too much memory (the number will differ, of course).
Now, all the other steps (disk I/O, wire transfer, client memory consumption) are also affected in the same way, but Im skipping those. Instead, Id like to look at…
### Index usage
Most databases these days have figured out the concept of [ _covering indexes_ ][5]. A covering index is not a special index per se. But it can turn into a “special index” for a given query, either “accidentally,” or by design.
Check out this query:
`SELECT` `*`
`FROM` `actor`
`WHERE` `last_name` `LIKE` `'A%'`
Theres no extraordinary thing to be seen in the execution plan. Its a simple query. Index range scan, table access, done:
```
-------------------------------------------------------------------
| Id | Operation | Name | Rows |
-------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 8 |
| 1 | TABLE ACCESS BY INDEX ROWID| ACTOR | 8 |
|* 2 | INDEX RANGE SCAN | IDX_ACTOR_LAST_NAME | 8 |
-------------------------------------------------------------------
```
Is it a good plan, though? Well, if what we really needed was this, then its not:
[
![](https://lukaseder.files.wordpress.com/2017/03/first-name-last-name.png?w=662)
][6]
Sure, were wasting memory etc. But check out this alternative query:
| 123 | `SELECT` `first_name, last_name``FROM` `actor``WHERE` `last_name` `LIKE` `'A%'` |
Its plan is this:
```
----------------------------------------------------
| Id | Operation | Name | Rows |
----------------------------------------------------
| 0 | SELECT STATEMENT | | 8 |
|* 1 | INDEX RANGE SCAN| IDX_ACTOR_NAMES | 8 |
----------------------------------------------------
```
We could now eliminate the table access entirely, because theres an index that covers all the needs of our query… a covering index. Does it matter? Absolutely! This approach can speed up some of your queries by an order of magnitude (or slow them down by an order of magnitude when your index stops being covering after a change).
You cannot always profit from covering indexes. Indexes come with their own cost and you shouldnt add too many of them, but in cases like these, its a no-brainer. Lets run a benchmark:
```
SET SERVEROUTPUT ON
DECLARE
v_ts TIMESTAMP;
v_repeat CONSTANT NUMBER := 100000;
BEGIN
v_ts := SYSTIMESTAMP;
FOR i IN 1..v_repeat LOOP
FOR rec IN (
-- Worst query: Memory overhead AND table access
SELECT *
FROM actor
WHERE last_name LIKE 'A%'
) LOOP
NULL;
END LOOP;
END LOOP;
dbms_output.put_line('Statement 1 : ' || (SYSTIMESTAMP - v_ts));
v_ts := SYSTIMESTAMP;
FOR i IN 1..v_repeat LOOP
FOR rec IN (
-- Better query: Still table access
SELECT /*+INDEX(actor(last_name))*/
first_name, last_name
FROM actor
WHERE last_name LIKE 'A%'
) LOOP
NULL;
END LOOP;
END LOOP;
dbms_output.put_line('Statement 2 : ' || (SYSTIMESTAMP - v_ts));
v_ts := SYSTIMESTAMP;
FOR i IN 1..v_repeat LOOP
FOR rec IN (
-- Best query: Covering index
SELECT /*+INDEX(actor(last_name, first_name))*/
first_name, last_name
FROM actor
WHERE last_name LIKE 'A%'
) LOOP
NULL;
END LOOP;
END LOOP;
dbms_output.put_line('Statement 3 : ' || (SYSTIMESTAMP - v_ts));
END;
/
```
The result is:
```
Statement 1 : +000000000 00:00:02.479000000
Statement 2 : +000000000 00:00:02.261000000
Statement 3 : +000000000 00:00:01.857000000
```
Note, the actor table only has 4 columns, so the difference between statements 1 and 2 is not too impressive, but still significant. Note also Im using Oracles hints to force the optimiser to pick one or the other index for the query. Statement 3 clearly wins in this case. Its a  _much_  better query, and thats just an extremely simple query.
Again, when we write `SELECT *`, we create **needless, mandatory work** for the database, which it cannot optimise. It wont pick the covering index because that index has a bit more overhead than the `LAST_NAME`index that it did pick, and after all, it had to go to the table anyway to fetch the useless `LAST_UPDATE` column, for instance.
But things get worse with `SELECT *`. Consider…
### SQL transformations
Optimisers work so well, because they transform your SQL queries ([watch my recent talk at Voxxed Days Zurich about how this works][7]). For instance, theres a SQL transformation called “`JOIN` elimination”, and it is really powerful. Consider this auxiliary view, which we wrote because we grew so incredibly tired of joining all these tables all the time:
```
CREATE VIEW v_customer AS
SELECT
c.first_name, c.last_name,
a.address, ci.city, co.country
FROM customer c
JOIN address a USING (address_id)
JOIN city ci USING (city_id)
JOIN country co USING (country_id)
```
This view just connects all the “to-one” relationships between a `CUSTOMER` and their different `ADDRESS` parts. Thanks, normalisation.
Now, after a while working with this view, imagine, weve become so accustomed to this view, we forgot all about the underlying tables. And now, were running this query:
```
SELECT *
FROM v_customer
```
Were getting quite some impressive plan:
```
----------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost |
----------------------------------------------------------------
| 0 | SELECT STATEMENT | | 599 | 47920 | 14 |
|* 1 | HASH JOIN | | 599 | 47920 | 14 |
| 2 | TABLE ACCESS FULL | COUNTRY | 109 | 1526 | 2 |
|* 3 | HASH JOIN | | 599 | 39534 | 11 |
| 4 | TABLE ACCESS FULL | CITY | 600 | 10800 | 3 |
|* 5 | HASH JOIN | | 599 | 28752 | 8 |
| 6 | TABLE ACCESS FULL| CUSTOMER | 599 | 11381 | 4 |
| 7 | TABLE ACCESS FULL| ADDRESS | 603 | 17487 | 3 |
----------------------------------------------------------------
```
Well, of course. We run all these joins and full table scans, because thats what we told the database to do. Fetch all this data.
Now, again, imagine, what we really wanted on one particular screen was this:
[
![](https://lukaseder.files.wordpress.com/2017/03/first-name-last-name-customers.png?w=662)
][8]
Yeah, duh, right? By now you get my point. But imagine, weve learned from the previous mistakes and were now actually running the following, better query:
```
SELECT first_name, last_name
FROM v_customer
```
Now, check this out!
```
------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost |
------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 599 | 16173 | 4 |
| 1 | NESTED LOOPS | | 599 | 16173 | 4 |
| 2 | TABLE ACCESS FULL| CUSTOMER | 599 | 11381 | 4 |
|* 3 | INDEX UNIQUE SCAN| SYS_C007120 | 1 | 8 | 0 |
------------------------------------------------------------------
```
Thats a  _drastic_  improvement in the execution plan. Our joins were eliminated, because the optimiser could prove they were **needless**, so once it can prove this (and you dont make the work **mandatory** by selecting *), it can remove the work and simply not do it. Why is that the case?
Each `CUSTOMER.ADDRESS_ID` foreign key guarantees that there is  _exactly one_  `ADDRESS.ADDRESS_ID` primary key value, so the `JOIN` operation is guaranteed to be a to-one join which does not add rows nor remove rows. If we dont even select rows or query rows, well, we dont need to actually load the rows at all. Removing the `JOIN` provably wont change the outcome of the query.
Databases do these things all the time. You can try this on most databases:
```
-- Oracle
SELECT CASE WHEN EXISTS (
SELECT 1 / 0 FROM dual
) THEN 1 ELSE 0 END
FROM dual
-- More reasonable SQL dialects, e.g. PostgreSQL
SELECT EXISTS (SELECT 1 / 0)
```
In this case, you might expect an arithmetic exception to be raised, as when you run this query:
```
SELECT 1 / 0 FROM dual
```
yielding
```
ORA-01476: divisor is equal to zero
```
But it doesnt happen. The optimiser (or even the parser) can prove that any `SELECT` column expression in a `EXISTS (SELECT ..)` predicate will not change the outcome of a query, so theres no need to evaluate it. Huh!
### Meanwhile…
One of most ORMs most unfortunate problems is the fact that they make writing `SELECT *` queries so easy to write. In fact, HQL / JPQL for instance, proceeded to making it the default. You can even omit the `SELECT` clause entirely, because after all, youre going to be fetching the entire entity, as declared, right?
For instance:
`FROM` `v_customer`
[Vlad Mihalcea for instance, a Hibernate expert and Hibernate Developer advocate][9] recommends you use queries almost every time youre sure you dont want to persist any modifications after fetching. ORMs make it easy to solve the object graph persistence problem. Note: Persistence. The idea of actually modifying the object graph and persisting the modifications is inherent.
But if you dont intend to do that, why bother fetching the entity? Why not write a query? Lets be very clear: From a performance perspective, writing a query tailored to the exact use-case youre solving is  _always_  going to outperform any other option. You may not care because your data set is small and it doesnt matter. Fine. But eventually, youll need to scale and re-designing your applications to favour a query language over imperative entity graph traversal will be quite hard. Youll have other things to do.
### Counting for existence
Some of the worst wastes of resources is when people run `COUNT(*)`queries when they simply want to check for existence. E.g.
> Did this user have any orders at all?
And well run:
```
SELECT count(*)
FROM orders
WHERE user_id = :user_id
```
Easy. If `COUNT = 0`: No orders. Otherwise: Yes, orders.
The performance will not be horrible, because we probably have an index on the `ORDERS.USER_ID` column. But what do you think will be the performance of the above compared to this alternative here:
```
-- Oracle
SELECT CASE WHEN EXISTS (
SELECT *
FROM orders
WHERE user_id = :user_id
) THEN 1 ELSE 0 END
FROM dual
-- Reasonable SQL dialects, like PostgreSQL
SELECT EXISTS (
SELECT *
FROM orders
WHERE user_id = :user_id
)
```
It doesnt take a rocket scientist to figure out that an actual existence predicate can stop looking for additional rows as soon as it found  _one_ . So, if the answer is “no orders”, then the speed will be comparable. If, however, the answer is “yes, orders”, then the answer might be  _drastically_  faster in the case where we do not calculate the exact count.
Because we  _dont care_  about the exact count. Yet, we told the database to calculate it (**needless**), and the database doesnt know were discarding all results bigger than 1 (**mandatory**).
Of course, things get much worse if you call `list.size()` on a JPA-backed collection to do the same…
[Ive blogged about this recently, and benchmarked the alternatives on different databases. Do check it out.][10]
### Conclusion
This article stated the “obvious”. Dont tell the database to perform **needless, mandatory work**.
Its **needless** because given your requirements, you  _knew_  that some specific piece of work did not need to be done. Yet, you tell the database to do it.
Its **mandatory** because the database has no way to prove its **needless**. This information is contained only in the client, which is inaccessible to the server. So, the database has to do it.
This article talked about `SELECT *`, mostly, because thats such an easy target. But this isnt about databases only. This is about any distributed algorithm where a client instructs a server to perform **needless, mandatory work**. How many N+1 problems does your average AngularJS application have, where the UI loops over service result A, calling service B many times, instead of batching all calls to B into a single call? Its a recurrent pattern.
The solution is always the same. The more information you give to the entity executing your command, the faster it can (in principle) execute such command. Write a better query. Every time. Your entire system will thank you for it.
### If you liked this article…
… do also check out my recent talk at Voxxed Days Zurich, where I show some hyperbolic examples of why SQL will beat Java at data processing algorithms every time:
--------------------------------------------------------------------------------
via: https://blog.jooq.org/2017/03/08/many-sql-performance-problems-stem-from-unnecessary-mandatory-work
作者:[ jooq][a]
译者:[译者ID](https://github.com/译者ID)
校对:[校对者ID](https://github.com/校对者ID)
本文由 [LCTT](https://github.com/LCTT/TranslateProject) 原创编译,[Linux中国](https://linux.cn/) 荣誉推出
[a]:https://blog.jooq.org/
[1]:http://use-the-index-luke.com/
[2]:https://lukaseder.files.wordpress.com/2017/03/title-rating.png
[3]:https://github.com/jOOQ/jOOQ/tree/master/jOOQ-examples/Sakila
[4]:https://lukaseder.files.wordpress.com/2017/03/useless-information.png
[5]:https://blog.jooq.org/2015/04/28/do-not-think-that-one-second-is-fast-for-query-execution/
[6]:https://lukaseder.files.wordpress.com/2017/03/first-name-last-name.png
[7]:https://www.youtube.com/watch?v=wTPGW1PNy_Y
[8]:https://lukaseder.files.wordpress.com/2017/03/first-name-last-name-customers.png
[9]:https://vladmihalcea.com/2016/09/13/the-best-way-to-handle-the-lazyinitializationexception/
[10]:https://blog.jooq.org/2016/09/14/avoid-using-count-in-sql-when-you-could-use-exists/

View File

@ -0,0 +1,394 @@
许多 SQL 性能问题来自于“不必要的强制性工作”
============================================================ 
在编写高效 SQL 时,你可能遇到的最有影响力的事情就是[索引][1]。但是,一个很重要的事实就是很多 SQL 客户端要求数据库做很多**“不必要的强制性工作”**。
跟我再重复一遍:
> 不必要的强制性工作
什么是**“不必要的强制性工作”**?包括两个方面:
### 不必要
假设你的客户端应用程序需要这些信息:
[
![](https://lukaseder.files.wordpress.com/2017/03/title-rating.png?w=662)
][2]
没什么特别的。我们运行着一个电影数据库([例如 Sakila 数据库][3]),我们想要给用户显示每部电影的名称和评分。
这是能产生上面结果的查询:
`SELECT title, rating`
`FROM film`
然而,我们的应用程序(或者我们的 ORMLCTT 译注Object-relational mapping对象关系映射运行的查询却是
`SELECT *`
`FROM film`
我们得到什么?猜一下。我们得到很多无用的信息:
[
![](https://lukaseder.files.wordpress.com/2017/03/useless-information.png?w=662&h=131)
][4]
甚至从头到尾有复杂的 JSON 数据,加载过程包括:
* 从磁盘
* 加载到缓存
* 通过总线
* 进入客户端内存
* 然后被丢弃
是的,我们丢弃了其中大部分的信息。检索它所做的工作完全就是不必要的。对吧?是的。
### 强制性
这是最严重的部分。现今随着优化器变得相当聪明对于数据库来说这些工作都是强制性的。数据库没有办法_知道_客户端应用程序实际上不需要其中 95% 的数据。这只是一个简单的例子。想象一下如果我们连接更多的表...
你想想那会怎样呢?数据库会快吗?让我给你一些之前你可能没有想到的见解:
### 内存消耗
当然,单独的执行时间不会变化很大。可能是慢 1.5 倍但我们可以忍受是吧为方便起见有时候确实如此。但是如果你_每次_都为了方便而牺牲性能事情就大了。我们不在讨论性能单个查询的速度而是吞吐量系统响应时间这就是事情变得困难而难以解决的时候。你无法再进行扩展。
让我们来看看执行计划,这是 Oracle的
```
--------------------------------------------------
| Id | Operation | Name | Rows | Bytes |
--------------------------------------------------
| 0 | SELECT STATEMENT | | 1000 | 166K|
| 1 | TABLE ACCESS FULL| FILM | 1000 | 166K|
--------------------------------------------------
```
对比
```
--------------------------------------------------
| Id | Operation | Name | Rows | Bytes |
--------------------------------------------------
| 0 | SELECT STATEMENT | | 1000 | 20000 |
| 1 | TABLE ACCESS FULL| FILM | 1000 | 20000 |
--------------------------------------------------
```
当执行 `SELECT *` 而不是 `SELECT film, rating` 的时候,数据库中我们使用了 8 倍的内存。这并不奇怪,对吧?我们早就知道了。在很多很多我们并不需要其中全部数据的查询中,我们仍然接受了。我们为数据库产生了**不必要的强制性工作**,后果累加了起来。我们使用了 8 倍的内存(当然,数值可能有些不同)。
现在,所有其它步骤(磁盘 I/O、总线传输、客户端内存消耗也受到相同的影响我这里就跳过了。另外我还想看看...
### 索引使用
如今大部分数据库都有[涵盖索引][5]LCTT 译注covering index包括了你查询所需列、甚至更多列的索引可以直接从索引中获取所有需要的数据而无需访问物理表的概念。涵盖索引并不是特殊的索引。但对于一个特定的查询它可以“意外地”或人为地转变为一个“特殊索引”。
看看这个查询:
`SELECT` `*`
`FROM` `actor`
`WHERE` `last_name` `LIKE` `'A%'`
执行计划中没有什么特别之处。它只是个简单的查询。索引范围扫描、表访问,就结束了:
```
-------------------------------------------------------------------
| Id | Operation | Name | Rows |
-------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 8 |
| 1 | TABLE ACCESS BY INDEX ROWID| ACTOR | 8 |
|* 2 | INDEX RANGE SCAN | IDX_ACTOR_LAST_NAME | 8 |
-------------------------------------------------------------------
```
这是个好计划吗?如果我们想要的就是这些,那么它就不是:
[
![](https://lukaseder.files.wordpress.com/2017/03/first-name-last-name.png?w=662)
][6]
当然,我们浪费了内存等等。再来看看这个查询:
| 123 | `SELECT` `first_name, last_name``FROM` `actor``WHERE` `last_name` `LIKE` `'A%'` |
它的计划是:
```
----------------------------------------------------
| Id | Operation | Name | Rows |
----------------------------------------------------
| 0 | SELECT STATEMENT | | 8 |
|* 1 | INDEX RANGE SCAN| IDX_ACTOR_NAMES | 8 |
----------------------------------------------------
```
现在我们可以完全消除表访问,因为有一个索引涵盖了我们查询需要的所有东西...一个涵盖索引。这很重要吗?当然!这种方法可以将你的一些查询加速一个数量级(如果更改后你的索引不在涵盖,可能会降低一个数量级)。
你不能总是从涵盖索引中获利。索引有它们自己的成本,你不应该添加太多索引,例如像这种情况,就是不明智的。让我们来做个测试:
```
SET SERVEROUTPUT ON
DECLARE
v_ts TIMESTAMP;
v_repeat CONSTANT NUMBER := 100000;
BEGIN
v_ts := SYSTIMESTAMP;
FOR i IN 1..v_repeat LOOP
FOR rec IN (
-- Worst query: Memory overhead AND table access
SELECT *
FROM actor
WHERE last_name LIKE 'A%'
) LOOP
NULL;
END LOOP;
END LOOP;
dbms_output.put_line('Statement 1 : ' || (SYSTIMESTAMP - v_ts));
v_ts := SYSTIMESTAMP;
FOR i IN 1..v_repeat LOOP
FOR rec IN (
-- Better query: Still table access
SELECT /*+INDEX(actor(last_name))*/
first_name, last_name
FROM actor
WHERE last_name LIKE 'A%'
) LOOP
NULL;
END LOOP;
END LOOP;
dbms_output.put_line('Statement 2 : ' || (SYSTIMESTAMP - v_ts));
v_ts := SYSTIMESTAMP;
FOR i IN 1..v_repeat LOOP
FOR rec IN (
-- Best query: Covering index
SELECT /*+INDEX(actor(last_name, first_name))*/
first_name, last_name
FROM actor
WHERE last_name LIKE 'A%'
) LOOP
NULL;
END LOOP;
END LOOP;
dbms_output.put_line('Statement 3 : ' || (SYSTIMESTAMP - v_ts));
END;
/
```
结果是:
```
Statement 1 : +000000000 00:00:02.479000000
Statement 2 : +000000000 00:00:02.261000000
Statement 3 : +000000000 00:00:01.857000000
```
注意,表 actor 只有 4 列,因此语句 1 和 2 的差别并不是太令人印象深刻,但仍然很重要。还要注意我使用了 Oracle 的提示强制优化器为查询选择一个或其它索引。在这种情况下语句 3 明显胜利。这是一个好_很多_的查询也是一个十分简单的查询。
当我们写 `SELECT *` 语句时,我们为数据库带来了**不必要的强制性工作**,这是无法优化的。它不会使用涵盖索引,因为比起它使用的 `LAST_NAME` 索引,涵盖索引开销更多一点,不管怎样,它都要访问表以获取无用的 `LAST_UPDATE` 列。
使用 `SELECT *` 会变得更糟。考虑...
### SQL 转换
优化器工作的很好,因为它们转换了你的 SQL 查询([看我最近在 Voxxed Days Zurich 关于这方面的演讲][7])。例如,其中有一个称为“`连接`消除”的转换,它真的很强大。考虑这个辅助视图,我们写了这个视图,因为我们变得难以置信地厌倦总是连接所有这些表:
```
CREATE VIEW v_customer AS
SELECT
c.first_name, c.last_name,
a.address, ci.city, co.country
FROM customer c
JOIN address a USING (address_id)
JOIN city ci USING (city_id)
JOIN country co USING (country_id)
```
这个视图仅仅是把 `CUSTOMER` 和他们不同的 `ADDRESS` 部分所有“对一”关系连接起来。谢天谢地,它很工整。
现在,使用这个视图一段时间之后,想象我们非常习惯这个视图,我们都忘了所有它底层的表。然后,我们运行了这个查询:
```
SELECT *
FROM v_customer
```
我们得到了一个相当令人印象深刻的计划:
```
----------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost |
----------------------------------------------------------------
| 0 | SELECT STATEMENT | | 599 | 47920 | 14 |
|* 1 | HASH JOIN | | 599 | 47920 | 14 |
| 2 | TABLE ACCESS FULL | COUNTRY | 109 | 1526 | 2 |
|* 3 | HASH JOIN | | 599 | 39534 | 11 |
| 4 | TABLE ACCESS FULL | CITY | 600 | 10800 | 3 |
|* 5 | HASH JOIN | | 599 | 28752 | 8 |
| 6 | TABLE ACCESS FULL| CUSTOMER | 599 | 11381 | 4 |
| 7 | TABLE ACCESS FULL| ADDRESS | 603 | 17487 | 3 |
----------------------------------------------------------------
```
当然是这样。我们运行所有这些连接以及全表扫描,因为这就是我们让数据库去做的。获取所有的数据。
现在,再一次想象,对于一个特定场景,我们真正想要的是:
[
![](https://lukaseder.files.wordpress.com/2017/03/first-name-last-name-customers.png?w=662)
][8]
是啊,对吧?现在你应该知道我的意图了。但想像一下,我们确实从前面的错误中学到了东西,现在我们实际上运行下面一个比较好的查询:
```
SELECT first_name, last_name
FROM v_customer
```
再来看看结果!
```
------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost |
------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 599 | 16173 | 4 |
| 1 | NESTED LOOPS | | 599 | 16173 | 4 |
| 2 | TABLE ACCESS FULL| CUSTOMER | 599 | 11381 | 4 |
|* 3 | INDEX UNIQUE SCAN| SYS_C007120 | 1 | 8 | 0 |
------------------------------------------------------------------
```
这是执行计划一个_极大的_进步。我们的连接被消除了因为优化器可以证明它们是**不必要的**,因此一旦它可以证明这点(而且你不会使用 select * 使其成为**强制性**工作),它就可以移除这些工作并不执行它。为什么会发生这种情况?
每个 `CUSTOMER.ADDRESS_ID` 外键保证了_有且只有一个_`ADDRESS.ADDRESS_ID` 主键值,因此可以保证 `JOIN` 操作是对一连接,它不会产生或者删除行。如果我们甚至不选择行或者查询行,当然,我们就不需要真正地去加载行。可证明地移除 `JOIN` 并不会改变查询的结果。
数据库总是会做这些事情。你可以在大部分数据库上尝试它:
```
-- Oracle
SELECT CASE WHEN EXISTS (
SELECT 1 / 0 FROM dual
) THEN 1 ELSE 0 END
FROM dual
-- 更合理的 SQL 语句,例如 PostgreSQL
SELECT EXISTS (SELECT 1 / 0)
```
在这种情况下,你可能预料会抛出算术异常,当你运行这个查询时:
```
SELECT 1 / 0 FROM dual
```
产生了
```
ORA-01476: divisor is equal to zero
```
但它没有发生。优化器(甚至解析器)可以证明 `EXISTS (SELECT ..)` 谓词内的任何 `SELECT` 列表达式不会改变查询的结果,因此也就没有必要计算它的值。呵!
### 同时...
大部分 ORM 的最不幸问题就是事实上他们很随意就写出了 `SELECT *` 查询。事实上,例如 HQL / JPQL设置默认使用它。你甚至可以完全抛弃 `SELECT` 从句,因为毕竟你想要获取所有实体,正如声明的那样,对吧?
例如:
`FROM` `v_customer`
例如[Vlad Mihalcea][9](一个 Hibernate 专家和 Hibernate 开发倡导者建议你每次确定不想要在获取后进行任何更改时再使用查询。ORM 使解决对象图持久化问题变得简单。注意:持久化。真正修改对象图并持久化修改的想法是固有的。
但如果你不想那样做为什么要抓取实体呢为什么不写一个查询让我们清楚一点从性能角度针对你正在解决的用例写一个查询_总是_会胜过其它选项。你可以不会在意因为你的数据集很小没关系。可以。但最终你需要扩展然后重新设计你的应用程序以便在强制实体图遍历之上支持查询语言这会很困难。你也需要做其它事情。
### 计算出现次数
资源浪费最严重的情况是在只是想要检验存在性时运行 `COUNT(*)` 查询。例如:
> 这个用户有没有订单?
我们会运行:
```
SELECT count(*)
FROM orders
WHERE user_id = :user_id
```
很简单。如果 `COUNT = 0`:没有订单。否则:是的,有订单。
性能可能不会很差,因为我们可能有一个 `ORDERS.USER_ID` 列上的索引。但是和下面的这个相比你认为上面的性能是怎样呢:
```
-- Oracle
SELECT CASE WHEN EXISTS (
SELECT *
FROM orders
WHERE user_id = :user_id
) THEN 1 ELSE 0 END
FROM dual
-- 更合理的 SQL 语句,例如 PostgreSQL
SELECT EXISTS (
SELECT *
FROM orders
WHERE user_id = :user_id
)
```
它不需要火箭科学家来确定一旦它找到一个实际存在谓词就可以马上停止寻找额外的行。因此如果答案是“没有订单”速度将会是差不多。但如果结果是“是的有订单”那么结果在我们不计算具体次数的情况下就会_大幅_加快。
因为我们_不在乎_具体的次数。我们告诉数据库去计算它**不必要的**),数据库不知道我们会丢弃所有大于 1 的结果(**强制性**)。
当然,如果你在 JPA 支持的集合上调用 `list.size()` 做同样的事情,情况会变得更糟!
[近期我有关于该情况的博客以及在不同数据库上的测试。去看看吧。][10]
### 总结
这篇文章的立场很“明显”。别让数据库做**不必要的强制性工作**。
它**不必要**因为对于你给定的需求你_知道_一些特定的工作不需要完成。但是你告诉数据库去做。
它**强制性**,因为数据库无法证明它是**不必要的**。这些信息只包含在客户端中,对于服务器来说无法访问。因此,数据库需要去做。
这篇文章大部分在介绍 `SELECT *`,因为这是一个很简单的目标。但是这并不仅限于数据库。这关于客户端要求服务器完成**不必要的强制性工作**的任何分布式算法。你 AngularJS 应用程序平均有多少个 N+1 问题UI 在服务结果 A 上循环,多次调用服务 B而不是把所有对 B 的调用打包为一个调用?这是一个复发的模式。
解决方法总是相同。你给执行你命令的实体越多信息,(理论上)它能更快执行这样的命令。每次都写一个好的查询。你的整个系统都会为此感谢你的。
### 如果你喜欢这篇文章...
再看看近期我在 Voxxed Days Zurich 的演讲,其中我展示了一些在数据处理算法上为什么 SQL 总是会胜过 Java 的双曲线例子
--------------------------------------------------------------------------------
via: https://blog.jooq.org/2017/03/08/many-sql-performance-problems-stem-from-unnecessary-mandatory-work
作者:[ jooq][a]
译者:[ictlyh](https://github.com/ictlyh)
校对:[校对者ID](https://github.com/校对者ID)
本文由 [LCTT](https://github.com/LCTT/TranslateProject) 原创编译,[Linux中国](https://linux.cn/) 荣誉推出
[a]:https://blog.jooq.org/
[1]:http://use-the-index-luke.com/
[2]:https://lukaseder.files.wordpress.com/2017/03/title-rating.png
[3]:https://github.com/jOOQ/jOOQ/tree/master/jOOQ-examples/Sakila
[4]:https://lukaseder.files.wordpress.com/2017/03/useless-information.png
[5]:https://blog.jooq.org/2015/04/28/do-not-think-that-one-second-is-fast-for-query-execution/
[6]:https://lukaseder.files.wordpress.com/2017/03/first-name-last-name.png
[7]:https://www.youtube.com/watch?v=wTPGW1PNy_Y
[8]:https://lukaseder.files.wordpress.com/2017/03/first-name-last-name-customers.png
[9]:https://vladmihalcea.com/2016/09/13/the-best-way-to-handle-the-lazyinitializationexception/
[10]:https://blog.jooq.org/2016/09/14/avoid-using-count-in-sql-when-you-could-use-exists/