TranslateProject/sources/tech/20191003 SQL queries don-t start with SELECT.md
DarkSun b19c27fc03 选题: 20191003 SQL queries don't start with SELECT
sources/tech/20191003 SQL queries don-t start with SELECT.md
2019-10-05 01:03:06 +08:00

7.2 KiB
Raw Blame History

SQL queries don't start with SELECT

Okay, obviously many SQL queries do start with SELECT (and actually this post is only about SELECT queries, not INSERTs or anything).

But! Yesterday I was working on an explanation of window functions, and I found myself googling “can you filter based on the result of a window function”. As in can you filter the result of a window function in a WHERE or HAVING or something?

Eventually I concluded “window functions must run after WHERE and GROUP BY happen, so you cant do it”. But this led me to a bigger question what order do SQL queries actually run in?.

This was something that I felt like I knew intuitively (“Ive written at least 10,000 SQL queries, some of them were really complicated! I must know this!“) but I struggled to actually articulate what the order was.

SQL queries happen in this order

I looked up the order, and here it is! (SELECT isnt the first thing, its like the 5th thing!) (here it is in a tweet).

(I really want to find a more accurate way of phrasing this than “sql queries happen/run in this order” but I havent figured it out yet)

https://jvns.ca/images/sql-queries.jpeg

In a non-image format, the order is:

  • FROM/JOIN and all the ON conditions
  • WHERE
  • GROUP BY
  • HAVING
  • SELECT (including window functions)
  • ORDER BY
  • LIMIT

questions this diagram helps you answer

This diagram is about the semantics of SQL queries it lets you reason through what a given query will return and answers questions like:

  • Can I do WHERE on something that came from a GROUP BY? (no! WHERE happens before GROUP BY!)
  • Can I filter based on the results of a window function? (no! window functions happen in SELECT, which happens after both WHERE and GROUP BY)
  • Can I ORDER BY based on something I did in GROUP BY? (yes! ORDER BY is basically the last thing, you can ORDER BY based on anything!)
  • When does LIMIT happen? (at the very end!)

Database engines dont actually literally run queries in this order because they implement a bunch of optimizations to make queries run faster well get to that a little later in the post.

So:

  • you can use this diagram when you just want to understand which queries are valid and how to reason about what results of a given query will be
  • you shouldnt use this diagram to reason about query performance or anything involving indexes, thats a much more complicated thing with a lot more variables

confounding factor: column aliases

Someone on Twitter pointed out that many SQL implementations let you use the syntax:

SELECT CONCAT(first_name, ' ', last_name) AS full_name, count(*)
FROM table
GROUP BY full_name

This query makes it look like GROUP BY happens after SELECT even though GROUP BY is first, because the GROUP BY references an alias from the SELECT. But its not actually necessary for the GROUP BY to run after the SELECT for this to work the database engine can just rewrite the query as

SELECT CONCAT(first_name, ' ', last_name) AS full_name, count(*)
FROM table
GROUP BY CONCAT(first_name, ' ', last_name)

and run the GROUP BY first.

Your database engine also definitely does a bunch of checks to make sure that what you put in SELECT and GROUP BY makes sense together before it even starts to run the query, so it has to look at the query as a whole anyway before it starts to come up with an execution plan.

queries arent actually run in this order (optimizations!)

Database engines in practice dont actually run queries by joining, and then filtering, and then grouping, because they implement a bunch of optimizations reorder things to make the query run faster as long as reordering things wont change the results of the query.

One simple example of a reason why need to run queries in a different order to make them fast is that in this query:

SELECT * FROM
owners LEFT JOIN cats ON owners.id = cats.owner
WHERE cats.name = 'mr darcy'

it would be silly to do the whole left join and match up all the rows in the 2 tables if you just need to look up the 3 cats named mr darcy its way faster to do some filtering first for cats named mr darcy. And in this case filtering first doesnt change the results of the query!

There are lots of other optimizations that database engines implement in practice that might make them run queries in a different order but theres no room for that and honestly its not something Im an expert on.

LINQ starts queries with FROM

LINQ (a querying syntax in C# and VB.NET) uses the order FROM ... WHERE ... SELECT. Heres an example of a LINQ query:

var teenAgerStudent = from s in studentList
                      where s.Age > 12 && s.Age < 20
                      select s;

pandas (my favourite data wrangling tool) also basically works like this, though you dont need to use this exact order Ill often write pandas code like this:

df = thing1.join(thing2)      # like a JOIN
df = df[df.created_at > 1000] # like a WHERE
df = df.groupby('something', num_yes = ('yes', 'sum')) # like a GROUP BY
df = df[df.num_yes > 2]       # like a HAVING, filtering on the result of a GROUP BY
df = df[['num_yes', 'something1', 'something']] # pick the columns I want to display, like a SELECT
df.sort_values('sometthing', ascending=True)[:30] # ORDER BY and LIMIT
df[:30]

This isnt because pandas is imposing any specific rule on how you have to write your code, though. Its just that it often makes sense to write code in the order JOIN / WHERE / GROUP BY / HAVING. (Ill often put a WHERE first to improve performance though, and I think most database engines will also do a WHERE first in practice)

dplyr in R also lets you use a different syntax for querying SQL databases like Postgres, MySQL and SQLite, which is also in a more logical order.

I was really surprised that I didnt know this

Im writing a blog post about this because when I found out the order I was SO SURPRISED that Id never seen it written down that way before it explains basically everything that I knew intuitively about why some queries are allowed and others arent. So I wanted to write it down in the hopes that it will help other people also understand how to write SQL queries.


via: https://jvns.ca/blog/2019/10/03/sql-queries-don-t-start-with-select/

作者:Julia Evans 选题:lujun9972 译者:译者ID 校对:校对者ID

本文由 LCTT 原创编译,Linux中国 荣誉推出