memgraph/docs/dev/query/planning.md
Teon Banek 6f10b1c115 Move query implementation from Phriction to this repo
Summary:
Our query parsing, planning and execution architecture was described on
Phabricator wiki pages, Phriction. This commit copies the said
documentation here, so that it's easier to access for all developers.
Additional benefit is tracking the changes and hopefully suggesting to
developers to keep it up to date.

Besides making a copy, the documentation has been updated to reflect the
current state of the codebase. Note that some things are still missing,
but what was written should now be correct.

Reviewers: mtomic, llugovic

Reviewed By: mtomic

Subscribers: pullbot

Differential Revision: https://phabricator.memgraph.io/D1854
2019-02-15 16:58:39 +01:00

488 lines
20 KiB
Markdown

# Logical Planning
After the semantic analysis and symbol generation, the AST is converted to a
tree of logical operators. This conversion is called *planning* and the tree
of logical operators is called a *plan*. The whole planning process is done in
the following steps.
1. [AST Preprocessing](#ast-preprocessing)
The first step is to preprocess the AST by collecting
information on filters, divide the query into parts, normalize patterns
in `MATCH` clauses, etc.
2. [Logical Operator Planning](#logical-operator-planning)
After the preprocess step, the planning can be done via 2 planners:
`VariableStartPlanner` and `RuleBasedPlanner`. The first planner will
generate multiple plans where each plan has different starting points for
searching the patterns in `MATCH` clauses. The second planner produces a
single plan by mapping the query parts as they are to logical operators.
3. [Logical Plan Postprocessing](#logical-plan-postprocessing)
In this stage, we perform various transformations on the generated logical
plan. Here we want to optimize the operations in order to improve
performance during the execution. Naturally, transformations need to
preserve the semantic behaviour of the original plan.
4. [Cost Estimation](#cost-estimation)
After the generation, the execution cost of each plan is estimated. This
estimation is used to select the best plan which will be executed.
5. [Distributed Planning](#distributed-planning)
In case we are running distributed Memgraph, the final plan is adapted
for distributed execution. NOTE: This appears to be an error in the
workflow. Distributed planning should be moved before step 3. or
integrated with it. With the workflow ordered as is now, cost estimation
doesn't consider the distributed plan.
The implementation can be found in the `query/plan` directory, with the public
entry point being `query/plan/planner.hpp`.
## AST Preprocessing
Each openCypher query consists of at least 1 **single query**. Multiple single
queries are chained together using a **query combinator**. Currently, there is
only one combinator, `UNION`. The preprocessing step starts in the
`CollectQueryParts` function. This function will take a look at each single
query and divide it into parts. Each part is separated with `RETURN` and
`WITH` clauses. For example:
MATCH (n) CREATE (m) WITH m MATCH (l)-[]-(m) RETURN l
| | |
|------- part 1 -----------+-------- part 2 --------|
| |
|-------------------- single query -----------------|
Each part is created by collecting all `MATCH` clauses and *normalizing* their
patterns. Pattern normalization is the process of converting an arbitrarily
long pattern chain of nodes and edges into a list of triplets `(start node,
edge, end node)`. The triplets should preserve the semantics of the match. For
example:
MATCH (a)-[p]-(b)-[q]-(c)-[r]-(d)
is equivalent to:
MATCH (a)-[p]-(b), (b)-[q]-(c), (c)-[r]-(d)
With this representation, it becomes easier to reorder the triplets and choose
different strategies for pattern matching.
In addition to normalizing patterns, all of the filter expressions in patterns
and inside of the `WHERE` clause (of the accompanying `MATCH`) are extracted
and stored separately. During the extraction, symbols used in the filter
expression are collected. This allows for planning filters in a valid order,
as the matching for triplets is being done. Another important benefit of
having extra information on filters, is to recognize when a database index
could be used.
After each `MATCH` is processed, they are all grouped, so that even the whole
`MATCH` clauses may be reordered. The important thing is to remember which
symbols were used to name edges in each `MATCH`. With those symbols we can
plan for *cyphermorphism*, i.e. ensure different edges in the search pattern
of a single `MATCH` map to different edges in the graph. This preserves the
semantic of the query, even though we may have reordered the matching. The
same steps are done for `OPTIONAL MATCH`.
Another clause which needs processing is `MERGE`. Here we normalize the
pattern, since the `MERGE` is a bit like `MATCH` and `CREATE` in one.
All the other clauses are left as is.
In the end, each query part consists of:
* processed and grouped `MATCH` clauses;
* processed and grouped `OPTIONAL MATCH` clauses;
* processed `MERGE` matching pattern and
* unchanged remaining clauses.
The last stored clause is guaranteed to be either `WITH` or `RETURN`.
## Logical Operator Planning
### Variable Start Planner
The `VariableStartPlanner` generates multiple plans for a single query. Each
plan is generated by selecting a different starting point for pattern
matching.
The algorithm works as follows.
1. For each query part:
1. For each node in triplets of collected `MATCH` clauses:
i. Add the node to a set of `expanded` nodes
ii. Select a triplet `(start node, edge, end node)` whose `start node` is
in the `expanded` set
iii. If no triplet was selected, choose a new starting node that isn't in
`expanded` and continue expanding
iv. Repeat steps ii. -- iii. until all triplets have been selected
and store that as a variation of the `MATCH` clauses
2. Do step 1.1. for `OPTIONAL MATCH` and `MERGE` clauses
3. Take all combinations of the generated `MATCH`, `OPTIONAL MATCH` and
`MERGE` and store them as variations of the query part.
2. For each combination of query part variations:
1. Generate a plan using the rule based planner
### Rule Based Planner
The `RuleBasedPlanner` generates a single plan for a single query. A plan is
generated by following hardcoded rules for producing logical operators. The
following sections are an overview on how each openCypher clause is converted
to a `LogicalOperator`.
#### MATCH
`MATCH` clause is used to specify which patterns need to be searched for in
the database. These patterns are normalized in the preprocess step to be
represented as triplets `(start node, edge, end node)`. When there is no edge,
then the triplet is reduced only to the `start node`. Generating the operators
is done by looping over these triplets.
##### Searching for Nodes
The simplest search is finding stand alone nodes. For example, `MATCH (n)`
will find all the nodes in the graph. This is accomplished by generating a
`ScanAll` operator and forwarding the node symbol which should store the
results. In this case, all the nodes will be referenced by `n`.
Multiple nodes can be specified in a single match, e.g. `MATCH (n), (m)`.
Planning is done by repeating the same steps for each sub pattern (separated
by a comma). In this case, we would get 2 `ScanAll` operators chained one
after the other. An optimization can be obtained if the node in the pattern is
already searched for. In `MATCH (n), (n)` we can drop the second `ScanAll`
operator since we have already generated it for the first node.
##### Searching for Relationships
A more advanced search includes finding nodes with relationships. For example,
`MATCH (n)-[r]-(m)` should find every pair of connected nodes in the database.
This means, that if a single node has multiple connections, it will be
repeated for each combination of pairs. The generation of operators starts
from the first node in the pattern. If we are referencing a new starting node,
, we need to generate a `ScanAll` which finds all the nodes and stores them
into `n`. Then, we generate an `Expand` operator which reads the `n` and
traverses all the edges of that node. The edge is stored into `r`, while the
destination node is stored in `m`.
Matching multiple relationships proceeds similarly, by repeating the same
steps. The only difference is that we need to ensure different edges in the
search pattern, map to different edges in the graph. This means that after
each `Expand` operator, we need to generate an `ExpandUniquenessFilter`. We
provide this operator with a list of symbols for the previously matched edges
and the symbol for the current edge.
For example.
MATCH (n)-[r1]-(m)-[r2]-(l)
The above is preprocessed into
MATCH (n)-[r1]-(m), (m)-[r2]-(l)
Then we look at each triplet in order and perform the described steps. This
way, we would generate:
ScanAll (n) > Expand (n, r1, m) > Expand (m, r2, l) >
ExpandUniquenessFilter ([r1], r2)
Note that we don't need to make `ExpandUniquenessFilter` after the first
`Expand`, since there are no edges to compare to. This filtering needs to work
across multiple pattern, but inside a *single* `MATCH` clause.
Let's take a look at the following.
MATCH (n)-[r1]-(m), (m)-[r2]-(l)
We would also generate the exact same operators.
ScanAll (n) > Expand (n, r1, m) > Expand (m, r2, l) >
ExpandUniquenessFilter ([r1], r2)
On the other hand,
MATCH (n)-[r1]-(m) MATCH (m)-[r2]-(l)-[r3]-(i)
We would reset the uniqueness filtering at the start of the second match. This
would mean that we output the following:
ScanAll (n) > Expand (n, r1, m) > Expand (m, r2, l) > Expand (l, r3, i) >
ExpandUniquenessFilter ([r2], r3)
There is a difference in how we handle edge uniqueness compared to Neo4j.
Neo4j does not allow searching for a single edge multiple times, but we've
decided to support that.
For example, the user can say the following.
MATCH (n)-[r]-(m)-[r]-l
We would ensure that both `r` variables match to the same edge. In our
terminology, we call this the *edge cycle*. For the above example, we would
generate this plan.
ScanAll (n) > Expand (n, r, m) > Expand (m, r, l)
We do not put an `ExpandUniquenessFilter` operator between 2 `Expand`
operators and we tell the 2nd `Expand` that it is an edge cycle. This, 2nd
`Expand` will ensure we have matched both the same edges.
##### Filtering
To narrow the search down, the patterns in `MATCH` can have filtered labels
and properties. A more general filtering is done using the accompanying
`WHERE` clause. During the preprocess step, all filters are collected and
extracted into expressions. Additional information on which symbols are used
is also stored. This way, each time we generate a `ScanAll` or `Expand`, we
look at all the filters to see if any of them can be used. I.e. if the symbols
they use have been bound by a newly produced operator. If a filter expression
can be used, we immediately add a `Filter` operator with that expression.
For example.
MATCH (n)-[r]-(m :label) WHERE n.prop = 42
We would produce:
ScanAll (n) > Filter (n.prop) > Expand (n, r, m) > Filter (m :label)
This means that the same plan is generated for the query:
MATCH (n {prop: 42})-[r]-(m :label)
#### OPTIONAL
If a `MATCH` clause is preceded by `OPTIONAL`, then we need to generate a plan
such that we produce results even if we fail to match anything. This is
accomplished by generating an `Optional` operator, which takes 2 operator
trees:
* input operation and
* optional operation.
The input is the operation we generated for the part of the query before
`OPTIONAL MATCH`. For the optional operation, we simply generate the `OPTIONAL
MATCH` part just like we would for regular `MATCH`. In addition to operations,
we need to send the symbols which are set during optional matching to the
`Optional` operator. The operator will reset values of those symbols to
`null`, when the optional part fails to match.
#### RETURN & WITH
`RETURN` and `WITH` clauses are very similar to each other. The only
difference is that `WITH` separates parts of the query and can be paired with
`WHERE` clause.
The common part is generating operators for the body of the clause. Separation
of query parts is mostly done in semantic analysis, which checks that only the
symbols exposed through `WITH` are visible in the query parts after the
clause. The minor part is done in planning.
##### Named Results
Both clauses contain multiple named expressions (`expr AS name`) which are
used to generate `Produce` operator.
##### Aggregations
If an expression contains an aggregation operator (`sum`, `avg`, ...) we need
to plan the `Aggregate` operator as input to `Produce`. This case is more
complex, because aggregation in openCypher can perform implicit grouping of
results used for aggregation.
For example, `WITH/RETURN sum(n.x) AS s, n.y AS group` will implicitly group
by `n.y` expression.
Another, obscure grouping can be achieved with `RETURN sum(n.a) + n.b AS s`.
Here, the `n.b` will be used for grouping, even though both the `sum` and
`n.b` are in the same named expression.
Therefore, we need to collect all expressions which do not contain
aggregations and use them for grouping. You may have noticed that in the last
example `sum` is actually a sub-expression of `+`. `Aggregate` operator does
not see that (nor it should), so the responsibility of evaluating that falls
on `Produce`. One way is for `Aggregate` to store results of grouping
expressions on the frame in addition to aggregation results. Unfortunately,
this would require rewiring named expressions in `Produce` to reference
already evaluated expressions. In the current implementation, we opted for
`Aggregate` to store only aggregation results on the frame, while `Produce`
will re-evaluate all the other (grouping) expressions. To handle that, symbols
which are used in expressions are passed to `Aggregate`, so that they can be
remembered. `Produce` will read those symbols from the frame and use it to
re-evaluate the needed expressions.
##### Accumulation
After we have `Produce` and potentially `Aggregate`, we need to handle a
special case when the part of the query before `RETURN` or `WITH` performs
updates. For that, we want to run that part of the query fully, so that we get
the latest results. This is accomplished by adding `Accumulate` operator as
input to `Aggregate` or `Produce` (if there is no aggregation). Accumulation
will store all the values for all the used symbols inside `RETURN` and `WITH`,
so that they can be used in the operator which follows. This way, only parts
of the frame are copied, instead of the whole frame. Here is a minor
difference between planning `WITH`, compared to `RETURN`. Since `WITH` can
separate writing from reading, we need to advance the transaction command.
This enables the later, read parts of the query to obtain the newest changes.
This is supported by passing `advance_command` flag to `Accumulate` operator.
In the simplest case, common to both clauses, we have `Accumulate > Aggregate
> Produce` operators, where `Accumulate` and `Aggregate` may be left out.
##### Ordering
Planning `ORDER BY` is simple enough. Since it may see new symbols (filled in
`Produce`), we add the `OrderBy` operator at the end. The operator will change
the order of produced results, so we pass it the ordering expressions and the
output symbols of named expressions.
##### Filtering
A final difference in `WITH`, is when it contains a `WHERE` clause. For that,
we simply generate the `Filter` operator, appended after `Produce` or
`OrderBy` (depending which operator is last).
##### Skipping and Limiting
If we have `SKIP` or `LIMIT`, we generate `Skip` or `Limit` operators,
respectively. These operators are put at the end of the clause.
This placement may have some unexpected behaviour when combined with
operations that update the graph. For example.
MATCH (n) SET n.x = n.x + 1 RETURN n LIMIT 1
The above query may be interpreted as if the `SET` will be done only once.
Since this is a write query, we need to accumulate results, so the part before
`RETURN` will execute completely. The accumulated results will be yielded up
to the given limit, and the user would get only the first `n` that was
updated. This may confuse the user because in reality, every node in the
database had been updated.
Note that `Skip` always comes before `Limit`. In the current implementation,
they are generated directly one after the other.
#### CREATE
`CREATE` clause is used to create nodes and edges (relationships).
For multiple `CREATE` clauses or multiple creation patterns in a single
clause, we perform the same, following steps.
##### Creating a Single Node
A node is created by simply specifying a node pattern.
For example `CREATE (n :label {property: "value"}), ()` would create 2 nodes.
The 1st one would be created with a label and a property. This node could be
referenced later in the query, by using the variable `n`. The 2nd node cannot
be referenced and it would be created without any labels nor properties. For
node creation, we generate a `CreateNode` operator and pass it all the details
of node creation: variable symbol, labels and properties. In the mentioned
example, we would have `CreateNode > CreateNode`.
##### Creating a Relationship
To create a relationship, the `CREATE` clause must contain a pattern with a
directed edge. Compared to creating a single node, this case is a bit more
complicated, because either side of the edge may not exist. By exist, we mean
that the endpoint is a variable which already references a node.
For example, `MATCH (n) CREATE (n)-[r]->(m)` would create an edge `r` and a
node `m` for each matched node `n`. If we focus on the `CREATE` part, we
generate `CreateExpand (n, r, m)` where `n` already exists (refers to matched
node) and `m` would be newly created along with edge `r`. If we had only
`CREATE (n) -[r]-> (m)`, then we would need to create both nodes of the edge
`r`. This is done by generating `CreateNode (n) > CreateExpand(n, r, m)`. The
final case is when both endpoints refer to an existing node. For example, when
adding a node with a cyclical connection `CREATE (n)-[r]->(n)`. In this case,
we would generate `CreateNode (n) > CreateExpand (n, r, n)`. We would tell
`CreateExpand` to only create the edge `r` between the already created `n`.
#### MERGE
Although the merge operation is complex, planning turns out to be relatively
simple. The pattern inside the `MERGE` clause is used for both matching and
creating. Therefore, we create 2 operator trees, one for each action.
For example.
MERGE (n)-[r:r]-(m)
We would generated a single `Merge` operator which has the following.
* No input operation (since it is not preceded by any other clause).
* On match operation
`ScanAll (n) > Expand (n, r, m) > Filter (r)`
* On create operation
`CreateNode (n) > CreateExpand (n, r, m)`
In cases when `MERGE` contains `ON MATCH` and `ON CREATE` parts, we simply
append their operations to the respective operator trees.
Observe the following example.
MERGE (n)-[r:r]-(m) ON MATCH SET n.x = 42 ON CREATE SET m :label
The `Merge` would be generated with the following.
* No input operation (again, since there is no clause preceding it).
* On match operation
`ScanAll (n) > Expand (n, r, m) > Filter (r) > SetProperty (n.x, 42)`
* On create operation
`CreateNode (n) > CreateExpand (n, r, m) > SetLabels (n, :label)`
When we have preceding clauses, we simply put their operator as input to
`Merge`.
MATCH (n) MERGE (n)-[r:r]-(m)
The above would be generated as
ScanAll (n) > Merge (on_match_operation, on_create_operation)
Here we need to be careful to recognize which symbols are already declared.
But, since the `on_match_operation` uses the same algorithm for generating a
`Match`, that problem is handled there. The same should hold for
`on_create_operation`, which uses the process of generating a `Create`. So,
finally for this example, the `Merge` would have:
* Input operation
`ScanAll (n)`
* On match operation
`Expand (n, r, m) > Filter (r)`
Note that `ScanAll` is not needed since we get the nodes from input.
* On create operation
`CreateExpand (n, r, m)`
Note that `CreateNode` is dropped, since we want to expand the existing one.
## Logical Plan Postprocessing
NOTE: TODO
## Cost Estimation
NOTE: TODO
## Distributed Planning
NOTE: TODO