6f10b1c115
Summary: Our query parsing, planning and execution architecture was described on Phabricator wiki pages, Phriction. This commit copies the said documentation here, so that it's easier to access for all developers. Additional benefit is tracking the changes and hopefully suggesting to developers to keep it up to date. Besides making a copy, the documentation has been updated to reflect the current state of the codebase. Note that some things are still missing, but what was written should now be correct. Reviewers: mtomic, llugovic Reviewed By: mtomic Subscribers: pullbot Differential Revision: https://phabricator.memgraph.io/D1854
488 lines
20 KiB
Markdown
488 lines
20 KiB
Markdown
# Logical Planning
|
|
|
|
After the semantic analysis and symbol generation, the AST is converted to a
|
|
tree of logical operators. This conversion is called *planning* and the tree
|
|
of logical operators is called a *plan*. The whole planning process is done in
|
|
the following steps.
|
|
|
|
1. [AST Preprocessing](#ast-preprocessing)
|
|
|
|
The first step is to preprocess the AST by collecting
|
|
information on filters, divide the query into parts, normalize patterns
|
|
in `MATCH` clauses, etc.
|
|
|
|
2. [Logical Operator Planning](#logical-operator-planning)
|
|
|
|
After the preprocess step, the planning can be done via 2 planners:
|
|
`VariableStartPlanner` and `RuleBasedPlanner`. The first planner will
|
|
generate multiple plans where each plan has different starting points for
|
|
searching the patterns in `MATCH` clauses. The second planner produces a
|
|
single plan by mapping the query parts as they are to logical operators.
|
|
|
|
3. [Logical Plan Postprocessing](#logical-plan-postprocessing)
|
|
|
|
In this stage, we perform various transformations on the generated logical
|
|
plan. Here we want to optimize the operations in order to improve
|
|
performance during the execution. Naturally, transformations need to
|
|
preserve the semantic behaviour of the original plan.
|
|
|
|
4. [Cost Estimation](#cost-estimation)
|
|
|
|
After the generation, the execution cost of each plan is estimated. This
|
|
estimation is used to select the best plan which will be executed.
|
|
|
|
5. [Distributed Planning](#distributed-planning)
|
|
|
|
In case we are running distributed Memgraph, the final plan is adapted
|
|
for distributed execution. NOTE: This appears to be an error in the
|
|
workflow. Distributed planning should be moved before step 3. or
|
|
integrated with it. With the workflow ordered as is now, cost estimation
|
|
doesn't consider the distributed plan.
|
|
|
|
The implementation can be found in the `query/plan` directory, with the public
|
|
entry point being `query/plan/planner.hpp`.
|
|
|
|
## AST Preprocessing
|
|
|
|
Each openCypher query consists of at least 1 **single query**. Multiple single
|
|
queries are chained together using a **query combinator**. Currently, there is
|
|
only one combinator, `UNION`. The preprocessing step starts in the
|
|
`CollectQueryParts` function. This function will take a look at each single
|
|
query and divide it into parts. Each part is separated with `RETURN` and
|
|
`WITH` clauses. For example:
|
|
|
|
MATCH (n) CREATE (m) WITH m MATCH (l)-[]-(m) RETURN l
|
|
| | |
|
|
|------- part 1 -----------+-------- part 2 --------|
|
|
| |
|
|
|-------------------- single query -----------------|
|
|
|
|
Each part is created by collecting all `MATCH` clauses and *normalizing* their
|
|
patterns. Pattern normalization is the process of converting an arbitrarily
|
|
long pattern chain of nodes and edges into a list of triplets `(start node,
|
|
edge, end node)`. The triplets should preserve the semantics of the match. For
|
|
example:
|
|
|
|
MATCH (a)-[p]-(b)-[q]-(c)-[r]-(d)
|
|
|
|
is equivalent to:
|
|
|
|
MATCH (a)-[p]-(b), (b)-[q]-(c), (c)-[r]-(d)
|
|
|
|
With this representation, it becomes easier to reorder the triplets and choose
|
|
different strategies for pattern matching.
|
|
|
|
In addition to normalizing patterns, all of the filter expressions in patterns
|
|
and inside of the `WHERE` clause (of the accompanying `MATCH`) are extracted
|
|
and stored separately. During the extraction, symbols used in the filter
|
|
expression are collected. This allows for planning filters in a valid order,
|
|
as the matching for triplets is being done. Another important benefit of
|
|
having extra information on filters, is to recognize when a database index
|
|
could be used.
|
|
|
|
After each `MATCH` is processed, they are all grouped, so that even the whole
|
|
`MATCH` clauses may be reordered. The important thing is to remember which
|
|
symbols were used to name edges in each `MATCH`. With those symbols we can
|
|
plan for *cyphermorphism*, i.e. ensure different edges in the search pattern
|
|
of a single `MATCH` map to different edges in the graph. This preserves the
|
|
semantic of the query, even though we may have reordered the matching. The
|
|
same steps are done for `OPTIONAL MATCH`.
|
|
|
|
Another clause which needs processing is `MERGE`. Here we normalize the
|
|
pattern, since the `MERGE` is a bit like `MATCH` and `CREATE` in one.
|
|
|
|
All the other clauses are left as is.
|
|
|
|
In the end, each query part consists of:
|
|
|
|
* processed and grouped `MATCH` clauses;
|
|
* processed and grouped `OPTIONAL MATCH` clauses;
|
|
* processed `MERGE` matching pattern and
|
|
* unchanged remaining clauses.
|
|
|
|
The last stored clause is guaranteed to be either `WITH` or `RETURN`.
|
|
|
|
## Logical Operator Planning
|
|
|
|
### Variable Start Planner
|
|
|
|
The `VariableStartPlanner` generates multiple plans for a single query. Each
|
|
plan is generated by selecting a different starting point for pattern
|
|
matching.
|
|
|
|
The algorithm works as follows.
|
|
|
|
1. For each query part:
|
|
1. For each node in triplets of collected `MATCH` clauses:
|
|
i. Add the node to a set of `expanded` nodes
|
|
ii. Select a triplet `(start node, edge, end node)` whose `start node` is
|
|
in the `expanded` set
|
|
iii. If no triplet was selected, choose a new starting node that isn't in
|
|
`expanded` and continue expanding
|
|
iv. Repeat steps ii. -- iii. until all triplets have been selected
|
|
and store that as a variation of the `MATCH` clauses
|
|
2. Do step 1.1. for `OPTIONAL MATCH` and `MERGE` clauses
|
|
3. Take all combinations of the generated `MATCH`, `OPTIONAL MATCH` and
|
|
`MERGE` and store them as variations of the query part.
|
|
2. For each combination of query part variations:
|
|
1. Generate a plan using the rule based planner
|
|
|
|
### Rule Based Planner
|
|
|
|
The `RuleBasedPlanner` generates a single plan for a single query. A plan is
|
|
generated by following hardcoded rules for producing logical operators. The
|
|
following sections are an overview on how each openCypher clause is converted
|
|
to a `LogicalOperator`.
|
|
|
|
#### MATCH
|
|
|
|
`MATCH` clause is used to specify which patterns need to be searched for in
|
|
the database. These patterns are normalized in the preprocess step to be
|
|
represented as triplets `(start node, edge, end node)`. When there is no edge,
|
|
then the triplet is reduced only to the `start node`. Generating the operators
|
|
is done by looping over these triplets.
|
|
|
|
##### Searching for Nodes
|
|
|
|
The simplest search is finding stand alone nodes. For example, `MATCH (n)`
|
|
will find all the nodes in the graph. This is accomplished by generating a
|
|
`ScanAll` operator and forwarding the node symbol which should store the
|
|
results. In this case, all the nodes will be referenced by `n`.
|
|
|
|
Multiple nodes can be specified in a single match, e.g. `MATCH (n), (m)`.
|
|
Planning is done by repeating the same steps for each sub pattern (separated
|
|
by a comma). In this case, we would get 2 `ScanAll` operators chained one
|
|
after the other. An optimization can be obtained if the node in the pattern is
|
|
already searched for. In `MATCH (n), (n)` we can drop the second `ScanAll`
|
|
operator since we have already generated it for the first node.
|
|
|
|
##### Searching for Relationships
|
|
|
|
A more advanced search includes finding nodes with relationships. For example,
|
|
`MATCH (n)-[r]-(m)` should find every pair of connected nodes in the database.
|
|
This means, that if a single node has multiple connections, it will be
|
|
repeated for each combination of pairs. The generation of operators starts
|
|
from the first node in the pattern. If we are referencing a new starting node,
|
|
, we need to generate a `ScanAll` which finds all the nodes and stores them
|
|
into `n`. Then, we generate an `Expand` operator which reads the `n` and
|
|
traverses all the edges of that node. The edge is stored into `r`, while the
|
|
destination node is stored in `m`.
|
|
|
|
Matching multiple relationships proceeds similarly, by repeating the same
|
|
steps. The only difference is that we need to ensure different edges in the
|
|
search pattern, map to different edges in the graph. This means that after
|
|
each `Expand` operator, we need to generate an `ExpandUniquenessFilter`. We
|
|
provide this operator with a list of symbols for the previously matched edges
|
|
and the symbol for the current edge.
|
|
|
|
For example.
|
|
|
|
MATCH (n)-[r1]-(m)-[r2]-(l)
|
|
|
|
The above is preprocessed into
|
|
|
|
MATCH (n)-[r1]-(m), (m)-[r2]-(l)
|
|
|
|
Then we look at each triplet in order and perform the described steps. This
|
|
way, we would generate:
|
|
|
|
ScanAll (n) > Expand (n, r1, m) > Expand (m, r2, l) >
|
|
ExpandUniquenessFilter ([r1], r2)
|
|
|
|
Note that we don't need to make `ExpandUniquenessFilter` after the first
|
|
`Expand`, since there are no edges to compare to. This filtering needs to work
|
|
across multiple pattern, but inside a *single* `MATCH` clause.
|
|
|
|
Let's take a look at the following.
|
|
|
|
MATCH (n)-[r1]-(m), (m)-[r2]-(l)
|
|
|
|
We would also generate the exact same operators.
|
|
|
|
ScanAll (n) > Expand (n, r1, m) > Expand (m, r2, l) >
|
|
ExpandUniquenessFilter ([r1], r2)
|
|
|
|
On the other hand,
|
|
|
|
MATCH (n)-[r1]-(m) MATCH (m)-[r2]-(l)-[r3]-(i)
|
|
|
|
We would reset the uniqueness filtering at the start of the second match. This
|
|
would mean that we output the following:
|
|
|
|
ScanAll (n) > Expand (n, r1, m) > Expand (m, r2, l) > Expand (l, r3, i) >
|
|
ExpandUniquenessFilter ([r2], r3)
|
|
|
|
There is a difference in how we handle edge uniqueness compared to Neo4j.
|
|
Neo4j does not allow searching for a single edge multiple times, but we've
|
|
decided to support that.
|
|
|
|
For example, the user can say the following.
|
|
|
|
MATCH (n)-[r]-(m)-[r]-l
|
|
|
|
We would ensure that both `r` variables match to the same edge. In our
|
|
terminology, we call this the *edge cycle*. For the above example, we would
|
|
generate this plan.
|
|
|
|
ScanAll (n) > Expand (n, r, m) > Expand (m, r, l)
|
|
|
|
We do not put an `ExpandUniquenessFilter` operator between 2 `Expand`
|
|
operators and we tell the 2nd `Expand` that it is an edge cycle. This, 2nd
|
|
`Expand` will ensure we have matched both the same edges.
|
|
|
|
##### Filtering
|
|
|
|
To narrow the search down, the patterns in `MATCH` can have filtered labels
|
|
and properties. A more general filtering is done using the accompanying
|
|
`WHERE` clause. During the preprocess step, all filters are collected and
|
|
extracted into expressions. Additional information on which symbols are used
|
|
is also stored. This way, each time we generate a `ScanAll` or `Expand`, we
|
|
look at all the filters to see if any of them can be used. I.e. if the symbols
|
|
they use have been bound by a newly produced operator. If a filter expression
|
|
can be used, we immediately add a `Filter` operator with that expression.
|
|
|
|
For example.
|
|
|
|
MATCH (n)-[r]-(m :label) WHERE n.prop = 42
|
|
|
|
We would produce:
|
|
|
|
ScanAll (n) > Filter (n.prop) > Expand (n, r, m) > Filter (m :label)
|
|
|
|
This means that the same plan is generated for the query:
|
|
|
|
MATCH (n {prop: 42})-[r]-(m :label)
|
|
|
|
#### OPTIONAL
|
|
|
|
If a `MATCH` clause is preceded by `OPTIONAL`, then we need to generate a plan
|
|
such that we produce results even if we fail to match anything. This is
|
|
accomplished by generating an `Optional` operator, which takes 2 operator
|
|
trees:
|
|
|
|
* input operation and
|
|
* optional operation.
|
|
|
|
The input is the operation we generated for the part of the query before
|
|
`OPTIONAL MATCH`. For the optional operation, we simply generate the `OPTIONAL
|
|
MATCH` part just like we would for regular `MATCH`. In addition to operations,
|
|
we need to send the symbols which are set during optional matching to the
|
|
`Optional` operator. The operator will reset values of those symbols to
|
|
`null`, when the optional part fails to match.
|
|
|
|
#### RETURN & WITH
|
|
|
|
`RETURN` and `WITH` clauses are very similar to each other. The only
|
|
difference is that `WITH` separates parts of the query and can be paired with
|
|
`WHERE` clause.
|
|
|
|
The common part is generating operators for the body of the clause. Separation
|
|
of query parts is mostly done in semantic analysis, which checks that only the
|
|
symbols exposed through `WITH` are visible in the query parts after the
|
|
clause. The minor part is done in planning.
|
|
|
|
##### Named Results
|
|
|
|
Both clauses contain multiple named expressions (`expr AS name`) which are
|
|
used to generate `Produce` operator.
|
|
|
|
##### Aggregations
|
|
|
|
If an expression contains an aggregation operator (`sum`, `avg`, ...) we need
|
|
to plan the `Aggregate` operator as input to `Produce`. This case is more
|
|
complex, because aggregation in openCypher can perform implicit grouping of
|
|
results used for aggregation.
|
|
|
|
For example, `WITH/RETURN sum(n.x) AS s, n.y AS group` will implicitly group
|
|
by `n.y` expression.
|
|
|
|
Another, obscure grouping can be achieved with `RETURN sum(n.a) + n.b AS s`.
|
|
Here, the `n.b` will be used for grouping, even though both the `sum` and
|
|
`n.b` are in the same named expression.
|
|
|
|
Therefore, we need to collect all expressions which do not contain
|
|
aggregations and use them for grouping. You may have noticed that in the last
|
|
example `sum` is actually a sub-expression of `+`. `Aggregate` operator does
|
|
not see that (nor it should), so the responsibility of evaluating that falls
|
|
on `Produce`. One way is for `Aggregate` to store results of grouping
|
|
expressions on the frame in addition to aggregation results. Unfortunately,
|
|
this would require rewiring named expressions in `Produce` to reference
|
|
already evaluated expressions. In the current implementation, we opted for
|
|
`Aggregate` to store only aggregation results on the frame, while `Produce`
|
|
will re-evaluate all the other (grouping) expressions. To handle that, symbols
|
|
which are used in expressions are passed to `Aggregate`, so that they can be
|
|
remembered. `Produce` will read those symbols from the frame and use it to
|
|
re-evaluate the needed expressions.
|
|
|
|
##### Accumulation
|
|
|
|
After we have `Produce` and potentially `Aggregate`, we need to handle a
|
|
special case when the part of the query before `RETURN` or `WITH` performs
|
|
updates. For that, we want to run that part of the query fully, so that we get
|
|
the latest results. This is accomplished by adding `Accumulate` operator as
|
|
input to `Aggregate` or `Produce` (if there is no aggregation). Accumulation
|
|
will store all the values for all the used symbols inside `RETURN` and `WITH`,
|
|
so that they can be used in the operator which follows. This way, only parts
|
|
of the frame are copied, instead of the whole frame. Here is a minor
|
|
difference between planning `WITH`, compared to `RETURN`. Since `WITH` can
|
|
separate writing from reading, we need to advance the transaction command.
|
|
This enables the later, read parts of the query to obtain the newest changes.
|
|
This is supported by passing `advance_command` flag to `Accumulate` operator.
|
|
|
|
In the simplest case, common to both clauses, we have `Accumulate > Aggregate
|
|
> Produce` operators, where `Accumulate` and `Aggregate` may be left out.
|
|
|
|
##### Ordering
|
|
|
|
Planning `ORDER BY` is simple enough. Since it may see new symbols (filled in
|
|
`Produce`), we add the `OrderBy` operator at the end. The operator will change
|
|
the order of produced results, so we pass it the ordering expressions and the
|
|
output symbols of named expressions.
|
|
|
|
##### Filtering
|
|
|
|
A final difference in `WITH`, is when it contains a `WHERE` clause. For that,
|
|
we simply generate the `Filter` operator, appended after `Produce` or
|
|
`OrderBy` (depending which operator is last).
|
|
|
|
##### Skipping and Limiting
|
|
|
|
If we have `SKIP` or `LIMIT`, we generate `Skip` or `Limit` operators,
|
|
respectively. These operators are put at the end of the clause.
|
|
|
|
This placement may have some unexpected behaviour when combined with
|
|
operations that update the graph. For example.
|
|
|
|
MATCH (n) SET n.x = n.x + 1 RETURN n LIMIT 1
|
|
|
|
The above query may be interpreted as if the `SET` will be done only once.
|
|
Since this is a write query, we need to accumulate results, so the part before
|
|
`RETURN` will execute completely. The accumulated results will be yielded up
|
|
to the given limit, and the user would get only the first `n` that was
|
|
updated. This may confuse the user because in reality, every node in the
|
|
database had been updated.
|
|
|
|
Note that `Skip` always comes before `Limit`. In the current implementation,
|
|
they are generated directly one after the other.
|
|
|
|
#### CREATE
|
|
|
|
`CREATE` clause is used to create nodes and edges (relationships).
|
|
|
|
For multiple `CREATE` clauses or multiple creation patterns in a single
|
|
clause, we perform the same, following steps.
|
|
|
|
##### Creating a Single Node
|
|
|
|
A node is created by simply specifying a node pattern.
|
|
|
|
For example `CREATE (n :label {property: "value"}), ()` would create 2 nodes.
|
|
The 1st one would be created with a label and a property. This node could be
|
|
referenced later in the query, by using the variable `n`. The 2nd node cannot
|
|
be referenced and it would be created without any labels nor properties. For
|
|
node creation, we generate a `CreateNode` operator and pass it all the details
|
|
of node creation: variable symbol, labels and properties. In the mentioned
|
|
example, we would have `CreateNode > CreateNode`.
|
|
|
|
##### Creating a Relationship
|
|
|
|
To create a relationship, the `CREATE` clause must contain a pattern with a
|
|
directed edge. Compared to creating a single node, this case is a bit more
|
|
complicated, because either side of the edge may not exist. By exist, we mean
|
|
that the endpoint is a variable which already references a node.
|
|
|
|
For example, `MATCH (n) CREATE (n)-[r]->(m)` would create an edge `r` and a
|
|
node `m` for each matched node `n`. If we focus on the `CREATE` part, we
|
|
generate `CreateExpand (n, r, m)` where `n` already exists (refers to matched
|
|
node) and `m` would be newly created along with edge `r`. If we had only
|
|
`CREATE (n) -[r]-> (m)`, then we would need to create both nodes of the edge
|
|
`r`. This is done by generating `CreateNode (n) > CreateExpand(n, r, m)`. The
|
|
final case is when both endpoints refer to an existing node. For example, when
|
|
adding a node with a cyclical connection `CREATE (n)-[r]->(n)`. In this case,
|
|
we would generate `CreateNode (n) > CreateExpand (n, r, n)`. We would tell
|
|
`CreateExpand` to only create the edge `r` between the already created `n`.
|
|
|
|
#### MERGE
|
|
|
|
Although the merge operation is complex, planning turns out to be relatively
|
|
simple. The pattern inside the `MERGE` clause is used for both matching and
|
|
creating. Therefore, we create 2 operator trees, one for each action.
|
|
|
|
For example.
|
|
|
|
MERGE (n)-[r:r]-(m)
|
|
|
|
We would generated a single `Merge` operator which has the following.
|
|
|
|
* No input operation (since it is not preceded by any other clause).
|
|
|
|
* On match operation
|
|
|
|
`ScanAll (n) > Expand (n, r, m) > Filter (r)`
|
|
|
|
* On create operation
|
|
|
|
`CreateNode (n) > CreateExpand (n, r, m)`
|
|
|
|
In cases when `MERGE` contains `ON MATCH` and `ON CREATE` parts, we simply
|
|
append their operations to the respective operator trees.
|
|
|
|
Observe the following example.
|
|
|
|
MERGE (n)-[r:r]-(m) ON MATCH SET n.x = 42 ON CREATE SET m :label
|
|
|
|
The `Merge` would be generated with the following.
|
|
|
|
* No input operation (again, since there is no clause preceding it).
|
|
|
|
* On match operation
|
|
|
|
`ScanAll (n) > Expand (n, r, m) > Filter (r) > SetProperty (n.x, 42)`
|
|
|
|
* On create operation
|
|
|
|
`CreateNode (n) > CreateExpand (n, r, m) > SetLabels (n, :label)`
|
|
|
|
When we have preceding clauses, we simply put their operator as input to
|
|
`Merge`.
|
|
|
|
MATCH (n) MERGE (n)-[r:r]-(m)
|
|
|
|
The above would be generated as
|
|
|
|
ScanAll (n) > Merge (on_match_operation, on_create_operation)
|
|
|
|
Here we need to be careful to recognize which symbols are already declared.
|
|
But, since the `on_match_operation` uses the same algorithm for generating a
|
|
`Match`, that problem is handled there. The same should hold for
|
|
`on_create_operation`, which uses the process of generating a `Create`. So,
|
|
finally for this example, the `Merge` would have:
|
|
|
|
* Input operation
|
|
|
|
`ScanAll (n)`
|
|
|
|
* On match operation
|
|
|
|
`Expand (n, r, m) > Filter (r)`
|
|
|
|
Note that `ScanAll` is not needed since we get the nodes from input.
|
|
|
|
* On create operation
|
|
|
|
`CreateExpand (n, r, m)`
|
|
|
|
Note that `CreateNode` is dropped, since we want to expand the existing one.
|
|
|
|
## Logical Plan Postprocessing
|
|
|
|
NOTE: TODO
|
|
|
|
## Cost Estimation
|
|
|
|
NOTE: TODO
|
|
|
|
## Distributed Planning
|
|
|
|
NOTE: TODO
|