488 lines
20 KiB
Markdown
488 lines
20 KiB
Markdown
|
# Logical Planning
|
||
|
|
||
|
After the semantic analysis and symbol generation, the AST is converted to a
|
||
|
tree of logical operators. This conversion is called *planning* and the tree
|
||
|
of logical operators is called a *plan*. The whole planning process is done in
|
||
|
the following steps.
|
||
|
|
||
|
1. [AST Preprocessing](#ast-preprocessing)
|
||
|
|
||
|
The first step is to preprocess the AST by collecting
|
||
|
information on filters, divide the query into parts, normalize patterns
|
||
|
in `MATCH` clauses, etc.
|
||
|
|
||
|
2. [Logical Operator Planning](#logical-operator-planning)
|
||
|
|
||
|
After the preprocess step, the planning can be done via 2 planners:
|
||
|
`VariableStartPlanner` and `RuleBasedPlanner`. The first planner will
|
||
|
generate multiple plans where each plan has different starting points for
|
||
|
searching the patterns in `MATCH` clauses. The second planner produces a
|
||
|
single plan by mapping the query parts as they are to logical operators.
|
||
|
|
||
|
3. [Logical Plan Postprocessing](#logical-plan-postprocessing)
|
||
|
|
||
|
In this stage, we perform various transformations on the generated logical
|
||
|
plan. Here we want to optimize the operations in order to improve
|
||
|
performance during the execution. Naturally, transformations need to
|
||
|
preserve the semantic behaviour of the original plan.
|
||
|
|
||
|
4. [Cost Estimation](#cost-estimation)
|
||
|
|
||
|
After the generation, the execution cost of each plan is estimated. This
|
||
|
estimation is used to select the best plan which will be executed.
|
||
|
|
||
|
5. [Distributed Planning](#distributed-planning)
|
||
|
|
||
|
In case we are running distributed Memgraph, the final plan is adapted
|
||
|
for distributed execution. NOTE: This appears to be an error in the
|
||
|
workflow. Distributed planning should be moved before step 3. or
|
||
|
integrated with it. With the workflow ordered as is now, cost estimation
|
||
|
doesn't consider the distributed plan.
|
||
|
|
||
|
The implementation can be found in the `query/plan` directory, with the public
|
||
|
entry point being `query/plan/planner.hpp`.
|
||
|
|
||
|
## AST Preprocessing
|
||
|
|
||
|
Each openCypher query consists of at least 1 **single query**. Multiple single
|
||
|
queries are chained together using a **query combinator**. Currently, there is
|
||
|
only one combinator, `UNION`. The preprocessing step starts in the
|
||
|
`CollectQueryParts` function. This function will take a look at each single
|
||
|
query and divide it into parts. Each part is separated with `RETURN` and
|
||
|
`WITH` clauses. For example:
|
||
|
|
||
|
MATCH (n) CREATE (m) WITH m MATCH (l)-[]-(m) RETURN l
|
||
|
| | |
|
||
|
|------- part 1 -----------+-------- part 2 --------|
|
||
|
| |
|
||
|
|-------------------- single query -----------------|
|
||
|
|
||
|
Each part is created by collecting all `MATCH` clauses and *normalizing* their
|
||
|
patterns. Pattern normalization is the process of converting an arbitrarily
|
||
|
long pattern chain of nodes and edges into a list of triplets `(start node,
|
||
|
edge, end node)`. The triplets should preserve the semantics of the match. For
|
||
|
example:
|
||
|
|
||
|
MATCH (a)-[p]-(b)-[q]-(c)-[r]-(d)
|
||
|
|
||
|
is equivalent to:
|
||
|
|
||
|
MATCH (a)-[p]-(b), (b)-[q]-(c), (c)-[r]-(d)
|
||
|
|
||
|
With this representation, it becomes easier to reorder the triplets and choose
|
||
|
different strategies for pattern matching.
|
||
|
|
||
|
In addition to normalizing patterns, all of the filter expressions in patterns
|
||
|
and inside of the `WHERE` clause (of the accompanying `MATCH`) are extracted
|
||
|
and stored separately. During the extraction, symbols used in the filter
|
||
|
expression are collected. This allows for planning filters in a valid order,
|
||
|
as the matching for triplets is being done. Another important benefit of
|
||
|
having extra information on filters, is to recognize when a database index
|
||
|
could be used.
|
||
|
|
||
|
After each `MATCH` is processed, they are all grouped, so that even the whole
|
||
|
`MATCH` clauses may be reordered. The important thing is to remember which
|
||
|
symbols were used to name edges in each `MATCH`. With those symbols we can
|
||
|
plan for *cyphermorphism*, i.e. ensure different edges in the search pattern
|
||
|
of a single `MATCH` map to different edges in the graph. This preserves the
|
||
|
semantic of the query, even though we may have reordered the matching. The
|
||
|
same steps are done for `OPTIONAL MATCH`.
|
||
|
|
||
|
Another clause which needs processing is `MERGE`. Here we normalize the
|
||
|
pattern, since the `MERGE` is a bit like `MATCH` and `CREATE` in one.
|
||
|
|
||
|
All the other clauses are left as is.
|
||
|
|
||
|
In the end, each query part consists of:
|
||
|
|
||
|
* processed and grouped `MATCH` clauses;
|
||
|
* processed and grouped `OPTIONAL MATCH` clauses;
|
||
|
* processed `MERGE` matching pattern and
|
||
|
* unchanged remaining clauses.
|
||
|
|
||
|
The last stored clause is guaranteed to be either `WITH` or `RETURN`.
|
||
|
|
||
|
## Logical Operator Planning
|
||
|
|
||
|
### Variable Start Planner
|
||
|
|
||
|
The `VariableStartPlanner` generates multiple plans for a single query. Each
|
||
|
plan is generated by selecting a different starting point for pattern
|
||
|
matching.
|
||
|
|
||
|
The algorithm works as follows.
|
||
|
|
||
|
1. For each query part:
|
||
|
1. For each node in triplets of collected `MATCH` clauses:
|
||
|
i. Add the node to a set of `expanded` nodes
|
||
|
ii. Select a triplet `(start node, edge, end node)` whose `start node` is
|
||
|
in the `expanded` set
|
||
|
iii. If no triplet was selected, choose a new starting node that isn't in
|
||
|
`expanded` and continue expanding
|
||
|
iv. Repeat steps ii. -- iii. until all triplets have been selected
|
||
|
and store that as a variation of the `MATCH` clauses
|
||
|
2. Do step 1.1. for `OPTIONAL MATCH` and `MERGE` clauses
|
||
|
3. Take all combinations of the generated `MATCH`, `OPTIONAL MATCH` and
|
||
|
`MERGE` and store them as variations of the query part.
|
||
|
2. For each combination of query part variations:
|
||
|
1. Generate a plan using the rule based planner
|
||
|
|
||
|
### Rule Based Planner
|
||
|
|
||
|
The `RuleBasedPlanner` generates a single plan for a single query. A plan is
|
||
|
generated by following hardcoded rules for producing logical operators. The
|
||
|
following sections are an overview on how each openCypher clause is converted
|
||
|
to a `LogicalOperator`.
|
||
|
|
||
|
#### MATCH
|
||
|
|
||
|
`MATCH` clause is used to specify which patterns need to be searched for in
|
||
|
the database. These patterns are normalized in the preprocess step to be
|
||
|
represented as triplets `(start node, edge, end node)`. When there is no edge,
|
||
|
then the triplet is reduced only to the `start node`. Generating the operators
|
||
|
is done by looping over these triplets.
|
||
|
|
||
|
##### Searching for Nodes
|
||
|
|
||
|
The simplest search is finding stand alone nodes. For example, `MATCH (n)`
|
||
|
will find all the nodes in the graph. This is accomplished by generating a
|
||
|
`ScanAll` operator and forwarding the node symbol which should store the
|
||
|
results. In this case, all the nodes will be referenced by `n`.
|
||
|
|
||
|
Multiple nodes can be specified in a single match, e.g. `MATCH (n), (m)`.
|
||
|
Planning is done by repeating the same steps for each sub pattern (separated
|
||
|
by a comma). In this case, we would get 2 `ScanAll` operators chained one
|
||
|
after the other. An optimization can be obtained if the node in the pattern is
|
||
|
already searched for. In `MATCH (n), (n)` we can drop the second `ScanAll`
|
||
|
operator since we have already generated it for the first node.
|
||
|
|
||
|
##### Searching for Relationships
|
||
|
|
||
|
A more advanced search includes finding nodes with relationships. For example,
|
||
|
`MATCH (n)-[r]-(m)` should find every pair of connected nodes in the database.
|
||
|
This means, that if a single node has multiple connections, it will be
|
||
|
repeated for each combination of pairs. The generation of operators starts
|
||
|
from the first node in the pattern. If we are referencing a new starting node,
|
||
|
, we need to generate a `ScanAll` which finds all the nodes and stores them
|
||
|
into `n`. Then, we generate an `Expand` operator which reads the `n` and
|
||
|
traverses all the edges of that node. The edge is stored into `r`, while the
|
||
|
destination node is stored in `m`.
|
||
|
|
||
|
Matching multiple relationships proceeds similarly, by repeating the same
|
||
|
steps. The only difference is that we need to ensure different edges in the
|
||
|
search pattern, map to different edges in the graph. This means that after
|
||
|
each `Expand` operator, we need to generate an `ExpandUniquenessFilter`. We
|
||
|
provide this operator with a list of symbols for the previously matched edges
|
||
|
and the symbol for the current edge.
|
||
|
|
||
|
For example.
|
||
|
|
||
|
MATCH (n)-[r1]-(m)-[r2]-(l)
|
||
|
|
||
|
The above is preprocessed into
|
||
|
|
||
|
MATCH (n)-[r1]-(m), (m)-[r2]-(l)
|
||
|
|
||
|
Then we look at each triplet in order and perform the described steps. This
|
||
|
way, we would generate:
|
||
|
|
||
|
ScanAll (n) > Expand (n, r1, m) > Expand (m, r2, l) >
|
||
|
ExpandUniquenessFilter ([r1], r2)
|
||
|
|
||
|
Note that we don't need to make `ExpandUniquenessFilter` after the first
|
||
|
`Expand`, since there are no edges to compare to. This filtering needs to work
|
||
|
across multiple pattern, but inside a *single* `MATCH` clause.
|
||
|
|
||
|
Let's take a look at the following.
|
||
|
|
||
|
MATCH (n)-[r1]-(m), (m)-[r2]-(l)
|
||
|
|
||
|
We would also generate the exact same operators.
|
||
|
|
||
|
ScanAll (n) > Expand (n, r1, m) > Expand (m, r2, l) >
|
||
|
ExpandUniquenessFilter ([r1], r2)
|
||
|
|
||
|
On the other hand,
|
||
|
|
||
|
MATCH (n)-[r1]-(m) MATCH (m)-[r2]-(l)-[r3]-(i)
|
||
|
|
||
|
We would reset the uniqueness filtering at the start of the second match. This
|
||
|
would mean that we output the following:
|
||
|
|
||
|
ScanAll (n) > Expand (n, r1, m) > Expand (m, r2, l) > Expand (l, r3, i) >
|
||
|
ExpandUniquenessFilter ([r2], r3)
|
||
|
|
||
|
There is a difference in how we handle edge uniqueness compared to Neo4j.
|
||
|
Neo4j does not allow searching for a single edge multiple times, but we've
|
||
|
decided to support that.
|
||
|
|
||
|
For example, the user can say the following.
|
||
|
|
||
|
MATCH (n)-[r]-(m)-[r]-l
|
||
|
|
||
|
We would ensure that both `r` variables match to the same edge. In our
|
||
|
terminology, we call this the *edge cycle*. For the above example, we would
|
||
|
generate this plan.
|
||
|
|
||
|
ScanAll (n) > Expand (n, r, m) > Expand (m, r, l)
|
||
|
|
||
|
We do not put an `ExpandUniquenessFilter` operator between 2 `Expand`
|
||
|
operators and we tell the 2nd `Expand` that it is an edge cycle. This, 2nd
|
||
|
`Expand` will ensure we have matched both the same edges.
|
||
|
|
||
|
##### Filtering
|
||
|
|
||
|
To narrow the search down, the patterns in `MATCH` can have filtered labels
|
||
|
and properties. A more general filtering is done using the accompanying
|
||
|
`WHERE` clause. During the preprocess step, all filters are collected and
|
||
|
extracted into expressions. Additional information on which symbols are used
|
||
|
is also stored. This way, each time we generate a `ScanAll` or `Expand`, we
|
||
|
look at all the filters to see if any of them can be used. I.e. if the symbols
|
||
|
they use have been bound by a newly produced operator. If a filter expression
|
||
|
can be used, we immediately add a `Filter` operator with that expression.
|
||
|
|
||
|
For example.
|
||
|
|
||
|
MATCH (n)-[r]-(m :label) WHERE n.prop = 42
|
||
|
|
||
|
We would produce:
|
||
|
|
||
|
ScanAll (n) > Filter (n.prop) > Expand (n, r, m) > Filter (m :label)
|
||
|
|
||
|
This means that the same plan is generated for the query:
|
||
|
|
||
|
MATCH (n {prop: 42})-[r]-(m :label)
|
||
|
|
||
|
#### OPTIONAL
|
||
|
|
||
|
If a `MATCH` clause is preceded by `OPTIONAL`, then we need to generate a plan
|
||
|
such that we produce results even if we fail to match anything. This is
|
||
|
accomplished by generating an `Optional` operator, which takes 2 operator
|
||
|
trees:
|
||
|
|
||
|
* input operation and
|
||
|
* optional operation.
|
||
|
|
||
|
The input is the operation we generated for the part of the query before
|
||
|
`OPTIONAL MATCH`. For the optional operation, we simply generate the `OPTIONAL
|
||
|
MATCH` part just like we would for regular `MATCH`. In addition to operations,
|
||
|
we need to send the symbols which are set during optional matching to the
|
||
|
`Optional` operator. The operator will reset values of those symbols to
|
||
|
`null`, when the optional part fails to match.
|
||
|
|
||
|
#### RETURN & WITH
|
||
|
|
||
|
`RETURN` and `WITH` clauses are very similar to each other. The only
|
||
|
difference is that `WITH` separates parts of the query and can be paired with
|
||
|
`WHERE` clause.
|
||
|
|
||
|
The common part is generating operators for the body of the clause. Separation
|
||
|
of query parts is mostly done in semantic analysis, which checks that only the
|
||
|
symbols exposed through `WITH` are visible in the query parts after the
|
||
|
clause. The minor part is done in planning.
|
||
|
|
||
|
##### Named Results
|
||
|
|
||
|
Both clauses contain multiple named expressions (`expr AS name`) which are
|
||
|
used to generate `Produce` operator.
|
||
|
|
||
|
##### Aggregations
|
||
|
|
||
|
If an expression contains an aggregation operator (`sum`, `avg`, ...) we need
|
||
|
to plan the `Aggregate` operator as input to `Produce`. This case is more
|
||
|
complex, because aggregation in openCypher can perform implicit grouping of
|
||
|
results used for aggregation.
|
||
|
|
||
|
For example, `WITH/RETURN sum(n.x) AS s, n.y AS group` will implicitly group
|
||
|
by `n.y` expression.
|
||
|
|
||
|
Another, obscure grouping can be achieved with `RETURN sum(n.a) + n.b AS s`.
|
||
|
Here, the `n.b` will be used for grouping, even though both the `sum` and
|
||
|
`n.b` are in the same named expression.
|
||
|
|
||
|
Therefore, we need to collect all expressions which do not contain
|
||
|
aggregations and use them for grouping. You may have noticed that in the last
|
||
|
example `sum` is actually a sub-expression of `+`. `Aggregate` operator does
|
||
|
not see that (nor it should), so the responsibility of evaluating that falls
|
||
|
on `Produce`. One way is for `Aggregate` to store results of grouping
|
||
|
expressions on the frame in addition to aggregation results. Unfortunately,
|
||
|
this would require rewiring named expressions in `Produce` to reference
|
||
|
already evaluated expressions. In the current implementation, we opted for
|
||
|
`Aggregate` to store only aggregation results on the frame, while `Produce`
|
||
|
will re-evaluate all the other (grouping) expressions. To handle that, symbols
|
||
|
which are used in expressions are passed to `Aggregate`, so that they can be
|
||
|
remembered. `Produce` will read those symbols from the frame and use it to
|
||
|
re-evaluate the needed expressions.
|
||
|
|
||
|
##### Accumulation
|
||
|
|
||
|
After we have `Produce` and potentially `Aggregate`, we need to handle a
|
||
|
special case when the part of the query before `RETURN` or `WITH` performs
|
||
|
updates. For that, we want to run that part of the query fully, so that we get
|
||
|
the latest results. This is accomplished by adding `Accumulate` operator as
|
||
|
input to `Aggregate` or `Produce` (if there is no aggregation). Accumulation
|
||
|
will store all the values for all the used symbols inside `RETURN` and `WITH`,
|
||
|
so that they can be used in the operator which follows. This way, only parts
|
||
|
of the frame are copied, instead of the whole frame. Here is a minor
|
||
|
difference between planning `WITH`, compared to `RETURN`. Since `WITH` can
|
||
|
separate writing from reading, we need to advance the transaction command.
|
||
|
This enables the later, read parts of the query to obtain the newest changes.
|
||
|
This is supported by passing `advance_command` flag to `Accumulate` operator.
|
||
|
|
||
|
In the simplest case, common to both clauses, we have `Accumulate > Aggregate
|
||
|
> Produce` operators, where `Accumulate` and `Aggregate` may be left out.
|
||
|
|
||
|
##### Ordering
|
||
|
|
||
|
Planning `ORDER BY` is simple enough. Since it may see new symbols (filled in
|
||
|
`Produce`), we add the `OrderBy` operator at the end. The operator will change
|
||
|
the order of produced results, so we pass it the ordering expressions and the
|
||
|
output symbols of named expressions.
|
||
|
|
||
|
##### Filtering
|
||
|
|
||
|
A final difference in `WITH`, is when it contains a `WHERE` clause. For that,
|
||
|
we simply generate the `Filter` operator, appended after `Produce` or
|
||
|
`OrderBy` (depending which operator is last).
|
||
|
|
||
|
##### Skipping and Limiting
|
||
|
|
||
|
If we have `SKIP` or `LIMIT`, we generate `Skip` or `Limit` operators,
|
||
|
respectively. These operators are put at the end of the clause.
|
||
|
|
||
|
This placement may have some unexpected behaviour when combined with
|
||
|
operations that update the graph. For example.
|
||
|
|
||
|
MATCH (n) SET n.x = n.x + 1 RETURN n LIMIT 1
|
||
|
|
||
|
The above query may be interpreted as if the `SET` will be done only once.
|
||
|
Since this is a write query, we need to accumulate results, so the part before
|
||
|
`RETURN` will execute completely. The accumulated results will be yielded up
|
||
|
to the given limit, and the user would get only the first `n` that was
|
||
|
updated. This may confuse the user because in reality, every node in the
|
||
|
database had been updated.
|
||
|
|
||
|
Note that `Skip` always comes before `Limit`. In the current implementation,
|
||
|
they are generated directly one after the other.
|
||
|
|
||
|
#### CREATE
|
||
|
|
||
|
`CREATE` clause is used to create nodes and edges (relationships).
|
||
|
|
||
|
For multiple `CREATE` clauses or multiple creation patterns in a single
|
||
|
clause, we perform the same, following steps.
|
||
|
|
||
|
##### Creating a Single Node
|
||
|
|
||
|
A node is created by simply specifying a node pattern.
|
||
|
|
||
|
For example `CREATE (n :label {property: "value"}), ()` would create 2 nodes.
|
||
|
The 1st one would be created with a label and a property. This node could be
|
||
|
referenced later in the query, by using the variable `n`. The 2nd node cannot
|
||
|
be referenced and it would be created without any labels nor properties. For
|
||
|
node creation, we generate a `CreateNode` operator and pass it all the details
|
||
|
of node creation: variable symbol, labels and properties. In the mentioned
|
||
|
example, we would have `CreateNode > CreateNode`.
|
||
|
|
||
|
##### Creating a Relationship
|
||
|
|
||
|
To create a relationship, the `CREATE` clause must contain a pattern with a
|
||
|
directed edge. Compared to creating a single node, this case is a bit more
|
||
|
complicated, because either side of the edge may not exist. By exist, we mean
|
||
|
that the endpoint is a variable which already references a node.
|
||
|
|
||
|
For example, `MATCH (n) CREATE (n)-[r]->(m)` would create an edge `r` and a
|
||
|
node `m` for each matched node `n`. If we focus on the `CREATE` part, we
|
||
|
generate `CreateExpand (n, r, m)` where `n` already exists (refers to matched
|
||
|
node) and `m` would be newly created along with edge `r`. If we had only
|
||
|
`CREATE (n) -[r]-> (m)`, then we would need to create both nodes of the edge
|
||
|
`r`. This is done by generating `CreateNode (n) > CreateExpand(n, r, m)`. The
|
||
|
final case is when both endpoints refer to an existing node. For example, when
|
||
|
adding a node with a cyclical connection `CREATE (n)-[r]->(n)`. In this case,
|
||
|
we would generate `CreateNode (n) > CreateExpand (n, r, n)`. We would tell
|
||
|
`CreateExpand` to only create the edge `r` between the already created `n`.
|
||
|
|
||
|
#### MERGE
|
||
|
|
||
|
Although the merge operation is complex, planning turns out to be relatively
|
||
|
simple. The pattern inside the `MERGE` clause is used for both matching and
|
||
|
creating. Therefore, we create 2 operator trees, one for each action.
|
||
|
|
||
|
For example.
|
||
|
|
||
|
MERGE (n)-[r:r]-(m)
|
||
|
|
||
|
We would generated a single `Merge` operator which has the following.
|
||
|
|
||
|
* No input operation (since it is not preceded by any other clause).
|
||
|
|
||
|
* On match operation
|
||
|
|
||
|
`ScanAll (n) > Expand (n, r, m) > Filter (r)`
|
||
|
|
||
|
* On create operation
|
||
|
|
||
|
`CreateNode (n) > CreateExpand (n, r, m)`
|
||
|
|
||
|
In cases when `MERGE` contains `ON MATCH` and `ON CREATE` parts, we simply
|
||
|
append their operations to the respective operator trees.
|
||
|
|
||
|
Observe the following example.
|
||
|
|
||
|
MERGE (n)-[r:r]-(m) ON MATCH SET n.x = 42 ON CREATE SET m :label
|
||
|
|
||
|
The `Merge` would be generated with the following.
|
||
|
|
||
|
* No input operation (again, since there is no clause preceding it).
|
||
|
|
||
|
* On match operation
|
||
|
|
||
|
`ScanAll (n) > Expand (n, r, m) > Filter (r) > SetProperty (n.x, 42)`
|
||
|
|
||
|
* On create operation
|
||
|
|
||
|
`CreateNode (n) > CreateExpand (n, r, m) > SetLabels (n, :label)`
|
||
|
|
||
|
When we have preceding clauses, we simply put their operator as input to
|
||
|
`Merge`.
|
||
|
|
||
|
MATCH (n) MERGE (n)-[r:r]-(m)
|
||
|
|
||
|
The above would be generated as
|
||
|
|
||
|
ScanAll (n) > Merge (on_match_operation, on_create_operation)
|
||
|
|
||
|
Here we need to be careful to recognize which symbols are already declared.
|
||
|
But, since the `on_match_operation` uses the same algorithm for generating a
|
||
|
`Match`, that problem is handled there. The same should hold for
|
||
|
`on_create_operation`, which uses the process of generating a `Create`. So,
|
||
|
finally for this example, the `Merge` would have:
|
||
|
|
||
|
* Input operation
|
||
|
|
||
|
`ScanAll (n)`
|
||
|
|
||
|
* On match operation
|
||
|
|
||
|
`Expand (n, r, m) > Filter (r)`
|
||
|
|
||
|
Note that `ScanAll` is not needed since we get the nodes from input.
|
||
|
|
||
|
* On create operation
|
||
|
|
||
|
`CreateExpand (n, r, m)`
|
||
|
|
||
|
Note that `CreateNode` is dropped, since we want to expand the existing one.
|
||
|
|
||
|
## Logical Plan Postprocessing
|
||
|
|
||
|
NOTE: TODO
|
||
|
|
||
|
## Cost Estimation
|
||
|
|
||
|
NOTE: TODO
|
||
|
|
||
|
## Distributed Planning
|
||
|
|
||
|
NOTE: TODO
|