Reviewers: teon.banek Reviewed By: teon.banek Differential Revision: https://phabricator.memgraph.io/D2069
20 KiB
Logical Planning
After the semantic analysis and symbol generation, the AST is converted to a tree of logical operators. This conversion is called planning and the tree of logical operators is called a plan. The whole planning process is done in the following steps.
-
The first step is to preprocess the AST by collecting information on filters, divide the query into parts, normalize patterns in
MATCH
clauses, etc. -
After the preprocess step, the planning can be done via 2 planners:
VariableStartPlanner
andRuleBasedPlanner
. The first planner will generate multiple plans where each plan has different starting points for searching the patterns inMATCH
clauses. The second planner produces a single plan by mapping the query parts as they are to logical operators. -
In this stage, we perform various transformations on the generated logical plan. Here we want to optimize the operations in order to improve performance during the execution. Naturally, transformations need to preserve the semantic behaviour of the original plan.
-
After the generation, the execution cost of each plan is estimated. This estimation is used to select the best plan which will be executed.
-
In case we are running distributed Memgraph, the final plan is adapted for distributed execution. NOTE: This appears to be an error in the workflow. Distributed planning should be moved before step 3. or integrated with it. With the workflow ordered as is now, cost estimation doesn't consider the distributed plan.
The implementation can be found in the query/plan
directory, with the public
entry point being query/plan/planner.hpp
.
AST Preprocessing
Each openCypher query consists of at least 1 single query. Multiple single
queries are chained together using a query combinator. Currently, there is
only one combinator, UNION
. The preprocessing step starts in the
CollectQueryParts
function. This function will take a look at each single
query and divide it into parts. Each part is separated with RETURN
and
WITH
clauses. For example:
MATCH (n) CREATE (m) WITH m MATCH (l)-[]-(m) RETURN l
| | |
|------- part 1 -----------+-------- part 2 --------|
| |
|-------------------- single query -----------------|
Each part is created by collecting all MATCH
clauses and normalizing their
patterns. Pattern normalization is the process of converting an arbitrarily
long pattern chain of nodes and edges into a list of triplets (start node, edge, end node)
. The triplets should preserve the semantics of the match. For
example:
MATCH (a)-[p]-(b)-[q]-(c)-[r]-(d)
is equivalent to:
MATCH (a)-[p]-(b), (b)-[q]-(c), (c)-[r]-(d)
With this representation, it becomes easier to reorder the triplets and choose different strategies for pattern matching.
In addition to normalizing patterns, all of the filter expressions in patterns
and inside of the WHERE
clause (of the accompanying MATCH
) are extracted
and stored separately. During the extraction, symbols used in the filter
expression are collected. This allows for planning filters in a valid order,
as the matching for triplets is being done. Another important benefit of
having extra information on filters, is to recognize when a database index
could be used.
After each MATCH
is processed, they are all grouped, so that even the whole
MATCH
clauses may be reordered. The important thing is to remember which
symbols were used to name edges in each MATCH
. With those symbols we can
plan for cyphermorphism, i.e. ensure different edges in the search pattern
of a single MATCH
map to different edges in the graph. This preserves the
semantic of the query, even though we may have reordered the matching. The
same steps are done for OPTIONAL MATCH
.
Another clause which needs processing is MERGE
. Here we normalize the
pattern, since the MERGE
is a bit like MATCH
and CREATE
in one.
All the other clauses are left as is.
In the end, each query part consists of:
- processed and grouped
MATCH
clauses; - processed and grouped
OPTIONAL MATCH
clauses; - processed
MERGE
matching pattern and - unchanged remaining clauses.
The last stored clause is guaranteed to be either WITH
or RETURN
.
Logical Operator Planning
Variable Start Planner
The VariableStartPlanner
generates multiple plans for a single query. Each
plan is generated by selecting a different starting point for pattern
matching.
The algorithm works as follows.
- For each query part:
- For each node in triplets of collected
MATCH
clauses: i. Add the node to a set ofexpanded
nodes ii. Select a triplet(start node, edge, end node)
whosestart node
is in theexpanded
set iii. If no triplet was selected, choose a new starting node that isn't inexpanded
and continue expanding iv. Repeat steps ii. -- iii. until all triplets have been selected and store that as a variation of theMATCH
clauses - Do step 1.1. for
OPTIONAL MATCH
andMERGE
clauses - Take all combinations of the generated
MATCH
,OPTIONAL MATCH
andMERGE
and store them as variations of the query part.
- For each node in triplets of collected
- For each combination of query part variations:
- Generate a plan using the rule based planner
Rule Based Planner
The RuleBasedPlanner
generates a single plan for a single query. A plan is
generated by following hardcoded rules for producing logical operators. The
following sections are an overview on how each openCypher clause is converted
to a LogicalOperator
.
MATCH
MATCH
clause is used to specify which patterns need to be searched for in
the database. These patterns are normalized in the preprocess step to be
represented as triplets (start node, edge, end node)
. When there is no edge,
then the triplet is reduced only to the start node
. Generating the operators
is done by looping over these triplets.
Searching for Nodes
The simplest search is finding standalone nodes. For example, MATCH (n)
will find all the nodes in the graph. This is accomplished by generating a
ScanAll
operator and forwarding the node symbol which should store the
results. In this case, all the nodes will be referenced by n
.
Multiple nodes can be specified in a single match, e.g. MATCH (n), (m)
.
Planning is done by repeating the same steps for each sub pattern (separated
by a comma). In this case, we would get 2 ScanAll
operators chained one
after the other. An optimization can be obtained if the node in the pattern is
already searched for. In MATCH (n), (n)
we can drop the second ScanAll
operator since we have already generated it for the first node.
Searching for Relationships
A more advanced search includes finding nodes with relationships. For example,
MATCH (n)-[r]-(m)
should find every pair of connected nodes in the database.
This means, that if a single node has multiple connections, it will be
repeated for each combination of pairs. The generation of operators starts
from the first node in the pattern. If we are referencing a new starting node,
we need to generate a ScanAll
which finds all the nodes and stores them
into n
. Then, we generate an Expand
operator which reads the n
and
traverses all the edges of that node. The edge is stored into r
, while the
destination node is stored in m
.
Matching multiple relationships proceeds similarly, by repeating the same
steps. The only difference is that we need to ensure different edges in the
search pattern, map to different edges in the graph. This means that after each
Expand
operator, we need to generate an EdgeUniquenessFilter
. We provide
this operator with a list of symbols for the previously matched edges and the
symbol for the current edge.
For example.
MATCH (n)-[r1]-(m)-[r2]-(l)
The above is preprocessed into
MATCH (n)-[r1]-(m), (m)-[r2]-(l)
Then we look at each triplet in order and perform the described steps. This way, we would generate:
ScanAll (n) > Expand (n, r1, m) > Expand (m, r2, l) >
EdgeUniquenessFilter ([r1], r2)
Note that we don't need to make EdgeUniquenessFilter
after the first
Expand
, since there are no edges to compare to. This filtering needs to work
across multiple pattern, but inside a single MATCH
clause.
Let's take a look at the following.
MATCH (n)-[r1]-(m), (m)-[r2]-(l)
We would also generate the exact same operators.
ScanAll (n) > Expand (n, r1, m) > Expand (m, r2, l) >
EdgeUniquenessFilter ([r1], r2)
On the other hand,
MATCH (n)-[r1]-(m) MATCH (m)-[r2]-(l)-[r3]-(i)
would reset the uniqueness filtering at the start of the second match. This would mean that we output the following:
ScanAll (n) > Expand (n, r1, m) > Expand (m, r2, l) > Expand (l, r3, i) >
EdgeUniquenessFilter ([r2], r3)
There is a difference in how we handle edge uniqueness compared to Neo4j. Neo4j does not allow searching for a single edge multiple times, but we've decided to support that.
For example, the user can say the following.
MATCH (n)-[r]-(m)-[r]-l
We would ensure that both r
variables match to the same edge. In our
terminology, we call this the edge cycle. For the above example, we would
generate this plan.
ScanAll (n) > Expand (n, r, m) > Expand (m, r, l)
We do not put an EdgeUniquenessFilter
operator between 2 Expand
operators and we tell the 2nd Expand
that it is an edge cycle. This, 2nd
Expand
will ensure we have matched both the same edges.
Filtering
To narrow the search down, the patterns in MATCH
can have filtered labels
and properties. A more general filtering is done using the accompanying
WHERE
clause. During the preprocess step, all filters are collected and
extracted into expressions. Additional information on which symbols are used
is also stored. This way, each time we generate a ScanAll
or Expand
, we
look at all the filters to see if any of them can be used. I.e. if the symbols
they use have been bound by a newly produced operator. If a filter expression
can be used, we immediately add a Filter
operator with that expression.
For example.
MATCH (n)-[r]-(m :label) WHERE n.prop = 42
We would produce:
ScanAll (n) > Filter (n.prop) > Expand (n, r, m) > Filter (m :label)
This means that the same plan is generated for the query:
MATCH (n {prop: 42})-[r]-(m :label)
OPTIONAL
If a MATCH
clause is preceded by OPTIONAL
, then we need to generate a plan
such that we produce results even if we fail to match anything. This is
accomplished by generating an Optional
operator, which takes 2 operator
trees:
- input operation and
- optional operation.
The input is the operation we generated for the part of the query before
OPTIONAL MATCH
. For the optional operation, we simply generate the OPTIONAL MATCH
part just like we would for regular MATCH
. In addition to operations,
we need to send the symbols which are set during optional matching to the
Optional
operator. The operator will reset values of those symbols to
null
, when the optional part fails to match.
RETURN & WITH
RETURN
and WITH
clauses are very similar to each other. The only
difference is that WITH
separates parts of the query and can be paired with
WHERE
clause.
The common part is generating operators for the body of the clause. Separation
of query parts is mostly done in semantic analysis, which checks that only the
symbols exposed through WITH
are visible in the query parts after the
clause. The minor part is done in planning.
Named Results
Both clauses contain multiple named expressions (expr AS name
) which are
used to generate Produce
operator.
Aggregations
If an expression contains an aggregation operator (sum
, avg
, ...) we need
to plan the Aggregate
operator as input to Produce
. This case is more
complex, because aggregation in openCypher can perform implicit grouping of
results used for aggregation.
For example, WITH/RETURN sum(n.x) AS s, n.y AS group
will implicitly group
by n.y
expression.
Another, obscure grouping can be achieved with RETURN sum(n.a) + n.b AS s
.
Here, the n.b
will be used for grouping, even though both the sum
and
n.b
are in the same named expression.
Therefore, we need to collect all expressions which do not contain
aggregations and use them for grouping. You may have noticed that in the last
example sum
is actually a sub-expression of +
. Aggregate
operator does
not see that (nor it should), so the responsibility of evaluating that falls
on Produce
. One way is for Aggregate
to store results of grouping
expressions on the frame in addition to aggregation results. Unfortunately,
this would require rewiring named expressions in Produce
to reference
already evaluated expressions. In the current implementation, we opted for
Aggregate
to store only aggregation results on the frame, while Produce
will re-evaluate all the other (grouping) expressions. To handle that, symbols
which are used in expressions are passed to Aggregate
, so that they can be
remembered. Produce
will read those symbols from the frame and use it to
re-evaluate the needed expressions.
Accumulation
After we have Produce
and potentially Aggregate
, we need to handle a
special case when the part of the query before RETURN
or WITH
performs
updates. For that, we want to run that part of the query fully, so that we get
the latest results. This is accomplished by adding Accumulate
operator as
input to Aggregate
or Produce
(if there is no aggregation). Accumulation
will store all the values for all the used symbols inside RETURN
and WITH
,
so that they can be used in the operator which follows. This way, only parts
of the frame are copied, instead of the whole frame. Here is a minor
difference between planning WITH
, compared to RETURN
. Since WITH
can
separate writing from reading, we need to advance the transaction command.
This enables the later, read parts of the query to obtain the newest changes.
This is supported by passing advance_command
flag to Accumulate
operator.
In the simplest case, common to both clauses, we have `Accumulate > Aggregate
Produce
operators, where
Accumulateand
Aggregate` may be left out.
Ordering
Planning ORDER BY
is simple enough. Since it may see new symbols (filled in
Produce
), we add the OrderBy
operator at the end. The operator will change
the order of produced results, so we pass it the ordering expressions and the
output symbols of named expressions.
Filtering
A final difference in WITH
, is when it contains a WHERE
clause. For that,
we simply generate the Filter
operator, appended after Produce
or
OrderBy
(depending which operator is last).
Skipping and Limiting
If we have SKIP
or LIMIT
, we generate Skip
or Limit
operators,
respectively. These operators are put at the end of the clause.
This placement may have some unexpected behaviour when combined with operations that update the graph. For example.
MATCH (n) SET n.x = n.x + 1 RETURN n LIMIT 1
The above query may be interpreted as if the SET
will be done only once.
Since this is a write query, we need to accumulate results, so the part before
RETURN
will execute completely. The accumulated results will be yielded up
to the given limit, and the user would get only the first n
that was
updated. This may confuse the user because in reality, every node in the
database had been updated.
Note that Skip
always comes before Limit
. In the current implementation,
they are generated directly one after the other.
CREATE
CREATE
clause is used to create nodes and edges (relationships).
For multiple CREATE
clauses or multiple creation patterns in a single
clause, we perform the same, following steps.
Creating a Single Node
A node is created by simply specifying a node pattern.
For example CREATE (n :label {property: "value"}), ()
would create 2 nodes.
The 1st one would be created with a label and a property. This node could be
referenced later in the query, by using the variable n
. The 2nd node cannot
be referenced and it would be created without any labels nor properties. For
node creation, we generate a CreateNode
operator and pass it all the details
of node creation: variable symbol, labels and properties. In the mentioned
example, we would have CreateNode > CreateNode
.
Creating a Relationship
To create a relationship, the CREATE
clause must contain a pattern with a
directed edge. Compared to creating a single node, this case is a bit more
complicated, because either side of the edge may not exist. By exist, we mean
that the endpoint is a variable which already references a node.
For example, MATCH (n) CREATE (n)-[r]->(m)
would create an edge r
and a
node m
for each matched node n
. If we focus on the CREATE
part, we
generate CreateExpand (n, r, m)
where n
already exists (refers to matched
node) and m
would be newly created along with edge r
. If we had only
CREATE (n)-[r]->(m)
, then we would need to create both nodes of the edge
r
. This is done by generating CreateNode (n) > CreateExpand(n, r, m)
. The
final case is when both endpoints refer to an existing node. For example, when
adding a node with a cyclical connection CREATE (n)-[r]->(n)
. In this case,
we would generate CreateNode (n) > CreateExpand (n, r, n)
. We would tell
CreateExpand
to only create the edge r
between the already created n
.
MERGE
Although the merge operation is complex, planning turns out to be relatively
simple. The pattern inside the MERGE
clause is used for both matching and
creating. Therefore, we create 2 operator trees, one for each action.
For example.
MERGE (n)-[r:r]-(m)
We would generate a single Merge
operator which has the following.
-
No input operation (since it is not preceded by any other clause).
-
On match operation
ScanAll (n) > Expand (n, r, m) > Filter (r)
-
On create operation
CreateNode (n) > CreateExpand (n, r, m)
In cases when MERGE
contains ON MATCH
and ON CREATE
parts, we simply
append their operations to the respective operator trees.
Observe the following example.
MERGE (n)-[r:r]-(m) ON MATCH SET n.x = 42 ON CREATE SET m :label
The Merge
would be generated with the following.
-
No input operation (again, since there is no clause preceding it).
-
On match operation
ScanAll (n) > Expand (n, r, m) > Filter (r) > SetProperty (n.x, 42)
-
On create operation
CreateNode (n) > CreateExpand (n, r, m) > SetLabels (n, :label)
When we have preceding clauses, we simply put their operator as input to
Merge
.
MATCH (n) MERGE (n)-[r:r]-(m)
The above would be generated as
ScanAll (n) > Merge (on_match_operation, on_create_operation)
Here we need to be careful to recognize which symbols are already declared.
But, since the on_match_operation
uses the same algorithm for generating a
Match
, that problem is handled there. The same should hold for
on_create_operation
, which uses the process of generating a Create
. So,
finally for this example, the Merge
would have:
-
Input operation
ScanAll (n)
-
On match operation
Expand (n, r, m) > Filter (r)
Note that
ScanAll
is not needed since we get the nodes from input. -
On create operation
CreateExpand (n, r, m)
Note that
CreateNode
is dropped, since we want to expand the existing one.
Logical Plan Postprocessing
NOTE: TODO
Cost Estimation
NOTE: TODO
Distributed Planning
NOTE: TODO