memgraph/docs/dev/query/planning.md
Lovro Lugovic 4e7292f2f4 Fix planning.md typos
Reviewers: mtomic, teon.banek

Reviewed By: teon.banek

Differential Revision: https://phabricator.memgraph.io/D1897
2019-03-04 13:18:44 +01:00

20 KiB

Logical Planning

After the semantic analysis and symbol generation, the AST is converted to a tree of logical operators. This conversion is called planning and the tree of logical operators is called a plan. The whole planning process is done in the following steps.

  1. AST Preprocessing

    The first step is to preprocess the AST by collecting information on filters, divide the query into parts, normalize patterns in MATCH clauses, etc.

  2. Logical Operator Planning

    After the preprocess step, the planning can be done via 2 planners: VariableStartPlanner and RuleBasedPlanner. The first planner will generate multiple plans where each plan has different starting points for searching the patterns in MATCH clauses. The second planner produces a single plan by mapping the query parts as they are to logical operators.

  3. Logical Plan Postprocessing

    In this stage, we perform various transformations on the generated logical plan. Here we want to optimize the operations in order to improve performance during the execution. Naturally, transformations need to preserve the semantic behaviour of the original plan.

  4. Cost Estimation

    After the generation, the execution cost of each plan is estimated. This estimation is used to select the best plan which will be executed.

  5. Distributed Planning

    In case we are running distributed Memgraph, the final plan is adapted for distributed execution. NOTE: This appears to be an error in the workflow. Distributed planning should be moved before step 3. or integrated with it. With the workflow ordered as is now, cost estimation doesn't consider the distributed plan.

The implementation can be found in the query/plan directory, with the public entry point being query/plan/planner.hpp.

AST Preprocessing

Each openCypher query consists of at least 1 single query. Multiple single queries are chained together using a query combinator. Currently, there is only one combinator, UNION. The preprocessing step starts in the CollectQueryParts function. This function will take a look at each single query and divide it into parts. Each part is separated with RETURN and WITH clauses. For example:

MATCH (n) CREATE (m) WITH m MATCH (l)-[]-(m) RETURN l
|                          |                        |
|------- part 1 -----------+-------- part 2 --------|
|                                                   |
|-------------------- single query -----------------|

Each part is created by collecting all MATCH clauses and normalizing their patterns. Pattern normalization is the process of converting an arbitrarily long pattern chain of nodes and edges into a list of triplets (start node, edge, end node). The triplets should preserve the semantics of the match. For example:

MATCH (a)-[p]-(b)-[q]-(c)-[r]-(d)

is equivalent to:

MATCH (a)-[p]-(b), (b)-[q]-(c), (c)-[r]-(d)

With this representation, it becomes easier to reorder the triplets and choose different strategies for pattern matching.

In addition to normalizing patterns, all of the filter expressions in patterns and inside of the WHERE clause (of the accompanying MATCH) are extracted and stored separately. During the extraction, symbols used in the filter expression are collected. This allows for planning filters in a valid order, as the matching for triplets is being done. Another important benefit of having extra information on filters, is to recognize when a database index could be used.

After each MATCH is processed, they are all grouped, so that even the whole MATCH clauses may be reordered. The important thing is to remember which symbols were used to name edges in each MATCH. With those symbols we can plan for cyphermorphism, i.e. ensure different edges in the search pattern of a single MATCH map to different edges in the graph. This preserves the semantic of the query, even though we may have reordered the matching. The same steps are done for OPTIONAL MATCH.

Another clause which needs processing is MERGE. Here we normalize the pattern, since the MERGE is a bit like MATCH and CREATE in one.

All the other clauses are left as is.

In the end, each query part consists of:

  • processed and grouped MATCH clauses;
  • processed and grouped OPTIONAL MATCH clauses;
  • processed MERGE matching pattern and
  • unchanged remaining clauses.

The last stored clause is guaranteed to be either WITH or RETURN.

Logical Operator Planning

Variable Start Planner

The VariableStartPlanner generates multiple plans for a single query. Each plan is generated by selecting a different starting point for pattern matching.

The algorithm works as follows.

  1. For each query part:
    1. For each node in triplets of collected MATCH clauses: i. Add the node to a set of expanded nodes ii. Select a triplet (start node, edge, end node) whose start node is in the expanded set iii. If no triplet was selected, choose a new starting node that isn't in expanded and continue expanding iv. Repeat steps ii. -- iii. until all triplets have been selected and store that as a variation of the MATCH clauses
    2. Do step 1.1. for OPTIONAL MATCH and MERGE clauses
    3. Take all combinations of the generated MATCH, OPTIONAL MATCH and MERGE and store them as variations of the query part.
  2. For each combination of query part variations:
    1. Generate a plan using the rule based planner

Rule Based Planner

The RuleBasedPlanner generates a single plan for a single query. A plan is generated by following hardcoded rules for producing logical operators. The following sections are an overview on how each openCypher clause is converted to a LogicalOperator.

MATCH

MATCH clause is used to specify which patterns need to be searched for in the database. These patterns are normalized in the preprocess step to be represented as triplets (start node, edge, end node). When there is no edge, then the triplet is reduced only to the start node. Generating the operators is done by looping over these triplets.

Searching for Nodes

The simplest search is finding standalone nodes. For example, MATCH (n) will find all the nodes in the graph. This is accomplished by generating a ScanAll operator and forwarding the node symbol which should store the results. In this case, all the nodes will be referenced by n.

Multiple nodes can be specified in a single match, e.g. MATCH (n), (m). Planning is done by repeating the same steps for each sub pattern (separated by a comma). In this case, we would get 2 ScanAll operators chained one after the other. An optimization can be obtained if the node in the pattern is already searched for. In MATCH (n), (n) we can drop the second ScanAll operator since we have already generated it for the first node.

Searching for Relationships

A more advanced search includes finding nodes with relationships. For example, MATCH (n)-[r]-(m) should find every pair of connected nodes in the database. This means, that if a single node has multiple connections, it will be repeated for each combination of pairs. The generation of operators starts from the first node in the pattern. If we are referencing a new starting node, we need to generate a ScanAll which finds all the nodes and stores them into n. Then, we generate an Expand operator which reads the n and traverses all the edges of that node. The edge is stored into r, while the destination node is stored in m.

Matching multiple relationships proceeds similarly, by repeating the same steps. The only difference is that we need to ensure different edges in the search pattern, map to different edges in the graph. This means that after each Expand operator, we need to generate an ExpandUniquenessFilter. We provide this operator with a list of symbols for the previously matched edges and the symbol for the current edge.

For example.

MATCH (n)-[r1]-(m)-[r2]-(l)

The above is preprocessed into

MATCH (n)-[r1]-(m), (m)-[r2]-(l)

Then we look at each triplet in order and perform the described steps. This way, we would generate:

ScanAll (n) > Expand (n, r1, m) > Expand (m, r2, l) >
    ExpandUniquenessFilter ([r1], r2)

Note that we don't need to make ExpandUniquenessFilter after the first Expand, since there are no edges to compare to. This filtering needs to work across multiple pattern, but inside a single MATCH clause.

Let's take a look at the following.

MATCH (n)-[r1]-(m), (m)-[r2]-(l)

We would also generate the exact same operators.

ScanAll (n) > Expand (n, r1, m) > Expand (m, r2, l) >
    ExpandUniquenessFilter ([r1], r2)

On the other hand,

MATCH (n)-[r1]-(m) MATCH (m)-[r2]-(l)-[r3]-(i)

would reset the uniqueness filtering at the start of the second match. This would mean that we output the following:

ScanAll (n) > Expand (n, r1, m) > Expand (m, r2, l) > Expand (l, r3, i) >
    ExpandUniquenessFilter ([r2], r3)

There is a difference in how we handle edge uniqueness compared to Neo4j. Neo4j does not allow searching for a single edge multiple times, but we've decided to support that.

For example, the user can say the following.

MATCH (n)-[r]-(m)-[r]-l

We would ensure that both r variables match to the same edge. In our terminology, we call this the edge cycle. For the above example, we would generate this plan.

ScanAll (n) > Expand (n, r, m) > Expand (m, r, l)

We do not put an ExpandUniquenessFilter operator between 2 Expand operators and we tell the 2nd Expand that it is an edge cycle. This, 2nd Expand will ensure we have matched both the same edges.

Filtering

To narrow the search down, the patterns in MATCH can have filtered labels and properties. A more general filtering is done using the accompanying WHERE clause. During the preprocess step, all filters are collected and extracted into expressions. Additional information on which symbols are used is also stored. This way, each time we generate a ScanAll or Expand, we look at all the filters to see if any of them can be used. I.e. if the symbols they use have been bound by a newly produced operator. If a filter expression can be used, we immediately add a Filter operator with that expression.

For example.

MATCH (n)-[r]-(m :label) WHERE n.prop = 42

We would produce:

ScanAll (n) > Filter (n.prop) > Expand (n, r, m) > Filter (m :label)

This means that the same plan is generated for the query:

MATCH (n {prop: 42})-[r]-(m :label)

OPTIONAL

If a MATCH clause is preceded by OPTIONAL, then we need to generate a plan such that we produce results even if we fail to match anything. This is accomplished by generating an Optional operator, which takes 2 operator trees:

  • input operation and
  • optional operation.

The input is the operation we generated for the part of the query before OPTIONAL MATCH. For the optional operation, we simply generate the OPTIONAL MATCH part just like we would for regular MATCH. In addition to operations, we need to send the symbols which are set during optional matching to the Optional operator. The operator will reset values of those symbols to null, when the optional part fails to match.

RETURN & WITH

RETURN and WITH clauses are very similar to each other. The only difference is that WITH separates parts of the query and can be paired with WHERE clause.

The common part is generating operators for the body of the clause. Separation of query parts is mostly done in semantic analysis, which checks that only the symbols exposed through WITH are visible in the query parts after the clause. The minor part is done in planning.

Named Results

Both clauses contain multiple named expressions (expr AS name) which are used to generate Produce operator.

Aggregations

If an expression contains an aggregation operator (sum, avg, ...) we need to plan the Aggregate operator as input to Produce. This case is more complex, because aggregation in openCypher can perform implicit grouping of results used for aggregation.

For example, WITH/RETURN sum(n.x) AS s, n.y AS group will implicitly group by n.y expression.

Another, obscure grouping can be achieved with RETURN sum(n.a) + n.b AS s. Here, the n.b will be used for grouping, even though both the sum and n.b are in the same named expression.

Therefore, we need to collect all expressions which do not contain aggregations and use them for grouping. You may have noticed that in the last example sum is actually a sub-expression of +. Aggregate operator does not see that (nor it should), so the responsibility of evaluating that falls on Produce. One way is for Aggregate to store results of grouping expressions on the frame in addition to aggregation results. Unfortunately, this would require rewiring named expressions in Produce to reference already evaluated expressions. In the current implementation, we opted for Aggregate to store only aggregation results on the frame, while Produce will re-evaluate all the other (grouping) expressions. To handle that, symbols which are used in expressions are passed to Aggregate, so that they can be remembered. Produce will read those symbols from the frame and use it to re-evaluate the needed expressions.

Accumulation

After we have Produce and potentially Aggregate, we need to handle a special case when the part of the query before RETURN or WITH performs updates. For that, we want to run that part of the query fully, so that we get the latest results. This is accomplished by adding Accumulate operator as input to Aggregate or Produce (if there is no aggregation). Accumulation will store all the values for all the used symbols inside RETURN and WITH, so that they can be used in the operator which follows. This way, only parts of the frame are copied, instead of the whole frame. Here is a minor difference between planning WITH, compared to RETURN. Since WITH can separate writing from reading, we need to advance the transaction command. This enables the later, read parts of the query to obtain the newest changes. This is supported by passing advance_command flag to Accumulate operator.

In the simplest case, common to both clauses, we have `Accumulate > Aggregate

Produceoperators, whereAccumulateandAggregate` may be left out.

Ordering

Planning ORDER BY is simple enough. Since it may see new symbols (filled in Produce), we add the OrderBy operator at the end. The operator will change the order of produced results, so we pass it the ordering expressions and the output symbols of named expressions.

Filtering

A final difference in WITH, is when it contains a WHERE clause. For that, we simply generate the Filter operator, appended after Produce or OrderBy (depending which operator is last).

Skipping and Limiting

If we have SKIP or LIMIT, we generate Skip or Limit operators, respectively. These operators are put at the end of the clause.

This placement may have some unexpected behaviour when combined with operations that update the graph. For example.

MATCH (n) SET n.x = n.x + 1 RETURN n LIMIT 1

The above query may be interpreted as if the SET will be done only once. Since this is a write query, we need to accumulate results, so the part before RETURN will execute completely. The accumulated results will be yielded up to the given limit, and the user would get only the first n that was updated. This may confuse the user because in reality, every node in the database had been updated.

Note that Skip always comes before Limit. In the current implementation, they are generated directly one after the other.

CREATE

CREATE clause is used to create nodes and edges (relationships).

For multiple CREATE clauses or multiple creation patterns in a single clause, we perform the same, following steps.

Creating a Single Node

A node is created by simply specifying a node pattern.

For example CREATE (n :label {property: "value"}), () would create 2 nodes. The 1st one would be created with a label and a property. This node could be referenced later in the query, by using the variable n. The 2nd node cannot be referenced and it would be created without any labels nor properties. For node creation, we generate a CreateNode operator and pass it all the details of node creation: variable symbol, labels and properties. In the mentioned example, we would have CreateNode > CreateNode.

Creating a Relationship

To create a relationship, the CREATE clause must contain a pattern with a directed edge. Compared to creating a single node, this case is a bit more complicated, because either side of the edge may not exist. By exist, we mean that the endpoint is a variable which already references a node.

For example, MATCH (n) CREATE (n)-[r]->(m) would create an edge r and a node m for each matched node n. If we focus on the CREATE part, we generate CreateExpand (n, r, m) where n already exists (refers to matched node) and m would be newly created along with edge r. If we had only CREATE (n)-[r]->(m), then we would need to create both nodes of the edge r. This is done by generating CreateNode (n) > CreateExpand(n, r, m). The final case is when both endpoints refer to an existing node. For example, when adding a node with a cyclical connection CREATE (n)-[r]->(n). In this case, we would generate CreateNode (n) > CreateExpand (n, r, n). We would tell CreateExpand to only create the edge r between the already created n.

MERGE

Although the merge operation is complex, planning turns out to be relatively simple. The pattern inside the MERGE clause is used for both matching and creating. Therefore, we create 2 operator trees, one for each action.

For example.

MERGE (n)-[r:r]-(m)

We would generate a single Merge operator which has the following.

  • No input operation (since it is not preceded by any other clause).

  • On match operation

    ScanAll (n) > Expand (n, r, m) > Filter (r)

  • On create operation

    CreateNode (n) > CreateExpand (n, r, m)

In cases when MERGE contains ON MATCH and ON CREATE parts, we simply append their operations to the respective operator trees.

Observe the following example.

MERGE (n)-[r:r]-(m) ON MATCH SET n.x = 42 ON CREATE SET m :label

The Merge would be generated with the following.

  • No input operation (again, since there is no clause preceding it).

  • On match operation

    ScanAll (n) > Expand (n, r, m) > Filter (r) > SetProperty (n.x, 42)

  • On create operation

    CreateNode (n) > CreateExpand (n, r, m) > SetLabels (n, :label)

When we have preceding clauses, we simply put their operator as input to Merge.

MATCH (n) MERGE (n)-[r:r]-(m)

The above would be generated as

ScanAll (n) > Merge (on_match_operation, on_create_operation)

Here we need to be careful to recognize which symbols are already declared. But, since the on_match_operation uses the same algorithm for generating a Match, that problem is handled there. The same should hold for on_create_operation, which uses the process of generating a Create. So, finally for this example, the Merge would have:

  • Input operation

    ScanAll (n)

  • On match operation

    Expand (n, r, m) > Filter (r)

    Note that ScanAll is not needed since we get the nodes from input.

  • On create operation

    CreateExpand (n, r, m)

    Note that CreateNode is dropped, since we want to expand the existing one.

Logical Plan Postprocessing

NOTE: TODO

Cost Estimation

NOTE: TODO

Distributed Planning

NOTE: TODO