Move query implementation from Phriction to this repo

Summary:
Our query parsing, planning and execution architecture was described on
Phabricator wiki pages, Phriction. This commit copies the said
documentation here, so that it's easier to access for all developers.
Additional benefit is tracking the changes and hopefully suggesting to
developers to keep it up to date.

Besides making a copy, the documentation has been updated to reflect the
current state of the codebase. Note that some things are still missing,
but what was written should now be correct.

Reviewers: mtomic, llugovic

Reviewed By: mtomic

Subscribers: pullbot

Differential Revision: https://phabricator.memgraph.io/D1854
This commit is contained in:
Teon Banek 2019-02-13 16:50:32 +01:00
parent a14c4f1864
commit 6f10b1c115
7 changed files with 1082 additions and 0 deletions

1
docs/dev/query/.gitignore vendored Normal file
View File

@ -0,0 +1 @@
html/

11
docs/dev/query/build-html Executable file
View File

@ -0,0 +1,11 @@
#!/bin/bash
script_dir="$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )"
mkdir -p $script_dir/html
for markdown_file in $(find $script_dir -name '*.md'); do
name=$(basename -s .md $markdown_file)
sed -e 's/.md/.html/' $markdown_file | \
pandoc -s -f markdown -t html -o $script_dir/html/$name.html
done

View File

@ -0,0 +1,14 @@
# Query Parsing, Planning and Execution
This part of the documentation deals with query execution.
Memgraph currently supports only query interpretation. Each new query is
parsed, analysed and translated into a sequence of operations which are then
executed on the main database storage. Query execution is organized into the
following phases:
1. [Lexical Analysis (Tokenization)](parsing.md)
2. [Syntactic Analysis (Parsing)](parsing.md)
3. [Semantic Analysis and Symbol Generation](semantic.md)
4. [Logical Planning](planning.md)
5. [Logical Plan Execution](execution.md)

373
docs/dev/query/execution.md Normal file
View File

@ -0,0 +1,373 @@
# Logical Plan Execution
We implement classical iterator style operators. Logical operators define
operations on database. They encapsulate the following info: what the input is
(another `LogicalOperator`), what to do with the data, and how to do it.
Currently logical operators can have zero or more input operations, and thus a
`LogicalOperator` tree is formed. Most `LogicalOperator` types have only one
input, so we are mostly working with chains instead of full fledged trees.
You can find information on each operator in `src/query/plan/operator.lcp`.
## Cursor
Logical operators do not perform database work themselves. Instead they create
`Cursor` objects that do the actual work, based on the info in the operator.
Cursors expose a `Pull` method that gets called by the cursor's consumer. The
consumer keeps pulling as long as the `Pull` returns `true` (indicating it
successfully performed some work and might be eligible for another `Pull`).
Most cursors will call the `Pull` function of their input provided cursor, so
typically a cursor chain is created that is analogue to the logical operator
chain it's created from.
## Frame
The `Frame` object contains all the data of the current `Pull` chain. It
serves for communicating data between cursors.
For example, in a `MATCH (n) RETURN n` query the `ScanAllCursor` places a
vertex on the `Frame` for each `Pull`. It places it on the place reserved for
the `n` symbol. Then the `ProduceCursor` can take that same value from the
`Frame` because it knows the appropriate symbol. `Frame` positions are indexed
by `Symbol` objects.
## ExpressionEvaluator
Expressions results are not placed on the `Frame` since they do not need to be
communicated between different `Cursors`. Instead, expressions are evaluated
using an instance of `ExpressionEvaluator`. Since generally speaking an
expression can be defined by a tree of subexpressions, the
`ExpressionEvaluator` is implemented as a tree visitor. There is a performance
sub-optimality here because a stack is used to communicate intermediary
expression results between elements of the tree. This is one of the reasons
why it's planned to use `Frame` for intermediary expression results as well.
The other reason is that it might facilitate compilation later on.
## Cypher Execution Semantics
Cypher query execution has *mostly* well-defined semantics. Some are
explicitly defined by openCypher and its TCK, while others are implicitly
defined by Neo4j's implementation of Cypher that we want to be generally
compatible with.
These semantics can in short be described as follows: a Cypher query consists
of multiple clauses some of which modify it. Generally, every clause in the
query, when reading it left to right, operates on a consistent state of the
property graph, untouched by subsequent clauses. This means that a `MATCH`
clause in the beginning operates on a graph-state in which modifications by
the subsequent `SET` are not visible.
The stated semantics feel very natural to the end-user, and Neo seems to
implement them well. For Memgraph the situation is complex because
`LogicalOperator` execution (through a `Cursor`) happens one `Pull` at a time
(generally meaning all the query clauses get executed for every top-level
`Pull`). This is not inherently consistent with Cypher semantics because a
`SET` clause can modify data, and the `MATCH` clause that precedes it might
see the modification in a subsequent `Pull`. Also, the `RETURN` clause might
want to stream results to the user before all `SET` clauses have been
executed, so the user might see some intermediate graph state. There are many
edge-cases that Memgraph does its best to avoid to stay true to Cypher
semantics, while at the same time using a high-performance streaming approach.
The edge-cases are enumerated in this document along with the implementation
details they imply.
## Implementation Peculiarities
### Once
An operator that does nothing but whose `Cursor::Pull` returns `true` on the
first `Pull` and `false` on subsequent ones. This operator is used when
another operator has an optional input, because in Cypher a clause will
typically execute once for every input from the preceding clauses, or just
once if there was no preceding input. For example, consider the `CREATE`
clause. In the query `CREATE (n)` only one node is created, while in the query
`MATCH (n) CREATE (m)` a node is created for each existing node. Thus in our
`CreateNode` logical operator the input is either a `ScanAll` operator, or a
`Once` operator.
### GraphView
In the previous section, [Cypher Execution
Semantics](#cypher-execution-semantics), we mentioned how the preceding
clauses should not see changes made in subsequent ones. For that reason, some
operators take a `GraphView` enum value. This value determines which state of
the graph an operator sees.
Consider the query `MATCH (n)--(m) WHERE n.x = 0 SET m.x = 1`. Naive streaming
could match a vertex `n` on the given criteria, expand to `m`, update it's
property, and in the next iteration consider the vertex previously matched to
`m` and skip it because it's newly set property value does not qualify. This
is not how Cypher works. To handle this issue properly, Memgraph designed the
`VertexAccessor` class that tracks two versions of data: one that was visible
before the current transaction+command, and the optional other that was
created in the current transaction+command. The `MATCH` clause will be planned
as `ScanAll` and `Expand` operations using `GraphView::OLD` value. This will
ensure modifications performed in the same query do not affect it. The same
applies to edges and the `EdgeAccessor` class.
### Existing Record Detection
It's possible that a pattern element has already been declared in the same
pattern, or a preceding pattern. For example `MATCH (n)--(m), (n)--(l)` or a
cycle-detection match `MATCH (n)-->(n) RETURN n`. Implementation-wise,
existing record detection just checks that the expanded record is equal to the
one already on the frame.
### Why Not Use Separate Expansion Ops for Edges and Vertices?
Expanding an edge and a vertex in separate ops is not feasible when matching a
cycle in bi-directional expansions. Consider the query `MATCH (n)--(n) RETURN
n`. Let's try to expand first the edge in one op, and vertex in the next. The
vertex expansion consumes the edge expansion input. It takes the expanded edge
from the frame. It needs to detect a cycle by comparing the vertex existing on
the frame with one of the edge vertices (`from` or `to`). But which one? It
doesn't know, and can't ensure correct cycle detection.
### Data Visibility During and After SET
In Cypher, setting values always works on the latest version of data (from
preceding or current clause). That means that within a `SET` clause all the
changes from previous clauses must be visible, as well as changes done by the
current `SET` clause. Also, if there is a clause after `SET` it must see *all*
the changes performed by the preceding `SET`. Both these things are best
illustrated with the following queries executed on an empty database:
CREATE (n:A {x:0})-[:EdgeType]->(m:B {x:0})
MATCH (n)--(m) SET m.x = n.x + 1 RETURN labels(n), n.x, labels(m), m.x
This returns:
+---------+---+---------+---+
|labels(n)|n.x|labels(m)|m.x|
+:=======:+:=:+:=======:+:=:+
|[A] |2 |[B] |1 |
+---------+---+---------+---+
|[B] |1 |[A] |2 |
+---------+---+---------+---+
The obtained result implies the following operations:
1. In the first iteration set the value of the `B.x` to 1
2. In the second iteration the we observe `B.x` with the value of 1 and set
`A.x` to 2
3. In `RETURN` we see all the changes made in both iterations
To implement the desired behavior Memgraph utilizes two techniques. First is
the already mentioned tracking of two versions of data in vertex accessors.
Using this approach ensures that the second iteration in the example query
sees the data modification performed by the preceding iteration. The second
technique is the `Accumulate` operation that accumulates all the iterations
from the preceding logical op before passing them to the next logical op. In
the example query, `Accumulate` ensures that the results returned to the user
reflect changes performed in all iterations of the query (naive streaming
could stream results at the end of first iteration producing inconsistent
results). Note that `Accumulate` is demanding regarding memory and slows down
query execution. For that reason it should be used only when necessary, for
example it does not have to be used in a query that has `MATCH` and `SET` but
no `RETURN`.
### Neo4j Inconsistency on Multiple SET Clauses
Considering the preceding example it could be expected that when a query has
multiple `SET` clauses all the changes from those preceding one are visible.
This is not the case in Neo4j's implementation. Consider the following queries
executed on an empty database:
CREATE (n:A {x:0})-[:EdgeType]->(m:B {x:0})
MATCH (n)--(m) SET n.x = n.x + 1 SET m.x = m.x * 2
RETURN labels(n), n.x, labels(m), m.x
This returns:
+---------+---+---------+---+
|labels(n)|n.x|labels(m)|m.x|
+:=======:+:=:+:=======:+:=:+
|[A] |2 |[B] |1 |
+---------+---+---------+---+
|[B] |1 |[A] |2 |
+---------+---+---------+---+
If all the iterations of the first `SET` clause were executed before executing
the second, all the resulting values would be 2. This not being the case, we
conclude that Neo4j does not use a barrier-like mechanism between `SET`
clauses. It is Memgraph's current vision that this is inconsistent and we
plan to reduce Neo4j compliance in favour of operation consistency.
### Double Deletion
It's possible to match the same graph element multiple times in a single query
and delete it. Neo supports this, and so do we. The relevant implementation
detail is in the `GraphDbAccessor` class, where the record deletion functions
reside, and not in the logical plan execution. It comes down to checking if a
record has already been deleted in the current transaction+command and not
attempting to do it again (results in a crash).
### Set + Delete Edge-case
It's legal for a query to combine `SET` and `DELETE` clauses. Consider the
following queries executed on an empty database:
CREATE ()-[:T]->()
MATCH (n)--(m) SET n.x = 42 DETACH DELETE m
Due to the `MATCH` being undirected the second pull will attempt to set data
on a deleted vertex. This is not a legal operation in Memgraph storage
implementation. For that reason the logical operator for `SET` must check if
the record it's trying to set something on has been deleted by the current
transaction+command. If so, the modification is not executed.
### Deletion Accumulation
Sometimes it's necessary to accumulate deletions of all the matches before
attempting to execute them. Consider this the following. Start with an empty
database and execute queries:
CREATE ()-[:T]->()-[:T]->()
MATCH (a)-[r1]-(b)-[r2]-(c) DELETE r1, b, c
Note that the `DELETE` clause attempts to delete node `c`, but it does not
detach it by deleting edge `r2`. However, due to undirected edge in the
`MATCH`, both edges get pulled and deleted.
Currently Memgraph does not support this behavior, Neo does. There are a few
ways that we could do this.
* Accumulate on deletion (that sucks because we have to keep track of
everything that gets returned after the deletion).
* Maybe we could stream through the deletion op, but defer actual deletion
until plan-execution end.
* Ignore this because it's very edgy (this is the currently selected option).
### Aggregation Without Input
It is necessary to define what aggregation ops return when they receive no
input. Following is a table that shows what Neo4j's Cypher implementation and
SQL produce.
+-------------+------------------------+---------------------+---------------------+------------------+
| \<OP\> | 1. Cypher, no group-by | 2. Cypher, group-by | 3. SQL, no group-by | 4. SQL, group-by |
+=============+:======================:+:===================:+:===================:+:================:+
| Count(\*) | 0 | \<NO\_ROWS> | 0 | \<NO\_ROWS> |
+-------------+------------------------+---------------------+---------------------+------------------+
| Count(prop) | 0 | \<NO\_ROWS> | 0 | \<NO\_ROWS> |
+-------------+------------------------+---------------------+---------------------+------------------+
| Sum | 0 | \<NO\_ROWS> | NULL | \<NO\_ROWS> |
+-------------+------------------------+---------------------+---------------------+------------------+
| Avg | NULL | \<NO\_ROWS> | NULL | \<NO\_ROWS> |
+-------------+------------------------+---------------------+---------------------+------------------+
| Min | NULL | \<NO\_ROWS> | NULL | \<NO\_ROWS> |
+-------------+------------------------+---------------------+---------------------+------------------+
| Max | NULL | \<NO\_ROWS> | NULL | \<NO\_ROWS> |
+-------------+------------------------+---------------------+---------------------+------------------+
| Collect | [] | \<NO\_ROWS> | N/A | N/A |
+-------------+------------------------+---------------------+---------------------+------------------+
Where:
1. `MATCH (n) RETURN <OP>(n.prop)`
2. `MATCH (n) RETURN <OP>(n.prop), (n.prop2)`
3. `SELECT <OP>(prop) FROM Table`
4. `SELECT <OP>(prop), prop2 FROM Table GROUP BY prop2`
Neo's Cypher implementation diverges from SQL only when performing `SUM`.
Memgraph implements SQL-like behavior. It is considered that `SUM` of
arbitrary elements should not be implicitly 0, especially in a property graph
without a strict schema (the property in question can contain values of
arbitrary types, or no values at all).
### OrderBy
The `OrderBy` logical operator sorts the results in the desired order. It
occurs in Cypher as part of a `WITH` or `RETURN` clause. Both the concept and
the implementation are straightforward. It's necessary for the logical op to
`Pull` everything from its input so it can be sorted. It's not necessary to
keep the whole `Frame` state of each input, it is sufficient to keep a list of
`TypedValues` on which the results will be sorted, and another list of values
that need to be remembered and recreated on the `Frame` when yielding.
The sorting itself is made to reflect that of Neo's implementation which comes
down to these points.
* `Null` comes last (as if it's greater than anything).
* Primitive types compare naturally, with no implicit casting except from
`int` to `double`.
* Complex types are not comparable.
* Every unsupported comparison results in an exception that gets propagated
to the end user.
### Limit in Write Queries
`Limit` can be used as part of a write query, in which case it will *not*
reduce the amount of performed updates. For example, consider a database that
has 10 vertices. The query `MATCH (n) SET n.x = 1 RETURN n LIMIT 3` will
result in all vertices having their property value changed, while returning
only the first to the client. This makes sense from the implementation
standpoint, because `Accumulate` is planned after `SetProperty` but before
`Produce` and `Limit` operations. Note that this behavior can be
non-deterministic in some queries, since it relies on the order of iteration
over nodes which is undefined when not explicitly specified.
### Merge
`MERGE` in Cypher attempts to match a pattern. If it already exists, it does
nothing and subsequent clauses like `RETURN` can use the matched pattern
elements. If the pattern can't match to any data, it creates it. For detailed
information see Neo4j's [merge
documentation.](https://neo4j.com/docs/developer-manual/current/cypher/clauses/merge/)
An important thing about `MERGE` is visibility of modified data. `MERGE` takes
an input (typically a `MATCH`) and has two additional *phases*: the merging
part, and the subsequent set parts (`ON MATCH SET` and `ON CREATE SET`).
Analysis of Neo4j's behavior indicates that each of these three phases (input,
merge, set) does not see changes to the graph state done by subsequent phase.
The input phase does not see data created by the merge phase, nor the set
phase. This is consistent with what seems like the general Cypher philosophy
that query clause effects aren't visible in the preceding clauses.
We define the `Merge` logical operator as a *routing* operator that uses three
logical operator branches.
1. The input from a preceding clause.
For example in `MATCH (n), (m) MERGE (n)-[:T]-(m)`. This input is
optional because `MERGE` is allowed to be the first clause in a query.
2. The `merge_match` branch.
This logical operator branch is `Pull`-ed from until exhausted for each
successful `Pull` from the input branch.
3. The `merge_create` branch.
This branch is `Pull`ed when the `merge_match` branch does not match
anything (no successful `Pull`s) for an input `Pull`. It is `Pull`ed only
once in such a situation, since only one creation needs to occur for a
failed match.
The `ON MATCH SET` and `ON CREATE SET` parts of the `MERGE` clause are
included in the `merge_match` and `merge_create` branches respectively. They
are placed on the end of their branches so that they execute only when those
branches succeed.
Memgraph strives to be consistent with Neo in its `MERGE` implementation,
while at the same time keeping performance as good as possible. Consistency
with Neo w.r.t. graph state visibility is not trivial. Documentation for
`Expand` and `Set` describe how Memgraph keeps track of both the updated
version of an edge/vertex and the old one, as it was before the current
transaction+command. This technique is also used in `Merge`. The input
phase/branch of `Merge` always looks at the old data. The merge phase needs to
see the new data so it doesn't create more data then necessary.
For example, consider the query.
MATCH (p:Person) MERGE (c:City {name: p.lives_in})
This query needs to create a city node only once for each unique `p.lives_in`.
Finally the set phase of a `MERGE` clause should not affect the merge phase.
To achieve this the `merge_match` branch of the `Merge` operator should see
the latest created nodes, but filter them on their old state (if those nodes
were not created by the `create_branch`). Implementation-wise that means that
`ScanAll` and `Expand` operators in the `merge_branch` need to look at the new
graph state, while `Filter` operators the old, if available.

62
docs/dev/query/parsing.md Normal file
View File

@ -0,0 +1,62 @@
# Lexical and Syntactic Analysis
## Antlr
We use Antlr for lexical and syntax analysis of Cypher queries. Antrl uses
grammar file `Cypher.g4` downloaded from http://www.opencypher.org to generate
the parser and the visitor for the Cypher parse tree. Even though the provided
grammar is not very pleasant to work with we decided not to do any drastic
changes to it so that our transition to newly published versions of
`Cypher.g4` would be easier. Nevertheless, we had to fix some bugs and add
features, so our version is not completely the same.
In addition to using `Cypher.g4`, we have `MemgraphCypher.g4`. This grammar
file defines Memgraph specific extensions to the original grammar. Most
notable example is the inclusion of syntax for handling authorization. At the
moment, some extensions are also found in `Cypher.g4`. For example, the syntax
for using a lambda function in relationship patterns. These extensions should
be moved out of `Cypher.g4`, so that it remains as close to the original
grammar as possible. Additionally, having `MemgraphCypher.g4` may not be
enough if we wish to split the functionality for community and enterprise
editions of Memgraph.
## Abstract Syntax Tree (AST)
Since Antlr generated visitor and the official openCypher grammar are not very
practical to use, we translate the Antlr's AST to our own AST. Currently there
are ~40 types of nodes in our AST. Their definitions can be found in
`src/query/frontend/ast/ast.lcp`.
Major groups of types can be found under the following base types.
* `Expression` --- types corresponding to Cypher expressions.
* `Clause` --- types corresponding to Cypher clauses.
* `PatternAtom` --- node or edge related information.
* `Query` --- different kinds of queries, allows extending the language with
Memgraph specific query syntax.
Memory management of created AST nodes is done with `AstStorage`. Each type
must be created by invoking `AstStorage::Create` method. This way all of the
pointers to nodes and their children are raw pointers. The only owner of
allocated memory is the `AstStorage`. When the storage goes out of scope, the
pointers become invalid. It may be more natural to handle tree ownership via
`unique_ptr`, i.e. each node owns its children. But there are some benefits to
having a custom storage and allocation scheme.
The primary reason we opted for not using `unique_ptr` is the requirement of
Antlr's base visitor class that the resulting values must by copyable. The
result is wrapped in `antlr::Any` so that the derived visitor classes may
return any type they wish when visiting Antlr's AST. Unfortunately,
`antlr::Any` does not work with non-copyable types.
Another benefit of having `AstStorage` is that we can easily add a different
allocation scheme for AST nodes. The interface of node creation would not
change.
### AST Translation
The translation process is done via `CypherMainVisitor` class, which is
derived from Antlr generated visitor. Besides instancing our AST types, a
minimal number of syntactic checks are done on a query. These checks handle
the cases which were valid in original openCypher grammar, but may be invalid
when combined with other syntax elements.

487
docs/dev/query/planning.md Normal file
View File

@ -0,0 +1,487 @@
# Logical Planning
After the semantic analysis and symbol generation, the AST is converted to a
tree of logical operators. This conversion is called *planning* and the tree
of logical operators is called a *plan*. The whole planning process is done in
the following steps.
1. [AST Preprocessing](#ast-preprocessing)
The first step is to preprocess the AST by collecting
information on filters, divide the query into parts, normalize patterns
in `MATCH` clauses, etc.
2. [Logical Operator Planning](#logical-operator-planning)
After the preprocess step, the planning can be done via 2 planners:
`VariableStartPlanner` and `RuleBasedPlanner`. The first planner will
generate multiple plans where each plan has different starting points for
searching the patterns in `MATCH` clauses. The second planner produces a
single plan by mapping the query parts as they are to logical operators.
3. [Logical Plan Postprocessing](#logical-plan-postprocessing)
In this stage, we perform various transformations on the generated logical
plan. Here we want to optimize the operations in order to improve
performance during the execution. Naturally, transformations need to
preserve the semantic behaviour of the original plan.
4. [Cost Estimation](#cost-estimation)
After the generation, the execution cost of each plan is estimated. This
estimation is used to select the best plan which will be executed.
5. [Distributed Planning](#distributed-planning)
In case we are running distributed Memgraph, the final plan is adapted
for distributed execution. NOTE: This appears to be an error in the
workflow. Distributed planning should be moved before step 3. or
integrated with it. With the workflow ordered as is now, cost estimation
doesn't consider the distributed plan.
The implementation can be found in the `query/plan` directory, with the public
entry point being `query/plan/planner.hpp`.
## AST Preprocessing
Each openCypher query consists of at least 1 **single query**. Multiple single
queries are chained together using a **query combinator**. Currently, there is
only one combinator, `UNION`. The preprocessing step starts in the
`CollectQueryParts` function. This function will take a look at each single
query and divide it into parts. Each part is separated with `RETURN` and
`WITH` clauses. For example:
MATCH (n) CREATE (m) WITH m MATCH (l)-[]-(m) RETURN l
| | |
|------- part 1 -----------+-------- part 2 --------|
| |
|-------------------- single query -----------------|
Each part is created by collecting all `MATCH` clauses and *normalizing* their
patterns. Pattern normalization is the process of converting an arbitrarily
long pattern chain of nodes and edges into a list of triplets `(start node,
edge, end node)`. The triplets should preserve the semantics of the match. For
example:
MATCH (a)-[p]-(b)-[q]-(c)-[r]-(d)
is equivalent to:
MATCH (a)-[p]-(b), (b)-[q]-(c), (c)-[r]-(d)
With this representation, it becomes easier to reorder the triplets and choose
different strategies for pattern matching.
In addition to normalizing patterns, all of the filter expressions in patterns
and inside of the `WHERE` clause (of the accompanying `MATCH`) are extracted
and stored separately. During the extraction, symbols used in the filter
expression are collected. This allows for planning filters in a valid order,
as the matching for triplets is being done. Another important benefit of
having extra information on filters, is to recognize when a database index
could be used.
After each `MATCH` is processed, they are all grouped, so that even the whole
`MATCH` clauses may be reordered. The important thing is to remember which
symbols were used to name edges in each `MATCH`. With those symbols we can
plan for *cyphermorphism*, i.e. ensure different edges in the search pattern
of a single `MATCH` map to different edges in the graph. This preserves the
semantic of the query, even though we may have reordered the matching. The
same steps are done for `OPTIONAL MATCH`.
Another clause which needs processing is `MERGE`. Here we normalize the
pattern, since the `MERGE` is a bit like `MATCH` and `CREATE` in one.
All the other clauses are left as is.
In the end, each query part consists of:
* processed and grouped `MATCH` clauses;
* processed and grouped `OPTIONAL MATCH` clauses;
* processed `MERGE` matching pattern and
* unchanged remaining clauses.
The last stored clause is guaranteed to be either `WITH` or `RETURN`.
## Logical Operator Planning
### Variable Start Planner
The `VariableStartPlanner` generates multiple plans for a single query. Each
plan is generated by selecting a different starting point for pattern
matching.
The algorithm works as follows.
1. For each query part:
1. For each node in triplets of collected `MATCH` clauses:
i. Add the node to a set of `expanded` nodes
ii. Select a triplet `(start node, edge, end node)` whose `start node` is
in the `expanded` set
iii. If no triplet was selected, choose a new starting node that isn't in
`expanded` and continue expanding
iv. Repeat steps ii. -- iii. until all triplets have been selected
and store that as a variation of the `MATCH` clauses
2. Do step 1.1. for `OPTIONAL MATCH` and `MERGE` clauses
3. Take all combinations of the generated `MATCH`, `OPTIONAL MATCH` and
`MERGE` and store them as variations of the query part.
2. For each combination of query part variations:
1. Generate a plan using the rule based planner
### Rule Based Planner
The `RuleBasedPlanner` generates a single plan for a single query. A plan is
generated by following hardcoded rules for producing logical operators. The
following sections are an overview on how each openCypher clause is converted
to a `LogicalOperator`.
#### MATCH
`MATCH` clause is used to specify which patterns need to be searched for in
the database. These patterns are normalized in the preprocess step to be
represented as triplets `(start node, edge, end node)`. When there is no edge,
then the triplet is reduced only to the `start node`. Generating the operators
is done by looping over these triplets.
##### Searching for Nodes
The simplest search is finding stand alone nodes. For example, `MATCH (n)`
will find all the nodes in the graph. This is accomplished by generating a
`ScanAll` operator and forwarding the node symbol which should store the
results. In this case, all the nodes will be referenced by `n`.
Multiple nodes can be specified in a single match, e.g. `MATCH (n), (m)`.
Planning is done by repeating the same steps for each sub pattern (separated
by a comma). In this case, we would get 2 `ScanAll` operators chained one
after the other. An optimization can be obtained if the node in the pattern is
already searched for. In `MATCH (n), (n)` we can drop the second `ScanAll`
operator since we have already generated it for the first node.
##### Searching for Relationships
A more advanced search includes finding nodes with relationships. For example,
`MATCH (n)-[r]-(m)` should find every pair of connected nodes in the database.
This means, that if a single node has multiple connections, it will be
repeated for each combination of pairs. The generation of operators starts
from the first node in the pattern. If we are referencing a new starting node,
, we need to generate a `ScanAll` which finds all the nodes and stores them
into `n`. Then, we generate an `Expand` operator which reads the `n` and
traverses all the edges of that node. The edge is stored into `r`, while the
destination node is stored in `m`.
Matching multiple relationships proceeds similarly, by repeating the same
steps. The only difference is that we need to ensure different edges in the
search pattern, map to different edges in the graph. This means that after
each `Expand` operator, we need to generate an `ExpandUniquenessFilter`. We
provide this operator with a list of symbols for the previously matched edges
and the symbol for the current edge.
For example.
MATCH (n)-[r1]-(m)-[r2]-(l)
The above is preprocessed into
MATCH (n)-[r1]-(m), (m)-[r2]-(l)
Then we look at each triplet in order and perform the described steps. This
way, we would generate:
ScanAll (n) > Expand (n, r1, m) > Expand (m, r2, l) >
ExpandUniquenessFilter ([r1], r2)
Note that we don't need to make `ExpandUniquenessFilter` after the first
`Expand`, since there are no edges to compare to. This filtering needs to work
across multiple pattern, but inside a *single* `MATCH` clause.
Let's take a look at the following.
MATCH (n)-[r1]-(m), (m)-[r2]-(l)
We would also generate the exact same operators.
ScanAll (n) > Expand (n, r1, m) > Expand (m, r2, l) >
ExpandUniquenessFilter ([r1], r2)
On the other hand,
MATCH (n)-[r1]-(m) MATCH (m)-[r2]-(l)-[r3]-(i)
We would reset the uniqueness filtering at the start of the second match. This
would mean that we output the following:
ScanAll (n) > Expand (n, r1, m) > Expand (m, r2, l) > Expand (l, r3, i) >
ExpandUniquenessFilter ([r2], r3)
There is a difference in how we handle edge uniqueness compared to Neo4j.
Neo4j does not allow searching for a single edge multiple times, but we've
decided to support that.
For example, the user can say the following.
MATCH (n)-[r]-(m)-[r]-l
We would ensure that both `r` variables match to the same edge. In our
terminology, we call this the *edge cycle*. For the above example, we would
generate this plan.
ScanAll (n) > Expand (n, r, m) > Expand (m, r, l)
We do not put an `ExpandUniquenessFilter` operator between 2 `Expand`
operators and we tell the 2nd `Expand` that it is an edge cycle. This, 2nd
`Expand` will ensure we have matched both the same edges.
##### Filtering
To narrow the search down, the patterns in `MATCH` can have filtered labels
and properties. A more general filtering is done using the accompanying
`WHERE` clause. During the preprocess step, all filters are collected and
extracted into expressions. Additional information on which symbols are used
is also stored. This way, each time we generate a `ScanAll` or `Expand`, we
look at all the filters to see if any of them can be used. I.e. if the symbols
they use have been bound by a newly produced operator. If a filter expression
can be used, we immediately add a `Filter` operator with that expression.
For example.
MATCH (n)-[r]-(m :label) WHERE n.prop = 42
We would produce:
ScanAll (n) > Filter (n.prop) > Expand (n, r, m) > Filter (m :label)
This means that the same plan is generated for the query:
MATCH (n {prop: 42})-[r]-(m :label)
#### OPTIONAL
If a `MATCH` clause is preceded by `OPTIONAL`, then we need to generate a plan
such that we produce results even if we fail to match anything. This is
accomplished by generating an `Optional` operator, which takes 2 operator
trees:
* input operation and
* optional operation.
The input is the operation we generated for the part of the query before
`OPTIONAL MATCH`. For the optional operation, we simply generate the `OPTIONAL
MATCH` part just like we would for regular `MATCH`. In addition to operations,
we need to send the symbols which are set during optional matching to the
`Optional` operator. The operator will reset values of those symbols to
`null`, when the optional part fails to match.
#### RETURN & WITH
`RETURN` and `WITH` clauses are very similar to each other. The only
difference is that `WITH` separates parts of the query and can be paired with
`WHERE` clause.
The common part is generating operators for the body of the clause. Separation
of query parts is mostly done in semantic analysis, which checks that only the
symbols exposed through `WITH` are visible in the query parts after the
clause. The minor part is done in planning.
##### Named Results
Both clauses contain multiple named expressions (`expr AS name`) which are
used to generate `Produce` operator.
##### Aggregations
If an expression contains an aggregation operator (`sum`, `avg`, ...) we need
to plan the `Aggregate` operator as input to `Produce`. This case is more
complex, because aggregation in openCypher can perform implicit grouping of
results used for aggregation.
For example, `WITH/RETURN sum(n.x) AS s, n.y AS group` will implicitly group
by `n.y` expression.
Another, obscure grouping can be achieved with `RETURN sum(n.a) + n.b AS s`.
Here, the `n.b` will be used for grouping, even though both the `sum` and
`n.b` are in the same named expression.
Therefore, we need to collect all expressions which do not contain
aggregations and use them for grouping. You may have noticed that in the last
example `sum` is actually a sub-expression of `+`. `Aggregate` operator does
not see that (nor it should), so the responsibility of evaluating that falls
on `Produce`. One way is for `Aggregate` to store results of grouping
expressions on the frame in addition to aggregation results. Unfortunately,
this would require rewiring named expressions in `Produce` to reference
already evaluated expressions. In the current implementation, we opted for
`Aggregate` to store only aggregation results on the frame, while `Produce`
will re-evaluate all the other (grouping) expressions. To handle that, symbols
which are used in expressions are passed to `Aggregate`, so that they can be
remembered. `Produce` will read those symbols from the frame and use it to
re-evaluate the needed expressions.
##### Accumulation
After we have `Produce` and potentially `Aggregate`, we need to handle a
special case when the part of the query before `RETURN` or `WITH` performs
updates. For that, we want to run that part of the query fully, so that we get
the latest results. This is accomplished by adding `Accumulate` operator as
input to `Aggregate` or `Produce` (if there is no aggregation). Accumulation
will store all the values for all the used symbols inside `RETURN` and `WITH`,
so that they can be used in the operator which follows. This way, only parts
of the frame are copied, instead of the whole frame. Here is a minor
difference between planning `WITH`, compared to `RETURN`. Since `WITH` can
separate writing from reading, we need to advance the transaction command.
This enables the later, read parts of the query to obtain the newest changes.
This is supported by passing `advance_command` flag to `Accumulate` operator.
In the simplest case, common to both clauses, we have `Accumulate > Aggregate
> Produce` operators, where `Accumulate` and `Aggregate` may be left out.
##### Ordering
Planning `ORDER BY` is simple enough. Since it may see new symbols (filled in
`Produce`), we add the `OrderBy` operator at the end. The operator will change
the order of produced results, so we pass it the ordering expressions and the
output symbols of named expressions.
##### Filtering
A final difference in `WITH`, is when it contains a `WHERE` clause. For that,
we simply generate the `Filter` operator, appended after `Produce` or
`OrderBy` (depending which operator is last).
##### Skipping and Limiting
If we have `SKIP` or `LIMIT`, we generate `Skip` or `Limit` operators,
respectively. These operators are put at the end of the clause.
This placement may have some unexpected behaviour when combined with
operations that update the graph. For example.
MATCH (n) SET n.x = n.x + 1 RETURN n LIMIT 1
The above query may be interpreted as if the `SET` will be done only once.
Since this is a write query, we need to accumulate results, so the part before
`RETURN` will execute completely. The accumulated results will be yielded up
to the given limit, and the user would get only the first `n` that was
updated. This may confuse the user because in reality, every node in the
database had been updated.
Note that `Skip` always comes before `Limit`. In the current implementation,
they are generated directly one after the other.
#### CREATE
`CREATE` clause is used to create nodes and edges (relationships).
For multiple `CREATE` clauses or multiple creation patterns in a single
clause, we perform the same, following steps.
##### Creating a Single Node
A node is created by simply specifying a node pattern.
For example `CREATE (n :label {property: "value"}), ()` would create 2 nodes.
The 1st one would be created with a label and a property. This node could be
referenced later in the query, by using the variable `n`. The 2nd node cannot
be referenced and it would be created without any labels nor properties. For
node creation, we generate a `CreateNode` operator and pass it all the details
of node creation: variable symbol, labels and properties. In the mentioned
example, we would have `CreateNode > CreateNode`.
##### Creating a Relationship
To create a relationship, the `CREATE` clause must contain a pattern with a
directed edge. Compared to creating a single node, this case is a bit more
complicated, because either side of the edge may not exist. By exist, we mean
that the endpoint is a variable which already references a node.
For example, `MATCH (n) CREATE (n)-[r]->(m)` would create an edge `r` and a
node `m` for each matched node `n`. If we focus on the `CREATE` part, we
generate `CreateExpand (n, r, m)` where `n` already exists (refers to matched
node) and `m` would be newly created along with edge `r`. If we had only
`CREATE (n) -[r]-> (m)`, then we would need to create both nodes of the edge
`r`. This is done by generating `CreateNode (n) > CreateExpand(n, r, m)`. The
final case is when both endpoints refer to an existing node. For example, when
adding a node with a cyclical connection `CREATE (n)-[r]->(n)`. In this case,
we would generate `CreateNode (n) > CreateExpand (n, r, n)`. We would tell
`CreateExpand` to only create the edge `r` between the already created `n`.
#### MERGE
Although the merge operation is complex, planning turns out to be relatively
simple. The pattern inside the `MERGE` clause is used for both matching and
creating. Therefore, we create 2 operator trees, one for each action.
For example.
MERGE (n)-[r:r]-(m)
We would generated a single `Merge` operator which has the following.
* No input operation (since it is not preceded by any other clause).
* On match operation
`ScanAll (n) > Expand (n, r, m) > Filter (r)`
* On create operation
`CreateNode (n) > CreateExpand (n, r, m)`
In cases when `MERGE` contains `ON MATCH` and `ON CREATE` parts, we simply
append their operations to the respective operator trees.
Observe the following example.
MERGE (n)-[r:r]-(m) ON MATCH SET n.x = 42 ON CREATE SET m :label
The `Merge` would be generated with the following.
* No input operation (again, since there is no clause preceding it).
* On match operation
`ScanAll (n) > Expand (n, r, m) > Filter (r) > SetProperty (n.x, 42)`
* On create operation
`CreateNode (n) > CreateExpand (n, r, m) > SetLabels (n, :label)`
When we have preceding clauses, we simply put their operator as input to
`Merge`.
MATCH (n) MERGE (n)-[r:r]-(m)
The above would be generated as
ScanAll (n) > Merge (on_match_operation, on_create_operation)
Here we need to be careful to recognize which symbols are already declared.
But, since the `on_match_operation` uses the same algorithm for generating a
`Match`, that problem is handled there. The same should hold for
`on_create_operation`, which uses the process of generating a `Create`. So,
finally for this example, the `Merge` would have:
* Input operation
`ScanAll (n)`
* On match operation
`Expand (n, r, m) > Filter (r)`
Note that `ScanAll` is not needed since we get the nodes from input.
* On create operation
`CreateExpand (n, r, m)`
Note that `CreateNode` is dropped, since we want to expand the existing one.
## Logical Plan Postprocessing
NOTE: TODO
## Cost Estimation
NOTE: TODO
## Distributed Planning
NOTE: TODO

134
docs/dev/query/semantic.md Normal file
View File

@ -0,0 +1,134 @@
# Semantic Analysis and Symbol Generation
In this phase, various semantic and variable type checks are performed.
Additionally, we generate symbols which map AST nodes to stored values
computed from evaluated expressions.
## Symbol Generation
Implementation can be found in `query/frontend/semantic/symbol_generator.cpp`.
Symbols are generated for each AST node that represents data that needs to
have storage. Currently, these are:
* `NamedExpression`
* `CypherUnion`
* `Identifier`
* `Aggregation`
You may notice that the above AST nodes may not correspond to something named
by a user. For example, `Aggregation` can be a part of larger expression and
thus remain unnamed. The reason we still generate symbols is to have a uniform
behaviour when executing a query as well as allow for caching the results of
expression evaluation.
AST nodes do not actually store a `Symbol` instance, instead they have a
`int32_t` index identifying the symbol in the `SymbolTable` class. This is
done to minimize the size of AST types as well as allow easier sharing of same
symbols with multiple instances of AST nodes.
The storage for evaluated data is represented by the `Frame` class. Each
symbol determines a unique position in the frame. During interpretation,
evaluation of expressions which have a symbol will either read or store values
in the frame. For example, instance of an `Identifier` will use the symbol to
find and read the value from `Frame`. On the other hand, `NamedExpression`
will take the result of evaluating its own expression and store it in the
`Frame`.
When a symbol is created, context of creation is used to assign a type to that
symbol. This type is used for simple type checking operations. For example,
`MATCH (n)` will create a symbol for variable `n`. Since the `MATCH (n)`
represents finding a vertex in the graph, we can set `Symbol::Type::Vertex`
for that symbol. Later, for example in `MATCH ()-[n]-()` we see that variable
`n` is used as an edge. Since we already have a symbol for that variable, we
detect this type mismatch and raise a `SemanticException`.
Basic rule of symbol generation, is that variables inside `MATCH`, `CREATE`,
`MERGE`, `WITH ... AS` and `RETURN ... AS` clauses establish new symbols.
### Symbols in Patterns
Inside `MATCH`, symbols are created only if they didn't exist before. For
example, patterns in `MATCH (n {a: 5})--(m {b: 5}) RETURN n, m` will create 2
symbols: one for `n` and one for `m`. `RETURN` clause will, in turn, reference
those symbols. Symbols established in a part of pattern are immediately bound
and visible in later parts. For example, `MATCH (n)--(n)` will create a symbol
for variable `n` for 1st `(n)`. That symbol is referenced in 2nd `(n)`. Note
that the symbol is not bound inside 1st `(n)` itself. What this means is that,
for example, `MATCH (n {a: n.b})` should raise an error, because `n` is not
yet bound when encountering `n.b`. On the other hand,
`MATCH (n)--(n {a: n.b})` is fine.
The `CREATE` is similar to `MATCH`, but it *always* establishes symbols for
variables which create graph elements. What this means is that, for example
`MATCH (n) CREATE (n)` is not allowed. `CREATE` wants to create a new node,
for which we already have a symbol. In such a case, we need to throw an error
that the variable `n` is being redeclared. On the other hand `MATCH (n) CREATE
(n)-[r :r]->(n)` is fine, because `CREATE` will only create the edge `r`,
connecting the already existing node `n`. Remaining behaviour is the same as
in `MATCH`. This means that we can simplify `CREATE` to be like `MATCH` with 2
special cases.
1. Are we creating a node, i.e. `CREATE (n)`? If yes, then the symbol for
`n` must not have been created before. Otherwise, we reference the
existing symbol.
2. Are we creating an edge, i.e. we encounter a variable for an edge inside
`CREATE`? If yes, then that variable must not reference a symbol.
The `MERGE` clause is treated the same as `CREATE` with regards to symbol
generation. The only difference is that we allow bidirectional edges in the
pattern. When creating such a pattern, the direction of the created edge is
arbitrarily determined.
### Symbols in WITH and RETURN
In addition to patterns, new symbols are established in the `WITH` clause.
This clause makes the new symbols visible *only* to the rest of the query.
For example, `MATCH (old) WITH old AS new RETURN new, old` should raise an
error that `old` is unbound inside `RETURN`.
There is a special case with symbol visibility in `WHERE` and `ORDER BY`. They
need to see both the old and the new symbols. Therefore `MATCH (old) RETURN
old AS new ORDER BY old.prop` needs to work. On the other hand, if we perform
aggregations inside `WITH` or `RETURN`, then the old symbols should not be
visible neither in `WHERE` nor in `ORDER BY`. Since the aggregation has to go
through all the results in order to generate the final value, it makes no
sense to store old symbols and their values. A query like `MATCH (old) WITH
SUM(old.prop) AS sum WHERE old.prop = 42 RETURN sum` needs to raise an error
that `old` is unbound inside `WHERE`.
For cases when `SKIP` and `LIMIT` appear, we disallow any identifiers from
appearing in their expressions. Basically, `SKIP` and `LIMIT` can only be
constant expressions[^1]. For example, `MATCH (old) RETURN old AS new SKIP
new.prop` needs to raise that variables are not allowed in `SKIP`. It makes no
sense to allow variables, since their values may vary on each iteration. On
the other hand, we could support variables to constant expressions, but for
simplicity we do not. For example, `MATCH (old) RETURN old, 2 AS limit_var
LIMIT limit_var` would still throw an error.
Finally, we generate symbols for names created in `RETURN` clause. These
symbols are used for the final results of a query.
NOTE: New symbols in `WITH` and `RETURN` should be unique. This means that
`WITH a AS same, b AS same` is not allowed, neither is a construct like
`RETURN 2, 2`
### Symbols in Functions which Establish New Scope
Symbols can also be created in some functions. These functions usually take an
expression, bind a single variable and run the expression inside the newly
established scope.
The `all` function takes a list, creates a variable for list element and runs
the predicate expression. For example:
MATCH (n) RETURN n, all(n IN n.prop_list WHERE n < 42)
We create a new symbol for use inside `all`, this means that the `WHERE n <
42` uses the `n` which takes values from a `n.prop_list` elements. The
original `n` bound by `MATCH` is not visible inside the `all` function, but it
is visible outside. Therefore, the `RETURN n` and `n.prop_list` reference the
`n` from `MATCH`.
[^1]: Constant expressions are expressions for which the result can be
computed at compile time.