Move query implementation from Phriction to this repo
Summary: Our query parsing, planning and execution architecture was described on Phabricator wiki pages, Phriction. This commit copies the said documentation here, so that it's easier to access for all developers. Additional benefit is tracking the changes and hopefully suggesting to developers to keep it up to date. Besides making a copy, the documentation has been updated to reflect the current state of the codebase. Note that some things are still missing, but what was written should now be correct. Reviewers: mtomic, llugovic Reviewed By: mtomic Subscribers: pullbot Differential Revision: https://phabricator.memgraph.io/D1854
This commit is contained in:
parent
a14c4f1864
commit
6f10b1c115
1
docs/dev/query/.gitignore
vendored
Normal file
1
docs/dev/query/.gitignore
vendored
Normal file
@ -0,0 +1 @@
|
||||
html/
|
11
docs/dev/query/build-html
Executable file
11
docs/dev/query/build-html
Executable file
@ -0,0 +1,11 @@
|
||||
#!/bin/bash
|
||||
|
||||
script_dir="$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )"
|
||||
|
||||
mkdir -p $script_dir/html
|
||||
|
||||
for markdown_file in $(find $script_dir -name '*.md'); do
|
||||
name=$(basename -s .md $markdown_file)
|
||||
sed -e 's/.md/.html/' $markdown_file | \
|
||||
pandoc -s -f markdown -t html -o $script_dir/html/$name.html
|
||||
done
|
14
docs/dev/query/contents.md
Normal file
14
docs/dev/query/contents.md
Normal file
@ -0,0 +1,14 @@
|
||||
# Query Parsing, Planning and Execution
|
||||
|
||||
This part of the documentation deals with query execution.
|
||||
|
||||
Memgraph currently supports only query interpretation. Each new query is
|
||||
parsed, analysed and translated into a sequence of operations which are then
|
||||
executed on the main database storage. Query execution is organized into the
|
||||
following phases:
|
||||
|
||||
1. [Lexical Analysis (Tokenization)](parsing.md)
|
||||
2. [Syntactic Analysis (Parsing)](parsing.md)
|
||||
3. [Semantic Analysis and Symbol Generation](semantic.md)
|
||||
4. [Logical Planning](planning.md)
|
||||
5. [Logical Plan Execution](execution.md)
|
373
docs/dev/query/execution.md
Normal file
373
docs/dev/query/execution.md
Normal file
@ -0,0 +1,373 @@
|
||||
# Logical Plan Execution
|
||||
|
||||
We implement classical iterator style operators. Logical operators define
|
||||
operations on database. They encapsulate the following info: what the input is
|
||||
(another `LogicalOperator`), what to do with the data, and how to do it.
|
||||
|
||||
Currently logical operators can have zero or more input operations, and thus a
|
||||
`LogicalOperator` tree is formed. Most `LogicalOperator` types have only one
|
||||
input, so we are mostly working with chains instead of full fledged trees.
|
||||
You can find information on each operator in `src/query/plan/operator.lcp`.
|
||||
|
||||
## Cursor
|
||||
|
||||
Logical operators do not perform database work themselves. Instead they create
|
||||
`Cursor` objects that do the actual work, based on the info in the operator.
|
||||
Cursors expose a `Pull` method that gets called by the cursor's consumer. The
|
||||
consumer keeps pulling as long as the `Pull` returns `true` (indicating it
|
||||
successfully performed some work and might be eligible for another `Pull`).
|
||||
Most cursors will call the `Pull` function of their input provided cursor, so
|
||||
typically a cursor chain is created that is analogue to the logical operator
|
||||
chain it's created from.
|
||||
|
||||
## Frame
|
||||
|
||||
The `Frame` object contains all the data of the current `Pull` chain. It
|
||||
serves for communicating data between cursors.
|
||||
|
||||
For example, in a `MATCH (n) RETURN n` query the `ScanAllCursor` places a
|
||||
vertex on the `Frame` for each `Pull`. It places it on the place reserved for
|
||||
the `n` symbol. Then the `ProduceCursor` can take that same value from the
|
||||
`Frame` because it knows the appropriate symbol. `Frame` positions are indexed
|
||||
by `Symbol` objects.
|
||||
|
||||
## ExpressionEvaluator
|
||||
|
||||
Expressions results are not placed on the `Frame` since they do not need to be
|
||||
communicated between different `Cursors`. Instead, expressions are evaluated
|
||||
using an instance of `ExpressionEvaluator`. Since generally speaking an
|
||||
expression can be defined by a tree of subexpressions, the
|
||||
`ExpressionEvaluator` is implemented as a tree visitor. There is a performance
|
||||
sub-optimality here because a stack is used to communicate intermediary
|
||||
expression results between elements of the tree. This is one of the reasons
|
||||
why it's planned to use `Frame` for intermediary expression results as well.
|
||||
The other reason is that it might facilitate compilation later on.
|
||||
|
||||
## Cypher Execution Semantics
|
||||
|
||||
Cypher query execution has *mostly* well-defined semantics. Some are
|
||||
explicitly defined by openCypher and its TCK, while others are implicitly
|
||||
defined by Neo4j's implementation of Cypher that we want to be generally
|
||||
compatible with.
|
||||
|
||||
These semantics can in short be described as follows: a Cypher query consists
|
||||
of multiple clauses some of which modify it. Generally, every clause in the
|
||||
query, when reading it left to right, operates on a consistent state of the
|
||||
property graph, untouched by subsequent clauses. This means that a `MATCH`
|
||||
clause in the beginning operates on a graph-state in which modifications by
|
||||
the subsequent `SET` are not visible.
|
||||
|
||||
The stated semantics feel very natural to the end-user, and Neo seems to
|
||||
implement them well. For Memgraph the situation is complex because
|
||||
`LogicalOperator` execution (through a `Cursor`) happens one `Pull` at a time
|
||||
(generally meaning all the query clauses get executed for every top-level
|
||||
`Pull`). This is not inherently consistent with Cypher semantics because a
|
||||
`SET` clause can modify data, and the `MATCH` clause that precedes it might
|
||||
see the modification in a subsequent `Pull`. Also, the `RETURN` clause might
|
||||
want to stream results to the user before all `SET` clauses have been
|
||||
executed, so the user might see some intermediate graph state. There are many
|
||||
edge-cases that Memgraph does its best to avoid to stay true to Cypher
|
||||
semantics, while at the same time using a high-performance streaming approach.
|
||||
The edge-cases are enumerated in this document along with the implementation
|
||||
details they imply.
|
||||
|
||||
## Implementation Peculiarities
|
||||
|
||||
### Once
|
||||
|
||||
An operator that does nothing but whose `Cursor::Pull` returns `true` on the
|
||||
first `Pull` and `false` on subsequent ones. This operator is used when
|
||||
another operator has an optional input, because in Cypher a clause will
|
||||
typically execute once for every input from the preceding clauses, or just
|
||||
once if there was no preceding input. For example, consider the `CREATE`
|
||||
clause. In the query `CREATE (n)` only one node is created, while in the query
|
||||
`MATCH (n) CREATE (m)` a node is created for each existing node. Thus in our
|
||||
`CreateNode` logical operator the input is either a `ScanAll` operator, or a
|
||||
`Once` operator.
|
||||
|
||||
### GraphView
|
||||
|
||||
In the previous section, [Cypher Execution
|
||||
Semantics](#cypher-execution-semantics), we mentioned how the preceding
|
||||
clauses should not see changes made in subsequent ones. For that reason, some
|
||||
operators take a `GraphView` enum value. This value determines which state of
|
||||
the graph an operator sees.
|
||||
|
||||
Consider the query `MATCH (n)--(m) WHERE n.x = 0 SET m.x = 1`. Naive streaming
|
||||
could match a vertex `n` on the given criteria, expand to `m`, update it's
|
||||
property, and in the next iteration consider the vertex previously matched to
|
||||
`m` and skip it because it's newly set property value does not qualify. This
|
||||
is not how Cypher works. To handle this issue properly, Memgraph designed the
|
||||
`VertexAccessor` class that tracks two versions of data: one that was visible
|
||||
before the current transaction+command, and the optional other that was
|
||||
created in the current transaction+command. The `MATCH` clause will be planned
|
||||
as `ScanAll` and `Expand` operations using `GraphView::OLD` value. This will
|
||||
ensure modifications performed in the same query do not affect it. The same
|
||||
applies to edges and the `EdgeAccessor` class.
|
||||
|
||||
### Existing Record Detection
|
||||
|
||||
It's possible that a pattern element has already been declared in the same
|
||||
pattern, or a preceding pattern. For example `MATCH (n)--(m), (n)--(l)` or a
|
||||
cycle-detection match `MATCH (n)-->(n) RETURN n`. Implementation-wise,
|
||||
existing record detection just checks that the expanded record is equal to the
|
||||
one already on the frame.
|
||||
|
||||
### Why Not Use Separate Expansion Ops for Edges and Vertices?
|
||||
|
||||
Expanding an edge and a vertex in separate ops is not feasible when matching a
|
||||
cycle in bi-directional expansions. Consider the query `MATCH (n)--(n) RETURN
|
||||
n`. Let's try to expand first the edge in one op, and vertex in the next. The
|
||||
vertex expansion consumes the edge expansion input. It takes the expanded edge
|
||||
from the frame. It needs to detect a cycle by comparing the vertex existing on
|
||||
the frame with one of the edge vertices (`from` or `to`). But which one? It
|
||||
doesn't know, and can't ensure correct cycle detection.
|
||||
|
||||
### Data Visibility During and After SET
|
||||
|
||||
In Cypher, setting values always works on the latest version of data (from
|
||||
preceding or current clause). That means that within a `SET` clause all the
|
||||
changes from previous clauses must be visible, as well as changes done by the
|
||||
current `SET` clause. Also, if there is a clause after `SET` it must see *all*
|
||||
the changes performed by the preceding `SET`. Both these things are best
|
||||
illustrated with the following queries executed on an empty database:
|
||||
|
||||
CREATE (n:A {x:0})-[:EdgeType]->(m:B {x:0})
|
||||
MATCH (n)--(m) SET m.x = n.x + 1 RETURN labels(n), n.x, labels(m), m.x
|
||||
|
||||
This returns:
|
||||
|
||||
+---------+---+---------+---+
|
||||
|labels(n)|n.x|labels(m)|m.x|
|
||||
+:=======:+:=:+:=======:+:=:+
|
||||
|[A] |2 |[B] |1 |
|
||||
+---------+---+---------+---+
|
||||
|[B] |1 |[A] |2 |
|
||||
+---------+---+---------+---+
|
||||
|
||||
The obtained result implies the following operations:
|
||||
|
||||
1. In the first iteration set the value of the `B.x` to 1
|
||||
2. In the second iteration the we observe `B.x` with the value of 1 and set
|
||||
`A.x` to 2
|
||||
3. In `RETURN` we see all the changes made in both iterations
|
||||
|
||||
To implement the desired behavior Memgraph utilizes two techniques. First is
|
||||
the already mentioned tracking of two versions of data in vertex accessors.
|
||||
Using this approach ensures that the second iteration in the example query
|
||||
sees the data modification performed by the preceding iteration. The second
|
||||
technique is the `Accumulate` operation that accumulates all the iterations
|
||||
from the preceding logical op before passing them to the next logical op. In
|
||||
the example query, `Accumulate` ensures that the results returned to the user
|
||||
reflect changes performed in all iterations of the query (naive streaming
|
||||
could stream results at the end of first iteration producing inconsistent
|
||||
results). Note that `Accumulate` is demanding regarding memory and slows down
|
||||
query execution. For that reason it should be used only when necessary, for
|
||||
example it does not have to be used in a query that has `MATCH` and `SET` but
|
||||
no `RETURN`.
|
||||
|
||||
### Neo4j Inconsistency on Multiple SET Clauses
|
||||
|
||||
Considering the preceding example it could be expected that when a query has
|
||||
multiple `SET` clauses all the changes from those preceding one are visible.
|
||||
This is not the case in Neo4j's implementation. Consider the following queries
|
||||
executed on an empty database:
|
||||
|
||||
CREATE (n:A {x:0})-[:EdgeType]->(m:B {x:0})
|
||||
MATCH (n)--(m) SET n.x = n.x + 1 SET m.x = m.x * 2
|
||||
RETURN labels(n), n.x, labels(m), m.x
|
||||
|
||||
This returns:
|
||||
|
||||
+---------+---+---------+---+
|
||||
|labels(n)|n.x|labels(m)|m.x|
|
||||
+:=======:+:=:+:=======:+:=:+
|
||||
|[A] |2 |[B] |1 |
|
||||
+---------+---+---------+---+
|
||||
|[B] |1 |[A] |2 |
|
||||
+---------+---+---------+---+
|
||||
|
||||
If all the iterations of the first `SET` clause were executed before executing
|
||||
the second, all the resulting values would be 2. This not being the case, we
|
||||
conclude that Neo4j does not use a barrier-like mechanism between `SET`
|
||||
clauses. It is Memgraph's current vision that this is inconsistent and we
|
||||
plan to reduce Neo4j compliance in favour of operation consistency.
|
||||
|
||||
### Double Deletion
|
||||
|
||||
It's possible to match the same graph element multiple times in a single query
|
||||
and delete it. Neo supports this, and so do we. The relevant implementation
|
||||
detail is in the `GraphDbAccessor` class, where the record deletion functions
|
||||
reside, and not in the logical plan execution. It comes down to checking if a
|
||||
record has already been deleted in the current transaction+command and not
|
||||
attempting to do it again (results in a crash).
|
||||
|
||||
### Set + Delete Edge-case
|
||||
|
||||
It's legal for a query to combine `SET` and `DELETE` clauses. Consider the
|
||||
following queries executed on an empty database:
|
||||
|
||||
|
||||
CREATE ()-[:T]->()
|
||||
MATCH (n)--(m) SET n.x = 42 DETACH DELETE m
|
||||
|
||||
Due to the `MATCH` being undirected the second pull will attempt to set data
|
||||
on a deleted vertex. This is not a legal operation in Memgraph storage
|
||||
implementation. For that reason the logical operator for `SET` must check if
|
||||
the record it's trying to set something on has been deleted by the current
|
||||
transaction+command. If so, the modification is not executed.
|
||||
|
||||
### Deletion Accumulation
|
||||
|
||||
Sometimes it's necessary to accumulate deletions of all the matches before
|
||||
attempting to execute them. Consider this the following. Start with an empty
|
||||
database and execute queries:
|
||||
|
||||
CREATE ()-[:T]->()-[:T]->()
|
||||
MATCH (a)-[r1]-(b)-[r2]-(c) DELETE r1, b, c
|
||||
|
||||
Note that the `DELETE` clause attempts to delete node `c`, but it does not
|
||||
detach it by deleting edge `r2`. However, due to undirected edge in the
|
||||
`MATCH`, both edges get pulled and deleted.
|
||||
|
||||
Currently Memgraph does not support this behavior, Neo does. There are a few
|
||||
ways that we could do this.
|
||||
|
||||
* Accumulate on deletion (that sucks because we have to keep track of
|
||||
everything that gets returned after the deletion).
|
||||
* Maybe we could stream through the deletion op, but defer actual deletion
|
||||
until plan-execution end.
|
||||
* Ignore this because it's very edgy (this is the currently selected option).
|
||||
|
||||
### Aggregation Without Input
|
||||
|
||||
It is necessary to define what aggregation ops return when they receive no
|
||||
input. Following is a table that shows what Neo4j's Cypher implementation and
|
||||
SQL produce.
|
||||
|
||||
|
||||
+-------------+------------------------+---------------------+---------------------+------------------+
|
||||
| \<OP\> | 1. Cypher, no group-by | 2. Cypher, group-by | 3. SQL, no group-by | 4. SQL, group-by |
|
||||
+=============+:======================:+:===================:+:===================:+:================:+
|
||||
| Count(\*) | 0 | \<NO\_ROWS> | 0 | \<NO\_ROWS> |
|
||||
+-------------+------------------------+---------------------+---------------------+------------------+
|
||||
| Count(prop) | 0 | \<NO\_ROWS> | 0 | \<NO\_ROWS> |
|
||||
+-------------+------------------------+---------------------+---------------------+------------------+
|
||||
| Sum | 0 | \<NO\_ROWS> | NULL | \<NO\_ROWS> |
|
||||
+-------------+------------------------+---------------------+---------------------+------------------+
|
||||
| Avg | NULL | \<NO\_ROWS> | NULL | \<NO\_ROWS> |
|
||||
+-------------+------------------------+---------------------+---------------------+------------------+
|
||||
| Min | NULL | \<NO\_ROWS> | NULL | \<NO\_ROWS> |
|
||||
+-------------+------------------------+---------------------+---------------------+------------------+
|
||||
| Max | NULL | \<NO\_ROWS> | NULL | \<NO\_ROWS> |
|
||||
+-------------+------------------------+---------------------+---------------------+------------------+
|
||||
| Collect | [] | \<NO\_ROWS> | N/A | N/A |
|
||||
+-------------+------------------------+---------------------+---------------------+------------------+
|
||||
|
||||
Where:
|
||||
|
||||
1. `MATCH (n) RETURN <OP>(n.prop)`
|
||||
2. `MATCH (n) RETURN <OP>(n.prop), (n.prop2)`
|
||||
3. `SELECT <OP>(prop) FROM Table`
|
||||
4. `SELECT <OP>(prop), prop2 FROM Table GROUP BY prop2`
|
||||
|
||||
Neo's Cypher implementation diverges from SQL only when performing `SUM`.
|
||||
Memgraph implements SQL-like behavior. It is considered that `SUM` of
|
||||
arbitrary elements should not be implicitly 0, especially in a property graph
|
||||
without a strict schema (the property in question can contain values of
|
||||
arbitrary types, or no values at all).
|
||||
|
||||
### OrderBy
|
||||
|
||||
The `OrderBy` logical operator sorts the results in the desired order. It
|
||||
occurs in Cypher as part of a `WITH` or `RETURN` clause. Both the concept and
|
||||
the implementation are straightforward. It's necessary for the logical op to
|
||||
`Pull` everything from its input so it can be sorted. It's not necessary to
|
||||
keep the whole `Frame` state of each input, it is sufficient to keep a list of
|
||||
`TypedValues` on which the results will be sorted, and another list of values
|
||||
that need to be remembered and recreated on the `Frame` when yielding.
|
||||
|
||||
The sorting itself is made to reflect that of Neo's implementation which comes
|
||||
down to these points.
|
||||
|
||||
* `Null` comes last (as if it's greater than anything).
|
||||
* Primitive types compare naturally, with no implicit casting except from
|
||||
`int` to `double`.
|
||||
* Complex types are not comparable.
|
||||
* Every unsupported comparison results in an exception that gets propagated
|
||||
to the end user.
|
||||
|
||||
### Limit in Write Queries
|
||||
|
||||
`Limit` can be used as part of a write query, in which case it will *not*
|
||||
reduce the amount of performed updates. For example, consider a database that
|
||||
has 10 vertices. The query `MATCH (n) SET n.x = 1 RETURN n LIMIT 3` will
|
||||
result in all vertices having their property value changed, while returning
|
||||
only the first to the client. This makes sense from the implementation
|
||||
standpoint, because `Accumulate` is planned after `SetProperty` but before
|
||||
`Produce` and `Limit` operations. Note that this behavior can be
|
||||
non-deterministic in some queries, since it relies on the order of iteration
|
||||
over nodes which is undefined when not explicitly specified.
|
||||
|
||||
### Merge
|
||||
|
||||
`MERGE` in Cypher attempts to match a pattern. If it already exists, it does
|
||||
nothing and subsequent clauses like `RETURN` can use the matched pattern
|
||||
elements. If the pattern can't match to any data, it creates it. For detailed
|
||||
information see Neo4j's [merge
|
||||
documentation.](https://neo4j.com/docs/developer-manual/current/cypher/clauses/merge/)
|
||||
|
||||
An important thing about `MERGE` is visibility of modified data. `MERGE` takes
|
||||
an input (typically a `MATCH`) and has two additional *phases*: the merging
|
||||
part, and the subsequent set parts (`ON MATCH SET` and `ON CREATE SET`).
|
||||
Analysis of Neo4j's behavior indicates that each of these three phases (input,
|
||||
merge, set) does not see changes to the graph state done by subsequent phase.
|
||||
The input phase does not see data created by the merge phase, nor the set
|
||||
phase. This is consistent with what seems like the general Cypher philosophy
|
||||
that query clause effects aren't visible in the preceding clauses.
|
||||
|
||||
We define the `Merge` logical operator as a *routing* operator that uses three
|
||||
logical operator branches.
|
||||
|
||||
1. The input from a preceding clause.
|
||||
|
||||
For example in `MATCH (n), (m) MERGE (n)-[:T]-(m)`. This input is
|
||||
optional because `MERGE` is allowed to be the first clause in a query.
|
||||
|
||||
2. The `merge_match` branch.
|
||||
|
||||
This logical operator branch is `Pull`-ed from until exhausted for each
|
||||
successful `Pull` from the input branch.
|
||||
|
||||
3. The `merge_create` branch.
|
||||
|
||||
This branch is `Pull`ed when the `merge_match` branch does not match
|
||||
anything (no successful `Pull`s) for an input `Pull`. It is `Pull`ed only
|
||||
once in such a situation, since only one creation needs to occur for a
|
||||
failed match.
|
||||
|
||||
The `ON MATCH SET` and `ON CREATE SET` parts of the `MERGE` clause are
|
||||
included in the `merge_match` and `merge_create` branches respectively. They
|
||||
are placed on the end of their branches so that they execute only when those
|
||||
branches succeed.
|
||||
|
||||
Memgraph strives to be consistent with Neo in its `MERGE` implementation,
|
||||
while at the same time keeping performance as good as possible. Consistency
|
||||
with Neo w.r.t. graph state visibility is not trivial. Documentation for
|
||||
`Expand` and `Set` describe how Memgraph keeps track of both the updated
|
||||
version of an edge/vertex and the old one, as it was before the current
|
||||
transaction+command. This technique is also used in `Merge`. The input
|
||||
phase/branch of `Merge` always looks at the old data. The merge phase needs to
|
||||
see the new data so it doesn't create more data then necessary.
|
||||
|
||||
For example, consider the query.
|
||||
|
||||
MATCH (p:Person) MERGE (c:City {name: p.lives_in})
|
||||
|
||||
This query needs to create a city node only once for each unique `p.lives_in`.
|
||||
Finally the set phase of a `MERGE` clause should not affect the merge phase.
|
||||
To achieve this the `merge_match` branch of the `Merge` operator should see
|
||||
the latest created nodes, but filter them on their old state (if those nodes
|
||||
were not created by the `create_branch`). Implementation-wise that means that
|
||||
`ScanAll` and `Expand` operators in the `merge_branch` need to look at the new
|
||||
graph state, while `Filter` operators the old, if available.
|
62
docs/dev/query/parsing.md
Normal file
62
docs/dev/query/parsing.md
Normal file
@ -0,0 +1,62 @@
|
||||
# Lexical and Syntactic Analysis
|
||||
|
||||
## Antlr
|
||||
|
||||
We use Antlr for lexical and syntax analysis of Cypher queries. Antrl uses
|
||||
grammar file `Cypher.g4` downloaded from http://www.opencypher.org to generate
|
||||
the parser and the visitor for the Cypher parse tree. Even though the provided
|
||||
grammar is not very pleasant to work with we decided not to do any drastic
|
||||
changes to it so that our transition to newly published versions of
|
||||
`Cypher.g4` would be easier. Nevertheless, we had to fix some bugs and add
|
||||
features, so our version is not completely the same.
|
||||
|
||||
In addition to using `Cypher.g4`, we have `MemgraphCypher.g4`. This grammar
|
||||
file defines Memgraph specific extensions to the original grammar. Most
|
||||
notable example is the inclusion of syntax for handling authorization. At the
|
||||
moment, some extensions are also found in `Cypher.g4`. For example, the syntax
|
||||
for using a lambda function in relationship patterns. These extensions should
|
||||
be moved out of `Cypher.g4`, so that it remains as close to the original
|
||||
grammar as possible. Additionally, having `MemgraphCypher.g4` may not be
|
||||
enough if we wish to split the functionality for community and enterprise
|
||||
editions of Memgraph.
|
||||
|
||||
## Abstract Syntax Tree (AST)
|
||||
|
||||
Since Antlr generated visitor and the official openCypher grammar are not very
|
||||
practical to use, we translate the Antlr's AST to our own AST. Currently there
|
||||
are ~40 types of nodes in our AST. Their definitions can be found in
|
||||
`src/query/frontend/ast/ast.lcp`.
|
||||
|
||||
Major groups of types can be found under the following base types.
|
||||
|
||||
* `Expression` --- types corresponding to Cypher expressions.
|
||||
* `Clause` --- types corresponding to Cypher clauses.
|
||||
* `PatternAtom` --- node or edge related information.
|
||||
* `Query` --- different kinds of queries, allows extending the language with
|
||||
Memgraph specific query syntax.
|
||||
|
||||
Memory management of created AST nodes is done with `AstStorage`. Each type
|
||||
must be created by invoking `AstStorage::Create` method. This way all of the
|
||||
pointers to nodes and their children are raw pointers. The only owner of
|
||||
allocated memory is the `AstStorage`. When the storage goes out of scope, the
|
||||
pointers become invalid. It may be more natural to handle tree ownership via
|
||||
`unique_ptr`, i.e. each node owns its children. But there are some benefits to
|
||||
having a custom storage and allocation scheme.
|
||||
|
||||
The primary reason we opted for not using `unique_ptr` is the requirement of
|
||||
Antlr's base visitor class that the resulting values must by copyable. The
|
||||
result is wrapped in `antlr::Any` so that the derived visitor classes may
|
||||
return any type they wish when visiting Antlr's AST. Unfortunately,
|
||||
`antlr::Any` does not work with non-copyable types.
|
||||
|
||||
Another benefit of having `AstStorage` is that we can easily add a different
|
||||
allocation scheme for AST nodes. The interface of node creation would not
|
||||
change.
|
||||
|
||||
### AST Translation
|
||||
|
||||
The translation process is done via `CypherMainVisitor` class, which is
|
||||
derived from Antlr generated visitor. Besides instancing our AST types, a
|
||||
minimal number of syntactic checks are done on a query. These checks handle
|
||||
the cases which were valid in original openCypher grammar, but may be invalid
|
||||
when combined with other syntax elements.
|
487
docs/dev/query/planning.md
Normal file
487
docs/dev/query/planning.md
Normal file
@ -0,0 +1,487 @@
|
||||
# Logical Planning
|
||||
|
||||
After the semantic analysis and symbol generation, the AST is converted to a
|
||||
tree of logical operators. This conversion is called *planning* and the tree
|
||||
of logical operators is called a *plan*. The whole planning process is done in
|
||||
the following steps.
|
||||
|
||||
1. [AST Preprocessing](#ast-preprocessing)
|
||||
|
||||
The first step is to preprocess the AST by collecting
|
||||
information on filters, divide the query into parts, normalize patterns
|
||||
in `MATCH` clauses, etc.
|
||||
|
||||
2. [Logical Operator Planning](#logical-operator-planning)
|
||||
|
||||
After the preprocess step, the planning can be done via 2 planners:
|
||||
`VariableStartPlanner` and `RuleBasedPlanner`. The first planner will
|
||||
generate multiple plans where each plan has different starting points for
|
||||
searching the patterns in `MATCH` clauses. The second planner produces a
|
||||
single plan by mapping the query parts as they are to logical operators.
|
||||
|
||||
3. [Logical Plan Postprocessing](#logical-plan-postprocessing)
|
||||
|
||||
In this stage, we perform various transformations on the generated logical
|
||||
plan. Here we want to optimize the operations in order to improve
|
||||
performance during the execution. Naturally, transformations need to
|
||||
preserve the semantic behaviour of the original plan.
|
||||
|
||||
4. [Cost Estimation](#cost-estimation)
|
||||
|
||||
After the generation, the execution cost of each plan is estimated. This
|
||||
estimation is used to select the best plan which will be executed.
|
||||
|
||||
5. [Distributed Planning](#distributed-planning)
|
||||
|
||||
In case we are running distributed Memgraph, the final plan is adapted
|
||||
for distributed execution. NOTE: This appears to be an error in the
|
||||
workflow. Distributed planning should be moved before step 3. or
|
||||
integrated with it. With the workflow ordered as is now, cost estimation
|
||||
doesn't consider the distributed plan.
|
||||
|
||||
The implementation can be found in the `query/plan` directory, with the public
|
||||
entry point being `query/plan/planner.hpp`.
|
||||
|
||||
## AST Preprocessing
|
||||
|
||||
Each openCypher query consists of at least 1 **single query**. Multiple single
|
||||
queries are chained together using a **query combinator**. Currently, there is
|
||||
only one combinator, `UNION`. The preprocessing step starts in the
|
||||
`CollectQueryParts` function. This function will take a look at each single
|
||||
query and divide it into parts. Each part is separated with `RETURN` and
|
||||
`WITH` clauses. For example:
|
||||
|
||||
MATCH (n) CREATE (m) WITH m MATCH (l)-[]-(m) RETURN l
|
||||
| | |
|
||||
|------- part 1 -----------+-------- part 2 --------|
|
||||
| |
|
||||
|-------------------- single query -----------------|
|
||||
|
||||
Each part is created by collecting all `MATCH` clauses and *normalizing* their
|
||||
patterns. Pattern normalization is the process of converting an arbitrarily
|
||||
long pattern chain of nodes and edges into a list of triplets `(start node,
|
||||
edge, end node)`. The triplets should preserve the semantics of the match. For
|
||||
example:
|
||||
|
||||
MATCH (a)-[p]-(b)-[q]-(c)-[r]-(d)
|
||||
|
||||
is equivalent to:
|
||||
|
||||
MATCH (a)-[p]-(b), (b)-[q]-(c), (c)-[r]-(d)
|
||||
|
||||
With this representation, it becomes easier to reorder the triplets and choose
|
||||
different strategies for pattern matching.
|
||||
|
||||
In addition to normalizing patterns, all of the filter expressions in patterns
|
||||
and inside of the `WHERE` clause (of the accompanying `MATCH`) are extracted
|
||||
and stored separately. During the extraction, symbols used in the filter
|
||||
expression are collected. This allows for planning filters in a valid order,
|
||||
as the matching for triplets is being done. Another important benefit of
|
||||
having extra information on filters, is to recognize when a database index
|
||||
could be used.
|
||||
|
||||
After each `MATCH` is processed, they are all grouped, so that even the whole
|
||||
`MATCH` clauses may be reordered. The important thing is to remember which
|
||||
symbols were used to name edges in each `MATCH`. With those symbols we can
|
||||
plan for *cyphermorphism*, i.e. ensure different edges in the search pattern
|
||||
of a single `MATCH` map to different edges in the graph. This preserves the
|
||||
semantic of the query, even though we may have reordered the matching. The
|
||||
same steps are done for `OPTIONAL MATCH`.
|
||||
|
||||
Another clause which needs processing is `MERGE`. Here we normalize the
|
||||
pattern, since the `MERGE` is a bit like `MATCH` and `CREATE` in one.
|
||||
|
||||
All the other clauses are left as is.
|
||||
|
||||
In the end, each query part consists of:
|
||||
|
||||
* processed and grouped `MATCH` clauses;
|
||||
* processed and grouped `OPTIONAL MATCH` clauses;
|
||||
* processed `MERGE` matching pattern and
|
||||
* unchanged remaining clauses.
|
||||
|
||||
The last stored clause is guaranteed to be either `WITH` or `RETURN`.
|
||||
|
||||
## Logical Operator Planning
|
||||
|
||||
### Variable Start Planner
|
||||
|
||||
The `VariableStartPlanner` generates multiple plans for a single query. Each
|
||||
plan is generated by selecting a different starting point for pattern
|
||||
matching.
|
||||
|
||||
The algorithm works as follows.
|
||||
|
||||
1. For each query part:
|
||||
1. For each node in triplets of collected `MATCH` clauses:
|
||||
i. Add the node to a set of `expanded` nodes
|
||||
ii. Select a triplet `(start node, edge, end node)` whose `start node` is
|
||||
in the `expanded` set
|
||||
iii. If no triplet was selected, choose a new starting node that isn't in
|
||||
`expanded` and continue expanding
|
||||
iv. Repeat steps ii. -- iii. until all triplets have been selected
|
||||
and store that as a variation of the `MATCH` clauses
|
||||
2. Do step 1.1. for `OPTIONAL MATCH` and `MERGE` clauses
|
||||
3. Take all combinations of the generated `MATCH`, `OPTIONAL MATCH` and
|
||||
`MERGE` and store them as variations of the query part.
|
||||
2. For each combination of query part variations:
|
||||
1. Generate a plan using the rule based planner
|
||||
|
||||
### Rule Based Planner
|
||||
|
||||
The `RuleBasedPlanner` generates a single plan for a single query. A plan is
|
||||
generated by following hardcoded rules for producing logical operators. The
|
||||
following sections are an overview on how each openCypher clause is converted
|
||||
to a `LogicalOperator`.
|
||||
|
||||
#### MATCH
|
||||
|
||||
`MATCH` clause is used to specify which patterns need to be searched for in
|
||||
the database. These patterns are normalized in the preprocess step to be
|
||||
represented as triplets `(start node, edge, end node)`. When there is no edge,
|
||||
then the triplet is reduced only to the `start node`. Generating the operators
|
||||
is done by looping over these triplets.
|
||||
|
||||
##### Searching for Nodes
|
||||
|
||||
The simplest search is finding stand alone nodes. For example, `MATCH (n)`
|
||||
will find all the nodes in the graph. This is accomplished by generating a
|
||||
`ScanAll` operator and forwarding the node symbol which should store the
|
||||
results. In this case, all the nodes will be referenced by `n`.
|
||||
|
||||
Multiple nodes can be specified in a single match, e.g. `MATCH (n), (m)`.
|
||||
Planning is done by repeating the same steps for each sub pattern (separated
|
||||
by a comma). In this case, we would get 2 `ScanAll` operators chained one
|
||||
after the other. An optimization can be obtained if the node in the pattern is
|
||||
already searched for. In `MATCH (n), (n)` we can drop the second `ScanAll`
|
||||
operator since we have already generated it for the first node.
|
||||
|
||||
##### Searching for Relationships
|
||||
|
||||
A more advanced search includes finding nodes with relationships. For example,
|
||||
`MATCH (n)-[r]-(m)` should find every pair of connected nodes in the database.
|
||||
This means, that if a single node has multiple connections, it will be
|
||||
repeated for each combination of pairs. The generation of operators starts
|
||||
from the first node in the pattern. If we are referencing a new starting node,
|
||||
, we need to generate a `ScanAll` which finds all the nodes and stores them
|
||||
into `n`. Then, we generate an `Expand` operator which reads the `n` and
|
||||
traverses all the edges of that node. The edge is stored into `r`, while the
|
||||
destination node is stored in `m`.
|
||||
|
||||
Matching multiple relationships proceeds similarly, by repeating the same
|
||||
steps. The only difference is that we need to ensure different edges in the
|
||||
search pattern, map to different edges in the graph. This means that after
|
||||
each `Expand` operator, we need to generate an `ExpandUniquenessFilter`. We
|
||||
provide this operator with a list of symbols for the previously matched edges
|
||||
and the symbol for the current edge.
|
||||
|
||||
For example.
|
||||
|
||||
MATCH (n)-[r1]-(m)-[r2]-(l)
|
||||
|
||||
The above is preprocessed into
|
||||
|
||||
MATCH (n)-[r1]-(m), (m)-[r2]-(l)
|
||||
|
||||
Then we look at each triplet in order and perform the described steps. This
|
||||
way, we would generate:
|
||||
|
||||
ScanAll (n) > Expand (n, r1, m) > Expand (m, r2, l) >
|
||||
ExpandUniquenessFilter ([r1], r2)
|
||||
|
||||
Note that we don't need to make `ExpandUniquenessFilter` after the first
|
||||
`Expand`, since there are no edges to compare to. This filtering needs to work
|
||||
across multiple pattern, but inside a *single* `MATCH` clause.
|
||||
|
||||
Let's take a look at the following.
|
||||
|
||||
MATCH (n)-[r1]-(m), (m)-[r2]-(l)
|
||||
|
||||
We would also generate the exact same operators.
|
||||
|
||||
ScanAll (n) > Expand (n, r1, m) > Expand (m, r2, l) >
|
||||
ExpandUniquenessFilter ([r1], r2)
|
||||
|
||||
On the other hand,
|
||||
|
||||
MATCH (n)-[r1]-(m) MATCH (m)-[r2]-(l)-[r3]-(i)
|
||||
|
||||
We would reset the uniqueness filtering at the start of the second match. This
|
||||
would mean that we output the following:
|
||||
|
||||
ScanAll (n) > Expand (n, r1, m) > Expand (m, r2, l) > Expand (l, r3, i) >
|
||||
ExpandUniquenessFilter ([r2], r3)
|
||||
|
||||
There is a difference in how we handle edge uniqueness compared to Neo4j.
|
||||
Neo4j does not allow searching for a single edge multiple times, but we've
|
||||
decided to support that.
|
||||
|
||||
For example, the user can say the following.
|
||||
|
||||
MATCH (n)-[r]-(m)-[r]-l
|
||||
|
||||
We would ensure that both `r` variables match to the same edge. In our
|
||||
terminology, we call this the *edge cycle*. For the above example, we would
|
||||
generate this plan.
|
||||
|
||||
ScanAll (n) > Expand (n, r, m) > Expand (m, r, l)
|
||||
|
||||
We do not put an `ExpandUniquenessFilter` operator between 2 `Expand`
|
||||
operators and we tell the 2nd `Expand` that it is an edge cycle. This, 2nd
|
||||
`Expand` will ensure we have matched both the same edges.
|
||||
|
||||
##### Filtering
|
||||
|
||||
To narrow the search down, the patterns in `MATCH` can have filtered labels
|
||||
and properties. A more general filtering is done using the accompanying
|
||||
`WHERE` clause. During the preprocess step, all filters are collected and
|
||||
extracted into expressions. Additional information on which symbols are used
|
||||
is also stored. This way, each time we generate a `ScanAll` or `Expand`, we
|
||||
look at all the filters to see if any of them can be used. I.e. if the symbols
|
||||
they use have been bound by a newly produced operator. If a filter expression
|
||||
can be used, we immediately add a `Filter` operator with that expression.
|
||||
|
||||
For example.
|
||||
|
||||
MATCH (n)-[r]-(m :label) WHERE n.prop = 42
|
||||
|
||||
We would produce:
|
||||
|
||||
ScanAll (n) > Filter (n.prop) > Expand (n, r, m) > Filter (m :label)
|
||||
|
||||
This means that the same plan is generated for the query:
|
||||
|
||||
MATCH (n {prop: 42})-[r]-(m :label)
|
||||
|
||||
#### OPTIONAL
|
||||
|
||||
If a `MATCH` clause is preceded by `OPTIONAL`, then we need to generate a plan
|
||||
such that we produce results even if we fail to match anything. This is
|
||||
accomplished by generating an `Optional` operator, which takes 2 operator
|
||||
trees:
|
||||
|
||||
* input operation and
|
||||
* optional operation.
|
||||
|
||||
The input is the operation we generated for the part of the query before
|
||||
`OPTIONAL MATCH`. For the optional operation, we simply generate the `OPTIONAL
|
||||
MATCH` part just like we would for regular `MATCH`. In addition to operations,
|
||||
we need to send the symbols which are set during optional matching to the
|
||||
`Optional` operator. The operator will reset values of those symbols to
|
||||
`null`, when the optional part fails to match.
|
||||
|
||||
#### RETURN & WITH
|
||||
|
||||
`RETURN` and `WITH` clauses are very similar to each other. The only
|
||||
difference is that `WITH` separates parts of the query and can be paired with
|
||||
`WHERE` clause.
|
||||
|
||||
The common part is generating operators for the body of the clause. Separation
|
||||
of query parts is mostly done in semantic analysis, which checks that only the
|
||||
symbols exposed through `WITH` are visible in the query parts after the
|
||||
clause. The minor part is done in planning.
|
||||
|
||||
##### Named Results
|
||||
|
||||
Both clauses contain multiple named expressions (`expr AS name`) which are
|
||||
used to generate `Produce` operator.
|
||||
|
||||
##### Aggregations
|
||||
|
||||
If an expression contains an aggregation operator (`sum`, `avg`, ...) we need
|
||||
to plan the `Aggregate` operator as input to `Produce`. This case is more
|
||||
complex, because aggregation in openCypher can perform implicit grouping of
|
||||
results used for aggregation.
|
||||
|
||||
For example, `WITH/RETURN sum(n.x) AS s, n.y AS group` will implicitly group
|
||||
by `n.y` expression.
|
||||
|
||||
Another, obscure grouping can be achieved with `RETURN sum(n.a) + n.b AS s`.
|
||||
Here, the `n.b` will be used for grouping, even though both the `sum` and
|
||||
`n.b` are in the same named expression.
|
||||
|
||||
Therefore, we need to collect all expressions which do not contain
|
||||
aggregations and use them for grouping. You may have noticed that in the last
|
||||
example `sum` is actually a sub-expression of `+`. `Aggregate` operator does
|
||||
not see that (nor it should), so the responsibility of evaluating that falls
|
||||
on `Produce`. One way is for `Aggregate` to store results of grouping
|
||||
expressions on the frame in addition to aggregation results. Unfortunately,
|
||||
this would require rewiring named expressions in `Produce` to reference
|
||||
already evaluated expressions. In the current implementation, we opted for
|
||||
`Aggregate` to store only aggregation results on the frame, while `Produce`
|
||||
will re-evaluate all the other (grouping) expressions. To handle that, symbols
|
||||
which are used in expressions are passed to `Aggregate`, so that they can be
|
||||
remembered. `Produce` will read those symbols from the frame and use it to
|
||||
re-evaluate the needed expressions.
|
||||
|
||||
##### Accumulation
|
||||
|
||||
After we have `Produce` and potentially `Aggregate`, we need to handle a
|
||||
special case when the part of the query before `RETURN` or `WITH` performs
|
||||
updates. For that, we want to run that part of the query fully, so that we get
|
||||
the latest results. This is accomplished by adding `Accumulate` operator as
|
||||
input to `Aggregate` or `Produce` (if there is no aggregation). Accumulation
|
||||
will store all the values for all the used symbols inside `RETURN` and `WITH`,
|
||||
so that they can be used in the operator which follows. This way, only parts
|
||||
of the frame are copied, instead of the whole frame. Here is a minor
|
||||
difference between planning `WITH`, compared to `RETURN`. Since `WITH` can
|
||||
separate writing from reading, we need to advance the transaction command.
|
||||
This enables the later, read parts of the query to obtain the newest changes.
|
||||
This is supported by passing `advance_command` flag to `Accumulate` operator.
|
||||
|
||||
In the simplest case, common to both clauses, we have `Accumulate > Aggregate
|
||||
> Produce` operators, where `Accumulate` and `Aggregate` may be left out.
|
||||
|
||||
##### Ordering
|
||||
|
||||
Planning `ORDER BY` is simple enough. Since it may see new symbols (filled in
|
||||
`Produce`), we add the `OrderBy` operator at the end. The operator will change
|
||||
the order of produced results, so we pass it the ordering expressions and the
|
||||
output symbols of named expressions.
|
||||
|
||||
##### Filtering
|
||||
|
||||
A final difference in `WITH`, is when it contains a `WHERE` clause. For that,
|
||||
we simply generate the `Filter` operator, appended after `Produce` or
|
||||
`OrderBy` (depending which operator is last).
|
||||
|
||||
##### Skipping and Limiting
|
||||
|
||||
If we have `SKIP` or `LIMIT`, we generate `Skip` or `Limit` operators,
|
||||
respectively. These operators are put at the end of the clause.
|
||||
|
||||
This placement may have some unexpected behaviour when combined with
|
||||
operations that update the graph. For example.
|
||||
|
||||
MATCH (n) SET n.x = n.x + 1 RETURN n LIMIT 1
|
||||
|
||||
The above query may be interpreted as if the `SET` will be done only once.
|
||||
Since this is a write query, we need to accumulate results, so the part before
|
||||
`RETURN` will execute completely. The accumulated results will be yielded up
|
||||
to the given limit, and the user would get only the first `n` that was
|
||||
updated. This may confuse the user because in reality, every node in the
|
||||
database had been updated.
|
||||
|
||||
Note that `Skip` always comes before `Limit`. In the current implementation,
|
||||
they are generated directly one after the other.
|
||||
|
||||
#### CREATE
|
||||
|
||||
`CREATE` clause is used to create nodes and edges (relationships).
|
||||
|
||||
For multiple `CREATE` clauses or multiple creation patterns in a single
|
||||
clause, we perform the same, following steps.
|
||||
|
||||
##### Creating a Single Node
|
||||
|
||||
A node is created by simply specifying a node pattern.
|
||||
|
||||
For example `CREATE (n :label {property: "value"}), ()` would create 2 nodes.
|
||||
The 1st one would be created with a label and a property. This node could be
|
||||
referenced later in the query, by using the variable `n`. The 2nd node cannot
|
||||
be referenced and it would be created without any labels nor properties. For
|
||||
node creation, we generate a `CreateNode` operator and pass it all the details
|
||||
of node creation: variable symbol, labels and properties. In the mentioned
|
||||
example, we would have `CreateNode > CreateNode`.
|
||||
|
||||
##### Creating a Relationship
|
||||
|
||||
To create a relationship, the `CREATE` clause must contain a pattern with a
|
||||
directed edge. Compared to creating a single node, this case is a bit more
|
||||
complicated, because either side of the edge may not exist. By exist, we mean
|
||||
that the endpoint is a variable which already references a node.
|
||||
|
||||
For example, `MATCH (n) CREATE (n)-[r]->(m)` would create an edge `r` and a
|
||||
node `m` for each matched node `n`. If we focus on the `CREATE` part, we
|
||||
generate `CreateExpand (n, r, m)` where `n` already exists (refers to matched
|
||||
node) and `m` would be newly created along with edge `r`. If we had only
|
||||
`CREATE (n) -[r]-> (m)`, then we would need to create both nodes of the edge
|
||||
`r`. This is done by generating `CreateNode (n) > CreateExpand(n, r, m)`. The
|
||||
final case is when both endpoints refer to an existing node. For example, when
|
||||
adding a node with a cyclical connection `CREATE (n)-[r]->(n)`. In this case,
|
||||
we would generate `CreateNode (n) > CreateExpand (n, r, n)`. We would tell
|
||||
`CreateExpand` to only create the edge `r` between the already created `n`.
|
||||
|
||||
#### MERGE
|
||||
|
||||
Although the merge operation is complex, planning turns out to be relatively
|
||||
simple. The pattern inside the `MERGE` clause is used for both matching and
|
||||
creating. Therefore, we create 2 operator trees, one for each action.
|
||||
|
||||
For example.
|
||||
|
||||
MERGE (n)-[r:r]-(m)
|
||||
|
||||
We would generated a single `Merge` operator which has the following.
|
||||
|
||||
* No input operation (since it is not preceded by any other clause).
|
||||
|
||||
* On match operation
|
||||
|
||||
`ScanAll (n) > Expand (n, r, m) > Filter (r)`
|
||||
|
||||
* On create operation
|
||||
|
||||
`CreateNode (n) > CreateExpand (n, r, m)`
|
||||
|
||||
In cases when `MERGE` contains `ON MATCH` and `ON CREATE` parts, we simply
|
||||
append their operations to the respective operator trees.
|
||||
|
||||
Observe the following example.
|
||||
|
||||
MERGE (n)-[r:r]-(m) ON MATCH SET n.x = 42 ON CREATE SET m :label
|
||||
|
||||
The `Merge` would be generated with the following.
|
||||
|
||||
* No input operation (again, since there is no clause preceding it).
|
||||
|
||||
* On match operation
|
||||
|
||||
`ScanAll (n) > Expand (n, r, m) > Filter (r) > SetProperty (n.x, 42)`
|
||||
|
||||
* On create operation
|
||||
|
||||
`CreateNode (n) > CreateExpand (n, r, m) > SetLabels (n, :label)`
|
||||
|
||||
When we have preceding clauses, we simply put their operator as input to
|
||||
`Merge`.
|
||||
|
||||
MATCH (n) MERGE (n)-[r:r]-(m)
|
||||
|
||||
The above would be generated as
|
||||
|
||||
ScanAll (n) > Merge (on_match_operation, on_create_operation)
|
||||
|
||||
Here we need to be careful to recognize which symbols are already declared.
|
||||
But, since the `on_match_operation` uses the same algorithm for generating a
|
||||
`Match`, that problem is handled there. The same should hold for
|
||||
`on_create_operation`, which uses the process of generating a `Create`. So,
|
||||
finally for this example, the `Merge` would have:
|
||||
|
||||
* Input operation
|
||||
|
||||
`ScanAll (n)`
|
||||
|
||||
* On match operation
|
||||
|
||||
`Expand (n, r, m) > Filter (r)`
|
||||
|
||||
Note that `ScanAll` is not needed since we get the nodes from input.
|
||||
|
||||
* On create operation
|
||||
|
||||
`CreateExpand (n, r, m)`
|
||||
|
||||
Note that `CreateNode` is dropped, since we want to expand the existing one.
|
||||
|
||||
## Logical Plan Postprocessing
|
||||
|
||||
NOTE: TODO
|
||||
|
||||
## Cost Estimation
|
||||
|
||||
NOTE: TODO
|
||||
|
||||
## Distributed Planning
|
||||
|
||||
NOTE: TODO
|
134
docs/dev/query/semantic.md
Normal file
134
docs/dev/query/semantic.md
Normal file
@ -0,0 +1,134 @@
|
||||
# Semantic Analysis and Symbol Generation
|
||||
|
||||
In this phase, various semantic and variable type checks are performed.
|
||||
Additionally, we generate symbols which map AST nodes to stored values
|
||||
computed from evaluated expressions.
|
||||
|
||||
## Symbol Generation
|
||||
|
||||
Implementation can be found in `query/frontend/semantic/symbol_generator.cpp`.
|
||||
|
||||
Symbols are generated for each AST node that represents data that needs to
|
||||
have storage. Currently, these are:
|
||||
|
||||
* `NamedExpression`
|
||||
* `CypherUnion`
|
||||
* `Identifier`
|
||||
* `Aggregation`
|
||||
|
||||
You may notice that the above AST nodes may not correspond to something named
|
||||
by a user. For example, `Aggregation` can be a part of larger expression and
|
||||
thus remain unnamed. The reason we still generate symbols is to have a uniform
|
||||
behaviour when executing a query as well as allow for caching the results of
|
||||
expression evaluation.
|
||||
|
||||
AST nodes do not actually store a `Symbol` instance, instead they have a
|
||||
`int32_t` index identifying the symbol in the `SymbolTable` class. This is
|
||||
done to minimize the size of AST types as well as allow easier sharing of same
|
||||
symbols with multiple instances of AST nodes.
|
||||
|
||||
The storage for evaluated data is represented by the `Frame` class. Each
|
||||
symbol determines a unique position in the frame. During interpretation,
|
||||
evaluation of expressions which have a symbol will either read or store values
|
||||
in the frame. For example, instance of an `Identifier` will use the symbol to
|
||||
find and read the value from `Frame`. On the other hand, `NamedExpression`
|
||||
will take the result of evaluating its own expression and store it in the
|
||||
`Frame`.
|
||||
|
||||
When a symbol is created, context of creation is used to assign a type to that
|
||||
symbol. This type is used for simple type checking operations. For example,
|
||||
`MATCH (n)` will create a symbol for variable `n`. Since the `MATCH (n)`
|
||||
represents finding a vertex in the graph, we can set `Symbol::Type::Vertex`
|
||||
for that symbol. Later, for example in `MATCH ()-[n]-()` we see that variable
|
||||
`n` is used as an edge. Since we already have a symbol for that variable, we
|
||||
detect this type mismatch and raise a `SemanticException`.
|
||||
|
||||
Basic rule of symbol generation, is that variables inside `MATCH`, `CREATE`,
|
||||
`MERGE`, `WITH ... AS` and `RETURN ... AS` clauses establish new symbols.
|
||||
|
||||
### Symbols in Patterns
|
||||
|
||||
Inside `MATCH`, symbols are created only if they didn't exist before. For
|
||||
example, patterns in `MATCH (n {a: 5})--(m {b: 5}) RETURN n, m` will create 2
|
||||
symbols: one for `n` and one for `m`. `RETURN` clause will, in turn, reference
|
||||
those symbols. Symbols established in a part of pattern are immediately bound
|
||||
and visible in later parts. For example, `MATCH (n)--(n)` will create a symbol
|
||||
for variable `n` for 1st `(n)`. That symbol is referenced in 2nd `(n)`. Note
|
||||
that the symbol is not bound inside 1st `(n)` itself. What this means is that,
|
||||
for example, `MATCH (n {a: n.b})` should raise an error, because `n` is not
|
||||
yet bound when encountering `n.b`. On the other hand,
|
||||
`MATCH (n)--(n {a: n.b})` is fine.
|
||||
|
||||
The `CREATE` is similar to `MATCH`, but it *always* establishes symbols for
|
||||
variables which create graph elements. What this means is that, for example
|
||||
`MATCH (n) CREATE (n)` is not allowed. `CREATE` wants to create a new node,
|
||||
for which we already have a symbol. In such a case, we need to throw an error
|
||||
that the variable `n` is being redeclared. On the other hand `MATCH (n) CREATE
|
||||
(n)-[r :r]->(n)` is fine, because `CREATE` will only create the edge `r`,
|
||||
connecting the already existing node `n`. Remaining behaviour is the same as
|
||||
in `MATCH`. This means that we can simplify `CREATE` to be like `MATCH` with 2
|
||||
special cases.
|
||||
|
||||
1. Are we creating a node, i.e. `CREATE (n)`? If yes, then the symbol for
|
||||
`n` must not have been created before. Otherwise, we reference the
|
||||
existing symbol.
|
||||
2. Are we creating an edge, i.e. we encounter a variable for an edge inside
|
||||
`CREATE`? If yes, then that variable must not reference a symbol.
|
||||
|
||||
The `MERGE` clause is treated the same as `CREATE` with regards to symbol
|
||||
generation. The only difference is that we allow bidirectional edges in the
|
||||
pattern. When creating such a pattern, the direction of the created edge is
|
||||
arbitrarily determined.
|
||||
|
||||
### Symbols in WITH and RETURN
|
||||
|
||||
In addition to patterns, new symbols are established in the `WITH` clause.
|
||||
This clause makes the new symbols visible *only* to the rest of the query.
|
||||
For example, `MATCH (old) WITH old AS new RETURN new, old` should raise an
|
||||
error that `old` is unbound inside `RETURN`.
|
||||
|
||||
There is a special case with symbol visibility in `WHERE` and `ORDER BY`. They
|
||||
need to see both the old and the new symbols. Therefore `MATCH (old) RETURN
|
||||
old AS new ORDER BY old.prop` needs to work. On the other hand, if we perform
|
||||
aggregations inside `WITH` or `RETURN`, then the old symbols should not be
|
||||
visible neither in `WHERE` nor in `ORDER BY`. Since the aggregation has to go
|
||||
through all the results in order to generate the final value, it makes no
|
||||
sense to store old symbols and their values. A query like `MATCH (old) WITH
|
||||
SUM(old.prop) AS sum WHERE old.prop = 42 RETURN sum` needs to raise an error
|
||||
that `old` is unbound inside `WHERE`.
|
||||
|
||||
For cases when `SKIP` and `LIMIT` appear, we disallow any identifiers from
|
||||
appearing in their expressions. Basically, `SKIP` and `LIMIT` can only be
|
||||
constant expressions[^1]. For example, `MATCH (old) RETURN old AS new SKIP
|
||||
new.prop` needs to raise that variables are not allowed in `SKIP`. It makes no
|
||||
sense to allow variables, since their values may vary on each iteration. On
|
||||
the other hand, we could support variables to constant expressions, but for
|
||||
simplicity we do not. For example, `MATCH (old) RETURN old, 2 AS limit_var
|
||||
LIMIT limit_var` would still throw an error.
|
||||
|
||||
Finally, we generate symbols for names created in `RETURN` clause. These
|
||||
symbols are used for the final results of a query.
|
||||
|
||||
NOTE: New symbols in `WITH` and `RETURN` should be unique. This means that
|
||||
`WITH a AS same, b AS same` is not allowed, neither is a construct like
|
||||
`RETURN 2, 2`
|
||||
|
||||
### Symbols in Functions which Establish New Scope
|
||||
|
||||
Symbols can also be created in some functions. These functions usually take an
|
||||
expression, bind a single variable and run the expression inside the newly
|
||||
established scope.
|
||||
|
||||
The `all` function takes a list, creates a variable for list element and runs
|
||||
the predicate expression. For example:
|
||||
|
||||
MATCH (n) RETURN n, all(n IN n.prop_list WHERE n < 42)
|
||||
|
||||
We create a new symbol for use inside `all`, this means that the `WHERE n <
|
||||
42` uses the `n` which takes values from a `n.prop_list` elements. The
|
||||
original `n` bound by `MATCH` is not visible inside the `all` function, but it
|
||||
is visible outside. Therefore, the `RETURN n` and `n.prop_list` reference the
|
||||
`n` from `MATCH`.
|
||||
|
||||
[^1]: Constant expressions are expressions for which the result can be
|
||||
computed at compile time.
|
Loading…
Reference in New Issue
Block a user