Move query implementation from Phriction to this repo

Summary: Our query parsing, planning and execution architecture was described on Phabricator wiki pages, Phriction. This commit copies the said documentation here, so that it's easier to access for all developers. Additional benefit is tracking the changes and hopefully suggesting to developers to keep it up to date. Besides making a copy, the documentation has been updated to reflect the current state of the codebase. Note that some things are still missing, but what was written should now be correct. Reviewers: mtomic, llugovic Reviewed By: mtomic Subscribers: pullbot Differential Revision: https://phabricator.memgraph.io/D1854
2019-02-13 16:50:32 +01:00 · 2019-02-13 16:50:32 +01:00 · 6f10b1c115
commit 6f10b1c115
parent a14c4f1864
7 changed files with 1082 additions and 0 deletions
--- a/docs/dev/query/.gitignore
+++ b/docs/dev/query/.gitignore
@ -0,0 +1 @@
 html/
--- a/docs/dev/query/build-html
+++ b/docs/dev/query/build-html
@ -0,0 +1,11 @@
 #!/bin/bash
 script_dir="$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )"
 mkdir -p $script_dir/html
 for markdown_file in $(find $script_dir -name '*.md'); do
  name=$(basename -s .md $markdown_file)
  sed -e 's/.md/.html/' $markdown_file | \
    pandoc -s -f markdown -t html -o $script_dir/html/$name.html
 done
--- a/docs/dev/query/contents.md
+++ b/docs/dev/query/contents.md
@ -0,0 +1,14 @@
 # Query Parsing, Planning and Execution
 This part of the documentation deals with query execution.
 Memgraph currently supports only query interpretation. Each new query is
 parsed, analysed and translated into a sequence of operations which are then
 executed on the main database storage. Query execution is organized into the
 following phases:
  1.  [Lexical Analysis (Tokenization)](parsing.md)
  2.  [Syntactic Analysis (Parsing)](parsing.md)
  3.  [Semantic Analysis and Symbol Generation](semantic.md)
  4.  [Logical Planning](planning.md)
  5.  [Logical Plan Execution](execution.md)
--- a/docs/dev/query/execution.md
+++ b/docs/dev/query/execution.md
@ -0,0 +1,373 @@
 # Logical Plan Execution
 We implement classical iterator style operators. Logical operators define
 operations on database. They encapsulate the following info: what the input is
 (another `LogicalOperator`), what to do with the data, and how to do it.
 Currently logical operators can have zero or more input operations, and thus a
 `LogicalOperator` tree is formed. Most `LogicalOperator` types have only one
 input, so we are mostly working with chains instead of full fledged trees.
 You can find information on each operator in `src/query/plan/operator.lcp`.
 ## Cursor
 Logical operators do not perform database work themselves. Instead they create
 `Cursor` objects that do the actual work, based on the info in the operator.
 Cursors expose a `Pull` method that gets called by the cursor's consumer. The
 consumer keeps pulling as long as the `Pull` returns `true` (indicating it
 successfully performed some work and might be eligible for another `Pull`).
 Most cursors will call the `Pull` function of their input provided cursor, so
 typically a cursor chain is created that is analogue to the logical operator
 chain it's created from.
 ## Frame
 The `Frame` object contains all the data of the current `Pull` chain. It
 serves for communicating data between cursors.
 For example, in a `MATCH (n) RETURN n` query the `ScanAllCursor` places a
 vertex on the `Frame` for each `Pull`. It places it on the place reserved for
 the `n` symbol. Then the `ProduceCursor` can take that same value from the
 `Frame` because it knows the appropriate symbol. `Frame` positions are indexed
 by `Symbol` objects.
 ## ExpressionEvaluator
 Expressions results are not placed on the `Frame` since they do not need to be
 communicated between different `Cursors`. Instead, expressions are evaluated
 using an instance of `ExpressionEvaluator`. Since generally speaking an
 expression can be defined by a tree of subexpressions, the
 `ExpressionEvaluator` is implemented as a tree visitor. There is a performance
 sub-optimality here because a stack is used to communicate intermediary
 expression results between elements of the tree. This is one of the reasons
 why it's planned to use `Frame` for intermediary expression results as well.
 The other reason is that it might facilitate compilation later on.
 ## Cypher Execution Semantics
 Cypher query execution has *mostly* well-defined semantics. Some are
 explicitly defined by openCypher and its TCK, while others are implicitly
 defined by Neo4j's implementation of Cypher that we want to be generally
 compatible with.
 These semantics can in short be described as follows: a Cypher query consists
 of multiple clauses some of which modify it. Generally, every clause in the
 query, when reading it left to right, operates on a consistent state of the
 property graph, untouched by subsequent clauses. This means that a `MATCH`
 clause in the beginning operates on a graph-state in which modifications by
 the subsequent `SET` are not visible.
 The stated semantics feel very natural to the end-user, and Neo seems to
 implement them well. For Memgraph the situation is complex because
 `LogicalOperator` execution (through a `Cursor`) happens one `Pull` at a time
 (generally meaning all the query clauses get executed for every top-level
 `Pull`). This is not inherently consistent with Cypher semantics because a
 `SET` clause can modify data, and the `MATCH` clause that precedes it might
 see the modification in a subsequent `Pull`. Also, the `RETURN` clause might
 want to stream results to the user before all `SET` clauses have been
 executed, so the user might see some intermediate graph state. There are many
 edge-cases that Memgraph does its best to avoid to stay true to Cypher
 semantics, while at the same time using a high-performance streaming approach.
 The edge-cases are enumerated in this document along with the implementation
 details they imply.
 ## Implementation Peculiarities
 ### Once
 An operator that does nothing but whose `Cursor::Pull` returns `true` on the
 first `Pull` and `false` on subsequent ones. This operator is used when
 another operator has an optional input, because in Cypher a clause will
 typically execute once for every input from the preceding clauses, or just
 once if there was no preceding input. For example, consider the `CREATE`
 clause. In the query `CREATE (n)` only one node is created, while in the query
 `MATCH (n) CREATE (m)` a node is created for each existing node. Thus in our
 `CreateNode` logical operator the input is either a `ScanAll` operator, or a
 `Once` operator.
 ### GraphView
 In the previous section, [Cypher Execution
 Semantics](#cypher-execution-semantics), we mentioned how the preceding
 clauses should not see changes made in subsequent ones. For that reason, some
 operators take a `GraphView` enum value. This value determines which state of
 the graph an operator sees.
 Consider the query `MATCH (n)--(m) WHERE n.x = 0 SET m.x = 1`. Naive streaming
 could match a vertex `n` on the given criteria, expand to `m`, update it's
 property, and in the next iteration consider the vertex previously matched to
 `m` and skip it because it's newly set property value does not qualify. This
 is not how Cypher works. To handle this issue properly, Memgraph designed the
 `VertexAccessor` class that tracks two versions of data: one that was visible
 before the current transaction+command, and the optional other that was
 created in the current transaction+command. The `MATCH` clause will be planned
 as `ScanAll` and `Expand` operations using `GraphView::OLD` value. This will
 ensure modifications performed in the same query do not affect it. The same
 applies to edges and the `EdgeAccessor` class.
 ### Existing Record Detection
 It's possible that a pattern element has already been declared in the same
 pattern, or a preceding pattern. For example `MATCH (n)--(m), (n)--(l)` or a
 cycle-detection match `MATCH (n)-->(n) RETURN n`. Implementation-wise,
 existing record detection just checks that the expanded record is equal to the
 one already on the frame.
 ### Why Not Use Separate Expansion Ops for Edges and Vertices?
 Expanding an edge and a vertex in separate ops is not feasible when matching a
 cycle in bi-directional expansions. Consider the query `MATCH (n)--(n) RETURN
 n`. Let's try to expand first the edge in one op, and vertex in the next. The
 vertex expansion consumes the edge expansion input. It takes the expanded edge
 from the frame. It needs to detect a cycle by comparing the vertex existing on
 the frame with one of the edge vertices (`from` or `to`). But which one? It
 doesn't know, and can't ensure correct cycle detection.
 ### Data Visibility During and After SET
 In Cypher, setting values always works on the latest version of data (from
 preceding or current clause). That means that within a `SET` clause all the
 changes from previous clauses must be visible, as well as changes done by the
 current `SET` clause. Also, if there is a clause after `SET` it must see *all*
 the changes performed by the preceding `SET`. Both these things are best
 illustrated with the following queries executed on an empty database:
    CREATE (n:A {x:0})-[:EdgeType]->(m:B {x:0})
    MATCH (n)--(m) SET m.x = n.x + 1 RETURN labels(n), n.x, labels(m), m.x
 This returns:
 +---------+---+---------+---+
 |labels(n)|n.x|labels(m)|m.x|
 +:=======:+:=:+:=======:+:=:+
 |[A]      |2  |[B]      |1  |
 +---------+---+---------+---+
 |[B]      |1  |[A]      |2  |
 +---------+---+---------+---+
 The obtained result implies the following operations:
  1. In the first iteration set the value of the `B.x` to 1
  2. In the second iteration the we observe `B.x` with the value of 1 and set
     `A.x` to 2
  3. In `RETURN` we see all the changes made in both iterations
 To implement the desired behavior Memgraph utilizes two techniques. First is
 the already mentioned tracking of two versions of data in vertex accessors.
 Using this approach ensures that the second iteration in the example query
 sees the data modification performed by the preceding iteration. The second
 technique is the `Accumulate` operation that accumulates all the iterations
 from the preceding logical op before passing them to the next logical op. In
 the example query, `Accumulate` ensures that the results returned to the user
 reflect changes performed in all iterations of the query (naive streaming
 could stream results at the end of first iteration producing inconsistent
 results). Note that `Accumulate` is demanding regarding memory and slows down
 query execution. For that reason it should be used only when necessary, for
 example it does not have to be used in a query that has `MATCH` and `SET` but
 no `RETURN`.
 ### Neo4j Inconsistency on Multiple SET Clauses
 Considering the preceding example it could be expected that when a query has
 multiple `SET` clauses all the changes from those preceding one are visible.
 This is not the case in Neo4j's implementation. Consider the following queries
 executed on an empty database:
    CREATE (n:A {x:0})-[:EdgeType]->(m:B {x:0})
    MATCH (n)--(m) SET n.x = n.x + 1 SET m.x = m.x * 2
    RETURN labels(n), n.x, labels(m), m.x
 This returns:
 +---------+---+---------+---+
 |labels(n)|n.x|labels(m)|m.x|
 +:=======:+:=:+:=======:+:=:+
 |[A]      |2  |[B]      |1  |
 +---------+---+---------+---+
 |[B]      |1  |[A]      |2  |
 +---------+---+---------+---+
 If all the iterations of the first `SET` clause were executed before executing
 the second, all the resulting values would be 2. This not being the case, we
 conclude that Neo4j does not use a barrier-like mechanism between `SET`
 clauses.  It is Memgraph's current vision that this is inconsistent and we
 plan to reduce Neo4j compliance in favour of operation consistency.
 ### Double Deletion
 It's possible to match the same graph element multiple times in a single query
 and delete it. Neo supports this, and so do we. The relevant implementation
 detail is in the `GraphDbAccessor` class, where the record deletion functions
 reside, and not in the logical plan execution. It comes down to checking if a
 record has already been deleted in the current transaction+command and not
 attempting to do it again (results in a crash).
 ### Set + Delete Edge-case
 It's legal for a query to combine `SET` and `DELETE` clauses. Consider the
 following queries executed on an empty database:
    CREATE ()-[:T]->()
    MATCH (n)--(m) SET n.x = 42 DETACH DELETE m
 Due to the `MATCH` being undirected the second pull will attempt to set data
 on a deleted vertex. This is not a legal operation in Memgraph storage
 implementation. For that reason the logical operator for `SET` must check if
 the record it's trying to set something on has been deleted by the current
 transaction+command. If so, the modification is not executed.
 ### Deletion Accumulation
 Sometimes it's necessary to accumulate deletions of all the matches before
 attempting to execute them. Consider this the following. Start with an empty
 database and execute queries:
    CREATE ()-[:T]->()-[:T]->()
    MATCH (a)-[r1]-(b)-[r2]-(c) DELETE r1, b, c
 Note that the `DELETE` clause attempts to delete node `c`, but it does not
 detach it by deleting edge `r2`. However, due to undirected edge in the
 `MATCH`, both edges get pulled and deleted.
 Currently Memgraph does not support this behavior, Neo does. There are a few
 ways that we could do this.
 * Accumulate on deletion (that sucks because we have to keep track of
   everything that gets returned after the deletion).
 * Maybe we could stream through the deletion op, but defer actual deletion
   until plan-execution end.
 * Ignore this because it's very edgy (this is the currently selected option).
 ### Aggregation Without Input
 It is necessary to define what aggregation ops return when they receive no
 input. Following is a table that shows what Neo4j's Cypher implementation and
 SQL produce.
 +-------------+------------------------+---------------------+---------------------+------------------+
 |    \<OP\>   | 1. Cypher, no group-by | 2. Cypher, group-by | 3. SQL, no group-by | 4. SQL, group-by |
 +=============+:======================:+:===================:+:===================:+:================:+
 | Count(\*)   | 0                      | \<NO\_ROWS>         | 0                   | \<NO\_ROWS>      |
 +-------------+------------------------+---------------------+---------------------+------------------+
 | Count(prop) | 0                      | \<NO\_ROWS>         | 0                   | \<NO\_ROWS>      |
 +-------------+------------------------+---------------------+---------------------+------------------+
 | Sum         | 0                      | \<NO\_ROWS>         | NULL                | \<NO\_ROWS>      |
 +-------------+------------------------+---------------------+---------------------+------------------+
 | Avg         | NULL                   | \<NO\_ROWS>         | NULL                | \<NO\_ROWS>      |
 +-------------+------------------------+---------------------+---------------------+------------------+
 | Min         | NULL                   | \<NO\_ROWS>         | NULL                | \<NO\_ROWS>      |
 +-------------+------------------------+---------------------+---------------------+------------------+
 | Max         | NULL                   | \<NO\_ROWS>         | NULL                | \<NO\_ROWS>      |
 +-------------+------------------------+---------------------+---------------------+------------------+
 | Collect     | []                     | \<NO\_ROWS>         | N/A                 | N/A              |
 +-------------+------------------------+---------------------+---------------------+------------------+
 Where:
    1. `MATCH (n) RETURN <OP>(n.prop)`
    2. `MATCH (n) RETURN <OP>(n.prop), (n.prop2)`
    3. `SELECT <OP>(prop) FROM Table`
    4. `SELECT <OP>(prop), prop2 FROM Table GROUP BY prop2`
 Neo's Cypher implementation diverges from SQL only when performing `SUM`.
 Memgraph implements SQL-like behavior. It is considered that `SUM` of
 arbitrary elements should not be implicitly 0, especially in a property graph
 without a strict schema (the property in question can contain values of
 arbitrary types, or no values at all).
 ### OrderBy
 The `OrderBy` logical operator sorts the results in the desired order. It
 occurs in Cypher as part of a `WITH` or `RETURN` clause. Both the concept and
 the implementation are straightforward. It's necessary for the logical op to
 `Pull` everything from its input so it can be sorted. It's not necessary to
 keep the whole `Frame` state of each input, it is sufficient to keep a list of
 `TypedValues` on which the results will be sorted, and another list of values
 that need to be remembered and recreated on the `Frame` when yielding.
 The sorting itself is made to reflect that of Neo's implementation which comes
 down to these points.
  * `Null` comes last (as if it's greater than anything).
  * Primitive types compare naturally, with no implicit casting except from
    `int` to `double`.
  * Complex types are not comparable.
  * Every unsupported comparison results in an exception that gets propagated
    to the end user.
 ### Limit in Write Queries
 `Limit` can be used as part of a write query, in which case it will *not*
 reduce the amount of performed updates. For example, consider a database that
 has 10 vertices. The query `MATCH (n) SET n.x = 1 RETURN n LIMIT 3` will
 result in all vertices having their property value changed, while returning
 only the first to the client. This makes sense from the implementation
 standpoint, because `Accumulate` is planned after `SetProperty` but before
 `Produce` and `Limit` operations. Note that this behavior can be
 non-deterministic in some queries, since it relies on the order of iteration
 over nodes which is undefined when not explicitly specified.
 ### Merge
 `MERGE` in Cypher attempts to match a pattern. If it already exists, it does
 nothing and subsequent clauses like `RETURN` can use the matched pattern
 elements. If the pattern can't match to any data, it creates it. For detailed
 information see Neo4j's [merge
 documentation.](https://neo4j.com/docs/developer-manual/current/cypher/clauses/merge/)
 An important thing about `MERGE` is visibility of modified data. `MERGE` takes
 an input (typically a `MATCH`) and has two additional *phases*: the merging
 part, and the subsequent set parts (`ON MATCH SET` and `ON CREATE SET`).
 Analysis of Neo4j's behavior indicates that each of these three phases (input,
 merge, set) does not see changes to the graph state done by subsequent phase.
 The input phase does not see data created by the merge phase, nor the set
 phase. This is consistent with what seems like the general Cypher philosophy
 that query clause effects aren't visible in the preceding clauses.
 We define the `Merge` logical operator as a *routing* operator that uses three
 logical operator branches.
  1. The input from a preceding clause.
     For example in `MATCH (n), (m) MERGE (n)-[:T]-(m)`. This input is
     optional because `MERGE` is allowed to be the first clause in a query.
  2. The `merge_match` branch.
     This logical operator branch is `Pull`-ed from until exhausted for each
     successful `Pull` from the input branch.
  3. The `merge_create` branch.
     This branch is `Pull`ed when the `merge_match` branch does not match
     anything (no successful `Pull`s) for an input `Pull`. It is `Pull`ed only
     once in such a situation, since only one creation needs to occur for a
     failed match.
 The `ON MATCH SET` and `ON CREATE SET` parts of the `MERGE` clause are
 included in the `merge_match` and `merge_create` branches respectively. They
 are placed on the end of their branches so that they execute only when those
 branches succeed.
 Memgraph strives to be consistent with Neo in its `MERGE` implementation,
 while at the same time keeping performance as good as possible. Consistency
 with Neo w.r.t. graph state visibility is not trivial. Documentation for
 `Expand` and `Set` describe how Memgraph keeps track of both the updated
 version of an edge/vertex and the old one, as it was before the current
 transaction+command. This technique is also used in `Merge`. The input
 phase/branch of `Merge` always looks at the old data. The merge phase needs to
 see the new data so it doesn't create more data then necessary.
 For example, consider the query.
    MATCH (p:Person) MERGE (c:City {name: p.lives_in})
 This query needs to create a city node only once for each unique `p.lives_in`.
 Finally the set phase of a `MERGE` clause should not affect the merge phase.
 To achieve this the `merge_match` branch of the `Merge` operator should see
 the latest created nodes, but filter them on their old state (if those nodes
 were not created by the `create_branch`).  Implementation-wise that means that
 `ScanAll` and `Expand` operators in the `merge_branch` need to look at the new
 graph state, while `Filter` operators the old, if available.
--- a/docs/dev/query/parsing.md
+++ b/docs/dev/query/parsing.md
@ -0,0 +1,62 @@
 # Lexical and Syntactic Analysis
 ## Antlr
 We use Antlr for lexical and syntax analysis of Cypher queries. Antrl uses
 grammar file `Cypher.g4` downloaded from http://www.opencypher.org to generate
 the parser and the visitor for the Cypher parse tree. Even though the provided
 grammar is not very pleasant to work with we decided not to do any drastic
 changes to it so that our transition to newly published versions of
 `Cypher.g4` would be easier. Nevertheless, we had to fix some bugs and add
 features, so our version is not completely the same.
 In addition to using `Cypher.g4`, we have `MemgraphCypher.g4`. This grammar
 file defines Memgraph specific extensions to the original grammar. Most
 notable example is the inclusion of syntax for handling authorization. At the
 moment, some extensions are also found in `Cypher.g4`. For example, the syntax
 for using a lambda function in relationship patterns. These extensions should
 be moved out of `Cypher.g4`, so that it remains as close to the original
 grammar as possible. Additionally, having `MemgraphCypher.g4` may not be
 enough if we wish to split the functionality for community and enterprise
 editions of Memgraph.
 ## Abstract Syntax Tree (AST)
 Since Antlr generated visitor and the official openCypher grammar are not very
 practical to use, we translate the Antlr's AST to our own AST. Currently there
 are ~40 types of nodes in our AST. Their definitions can be found in
 `src/query/frontend/ast/ast.lcp`.
 Major groups of types can be found under the following base types.
  * `Expression` --- types corresponding to Cypher expressions.
  * `Clause` --- types corresponding to Cypher clauses.
  * `PatternAtom` --- node or edge related information.
  * `Query` --- different kinds of queries, allows extending the language with
    Memgraph specific query syntax.
 Memory management of created AST nodes is done with `AstStorage`. Each type
 must be created by invoking `AstStorage::Create` method. This way all of the
 pointers to nodes and their children are raw pointers. The only owner of
 allocated memory is the `AstStorage`. When the storage goes out of scope, the
 pointers become invalid. It may be more natural to handle tree ownership via
 `unique_ptr`, i.e. each node owns its children. But there are some benefits to
 having a custom storage and allocation scheme.
 The primary reason we opted for not using `unique_ptr` is the requirement of
 Antlr's base visitor class that the resulting values must by copyable. The
 result is wrapped in `antlr::Any` so that the derived visitor classes may
 return any type they wish when visiting Antlr's AST. Unfortunately,
 `antlr::Any` does not work with non-copyable types.
 Another benefit of having `AstStorage` is that we can easily add a different
 allocation scheme for AST nodes. The interface of node creation would not
 change.
 ### AST Translation
 The translation process is done via `CypherMainVisitor` class, which is
 derived from Antlr generated visitor. Besides instancing our AST types, a
 minimal number of syntactic checks are done on a query. These checks handle
 the cases which were valid in original openCypher grammar, but may be invalid
 when combined with other syntax elements.
--- a/docs/dev/query/planning.md
+++ b/docs/dev/query/planning.md
@ -0,0 +1,487 @@
 # Logical Planning
 After the semantic analysis and symbol generation, the AST is converted to a
 tree of logical operators. This conversion is called *planning* and the tree
 of logical operators is called a *plan*. The whole planning process is done in
 the following steps.
  1. [AST Preprocessing](#ast-preprocessing)
     The first step is to preprocess the AST by collecting
     information on filters, divide the query into parts, normalize patterns
     in `MATCH` clauses, etc.
  2. [Logical Operator Planning](#logical-operator-planning)
     After the preprocess step, the planning can be done via 2 planners:
     `VariableStartPlanner` and `RuleBasedPlanner`. The first planner will
     generate multiple plans where each plan has different starting points for
     searching the patterns in `MATCH` clauses. The second planner produces a
     single plan by mapping the query parts as they are to logical operators.
  3. [Logical Plan Postprocessing](#logical-plan-postprocessing)
     In this stage, we perform various transformations on the generated logical
     plan. Here we want to optimize the operations in order to improve
     performance during the execution. Naturally, transformations need to
     preserve the semantic behaviour of the original plan.
  4. [Cost Estimation](#cost-estimation)
     After the generation, the execution cost of each plan is estimated. This
     estimation is used to select the best plan which will be executed.
  5. [Distributed Planning](#distributed-planning)
     In case we are running distributed Memgraph, the final plan is adapted
     for distributed execution. NOTE: This appears to be an error in the
     workflow. Distributed planning should be moved before step 3. or
     integrated with it. With the workflow ordered as is now, cost estimation
     doesn't consider the distributed plan.
 The implementation can be found in the `query/plan` directory, with the public
 entry point being `query/plan/planner.hpp`.
 ## AST Preprocessing
 Each openCypher query consists of at least 1 **single query**. Multiple single
 queries are chained together using a **query combinator**. Currently, there is
 only one combinator, `UNION`. The preprocessing step starts in the
 `CollectQueryParts` function. This function will take a look at each single
 query and divide it into parts. Each part is separated with `RETURN` and
 `WITH` clauses. For example:
    MATCH (n) CREATE (m) WITH m MATCH (l)-[]-(m) RETURN l
    |                          |                        |
    |------- part 1 -----------+-------- part 2 --------|
    |                                                   |
    |-------------------- single query -----------------|
 Each part is created by collecting all `MATCH` clauses and *normalizing* their
 patterns. Pattern normalization is the process of converting an arbitrarily
 long pattern chain of nodes and edges into a list of triplets `(start node,
 edge, end node)`. The triplets should preserve the semantics of the match. For
 example:
    MATCH (a)-[p]-(b)-[q]-(c)-[r]-(d)
 is equivalent to:
    MATCH (a)-[p]-(b), (b)-[q]-(c), (c)-[r]-(d)
 With this representation, it becomes easier to reorder the triplets and choose
 different strategies for pattern matching.
 In addition to normalizing patterns, all of the filter expressions in patterns
 and inside of the `WHERE` clause (of the accompanying `MATCH`) are extracted
 and stored separately. During the extraction, symbols used in the filter
 expression are collected. This allows for planning filters in a valid order,
 as the matching for triplets is being done. Another important benefit of
 having extra information on filters, is to recognize when a database index
 could be used.
 After each `MATCH` is processed, they are all grouped, so that even the whole
 `MATCH` clauses may be reordered. The important thing is to remember which
 symbols were used to name edges in each `MATCH`. With those symbols we can
 plan for *cyphermorphism*, i.e. ensure different edges in the search pattern
 of a single `MATCH` map to different edges in the graph. This preserves the
 semantic of the query, even though we may have reordered the matching. The
 same steps are done for `OPTIONAL MATCH`.
 Another clause which needs processing is `MERGE`. Here we normalize the
 pattern, since the `MERGE` is a bit like `MATCH` and `CREATE` in one.
 All the other clauses are left as is.
 In the end, each query part consists of:
  * processed and grouped `MATCH` clauses;
  * processed and grouped `OPTIONAL MATCH` clauses;
  * processed `MERGE` matching pattern and
  * unchanged remaining clauses.
 The last stored clause is guaranteed to be either `WITH` or `RETURN`.
 ## Logical Operator Planning
 ### Variable Start Planner
 The `VariableStartPlanner` generates multiple plans for a single query. Each
 plan is generated by selecting a different starting point for pattern
 matching.
 The algorithm works as follows.
  1. For each query part:
     1. For each node in triplets of collected `MATCH` clauses:
        i. Add the node to a set of `expanded` nodes
        ii. Select a triplet `(start node, edge, end node)` whose `start node` is
            in the `expanded` set
        iii. If no triplet was selected, choose a new starting node that isn't in
             `expanded` and continue expanding
        iv. Repeat steps ii. -- iii. until all triplets have been selected
            and store that as a variation of the `MATCH` clauses
     2. Do step 1.1. for `OPTIONAL MATCH` and `MERGE` clauses
     3. Take all combinations of the generated `MATCH`, `OPTIONAL MATCH` and
        `MERGE` and store them as variations of the query part.
  2. For each combination of query part variations:
     1. Generate a plan using the rule based planner
 ###  Rule Based Planner
 The `RuleBasedPlanner` generates a single plan for a single query. A plan is
 generated by following hardcoded rules for producing logical operators. The
 following sections are an overview on how each openCypher clause is converted
 to a `LogicalOperator`.
 ####  MATCH
 `MATCH` clause is used to specify which patterns need to be searched for in
 the database. These patterns are normalized in the preprocess step to be
 represented as triplets `(start node, edge, end node)`. When there is no edge,
 then the triplet is reduced only to the `start node`. Generating the operators
 is done by looping over these triplets.
 ##### Searching for Nodes
 The simplest search is finding stand alone nodes. For example, `MATCH (n)`
 will find all the nodes in the graph. This is accomplished by generating a
 `ScanAll` operator and forwarding the node symbol which should store the
 results. In this case, all the nodes will be referenced by `n`.
 Multiple nodes can be specified in a single match, e.g. `MATCH (n), (m)`.
 Planning is done by repeating the same steps for each sub pattern (separated
 by a comma). In this case, we would get 2 `ScanAll` operators chained one
 after the other. An optimization can be obtained if the node in the pattern is
 already searched for. In `MATCH (n), (n)` we can drop the second `ScanAll`
 operator since we have already generated it for the first node.
 ##### Searching for Relationships
 A more advanced search includes finding nodes with relationships. For example,
 `MATCH (n)-[r]-(m)` should find every pair of connected nodes in the database.
 This means, that if a single node has multiple connections, it will be
 repeated for each combination of pairs. The generation of operators starts
 from the first node in the pattern. If we are referencing a new starting node,
 , we need to generate a `ScanAll` which finds all the nodes and stores them
 into `n`. Then, we generate an `Expand` operator which reads the `n` and
 traverses all the edges of that node. The edge is stored into `r`, while the
 destination node is stored in `m`.
 Matching multiple relationships proceeds similarly, by repeating the same
 steps. The only difference is that we need to ensure different edges in the
 search pattern, map to different edges in the graph. This means that after
 each `Expand` operator, we need to generate an `ExpandUniquenessFilter`. We
 provide this operator with a list of symbols for the previously matched edges
 and the symbol for the current edge.
 For example.
    MATCH (n)-[r1]-(m)-[r2]-(l)
 The above is preprocessed into
    MATCH (n)-[r1]-(m), (m)-[r2]-(l)
 Then we look at each triplet in order and perform the described steps. This
 way, we would generate:
    ScanAll (n) > Expand (n, r1, m) > Expand (m, r2, l) >
        ExpandUniquenessFilter ([r1], r2)
 Note that we don't need to make `ExpandUniquenessFilter` after the first
 `Expand`, since there are no edges to compare to. This filtering needs to work
 across multiple pattern, but inside a *single* `MATCH` clause.
 Let's take a look at the following.
    MATCH (n)-[r1]-(m), (m)-[r2]-(l)
 We would also generate the exact same operators.
    ScanAll (n) > Expand (n, r1, m) > Expand (m, r2, l) >
        ExpandUniquenessFilter ([r1], r2)
 On the other hand,
    MATCH (n)-[r1]-(m) MATCH (m)-[r2]-(l)-[r3]-(i)
 We would reset the uniqueness filtering at the start of the second match. This
 would mean that we output the following:
    ScanAll (n) > Expand (n, r1, m) > Expand (m, r2, l) > Expand (l, r3, i) >
        ExpandUniquenessFilter ([r2], r3)
 There is a difference in how we handle edge uniqueness compared to Neo4j.
 Neo4j does not allow searching for a single edge multiple times, but we've
 decided to support that.
 For example, the user can say the following.
    MATCH (n)-[r]-(m)-[r]-l
 We would ensure that both `r` variables match to the same edge. In our
 terminology, we call this the *edge cycle*. For the above example, we would
 generate this plan.
    ScanAll (n) > Expand (n, r, m) > Expand (m, r, l)
 We do not put an `ExpandUniquenessFilter` operator between 2 `Expand`
 operators and we tell the 2nd `Expand` that it is an edge cycle. This, 2nd
 `Expand` will ensure we have matched both the same edges.
 ##### Filtering
 To narrow the search down, the patterns in `MATCH` can have filtered labels
 and properties. A more general filtering is done using the accompanying
 `WHERE` clause. During the preprocess step, all filters are collected and
 extracted into expressions. Additional information on which symbols are used
 is also stored. This way, each time we generate a `ScanAll` or `Expand`, we
 look at all the filters to see if any of them can be used. I.e. if the symbols
 they use have been bound by a newly produced operator. If a filter expression
 can be used, we immediately add a `Filter` operator with that expression.
 For example.
    MATCH (n)-[r]-(m :label) WHERE n.prop = 42
 We would produce:
    ScanAll (n) > Filter (n.prop) > Expand (n, r, m) > Filter (m :label)
 This means that the same plan is generated for the query:
    MATCH (n {prop: 42})-[r]-(m :label)
 #### OPTIONAL
 If a `MATCH` clause is preceded by `OPTIONAL`, then we need to generate a plan
 such that we produce results even if we fail to match anything. This is
 accomplished by generating an `Optional` operator, which takes 2 operator
 trees:
  * input operation and
  * optional operation.
 The input is the operation we generated for the part of the query before
 `OPTIONAL MATCH`. For the optional operation, we simply generate the `OPTIONAL
 MATCH` part just like we would for regular `MATCH`. In addition to operations,
 we need to send the symbols which are set during optional matching to the
 `Optional` operator. The operator will reset values of those symbols to
 `null`, when the optional part fails to match.
 #### RETURN & WITH
 `RETURN` and `WITH` clauses are very similar to each other. The only
 difference is that `WITH` separates parts of the query and can be paired with
 `WHERE` clause.
 The common part is generating operators for the body of the clause. Separation
 of query parts is mostly done in semantic analysis, which checks that only the
 symbols exposed through `WITH` are visible in the query parts after the
 clause. The minor part is done in planning.
 ##### Named Results
 Both clauses contain multiple named expressions (`expr AS name`) which are
 used to generate `Produce` operator.
 ##### Aggregations
 If an expression contains an aggregation operator (`sum`, `avg`, ...) we need
 to plan the `Aggregate` operator as input to `Produce`. This case is more
 complex, because aggregation in openCypher can perform implicit grouping of
 results used for aggregation.
 For example, `WITH/RETURN sum(n.x) AS s, n.y AS group` will implicitly group
 by `n.y` expression.
 Another, obscure grouping can be achieved with `RETURN sum(n.a) + n.b AS s`.
 Here, the `n.b` will be used for grouping, even though both the `sum` and
 `n.b` are in the same named expression.
 Therefore, we need to collect all expressions which do not contain
 aggregations and use them for grouping. You may have noticed that in the last
 example `sum` is actually a sub-expression of `+`. `Aggregate` operator does
 not see that (nor it should), so the responsibility of evaluating that falls
 on `Produce`. One way is for `Aggregate` to store results of grouping
 expressions on the frame in addition to aggregation results. Unfortunately,
 this would require rewiring named expressions in `Produce` to reference
 already evaluated expressions. In the current implementation, we opted for
 `Aggregate` to store only aggregation results on the frame, while `Produce`
 will re-evaluate all the other (grouping) expressions. To handle that, symbols
 which are used in expressions are passed to `Aggregate`, so that they can be
 remembered. `Produce` will read those symbols from the frame and use it to
 re-evaluate the needed expressions.
 ##### Accumulation
 After we have `Produce` and potentially `Aggregate`, we need to handle a
 special case when the part of the query before `RETURN` or `WITH` performs
 updates. For that, we want to run that part of the query fully, so that we get
 the latest results. This is accomplished by adding `Accumulate` operator as
 input to `Aggregate` or `Produce` (if there is no aggregation). Accumulation
 will store all the values for all the used symbols inside `RETURN` and `WITH`,
 so that they can be used in the operator which follows. This way, only parts
 of the frame are copied, instead of the whole frame. Here is a minor
 difference between planning `WITH`, compared to `RETURN`. Since `WITH` can
 separate writing from reading, we need to advance the transaction command.
 This enables the later, read parts of the query to obtain the newest changes.
 This is supported by passing `advance_command` flag to `Accumulate` operator.
 In the simplest case, common to both clauses, we have `Accumulate > Aggregate
 > Produce` operators, where `Accumulate` and `Aggregate` may be left out.
 ##### Ordering
 Planning `ORDER BY` is simple enough. Since it may see new symbols (filled in
 `Produce`), we add the `OrderBy` operator at the end. The operator will change
 the order of produced results, so we pass it the ordering expressions and the
 output symbols of named expressions.
 ##### Filtering
 A final difference in `WITH`, is when it contains a `WHERE` clause. For that,
 we simply generate the `Filter` operator, appended after `Produce` or
 `OrderBy` (depending which operator is last).
 ##### Skipping and Limiting
 If we have `SKIP` or `LIMIT`, we generate `Skip` or `Limit` operators,
 respectively. These operators are put at the end of the clause.
 This placement may have some unexpected behaviour when combined with
 operations that update the graph. For example.
    MATCH (n) SET n.x = n.x + 1 RETURN n LIMIT 1
 The above query may be interpreted as if the `SET` will be done only once.
 Since this is a write query, we need to accumulate results, so the part before
 `RETURN` will execute completely. The accumulated results will be yielded up
 to the given limit, and the user would get only the first `n` that was
 updated.  This may confuse the user because in reality, every node in the
 database had been updated.
 Note that `Skip` always comes before `Limit`. In the current implementation,
 they are generated directly one after the other.
 #### CREATE
 `CREATE` clause is used to create nodes and edges (relationships).
 For multiple `CREATE` clauses or multiple creation patterns in a single
 clause, we perform the same, following steps.
 ##### Creating a Single Node
 A node is created by simply specifying a node pattern.
 For example `CREATE (n :label {property: "value"}), ()` would create 2 nodes.
 The 1st one would be created with a label and a property. This node could be
 referenced later in the query, by using the variable `n`. The 2nd node cannot
 be referenced and it would be created without any labels nor properties. For
 node creation, we generate a `CreateNode` operator and pass it all the details
 of node creation: variable symbol, labels and properties. In the mentioned
 example, we would have `CreateNode > CreateNode`.
 ##### Creating a Relationship
 To create a relationship, the `CREATE` clause must contain a pattern with a
 directed edge. Compared to creating a single node, this case is a bit more
 complicated, because either side of the edge may not exist. By exist, we mean
 that the endpoint is a variable which already references a node.
 For example, `MATCH (n) CREATE (n)-[r]->(m)` would create an edge `r` and a
 node `m` for each matched node `n`. If we focus on the `CREATE` part, we
 generate `CreateExpand (n, r, m)` where `n` already exists (refers to matched
 node) and `m` would be newly created along with edge `r`. If we had only
 `CREATE (n) -[r]-> (m)`, then we would need to create both nodes of the edge
 `r`. This is done by generating `CreateNode (n) > CreateExpand(n, r, m)`.  The
 final case is when both endpoints refer to an existing node. For example, when
 adding a node with a cyclical connection `CREATE (n)-[r]->(n)`. In this case,
 we would generate `CreateNode (n) > CreateExpand (n, r, n)`. We would tell
 `CreateExpand` to only create the edge `r` between the already created `n`.
 #### MERGE
 Although the merge operation is complex, planning turns out to be relatively
 simple. The pattern inside the `MERGE` clause is used for both matching and
 creating. Therefore, we create 2 operator trees, one for each action.
 For example.
    MERGE (n)-[r:r]-(m)
 We would generated a single `Merge` operator which has the following.
  * No input operation (since it is not preceded by any other clause).
  * On match operation
    `ScanAll (n) > Expand (n, r, m) > Filter (r)`
  * On create operation
    `CreateNode (n) > CreateExpand (n, r, m)`
 In cases when `MERGE` contains `ON MATCH` and `ON CREATE` parts, we simply
 append their operations to the respective operator trees.
 Observe the following example.
    MERGE (n)-[r:r]-(m) ON MATCH SET n.x = 42 ON CREATE SET m :label
 The `Merge` would be generated with the following.
  * No input operation (again, since there is no clause preceding it).
  * On match operation
    `ScanAll (n) > Expand (n, r, m) > Filter (r) > SetProperty (n.x, 42)`
  * On create operation
    `CreateNode (n) > CreateExpand (n, r, m) > SetLabels (n, :label)`
 When we have preceding clauses, we simply put their operator as input to
 `Merge`.
    MATCH (n) MERGE (n)-[r:r]-(m)
 The above would be generated as
    ScanAll (n) > Merge (on_match_operation, on_create_operation)
 Here we need to be careful to recognize which symbols are already declared.
 But, since the `on_match_operation` uses the same algorithm for generating a
 `Match`, that problem is handled there. The same should hold for
 `on_create_operation`, which uses the process of generating a `Create`. So,
 finally for this example, the `Merge` would have:
  * Input operation
    `ScanAll (n)`
  * On match operation
    `Expand (n, r, m) > Filter (r)`
    Note that `ScanAll` is not needed since we get the nodes from input.
  * On create operation
    `CreateExpand (n, r, m)`
    Note that `CreateNode` is dropped, since we want to expand the existing one.
 ## Logical Plan Postprocessing
 NOTE: TODO
 ## Cost Estimation
 NOTE: TODO
 ## Distributed Planning
 NOTE: TODO
--- a/docs/dev/query/semantic.md
+++ b/docs/dev/query/semantic.md
@ -0,0 +1,134 @@
 # Semantic Analysis and Symbol Generation
 In this phase, various semantic and variable type checks are performed.
 Additionally, we generate symbols which map AST nodes to stored values
 computed from evaluated expressions.
 ## Symbol Generation
 Implementation can be found in `query/frontend/semantic/symbol_generator.cpp`.
 Symbols are generated for each AST node that represents data that needs to
 have storage. Currently, these are:
  * `NamedExpression`
  * `CypherUnion`
  * `Identifier`
  * `Aggregation`
 You may notice that the above AST nodes may not correspond to something named
 by a user. For example, `Aggregation` can be a part of larger expression and
 thus remain unnamed. The reason we still generate symbols is to have a uniform
 behaviour when executing a query as well as allow for caching the results of
 expression evaluation.
 AST nodes do not actually store a `Symbol` instance, instead they have a
 `int32_t` index identifying the symbol in the `SymbolTable` class. This is
 done to minimize the size of AST types as well as allow easier sharing of same
 symbols with multiple instances of AST nodes.
 The storage for evaluated data is represented by the `Frame` class. Each
 symbol determines a unique position in the frame. During interpretation,
 evaluation of expressions which have a symbol will either read or store values
 in the frame. For example, instance of an `Identifier` will use the symbol to
 find and read the value from `Frame`. On the other hand, `NamedExpression`
 will take the result of evaluating its own expression and store it in the
 `Frame`.
 When a symbol is created, context of creation is used to assign a type to that
 symbol. This type is used for simple type checking operations. For example,
 `MATCH (n)` will create a symbol for variable `n`. Since the `MATCH (n)`
 represents finding a vertex in the graph, we can set `Symbol::Type::Vertex`
 for that symbol. Later, for example in `MATCH ()-[n]-()` we see that variable
 `n` is used as an edge. Since we already have a symbol for that variable, we
 detect this type mismatch and raise a `SemanticException`.
 Basic rule of symbol generation, is that variables inside `MATCH`, `CREATE`,
 `MERGE`, `WITH ... AS` and `RETURN ... AS` clauses establish new symbols.
 ### Symbols in Patterns
 Inside `MATCH`, symbols are created only if they didn't exist before. For
 example, patterns in `MATCH (n {a: 5})--(m {b: 5}) RETURN n, m` will create 2
 symbols: one for `n` and one for `m`. `RETURN` clause will, in turn, reference
 those symbols. Symbols established in a part of pattern are immediately bound
 and visible in later parts. For example, `MATCH (n)--(n)` will create a symbol
 for variable `n` for 1st `(n)`. That symbol is referenced in 2nd `(n)`. Note
 that the symbol is not bound inside 1st `(n)` itself. What this means is that,
 for example, `MATCH (n {a: n.b})` should raise an error, because `n` is not
 yet bound when encountering `n.b`. On the other hand,
 `MATCH (n)--(n {a: n.b})` is fine.
 The `CREATE` is similar to `MATCH`, but it *always* establishes symbols for
 variables which create graph elements. What this means is that, for example
 `MATCH (n) CREATE (n)` is not allowed. `CREATE` wants to create a new node,
 for which we already have a symbol. In such a case, we need to throw an error
 that the variable `n` is being redeclared. On the other hand `MATCH (n) CREATE
 (n)-[r :r]->(n)` is fine, because `CREATE` will only create the edge `r`,
 connecting the already existing node `n`. Remaining behaviour is the same as
 in `MATCH`. This means that we can simplify `CREATE` to be like `MATCH` with 2
 special cases.
  1. Are we creating a node, i.e. `CREATE (n)`? If yes, then the symbol for
     `n` must not have been created before. Otherwise, we reference the
     existing symbol.
  2. Are we creating an edge, i.e. we encounter a variable for an edge inside
     `CREATE`? If yes, then that variable must not reference a symbol.
 The `MERGE` clause is treated the same as `CREATE` with regards to symbol
 generation. The only difference is that we allow bidirectional edges in the
 pattern. When creating such a pattern, the direction of the created edge is
 arbitrarily determined.
 ### Symbols in WITH and RETURN
 In addition to patterns, new symbols are established in the `WITH` clause.
 This clause makes the new symbols visible *only* to the rest of the query.
 For example, `MATCH (old) WITH old AS new RETURN new, old` should raise an
 error that `old` is unbound inside `RETURN`.
 There is a special case with symbol visibility in `WHERE` and `ORDER BY`. They
 need to see both the old and the new symbols. Therefore `MATCH (old) RETURN
 old AS new ORDER BY old.prop` needs to work. On the other hand, if we perform
 aggregations inside `WITH` or `RETURN`, then the old symbols should not be
 visible neither in `WHERE` nor in `ORDER BY`. Since the aggregation has to go
 through all the results in order to generate the final value, it makes no
 sense to store old symbols and their values. A query like `MATCH (old) WITH
 SUM(old.prop) AS sum WHERE old.prop = 42 RETURN sum` needs to raise an error
 that `old` is unbound inside `WHERE`.
 For cases when `SKIP` and `LIMIT` appear, we disallow any identifiers from
 appearing in their expressions. Basically, `SKIP` and `LIMIT` can only be
 constant expressions[^1]. For example, `MATCH (old) RETURN old AS new SKIP
 new.prop` needs to raise that variables are not allowed in `SKIP`. It makes no
 sense to allow variables, since their values may vary on each iteration. On
 the other hand, we could support variables to constant expressions, but for
 simplicity we do not. For example, `MATCH (old) RETURN old, 2 AS limit_var
 LIMIT limit_var` would still throw an error.
 Finally, we generate symbols for names created in `RETURN` clause. These
 symbols are used for the final results of a query.
 NOTE: New symbols in `WITH` and `RETURN` should be unique. This means that
 `WITH a AS same, b AS same` is not allowed, neither is a construct like
 `RETURN 2, 2`
 ### Symbols in Functions which Establish New Scope
 Symbols can also be created in some functions. These functions usually take an
 expression, bind a single variable and run the expression inside the newly
 established scope.
 The `all` function takes a list, creates a variable for list element and runs
 the predicate expression. For example:
    MATCH (n) RETURN n, all(n IN n.prop_list WHERE n < 42)
 We create a new symbol for use inside `all`, this means that the `WHERE n <
 42` uses the `n` which takes values from a `n.prop_list` elements. The
 original `n` bound by `MATCH` is not visible inside the `all` function, but it
 is visible outside. Therefore, the `RETURN n` and `n.prop_list` reference the
 `n` from `MATCH`.
 [^1]: Constant expressions are expressions for which the result can be
  computed at compile time.