diff --git a/docs/dev/query/.gitignore b/docs/dev/query/.gitignore new file mode 100644 index 000000000..5ccff1a6b --- /dev/null +++ b/docs/dev/query/.gitignore @@ -0,0 +1 @@ +html/ diff --git a/docs/dev/query/build-html b/docs/dev/query/build-html new file mode 100755 index 000000000..ec445bc73 --- /dev/null +++ b/docs/dev/query/build-html @@ -0,0 +1,11 @@ +#!/bin/bash + +script_dir="$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )" + +mkdir -p $script_dir/html + +for markdown_file in $(find $script_dir -name '*.md'); do + name=$(basename -s .md $markdown_file) + sed -e 's/.md/.html/' $markdown_file | \ + pandoc -s -f markdown -t html -o $script_dir/html/$name.html +done diff --git a/docs/dev/query/contents.md b/docs/dev/query/contents.md new file mode 100644 index 000000000..6a3494daf --- /dev/null +++ b/docs/dev/query/contents.md @@ -0,0 +1,14 @@ +# Query Parsing, Planning and Execution + +This part of the documentation deals with query execution. + +Memgraph currently supports only query interpretation. Each new query is +parsed, analysed and translated into a sequence of operations which are then +executed on the main database storage. Query execution is organized into the +following phases: + + 1. [Lexical Analysis (Tokenization)](parsing.md) + 2. [Syntactic Analysis (Parsing)](parsing.md) + 3. [Semantic Analysis and Symbol Generation](semantic.md) + 4. [Logical Planning](planning.md) + 5. [Logical Plan Execution](execution.md) diff --git a/docs/dev/query/execution.md b/docs/dev/query/execution.md new file mode 100644 index 000000000..1305572e4 --- /dev/null +++ b/docs/dev/query/execution.md @@ -0,0 +1,373 @@ +# Logical Plan Execution + +We implement classical iterator style operators. Logical operators define +operations on database. They encapsulate the following info: what the input is +(another `LogicalOperator`), what to do with the data, and how to do it. + +Currently logical operators can have zero or more input operations, and thus a +`LogicalOperator` tree is formed. Most `LogicalOperator` types have only one +input, so we are mostly working with chains instead of full fledged trees. +You can find information on each operator in `src/query/plan/operator.lcp`. + +## Cursor + +Logical operators do not perform database work themselves. Instead they create +`Cursor` objects that do the actual work, based on the info in the operator. +Cursors expose a `Pull` method that gets called by the cursor's consumer. The +consumer keeps pulling as long as the `Pull` returns `true` (indicating it +successfully performed some work and might be eligible for another `Pull`). +Most cursors will call the `Pull` function of their input provided cursor, so +typically a cursor chain is created that is analogue to the logical operator +chain it's created from. + +## Frame + +The `Frame` object contains all the data of the current `Pull` chain. It +serves for communicating data between cursors. + +For example, in a `MATCH (n) RETURN n` query the `ScanAllCursor` places a +vertex on the `Frame` for each `Pull`. It places it on the place reserved for +the `n` symbol. Then the `ProduceCursor` can take that same value from the +`Frame` because it knows the appropriate symbol. `Frame` positions are indexed +by `Symbol` objects. + +## ExpressionEvaluator + +Expressions results are not placed on the `Frame` since they do not need to be +communicated between different `Cursors`. Instead, expressions are evaluated +using an instance of `ExpressionEvaluator`. Since generally speaking an +expression can be defined by a tree of subexpressions, the +`ExpressionEvaluator` is implemented as a tree visitor. There is a performance +sub-optimality here because a stack is used to communicate intermediary +expression results between elements of the tree. This is one of the reasons +why it's planned to use `Frame` for intermediary expression results as well. +The other reason is that it might facilitate compilation later on. + +## Cypher Execution Semantics + +Cypher query execution has *mostly* well-defined semantics. Some are +explicitly defined by openCypher and its TCK, while others are implicitly +defined by Neo4j's implementation of Cypher that we want to be generally +compatible with. + +These semantics can in short be described as follows: a Cypher query consists +of multiple clauses some of which modify it. Generally, every clause in the +query, when reading it left to right, operates on a consistent state of the +property graph, untouched by subsequent clauses. This means that a `MATCH` +clause in the beginning operates on a graph-state in which modifications by +the subsequent `SET` are not visible. + +The stated semantics feel very natural to the end-user, and Neo seems to +implement them well. For Memgraph the situation is complex because +`LogicalOperator` execution (through a `Cursor`) happens one `Pull` at a time +(generally meaning all the query clauses get executed for every top-level +`Pull`). This is not inherently consistent with Cypher semantics because a +`SET` clause can modify data, and the `MATCH` clause that precedes it might +see the modification in a subsequent `Pull`. Also, the `RETURN` clause might +want to stream results to the user before all `SET` clauses have been +executed, so the user might see some intermediate graph state. There are many +edge-cases that Memgraph does its best to avoid to stay true to Cypher +semantics, while at the same time using a high-performance streaming approach. +The edge-cases are enumerated in this document along with the implementation +details they imply. + +## Implementation Peculiarities + +### Once + +An operator that does nothing but whose `Cursor::Pull` returns `true` on the +first `Pull` and `false` on subsequent ones. This operator is used when +another operator has an optional input, because in Cypher a clause will +typically execute once for every input from the preceding clauses, or just +once if there was no preceding input. For example, consider the `CREATE` +clause. In the query `CREATE (n)` only one node is created, while in the query +`MATCH (n) CREATE (m)` a node is created for each existing node. Thus in our +`CreateNode` logical operator the input is either a `ScanAll` operator, or a +`Once` operator. + +### GraphView + +In the previous section, [Cypher Execution +Semantics](#cypher-execution-semantics), we mentioned how the preceding +clauses should not see changes made in subsequent ones. For that reason, some +operators take a `GraphView` enum value. This value determines which state of +the graph an operator sees. + +Consider the query `MATCH (n)--(m) WHERE n.x = 0 SET m.x = 1`. Naive streaming +could match a vertex `n` on the given criteria, expand to `m`, update it's +property, and in the next iteration consider the vertex previously matched to +`m` and skip it because it's newly set property value does not qualify. This +is not how Cypher works. To handle this issue properly, Memgraph designed the +`VertexAccessor` class that tracks two versions of data: one that was visible +before the current transaction+command, and the optional other that was +created in the current transaction+command. The `MATCH` clause will be planned +as `ScanAll` and `Expand` operations using `GraphView::OLD` value. This will +ensure modifications performed in the same query do not affect it. The same +applies to edges and the `EdgeAccessor` class. + +### Existing Record Detection + +It's possible that a pattern element has already been declared in the same +pattern, or a preceding pattern. For example `MATCH (n)--(m), (n)--(l)` or a +cycle-detection match `MATCH (n)-->(n) RETURN n`. Implementation-wise, +existing record detection just checks that the expanded record is equal to the +one already on the frame. + +### Why Not Use Separate Expansion Ops for Edges and Vertices? + +Expanding an edge and a vertex in separate ops is not feasible when matching a +cycle in bi-directional expansions. Consider the query `MATCH (n)--(n) RETURN +n`. Let's try to expand first the edge in one op, and vertex in the next. The +vertex expansion consumes the edge expansion input. It takes the expanded edge +from the frame. It needs to detect a cycle by comparing the vertex existing on +the frame with one of the edge vertices (`from` or `to`). But which one? It +doesn't know, and can't ensure correct cycle detection. + +### Data Visibility During and After SET + +In Cypher, setting values always works on the latest version of data (from +preceding or current clause). That means that within a `SET` clause all the +changes from previous clauses must be visible, as well as changes done by the +current `SET` clause. Also, if there is a clause after `SET` it must see *all* +the changes performed by the preceding `SET`. Both these things are best +illustrated with the following queries executed on an empty database: + + CREATE (n:A {x:0})-[:EdgeType]->(m:B {x:0}) + MATCH (n)--(m) SET m.x = n.x + 1 RETURN labels(n), n.x, labels(m), m.x + +This returns: + ++---------+---+---------+---+ +|labels(n)|n.x|labels(m)|m.x| ++:=======:+:=:+:=======:+:=:+ +|[A] |2 |[B] |1 | ++---------+---+---------+---+ +|[B] |1 |[A] |2 | ++---------+---+---------+---+ + +The obtained result implies the following operations: + + 1. In the first iteration set the value of the `B.x` to 1 + 2. In the second iteration the we observe `B.x` with the value of 1 and set + `A.x` to 2 + 3. In `RETURN` we see all the changes made in both iterations + +To implement the desired behavior Memgraph utilizes two techniques. First is +the already mentioned tracking of two versions of data in vertex accessors. +Using this approach ensures that the second iteration in the example query +sees the data modification performed by the preceding iteration. The second +technique is the `Accumulate` operation that accumulates all the iterations +from the preceding logical op before passing them to the next logical op. In +the example query, `Accumulate` ensures that the results returned to the user +reflect changes performed in all iterations of the query (naive streaming +could stream results at the end of first iteration producing inconsistent +results). Note that `Accumulate` is demanding regarding memory and slows down +query execution. For that reason it should be used only when necessary, for +example it does not have to be used in a query that has `MATCH` and `SET` but +no `RETURN`. + +### Neo4j Inconsistency on Multiple SET Clauses + +Considering the preceding example it could be expected that when a query has +multiple `SET` clauses all the changes from those preceding one are visible. +This is not the case in Neo4j's implementation. Consider the following queries +executed on an empty database: + + CREATE (n:A {x:0})-[:EdgeType]->(m:B {x:0}) + MATCH (n)--(m) SET n.x = n.x + 1 SET m.x = m.x * 2 + RETURN labels(n), n.x, labels(m), m.x + +This returns: + ++---------+---+---------+---+ +|labels(n)|n.x|labels(m)|m.x| ++:=======:+:=:+:=======:+:=:+ +|[A] |2 |[B] |1 | ++---------+---+---------+---+ +|[B] |1 |[A] |2 | ++---------+---+---------+---+ + +If all the iterations of the first `SET` clause were executed before executing +the second, all the resulting values would be 2. This not being the case, we +conclude that Neo4j does not use a barrier-like mechanism between `SET` +clauses. It is Memgraph's current vision that this is inconsistent and we +plan to reduce Neo4j compliance in favour of operation consistency. + +### Double Deletion + +It's possible to match the same graph element multiple times in a single query +and delete it. Neo supports this, and so do we. The relevant implementation +detail is in the `GraphDbAccessor` class, where the record deletion functions +reside, and not in the logical plan execution. It comes down to checking if a +record has already been deleted in the current transaction+command and not +attempting to do it again (results in a crash). + +### Set + Delete Edge-case + +It's legal for a query to combine `SET` and `DELETE` clauses. Consider the +following queries executed on an empty database: + + + CREATE ()-[:T]->() + MATCH (n)--(m) SET n.x = 42 DETACH DELETE m + +Due to the `MATCH` being undirected the second pull will attempt to set data +on a deleted vertex. This is not a legal operation in Memgraph storage +implementation. For that reason the logical operator for `SET` must check if +the record it's trying to set something on has been deleted by the current +transaction+command. If so, the modification is not executed. + +### Deletion Accumulation + +Sometimes it's necessary to accumulate deletions of all the matches before +attempting to execute them. Consider this the following. Start with an empty +database and execute queries: + + CREATE ()-[:T]->()-[:T]->() + MATCH (a)-[r1]-(b)-[r2]-(c) DELETE r1, b, c + +Note that the `DELETE` clause attempts to delete node `c`, but it does not +detach it by deleting edge `r2`. However, due to undirected edge in the +`MATCH`, both edges get pulled and deleted. + +Currently Memgraph does not support this behavior, Neo does. There are a few +ways that we could do this. + + * Accumulate on deletion (that sucks because we have to keep track of + everything that gets returned after the deletion). + * Maybe we could stream through the deletion op, but defer actual deletion + until plan-execution end. + * Ignore this because it's very edgy (this is the currently selected option). + +### Aggregation Without Input + +It is necessary to define what aggregation ops return when they receive no +input. Following is a table that shows what Neo4j's Cypher implementation and +SQL produce. + + ++-------------+------------------------+---------------------+---------------------+------------------+ +| \ | 1. Cypher, no group-by | 2. Cypher, group-by | 3. SQL, no group-by | 4. SQL, group-by | ++=============+:======================:+:===================:+:===================:+:================:+ +| Count(\*) | 0 | \ | 0 | \ | ++-------------+------------------------+---------------------+---------------------+------------------+ +| Count(prop) | 0 | \ | 0 | \ | ++-------------+------------------------+---------------------+---------------------+------------------+ +| Sum | 0 | \ | NULL | \ | ++-------------+------------------------+---------------------+---------------------+------------------+ +| Avg | NULL | \ | NULL | \ | ++-------------+------------------------+---------------------+---------------------+------------------+ +| Min | NULL | \ | NULL | \ | ++-------------+------------------------+---------------------+---------------------+------------------+ +| Max | NULL | \ | NULL | \ | ++-------------+------------------------+---------------------+---------------------+------------------+ +| Collect | [] | \ | N/A | N/A | ++-------------+------------------------+---------------------+---------------------+------------------+ + +Where: + + 1. `MATCH (n) RETURN (n.prop)` + 2. `MATCH (n) RETURN (n.prop), (n.prop2)` + 3. `SELECT (prop) FROM Table` + 4. `SELECT (prop), prop2 FROM Table GROUP BY prop2` + +Neo's Cypher implementation diverges from SQL only when performing `SUM`. +Memgraph implements SQL-like behavior. It is considered that `SUM` of +arbitrary elements should not be implicitly 0, especially in a property graph +without a strict schema (the property in question can contain values of +arbitrary types, or no values at all). + +### OrderBy + +The `OrderBy` logical operator sorts the results in the desired order. It +occurs in Cypher as part of a `WITH` or `RETURN` clause. Both the concept and +the implementation are straightforward. It's necessary for the logical op to +`Pull` everything from its input so it can be sorted. It's not necessary to +keep the whole `Frame` state of each input, it is sufficient to keep a list of +`TypedValues` on which the results will be sorted, and another list of values +that need to be remembered and recreated on the `Frame` when yielding. + +The sorting itself is made to reflect that of Neo's implementation which comes +down to these points. + + * `Null` comes last (as if it's greater than anything). + * Primitive types compare naturally, with no implicit casting except from + `int` to `double`. + * Complex types are not comparable. + * Every unsupported comparison results in an exception that gets propagated + to the end user. + +### Limit in Write Queries + +`Limit` can be used as part of a write query, in which case it will *not* +reduce the amount of performed updates. For example, consider a database that +has 10 vertices. The query `MATCH (n) SET n.x = 1 RETURN n LIMIT 3` will +result in all vertices having their property value changed, while returning +only the first to the client. This makes sense from the implementation +standpoint, because `Accumulate` is planned after `SetProperty` but before +`Produce` and `Limit` operations. Note that this behavior can be +non-deterministic in some queries, since it relies on the order of iteration +over nodes which is undefined when not explicitly specified. + +### Merge + +`MERGE` in Cypher attempts to match a pattern. If it already exists, it does +nothing and subsequent clauses like `RETURN` can use the matched pattern +elements. If the pattern can't match to any data, it creates it. For detailed +information see Neo4j's [merge +documentation.](https://neo4j.com/docs/developer-manual/current/cypher/clauses/merge/) + +An important thing about `MERGE` is visibility of modified data. `MERGE` takes +an input (typically a `MATCH`) and has two additional *phases*: the merging +part, and the subsequent set parts (`ON MATCH SET` and `ON CREATE SET`). +Analysis of Neo4j's behavior indicates that each of these three phases (input, +merge, set) does not see changes to the graph state done by subsequent phase. +The input phase does not see data created by the merge phase, nor the set +phase. This is consistent with what seems like the general Cypher philosophy +that query clause effects aren't visible in the preceding clauses. + +We define the `Merge` logical operator as a *routing* operator that uses three +logical operator branches. + + 1. The input from a preceding clause. + + For example in `MATCH (n), (m) MERGE (n)-[:T]-(m)`. This input is + optional because `MERGE` is allowed to be the first clause in a query. + + 2. The `merge_match` branch. + + This logical operator branch is `Pull`-ed from until exhausted for each + successful `Pull` from the input branch. + + 3. The `merge_create` branch. + + This branch is `Pull`ed when the `merge_match` branch does not match + anything (no successful `Pull`s) for an input `Pull`. It is `Pull`ed only + once in such a situation, since only one creation needs to occur for a + failed match. + +The `ON MATCH SET` and `ON CREATE SET` parts of the `MERGE` clause are +included in the `merge_match` and `merge_create` branches respectively. They +are placed on the end of their branches so that they execute only when those +branches succeed. + +Memgraph strives to be consistent with Neo in its `MERGE` implementation, +while at the same time keeping performance as good as possible. Consistency +with Neo w.r.t. graph state visibility is not trivial. Documentation for +`Expand` and `Set` describe how Memgraph keeps track of both the updated +version of an edge/vertex and the old one, as it was before the current +transaction+command. This technique is also used in `Merge`. The input +phase/branch of `Merge` always looks at the old data. The merge phase needs to +see the new data so it doesn't create more data then necessary. + +For example, consider the query. + + MATCH (p:Person) MERGE (c:City {name: p.lives_in}) + +This query needs to create a city node only once for each unique `p.lives_in`. +Finally the set phase of a `MERGE` clause should not affect the merge phase. +To achieve this the `merge_match` branch of the `Merge` operator should see +the latest created nodes, but filter them on their old state (if those nodes +were not created by the `create_branch`). Implementation-wise that means that +`ScanAll` and `Expand` operators in the `merge_branch` need to look at the new +graph state, while `Filter` operators the old, if available. diff --git a/docs/dev/query/parsing.md b/docs/dev/query/parsing.md new file mode 100644 index 000000000..3db3c7f49 --- /dev/null +++ b/docs/dev/query/parsing.md @@ -0,0 +1,62 @@ +# Lexical and Syntactic Analysis + +## Antlr + +We use Antlr for lexical and syntax analysis of Cypher queries. Antrl uses +grammar file `Cypher.g4` downloaded from http://www.opencypher.org to generate +the parser and the visitor for the Cypher parse tree. Even though the provided +grammar is not very pleasant to work with we decided not to do any drastic +changes to it so that our transition to newly published versions of +`Cypher.g4` would be easier. Nevertheless, we had to fix some bugs and add +features, so our version is not completely the same. + +In addition to using `Cypher.g4`, we have `MemgraphCypher.g4`. This grammar +file defines Memgraph specific extensions to the original grammar. Most +notable example is the inclusion of syntax for handling authorization. At the +moment, some extensions are also found in `Cypher.g4`. For example, the syntax +for using a lambda function in relationship patterns. These extensions should +be moved out of `Cypher.g4`, so that it remains as close to the original +grammar as possible. Additionally, having `MemgraphCypher.g4` may not be +enough if we wish to split the functionality for community and enterprise +editions of Memgraph. + +## Abstract Syntax Tree (AST) + +Since Antlr generated visitor and the official openCypher grammar are not very +practical to use, we translate the Antlr's AST to our own AST. Currently there +are ~40 types of nodes in our AST. Their definitions can be found in +`src/query/frontend/ast/ast.lcp`. + +Major groups of types can be found under the following base types. + + * `Expression` --- types corresponding to Cypher expressions. + * `Clause` --- types corresponding to Cypher clauses. + * `PatternAtom` --- node or edge related information. + * `Query` --- different kinds of queries, allows extending the language with + Memgraph specific query syntax. + +Memory management of created AST nodes is done with `AstStorage`. Each type +must be created by invoking `AstStorage::Create` method. This way all of the +pointers to nodes and their children are raw pointers. The only owner of +allocated memory is the `AstStorage`. When the storage goes out of scope, the +pointers become invalid. It may be more natural to handle tree ownership via +`unique_ptr`, i.e. each node owns its children. But there are some benefits to +having a custom storage and allocation scheme. + +The primary reason we opted for not using `unique_ptr` is the requirement of +Antlr's base visitor class that the resulting values must by copyable. The +result is wrapped in `antlr::Any` so that the derived visitor classes may +return any type they wish when visiting Antlr's AST. Unfortunately, +`antlr::Any` does not work with non-copyable types. + +Another benefit of having `AstStorage` is that we can easily add a different +allocation scheme for AST nodes. The interface of node creation would not +change. + +### AST Translation + +The translation process is done via `CypherMainVisitor` class, which is +derived from Antlr generated visitor. Besides instancing our AST types, a +minimal number of syntactic checks are done on a query. These checks handle +the cases which were valid in original openCypher grammar, but may be invalid +when combined with other syntax elements. diff --git a/docs/dev/query/planning.md b/docs/dev/query/planning.md new file mode 100644 index 000000000..07ac05224 --- /dev/null +++ b/docs/dev/query/planning.md @@ -0,0 +1,487 @@ +# Logical Planning + +After the semantic analysis and symbol generation, the AST is converted to a +tree of logical operators. This conversion is called *planning* and the tree +of logical operators is called a *plan*. The whole planning process is done in +the following steps. + + 1. [AST Preprocessing](#ast-preprocessing) + + The first step is to preprocess the AST by collecting + information on filters, divide the query into parts, normalize patterns + in `MATCH` clauses, etc. + + 2. [Logical Operator Planning](#logical-operator-planning) + + After the preprocess step, the planning can be done via 2 planners: + `VariableStartPlanner` and `RuleBasedPlanner`. The first planner will + generate multiple plans where each plan has different starting points for + searching the patterns in `MATCH` clauses. The second planner produces a + single plan by mapping the query parts as they are to logical operators. + + 3. [Logical Plan Postprocessing](#logical-plan-postprocessing) + + In this stage, we perform various transformations on the generated logical + plan. Here we want to optimize the operations in order to improve + performance during the execution. Naturally, transformations need to + preserve the semantic behaviour of the original plan. + + 4. [Cost Estimation](#cost-estimation) + + After the generation, the execution cost of each plan is estimated. This + estimation is used to select the best plan which will be executed. + + 5. [Distributed Planning](#distributed-planning) + + In case we are running distributed Memgraph, the final plan is adapted + for distributed execution. NOTE: This appears to be an error in the + workflow. Distributed planning should be moved before step 3. or + integrated with it. With the workflow ordered as is now, cost estimation + doesn't consider the distributed plan. + +The implementation can be found in the `query/plan` directory, with the public +entry point being `query/plan/planner.hpp`. + +## AST Preprocessing + +Each openCypher query consists of at least 1 **single query**. Multiple single +queries are chained together using a **query combinator**. Currently, there is +only one combinator, `UNION`. The preprocessing step starts in the +`CollectQueryParts` function. This function will take a look at each single +query and divide it into parts. Each part is separated with `RETURN` and +`WITH` clauses. For example: + + MATCH (n) CREATE (m) WITH m MATCH (l)-[]-(m) RETURN l + | | | + |------- part 1 -----------+-------- part 2 --------| + | | + |-------------------- single query -----------------| + +Each part is created by collecting all `MATCH` clauses and *normalizing* their +patterns. Pattern normalization is the process of converting an arbitrarily +long pattern chain of nodes and edges into a list of triplets `(start node, +edge, end node)`. The triplets should preserve the semantics of the match. For +example: + + MATCH (a)-[p]-(b)-[q]-(c)-[r]-(d) + +is equivalent to: + + MATCH (a)-[p]-(b), (b)-[q]-(c), (c)-[r]-(d) + +With this representation, it becomes easier to reorder the triplets and choose +different strategies for pattern matching. + +In addition to normalizing patterns, all of the filter expressions in patterns +and inside of the `WHERE` clause (of the accompanying `MATCH`) are extracted +and stored separately. During the extraction, symbols used in the filter +expression are collected. This allows for planning filters in a valid order, +as the matching for triplets is being done. Another important benefit of +having extra information on filters, is to recognize when a database index +could be used. + +After each `MATCH` is processed, they are all grouped, so that even the whole +`MATCH` clauses may be reordered. The important thing is to remember which +symbols were used to name edges in each `MATCH`. With those symbols we can +plan for *cyphermorphism*, i.e. ensure different edges in the search pattern +of a single `MATCH` map to different edges in the graph. This preserves the +semantic of the query, even though we may have reordered the matching. The +same steps are done for `OPTIONAL MATCH`. + +Another clause which needs processing is `MERGE`. Here we normalize the +pattern, since the `MERGE` is a bit like `MATCH` and `CREATE` in one. + +All the other clauses are left as is. + +In the end, each query part consists of: + + * processed and grouped `MATCH` clauses; + * processed and grouped `OPTIONAL MATCH` clauses; + * processed `MERGE` matching pattern and + * unchanged remaining clauses. + +The last stored clause is guaranteed to be either `WITH` or `RETURN`. + +## Logical Operator Planning + +### Variable Start Planner + +The `VariableStartPlanner` generates multiple plans for a single query. Each +plan is generated by selecting a different starting point for pattern +matching. + +The algorithm works as follows. + + 1. For each query part: + 1. For each node in triplets of collected `MATCH` clauses: + i. Add the node to a set of `expanded` nodes + ii. Select a triplet `(start node, edge, end node)` whose `start node` is + in the `expanded` set + iii. If no triplet was selected, choose a new starting node that isn't in + `expanded` and continue expanding + iv. Repeat steps ii. -- iii. until all triplets have been selected + and store that as a variation of the `MATCH` clauses + 2. Do step 1.1. for `OPTIONAL MATCH` and `MERGE` clauses + 3. Take all combinations of the generated `MATCH`, `OPTIONAL MATCH` and + `MERGE` and store them as variations of the query part. + 2. For each combination of query part variations: + 1. Generate a plan using the rule based planner + +### Rule Based Planner + +The `RuleBasedPlanner` generates a single plan for a single query. A plan is +generated by following hardcoded rules for producing logical operators. The +following sections are an overview on how each openCypher clause is converted +to a `LogicalOperator`. + +#### MATCH + +`MATCH` clause is used to specify which patterns need to be searched for in +the database. These patterns are normalized in the preprocess step to be +represented as triplets `(start node, edge, end node)`. When there is no edge, +then the triplet is reduced only to the `start node`. Generating the operators +is done by looping over these triplets. + +##### Searching for Nodes + +The simplest search is finding stand alone nodes. For example, `MATCH (n)` +will find all the nodes in the graph. This is accomplished by generating a +`ScanAll` operator and forwarding the node symbol which should store the +results. In this case, all the nodes will be referenced by `n`. + +Multiple nodes can be specified in a single match, e.g. `MATCH (n), (m)`. +Planning is done by repeating the same steps for each sub pattern (separated +by a comma). In this case, we would get 2 `ScanAll` operators chained one +after the other. An optimization can be obtained if the node in the pattern is +already searched for. In `MATCH (n), (n)` we can drop the second `ScanAll` +operator since we have already generated it for the first node. + +##### Searching for Relationships + +A more advanced search includes finding nodes with relationships. For example, +`MATCH (n)-[r]-(m)` should find every pair of connected nodes in the database. +This means, that if a single node has multiple connections, it will be +repeated for each combination of pairs. The generation of operators starts +from the first node in the pattern. If we are referencing a new starting node, +, we need to generate a `ScanAll` which finds all the nodes and stores them +into `n`. Then, we generate an `Expand` operator which reads the `n` and +traverses all the edges of that node. The edge is stored into `r`, while the +destination node is stored in `m`. + +Matching multiple relationships proceeds similarly, by repeating the same +steps. The only difference is that we need to ensure different edges in the +search pattern, map to different edges in the graph. This means that after +each `Expand` operator, we need to generate an `ExpandUniquenessFilter`. We +provide this operator with a list of symbols for the previously matched edges +and the symbol for the current edge. + +For example. + + MATCH (n)-[r1]-(m)-[r2]-(l) + +The above is preprocessed into + + MATCH (n)-[r1]-(m), (m)-[r2]-(l) + +Then we look at each triplet in order and perform the described steps. This +way, we would generate: + + ScanAll (n) > Expand (n, r1, m) > Expand (m, r2, l) > + ExpandUniquenessFilter ([r1], r2) + +Note that we don't need to make `ExpandUniquenessFilter` after the first +`Expand`, since there are no edges to compare to. This filtering needs to work +across multiple pattern, but inside a *single* `MATCH` clause. + +Let's take a look at the following. + + MATCH (n)-[r1]-(m), (m)-[r2]-(l) + +We would also generate the exact same operators. + + ScanAll (n) > Expand (n, r1, m) > Expand (m, r2, l) > + ExpandUniquenessFilter ([r1], r2) + +On the other hand, + + MATCH (n)-[r1]-(m) MATCH (m)-[r2]-(l)-[r3]-(i) + +We would reset the uniqueness filtering at the start of the second match. This +would mean that we output the following: + + ScanAll (n) > Expand (n, r1, m) > Expand (m, r2, l) > Expand (l, r3, i) > + ExpandUniquenessFilter ([r2], r3) + +There is a difference in how we handle edge uniqueness compared to Neo4j. +Neo4j does not allow searching for a single edge multiple times, but we've +decided to support that. + +For example, the user can say the following. + + MATCH (n)-[r]-(m)-[r]-l + +We would ensure that both `r` variables match to the same edge. In our +terminology, we call this the *edge cycle*. For the above example, we would +generate this plan. + + ScanAll (n) > Expand (n, r, m) > Expand (m, r, l) + +We do not put an `ExpandUniquenessFilter` operator between 2 `Expand` +operators and we tell the 2nd `Expand` that it is an edge cycle. This, 2nd +`Expand` will ensure we have matched both the same edges. + +##### Filtering + +To narrow the search down, the patterns in `MATCH` can have filtered labels +and properties. A more general filtering is done using the accompanying +`WHERE` clause. During the preprocess step, all filters are collected and +extracted into expressions. Additional information on which symbols are used +is also stored. This way, each time we generate a `ScanAll` or `Expand`, we +look at all the filters to see if any of them can be used. I.e. if the symbols +they use have been bound by a newly produced operator. If a filter expression +can be used, we immediately add a `Filter` operator with that expression. + +For example. + + MATCH (n)-[r]-(m :label) WHERE n.prop = 42 + +We would produce: + + ScanAll (n) > Filter (n.prop) > Expand (n, r, m) > Filter (m :label) + +This means that the same plan is generated for the query: + + MATCH (n {prop: 42})-[r]-(m :label) + +#### OPTIONAL + +If a `MATCH` clause is preceded by `OPTIONAL`, then we need to generate a plan +such that we produce results even if we fail to match anything. This is +accomplished by generating an `Optional` operator, which takes 2 operator +trees: + + * input operation and + * optional operation. + +The input is the operation we generated for the part of the query before +`OPTIONAL MATCH`. For the optional operation, we simply generate the `OPTIONAL +MATCH` part just like we would for regular `MATCH`. In addition to operations, +we need to send the symbols which are set during optional matching to the +`Optional` operator. The operator will reset values of those symbols to +`null`, when the optional part fails to match. + +#### RETURN & WITH + +`RETURN` and `WITH` clauses are very similar to each other. The only +difference is that `WITH` separates parts of the query and can be paired with +`WHERE` clause. + +The common part is generating operators for the body of the clause. Separation +of query parts is mostly done in semantic analysis, which checks that only the +symbols exposed through `WITH` are visible in the query parts after the +clause. The minor part is done in planning. + +##### Named Results + +Both clauses contain multiple named expressions (`expr AS name`) which are +used to generate `Produce` operator. + +##### Aggregations + +If an expression contains an aggregation operator (`sum`, `avg`, ...) we need +to plan the `Aggregate` operator as input to `Produce`. This case is more +complex, because aggregation in openCypher can perform implicit grouping of +results used for aggregation. + +For example, `WITH/RETURN sum(n.x) AS s, n.y AS group` will implicitly group +by `n.y` expression. + +Another, obscure grouping can be achieved with `RETURN sum(n.a) + n.b AS s`. +Here, the `n.b` will be used for grouping, even though both the `sum` and +`n.b` are in the same named expression. + +Therefore, we need to collect all expressions which do not contain +aggregations and use them for grouping. You may have noticed that in the last +example `sum` is actually a sub-expression of `+`. `Aggregate` operator does +not see that (nor it should), so the responsibility of evaluating that falls +on `Produce`. One way is for `Aggregate` to store results of grouping +expressions on the frame in addition to aggregation results. Unfortunately, +this would require rewiring named expressions in `Produce` to reference +already evaluated expressions. In the current implementation, we opted for +`Aggregate` to store only aggregation results on the frame, while `Produce` +will re-evaluate all the other (grouping) expressions. To handle that, symbols +which are used in expressions are passed to `Aggregate`, so that they can be +remembered. `Produce` will read those symbols from the frame and use it to +re-evaluate the needed expressions. + +##### Accumulation + +After we have `Produce` and potentially `Aggregate`, we need to handle a +special case when the part of the query before `RETURN` or `WITH` performs +updates. For that, we want to run that part of the query fully, so that we get +the latest results. This is accomplished by adding `Accumulate` operator as +input to `Aggregate` or `Produce` (if there is no aggregation). Accumulation +will store all the values for all the used symbols inside `RETURN` and `WITH`, +so that they can be used in the operator which follows. This way, only parts +of the frame are copied, instead of the whole frame. Here is a minor +difference between planning `WITH`, compared to `RETURN`. Since `WITH` can +separate writing from reading, we need to advance the transaction command. +This enables the later, read parts of the query to obtain the newest changes. +This is supported by passing `advance_command` flag to `Accumulate` operator. + +In the simplest case, common to both clauses, we have `Accumulate > Aggregate +> Produce` operators, where `Accumulate` and `Aggregate` may be left out. + +##### Ordering + +Planning `ORDER BY` is simple enough. Since it may see new symbols (filled in +`Produce`), we add the `OrderBy` operator at the end. The operator will change +the order of produced results, so we pass it the ordering expressions and the +output symbols of named expressions. + +##### Filtering + +A final difference in `WITH`, is when it contains a `WHERE` clause. For that, +we simply generate the `Filter` operator, appended after `Produce` or +`OrderBy` (depending which operator is last). + +##### Skipping and Limiting + +If we have `SKIP` or `LIMIT`, we generate `Skip` or `Limit` operators, +respectively. These operators are put at the end of the clause. + +This placement may have some unexpected behaviour when combined with +operations that update the graph. For example. + + MATCH (n) SET n.x = n.x + 1 RETURN n LIMIT 1 + +The above query may be interpreted as if the `SET` will be done only once. +Since this is a write query, we need to accumulate results, so the part before +`RETURN` will execute completely. The accumulated results will be yielded up +to the given limit, and the user would get only the first `n` that was +updated. This may confuse the user because in reality, every node in the +database had been updated. + +Note that `Skip` always comes before `Limit`. In the current implementation, +they are generated directly one after the other. + +#### CREATE + +`CREATE` clause is used to create nodes and edges (relationships). + +For multiple `CREATE` clauses or multiple creation patterns in a single +clause, we perform the same, following steps. + +##### Creating a Single Node + +A node is created by simply specifying a node pattern. + +For example `CREATE (n :label {property: "value"}), ()` would create 2 nodes. +The 1st one would be created with a label and a property. This node could be +referenced later in the query, by using the variable `n`. The 2nd node cannot +be referenced and it would be created without any labels nor properties. For +node creation, we generate a `CreateNode` operator and pass it all the details +of node creation: variable symbol, labels and properties. In the mentioned +example, we would have `CreateNode > CreateNode`. + +##### Creating a Relationship + +To create a relationship, the `CREATE` clause must contain a pattern with a +directed edge. Compared to creating a single node, this case is a bit more +complicated, because either side of the edge may not exist. By exist, we mean +that the endpoint is a variable which already references a node. + +For example, `MATCH (n) CREATE (n)-[r]->(m)` would create an edge `r` and a +node `m` for each matched node `n`. If we focus on the `CREATE` part, we +generate `CreateExpand (n, r, m)` where `n` already exists (refers to matched +node) and `m` would be newly created along with edge `r`. If we had only +`CREATE (n) -[r]-> (m)`, then we would need to create both nodes of the edge +`r`. This is done by generating `CreateNode (n) > CreateExpand(n, r, m)`. The +final case is when both endpoints refer to an existing node. For example, when +adding a node with a cyclical connection `CREATE (n)-[r]->(n)`. In this case, +we would generate `CreateNode (n) > CreateExpand (n, r, n)`. We would tell +`CreateExpand` to only create the edge `r` between the already created `n`. + +#### MERGE + +Although the merge operation is complex, planning turns out to be relatively +simple. The pattern inside the `MERGE` clause is used for both matching and +creating. Therefore, we create 2 operator trees, one for each action. + +For example. + + MERGE (n)-[r:r]-(m) + +We would generated a single `Merge` operator which has the following. + + * No input operation (since it is not preceded by any other clause). + + * On match operation + + `ScanAll (n) > Expand (n, r, m) > Filter (r)` + + * On create operation + + `CreateNode (n) > CreateExpand (n, r, m)` + +In cases when `MERGE` contains `ON MATCH` and `ON CREATE` parts, we simply +append their operations to the respective operator trees. + +Observe the following example. + + MERGE (n)-[r:r]-(m) ON MATCH SET n.x = 42 ON CREATE SET m :label + +The `Merge` would be generated with the following. + + * No input operation (again, since there is no clause preceding it). + + * On match operation + + `ScanAll (n) > Expand (n, r, m) > Filter (r) > SetProperty (n.x, 42)` + + * On create operation + + `CreateNode (n) > CreateExpand (n, r, m) > SetLabels (n, :label)` + +When we have preceding clauses, we simply put their operator as input to +`Merge`. + + MATCH (n) MERGE (n)-[r:r]-(m) + +The above would be generated as + + ScanAll (n) > Merge (on_match_operation, on_create_operation) + +Here we need to be careful to recognize which symbols are already declared. +But, since the `on_match_operation` uses the same algorithm for generating a +`Match`, that problem is handled there. The same should hold for +`on_create_operation`, which uses the process of generating a `Create`. So, +finally for this example, the `Merge` would have: + + * Input operation + + `ScanAll (n)` + + * On match operation + + `Expand (n, r, m) > Filter (r)` + + Note that `ScanAll` is not needed since we get the nodes from input. + + * On create operation + + `CreateExpand (n, r, m)` + + Note that `CreateNode` is dropped, since we want to expand the existing one. + +## Logical Plan Postprocessing + +NOTE: TODO + +## Cost Estimation + +NOTE: TODO + +## Distributed Planning + +NOTE: TODO diff --git a/docs/dev/query/semantic.md b/docs/dev/query/semantic.md new file mode 100644 index 000000000..ff10cbb5e --- /dev/null +++ b/docs/dev/query/semantic.md @@ -0,0 +1,134 @@ +# Semantic Analysis and Symbol Generation + +In this phase, various semantic and variable type checks are performed. +Additionally, we generate symbols which map AST nodes to stored values +computed from evaluated expressions. + +## Symbol Generation + +Implementation can be found in `query/frontend/semantic/symbol_generator.cpp`. + +Symbols are generated for each AST node that represents data that needs to +have storage. Currently, these are: + + * `NamedExpression` + * `CypherUnion` + * `Identifier` + * `Aggregation` + +You may notice that the above AST nodes may not correspond to something named +by a user. For example, `Aggregation` can be a part of larger expression and +thus remain unnamed. The reason we still generate symbols is to have a uniform +behaviour when executing a query as well as allow for caching the results of +expression evaluation. + +AST nodes do not actually store a `Symbol` instance, instead they have a +`int32_t` index identifying the symbol in the `SymbolTable` class. This is +done to minimize the size of AST types as well as allow easier sharing of same +symbols with multiple instances of AST nodes. + +The storage for evaluated data is represented by the `Frame` class. Each +symbol determines a unique position in the frame. During interpretation, +evaluation of expressions which have a symbol will either read or store values +in the frame. For example, instance of an `Identifier` will use the symbol to +find and read the value from `Frame`. On the other hand, `NamedExpression` +will take the result of evaluating its own expression and store it in the +`Frame`. + +When a symbol is created, context of creation is used to assign a type to that +symbol. This type is used for simple type checking operations. For example, +`MATCH (n)` will create a symbol for variable `n`. Since the `MATCH (n)` +represents finding a vertex in the graph, we can set `Symbol::Type::Vertex` +for that symbol. Later, for example in `MATCH ()-[n]-()` we see that variable +`n` is used as an edge. Since we already have a symbol for that variable, we +detect this type mismatch and raise a `SemanticException`. + +Basic rule of symbol generation, is that variables inside `MATCH`, `CREATE`, +`MERGE`, `WITH ... AS` and `RETURN ... AS` clauses establish new symbols. + +### Symbols in Patterns + +Inside `MATCH`, symbols are created only if they didn't exist before. For +example, patterns in `MATCH (n {a: 5})--(m {b: 5}) RETURN n, m` will create 2 +symbols: one for `n` and one for `m`. `RETURN` clause will, in turn, reference +those symbols. Symbols established in a part of pattern are immediately bound +and visible in later parts. For example, `MATCH (n)--(n)` will create a symbol +for variable `n` for 1st `(n)`. That symbol is referenced in 2nd `(n)`. Note +that the symbol is not bound inside 1st `(n)` itself. What this means is that, +for example, `MATCH (n {a: n.b})` should raise an error, because `n` is not +yet bound when encountering `n.b`. On the other hand, +`MATCH (n)--(n {a: n.b})` is fine. + +The `CREATE` is similar to `MATCH`, but it *always* establishes symbols for +variables which create graph elements. What this means is that, for example +`MATCH (n) CREATE (n)` is not allowed. `CREATE` wants to create a new node, +for which we already have a symbol. In such a case, we need to throw an error +that the variable `n` is being redeclared. On the other hand `MATCH (n) CREATE +(n)-[r :r]->(n)` is fine, because `CREATE` will only create the edge `r`, +connecting the already existing node `n`. Remaining behaviour is the same as +in `MATCH`. This means that we can simplify `CREATE` to be like `MATCH` with 2 +special cases. + + 1. Are we creating a node, i.e. `CREATE (n)`? If yes, then the symbol for + `n` must not have been created before. Otherwise, we reference the + existing symbol. + 2. Are we creating an edge, i.e. we encounter a variable for an edge inside + `CREATE`? If yes, then that variable must not reference a symbol. + +The `MERGE` clause is treated the same as `CREATE` with regards to symbol +generation. The only difference is that we allow bidirectional edges in the +pattern. When creating such a pattern, the direction of the created edge is +arbitrarily determined. + +### Symbols in WITH and RETURN + +In addition to patterns, new symbols are established in the `WITH` clause. +This clause makes the new symbols visible *only* to the rest of the query. +For example, `MATCH (old) WITH old AS new RETURN new, old` should raise an +error that `old` is unbound inside `RETURN`. + +There is a special case with symbol visibility in `WHERE` and `ORDER BY`. They +need to see both the old and the new symbols. Therefore `MATCH (old) RETURN +old AS new ORDER BY old.prop` needs to work. On the other hand, if we perform +aggregations inside `WITH` or `RETURN`, then the old symbols should not be +visible neither in `WHERE` nor in `ORDER BY`. Since the aggregation has to go +through all the results in order to generate the final value, it makes no +sense to store old symbols and their values. A query like `MATCH (old) WITH +SUM(old.prop) AS sum WHERE old.prop = 42 RETURN sum` needs to raise an error +that `old` is unbound inside `WHERE`. + +For cases when `SKIP` and `LIMIT` appear, we disallow any identifiers from +appearing in their expressions. Basically, `SKIP` and `LIMIT` can only be +constant expressions[^1]. For example, `MATCH (old) RETURN old AS new SKIP +new.prop` needs to raise that variables are not allowed in `SKIP`. It makes no +sense to allow variables, since their values may vary on each iteration. On +the other hand, we could support variables to constant expressions, but for +simplicity we do not. For example, `MATCH (old) RETURN old, 2 AS limit_var +LIMIT limit_var` would still throw an error. + +Finally, we generate symbols for names created in `RETURN` clause. These +symbols are used for the final results of a query. + +NOTE: New symbols in `WITH` and `RETURN` should be unique. This means that +`WITH a AS same, b AS same` is not allowed, neither is a construct like +`RETURN 2, 2` + +### Symbols in Functions which Establish New Scope + +Symbols can also be created in some functions. These functions usually take an +expression, bind a single variable and run the expression inside the newly +established scope. + +The `all` function takes a list, creates a variable for list element and runs +the predicate expression. For example: + + MATCH (n) RETURN n, all(n IN n.prop_list WHERE n < 42) + +We create a new symbol for use inside `all`, this means that the `WHERE n < +42` uses the `n` which takes values from a `n.prop_list` elements. The +original `n` bound by `MATCH` is not visible inside the `all` function, but it +is visible outside. Therefore, the `RETURN n` and `n.prop_list` reference the +`n` from `MATCH`. + +[^1]: Constant expressions are expressions for which the result can be + computed at compile time.