Teon Banek 6f10b1c115 Move query implementation from Phriction to this repo

Summary:
Our query parsing, planning and execution architecture was described on
Phabricator wiki pages, Phriction. This commit copies the said
documentation here, so that it's easier to access for all developers.
Additional benefit is tracking the changes and hopefully suggesting to
developers to keep it up to date.

Besides making a copy, the documentation has been updated to reflect the
current state of the codebase. Note that some things are still missing,
but what was written should now be correct.

Reviewers: mtomic, llugovic

Reviewed By: mtomic

Subscribers: pullbot

Differential Revision: https://phabricator.memgraph.io/D1854

2019-02-15 16:58:39 +01:00

6.7 KiB

Raw Blame History

Semantic Analysis and Symbol Generation

In this phase, various semantic and variable type checks are performed. Additionally, we generate symbols which map AST nodes to stored values computed from evaluated expressions.

Symbol Generation

Implementation can be found in query/frontend/semantic/symbol_generator.cpp.

Symbols are generated for each AST node that represents data that needs to have storage. Currently, these are:

NamedExpression
CypherUnion
Identifier
Aggregation

You may notice that the above AST nodes may not correspond to something named by a user. For example, Aggregation can be a part of larger expression and thus remain unnamed. The reason we still generate symbols is to have a uniform behaviour when executing a query as well as allow for caching the results of expression evaluation.

AST nodes do not actually store a Symbol instance, instead they have a int32_t index identifying the symbol in the SymbolTable class. This is done to minimize the size of AST types as well as allow easier sharing of same symbols with multiple instances of AST nodes.

The storage for evaluated data is represented by the Frame class. Each symbol determines a unique position in the frame. During interpretation, evaluation of expressions which have a symbol will either read or store values in the frame. For example, instance of an Identifier will use the symbol to find and read the value from Frame. On the other hand, NamedExpression will take the result of evaluating its own expression and store it in the Frame.

When a symbol is created, context of creation is used to assign a type to that symbol. This type is used for simple type checking operations. For example, MATCH (n) will create a symbol for variable n. Since the MATCH (n) represents finding a vertex in the graph, we can set Symbol::Type::Vertex for that symbol. Later, for example in MATCH ()-[n]-() we see that variable n is used as an edge. Since we already have a symbol for that variable, we detect this type mismatch and raise a SemanticException.

Basic rule of symbol generation, is that variables inside MATCH, CREATE, MERGE, WITH ... AS and RETURN ... AS clauses establish new symbols.

Symbols in Patterns

Inside MATCH, symbols are created only if they didn't exist before. For example, patterns in MATCH (n {a: 5})--(m {b: 5}) RETURN n, m will create 2 symbols: one for n and one for m. RETURN clause will, in turn, reference those symbols. Symbols established in a part of pattern are immediately bound and visible in later parts. For example, MATCH (n)--(n) will create a symbol for variable n for 1st (n). That symbol is referenced in 2nd (n). Note that the symbol is not bound inside 1st (n) itself. What this means is that, for example, MATCH (n {a: n.b}) should raise an error, because n is not yet bound when encountering n.b. On the other hand, MATCH (n)--(n {a: n.b}) is fine.

The CREATE is similar to MATCH, but it always establishes symbols for variables which create graph elements. What this means is that, for example MATCH (n) CREATE (n) is not allowed. CREATE wants to create a new node, for which we already have a symbol. In such a case, we need to throw an error that the variable n is being redeclared. On the other hand MATCH (n) CREATE (n)-[r :r]->(n) is fine, because CREATE will only create the edge r, connecting the already existing node n. Remaining behaviour is the same as in MATCH. This means that we can simplify CREATE to be like MATCH with 2 special cases.

Are we creating a node, i.e. CREATE (n)? If yes, then the symbol for n must not have been created before. Otherwise, we reference the existing symbol.
Are we creating an edge, i.e. we encounter a variable for an edge inside CREATE? If yes, then that variable must not reference a symbol.

The MERGE clause is treated the same as CREATE with regards to symbol generation. The only difference is that we allow bidirectional edges in the pattern. When creating such a pattern, the direction of the created edge is arbitrarily determined.

Symbols in WITH and RETURN

In addition to patterns, new symbols are established in the WITH clause. This clause makes the new symbols visible only to the rest of the query. For example, MATCH (old) WITH old AS new RETURN new, old should raise an error that old is unbound inside RETURN.

There is a special case with symbol visibility in WHERE and ORDER BY. They need to see both the old and the new symbols. Therefore MATCH (old) RETURN old AS new ORDER BY old.prop needs to work. On the other hand, if we perform aggregations inside WITH or RETURN, then the old symbols should not be visible neither in WHERE nor in ORDER BY. Since the aggregation has to go through all the results in order to generate the final value, it makes no sense to store old symbols and their values. A query like MATCH (old) WITH SUM(old.prop) AS sum WHERE old.prop = 42 RETURN sum needs to raise an error that old is unbound inside WHERE.

For cases when SKIP and LIMIT appear, we disallow any identifiers from appearing in their expressions. Basically, SKIP and LIMIT can only be constant expressions¹. For example, MATCH (old) RETURN old AS new SKIP new.prop needs to raise that variables are not allowed in SKIP. It makes no sense to allow variables, since their values may vary on each iteration. On the other hand, we could support variables to constant expressions, but for simplicity we do not. For example, MATCH (old) RETURN old, 2 AS limit_var LIMIT limit_var would still throw an error.

Finally, we generate symbols for names created in RETURN clause. These symbols are used for the final results of a query.

NOTE: New symbols in WITH and RETURN should be unique. This means that WITH a AS same, b AS same is not allowed, neither is a construct like RETURN 2, 2

Symbols in Functions which Establish New Scope

Symbols can also be created in some functions. These functions usually take an expression, bind a single variable and run the expression inside the newly established scope.

The all function takes a list, creates a variable for list element and runs the predicate expression. For example:

MATCH (n) RETURN n, all(n IN n.prop_list WHERE n < 42)

We create a new symbol for use inside all, this means that the WHERE n < 42 uses the n which takes values from a n.prop_list elements. The original n bound by MATCH is not visible inside the all function, but it is visible outside. Therefore, the RETURN n and n.prop_list reference the n from MATCH.

Constant expressions are expressions for which the result can be computed at compile time. ↩︎

6.7 KiB Raw Blame History