memgraph/docs/dev/query/semantic.md

# Semantic Analysis and Symbol Generation

In this phase, various semantic and variable type checks are performed.
Additionally, we generate symbols which map AST nodes to stored values
computed from evaluated expressions.

## Symbol Generation

Implementation can be found in `query/frontend/semantic/symbol_generator.cpp`.

Symbols are generated for each AST node that represents data that needs to
have storage. Currently, these are:

  * `NamedExpression`
  * `CypherUnion`
  * `Identifier`
  * `Aggregation`

You may notice that the above AST nodes may not correspond to something named
by a user. For example, `Aggregation` can be a part of larger expression and
thus remain unnamed. The reason we still generate symbols is to have a uniform
behaviour when executing a query as well as allow for caching the results of
expression evaluation.

AST nodes do not actually store a `Symbol` instance, instead they have a
`int32_t` index identifying the symbol in the `SymbolTable` class. This is
done to minimize the size of AST types as well as allow easier sharing of same
symbols with multiple instances of AST nodes.

The storage for evaluated data is represented by the `Frame` class. Each
symbol determines a unique position in the frame. During interpretation,
evaluation of expressions which have a symbol will either read or store values
in the frame. For example, instance of an `Identifier` will use the symbol to
find and read the value from `Frame`. On the other hand, `NamedExpression`
will take the result of evaluating its own expression and store it in the
`Frame`.

When a symbol is created, context of creation is used to assign a type to that
symbol. This type is used for simple type checking operations. For example,
`MATCH (n)` will create a symbol for variable `n`. Since the `MATCH (n)`
represents finding a vertex in the graph, we can set `Symbol::Type::Vertex`
for that symbol. Later, for example in `MATCH ()-[n]-()` we see that variable
`n` is used as an edge. Since we already have a symbol for that variable, we
detect this type mismatch and raise a `SemanticException`.

Basic rule of symbol generation, is that variables inside `MATCH`, `CREATE`,
`MERGE`, `WITH ... AS` and `RETURN ... AS` clauses establish new symbols.

### Symbols in Patterns

Inside `MATCH`, symbols are created only if they didn't exist before. For
example, patterns in `MATCH (n {a: 5})--(m {b: 5}) RETURN n, m` will create 2
symbols: one for `n` and one for `m`. `RETURN` clause will, in turn, reference
those symbols. Symbols established in a part of pattern are immediately bound
and visible in later parts. For example, `MATCH (n)--(n)` will create a symbol
for variable `n` for 1st `(n)`. That symbol is referenced in 2nd `(n)`. Note
that the symbol is not bound inside 1st `(n)` itself. What this means is that,
for example, `MATCH (n {a: n.b})` should raise an error, because `n` is not
yet bound when encountering `n.b`. On the other hand,
`MATCH (n)--(n {a: n.b})` is fine.

The `CREATE` is similar to `MATCH`, but it *always* establishes symbols for
variables which create graph elements. What this means is that, for example
`MATCH (n) CREATE (n)` is not allowed. `CREATE` wants to create a new node,
for which we already have a symbol. In such a case, we need to throw an error
that the variable `n` is being redeclared. On the other hand `MATCH (n) CREATE
(n)-[r :r]->(n)` is fine, because `CREATE` will only create the edge `r`,
connecting the already existing node `n`. Remaining behaviour is the same as
in `MATCH`. This means that we can simplify `CREATE` to be like `MATCH` with 2
special cases.

  1. Are we creating a node, i.e. `CREATE (n)`? If yes, then the symbol for
     `n` must not have been created before. Otherwise, we reference the
     existing symbol.
  2. Are we creating an edge, i.e. we encounter a variable for an edge inside
     `CREATE`? If yes, then that variable must not reference a symbol.

The `MERGE` clause is treated the same as `CREATE` with regards to symbol
generation. The only difference is that we allow bidirectional edges in the
pattern. When creating such a pattern, the direction of the created edge is
arbitrarily determined.

### Symbols in WITH and RETURN

In addition to patterns, new symbols are established in the `WITH` clause.
This clause makes the new symbols visible *only* to the rest of the query.
For example, `MATCH (old) WITH old AS new RETURN new, old` should raise an
error that `old` is unbound inside `RETURN`.

There is a special case with symbol visibility in `WHERE` and `ORDER BY`. They
need to see both the old and the new symbols. Therefore `MATCH (old) RETURN
old AS new ORDER BY old.prop` needs to work. On the other hand, if we perform
aggregations inside `WITH` or `RETURN`, then the old symbols should not be
visible neither in `WHERE` nor in `ORDER BY`. Since the aggregation has to go
through all the results in order to generate the final value, it makes no
sense to store old symbols and their values. A query like `MATCH (old) WITH
SUM(old.prop) AS sum WHERE old.prop = 42 RETURN sum` needs to raise an error
that `old` is unbound inside `WHERE`.

For cases when `SKIP` and `LIMIT` appear, we disallow any identifiers from
appearing in their expressions. Basically, `SKIP` and `LIMIT` can only be
constant expressions[^1]. For example, `MATCH (old) RETURN old AS new SKIP
new.prop` needs to raise that variables are not allowed in `SKIP`. It makes no
sense to allow variables, since their values may vary on each iteration. On
the other hand, we could support variables to constant expressions, but for
simplicity we do not. For example, `MATCH (old) RETURN old, 2 AS limit_var
LIMIT limit_var` would still throw an error.

Finally, we generate symbols for names created in `RETURN` clause. These
symbols are used for the final results of a query.

NOTE: New symbols in `WITH` and `RETURN` should be unique. This means that
`WITH a AS same, b AS same` is not allowed, neither is a construct like
`RETURN 2, 2`

### Symbols in Functions which Establish New Scope

Symbols can also be created in some functions. These functions usually take an
expression, bind a single variable and run the expression inside the newly
established scope.

The `all` function takes a list, creates a variable for list element and runs
the predicate expression. For example:

    MATCH (n) RETURN n, all(n IN n.prop_list WHERE n < 42)

We create a new symbol for use inside `all`, this means that the `WHERE n <
42` uses the `n` which takes values from a `n.prop_list` elements. The
original `n` bound by `MATCH` is not visible inside the `all` function, but it
is visible outside. Therefore, the `RETURN n` and `n.prop_list` reference the
`n` from `MATCH`.

[^1]: Constant expressions are expressions for which the result can be
  computed at compile time.
Move query implementation from Phriction to this repo Summary: Our query parsing, planning and execution architecture was described on Phabricator wiki pages, Phriction. This commit copies the said documentation here, so that it's easier to access for all developers. Additional benefit is tracking the changes and hopefully suggesting to developers to keep it up to date. Besides making a copy, the documentation has been updated to reflect the current state of the codebase. Note that some things are still missing, but what was written should now be correct. Reviewers: mtomic, llugovic Reviewed By: mtomic Subscribers: pullbot Differential Revision: https://phabricator.memgraph.io/D1854 2019-02-13 23:50:32 +08:00			`# Semantic Analysis and Symbol Generation`

			`In this phase, various semantic and variable type checks are performed.`
			`Additionally, we generate symbols which map AST nodes to stored values`
			`computed from evaluated expressions.`

			`## Symbol Generation`

			Implementation can be found in `query/frontend/semantic/symbol_generator.cpp`.

			`Symbols are generated for each AST node that represents data that needs to`
			`have storage. Currently, these are:`

			* `NamedExpression`
			* `CypherUnion`
			* `Identifier`
			* `Aggregation`

			`You may notice that the above AST nodes may not correspond to something named`
			by a user. For example, `Aggregation` can be a part of larger expression and
			`thus remain unnamed. The reason we still generate symbols is to have a uniform`
			`behaviour when executing a query as well as allow for caching the results of`
			`expression evaluation.`

			AST nodes do not actually store a `Symbol` instance, instead they have a
			`int32_t` index identifying the symbol in the `SymbolTable` class. This is
			`done to minimize the size of AST types as well as allow easier sharing of same`
			`symbols with multiple instances of AST nodes.`

			The storage for evaluated data is represented by the `Frame` class. Each
			`symbol determines a unique position in the frame. During interpretation,`
			`evaluation of expressions which have a symbol will either read or store values`
			in the frame. For example, instance of an `Identifier` will use the symbol to
			find and read the value from `Frame`. On the other hand, `NamedExpression`
			`will take the result of evaluating its own expression and store it in the`
			`Frame`.

			`When a symbol is created, context of creation is used to assign a type to that`
			`symbol. This type is used for simple type checking operations. For example,`
			`MATCH (n)` will create a symbol for variable `n`. Since the `MATCH (n)`
			represents finding a vertex in the graph, we can set `Symbol::Type::Vertex`
			for that symbol. Later, for example in `MATCH ()-[n]-()` we see that variable
			`n` is used as an edge. Since we already have a symbol for that variable, we
			detect this type mismatch and raise a `SemanticException`.

			Basic rule of symbol generation, is that variables inside `MATCH`, `CREATE`,
			`MERGE`, `WITH ... AS` and `RETURN ... AS` clauses establish new symbols.

			`### Symbols in Patterns`

			Inside `MATCH`, symbols are created only if they didn't exist before. For
			example, patterns in `MATCH (n {a: 5})--(m {b: 5}) RETURN n, m` will create 2
			symbols: one for `n` and one for `m`. `RETURN` clause will, in turn, reference
			`those symbols. Symbols established in a part of pattern are immediately bound`
			and visible in later parts. For example, `MATCH (n)--(n)` will create a symbol
			for variable `n` for 1st `(n)`. That symbol is referenced in 2nd `(n)`. Note
			that the symbol is not bound inside 1st `(n)` itself. What this means is that,
			for example, `MATCH (n {a: n.b})` should raise an error, because `n` is not
			yet bound when encountering `n.b`. On the other hand,
			`MATCH (n)--(n {a: n.b})` is fine.

			The `CREATE` is similar to `MATCH`, but it always establishes symbols for
			`variables which create graph elements. What this means is that, for example`
			`MATCH (n) CREATE (n)` is not allowed. `CREATE` wants to create a new node,
			`for which we already have a symbol. In such a case, we need to throw an error`
			that the variable `n` is being redeclared. On the other hand `MATCH (n) CREATE
			(n)-[r :r]->(n)` is fine, because `CREATE` will only create the edge `r`,
			connecting the already existing node `n`. Remaining behaviour is the same as
			in `MATCH`. This means that we can simplify `CREATE` to be like `MATCH` with 2
			`special cases.`

			1. Are we creating a node, i.e. `CREATE (n)`? If yes, then the symbol for
			`n` must not have been created before. Otherwise, we reference the
			`existing symbol.`
			`2. Are we creating an edge, i.e. we encounter a variable for an edge inside`
			`CREATE`? If yes, then that variable must not reference a symbol.

			The `MERGE` clause is treated the same as `CREATE` with regards to symbol
			`generation. The only difference is that we allow bidirectional edges in the`
			`pattern. When creating such a pattern, the direction of the created edge is`
			`arbitrarily determined.`

			`### Symbols in WITH and RETURN`

			In addition to patterns, new symbols are established in the `WITH` clause.
			`This clause makes the new symbols visible only to the rest of the query.`
			For example, `MATCH (old) WITH old AS new RETURN new, old` should raise an
			error that `old` is unbound inside `RETURN`.

			There is a special case with symbol visibility in `WHERE` and `ORDER BY`. They
			need to see both the old and the new symbols. Therefore `MATCH (old) RETURN
			old AS new ORDER BY old.prop` needs to work. On the other hand, if we perform
			aggregations inside `WITH` or `RETURN`, then the old symbols should not be
			visible neither in `WHERE` nor in `ORDER BY`. Since the aggregation has to go
			`through all the results in order to generate the final value, it makes no`
			sense to store old symbols and their values. A query like `MATCH (old) WITH
			SUM(old.prop) AS sum WHERE old.prop = 42 RETURN sum` needs to raise an error
			that `old` is unbound inside `WHERE`.

			For cases when `SKIP` and `LIMIT` appear, we disallow any identifiers from
			appearing in their expressions. Basically, `SKIP` and `LIMIT` can only be
			constant expressions[^1]. For example, `MATCH (old) RETURN old AS new SKIP
			new.prop` needs to raise that variables are not allowed in `SKIP`. It makes no
			`sense to allow variables, since their values may vary on each iteration. On`
			`the other hand, we could support variables to constant expressions, but for`
			simplicity we do not. For example, `MATCH (old) RETURN old, 2 AS limit_var
			LIMIT limit_var` would still throw an error.

			Finally, we generate symbols for names created in `RETURN` clause. These
			`symbols are used for the final results of a query.`

			NOTE: New symbols in `WITH` and `RETURN` should be unique. This means that
			`WITH a AS same, b AS same` is not allowed, neither is a construct like
			`RETURN 2, 2`

			`### Symbols in Functions which Establish New Scope`

			`Symbols can also be created in some functions. These functions usually take an`
			`expression, bind a single variable and run the expression inside the newly`
			`established scope.`

			The `all` function takes a list, creates a variable for list element and runs
			`the predicate expression. For example:`

			`MATCH (n) RETURN n, all(n IN n.prop_list WHERE n < 42)`

			We create a new symbol for use inside `all`, this means that the `WHERE n <
			42` uses the `n` which takes values from a `n.prop_list` elements. The
			original `n` bound by `MATCH` is not visible inside the `all` function, but it
			is visible outside. Therefore, the `RETURN n` and `n.prop_list` reference the
			`n` from `MATCH`.

			`[^1]: Constant expressions are expressions for which the result can be`
			`computed at compile time.`