memgraph/docs/dev/query/parsing.md
Teon Banek 6f10b1c115 Move query implementation from Phriction to this repo
Summary:
Our query parsing, planning and execution architecture was described on
Phabricator wiki pages, Phriction. This commit copies the said
documentation here, so that it's easier to access for all developers.
Additional benefit is tracking the changes and hopefully suggesting to
developers to keep it up to date.

Besides making a copy, the documentation has been updated to reflect the
current state of the codebase. Note that some things are still missing,
but what was written should now be correct.

Reviewers: mtomic, llugovic

Reviewed By: mtomic

Subscribers: pullbot

Differential Revision: https://phabricator.memgraph.io/D1854
2019-02-15 16:58:39 +01:00

3.1 KiB

Lexical and Syntactic Analysis

Antlr

We use Antlr for lexical and syntax analysis of Cypher queries. Antrl uses grammar file Cypher.g4 downloaded from http://www.opencypher.org to generate the parser and the visitor for the Cypher parse tree. Even though the provided grammar is not very pleasant to work with we decided not to do any drastic changes to it so that our transition to newly published versions of Cypher.g4 would be easier. Nevertheless, we had to fix some bugs and add features, so our version is not completely the same.

In addition to using Cypher.g4, we have MemgraphCypher.g4. This grammar file defines Memgraph specific extensions to the original grammar. Most notable example is the inclusion of syntax for handling authorization. At the moment, some extensions are also found in Cypher.g4. For example, the syntax for using a lambda function in relationship patterns. These extensions should be moved out of Cypher.g4, so that it remains as close to the original grammar as possible. Additionally, having MemgraphCypher.g4 may not be enough if we wish to split the functionality for community and enterprise editions of Memgraph.

Abstract Syntax Tree (AST)

Since Antlr generated visitor and the official openCypher grammar are not very practical to use, we translate the Antlr's AST to our own AST. Currently there are ~40 types of nodes in our AST. Their definitions can be found in src/query/frontend/ast/ast.lcp.

Major groups of types can be found under the following base types.

  • Expression --- types corresponding to Cypher expressions.
  • Clause --- types corresponding to Cypher clauses.
  • PatternAtom --- node or edge related information.
  • Query --- different kinds of queries, allows extending the language with Memgraph specific query syntax.

Memory management of created AST nodes is done with AstStorage. Each type must be created by invoking AstStorage::Create method. This way all of the pointers to nodes and their children are raw pointers. The only owner of allocated memory is the AstStorage. When the storage goes out of scope, the pointers become invalid. It may be more natural to handle tree ownership via unique_ptr, i.e. each node owns its children. But there are some benefits to having a custom storage and allocation scheme.

The primary reason we opted for not using unique_ptr is the requirement of Antlr's base visitor class that the resulting values must by copyable. The result is wrapped in antlr::Any so that the derived visitor classes may return any type they wish when visiting Antlr's AST. Unfortunately, antlr::Any does not work with non-copyable types.

Another benefit of having AstStorage is that we can easily add a different allocation scheme for AST nodes. The interface of node creation would not change.

AST Translation

The translation process is done via CypherMainVisitor class, which is derived from Antlr generated visitor. Besides instancing our AST types, a minimal number of syntactic checks are done on a query. These checks handle the cases which were valid in original openCypher grammar, but may be invalid when combined with other syntax elements.