memgraph/docs/dev/query/parsing.md

63 lines
3.1 KiB
Markdown
Raw Normal View History

# Lexical and Syntactic Analysis
## Antlr
We use Antlr for lexical and syntax analysis of Cypher queries. Antrl uses
grammar file `Cypher.g4` downloaded from http://www.opencypher.org to generate
the parser and the visitor for the Cypher parse tree. Even though the provided
grammar is not very pleasant to work with we decided not to do any drastic
changes to it so that our transition to newly published versions of
`Cypher.g4` would be easier. Nevertheless, we had to fix some bugs and add
features, so our version is not completely the same.
In addition to using `Cypher.g4`, we have `MemgraphCypher.g4`. This grammar
file defines Memgraph specific extensions to the original grammar. Most
notable example is the inclusion of syntax for handling authorization. At the
moment, some extensions are also found in `Cypher.g4`. For example, the syntax
for using a lambda function in relationship patterns. These extensions should
be moved out of `Cypher.g4`, so that it remains as close to the original
grammar as possible. Additionally, having `MemgraphCypher.g4` may not be
enough if we wish to split the functionality for community and enterprise
editions of Memgraph.
## Abstract Syntax Tree (AST)
Since Antlr generated visitor and the official openCypher grammar are not very
practical to use, we translate the Antlr's AST to our own AST. Currently there
are ~40 types of nodes in our AST. Their definitions can be found in
`src/query/frontend/ast/ast.lcp`.
Major groups of types can be found under the following base types.
* `Expression` --- types corresponding to Cypher expressions.
* `Clause` --- types corresponding to Cypher clauses.
* `PatternAtom` --- node or edge related information.
* `Query` --- different kinds of queries, allows extending the language with
Memgraph specific query syntax.
Memory management of created AST nodes is done with `AstStorage`. Each type
must be created by invoking `AstStorage::Create` method. This way all of the
pointers to nodes and their children are raw pointers. The only owner of
allocated memory is the `AstStorage`. When the storage goes out of scope, the
pointers become invalid. It may be more natural to handle tree ownership via
`unique_ptr`, i.e. each node owns its children. But there are some benefits to
having a custom storage and allocation scheme.
The primary reason we opted for not using `unique_ptr` is the requirement of
Antlr's base visitor class that the resulting values must by copyable. The
result is wrapped in `antlr::Any` so that the derived visitor classes may
return any type they wish when visiting Antlr's AST. Unfortunately,
`antlr::Any` does not work with non-copyable types.
Another benefit of having `AstStorage` is that we can easily add a different
allocation scheme for AST nodes. The interface of node creation would not
change.
### AST Translation
The translation process is done via `CypherMainVisitor` class, which is
derived from Antlr generated visitor. Besides instancing our AST types, a
minimal number of syntactic checks are done on a query. These checks handle
the cases which were valid in original openCypher grammar, but may be invalid
when combined with other syntax elements.