memgraph/docs/feature_spec/python-query-modules.md
Teon Banek 029a36eab1 Add a feature spec for Python Query Modules
Reviewers: mferencevic, buda, ipaljak, dsantl, tlastre

Reviewed By: mferencevic, buda, ipaljak, dsantl

Differential Revision: https://phabricator.memgraph.io/D2629
2020-01-23 17:10:23 +01:00

186 lines
9.0 KiB
Markdown

# Python 3 Query Modules
## Introduction
Memgraph exposes a C API for writing the so called Query Modules. These
modules contain definitions of procedures which can be invoked through the
query language using the `CALL ... YIELD ...` syntax. This mechanism allows
database users to extend Memgraph with their own algorithms and
functionalities.
Using a low level language like C can be quite cumbersome for writing modules,
so it seems natural to add support for a higher level language on top of the
existing C API.
There are languages written exactly for this purpose of extending C with high
level constructs, for example Lua and Guile. Instead of those, we have chosen
Python 3 to be the first high level language we will support. The primary reason
being that it's very popular, so more people should be able to write modules.
Another benefit of Python which comes out of its popularity is the large
ecosystem of libraries, especially graph algorithm related ones like NetworkX.
Python does have significant performance and implementation downsides compared
to Lua and Guile, but these are described in more detail later in this
document.
## Python 3 API Overview
The Python 3 API should be as user friendly as possible as well as look
Pythonic. This implies that some functions from the C API will not map to the
exact same functions. The most obvious case for a Pythonic approach is
registering procedures of a query module. Let's take a look at the C example
and its transformation to Python.
```c
static void procedure(const struct mgp_list *args,
const struct mgp_graph *graph, struct mgp_result *result,
struct mgp_memory *memory);
int mgp_init_module(struct mgp_module *module, struct mgp_memory *memory) {
struct mgp_proc *proc =
mgp_module_add_read_procedure(module, "procedure", procedure);
if (!proc) return 1;
if (!mgp_proc_add_arg(proc, "required_arg",
mgp_type_nullable(mgp_type_any())))
return 1;
struct mgp_value *null_value = mgp_value_make_null(memory);
if (!mgp_proc_add_opt_arg(proc, "optional_arg",
mgp_type_nullable(mgp_type_any()), null_value)) {
mgp_value_destroy(null_value);
return 1;
}
mgp_value_destroy(null_value);
if (!mgp_proc_add_result(proc, "result", mgp_type_string())) return 1;
if (!mgp_proc_add_result(proc, "args",
mgp_type_list(mgp_type_nullable(mgp_type_any()))))
return 1;
return 0;
}
```
In Python things should be a lot simpler.
```Python
# mgp.read_proc obtains the procedure name via __name__ attribute of a function.
@mgp.read_proc(# Arguments passed to multiple mgp_proc_add_arg calls
(('required_arg', mgp.Nullable(mgp.Any)), ('optional_arg', mgp.Nullable(mgp.Any), None)),
# Result fields passed to multiple mgp_proc_add_result calls
(('result', str), ('args', mgp.List(mgp.Nullable(mgp.Any)))))
def procedure(args, graph, result, memory):
pass
```
Here we have replaced `mgp_module_*` and `mgp_proc_*` C API with a much
simpler decorator function in Python -- `mgp.read_proc`. The types of
arguments and result fields can both be our types as well as Python builtin
types which can map to supported `mgp_value` types. The expected builtin types
we ought to support are: `bool`, `str`, `int`, `float` and `map`. While the
rest of the types are provided via our Python API. Optionally, we can add
convenience support for `object` type which would map to
`mgp.Nullable(mgp.Any)` and `list` which would map to
`mgp.List(mgp.Nullable(mgp.Any))`. Also, it makes sense to take a look if we
can leverage Python's `typing` module here.
Another Pythonic change is to remove `mgp_value` C API from Python altogether.
This means that the arguments a Python procedure receives are not `mgp_value`
instances but rather `PyObject` instances. In other words, our implementation
would immediately marshal `mgp_value` to corresponding type in Python.
Obviously we would need to provide our own Python types for non-builtin
things like `mgp.Vertex` (equivalent to `mgp_vertex`) and other.
Continuing from our example above, let's say the procedure was invoked through
Cypher using the following query.
MATCH (n) CALL py_module.procedure(42, n) YIELD *;
The Python procedure could then do the following and complete without throwing
neither the AssertionError nor the ValueError.
```Python
def procedure(args, graph, result, memory):
assert isinstance(args, list)
# Unpacking throws ValueError if args does not contain exactly 2 values.
required_arg, optional_arg = args
assert isintance(required_arg, int)
assert isinstance(optional_arg, mgp.Vertex)
```
The rest of the C API should naturally map to either top level functions or
class methods as appropriate.
## Loading Python Query Modules
Our current mechanism for loading the modules is to look for `.so` files in
the directory specified by `--query-modules` flag. This is done when Memgraph
is started. We can extend this mechanism to look for `.py` files in addition
to `.so` files in the same directory and import them in the embedded Python
interpreter. The only issue is embedding the interpreter in Memgraph. There
are multiple choices:
1. Building Memgraph and statically linking to Python.
2. Building Memgraph and dynamically linking to Python, and distributing
Python with Memgraph's installation.
3. Building Memgraph and dynamically linking to Python, but without
distributing the Python library.
4. Building Memgraph and optionally loading Python library by trying to
`dlopen` it.
The first two options are only viable if the Python license allows, and this
will need further investigation.
The third option adds Python as an installation dependency for Memgraph, and
without it Memgraph will not run. This is problematic for users which cannot
or do not want to install Python 3.
The fourth option avoids all of the issues present in the first 3 options, but
comes at a higher implementation cost. We would need to try to `dlopen` the
Python library and setup function pointers. If we succeed we would import
`.py` files from the `--query-modules` directory. On the other hand, if the
user does not have Python, `dlopen` would fail and Memgraph would run without
Python support.
After live discussion, we've decided to go with option 3. This way we don't
have to worry about mismatching Python versions we support and what the users
expect. Also, we should target Python 3.5 as that should be common between
Debian and CentOS for which we ship installation packages.
## Performance and Implementation Problems
As previously mentioned, embedding Python introduces usability issues compared
to other embeddable languages.
The first, major issue is Global Interpreter Lock (GIL). Initializing Python
will start a single global interpreter and running multiple threads will
require acquiring GIL. In practice, this means that when multiple users run a
procedure written in Python in parallel the execution will not actually be
parallel. Python's interpreter will jump between executing one user's
procedure and the other's. This can be quite an issue for long running
procedures when multiple users are querying Memgraph. The solution for this
issue is Python's API for sub-interpreters. Unfortunately, the support for
them is rather poor and the API contains a lot of critical bugs when we tried
to use them. For the time being, we will have to accept GIL and its downsides.
Perhaps in the future we will gain more knowledge on how we could reduce the
acquire rate of GIL or the sub-interpreter API will get improved.
Another major issue is memory allocation. Python's C API does not have support
for setting up a temporary allocator during execution of a single function.
It only has support for setting up a global heap allocator. This obviously
impacts our control of memory during a query procedure invocation. Besides
potential performance penalty, a procedure could allocate much more memory
than we would actually allow for execution of a single query. This means that
options controlling the memory limit during query execution are useless. On
the bright side, Python does use block style allocators and reference
counting, so the performance penalty and global memory usage should not be
that terrible.
The final issue that isn't as major as the ones above is the global state of
the interpreter. In practice this means that any registered procedure and
imported module has access to any other procedure and module. This may pollute
the namespace for other users, but it should not be much of a problem because
Python always has things under a module scope. The other, slightly bigger
downside is that a malicious user could use this knowledge to modify other
modules and procedures. This seems like a major issue, but if we take the
bigger picture into consideration, we already have a security issue in general
by invoking `dlopen` on `.so` and potentially running arbitrary code. This was
the trade off we chose to allow users to extend Memgraph. It's up to the users
to write sane extensions and protect their servers from access.