Add a feature spec for Python Query Modules
Reviewers: mferencevic, buda, ipaljak, dsantl, tlastre Reviewed By: mferencevic, buda, ipaljak, dsantl Differential Revision: https://phabricator.memgraph.io/D2629
This commit is contained in:
parent
a1fa7de115
commit
029a36eab1
185
docs/feature_spec/python-query-modules.md
Normal file
185
docs/feature_spec/python-query-modules.md
Normal file
@ -0,0 +1,185 @@
|
||||
# Python 3 Query Modules
|
||||
|
||||
## Introduction
|
||||
|
||||
Memgraph exposes a C API for writing the so called Query Modules. These
|
||||
modules contain definitions of procedures which can be invoked through the
|
||||
query language using the `CALL ... YIELD ...` syntax. This mechanism allows
|
||||
database users to extend Memgraph with their own algorithms and
|
||||
functionalities.
|
||||
|
||||
Using a low level language like C can be quite cumbersome for writing modules,
|
||||
so it seems natural to add support for a higher level language on top of the
|
||||
existing C API.
|
||||
|
||||
There are languages written exactly for this purpose of extending C with high
|
||||
level constructs, for example Lua and Guile. Instead of those, we have chosen
|
||||
Python 3 to be the first high level language we will support. The primary reason
|
||||
being that it's very popular, so more people should be able to write modules.
|
||||
Another benefit of Python which comes out of its popularity is the large
|
||||
ecosystem of libraries, especially graph algorithm related ones like NetworkX.
|
||||
Python does have significant performance and implementation downsides compared
|
||||
to Lua and Guile, but these are described in more detail later in this
|
||||
document.
|
||||
|
||||
## Python 3 API Overview
|
||||
|
||||
The Python 3 API should be as user friendly as possible as well as look
|
||||
Pythonic. This implies that some functions from the C API will not map to the
|
||||
exact same functions. The most obvious case for a Pythonic approach is
|
||||
registering procedures of a query module. Let's take a look at the C example
|
||||
and its transformation to Python.
|
||||
|
||||
```c
|
||||
static void procedure(const struct mgp_list *args,
|
||||
const struct mgp_graph *graph, struct mgp_result *result,
|
||||
struct mgp_memory *memory);
|
||||
|
||||
int mgp_init_module(struct mgp_module *module, struct mgp_memory *memory) {
|
||||
struct mgp_proc *proc =
|
||||
mgp_module_add_read_procedure(module, "procedure", procedure);
|
||||
if (!proc) return 1;
|
||||
if (!mgp_proc_add_arg(proc, "required_arg",
|
||||
mgp_type_nullable(mgp_type_any())))
|
||||
return 1;
|
||||
struct mgp_value *null_value = mgp_value_make_null(memory);
|
||||
if (!mgp_proc_add_opt_arg(proc, "optional_arg",
|
||||
mgp_type_nullable(mgp_type_any()), null_value)) {
|
||||
mgp_value_destroy(null_value);
|
||||
return 1;
|
||||
}
|
||||
mgp_value_destroy(null_value);
|
||||
if (!mgp_proc_add_result(proc, "result", mgp_type_string())) return 1;
|
||||
if (!mgp_proc_add_result(proc, "args",
|
||||
mgp_type_list(mgp_type_nullable(mgp_type_any()))))
|
||||
return 1;
|
||||
return 0;
|
||||
}
|
||||
```
|
||||
|
||||
In Python things should be a lot simpler.
|
||||
|
||||
```Python
|
||||
# mgp.read_proc obtains the procedure name via __name__ attribute of a function.
|
||||
@mgp.read_proc(# Arguments passed to multiple mgp_proc_add_arg calls
|
||||
(('required_arg', mgp.Nullable(mgp.Any)), ('optional_arg', mgp.Nullable(mgp.Any), None)),
|
||||
# Result fields passed to multiple mgp_proc_add_result calls
|
||||
(('result', str), ('args', mgp.List(mgp.Nullable(mgp.Any)))))
|
||||
def procedure(args, graph, result, memory):
|
||||
pass
|
||||
```
|
||||
|
||||
Here we have replaced `mgp_module_*` and `mgp_proc_*` C API with a much
|
||||
simpler decorator function in Python -- `mgp.read_proc`. The types of
|
||||
arguments and result fields can both be our types as well as Python builtin
|
||||
types which can map to supported `mgp_value` types. The expected builtin types
|
||||
we ought to support are: `bool`, `str`, `int`, `float` and `map`. While the
|
||||
rest of the types are provided via our Python API. Optionally, we can add
|
||||
convenience support for `object` type which would map to
|
||||
`mgp.Nullable(mgp.Any)` and `list` which would map to
|
||||
`mgp.List(mgp.Nullable(mgp.Any))`. Also, it makes sense to take a look if we
|
||||
can leverage Python's `typing` module here.
|
||||
|
||||
Another Pythonic change is to remove `mgp_value` C API from Python altogether.
|
||||
This means that the arguments a Python procedure receives are not `mgp_value`
|
||||
instances but rather `PyObject` instances. In other words, our implementation
|
||||
would immediately marshal `mgp_value` to corresponding type in Python.
|
||||
Obviously we would need to provide our own Python types for non-builtin
|
||||
things like `mgp.Vertex` (equivalent to `mgp_vertex`) and other.
|
||||
|
||||
Continuing from our example above, let's say the procedure was invoked through
|
||||
Cypher using the following query.
|
||||
|
||||
MATCH (n) CALL py_module.procedure(42, n) YIELD *;
|
||||
|
||||
The Python procedure could then do the following and complete without throwing
|
||||
neither the AssertionError nor the ValueError.
|
||||
|
||||
```Python
|
||||
def procedure(args, graph, result, memory):
|
||||
assert isinstance(args, list)
|
||||
# Unpacking throws ValueError if args does not contain exactly 2 values.
|
||||
required_arg, optional_arg = args
|
||||
assert isintance(required_arg, int)
|
||||
assert isinstance(optional_arg, mgp.Vertex)
|
||||
```
|
||||
|
||||
The rest of the C API should naturally map to either top level functions or
|
||||
class methods as appropriate.
|
||||
|
||||
## Loading Python Query Modules
|
||||
|
||||
Our current mechanism for loading the modules is to look for `.so` files in
|
||||
the directory specified by `--query-modules` flag. This is done when Memgraph
|
||||
is started. We can extend this mechanism to look for `.py` files in addition
|
||||
to `.so` files in the same directory and import them in the embedded Python
|
||||
interpreter. The only issue is embedding the interpreter in Memgraph. There
|
||||
are multiple choices:
|
||||
|
||||
1. Building Memgraph and statically linking to Python.
|
||||
2. Building Memgraph and dynamically linking to Python, and distributing
|
||||
Python with Memgraph's installation.
|
||||
3. Building Memgraph and dynamically linking to Python, but without
|
||||
distributing the Python library.
|
||||
4. Building Memgraph and optionally loading Python library by trying to
|
||||
`dlopen` it.
|
||||
|
||||
The first two options are only viable if the Python license allows, and this
|
||||
will need further investigation.
|
||||
|
||||
The third option adds Python as an installation dependency for Memgraph, and
|
||||
without it Memgraph will not run. This is problematic for users which cannot
|
||||
or do not want to install Python 3.
|
||||
|
||||
The fourth option avoids all of the issues present in the first 3 options, but
|
||||
comes at a higher implementation cost. We would need to try to `dlopen` the
|
||||
Python library and setup function pointers. If we succeed we would import
|
||||
`.py` files from the `--query-modules` directory. On the other hand, if the
|
||||
user does not have Python, `dlopen` would fail and Memgraph would run without
|
||||
Python support.
|
||||
|
||||
After live discussion, we've decided to go with option 3. This way we don't
|
||||
have to worry about mismatching Python versions we support and what the users
|
||||
expect. Also, we should target Python 3.5 as that should be common between
|
||||
Debian and CentOS for which we ship installation packages.
|
||||
|
||||
## Performance and Implementation Problems
|
||||
|
||||
As previously mentioned, embedding Python introduces usability issues compared
|
||||
to other embeddable languages.
|
||||
|
||||
The first, major issue is Global Interpreter Lock (GIL). Initializing Python
|
||||
will start a single global interpreter and running multiple threads will
|
||||
require acquiring GIL. In practice, this means that when multiple users run a
|
||||
procedure written in Python in parallel the execution will not actually be
|
||||
parallel. Python's interpreter will jump between executing one user's
|
||||
procedure and the other's. This can be quite an issue for long running
|
||||
procedures when multiple users are querying Memgraph. The solution for this
|
||||
issue is Python's API for sub-interpreters. Unfortunately, the support for
|
||||
them is rather poor and the API contains a lot of critical bugs when we tried
|
||||
to use them. For the time being, we will have to accept GIL and its downsides.
|
||||
Perhaps in the future we will gain more knowledge on how we could reduce the
|
||||
acquire rate of GIL or the sub-interpreter API will get improved.
|
||||
|
||||
Another major issue is memory allocation. Python's C API does not have support
|
||||
for setting up a temporary allocator during execution of a single function.
|
||||
It only has support for setting up a global heap allocator. This obviously
|
||||
impacts our control of memory during a query procedure invocation. Besides
|
||||
potential performance penalty, a procedure could allocate much more memory
|
||||
than we would actually allow for execution of a single query. This means that
|
||||
options controlling the memory limit during query execution are useless. On
|
||||
the bright side, Python does use block style allocators and reference
|
||||
counting, so the performance penalty and global memory usage should not be
|
||||
that terrible.
|
||||
|
||||
The final issue that isn't as major as the ones above is the global state of
|
||||
the interpreter. In practice this means that any registered procedure and
|
||||
imported module has access to any other procedure and module. This may pollute
|
||||
the namespace for other users, but it should not be much of a problem because
|
||||
Python always has things under a module scope. The other, slightly bigger
|
||||
downside is that a malicious user could use this knowledge to modify other
|
||||
modules and procedures. This seems like a major issue, but if we take the
|
||||
bigger picture into consideration, we already have a security issue in general
|
||||
by invoking `dlopen` on `.so` and potentially running arbitrary code. This was
|
||||
the trade off we chose to allow users to extend Memgraph. It's up to the users
|
||||
to write sane extensions and protect their servers from access.
|
Loading…
Reference in New Issue
Block a user