Reviewers: mferencevic, buda, ipaljak, dsantl, tlastre Reviewed By: mferencevic, buda, ipaljak, dsantl Differential Revision: https://phabricator.memgraph.io/D2629
9.0 KiB
Python 3 Query Modules
Introduction
Memgraph exposes a C API for writing the so called Query Modules. These
modules contain definitions of procedures which can be invoked through the
query language using the CALL ... YIELD ...
syntax. This mechanism allows
database users to extend Memgraph with their own algorithms and
functionalities.
Using a low level language like C can be quite cumbersome for writing modules, so it seems natural to add support for a higher level language on top of the existing C API.
There are languages written exactly for this purpose of extending C with high level constructs, for example Lua and Guile. Instead of those, we have chosen Python 3 to be the first high level language we will support. The primary reason being that it's very popular, so more people should be able to write modules. Another benefit of Python which comes out of its popularity is the large ecosystem of libraries, especially graph algorithm related ones like NetworkX. Python does have significant performance and implementation downsides compared to Lua and Guile, but these are described in more detail later in this document.
Python 3 API Overview
The Python 3 API should be as user friendly as possible as well as look Pythonic. This implies that some functions from the C API will not map to the exact same functions. The most obvious case for a Pythonic approach is registering procedures of a query module. Let's take a look at the C example and its transformation to Python.
static void procedure(const struct mgp_list *args,
const struct mgp_graph *graph, struct mgp_result *result,
struct mgp_memory *memory);
int mgp_init_module(struct mgp_module *module, struct mgp_memory *memory) {
struct mgp_proc *proc =
mgp_module_add_read_procedure(module, "procedure", procedure);
if (!proc) return 1;
if (!mgp_proc_add_arg(proc, "required_arg",
mgp_type_nullable(mgp_type_any())))
return 1;
struct mgp_value *null_value = mgp_value_make_null(memory);
if (!mgp_proc_add_opt_arg(proc, "optional_arg",
mgp_type_nullable(mgp_type_any()), null_value)) {
mgp_value_destroy(null_value);
return 1;
}
mgp_value_destroy(null_value);
if (!mgp_proc_add_result(proc, "result", mgp_type_string())) return 1;
if (!mgp_proc_add_result(proc, "args",
mgp_type_list(mgp_type_nullable(mgp_type_any()))))
return 1;
return 0;
}
In Python things should be a lot simpler.
# mgp.read_proc obtains the procedure name via __name__ attribute of a function.
@mgp.read_proc(# Arguments passed to multiple mgp_proc_add_arg calls
(('required_arg', mgp.Nullable(mgp.Any)), ('optional_arg', mgp.Nullable(mgp.Any), None)),
# Result fields passed to multiple mgp_proc_add_result calls
(('result', str), ('args', mgp.List(mgp.Nullable(mgp.Any)))))
def procedure(args, graph, result, memory):
pass
Here we have replaced mgp_module_*
and mgp_proc_*
C API with a much
simpler decorator function in Python -- mgp.read_proc
. The types of
arguments and result fields can both be our types as well as Python builtin
types which can map to supported mgp_value
types. The expected builtin types
we ought to support are: bool
, str
, int
, float
and map
. While the
rest of the types are provided via our Python API. Optionally, we can add
convenience support for object
type which would map to
mgp.Nullable(mgp.Any)
and list
which would map to
mgp.List(mgp.Nullable(mgp.Any))
. Also, it makes sense to take a look if we
can leverage Python's typing
module here.
Another Pythonic change is to remove mgp_value
C API from Python altogether.
This means that the arguments a Python procedure receives are not mgp_value
instances but rather PyObject
instances. In other words, our implementation
would immediately marshal mgp_value
to corresponding type in Python.
Obviously we would need to provide our own Python types for non-builtin
things like mgp.Vertex
(equivalent to mgp_vertex
) and other.
Continuing from our example above, let's say the procedure was invoked through Cypher using the following query.
MATCH (n) CALL py_module.procedure(42, n) YIELD *;
The Python procedure could then do the following and complete without throwing neither the AssertionError nor the ValueError.
def procedure(args, graph, result, memory):
assert isinstance(args, list)
# Unpacking throws ValueError if args does not contain exactly 2 values.
required_arg, optional_arg = args
assert isintance(required_arg, int)
assert isinstance(optional_arg, mgp.Vertex)
The rest of the C API should naturally map to either top level functions or class methods as appropriate.
Loading Python Query Modules
Our current mechanism for loading the modules is to look for .so
files in
the directory specified by --query-modules
flag. This is done when Memgraph
is started. We can extend this mechanism to look for .py
files in addition
to .so
files in the same directory and import them in the embedded Python
interpreter. The only issue is embedding the interpreter in Memgraph. There
are multiple choices:
- Building Memgraph and statically linking to Python.
- Building Memgraph and dynamically linking to Python, and distributing Python with Memgraph's installation.
- Building Memgraph and dynamically linking to Python, but without distributing the Python library.
- Building Memgraph and optionally loading Python library by trying to
dlopen
it.
The first two options are only viable if the Python license allows, and this will need further investigation.
The third option adds Python as an installation dependency for Memgraph, and without it Memgraph will not run. This is problematic for users which cannot or do not want to install Python 3.
The fourth option avoids all of the issues present in the first 3 options, but
comes at a higher implementation cost. We would need to try to dlopen
the
Python library and setup function pointers. If we succeed we would import
.py
files from the --query-modules
directory. On the other hand, if the
user does not have Python, dlopen
would fail and Memgraph would run without
Python support.
After live discussion, we've decided to go with option 3. This way we don't have to worry about mismatching Python versions we support and what the users expect. Also, we should target Python 3.5 as that should be common between Debian and CentOS for which we ship installation packages.
Performance and Implementation Problems
As previously mentioned, embedding Python introduces usability issues compared to other embeddable languages.
The first, major issue is Global Interpreter Lock (GIL). Initializing Python will start a single global interpreter and running multiple threads will require acquiring GIL. In practice, this means that when multiple users run a procedure written in Python in parallel the execution will not actually be parallel. Python's interpreter will jump between executing one user's procedure and the other's. This can be quite an issue for long running procedures when multiple users are querying Memgraph. The solution for this issue is Python's API for sub-interpreters. Unfortunately, the support for them is rather poor and the API contains a lot of critical bugs when we tried to use them. For the time being, we will have to accept GIL and its downsides. Perhaps in the future we will gain more knowledge on how we could reduce the acquire rate of GIL or the sub-interpreter API will get improved.
Another major issue is memory allocation. Python's C API does not have support for setting up a temporary allocator during execution of a single function. It only has support for setting up a global heap allocator. This obviously impacts our control of memory during a query procedure invocation. Besides potential performance penalty, a procedure could allocate much more memory than we would actually allow for execution of a single query. This means that options controlling the memory limit during query execution are useless. On the bright side, Python does use block style allocators and reference counting, so the performance penalty and global memory usage should not be that terrible.
The final issue that isn't as major as the ones above is the global state of
the interpreter. In practice this means that any registered procedure and
imported module has access to any other procedure and module. This may pollute
the namespace for other users, but it should not be much of a problem because
Python always has things under a module scope. The other, slightly bigger
downside is that a malicious user could use this knowledge to modify other
modules and procedures. This seems like a major issue, but if we take the
bigger picture into consideration, we already have a security issue in general
by invoking dlopen
on .so
and potentially running arbitrary code. This was
the trade off we chose to allow users to extend Memgraph. It's up to the users
to write sane extensions and protect their servers from access.