memgraph/docs/feature_spec/python-query-modules.md
Teon Banek 029a36eab1 Add a feature spec for Python Query Modules
Reviewers: mferencevic, buda, ipaljak, dsantl, tlastre

Reviewed By: mferencevic, buda, ipaljak, dsantl

Differential Revision: https://phabricator.memgraph.io/D2629
2020-01-23 17:10:23 +01:00

9.0 KiB

Python 3 Query Modules

Introduction

Memgraph exposes a C API for writing the so called Query Modules. These modules contain definitions of procedures which can be invoked through the query language using the CALL ... YIELD ... syntax. This mechanism allows database users to extend Memgraph with their own algorithms and functionalities.

Using a low level language like C can be quite cumbersome for writing modules, so it seems natural to add support for a higher level language on top of the existing C API.

There are languages written exactly for this purpose of extending C with high level constructs, for example Lua and Guile. Instead of those, we have chosen Python 3 to be the first high level language we will support. The primary reason being that it's very popular, so more people should be able to write modules. Another benefit of Python which comes out of its popularity is the large ecosystem of libraries, especially graph algorithm related ones like NetworkX. Python does have significant performance and implementation downsides compared to Lua and Guile, but these are described in more detail later in this document.

Python 3 API Overview

The Python 3 API should be as user friendly as possible as well as look Pythonic. This implies that some functions from the C API will not map to the exact same functions. The most obvious case for a Pythonic approach is registering procedures of a query module. Let's take a look at the C example and its transformation to Python.

static void procedure(const struct mgp_list *args,
                      const struct mgp_graph *graph, struct mgp_result *result,
                      struct mgp_memory *memory);

int mgp_init_module(struct mgp_module *module, struct mgp_memory *memory) {
  struct mgp_proc *proc =
      mgp_module_add_read_procedure(module, "procedure", procedure);
  if (!proc) return 1;
  if (!mgp_proc_add_arg(proc, "required_arg",
                        mgp_type_nullable(mgp_type_any())))
    return 1;
  struct mgp_value *null_value = mgp_value_make_null(memory);
  if (!mgp_proc_add_opt_arg(proc, "optional_arg",
                            mgp_type_nullable(mgp_type_any()), null_value)) {
    mgp_value_destroy(null_value);
    return 1;
  }
  mgp_value_destroy(null_value);
  if (!mgp_proc_add_result(proc, "result", mgp_type_string())) return 1;
  if (!mgp_proc_add_result(proc, "args",
                           mgp_type_list(mgp_type_nullable(mgp_type_any()))))
    return 1;
  return 0;
}

In Python things should be a lot simpler.

# mgp.read_proc obtains the procedure name via __name__ attribute of a function.
@mgp.read_proc(# Arguments passed to multiple mgp_proc_add_arg calls
               (('required_arg', mgp.Nullable(mgp.Any)), ('optional_arg', mgp.Nullable(mgp.Any), None)),
               # Result fields passed to multiple mgp_proc_add_result calls
               (('result', str), ('args', mgp.List(mgp.Nullable(mgp.Any)))))
def procedure(args, graph, result, memory):
    pass

Here we have replaced mgp_module_* and mgp_proc_* C API with a much simpler decorator function in Python -- mgp.read_proc. The types of arguments and result fields can both be our types as well as Python builtin types which can map to supported mgp_value types. The expected builtin types we ought to support are: bool, str, int, float and map. While the rest of the types are provided via our Python API. Optionally, we can add convenience support for object type which would map to mgp.Nullable(mgp.Any) and list which would map to mgp.List(mgp.Nullable(mgp.Any)). Also, it makes sense to take a look if we can leverage Python's typing module here.

Another Pythonic change is to remove mgp_value C API from Python altogether. This means that the arguments a Python procedure receives are not mgp_value instances but rather PyObject instances. In other words, our implementation would immediately marshal mgp_value to corresponding type in Python. Obviously we would need to provide our own Python types for non-builtin things like mgp.Vertex (equivalent to mgp_vertex) and other.

Continuing from our example above, let's say the procedure was invoked through Cypher using the following query.

MATCH (n) CALL py_module.procedure(42, n) YIELD *;

The Python procedure could then do the following and complete without throwing neither the AssertionError nor the ValueError.

def procedure(args, graph, result, memory):
    assert isinstance(args, list)
    # Unpacking throws ValueError if args does not contain exactly 2 values.
    required_arg, optional_arg = args
    assert isintance(required_arg, int)
    assert isinstance(optional_arg, mgp.Vertex)

The rest of the C API should naturally map to either top level functions or class methods as appropriate.

Loading Python Query Modules

Our current mechanism for loading the modules is to look for .so files in the directory specified by --query-modules flag. This is done when Memgraph is started. We can extend this mechanism to look for .py files in addition to .so files in the same directory and import them in the embedded Python interpreter. The only issue is embedding the interpreter in Memgraph. There are multiple choices:

  1. Building Memgraph and statically linking to Python.
  2. Building Memgraph and dynamically linking to Python, and distributing Python with Memgraph's installation.
  3. Building Memgraph and dynamically linking to Python, but without distributing the Python library.
  4. Building Memgraph and optionally loading Python library by trying to dlopen it.

The first two options are only viable if the Python license allows, and this will need further investigation.

The third option adds Python as an installation dependency for Memgraph, and without it Memgraph will not run. This is problematic for users which cannot or do not want to install Python 3.

The fourth option avoids all of the issues present in the first 3 options, but comes at a higher implementation cost. We would need to try to dlopen the Python library and setup function pointers. If we succeed we would import .py files from the --query-modules directory. On the other hand, if the user does not have Python, dlopen would fail and Memgraph would run without Python support.

After live discussion, we've decided to go with option 3. This way we don't have to worry about mismatching Python versions we support and what the users expect. Also, we should target Python 3.5 as that should be common between Debian and CentOS for which we ship installation packages.

Performance and Implementation Problems

As previously mentioned, embedding Python introduces usability issues compared to other embeddable languages.

The first, major issue is Global Interpreter Lock (GIL). Initializing Python will start a single global interpreter and running multiple threads will require acquiring GIL. In practice, this means that when multiple users run a procedure written in Python in parallel the execution will not actually be parallel. Python's interpreter will jump between executing one user's procedure and the other's. This can be quite an issue for long running procedures when multiple users are querying Memgraph. The solution for this issue is Python's API for sub-interpreters. Unfortunately, the support for them is rather poor and the API contains a lot of critical bugs when we tried to use them. For the time being, we will have to accept GIL and its downsides. Perhaps in the future we will gain more knowledge on how we could reduce the acquire rate of GIL or the sub-interpreter API will get improved.

Another major issue is memory allocation. Python's C API does not have support for setting up a temporary allocator during execution of a single function. It only has support for setting up a global heap allocator. This obviously impacts our control of memory during a query procedure invocation. Besides potential performance penalty, a procedure could allocate much more memory than we would actually allow for execution of a single query. This means that options controlling the memory limit during query execution are useless. On the bright side, Python does use block style allocators and reference counting, so the performance penalty and global memory usage should not be that terrible.

The final issue that isn't as major as the ones above is the global state of the interpreter. In practice this means that any registered procedure and imported module has access to any other procedure and module. This may pollute the namespace for other users, but it should not be much of a problem because Python always has things under a module scope. The other, slightly bigger downside is that a malicious user could use this knowledge to modify other modules and procedures. This seems like a major issue, but if we take the bigger picture into consideration, we already have a security issue in general by invoking dlopen on .so and potentially running arbitrary code. This was the trade off we chose to allow users to extend Memgraph. It's up to the users to write sane extensions and protect their servers from access.