Add a feature spec for Python Query Modules

Reviewers: mferencevic, buda, ipaljak, dsantl, tlastre Reviewed By: mferencevic, buda, ipaljak, dsantl Differential Revision: https://phabricator.memgraph.io/D2629
2020-01-23 15:34:40 +01:00 · 2020-01-23 15:34:40 +01:00 · 029a36eab1
commit 029a36eab1
parent a1fa7de115
1 changed files with 185 additions and 0 deletions
--- a/docs/feature_spec/python-query-modules.md
+++ b/docs/feature_spec/python-query-modules.md
@ -0,0 +1,185 @@
+# Python 3 Query Modules
+
+## Introduction
+
+Memgraph exposes a C API for writing the so called Query Modules. These
+modules contain definitions of procedures which can be invoked through the
+query language using the `CALL ... YIELD ...` syntax. This mechanism allows
+database users to extend Memgraph with their own algorithms and
+functionalities.
+
+Using a low level language like C can be quite cumbersome for writing modules,
+so it seems natural to add support for a higher level language on top of the
+existing C API.
+
+There are languages written exactly for this purpose of extending C with high
+level constructs, for example Lua and Guile. Instead of those, we have chosen
+Python 3 to be the first high level language we will support. The primary reason
+being that it's very popular, so more people should be able to write modules.
+Another benefit of Python which comes out of its popularity is the large
+ecosystem of libraries, especially graph algorithm related ones like NetworkX.
+Python does have significant performance and implementation downsides compared
+to Lua and Guile, but these are described in more detail later in this
+document.
+
+## Python 3 API Overview
+
+The Python 3 API should be as user friendly as possible as well as look
+Pythonic. This implies that some functions from the C API will not map to the
+exact same functions. The most obvious case for a Pythonic approach is
+registering procedures of a query module. Let's take a look at the C example
+and its transformation to Python.
+
+```c
+static void procedure(const struct mgp_list *args,
+                      const struct mgp_graph *graph, struct mgp_result *result,
+                      struct mgp_memory *memory);
+
+int mgp_init_module(struct mgp_module *module, struct mgp_memory *memory) {
+  struct mgp_proc *proc =
+      mgp_module_add_read_procedure(module, "procedure", procedure);
+  if (!proc) return 1;
+  if (!mgp_proc_add_arg(proc, "required_arg",
+                        mgp_type_nullable(mgp_type_any())))
+    return 1;
+  struct mgp_value *null_value = mgp_value_make_null(memory);
+  if (!mgp_proc_add_opt_arg(proc, "optional_arg",
+                            mgp_type_nullable(mgp_type_any()), null_value)) {
+    mgp_value_destroy(null_value);
+    return 1;
+  }
+  mgp_value_destroy(null_value);
+  if (!mgp_proc_add_result(proc, "result", mgp_type_string())) return 1;
+  if (!mgp_proc_add_result(proc, "args",
+                           mgp_type_list(mgp_type_nullable(mgp_type_any()))))
+    return 1;
+  return 0;
+}
+```
+
+In Python things should be a lot simpler.
+
+```Python
+# mgp.read_proc obtains the procedure name via __name__ attribute of a function.
+@mgp.read_proc(# Arguments passed to multiple mgp_proc_add_arg calls
+               (('required_arg', mgp.Nullable(mgp.Any)), ('optional_arg', mgp.Nullable(mgp.Any), None)),
+               # Result fields passed to multiple mgp_proc_add_result calls
+               (('result', str), ('args', mgp.List(mgp.Nullable(mgp.Any)))))
+def procedure(args, graph, result, memory):
+    pass
+```
+
+Here we have replaced `mgp_module_*` and `mgp_proc_*` C API with a much
+simpler decorator function in Python -- `mgp.read_proc`. The types of
+arguments and result fields can both be our types as well as Python builtin
+types which can map to supported `mgp_value` types. The expected builtin types
+we ought to support are: `bool`, `str`, `int`, `float` and `map`. While the
+rest of the types are provided via our Python API. Optionally, we can add
+convenience support for `object` type which would map to
+`mgp.Nullable(mgp.Any)` and `list` which would map to
+`mgp.List(mgp.Nullable(mgp.Any))`. Also, it makes sense to take a look if we
+can leverage Python's `typing` module here.
+
+Another Pythonic change is to remove `mgp_value` C API from Python altogether.
+This means that the arguments a Python procedure receives are not `mgp_value`
+instances but rather `PyObject` instances. In other words, our implementation
+would immediately marshal `mgp_value` to corresponding type in Python.
+Obviously we would need to provide our own Python types for non-builtin
+things like `mgp.Vertex` (equivalent to `mgp_vertex`) and other.
+
+Continuing from our example above, let's say the procedure was invoked through
+Cypher using the following query.
+
+    MATCH (n) CALL py_module.procedure(42, n) YIELD *;
+
+The Python procedure could then do the following and complete without throwing
+neither the AssertionError nor the ValueError.
+
+```Python
+def procedure(args, graph, result, memory):
+    assert isinstance(args, list)
+    # Unpacking throws ValueError if args does not contain exactly 2 values.
+    required_arg, optional_arg = args
+    assert isintance(required_arg, int)
+    assert isinstance(optional_arg, mgp.Vertex)
+```
+
+The rest of the C API should naturally map to either top level functions or
+class methods as appropriate.
+
+## Loading Python Query Modules
+
+Our current mechanism for loading the modules is to look for `.so` files in
+the directory specified by `--query-modules` flag. This is done when Memgraph
+is started. We can extend this mechanism to look for `.py` files in addition
+to `.so` files in the same directory and import them in the embedded Python
+interpreter. The only issue is embedding the interpreter in Memgraph.  There
+are multiple choices:
+
+  1. Building Memgraph and statically linking to Python.
+  2. Building Memgraph and dynamically linking to Python, and distributing
+     Python with Memgraph's installation.
+  3. Building Memgraph and dynamically linking to Python, but without
+     distributing the Python library.
+  4. Building Memgraph and optionally loading Python library by trying to
+     `dlopen` it.
+
+The first two options are only viable if the Python license allows, and this
+will need further investigation.
+
+The third option adds Python as an installation dependency for Memgraph, and
+without it Memgraph will not run. This is problematic for users which cannot
+or do not want to install Python 3.
+
+The fourth option avoids all of the issues present in the first 3 options, but
+comes at a higher implementation cost. We would need to try to `dlopen` the
+Python library and setup function pointers. If we succeed we would import
+`.py` files from the `--query-modules` directory. On the other hand, if the
+user does not have Python, `dlopen` would fail and Memgraph would run without
+Python support.
+
+After live discussion, we've decided to go with option 3. This way we don't
+have to worry about mismatching Python versions we support and what the users
+expect. Also, we should target Python 3.5 as that should be common between
+Debian and CentOS for which we ship installation packages.
+
+## Performance and Implementation Problems
+
+As previously mentioned, embedding Python introduces usability issues compared
+to other embeddable languages.
+
+The first, major issue is Global Interpreter Lock (GIL). Initializing Python
+will start a single global interpreter and running multiple threads will
+require acquiring GIL. In practice, this means that when multiple users run a
+procedure written in Python in parallel the execution will not actually be
+parallel. Python's interpreter will jump between executing one user's
+procedure and the other's. This can be quite an issue for long running
+procedures when multiple users are querying Memgraph. The solution for this
+issue is Python's API for sub-interpreters. Unfortunately, the support for
+them is rather poor and the API contains a lot of critical bugs when we tried
+to use them. For the time being, we will have to accept GIL and its downsides.
+Perhaps in the future we will gain more knowledge on how we could reduce the
+acquire rate of GIL or the sub-interpreter API will get improved.
+
+Another major issue is memory allocation. Python's C API does not have support
+for setting up a temporary allocator during execution of a single function.
+It only has support for setting up a global heap allocator. This obviously
+impacts our control of memory during a query procedure invocation. Besides
+potential performance penalty, a procedure could allocate much more memory
+than we would actually allow for execution of a single query. This means that
+options controlling the memory limit during query execution are useless. On
+the bright side, Python does use block style allocators and reference
+counting, so the performance penalty and global memory usage should not be
+that terrible.
+
+The final issue that isn't as major as the ones above is the global state of
+the interpreter. In practice this means that any registered procedure and
+imported module has access to any other procedure and module. This may pollute
+the namespace for other users, but it should not be much of a problem because
+Python always has things under a module scope. The other, slightly bigger
+downside is that a malicious user could use this knowledge to modify other
+modules and procedures. This seems like a major issue, but if we take the
+bigger picture into consideration, we already have a security issue in general
+by invoking `dlopen` on `.so` and potentially running arbitrary code. This was
+the trade off we chose to allow users to extend Memgraph. It's up to the users
+to write sane extensions and protect their servers from access.