Add Tensorflow op feature spec

Reviewers: buda, teon.banek, mferencevic, msantl Reviewed By: buda, teon.banek Differential Revision: https://phabricator.memgraph.io/D1691
2018-10-29 13:28:30 +01:00 · 2018-10-29 13:28:30 +01:00 · 20dcb05c50
commit 20dcb05c50
parent 4ad9cfe1f4
2 changed files with 203 additions and 0 deletions
--- a/docs/feature_spec/tensorflow_op/technicalities.md
+++ b/docs/feature_spec/tensorflow_op/technicalities.md
@ -0,0 +1,61 @@
+# Tensorflow Op - Technicalities
+
+The final result should be a shared object (".so") file that can be
+dynamically loaded by the Tensorflow runtime in order to directly
+access the bolt client.
+
+## About Tensorflow
+
+Tensorflow is usually used with Python such that the Python code is used
+to define a directed acyclic computation graph. Basically no computation
+is done in Python. Instead, values from Python are copied into the graph
+structure as constants to be used by other Ops. The directed acyclic graph
+naturally ends up with two sets of border nodes, one for inputs, one for
+outputs. These are sometimes called "feeds".
+
+Following the Python definition of the graph, during training, the entire
+data processing graph/pipeline is called from Python as a single expression.
+This leads to lazy evaluation since the called result has already been
+defined for a while.
+
+Tensorflow internally works with tensors, i.e. n-dimensional arrays. That
+means all of its inputs need to be matrices as well as its outputs. While
+it is possible to feed data directly from Python's numpy matrices straight
+into Tensorflow, this is less desirable than using the Tensorflow data API
+(which defines data input and processing as a Tensorflow graph) because:
+
+  1. The data API is written in C++ and entirely avoids Python and as such
+  is faster
+  2. The data API, unlike Python is available in "Tensorflow serving". The
+  default way to serve Tensorflow models in production.
+
+Once the entire input pipeline is defined via the tf.data API, its input
+is basically a list of node IDs the model is supposed to work with. The
+model, through the data API knows how to connect to Memgraph and execute
+openCypher queries in order to get the remaining data it needs.
+(For example features of neighbouring nodes.)
+
+## The Interface
+
+I think it's best you read the official guide...  
+<https://www.tensorflow.org/extend/adding_an_op>  
+And especially the addition that specifies how data ops are special  
+<https://www.tensorflow.org/extend/new_data_formats>  
+
+## Compiling the TF Op
+
+There are two options for compiling a custom op.  
+One of them involves pulling the TF source, adding your code to it and
+compiling via bazel.  
+This is probably awkward to do for us and would
+significantly slow down compilation.  
+
+The other method involves installing Tensorflow as a Python package and
+pulling the required headers from for example:  
+`/usr/local/lib/python3.6/site-packages/tensorflow/include`  
+We can then compile our Op with our regular build system.
+
+This is practical since we can copy the required headers to our repo.
+If necessary, we can have several versions of the headers to build several
+versions of our Op for every TF version which we want to support.
+(But this is unlikely to be required as the API should be stable).
--- a/docs/feature_spec/tensorflow_op/usage_example.md
+++ b/docs/feature_spec/tensorflow_op/usage_example.md
@ -0,0 +1,142 @@
+# Example for Using the Bolt Client Tensorflow Op
+
+## Dynamic Loading
+
+``` python3
+import tensorflow as tf
+
+mg_ops = tf.load_op_library('/usr/bin/memgraph/tensorflow_ops.so')
+```
+
+## Basic Usage
+
+``` python3
+dataset = mg_ops.OpenCypherDataset(
+            # This is probably unfortunate as the username and password
+            # get hardcoded into the graph, but for the simple case it's fine
+            "hostname:7687", auth=("user", "pass"),
+
+            # Our query
+            '''
+            MATCH (n:Train) RETURN n.id, n.features
+            ''',
+
+            # Cast return values to these types
+            (tf.string, tf.float32))
+
+# Some Tensorflow data api boilerplate
+iterator = dataset.make_one_shot_iterator()
+next_element = iterator.get_next()
+
+# Up to now we have only defined our computation graph which basically
+# just connects to Memgraph
+# `next_element` is not really data but a handle to a node in the Tensorflow
+# graph, which we can and do evaluate
+# It is a Tensorflow tensor with shape=(None, 2)
+# and dtype=(tf.string, tf.float)
+# shape `None` means the shape of the tensor is unknown at definition time
+# and is dynamic and will only be known once the tensor has been evaluated
+
+with tf.Session() as sess:
+    node_ids = sess.run(next_element)
+    # `node_ids` contains IDs and features of all the nodes 
+    # in the graph with the label "Train"
+    # It is a numpy.ndarray with a shape ($n_matching_nodes, 2)
+```
+
+## Memgraph Client as a Generic Tensorflow Op
+
+Other than the Tensorflow Data Op, we'll want to support a generic Tensorflow
+Op which can be put anywhere in the Tensorflow computation Graph. It takes in
+an arbitrary tensor and produces a tensor. This would be used in the GraphSage
+algorithm to fetch the lowest level features into Tensorflow
+
+```python3
+requested_ids = np.array([1, 2, 3])
+ids_placeholder = tf.placeholder(tf.int32)
+
+model = mg_ops.OpenCypher()
+    "hostname:7687", auth=("user", "pass"),
+    """
+    UNWIND $node_ids as nid
+    MATCH (n:Train {id: nid})
+    RETURN n.features
+    """,
+
+    # What to call the input tensor as an openCypher parameter
+    parameter_name="node_ids",
+
+    # Type of our resulting tensor
+    dtype=(tf.float32)
+)
+
+features = model(ids_placeholder)
+
+with tf.Session() as sess:
+    result = sess.run(features,
+                    feed_dict={ids_placeholder: requested_ids})
+```
+
+This is probably easier to implement than the Data Op, so it might be a good
+idea to start with.
+
+## Production Usage
+
+During training, in the GraphSage algorithm at least, Memgraph is at the
+beginning and at the end of the Tensorflow computation graph.  
+At the beginning, the Data Op provides the node IDs which are fed into the
+generic Tensorflow Op to find their neighbours and their neighbours and
+their features.
+
+Production usage differs in that we don't use the Data Op. The Data Op is
+effectively cut off and the initial input is fed by Tensorflow serving,
+with the data found in the request.
+
+For example a JSON request to classify a node might look like:
+
+`POST http://host:port/v1/models/GraphSage/versions/v1:classify`
+
+With the contents:
+
+```json
+{
+    "examples": [
+        {"node_id": 1},
+        {"node_id": 2}
+    ],
+}
+```
+
+Every element of the "examples" list is an example to be computed. Each is
+represented by a dict with keys matching names of feeds in the Tensorflow
+graph and values being the values we want fed in for each example
+
+The REST API then replies in kind with the classification result in JSON
+
+Note about adding our custom Op to Tensorflow serving.  
+Our Ops .so can be added into the Bazel build to link with Tensorflow serving
+or it can be dynamically loaded by starting Tensorflow serving with a flag
+`--custom_op_paths`
+
+## Considerations
+
+There might be issues here that the url to connect to Memgraph is
+hardcoded into the op and would thus be wrong when moved to production,
+requiring some type of a hack to make work. We probably want to solve
+this by having the client op take in another tf.Variable as an input
+which would contain a connection url and username/password.  
+We have to research whether this makes it easy enough to move to
+production, as the connection string variable is still a part of the
+graph, but maybe easier to replace.  
+
+It is probably the best idea to utilize openCypher parameters to make
+our queries flexible. The exact API as to how to declare the parameters
+in Python is open to discussion.
+
+The Data Op might not even be necessary to implement as it is not
+key for production use. It can be replaced in training mode with
+feed dicts and either
+
+ 1. Getting the initial list of nodes via a Python Bolt client
+ 2. Creating a separate Tensorflow computation graph that gets all the
+ relevant node IDs into Python