From 20dcb05c50ccd0c99d23e00abd3a9b591477b3e1 Mon Sep 17 00:00:00 2001 From: Marin Petricevic Date: Mon, 29 Oct 2018 13:28:30 +0100 Subject: [PATCH] Add Tensorflow op feature spec Reviewers: buda, teon.banek, mferencevic, msantl Reviewed By: buda, teon.banek Differential Revision: https://phabricator.memgraph.io/D1691 --- .../tensorflow_op/technicalities.md | 61 ++++++++ .../tensorflow_op/usage_example.md | 142 ++++++++++++++++++ 2 files changed, 203 insertions(+) create mode 100644 docs/feature_spec/tensorflow_op/technicalities.md create mode 100644 docs/feature_spec/tensorflow_op/usage_example.md diff --git a/docs/feature_spec/tensorflow_op/technicalities.md b/docs/feature_spec/tensorflow_op/technicalities.md new file mode 100644 index 000000000..0241c8152 --- /dev/null +++ b/docs/feature_spec/tensorflow_op/technicalities.md @@ -0,0 +1,61 @@ +# Tensorflow Op - Technicalities + +The final result should be a shared object (".so") file that can be +dynamically loaded by the Tensorflow runtime in order to directly +access the bolt client. + +## About Tensorflow + +Tensorflow is usually used with Python such that the Python code is used +to define a directed acyclic computation graph. Basically no computation +is done in Python. Instead, values from Python are copied into the graph +structure as constants to be used by other Ops. The directed acyclic graph +naturally ends up with two sets of border nodes, one for inputs, one for +outputs. These are sometimes called "feeds". + +Following the Python definition of the graph, during training, the entire +data processing graph/pipeline is called from Python as a single expression. +This leads to lazy evaluation since the called result has already been +defined for a while. + +Tensorflow internally works with tensors, i.e. n-dimensional arrays. That +means all of its inputs need to be matrices as well as its outputs. While +it is possible to feed data directly from Python's numpy matrices straight +into Tensorflow, this is less desirable than using the Tensorflow data API +(which defines data input and processing as a Tensorflow graph) because: + + 1. The data API is written in C++ and entirely avoids Python and as such + is faster + 2. The data API, unlike Python is available in "Tensorflow serving". The + default way to serve Tensorflow models in production. + +Once the entire input pipeline is defined via the tf.data API, its input +is basically a list of node IDs the model is supposed to work with. The +model, through the data API knows how to connect to Memgraph and execute +openCypher queries in order to get the remaining data it needs. +(For example features of neighbouring nodes.) + +## The Interface + +I think it's best you read the official guide... + +And especially the addition that specifies how data ops are special + + +## Compiling the TF Op + +There are two options for compiling a custom op. +One of them involves pulling the TF source, adding your code to it and +compiling via bazel. +This is probably awkward to do for us and would +significantly slow down compilation. + +The other method involves installing Tensorflow as a Python package and +pulling the required headers from for example: +`/usr/local/lib/python3.6/site-packages/tensorflow/include` +We can then compile our Op with our regular build system. + +This is practical since we can copy the required headers to our repo. +If necessary, we can have several versions of the headers to build several +versions of our Op for every TF version which we want to support. +(But this is unlikely to be required as the API should be stable). diff --git a/docs/feature_spec/tensorflow_op/usage_example.md b/docs/feature_spec/tensorflow_op/usage_example.md new file mode 100644 index 000000000..68dbbb2c2 --- /dev/null +++ b/docs/feature_spec/tensorflow_op/usage_example.md @@ -0,0 +1,142 @@ +# Example for Using the Bolt Client Tensorflow Op + +## Dynamic Loading + +``` python3 +import tensorflow as tf + +mg_ops = tf.load_op_library('/usr/bin/memgraph/tensorflow_ops.so') +``` + +## Basic Usage + +``` python3 +dataset = mg_ops.OpenCypherDataset( + # This is probably unfortunate as the username and password + # get hardcoded into the graph, but for the simple case it's fine + "hostname:7687", auth=("user", "pass"), + + # Our query + ''' + MATCH (n:Train) RETURN n.id, n.features + ''', + + # Cast return values to these types + (tf.string, tf.float32)) + +# Some Tensorflow data api boilerplate +iterator = dataset.make_one_shot_iterator() +next_element = iterator.get_next() + +# Up to now we have only defined our computation graph which basically +# just connects to Memgraph +# `next_element` is not really data but a handle to a node in the Tensorflow +# graph, which we can and do evaluate +# It is a Tensorflow tensor with shape=(None, 2) +# and dtype=(tf.string, tf.float) +# shape `None` means the shape of the tensor is unknown at definition time +# and is dynamic and will only be known once the tensor has been evaluated + +with tf.Session() as sess: + node_ids = sess.run(next_element) + # `node_ids` contains IDs and features of all the nodes + # in the graph with the label "Train" + # It is a numpy.ndarray with a shape ($n_matching_nodes, 2) +``` + +## Memgraph Client as a Generic Tensorflow Op + +Other than the Tensorflow Data Op, we'll want to support a generic Tensorflow +Op which can be put anywhere in the Tensorflow computation Graph. It takes in +an arbitrary tensor and produces a tensor. This would be used in the GraphSage +algorithm to fetch the lowest level features into Tensorflow + +```python3 +requested_ids = np.array([1, 2, 3]) +ids_placeholder = tf.placeholder(tf.int32) + +model = mg_ops.OpenCypher() + "hostname:7687", auth=("user", "pass"), + """ + UNWIND $node_ids as nid + MATCH (n:Train {id: nid}) + RETURN n.features + """, + + # What to call the input tensor as an openCypher parameter + parameter_name="node_ids", + + # Type of our resulting tensor + dtype=(tf.float32) +) + +features = model(ids_placeholder) + +with tf.Session() as sess: + result = sess.run(features, + feed_dict={ids_placeholder: requested_ids}) +``` + +This is probably easier to implement than the Data Op, so it might be a good +idea to start with. + +## Production Usage + +During training, in the GraphSage algorithm at least, Memgraph is at the +beginning and at the end of the Tensorflow computation graph. +At the beginning, the Data Op provides the node IDs which are fed into the +generic Tensorflow Op to find their neighbours and their neighbours and +their features. + +Production usage differs in that we don't use the Data Op. The Data Op is +effectively cut off and the initial input is fed by Tensorflow serving, +with the data found in the request. + +For example a JSON request to classify a node might look like: + +`POST http://host:port/v1/models/GraphSage/versions/v1:classify` + +With the contents: + +```json +{ + "examples": [ + {"node_id": 1}, + {"node_id": 2} + ], +} +``` + +Every element of the "examples" list is an example to be computed. Each is +represented by a dict with keys matching names of feeds in the Tensorflow +graph and values being the values we want fed in for each example + +The REST API then replies in kind with the classification result in JSON + +Note about adding our custom Op to Tensorflow serving. +Our Ops .so can be added into the Bazel build to link with Tensorflow serving +or it can be dynamically loaded by starting Tensorflow serving with a flag +`--custom_op_paths` + +## Considerations + +There might be issues here that the url to connect to Memgraph is +hardcoded into the op and would thus be wrong when moved to production, +requiring some type of a hack to make work. We probably want to solve +this by having the client op take in another tf.Variable as an input +which would contain a connection url and username/password. +We have to research whether this makes it easy enough to move to +production, as the connection string variable is still a part of the +graph, but maybe easier to replace. + +It is probably the best idea to utilize openCypher parameters to make +our queries flexible. The exact API as to how to declare the parameters +in Python is open to discussion. + +The Data Op might not even be necessary to implement as it is not +key for production use. It can be replaced in training mode with +feed dicts and either + + 1. Getting the initial list of nodes via a Python Bolt client + 2. Creating a separate Tensorflow computation graph that gets all the + relevant node IDs into Python