Marin Petricevic 20dcb05c50 Add Tensorflow op feature spec

Reviewers: buda, teon.banek, mferencevic, msantl

Reviewed By: buda, teon.banek

Differential Revision: https://phabricator.memgraph.io/D1691

2018-11-08 11:17:20 +01:00

4.6 KiB

Raw Blame History

Example for Using the Bolt Client Tensorflow Op

Dynamic Loading

import tensorflow as tf

mg_ops = tf.load_op_library('/usr/bin/memgraph/tensorflow_ops.so')

Basic Usage

dataset = mg_ops.OpenCypherDataset(
            # This is probably unfortunate as the username and password
            # get hardcoded into the graph, but for the simple case it's fine
            "hostname:7687", auth=("user", "pass"),

            # Our query
            '''
            MATCH (n:Train) RETURN n.id, n.features
            ''',

            # Cast return values to these types
            (tf.string, tf.float32))

# Some Tensorflow data api boilerplate
iterator = dataset.make_one_shot_iterator()
next_element = iterator.get_next()

# Up to now we have only defined our computation graph which basically
# just connects to Memgraph
# `next_element` is not really data but a handle to a node in the Tensorflow
# graph, which we can and do evaluate
# It is a Tensorflow tensor with shape=(None, 2)
# and dtype=(tf.string, tf.float)
# shape `None` means the shape of the tensor is unknown at definition time
# and is dynamic and will only be known once the tensor has been evaluated

with tf.Session() as sess:
    node_ids = sess.run(next_element)
    # `node_ids` contains IDs and features of all the nodes 
    # in the graph with the label "Train"
    # It is a numpy.ndarray with a shape ($n_matching_nodes, 2)

Memgraph Client as a Generic Tensorflow Op

Other than the Tensorflow Data Op, we'll want to support a generic Tensorflow Op which can be put anywhere in the Tensorflow computation Graph. It takes in an arbitrary tensor and produces a tensor. This would be used in the GraphSage algorithm to fetch the lowest level features into Tensorflow

requested_ids = np.array([1, 2, 3])
ids_placeholder = tf.placeholder(tf.int32)

model = mg_ops.OpenCypher()
    "hostname:7687", auth=("user", "pass"),
    """
    UNWIND $node_ids as nid
    MATCH (n:Train {id: nid})
    RETURN n.features
    """,

    # What to call the input tensor as an openCypher parameter
    parameter_name="node_ids",

    # Type of our resulting tensor
    dtype=(tf.float32)
)

features = model(ids_placeholder)

with tf.Session() as sess:
    result = sess.run(features,
                    feed_dict={ids_placeholder: requested_ids})

This is probably easier to implement than the Data Op, so it might be a good idea to start with.

Production Usage

During training, in the GraphSage algorithm at least, Memgraph is at the beginning and at the end of the Tensorflow computation graph.
At the beginning, the Data Op provides the node IDs which are fed into the generic Tensorflow Op to find their neighbours and their neighbours and their features.

Production usage differs in that we don't use the Data Op. The Data Op is effectively cut off and the initial input is fed by Tensorflow serving, with the data found in the request.

For example a JSON request to classify a node might look like:

POST http://host:port/v1/models/GraphSage/versions/v1:classify

With the contents:

{
    "examples": [
        {"node_id": 1},
        {"node_id": 2}
    ],
}

Every element of the "examples" list is an example to be computed. Each is represented by a dict with keys matching names of feeds in the Tensorflow graph and values being the values we want fed in for each example

The REST API then replies in kind with the classification result in JSON

Note about adding our custom Op to Tensorflow serving.
Our Ops .so can be added into the Bazel build to link with Tensorflow serving or it can be dynamically loaded by starting Tensorflow serving with a flag --custom_op_paths

Considerations

There might be issues here that the url to connect to Memgraph is hardcoded into the op and would thus be wrong when moved to production, requiring some type of a hack to make work. We probably want to solve this by having the client op take in another tf.Variable as an input which would contain a connection url and username/password.
We have to research whether this makes it easy enough to move to production, as the connection string variable is still a part of the graph, but maybe easier to replace.

It is probably the best idea to utilize openCypher parameters to make our queries flexible. The exact API as to how to declare the parameters in Python is open to discussion.

The Data Op might not even be necessary to implement as it is not key for production use. It can be replaced in training mode with feed dicts and either

Getting the initial list of nodes via a Python Bolt client
Creating a separate Tensorflow computation graph that gets all the relevant node IDs into Python

4.6 KiB Raw Blame History