memgraph/docs/user_technical/integrations.md
Matija Santl 4ee3db80b0 Add kafka documentation
Summary:
Updated the feature specs, the changelog and added a new section in
user technical.

Reviewers: mferencevic, mculinovic, buda, ipaljak

Reviewed By: ipaljak

Subscribers: pullbot

Differential Revision: https://phabricator.memgraph.io/D1534
2018-08-09 16:52:52 +02:00

7.8 KiB

Integrations

Kafka

Apache Kafka is an open-source stream-processing software platform. The project aims to provide a unified, high-throughput, low-latency platform for handling real-time data feeds.

Memgraph offers easy data import at the source using Kafka as the high-throughput messaging system.

openCypher

Memgraphs custom openCypher clause for creating a stream is:

CREATE STREAM stream_name AS
  LOAD DATA KAFKA 'URI'
  WITH TOPIC 'topic'
  WITH TRANSFORM 'URI'
  [BATCH_INTERVAL milliseconds]
  [BATCH_SIZE count]

The CREATE STREAM clause happens in a transaction.

WITH TOPIC parameter specifies the Kafka topic from which we'll stream data.

WITH TRANSFORM parameter should contain a URI of the transform script. We cover more about the transform script later, in the transform section.

BATCH_INTERVAL parameter defines the time interval in milliseconds which is the time between two successive stream importing operations.

BATCH_SIZE parameter defines the count of Kafka messages that will be batched together before import.

If both BATCH_INTERVAL and BATCH_SIZE parameters are given, the condition that is satisfied first will trigger the batched import.

Default value for BATCH_INTERVAL is 100 milliseconds, and the default value for BATCH_SIZE is 10.

The DROP clause deletes a stream:

DROP STREAM stream_name;

The SHOW clause enables you to see all configured streams:

SHOW STREAMS;

You can also start/stop streams with the START and STOP clauses:

START STREAM stream_name [LIMIT count BATCHES];
STOP STREAM stream_name;

A stream needs to be stopped in order to start it and it needs to be started in order to stop it. Starting a started or stopping a stopped stream will not affect that stream.

There are also convenience clauses to start and stop all streams:

START ALL STREAMS;
STOP ALL STREAMS;

Before the actual import, you can also test the stream with the TEST STREAM clause:

TEST STREAM stream_name [LIMIT count BATCHES];

When a stream is tested, data extraction and transformation occurs, but nothing is inserted into the graph.

A stream needs to be stopped in order to test it. When the batch limit is omitted, TEST STREAM will run for only one batch by default.

Transform

The transform script allows Memgraph users to have custom Kafka messages and still be able to import data in Memgraph by adding the logic to decode the messages in the transform script.

The entry point of the transform script from Memgraph is the stream function. Input for the stream function is a list of bytes that represent byte encoded Kafka messages, and the output of the stream function must be a list of tuples containing openCypher string queries and corresponding parameters stored in a dictionary.

To be more precise, the signature of the stream function looks like the following:

stream : [bytes] -> [(str, {str : type})]
type : none | bool | int | float | str | list | dict

An example of a simple transform script that creates vertices if the message contains one number (the vertex id) or it creates edges if the message contains two numbers (origin vertex id and destination vertex id) would look like the following:

def create_vertex(vertex_id):
  return ("CREATE (:Node {id: $id})", {"id": vertex_id})


def create_edge(from_id, to_id):
  return ("MATCH (n:Node {id: $from_id}), (m:Node {id: $to_id}) "\
          "CREATE (n)-[:Edge]->(m)", {"from_id": from_id, "to_id": to_id})


def stream(batch):
    result = []
    for item in batch:
        message = item.decode('utf-8').split()
        if len(message) == 1:
          result.append(create_vertex(message[0]))
        elif len(message) == 2:
          result.append(create_edge(message[0], message[1]))
    return result

Example

For this example, we assume you have a local instance of Kafka. You can find more about running Kafka here.

From this point forth, we assume you have a instance of Kafka running on localhost:9092 with a topic test and that you've started Memgraph and have Memgraph client running.

Each Kafka stream in Memgraph requires a transform script written in Python that knows how to interpret incoming data and transform the data to queries that Memgraph understands. Lets assume you have script available on http://localhost/transform.py.

Lets also assume the Kafka topic contains two types of messages:

  • Node creation: the message contains a single number, the node id.
  • Edge creation: the message contains two numbers, origin node id and destination node id.

In order to create a stream input the following query in the client:

CREATE STREAM mystream AS LOAD DATA KAFKA 'localhost:9092' WITH TOPIC 'test' WITH
TRANSFORM 'http://localhost/transform.py'

This will create the stream inside Memgraph but will not start it yet. However, if the Kafka instance isn't available on the given URI, or the topic doesn't exist, the query will fail with an appropriate message.

E.g. if the transform script can't be found at the given URI, the following error will be shown:

Client received exception: Couldn't get the transform script from http://localhost/transform.py

Similar, if the given Kafka topic doesn't exist, we'll get the following:

Client received exception: Kafka stream mystream, topic not found

After a successful stream creation, you can check the status of all streams by executing:

SHOW STREAMS

This should produce the following output:

+----------+----------------+-------+------------------------------+---------+
| name     | uri            | topic | transform                    | status  |
+---------------------------+--------------------------------------+---------+
| mystream | localhost:9092 | test  | http://localhost/memgraph.py | stopped |
+----------+----------------+-------+------------------------------+---------+

As you can notice, the status of this stream is stopped.

In order to see if everything is correct, you can test the stream by executing:

TEST STREAM mystream;

This will ingest data from Kafka, but instead of writing it to Memgraph, it will just output the result.

If the test Kafka topic would contain two messages, 1 and 1 2 the result of the TEST STREAM query would look like:

+-------------------------------------------------------------------------------+-------------------------+
| query                                                                         | params                  |
+-------------------------------------------------------------------------------+-------------------------+
| CREATE (:Node {id: $id})                                                      | {id:"1"}                |
| MATCH (n:Node {id: $from_id}), (m:Node {id: $to_id}) CREATE (n)-[:Edge]->(m)  | {from_id:"1",to_id:"2"} |
+-------------------------------------------------------------------------------+-------------------------+

To start ingesting data from a stream, you need to execute the following query:

START STREAM mystream;

If we check the stream status now, the output would look like this:

+----------+----------------+-------+------------------------------+---------+
| name     | uri            | topic | transform                    | status  |
+---------------------------+--------------------------------------+---------+
| mystream | localhost:9092 | test  | http://localhost/memgraph.py | running |
+----------+----------------+-------+------------------------------+---------+

To stop ingesting data, the stop stream query needs to be executed:

STOP STREAM mystream;

If Memgraph shuts down, all streams that existed before the shutdown are going to be recovered.