memgraph/docs/feature_specs/kafka/transform.md

53 lines
1.5 KiB
Markdown
Raw Normal View History

# Kafka - data transform
The transform script is a user defined script written in Python. The script
should be aware of the data format in the Kafka message.
Each Kafka message is byte length encoded, which means that the first eight
bytes of each message contain the length of the message.
More on the message format can be seen
[here](https://cwiki.apache.org/confluence/display/KAFKA/A+Guide+To+The+Kafka+Protocol#AGuideToTheKafkaProtocol-Messagesets).
The script will be embedded in our C++ codebase using pythons
[embedding](https://docs.python.org/3.5/extending/embedding.html) feature.
A sample code for a streaming transform script could look like this:
```python
import struct
import sys
def get_records():
while True:
message_len = sys.stdin.read(8)
if len(message_len) == 8:
message_len = struct.unpack("L", message_len)[0]
record = sys.stdin.read(message_len)
yield record
else:
assert len(message_len) == 0, message_len
return
def create_vertex(fields):
return "CREATE (n:Node {{id: {}}})".format(fields[1])
def create_edge(fields):
return "MATCH (n:Node {{id: {}}}) "\
"MATCH ((m:Node {{id : {}}})) "\
"CREATE (n)-[e:Edge{{value: {}}}]->(m) "\
.format(fields[1], fields[2], fields[3])
for record in get_records():
fields = record.split("\t")
if fields[0] == "v":
return create_vertex(fields):
else:
return create_edge(fields)
```
The script should output openCypher query strings based on the type of the
records.