53 lines
1.5 KiB
Markdown
53 lines
1.5 KiB
Markdown
|
# Kafka - data transform
|
||
|
|
||
|
The transform script is a user defined script written in Python. The script
|
||
|
should be aware of the data format in the Kafka message.
|
||
|
|
||
|
Each Kafka message is byte length encoded, which means that the first eight
|
||
|
bytes of each message contain the length of the message.
|
||
|
|
||
|
More on the message format can be seen
|
||
|
[here](https://cwiki.apache.org/confluence/display/KAFKA/A+Guide+To+The+Kafka+Protocol#AGuideToTheKafkaProtocol-Messagesets).
|
||
|
|
||
|
|
||
|
The script will be embedded in our C++ codebase using pythons
|
||
|
[embedding](https://docs.python.org/3.5/extending/embedding.html) feature.
|
||
|
|
||
|
A sample code for a streaming transform script could look like this:
|
||
|
|
||
|
```python
|
||
|
import struct
|
||
|
import sys
|
||
|
|
||
|
def get_records():
|
||
|
while True:
|
||
|
message_len = sys.stdin.read(8)
|
||
|
if len(message_len) == 8:
|
||
|
message_len = struct.unpack("L", message_len)[0]
|
||
|
record = sys.stdin.read(message_len)
|
||
|
yield record
|
||
|
else:
|
||
|
assert len(message_len) == 0, message_len
|
||
|
return
|
||
|
|
||
|
def create_vertex(fields):
|
||
|
return "CREATE (n:Node {{id: {}}})".format(fields[1])
|
||
|
|
||
|
|
||
|
def create_edge(fields):
|
||
|
return "MATCH (n:Node {{id: {}}}) "\
|
||
|
"MATCH ((m:Node {{id : {}}})) "\
|
||
|
"CREATE (n)-[e:Edge{{value: {}}}]->(m) "\
|
||
|
.format(fields[1], fields[2], fields[3])
|
||
|
|
||
|
for record in get_records():
|
||
|
fields = record.split("\t")
|
||
|
if fields[0] == "v":
|
||
|
return create_vertex(fields):
|
||
|
else:
|
||
|
return create_edge(fields)
|
||
|
```
|
||
|
|
||
|
The script should output openCypher query strings based on the type of the
|
||
|
records.
|