From 56a5c61732cfdcf0f9fcbd322a43dac10bc34011 Mon Sep 17 00:00:00 2001
From: Matija Santl <matija.santl@memgraph.com>
Date: Mon, 3 Sep 2018 13:59:36 +0200
Subject: [PATCH] Distributed durability feature spec

Summary: Notes from the meeting with @mferencevic.

Reviewers: mferencevic, buda

Reviewed By: buda

Subscribers: buda, mferencevic

Differential Revision: https://phabricator.memgraph.io/D1585
---
 docs/feature_specs/distributed.md | 78 +++++++++++++++++++++++++++++++
 1 file changed, 78 insertions(+)
 create mode 100644 docs/feature_specs/distributed.md

diff --git a/docs/feature_specs/distributed.md b/docs/feature_specs/distributed.md
new file mode 100644
index 000000000..a9f935dce
--- /dev/null
+++ b/docs/feature_specs/distributed.md
@@ -0,0 +1,78 @@
+# Distributed Memgraph specs
+This document describes reasnonings behind Memgraphs distributed concepts.
+
+## Distributed state machine
+Memgraphs distributed mode introduces two states of the cluster, recovering and
+working. The change between states shouldn't happen often, but when it happens
+it can take a while to make a transition from one to another.
+
+### Recovering
+This state is the default state for Memgraph when the cluster starts with
+recovery flags. If the recovery finishes successfully, the state changes to
+working. If recovery fails, the user will be presented with a message that
+explains what happened and what are the next steps.
+
+Another way to enter this state is failure. If the cluster encounters a failure,
+the master will enter the Recovering mode. This time, it will wait for all
+workers to respond with a message saying they are alive and well, and making
+sure they all have consistent state.
+
+### Working
+This state should be the default state of Memgraph most of the time. When in
+this state, Memgraph accepts connections from Bolt clients and allows query
+execution.
+
+If distributed execution fails for a transaction, that transaction, and all
+other active transactions will be aborted and the cluster will enter the
+Recovering state.
+
+## Durability
+One of the important concepts in distributed Memgraph is durability.
+
+### Cluster configuration
+When running Memgraph in distributed mode, the master will store cluster
+metadata in a persistent store. If fore some reason the cluster shuts down,
+recovering Memgraph from durability files shouldn't require any additional
+flags.
+
+### Database ID
+Each new and clean run of Memgraph should generate a new globally unique
+database id.  This id will associate all files that have persisted with this
+run. Adding the database id to snapshots, write-ahead logs and cluster metadata
+files ties them a specific Memgraph run, and it makes recovery easier to reason
+about.
+
+When recovering, the cluster won't generate a new id, but will reuse the one
+from the snapshot/wal that it was able to recover from.
+
+### Durability files
+Memgraph uses snapshots and write-ahead logs for durability.
+
+When Memgraph recovers it has to make sure all machines in the cluster recover
+to the same recovery point. This is done by finding a common snapshot and
+finding common transactions in per-machine available write-ahead logs.
+
+Since we can not be sure that each machine persisted durability files, we need
+to be able to negotiate a common recovery point in the cluster. Possible
+durability file failures could require to start the cluster from scratch,
+purging everything from storage and recovering from existing durability files.
+
+We need to ensure that we keep wal files containing information about
+transactions between all existing snapshots. This will provide better durability
+in the case of a random machine durability file failure, where the cluster can
+find a common recovery point that all machines in the cluster have.
+
+Also, we should suggest and make clear docs that anything less than two
+snapshots isn't considered safe for recovery.
+
+### Recovery
+The recovery happens in following steps:
+* Master enables worker registration.
+* Master recovers cluster metadata from the persisted storage.
+* Master waits all required workers to register.
+* Master broadcasts a recovery request to all workers.
+* Workers respond with with a set of possible recovery points.
+* Master finds a common recovery point for the whole cluster.
+* Master broadcasts a recovery request with the common recovery point.
+* Master waits for the cluster to recover.
+* After a successful cluster recovery, master can enter Working state.