memgraph/docs/dev/distributed/durability.md

# Distributed durability

Durability in distributed is slightly different then in single-node as
the state itself is shared between multiple workers and none of those
states are independent.

Note that recovering from persistent storage must result in a stable
database state. This means that across the cluster the state
modification of every transaction that was running is either recovered
fully or not at all. Also, if transaction A committed before transaction B,
then if B is recovered so must A.

## Snapshots

It is possibly avoidable but highly desirable that the database can be
recovered from snapshot only, without relying on WAL files. For this to
be possible in distributed, it must be ensured that the same
transactions are recovered on all the workers (including master) in the
cluster. Since the snapshot does not contain information about which
state change happened in which transaction, the only way to achieve this
is to have synchronized snapshots. This means that the process of
creating a snapshot, which is in itself transactional (it happens within
a transaction and thus observes some consistent database state), must
happen in the same transaction. This is achieved by the master starting
a snapshot generating transaction and triggering the process on all
workers in the cluster.

## WAL

Unlike the snapshot, write-ahead logs contain the information on which
transaction made which state change. This makes it possible to include
or exclude transactions during the recovery process. What is necessary
however is a global consensus on which of the transactions should be
recovered and which not, to ensure recovery into a consistent state.

It would be possible to achieve this with some kind of synchronized
recovery process, but it would impose constraints on cluster startup and
would not be trivial.

A simpler alternative is that the consensus is achieved beforehand,
while the database (to be recovered) is still operational. What is
necessary is to keep track of which transactions are guaranteed to
have been flushed to the WAL files on all the workers in the cluster. It
makes sense to keep this record on the master, so a mechanism is
introduced which periodically pings all the workers, telling them to
flush their WALs, and writes some sort of a log indicating that this has
been confirmed. The downside of this is a periodic broadcast must be
done, and that potentially slightly less data can be recovered in the
case of a crash then if using a post-crash consensus. It is however much
simpler to implement.
Document distributed, durability and storage concepts Summary: Adding all you guys as reviewers. Feel free to take a peek and comment on whatever is not clear. Where needed the documentation can be extended or concepts explained in-person. Reviewers: buda, teon.banek, dgleich, ipaljak, msantl, mculinovic, mtomic Reviewed By: buda Differential Revision: https://phabricator.memgraph.io/D1322 2018-03-27 17:29:37 +08:00			`# Distributed durability`

			`Durability in distributed is slightly different then in single-node as`
			`the state itself is shared between multiple workers and none of those`
			`states are independent.`

			`Note that recovering from persistent storage must result in a stable`
			`database state. This means that across the cluster the state`
			`modification of every transaction that was running is either recovered`
			`fully or not at all. Also, if transaction A committed before transaction B,`
			`then if B is recovered so must A.`

			`## Snapshots`

			`It is possibly avoidable but highly desirable that the database can be`
			`recovered from snapshot only, without relying on WAL files. For this to`
			`be possible in distributed, it must be ensured that the same`
			`transactions are recovered on all the workers (including master) in the`
			`cluster. Since the snapshot does not contain information about which`
			`state change happened in which transaction, the only way to achieve this`
			`is to have synchronized snapshots. This means that the process of`
			`creating a snapshot, which is in itself transactional (it happens within`
			`a transaction and thus observes some consistent database state), must`
			`happen in the same transaction. This is achieved by the master starting`
			`a snapshot generating transaction and triggering the process on all`
			`workers in the cluster.`

			`## WAL`

			`Unlike the snapshot, write-ahead logs contain the information on which`
			`transaction made which state change. This makes it possible to include`
			`or exclude transactions during the recovery process. What is necessary`
			`however is a global consensus on which of the transactions should be`
			`recovered and which not, to ensure recovery into a consistent state.`

			`It would be possible to achieve this with some kind of synchronized`
			`recovery process, but it would impose constraints on cluster startup and`
			`would not be trivial.`

			`A simpler alternative is that the consensus is achieved beforehand,`
			`while the database (to be recovered) is still operational. What is`
			`necessary is to keep track of which transactions are guaranteed to`
			`have been flushed to the WAL files on all the workers in the cluster. It`
			`makes sense to keep this record on the master, so a mechanism is`
			`introduced which periodically pings all the workers, telling them to`
			`flush their WALs, and writes some sort of a log indicating that this has`
			`been confirmed. The downside of this is a periodic broadcast must be`
			`done, and that potentially slightly less data can be recovered in the`
			`case of a crash then if using a post-crash consensus. It is however much`
			`simpler to implement.`