Distributed durability feature spec
Summary: Notes from the meeting with @mferencevic. Reviewers: mferencevic, buda Reviewed By: buda Subscribers: buda, mferencevic Differential Revision: https://phabricator.memgraph.io/D1585
This commit is contained in:
parent
343fe54f2c
commit
56a5c61732
78
docs/feature_specs/distributed.md
Normal file
78
docs/feature_specs/distributed.md
Normal file
@ -0,0 +1,78 @@
|
||||
# Distributed Memgraph specs
|
||||
This document describes reasnonings behind Memgraphs distributed concepts.
|
||||
|
||||
## Distributed state machine
|
||||
Memgraphs distributed mode introduces two states of the cluster, recovering and
|
||||
working. The change between states shouldn't happen often, but when it happens
|
||||
it can take a while to make a transition from one to another.
|
||||
|
||||
### Recovering
|
||||
This state is the default state for Memgraph when the cluster starts with
|
||||
recovery flags. If the recovery finishes successfully, the state changes to
|
||||
working. If recovery fails, the user will be presented with a message that
|
||||
explains what happened and what are the next steps.
|
||||
|
||||
Another way to enter this state is failure. If the cluster encounters a failure,
|
||||
the master will enter the Recovering mode. This time, it will wait for all
|
||||
workers to respond with a message saying they are alive and well, and making
|
||||
sure they all have consistent state.
|
||||
|
||||
### Working
|
||||
This state should be the default state of Memgraph most of the time. When in
|
||||
this state, Memgraph accepts connections from Bolt clients and allows query
|
||||
execution.
|
||||
|
||||
If distributed execution fails for a transaction, that transaction, and all
|
||||
other active transactions will be aborted and the cluster will enter the
|
||||
Recovering state.
|
||||
|
||||
## Durability
|
||||
One of the important concepts in distributed Memgraph is durability.
|
||||
|
||||
### Cluster configuration
|
||||
When running Memgraph in distributed mode, the master will store cluster
|
||||
metadata in a persistent store. If fore some reason the cluster shuts down,
|
||||
recovering Memgraph from durability files shouldn't require any additional
|
||||
flags.
|
||||
|
||||
### Database ID
|
||||
Each new and clean run of Memgraph should generate a new globally unique
|
||||
database id. This id will associate all files that have persisted with this
|
||||
run. Adding the database id to snapshots, write-ahead logs and cluster metadata
|
||||
files ties them a specific Memgraph run, and it makes recovery easier to reason
|
||||
about.
|
||||
|
||||
When recovering, the cluster won't generate a new id, but will reuse the one
|
||||
from the snapshot/wal that it was able to recover from.
|
||||
|
||||
### Durability files
|
||||
Memgraph uses snapshots and write-ahead logs for durability.
|
||||
|
||||
When Memgraph recovers it has to make sure all machines in the cluster recover
|
||||
to the same recovery point. This is done by finding a common snapshot and
|
||||
finding common transactions in per-machine available write-ahead logs.
|
||||
|
||||
Since we can not be sure that each machine persisted durability files, we need
|
||||
to be able to negotiate a common recovery point in the cluster. Possible
|
||||
durability file failures could require to start the cluster from scratch,
|
||||
purging everything from storage and recovering from existing durability files.
|
||||
|
||||
We need to ensure that we keep wal files containing information about
|
||||
transactions between all existing snapshots. This will provide better durability
|
||||
in the case of a random machine durability file failure, where the cluster can
|
||||
find a common recovery point that all machines in the cluster have.
|
||||
|
||||
Also, we should suggest and make clear docs that anything less than two
|
||||
snapshots isn't considered safe for recovery.
|
||||
|
||||
### Recovery
|
||||
The recovery happens in following steps:
|
||||
* Master enables worker registration.
|
||||
* Master recovers cluster metadata from the persisted storage.
|
||||
* Master waits all required workers to register.
|
||||
* Master broadcasts a recovery request to all workers.
|
||||
* Workers respond with with a set of possible recovery points.
|
||||
* Master finds a common recovery point for the whole cluster.
|
||||
* Master broadcasts a recovery request with the common recovery point.
|
||||
* Master waits for the cluster to recover.
|
||||
* After a successful cluster recovery, master can enter Working state.
|
Loading…
Reference in New Issue
Block a user