Summary: The diff where I added `distributed.md` was a bit old. When landing this, I didn't notice the `feature_specs` folder was renamed to `feature_spec`. This fixes this. Reviewers: ipaljak Reviewed By: ipaljak Differential Revision: https://phabricator.memgraph.io/D1635
3.6 KiB
Distributed Memgraph specs
This document describes reasnonings behind Memgraphs distributed concepts.
Distributed state machine
Memgraphs distributed mode introduces two states of the cluster, recovering and working. The change between states shouldn't happen often, but when it happens it can take a while to make a transition from one to another.
Recovering
This state is the default state for Memgraph when the cluster starts with recovery flags. If the recovery finishes successfully, the state changes to working. If recovery fails, the user will be presented with a message that explains what happened and what are the next steps.
Another way to enter this state is failure. If the cluster encounters a failure, the master will enter the Recovering mode. This time, it will wait for all workers to respond with a message saying they are alive and well, and making sure they all have consistent state.
Working
This state should be the default state of Memgraph most of the time. When in this state, Memgraph accepts connections from Bolt clients and allows query execution.
If distributed execution fails for a transaction, that transaction, and all other active transactions will be aborted and the cluster will enter the Recovering state.
Durability
One of the important concepts in distributed Memgraph is durability.
Cluster configuration
When running Memgraph in distributed mode, the master will store cluster metadata in a persistent store. If fore some reason the cluster shuts down, recovering Memgraph from durability files shouldn't require any additional flags.
Database ID
Each new and clean run of Memgraph should generate a new globally unique database id. This id will associate all files that have persisted with this run. Adding the database id to snapshots, write-ahead logs and cluster metadata files ties them a specific Memgraph run, and it makes recovery easier to reason about.
When recovering, the cluster won't generate a new id, but will reuse the one from the snapshot/wal that it was able to recover from.
Durability files
Memgraph uses snapshots and write-ahead logs for durability.
When Memgraph recovers it has to make sure all machines in the cluster recover to the same recovery point. This is done by finding a common snapshot and finding common transactions in per-machine available write-ahead logs.
Since we can not be sure that each machine persisted durability files, we need to be able to negotiate a common recovery point in the cluster. Possible durability file failures could require to start the cluster from scratch, purging everything from storage and recovering from existing durability files.
We need to ensure that we keep wal files containing information about transactions between all existing snapshots. This will provide better durability in the case of a random machine durability file failure, where the cluster can find a common recovery point that all machines in the cluster have.
Also, we should suggest and make clear docs that anything less than two snapshots isn't considered safe for recovery.
Recovery
The recovery happens in following steps:
- Master enables worker registration.
- Master recovers cluster metadata from the persisted storage.
- Master waits all required workers to register.
- Master broadcasts a recovery request to all workers.
- Workers respond with with a set of possible recovery points.
- Master finds a common recovery point for the whole cluster.
- Master broadcasts a recovery request with the common recovery point.
- Master waits for the cluster to recover.
- After a successful cluster recovery, master can enter Working state.