Add HA feature spec
Reviewers: msantl, buda, teon.banek, mferencevic, mtomic, vkasljevic Reviewed By: msantl Differential Revision: https://phabricator.memgraph.io/D1692
This commit is contained in:
parent
66dcd536f0
commit
598ce81754
185
docs/feature_spec/high_availability.md
Normal file
185
docs/feature_spec/high_availability.md
Normal file
@ -0,0 +1,185 @@
|
||||
# High Availability (abbr. HA)
|
||||
|
||||
## Introduction
|
||||
|
||||
High availability is a characteristic of a system which aims to ensure a
|
||||
certain level of operational performance for a higher-than-normal period.
|
||||
Although there are multiple ways to design highly available systems, Memgraph
|
||||
strives to achieve HA by elimination of single points of failure. In essence,
|
||||
this implies adding redundancy to the system so that a failure of a component
|
||||
does not imply the failure of the entire system.
|
||||
|
||||
## Theoretical Background
|
||||
|
||||
The following chapter serves as an introduction into some theoretical aspects
|
||||
of Memgraph's high availability implementation. If the reader is solely
|
||||
interested in design decisions around HA implementation, they can skip this
|
||||
chapter.
|
||||
|
||||
An important implication of any HA implementation stems from Eric Brewer's
|
||||
[CAP Theorem](https://fenix.tecnico.ulisboa.pt/downloadFile/1126518382178117/10.e-CAP-3.pdf)
|
||||
which states that it is impossible for a distributed system to simultaneously
|
||||
achieve:
|
||||
|
||||
* Consistency (C - every read receives the most recent write or an error)
|
||||
* Availability (A - every request receives a response that is not an error)
|
||||
* Partition tolerance (P - The system continues to operate despite an
|
||||
arbitrary number of messages being dropped by the
|
||||
network between nodes)
|
||||
|
||||
In the context of HA, Memgraph should strive to achieve CA.
|
||||
|
||||
### Consensus
|
||||
|
||||
Implications of the CAP theorem naturally lead us towards introducing a
|
||||
cluster of machines which will have identical internal states. When a designated
|
||||
machine for handling client requests fails, it can simply be replaced with
|
||||
another.
|
||||
|
||||
Well... turns out this is not as easy as it sounds :(
|
||||
|
||||
Keeping around a cluster of machines with consistent internal state is an
|
||||
inherently difficult problem. More precisely, this problem is as hard as
|
||||
getting a cluster of machines to agree on a single value, which is a highly
|
||||
researched area in distributed systems. Our research of state of the art
|
||||
consensus algorithms lead us to Diego Ongaro's
|
||||
[Raft algorithm](https://raft.github.io/raft.pdf).
|
||||
|
||||
#### Raft
|
||||
|
||||
As you might have guessed, analyzing each subtle detail of Raft goes way
|
||||
beyond the scope of this document. In the remainder of the chapter we will
|
||||
outline only the most important ideas and implications, leaving all further
|
||||
analysis to the reader. Detailed explanation can be found either in Diego's
|
||||
[dissertation](https://ramcloud.stanford.edu/~ongaro/thesis.pdf) or the
|
||||
Raft [paper](https://raft.github.io/raft.pdf).
|
||||
|
||||
In essence, Raft allows us to implement the previously mentioned idea of
|
||||
managing a cluster of machines with identical internal states. In other
|
||||
words, the Raft protocol allows us to manage a cluster of replicated
|
||||
state machines which is fully functional as long as the *majority* of
|
||||
the machines in the cluster operate correctly.
|
||||
|
||||
Another important fact is that those state machines must be *deterministic*.
|
||||
In other words, the same command on two different machines with the same
|
||||
internal state must yield the same result. This is important because Memgraph,
|
||||
as a black box, is not entirely deterministic. Non-determinism can easily be
|
||||
introduced by the user (e.g. by using the `rand` function) or by algorithms
|
||||
behind query execution (e.g. introducing fuzzy logic in the planner could yield
|
||||
a different order of results). Luckily, once we enter the storage level,
|
||||
everything should be fully deterministic.
|
||||
|
||||
To summarize, Raft is a protocol which achieves consensus in a cluster of
|
||||
deterministic state machines via log replication. The cluster is fully
|
||||
functional if the majority of the machines work correctly. The reader
|
||||
is strongly encouraged to gain a deeper understanding (at least read through
|
||||
the paper) of Raft before reading the rest of this document.
|
||||
|
||||
## Integration with Memgraph
|
||||
|
||||
The first thing that should be defined is a single instruction within the
|
||||
context of Raft (i.e. a single entry in a replicated log). As mentioned
|
||||
before, these instructions should be completely deterministic when applied
|
||||
to the state machine. We have therefore decided that the appropriate level
|
||||
of abstraction within Memgraph corresponds to `StateDelta`-s (data structures
|
||||
which describe a single change to the Memgraph state, used for durability
|
||||
in WAL). Moreover, a single instruction in a replicated log will consist of a
|
||||
batch of `StateDelta`s which correspond to a single **commited** transaction.
|
||||
This decision both improves performance and handles some special cases that
|
||||
present themselves otherwise by leveraging the knowledge that the transaction
|
||||
should be committed.
|
||||
|
||||
"What happens with aborted transactions?"
|
||||
|
||||
A great question, they are handled solely by the leader which is the only
|
||||
machine that communicates with the client. Aborted transactions do not alter
|
||||
the state of the database and there is no need to replicate it to other machines
|
||||
in the cluster. If, for instance, the leader dies before returning the result
|
||||
of some read operation in an aborted transaction, the client will notice that
|
||||
the leader has crashed. A new leader will be elected in the next term and the
|
||||
client should retry the transaction.
|
||||
|
||||
"OK, that makes sense! But, wait a minute, this is broken by design! Merely
|
||||
generating `StateDelta`s on the leader for any transaction will taint its
|
||||
internal storage before sending the first RPC to some follower. This deviates
|
||||
from Raft and will crash the universe!"
|
||||
|
||||
Another great observation. It is indeed true that applying `StateDelta`s makes
|
||||
changes to local storage, but only a single type of `StateDelta` makes that
|
||||
change durable. That `StateDelta` type is called `TRANSACTION_COMMIT` and we
|
||||
will change its behaviour when working as a HA instance. More precisely, we
|
||||
must not allow the transaction engine to modify the commit log saying that
|
||||
the transaction has been committed. That action should be delayed until those
|
||||
`StateDelta`s have been applied to the majority of the cluster. At that point
|
||||
the commit log can be safely modified leaving it up to Raft to ensure the
|
||||
durability of the transaction.
|
||||
|
||||
We should also address one subtle detail that arises in this case. Consider
|
||||
the following scenario:
|
||||
|
||||
* The leader starts working on a transaction which creates a new record in the
|
||||
database. Suppose that record is stored in the leader's internal storage
|
||||
but the transaction was not commited (i.e. no such entry in the commit log).
|
||||
* The leader should start replicating those `StateDelta`s to its followers
|
||||
but, suddenly, it's cut off from the rest of the cluster.
|
||||
* Due to timeout, a new election is held and a new leader has been elected.
|
||||
* Our old leader comes back to life and becomes a follower.
|
||||
* The new leader receives a transaction which creates that same record, but
|
||||
this transaction is successfully replicated and commited by the new leader.
|
||||
|
||||
The problem lies in the fact that there is still a record within the internal
|
||||
storage of our old leader with the same transaction ID and GID as the recently
|
||||
committed record by the new leader. Obviously, this is broken. As a solution, we
|
||||
will prefix the transaction ID with the Raft term, thus ensuring that the
|
||||
transaction IDs will differ.
|
||||
|
||||
"When will followers append that transaction to their commit logs?"
|
||||
|
||||
When the leader deduces that the transaction is safe to commit, it will include
|
||||
the relevant information in all further heartbeats which will alert the
|
||||
followers that it is safe to commit those entries from their raft logs.
|
||||
Naturally, the followers need not to delay appending data to the commit log
|
||||
as they know that the transaction has already been committed (from the clusters
|
||||
point of view). If this sounds really messed up, seriously, read the Raft paper.
|
||||
|
||||
"How does the raft log differ from WAL"
|
||||
|
||||
Conceptually, it doesn't. When operating in HA, we don't really need the
|
||||
recovery mechanisms implemented in Memgraph thus far. When a dead machine
|
||||
comes back to life, it will eventually come in sync with the rest of the
|
||||
cluster and everything will be done using the machine's raft log as well
|
||||
as the messages received from the cluster leader.
|
||||
|
||||
"Those logs will become huge, isn't that recovery going to be painfully slow?"
|
||||
|
||||
True, but there are mechanisms for making raft logs more compact. The most
|
||||
popular method is, wait for it, making snapshots :)
|
||||
Although the process of bringing an old machine back to life is a long one,
|
||||
it doesn't really affect the performance of the cluster in a great degree.
|
||||
The cluster will work perfectly fine with that machine being way out of sync.
|
||||
|
||||
"I don't know, everything seems to be a lot slower than before!"
|
||||
|
||||
Absolutely true, the user should be aware that they will suffer dire
|
||||
consequences on the performance side if they choose to be highly available.
|
||||
As Frankie says, "That's life!".
|
||||
|
||||
"Also, I didn't really care about most of the things you've said. I'm
|
||||
not a part of the storage team and couldn't care less about the issues you
|
||||
face, how does HA affect 'my part of the codebase'?"
|
||||
|
||||
Answer for query execution: That's ok, you'll be able to use the same beloved
|
||||
API (when we implement it, he he :) towards storage and continue to
|
||||
make fun of us when you find a bug.
|
||||
|
||||
Answer for infrastructure: We'll talk. Some changes will surely need to
|
||||
be made on the Memgraph client. There is a chapter in Diego's dissertation
|
||||
called 'Client interaction', but we'll cross that bridge when we get there.
|
||||
There will also be the whole 'integration with Jepsen tests' thing going on.
|
||||
|
||||
Answer for analytics: I'm astonished you've read this article. Wanna join
|
||||
storage?
|
||||
|
||||
## How do we test HA
|
||||
|
||||
[Check this out](https://jepsen.io/analyses/dgraph-1-0-2)
|
Loading…
Reference in New Issue
Block a user