186 lines
9.4 KiB
Markdown
186 lines
9.4 KiB
Markdown
|
# High Availability (abbr. HA)
|
||
|
|
||
|
## Introduction
|
||
|
|
||
|
High availability is a characteristic of a system which aims to ensure a
|
||
|
certain level of operational performance for a higher-than-normal period.
|
||
|
Although there are multiple ways to design highly available systems, Memgraph
|
||
|
strives to achieve HA by elimination of single points of failure. In essence,
|
||
|
this implies adding redundancy to the system so that a failure of a component
|
||
|
does not imply the failure of the entire system.
|
||
|
|
||
|
## Theoretical Background
|
||
|
|
||
|
The following chapter serves as an introduction into some theoretical aspects
|
||
|
of Memgraph's high availability implementation. If the reader is solely
|
||
|
interested in design decisions around HA implementation, they can skip this
|
||
|
chapter.
|
||
|
|
||
|
An important implication of any HA implementation stems from Eric Brewer's
|
||
|
[CAP Theorem](https://fenix.tecnico.ulisboa.pt/downloadFile/1126518382178117/10.e-CAP-3.pdf)
|
||
|
which states that it is impossible for a distributed system to simultaneously
|
||
|
achieve:
|
||
|
|
||
|
* Consistency (C - every read receives the most recent write or an error)
|
||
|
* Availability (A - every request receives a response that is not an error)
|
||
|
* Partition tolerance (P - The system continues to operate despite an
|
||
|
arbitrary number of messages being dropped by the
|
||
|
network between nodes)
|
||
|
|
||
|
In the context of HA, Memgraph should strive to achieve CA.
|
||
|
|
||
|
### Consensus
|
||
|
|
||
|
Implications of the CAP theorem naturally lead us towards introducing a
|
||
|
cluster of machines which will have identical internal states. When a designated
|
||
|
machine for handling client requests fails, it can simply be replaced with
|
||
|
another.
|
||
|
|
||
|
Well... turns out this is not as easy as it sounds :(
|
||
|
|
||
|
Keeping around a cluster of machines with consistent internal state is an
|
||
|
inherently difficult problem. More precisely, this problem is as hard as
|
||
|
getting a cluster of machines to agree on a single value, which is a highly
|
||
|
researched area in distributed systems. Our research of state of the art
|
||
|
consensus algorithms lead us to Diego Ongaro's
|
||
|
[Raft algorithm](https://raft.github.io/raft.pdf).
|
||
|
|
||
|
#### Raft
|
||
|
|
||
|
As you might have guessed, analyzing each subtle detail of Raft goes way
|
||
|
beyond the scope of this document. In the remainder of the chapter we will
|
||
|
outline only the most important ideas and implications, leaving all further
|
||
|
analysis to the reader. Detailed explanation can be found either in Diego's
|
||
|
[dissertation](https://ramcloud.stanford.edu/~ongaro/thesis.pdf) or the
|
||
|
Raft [paper](https://raft.github.io/raft.pdf).
|
||
|
|
||
|
In essence, Raft allows us to implement the previously mentioned idea of
|
||
|
managing a cluster of machines with identical internal states. In other
|
||
|
words, the Raft protocol allows us to manage a cluster of replicated
|
||
|
state machines which is fully functional as long as the *majority* of
|
||
|
the machines in the cluster operate correctly.
|
||
|
|
||
|
Another important fact is that those state machines must be *deterministic*.
|
||
|
In other words, the same command on two different machines with the same
|
||
|
internal state must yield the same result. This is important because Memgraph,
|
||
|
as a black box, is not entirely deterministic. Non-determinism can easily be
|
||
|
introduced by the user (e.g. by using the `rand` function) or by algorithms
|
||
|
behind query execution (e.g. introducing fuzzy logic in the planner could yield
|
||
|
a different order of results). Luckily, once we enter the storage level,
|
||
|
everything should be fully deterministic.
|
||
|
|
||
|
To summarize, Raft is a protocol which achieves consensus in a cluster of
|
||
|
deterministic state machines via log replication. The cluster is fully
|
||
|
functional if the majority of the machines work correctly. The reader
|
||
|
is strongly encouraged to gain a deeper understanding (at least read through
|
||
|
the paper) of Raft before reading the rest of this document.
|
||
|
|
||
|
## Integration with Memgraph
|
||
|
|
||
|
The first thing that should be defined is a single instruction within the
|
||
|
context of Raft (i.e. a single entry in a replicated log). As mentioned
|
||
|
before, these instructions should be completely deterministic when applied
|
||
|
to the state machine. We have therefore decided that the appropriate level
|
||
|
of abstraction within Memgraph corresponds to `StateDelta`-s (data structures
|
||
|
which describe a single change to the Memgraph state, used for durability
|
||
|
in WAL). Moreover, a single instruction in a replicated log will consist of a
|
||
|
batch of `StateDelta`s which correspond to a single **commited** transaction.
|
||
|
This decision both improves performance and handles some special cases that
|
||
|
present themselves otherwise by leveraging the knowledge that the transaction
|
||
|
should be committed.
|
||
|
|
||
|
"What happens with aborted transactions?"
|
||
|
|
||
|
A great question, they are handled solely by the leader which is the only
|
||
|
machine that communicates with the client. Aborted transactions do not alter
|
||
|
the state of the database and there is no need to replicate it to other machines
|
||
|
in the cluster. If, for instance, the leader dies before returning the result
|
||
|
of some read operation in an aborted transaction, the client will notice that
|
||
|
the leader has crashed. A new leader will be elected in the next term and the
|
||
|
client should retry the transaction.
|
||
|
|
||
|
"OK, that makes sense! But, wait a minute, this is broken by design! Merely
|
||
|
generating `StateDelta`s on the leader for any transaction will taint its
|
||
|
internal storage before sending the first RPC to some follower. This deviates
|
||
|
from Raft and will crash the universe!"
|
||
|
|
||
|
Another great observation. It is indeed true that applying `StateDelta`s makes
|
||
|
changes to local storage, but only a single type of `StateDelta` makes that
|
||
|
change durable. That `StateDelta` type is called `TRANSACTION_COMMIT` and we
|
||
|
will change its behaviour when working as a HA instance. More precisely, we
|
||
|
must not allow the transaction engine to modify the commit log saying that
|
||
|
the transaction has been committed. That action should be delayed until those
|
||
|
`StateDelta`s have been applied to the majority of the cluster. At that point
|
||
|
the commit log can be safely modified leaving it up to Raft to ensure the
|
||
|
durability of the transaction.
|
||
|
|
||
|
We should also address one subtle detail that arises in this case. Consider
|
||
|
the following scenario:
|
||
|
|
||
|
* The leader starts working on a transaction which creates a new record in the
|
||
|
database. Suppose that record is stored in the leader's internal storage
|
||
|
but the transaction was not commited (i.e. no such entry in the commit log).
|
||
|
* The leader should start replicating those `StateDelta`s to its followers
|
||
|
but, suddenly, it's cut off from the rest of the cluster.
|
||
|
* Due to timeout, a new election is held and a new leader has been elected.
|
||
|
* Our old leader comes back to life and becomes a follower.
|
||
|
* The new leader receives a transaction which creates that same record, but
|
||
|
this transaction is successfully replicated and commited by the new leader.
|
||
|
|
||
|
The problem lies in the fact that there is still a record within the internal
|
||
|
storage of our old leader with the same transaction ID and GID as the recently
|
||
|
committed record by the new leader. Obviously, this is broken. As a solution, we
|
||
|
will prefix the transaction ID with the Raft term, thus ensuring that the
|
||
|
transaction IDs will differ.
|
||
|
|
||
|
"When will followers append that transaction to their commit logs?"
|
||
|
|
||
|
When the leader deduces that the transaction is safe to commit, it will include
|
||
|
the relevant information in all further heartbeats which will alert the
|
||
|
followers that it is safe to commit those entries from their raft logs.
|
||
|
Naturally, the followers need not to delay appending data to the commit log
|
||
|
as they know that the transaction has already been committed (from the clusters
|
||
|
point of view). If this sounds really messed up, seriously, read the Raft paper.
|
||
|
|
||
|
"How does the raft log differ from WAL"
|
||
|
|
||
|
Conceptually, it doesn't. When operating in HA, we don't really need the
|
||
|
recovery mechanisms implemented in Memgraph thus far. When a dead machine
|
||
|
comes back to life, it will eventually come in sync with the rest of the
|
||
|
cluster and everything will be done using the machine's raft log as well
|
||
|
as the messages received from the cluster leader.
|
||
|
|
||
|
"Those logs will become huge, isn't that recovery going to be painfully slow?"
|
||
|
|
||
|
True, but there are mechanisms for making raft logs more compact. The most
|
||
|
popular method is, wait for it, making snapshots :)
|
||
|
Although the process of bringing an old machine back to life is a long one,
|
||
|
it doesn't really affect the performance of the cluster in a great degree.
|
||
|
The cluster will work perfectly fine with that machine being way out of sync.
|
||
|
|
||
|
"I don't know, everything seems to be a lot slower than before!"
|
||
|
|
||
|
Absolutely true, the user should be aware that they will suffer dire
|
||
|
consequences on the performance side if they choose to be highly available.
|
||
|
As Frankie says, "That's life!".
|
||
|
|
||
|
"Also, I didn't really care about most of the things you've said. I'm
|
||
|
not a part of the storage team and couldn't care less about the issues you
|
||
|
face, how does HA affect 'my part of the codebase'?"
|
||
|
|
||
|
Answer for query execution: That's ok, you'll be able to use the same beloved
|
||
|
API (when we implement it, he he :) towards storage and continue to
|
||
|
make fun of us when you find a bug.
|
||
|
|
||
|
Answer for infrastructure: We'll talk. Some changes will surely need to
|
||
|
be made on the Memgraph client. There is a chapter in Diego's dissertation
|
||
|
called 'Client interaction', but we'll cross that bridge when we get there.
|
||
|
There will also be the whole 'integration with Jepsen tests' thing going on.
|
||
|
|
||
|
Answer for analytics: I'm astonished you've read this article. Wanna join
|
||
|
storage?
|
||
|
|
||
|
## How do we test HA
|
||
|
|
||
|
[Check this out](https://jepsen.io/analyses/dgraph-1-0-2)
|