memgraph/docs/feature_spec/high_availability.md

# High Availability (abbr. HA)

## Introduction

High availability is a characteristic of a system which aims to ensure a
certain level of operational performance for a higher-than-normal period.
Although there are multiple ways to design highly available systems, Memgraph
strives to achieve HA by elimination of single points of failure. In essence,
this implies adding redundancy to the system so that a failure of a component
does not imply the failure of the entire system.

## Theoretical Background

The following chapter serves as an introduction into some theoretical aspects
of Memgraph's high availability implementation. If the reader is solely
interested in design decisions around HA implementation, they can skip this
chapter.

An important implication of any HA implementation stems from Eric Brewer's
[CAP Theorem](https://fenix.tecnico.ulisboa.pt/downloadFile/1126518382178117/10.e-CAP-3.pdf)
which states that it is impossible for a distributed system to simultaneously
achieve:

  * Consistency (C - every read receives the most recent write or an error)
  * Availability (A - every request receives a response that is not an error)
  * Partition tolerance (P - The system continues to operate despite an
                             arbitrary number of messages being dropped by the
                             network between nodes)

In the context of HA, Memgraph should strive to achieve CA.

### Consensus

Implications of the CAP theorem naturally lead us towards introducing a
cluster of machines which will have identical internal states. When a designated
machine for handling client requests fails, it can simply be replaced with
another.

Well... turns out this is not as easy as it sounds :(

Keeping around a cluster of machines with consistent internal state is an
inherently difficult problem. More precisely, this problem is as hard as
getting a cluster of machines to agree on a single value, which is a highly
researched area in distributed systems. Our research of state of the art
consensus algorithms lead us to Diego Ongaro's
[Raft algorithm](https://raft.github.io/raft.pdf).

#### Raft

As you might have guessed, analyzing each subtle detail of Raft goes way
beyond the scope of this document. In the remainder of the chapter we will
outline only the most important ideas and implications, leaving all further
analysis to the reader. Detailed explanation can be found either in Diego's
[dissertation](https://ramcloud.stanford.edu/~ongaro/thesis.pdf) or the
Raft [paper](https://raft.github.io/raft.pdf).

In essence, Raft allows us to implement the previously mentioned idea of
managing a cluster of machines with identical internal states. In other
words, the Raft protocol allows us to manage a cluster of replicated
state machines which is fully functional as long as the *majority* of
the machines in the cluster operate correctly.

Another important fact is that those state machines must be *deterministic*.
In other words, the same command on two different machines with the same
internal state must yield the same result. This is important because Memgraph,
as a black box, is not entirely deterministic. Non-determinism can easily be
introduced by the user (e.g. by using the `rand` function) or by algorithms
behind query execution (e.g. introducing fuzzy logic in the planner could yield
a different order of results). Luckily, once we enter the storage level,
everything should be fully deterministic.

To summarize, Raft is a protocol which achieves consensus in a cluster of
deterministic state machines via log replication. The cluster is fully
functional if the majority of the machines work correctly. The reader
is strongly encouraged to gain a deeper understanding (at least read through
the paper) of Raft before reading the rest of this document.

## Integration with Memgraph

The first thing that should be defined is a single instruction within the
context of Raft (i.e. a single entry in a replicated log). As mentioned
before, these instructions should be completely deterministic when applied
to the state machine. We have therefore decided that the appropriate level
of abstraction within Memgraph corresponds to `StateDelta`-s (data structures
which describe a single change to the Memgraph state, used for durability
in WAL). Moreover, a single instruction in a replicated log will consist of a
batch of `StateDelta`s which correspond to a single **commited** transaction.
This decision both improves performance and handles some special cases that
present themselves otherwise by leveraging the knowledge that the transaction
should be committed.

"What happens with aborted transactions?"

A great question, they are handled solely by the leader which is the only
machine that communicates with the client. Aborted transactions do not alter
the state of the database and there is no need to replicate it to other machines
in the cluster. If, for instance, the leader dies before returning the result
of some read operation in an aborted transaction, the client will notice that
the leader has crashed. A new leader will be elected in the next term and the
client should retry the transaction.

"OK, that makes sense! But, wait a minute, this is broken by design! Merely
generating `StateDelta`s on the leader for any transaction will taint its
internal storage before sending the first RPC to some follower. This deviates
from Raft and will crash the universe!"

Another great observation. It is indeed true that applying `StateDelta`s makes
changes to local storage, but only a single type of `StateDelta` makes that
  change durable. That `StateDelta` type is called `TRANSACTION_COMMIT` and we
will change its behaviour when working as a HA instance. More precisely, we
must not allow the transaction engine to modify the commit log saying that
the transaction has been committed. That action should be delayed until those
`StateDelta`s have been applied to the majority of the cluster. At that point
the commit log can be safely modified leaving it up to Raft to ensure the
durability of the transaction.

We should also address one subtle detail that arises in this case. Consider
the following scenario:

  * The leader starts working on a transaction which creates a new record in the
    database. Suppose that record is stored in the leader's internal storage
    but the transaction was not commited (i.e. no such entry in the commit log).
  * The leader should start replicating those `StateDelta`s to its followers
    but, suddenly, it's cut off from the rest of the cluster.
  * Due to timeout, a new election is held and a new leader has been elected.
  * Our old leader comes back to life and becomes a follower.
  * The new leader receives a transaction which creates that same record, but
    this transaction is successfully replicated and commited by the new leader.

The problem lies in the fact that there is still a record within the internal
storage of our old leader with the same transaction ID and GID as the recently
committed record by the new leader. Obviously, this is broken. As a solution, we
will prefix the transaction ID with the Raft term, thus ensuring that the
transaction IDs will differ.

"When will followers append that transaction to their commit logs?"

When the leader deduces that the transaction is safe to commit, it will include
the relevant information in all further heartbeats which will alert the
followers that it is safe to commit those entries from their raft logs.
Naturally, the followers need not to delay appending data to the commit log
as they know that the transaction has already been committed (from the clusters
point of view). If this sounds really messed up, seriously, read the Raft paper.

"How does the raft log differ from WAL"

Conceptually, it doesn't. When operating in HA, we don't really need the
recovery mechanisms implemented in Memgraph thus far. When a dead machine
comes back to life, it will eventually come in sync with the rest of the
cluster and everything will be done using the machine's raft log as well
as the messages received from the cluster leader.

"Those logs will become huge, isn't that recovery going to be painfully slow?"

True, but there are mechanisms for making raft logs more compact. The most
popular method is, wait for it, making snapshots :)
Although the process of bringing an old machine back to life is a long one,
it doesn't really affect the performance of the cluster in a great degree.
The cluster will work perfectly fine with that machine being way out of sync.

"I don't know, everything seems to be a lot slower than before!"

Absolutely true, the user should be aware that they will suffer dire
consequences on the performance side if they choose to be highly available.
As Frankie says, "That's life!".

"Also, I didn't really care about most of the things you've said. I'm
not a part of the storage team and couldn't care less about the issues you
face, how does HA affect 'my part of the codebase'?"

Answer for query execution: That's ok, you'll be able to use the same beloved
API (when we implement it, he he :) towards storage and continue to
make fun of us when you find a bug.

Answer for infrastructure: We'll talk. Some changes will surely need to
be made on the Memgraph client. There is a chapter in Diego's dissertation
called 'Client interaction', but we'll cross that bridge when we get there.
There will also be the whole 'integration with Jepsen tests' thing going on.

Answer for analytics: I'm astonished you've read this article. Wanna join
storage?

## How do we test HA

[Check this out](https://jepsen.io/analyses/dgraph-1-0-2)
Add HA feature spec Reviewers: msantl, buda, teon.banek, mferencevic, mtomic, vkasljevic Reviewed By: msantl Differential Revision: https://phabricator.memgraph.io/D1692 2018-10-25 22:53:33 +08:00			`# High Availability (abbr. HA)`

			`## Introduction`

			`High availability is a characteristic of a system which aims to ensure a`
			`certain level of operational performance for a higher-than-normal period.`
			`Although there are multiple ways to design highly available systems, Memgraph`
			`strives to achieve HA by elimination of single points of failure. In essence,`
			`this implies adding redundancy to the system so that a failure of a component`
			`does not imply the failure of the entire system.`

			`## Theoretical Background`

			`The following chapter serves as an introduction into some theoretical aspects`
			`of Memgraph's high availability implementation. If the reader is solely`
			`interested in design decisions around HA implementation, they can skip this`
			`chapter.`

			`An important implication of any HA implementation stems from Eric Brewer's`
			`[CAP Theorem](https://fenix.tecnico.ulisboa.pt/downloadFile/1126518382178117/10.e-CAP-3.pdf)`
			`which states that it is impossible for a distributed system to simultaneously`
			`achieve:`

			`* Consistency (C - every read receives the most recent write or an error)`
			`* Availability (A - every request receives a response that is not an error)`
			`* Partition tolerance (P - The system continues to operate despite an`
			`arbitrary number of messages being dropped by the`
			`network between nodes)`

			`In the context of HA, Memgraph should strive to achieve CA.`

			`### Consensus`

			`Implications of the CAP theorem naturally lead us towards introducing a`
			`cluster of machines which will have identical internal states. When a designated`
			`machine for handling client requests fails, it can simply be replaced with`
			`another.`

			`Well... turns out this is not as easy as it sounds :(`

			`Keeping around a cluster of machines with consistent internal state is an`
			`inherently difficult problem. More precisely, this problem is as hard as`
			`getting a cluster of machines to agree on a single value, which is a highly`
			`researched area in distributed systems. Our research of state of the art`
			`consensus algorithms lead us to Diego Ongaro's`
			`[Raft algorithm](https://raft.github.io/raft.pdf).`

			`#### Raft`

			`As you might have guessed, analyzing each subtle detail of Raft goes way`
			`beyond the scope of this document. In the remainder of the chapter we will`
			`outline only the most important ideas and implications, leaving all further`
			`analysis to the reader. Detailed explanation can be found either in Diego's`
			`[dissertation](https://ramcloud.stanford.edu/~ongaro/thesis.pdf) or the`
			`Raft [paper](https://raft.github.io/raft.pdf).`

			`In essence, Raft allows us to implement the previously mentioned idea of`
			`managing a cluster of machines with identical internal states. In other`
			`words, the Raft protocol allows us to manage a cluster of replicated`
			`state machines which is fully functional as long as the majority of`
			`the machines in the cluster operate correctly.`

			`Another important fact is that those state machines must be deterministic.`
			`In other words, the same command on two different machines with the same`
			`internal state must yield the same result. This is important because Memgraph,`
			`as a black box, is not entirely deterministic. Non-determinism can easily be`
			introduced by the user (e.g. by using the `rand` function) or by algorithms
			`behind query execution (e.g. introducing fuzzy logic in the planner could yield`
			`a different order of results). Luckily, once we enter the storage level,`
			`everything should be fully deterministic.`

			`To summarize, Raft is a protocol which achieves consensus in a cluster of`
			`deterministic state machines via log replication. The cluster is fully`
			`functional if the majority of the machines work correctly. The reader`
			`is strongly encouraged to gain a deeper understanding (at least read through`
			`the paper) of Raft before reading the rest of this document.`

			`## Integration with Memgraph`

			`The first thing that should be defined is a single instruction within the`
			`context of Raft (i.e. a single entry in a replicated log). As mentioned`
			`before, these instructions should be completely deterministic when applied`
			`to the state machine. We have therefore decided that the appropriate level`
			of abstraction within Memgraph corresponds to `StateDelta`-s (data structures
			`which describe a single change to the Memgraph state, used for durability`
			`in WAL). Moreover, a single instruction in a replicated log will consist of a`
			batch of `StateDelta`s which correspond to a single commited transaction.
			`This decision both improves performance and handles some special cases that`
			`present themselves otherwise by leveraging the knowledge that the transaction`
			`should be committed.`

			`"What happens with aborted transactions?"`

			`A great question, they are handled solely by the leader which is the only`
			`machine that communicates with the client. Aborted transactions do not alter`
			`the state of the database and there is no need to replicate it to other machines`
			`in the cluster. If, for instance, the leader dies before returning the result`
			`of some read operation in an aborted transaction, the client will notice that`
			`the leader has crashed. A new leader will be elected in the next term and the`
			`client should retry the transaction.`

			`"OK, that makes sense! But, wait a minute, this is broken by design! Merely`
			generating `StateDelta`s on the leader for any transaction will taint its
			`internal storage before sending the first RPC to some follower. This deviates`
			`from Raft and will crash the universe!"`

			Another great observation. It is indeed true that applying `StateDelta`s makes
			changes to local storage, but only a single type of `StateDelta` makes that
			change durable. That `StateDelta` type is called `TRANSACTION_COMMIT` and we
			`will change its behaviour when working as a HA instance. More precisely, we`
			`must not allow the transaction engine to modify the commit log saying that`
			`the transaction has been committed. That action should be delayed until those`
			`StateDelta`s have been applied to the majority of the cluster. At that point
			`the commit log can be safely modified leaving it up to Raft to ensure the`
			`durability of the transaction.`

			`We should also address one subtle detail that arises in this case. Consider`
			`the following scenario:`

			`* The leader starts working on a transaction which creates a new record in the`
			`database. Suppose that record is stored in the leader's internal storage`
			`but the transaction was not commited (i.e. no such entry in the commit log).`
			* The leader should start replicating those `StateDelta`s to its followers
			`but, suddenly, it's cut off from the rest of the cluster.`
			`* Due to timeout, a new election is held and a new leader has been elected.`
			`* Our old leader comes back to life and becomes a follower.`
			`* The new leader receives a transaction which creates that same record, but`
			`this transaction is successfully replicated and commited by the new leader.`

			`The problem lies in the fact that there is still a record within the internal`
			`storage of our old leader with the same transaction ID and GID as the recently`
			`committed record by the new leader. Obviously, this is broken. As a solution, we`
			`will prefix the transaction ID with the Raft term, thus ensuring that the`
			`transaction IDs will differ.`

			`"When will followers append that transaction to their commit logs?"`

			`When the leader deduces that the transaction is safe to commit, it will include`
			`the relevant information in all further heartbeats which will alert the`
			`followers that it is safe to commit those entries from their raft logs.`
			`Naturally, the followers need not to delay appending data to the commit log`
			`as they know that the transaction has already been committed (from the clusters`
			`point of view). If this sounds really messed up, seriously, read the Raft paper.`

			`"How does the raft log differ from WAL"`

			`Conceptually, it doesn't. When operating in HA, we don't really need the`
			`recovery mechanisms implemented in Memgraph thus far. When a dead machine`
			`comes back to life, it will eventually come in sync with the rest of the`
			`cluster and everything will be done using the machine's raft log as well`
			`as the messages received from the cluster leader.`

			`"Those logs will become huge, isn't that recovery going to be painfully slow?"`

			`True, but there are mechanisms for making raft logs more compact. The most`
			`popular method is, wait for it, making snapshots :)`
			`Although the process of bringing an old machine back to life is a long one,`
			`it doesn't really affect the performance of the cluster in a great degree.`
			`The cluster will work perfectly fine with that machine being way out of sync.`

			`"I don't know, everything seems to be a lot slower than before!"`

			`Absolutely true, the user should be aware that they will suffer dire`
			`consequences on the performance side if they choose to be highly available.`
			`As Frankie says, "That's life!".`

			`"Also, I didn't really care about most of the things you've said. I'm`
			`not a part of the storage team and couldn't care less about the issues you`
			`face, how does HA affect 'my part of the codebase'?"`

			`Answer for query execution: That's ok, you'll be able to use the same beloved`
			`API (when we implement it, he he :) towards storage and continue to`
			`make fun of us when you find a bug.`

			`Answer for infrastructure: We'll talk. Some changes will surely need to`
			`be made on the Memgraph client. There is a chapter in Diego's dissertation`
			`called 'Client interaction', but we'll cross that bridge when we get there.`
			`There will also be the whole 'integration with Jepsen tests' thing going on.`

			`Answer for analytics: I'm astonished you've read this article. Wanna join`
			`storage?`

			`## How do we test HA`

			`[Check this out](https://jepsen.io/analyses/dgraph-1-0-2)`