Add HA feature spec

Reviewers: msantl, buda, teon.banek, mferencevic, mtomic, vkasljevic Reviewed By: msantl Differential Revision: https://phabricator.memgraph.io/D1692
2018-10-25 16:53:33 +02:00 · 2018-10-25 16:53:33 +02:00 · 598ce81754
commit 598ce81754
parent 66dcd536f0
1 changed files with 185 additions and 0 deletions
--- a/docs/feature_spec/high_availability.md
+++ b/docs/feature_spec/high_availability.md
@ -0,0 +1,185 @@
+# High Availability (abbr. HA)
+
+## Introduction
+
+High availability is a characteristic of a system which aims to ensure a
+certain level of operational performance for a higher-than-normal period.
+Although there are multiple ways to design highly available systems, Memgraph
+strives to achieve HA by elimination of single points of failure. In essence,
+this implies adding redundancy to the system so that a failure of a component
+does not imply the failure of the entire system.
+
+## Theoretical Background
+
+The following chapter serves as an introduction into some theoretical aspects
+of Memgraph's high availability implementation. If the reader is solely
+interested in design decisions around HA implementation, they can skip this
+chapter.
+
+An important implication of any HA implementation stems from Eric Brewer's
+[CAP Theorem](https://fenix.tecnico.ulisboa.pt/downloadFile/1126518382178117/10.e-CAP-3.pdf)
+which states that it is impossible for a distributed system to simultaneously
+achieve:
+
+  * Consistency (C - every read receives the most recent write or an error)
+  * Availability (A - every request receives a response that is not an error)
+  * Partition tolerance (P - The system continues to operate despite an
+                             arbitrary number of messages being dropped by the
+                             network between nodes)
+
+In the context of HA, Memgraph should strive to achieve CA.
+
+### Consensus
+
+Implications of the CAP theorem naturally lead us towards introducing a
+cluster of machines which will have identical internal states. When a designated
+machine for handling client requests fails, it can simply be replaced with
+another.
+
+Well... turns out this is not as easy as it sounds :(
+
+Keeping around a cluster of machines with consistent internal state is an
+inherently difficult problem. More precisely, this problem is as hard as
+getting a cluster of machines to agree on a single value, which is a highly
+researched area in distributed systems. Our research of state of the art
+consensus algorithms lead us to Diego Ongaro's
+[Raft algorithm](https://raft.github.io/raft.pdf).
+
+#### Raft
+
+As you might have guessed, analyzing each subtle detail of Raft goes way
+beyond the scope of this document. In the remainder of the chapter we will
+outline only the most important ideas and implications, leaving all further
+analysis to the reader. Detailed explanation can be found either in Diego's
+[dissertation](https://ramcloud.stanford.edu/~ongaro/thesis.pdf) or the
+Raft [paper](https://raft.github.io/raft.pdf).
+
+In essence, Raft allows us to implement the previously mentioned idea of
+managing a cluster of machines with identical internal states. In other
+words, the Raft protocol allows us to manage a cluster of replicated
+state machines which is fully functional as long as the *majority* of
+the machines in the cluster operate correctly.
+
+Another important fact is that those state machines must be *deterministic*.
+In other words, the same command on two different machines with the same
+internal state must yield the same result. This is important because Memgraph,
+as a black box, is not entirely deterministic. Non-determinism can easily be
+introduced by the user (e.g. by using the `rand` function) or by algorithms
+behind query execution (e.g. introducing fuzzy logic in the planner could yield
+a different order of results). Luckily, once we enter the storage level,
+everything should be fully deterministic.
+
+To summarize, Raft is a protocol which achieves consensus in a cluster of
+deterministic state machines via log replication. The cluster is fully
+functional if the majority of the machines work correctly. The reader
+is strongly encouraged to gain a deeper understanding (at least read through
+the paper) of Raft before reading the rest of this document.
+
+## Integration with Memgraph
+
+The first thing that should be defined is a single instruction within the
+context of Raft (i.e. a single entry in a replicated log). As mentioned
+before, these instructions should be completely deterministic when applied
+to the state machine. We have therefore decided that the appropriate level
+of abstraction within Memgraph corresponds to `StateDelta`-s (data structures
+which describe a single change to the Memgraph state, used for durability
+in WAL). Moreover, a single instruction in a replicated log will consist of a
+batch of `StateDelta`s which correspond to a single **commited** transaction.
+This decision both improves performance and handles some special cases that
+present themselves otherwise by leveraging the knowledge that the transaction
+should be committed.
+
+"What happens with aborted transactions?"
+
+A great question, they are handled solely by the leader which is the only
+machine that communicates with the client. Aborted transactions do not alter
+the state of the database and there is no need to replicate it to other machines
+in the cluster. If, for instance, the leader dies before returning the result
+of some read operation in an aborted transaction, the client will notice that
+the leader has crashed. A new leader will be elected in the next term and the
+client should retry the transaction.
+
+"OK, that makes sense! But, wait a minute, this is broken by design! Merely
+generating `StateDelta`s on the leader for any transaction will taint its
+internal storage before sending the first RPC to some follower. This deviates
+from Raft and will crash the universe!"
+
+Another great observation. It is indeed true that applying `StateDelta`s makes
+changes to local storage, but only a single type of `StateDelta` makes that
+  change durable. That `StateDelta` type is called `TRANSACTION_COMMIT` and we
+will change its behaviour when working as a HA instance. More precisely, we
+must not allow the transaction engine to modify the commit log saying that
+the transaction has been committed. That action should be delayed until those
+`StateDelta`s have been applied to the majority of the cluster. At that point
+the commit log can be safely modified leaving it up to Raft to ensure the
+durability of the transaction.
+
+We should also address one subtle detail that arises in this case. Consider
+the following scenario:
+
+  * The leader starts working on a transaction which creates a new record in the
+    database. Suppose that record is stored in the leader's internal storage
+    but the transaction was not commited (i.e. no such entry in the commit log).
+  * The leader should start replicating those `StateDelta`s to its followers
+    but, suddenly, it's cut off from the rest of the cluster.
+  * Due to timeout, a new election is held and a new leader has been elected.
+  * Our old leader comes back to life and becomes a follower.
+  * The new leader receives a transaction which creates that same record, but
+    this transaction is successfully replicated and commited by the new leader.
+
+The problem lies in the fact that there is still a record within the internal
+storage of our old leader with the same transaction ID and GID as the recently
+committed record by the new leader. Obviously, this is broken. As a solution, we
+will prefix the transaction ID with the Raft term, thus ensuring that the
+transaction IDs will differ.
+
+"When will followers append that transaction to their commit logs?"
+
+When the leader deduces that the transaction is safe to commit, it will include
+the relevant information in all further heartbeats which will alert the
+followers that it is safe to commit those entries from their raft logs.
+Naturally, the followers need not to delay appending data to the commit log
+as they know that the transaction has already been committed (from the clusters
+point of view). If this sounds really messed up, seriously, read the Raft paper.
+
+"How does the raft log differ from WAL"
+
+Conceptually, it doesn't. When operating in HA, we don't really need the
+recovery mechanisms implemented in Memgraph thus far. When a dead machine
+comes back to life, it will eventually come in sync with the rest of the
+cluster and everything will be done using the machine's raft log as well
+as the messages received from the cluster leader.
+
+"Those logs will become huge, isn't that recovery going to be painfully slow?"
+
+True, but there are mechanisms for making raft logs more compact. The most
+popular method is, wait for it, making snapshots :)
+Although the process of bringing an old machine back to life is a long one,
+it doesn't really affect the performance of the cluster in a great degree.
+The cluster will work perfectly fine with that machine being way out of sync.
+
+"I don't know, everything seems to be a lot slower than before!"
+
+Absolutely true, the user should be aware that they will suffer dire
+consequences on the performance side if they choose to be highly available.
+As Frankie says, "That's life!".
+
+"Also, I didn't really care about most of the things you've said. I'm
+not a part of the storage team and couldn't care less about the issues you
+face, how does HA affect 'my part of the codebase'?"
+
+Answer for query execution: That's ok, you'll be able to use the same beloved
+API (when we implement it, he he :) towards storage and continue to
+make fun of us when you find a bug.
+
+Answer for infrastructure: We'll talk. Some changes will surely need to
+be made on the Memgraph client. There is a chapter in Diego's dissertation
+called 'Client interaction', but we'll cross that bridge when we get there.
+There will also be the whole 'integration with Jepsen tests' thing going on.
+
+Answer for analytics: I'm astonished you've read this article. Wanna join
+storage?
+
+## How do we test HA
+
+[Check this out](https://jepsen.io/analyses/dgraph-1-0-2)