From 45098dd80baa77eb6eb8e2537845c92d3ec0b803 Mon Sep 17 00:00:00 2001 From: Ivan Paljak Date: Mon, 12 Nov 2018 10:17:20 +0100 Subject: [PATCH] Update Raft feature spec (no-op in reads) Reviewers: msantl, teon.banek, buda, mtomic, mferencevic Reviewed By: msantl Differential Revision: https://phabricator.memgraph.io/D1728 --- docs/feature_spec/high_availability.md | 38 ++++++++++++++++++++++---- 1 file changed, 33 insertions(+), 5 deletions(-) diff --git a/docs/feature_spec/high_availability.md b/docs/feature_spec/high_availability.md index f77551d10..3b56f6e63 100644 --- a/docs/feature_spec/high_availability.md +++ b/docs/feature_spec/high_availability.md @@ -51,8 +51,8 @@ As you might have guessed, analyzing each subtle detail of Raft goes way beyond the scope of this document. In the remainder of the chapter we will outline only the most important ideas and implications, leaving all further analysis to the reader. Detailed explanation can be found either in Diego's -[dissertation](https://ramcloud.stanford.edu/~ongaro/thesis.pdf) or the -Raft [paper](https://raft.github.io/raft.pdf). +[dissertation](https://ramcloud.stanford.edu/~ongaro/thesis.pdf) \[1\] or the +Raft [paper](https://raft.github.io/raft.pdf) \[2\]. In essence, Raft allows us to implement the previously mentioned idea of managing a cluster of machines with identical internal states. In other @@ -84,7 +84,7 @@ to the state machine. We have therefore decided that the appropriate level of abstraction within Memgraph corresponds to `StateDelta`-s (data structures which describe a single change to the Memgraph state, used for durability in WAL). Moreover, a single instruction in a replicated log will consist of a -batch of `StateDelta`s which correspond to a single **commited** transaction. +batch of `StateDelta`s which correspond to a single **committed** transaction. This decision both improves performance and handles some special cases that present themselves otherwise by leveraging the knowledge that the transaction should be committed. @@ -119,13 +119,13 @@ the following scenario: * The leader starts working on a transaction which creates a new record in the database. Suppose that record is stored in the leader's internal storage - but the transaction was not commited (i.e. no such entry in the commit log). + but the transaction was not committed (i.e. no such entry in the commit log). * The leader should start replicating those `StateDelta`s to its followers but, suddenly, it's cut off from the rest of the cluster. * Due to timeout, a new election is held and a new leader has been elected. * Our old leader comes back to life and becomes a follower. * The new leader receives a transaction which creates that same record, but - this transaction is successfully replicated and commited by the new leader. + this transaction is successfully replicated and committed by the new leader. The problem lies in the fact that there is still a record within the internal storage of our old leader with the same transaction ID and GID as the recently @@ -180,6 +180,34 @@ There will also be the whole 'integration with Jepsen tests' thing going on. Answer for analytics: I'm astonished you've read this article. Wanna join storage? +### Subtlety Regarding Reads + +As we have hinted in the previous chapter, we would like to bypass log +replication for operations which do not alter the internal state of Memgraph. +Those operations should therefore be handled only by the leader, which is not +as trivial as it seems. The subtlety arises from the fact that a (newly-elected) +leader can have an entry in its log which was committed by the previous leader +that has crashed but that entry is not yet committed in its internal storage +by the current leader. Moreover, the rule about safely committing logs that are +replicated on the majority of the cluster only applies for entries replicated in +the leaders current term. Therefore, we are faced with two issues: + + * We cannot simply perform read operations if the leader has a non-committed + entry in its log (breaks consistency). + * Replicating those entries onto the majority of the cluster is not enough + to guarantee that they can be safely committed. + +This can be solved by introducing a blank no-op operation which the new leader +will try to replicate at the start of its term. Once that operation is +replicated and committed, the leader can safely perform those non-altering +operations on its own. + +For further information about these issues, you should check out section +5.4.2 from the raft paper \[1\] which hints as to why its not safe to commit +entries from previous terms. Also, you should check out section 6.4 from +the thesis \[2\] which goes into more details around efficiently processing +read-only queries. + ## How do we test HA [Check this out](https://jepsen.io/analyses/dgraph-1-0-2)