Update Raft feature spec (no-op in reads)

Reviewers: msantl, teon.banek, buda, mtomic, mferencevic

Reviewed By: msantl

Differential Revision: https://phabricator.memgraph.io/D1728
This commit is contained in:
Ivan Paljak 2018-11-12 10:17:20 +01:00
parent 20dcb05c50
commit 45098dd80b

View File

@ -51,8 +51,8 @@ As you might have guessed, analyzing each subtle detail of Raft goes way
beyond the scope of this document. In the remainder of the chapter we will beyond the scope of this document. In the remainder of the chapter we will
outline only the most important ideas and implications, leaving all further outline only the most important ideas and implications, leaving all further
analysis to the reader. Detailed explanation can be found either in Diego's analysis to the reader. Detailed explanation can be found either in Diego's
[dissertation](https://ramcloud.stanford.edu/~ongaro/thesis.pdf) or the [dissertation](https://ramcloud.stanford.edu/~ongaro/thesis.pdf) \[1\] or the
Raft [paper](https://raft.github.io/raft.pdf). Raft [paper](https://raft.github.io/raft.pdf) \[2\].
In essence, Raft allows us to implement the previously mentioned idea of In essence, Raft allows us to implement the previously mentioned idea of
managing a cluster of machines with identical internal states. In other managing a cluster of machines with identical internal states. In other
@ -84,7 +84,7 @@ to the state machine. We have therefore decided that the appropriate level
of abstraction within Memgraph corresponds to `StateDelta`-s (data structures of abstraction within Memgraph corresponds to `StateDelta`-s (data structures
which describe a single change to the Memgraph state, used for durability which describe a single change to the Memgraph state, used for durability
in WAL). Moreover, a single instruction in a replicated log will consist of a in WAL). Moreover, a single instruction in a replicated log will consist of a
batch of `StateDelta`s which correspond to a single **commited** transaction. batch of `StateDelta`s which correspond to a single **committed** transaction.
This decision both improves performance and handles some special cases that This decision both improves performance and handles some special cases that
present themselves otherwise by leveraging the knowledge that the transaction present themselves otherwise by leveraging the knowledge that the transaction
should be committed. should be committed.
@ -119,13 +119,13 @@ the following scenario:
* The leader starts working on a transaction which creates a new record in the * The leader starts working on a transaction which creates a new record in the
database. Suppose that record is stored in the leader's internal storage database. Suppose that record is stored in the leader's internal storage
but the transaction was not commited (i.e. no such entry in the commit log). but the transaction was not committed (i.e. no such entry in the commit log).
* The leader should start replicating those `StateDelta`s to its followers * The leader should start replicating those `StateDelta`s to its followers
but, suddenly, it's cut off from the rest of the cluster. but, suddenly, it's cut off from the rest of the cluster.
* Due to timeout, a new election is held and a new leader has been elected. * Due to timeout, a new election is held and a new leader has been elected.
* Our old leader comes back to life and becomes a follower. * Our old leader comes back to life and becomes a follower.
* The new leader receives a transaction which creates that same record, but * The new leader receives a transaction which creates that same record, but
this transaction is successfully replicated and commited by the new leader. this transaction is successfully replicated and committed by the new leader.
The problem lies in the fact that there is still a record within the internal The problem lies in the fact that there is still a record within the internal
storage of our old leader with the same transaction ID and GID as the recently storage of our old leader with the same transaction ID and GID as the recently
@ -180,6 +180,34 @@ There will also be the whole 'integration with Jepsen tests' thing going on.
Answer for analytics: I'm astonished you've read this article. Wanna join Answer for analytics: I'm astonished you've read this article. Wanna join
storage? storage?
### Subtlety Regarding Reads
As we have hinted in the previous chapter, we would like to bypass log
replication for operations which do not alter the internal state of Memgraph.
Those operations should therefore be handled only by the leader, which is not
as trivial as it seems. The subtlety arises from the fact that a (newly-elected)
leader can have an entry in its log which was committed by the previous leader
that has crashed but that entry is not yet committed in its internal storage
by the current leader. Moreover, the rule about safely committing logs that are
replicated on the majority of the cluster only applies for entries replicated in
the leaders current term. Therefore, we are faced with two issues:
* We cannot simply perform read operations if the leader has a non-committed
entry in its log (breaks consistency).
* Replicating those entries onto the majority of the cluster is not enough
to guarantee that they can be safely committed.
This can be solved by introducing a blank no-op operation which the new leader
will try to replicate at the start of its term. Once that operation is
replicated and committed, the leader can safely perform those non-altering
operations on its own.
For further information about these issues, you should check out section
5.4.2 from the raft paper \[1\] which hints as to why its not safe to commit
entries from previous terms. Also, you should check out section 6.4 from
the thesis \[2\] which goes into more details around efficiently processing
read-only queries.
## How do we test HA ## How do we test HA
[Check this out](https://jepsen.io/analyses/dgraph-1-0-2) [Check this out](https://jepsen.io/analyses/dgraph-1-0-2)