Update HA feature spec
Reviewers: buda, teon.banek, mferencevic Reviewed By: buda Subscribers: dsantl, dlozic, tlastre Differential Revision: https://phabricator.memgraph.io/D2626
This commit is contained in:
parent
3a88668ace
commit
3121f7d89d
@ -1,214 +1,275 @@
|
|||||||
# High Availability (abbr. HA)
|
# High Availability (abbr. HA)
|
||||||
|
|
||||||
## Introduction
|
## High Level Context
|
||||||
|
|
||||||
High availability is a characteristic of a system which aims to ensure a
|
High availability is a characteristic of a system which aims to ensure a
|
||||||
certain level of operational performance for a higher-than-normal period.
|
certain level of operational performance for a higher-than-normal period.
|
||||||
Although there are multiple ways to design highly available systems, Memgraph
|
Although there are multiple ways to design highly available systems, Memgraph
|
||||||
strives to achieve HA by elimination of single points of failure. In essence,
|
strives to achieve HA by elimination of single points of failure. In essence,
|
||||||
this implies adding redundancy to the system so that a failure of a component
|
this implies adding redundancy to the system so that a failure of a component
|
||||||
does not imply the failure of the entire system.
|
does not imply the failure of the entire system. To ensure this, HA Memgraph
|
||||||
|
implements the [Raft consensus algorithm](https://raft.github.io/).
|
||||||
|
|
||||||
## Theoretical Background
|
Correct implementation of the algorithm guarantees that the cluster will be
|
||||||
|
fully functional (available) as long as any strong majority of the servers are
|
||||||
|
operational and can communicate with each other and with clients. For example,
|
||||||
|
clusters of three or four machines can tolerate the failure of a single server,
|
||||||
|
clusters of five and six machines can tolerate the failure of any two servers,
|
||||||
|
and so on. Therefore, we strongly recommend a setup of an odd-sized cluster.
|
||||||
|
|
||||||
The following chapter serves as an introduction into some theoretical aspects
|
### Performance Implications
|
||||||
of Memgraph's high availability implementation. If the reader is solely
|
|
||||||
interested in design decisions around HA implementation, they can skip this
|
|
||||||
chapter.
|
|
||||||
|
|
||||||
An important implication of any HA implementation stems from Eric Brewer's
|
Internally, Raft achieves high availability by keeping a consistent replicated
|
||||||
[CAP Theorem](https://fenix.tecnico.ulisboa.pt/downloadFile/1126518382178117/10.e-CAP-3.pdf)
|
log on each server within the cluster. Therefore, we must successfully replicate
|
||||||
which states that it is impossible for a distributed system to simultaneously
|
a transaction on the majority of servers within the cluster before we actually
|
||||||
achieve:
|
commit it and report the result back to the client. This operation represents
|
||||||
|
a significant performance hit when compared with single node version of
|
||||||
|
Memgraph.
|
||||||
|
|
||||||
* Consistency (C - every read receives the most recent write or an error)
|
Luckily, the algorithm can be tweaked in a way which allows read-only
|
||||||
* Availability (A - every request receives a response that is not an error)
|
transactions to perform significantly better than those which modify the
|
||||||
* Partition tolerance (P - The system continues to operate despite an
|
database state. That being said, the performance of read-only operations
|
||||||
arbitrary number of messages being dropped by the
|
is still not going to be on par with single node Memgraph.
|
||||||
network between nodes)
|
|
||||||
|
|
||||||
In the context of HA, Memgraph should strive to achieve CA.
|
This section will be updated with exact numbers once we integrate HA with
|
||||||
|
new storage.
|
||||||
|
|
||||||
### Consensus
|
With the old storage, write throughput was almost five times lower than read
|
||||||
|
throughput (~30000 reads per second vs ~6000 writes per second).
|
||||||
|
|
||||||
Implications of the CAP theorem naturally lead us towards introducing a
|
## User Facing Setup
|
||||||
cluster of machines which will have identical internal states. When a designated
|
|
||||||
machine for handling client requests fails, it can simply be replaced with
|
|
||||||
another.
|
|
||||||
|
|
||||||
Well... turns out this is not as easy as it sounds :(
|
### How to Setup HA Memgraph Cluster?
|
||||||
|
|
||||||
Keeping around a cluster of machines with consistent internal state is an
|
First, the user needs to install `memgraph_ha` package on each machine
|
||||||
inherently difficult problem. More precisely, this problem is as hard as
|
in their cluster. HA Memgraph should be available as a Debian package,
|
||||||
getting a cluster of machines to agree on a single value, which is a highly
|
so its installation on each machine should be as simple as:
|
||||||
researched area in distributed systems. Our research of state of the art
|
|
||||||
consensus algorithms lead us to Diego Ongaro's
|
|
||||||
[Raft algorithm](https://raft.github.io/raft.pdf).
|
|
||||||
|
|
||||||
#### Raft
|
```plaintext
|
||||||
|
dpkg -i /path/to/memgraph_ha_<version>.deb
|
||||||
|
```
|
||||||
|
|
||||||
As you might have guessed, analyzing each subtle detail of Raft goes way
|
After successful installation of the `memgraph_ha` package, the user should
|
||||||
beyond the scope of this document. In the remainder of the chapter we will
|
finish its configuration before attempting to start the cluster.
|
||||||
outline only the most important ideas and implications, leaving all further
|
|
||||||
analysis to the reader. Detailed explanation can be found either in Diego's
|
|
||||||
[dissertation](https://ramcloud.stanford.edu/~ongaro/thesis.pdf) \[1\] or the
|
|
||||||
Raft [paper](https://raft.github.io/raft.pdf) \[2\].
|
|
||||||
|
|
||||||
In essence, Raft allows us to implement the previously mentioned idea of
|
There are two main things that need to be configured on every node in order for
|
||||||
managing a cluster of machines with identical internal states. In other
|
the cluster to be able to run:
|
||||||
words, the Raft protocol allows us to manage a cluster of replicated
|
|
||||||
state machines which is fully functional as long as the *majority* of
|
|
||||||
the machines in the cluster operate correctly.
|
|
||||||
|
|
||||||
Another important fact is that those state machines must be *deterministic*.
|
1. The user has to edit the main configuration file and specify the unique node
|
||||||
In other words, the same command on two different machines with the same
|
ID to each server in the cluster
|
||||||
internal state must yield the same result. This is important because Memgraph,
|
2. The user has to create a file that describes all IP addresses of all servers
|
||||||
as a black box, is not entirely deterministic. Non-determinism can easily be
|
that will be used in the cluster
|
||||||
introduced by the user (e.g. by using the `rand` function) or by algorithms
|
|
||||||
behind query execution (e.g. introducing fuzzy logic in the planner could yield
|
|
||||||
a different order of results). Luckily, once we enter the storage level,
|
|
||||||
everything should be fully deterministic.
|
|
||||||
|
|
||||||
To summarize, Raft is a protocol which achieves consensus in a cluster of
|
The `memgraph_ha` binary loads all main configuration parameters from
|
||||||
deterministic state machines via log replication. The cluster is fully
|
`/etc/memgraph/memgraph_ha.conf`. On each node of the cluster, the user should
|
||||||
functional if the majority of the machines work correctly. The reader
|
uncomment the `--server-id=0` parameter and change its value to the `server_id`
|
||||||
is strongly encouraged to gain a deeper understanding (at least read through
|
of that node.
|
||||||
the paper) of Raft before reading the rest of this document.
|
|
||||||
|
The last step before starting the server is to create a `coordination`
|
||||||
|
configuration file. That file is already present as an example in
|
||||||
|
`/etc/memgraph/coordination.json.example` and you have to copy it to
|
||||||
|
`/etc/memgraph/coordination.json` and edit it according to your cluster
|
||||||
|
configuration. The file contains coordination info consisting of a list of
|
||||||
|
`server_id`, `ip_address` and `rpc_port` lists. The assumed contents of the
|
||||||
|
`coordination.json` file are:
|
||||||
|
|
||||||
|
```plaintext
|
||||||
|
[
|
||||||
|
[1, "192.168.0.1", 10000],
|
||||||
|
[2, "192.168.0.2", 10000],
|
||||||
|
[3, "192.168.0.3", 10000]
|
||||||
|
]
|
||||||
|
```
|
||||||
|
Here, each line corresponds to coordination of one server. The first entry is
|
||||||
|
that server's ID, the second is its IP address and the third is the RPC port it
|
||||||
|
listens to. This port should not be confused with the port used for client
|
||||||
|
interaction via the Bolt protocol.
|
||||||
|
|
||||||
|
The `ip_address` entered for each `server_id` *must* match the exact IP address
|
||||||
|
that belongs to that server and that will be used to communicate to other nodes
|
||||||
|
in the cluster. The coordination configuration file *must* be identical on all
|
||||||
|
nodes in the cluster.
|
||||||
|
|
||||||
|
After the user has set the `server_id` on each node in
|
||||||
|
`/etc/memgraph/memgraph_ha.conf` and provided the same
|
||||||
|
`/etc/memgraph/coordination.json` file to each node in the cluster, they can
|
||||||
|
start the Memgraph HA service by issuing the following command on each node in
|
||||||
|
the cluster:
|
||||||
|
|
||||||
|
```plaintext
|
||||||
|
systemctl start memgraph_ha
|
||||||
|
```
|
||||||
|
|
||||||
|
### How to Configure Raft Parameters?
|
||||||
|
|
||||||
|
All Raft configuration parameters can be controlled by modifying
|
||||||
|
`/etc/memgraph/raft.json`. The assumed contents of the `raft.json` file are:
|
||||||
|
|
||||||
|
```plaintext
|
||||||
|
{
|
||||||
|
"election_timeout_min": 750,
|
||||||
|
"election_timeout_max": 1000,
|
||||||
|
"heartbeat_interval": 100,
|
||||||
|
"replication_timeout": 20000,
|
||||||
|
"log_size_snapshot_threshold": 50000
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
The meaning behind each entry is demystified in the following table:
|
||||||
|
|
||||||
|
Flag | Description
|
||||||
|
------------------------------|------------
|
||||||
|
`election_timeout_min` | Lower bound for the randomly sampled reelection timer given in milliseconds
|
||||||
|
`election_timeout_max` | Upper bound for the randomly sampled reelection timer given in milliseconds
|
||||||
|
`heartbeat_interval` | Time interval between consecutive heartbeats given in milliseconds
|
||||||
|
`replication_timeout` | Time interval allowed for data replication given in milliseconds
|
||||||
|
`log_size_snapshot_threshold` | Allowed number of entries in Raft log before its compaction
|
||||||
|
|
||||||
|
### How to Query HA Memgraph via Proxy?
|
||||||
|
|
||||||
|
This chapter describes how to query HA Memgraph using our proxy server.
|
||||||
|
Note that this is not intended to be a long-term solution. Instead, we will
|
||||||
|
implement a proper Memgraph HA client which is capable of communicating with
|
||||||
|
the HA cluster. Once our own client is implemented, it will no longer be
|
||||||
|
possible to query HA Memgraph using other clients (such as neo4j client).
|
||||||
|
|
||||||
|
The Bolt protocol that is exposed by each Memgraph HA node is an extended
|
||||||
|
version of the standard Bolt protocol. In order to be able to communicate with
|
||||||
|
the highly available cluster of Memgraph HA nodes, the client must have some
|
||||||
|
logic implemented in itself so that it can communicate correctly with all nodes
|
||||||
|
in the cluster. To facilitate a faster start with the HA cluster we will build
|
||||||
|
the Memgraph HA proxy binary that communicates with all nodes in the HA cluster
|
||||||
|
using the extended Bolt protocol and itself exposes a standard Bolt protocol to
|
||||||
|
the user. All standard Bolt clients (libraries and custom systems) can
|
||||||
|
communicate with the Memgraph HA proxy without any code modifications.
|
||||||
|
|
||||||
|
The HA proxy should be deployed on each client machine that is used to
|
||||||
|
communicate with the cluster. It can't be deployed on the Memgraph HA nodes!
|
||||||
|
|
||||||
|
When using the Memgraph HA proxy, the communication flow is described in the
|
||||||
|
following diagram:
|
||||||
|
|
||||||
|
```plaintext
|
||||||
|
Memgraph HA node 1 -----+
|
||||||
|
|
|
||||||
|
Memgraph HA node 2 -----+ Memgraph HA proxy <---> any standard Bolt client (C, Java, PHP, Python, etc.)
|
||||||
|
|
|
||||||
|
Memgraph HA node 3 -----+
|
||||||
|
```
|
||||||
|
|
||||||
|
To setup the Memgraph HA proxy the user should install the `memgraph_ha_proxy`
|
||||||
|
package.
|
||||||
|
|
||||||
|
After its successful installation, the user should enter all endpoints of the
|
||||||
|
HA Memgraph cluster servers into the configuration before attempting to start
|
||||||
|
the HA Memgraph proxy server.
|
||||||
|
|
||||||
|
The HA Memgraph proxy server loads all of its configuration from
|
||||||
|
`/etc/memgraph/memgraph_ha_proxy.conf`. Assuming that the cluster is set up
|
||||||
|
like in the previous examples, the user should uncomment and enter the following
|
||||||
|
value into the `--endpoints` parameter:
|
||||||
|
|
||||||
|
```plaintext
|
||||||
|
--endpoints=192.168.0.1:7687,192.168.0.2:7687,192.168.0.3:7687
|
||||||
|
```
|
||||||
|
|
||||||
|
Note that the IP addresses used in the example match the individual cluster
|
||||||
|
nodes IP addresses, but the ports used are the Bolt server ports exposed by
|
||||||
|
each node (currently the default value of `7687`).
|
||||||
|
|
||||||
|
The user can now start the proxy by using the following command:
|
||||||
|
|
||||||
|
```plaintext
|
||||||
|
systemctl start memgraph_ha_proxy
|
||||||
|
```
|
||||||
|
|
||||||
|
After the proxy has been started, the user can query the HA cluster by
|
||||||
|
connecting to the HA Memgraph proxy IP address using their favorite Bolt
|
||||||
|
client.
|
||||||
|
|
||||||
## Integration with Memgraph
|
## Integration with Memgraph
|
||||||
|
|
||||||
The first thing that should be defined is a single instruction within the
|
The first thing that should be defined is a single instruction within the
|
||||||
context of Raft (i.e. a single entry in a replicated log). As mentioned
|
context of Raft (i.e. a single entry in a replicated log).
|
||||||
before, these instructions should be completely deterministic when applied
|
These instructions should be completely deterministic when applied
|
||||||
to the state machine. We have therefore decided that the appropriate level
|
to the state machine. We have therefore decided that the appropriate level
|
||||||
of abstraction within Memgraph corresponds to `StateDelta`-s (data structures
|
of abstraction within Memgraph corresponds to `Delta`s (data structures
|
||||||
which describe a single change to the Memgraph state, used for durability
|
which describe a single change to the Memgraph state, used for durability
|
||||||
in WAL). Moreover, a single instruction in a replicated log will consist of a
|
in WAL). Moreover, a single instruction in a replicated log will consist of a
|
||||||
batch of `StateDelta`s which correspond to a single **committed** transaction.
|
batch of `Delta`s which correspond to a single transaction that's about
|
||||||
This decision both improves performance and handles some special cases that
|
to be **committed**.
|
||||||
present themselves otherwise by leveraging the knowledge that the transaction
|
|
||||||
should be committed.
|
|
||||||
|
|
||||||
"What happens with aborted transactions?"
|
Apart from `Delta`s, there are certain operations within the storage called
|
||||||
|
`StorageGlobalOperations` which do not conform to usual transactional workflow
|
||||||
|
(e.g. Creating indices). Since our storage engine implementation guarantees
|
||||||
|
that at the moment of their execution no other transactions are active, we can
|
||||||
|
safely replicate them as well. In other words, no additional logic needs to be
|
||||||
|
implemented because of them.
|
||||||
|
|
||||||
A great question, they are handled solely by the leader which is the only
|
Therefore, we will introduce a new `RaftDelta` object which can be constructed
|
||||||
machine that communicates with the client. Aborted transactions do not alter
|
both from storage `Delta` and `StorageGlobalOperation`. Instead of appending
|
||||||
the state of the database and there is no need to replicate it to other machines
|
these to WAL (as we do in single node), we will start to replicate them across
|
||||||
in the cluster. If, for instance, the leader dies before returning the result
|
our cluster. Once we have replicated the corresponding Raft log entry on
|
||||||
of some read operation in an aborted transaction, the client will notice that
|
majority of the cluster, we are able to safely commit the transaction or execute
|
||||||
the leader has crashed. A new leader will be elected in the next term and the
|
a global operation. If for any reason the replication fails (leadership change,
|
||||||
client should retry the transaction.
|
worker failures, etc.) the transaction will be aborted.
|
||||||
|
|
||||||
"OK, that makes sense! But, wait a minute, this is broken by design! Merely
|
In the follower mode, we need to be able to apply `RaftDelta`s we got from
|
||||||
generating `StateDelta`s on the leader for any transaction will taint its
|
the leader when the protocol allows us to do so. In that case, we will use the
|
||||||
internal storage before sending the first RPC to some follower. This deviates
|
same concepts from durability in storage v2, i.e., applying deltas maps
|
||||||
from Raft and will crash the universe!"
|
completely to recovery from WAL in storage v2.
|
||||||
|
|
||||||
Another great observation. It is indeed true that applying `StateDelta`s makes
|
## Test and Benchmark Strategy
|
||||||
changes to local storage, but only a single type of `StateDelta` makes that
|
|
||||||
change durable. That `StateDelta` type is called `TRANSACTION_COMMIT` and we
|
|
||||||
will change its behaviour when working as a HA instance. More precisely, we
|
|
||||||
must not allow the transaction engine to modify the commit log saying that
|
|
||||||
the transaction has been committed. That action should be delayed until those
|
|
||||||
`StateDelta`s have been applied to the majority of the cluster. At that point
|
|
||||||
the commit log can be safely modified leaving it up to Raft to ensure the
|
|
||||||
durability of the transaction.
|
|
||||||
|
|
||||||
We should also address one subtle detail that arises in this case. Consider
|
We have already implemented some integration and stress tests. These are:
|
||||||
the following scenario:
|
|
||||||
|
|
||||||
* The leader starts working on a transaction which creates a new record in the
|
1. leader election -- Tests whether leader election works properly.
|
||||||
database. Suppose that record is stored in the leader's internal storage
|
2. basic test -- Tests basic leader election and log replication.
|
||||||
but the transaction was not committed (i.e. no such entry in the commit log).
|
3. term updates test -- Tests a specific corner case (which used to fail)
|
||||||
* The leader should start replicating those `StateDelta`s to its followers
|
regarding term updates.
|
||||||
but, suddenly, it's cut off from the rest of the cluster.
|
4. log compaction test -- Tests whether log compaction works properly.
|
||||||
* Due to timeout, a new election is held and a new leader has been elected.
|
5. large log entries -- Tests whether we can successfully replicate relatively
|
||||||
* Our old leader comes back to life and becomes a follower.
|
large log entries.
|
||||||
* The new leader receives a transaction which creates that same record, but
|
6. index test -- Tests whether index creation works in HA.
|
||||||
this transaction is successfully replicated and committed by the new leader.
|
7. normal operation stress test -- Long running concurrent stress test under
|
||||||
|
normal conditions (no failures).
|
||||||
|
8. read benchmark -- Measures read throughput in HA.
|
||||||
|
9. write benchmark -- Measures write throughput in HA.
|
||||||
|
|
||||||
The problem lies in the fact that there is still a record within the internal
|
At the moment, our main goal is to pass existing tests and have a stable version
|
||||||
storage of our old leader with the same transaction ID and GID as the recently
|
on our stress test. We should also implement a stress test which occasionally
|
||||||
committed record by the new leader. Obviously, this is broken. As a solution, on
|
introduces different types of failures in our cluster (we did this kind of
|
||||||
each transition from `Leader` to `Follower`, we will reinitialize storage, reset
|
testing manually thus far). Passing these tests should convince us that we have
|
||||||
the transaction engine and recover data from the Raft log. This will ensure all
|
a "stable enough" version which we can start pushing to our customers.
|
||||||
ongoing transactions which have "polluted" the storage will be gone.
|
|
||||||
|
|
||||||
"When will followers append that transaction to their commit logs?"
|
Additional (proper) testing should probably involve some ideas from
|
||||||
|
[here](https://jepsen.io/analyses/dgraph-1-0-2)
|
||||||
|
|
||||||
When the leader deduces that the transaction is safe to commit, it will include
|
## Possible Future Changes/Improvements/Extensions
|
||||||
the relevant information in all further heartbeats which will alert the
|
|
||||||
followers that it is safe to commit those entries from their raft logs.
|
|
||||||
Naturally, the followers need not to delay appending data to the commit log
|
|
||||||
as they know that the transaction has already been committed (from the clusters
|
|
||||||
point of view). If this sounds really messed up, seriously, read the Raft paper.
|
|
||||||
|
|
||||||
"How does the raft log differ from WAL"
|
There are two general directions in which we can alter HA Memgraph. The first
|
||||||
|
direction assumes we are going to stick with the Raft protocol. In that case
|
||||||
|
there are a few known ways to extend the basic algorithm in order to gain
|
||||||
|
better performance or achieve extra functionality. In no particular order,
|
||||||
|
these are:
|
||||||
|
|
||||||
Conceptually, it doesn't. When operating in HA, we don't really need the
|
1. Improving read performance using leader leases [Section 6.4 from Raft thesis]
|
||||||
recovery mechanisms implemented in Memgraph thus far. When a dead machine
|
2. Introducing cluster membership changes [Chapter 4 from Raft thesis]
|
||||||
comes back to life, it will eventually come in sync with the rest of the
|
3. Introducing a [learner mode](https://etcd.io/docs/v3.3.12/learning/learner/).
|
||||||
cluster and everything will be done using the machine's raft log as well
|
4. Consider different log compaction strategies [Chapter 5 from Raft thesis]
|
||||||
as the messages received from the cluster leader.
|
5. Removing HA proxy and implementing our own HA Memgraph client.
|
||||||
|
|
||||||
"Those logs will become huge, isn't that recovery going to be painfully slow?"
|
On the other hand, we might decide in the future to base our HA implementation
|
||||||
|
on a completely different protocol which might even offer different guarantees.
|
||||||
|
In that case we probably need to do a bit more of market research and weigh the
|
||||||
|
trade-offs of different solutions.
|
||||||
|
[This](https://www.postgresql.org/docs/9.5/different-replication-solutions.html)
|
||||||
|
might be a good starting point.
|
||||||
|
|
||||||
True, but there are mechanisms for making raft logs more compact. The most
|
## Reading materials
|
||||||
popular method is, wait for it, making snapshots :)
|
|
||||||
Although the process of bringing an old machine back to life is a long one,
|
|
||||||
it doesn't really affect the performance of the cluster in a great degree.
|
|
||||||
The cluster will work perfectly fine with that machine being way out of sync.
|
|
||||||
|
|
||||||
"I don't know, everything seems to be a lot slower than before!"
|
1. [Raft paper](https://raft.github.io/raft.pdf)
|
||||||
|
2. [Raft thesis](https://github.com/ongardie/dissertation) (book.pdf)
|
||||||
Absolutely true, the user should be aware that they will suffer dire
|
3. [Raft playground](https://raft.github.io/)
|
||||||
consequences on the performance side if they choose to be highly available.
|
4. [Leader Leases](https://blog.yugabyte.com/low-latency-reads-in-geo-distributed-sql-with-raft-leader-leases/)
|
||||||
As Frankie says, "That's life!".
|
5. [Improving Raft ETH](https://pub.tik.ee.ethz.ch/students/2017-FS/SA-2017-80.pdf)
|
||||||
|
|
||||||
"Also, I didn't really care about most of the things you've said. I'm
|
|
||||||
not a part of the storage team and couldn't care less about the issues you
|
|
||||||
face, how does HA affect 'my part of the codebase'?"
|
|
||||||
|
|
||||||
Answer for query execution: That's ok, you'll be able to use the same beloved
|
|
||||||
API (when we implement it, he he :) towards storage and continue to
|
|
||||||
make fun of us when you find a bug.
|
|
||||||
|
|
||||||
Answer for infrastructure: We'll talk. Some changes will surely need to
|
|
||||||
be made on the Memgraph client. There is a chapter in Diego's dissertation
|
|
||||||
called 'Client interaction', but we'll cross that bridge when we get there.
|
|
||||||
There will also be the whole 'integration with Jepsen tests' thing going on.
|
|
||||||
|
|
||||||
Answer for analytics: I'm astonished you've read this article. Wanna join
|
|
||||||
storage?
|
|
||||||
|
|
||||||
### Subtlety Regarding Reads
|
|
||||||
|
|
||||||
As we have hinted in the previous chapter, we would like to bypass log
|
|
||||||
replication for operations which do not alter the internal state of Memgraph.
|
|
||||||
Those operations should therefore be handled only by the leader, which is not
|
|
||||||
as trivial as it seems. The subtlety arises from the fact that a (newly-elected)
|
|
||||||
leader can have an entry in its log which was committed by the previous leader
|
|
||||||
that has crashed but that entry is not yet committed in its internal storage
|
|
||||||
by the current leader. Moreover, the rule about safely committing logs that are
|
|
||||||
replicated on the majority of the cluster only applies for entries replicated in
|
|
||||||
the leaders current term. Therefore, we are faced with two issues:
|
|
||||||
|
|
||||||
* We cannot simply perform read operations if the leader has a non-committed
|
|
||||||
entry in its log (breaks consistency).
|
|
||||||
* Replicating those entries onto the majority of the cluster is not enough
|
|
||||||
to guarantee that they can be safely committed.
|
|
||||||
|
|
||||||
This can be solved by introducing a blank no-op operation which the new leader
|
|
||||||
will try to replicate at the start of its term. Once that operation is
|
|
||||||
replicated and committed, the leader can safely perform those non-altering
|
|
||||||
operations on its own.
|
|
||||||
|
|
||||||
For further information about these issues, you should check out section
|
|
||||||
5.4.2 from the raft paper \[1\] which hints as to why its not safe to commit
|
|
||||||
entries from previous terms. Also, you should check out section 6.4 from
|
|
||||||
the thesis \[2\] which goes into more details around efficiently processing
|
|
||||||
read-only queries.
|
|
||||||
|
|
||||||
## How do we test HA
|
|
||||||
|
|
||||||
[Check this out](https://jepsen.io/analyses/dgraph-1-0-2)
|
|
||||||
|
Loading…
Reference in New Issue
Block a user