Summary:
Sometimes, when the leader resigns it's leadership, a newly elected
leader would send the old leader `AppendEntriesRPC` that would cause the
snapshot recovery to happen. This diff prevents that.
Reviewers: mferencevic, ipaljak
Reviewed By: mferencevic
Subscribers: pullbot
Differential Revision: https://phabricator.memgraph.io/D1943
Summary:
Based on the failure manifested in
https://apollo.memgraph.io/runs/512803/ it seems like machines give each other
votes for the same term.
Looking at the code, `voted_for_` variable wasn't assigned on the election start
and the election starter could grant his vote to someone else but would still
count his vote to himself.
Reviewers: ipaljak, mferencevic
Reviewed By: mferencevic
Subscribers: pullbot, vkasljevic
Differential Revision: https://phabricator.memgraph.io/D1941
Summary:
For HA benchmarks, if one of the executables exits with a status other
than zero, the benchmark should fail.
Also, removing `LOG(INFO)`, since failing benchmarks should flag where to look.
Reviewers: ipaljak
Reviewed By: ipaljak
Subscribers: pullbot
Differential Revision: https://phabricator.memgraph.io/D1921
Summary:
Concurent access to the map that contains replication log timeouts
caused the HA version to often report replication log timeout errors.
Adding locks around the access prevets them from happening.
Performance on Apollo reports write speed around 8k/s.
Reviewers: ipaljak
Reviewed By: ipaljak
Subscribers: pullbot
Differential Revision: https://phabricator.memgraph.io/D1920
Summary:
`SHOW STORAGE STATS` when executed in a Raft cluster should return
stats for each member of the cluster.
`StorageStats` starts a RPC server on each member of the cluster that answers
about its local storage stats.
The query can be invoked only on the current leader, the leader sends a request
to each peer and shows the results it gets. If some peers don't answer within 1
second, stats for those peers won't be shown.
The new output can be seen here: P27
Reviewers: ipaljak, mferencevic
Reviewed By: ipaljak
Subscribers: pullbot
Differential Revision: https://phabricator.memgraph.io/D1907
Summary:
Added a new config parameter, replication timeout. This parameter sets the
upper limit to the replication phase and once the timeout exceeds, the
transaction engine stops accepting new transactions.
We could experience this timeout in two cases:
1. a network partition
2. majority of the cluster stops working
Reviewers: ipaljak
Reviewed By: ipaljak
Subscribers: pullbot
Differential Revision: https://phabricator.memgraph.io/D1893
Summary:
Added index creation and deletion handling in StateDelta.
Also included an integration test that creates an index and makes sure that it
gets replicated by killing each peer eventually causing a leader re-election.
Reviewers: ipaljak
Reviewed By: ipaljak
Subscribers: pullbot
Differential Revision: https://phabricator.memgraph.io/D1886
Summary: Teon found this nit so we might fix it.
Reviewers: ipaljak
Reviewed By: ipaljak
Subscribers: pullbot
Differential Revision: https://phabricator.memgraph.io/D1860
Summary:
During the following scenario:
- start a HA cluster with 3 machines
- find the leader and start sending queries
- SIGTERM the leader but leave other 2 machines untouched
The leader would be stuck in the shutdown phase.
This was happening because during the shutdown phase of the Bolt server, a
`graph_db_accessor` would try to commit a transaction after we've already shut
down Raft server. Raft, although not running, is still thinking it's in the
Leader mode. Tx Engine calls the `SafeToCommit` method to Commit transactions,
and ends up in an infinite loop.
Since Raft was shut down it won't handle any of the incoming RPCs and won't
change it's mode.
The fix here is to shut down the Bolt server before Raft, so we don't have any
pending commits once Raft is shut down.
Reviewers: ipaljak
Reviewed By: ipaljak
Subscribers: pullbot
Differential Revision: https://phabricator.memgraph.io/D1853
Summary:
In Raft, we often need to access persistent state of the server
without modifying it. In order to speed up such operations, we
keep an in-memory copy of that state.
In this diff we make a copy of all persistent state except for
the log itself. Running our feature benchmark locally, we manage
to increase the throughput for cca 750 queries/s.
Reviewers: msantl
Reviewed By: msantl
Subscribers: pullbot
Differential Revision: https://phabricator.memgraph.io/D1843
Summary:
In this part of log compaction for raft, I've implemented snapshooting
and snapshot recovery. I've also refactored the code a bit, so `RaftServer` now
has a pointer to the `GraphDb` and it can do some things by itself.
Log compaction requires some further work. Since snapshooting isn't synchronous
between peers, and each peer can work at their own pace, once we've compacted
the log so that the next log to be sent to peer `x` isn't available anymore, we
need to send the snapshot over the wire. This means that the next part will
contain the `InstallSnapshotRPC` and then maybe one more that will implement the
logic of sending `LogEntry` or the whole snapshot.
Reviewers: ipaljak
Reviewed By: ipaljak
Subscribers: pullbot
Differential Revision: https://phabricator.memgraph.io/D1834
Summary:
`ReplicationLog` had a classic off-by-one bug. The `valid_prefix`
variable wasn't set properly.
This diff also includes a poor man's version of a HA client. This client
assumes that all the HA instances run on a single machine and that the
corresponding Bold endpoints have open ports ranging from `7687` to
`7687 + num_machines - 1`.
This should make it easeir to test certain things, ie. disk usage, P25.
This test revealed the bug with `ReplicationLog`
Reviewers: ipaljak
Reviewed By: ipaljak
Subscribers: pullbot
Differential Revision: https://phabricator.memgraph.io/D1813
Summary:
Once a leader loses it's leadership, in order to handle hanging
transactions, we reset the storage and the transaction engine.
This requires to re-apply all the commited entries from the log.
Once we add snapshot (log compaction) we would need to do that also.
One thing to have in mind is the `election_timeout_min` parameter. If it's set
too low it could trigger leader re-election too often.
Reviewers: ipaljak
Reviewed By: ipaljak
Subscribers: pullbot
Differential Revision: https://phabricator.memgraph.io/D1822
Summary:
Each `raft::LogEntry` is now persisted under its own key in our `KVStore`. Locally running our HA feature benchmark yields the following results:
```
duration 23.7
executed_writes: 15000
write_per_second: 632.888
```
This represents about 5x increase in throughput.
Reviewers: msantl
Reviewed By: msantl
Subscribers: pullbot
Differential Revision: https://phabricator.memgraph.io/D1799
Summary:
Created a new integration test for Raft protocol.
The tests iterates through the Raft cluster and does the following:
* kill machine `X`
* execute a query
* bring `X` back to life
The first step is to insert a vertex in the cluster, and last step is to check
if the cluster has all the data.
I also edited some of the raft core files because this test surafaced some bugs.
The `tester` binary is a hacked version of the HA client and so are the parts in
the code that refuse to execute a query is the machine is not in `Leader` mode.o
Those parts will go away once we have a proper HA client.
I've run the `runner.py` for a while (215 times)
```
while ./runner.py &> log.txt; do echo -n "."; done
```
and it didn't break.
Reviewers: ipaljak, mferencevic
Reviewed By: ipaljak
Subscribers: pullbot
Differential Revision: https://phabricator.memgraph.io/D1788
Summary:
Creating Raft noop logs on leader change will trigger the whole
log replication procedure that ends up committing/applying state deltas on newly
elected leaders that didn't receive the last commit index from the previous
leader.
I also included a small tweak that won't trigger add logs when a transaction
contains only BEGIN and ABORT StateDeltas, because we don't want to replicate
read queries.
Reviewers: ipaljak
Reviewed By: ipaljak
Subscribers: pullbot
Differential Revision: https://phabricator.memgraph.io/D1785
Summary:
TransactionReplicator replicates transactions on follower machines in
HA memgraph. Our DB accessor API doesn't provide us with the functionality to
begin transactions with non-increasing ids. This is why the
`TransactionReplicator` uses a internal map that maps tx ids from the leader
node to transactions on the follower node (whose id doesn't have to match the
leaders tx id).
If the leader has the following transaction timeline:
```
L
tx1
|
| tx2
| |
| |
| |
| |
| |
| tx2
|
|
|
|
tx1
```
`tx2` will commit first and will be replicated. When applying `tx2` on follower
nodes, they will start a new transaction with tx id `1`. When `tx1` starts
replicating, followers will start a new transaction with tx id `2`. And this is
wehre `TransactionReplicator` kicks in.
Reviewers: ipaljak
Reviewed By: ipaljak
Subscribers: pullbot
Differential Revision: https://phabricator.memgraph.io/D1775
Summary:
Explicitly start and stop raft server.
This way we can be sure that raft won't try to use coordination after it's
shutdown, and we can define the start of th raft protocol easier.
Reviewers: ipaljak
Reviewed By: ipaljak
Subscribers: pullbot
Differential Revision: https://phabricator.memgraph.io/D1766
Summary:
This is just the first diff that tries to wire the raft protocol into
memgraph.
In this diff I'm introducing transaction engine reset functionality. I also
introduced `RaftInterface` which should be used wherever someone wants to access
Raft from Memgraph.
For design decisions see the feature spec.
Reviewers: ipaljak, teon.banek
Reviewed By: ipaljak
Subscribers: pullbot, teon.banek
Differential Revision: https://phabricator.memgraph.io/D1758
Summary: This diff contains a rough implementation of the Raft protocol which ends at leader election.
Reviewers: msantl
Reviewed By: msantl
Subscribers: teon.banek, pullbot
Differential Revision: https://phabricator.memgraph.io/D1744
Summary:
Classes marked with `:serialize (:slk)` will now generate SLK
serialization code. This diff also changes how the `:serialize` option
is parsed, so that multiple different serialization backends are
supported.
Reviewers: mtomic, llugovic, mferencevic
Reviewed By: mtomic
Subscribers: pullbot
Differential Revision: https://phabricator.memgraph.io/D1755
Summary:
Basic RPC messages for Raft protocol. They will most likely be updated as we
move along with the implementation.
Reviewers: msantl, teon.banek, mferencevic
Reviewed By: msantl, teon.banek
Subscribers: pullbot
Differential Revision: https://phabricator.memgraph.io/D1726