RaftMerge: body part of multi-raft region merge protocol except rollback. (#10)

2024-12-27 04:50:15 +08:00 · 2018-03-21 15:10:16 +08:00 · 2018-03-21 15:10:16 +08:00 · b66286b0dd
commit b66286b0dd
parent 44b8611bc2
1 changed files with 223 additions and 16 deletions
--- a/RaftMerge/RaftMerge.tla
+++ b/RaftMerge/RaftMerge.tla
@ -1,5 +1,22 @@
 ------------------------------ MODULE RaftMerge -------------------------------
 \* This is the formal specification for the multi-raft region merge algorithm
 \* of TiKV.
 \*
 \* The whole data is divided into multiple shards called regions, and each
 \* region is replicated to several stores comprising a Raft group. Two regions
 \* can merge into a larger one if one region is reasonably small.
 \*
 \* This specification asserts two regions named A and B are replicated to the
 \* same set of stores. Each region has leader on store LeaderA and LeaderB
 \* respectively.
 \*
 \* Notice that TiKV uses a slightly different Raft model compared with Ongaro's
 \* original Raft implementation. A log is truly committed if the log is applied
 \* to the state machine, and then server will return the result to client.
 \* commit_index is only a marker, log may be dropped even if commit_index goes
 \* beyond that log.
 EXTENDS Integers, FiniteSets, Sequences, TLC
 CONSTANTS Store, Region
@ -40,22 +57,44 @@ VARIABLES messages
 \* The data structures in C. MAXS = |Store|.
 \*
 \* enum Log { LogNormal, LogPreMerge, LogMerge };
 \*
 \* enum RegionState { RegionNormal, RegionTombStone, RegionMerging };
 \*
 \* struct Raft {
 \*   bool is_leader;
 \*   vector<Log> logs;
 \*   int commit_index;
 \*   int apply_index;
 \*   int num_applied;        // number of applied normal logs
 \*   int match_index[MAXS];  // leader only
 \* };
 \*
 \* struct Store {
-\*   Raft raft[2];  // 2 for two regions
+\*   Raft raft[2];           // 2 for two regions
 \*   RegionState region[2];  // 2 for two regions
 \* } stores[MAXS];
 \*
-\* Note for ease of implementation, we use two 2-dimension arrays raft[MAXS][2].
+\* Note for ease of implementation, we use two 2-dimension arrays raft[MAXS][2]
 \* and region[MAXS][2].
 \*
 \* Also note that different from a real-world implementation, we don't
 \* introduce the concept of `epoch` here, which is used to figure out whether
 \* the configuration of one region has changed. Epoch matters when we are
 \* applying the logs into state machine, if it is stale, we will skip all
 \* later non-admin logs. Epoch will be changed when we are applying admin logs.
-\* Log.
+\* Log types.
-CONSTANTS Log
+\* The logs are divided into two categories, normal logs and admin logs.
 \* Logs apart from LogNormal are admin logs.
 CONSTANTS LogNormal,    \* RegionB only
          LogPreMerge,
          LogMerge
 \* Region state types.
 CONSTANTS RegionNormal,
          RegionTombStone,
          RegionMerging
 VARIABLES raft, region
@ -164,6 +203,7 @@ Receive(m) ==
 \* Leader i of region r receives a client request to append a log.
 ClientRequest(i, r, log) ==
  /\ raft[i][r].is_leader
  /\ region[i][r] = RegionNormal
  /\ client_requests_index < MaxClientRequests
  /\ LET
       new_logs == Append(raft[i][r].logs, log)
@ -175,21 +215,137 @@ ClientRequest(i, r, log) ==
  /\ UNCHANGED <<messages, region>>
 -------------------------------------------------------------------------------
 \* State transitions for Raft merge, and log applying.
 \* Internal requests for Raft merge.
 \* Assume raft[i][r].is_leader, i.e., only leader can handle internal requests.
 InternalRequest(i, r, log) ==
  LET
    new_logs == Append(raft[i][r].logs, log)
    new_match_index == [raft[i][r].match_index EXCEPT ![i] = @ + 1]
  IN
    [raft EXCEPT ![i][r].logs = new_logs,
                 ![i][r].match_index = new_match_index]
 \* Send merge request to the leader of Region B.
 ProposeMergeRequest(i) ==
  /\ raft[i][RegionB].is_leader
  /\ \* This request should be sent only once.
     Len(SelectSeq(raft[i][RegionB].logs, LAMBDA log : log.type = LogPreMerge)) = 0
  /\ raft' = InternalRequest(
               i, RegionB,
               [type      |-> LogPreMerge,
                min_index |-> 1 + Min({raft[i][RegionB].match_index[j] : j \in Store})]
             )
  /\ UNCHANGED <<messages, region, client_vars>>
 \* Return TRUE if there is a log applicable to the state machine.
 \* A log is applicable if it is committed, and the target region is not in
 \* TombStone state.
 LogAppliable(i, r) ==
-  raft[i][r].apply_index < raft[i][r].commit_index
+  /\ raft[i][r].apply_index < raft[i][r].commit_index
  /\ region[i][r] /= RegionTombStone
-\* Apply Raft logs to make apply_index catch up with commit_index.
+\* Apply LogPreMerge.
-\* This simply increases apply_index.
+ApplyPreMergeLog(i) ==
-ApplyLog(i, r) ==
+  LET
    next_index == raft[i][RegionB].apply_index + 1
  IN
    /\ LogAppliable(i, RegionB)
    /\ raft[i][RegionB].logs[next_index].type = LogPreMerge
    /\ IF raft[i][RegionA].is_leader
       THEN
         \* If this store is the leader of regionA, make a merge proposal, and
         \* advance apply_index.
         LET
           min_index == raft[i][RegionB].logs[next_index].min_index
           commit_index == next_index
           fetch_logs == SubSeq(raft[i][RegionB].logs, min_index, commit_index)
         IN
           raft' = [InternalRequest(
                      i, RegionA,
                      [type         |-> LogMerge,
                       min_index    |-> min_index,
                       commit_index |-> commit_index,
                       entries      |-> fetch_logs]
                    )
                    EXCEPT ![i][RegionB].apply_index = next_index]
       ELSE
         \* Otherwise, only advance apply_index.
         raft' = [raft EXCEPT ![i][RegionB].apply_index = next_index]
    /\ region' = [region EXCEPT ![i][RegionB] = RegionMerging]
    /\ UNCHANGED <<messages, client_vars>>
 \* Apply LogMerge.
 \*
 \* This action is roughly divided into two sub-actions, and executed separately.
 \* The first step copies the logs to region B, to ensure it in sync with leader
 \* B. The second step waits until the copied logs in the first step are applied,
 \* then advances apply_index and marks this region as tombstone.
 ApplyMergeLogStep1(i) ==
  LET
    next_index   == raft[i][RegionA].apply_index + 1
    min_index    == raft[i][RegionA].logs[next_index].min_index
    commit_index == raft[i][RegionA].logs[next_index].commit_index
    new_logs     ==
      LET
        old_logs == raft[i][RegionB].logs
        entries  == raft[i][RegionA].logs[next_index].entries
      IN
        IF commit_index <= Len(raft[i][RegionB].logs)
        THEN old_logs
        ELSE old_logs \o SubSeq(entries, Len(old_logs) - min_index + 2, Len(entries))
  IN
    /\ raft' = [raft EXCEPT ![i][RegionB].logs = new_logs,
                            ![i][RegionB].commit_index = Max({@, commit_index})]
    /\ UNCHANGED <<messages, region, client_vars>>
 ApplyMergeLogStep2(i) ==
  LET
    next_index   == raft[i][RegionA].apply_index + 1
    commit_index == raft[i][RegionA].logs[next_index].commit_index
  IN
    /\ \* Lag logs have been applied.
       raft[i][RegionB].apply_index >= commit_index
    /\ raft' = [raft EXCEPT ![i][RegionA].apply_index = next_index]
    /\ region' = [region EXCEPT ![i][RegionB] = RegionTombStone]
    /\ UNCHANGED <<messages, client_vars>>
 ApplyMergeLog(i) ==
  LET
    next_index == raft[i][RegionA].apply_index + 1
  IN
    /\ LogAppliable(i, RegionA)
    /\ raft[i][RegionA].logs[next_index].type = LogMerge
    /\ \/ ApplyMergeLogStep1(i)
       \/ ApplyMergeLogStep2(i)
 \* Apply LogNormal.
 \* This log simply increases apply_index.
 ApplyNormalLog(i, r) ==
  LET
    next_index == raft[i][r].apply_index + 1
  IN
    /\ LogAppliable(i, r)
-    /\ raft' = [raft EXCEPT ![i][r].apply_index = next_index]
+    /\ raft[i][r].logs[next_index].type = LogNormal
    /\ LET
         \* Apply this log if this region is in normal state, otherwise skip it.
         \* Notice we don't check for epoch here as what is done in the real
         \* world implementation, but these two approaches are equivalent to
         \* check whether we have applied PreMergeLog, as applying PreMergeLog
         \* will also convert the region state from normal state.
         num_applied_delta == IF region[i][r] = RegionNormal THEN 1 ELSE 0
       IN
         raft' = [raft EXCEPT ![i][r].apply_index = next_index,
                              ![i][r].num_applied = @ + num_applied_delta]
    /\ UNCHANGED <<messages, region, client_vars>>
 \* Apply Raft logs to make apply_index catch up with commit_index.
 ApplyLog(i) ==
  \/ \E r \in Region : ApplyNormalLog(i, r)
  \/ ApplyPreMergeLog(i)
  \/ ApplyMergeLog(i)
 -------------------------------------------------------------------------------
 \* Specification of Raft merge.
@ -198,11 +354,14 @@ Next ==
  \/ \E i, j \in Store : \E r \in Region : AppendEntries(i, j, r)
  \/ \E i \in Store : \E r \in Region : AdvanceCommitIndex(i, r)
  \/ \E m \in messages : Receive(m)
  \/ \E i \in Store : \E r \in Region : ApplyLog(i, r)
-  \* External client can send requests to region leader.
+  \* External client can send requests to region B leader, to add a new log
-  \/ \E i \in Store : \E r \in Region :
+  \* entry in region B.
-        ClientRequest(i, r, Log)
+  \/ \E i \in Store : ClientRequest(i, RegionB, [type |-> LogNormal])
  \* Raft merge actions.
  \/ ProposeMergeRequest(LeaderB)
  \/ \E i \in Store : ApplyLog(i)
 Init ==
  /\ messages = {}
@ -215,6 +374,7 @@ Init ==
                              logs         |-> << >>,
                              commit_index |-> 0,
                              apply_index  |-> 0,
                              num_applied  |-> 0,
                              match_index  |-> [j \in Store |-> 0]
                             ]
                           ]
@ -223,7 +383,7 @@ Init ==
       raft = MarkLeader(MarkLeader(no_leader_raft, LeaderA, RegionA),
                         LeaderB,
                         RegionB)
-  /\ region = TRUE
+  /\ region = [i \in Store |-> [r \in Region |-> RegionNormal]]
  /\ client_requests_index = 0
 Spec ==
@ -233,19 +393,33 @@ Spec ==
 \* Type invariants.
 LogType ==
-  {Log}
+  LET
    FlatLogType ==
           [type : {LogNormal}]
      \cup [type : {LogPreMerge}, min_index : Nat]
  IN
         FlatLogType
    \cup [type : {LogMerge},
          min_index : Nat,
          commit_index : Nat,
          entries : Seq(FlatLogType)]
 RaftType ==
  [ is_leader    : BOOLEAN,
    logs         : Seq(LogType),
    commit_index : Nat,
    apply_index  : Nat,
    num_applied  : Nat,
    match_index  : [Store -> Nat]  \* Only available on leader.
                                   \* Initialized to zeroes on followers.
  ]
 RegionType ==
  { RegionNormal, RegionTombStone, RegionMerging }
 TypeInvariant ==
-  /\ raft \in [Store -> [Region -> RaftType]]
+  /\ raft   \in [Store -> [Region -> RaftType]]
  /\ region \in [Store -> [Region -> RegionType]]
 -------------------------------------------------------------------------------
 \* Some invariants for our simplified Raft model.
@ -294,4 +468,37 @@ SimpliedRaftInvariant ==
  /\ LogInvariant
  /\ ApplyIndexInvariant
 -------------------------------------------------------------------------------
 \* Some invariants for Raft region merge.
 \* If a region on two different stores have applied the same logs, they should
 \* also share the same region state.
 RegionApplyInvariant ==
  \A i, j \in Store :
    (
      /\ i /= j
      /\ (\A r \in Region : raft[i][r].apply_index = raft[j][r].apply_index)
    ) =>
      \A r \in Region : region[i][r] = region[j][r]
 \* For any two stores of region B, if both done, they should have the same
 \* number of applied logs.
 MergeLogInvariant ==
  \A i, j \in Store :
    (
      /\ i /= j
      /\ region[i][RegionB] = RegionTombStone
      /\ region[j][RegionB] = RegionTombStone
    ) =>
      LET
        applied_i == raft[i][RegionB].num_applied
        applied_j == raft[j][RegionB].num_applied
      IN
        applied_i = applied_j
 \* Combination of the above invariants.
 RaftMergeInvariant ==
  /\ RegionApplyInvariant
  /\ MergeLogInvariant
 ===============================================================================