mirror of
https://github.com/talent-plan/tinykv.git
synced 2024-12-25 20:30:30 +08:00
spell check using grammarly (#229)
This commit is contained in:
parent
5e493b7bb9
commit
87f6ce88e8
@ -1,6 +1,6 @@
|
||||
# Project1 StandaloneKV
|
||||
|
||||
In this project, you will build a standalone key/value storage [gRPC](https://grpc.io/docs/guides/) service with the support of column family. Standalone means only a single node, not a distributed system. Column family( <https://en.wikipedia.org/wiki/Standard_column_family> , it will abbreviate to CF below) is a term like key namespace, namely the values of the same key in different column families is not the same. You can simply regard multiple column families as separate mini databases. It’s used to support the transaction model in project4, you will know why TinyKV needs the support of CF then.
|
||||
In this project, you will build a standalone key/value storage [gRPC](https://grpc.io/docs/guides/) service with the support of the column family. Standalone means only a single node, not a distributed system. [Column family]( <https://en.wikipedia.org/wiki/Standard_column_family> ), it will abbreviate to CF below) is a term like key namespace, namely the values of the same key in different column families is not the same. You can simply regard multiple column families as separate mini databases. It’s used to support the transaction model in the project4, you will know why TinyKV needs the support of CF then.
|
||||
|
||||
The service supports four basic operations: Put/Delete/Get/Scan. It maintains a simple database of key/value pairs. Keys and values are strings. `Put` replaces the value for a particular key for the specified CF in the database, `Delete` deletes the key's value for the specified CF, `Get` fetches the current value for a key for the specified CF, and `Scan` fetches the current value for a series of keys for the specified CF.
|
||||
|
||||
@ -37,7 +37,7 @@ And you don’t need to consider the `kvrpcpb.Context` now, it’s used in the f
|
||||
|
||||
> Hints:
|
||||
>
|
||||
> - You should use [badger.Txn]( https://godoc.org/github.com/dgraph-io/badger#Txn ) to implement the `Reader` function, because the transcation handler provided by badger could provide a consistent snapshot of the keys and values.
|
||||
> - You should use [badger.Txn]( https://godoc.org/github.com/dgraph-io/badger#Txn ) to implement the `Reader` function because the transaction handler provided by badger could provide a consistent snapshot of the keys and values.
|
||||
> - Badger doesn’t give support for column families. engine_util package (`kv/util/engine_util`) simulates column families by adding a prefix to keys. For example, a key `key` that belongs to a specific column family `cf` is stored as `${cf}_${key}`. It wraps `badger` to provide operations with CFs, and also offers many useful helper functions. So you should do all read/write operations through `engine_util` provided methods. Please read `util/engine_util/doc.go` to learn more.
|
||||
> - TinyKV uses a fork of the original version of `badger` with some fix, so just use `github.com/Connor1996/badger` instead of `github.com/dgraph-io/badger`.
|
||||
> - Don’t forget to call `Discard()` for badger.Txn and close all iterators before discarding.
|
||||
|
@ -2,19 +2,19 @@
|
||||
|
||||
Raft is a consensus algorithm that is designed to be easy to understand. You can read material about the Raft itself at [the Raft site](https://raft.github.io/), an interactive visualization of the Raft, and other resources, including [the extended Raft paper](https://raft.github.io/raft.pdf).
|
||||
|
||||
In this project you will implement a high available kv server based on raft, which needs you not only to implement the Raft algorithm but also use it practically, and bring you more challenges like managing raft’s persisted state with `badger`, add flow control for snapshot message, etc.
|
||||
In this project, you will implement a high available kv server based on raft, which needs you not only to implement the Raft algorithm but also use it practically, and bring you more challenges like managing raft’s persisted state with `badger`, add flow control for snapshot message, etc.
|
||||
|
||||
The project have 3 parts you need to do, including:
|
||||
The project has 3 parts you need to do, including:
|
||||
|
||||
- Implement the basic Raft algorithm
|
||||
- Build a fault tolerant KV server on top of Raft
|
||||
- Build a fault-tolerant KV server on top of Raft
|
||||
- Add the support of raftlog GC and snapshot
|
||||
|
||||
## Part A
|
||||
|
||||
### The Code
|
||||
|
||||
In this part, you will implement the basic raft algorithm. The code you need to implement is under `raft/`. Inside `raft/`, there are some skeleton code and test cases waiting for you. The raft algorithm you're gonna implement here has a well designed interface with the upper application. Moreover, it uses a logical clock (named tick here) to measure the election and heartbeat timeout instead of a physical clock. That is to say, do not set a timer in the Raft module itself, and the upper application is responsible to advance the logical clock by calling `RawNode.Tick()`. Apart from that, messages sending and receiving along with other things are processed in an asynchronous way, it is also up to the upper application when to actually do these things (see below for more detail). For example, Raft will not block wait on the response of any request message.
|
||||
In this part, you will implement the basic raft algorithm. The code you need to implement is under `raft/`. Inside `raft/`, there are some skeleton code and test cases waiting for you. The raft algorithm you're gonna implement here has a well-designed interface with the upper application. Moreover, it uses a logical clock (named tick here) to measure the election and heartbeat timeout instead of a physical clock. That is to say, do not set a timer in the Raft module itself, and the upper application is responsible to advance the logical clock by calling `RawNode.Tick()`. Apart from that, messages sending and receiving along with other things are processed asynchronously, it is also up to the upper application when to actually do these things (see below for more detail). For example, Raft will not block waiting on the response of any request message.
|
||||
|
||||
Before implementing, please checkout the hints for this part first. Also, you should take a rough look at the proto file `proto/proto/eraftpb.proto`. Raft sends and receives messages and related structs are defined there, you’ll use them for the implementation. Note that, unlike Raft paper, it divides Heartbeat and AppendEntries into different messages to make the logic more clear.
|
||||
|
||||
@ -30,15 +30,15 @@ This part can be broken down into 3 steps, including:
|
||||
|
||||
#### Leader election
|
||||
|
||||
To implement leader election, you may want to start with `raft.Raft.tick()` which is used to advance the internal logical clock by a single tick and hence drive the election timeout or heartbeat timeout. You don’t need to care about the message sending and receiving logic now. If you need to send out a message, just push it to `raft.Raft.msgs` and all messages the raft received will be passed to `raft.Raft.Step()`. The test code will get the messages from `raft.Raft.msgs` and pass response messages through `raft.Raft.Step()`. The `raft.Raft.Step()` is the entrance of message handling, you should handle messages like `MsgRequestVote`, `MsgHeartbeat` and their response. And please also implement test stub functions and get them called properly like `raft.Raft.becomeXXX` which used to update the raft internal state when the raft’s role changes.
|
||||
To implement leader election, you may want to start with `raft.Raft.tick()` which is used to advance the internal logical clock by a single tick and hence drive the election timeout or heartbeat timeout. You don’t need to care about the message sending and receiving logic now. If you need to send out a message, just push it to `raft.Raft.msgs` and all messages the raft received will be passed to `raft.Raft.Step()`. The test code will get the messages from `raft.Raft.msgs` and pass response messages through `raft.Raft.Step()`. The `raft.Raft.Step()` is the entrance of message handling, you should handle messages like `MsgRequestVote`, `MsgHeartbeat` and their response. And please also implement test stub functions and get them called properly like `raft.Raft.becomeXXX` which is used to update the raft internal state when the raft’s role changes.
|
||||
|
||||
You can run `make project2aa` to test the implementation, and see some hints at the end of this part.
|
||||
You can run `make project2aa` to test the implementation and see some hints at the end of this part.
|
||||
|
||||
#### Log replication
|
||||
|
||||
To implement log replication, you may want to start with handling `MsgAppend` and `MsgAppendResponse` on both the sender and receiver side. Checkout `raft.RaftLog` in `raft/log.go` which is a helper struct that help you manage the raft log, in here you also need to interact with the upper application by the `Storage` interface defined in `raft/storage.go` to get the persisted data like log entries and snapshot.
|
||||
To implement log replication, you may want to start with handling `MsgAppend` and `MsgAppendResponse` on both the sender and receiver sides. Checkout `raft.RaftLog` in `raft/log.go` which is a helper struct that helps you manage the raft log, in here you also need to interact with the upper application by the `Storage` interface defined in `raft/storage.go` to get the persisted data like log entries and snapshot.
|
||||
|
||||
You can run `make project2ab` to test the implementation, and see some hints at the end of this part.
|
||||
You can run `make project2ab` to test the implementation and see some hints at the end of this part.
|
||||
|
||||
### Implement the raw node interface
|
||||
|
||||
@ -46,13 +46,13 @@ You can run `make project2ab` to test the implementation, and see some hints at
|
||||
|
||||
Another important struct `Ready` is also defined here. When handling messages or advancing the logical clock, the `raft.Raft` may need to interact with the upper application, like:
|
||||
|
||||
- send messages to other peer
|
||||
- send messages to other peers
|
||||
- save log entries to stable storage
|
||||
- save hard state like term, commit index and vote to stable storage
|
||||
- save hard state like the term, commit index, and vote to stable storage
|
||||
- apply committed log entries to the state machine
|
||||
- etc
|
||||
|
||||
But these interactions do not happen immediately, instead they are encapsulated in `Ready` and returned by `RawNode.Ready()` to the upper application. It is up to the upper application when to call `RawNode.Ready()` and handle it. After handling the returned `Ready`, the upper application also needs to call some functions like `RawNode.Advance()` to update `raft.Raft`'s internal state like the applied index, stabled log index, etc.
|
||||
But these interactions do not happen immediately, instead, they are encapsulated in `Ready` and returned by `RawNode.Ready()` to the upper application. It is up to the upper application when to call `RawNode.Ready()` and handle it. After handling the returned `Ready`, the upper application also needs to call some functions like `RawNode.Advance()` to update `raft.Raft`'s internal state like the applied index, stabled log index, etc.
|
||||
|
||||
You can run `make project2ac` to test the implementation and run `make project2a` to test the whole part A.
|
||||
|
||||
@ -60,16 +60,16 @@ You can run `make project2ac` to test the implementation and run `make project2a
|
||||
>
|
||||
> - Add any state you need to `raft.Raft`, `raft.RaftLog`, `raft.RawNode` and message on `eraftpb.proto`
|
||||
> - The tests assume that the first time start raft should have term 0
|
||||
> - The tests assume that the new elected leader should append a noop entry on its term
|
||||
> - The tests assume that the newly elected leader should append a noop entry on its term
|
||||
> - The tests doesn’t set term for the local messages, `MessageType_MsgHup`, `MessageType_MsgBeat` and `MessageType_MsgPropose`.
|
||||
> - The log entries append are quite different between leader and non-leader, there are different sources, checking and handling, be careful with that.
|
||||
> - Don’t forget the election timeout should be different between peers.
|
||||
> - Some wrapper functions in `rawnode.go` can implement with `raft.Step(local message)`
|
||||
> - When start a new raft, get the last stabled state from `Storage` to initialize `raft.Raft` and `raft.RaftLog`
|
||||
> - When starting a new raft, get the last stabled state from `Storage` to initialize `raft.Raft` and `raft.RaftLog`
|
||||
|
||||
## Part B
|
||||
|
||||
In this part you will build a fault-tolerant key-value storage service using the Raft module implemented in part A. Your key/value service will be a replicated state machine, consisting of several key/value servers that use Raft for replication. Your key/value service should continue to process client requests as long as a majority of the servers are alive and can communicate, in spite of other failures or network partitions.
|
||||
In this part, you will build a fault-tolerant key-value storage service using the Raft module implemented in part A. Your key/value service will be a replicated state machine, consisting of several key/value servers that use Raft for replication. Your key/value service should continue to process client requests as long as a majority of the servers are alive and can communicate, despite other failures or network partitions.
|
||||
|
||||
In project1 you have implemented a standalone kv server, so you should already be familiar with the kv server API and `Storage` interface.
|
||||
|
||||
@ -96,11 +96,11 @@ Then, here comes the core of TinyKV — raftstore. The structure is a little com
|
||||
|
||||
The entrance of raftstore is `Raftstore`, see `kv/raftstore/raftstore.go`. It starts some workers to handle specific tasks asynchronously, and most of them aren’t used now so you can just ignore them. All you need to focus on is `raftWorker`.(kv/raftstore/raft_worker.go)
|
||||
|
||||
The whole process is divided into two parts: raft worker polls `raftCh` to get the messages, the messages includes the base tick to drive Raft module and Raft commands to be proposed as Raft entries; it gets and handles ready from Raft module, including send raft messages, persist the state, apply the committed entries to the state machine. Once applied, return the response to clients.
|
||||
The whole process is divided into two parts: raft worker polls `raftCh` to get the messages, the messages include the base tick to drive Raft module and Raft commands to be proposed as Raft entries; it gets and handles ready from Raft module, including send raft messages, persist the state, apply the committed entries to the state machine. Once applied, return the response to clients.
|
||||
|
||||
### Implement peer storage
|
||||
|
||||
Peer storage is what you interact with through the `Storage` interface in part A, but in addition to the raft log, peer storage also manages other persisted metadata which is very important to restore the consistent state machine after restart. Moreover, there are three important states defined in `proto/proto/raft_serverpb.proto`:
|
||||
Peer storage is what you interact with through the `Storage` interface in part A, but in addition to the raft log, peer storage also manages other persisted metadata which is very important to restore the consistent state machine after a restart. Moreover, there are three important states defined in `proto/proto/raft_serverpb.proto`:
|
||||
|
||||
- RaftLocalState: Used to store HardState of the current Raft and the last Log Index.
|
||||
- RaftApplyState: Used to store the last Log index that Raft applies and some truncated Log information.
|
||||
@ -122,11 +122,11 @@ The format is as below and some helper functions are provided in `kv/raftstore/m
|
||||
|
||||
> You may wonder why TinyKV needs two badger instances. Actually, it can use only one badger to store both raft log and state machine data. Separating into two instances is just to be consistent with TiKV design.
|
||||
|
||||
These metadatas should be created and updated in `PeerStorage`. When creating PeerStorage, see `kv/raftstore/peer_storager.go`. It initializes RaftLocalState, RaftApplyState of this Peer or gets the previous value from the underlying engine in the case of restart. Note that the value of both RAFT_INIT_LOG_TERM and RAFT_INIT_LOG_INDEX is 5 (as long as it's larger than 1) but not 0. The reason why not set it to 0 is to distinguish with the case that peer created passively after conf change. You may not quite understand it now, so just keep it in mind and the detail will be described in project3b when you are implementing conf change.
|
||||
These metadatas should be created and updated in `PeerStorage`. When creating PeerStorage, see `kv/raftstore/peer_storager.go`. It initializes RaftLocalState, RaftApplyState of this Peer, or gets the previous value from the underlying engine in the case of restart. Note that the value of both RAFT_INIT_LOG_TERM and RAFT_INIT_LOG_INDEX is 5 (as long as it's larger than 1) but not 0. The reason why not set it to 0 is to distinguish with the case that peer created passively after conf change. You may not quite understand it now, so just keep it in mind and the detail will be described in project3b when you are implementing conf change.
|
||||
|
||||
The code you need to implement in this part is only one function: `PeerStorage.SaveReadyState`, what this function does is to save the data in `raft.Ready` to badger, including append log entries and save the Raft hard state.
|
||||
|
||||
To append log entries, simply save all log entries at `raft.Ready.Entries` to raftdb and delete any previously appended log entries which will never be committed. Also update the peer storage’s `RaftLocalState` and save it to raftdb.
|
||||
To append log entries, simply save all log entries at `raft.Ready.Entries` to raftdb and delete any previously appended log entries which will never be committed. Also, update the peer storage’s `RaftLocalState` and save it to raftdb.
|
||||
|
||||
To save the hard state is also very easy, just update peer storage’s `RaftLocalState.HardState` and save it to raftdb.
|
||||
|
||||
@ -137,7 +137,7 @@ To save the hard state is also very easy, just update peer storage’s `RaftLoca
|
||||
|
||||
### Implement Raft ready process
|
||||
|
||||
In project2 part A, you have built a tick based Raft module. Now you need to write the outer process to drive it. Most of the code is already implemented under `kv/raftstore/peer_msg_handler.go` and `kv/raftstore/peer.go`. So you need to learn the code and finish the logic of `proposeRaftCommand` and `HandleRaftReady`. Here are some interpretations of the framework.
|
||||
In project2 part A, you have built a tick-based Raft module. Now you need to write the outer process to drive it. Most of the code is already implemented under `kv/raftstore/peer_msg_handler.go` and `kv/raftstore/peer.go`. So you need to learn the code and finish the logic of `proposeRaftCommand` and `HandleRaftReady`. Here are some interpretations of the framework.
|
||||
|
||||
The Raft `RawNode` is already created with `PeerStorage` and stored in `peer`. In the raft worker, you can see that it takes the `peer` and wraps it by `peerMsgHandler`. The `peerMsgHandler` mainly has two functions: one is `HandleMsgs` and the other is `HandleRaftReady`.
|
||||
|
||||
@ -180,21 +180,21 @@ After this the whole process of a read or write would be like this:
|
||||
|
||||
You should run `make project2b` to pass all the tests. The whole test is running a mock cluster including multiple TinyKV instances with a mock network. It performs some read and write operations and checks whether the return values are as expected.
|
||||
|
||||
To be noted, error handling is an important part of passing the test. You may have already noticed that there are some errors defined in `proto/proto/errorpb.proto` and the error is a field of the gRPC response. Also, the corresponding errors which implements `error` interface are defined in `kv/raftstore/util/error.go`, so you can use them as a return value of functions.
|
||||
To be noted, error handling is an important part of passing the test. You may have already noticed that there are some errors defined in `proto/proto/errorpb.proto` and the error is a field of the gRPC response. Also, the corresponding errors which implement the`error` interface are defined in `kv/raftstore/util/error.go`, so you can use them as a return value of functions.
|
||||
|
||||
These error are mainly related with Region. So it is also a member of `RaftResponseHeader` of `RaftCmdResponse`. When proposaling a request or applying a command, there may be some errors. If that, you should return the raft command response with the error, then the error will be further passed to gRPC response. You can use `BindErrResp` provided in `kv/raftstore/cmd_resp.go` to convert these errors to errors defined in `errorpb.proto` when returning the response with error.
|
||||
These errors are mainly related to Region. So it is also a member of `RaftResponseHeader` of `RaftCmdResponse`. When proposing a request or applying a command, there may be some errors. If that, you should return the raft command response with the error, then the error will be further passed to gRPC response. You can use `BindErrResp` provided in `kv/raftstore/cmd_resp.go` to convert these errors to errors defined in `errorpb.proto` when returning the response with an error.
|
||||
|
||||
In this stage, you may consider these errors, and others will be processed in project3:
|
||||
|
||||
- ErrNotLeader: the raft command is proposed on a follower. so use it to let client try other peers.
|
||||
- ErrStaleCommand: It may due to leader changes that some logs are not committed and overrided with new leaders’ logs. But client doesn’t know that and is still waiting for the response. So you should return this to let client knows and retries the command again.
|
||||
- ErrNotLeader: the raft command is proposed on a follower. so use it to let the client try other peers.
|
||||
- ErrStaleCommand: It may due to leader changes that some logs are not committed and overrided with new leaders’ logs. But the client doesn’t know that and is still waiting for the response. So you should return this to let the client knows and retries the command again.
|
||||
|
||||
> Hints:
|
||||
>
|
||||
> - `PeerStorage` implements the `Storage` interface of Raft module, you should use the provided method `SaveRaftReady()` to persist the Raft related states.
|
||||
> - `PeerStorage` implements the `Storage` interface of the Raft module, you should use the provided method `SaveRaftReady()` to persist the Raft related states.
|
||||
> - Use `WriteBatch` in `engine_util` to make multiple writes atomically, for example, you need to make sure to apply the committed entries and update the applied index in one write batch.
|
||||
> - Use `Transport` to send raft messages to other peers, it’s in the `GlobalContext`,
|
||||
> - The server should not complete a get RPC if it is not part of a majority and do not has up-to-date data. You can just put the get operation into the raft log, or implement the optimization for read-only operations that is described in Section 8 in the Raft paper.
|
||||
> - The server should not complete a get RPC if it is not part of a majority and does not has up-to-date data. You can just put the get operation into the raft log, or implement the optimization for read-only operations that is described in Section 8 in the Raft paper.
|
||||
> - Do not forget to update and persist the apply state when applying the log entries.
|
||||
> - You can apply the committed Raft log entries in an asynchronous way just like TiKV does. It’s not necessary, though a big challenge to improve performance.
|
||||
> - Record the callback of the command when proposing, and return the callback after applying.
|
||||
@ -204,7 +204,7 @@ In this stage, you may consider these errors, and others will be processed in pr
|
||||
|
||||
As things stand now with your code, it's not practical for a long-running server to remember the complete Raft log forever. Instead, the server will check the number of Raft log, and discard log entries exceeding the threshold from time to time.
|
||||
|
||||
In this part, you will implement the Snapshot handling based on the above two part implementation. Generally, Snapshot is just a raft message like AppendEntries used to replicate data to follower, what make it different is its size, Snapshot contains the whole state machine data in some point of time, and to build and send such a big message at once will consume many resource and time, which may block the handling of other raft message, to amortize this problem, Snapshot message will use an independent connect, and split the data into chunks to transport. That’s the reason why there is a snapshot RPC API for TinyKV service. If you are interested in the detail of sending and receiving, check `snapRunner` and the reference <https://pingcap.com/blog-cn/tikv-source-code-reading-10/>
|
||||
In this part, you will implement the Snapshot handling based on the above two part implementation. Generally, Snapshot is just a raft message like AppendEntries used to replicate data to followers, what makes it different is its size, Snapshot contains the whole state machine data at some point of time, and to build and send such a big message at once will consume many resource and time, which may block the handling of other raft messages, to amortize this problem, Snapshot message will use an independent connect, and split the data into chunks to transport. That’s the reason why there is a snapshot RPC API for TinyKV service. If you are interested in the detail of sending and receiving, check `snapRunner` and the reference <https://pingcap.com/blog-cn/tikv-source-code-reading-10/>
|
||||
|
||||
### The Code
|
||||
|
||||
@ -212,18 +212,18 @@ All you need to change is based on the code written in part A and part B.
|
||||
|
||||
### Implement in Raft
|
||||
|
||||
Although we need some different handling for Snapshot messages, in the perspective of raft algorithm there should be no difference. See the definition of `eraftpb.Snapshot` in the proto file, the `data` field on `eraftpb.Snapshot` does not represent the actual state machine data but some metadata used for the upper application you can ignore it for now. When the leader needs to send a Snapshot message to a follower, it can call `Storage.Snapshot()` to get a `eraftpb.Snapshot`, then send the snapshot message like other raft messages. How the state machine data is actually built and sent are implemented by the raftstore, it will be introduced in the next step. You can assume that once `Storage.Snapshot()` returns successfully, it’s safe for Raft leader to the snapshot message to the follower, and follower should call `handleSnapshot` to handle it, which namely just restore the raft internal state like term, commit index and membership information, ect, from the
|
||||
Although we need some different handling for Snapshot messages, in the perspective of raft algorithm there should be no difference. See the definition of `eraftpb.Snapshot` in the proto file, the `data` field on `eraftpb.Snapshot` does not represent the actual state machine data but some metadata is used for the upper application you can ignore it for now. When the leader needs to send a Snapshot message to a follower, it can call `Storage.Snapshot()` to get a `eraftpb.Snapshot`, then send the snapshot message like other raft messages. How the state machine data is actually built and sent are implemented by the raftstore, it will be introduced in the next step. You can assume that once `Storage.Snapshot()` returns successfully, it’s safe for Raft leader to the snapshot message to the follower, and follower should call `handleSnapshot` to handle it, which namely just restore the raft internal state like the term, commit index and membership information, etc, from the
|
||||
`eraftpb.SnapshotMetadata` in the message, after that, the procedure of snapshot handling is finish.
|
||||
|
||||
#### Implement in raftstore
|
||||
|
||||
In this step, you need to learn two more workers of raftstore — raftlog-gc worker and region worker.
|
||||
|
||||
Raftstore checks whether it needs to gc log from time to time based on the config `RaftLogGcCountLimit`, see `onRaftGcLogTick()`. If yes, it will propose a raft admin command `CompactLogRequest` which is wrapped in `RaftCmdRequest` just like four basic command types(Get/Put/Delete/Snap) implemented in project2 part B. Then you need to process this admin command when it’s committed by Raft. But unlike Get/Put/Delete/Snap commands write or read state machine data, `CompactLogRequest` modifies metadata, namely updates the `RaftTruncatedState` which is in the `RaftApplyState`. After that, you should schedule a task to raftlog-gc worker by `ScheduleCompactLog`. Raftlog-gc worker will do the actual log deletion work asynchronously.
|
||||
Raftstore checks whether it needs to gc log from time to time based on the config `RaftLogGcCountLimit`, see `onRaftGcLogTick()`. If yes, it will propose a raft admin command `CompactLogRequest` which is wrapped in `RaftCmdRequest` just like four basic command types(Get/Put/Delete/Snap) implemented in project2 part B. Then you need to process this admin command when it’s committed by Raft. But unlike Get/Put/Delete/Snap commands write or read state machine data, `CompactLogRequest` modifies metadata, namely updates the `RaftTruncatedState` which is in the `RaftApplyState`. After that, you should schedule a task to raftlog-gc worker by `ScheduleCompactLog`. Raftlog-gc worker will do the actual log deletion work asynchronously.
|
||||
|
||||
Then due to the log compaction, Raft module maybe needs to send a snapshot. `PeerStorage` implements `Storage.Snapshot()`. TinyKV generates snapshots and applies snapshots in the region worker. When calling `Snapshot()`, it actually sends a task `RegionTaskGen` to the region worker. The message handler of the region worker is located in `kv/raftstore/runner/region_task.go`. It scans the underlying engines to generate a snapshot, and sends snapshot metadata by channel. At the next time of Raft calling `Snapshot`, it checks whether the snapshot generating is finished. If yes, Raft should send the snapshot message to other peers, and the snapshot sending and receiving work is handled by `kv/storage/raft_storage/snap_runner.go`. You don’t need to dive into the details, only should know the snapshot message will be handled by `onRaftMsg` after the snapshot is received.
|
||||
Then due to the log compaction, Raft module maybe needs to send a snapshot. `PeerStorage` implements `Storage.Snapshot()`. TinyKV generates snapshots and applies snapshots in the region worker. When calling `Snapshot()`, it actually sends a task `RegionTaskGen` to the region worker. The message handler of the region worker is located in `kv/raftstore/runner/region_task.go`. It scans the underlying engines to generate a snapshot, and sends snapshot metadata by channel. At the next time Raft calling `Snapshot`, it checks whether the snapshot generating is finished. If yes, Raft should send the snapshot message to other peers, and the snapshot sending and receiving work is handled by `kv/storage/raft_storage/snap_runner.go`. You don’t need to dive into the details, only should know the snapshot message will be handled by `onRaftMsg` after the snapshot is received.
|
||||
|
||||
Then the snapshot will reflect in the next Raft ready, so the task you should do is to modify the raft ready process to handle the case of snapshot. When you are sure to apply the snapshot, you can update the peer storage’s memory state like `RaftLocalState`, `RaftApplyState` and `RegionLocalState`. Also don’t forget to persist these states to kvdb and raftdb and remove stale state from kvdb and raftdb. Besides, you also need to update
|
||||
Then the snapshot will reflect in the next Raft ready, so the task you should do is to modify the raft ready process to handle the case of a snapshot. When you are sure to apply the snapshot, you can update the peer storage’s memory state like `RaftLocalState`, `RaftApplyState`, and `RegionLocalState`. Also, don’t forget to persist these states to kvdb and raftdb and remove stale state from kvdb and raftdb. Besides, you also need to update
|
||||
`PeerStorage.snapState` to `snap.SnapState_Applying` and send `runner.RegionTaskApply` task to region worker through `PeerStorage.regionSched` and wait until region worker finish.
|
||||
|
||||
You should run `make project2c` to pass all the tests.
|
||||
|
@ -4,9 +4,9 @@ In project2, you have built a high available kv server based on Raft, good work!
|
||||
|
||||
![multiraft](imgs/multiraft.png)
|
||||
|
||||
In this project you will implement a multi raft based kv server with balance scheduler, which consist of multiple raft groups, each raft group is responsible for a single key range which is named region here, the layout will be looked like the above diagram. Requests to a single region are handled just like before, yet multiple regions can handle requests concurrently which improve performance but also bring some new challenges like balancing the request to each region, etc.
|
||||
In this project you will implement a multi raft-based kv server with balance scheduler, which consist of multiple raft groups, each raft group is responsible for a single key range which is named region here, the layout will be looked like the above diagram. Requests to a single region are handled just like before, yet multiple regions can handle requests concurrently which improves performance but also bring some new challenges like balancing the request to each region, etc.
|
||||
|
||||
This project have 3 part, including:
|
||||
This project has 3 part, including:
|
||||
|
||||
1. Implement membership change and leadership change to Raft algorithm
|
||||
2. Implement conf change and region split on raftstore
|
||||
@ -22,11 +22,11 @@ The code you need to modify is all about `raft/raft.go` and `raft/rawnode.go`, a
|
||||
|
||||
### Implement leader transfer
|
||||
|
||||
To implement leader transfer, let’s introduce two new message types: `MsgTransferLeader` and `MsgTimeoutNow`. To transfer leadership you need to first call `raft.Raft.Step` with `MsgTransferLeader` message on the current leader, and to ensure the success of transfer, the current leader should first check the qualification of the transferee (namely transfer target) like: is the transferee’s log up to date, etc. If the transferee is not qualified, the current leader can choose to abort the transfer or help the transferee, since abort is not helping, let’s choose to help the transferee. If the transferee’s log is not up to date, the current leader should send a `MsgAppend` message to the transferee and stop accepting new proposes in case we end up cycling. So if the transferee is qualified (or after the current leader’s help), the leader should send a `MsgTimeoutNow` message to the transferee immediately, and after received a `MsgTimeoutNow` message the transferee should start a new election immediately regardless of its election timeout, with a higher term and up to date log, the transferee have great chance to step down the current leader and become the new leader.
|
||||
To implement leader transfer, let’s introduce two new message types: `MsgTransferLeader` and `MsgTimeoutNow`. To transfer leadership you need to first call `raft.Raft.Step` with `MsgTransferLeader` message on the current leader, and to ensure the success of the transfer, the current leader should first check the qualification of the transferee (namely transfer target) like: is the transferee’s log up to date, etc. If the transferee is not qualified, the current leader can choose to abort the transfer or help the transferee, since abort is not helping, let’s choose to help the transferee. If the transferee’s log is not up to date, the current leader should send a `MsgAppend` message to the transferee and stop accepting new proposals in case we end up cycling. So if the transferee is qualified (or after the current leader’s help), the leader should send a `MsgTimeoutNow` message to the transferee immediately, and after receiving a `MsgTimeoutNow` message the transferee should start a new election immediately regardless of its election timeout, with a higher term and up to date log, the transferee has great chance to step down the current leader and become the new leader.
|
||||
|
||||
### Implement conf change
|
||||
|
||||
Conf change algorithm you will implement here is not the joint consensus algorithm mentioned in the extended Raft paper that can add and/or remove arbitrary peers at once, instead it can only add or remove peers one by one, which is more simple and easy to reason about. Moreover, conf change start at calling leader’s `raft.RawNode.ProposeConfChange` which will propose an entry with `pb.Entry.EntryType` set to `EntryConfChange` and `pb.Entry.Data` set to the input `pb.ConfChange`. When entries with type `EntryConfChange` are committed, you must apply it through `RawNode.ApplyConfChange` with the `pb.ConfChange` in the entry, only then you can add or remove peer to this raft node through `raft.Raft.addNode` and `raft.Raft.removeNode` according to the `pb.ConfChange`.
|
||||
Conf change algorithm you will implement here is not the joint consensus algorithm mentioned in the extended Raft paper that can add and/or remove arbitrary peers at once, instead, it can only add or remove peers one by one, which is more simple and easy to reason about. Moreover, conf change start at calling leader’s `raft.RawNode.ProposeConfChange` which will propose an entry with `pb.Entry.EntryType` set to `EntryConfChange` and `pb.Entry.Data` set to the input `pb.ConfChange`. When entries with type `EntryConfChange` are committed, you must apply it through `RawNode.ApplyConfChange` with the `pb.ConfChange` in the entry, only then you can add or remove peer to this raft node through `raft.Raft.addNode` and `raft.Raft.removeNode` according to the `pb.ConfChange`.
|
||||
|
||||
> Hints:
|
||||
>
|
||||
@ -37,7 +37,7 @@ Conf change algorithm you will implement here is not the joint consensus algorit
|
||||
|
||||
## Part B
|
||||
|
||||
As Raft module supported membership change and leadership change now, in this part you need to make TinyKV support these admin commands based on part A. As you can see in `proto/proto/raft_cmdpb.proto`, there are four types of admin commands:
|
||||
As the Raft module supported membership change and leadership change now, in this part you need to make TinyKV support these admin commands based on part A. As you can see in `proto/proto/raft_cmdpb.proto`, there are four types of admin commands:
|
||||
|
||||
- CompactLog (Already implemented in project 2 part C)
|
||||
- TransferLeader
|
||||
@ -48,15 +48,15 @@ As Raft module supported membership change and leadership change now, in this pa
|
||||
|
||||
### The Code
|
||||
|
||||
All the changes are based on the implementation of project2, so the code you need to modify is all about `kv/raftstore/peer_msg_handler.go` and `kv/raftstore/peer.go`.
|
||||
All the changes are based on the implementation of the project2, so the code you need to modify is all about `kv/raftstore/peer_msg_handler.go` and `kv/raftstore/peer.go`.
|
||||
|
||||
### Propose transfer leader
|
||||
|
||||
This step is quite simple. As a raft command, `TransferLeader` will be proposed as a Raft entry. But `TransferLeader` actually is an action no need to replicate to other peers, so you just need to call the `TransferLeader()` method of `RawNode` instead of `Propose()` for `TransferLeader` command.
|
||||
This step is quite simple. As a raft command, `TransferLeader` will be proposed as a Raft entry. But `TransferLeader` actually is an action with no need to replicate to other peers, so you just need to call the `TransferLeader()` method of `RawNode` instead of `Propose()` for `TransferLeader` command.
|
||||
|
||||
### Implement conf change in raftstore
|
||||
|
||||
The conf change have two different types, `AddNode` and `RemoveNode`. Just as its name implies, it adds a Peer or removes a Peer from the Region. To implement conf change, you should learn the terminology of `RegionEpoch` first. `RegionEpoch` is a part of the meta information of `metapb.Region`. When a Region adds or removes Peer or splits, the Region’s epoch has changed. RegionEpoch’s `conf_ver` increases during ConfChange while `version` increases during split. It will be used to guarantee the latest region information under network isolation that two leaders in one Region.
|
||||
The conf change has two different types, `AddNode` and `RemoveNode`. Just as its name implies, it adds a Peer or removes a Peer from the Region. To implement conf change, you should learn the terminology of `RegionEpoch` first. `RegionEpoch` is a part of the meta-information of `metapb.Region`. When a Region adds or removes Peer or splits, the Region’s epoch has changed. RegionEpoch’s `conf_ver` increases during ConfChange while `version` increases during a split. It will be used to guarantee the latest region information under network isolation that two leaders in one Region.
|
||||
|
||||
You need to make raftstore support handling conf change commands. The process would be:
|
||||
|
||||
@ -66,10 +66,10 @@ You need to make raftstore support handling conf change commands. The process wo
|
||||
|
||||
> Hints:
|
||||
>
|
||||
> - For executing `AddNode`, the newly added Peer will be created by heartbeat from the leader, check `maybeCreatePeer()` of `storeWorker`. At that time, this Peer is uninitialized and any information of its Region is unknown to us, so we use 0 to initialize its `Log Term` and `Index`. Leader then will know this Follower has no data (there exists a Log gap from 0 to 5) and it will directly send a snapshot to this Follower.
|
||||
> - For executing `AddNode`, the newly added Peer will be created by heartbeat from the leader, check `maybeCreatePeer()` of `storeWorker`. At that time, this Peer is uninitialized and any information of its Region is unknown to us, so we use 0 to initialize its `Log Term` and `Index`. The leader then will know this Follower has no data (there exists a Log gap from 0 to 5) and it will directly send a snapshot to this Follower.
|
||||
> - For executing `RemoveNode`, you should call the `destroyPeer()` explicitly to stop the Raft module. The destroy logic is provided for you.
|
||||
> - Do not forget to update the region state in `storeMeta` of `GlobalContext`
|
||||
> - Test code schedules the command of one conf change multiple times until the conf change is applied, so you need to consider how to ignore the duplicate commands of same conf change.
|
||||
> - Test code schedules the command of one conf change multiple times until the conf change is applied, so you need to consider how to ignore the duplicate commands of the same conf change.
|
||||
|
||||
### Implement split region in raftstore
|
||||
|
||||
@ -88,10 +88,10 @@ message Region {
|
||||
}
|
||||
```
|
||||
|
||||
Let’s take a relook at Region definition, it includes two fields `start_key` and `end_key` to indicate the range of data which the Region is responsible for. So split is the key step to support multi-raft. At the beginning, there is only one Region with range [“”, “”). You can regard the key space as a loop, so [“”, “”) stands for the whole space. With the data written, the split checker will checks the region size every `cfg.SplitRegionCheckTickInterval`, and generates a split key if possible to cut the Region into two parts, you can check the logic in
|
||||
Let’s take a relook at Region definition, it includes two fields `start_key` and `end_key` to indicate the range of data which the Region is responsible for. So split is the key step to support multi-raft. In the beginning, there is only one Region with range [“”, “”). You can regard the key space as a loop, so [“”, “”) stands for the whole space. With the data written, the split checker will checks the region size every `cfg.SplitRegionCheckTickInterval`, and generates a split key if possible to cut the Region into two parts, you can check the logic in
|
||||
`kv/raftstore/runner/split_check.go`. The split key will be wrapped as a `MsgSplitRegion` handled by `onPrepareSplitRegion()`.
|
||||
|
||||
To make sure the ids of the newly created Region and Peers are unique, the ids are allocated by scheduler. It’s also provided, so you don’t have to implement it.
|
||||
To make sure the ids of the newly created Region and Peers are unique, the ids are allocated by the scheduler. It’s also provided, so you don’t have to implement it.
|
||||
`onPrepareSplitRegion()` actually schedules a task for the pd worker to ask the scheduler for the ids. And make a split admin command after receiving the response from scheduler, see `onAskBatchSplit()` in `kv/raftstore/runner/pd_task.go`.
|
||||
|
||||
So your task is to implement the process of handling split admin command, just like conf change does. The provided framework supports multiple raft, see `kv/raftstore/router.go`. When a Region splits into two Regions, one of the Regions will inherit the metadata before splitting and just modify its Range and RegionEpoch while the other will create relevant meta information.
|
||||
@ -108,7 +108,7 @@ So your task is to implement the process of handling split admin command, just l
|
||||
|
||||
As we have instructed above, all data in our kv store is split into several regions, and every region contains multiple replicas. A problem emerged: where should we place every replica? and how can we find the best place for a replica? Who sends former AddPeer and RemovePeer commands? The Scheduler takes on this responsibility.
|
||||
|
||||
In order to make informed decisions, the Scheduler should have some information about the whole cluster. It should know where every region is. It should know how many keys they have. It should know how big they are… To get related information, the Scheduler requires that every region should send a heartbeat request to the Scheduler periodically. You can find the heartbeat request structure `RegionHeartbeatRequest` in `/proto/proto/pdpb.proto`. After receiving a heartbeat, the scheduler will update local region information.
|
||||
To make informed decisions, the Scheduler should have some information about the whole cluster. It should know where every region is. It should know how many keys they have. It should know how big they are… To get related information, the Scheduler requires that every region should send a heartbeat request to the Scheduler periodically. You can find the heartbeat request structure `RegionHeartbeatRequest` in `/proto/proto/pdpb.proto`. After receiving a heartbeat, the scheduler will update local region information.
|
||||
|
||||
Meanwhile, the Scheduler checks region information periodically to find whether there is an imbalance in our TinyKV cluster. For example, if any store contains too many regions, regions should be moved to other stores from it. These commands will be picked up as the response for corresponding regions’ heartbeat requests.
|
||||
|
||||
@ -118,7 +118,7 @@ In this part, you will need to implement the above two functions for Scheduler.
|
||||
|
||||
The code you need to modify is all about `scheduler/server/cluster.go` and `scheduler/server/schedulers/balance_region.go`. As described above, when the Scheduler received a region heartbeat, it will update its local region information first. Then it will check whether there are pending commands for this region. If there is, it will be sent back as the response.
|
||||
|
||||
You only need to implement `processRegionHeartbeat` function, in which the Scheduler updates local information; and `Schedule` function for balance-region scheduler, in which the Scheduler scans stores and determines whether there is an imbalance and which region it should move.
|
||||
You only need to implement `processRegionHeartbeat` function, in which the Scheduler updates local information; and `Schedule` function for the balance-region scheduler, in which the Scheduler scans stores and determines whether there is an imbalance and which region it should move.
|
||||
|
||||
### Collect region heartbeat
|
||||
|
||||
@ -126,13 +126,13 @@ As you can see, the only argument of `processRegionHeartbeat` function is a regi
|
||||
|
||||
Definitely not! There are two reasons. One is that updates could be skipped when no changes have been made for this region. The more important one is that the Scheduler cannot trust every heartbeat. Particularly speaking, if the cluster has partitions in a certain section, the information about some nodes might be wrong.
|
||||
|
||||
For example, some Regions re-initiate elections and splits after they are split, but another isolated batch of nodes still sends the obsolete information to Scheduler through heartbeats. So for one Region, either of two nodes might say that it's the leader, which means the Scheduler cannot trust them both.
|
||||
For example, some Regions re-initiate elections and splits after they are split, but another isolated batch of nodes still sends the obsolete information to Scheduler through heartbeats. So for one Region, either of the two nodes might say that it's the leader, which means the Scheduler cannot trust them both.
|
||||
|
||||
Which one is more credible? The Scheduler should use `conf_ver` and `version` to determine it, namely `RegionEpcoh`. The Scheduler should first compare the values of the Region version of two nodes. If the values are the same, the Scheduler compares the values of the configuration change version. The node with a larger configuration change version must have newer information.
|
||||
|
||||
Simply speaking, you could organize the check routine in below way:
|
||||
Simply speaking, you could organize the check routine in the below way:
|
||||
|
||||
1. Check whether there is a region with the same Id in local storage. If there is and at least one of the heartbeats’ `conf_ver` and `version` are less than its, this heartbeat region is stale
|
||||
1. Check whether there is a region with the same Id in local storage. If there is and at least one of the heartbeats’ `conf_ver` and `version` is less than its, this heartbeat region is stale
|
||||
|
||||
2. If there isn’t, scan all regions that overlap with it. The heartbeats’ `conf_ver` and `version` should be greater or equal than all of them, or the region is stale.
|
||||
|
||||
@ -156,9 +156,9 @@ If the Scheduler determines to update local storage according to this heartbeat,
|
||||
|
||||
There can be many different types of schedulers running in the Scheduler, for example, balance-region scheduler and balance-leader scheduler. This learning material will focus on the balance-region scheduler.
|
||||
|
||||
Every schedulers should have implemented Scheduler interface, which you can find in `/scheduler/server/schedule/scheduler.go`. The Scheduler will use the return value of `GetMinInterval` as default interval to run `Schedule` method periodically. If it returns null (with several times retry), the Scheduler will use `GetNextInterval` to increase the interval. By defining `GetNextInterval` you can define how the interval increases. If it returns an operator, the Scheduler will dispatch these operators as the response of the next heartbeat of the related region.
|
||||
Every scheduler should have implemented the Scheduler interface, which you can find in `/scheduler/server/schedule/scheduler.go`. The Scheduler will use the return value of `GetMinInterval` as the default interval to run the `Schedule` method periodically. If it returns null (with several times retry), the Scheduler will use `GetNextInterval` to increase the interval. By defining `GetNextInterval` you can define how the interval increases. If it returns an operator, the Scheduler will dispatch these operators as the response of the next heartbeat of the related region.
|
||||
|
||||
The core part of the Scheduler interface is `Schedule` method. The return value of this method is `Operator`, which contains multiple steps such as `AddPeer` and `RemovePeer`. For example, `MovePeer` may contain `AddPeer`, `transferLeader` and `RemovePeer` which you have implemented in former part. Take the first RaftGroup in the diagram below as an example. The scheduler tries to move peer from the third store to the forth. First, it should `AddPeer` for the forth store. Then it checks whether the third is a leader, and find that no, it isn’t, so there is no need to `transferLeader`. Then it removes the peer in the third store.
|
||||
The core part of the Scheduler interface is `Schedule` method. The return value of this method is `Operator`, which contains multiple steps such as `AddPeer` and `RemovePeer`. For example, `MovePeer` may contain `AddPeer`, `transferLeader` and `RemovePeer` which you have implemented in former part. Take the first RaftGroup in the diagram below as an example. The scheduler tries to move peers from the third store to the fourth. First, it should `AddPeer` for the fourth store. Then it checks whether the third is a leader, and find that no, it isn’t, so there is no need to `transferLeader`. Then it removes the peer in the third store.
|
||||
|
||||
You can use the `CreateMovePeerOperator` function in `scheduler/server/schedule/operator` package to create a `MovePeer` operator.
|
||||
|
||||
@ -168,7 +168,7 @@ You can use the `CreateMovePeerOperator` function in `scheduler/server/schedule/
|
||||
|
||||
In this part, the only function you need to implement is the `Schedule` method in `scheduler/server/schedulers/balance_region.go`. This scheduler avoids too many regions in one store. First, the Scheduler will select all suitable stores. Then sort them according to their region size. Then the Scheduler tries to find regions to move from the store with the biggest region size.
|
||||
|
||||
The scheduler will try to find the region most suitable for moving in the store. First it will try to select a pending region because pending may mean the disk is overloaded. If there isn’t a pending region, it will try to find a follower region. If it still cannot pick out one region, it will try to pick leader regions. Finally it will select out the region to move, or the Scheduler will try the next store which has smaller region size until all stores will have been tried.
|
||||
The scheduler will try to find the region most suitable for moving in the store. First, it will try to select a pending region because pending may mean the disk is overloaded. If there isn’t a pending region, it will try to find a follower region. If it still cannot pick out one region, it will try to pick leader regions. Finally, it will select out the region to move, or the Scheduler will try the next store which has a smaller region size until all stores will have been tried.
|
||||
|
||||
After you pick up one region to move, the Scheduler will select a store as the target. Actually, the Scheduler will select the store with the smallest region size. Then the Scheduler will judge whether this movement is valuable, by checking the difference between region sizes of the original store and the target store. If the difference is big enough, the Scheduler should allocate a new peer on the target store and create a move peer operator.
|
||||
|
||||
@ -184,4 +184,4 @@ The Scheduler framework provides three methods to get regions. `GetPendingRegion
|
||||
|
||||
* How to judge whether this operation is valuable?
|
||||
|
||||
If the difference between the original and target stores’ region sizes are too small, after we move the region from original store to target store, the Scheduler may want to move back again next time. So we have to make sure that the difference has to be bigger than two times of the approximate size of the region, which ensures that after moving, the target store’s region size is still smaller than the original store.
|
||||
If the difference between the original and target stores’ region sizes is too small, after we move the region from the original store to the target store, the Scheduler may want to move back again next time. So we have to make sure that the difference has to be bigger than two times the approximate size of the region, which ensures that after moving, the target store’s region size is still smaller than the original store.
|
||||
|
@ -12,9 +12,9 @@ You'll implement MVCC in part A. In parts B and C you’ll implement the transac
|
||||
|
||||
## Transactions in TinyKV
|
||||
|
||||
TinyKV's transaction design follows [Percolator](https://storage.googleapis.com/pub-tools-public-publication-data/pdf/36726.pdf); it is a two phase commit protocol (2PC).
|
||||
TinyKV's transaction design follows [Percolator](https://storage.googleapis.com/pub-tools-public-publication-data/pdf/36726.pdf); it is a two-phase commit protocol (2PC).
|
||||
|
||||
A transaction is a list of reads and writes. A transaction has a start timestamp and, when a transaction is committed, it has a commit timestamp (which must be greater than the starting timestamp). The whole transaction reads from the version of a key that is valid at the start timestamp. After committing, all writes appear to have been written at the commit timestamp. Any key to be written must not be written by any other transaction between the start and commit timestamps, otherwise the whole transaction is cancelled (this is called a write conflict).
|
||||
A transaction is a list of reads and writes. A transaction has a start timestamp and, when a transaction is committed, it has a commit timestamp (which must be greater than the starting timestamp). The whole transaction reads from the version of a key that is valid at the start timestamp. After committing, all writes appear to have been written at the commit timestamp. Any key to be written must not be written by any other transaction between the start and commit timestamps, otherwise, the whole transaction is canceled (this is called a write conflict).
|
||||
|
||||
The protocol starts with the client getting a start timestamp from TinyScheduler. It then builds the transaction locally, reading from the database (using a `KvGet` or `KvScan` request which includes the start timestamp, in contrast to `RawGet` or `RawScan` requests), but only recording writes locally in memory. Once the transaction is built, the client will select one key as the *primary key* (note that this has nothing to do with an SQL primary key). The client sends `KvPrewrite` messages to TinyKV. A `KvPrewrite` message contains all the writes in the transaction. A TinyKV server will attempt to lock all keys required by the transaction. If locking any key fails, then TinyKV responds to the client that the transaction has failed. The client can retry the transaction later (i.e., with a different start timestamp). If all keys are locked, the prewrite succeeds. Each lock stores the primary key of the transaction and a time to live (TTL).
|
||||
|
||||
@ -24,7 +24,7 @@ If any prewrite fails, then the transaction is rolled back by the client by send
|
||||
|
||||
In TinyKV, TTL checking is not performed spontaneously. To initiate a timeout check, the client sends the current time to TinyKV in a `KvCheckTxnStatus` request. The request identifies the transaction by its primary key and start timestamp. The lock may be missing or already be committed; if not, TinyKV compares the TTL on the lock with the timestamp in the `KvCheckTxnStatus` request. If the lock has timed out, then TinyKV rolls back the lock. In any case, TinyKV responds with the status of the lock so that the client can take action by sending a `KvResolveLock` request. The client typically checks transaction status when it fails to prewrite a transaction due to another transaction's lock.
|
||||
|
||||
If the primary key commit succeeds, then the client will commit all other keys in the other regions. These requests should always succeed because by responding positively to a prewrite request, the server is promising that if it gets a commit request for that transaction then it will succeed. Once the client has all its prewrite responses, the only way for the transaction to fail is if it times out, and in that case committing the primary key should fail. Once the primary key is committed, then the other keys can no longer time out.
|
||||
If the primary key commit succeeds, then the client will commit all other keys in the other regions. These requests should always succeed because by responding positively to a prewrite request, the server is promising that if it gets a commit request for that transaction then it will succeed. Once the client has all its prewrite responses, the only way for the transaction to fail is if it times out, and in that case, committing the primary key should fail. Once the primary key is committed, then the other keys can no longer time out.
|
||||
|
||||
If the primary commit fails, then the client will rollback the transaction by sending `KvBatchRollback` requests.
|
||||
|
||||
@ -36,11 +36,11 @@ Implementing MVCC means representing the transactional API using a simple key/va
|
||||
|
||||
TinyKV uses three column families (CFs): `default` to hold user values, `lock` to store locks, and `write` to record changes. The `lock` CF is accessed using the user key; it stores a serialized `Lock` data structure (defined in [lock.go](/kv/transaction/mvcc/lock.go)). The `default` CF is accessed using the user key and the *start* timestamp of the transaction in which it was written; it stores the user value only. The `write` CF is accessed using the user key and the *commit* timestamp of the transaction in which it was written; it stores a `Write` data structure (defined in [write.go](/kv/transaction/mvcc/write.go)).
|
||||
|
||||
A user key and timestamp are combined into an *encoded key*. Keys are encoded in such a way that an ascending order of encoded keys orders first by user key (ascending), then by timestamp (descending). This ensures that iterating over encoded keys will give the most recent version first. Helper functions for encoding and decoding keys are defined in [transaction.go](/kv/transaction/mvcc/transaction.go).
|
||||
A user key and timestamp are combined into an *encoded key*. Keys are encoded in such a way that the ascending order of encoded keys orders first by user key (ascending), then by timestamp (descending). This ensures that iterating over encoded keys will give the most recent version first. Helper functions for encoding and decoding keys are defined in [transaction.go](/kv/transaction/mvcc/transaction.go).
|
||||
|
||||
This exercise requires implementing a single struct called `MvccTxn`. In parts B and C, you'll use the `MvccTxn` API to implement the transactional API. `MvccTxn` provides read and write operations based on the user key and logical representations of locks, writes, and values. Modifications are collected in `MvccTxn` and once all modifications for a command are collected, they will be written all at once to the underlying database. This ensures that commands succeed or fail atomically. Note that an MVCC transaction is not the same as a TinySQL transaction. An MVCC transaction contains the modifications for a single command, not a sequence of commands.
|
||||
|
||||
`MvccTxn` is defined in [transaction.go](/kv/transaction/mvcc/transaction.go). There is a stub implementation, and some helper functions for encoding and decoding keys. Tests are in [transaction_test.go](/kv/transaction/mvcc/transaction_test.go). For this exercise, you should implement each of the `MvccTxn` methods so that all tests pass. Each method is documented with its intended behaviour.
|
||||
`MvccTxn` is defined in [transaction.go](/kv/transaction/mvcc/transaction.go). There is a stub implementation, and some helper functions for encoding and decoding keys. Tests are in [transaction_test.go](/kv/transaction/mvcc/transaction_test.go). For this exercise, you should implement each of the `MvccTxn` methods so that all tests pass. Each method is documented with its intended behavior.
|
||||
|
||||
> Hints:
|
||||
>
|
||||
@ -59,7 +59,7 @@ In this part, you will use `MvccTxn` from part A to implement handling of `KvGet
|
||||
|
||||
You'll need to implement the `KvGet`, `KvPrewrite`, and `KvCommit` methods defined in [server.go](/kv/server/server.go). Each method takes a request object and returns a response object, you can see the contents of these objects by looking at the protocol definitions in [kvrpcpb.proto](/proto/kvrpcpb.proto) (you shouldn't need to change the protocol definitions).
|
||||
|
||||
TinyKV can handle multiple requests concurrently, so there is the possibility of local race conditions. For example, TinyKV might receive two requests from different clients at the same time, one of which commits a key and the other rolls back the same key. To avoid race conditions, you can *latch* any key in the database. This latch works much like a per-key mutex. One latch covers all CFs. [latches.go](/kv/transaction/latches/latches.go) defines a `Latches` object which provides API for this.
|
||||
TinyKV can handle multiple requests concurrently, so there is the possibility of local race conditions. For example, TinyKV might receive two requests from different clients at the same time, one of which commits a key, and the other rolls back the same key. To avoid race conditions, you can *latch* any key in the database. This latch works much like a per-key mutex. One latch covers all CFs. [latches.go](/kv/transaction/latches/latches.go) defines a `Latches` object which provides API for this.
|
||||
|
||||
> Hints:
|
||||
>
|
||||
@ -74,9 +74,9 @@ In this part, you will implement `KvScan`, `KvCheckTxnStatus`, `KvBatchRollback`
|
||||
|
||||
`KvCheckTxnStatus`, `KvBatchRollback`, and `KvResolveLock` are used by a client when it encounters some kind of conflict when trying to write a transaction. Each one involves changing the state of existing locks.
|
||||
|
||||
`KvCheckTxnStatus` checks for timeouts, removes expired locks, and returns the status of the lock.
|
||||
`KvCheckTxnStatus` checks for timeouts, removes expired locks and returns the status of the lock.
|
||||
|
||||
`KvBatchRollback` checks that a key is locked by the current transaction, and if so removes the lock, deletes any value, and leaves a rollback indicator as a write.
|
||||
`KvBatchRollback` checks that a key is locked by the current transaction, and if so removes the lock, deletes any value and leaves a rollback indicator as a write.
|
||||
|
||||
`KvResolveLock` inspects a batch of locked keys and either rolls them all back or commits them all.
|
||||
|
||||
@ -85,4 +85,4 @@ In this part, you will implement `KvScan`, `KvCheckTxnStatus`, `KvBatchRollback`
|
||||
> - For scanning, you might find it helpful to implement your own scanner (iterator) abstraction which iterates over logical values, rather than the raw values in underlying storage. `kv/transaction/mvcc/scanner.go` is a framework for you.
|
||||
> - When scanning, some errors can be recorded for an individual key and should not cause the whole scan to stop. For other commands, any single key causing an error should cause the whole operation to stop.
|
||||
> - Since `KvResolveLock` either commits or rolls back its keys, you should be able to share code with the `KvBatchRollback` and `KvCommit` implementations.
|
||||
> - A timestamp consists of a physical and a logical component. The physical part is roughly a monotonic version of wall-clock time. Usually we use the whole timestamp, for example when comparing timestamps for equality. However, when calculating timeouts, we must only use the physical part of the timestamp. To do this you may find the `PhysicalTime` function in [transaction.go](/kv/transaction/mvcc/transaction.go) useful.
|
||||
> - A timestamp consists of a physical and a logical component. The physical part is roughly a monotonic version of wall-clock time. Usually, we use the whole timestamp, for example when comparing timestamps for equality. However, when calculating timeouts, we must only use the physical part of the timestamp. To do this you may find the `PhysicalTime` function in [transaction.go](/kv/transaction/mvcc/transaction.go) useful.
|
||||
|
@ -42,7 +42,7 @@ type PeerStorage struct {
|
||||
snapState snap.SnapState
|
||||
// regionSched used to schedule task to region worker
|
||||
regionSched chan<- worker.Task
|
||||
// gennerate snapshot tried count
|
||||
// generate snapshot tried count
|
||||
snapTriedCnt int
|
||||
// Engine include two badger instance: Raft and Kv
|
||||
Engines *engine_util.Engines
|
||||
|
Loading…
Reference in New Issue
Block a user