mirror of
https://github.com/talent-plan/tinykv.git
synced 2025-01-13 22:00:07 +08:00
upload part of course material
Signed-off-by: Connor1996 <zbk602423539@gmail.com>
This commit is contained in:
parent
2dad6f81cb
commit
b8374cb7b6
76
README.md
76
README.md
@ -1,25 +1,70 @@
|
||||
# The TinyKV Course
|
||||
This course is a series of labs on a key-value storage system built with the Raft consensus algorithm. These labs are inspired by the famous [MIT 6.824](http://nil.csail.mit.edu/6.824/2018/index.html) course, but aim to be closer to industry implementations. The whole course is pruned from [TiKV](github.com/tikv/tikv) and re-written in Go. After completing this course, you will have the knowledge to implement a basic key-value storage service with distributed transactions and fault-tolerance and better understanding of TiKV implementation.
|
||||
|
||||
This is a series of projects on a key-value storage system built with the Raft consensus algorithm. These projects are inspired by the famous [MIT 6.824](http://nil.csail.mit.edu/6.824/2018/index.html) course, but aim to be closer to industry implementations. The whole course is pruned from [TiKV](https://github.com/tikv/tikv) and re-written in Go. After completing this course, you will have the knowledge to implement a basic key-value storage fault-tolerance service with distributed transaction support and a better understanding of TiKV implementation.
|
||||
|
||||
The whole project is a skeleton code for a kv server and a scheduler server at initial, and you need to finish the core logic step by step:
|
||||
- LAB1: build a standalone key-value server
|
||||
- LAB2: build a fault tolerant key-value server with Raft
|
||||
- LAB3: support multi Raft group and balance scheduling on top of LAB2
|
||||
- LAB4: support distributed transaction on top of LAB3
|
||||
|
||||
- Project1: build a standalone key-value server
|
||||
- Project2: build a fault tolerant key-value server with Raft
|
||||
- Project3: support multi Raft group and balance scheduling on top of Project2
|
||||
- Project4: support distributed transaction on top of Project3
|
||||
|
||||
**Important note: This course is still in developing, and the document is incomplete.** Any feedback and contribution is greatly appreciated. Please see help wanted issues if you want to join in the development.
|
||||
|
||||
## Build
|
||||
## Course
|
||||
|
||||
Before starting, you’d better read the overview design of TiKV and PD to get a general impression on what you will build:
|
||||
|
||||
- TiKV
|
||||
- <https://pingcap.com/blog-cn/tidb-internal-1/> (Chinese Version)
|
||||
- <https://pingcap.com/blog/2017-07-11-tidbinternal1/> (English Version)
|
||||
- PD
|
||||
- <https://pingcap.com/blog-cn/tidb-internal-3/> (Chinese Version)
|
||||
- <https://pingcap.com/blog/2017-07-20-tidbinternal3/> (English Version)
|
||||
|
||||
### Getting started
|
||||
|
||||
First, please clone the repository with git to get the source code of the project.
|
||||
|
||||
``` bash
|
||||
git clone https://github.com/pingcap-incubator/tinykv.git
|
||||
```
|
||||
|
||||
Then make sure you have installed [go](https://golang.org/doc/install) >= 1.13 toolchains. You should also have installed `make`.
|
||||
Now you can run `make` to check that everything is working as expected. You should see it runs successfully.
|
||||
|
||||
### Overview of the code
|
||||
|
||||
![overview](courses/imgs/overview.png)
|
||||
|
||||
Same as the architect TiDB + TiKV + PD that separates the storage and computation, TinyKV only focuses on the storage layer of a distributed database system. If you are also interested in SQL layer, see [TinySQL](https://github.com/pingcap-incubator/tinysql). Besides that, there is a component called TinyScheduler as a center control of the whole TinyKV cluster, which collects information from the heartbeats of TinyKV. After that, the TinyScheduler can generate some scheduling tasks and distribute them to the TinyKV instances. All of them are communicated by RPC.
|
||||
|
||||
The whole project is organized into the following directories:
|
||||
- `kv`: implementation of the TinyKV key/value store.
|
||||
- `proto`: all communication between nodes and processes uses Protocol Buffers over gRPC. This package contains the protocol definitions used by TinyKV, and generated Go code for using them.
|
||||
- `raft`: implementation of the Raft distributed consensus algorithm, used in TinyKV.
|
||||
- `scheduler`: implementation of the TinyScheduler which is responsible for managing TinyKV nodes and for generating timestamps.
|
||||
|
||||
### Course material
|
||||
|
||||
- [Project1 - StandaloneKV](courses/project1-StandaloneKV.md)
|
||||
- [Project2 - RaftKV](courses/project2-RaftKV.md)
|
||||
- [Project3 - MultiRaftKV](courses/project3-MultiRaftKV.md)
|
||||
- [Project4 - Transaction](courses/project4-Transaction.md)
|
||||
|
||||
## Deploy a cluster
|
||||
|
||||
Rather than a course, you can try TinyKV by deploying a real cluster, and integrate with it by TinySQL.
|
||||
|
||||
### Build
|
||||
|
||||
```
|
||||
make
|
||||
```
|
||||
|
||||
## Test
|
||||
```
|
||||
make test
|
||||
```
|
||||
It builds the binary of `tinykv-server` and `tinykvscheduler-server` to `bin` dir.
|
||||
|
||||
## Run(Not runnable now)
|
||||
### Run
|
||||
|
||||
Put the binary of `tinyscheduler-server`, `tinykv-server` and `tidb-server` into a single dir.
|
||||
|
||||
@ -41,9 +86,8 @@ mkdir -p data
|
||||
./tinysql-server --store=tikv --path="127.0.0.1:2379"
|
||||
```
|
||||
|
||||
## Documentation
|
||||
### Play
|
||||
|
||||
[Course material](https://docs.google.com/document/d/1Z3OrUOq28a44UaipQiagwfM_py6gS9nLz-znk6jtMpU/edit#) is in google doc now. It will be moved to github once polish finsihed.
|
||||
|
||||
This repo contains a single module: tinykv. Each package is documented either in a doc.go file or, if it is a single
|
||||
file package, in the single file.
|
||||
```
|
||||
mysql -u root -h 127.0.0.1 -P 4000
|
||||
```
|
||||
|
BIN
courses/imgs/keyspace.png
Normal file
BIN
courses/imgs/keyspace.png
Normal file
Binary file not shown.
After Width: | Height: | Size: 186 KiB |
BIN
courses/imgs/multiraft.png
Normal file
BIN
courses/imgs/multiraft.png
Normal file
Binary file not shown.
After Width: | Height: | Size: 175 KiB |
BIN
courses/imgs/overview.png
Normal file
BIN
courses/imgs/overview.png
Normal file
Binary file not shown.
After Width: | Height: | Size: 103 KiB |
0
courses/project1-StandaloneKV.md
Normal file
0
courses/project1-StandaloneKV.md
Normal file
30
courses/project2-RaftKV.md
Normal file
30
courses/project2-RaftKV.md
Normal file
@ -0,0 +1,30 @@
|
||||
## Part C
|
||||
|
||||
As things stand now with your code, it's not practical for a long-running server to remember the complete Raft log forever. Instead, the server will check the number of Raft log, and discard log entries exceeding the threshold from time to time.
|
||||
|
||||
In this part, you will implement the Snapshot handling based on the above two part implementation. Generally, Snapshot is just a raft message like AppendEntrie used to replicate data to follower, what make it different is its size, Snapshot contains the whole state machine data in some point of time, and to build and send such a big message at once will consume many resource and time, which may block the handling of other raft message, to amortize this problem, Snapshot message will use an independent connect, and split the data into chunks to transport. That’s the reason why there is a snapshot RPC API for TinyKV service. If you are interested in the detail of sending and receiving, check `snapRunner` and the reference <https://pingcap.com/blog-cn/tikv-source-code-reading-10/>
|
||||
|
||||
### The Code
|
||||
|
||||
All you need to change is based on the code written in part A and part B.
|
||||
|
||||
### Implement in Raft
|
||||
|
||||
Although we need some different handling for Snapshot messages, in the perspective of raft algorithm there should be no difference. See the definition of `eraftpb.Snapshot` in the proto file, the `data` field on `eraftpb.Snapshot` does not represent the actual state machine data but some metadata used for the upper application you can ignore it for now. When the leader needs to send a Snapshot message to a follower, it can call `Storage.Snapshot()` to get a `eraftpb.Snapshot`, then send the snapshot message like other raft messages. How the state machine data is actually built and sent are implemented by the raftstore, it will be introduced in the next step. You can assume that once `Storage.Snapshot()` returns successfully, it’s safe for Raft leader to the snapshot message to the follower, and follower should call `handleSnapshot` to handle it, which namely just restore the raft internal state like term, commit index and membership information, ect, from the
|
||||
`eraftpb.SnapshotMetadata` in the message, after that, the procedure of snapshot handling is finish.
|
||||
|
||||
#### Implement in raftstore
|
||||
|
||||
In this step, you need to learn two more workers of raftstore — raftlog-gc worker and region worker.
|
||||
|
||||
Raftstore checks whether it needs to gc log from time to time based on the config `RaftLogGcCountLimit`, see `onRaftGcLogTick()`. If yes, it will propose a raft admin command `CompactLogRequest` which is wrapped in `RaftCmdRequest` just like four basic command types(Get/Put/Delete/Snap) implemented in project2 part B. Then you need to process this admin command when it’s committed by Raft. But unlike Get/Put/Delete/Snap commands write or read state machine data, `CompactLogRequest` modifies metadata, namely updates the `RaftTruncatedState` which is in the `RaftApplyState`. After that, you should schedule a task to raftlog-gc worker by `ScheduleCompactLog`. Raftlog-gc worker will do the actual log deletion work asynchronously.
|
||||
|
||||
Then due to the log compaction, Raft module maybe needs to send a snapshot. `PeerStorage` implements `Storage.Snapshot()`. TinyKV generates snapshots and applies snapshots in the region worker. When calling `Snapshot()`, it actually sends a task `RegionTaskGen` to the region worker. The message handler of the region worker is located in `kv/raftstore/runner/region_task.go`. It scans the underlying engines to generate a snapshot, and sends snapshot metadata by channel. At the next time of Raft calling `Snapshot`, it checks whether the snapshot generating is finished. If yes, Raft should send the snapshot message to other peers, and the snapshot sending and receiving work is handled by `kv/storage/raft_storage/snap_runner.go`. You don’t need to dive into the details, only should know the snapshot message will be handled by `onRaftMsg` after the snapshot is received.
|
||||
|
||||
Then the snapshot will reflect in the next Raft ready, so the task you should do is to modify the raft ready process to handle the case of snapshot. When you are sure to apply the snapshot, you can update the peer storage’s memory state like `RaftLocalState`, `RaftApplyState` and `RegionLocalState`. Also don’t forget to persist these states to kvdb and raftdb and remove stale state from kvdb and raftdb. Besides, you also need to update
|
||||
`PeerStorage.snapState` to `snap.SnapState_Applying` and send `runner.RegionTaskApply` task to region worker through `PeerStorage.regionSched` and wait until region worker finish.
|
||||
|
||||
> Hints:
|
||||
>
|
||||
> - If there are some committed entries to be executed in the apply worker, do not apply state, see `ReadyToHandlePendingSnap`.
|
||||
> - Do not handle the next Raft ready before finishing applying snapshot.
|
105
courses/project3-MultiRaftKV.md
Normal file
105
courses/project3-MultiRaftKV.md
Normal file
@ -0,0 +1,105 @@
|
||||
# Project3 MultiRaftKV
|
||||
|
||||
In project2, you have built a high available kv server based on Raft, good work! But not enough, such kv server is backed by a single raft group which is not unlimited scalable, and every write request will wait until committed and then write to badger one by one, which is a key requirement to ensure consistency, but also kill any concurrency.
|
||||
|
||||
![multiraft](imgs/multiraft.png)
|
||||
|
||||
In this project you will implement a multi raft based kv server with balance scheduler, which consist of multiple raft groups, each raft group is responsible for a single key range which is named region here, the layout will be looked like the above diagram. Requests to a single region are handled just like before, yet multiple regions can handle requests concurrently which improve performance but also bring some new challenges like balancing the request to each region, etc.
|
||||
|
||||
This project have 3 part, including:
|
||||
|
||||
1. Implement membership change and leadership change to Raft algorithm
|
||||
2. Implement conf change and region split on raftstore
|
||||
3. Introduce scheduler
|
||||
|
||||
## Part A
|
||||
|
||||
In this part you will implement membership change and leadership change to the basic raft algorithm, these features are required by the next two parts. Membership change, namely conf change, is used to add or remove peers to the raft group, which can change the quorum of the raft group, so be careful. Leadership change, namely leader transfer, is used to transfer the leadership to another peer, which is very useful for balance.
|
||||
|
||||
### The Code
|
||||
|
||||
The code you need to modify is all about `raft/raft.go` and `raft/rawnode.go`, also see `proto/proto/eraft.proto` for new messages you need to handle. And both conf change and leader transfer are triggered by the upper application, so you may want to start at `raft/rawnode.go`.
|
||||
|
||||
### Implement leader transfer
|
||||
|
||||
To implement leader transfer, let’s introduce two new message types: `MsgTransferLeader` and `MsgTimeoutNow`. To transfer leadership you need to first call `raft.Raft.Step` with `MsgTransferLeader` message on the current leader, and to ensure the success of transfer, the current leader should first check the qualification of the transferee (namely transfer target) like: is the transferee’s log up to date, etc. If the transferee is not qualified, the current leader can choose to abort the transfer or help the transferee, since abort is not helping, let’s choose to help the transferee. If the transferee’s log is not up to date, the current leader should send a `MsgAppend` message to the transferee and stop accepting new proposes in case we end up cycling. So if the transferee is qualified (or after the current leader’s help), the leader should send a `MsgTimeoutNow` message to the transferee immediately, and after received a `MsgTimeoutNow` message the transferee should start a new election immediately regardless of its election timeout, with a higher term and up to date log, the transferee have great chance to step down the current leader and become the new leader.
|
||||
|
||||
### Implement conf change
|
||||
|
||||
Conf change algorithm you will implement here is not the joint consensus algorithm mentioned in the extended Raft paper that can add and/or remove arbitrary peers at once, instead it can only add or remove peers one by one, which is more simple and easy to reason about. Moreover, conf change start at calling leader’s `raft.RawNode.ProposeConfChange` which will propose an entry with `pb.Entry.EntryType` set to `EntryConfChange` and `pb.Entry.Data` set to the input `pb.ConfChange`. When entries with type `EntryConfChange` are committed, you must apply it through `RawNode.ApplyConfChange` with the `pb.ConfChange` in the entry, only then you can add or remove peer to this raft node through `raft.Raft.addNode` and `raft.Raft.removeNode` according to the `pb.ConfChange`.
|
||||
|
||||
> Hints:
|
||||
>
|
||||
> - `MsgTransferLeader` message is local message that not come from network
|
||||
> - You set the `Message.from` of the `MsgTransferLeader` message to the transferee (namely transfer target)
|
||||
> - To start a new election immediately you can call `Raft.Step` with `MsgHup` message
|
||||
> - Call `pb.ConfChange.Marshal` to get bytes represent of `pb.ConfChange` and put it to `pb.Entry.Data`
|
||||
|
||||
## Part B
|
||||
|
||||
As Raft module supported membership change and leadership change now, in this part you need to make TinyKV support these admin commands based on part A. As you can see in `proto/proto/raft_cmdpb.proto`, there are four types of admin commands:
|
||||
|
||||
- CompactLog (Already implemented in project 2 part C)
|
||||
- TransferLeader
|
||||
- ChangePeer
|
||||
- Split
|
||||
|
||||
`TransferLeader` and `ChangePeer` are the commands based on the Raft support of leadership change and membership change. These will be used as the basic operator steps for the balance scheduler. `Split` splits one Region into two Regions, that’s the base for multi raft. You will implement them step by step.
|
||||
|
||||
### The Code
|
||||
|
||||
All the changes are based on the implementation of project2, so the code you need to modify is all about `kv/raftstore/peer_msg_handler.go` and `kv/raftstore/peer.go`.
|
||||
|
||||
### Propose transfer leader
|
||||
|
||||
This step is quite simple. As a raft command, `TransferLeader` will be proposed as a Raft entry. But `TransferLeader` actually is an action no need to replicate to other peers, so you just need to call the `TransferLeader()` method of `RawNode` instead of `Propose()` for `TransferLeader` command.
|
||||
|
||||
### Implement conf change in raftstore
|
||||
|
||||
The conf change have two different types, `AddNode` and `RemoveNode`. Just as its name implies, it adds a Peer or removes a Peer from the Region. To implement conf change, you should learn the terminology of `RegionEpoch` first. `RegionEpoch` is a part of the meta information of `metapb.Region`. When a Region adds or removes Peer or splits, the Region’s epoch has changed. RegionEpoch’s `conf_ver` increases during ConfChange while `version` increases during split. It will be used to guarantee the latest region information under network isolation that two leaders in one Region.
|
||||
|
||||
You need to make raftstore support handling conf change commands. The process would be:
|
||||
|
||||
1. Propose conf change admin command by `ProposeConfChange`
|
||||
2. After the log is committed, change the `RegionLocalState`, including `RegionEpoch` and `Peers` in `Region`
|
||||
3. Call `ApplyConfChange()` of `raft.RawNode`
|
||||
|
||||
> Hints:
|
||||
>
|
||||
> - For executing `AddNode`, the newly added Peer will be created by heartbeat from the leader, check `maybeCreatePeer()` of `storeWorker`. At that time, this Peer is uninitialized and any information of its Region is unknown to us, so we use 0 to initialize its `Log Term` and `Index`. Leader then will know this Follower has no data (there exists a Log gap from 0 to 5) and it will directly send a snapshot to this Follower.
|
||||
> - For executing `RemoveNode`, you should call the `destroyPeer()` explicitly to stop the Raft module. The destroy logic is provided for you.
|
||||
> - Do not forget to update the region state in `storeMeta` of `GlobalContext`
|
||||
> - Test code schedules the command of one conf change multiple times until the conf change is applied, so you need to consider how to ignore the duplicate commands of same conf change.
|
||||
|
||||
### Implement split region in raftstore
|
||||
|
||||
![raft_group](imgs/keyspace.png)
|
||||
|
||||
To support multi-raft, the system performs data sharding and makes each Raft group store just a portion of data. Hash and Range are commonly used for data sharding. TinyKV uses Range and the main reason is that Range can better aggregate keys with the same prefix, which is convenient for operations like scan. Besides, Range outperforms in split than Hash. Usually, it only involves metadata modification and there is no need to move data around.
|
||||
|
||||
``` protobuf
|
||||
message Region {
|
||||
uint64 id = 1;
|
||||
// Region key range [start_key, end_key).
|
||||
bytes start_key = 2;
|
||||
bytes end_key = 3;
|
||||
RegionEpoch region_epoch = 4;
|
||||
repeated Peer peers = 5
|
||||
}
|
||||
```
|
||||
|
||||
Let’s take a relook at Region definition, it includes two fields `start_key` and `end_key` to indicate the range of data which the Region is responsible for. So split is the key step to support multi-raft. At the beginning, there is only one Region with range [“”, “”). You can regard the key space as a loop, so [“”, “”) stands for the whole space. With the data written, the split checker will checks the region size every `cfg.SplitRegionCheckTickInterval`, and generates a split key if possible to cut the Region into two parts, you can check the logic in
|
||||
`kv/raftstore/runner/split_check.go`. The split key will be wrapped as a `MsgSplitRegion` handled by `onPrepareSplitRegion()`.
|
||||
|
||||
To make sure the ids of the newly created Region and Peers are unique, the ids are allocated by scheduler. It’s also provided, so you don’t have to implement it.
|
||||
`onPrepareSplitRegion()` actually schedules a task for the pd worker to ask the scheduler for the ids. And make a split admin command after receiving the response from scheduler, see `onAskBatchSplit()` in `kv/raftstore/runner/pd_task.go`.
|
||||
|
||||
So your task is to implement the process of handling split admin command, just like conf change does. The provided framework supports multiple raft, see `kv/raftstore/router.go`. When a Region splits into two Regions, one of the Regions will inherit the metadata before splitting and just modify its Range and RegionEpoch while the other will create relevant meta information.
|
||||
|
||||
> Hints:
|
||||
>
|
||||
> - The corresponding Peer of this newly-created Region should be created by
|
||||
`createPeer()` and registered to the router.regions. And the region’s info should be inserted into `regionRanges` in ctx.StoreMeta.
|
||||
> - For the case region split with network isolation, the snapshot to be applied may have overlap with the existing region’s range. The check logic is in `checkSnapshot()` in `kv/raftstore/peer_msg_handler.go`. Please keep it in mind when implementing and take care of that case.
|
||||
> - Use `engine_util.ExceedEndKey()` to compare with region’s end key. Because when the end key equals “”, any key will equal or greater than “”. > - There are more errors need to be considered: `ErrRegionNotFound`,
|
||||
`ErrKeyNotInRegion`, `ErrEpochNotMatch`.
|
0
courses/project4-Transaction.md
Normal file
0
courses/project4-Transaction.md
Normal file
Loading…
Reference in New Issue
Block a user