TranslateProject/sources/tech/20230201.1 ⭐️⭐️⭐️ A guide to fuzzy queries with Apache ShardingSphere.md

424 lines
20 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

[#]: subject: "A guide to fuzzy queries with Apache ShardingSphere"
[#]: via: "https://opensource.com/article/23/2/fuzzy-query-apache-shardingsphere"
[#]: author: "Xiong Gaoxiang https://opensource.com/users/xionggaoxiang"
[#]: collector: "lkxed"
[#]: translator: " "
[#]: reviewer: " "
[#]: publisher: " "
[#]: url: " "
A guide to fuzzy queries with Apache ShardingSphere
======
[Apache ShardingSphere][1] is an open source distributed database and an ecosystem users and developers need for their databases to provide a customized and cloud-native experience. Its [latest release][2] contains many new features, including data encryption integrated with existing SQL workflows. Most importantly, it allows fuzzy queries of the encrypted data.
### The problem
By parsing a user's SQL input and rewriting the SQL according to the user's encryption rules, the original data is encrypted and stored with ciphertext data in the underlying database simultaneously.
When a user queries the data, it fetches the ciphertext data from the database, decrypts it, and returns the decrypted original data to the user. However, because the encryption algorithm encrypts the whole string, users cannot run fuzzy queries.
Nevertheless, many businesses need fuzzy queries after the data is encrypted. In version 5.3.0, Apache ShardingSphere provides users with a default fuzzy query algorithm that supports encrypted fields. The algorithm also supports hot plugging, which users can customize. The fuzzy query can be achieved through configuration.
### How to achieve fuzzy query in encrypted scenarios
#### Load data to the in-memory database (IMDB)
First, load all the data into the IMDB to decrypt it. Then, it'll be like querying the original data. This method can achieve fuzzy queries. If the amount of data is small, this method will prove simple and cost-effective. However, if the quantity of data is large, it'll be a disaster.
#### Implement encryption and decryption functions consistent with database programs
The second method is to modify fuzzy query conditions and use the database decryption function to decrypt data first and then implement fuzzy query. This method's advantage is the low cost of implementation, development, and use.
Users only need to modify the previous fuzzy query conditions slightly. However, the ciphertext and encryption functions are stored together in the database, which cannot cope with the problem of account data leaks.
Native SQL:
```
select * from user where name like "%xxx%"
```
After implementing the decryption function:
```
ѕеlесt * frоm uѕеr whеrе dесоdе(namе) lіkе "%ххх%"
```
#### Store after data masking
Implement data masking on ciphertext and then store it in a fuzzy query column. This method could lack precision.
For example, mobile number **13012345678** becomes **130****5678** after the masking algorithm is performed.
#### Perform encrypted storage after tokenization and combination
This method performs tokenization and combination on ciphertext data and then encrypts the resultset by grouping characters with fixed length and splitting a field into multiple ones. For example, we take four English characters and two Chinese characters as a query condition: **ningyu1** uses the four-character as a group to encrypt, so the first group is **ning**, the second group **ingy**, the third group **ngyu**, the fourth group **gyu1**, and so on. All the characters are encrypted and stored in the fuzzy query column. If you want to retrieve all data that contains four characters, such as **ingy**, encrypt the characters and use a key `like"%partial%"` to query.
Shortcomings:
- Increased storage costs: Free grouping will increase the amount of data, and the data length will increase after being encrypted.
- Limited length in fuzzy query: Due to security issues, the length of free grouping cannot be too short or the [rainbow table][3] will easily crack it. Like the example I mentioned above, the length of fuzzy query characters must be greater than or equal to four letters/digits or two Chinese characters.
#### Single-character digest algorithm (default fuzzy query algorithm provided in ShardingSphere version 5.3.0)
Although the above methods are all viable, it's only natural to wonder if there's a better alternative. In our community, we find that single-character encryption and storage can balance performance and query but fails to meet security requirements.
So what's the ideal solution? Inspired by masking algorithms and cryptographic hash functions, we find that data loss and one-way functions can be used.
The cryptographic hash function should have the following four features:
- It should be easy to calculate the hash value for any given message.
- It should be difficult to infer the original message from a known hash value.
- It should not be feasible to modify the message without changing the hash value.
- There should only be a very low chance that two different messages produce the same hash value.
Security: Because of the one-way function, it's impossible to infer the original message. To improve the accuracy of the fuzzy query, we want to encrypt a single character, but the rainbow table will crack it.
So we take a one-way function (to ensure every character is the same after encryption) and increase the frequency of collisions (to ensure every string is **1: N** backward), which greatly enhances security.
### Fuzzy query algorithm
Apache ShardingSphere implements a universal fuzzy query algorithm using the below single-character digest algorithm `org.apache.shardingsphere.encrypt.algorithm.like.CharDigestLikeEncryptAlgorithm`.
```
public final class CharDigestLikeEncryptAlgorithm implements LikeEncryptAlgorithm<Object, String> {
private static final String DELTA = "delta";
private static final String MASK = "mask";
private static final String START = "start";
private static final String DICT = "dict";
private static final int DEFAULT_DELTA = 1;
private static final int DEFAULT_MASK = 0b1111_0111_1101;
private static final int DEFAULT_START = 0x4e00;
private static final int MAX_NUMERIC_LETTER_CHAR = 255;
@Getter
private Properties props;
private int delta;
private int mask;
private int start;
private Map<Character, Integer> charIndexes;
@Override
public void init(final Properties props) {
this.props = props;
delta = createDelta(props);
mask = createMask(props);
start = createStart(props);
charIndexes = createCharIndexes(props);
}
private int createDelta(final Properties props) {
if (props.containsKey(DELTA)) {
String delta = props.getProperty(DELTA);
try {
return Integer.parseInt(delta);
} catch (NumberFormatException ex) {
throw new EncryptAlgorithmInitializationException("CHAR_DIGEST_LIKE", "delta can only be a decimal number");
}
}
return DEFAULT_DELTA;
}
private int createMask(final Properties props) {
if (props.containsKey(MASK)) {
String mask = props.getProperty(MASK);
try {
return Integer.parseInt(mask);
} catch (NumberFormatException ex) {
throw new EncryptAlgorithmInitializationException("CHAR_DIGEST_LIKE", "mask can only be a decimal number");
}
}
return DEFAULT_MASK;
}
private int createStart(final Properties props) {
if (props.containsKey(START)) {
String start = props.getProperty(START);
try {
return Integer.parseInt(start);
} catch (NumberFormatException ex) {
throw new EncryptAlgorithmInitializationException("CHAR_DIGEST_LIKE", "start can only be a decimal number");
}
}
return DEFAULT_START;
}
private Map<Character, Integer> createCharIndexes(final Properties props) {
String dictContent = props.containsKey(DICT) && !Strings.isNullOrEmpty(props.getProperty(DICT)) ? props.getProperty(DICT) : initDefaultDict();
Map<Character, Integer> result = new HashMap<>(dictContent.length(), 1);
for (int index = 0; index < dictContent.length(); index++) {
result.put(dictContent.charAt(index), index);
}
return result;
}
@SneakyThrows
private String initDefaultDict() {
InputStream inputStream = CharDigestLikeEncryptAlgorithm.class.getClassLoader().getResourceAsStream("algorithm/like/common_chinese_character.dict");
LineProcessor<String> lineProcessor = new LineProcessor<String>() {
private final StringBuilder builder = new StringBuilder();
@Override
public boolean processLine(final String line) {
if (line.startsWith("#") || 0 == line.length()) {
return true;
} else {
builder.append(line);
return false;
}
}
@Override
public String getResult() {
return builder.toString();
}
};
return CharStreams.readLines(new InputStreamReader(inputStream, Charsets.UTF_8), lineProcessor);
}
@Override
public String encrypt(final Object plainValue, final EncryptContext encryptContext) {
return null == plainValue ? null : digest(String.valueOf(plainValue));
}
private String digest(final String plainValue) {
StringBuilder result = new StringBuilder(plainValue.length());
for (char each : plainValue.toCharArray()) {
char maskedChar = getMaskedChar(each);
if ('%' == maskedChar) {
result.append(each);
} else {
result.append(maskedChar);
}
}
return result.toString();
}
private char getMaskedChar(final char originalChar) {
if ('%' == originalChar) {
return originalChar;
}
if (originalChar <= MAX_NUMERIC_LETTER_CHAR) {
return (char) ((originalChar + delta) & mask);
}
if (charIndexes.containsKey(originalChar)) {
return (char) (((charIndexes.get(originalChar) + delta) & mask) + start);
}
return (char) (((originalChar + delta) & mask) + start);
}
@Override
public String getType() {
return "CHAR_DIGEST_LIKE";
}
}
```
- Define the binary `mask` code to lose precision `0b1111_0111_1101` (mask).
- Save common Chinese characters with disrupted order like a `map` dictionary.
- Obtain a single string of `Unicode` for digits, English, and Latin.
- Obtain an `index` for a Chinese character belonging to a dictionary.
- Other characters fetch the `Unicode` of a single string.
- Add `1 (delta)`to the digits obtained by different types above to prevent any original text from appearing in the database.
- Then convert the offset `Unicode` into binary, perform the `AND` operation with `mask`, and carry out a two-bit digit loss.
- Directly output digits, English, and Latin after the loss of precision.
- The remaining characters are converted to decimal and output with the common character `start` code after the loss of precision.
### The fuzzy algorithm development progress
#### The first edition
Simply use `Unicode` and `mask` code of common characters to perform the `AND` operation.
```
Mask: 0b11111111111001111101
The original character: 0b1000101110101111讯
After encryption: 0b1000101000101101設
```
Assuming we know the key and encryption algorithm, the original string after a backward pass is:
```
1.0b1000101100101101 謭
2.0b1000101100101111 謯
3.0b1000101110101101 训
4.0b1000101110101111 讯
5.0b1000101010101101 読
6.0b1000101010101111 誯
7.0b1000101000101111 訯
8.0b1000101000101101 設
```
Based on the missing bits, we find that each string can be derived `2^n` Chinese characters backward. When the `Unicode` of common Chinese characters is decimal, their intervals are very large. Notice that the Chinese characters inferred backward are not common characters, and it's more likely to infer the original characters.
![Inference of Chinese characters][4]
#### The second edition
The interval of common Chinese characters in `Unicode` is irregular. We planned to leave the last few bits of Chinese characters in `Unicode` and convert them into decimal as an `index` to fetch some common Chinese characters. This way, when the algorithm is known, uncommon characters won't appear after a backward pass, and distractors are no longer easy to eliminate.
If we leave the last few bits of Chinese characters in `Unicode`, it has something to do with the relationship between the accuracy of fuzzy query and anti-decryption complexity. The higher the accuracy, the lower the decryption difficulty.
Let's take a look at the collision degree of common Chinese characters under our algorithm:
1. When `mask`=0b0011_1111_1111:
![Mask results][5]
2. When `mask`=0b0001_1111_1111:
![Mask results][6]
For the mantissa of Chinese characters, leave 10 and 9 digits. The 10-digit query is more accurate because its collision is much weaker. Nevertheless, if the algorithm and the key are known, the original text of the 1:1 character can be derived backward.
The nine-digit query is less accurate because nine-digit collisions are relatively stronger, but there are fewer 1:1 characters. Although we change the collisions regardless of whether we leave ten or nine digits, the distribution is unbalanced due to the irregular `Unicode` of Chinese characters. The overall collision probability cannot be controlled.
#### The third edition
In response to the unevenly distributed problem found in the second edition, we take common characters with disrupted order as the dictionary table.
1. The encrypted text first looks up the `index` in the out-of-order dictionary table. We use the `index` and subscript to replace the `Unicode` without rules. Use `Unicode` in case of uncommon characters. (Note: Evenly distribute the code to be calculated as far as possible.)
2. The next step is to perform the `AND` operation with a `mask` and lose two-bit precision to increase the frequency of collisions.
Let's take a look at the collision degree of common Chinese characters under our algorithm:
1. When `mask`=0b1111_1011_1101:
![Mask results][7]
2. When `mask`=0b0111_1011_1101:
![Mask results][8]
When the `mask` leaves 11 bits, you can see that the collision distribution is concentrated at 1:4. When the `mask` leaves ten bits, the number becomes 1:8. At this time, we only need to adjust the number of precision losses to control whether the collision is 1:2, 1:4 or 1:8.
If the `mask` is selected as 1, and the algorithm and key are known, there will be a 1:1 Chinese character because we calculate the collision degree of common characters at this time. If we add the missing four bits before the 16-bit binary of Chinese characters, the situation becomes `2^5=32` cases.
Since we encrypt the whole text, even if the individual character is inferred backward, there will be little impact on overall security and will not cause mass data leaks. At the same time, the premise of backward pass is to know the algorithm, key, `delta`, and dictionary, so it's impossible to achieve from the data in the database.
### How to use fuzzy query
Fuzzy query requires the configuration of `encryptors` (encryption algorithm configuration), `likeQueryColumn` (fuzzy query column name), and `likeQueryEncryptorName` (encryption algorithm name of fuzzy query column ) in the encryption configuration.
Please refer to the following configuration. Add your own sharding algorithm and data source.
```
dataSources:
ds_0:
dataSourceClassName: com.zaxxer.hikari.HikariDataSource
driverClassName: com.mysql.jdbc.Driver
jdbcUrl: jdbc:mysql://127.0.0.1:3306/test?allowPublicKeyRetrieval=true
username: root
password: root
rules:
- !ENCRYPT
encryptors:
like_encryptor:
type: CHAR_DIGEST_LIKE
aes_encryptor:
type: AES
props:
aes-key-value: 123456abc
tables:
user:
columns:
name:
cipherColumn: name
encryptorName: aes_encryptor
assistedQueryColumn: name_ext
assistedQueryEncryptorName: aes_encryptor
likeQueryColumn: name_like
likeQueryEncryptorName: like_encryptor
phone:
cipherColumn: phone
encryptorName: aes_encryptor
likeQueryColumn: phone_like
likeQueryEncryptorName: like_encryptor
queryWithCipherColumn: true
props:
sql-show: true
```
Insert
```
Logic SQL: insert into user ( id, name, phone, sex) values ( 1, '熊高祥', '13012345678', '男')
Actual SQL: ds_0 ::: insert into user ( id, name, name_ext, name_like, phone, phone_like, sex) values (1, 'gyVPLyhIzDIZaWDwTl3n4g==', 'gyVPLyhIzDIZaWDwTl3n4g==', '佹堝偀', 'qEmE7xRzW0d7EotlOAt6ww==', '04101454589', '男')
```
Update
```
Logic SQL: update user set name = '熊高祥123', sex = '男1' where sex ='男' and phone like '130%'
Actual SQL: ds_0 ::: update user set name = 'K22HjufsPPy4rrf4PD046A==', name_ext = 'K22HjufsPPy4rrf4PD046A==', name_like = '佹堝偀014', sex = '男1' where sex ='男' and phone_like like '041%'
```
Select
```
Logic SQL: select * from user where (id = 1 or phone = '13012345678') and name like '熊%'
Actual SQL: ds_0 ::: select `user`.`id`, `user`.`name` AS `name`, `user`.`sex`, `user`.`phone` AS `phone`, `user`.`create_time` from user where (id = 1 or phone = 'qEmE7xRzW0d7EotlOAt6ww==') and name_like like '佹%'
```
Select: federated table sub-query
```
Logic SQL: select * from user LEFT JOIN user_ext on user.id=user_ext.id where user.id in (select id from user where sex = '男' and name like '熊%')
Actual SQL: ds_0 ::: select `user`.`id`, `user`.`name` AS `name`, `user`.`sex`, `user`.`phone` AS `phone`, `user`.`create_time`, `user_ext`.`id`, `user_ext`.`address` from user LEFT JOIN user_ext on user.id=user_ext.id where user.id in (select id from user where sex = '男' and name_like like '佹%')
```
Delete
```
Logic SQL: delete from user where sex = '男' and name like '熊%'
Actual SQL: ds_0 ::: delete from user where sex = '男' and name_like like '佹%'
```
The above example demonstrates how fuzzy query columns rewrite SQL in different SQL syntaxes to support fuzzy queries.
### Wrap up
This article introduced you to the working principles of fuzzy query and used specific examples to demonstrate how to use it. I hope that through this article, you will have a basic understanding of fuzzy queries.
_This article was originally published on [Medium][9] and has been republished with the author's permission._
--------------------------------------------------------------------------------
via: https://opensource.com/article/23/2/fuzzy-query-apache-shardingsphere
作者:[Xiong Gaoxiang][a]
选题:[lkxed][b]
译者:[译者ID](https://github.com/译者ID)
校对:[校对者ID](https://github.com/校对者ID)
本文由 [LCTT](https://github.com/LCTT/TranslateProject) 原创编译,[Linux中国](https://linux.cn/) 荣誉推出
[a]: https://opensource.com/users/xionggaoxiang
[b]: https://github.com/lkxed
[1]: https://shardingsphere.apache.org/
[2]: https://opensource.com/article/23/1/apache-shardingsphere-new-features
[3]: https://www.techtarget.com/whatis/definition/rainbow-table
[4]: https://opensource.com/sites/default/files/2023-01/41chinesecharacters.jpg
[5]: https://opensource.com/sites/default/files/2023-01/42-1-characters.jpg
[6]: https://opensource.com/sites/default/files/2023-01/42-2-characters.jpg
[7]: https://opensource.com/sites/default/files/2023-01/43-1-characters.png
[8]: https://opensource.com/sites/default/files/2023-01/43-2-characters.png
[9]: https://medium.com/codex/fuzzy-query-for-ciphercolumn-shardingsphere-5-3-0-deep-dive-ad09faea67d3