20 KiB
A guide to fuzzy queries with Apache ShardingSphere
Apache ShardingSphere is an open source distributed database and an ecosystem users and developers need for their databases to provide a customized and cloud-native experience. Its latest release contains many new features, including data encryption integrated with existing SQL workflows. Most importantly, it allows fuzzy queries of the encrypted data.
The problem
By parsing a user's SQL input and rewriting the SQL according to the user's encryption rules, the original data is encrypted and stored with ciphertext data in the underlying database simultaneously.
When a user queries the data, it fetches the ciphertext data from the database, decrypts it, and returns the decrypted original data to the user. However, because the encryption algorithm encrypts the whole string, users cannot run fuzzy queries.
Nevertheless, many businesses need fuzzy queries after the data is encrypted. In version 5.3.0, Apache ShardingSphere provides users with a default fuzzy query algorithm that supports encrypted fields. The algorithm also supports hot plugging, which users can customize. The fuzzy query can be achieved through configuration.
How to achieve fuzzy query in encrypted scenarios
Load data to the in-memory database (IMDB)
First, load all the data into the IMDB to decrypt it. Then, it'll be like querying the original data. This method can achieve fuzzy queries. If the amount of data is small, this method will prove simple and cost-effective. However, if the quantity of data is large, it'll be a disaster.
Implement encryption and decryption functions consistent with database programs
The second method is to modify fuzzy query conditions and use the database decryption function to decrypt data first and then implement fuzzy query. This method's advantage is the low cost of implementation, development, and use.
Users only need to modify the previous fuzzy query conditions slightly. However, the ciphertext and encryption functions are stored together in the database, which cannot cope with the problem of account data leaks.
Native SQL:
select * from user where name like "%xxx%"
After implementing the decryption function:
ѕеlесt * frоm uѕеr whеrе dесоdе(namе) lіkе "%ххх%"
Store after data masking
Implement data masking on ciphertext and then store it in a fuzzy query column. This method could lack precision.
For example, mobile number 13012345678 becomes 130****5678 after the masking algorithm is performed.
Perform encrypted storage after tokenization and combination
This method performs tokenization and combination on ciphertext data and then encrypts the resultset by grouping characters with fixed length and splitting a field into multiple ones. For example, we take four English characters and two Chinese characters as a query condition: ningyu1 uses the four-character as a group to encrypt, so the first group is ning, the second group ingy, the third group ngyu, the fourth group gyu1, and so on. All the characters are encrypted and stored in the fuzzy query column. If you want to retrieve all data that contains four characters, such as ingy, encrypt the characters and use a key like"%partial%"
to query.
Shortcomings:
- Increased storage costs: Free grouping will increase the amount of data, and the data length will increase after being encrypted.
- Limited length in fuzzy query: Due to security issues, the length of free grouping cannot be too short or the rainbow table will easily crack it. Like the example I mentioned above, the length of fuzzy query characters must be greater than or equal to four letters/digits or two Chinese characters.
Single-character digest algorithm (default fuzzy query algorithm provided in ShardingSphere version 5.3.0)
Although the above methods are all viable, it's only natural to wonder if there's a better alternative. In our community, we find that single-character encryption and storage can balance performance and query but fails to meet security requirements.
So what's the ideal solution? Inspired by masking algorithms and cryptographic hash functions, we find that data loss and one-way functions can be used.
The cryptographic hash function should have the following four features:
- It should be easy to calculate the hash value for any given message.
- It should be difficult to infer the original message from a known hash value.
- It should not be feasible to modify the message without changing the hash value.
- There should only be a very low chance that two different messages produce the same hash value.
Security: Because of the one-way function, it's impossible to infer the original message. To improve the accuracy of the fuzzy query, we want to encrypt a single character, but the rainbow table will crack it.
So we take a one-way function (to ensure every character is the same after encryption) and increase the frequency of collisions (to ensure every string is 1: N backward), which greatly enhances security.
Fuzzy query algorithm
Apache ShardingSphere implements a universal fuzzy query algorithm using the below single-character digest algorithm org.apache.shardingsphere.encrypt.algorithm.like.CharDigestLikeEncryptAlgorithm
.
public final class CharDigestLikeEncryptAlgorithm implements LikeEncryptAlgorithm<Object, String> {
private static final String DELTA = "delta";
private static final String MASK = "mask";
private static final String START = "start";
private static final String DICT = "dict";
private static final int DEFAULT_DELTA = 1;
private static final int DEFAULT_MASK = 0b1111_0111_1101;
private static final int DEFAULT_START = 0x4e00;
private static final int MAX_NUMERIC_LETTER_CHAR = 255;
@Getter
private Properties props;
private int delta;
private int mask;
private int start;
private Map<Character, Integer> charIndexes;
@Override
public void init(final Properties props) {
this.props = props;
delta = createDelta(props);
mask = createMask(props);
start = createStart(props);
charIndexes = createCharIndexes(props);
}
private int createDelta(final Properties props) {
if (props.containsKey(DELTA)) {
String delta = props.getProperty(DELTA);
try {
return Integer.parseInt(delta);
} catch (NumberFormatException ex) {
throw new EncryptAlgorithmInitializationException("CHAR_DIGEST_LIKE", "delta can only be a decimal number");
}
}
return DEFAULT_DELTA;
}
private int createMask(final Properties props) {
if (props.containsKey(MASK)) {
String mask = props.getProperty(MASK);
try {
return Integer.parseInt(mask);
} catch (NumberFormatException ex) {
throw new EncryptAlgorithmInitializationException("CHAR_DIGEST_LIKE", "mask can only be a decimal number");
}
}
return DEFAULT_MASK;
}
private int createStart(final Properties props) {
if (props.containsKey(START)) {
String start = props.getProperty(START);
try {
return Integer.parseInt(start);
} catch (NumberFormatException ex) {
throw new EncryptAlgorithmInitializationException("CHAR_DIGEST_LIKE", "start can only be a decimal number");
}
}
return DEFAULT_START;
}
private Map<Character, Integer> createCharIndexes(final Properties props) {
String dictContent = props.containsKey(DICT) && !Strings.isNullOrEmpty(props.getProperty(DICT)) ? props.getProperty(DICT) : initDefaultDict();
Map<Character, Integer> result = new HashMap<>(dictContent.length(), 1);
for (int index = 0; index < dictContent.length(); index++) {
result.put(dictContent.charAt(index), index);
}
return result;
}
@SneakyThrows
private String initDefaultDict() {
InputStream inputStream = CharDigestLikeEncryptAlgorithm.class.getClassLoader().getResourceAsStream("algorithm/like/common_chinese_character.dict");
LineProcessor<String> lineProcessor = new LineProcessor<String>() {
private final StringBuilder builder = new StringBuilder();
@Override
public boolean processLine(final String line) {
if (line.startsWith("#") || 0 == line.length()) {
return true;
} else {
builder.append(line);
return false;
}
}
@Override
public String getResult() {
return builder.toString();
}
};
return CharStreams.readLines(new InputStreamReader(inputStream, Charsets.UTF_8), lineProcessor);
}
@Override
public String encrypt(final Object plainValue, final EncryptContext encryptContext) {
return null == plainValue ? null : digest(String.valueOf(plainValue));
}
private String digest(final String plainValue) {
StringBuilder result = new StringBuilder(plainValue.length());
for (char each : plainValue.toCharArray()) {
char maskedChar = getMaskedChar(each);
if ('%' == maskedChar) {
result.append(each);
} else {
result.append(maskedChar);
}
}
return result.toString();
}
private char getMaskedChar(final char originalChar) {
if ('%' == originalChar) {
return originalChar;
}
if (originalChar <= MAX_NUMERIC_LETTER_CHAR) {
return (char) ((originalChar + delta) & mask);
}
if (charIndexes.containsKey(originalChar)) {
return (char) (((charIndexes.get(originalChar) + delta) & mask) + start);
}
return (char) (((originalChar + delta) & mask) + start);
}
@Override
public String getType() {
return "CHAR_DIGEST_LIKE";
}
}
- Define the binary
mask
code to lose precision0b1111_0111_1101
(mask). - Save common Chinese characters with disrupted order like a
map
dictionary. - Obtain a single string of
Unicode
for digits, English, and Latin. - Obtain an
index
for a Chinese character belonging to a dictionary. - Other characters fetch the
Unicode
of a single string. - Add
1 (delta)
to the digits obtained by different types above to prevent any original text from appearing in the database. - Then convert the offset
Unicode
into binary, perform theAND
operation withmask
, and carry out a two-bit digit loss. - Directly output digits, English, and Latin after the loss of precision.
- The remaining characters are converted to decimal and output with the common character
start
code after the loss of precision.
The fuzzy algorithm development progress
The first edition
Simply use Unicode
and mask
code of common characters to perform the AND
operation.
Mask: 0b11111111111001111101
The original character: 0b1000101110101111讯
After encryption: 0b1000101000101101設
Assuming we know the key and encryption algorithm, the original string after a backward pass is:
1.0b1000101100101101 謭
2.0b1000101100101111 謯
3.0b1000101110101101 训
4.0b1000101110101111 讯
5.0b1000101010101101 読
6.0b1000101010101111 誯
7.0b1000101000101111 訯
8.0b1000101000101101 設
Based on the missing bits, we find that each string can be derived 2^n
Chinese characters backward. When the Unicode
of common Chinese characters is decimal, their intervals are very large. Notice that the Chinese characters inferred backward are not common characters, and it's more likely to infer the original characters.
The second edition
The interval of common Chinese characters in Unicode
is irregular. We planned to leave the last few bits of Chinese characters in Unicode
and convert them into decimal as an index
to fetch some common Chinese characters. This way, when the algorithm is known, uncommon characters won't appear after a backward pass, and distractors are no longer easy to eliminate.
If we leave the last few bits of Chinese characters in Unicode
, it has something to do with the relationship between the accuracy of fuzzy query and anti-decryption complexity. The higher the accuracy, the lower the decryption difficulty.
Let's take a look at the collision degree of common Chinese characters under our algorithm:
- When
mask
=0b0011_1111_1111:
- When
mask
=0b0001_1111_1111:
For the mantissa of Chinese characters, leave 10 and 9 digits. The 10-digit query is more accurate because its collision is much weaker. Nevertheless, if the algorithm and the key are known, the original text of the 1:1 character can be derived backward.
The nine-digit query is less accurate because nine-digit collisions are relatively stronger, but there are fewer 1:1 characters. Although we change the collisions regardless of whether we leave ten or nine digits, the distribution is unbalanced due to the irregular Unicode
of Chinese characters. The overall collision probability cannot be controlled.
The third edition
In response to the unevenly distributed problem found in the second edition, we take common characters with disrupted order as the dictionary table.
-
The encrypted text first looks up the
index
in the out-of-order dictionary table. We use theindex
and subscript to replace theUnicode
without rules. UseUnicode
in case of uncommon characters. (Note: Evenly distribute the code to be calculated as far as possible.) -
The next step is to perform the
AND
operation with amask
and lose two-bit precision to increase the frequency of collisions.
Let's take a look at the collision degree of common Chinese characters under our algorithm:
- When
mask
=0b1111_1011_1101:
- When
mask
=0b0111_1011_1101:
When the mask
leaves 11 bits, you can see that the collision distribution is concentrated at 1:4. When the mask
leaves ten bits, the number becomes 1:8. At this time, we only need to adjust the number of precision losses to control whether the collision is 1:2, 1:4 or 1:8.
If the mask
is selected as 1, and the algorithm and key are known, there will be a 1:1 Chinese character because we calculate the collision degree of common characters at this time. If we add the missing four bits before the 16-bit binary of Chinese characters, the situation becomes 2^5=32
cases.
Since we encrypt the whole text, even if the individual character is inferred backward, there will be little impact on overall security and will not cause mass data leaks. At the same time, the premise of backward pass is to know the algorithm, key, delta
, and dictionary, so it's impossible to achieve from the data in the database.
How to use fuzzy query
Fuzzy query requires the configuration of encryptors
(encryption algorithm configuration), likeQueryColumn
(fuzzy query column name), and likeQueryEncryptorName
(encryption algorithm name of fuzzy query column ) in the encryption configuration.
Please refer to the following configuration. Add your own sharding algorithm and data source.
dataSources:
ds_0:
dataSourceClassName: com.zaxxer.hikari.HikariDataSource
driverClassName: com.mysql.jdbc.Driver
jdbcUrl: jdbc:mysql://127.0.0.1:3306/test?allowPublicKeyRetrieval=true
username: root
password: root
rules:
- !ENCRYPT
encryptors:
like_encryptor:
type: CHAR_DIGEST_LIKE
aes_encryptor:
type: AES
props:
aes-key-value: 123456abc
tables:
user:
columns:
name:
cipherColumn: name
encryptorName: aes_encryptor
assistedQueryColumn: name_ext
assistedQueryEncryptorName: aes_encryptor
likeQueryColumn: name_like
likeQueryEncryptorName: like_encryptor
phone:
cipherColumn: phone
encryptorName: aes_encryptor
likeQueryColumn: phone_like
likeQueryEncryptorName: like_encryptor
queryWithCipherColumn: true
props:
sql-show: true
Insert
Logic SQL: insert into user ( id, name, phone, sex) values ( 1, '熊高祥', '13012345678', '男')
Actual SQL: ds_0 ::: insert into user ( id, name, name_ext, name_like, phone, phone_like, sex) values (1, 'gyVPLyhIzDIZaWDwTl3n4g==', 'gyVPLyhIzDIZaWDwTl3n4g==', '佹堝偀', 'qEmE7xRzW0d7EotlOAt6ww==', '04101454589', '男')
Update
Logic SQL: update user set name = '熊高祥123', sex = '男1' where sex ='男' and phone like '130%'
Actual SQL: ds_0 ::: update user set name = 'K22HjufsPPy4rrf4PD046A==', name_ext = 'K22HjufsPPy4rrf4PD046A==', name_like = '佹堝偀014', sex = '男1' where sex ='男' and phone_like like '041%'
Select
Logic SQL: select * from user where (id = 1 or phone = '13012345678') and name like '熊%'
Actual SQL: ds_0 ::: select `user`.`id`, `user`.`name` AS `name`, `user`.`sex`, `user`.`phone` AS `phone`, `user`.`create_time` from user where (id = 1 or phone = 'qEmE7xRzW0d7EotlOAt6ww==') and name_like like '佹%'
Select: federated table sub-query
Logic SQL: select * from user LEFT JOIN user_ext on user.id=user_ext.id where user.id in (select id from user where sex = '男' and name like '熊%')
Actual SQL: ds_0 ::: select `user`.`id`, `user`.`name` AS `name`, `user`.`sex`, `user`.`phone` AS `phone`, `user`.`create_time`, `user_ext`.`id`, `user_ext`.`address` from user LEFT JOIN user_ext on user.id=user_ext.id where user.id in (select id from user where sex = '男' and name_like like '佹%')
Delete
Logic SQL: delete from user where sex = '男' and name like '熊%'
Actual SQL: ds_0 ::: delete from user where sex = '男' and name_like like '佹%'
The above example demonstrates how fuzzy query columns rewrite SQL in different SQL syntaxes to support fuzzy queries.
Wrap up
This article introduced you to the working principles of fuzzy query and used specific examples to demonstrate how to use it. I hope that through this article, you will have a basic understanding of fuzzy queries.
This article was originally published on Medium and has been republished with the author's permission.
via: https://opensource.com/article/23/2/fuzzy-query-apache-shardingsphere
作者:Xiong Gaoxiang 选题:lkxed 译者:译者ID 校对:校对者ID