translated

This commit is contained in:
feng lv 2017-08-28 13:50:32 +08:00
parent 12c68052a7
commit 56216cc1db
2 changed files with 115 additions and 119 deletions

View File

@ -1,119 +0,0 @@
ucasFL translating
Know your Times Tables, but... do you know your "Hash Tables"?
============================================================
Diving into the world of Hash Tables and understanding the underlying mechanics is  _extremely_  interesting, and very rewarding. So lets get into it and get started from the beginning.
A Hash Table is a common data structure used in many modern day Software applications. It provides a dictionary-like functionality, giving you the ability to perform opertations such as inserting, removing and deleting items inside it. Lets just say I want to find what the definition of what “Apple” is, and I know the defintion is stored in my defined Hash Table. I will query my Hash Table to give me a defintion. The  _entry_  inside my Hash Table might look something like this `"Apple" => "A green fruit of fruity goodness"`. So, “Apple” is my  _key_  and “A green fruit of fruity goodness” is my associated  _value_ .
One more example just so were clear, take below the contents of a Hash Table:
```
1234
```
```
"bread" => "solid""water" => "liquid""soup" => "liquid""corn chips" => "solid"
```
I want to look up if  _bread_  is a solid or liquid, So I will query the Hash Table to give me the associated value, and the table will return to me with “solid”. Ok so we got the generally gist of how it functions. Another important concept to note with Hash Tables is the fact that every key is unique. Lets say tomorrow, I feel like having a bread milkshake (which is a  _liquid_ ), we now need to update the Hash Table to reflect its change from solid to liquid! So we add the entry into the dictionary, the key : “bread” and the value : “liquid”. Can you spot what has changed in the table below?
```
1234
```
```
"bread" => "liquid""water" => "liquid""soup" => "liquid""corn chips" => "solid"
```
Thats right, bread has been updated to have the value “liquid”.
**Keys are unique**, my bread cant be both a liquid and a solid. But what makes this data structure so special from the rest? Why not just use an [Array][1] instead? It depends on the nature of the problem. You may very well be better off using a Array for a particular problem, and that also brings me to the point, **choose the data structure that is most suited to your problem**. Example, If all you need to do is store a simple grocery list, an Array would do just fine. Consider the two problems below, each problem is very different in nature.
1. I need a grocery list of fruit
2. I need a grocery list of fruit and how much each it will cost me (per kilogram).
As you can see below, an Array might be a better choice for storing the fruit for the grocery list. But a Hash Table looks like a better choice for looking up the cost of each item.
```
123456789
```
```
//Example Array ["apple, "orange", "pear", "grape"] //Example Hash Table { "apple" : 3.05, "orange" : 5.5, "pear" : 8.4, "grape" : 12.4 }
```
There are literally so many oppurtunities to [use][2] Hash Tables.
### Time and what that means to you
[A brush up on time and space complexity][3].
On average it takes a Hash Table O(1) to search, insert and delete entries in the Hash Table. For the unaware, O(1) is spoken as “Big O 1” and represents constant time. Meaning that the running time to perform each operation is not dependent on the amount of data in the dataset. We can also  _promise_  that for searching, inserting and deleting items will take constant time, “IF AND ONLY” IF the implementation of the Hash Table is done right. If its not, then it can be really slow  _O(n)_ , especially if everything hashes to the same position/slot in the Hash Table.
### Building a good Hash Table
So far we now understand how to use a Hash Table, but what if we wanted to **build** one? Essentially what we need to do is map a string (eg. “dog”) to a **hash code** (a generated number), which maps to an index of an Array. You might ask, why not just go straight to using indexes? Why bother? Well this way it allows us to find out immediately where “dog” is located by quering directly for “dog”, `String name = Array["dog"] //name is "Lassy"`. But with using an index to look up the name, we could be in the likely situation that we do not know the index where the name is located. For example, `String name = Array[10] // name is now "Bob"` - thats not my dogs name! And that is the benefit of mapping the string to a hash code (which corresponds to an index of an Array). We can get the index of the Array by using the modulo operator with the size of the Hash Table, `index = hash_code % table_size`.
Another situation that we want to avoid is having two keys mapping to the same index, this is called a **hash collision** and theyre very likely to happen if the hash function is not properly implemented. But the truth is that every hash function  _with more inputs than outputs_  there is some chance of collision. To demonstrate a simple collision take the following two function outputs below:
`int cat_idx = hashCode("cat") % table_size; //cat_idx is now equal to 1`
`int dog_idx = hashCode("dog") % table_size; //dog_idx is now also equal 1`
We can see that both Array indexes are now 1! And as such the values will overwrite each other because they are being written to the same index. Like if we tried to look up the value for “cat” it would then return “Lassy”. Not what we wanted after all. There are various methods of [resolving hash collisions][4], the more popular one is called **Chaining**. The idea with chaining is that there is a Linked List for each index of an Array. If a collision occurs, the value will be stored inside that Linked List. Thus in the previous example, we would get the value we requested, but it we would need to search a Linked List attached to the index 1 of the Array. Hashing with Chaining achieves O(1 + α) time where α is the load factor which can be represented as n/k, n being the number of entries in the Hash Table and k being the number of slots available in the Hash Table. But remember this only holds true if the keys that you give are particularly random (relying on [SUHA][5])).
This is a big assumption to make, as there is always a possibility that non-equal keys will hash to the same slot. One solution to this is to take the reliance of randomness away from what keys are given to the Hash Table, and put the randomness on how the keys will be hashed to increase the likeliness of  _very few conflicts_  occuring. And this is known as…
### Universal Hashing
The concept is pretty simple, select  _at random_  a hash function h from the set universal hash family to compute the hash code. So in other words, choose any random hash function to hash the key! And by following this method it provides a  _very low_  probability that the hashes of two distinct keys will not be the same. I will keep this one short, but if you dont trust me then trust [Mathematics][6] instead. Also another thing to watch out for is when implementing this method be careful of having a bad universal hash family. It can blow out the time and space complexity to O(U) where U is the size of the family. And where the challenge lies is finding a Hash family that does not take too much time to compute, and too much space to store.
### A Hash function of the Gods
The search for perfection is inevitable. What if we could construct a  _Perfect hash function_ where we could just map things to a set of integers with absolutely  _no collisions_ . Good news is we can do this, Well kind of.. but our data has to be static (which means no insertions/deletes/updates can assured constant time). One approach to achieve a perfect hash function is to use  _2-Level Hashing_ , it is basically a combination of the last two ideas we previously discussed. It uses  _Universal Hashing_  to select which hash function to use, and then combines it with  _Chaining_ , but this time instead of using a Linked List data structure we use another Hash Table! Lets see how this looks visually below:
[![2-Level Hashing](http://www.zeroequalsfalse.press/2017/02/20/hashtables/Diagram.png "2-Level Hashing")][8]
**But how does this work and how can we ensure no lookup collisions?**
Well it works in reverse to the [Birthday paradox][7]. It states that in a set of N randomly chosen people, some pair will have the same birthday. But if the number of days in a year far outwighs the number of people (squared) then there is a damn good possibility that no pair of people will share the same birthday. So how it relates is, for each chained Hash Table is the size of the first-level Hash Table  _squared_ . That is if 2 elements happen to hash to the same slot, then the size of the chained Hash Table will be of size 4\. Most of the time the chained Tables will be very sparse/empty.
Repeat the following two steps to ensure no look up collisions,
* Select a hash from the universal hash family
* If we get a collision, then select another hash from the universal hash family.
Literally that is it, (Well.. for an O(N^2) space solution anyway). If space is a concern, then a different approach is obviously needed. But the great thing is that we will only ever have to do this process on average **twice**.
### Summing up
A Hash Table is only as good as its  _Hash function_ . Deriving a  _Perfect hash function_  is much harder to achieve without losing in particular areas such as functionality, time and space. I invite you to always consider Hash Tables when solving a problem as they offer great performance benefits and they can make a noticeable difference in the usability of your application. Hash Tables and Perfect hash functions are often used in Real-time programming applications. And have been widely implemented in algorithms around the world. Hash Tables are here to stay.
--------------------------------------------------------------------------------
via: http://www.zeroequalsfalse.press/2017/02/20/hashtables/
作者:[Marty Jacobs][a]
译者:[译者ID](https://github.com/译者ID)
校对:[校对者ID](https://github.com/校对者ID)
本文由 [LCTT](https://github.com/LCTT/TranslateProject) 原创编译,[Linux中国](https://linux.cn/) 荣誉推出
[a]:http://www.zeroequalsfalse.press/about
[1]:https://en.wikipedia.org/wiki/Array_data_type
[2]:https://en.wikipedia.org/wiki/Hash_table#Uses
[3]:https://www.hackerearth.com/practice/basic-programming/complexity-analysis/time-and-space-complexity/tutorial/
[4]:https://en.wikipedia.org/wiki/Hash_table#Collision_resolution
[5]:https://en.wikipedia.org/wiki/SUHA_(computer_science
[6]:https://en.wikipedia.org/wiki/Universal_hashing#Mathematical_guarantees
[7]:https://en.wikipedia.org/wiki/Birthday_problem
[8]:http://www.zeroequalsfalse.press/2017/02/20/hashtables/Diagram.png

View File

@ -0,0 +1,115 @@
知道时间表,但是你是否知道“哈希表”
============================================================
探索哈希表的世界并理解潜在的技术是非常有趣的,并且将会受益匪浅。所以,让我们开始探索它吧。
哈希表是许多现代软件、应用程序中一种常见的数据结构。它提供了类似字典的功能,从而你能够执行插入、删除和删除其中的项等操作。这么说吧,比如我想找出“苹果”的定义是什么,并且我知道该定义被存储在了我定义的哈希表中。我将查询我的哈希表来得到定义。哈希表内的记录看起来可能像:`“苹果 ”=>" “一种拥有水果之王之称的绿色水果”`。所以,“苹果”是我的关键字,而“一种拥有水果之王之称的水果”是与之关联的值。
还有一个例子,我们很清楚,下面的是哈希表的内容:
```
1234
```
```
"面包" => "固体""水" => "液体""汤" => "液体""玉米片" => "固体"
```
我想知道*面包*是固体还是液体,所以我将查询哈希表来获取与之相关的值,该哈希表将返回“固体”给我。现在,我们大致了解了哈希表是如何工作的。使用哈希表需要注意的另一个重要概念是每一个关键字都是唯一的。如果到了明天,我拥有一个面包奶昔(它是液体),那么我们需要更新哈希表,把“固体”改为“液体”来反映哈希表的改变。所以,我们需要添加一条记录到字典中:关键字为“面包”,对应的值为“液体”。你能发现下面的表发生了什么变化吗?
```
1234
```
```
"面包" => "液体""水" => "液体""汤" => "液体""玉米片" => "固体"
```
没错,“面包”对应的值被更新为了“液体”。
**关键字是唯一的**,我的面包不能既是液体又是固体。但是,是什么使得该数据结构与其他数据结构相比如此特殊呢?为什么不使用一个[数组][1]来代替呢?它取决于问题的本质。对于某一个特定的问题,使用数组来描述可能会更好,因此,我们需要注意的关键点就是,**我们应该选择最适合问题的数据结构**。例如,如果你需要做的只是存储一个简单的杂货列表,那么使用数组会很适合。考虑下面的两个问题,两个问题的本质完全不同。
1. 我需要一个水果的列表
2. 我需要一个水果的列表以及各种水果的价格(每千克)
正如你在下面所看到的,用数组来存储水果的列表可能是更好的选择。但是,用哈希表来存储每一种水果的价格看起来是更好的选择。
```
123456789
```
```
//示例数组
["apple, "orange", "pear", "grape"]
//示例哈希表
{ "apple" : 3.05, "orange" : 5.5, "pear" : 8.4, "grape" : 12.4 }
```
实际上,有许多的机会需要[使用][2]哈希表。
### 时间以及它对你的意义
[对时间复杂度和空间复杂度的一个复习][3].
平均情况下,在哈希表中进行搜索、插入和删除记录的时间复杂度均为 O(1) 。实际上O(1) 读作“大 O 1”表示常数时间。这意味着执行每一种操作的运行时间不依赖于数据集中数据的数量。我可以保证查找、插入和删除项目均只花费常数时间“当且仅当”哈希表的执行正确。如果执行不正确可能需要花费很慢的 O(n) 时间,尤其是当所有的数据都映射到了哈希表中的同一位置/点。
### 构建一个好的哈希表
到目前为止,我们已经知道如何使用哈希表了,但是如果我们想**构建**一个哈希表呢?本质上我们需要做的就是把一个字符串(比如 "dog")映射到一个哈希代码(一个生成数),即映射到一个数组的索引。你可能会问,为什么不直接使用索引呢?为什么没必要呢?通过这种方式我们可以直接查询 "dog" 立即得到 "dog" 所在的位置,`String name = Array["dog"] //name is Lassy`。但是,使用索引查询名称时,可能出现的情况是我们不知道名称所在的索引。比如,`String name = Array[10] // name is now "Bob"`- 现在不是我的狗的名字。这就是把一个字符串映射到一个哈希代码的益处(它和一个数组的索引相对应)。我们可以通过使用模运算符和哈希表的大小来计算出数组的索引:`index = hash_code % table_size`。
我们需要避免的另一种情况是两个关键字映射到同一个索引,这叫做**哈希碰撞**,如果选取的哈希函数不合适,这很容易发生。实际上,每一个输入比输出多的哈希函数都有可能发生碰撞。通过下面的同一个函数的两个输出来展示一个简单的碰撞:
`int cat_idx = hashCode("cat") % table_size; //cat_idx 现在等于 1`
`int dog_idx = hashCode("dog") % table_size; //dog_idx 也等于 1`
我们可以看到,现在两个数组的索引均是 1 。这样将会出现两个值相互覆盖,因为它们被写到了相同的索引中。如果我们查找 "cat" 的值,将会返回 "Lassy" ,但是这并不是我们想要的。有许多可以[解决哈希碰撞][4]的方法,但是更受欢迎的一种方法叫做**链接**。链接的想法就是对于数组的每一个索引位置都有一个链表,如果碰撞发生,值就被存到链表中。因此,在前面的例子中,我们将会得到我们需要的值,但是我们需要搜索数组中索引为 1 的位置上的链表。伴有链接的哈希实现需要 O(1 + α) 时间,其中 α 是装载因子,它可以表示为 n/k其中 n 是哈希表中的记录数目k 是哈希表中可用位置的数目。但是请记住,只有当你给出的关键字非常随机时,这一结论才正确(依赖于 [SUHA][5])。
这是做了一个很大的假设,因为总是有可能任何不相等的关键字都不散列到同一点。这一问题的一个解决方法是去除哈希表中关键字对随机性的依赖,转而把随机性集中于关键字是如何被散列的,从而减少矛盾发生的可能性。这被称为……
### 通用散列
这个观念很简单,从通用散列家族集合随机选择一个哈希函数来计算哈希代码。用起来话来说,就是选择任何一个随机的哈希函数来散列关键字。通过这种方法,将有很低的可能性散列两个不同的关键字结果不相同。我只是简单的提一下,如果不相信我那么请相信[数学][6]。实现这一方法时需要注意的另一件事是选择了一个不好的通用散列家族。它会把时间和空间复杂度拖到 O(U),其中 U 是散列家族的大小。而其中的挑战就是找到一个不需要太多时间来计算,也不需要太多空间来存储的哈希家族。
### 上帝哈希函数
追求完美是人的天性。我们是否能够构建一个*完美的哈希函数*,从而能够把关键字映射到整数集中,并且几乎没有碰撞。好消息是我们能够在一定程度上做到,但是我们的数据必须是静态的(这意味着在一定时间内没有插入/删除/更新)。一个实现完美哈希函数的方法就是使用 2-级哈希,它基本上是我们前面讨论过的两种方法的组合。它使用*通用散列*来选择使用哪个哈希函数,然后通过链接组合起来,但是这次不是使用链表数据结构,而是使用另一个哈希表。让我们看一看下面它是怎么实现的: [![2-Level Hashing](http://www.zeroequalsfalse.press/2017/02/20/hashtables/Diagram.png "2-Level Hashing")][8]
**但是这是如何工作的以及我们如何能够确保无需关心碰撞?**
它的工作方式与[生日悖论][7]相反。它指出,在随机选择的一堆人中,会有一些人生日相同。但是如果一年中的天数远远大于人数(平方以上),那么有极大的可能性所有人的生日都不相同。所以这二者是如何相关的?对于每一个链接哈希表,其大小均为第一级哈希表大小的平方。那就是说,如果有两个元素被散列到同一个点,那么链接哈希表的大小将为 4 。大多数时候,链接哈希表将会非常稀疏/空。
重复下面两个来确保无需担心碰撞:
* 从通用散列家族中选择一个哈希函数来计算
* 如果发生碰撞,那么继续从通用散列家族中选择另一个哈希函数来计算
字面上看就是这样(这是一个 O(n^2) 空间的解)。如果需要考虑空间问题,那么显然需要另一个不同的方法。但是值得庆幸的是,该过程平均只需要进行**两次。**
### 总结
只有具有一个好的哈希函数才能算得上是一个好的哈希表。在同时保证功能实现、时间和空间的提前下构建一个完美的哈希函数是一件很困难的事。我推荐你在解决问题的时候首先考虑哈希表,因为它能够为你提供巨大的性能优势,而且它能够对应用程序的可用性产生显著差异。哈希表和完美的哈希函数常被用于实时编程应用中,并且在各种算法中都得到了广泛应用。你见或者不见,哈希表就在这儿。
--------------------------------------------------------------------------------
via: http://www.zeroequalsfalse.press/2017/02/20/hashtables/
作者:[Marty Jacobs][a]
译者:[ucasFL](https://github.com/ucasFL)
校对:[校对者ID](https://github.com/校对者ID)
本文由 [LCTT](https://github.com/LCTT/TranslateProject) 原创编译,[Linux中国](https://linux.cn/) 荣誉推出
[a]:http://www.zeroequalsfalse.press/about
[1]:https://en.wikipedia.org/wiki/Array_data_type
[2]:https://en.wikipedia.org/wiki/Hash_table#Uses
[3]:https://www.hackerearth.com/practice/basic-programming/complexity-analysis/time-and-space-complexity/tutorial/
[4]:https://en.wikipedia.org/wiki/Hash_table#Collision_resolution
[5]:https://en.wikipedia.org/wiki/SUHA_(computer_science
[6]:https://en.wikipedia.org/wiki/Universal_hashing#Mathematical_guarantees
[7]:https://en.wikipedia.org/wiki/Birthday_problem
[8]:http://www.zeroequalsfalse.press/2017/02/20/hashtables/Diagram.png