How to Demonstrate Hashing and It’s Application to Beginner using Structure and Array in C: The Easy Way-Part-3 (Hash Collision)
This is third article of four parts article series the first two parts is prerequisites for this part and can be read from following links.
Life is not free from problems so do the hashing. Hashing is very good at improving the searching but many of the hash function suffers from “collision”. Collision can be defined as “if a hash function give same hash value for more than one key value” i.e. the hash function takes different value for hash calculation but gives same value as output for all different values.
For example, suppose division hash function is used as h(X) = X % 10, now for different values of X i.e. 1, 11, 21, 31 etc. hash value will be the same i.e. 1. Here, hash value is colliding with other hash value. Below (in Figure 1) is the example of previous discussed hash function and the demo application, hash values for ten roll numbers (136001 to 136010) are calculated with all four hashing function and hash collision are highlighted.
In figure 1, it can be observed that multiplication method gives same hash value that is 8 for 136002 and 136010, similarly mid-square method gives same hash value i.e. 7 for 136002 and 136009, folding method also have same hash value i.e. 1 for 136001 and 136010. So these are the example for hash collision.
So, why hash collision is a problem? In previous demo application (part-2), hash value was used to search location in hash table to store student’s record. Now in case of hash collision, two roll number will be map to same location and so storing two records can’t be possible. Same problem will arise at the time of searching record from hash table.
If there is a problem then there also exist solution, so there are many methods for addressing hash collision those can be group into two main categories i.e.
- Open Addressing
- Chaining
Further, each of these two discussed in details with proper examples.
- Open Addressing
As discussed, Hash value is used to find location in hash table for storing and searching record. In general, hash table is initialized with some value such as -1 or any other value which is also known sentinel values. The sentinel value helps to know that a location is free or not in the hash table which is very useful in the case of hash collision.
When a hash collision occurs, open addressing also known as closed hashing computes or find a new location to store the record. Examining and finding new location to store the record in hash table after a collision is known as probing. In open addressing probing can be achieved by following ways:
a. Linear probing
b. Quadratic probing
c. Double hashing
d. Rehashing
In further section, each of these techniques of probing is explained with reference to previous demo application. To keep the discussion in context it is required to explain the problem of collision in storing the information. In figure 2, it is clear that when one starts storing student’s record using hash value based on roll number, one hash collision occurs i.e. for 136001 and 135009 the hash value is the same i.e. 2. Now, it is important to understand that when a hash collision occurred, what is the status of hash table and which probing is used will decide which location will be used to store the record. In case of aforementioned example, hash collision occurs when one want to store student information with roll number 136009 which hash value same as 136001, as the record for 136001 is already stored, the new free location have to find for record with roll number 136009 and new location depends upon type of open addressing probing. In figure 2, colored memory location indicates the occupied location and non-colored indicates free location. The colored arrow is for showing the collision for roll number 136009. Here hash table size is 11 i.e. one more than the total records (10 ) to store, this arrangement is made to just demonstrate the probing method.
a. Linear probing
In linear probing after a collision occurred next free location is search linearly i.e. from the location of collision (location 2 in aforementioned case) each next location is checked (sentinel value is used ) to see if that location is free or not. Once a free location (location 4) is found, the record is stored at newly found free location. Linear probing is very simple and can be express as following:
h(k, i) = [h`(k) + i] mod m, where i is the probe sequence and have range 0 to m-1. m is the size of hash table.
Continuing with aforementioned example, first hash value will be generated with i=0 for key 136009 i.e.
h(136009, 0) = ( 2 + 0) % 11, which will be 2 and that is already occupied so the value of i will be increase and i=1 will be use i.e.
h(136009, 1)= ( 2 + 1) % 11, which will give output 3 as hash value and that is also occupied, now i =2 will be used,
h(136009, 2)= ( 2 + 2) % 11, which is 4 and that location is free so information of student9 with roll 136009 will be stored at location 4 in the hash table instead of location 2.
Figure 3, shows the result after performing the linear probing on the aforementioned hash collision for roll 136009
After resolving the collision the next record will be stored as usual method, so the record with roll 136010 will be store at location 8 of hash table which is free.
Searching with Linear Probing
Searching record saved without collision is simple and just need to recompute the hash value and then access the record at the location with O(1). Collision even makes searching complicated and time-consuming. Due to collision resolving the colliding record get stored at a different location than its hash value, now during search the record can not access directly. Searching need to be done in sequential manner from the collision location i.e. from the location in hash table for which hash value is colliding. The sequential search will terminate in following conditions:
- The search key found
- A free location found in hash table (key not found.)
- Search came to end of hash table (key not found.)
Implementation
To implement linear probing in our demo application, we need to perform few changes in previous code. Following is the working demo with linear probing. From line 32–46 is the linear probing function which takes array of structure, location of collision and size of hash table as argument and if available return the next free location else return -1 to indicate no free location available. In main program few changes are made in line( 69 to 79 ) to adopt linear probing. Firstly, with new hash value as index the location in hash table is checked (here default initial value is used to check a location is free or not. For int in structure the initialization value is zero, so the roll no field is checked). If the location is not free (i.e. roll no has any value other than zero) then linear probing function is called (line 73) and a new free location is search and assigned to arrayIndex variable else new entry is made with the hash value as index of hash table.
Searching with linear probing changed totally (line 112 to 133). Unlike earlier now the hash value is used to access the location but the resulting value must be compared with the key because record can be in different location than its hash value. In case of collision, the record have to be search in sequential manner from the location of collision (line 120–132) and the sequential search will terminate in case the record is found or a free location is found or search reach the end of hash table.
b. Quadratic probing
Quadratic probing solve few issues of linear probing such as primary clustering issue. In this, collision is resolve with following equation:
Where, h() is hashing function, m is the size of hash table, i is probe sequence (0 to m-1).
Implementation
The quadratic probing can be represented in other form also such as :
Where, h() is hashing function, m is the size of hash table, i is probe sequence (0 to m-1) and C1 and C2 are two constants must not be equal to zero. Following coding sample demonstrate the quadratic probing:
c. Double hashing
In double hashing as name implies two independent hash function is used to resolve the collision.
Where, h1() and h2() are two hashing function, m is the size of hash table, i is probe sequence (0 to m-1).
Implementation
Double hashing minimizes repeated collisions and the effects of clustering.
d. Rehashing
In this size of hash table is increase normally double to its size. It is complex and resource consuming process so should be avoided.
The chaining method will be discussed in next article part 4.