Evaluation of hash collision

How to evaluate the hash function collision? This question comes to me when I’m working on my paper. Here I’d like to share what I’ve collected and my solution.

The hash function quality

When I’m looking for good metrics for evaluation of hash function collision, I find the hash function quality metric from this blog, which is expressed as

\sum_{j=0}^{m-1}\frac{b_j(b_j+1)/2}{(n/2m)(n+2m-1)}

This expression is quoted in wikipedia and this paper ^[1] as well. This euqation is from the red dragon book^[2]. I cannot find the original euqation but since it has been referenced in academic paper, I think the resource should be reliable.

But I ran it on the dataset I created, a bunch of random number files, to evaluate the collision of gear^[3]. This metric is almost 4.8 for 100GB file, 1.8 for 10GB file and 1.03 for 1GB file. I think the consistency is dependent on the sample size. So I think it’s not a good solusion to evaluate hash collision.

Chi-squared dependence test

I’ve seen another solution to evaluate the uniformity of hash function, chi-squared dependence test^[4]. The main idea is to split the range of hash function into b bins and check the depedence of these bins with $\chi^2$ test. The equation is as below

T=\sum_{i=1}^{b} \frac{(n_i-\bar{n})^2}{\bar{n}}

Then just check the $\chi^2$ distribution table.

My solution

I define a metric named collision possibility like as the equation below:

p_{collision}= \dfrac{\sum_{i=2}^{n}n_iC_i^2}{C_n^2}

This equation represents the combination of selecting two colliding chunks (identical chunks has the same hash fingerprints) divided by the combination of picking any two chunks from the universe. For example, if you have a large file, say 200GB, you need to split this file into chunks and calculate the hash fingerprints of each chunk to evaluate the collision of this hash function. If the chunk size is 8KB, then you will have 26214400 chunks. There might be chunks that have the same hash fingerprints and for fingerprint 0xff058b7a, we have 8 different chunks corresponding to it. In this test, we have 100 such hash fingerprints. Then $i$ equals to 8, $n_i$ equals to 100 and $n$ equals to 26214400. To calculate the hash collision possibility, you need to sum the combinations when $i=2,3,4,5,\cdots$

The above instance shows how to calculate the collision probability, it also represents the probability for a chunk’s fingerprint to collide with a given chunk. I’ve already tested it with Adler32 and a fingerprint in my paper (pending). I found that the collision possibility of Adler32 to be $2.3\times 10^{-10}$ , which is almost $1/2^{32}$ . I’ve also tested it on gear. The collision is only at the order of $10^{-7}$ , which means much higher possibility to collide.

As long as the collision probality is close to $1/2^{n}$ , if your hash fingerprint is n-bit, the hash function should be distributed uniformly enough.

Grochol D, Sekanina L. Multi-objective evolution of hash functions for high speed networks[C]//2017 IEEE Congress on Evolutionary Computation (CEC). IEEE, 2017: 1533-1540. ↩︎
Aho A V, Sethi R, Ullman J D. Compilers, principles, techniques[J]. Addison wesley, 1986, 7(8): 9. ↩︎
Xia W, Jiang H, Feng D, et al. Ddelta: A deduplication-inspired fast delta compression approach[J], taking the fingerprint of the window before the boundary (I only use 32 bit). Performance Evaluation, 2014, 79: 258-272. ↩︎
Henke C, Schmoll C, Zseby T. Empirical evaluation of hash functions for packetid generation in sampled multipoint measurements[C]//International Conference on Passive and Active Network Measurement. Springer, Berlin, Heidelberg, 2009: 197-206. ↩︎