📖The design of fast content-defined chunking for data deduplication based storage systems

authors
Xia, Wen and Zou, Xiangyu and Jiang, Hong and Zhou, Yukun and Liu, Chuanyi and Feng, Dan and Hua, Yu and Hu, Yuchong and Zhang, Yucheng
year
2020
  • Gear hash is much faster than Rabin
    • Mask needs to be shifted left as using rightmost bits effectively reduces sliding window size
  • Enforcing min size helps with performance (as the first N bytes don’t need to be hashed and can be skipped) but reduces deduplication ratio (because a potential cut-point has been skipped).
    • Adding normalization improves deduplication ratio.
  • Normalization = using more strict criteria for chunks < avg size, and more relaxed one for > avg size. So it’s easier to generate chunks once the size exceeds the average.