đź“–Breaking and Fixing Content-Defined Chunking

authors
Kien Tuong Truong and Simon-Philipp Merz and Matteo Scarlata and Felix GĂĽnther and Kenneth G. Paterson
year
2025
url
https://eprint.iacr.org/2025/558
  • Many CDC applications use a secret “seed” to randomize chunk sizes. However, this is not secure and paper shows an attack on such algorithms.
  • Padding output to hide its length prevented the attack but it’s not provably secure.
    • Perfect length-hiding padding/encryption to the max chunk size does mitigate the attack, at the cost of increasing storage requirements.
  • A way to fix CDC is by applying AES to the rolling hash, and then comparing result to decide to chunk or now. This does add overhead of 1 AES operation per input byte. Modern hardware has dedicated AES instruction but the overhead is still 50–160%.
    • (Q: does AES need to be applied to the rolling hash, or can it be applied to the data bytes directly?)