Checksum#

ToDo:

  • Add illustration for computing checksums - Similar to this.

  • Add one example using pure bitwise operations.

  • Add appendix on error correction codes - For instance Hamming Codes.

  • Add relevant resources at the end.


Even though hashes provide cryptographic features, they are computational intensive and in some use cases, only data integrity is desired, e.g. to check that a dowloaded file is exactly the one hosted on a server. Checksums are non-reversible.

For that purpose there are functions called checksums, they are not hashes and hence not suitable for security purposes. That means that the sender should either make the checksum public or send it along with the original data. If the latter, one should notice that this is vulnerable to man in the middle attacks.

The most common methods are called CRC32 and Adler32. Both are similar in but the latter is faster. The Python standard library exposes functions for both under the zlib module.

One of the main difference is that with hashes, the data is usually small, whereas checksums are used for whole files, which can be many gigabytes in size. The following examples illustrate how to work with Python objects as well as files.

import zlib

Working with Python Data#

Python data is what has been shown in the previous chapters, dictionaries, strings and the like. As hashes, checksum functions expect a bytes object so prior to send any data, it should be converted to bytes.

CRC32#

message = b"Hello World!"

message_crc = zlib.crc32(message)

print(f" Message: {message}")
print(f"Checksum: {message_crc}")
 Message: b'Hello World!'
Checksum: 472456355

Adler32#

message = b"Hello World!"

message_crc = zlib.adler32(message)

print(f" Message: {message}")
print(f"Checksum: {message_crc}")
 Message: b'Hello World!'
Checksum: 474547262

Verification#

def verify(received_message, received_crc):
    received_message_crc = zlib.adler32(received_message)

    if received_message_crc != received_crc:
        return "Mismatch, either message or checksum is incorrect"
    
    return "Match, message and checksum are consistent"

Altered message#

received_message = b"Hello Wrold!"
received_crc = 474547262

print(verify(received_message, received_crc))
Mismatch, either message or checksum is incorrect

Altered Checksum#

received_message = b"Hello Wrold!"
received_crc = 474547263

print(verify(received_message, received_crc))
Mismatch, either message or checksum is incorrect

Correct Message and Checksum#

received_message = b"Hello World!"
received_crc = 474547262

print(verify(received_message, received_crc))
Match, message and checksum are consistent

Simulated Man in the Middle#

If the data is intercepted and malliciously tampered with, there is no way to tell the difference. The checksum does not provide any mechanism against modification, that is, it is not an authenticated mechanism.

original_message = b"Hello World!"
original_crc = 474547262

# Man in the middle Attack
altered_message = b"Hello World!" + b" Or that is what you think"
altered_crc = zlib.adler32(altered_message)

print(verify(altered_message, altered_crc))
Match, message and checksum are consistent

Working with Files#

It is also desireable to work with files and check their integrity, this code snipped could also be applied to hashes although it may not be as popular.

To work with files, the best way is to use the pathlib module from the standard library.

from pathlib import Path

Checksum Calculation#

readme_file_bytes = Path("../README.md").read_bytes()

message_crc = zlib.adler32(readme_file_bytes)

print(f"Checksum (Adler32): {message_crc}")
Checksum (Adler32): 3529090815

Hash as Checksums#

Some systems may use MD5, a legacy and vulnerable hash algorithm, or SHA1 to checksum files. Even though MD5 and SHA1 are cryptographic functions, they are stil subjected to the same Man in the middle attack, the advantage of using them lies in the collision avoidance and big changes in output due to small changes in input.

When using hashes, the output will be a hexadecimal, instead of an integer. It can be converted to integer although it might produce overflow in legacy systems that cannot handle arbitrary long integers depending on the precise hash function used.

import hashlib

readme_file_bytes = Path("../README.md").read_bytes()
message_crc = hashlib.md5(readme_file_bytes)

print(f"Checksum (MD5): {message_crc.hexdigest()}")
Checksum (MD5): a63e9794301a39aa9a412fd2f1a812fa
message_crc = hashlib.sha1(readme_file_bytes)

print(f"Checksum (SHA1): {message_crc.hexdigest()}")
Checksum (SHA1): 839289fd566b1fd11e6751dda679d00101af9403

Speed Benchmarks#

This section will compare computing the checksum using the different methods showed above. The README file will be use as an example and it will be artificially enlarged to showcase efficiency of the methods when dealing with large files. The sys module will be use to calculate the file size.

The exact results may vary depending on the hardware, this is not a rigorous benchmark but it serves to give a general impresion of the relative speed of the different methods.

Preparing the File#

import sys
readme_file_bytes = Path("../README.md").read_bytes()

# Make it artificially bigger
while sys.getsizeof(readme_file_bytes) < 500 * 2**20:
    readme_file_bytes = readme_file_bytes * 2

print(f"Checksum for a file of {sys.getsizeof(readme_file_bytes) / 2**20:.4g} MB")
Checksum for a file of 872 MB

Runs#

print("  Checksum CRC32:", end=" ")
%timeit zlib.crc32(readme_file_bytes)

print("Checksum Adler32:", end=" ")
%timeit zlib.adler32(readme_file_bytes)

print("    Checksum MD5:", end=" ")
%timeit hashlib.md5(readme_file_bytes)

print("   Checksum SHA1:", end=" ")
%timeit hashlib.sha1(readme_file_bytes)
  Checksum CRC32: 
998 ms ± 147 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
Checksum Adler32: 
425 ms ± 61 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
    Checksum MD5: 
1.75 s ± 102 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
   Checksum SHA1: 
---------------------------------------------------------------------------
KeyboardInterrupt                         Traceback (most recent call last)
Cell In[16], line 11
      8 get_ipython().run_line_magic('timeit', 'hashlib.md5(readme_file_bytes)')
     10 print("   Checksum SHA1:", end=" ")
---> 11 get_ipython().run_line_magic('timeit', 'hashlib.sha1(readme_file_bytes)')

File /opt/hostedtoolcache/Python/3.8.14/x64/lib/python3.8/site-packages/IPython/core/interactiveshell.py:2364, in InteractiveShell.run_line_magic(self, magic_name, line, _stack_depth)
   2362     kwargs['local_ns'] = self.get_local_scope(stack_depth)
   2363 with self.builtin_trap:
-> 2364     result = fn(*args, **kwargs)
   2365 return result

File /opt/hostedtoolcache/Python/3.8.14/x64/lib/python3.8/site-packages/IPython/core/magics/execution.py:1166, in ExecutionMagics.timeit(self, line, cell, local_ns)
   1163         if time_number >= 0.2:
   1164             break
-> 1166 all_runs = timer.repeat(repeat, number)
   1167 best = min(all_runs) / number
   1168 worst = max(all_runs) / number

File /opt/hostedtoolcache/Python/3.8.14/x64/lib/python3.8/timeit.py:205, in Timer.repeat(self, repeat, number)
    203 r = []
    204 for i in range(repeat):
--> 205     t = self.timeit(number)
    206     r.append(t)
    207 return r

File /opt/hostedtoolcache/Python/3.8.14/x64/lib/python3.8/site-packages/IPython/core/magics/execution.py:156, in Timer.timeit(self, number)
    154 gc.disable()
    155 try:
--> 156     timing = self.inner(it, self.timer)
    157 finally:
    158     if gcold:

File <magic-timeit>:1, in inner(_it, _timer)

KeyboardInterrupt: 

Conclusion#

Checksum functions provide less guarantees than hashes but they are useful to provide integrity. In particular Adler32 is around three times faster than CRC32, and if cryptographic features are not needed, and speed is paramount, using checksum functions over hashes seems to be a good option. Many systems use hashes as checksums for compatibility reasons. In order to give true integrity, the checksum should be public or delivered by a secure channel to avoid man in the middle attacks.