Base64 Encoding
Contents
Base64 Encoding#
ToDo:
Add Example with temporary zip files using the
tempfile
moduleAdd relevant resources at the end
Bytes objects are expressed in hexadecimal form (base 16), having the characters 0123456789ABCDF
that means that each byte (8 bits) will need two hex characters. This can be derived as follows:
1 Bytes has 8 bits and each bit can be either 0
or 1
, then there is a total of \(2^8\) possible combinations. \(2^8 = 256\). One hexadecimal characters can have 16 different values, if 256
different values should be represented, \(log_{16}(256) = 2\) characters are needed.
Hexadecimal is used for simplicity instead of using long streams of 0
s and 1
s. However, bigger than 16 bases can be use as well. From there that formats such as base32 and base64 which instead of using 2 (like in binary) or 16 (like in hexadecimal) different characters, they use 32 and 64 respectively.
Python has the base64
module which includes functions to encode and decode to Base16, Base32, Base64 and Base85. The rest of this chapter will focus on Base64.
There are several implementations of the base64 encoding. Python’s standard library exposes two: Standard Base64 (RFC 4868 §4) and URL and File Safe Base64 (RFC 4868 §5).
The only difference between the Standard and the URL safe version are the characters they use, the standard version uses the the character +
and /
whereas the urlsafe version replaces those two characters with _
and -
.
As well as the functions from hashlib
, base64
functions uses bytes objects.
Encoding != Compression#
Even though Base64 will compress the original size by using a larger vocabulary (more unique characters), it is not a compression algorithm. The easiest way to prove this is to have a long repeated sequence of bytes like AAAAAAA
, a good compression algorithm will avoid the redundancy and reduce the size. That is not what base64 does.
That being said, a common pattern is to first zip the content of a file and then using base64 on the zipped file. That way it should be compact and compressed. The standard library has the zipfile
module which can compress the file and then the bytes can be encoded in base64.
Encoding/Decoding Example#
import base64
from pathlib import Path
Encoding#
data = Path("../_static/images/certificate_details_dns.png").read_bytes()
data = data[111:122] # To improve display
size = len(data)
data_base64 = base64.standard_b64encode(data)
data_base64_string = data_base64.decode("UTF-8")
data_base64_urlsafe = base64.urlsafe_b64encode(data)
data_base64_urlsafe_string = data_base64_urlsafe.decode("UTF-8")
data_base64_urlsafe_string
'wH-9SRDAgJ3xH_I='
Representation Comparison#
print(f"{'Comparing Representation':>40}")
print(f" Data size in bytes: {data}")
print(f" Representation in bits (base 2): {bin(int(data.hex(), base=16))[2:]}")
print(f" Representation in hex (base 16): {data.hex()}")
print(f" Representation in base64: {data_base64_string}")
print(f"Representation in base64 URLSafe: {data_base64_urlsafe_string}")
Comparing Representation
Data size in bytes: b'\xc0\x7f\xbdI\x10\xc0\x80\x9d\xf1\x1f\xf2'
Representation in bits (base 2): 1100000001111111101111010100100100010000110000001000000010011101111100010001111111110010
Representation in hex (base 16): c07fbd4910c0809df11ff2
Representation in base64: wH+9SRDAgJ3xH/I=
Representation in base64 URLSafe: wH-9SRDAgJ3xH_I=
Length Comparison#
print(f"{'Comparing Lenghts':>40}")
print(f" Data size in bytes: {size} bytes")
print(f" Representation in bits (base 2): {size * 8} characters")
print(f" Representation in hex (base 16): {len(data.hex())} characters")
print(f" Representation in base64: {len(data_base64)} characters")
print(f"Representation in base64 URLSafe: {len(data_base64_urlsafe)} characters")
Comparing Lenghts
Data size in bytes: 11 bytes
Representation in bits (base 2): 88 characters
Representation in hex (base 16): 22 characters
Representation in base64: 16 characters
Representation in base64 URLSafe: 16 characters
Decoding#
decoded_data_standard = base64.standard_b64decode(data_base64_string)
decoded_data_urlsafe = base64.standard_b64decode(data_base64_string)
print(f" Original Data: {data.hex()}")
print(f"Decoded from Standard: {decoded_data_standard.hex()}")
print(f" Decoded from URLSafe: {decoded_data_urlsafe.hex()}")
Original Data: c07fbd4910c0809df11ff2
Decoded from Standard: c07fbd4910c0809df11ff2
Decoded from URLSafe: c07fbd4910c0809df11ff2
Encoding/Decoding Compressed File#
A common use case is to combine base64 with zip compression to get the advantages of both technologies. The official docs illustrates how to works with files in a filesystem. In this example everything will be handled in memory, i.e. no temporary files or zip files will be generated.
This all-in-memory approach is specially useful in circunstances where:
There is no filesystem per se (e.g. Lambda Architectures)
It is expensive to write to a file system (e.g. Cloud Solutions)
Writting to disk should be avoided for performance reasons (e.g. Real-Time APIs)
To work with in-memory, instead of using filenames a BytesIO
object is used. This objects mimics the API of the file object and avoid any read/write operations in the disk.
import zipfile
import io
Encoding#
file_path = Path("00_Fundamentals.ipynb")
file_bytes = file_path.read_bytes()
file_base64_urlsafe = base64.urlsafe_b64encode(file_bytes)
memory_file = io.BytesIO() # Mimics the zip file in the filesystem
zip_handler = zipfile.ZipFile(memory_file, "w", zipfile.ZIP_DEFLATED)
with zip_handler: # Context manager for automatic close
zip_handler.writestr(file_path.name, file_bytes)
memory_file.seek(0) # Resets the pointer to the beginning of the file
zip_bytes = memory_file.read()
zip_base64_urlsafe = base64.urlsafe_b64encode(zip_bytes)
print(f"Size before compression: {len(file_base64_urlsafe)} characters")
print(f" Size after compression: {len(zip_base64_urlsafe)} characters")
print(f" Compresion %: {len(zip_base64_urlsafe) / len(file_base64_urlsafe) * 100:.2f}%")
Size before compression: 14156 characters
Size after compression: 5668 characters
Compresion %: 40.04%
Decoding#
The process for decoding is straightforward, instead of using a zip file from the filesystem, a bytes object is cast to BytesIO
which allows the zipfile
module to work with the bytes object as if it were a file.
zip_decoded = base64.urlsafe_b64decode(zip_base64_urlsafe)
zip_decoded_io = io.BytesIO(zip_decoded)
zip_decoded_io.seek(0)
zip_handler = zipfile.ZipFile(zip_decoded_io, "r")
decoded_unzipped_bytes = zip_handler.read(file_path.name)
if decoded_unzipped_bytes == file_path.read_bytes():
print("The extracted file is the same")
else:
print("The extracted file is NOT the same")
The extracted file is the same
Conclusion#
Base64 is an encoding mechanism that allows to convert to string long sequences of bytes, it has two supported implementations in Python, standart and urlsafe.
Base64 is not a compresion algorithm but it can be combined with the zipfile
module to compress bytes or files and then encode them to base64. The resulting bytes will be both compressed and encoded.
If working with compression, it is possible to work with temporary zip files or everything in-memory.