# Salt

**ToDo**:
- Add illustration for the salting process - Similar to [this](https://www.php.net/manual/es/images/2a34c7f2e658f6ae74f3869f2aa5886f-crypt-text-rendered.svg) and [this](https://forum.huawei.com/enterprise/en/data/attachment/forum/201901/21/003213sdbhypkqzdbdplfp.png?cf490-password_hashing.png).
- Explain the concept of Key Derivation Function - Similar to [this](https://en.bitcoinwiki.org/upload/en/images/0/0e/Keyderivationfuncion.gif) and [this](https://i.stack.imgur.com/z37E4.png).
- Add resources to other salt compatible algorithms - Argon2 for instance.
- Add appendix regarding read/write hashes to database.
- Add relevant resources at the end
---

In the case of data that tend to be similar for different users (e.g. common used passwords), if the database with all hashes is compromised, an attacker could see if any of the hashes corresponds with a hash from a pre-computed table, examples of this are [Rainbow Attacks](https://en.wikipedia.org/wiki/Rainbow_table) and [Dictionary Attacks](https://en.wikipedia.org/wiki/Dictionary_attack).

If a random array of bytes is added to the data to be hashed, then the results will not match any pre-computed table attackers might have. This also implies the array of random bytes should be provided along with the hash in order to validate the hash is correct. These bytes are called **salt** and they are not secret (i.e. they can be store in plain text). Even though one might think that salts should be kept secret, there are other vulnerabilities associated with *secret salts*, such as [Length Extension Attacks](https://www.wikiwand.com/en/Length_extension_attack)

## Scrypt

The hashlib module has the function [`scrypt`](https://docs.python.org/3/library/hashlib.html#hashlib.scrypt), which provides a convinient interface to add salts to hashes. This function is an implementation of the [RFC 7914](https://datatracker.ietf.org/doc/html/rfc7914.html), it was designed to be memory-intensive and prevent GPU, ASIC and FPGA attacks.

**Note:** Salt could be of any lenght, however a minimum of 128bits (10 bytes) is recommended.

In [1]:
import secrets
import hashlib

In [2]:
random_bytes = secrets.token_bytes(10)
data = b"Hello World!"
data_hashed = hashlib.scrypt(data, salt=random_bytes, n=64, r=8, p=1).hex()
salt_string = random_bytes.hex()

print(f"   Original: {data}")
print(f"     Hashed: {data_hashed}")
print(f"Salt+Hashed: {salt_string}:{data_hashed}")

   Original: b'Hello World!'
     Hashed: b968d05f9edd796a5db4647437d2c9a30a3c1fddb120c25f5218f3792dd7801ec89974f3e2bd1286c470a989d4797dc8939fadef138056cd9ce0e073071a569f
Salt+Hashed: 6a016b6916163532bcbf:b968d05f9edd796a5db4647437d2c9a30a3c1fddb120c25f5218f3792dd7801ec89974f3e2bd1286c470a989d4797dc8939fadef138056cd9ce0e073071a569f


## Scrypt Disadvantages

The hashed string depends on several parameters, **n**, **r**, **p** and the **salt**. Unless they are hardcoded in the source code it is a good practice to store them along with the hash.

The delimiter character is usually either "**$**" or "**:**". Other options include using several columns in the database. Note that neither of the two are URL Safe, this is sensible because hashes should never be part of a URL

In [3]:
random_bytes = secrets.token_bytes(10)
data = b"Hello World!"

n = 2 ** 6
r = 8
p = 1
data_hashed = hashlib.scrypt(data, salt=random_bytes, n=n, r=r, p=p).hex()
salt_string = random_bytes.hex()

print(f"n+r+pSalt+Hashed: {n}${r}${p}${salt_string}${data_hashed}")

n+r+pSalt+Hashed: 64$8$1$1181b0a77a9c3075e8d3$20f5e4766227ee7c126177cb2cb44db13c34750209eb2ab7e004d29cfa1e306e3d5933148e72ba3ba5db6c052a6084aed7249dc1bbc5a9cf8683cfe23d90a245


## Scrypt Bonuses

Using the scrypt function also provides some additional advantages

### Customizing Length

There is an additional parameter, **`dklen`**, which allows to generate arbitrary long passwords, the default value is 64. 

**Important Note:** Using short lenghts will increase collision risks

In [4]:
data = b"Hello World!"

In [5]:
lenght = 2**2
random_bytes = secrets.token_bytes(10)

data_hashed = hashlib.scrypt(data, salt=random_bytes, n=64, r=8, p=1, dklen=lenght).hex()
salt_string = random_bytes.hex()

print(f"Salt+Hashed: {salt_string}:{data_hashed}")

Salt+Hashed: ab3e8d9ac23a68fdc48e:a3083313


In [6]:
lenght = 2**4
random_bytes = secrets.token_bytes(10)

data_hashed = hashlib.scrypt(data, salt=random_bytes, n=64, r=8, p=1, dklen=lenght).hex()
salt_string = random_bytes.hex()

print(f"Salt+Hashed: {salt_string}:{data_hashed}")

Salt+Hashed: 072da1f6a7cd465026b8:3abe1ef104f20e1a9796aa0ffcb865ee


In [7]:
lenght = 2**6
random_bytes = secrets.token_bytes(10)

data_hashed = hashlib.scrypt(data, salt=random_bytes, n=64, r=8, p=1, dklen=lenght).hex()
salt_string = random_bytes.hex()

print(f"Salt+Hashed: {salt_string}:{data_hashed}")

Salt+Hashed: 1fe6099ad8b92ac248a3:30e34385576d6f38cac26f291cbc0a2a2f82944630741dc1cea6ce9d12d56f2a3676400c3ab079b120533d6a34c0136128194fa7599cd96ab03072f8602686e6


### Customizing Execution Time

In some application making the hash slower can increase the security, the scrypt algorithm can be customized to take different amounts of memory and processing time.

The memory used will be equal to: 128 * n * r * p bytes

Examples:

- Low Memory Footprint: 128 * 64 * 8 * 1 = 64 KB
- Large Memory Footprint: 128 * 2^17 * 8 * 1 = 128 MB

In [8]:
n = 2 ** 20
r = 8
p = 1
memory_bytes = 128 * n * r * p
memory_kilo_bytes = memory_bytes / 1024
memory_mega_bytes = memory_kilo_bytes / 1024
print(f"Memory Consumed: {memory_bytes} bytes = {memory_kilo_bytes:.2f} KB = {memory_mega_bytes:.2f} MB")

Memory Consumed: 1073741824 bytes = 1048576.00 KB = 1024.00 MB


##### Code Examples

In [9]:
data = b"Hello World!"

In [10]:
%%timeit
n = 2**6  # 64
random_bytes = secrets.token_bytes(16)
data_hashed = hashlib.scrypt(data, salt=random_bytes, n=n, r=8, p=1).hex()

426 µs ± 9.73 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


In [11]:
%%timeit
n = 2**12
random_bytes = secrets.token_bytes(16)
data_hashed = hashlib.scrypt(data, salt=random_bytes, n=n, r=8, p=1).hex()

27 ms ± 949 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [12]:
%%timeit
n = 2**14
random_bytes = secrets.token_bytes(16)
data_hashed = hashlib.scrypt(data, salt=random_bytes, n=n, r=8, p=1).hex()

108 ms ± 2.15 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


## Example: User Authentication

The following example showcases how scrypt can be used to securely store passwords.

**Note: Remember that this code is not suitable for any production use-case.**

### Auxiliary Functions

In [13]:
def generate_hash(data:str, salt: bytes) -> str:
    data_bytes = data.encode("utf-8")
    data_hashed = hashlib.scrypt(data_bytes, salt=salt, n=64, r=8, p=1)
    return f"{salt.hex()}:{data_hashed.hex()}"


def sign_up(email, password, database_):
    database = database_.copy()
    random_bytes = secrets.token_bytes(10)
    database[email] = generate_hash(password, random_bytes)
    print("Successfully Singed Up")
    return database


def login(email, password, database):
    if email not in database:
        print(f"ERROR: User {email} not in Database")
        return

    expected_password = database[email]
    salt, hashed = expected_password.split(":")
    salt_bytes = bytes.fromhex(salt)
    calculated_hash = generate_hash(password, salt_bytes)
    passwords_matched = secrets.compare_digest(expected_password, calculated_hash) 
    if passwords_matched:
        print(f"Successfully Signed in: {email}")
        return
    
    print(f"ERROR: Incorrect Password for: {email}")

### Sign Up

In [14]:
email = "johndoe@example.com"
password = "password123"
user_database = {}

user_database = sign_up(email, password, user_database)

Successfully Singed Up


### Wrong Email

In [15]:
email = "janedoe@example.com"
password = "password123"

login(email, password, user_database)

ERROR: User janedoe@example.com not in Database


### Wrong Password

In [16]:
email = "johndoe@example.com"
password = "password"

login(email, password, user_database)

ERROR: Incorrect Password for: johndoe@example.com


### Successful Login

In [17]:
email = "johndoe@example.com"
password = "password123"

login(email, password, user_database)

Successfully Signed in: johndoe@example.com


## Conclusion

Hash algorithms themselves have a drawback, anyone can pre-compute most commonly use values and simply do a one by one comparison searching for matches. However, thanks to the small changes in input big changes in output feature of cryptographic hashes, simply adding some bytes to the data mitigates this risk by a great ammount. This bytes are called **salt**, they are public and stored along with the hash. One of the algorithms that uses it is [`scrypt`](https://docs.python.org/3/library/hashlib.html#hashlib.scrypt), which is available in the standard library. This particular algorithm also allows customization of memory consumption/processing time.