Should I Hash Private Data?

How I learned to stop worrying about hashes and love PRFs

Sarai Rosenberg
10 min readAug 12, 2020

--

Let’s, uhhhhh, hash this out a bit, because the distinction between “when is a hash okay” and “when is a hash not okay” is very important for teaching software engineers who will often use hashes in their work.

As summarized by renowned security and privacy expert Dr. Lea Kissner,

Hash functions are tricky and I’ve seen a lot of people think they throw away all of the information you put in them. But they don’t, which is the precise reason why they’re useful.

A hash is a “digest”. Generally, hashes identify the data that was hashed in a one-to-one relationship. Hashes have subtle properties that make them useful to attackers, which undermines security and privacy. As Dr. Sophie Schmieg (Google Lead ISE Crypto) remarked, hashes are in a dangerous grey area between “don’t use hashes for anything you want to keep secret” and “a hash is a one-way function with meaningless output”.

To set expectations: the target audience is software and security engineers, and the goal is a gentle introduction to the reasoning and methods of applied cryptography that software and security engineers may use in their work. This is not a resource for enterprise organizations—in particular, I do not address risks for an attacker that has access to keys. This is not a resource for rigorous data anonymization (see, e.g., this anonymization bibliography).

white noise on dark red

Hashes as Identifiers

Using a hash on low entropy data is very vulnerable to brute force attack. E.g., rainbow tables — and anything under ~80 bits of entropy could be brute forced by a determined attacker with resources and time. Beyond directly brute forcing: the whole point of a hash is that it’s a deterministic identifier of data. A hash of data can detect tampering on that data because it is deterministic: if the data is changed, the hash of that data will also change, which provides simple tamper detection — provided that the hash you use to verify is not under the attacker’s control.

Let’s say we hashed every name, email, and phone number in a database of PII. It sounds like we could pass around the hash to safely use machine learning or identify “yes we’re talking about the same user”. The first issue is tracing: an attacker also identifies “ah, the user assigned here is the same user assigned here, which matches the hash I found in haveibeenpwned.com”. If your data is public, and especially if it is important to someone — you, the user, or the attacker! — it may have been hashed and can be identified by the hash. (Imagine a powerful data broker that collects and indexes large quantities of data. What would they be able to do with hashes of sensitive data?)

If a hash is used to identify data, but the data should be private, why not use a UUID? A UUID provides unique identifiers. There’s still a tracking problem, but that’s an issue inherent to any system that uses identifiers.

So a system that hashes small pieces of low entropy data can have some risks, but what about large data blobs? Maybe you took a picture of a passport and hash that to identify or detect tampering. Maybe a software company asks users to share a screenshot of their car registration in the DMV portal. If there is only one copy of each large data blob that is hashed, this is relatively private and secure. However, if this is a picture that is uploaded to cloud providers, shared on social media, or published on a website, the hash now serves as an abusable identifier. Just as software engineers may use a hash digest as an identity to reference data, attackers can similarly query on a hash against public data or against their private data stockpile.

white noise on green

Hashes and Threat Models

“But our data and our hash is private!” — Ideally, yes! But sometimes hashes of data are exposed by vulnerabilities in system design or by attackers dumping data.

In general, assume that anyone who can access the hash of sensitive data should also have access to the sensitive data. In particular, for non-ephemeral keys. Stay with me for a sec, because that’s a bold and wide assertion. While it is, indeed, generally infeasible to reverse the one-to-one mapping from data to its hash output, the comments above show some of the dangers of creating an easily computable value that attackers could abuse through match detection and identifying data in large data sets, among other threats. There are plenty of scenarios where a hash is a solution that provides reasonable security and privacy for the intended threat model.

Hashes are used Extremely Frequently in tech, especially in startups, as a handy way to obscure data while providing a tag or an integrity check. That’s Probably Okay for your current threat model: it is genuinely difficult to find a preimage of a hash, the benefits of tracing data across a small system often outweigh the risks, and you’re probably not a high-value target for an APT (though some of your customers might be 👀).

I offer two concerns and an alternative for hashing private data even when it is a reasonable choice for your threat model:

1. Is this use of hashing a dangerous example?

While you may have carefully assessed the risks against your current threat model, software engineers are often rushed to deliver and they may latch onto an example that appears similar. Is hashing the go-to solution that you want software developers to turn to when handling private data?

2. What if your threat model changes?

Zoom was widely criticized for security and privacy flaws when their usage and publicity skyrocketed in March 2020. Many of these issues were acceptable risks for Zoom’s 2019 threat model (especially for the corporate market) — but unacceptable for a product with high, public, and sensitive usage by schools and governments.

white noise on blue

Why not use a PRF?

Psuedorandom function families (PRF) use a key to produce a function that maps data to an output that cannot be distinguished from random data. A PRF is by definition a deterministic method to hide data. PRFs are comparable to hashing in speed.

In general, a Message Authentication Code (MAC) is a method of generating a tag from data using a key, which provides verification of authenticity and integrity. An HMAC is a hash-based MAC, which is like a template for creating a MAC based on a hash algorithm parameter. Every HMAC is a type of PRF, and every PRF is a MAC.

If you need a deterministic identifier for private data, why not use a PRF? If you need tamper detection on private data, why not use a PRF? If a party who uses a hash has access to the data, you might as well use a PRF and use the same access control list (ACL) for the key used with the PRF. If a party who uses a hash does not have access to the data [currently], you still might as well use a PRF, because they can’t verify the hash without the data anyway.

PRF works with data that is high entropy, low entropy, or no entropy! PRFs are tasty, nutritious, and fun for the whole family. (PRF assertions void where prohibited by law.)

white noise on dark red

What is a MAC/PRF? What about salted hashes?

Like an HMAC, you can think of a PRF as a family of functions where choosing a key K from keyspace 𝒦 produces a function that maps data from message space 𝓜 to binary output of length k —in other words, PRF:𝓜⨯𝒦→{0,1}ᵏ and PRFᴷ:𝓜→{0,1}ᵏ. A hash function H similarly maps data from message space 𝓜 to binary output of length k (e.g., for SHA512, k=512) — H:𝓜→{0,1}ᵏ. Both a hash and a PRF are functions that are “hard” to reverse. A distinguishing characteristic of a PRF (and of MACs in general) is that it uses a key.

For a hash h produced by hash function H, a first preimage attack is finding a message m where H(m)=h. Brute forcing a preimage attack is computationally comparable to the entropy of message m.

N.b.: For practical purposes, both a hash and a PRF are injective (sometimes “one-to-one function”), because the set of all messages that will be sent (past and future) is smaller than the output space, |{0,1}¹²⁸|=2¹²⁸. In general, neither a hash nor PRF can be injective because the set of all possible messages is countably infinite.

A cryptographic salt is a random string that is stored next to a message and concatenated with the message in a “salted hash”. Salting the hash adds entropy, but it’s a bit like kicking the hashing problem a bit further down the road—and remember that salts are not secrets. A salted hash may be a good solution for some systems—but do you see how the potential gotchas become harder to assess as we dive into nuances?

Let’s go back to PRFs. The power and nature of a MAC is in its key, which we can see by looking at preimage attacks. A preimage attack on a PRF requires finding a key K and a message m where PRF(K,m)=h. The preimage space here is 𝓜⨯𝒦, which is astronomically vast and computationally extraordinarily far beyond feasibility. We’re talking “whoops I accidentally a whole cartesian product” measures of large, because that’s what happened.

I glossed over some subtleties here. Brute forcing isn’t the only weakness in hashes or MACs—but brute force problems demonstrate a, uh, “key” distinction between a hash and a MAC. Also, a preimage attack isn’t the same as finding the original message—but if you find a second message that produces the same hash as a specified message hash with any modern hash function, that is a second preimage attack, and it is more valuable than almost any message!

At this point, you might be thinking “ah, so if keys are the secret ingredient, why don’t I just concatenate a cryptographic pepper into my hash”—and I’m gonna stop you right there, because you’re rolling your own crypto. Use a PRF.

white noise on green

Why not use a signature, or a …?

Oof. This is getting into the weeds. I want to call out three properties that we need to meet, which are met by few hash alternatives:

  1. HMAC is much faster than signing

While hashing, HMACs, and PRFs are fast (microseconds), signing with a digital signature is slow—on the order of milliseconds.

A signature is a cryptographic hash of data combined with a signing algorithm like ECDSA or RSA. Benchmarks vary on algorithm and key size, but every signature operation is 100-1000 times slower than a hash/HMAC/PRF. Compared with RSA, ECDSA is somewhat faster at signing but slower at verifying. Verifying is typically more frequent than signing.

2. HMAC provides confidentiality in addition to authenticity

By definition, a PRF such as an HMAC guarantees that the output is indistinguishable from random noise for anyone who doesn’t have the key. A digital signature does not necessarily provide that guarantee. For both ECDSA and RSA, recovering the message is hard but can be brute forced by a determined attacker, much like a hash.

Digital signatures are designed to provide authenticity. Signatures are typically sent with a message, and anyone with the public key can verify authenticity of the signature. An HMAC provides authenticity for anyone with the key, but the key must be shared in order to verify authenticity—and anyone with that single key can forge a message.

3. HMAC is a relatively easy and widely supported choice for a software engineering or security engineering team

Again: nuances! As we collect nuances we have to navigate, a software engineer’s job becomes harder and harder in considering “when should I use a hash vs UUID vs signature vs MAC”. A PRF or a MAC is a safe tool to reach for from the shelf: few footguns, relatively easy to operate, and plenty of cross-language cryptographic library support. …sort of.

white noise on blue

Conclusion and a growing list of caveats

TLDR*: If you want a secure and private identifier for a private object, use a UUID. If you want to verify authenticity or integrity for a private object, use a MAC/PRF.

*TLDR-FML: A general guide, with some exceptions and qualifiers (TYSM for calling out that every universal statement has qualifiers, even this one) — if you want a secure and private identifier for an object, use a UUID**. If you want tamper detection on private data, use a MAC/PRF. Sure, not perfect. Just a general guideline, with some exceptions††. Look it’s a TLDR, I don’t know what you want me to say here to address special cases that might be threatened by my general guidance summary statement. Choices and consequences change substantially when an active aggressor is involved in contrast with passive aggressors that aggregate data and attack when they hold a strong position.

**Or a pseudorandom function family (PRF). Yes, a UUID means storing that in a database. Yes, a PRF also means storing a key and handling key management, which turns a data privacy problem into a key management problem. Congratulations, we summarized cryptography: it’s all about PRFs and key management.***

***Not All Cryptography

Well, Actually, the definition of a MAC technically does not guarantee that it hides input, and a secure PRF asserts that it hides input but does not provide constant time verification in most (any?) cryptographic APIs.

††TLDR-FML-WTF: Yes there are a lot of exceptions. Hire a cryptography consultant, or††† Sophie Schmieg will yell at you about PRFs. (†††Intentionally not XOR.)

††††setting aside space for further caveats and addendums

--

--