Mapping a Formal Definition of Collision Resistance to Existing Implementations of Hash Functions

This is part 3 of a series on omission-tolerant integrity protection and related topics. See also part 1 and part 2. A technical report on the topic is available on this site and in the IACR ePrint Archive.

When a cryptographic hash is used as a checksum of a bit string, security depends on the collision resistance of the hash function. When the root label of a typed hash tree is used as an omission-tolerant checksum of a set of key-value pairs, security also depends on the collision resistance of a hash function, in this case of the hash function that is used to construct the typed hash tree. (It may also depend on the preimage resistance of the hash function depending on how the set is encoded, as we shall see in another post of this series.) The formal proof of security in the technical report is therefore based on a formal definition of collision resistance. But, surprising as this may seem, there is no standard, widely used, formal definition of collision resistance for the keyless, or unkeyed hash functions such as SHA256 that are used in practice.

Collision resistance is hard to define

The concept of a hash function collision is simple and clear enough: it is a pair of distinct elements of the domain of the hash function that have the same image. The concept of collision resistance, on the other hand, is difficult to define. The reason for this, as explained for example by Katz and Lindell in their Introduction to Modern Cryptography (Chapman & Hall/CRC, 2nd edition, 2014, bottom of page 155), is that collisions exist, and for any collision there exists a trivial algorithm that hardcodes the collision and outputs it in constant time.

An informal definition of collision resistance is given by Menezes, van Oorschot and Vanstone in their Handbook of Applied Cryptography (CRC Press, 1997, Section 9.2.2, page 324), where a hash function is said to be collision resistant if it is “computationally infeasible” to find a collision. This definition has the problem that, by any definition of computational feasibility, it is feasible for an algorithm that hardcodes any of the existing collisions to output the collision.

Formal definitions of collision resistance in an asymptotic security setting, such as Definition 5.2 in Katz and Lindell's book, evade the difficulty by defining a hash function, as in their Definition 5.1, as a function that takes as input a key whose length grows with the security parameter in addition to the bit string to be hashed. Besides being applicable to keyed hash functions such as HMAC-SHA256 rather than to keyless hash functions such as SHA256, such definitions have the problem that they require a keyed hash function to accept keys of unbounded length, which is something that cannot be implemented in practice.

In the technical report we use instead a formal definition based on the system parameterization concept that Boneh and Shoup introduce in their Graduate Course in Applied Cryptography, a draft of which is available online.

Boneh and Shoup's System Parameterization

When cryptographic security is specified “asymptotically”, a cryptographic scheme is defined to be secure if an adversary has a “negligible probability” of breaching a security property specified by the definition. This requires a “security parameter”, so that a negligible probability can be defined as a probability that is a negligible function of the security parameter, i.e. a function that goes to zero faster than any polynomial function when the security parameter goes to infinity.

But it also requires a probability space. In a traditional asymptotic security setting the randomness of the probability space is inherent in the cryptographic scheme. Typically, a probabilistic algorithm is used to generate a parameter of the scheme, such as a cryptographic key, which is thus a random variable. For example, in their textbook, Katz and Lindell define an encryption scheme (Definition 3.7, page 52) as a triple of algorithms (Gen,Enc,Dec), where Gen is a probabilistic algorithm that takes as input the security parameter and outputs a random encryption key. This does not work, however, if the cryptographic scheme does not have a key that may be chosen at random or other inherent randomness.

In their online textbook, on the other hand, Boneh and Shoup introduce a novel asymptotic security setting that uses both a security parameter λ and a system parameter Λ. The latter is a random variable output by a probabilistic system parameterization algorithm P that takes the security parameter as input. The randomness provided by the system parameter can be used to define the probability that an adversary has of breaching a security property of a cryptographic scheme even if the scheme itself has no key and no inherent randomness.

Suprisingly, Boneh and Shoup view the system parameterization algorithm as being specific to a cryptographic scheme (Section 2.4.2: “Once λ is chosen, a system parameter Λ is generated using an algorithm specific to the cipher”). It is best viewed, however, as being separate from, and independent of, any particular cryptographic scheme, as suggested by the word “system” in the term “system parameter”, and as illustrated by the proofs of Theorem 2 and Theorem 3 in our technical report. In a security reduction proof that involves multiple cryptographic schemes and does not gloss over the use of the security and system parameters, there should be only one security parameterization algorithm P and one system parameter Λ. Keys used in the schemes may be generated at random or derived from other parameters, which themselves may or may not be random. Any random keys or parameters should be produced by random bit generators that are independent of P and contribute randomness to the overall probability space of the proof independently of the randomness contributed by P.

Boneh and Shoup use their system parameterization concept to define collision resistance for keyless hash functions.

Mapping Boneh and Shoup's Definition to Existing Implementations

In Section 8.1.1 of their online textbook, Boneh and Shoup define a hash function (Definition 8.1) as an efficient deterministic algorithm that takes as inputs the security parameter λ, the system parameter Λ and the message m to be hashed, and satisfies conditions that involve two “families of spaces” M and T; I will come back to these families of spaces in a moment. Then they define the collision resistance of such a hash function using Attack Game 8.1 as modified in Figure 8.3, where the challenger generates Λ from λ and gives Λ to the adversary. The hash function is collision resistant if the probability that the adversary has of winning the game by outputting a collision is a negligible function of λ.

At first glance this is not an improvement over the definition of Katz and Lindell. In footnote 1 of page 286 Boneh and Shoup say themselves that their approach is “really the same” as the approach where a hash function takes as input a random key k, as k may be viewed as a system parameter. At the end of Section 8.1 they also say that the security and system parameters are “artifacts of the formal framework that are needed to make sense of Definition 8.1”.

But a mere change of terminology makes the use of system parameterization to define collision resistance intuitive and close to actual practice. In our technical report, instead of referring to an algorithm that takes as inputs λ, Λ and a bit string to be hashed as a “hash function”, we refer to it as a “hashing scheme with system parameterization” (Definition 16); then we use such a hashing scheme to define a family of hash functions indexed by λ and Λ (Definition 17). A hash function is then an ordinary mathematical function that takes the bit string to be hashed as its only argument. This is very close to actual practice, since SHA-2 is a family of hash functions having different security strengths, which can be viewed as different values of a security parameter. There is only one function for each security strength, but that function is defined using constants, such as those specified in Sections 4.2.2 and 4.2.3 of the Secure Hash Standard, that are arbitrary and could very well have been been defined at random by some trusted process playing the role of a system parameterization algorithm.

Although Boneh and Shoup do not define a family of ordinary hash functions indexed by λ and Λ, their definition of a hash function as an algorithm that takes λ and Λ as inputs involves families of sets M and T indexed by λ and Λ that would be the domains and codomains of such ordinary hash functions. The definition specifies that M and T must be “families of spaces with system parameterization P” that are “efficiently recognizable”. These concepts are explained in their Definition 2.9 of Section 2.4. In our Definition 17, the domain and codomain of each element of the family of hash functions are defined more simply in terms of bit lengths, the domain being a set of bit strings of length less than an integer a, and the codomain a set of bit strings of length equal to an integer b. This agrees with actual practice: the domain of SHA256, for example, is the set of bit strings of length less than 2⁶⁴, since 64 bits are used to specify the length of the input in the padding; and the codomain is the set of 256-bit strings.