* This is part 3 of a series on omission-tolerant integrity
protection and related topics.
See also part 1
and
part
2.
A technical report on the topic is available
on this site and
in the IACR ePrint Archive.
*

When a cryptographic hash is used as a checksum of a bit string,
security depends on the collision resistance of the hash function.
When the root label of a typed hash tree is used as an
omission-tolerant checksum of a set of key-value pairs, security also
depends on the collision resistance of a hash function, in this case
of the hash function that is used to construct the typed hash tree.
(It may also depend on the preimage resistance of the hash function
depending on how the set is encoded, as we shall see in another post
of this series.) The formal proof of security in
the technical report is therefore
based on a formal definition of collision resistance. But, surprising
as this may seem, there is no standard, widely used, formal definition
of collision resistance for the *keyless*, or *unkeyed* hash
functions such as SHA256 that are used in practice.

### Collision resistance is hard to define

The concept of a *hash function collision* is simple and clear
enough: it is a pair of distinct elements of the domain of the hash
function that have the same image. The concept of *collision
resistance*, on the other hand, is difficult to define. The reason
for this, as explained for example by Katz and Lindell in
their *Introduction to Modern Cryptography* (Chapman & Hall/CRC,
2nd edition, 2014, bottom of page 155), is that collisions exist, and
for any collision there exists a trivial algorithm that hardcodes the
collision and outputs it in constant time.

An informal definition of collision resistance is given by Menezes,
van Oorschot and Vanstone in their *Handbook of Applied
Cryptography* (CRC Press, 1997, Section 9.2.2, page 324), where a
hash function is said to be collision resistant if it is
“computationally infeasible” to find a collision. This
definition has the problem that, by any definition of computational
feasibility, it is feasible for an algorithm that hardcodes any of the
existing collisions to output the collision.

Formal definitions of collision resistance in an asymptotic security setting, such as Definition 5.2 in Katz and Lindell’s book, evade the difficulty by defining a hash function, as in their Definition 5.1, as a function that takes as input a key whose length grows with the security parameter in addition to the bit string to be hashed. Besides being applicable to keyed hash functions such as HMAC-SHA256 rather than to keyless hash functions such as SHA256, such definitions have the problem that they require a keyed hash function to accept keys of unbounded length, which is something that cannot be implemented in practice.

In the technical report we use
instead a formal definition based on the *system
parameterization* concept that Boneh and Shoup introduce in
their *Graduate Course in Applied Cryptography*, a draft of which
is
available
online.

### Boneh and Shoup’s System Parameterization

When cryptographic security is specified “asymptotically”, a cryptographic scheme is defined to be secure if an adversary has a “negligible probability” of breaching a security property specified by the definition. This requires a “security parameter”, so that a negligible probability can be defined as a probability that is a negligible function of the security parameter, i.e. a function that goes to zero faster than any polynomial function when the security parameter goes to infinity.

But it also requires a *probability space*. In a traditional
asymptotic security setting the randomness of the probability space is
inherent in the cryptographic scheme. Typically, a probabilistic
algorithm is used to generate a parameter of the scheme, such as a
cryptographic key, which is thus a random variable. For example, in
their textbook, Katz and Lindell define an encryption scheme
(Definition 3.7, page 52) as a triple of algorithms (Gen,Enc,Dec),
where Gen is a probabilistic algorithm that takes as input the
security parameter and outputs a random encryption key.
This does not work, however, if the cryptographic scheme does not have
a key that may be chosen at random or other inherent randomness.

In their
online
textbook, on the other hand, Boneh and Shoup introduce a novel
asymptotic security setting that uses both a security parameter
λ and a *system parameter* Λ. The latter is a
random variable output by a probabilistic *system
parameterization* algorithm *P* that takes the security
parameter as input. The randomness provided by the system parameter
can be used to define the probability that an adversary has of
breaching a security property of a cryptographic scheme even if the
scheme itself has no key and no inherent randomness.

Suprisingly, Boneh and Shoup view the system parameterization
algorithm as being specific to a cryptographic scheme (Section 2.4.2:
“Once λ is chosen, a system parameter Λ is
generated using an algorithm specific to the cipher”). It is
best viewed, however, as being separate from, and independent of, any
particular cryptographic scheme, as suggested by the word
“system” in the term “system parameter”, and
as illustrated by the proofs of Theorem 2 and Theorem 3 in
our technical report. In a
security reduction proof that involves multiple cryptographic schemes
and does not gloss over the use of the security and system parameters,
there should be only one security parameterization algorithm *P*
and one system parameter Λ. Keys used in the schemes may be
generated at random or derived from other parameters, which themselves
may or may not be random. Any random keys or parameters should be
produced by random bit generators that are independent of *P* and
contribute randomness to the overall probability space of the proof
independently of the randomness contributed by *P*.

Boneh and Shoup use their system parameterization concept to define collision resistance for keyless hash functions.

### Mapping Boneh and Shoup’s Definition to Existing Implementations

In Section 8.1.1 of
their online
textbook, Boneh and Shoup define a hash function (Definition 8.1)
as an efficient deterministic algorithm that takes as inputs the
security parameter λ, the system parameter Λ and the
message *m* to be hashed, and satisfies conditions that involve
two “families of spaces” **M**
and **T**; I will come back to these families of spaces
in a moment. Then they define the collision resistance of such a hash
function using Attack Game 8.1 as modified in Figure 8.3, where the
challenger generates Λ from λ and gives Λ to the
adversary. The hash function is collision resistant if the
probability that the adversary has of winning the game by outputting a
collision is a negligible function of λ.

At first glance this is not an improvement over the definition of Katz
and Lindell. In footnote 1 of page 286 Boneh and Shoup say themselves
that their approach is “really the same” as the approach
where a hash function takes as input a random key *k*,
as *k* may be viewed as a system parameter. At the end of
Section 8.1 they also say that the security and system parameters are
“artifacts of the formal framework that are needed to make sense
of Definition 8.1”.

But a mere change of terminology makes the use of system
parameterization to define collision resistance intuitive and close to
actual practice. In our technical
report, instead of referring to an algorithm that takes as inputs
λ, Λ and a bit string to be hashed as a “hash
function”, we refer to it as a “hashing scheme with system
parameterization” (Definition 16); then we use such a hashing
scheme to define a family of hash functions indexed by λ and
Λ (Definition 17). A hash function is then an ordinary
mathematical function that takes the bit string to be hashed as its
only argument. This is very close to actual practice, since SHA-2 is
a family of hash functions having different *security strengths*,
which can be viewed as different values of a security parameter.
There is only one function for each security strength, but that
function is defined using constants, such as those specified in
Sections 4.2.2 and 4.2.3 of the
Secure
Hash Standard, that are arbitrary and could very well have been
been defined at random by some trusted process playing the role of a
system parameterization algorithm.

Although Boneh and Shoup do not define a family of ordinary hash
functions indexed by λ and Λ, their definition of a hash
function as an algorithm that takes λ and Λ as inputs
involves families of sets **M** and **T**
indexed by λ and Λ that would be the domains and
codomains of such ordinary hash functions. The definition specifies
that **M** and **T** must be “families
of spaces with system parameterization *P*” that are
“efficiently recognizable”. These concepts are explained
in their Definition 2.9 of Section 2.4. In our Definition 17, the
domain and codomain of each element of the family of hash functions
are defined more simply in terms of bit lengths, the domain being a
set of bit strings of length less than an integer
*a*, and the codomain a set of bit strings of length equal to an
integer *b*. This agrees with actual practice: the domain of
SHA256, for example, is the set of bit strings of length less than
2^{64}, since 64 bits are used to specify the length of the
input in the padding; and the codomain is the set of 256-bit strings.