Security Software

Guidelines for Mitigating Timing Side Channels Against Cryptographic Implementations

The primary concern with side channels is the protection of secrets. Secrets are broadly defined as any data that should not be seen or known by other users, applications, or even other code modules. When using side channel methods, malicious actors most commonly seek API keys, user passwords, or cryptographic keys because these may allow malicious actors to decrypt or access other protected secrets.

Software side channels can be classified into two broad categories:

  • Traditional side channels, which take advantage of architecturally committed operations.
  • Speculative execution side channels, which have been shown in proof of concept to take advantage of operations that only execute speculatively and thus are not committed into the architectural state.

Traditional side channels typically take advantage of the differences in execution timing when the target process uses a secret in order to derive or infer the value of the secret in the spy process, though other techniques of observing execution exist1. These execution differences can be observed in the mutual access of shared system resources (for example, branch predictors, cache lines, translation lookaside buffers (TLB), and execution ports), instructions that have varying execution timing based on input, or observable differences in API execution time. The methods used to observe these timing differences vary between specific side channels, but developers can follow general best practices to defend against them.

Most traditional side channels—regardless of technique—can be mitigated by applying all three of the following general “constant time”2 principles, listed here at a high level. We discuss details and examples of these principles later.

  1. Ensure runtime is independent of secret values.
  2. Ensure code access patterns3 are independent of secret values.
  3. Ensure data access patterns4 are independent of secret values.

These principles are conceptually simple, but sometimes can be difficult to implement in practice, depending on the complexity of your algorithms. Luckily, most developers won’t need to work on the most complex situations.

Speculative execution side channel methods may warrant different mitigations than those discussed in this article. In particular, some speculative execution side channel methods have been shown in proof of concept to have the ability to generate a universal read gadget in either the cryptographic software or other software in the same address space as secret data. You can find additional resources and techniques for mitigating speculative execution side channels on the Security Findings and Mitigations site.

Constant Time Code and Compiler Optimizations

Before we describe constant-time principles, a word about compiler optimizations and their effects on generated code: In short, be wary of code generated from high-level language source code that appears to adhere to all of these recommendations. Compilers with optimizers will look for ways to make the code faster, and sometimes the changes they introduce are quite unexpected. For any code that you write—whether C or hand-coded assembly—you must verify constant-time operation in the environment where you expect to use it.

In the C language, keywords like volatile, register, inline, and others can change the way compiler optimizations are applied. Using compiler command line options, such as GCC’s -O0 through -O3, can also generate vastly different machine code for the same simple lines of source code. The C language keywords can successfully mitigate some unexpected compiler optimizations for some compilers, but you must not rely on these keywords since they are only considered to be hints to the code generators. Code generator implementations may change at any time.

Many languages do not have good support—or any support in the case of many managed and dynamic languages—for providing code generation hints to the compiler or just-in-time (JIT) compiler. When considering constant time coding principles in these environments, it’s important to be extra vigilant about what is actually executing on the processor.

Examples in This Document

Examples in this document start by showing the simplest—most naive—implementation of a solution in order to illustrate problematic situations. One or more iterations of the implementation will be shown, illustrating a possible solution for identified issues. It is important to understand what each change provides to understand the final recommended implementation.

Principles of Traditional Side Channel Resistant Coding

Let’s start by elaborating on the high level principles enumerated above.

  1. Ensure the runtime is independent of secret values.

    Timing side channels take advantage of the smallest differences in execution time, sometimes at the resolution of a single clock cycle. Ensure algorithms consistently process secret data, and only use instructions whose latency is invariant to the data values. This requirement applies to the primary operation as well as any subfunctions.

  2. Ensure code access patterns are independent of secret values.

    The value of a secret, or values derived from a secret, must not affect a conditional branch or the target of an indirect branch. The address sequence of executed instructions must be independent of secret values.

  3. Ensure data access patterns are independent of secret values.

    Secret values should not cause a change to the order of accessed addresses or the data size of loads/stores. For example, don’t use a secret value to load a specific location from a table representing an S-box without precautions.

The following sections describe steps to mitigate examples of code vulnerable to side channels. The principles above will be referenced as secret independent runtime (SIR), secret independent code access (SIC), and secret independent data access (SID), respectively, when a mitigation is applied.

Mitigating Information Leakage due to Conditional State Change

Some algorithms require conditional state changes and use simple structures. In the example below, an internal state is incremented and returned. If a bound is reached, an error state is returned instead (in a real situation, this code would likely be part of a larger procedure that runs the state machine):

int
cond_stateInc(int estate, int maxState)
{
    static int state = 0;


    if ( state >= maxState ) {
        state = estate; // Error state
    } else {
        state += 1; // state transition
    }


    return state;
}

Compiling this code with GCC without optimization yields the results shown below. Note that assembly code generated by GCC places the destination in the second argument, which is the opposite order for Intel assembly convention. We used GCC 7.4.0 under CYGWIN for this example. The position-independent offsets for the state variable have been simplified.

 
        ...       
01 cond_stateInc:
02        movl    state(%rip), %eax  ;; eax = state
03        cmpl    %edx, %eax         ;; compare maxState, state
04        jge     .L4                ;; if state >= maxstate goto .L4
05        addl    $1, %eax           ;; (compute state+1)
06        movl    %eax, state(%rip)  ;; state = state + 1
07 .L3:
08        movl    state(%rip), %eax  ;; return state
09        ret
10 .L4:                              ;; when state >= maxState...
11        movl    %ecx, state(%rip)  ;; state = estate
12        jmp     .L3                ;; goto return point

As we might expect from the C code, the code paths taken are different depending on the arguments. When the conditional JGE instruction in line 4 is not executed, the code is linear. Otherwise, a branch is taken and two extra instructions are executed (lines 11 and 12). These additional instructions take some small amount of time. This timing difference may be detectable to other programs on the same computer system (meaning it may represent a side channel), and might be used by a malicious actor to trace program execution. If maxState is connected to checking validity of a secret, a malicious actor could use this timing difference to help guess the secret.

In modern processors, the branch instructions change the shared state of the microarchitecture (this is known as branch prediction). The processor may speculatively execute some instructions in parallel to improve processor efficiency. Malicious actors may seek to use both of these effects as side channels.

Even in this small example, we can see that tests and branches in the program may cause side channels. When a program implements a cryptographic algorithm, a test that depends on each bit of the secret key may allow a malicious actor to recover the key using a side channel.

We can prevent these side channels if we make the code execution time independent of the values in the code (in accordance with the SIR principle). The Intel® Architecture instruction set designers have anticipated this need. Compiling the same C function with optimization enabled generates the following code, with no branch instructions:

        ...
01 cond_stateInc:
02        movl        state(%rip), %r8d ;; register r8d = state
03        leal        1(%r8), %eax      ;; eax = state + 1
04        cmpl        %edx, %r8d        ;; compare maxState, state
05        cmovge      %ecx, %eax        ;; state>=maxState => eax=estate
06        movl        %eax, state(%rip) ;; update ‘state’ variable
07        ret                           ;; return state (in eax)

In the above code, the two possible final values of state are placed into registers. %eax initially contains state+1. Then CMOVGE instruction in line 5 copies the value of maxState in %ecx into %eax, only if the greater than or equals condition holds based on the comparison instruction in line 4. The CMOVcc instruction runs in time independent of its arguments in all current x86 architecture processors. This includes variants that load from memory. The load is performed before the condition is tested. Future versions of the architecture may introduce new addressing modes that do not exhibit this property. Refer to the Intel® 64 and IA-32 Architectures Optimization Reference Manual, section 3.5 Optimizing the Execution Core as the architecture evolves.

Another important observation is that a change to compiler flags might undo all our efforts to create data independent execution! Until compiled languages create primitives for data-independent execution, this will always be a problem. Although it is inconvenient, the safest solution to this is to write the secret-dependent code in assembly language. Another option is to validate the compiled code as part of the build process and create an automated trigger if the compiler decides to change to the output code.

Compiler and application writers have also addressed this issue and have demonstrated how to support constant time code in higher level language, such as C. For example, the OpenSSL crypto package, part of most Linux* distributions, implements such constructs. Compiler support for constant time code is outside the scope of this paper.

Mitigating Information Leakage Based on Variable Timing

Operations such as message authentication code (MAC), RSA signature padding, and password processing are especially susceptible to timing side channel attacks. These operations include a step that compares two values. If the comparison time is dependent on the inputs, malicious actors can use the timing differences to learn valuable information. This type of attack, known as an oracle attack5, can target processes that are not vulnerable to speculative execution side channels and can operate at an API level.

Consider the following example of simple byte buffer comparison that is not side channel resistant. It first checks that the buffers are the same length and returns false if they don’t match. It then compares each byte, and returns false as soon as a mismatch is found.

bool equals(byte a[], size_t a_len,
            byte b[], size_t b_len) {
  if (a_len != b_len) // data dependent!
    return false;


  for (size_t i = 0; i < a_len; i++) {
    if (a[i] != b[i]) { // data dependent!
      return false;
    }
  }
 
  return true;
}

Functionally this is correct, but in a naive case where the byte strings contain raw passwords, with the expected password in a and the malicious actor’s guess in b, this function provides a couple of oracles6. For simplicity, let’s assume each comparison operation takes one unit of time to execute and other operations take zero time.

First, the malicious actor could determine the correct length of the password by repeatedly causing the function to be invoked with values of increasing length until a runtime measurement of more than one unit is found.

Second, the malicious actor can efficiently find the password one byte at a time by causing the function to be invoked with values of length a_len (discovered using the first oracle), and then increment through valid byte values (for example, for the first byte, the malicious actor sends aaaa, baaa, caaa, etc.) until an increase in execution time is detected. The increase in execution time means the byte being tested has matched, so the malicious actor can then send queries that increment the next byte.

To put the above mechanism into more concrete terms, let’s look at an example system that uses a password between 3 and 5 bytes long and only allows ASCII characters a through z. If a malicious actor wants to find a user's password, and the user's password is oops, the malicious actor uses the following invocations to gain information:

  • aaa (time=1), aaaa (time=2, secret length=4)
  • baaa (time=2), … tooaaa (time=3, first byte is o)
  • obaa (time=3), … toooaa (time=4, second byte is o)
  • ooba (time=4), … tooopa (time=5, third byte is p)
  • oopb (time=5), … tooops (login successful! User’s password is oops)

Let’s next consider a change to the equals() function to eliminate the second oracle. The malicious actor can still determine the length of the expected password, but cannot optimize the search for each character, so a brute force search is required. They will also make the timing property such that the runtime is always constant related to the smaller of a_len and b_len. The new function accumulates bit differences between a_len and b_len and corresponding members of a and b. If there are no bit differences, the inputs are the same.

bool equals(byte a[], size_t a_len,
            byte b[], size_t b_len) {
  volatile size_t x = a_len ^ b_len;


  for (size_t i = 0; ((i < a_len) & (i < b_len)); i++) {
    x |= a[i] ^ b[i];
  }
 
  return (x==0);
}

There are a few details to note. First, the volatile keyword is used for the accumulator variable x. This tells the compiler to not make any optimizing assumptions in the computation x. Without this keyword, an intelligent optimizer may inject an “early out” condition to the loop if x ever becomes non-zero (for example, GCC will do this). Second, differences in length are accumulated in x instead of existing immediately if lengths are different [principle: SIC]. Third, the loop limit expression is designed to exit the loop when i equals the minimum of a_len or b_len without triggering logical short circuit conditions [principles: SIC, SIA]. In C, relational operators are defined to return the integer 1 for true and 0 for false, and the bitwise AND operator (“&”) is used to check that both relational operators are true. If a logical AND operator (“&&”) were used instead, the expression would short circuit evaluate to true without evaluating (i < b_len) when (i < a_len) is false, resulting in a non-constant execution time.

Eliminating the first oracle to determine the expected password length requires the caller to use padded buffers such that a_len and b_len are always identical. Alternatively, the input values are processed using a message digest function and the results compared [principles: SIR, SIC, SID].

The example above uses raw passwords as an example only. In real applications, the recommendation for passwords is to use salted password hashes instead of raw passwords to prevent other types of attacks in addition to the example above, but the same advice on constant time comparison of the salted password hashes applies.

A more likely example where this type of issue is found is timing attacks against MAC values. In this scenario, a system will only accept a message if the sender can also supply a valid MAC value. The malicious actor is trying to get a system to accept their message, but doesn’t know the secret required to compute the valid MAC value. If non-constant time comparison is used, the malicious actor can easily forge the necessary MAC value. In this case, the malicious actor knows the expected MAC length since it’s a public parameter, so they launch the same attack pattern described for passwords with a value range of 0x00 to 0xFF for each byte of the MAC. For instance, if the MAC is four bytes long with value 0x562AC9F8, the guess pattern looks like the following.

  • 0x00000000 (time=1), … to … 0x56000000 (time=2, first byte is 0x56)
  • 0x56010000 (time=2), … to … 0x562A0000 (time=3, second byte is 0x2A)
  • 0x562A0100 (time=3), … to … 0x562AC900 (time=4, third byte is 0xC9)
  • 0x562AC901 (time=4), … to … 0x562AC9F8 (time=5, fourth byte is 0xF8, message accepted!)

Ensuring the comparison always iterates over all bytes of the MAC forgery candidate forces the attack to become a brute force search, which is orders of magnitude harder.

Mitigating Information Leakage Based on Secret Calculations

Some of the underlying mathematical primitives used to perform cryptographic operations can exhibit detectable timing differences depending on the key material being used when the implementation is simple or is heavily optimized for speed over security. An example of this is the mathematical primitive used to implement the RSA and Diffie-Hellman algorithms: modular exponentiation.

To most clearly illustrate the side channel, let’s consider the most naive implementation of modular exponentiation called left-to-right exponentiation7. We’ll then iteratively update the computation method to eliminate timing side channels while preserving high performance.

To compute the value c=ab mod n, the simplest implementation does the following, where all values are multiprecision integers with size equivalent to n.

  1. Set c to 1.
  2. Scanning from most significant to least significant bit of exponent b
    1. Square c (c=c*c mod n)
    2. Multiply c with a (c=c*a mod n) if the current bit is a 1

Assuming multiprecision multiplication is implemented using constant time methods—a critical requirement—and therefore all multiplies take the same time to execute, the amount of time the modular exponentiation computation takes is directly related to the number of 1 bits in the exponent. Therefore, this method causes the computation timing to vary depending on the key material, both for the overall computation and for each individual step involving a bit of b. Observing the time required for each step reveals the bits of b. When performing a private key operation using RSA, b is the private key, so discovering the bits of b is equivalent to stealing the private key.

To eliminate the timing side channel, we consider a simple implementation variation that always executes the same code path and has a constant execution time [principles: SIR, SIC]. The method uses an intermediate value r that will be either 1 or a if the current bit of b is 0 or 1, respectively. The code will always take the same execution path and will always execute the same number of multiplies.

  1. Set c to 1.
  2. Scanning from most significant to least significant bit of exponent b
    1. Square c (c=c*c mod n).
    2. Compute r=(a-1)*current_bit_of(b)+1
    3. Multiply c with r (c=c*r mod n)

The downside of this method is that performance is significantly slower than the original simple method. The simple method will execute an average of 1.5 multiplies per bit in the exponent, where the revised method will always perform 3 multiplies per bit.

A method to enhance the performance of RSA computations is to use a left-to-right k-ary exponentiation method of computation8. With this technique, multiple bits of the exponent are processed at the same time. In addition to achieving constant execution times, the fixed size sliding window method also achieves the goal of not branching based on the content of secret data, because the fixed size sliding window method follows the same code path regardless of the private key value.

  1. Chose a window size, k.
  2. 2k powers of a are precomputed as apre=[a0 mod n, a1 mod n, …, a2^k-1 mod n]
  3. Set c to 1
  4. The exponent b is now scanned from the most significant bit to the least significant bit; k bits are computed at a time.
    1. The value of c is squared k times—in other words, for each bit in the window, c is squared just as in the previous algorithm.
    2. Then c is multiplied by the precomputed power of a using the appropriate stored value in apre indexed by the current set of k bits treated as an integer index to apre.

Note that this implementation is still vulnerable to side channels based on the pattern of data access, so it should not be used alone. To satisfy the principle of ensuring data access patterns are secret independent, the code that selects the precomputed apre value must access all values of apre consistently. Each access must have identical size—for example, 4 bytes—the addresses accessed must be 4 byte aligned, and accessed in exactly the same order. This can be accomplished with a scan loop through apre using instructions such as the CMOV family discussed earlier using a test on the k bits of b and the index into apre.

Mitigating Observation of Lookup Table Access

Many block ciphers make use of lookup tables. For example, the AES algorithm defines a procedure called SubBytes() that performs a byte-for-byte substitution using a table called an S-box. The S-box is a 256-byte constant, which spans four cache lines on most processors. The table is public, but indexing into the table for each substitution is private. By watching which one of the four cache lines is accessed in a straightforward implementation of SubBytes(), a spy process can obtain 2 bits of information about a secret round key. If the spy process can determine the 32-bit word within a cache line where the lookup occurs—which is possible with methods such as CacheBleed—it can obtain 6 bits of information about a secret round key. Although the spy process cannot know the exact values of the S-box accessed, repeating this observation many times for many encryptions may reveal the entire secret key.

To ensure memory access patterns are the same independent of the input value, the code must access all table data with the same data width, the exact same address order, and with identical data alignment regardless of the table entry to lookup [principles: SIR, SIC, SID]. The required value can be extracted using instructions such as the CMOV family. Alternatively, if the table is small enough, it may be possible to pack the values into Intel® Advanced Vector Extensions (Intel® AVX) registers and avoid memory accesses altogether.

For algorithms like AES, where the S-box values are computed deterministically, it is also an option to compute the replacement value for each substitution instead of performing a table lookup.

Conclusion

Developers who wish to protect secret data against timing side channel methods should ensure that their code runtime, data access patterns, and code access patterns are identical independent of secret values. Even developing simple cryptographic libraries that are resistant to timing side channel methods requires adhering to these guidelines as well as having a thorough background in cryptographic principles, algorithms, and implementations. Most developers should use existing tools and libraries that already follow the best practices described here, instead of trying to develop their own routines. Developers can refer to Security Best Practices for Side Channel Resistance for more generic techniques. Additional precautions should be implemented to protect secret data from the risk of speculative execution side channel methods. Refer to the Software Security Findings and Mitigations site for specific mitigations for speculative execution side channel methods.

References

Footnotes

  1. For example, “Prime+Abort: A Timer-Free High-Precision L3 Cache Attack using Intel TSX”, Craig Disselkoen, et al, https://www.usenix.org/conference/usenixsecurity17/technical-sessions/presentation/disselkoen
  2. Use of the term “constant time” is a legacy term that is ingrained in literature and used here for consistency. In modern processors, the time to execute a given set of instructions may vary depending on many factors. The key is to ensure that none of these factors are related to the manipulation of secret data values. Modern algorithm research uses the more inclusive term “data oblivious algorithm."
  3. A program's code access pattern is the order and address of instructions that it executes.
  4. A program's data access pattern is the order and address of memory operands that it loads and stores.
  5. https://en.wikipedia.org/wiki/Oracle_attack
  6. https://security.stackexchange.com/questions/10617/what-is-a-cryptographic-oracle
  7. Alfred J. Menezes, Paul C. van Oorschot, and Scott A. Vanstone. The Handbook of Applied Cryptography, CRC Press, 1996. Algorithm 14.79, pg 615.
  8. Alfred J. Menezes, Paul C. van Oorschot, and Scott A. Vanstone. The Handbook of Applied Cryptography, CRC Press, 1996. Algorithm 14.82, pg 615.

Was this article helpful?YesNo
0% of users found this helpful

Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software, or service activation. Performance varies depending on system configuration. Check with your system manufacturer or retailer or learn more at www.intel.com.

All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest Intel product specifications and roadmaps.

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors.

Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit www.intel.com/benchmarks.

Performance results are based on testing as of dates shown in configurations and may not reflect all publicly available​ updates.

The products and services described may contain defects or errors known as errata which may cause deviations from published specifications. Current characterized errata are available on request.

Intel provides these materials as-is, with no express or implied warranties.

No product or component can be absolutely secure.

Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries.

*Other names and brands may be claimed as the property of others.

Copyright Intel Corporation 2020.