Some notes on Secure Key performance and throughput

At IDF in September I led a technical session in the security track on developing applications that make use of Secure Key. In that presentation I put up the following chart:

It plots the maximum, total throughput of the RDRAND instruction in a multithreaded application for six different systems. The Y axis is the ratio to single threaded throughput, and the X axis is the number of threads executing RDRAND as rapidly as possible. What this chart says is that total RDRAND performance scales nearly linearly with the number of threads: if you have two threads executing RDRAND you get twice the throughput of one thread, if you use three threads you get three times the throughput, and so on. This scaling continues until you hit an overall hardware limit on the CPU, such as maxing out your hardware threads or saturating the bus.

After the talk was done, I had a session attendee come up to me and ask a question: Why is it that multiple threads give the DRNG (the digital random number generator) higher throughput? Why can't one thread simply pull as much random data via RDRAND as the DRNG is capable of generating?

The answer lies in the overall architecture of the CPU and the DRNG itself. Single-threaded performance is limited by the round-trip latency between when a RDRAND instruction is executed and when the random number is returned.

Round-trip latencies

The DRNG is connected to the cores via a bus. There are multiple busses within the CPU, and which bus is used for the DRNG is a decision that is made by the product groups when the CPU is designed. Those decisions are based on specific feature requirements and design constraints, and they can vary from generation to generation (3rd generation Core to 4th generation Core) as well as from family to family (Core to Xeon to Atom). No matter which bus is used, however, the process followed by the CPU when processing a RDRAND instruction is the same:

  1. The execution unit receives the RDRAND instruction
  2. A request for a random number is placed on the bus
  3. The DRNG receives the request from the bus
  4. The DRNG places a random number back on the bus
  5. The random number is returned to the execution unit that issued the instruction
  6. The random number is placed in the destination register

The time that elapses between steps 2 and 5 is the round-trip latency for the RDRAND instruction. The hardware thread cannot execute another RDRAND until the previous RDRAND has completed, and so this latency becomes the limit for single-thread throughput. It issues a request, waits for the answer, and then moves on.

With multiple threads executing RDRAND in parallel, however, each thread may place a request on the bus regardless of what the other threads are doing. While each thread still sees the same round-trip latency, and the same per-thread throughput, the total number of requests that are in flight on the bus increases. The end result is that the total throughput across all threads is roughly equal to the single-threaded throughput times the number of active threads. Hence, the throughput of the DRNG appears to increase.

§

Per informazioni più dettagliate sulle ottimizzazioni basate su compilatore, vedere il nostro Avviso sull'ottimizzazione.
Contrassegni: