On a 32 bit system, a program loop will take the same time to process a loop using double floating point arithmetic as it does with single floating point arithmetic. The double float calculations are done done in hardware, as opposed to using some sort of software emulation, as is done on most GPUs. The GPU takes more than twice as long to process a loop of double floats than it does a single float loop.
Please exclude all thought of SSE or AVX registers or calculations for the moment.
I understand how the calculation of single(32 bit) floating point values is performed. How is it that the use of double precision values(64 bit) does not use more time on the same hardware. Must the processor ALU be based on 64 bit architecture to achieve this, despite being a 32 bit operating system ?
What hardware mechanism is used to achieve this / Does anyone have a good description ?