Simple question about single and double float terminology

Simple question about single and double float terminology

Ritratto di magicfoot

On a 32 bit system, a program loop will take the same time to process a loop using double floating point arithmetic as it does with single floating point arithmetic. The double float calculations are done done in hardware, as opposed to using some sort of software emulation, as is done on most GPUs. The GPU takes more than twice as long to process  a loop of double floats than it does a single float loop.

Please exclude all thought of SSE or AVX registers or calculations for the moment.

I understand how the calculation of single(32 bit) floating point values is performed. How is it that the use of double precision values(64 bit) does not use more time on the same hardware. Must the processor ALU be based on 64 bit architecture to achieve this, despite being a 32 bit operating system ?

What hardware mechanism is used to achieve this / Does anyone have a good description ?

 

 

16 post / 0 new
Ultimo contenuto
Per informazioni complete sulle ottimizzazioni del compilatore, consultare l'Avviso sull'ottimizzazione
Ritratto di jimdempseyatthecove

On Intel processors there are the following floating point instruction sets: FPU (8087 emulation), SSE and AVX. All three have access to an internal, very fast, internal floating point processor (engine). The FPU supports 4-byte, 8-byte, and 10-byte floating point formats as single elements (scalars). The SSE and AVX support 4-byte, 8-byte floating point formats as scalars (single variable) or small vectors (2 or more elements). Ignoring the multiple element formats in SSE and AVX, the latency of a floating point multiply is on the order of 4 clock cycles (this will extend for memory references). Throughput can be as little as 1-2 clock cycles.

When the problem involves a large degree of RAM reads and writes, the program is waiting for the memory as opposed to waiting for the floating point operations.

Note, when small vectors can be used, the computation time can be significantly reduced (1/2, 1/4. 1/8) memory subsystem overhead can be reduced per floating operation, but the demands on memory subsystem may increase.

Jim Dempsey

www.quickthreadprogramming.com
Ritratto di iliyapolak

>>>The double float calculations are done done in hardware, as opposed to using some sort of software emulation, as is done on most GPUs. The GPU takes more than twice as long to process  a loop of double floats than it does a single float loop.>>>

IIRC Nvidia Kepler architecture has support for double precision calculations.Not sure about the Fermi design.

Ritratto di Sergey Kostrov

I have a question regarding that statement:

>>...How is it that the use of double precision values(64 bit) does not use more time on the same hardware...

Do you have a test case which demonstrates that performance is the same, that is for SP and DP floating-point types?

Ritratto di iliyapolak

>>>ow is it that the use of double precision values(64 bit) does not use more time on the same hardware. Must the processor ALU be based on 64 bit architecture to achieve this>>>

I suppose that recent Intel processors use one or two execution ports for ALU integer (single) operations and ALU vector operations and this data can be vectorized and send to SIMD execution engine.In case of Vector ALU it is up to 4 32-bit int scalar components processed at once.

Ritratto di Tim Prince

When using vectorized simd instructions, single float throughput is roughly double that of double, just as on your GPU.  This is because twice as many operations, using the same total number of bytes of data, may be performed per cycle.

when considering a single operation, performance of single and double may be similar.  This may be true of the GPU as well.  Some of the ads compare vector-parallel operation on a GPU against serial host CPU operation.  This is in line with your idea that simd parallelism should not be considered on host, even though you are discussing the equivalent on the GPU.

Ritratto di iliyapolak

I think that there cannot be direct comparision between CPU fp peak performance speed and GPU peak performance speed.I suppose that at infinitesimal time (more than one cpu cycle) the peak performance will be a function of scheduled to execute fp code , interdependecies in that code and available to this code execution units per single core.GPU has a lot of more available resources albeit operating at lower speed.

Ritratto di magicfoot

Thanks for all of those neat comments. Attached is single and double sample code with builder in vs2010 for Sergey.

This question originated because someone asked me why the single and double computational performance of a program on an i5 processor was the same, whereas on the GTX 480 GPU this is not the case. I glibly answered that the double and single times were the same  because the i5 does the double scalar arithmetic in hardware. I thought about this afterwards and realised that I did not really understand how the processor hardware did this so efficiently. Thanks for the answer Jim.

This question is not about the SSE or AVX. I get v good performance with most of my code sets using these devices. SSE typically x 2.5, AVX typically x 5 : all single precision implementations of course.

 

The focus of the question is how contemporary CISC processors handle double precision computation. The answer to this is that the FP engine circuitry does the computation.

 

Regards.

Allegati: 

AllegatoDimensione
Scarica sample.zip28.3 KB
Ritratto di magicfoot

Hi, thanks for all the answers. Sample programs attached for Sergey.

I am satisfied with answer that the maths is done by the fp engine.

 

Regards.

Allegati: 

AllegatoDimensione
Scarica sample.zip28.3 KB
Ritratto di iliyapolak

>>>I am satisfied with answer that the maths is done by the fp engine>>>

Do you mean integer math?

Ritratto di magicfoot

>>>I am satisfied with answer that the maths is done by the fp engine>>>

>>>>Do you mean integer math?

By fp engine I meant floating point engine.

Ritratto di Sergey Kostrov

Hi,

>>...Sample programs attached...

I'll take a look at what it does. Thank you!

Ritratto di iliyapolak

>>>This question originated because someone asked me why the single and double computational performance of a program on an i5 processor was the same, whereas on the GTX 480 GPU this is not the case>>>

Probably because of either lack of double precision support or locked support for double precision support for non Tesla cards.

Ritratto di iliyapolak

Quote:

iliyapolak wrote:

>>>I am satisfied with answer that the maths is done by the fp engine>>>

Do you mean integer math?

I do not know if the same engine is processing integer math.

Ritratto di Sergey Kostrov

>>>>...Sample programs attached...
>>
>>I'll take a look at what it does.

I'll be able to do verifications on three systems, that is, Ivy Bridge, Atom and Pentium 4, and I'll report my results.

Ritratto di Sergey Kostrov

>>>>...I'll be able to do verifications on three systems, that is, Ivy Bridge, Atom and Pentium 4, and I'll report my results...

Hi Bob, I will follow up by Monday and sorry for the delay.

Accedere per lasciare un commento.