unknown optimization on x64

unknown optimization on x64

krishnaraj's picture

I have written a benchmarking application for opencl https://github.com/krrishnarraj/clpeak . One of the tests include measuring compute capacity(gflops) of the device. When run on windows 32, it gives expected results on sandybridge as

Platform: Intel(R) OpenCL
  Device:       Intel(R) Core(TM) i7-3630QM CPU @ 2.40GHz
    Driver version: 1.2 (Win32)

    Single-precision compute (GFLOPS)
      float   : 25.19
      float2  : 50.48
      float4  : 50.37
      float8  : 51.75
      float16 : 51.85

Theoratical peak of this device is 76.8 gflops

But when same code runs on 64 bit, it gives a different result

Platform: Intel(R) OpenCL
  Device:       Intel(R) Core(TM) i7-3630QM CPU @ 2.40GHz
    Driver version: 1.2 (Win64)

    Single-precision compute (GFLOPS)
      float   : 25.15
      float2  : 99.25
      float4  : 172.25
      float8  : 80.07
      float16 : 96.42

Looks like vector code(float2, float4) has been optimized out to float or some out-of-order optimization has happend. Not sure what is happening!!

ASM output from kernel-analyzer has properly generated all fmad & fmul. Is there any optimization that is specific to x64? anything advanced?

7 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.
krishnaraj's picture
krishnaraj's picture

Anyone there?

Hello Krishnaraj,

Please note that the OpenCL compiler implicitly vectorizes the kernel for you. It does that along the dimension zero of workgroup work-items. Along this vectorization process, user explicit vectors (e.g. float2, float4) are broken to scalars and then re-vectorized along the work-items space.

With the above in mind, at the vector assembly level, in the compute_sp_v1 kernel, each instruction is data dependent on the previous one. On the other kernels (using vector type), after breaking the operations to scalars, we get two separated dependency chains. This allows the compiler to schedule independent instructions nearby and benefit from the processor instruction level parallelism.  

Having said that, for 64 bit mode, our compiler manages to expose this parallelism, while for 32 bit mode, the compiler doesn't. I assume that this difference is due to fewer registers in 32 bit mode. This explains the higher performance that you observed with the 64 bit mode.

The theoretical peak GFLOPS of your CPU is higher than 150.

Arik

 

krishnaraj's picture

Thank you for the reply

Few questions:

http://download.intel.com/support/processors/corei7/sb/core_i7-3600_m.pdf says that max cpu flops of 3630QM is 76GFLOPS. Confused!

flops = 4 cores * 8 avx * 2.4 GHz * 1 mul/add per clock = 76.8 GFLOPS

parallelism in this context means pipeline right? because in float8 kernel, you already have avx instructions(ilp exploited). Only when pipeline is busy you get max throughput. Right?

 

Thanks again

Hello Krishnaraj,

Sorry for the delay,

Unfortunately, there is amistake in that document.

The correct peak GFLOPS calculation is:  4 cores * 2 AVX ALUs * 8 avx * 2.4 GHz * 1 mul/add per clock = 153.6 GFLOPS

The actual frequency might be higher with turbo. (But I cannot calculate it as it depends on too many factors).

Parallelism in this context is about using both of the AVX ALUs.

 

Arik

krishnaraj's picture

Thanks Arik. I know its x'mas eve

                  well that explains everything. So 172 GFLOPS is the effect of turbo mode

                  

Login to leave a comment.