Why is my application not able to reach core i7 920 peak FP performance

Why is my application not able to reach core i7 920 peak FP performance

i have a question about the FP peak performance of my core i7
920.

I have an application that does a lot of MAC operations (basically a
convolution operation), and i am not able to reach the peak FP
performance of the cpu by a factor of ~8x when using multi-threading and
SSE instructions.
When trying to find out what the reason was for this i ended up with a
simplified code snippet, running on a single thread and not using SSE
instructions which performs equally bad:

for(i=0; i<49335264; i++)
{
data[i] += other_data[i] * other_data2[i];
}

If i'm correct (the data and other_data arrays are all FP) this piece of code requires:

49335264 * 2 = 98670528 FLOPs

It executes in ~150 ms (i'm very sure this timing is correct, since C
timers and the Intel VTune Profiler give me the same result)

This means the performance of this code snippet is:

98670528 / 150.10^-3 / 10^9 = 0.66 GFLOPs/sec

Where the peak performance of this cpu should be at 2*3.2 GFlops/sec (2 FP units, 3.2 GHz processor) right?

Is there any explanation for this huge gap? I first thought it was because the application should be memory limited, but that would mean:

The peak stream b/w of my cpu is ~16.4 GB/s
right? So let's say every iteration i require 3 FP reads and 1 FP write,
or 16 bytes of bandwidth. This would require 789.364.224 bytes of
traffic to the main memory in the entire application (assuming nothing is cached), which runs in ~150 ms. This would
mean i use 789.364.224 / 150 * 10^-3 / 10^9 = 5.26 GB/s. So i would say i
don't hit this bandwidth ceiling?

I also tried changing the operation within the loop to " data[i] += 2.0 * 5.0 " to test whether this would improve the performance, but this yields the exact same performance.

Thanks a lot in advance, and i could really use your help!

5 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Hello Dan,
Let's look at your simplest case.

for(i=0; i<49335264; i++) { data[i] +=2.0 * 5.0; }

You could get the slow performancefor the above loop if the data[] array is not initialized to valid values.
If data[] is not initialized then you could be incurring floating point exceptions.

The i7 920 processor has 3 memory channels. Do you have only 1 memory chip in the system?
If so, the theoretical memory bandwidth is 1/3 * 25.6 GB/s = 8.5 GB/s.

Assuming the above loop is using single precision float point then you are only getting 1.31 GB/sec.

Have you tried running the stream mem bw benchmark? I'm curious what stream reports for your system.
Stream source: http://www.cs.virginia.edu/stream/FTP/Code/ and website http://www.cs.virginia.edu/stream/

So... more questions than answers...
Pat

Thanks for your answer.
I'm sure the data array is initialised properly, and all data consists of single precision FPs.
I will run the benchmark tomorrow, thanks for the heads up.

Anybody has any other ideas? Thanks!

Dear Pat,

i ran the benchmark and here is the result:

-------------------------------------------------------------
STREAM version $Revision: 5.9 $
-------------------------------------------------------------
This system uses 8 bytes per DOUBLE PRECISION word.
-------------------------------------------------------------
Array size = 2000000, Offset = 0
Total memory required = 45.8 MB.
Each test is run 10 times, but only
the *best* time for each is used.
-------------------------------------------------------------
Printing one line per active thread....
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 2642 microseconds.
(= 2642 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function Rate (MB/s) Avg time Min time Max time
Copy: 9060.1950 0.0036 0.0035 0.0037
Scale: 8824.8884 0.0036 0.0036 0.0037
Add: 9179.5820 0.0053 0.0052 0.0053
Triad: 8966.9781 0.0054 0.0054 0.0054
-------------------------------------------------------------
Solution Validates
-------------------------------------------------------------

Hello Dan,
Thanks for your patience.
This analysis will follow closely the posting http://software.intel.com/en-us/forums/showpost.php?p=177199.

I have a small program which reproduces most of the things about which you have questions.

[cpp]#include
#include
#include
#include

#define NUM 49335264
#define TIMEi 1000

// return epoch time in seconds (with usec accuracy)
double dclock(void)
{
double xxx;
struct timeval tv2;
gettimeofday(&tv2, NULL);
xxx =(double)(tv2.tv_sec) + 1.0e-6*(double)tv2.tv_usec;
return xxx;
}

#ifndef MEM_TRAFFIC
#define MEM_TRAFFIC 2
#endif

#if (MEM_TRAFFIC != 2) && (MEM_TRAFFIC != 4)
#error "MEM_TRAFFIC must be 2 or 4"
#endif

float a[NUM],b[NUM],c[NUM];
int main(int argc,char **argv)
{
int i,j,k, m;
double dt, result, ops, tot_ops, tot_time, tm_beg, tm_end;

printf("init datan");
for(i=0;i 10) // just put this in so compiler doesn't optimize everything away.
{
float d=0;
for(i=0;i
Let'scompile with the command below
snb-d2:/home/flops # gcc dan_fmac.c -o dan_fmac -g -DMEM_TRAFFIC=2

The '-DMEM_TRAFFIC=2' option makes the code just do 'a[i] += 2.0 * 5.0;'.
If we look at the assembly code (use the '-S -c -o dan_fmac.s' option to generate assembly file) then we see that the compiler precomputes the '2.0 * 5.0' so there is only 1 floating point operation (the add) per iteration of the loop.
There is 1 load and 1 store per iteration.
I'll run the program under 'perf' with the command below.
'perf stat -e cycles -e r2010 ./dan_fmac' says to collect the 'cycles' (clockticks) event and the raw event 'r2010'.
The 'r2010' means collect the event number 0x10 with umask 0x20. On Sandybridge, this is the FP_COMP_OPS_EXE.SSE_FP_SCALAR_SINGLE event.
See the SDM vol 3 section 19.3 for sandy bridge events and their encodings.
See http://www.intel.com/content/www/us/en/processors/architectures-software-developer-manuals.html for the SDM.

Below is the command to run it and the output:

snb-d2:/home/flops # perf stat -e cycles -e r2010 ./dan_fmac

init data

start FP op

m= 0, NUM= 49335264, MFlops= 384.267156, Mop= 49.335264, time= 0.128388, MB/s= 3074.137250

m= 1, NUM= 49335264, MFlops= 384.311404, Mop= 49.335264, time= 0.128373, MB/s= 3074.491232

m= 2, NUM= 49335264, MFlops= 384.272865, Mop= 49.335264, time= 0.128386, MB/s= 3074.182921

m= 3, NUM= 49335264, MFlops= 383.902145, Mop= 49.335264, time= 0.128510, MB/s= 3071.217159

m= 4, NUM= 49335264, MFlops= 384.387077, Mop= 49.335264, time= 0.128348, MB/s= 3075.096616

m= 5, NUM= 49335264, MFlops= 384.299984, Mop= 49.335264, time= 0.128377, MB/s= 3074.399874

m= 6, NUM= 49335264, MFlops= 384.425639, Mop= 49.335264, time= 0.128335, MB/s= 3075.405110

m= 7, NUM= 49335264, MFlops= 384.036805, Mop= 49.335264, time= 0.128465, MB/s= 3072.294437

m= 8, NUM= 49335264, MFlops= 384.359945, Mop= 49.335264, time= 0.128357, MB/s= 3074.879564

m= 9, NUM= 49335264, MFlops= 384.347809, Mop= 49.335264, time= 0.128361, MB/s= 3074.782472

tot_Mop= 493.352640, tot_time= 1.283900, overall Mops/sec= 384.261020
 Performance counter stats for './dan_fmac':
     5293118691  cycles

      528655357  raw 0x2010
    1.589042239  seconds time elapsed

So we see that I'm only getting 384 MFlops and about 3074 MB/sec of memory bandwidth.
Not good.
The 'perf' output shows that we are only getting about 1 addfor every 10 cycles.
The gcc compiler, if you don't specify an optimization level, apparently doesn't optimize much.

If I compile with -O3 things are a lotbetter.
Using cmd: gcc dan_fmac.c -o dan_fmac -g -DMEM_TRAFFIC=2 -O3
I get output:

snb-d2:/home/flops # perf stat -e cycles -e r4010 ./dan_fmac

init data

start FP op

m= 0, NUM= 49335264, MFlops= 2169.911444, Mop= 49.335264, time= 0.022736, MB/s= 17359.291553

m= 1, NUM= 49335264, MFlops= 2171.619372, Mop= 49.335264, time= 0.022718, MB/s= 17372.954979

m= 2, NUM= 49335264, MFlops= 2169.638425, Mop= 49.335264, time= 0.022739, MB/s= 17357.107399

m= 3, NUM= 49335264, MFlops= 2166.390225, Mop= 49.335264, time= 0.022773, MB/s= 17331.121801

m= 4, NUM= 49335264, MFlops= 2170.025222, Mop= 49.335264, time= 0.022735, MB/s= 17360.201780

m= 5, NUM= 49335264, MFlops= 2166.390225, Mop= 49.335264, time= 0.022773, MB/s= 17331.121801

m= 6, NUM= 49335264, MFlops= 2169.456450, Mop= 49.335264, time= 0.022741, MB/s= 17355.651602

m= 7, NUM= 49335264, MFlops= 2167.048165, Mop= 49.335264, time= 0.022766, MB/s= 17336.385316

m= 8, NUM= 49335264, MFlops= 2169.638425, Mop= 49.335264, time= 0.022739, MB/s= 17357.107399

m= 9, NUM= 49335264, MFlops= 2167.048165, Mop= 49.335264, time= 0.022766, MB/s= 17336.385316

tot_Mop= 493.352640, tot_time= 0.227486, overall Mops/sec= 2168.715219
 Performance counter stats for './dan_fmac':
     1219427840  cycles

      222453725  raw 0x4010
    0.369353868  seconds time elapsed

Now theMFlops are about 5.6x faster. The memory bandwidth is about 5.6x higher too.
For the above runI used the raw event 'r4010' which is the FP_COMP_OPS_EXE.SSE_PACKED_SINGLE event.
FP_COMP_OPS_EXE.SSE_PACKED_SINGLE counts SSE packed single precision instructions.
Each packed SP instruction does 4 operations.
The compiler optimizations (at -O3) are vectorizing the code so use packed SSE instructions.
So now, by the perf data, we are getting 0.72 floating point operations per cycle.
Thatis, flops/cycle =0.72 ~= 4 * 222453725 / 1219427840.

If I compile it with -DMEM_TRAFFIC=4, the inner loop becomes a[i] += b[i]*c[i];

Without optimizations I get:

snb-d2:/home/flops # gcc dan_fmac.c -o dan_fmac -g -DMEM_TRAFFIC=4

snb-d2:/home/flops # perf stat -e cycles -e r2010 ./dan_fmac

init data

start FP op

m= 0, NUM= 49335264, MFlops= 673.620407, Mop= 98.670528, time= 0.146478, MB/s= 5388.963256

m= 1, NUM= 49335264, MFlops= 674.167973, Mop= 98.670528, time= 0.146359, MB/s= 5393.343784

m= 2, NUM= 49335264, MFlops= 674.176759, Mop= 98.670528, time= 0.146357, MB/s= 5393.414071

m= 3, NUM= 49335264, MFlops= 673.786009, Mop= 98.670528, time= 0.146442, MB/s= 5390.288075

m= 4, NUM= 49335264, MFlops= 674.176759, Mop= 98.670528, time= 0.146357, MB/s= 5393.414071

m= 5, NUM= 49335264, MFlops= 674.176759, Mop= 98.670528, time= 0.146357, MB/s= 5393.414071

m= 6, NUM= 49335264, MFlops= 673.514069, Mop= 98.670528, time= 0.146501, MB/s= 5388.112556

m= 7, NUM= 49335264, MFlops= 673.799173, Mop= 98.670528, time= 0.146439, MB/s= 5390.393387

m= 8, NUM= 49335264, MFlops= 674.039506, Mop= 98.670528, time= 0.146387, MB/s= 5392.316047

m= 9, NUM= 49335264, MFlops= 673.859515, Mop= 98.670528, time= 0.146426, MB/s= 5390.876118

tot_Mop= 986.705280, tot_time= 1.464103, overall Mops/sec= 673.931609
 Performance counter stats for './dan_fmac':
     5900500805  cycles

     1017177303  raw 0x2010
    1.769067328  seconds time elapsed
 

This is similar to your results. About 674 MFlops, 5390 MB/sec and 5.8 cycles/flop. Not too good.
Note that I use the raw event 'r2010' (the FP_COMP_OPS_EXE.SSE_FP_SCALAR_SINGLE event).

If I compile with optimizations (-O3) and run then I get:

snb-d2:/home/pfay/flops # gcc dan_fmac.c -o dan_fmac -g -DMEM_TRAFFIC=4  -O3

snb-d2:/home/pfay/flops # perf stat -e cycles -e r4010 ./dan_fmac

init data

start FP op

m= 0, NUM= 49335264, MFlops= 2247.973614, Mop= 98.670528, time= 0.043893, MB/s= 17983.788910

m= 1, NUM= 49335264, MFlops= 2260.016330, Mop= 98.670528, time= 0.043659, MB/s= 18080.130637

m= 2, NUM= 49335264, MFlops= 2263.811601, Mop= 98.670528, time= 0.043586, MB/s= 18110.492811

m= 3, NUM= 49335264, MFlops= 2258.154265, Mop= 98.670528, time= 0.043695, MB/s= 18065.234119

m= 4, NUM= 49335264, MFlops= 2262.104007, Mop= 98.670528, time= 0.043619, MB/s= 18096.832060

m= 5, NUM= 49335264, MFlops= 2260.954690, Mop= 98.670528, time= 0.043641, MB/s= 18087.637520

m= 6, NUM= 49335264, MFlops= 2261.832020, Mop= 98.670528, time= 0.043624, MB/s= 18094.656163

m= 7, NUM= 49335264, MFlops= 2259.251402, Mop= 98.670528, time= 0.043674, MB/s= 18074.011214

m= 8, NUM= 49335264, MFlops= 2261.572457, Mop= 98.670528, time= 0.043629, MB/s= 18092.579659

m= 9, NUM= 49335264, MFlops= 2259.769522, Mop= 98.670528, time= 0.043664, MB/s= 18078.156177

tot_Mop= 986.705280, tot_time= 0.436685, overall Mops/sec= 2259.536339
 Performance counter stats for './dan_fmac':
     1918680883  cycles

      364863566  raw 0x4010
    0.577368584  seconds time elapsed

Now we see a 3.35x increase in Flops, BW and flops/cycle.

So hopefully when you were compiling your program, optimizations were not enabled or the compiler was not able to vectorize the code.
You can compile with '-O3 -ftree-vectorizer-verbose=3' to see why the compiler is not able to vectorize.
Does this help?
Pat

Leave a Comment

Please sign in to add a comment. Not a member? Join today