Processing is ~3.8% slower when an executable is built with /Qopt-streaming-stores:always compiler option

Processing is ~3.8% slower when an executable is built with /Qopt-streaming-stores:always compiler option

Hi everybody,

I detected some issue ( I hope that this is Not a problem ) when /Qopt-streaming-stores:always is used. In overall, a processing of a large data set is ~3.8% slower when an executable is built with /Qopt-streaming-stores:always compiler option.

I'll be glad to provide some additional details. Thanks in advance.

 

38 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Here are test results:

...
Without /Qopt-streaming-stores:always
...
... - Pass 01 - Completed: 2.09400 secs - Excluded from calculation of average value
... - Pass 02 - Completed: 1.39000 secs
... - Pass 03 - Completed: 1.40700 secs
... - Pass 04 - Completed: 1.32800 secs
... - Pass 05 - Completed: 1.34300 secs
... - Pass 06 - Completed: 1.32900 secs
... - Pass 07 - Completed: 1.34300 secs
... - Pass 08 - Completed: 1.34400 secs
... - Pass 09 - Completed: 1.34400 secs
... - Pass 10 - Completed: 1.32800 secs
... - Pass 11 - Completed: 1.34400 secs
... - Pass 12 - Completed: 1.34300 secs
... - Pass 13 - Completed: 1.34400 secs
... - Pass 14 - Completed: 1.32800 secs
... - Pass 15 - Completed: 1.34400 secs
... - Average: 1.34707 secs
...
With /Qopt-streaming-stores:always
...
... - Pass 01 - Completed: 2.17100 secs - Excluded from calculation of average value
... - Pass 02 - Completed: 1.45400 secs
... - Pass 03 - Completed: 1.45300 secs
... - Pass 04 - Completed: 1.39000 secs
... - Pass 05 - Completed: 1.39100 secs
... - Pass 06 - Completed: 1.39100 secs
... - Pass 07 - Completed: 1.40600 secs
... - Pass 08 - Completed: 1.39000 secs
... - Pass 09 - Completed: 1.39100 secs
... - Pass 10 - Completed: 1.39100 secs
... - Pass 11 - Completed: 1.39000 secs
... - Pass 12 - Completed: 1.39100 secs
... - Pass 13 - Completed: 1.39100 secs
... - Pass 14 - Completed: 1.39000 secs
... - Pass 15 - Completed: 1.39100 secs
... - Average: 1.40071 secs
...

Note 1: Processing is ~3.8% slower in the 2nd case.

Note 2: Unfortunately, I can't provide sources to reproduce the issue and any algorithm that processes some large data set, for example, min/max reduction, matrix multiplication, matrix transpose, etc ( at least one for-loop needs to be ), could experience some performance slowdown.

>>...In overall, a processing of a large data set is ~3.8% slower when an executable is built with /Qopt-streaming-stores:always
>>compiler option...

I forgot to mention that I consider the issue as a common one and it could be applicable to all Intel C++ compiler Releases for versions 12 and 13.

Please, take this as my humble guess, as I am not sure nor I have run benchmarks, but IMO "always" forces compiler to always use these stores, even when they are not optimal. Please try to re-run tests with "/Qopt-streaming-stores:auto", because I think letting compiler to decide whether to use streaming store or not by decision of Intel engineers should be more optimal, I guess.

--
With best regards,
VooDooMan
-
If you find my post helpful, please rate it and/or select it as a best answer where applies. Thank you.

>>...Please, take this as my humble guess, as I am not sure nor I have run benchmarks, but IMO "always" forces compiler to
>>always use these stores, even when they are not optimal. Please try to re-run tests with "/Qopt-streaming-stores:auto"...

I decided to use "always" option because I have a memory bound procesing ( up to several GBs of data ). I'll try "auto" option in order to see if it makes a difference. Thanks.

Quote:

SergeyKostrov wrote:

I decided to use "always" option because I have a memory boundprocesing( up to severalGBsof data ). I'll try "auto" option in order to see if it makes a difference. Thanks.

Just out of my curiosity, please report back here your findings. It may be useful not only for me, but for the whole comunity as well. TIA!

--
With best regards,
VooDooMan
-
If you find my post helpful, please rate it and/or select it as a best answer where applies. Thank you.

Sergey,

Why would you assume streaming stores always implies faster?

While your code may not be re-referencing immediately the data written, the data written may alias to addresses contained withing the cache system, and thus cause  "false" eviction.

3.8% is not much to worry about. You can observe this difference depending on where code loops reside. e.g. a 1 byte movement in a loop could cause the ending branch back to top of loop (and possible prefetch data following the branch) to fall across/into an additional cache line. Thus slowing the instruction fetch time of the loop.

Jim Dempsey

www.quickthreadprogramming.com

Could the slowdown be due to overhead(more machine code) of streaming stores loops?

Did you expect a larger effect?

Such a small change in performance probably indicates you have some arrays which are gaining from streaming store and some are losing.  As mentioned earlier, with the automatic version the compiler will attempt to discover which are candidates for streaming stores (not read back where it's visible to the compiler).  Where it's not visible to the compiler you may have a long job ahead with VTune if you want to discover the detailed cache behavior.

Situations where streaming stores run 20% faster on one target with a specific organization of source code and number of threads and several times longer on another aren't unusual.  Where you optimize with tiling you can expect to need to remove streaming stores, if you don't make it visible to the compiler.  The compiler doesn't attempt to guess for this purpose how many threads you will use.

If the expected array sizes aren't visible to the compiler, you may need to use the pragmas where you want to test streaming store, rather than telling the compiler to use it whenever possible.  Streaming store compilation may have greater effect if you prevent the compiler from making implicit fast_memcpy and memset substitutions, where you have little choice but to accept what is built into the run-time library.

>>...Just out of my curiosity, please report back here your findings...

Here is update and with /Qopt-streaming-stores:auto there are No any performance decreases. I would also say that processing times "look better" compared to the 1st result in my 2nd post ( that is, Without /Qopt-streaming-stores:always ).

My final complete set of Intel C++ compiler options is as follows ( for Release Configuration ):

/c
/O3
/Oi
/Ot
/Oy
/GF
/MT
/GS-
/fp:fast=2
/W5
/nologo
/Wp64
/Zi
/Gr
/TP
/Qopenmp
/Qfp-speculation:fast
/Qopt-matmul
/Qparallel
/Qstd=c99
/Qstd=c++0x
/Qrestrict
/Qunroll:4
/Qopt-block-factor:64
/Qopt-streaming-stores:auto

Here are a couple of more notes:

>>Why would you assume streaming stores always implies faster?

I've looked at a Quick Reference Guide to Optimizations with Intel Compilers and a description for the sub-option 'always' is as follows:
...
Encourages the compiler to generate streaming stores that bypass cache, assuming application is memory bound with little data reuse
...
This is what I have at the moment ( a very memory bound processing due to a large data set ) and I really wanted to resolve some cache related issues currently existing during processing.

>>While your code may not be re-referencing immediately the data written, the data written may alias to addresses contained
>>withing the cache system, and thus cause "false" eviction.

I don't know at the moment if this is the "false" eviction and as Tim suggested VTune needs to be used to understand what is going on.

>>Could the slowdown be due to overhead ( more machine code ) of streaming stores loops?

This is cache related issue as I've already mentioned.

>>Did you expect a larger effect?

I've expected a positive performance improvement ( at least a couple of percent ). I didn't expect to see a negative impact.

Thanks to everybody for comments!

>>My final complete set of Intel C++ compiler options is as follows ( for Release Configuration ):
>>
>>...
>>/Qunroll:4
>>/Qopt-block-factor:64
>>/Qopt-streaming-stores:auto

Here are some results regarding these compiler options:

/Qunroll: 4 - OK
/Qunroll: ( 8 or 16 } - Negative impact
/Qopt-block-factor: { 32, 64, 128 } - OK
/Qopt-streaming-stores:always - Negative impact
/Qopt-streaming-stores:auto - OK ( and actually auto is Default value for the option )

I'm currently investigating if a couple of misaligned pointers ( not aligned on 64-byte boundary ) created by new C++ operator are related to that issue.

Quote:

Sergey Kostrov wrote:

>>...Just out of my curiosity, please report back here your findings...

Here is update and with /Qopt-streaming-stores:auto there are No any performance decreases. I would also say that processing times "look better" compared to the 1st result in my 2nd post ( that is, Without /Qopt-streaming-stores:always ).

Sergey, could you please make a favor and post your percentual values please? Because I am very curious about percents as you wrote before. I am disturbing you because you can't provide your sources, and I just want to learn something from your percentual reports about ICC compiler. Many thanks in advance!

--
With best regards,
VooDooMan
-
If you find my post helpful, please rate it and/or select it as a best answer where applies. Thank you.

I'll try to create a simple test case.

Sergey,

Another cache optimization technique for you to explore (in addition to streaming stores), is to incorporate CLFLUSH at appropriate places in the code. The purpose of which is to influence the cache system to have fewer instances of having to guess at which cache line(s) to evict. A proper implementation will require careful studying of your code to assure that you are only CLFLUSH-ing uninterested data.

A second level of use of CLFLUSH is to flush the cache lines that are easily predictable for the cache system pre-fetch. IOW have a preference to keep in cache the more expensive to fetch data. In matrix multiply in C++ where you have pointers to rows, the row data is relatively inexpensive to fetch, and the pre-fetcher may prefetch the data ahead of your requiest. Whereas the column data, and more importantly the adjacent column data not immediately used on this DOT product, is more expensive to fetch, and who's cache line is more likely to be re-used for next column. Thus the column data cache lines will be more important to keep in cache than the row data.

// square matrix multiply
for(int i = 0; i < n; ++i) {
  for(int j = 0; j < n; ++j) {
    double sum = 0.0;
    for(int k = 0; i < n; ++k) {
      sum += A[i][k] * B[k][j]; // B is more expensive to fetch
      if(((k + 1) % (SIZEOF_CACHE_LINE / sizeof(sum))) == 0)
        clflush(&A[i][k]);
    }
    C[i][j] = sum;
  }
}

I will let you fixup the code for your purpose.

Note, the above may show improvements (on very large matrix) when single threaded.
For multi-threaded it will be a little more difficult if the entire row of A is shared.

 Jim Dempsey

www.quickthreadprogramming.com

Sergey,

See: http://software.intel.com/en-us/forums/topic/397392

John D. McCalpin's response on 7/12/13 8:35

Jim

www.quickthreadprogramming.com

>>...could you please make a favor and post your percentual values please?..

Here they are:

With /Qopt-streaming-stores:auto

... - Pass 01 - Completed: 2.12500 secs - Excluded from calculation of average value
... - Pass 02 - Completed: 1.39100 secs
... - Pass 03 - Completed: 1.40600 secs
... - Pass 04 - Completed: 1.34400 secs
... - Pass 05 - Completed: 1.32800 secs
... - Pass 06 - Completed: 1.34400 secs
... - Pass 07 - Completed: 1.34400 secs
... - Pass 08 - Completed: 1.34300 secs
... - Pass 09 - Completed: 1.34400 secs
... - Pass 10 - Completed: 1.34400 secs
... - Pass 11 - Completed: 1.34400 secs
... - Pass 12 - Completed: 1.34300 secs
... - Pass 13 - Completed: 1.32900 secs
... - Pass 14 - Completed: 1.34300 secs
... - Pass 15 - Completed: 1.34400 secs
... - Average - 1.34936 secs

It is ~0.17% slower then 1st test case:

...
... - Pass 14 - Completed: 1.32800 secs
... - Pass 15 - Completed: 1.34400 secs
... - Average: 1.34707 secs
...

and this is due to Windows multitasking ( both cases executed with a priority set to High ).

>>...Another cache optimization technique for you to explore (in addition to streaming stores), is to incorporate CLFLUSH at
>>appropriate places in the code...

I'll investigate / test that piece of codes. Thanks, Jim.

Do you know that CRT function calloc doesn't align on 64-byte boundary ( at least for 'float's )? Here are some results:

[ Test 1 ]

...
float *pData = ( float * )calloc( 1, 1024 * sizeof( float ) );
...

Pointer 0x00393E20 is aligned on...........4
Pointer 0x00393E20 is aligned on...........8
Pointer 0x00393E20 is aligned on.........16
Pointer 0x00393E20 is aligned on.........32
Pointer 0x00393E20 is Not aligned on..64 ( 0x00393E20 -> ( 3751456 % 64 ) = 32 )
Pointer 0x00393E20 is Not aligned on.128 ( 0x00393E20 -> ( 3751456 % 128 ) = 32 )

[ Test 2 ]

...
float *pData = ( float * )_mm_malloc( 1024 * sizeof( float ), 64 );
...

Pointer 0x00C8FF40 is aligned on...........4
Pointer 0x00C8FF40 is aligned on...........8
Pointer 0x00C8FF40 is aligned on.........16
Pointer 0x00C8FF40 is aligned on.........32
Pointer 0x00C8FF40 is aligned on.........64
Pointer 0x00C8FF40 is Not aligned on.128 ( 0x00C8FF40 -> ( 13172544 % 128 ) = 64 )

[ Test 3 ]

...
float *pData = ( float * )_mm_malloc( 1024 * sizeof( float ), 128 );
...

Pointer 0x00C8FF80 is aligned on.........4
Pointer 0x00C8FF80 is aligned on.........8
Pointer 0x00C8FF80 is aligned on.......16
Pointer 0x00C8FF80 is aligned on.......32
Pointer 0x00C8FF80 is aligned on.......64
Pointer 0x00C8FF80 is aligned on.....128

There is no requirement for malloc, calloc, ... to provide aligned allocations. By nodes are managed by linked list of pointers, therefore minimal alignment is sizeof(void*). Historically, underlaying raw allocations on MS-DOS was by the paragraph (16-bytes). But the base of the allocation would not be than of the first paragraph. This was due to at least one pointer remaining as a header, but typically this was two pointers worth (on for node link, one for size). This practice carried over to 32-bit systems. You may still see 8-byte alignment on 32-bit O/S, though I think most current CRTL return 16-byte. Also note, try allocating an array

char* foo1 = new char[3]; // use [] format
char* foo2 = new char[3];

Array allocations include a count. Thus a new will typically return raw allocation node (possibly 16-byte aligned) + link* + size + count (node+12 bytes) on 32-bit O/S. Newer CRTL may round this to +16 bytes due to this makes SSE happy.

If you can work in the _mm_clflush please report back if you have success or not.

Jim Dempsey

www.quickthreadprogramming.com

Quote:

Sergey Kostrov wrote:

>>...could you please make a favor and post your percentual values please?..

Here they are:

Sergey, thank you very much, it is valuable knowledge to me, and I hope for the whole community too, when someone will run into the same issue, and will search against this forum.

--
With best regards,
VooDooMan
-
If you find my post helpful, please rate it and/or select it as a best answer where applies. Thank you.

Quote:

Sergey Kostrov wrote:

>>...could you please make a favor and post your percentual values please?..

Here they are:

With /Qopt-streaming-stores:auto

Thank you very much Sergey, it is very interesting, and it improves mine and community knowledge.

--
With best regards,
VooDooMan
-
If you find my post helpful, please rate it and/or select it as a best answer where applies. Thank you.

Sergey,

Here is an additional optimization technique to be used only where the problem size exceeds the physical RAM size. I will only sketch the code. Assume the problem is of matrix multiplication. The technique I would suggest then is:

// C = A * B (matrix multiplication)
// "DepthC" is the run length of the DOT products of the row A and col B
for(iDepth = 0; iDepth < DepthC; iDepth += DepthFitsInRAM) { // portion of C, A, B fits in RAM
 iDepthEnd = min( iDepth + DepthFitsInRAM, DepthC);
BTranspose = Transpose(B, iDepth, iDepthEnd); // Transpose only section of B
if(iDepth = 0)
B = DOTproducts(Apart(iDepth, iDepthEnd), BTranspose(iDepth, iDepthEnd))
else
B += DOTproducts(Apart(iDepth, iDepthEnd), BTranspose(iDepth, iDepthEnd))
}

IOW partition the workspace into work uinits such that the partial transposition of B, and subsiquent DOT products do not cause page faults (i.e. fits in available RAM). Where the first partial results is stored, and the subsequent results are accumulated. Properly tuned, this will reduce the frequency of paging.

Jim Dempsey

www.quickthreadprogramming.com

>>Assume the problem is of matrix multiplication. The technique I would suggest then is:
>>
>>// C = A * B (matrix multiplication)
>>// "DepthC" is the run length of the DOT products of the row A and col B
>>for(iDepth = 0; iDepth < DepthC; iDepth += DepthFitsInRAM) { // portion of C, A, B fits in RAM
>> iDepthEnd = min( iDepth + DepthFitsInRAM, DepthC);
>>BTranspose = Transpose(B, iDepth, iDepthEnd); // Transpose only section of B

I use that Transpose Based technique. In case of large matricies it reduces number of cache misses / page faults and speeds up calculations ( significantly ). Thanks, Jim.

Sergey,

Is this actually production code? Or is this for sharpening your skills for unusual circumstances?

I think developing the skill to produce a good functioning program using limited resources is important to have. I am glad to see you are developing this skill. Some problems are simply too large for installed RAM. If you have the skill to sub-partition the work into efficient work units an otherwise seemingly impossible programming problem is solvable with a little extra effort. My pogramming experience began in the days when the data files were 100's to 1000's times the size of RAM (actually of Core memory). Also, what you learn to apply to reducing paging, applies as well to efficiently using cache.

Jim Dempsey

www.quickthreadprogramming.com

>>...Is this actually production code?

This is a 2-year-old C/C++ software subsystem and I'm currently optimizing some parts due to performance problems and excessive memory usage when processing a large data set. I'll provide some technical details later. Thanks, Jim for your comments!

>>...Or is this for sharpening your skills for unusual circumstances?

No.

Sergey,

When you have idel time on your hands, there is one more (page file issue) optimization technique technique to explore. Using the partial transpose of B technique as a basis. Sketch using pseudo OpenMP-like syntax (you fix-up private/shared)

#pragma omp parallel
{
// all threads run through here, redundancy does not matter here
nThreads = omp_get_num_threads();
for(iDepth=0; ...
{
if(iDepth == 0)
{
// first depth. All threads participate in partial transpose of B for first depth
#pragma omp for
for(iRowInB = 0; ...
{
}
// note, implicit barrier here
} // end if(iDepth == 0)
if(notLastDepth)
{
// for the first throug all but last iDepth's
// launch two threads (assuming you have several more than 2 threads
// using NOWAIT
#pragma omp sections num_threads(2) nowait
{
partialTransposeNextBdepthUsngStreamingStores(0); // first half
#pragma section
partialTransposeNextBdepthUsngStreamingStores(1); // second half
} // end sections NOWAIT
// while two threads performing above, remainder perform DOT products
// of current paged in data
#pragma omp for num_threads(nThreads-2)
for(i = 0; ...
{
// perform DOT products using nThreads-2 threads
}
// wait for prefetching threads to complete
#pragma omp barrier
// end of not last depth section
} else {
// last depth section
// all threads perform final DOT products
#pragma omp for
for(i = 0; ...
{
// perform DOT products using all threads
}
} // end last section
} // end for(iDepth=0;...
} // end #pragma omp parallel

Test would have to be performed to assess the number of threads to perform the prefetch.
Note, the prefetching threads are mostly going to be stalled out waiting for page-in. So you could potentially oversubscribe by 2 threads (but then remember for last pass to use nThreads-2 threads).

Jim Dempsey

www.quickthreadprogramming.com

>>...I'll provide some technical details later...

Here is a screenshot and as you can see on a system with 32GB of physical memory almost ~94GB of memory is used during processing ( ~32GB RAM + ~62GB VM ):

Here are processing times after a very long test:

...
Calculating...
... - Pass 01 - Completed: 13766.30762 secs
... - Pass 02 - Completed: 9326.51953 secs
... - Pass 03 - Completed: 9418.59082 secs
... - Pass 04 - Completed: 9412.46094 secs
... - Pass 05 - Completed: 9410.07422 secs
...

Attachments: 

AttachmentSize
Downloadimage/jpeg testprocessing.jpg72.35 KB

Sergey,

It may be beneficial to look at where the data originates, and how it flows through the system.

For example, should you have a program that produces the original data (as opposed to from a different source), you could consider the implications of storing (writing to file) the B matrix in Btranspose format. Meaning the output of the data generation/collection program formats the file data in an efficient layout for next process, and the outputs of that process formmated for best use of the next, and so on.. All to often, when you have a series of processes the focus is too narrow and you look at optimizing each process instead of looking at optimizing the flow bewteen the processes.

Also, say for the A matrix, consider not initially reading A into virtual memory. Instead, on first use of each row of A, read all the row of A into virtual memory. This could also be partitioned by the Depth, but I suspect you will want to make the number of cells read to be the size of some multiple of the page file page size (may be 2KB/page - check your O/S), start with 64KB and work up or down from there.

Good luck in pruning time off this.

Have you considered using multiple workstations/servers with Intel Visual Fortran's coarray?

Jim Dempsey

www.quickthreadprogramming.com

>>...For example, should you have a program that produces the original data (as opposed to from a different source), you could
>>consider the implications of storing (writing to file)...

Only data sets from external files are used. I'm considering some kind of memory releasing ( manual swapping ) with restoring some time later but it is not easy to implement because an algorithm is recursive ( implementation complexity is very high ) and is not iterative.

I mentioned coarrays earlier, you also have OpenMPI. If you can partition the work, OpenMPI might be a good option.

Jim

www.quickthreadprogramming.com

>>...I mentioned coarrays earlier, you also have OpenMPI...

I don't think it will work for a project since this is a sgnificant scale up and application of a completely different technology. Thanks for the proposal.

So, let's go back to performance issues. After almost two weeks of investigations, tests and code changes I've managed to improve performance by ~29% and I consider it as a very good progress. In overall, all these performance issues were due to inconsistent alignments of some pointers because some set of pointers was aligned on 32-byte boundaries ( memory was allocated with calloc CRT function ) and another set of pointers was aligned on 64-byte boundaries ( memory was allocated with _mm_malloc intrinsic function ). There is another set of pointers allocated with C++ operator new and it will be fixed soon.

It would be nice if VTune had an easy to understand option to find the amount of time waiting for paging system. This might be helpful for programmers in your position.

Jim Dempsey

www.quickthreadprogramming.com

>>...It would be nice if VTune had an easy to understand option to find the amount of time waiting for paging system.
>>This might be helpful for programmers in your position...

By the way, it is a good idea and I will make a New Feature Request on VTune Forum. Thanks.

Sergey,

When making the request, a similar functional feature might be to have a report of the time a thread is in run mode, but is unable to run due to being preempted. (paging delay is programmer issue, preemption may be programmer (oversubscription) or O/S or other system activity).

Jim Dempsey

www.quickthreadprogramming.com

>>...There is another set of pointers allocated with C++ operator new and it will be fixed soon..

I changed all memory allocation calls to _mm_malloc intrinsic function and here are latest numbers:

... - Pass 01 - Completed: 1.70300 secs - Excluded from calculation of average value
... - Pass 02 - Completed: 1.12500 secs
... - Pass 03 - Completed: 1.14000 secs
... - Pass 04 - Completed: 1.14100 secs
... - Pass 05 - Completed: 1.14100 secs
... - Pass 06 - Completed: 1.12500 secs
... - Pass 07 - Completed: 1.14000 secs
... - Pass 08 - Completed: 1.14100 secs
... - Pass 09 - Completed: 1.14000 secs
... - Pass 10 - Completed: 1.12500 secs
... - Pass 11 - Completed: 1.12500 secs
... - Pass 12 - Completed: 1.14100 secs
... - Pass 13 - Completed: 1.14000 secs
... - Pass 14 - Completed: 1.12500 secs
... - Pass 15 - Completed: 1.14100 secs
... - Average: 1.13500 secs Note: Best time

[ Note: From the 2nd post ]

>>>>...
>>>>... - Pass 14 - Completed: 1.32800 secs
>>>>... - Pass 15 - Completed: 1.34400 secs
>>>>... - Average: 1.34707 secs
>>>>...

In overall, it is faster compared to processing with a mix of malloc / calloc functions and C++ new operator.

As I've mentioned there are good performance improvements for data set less than 48GB ( when they fit into Physical RAM with small usage of Virtual Memory ) but the situation is still not good for larger datasets up to 128GB ( when lots of Virtual Memory used ). Here is a new performance report:

>>...
>>Calculating...
>>... - Pass 01 - Completed: 13766.30762 secs
>>... - Pass 02 - Completed: 9326.51953 secs
>>... - Pass 03 - Completed: 9418.59082 secs
>>... - Pass 04 - Completed: 9412.46094 secs
>>... - Pass 05 - Completed: 9410.07422 secs
>>...

...
Calculating...
... - Pass 01 - Completed: 9657.35059 secs
... - Pass 02 - Completed: 9500.69531 secs
... - Pass 03 - Completed: 9509.91406 secs
... - Pass 04 - Completed: 9492.97266 secs
... - Pass 05 - Completed: 9517.10645 secs
...

>>...
>>if(((k + 1) % (SIZEOF_CACHE_LINE / sizeof(sum))) == 0)
>> clflush(&A[i][k]);
>>...

Jim,

Do you use that technique with clflush in production codes? I did a quick verification and I don't see any performance gains for any sizes of matricies.

Sergey,

I haven't used the clflush in a production code. In the sketch I provided, the intended purpose was for the limited use in situations where a cache was exceeded .AND. you have multiple streams being accessed (e.g. a row or column of an array), and where the expense of one stream is higher than the other. In this case, it may be preferable to cause eviction of the least expensive to fetch. The cost of the test has to be factored in. It would likely not be efficient for L1 cache optimization, L3 maybe. Depending on size, an alternative would be to remove the test and flush from inside the k loop and use an additional loop following the k loop to evict the same elements (eliminate test while performing strided walk through):

for(int k = 0; i < n; ++k) {
  sum += A[i][k] * B[k][j]; // B is more expensive to fetch
}
for(int k = 0; i < n; k += SIZEOF_CACHE_LINE / sizeof(sum)) {
  clflush(&A[i][k]);
}

Jim Dempsey

www.quickthreadprogramming.com

Leave a Comment

Please sign in to add a comment. Not a member? Join today