Latency and Throughput of Intel CPUs 'clflush' instruction

Latency and Throughput of Intel CPUs 'clflush' instruction

*** Latency and Throughput of Intel CPUs 'clflush' instruction ***

Zone: 

39 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

[ Abstract ]

Latency and Throughput of Intel CPUs clflush instruction.

Introduced with SSE2 IRT-Domain and is an instruction with a speculative execution. It is a real challenge to measure clflush instruction latency because it is up to a CPU when to actually execute it.

IRT-Domain - SSE2 - [ emmintrin.h ]

...
extern void __ICL_INTRINCC _mm_clflush( void const *p );
...

IRT - Intrinsics Run-Time

[ Here are notes related to objectives of an investigation ( a small R&D work ) ]

1. Intel does Not provide any numbers for the latency of CLFLUSH instruction.

2. Discussions about the latency of CLFLUSH instruction are highly speculative because it is
Not clear when the instruction is actually executed.

3. Some discussions about the latency of CLFLUSH instruction do Not take into account that
it flushes data into the main memory ( RAM ) and its latency is usually known. It is Not
clear when a cache line really becomes available for another hardware or software prefetch of
data or a set of instructions, and if it becomes available before (!) the main memory is
updated with a modified data.

4. It is more important to understand how as effective as possible binary codes could be
generated by C++ compilers in order to achieve the highest throughput of a set of CLFLUSH
instructions.

5. It is shown later that ineffective binary codes generation by a C++ compiler could affect
throughput of a set of CLFLUSH instructions.

6. Three types of binary code generations are possible and they are as follows:

- Type-1: Based on 'clflush [ebp-offset]' instruction using a general purpose register 'ebp'

- Type-2: Based on 'clflush [eXx]' instruction using a general purpose register 'eXx'

- Type-3: Composite when 'clflush' instruction is generated in a small Not inline function

[ Intel CLFLUSH instruction Opcodes ]

0F AE 38................clflush [eax]
0F AE 3B................clflush [ebx]
0F AE 39................clflush [ecx]
0F AE 3A................clflush [edx]
0F AE BD [offset]....clflush [ebp-offset]

[ Test Case - IrtClflush & CrtClflush ]

...
RTint piAddress[10][16] =
{
{ 0x11, 0x11, 0x11, 0x11, 0x11, 0x11, 0x11, 0x11, 0x11, 0x11, 0x11, 0x11, 0x11, 0x11, 0x11, 0x11 }, // 0
{ 0x22, 0x22, 0x22, 0x22, 0x22, 0x22, 0x22, 0x22, 0x22, 0x22, 0x22, 0x22, 0x22, 0x22, 0x22, 0x22 }, // 1
{ 0x33, 0x33, 0x33, 0x33, 0x33, 0x33, 0x33, 0x33, 0x33, 0x33, 0x33, 0x33, 0x33, 0x33, 0x33, 0x33 }, // 2
{ 0x44, 0x44, 0x44, 0x44, 0x44, 0x44, 0x44, 0x44, 0x44, 0x44, 0x44, 0x44, 0x44, 0x44, 0x44, 0x44 }, // 3
{ 0x77, 0x77, 0x77, 0x77, 0x77, 0x77, 0x77, 0x77, 0x77, 0x77, 0x77, 0x77, 0x77, 0x77, 0x77, 0x77 }, // 4
{ 0x88, 0x88, 0x88, 0x88, 0x88, 0x88, 0x88, 0x88, 0x88, 0x88, 0x88, 0x88, 0x88, 0x88, 0x88, 0x88 }, // 5
{ 0x44, 0x44, 0x44, 0x44, 0x44, 0x44, 0x44, 0x44, 0x44, 0x44, 0x44, 0x44, 0x44, 0x44, 0x44, 0x44 }, // 6
{ 0x33, 0x33, 0x33, 0x33, 0x33, 0x33, 0x33, 0x33, 0x33, 0x33, 0x33, 0x33, 0x33, 0x33, 0x33, 0x33 }, // 7
{ 0x22, 0x22, 0x22, 0x22, 0x22, 0x22, 0x22, 0x22, 0x22, 0x22, 0x22, 0x22, 0x22, 0x22, 0x22, 0x22 }, // 8
{ 0x11, 0x11, 0x11, 0x11, 0x11, 0x11, 0x11, 0x11, 0x11, 0x11, 0x11, 0x11, 0x11, 0x11, 0x11, 0x11 }, // 9
};

IrtClflush( &piAddress[0][0] );
CrtClflush( &piAddress[1][0] );

CrtSetThreadPriority( THREADPRIORITY_REALTIME );

CrtPrefetchData( ( RTchar * )&piAddress[0][0] ); // All prefetches are T0-type
CrtPrefetchData( ( RTchar * )&piAddress[1][0] );
CrtPrefetchData( ( RTchar * )&piAddress[2][0] );
CrtPrefetchData( ( RTchar * )&piAddress[3][0] );
CrtPrefetchData( ( RTchar * )&piAddress[4][0] );
CrtPrefetchData( ( RTchar * )&piAddress[5][0] );
CrtPrefetchData( ( RTchar * )&piAddress[6][0] );
CrtPrefetchData( ( RTchar * )&piAddress[7][0] );
CrtPrefetchData( ( RTchar * )&piAddress[8][0] );
CrtPrefetchData( ( RTchar * )&piAddress[9][0] );

RTuint64 uiClock1 = CrtRdtsc();

CrtClflush( &piAddress[0][0] );
CrtClflush( &piAddress[1][0] );
CrtClflush( &piAddress[2][0] );
CrtClflush( &piAddress[3][0] );
CrtClflush( &piAddress[4][0] );
CrtClflush( &piAddress[5][0] );
CrtClflush( &piAddress[6][0] );
CrtClflush( &piAddress[7][0] );
CrtClflush( &piAddress[8][0] );
CrtClflush( &piAddress[9][0] );

RTuint64 uiClock2 = CrtRdtsc();

CrtPrintf( RTU("[ CrtClflush ] - Executed in %u clock cycles\n"),
( RTuint )( uiClock2 - uiClock1 ) / 10 );

CrtSetThreadPriority( THREADPRIORITY_NORMAL );

CrtPrintf( RTU("IrtClflush & CrtClflush\n") );
...

[ Watcom C++ compiler - Generated binary codes - No re-ordering of instructions ]

...
00403737 lea eax, [ebp-8AEh]
0040373D prefetcht0 [eax]
00403740 lea eax, [ebp-86Eh]
00403746 prefetcht0 [eax]
00403749 lea eax, [ebp-82Eh]
0040374F prefetcht0 [eax]
00403752 lea eax, [ebp-7EEh]
00403758 prefetcht0 [eax]
0040375B lea eax, [ebp-7AEh]
00403761 prefetcht0 [eax]
00403764 lea eax, [ebp-76Eh]
0040376A prefetcht0 [eax]
0040376D lea eax, [ebp-72Eh]
00403773 prefetcht0 [eax]
00403776 lea eax, [ebp-6EEh]
0040377C prefetcht0 [eax]
0040377F lea eax, [ebp-6AEh]
00403785 prefetcht0 [eax]
00403788 lea eax, [ebp-66Eh]
0040378E prefetcht0 [eax]
00403791 rdtsc
00403793 mov ecx, eax
00403795 lea eax, [ebp-8AEh]
0040379B clflush [eax]
0040379E lea eax, [ebp-86Eh]
004037A4 clflush [eax]
004037A7 lea eax, [ebp-82Eh]
004037AD clflush [eax]
004037B0 lea eax, [ebp-7EEh]
004037B6 clflush [eax]
004037B9 lea eax, [ebp-7AEh]
004037BF clflush [eax]
004037C2 lea eax, [ebp-76Eh]
004037C8 clflush [eax]
004037CB lea eax, [ebp-72Eh]
004037D1 clflush [eax]
004037D4 lea eax, [ebp-6EEh]
004037DA clflush [eax]
004037DD lea eax, [ebp-6AEh]
004037E3 clflush [eax]
004037E6 lea eax, [ebp-66Eh]
004037EC clflush [eax]
004037EF rdtsc
004037F1 xor edx, edx
004037F3 sub eax, ecx
...

[ C++ compilers generated binary codes - Short Summary ]

[ Microsoft C++ compiler ]

A - optimized

...
clflush [ebp-100h]
...

B - non-optimized

...
mov eax, dword ptr [ebp+8]
clflush [eax]
...

[ Borland C++ compiler ]

A - optimized

...
mov edx, dword ptr [ebp-3D0h]
clflush [edx]
...

B - non-optimized ( in a small Not inline function )

...
push ebp
mov ebp, esp
mov eax, dword ptr [ebp+8]
clflush [eax]
pop ebp
ret
...

[ Intel C++ compiler ]

A - optimized

...
clflush [ebp-638h]
...

B - non-optimized

N/A

[ MinGW C++ compiler ]

A - optimized

...
mov edx, dword ptr [ebp-338h]
clflush [edx]
...

B - non-optimized

N/A

[ Watcom C++ compiler ]

A - optimized

...
mov eax, dword ptr [ebp-194h]
clflush [eax]
...

B - non-optimized

N/A

[ Run-Time testing - Extended Tracing - No ]
[ Microsoft C++ compiler ]

...
[ CrtClflush ] - Executed in 12 clock cycles
[ CrtClflush ] - Executed in 12 clock cycles
[ CrtClflush ] - Executed in 12 clock cycles
[ CrtClflush ] - Executed in 12 clock cycles
[ CrtClflush ] - Executed in 12 clock cycles
[ CrtClflush ] - Executed in 12 clock cycles
[ CrtClflush ] - Executed in 12 clock cycles
[ CrtClflush ] - Executed in 12 clock cycles
[ CrtClflush ] - Executed in 12 clock cycles
[ CrtClflush ] - Executed in 12 clock cycles
...

Here are generated binary codes:

...
00244486 rdtsc
00244488 clflush [ebp-300h]
0024448F clflush [ebp-240h]
00244496 clflush [ebp-180h]
0024449D mov dword ptr [ebp-48h], eax
002444A0 clflush [ebp-340h]
002444A7 clflush [ebp-280h]
002444AE clflush [ebp-1C0h]
002444B5 clflush [ebp-100h]
002444BC mov dword ptr [ebp-44h], edx
002444BF clflush [ebp-2C0h]
002444C6 clflush [ebp-200h]
002444CD clflush [ebp-140h]
002444D4 rdtsc
...

[ Run-Time testing - Extended Tracing - No ]
[ Borland C++ compiler ]

...
[ CrtClflush ] - Executed in 96 clock cycles
[ CrtClflush ] - Executed in 91 clock cycles
[ CrtClflush ] - Executed in 93 clock cycles
[ CrtClflush ] - Executed in 96 clock cycles
[ CrtClflush ] - Executed in 96 clock cycles
[ CrtClflush ] - Executed in 96 clock cycles
[ CrtClflush ] - Executed in 90 clock cycles
[ CrtClflush ] - Executed in 91 clock cycles
[ CrtClflush ] - Executed in 94 clock cycles
[ CrtClflush ] - Executed in 84 clock cycles
...

Here are generated binary codes:

...
0040417A call CrtRdtsc (406D6Ch)
0040417F mov dword ptr [ebp-230h], eax
00404185 mov dword ptr [ebp-22Ch], edx
0040418B lea ecx, [ebp-0BD0h]
00404191 push ecx
00404192 call CrtClflush (40123Ch)
00404197 pop ecx
00404198 lea eax, [ebp-0B90h]
0040419E push eax
0040419F call CrtClflush (40123Ch)
004041A4 pop ecx
004041A5 lea edx, [ebp-0B50h]
004041AB push edx
004041AC call CrtClflush (40123Ch)
004041B1 pop ecx
004041B2 lea ecx, [ebp-0B10h]
004041B8 push ecx
004041B9 call CrtClflush (40123Ch)
004041BE pop ecx
004041BF lea eax, [ebp-0AD0h]
004041C5 push eax
004041C6 call CrtClflush (40123Ch)
004041CB pop ecx
004041CC lea edx, [ebp-0A90h]
004041D2 push edx
004041D3 call CrtClflush (40123Ch)
004041D8 pop ecx
004041D9 lea ecx, [ebp-0A50h]
004041DF push ecx
004041E0 call CrtClflush (40123Ch)
004041E5 pop ecx
004041E6 lea eax, [ebp-0A10h]
004041EC push eax
004041ED call CrtClflush (40123Ch)
004041F2 pop ecx
004041F3 lea edx, [ebp-9D0h]
004041F9 push edx
004041FA call CrtClflush (40123Ch)
004041FF pop ecx
00404200 lea ecx, [ebp-990h]
00404206 push ecx
00404207 call CrtClflush (40123Ch)
0040420C pop ecx
0040420D call CrtRdtsc (406D6Ch)
00404212 mov dword ptr [ebp-238h], eax
00404218 mov dword ptr [ebp-234h], edx
...

...
// CrtRdtsc (406D6Ch)
00406D6C rdtsc
00406D6E ret
...

...
// CrtClflush (40123Ch)
0040123C push ebp
0040123D mov ebp, esp
0040123F mov eax, dword ptr [ebp+8]
00401242 clflush [eax]
00401245 pop ebp
00401246 ret
...

Note: This is the worst case and related to how CLFLUSH and RDTSC instructions are implemented in software.

[ Run-Time testing - Extended Tracing - No ]
[ Intel C++ compiler ]

...
[ CrtClflush ] - Executed in 20 clock cycles
[ CrtClflush ] - Executed in 23 clock cycles
[ CrtClflush ] - Executed in 24 clock cycles
[ CrtClflush ] - Executed in 24 clock cycles
[ CrtClflush ] - Executed in 20 clock cycles
[ CrtClflush ] - Executed in 19 clock cycles
[ CrtClflush ] - Executed in 19 clock cycles
[ CrtClflush ] - Executed in 22 clock cycles
[ CrtClflush ] - Executed in 19 clock cycles
[ CrtClflush ] - Executed in 18 clock cycles
...

A question is why does it slower than Microsoft or Watcom C++ compilers?

Here are generated binary codes:

...
0040365C rdtsc
0040365E clflush [ebp-8B8h]
00403665 mov ecx, eax
00403667 clflush [ebp-878h]
0040366E clflush [ebp-838h]
00403675 clflush [ebp-7F8h]
0040367C clflush [ebp-7B8h]
00403683 clflush [ebp-778h]
0040368A clflush [ebp-738h]
00403691 clflush [ebp-6F8h]
00403698 clflush [ebp-6B8h]
0040369F clflush [ebp-678h]
004036A6 rdtsc
...

1. Intel C++ compiler re-ordered a sequence of instructions.
2. 'mov ecx, eax' is placed after the 1st 'clflush [ebp-8B8h]' in order to save a value returned from 'RDTSC' in 'eax' general purpose register.
3. It is possible that pipelining is affected ( Very Likely! ), or an instruction stall is created ( Not proven and speculative! ).
4. Take a look at a perfectly generated binary codes by Watcom C++ compiler ( see Post #6 ).
5. Almost the same re-ordering is done by Microsoft C++ compiler.

[ Run-Time testing - Extended Tracing - No ]
[ MinGW C++ compiler ]

...
[ CrtClflush ] - Executed in 12 clock cycles
[ CrtClflush ] - Executed in 12 clock cycles
[ CrtClflush ] - Executed in 12 clock cycles
[ CrtClflush ] - Executed in 12 clock cycles
[ CrtClflush ] - Executed in 12 clock cycles
[ CrtClflush ] - Executed in 12 clock cycles
[ CrtClflush ] - Executed in 12 clock cycles
[ CrtClflush ] - Executed in 12 clock cycles
[ CrtClflush ] - Executed in 12 clock cycles
[ CrtClflush ] - Executed in 12 clock cycles
...

Here are generated binary codes:

...
0040265B rdtsc
0040265D mov esi, eax
0040265F clflush [ebp-2B8h]
00402666 clflush [ebp-278h]
0040266D clflush [ebp-238h]
00402674 clflush [ebp-1F8h]
0040267B clflush [ebp-1B8h]
00402682 clflush [ebp-178h]
00402689 clflush [ebp-138h]
00402690 clflush [ebp-0F8h]
00402697 clflush [ebp-0B8h]
0040269E clflush [ebp-78h]
004026A2 rdtsc
...

Perfect binary codes generation.

[ Run-Time testing - Extended Tracing - No ]
[ Watcom C++ compiler ]

...
[ CrtClflush ] - Executed in 12 clock cycles
[ CrtClflush ] - Executed in 12 clock cycles
[ CrtClflush ] - Executed in 12 clock cycles
[ CrtClflush ] - Executed in 12 clock cycles
[ CrtClflush ] - Executed in 12 clock cycles
[ CrtClflush ] - Executed in 12 clock cycles
[ CrtClflush ] - Executed in 12 clock cycles
[ CrtClflush ] - Executed in 12 clock cycles
[ CrtClflush ] - Executed in 12 clock cycles
[ CrtClflush ] - Executed in 12 clock cycles
...

Here are generated binary codes:

...
00403791 rdtsc
00403793 mov ecx, eax
00403795 lea eax, [ebp-8AEh]
0040379B clflush [eax]
0040379E lea eax, [ebp-86Eh]
004037A4 clflush [eax]
004037A7 lea eax, [ebp-82Eh]
004037AD clflush [eax]
004037B0 lea eax, [ebp-7EEh]
004037B6 clflush [eax]
004037B9 lea eax, [ebp-7AEh]
004037BF clflush [eax]
004037C2 lea eax, [ebp-76Eh]
004037C8 clflush [eax]
004037CB lea eax, [ebp-72Eh]
004037D1 clflush [eax]
004037D4 lea eax, [ebp-6EEh]
004037DA clflush [eax]
004037DD lea eax, [ebp-6AEh]
004037E3 clflush [eax]
004037E6 lea eax, [ebp-66Eh]
004037EC clflush [eax]
004037EF rdtsc
...

Perfect binary codes generation.

[ Run-Time testing - Extended Tracing - No - Summary ]

Let's consider three cases for Intel CPUs 'clflush' instruction:

1. Perfect binary codes generation to achieve the highest throughput:

MinGW C++ compiler ( rating is 10 out of 10 )
Watcom C++ compiler ( rating is 9 out of 10 )
Microsoft C++ compiler ( rating is 8 out of 10 )

2. Very good binary codes generation to achieve very good throughput:

Intel C++ compiler ( rating is 5 out of 10 )

3. Good binary codes generation but poor throughput ( Not optimized implementation! ):

Borland C++ compiler ( rating is 3 out of 10 )

[ Run-Time testing - Extended Tracing - Yes ]
[ Microsoft C++ compiler ]

Application - ScaLibTestApp - WIN32_MSC ( 32-bit ) - Release
Tests: Start
> Test0001 Start <
**********************************************
Configuration - WIN32_MSC ( 32-bit ) - Release
CTestSet::InitTestEnv - Passed

* CRuntimeSet Start *
...
[ CrtSetThreadPriority ] - Executed in 2896 clock cycles
[ CrtClflush ] - Executed in 84 clock cycles
[ CrtClflush ] - Executed in 104 clock cycles
[ CrtClflush ] - Executed in 104 clock cycles
[ CrtClflush ] - Executed in 116 clock cycles
[ CrtClflush ] - Executed in 104 clock cycles
[ CrtClflush ] - Executed in 104 clock cycles
[ CrtClflush ] - Executed in 92 clock cycles
[ CrtClflush ] - Executed in 92 clock cycles
[ CrtClflush ] - Executed in 92 clock cycles
[ CrtClflush ] - Executed in 198725 clock cycles
[ CrtSetThreadPriority ] - Executed in 3280 clock cycles
IrtClflush & CrtClflush
...
* CRuntimeSet End *

Test Completed in 7140 ticks
> Test0001 End <
Tests: Completed

[ Run-Time testing - Extended Tracing - Yes ]
[ Borland C++ compiler ]

Application - BccTestApp - WIN32_BCC ( 32-bit ) - Release
Tests: Start
> Test0001 Start <
**********************************************
Configuration - WIN32_BCC ( 32-bit ) - Release
CTestSet::InitTestEnv - Passed

* CRuntimeSet Start *
...
[ CrtSetThreadPriority ] - Executed in 28364 clock cycles
[ CrtClflush ] - Executed in 120 clock cycles
[ CrtClflush ] - Executed in 368 clock cycles
[ CrtClflush ] - Executed in 100 clock cycles
[ CrtClflush ] - Executed in 368 clock cycles
[ CrtClflush ] - Executed in 100 clock cycles
[ CrtClflush ] - Executed in 308 clock cycles
[ CrtClflush ] - Executed in 376 clock cycles
[ CrtClflush ] - Executed in 372 clock cycles
[ CrtClflush ] - Executed in 112 clock cycles
[ CrtClflush ] - Executed in 156735 clock cycles
[ CrtSetThreadPriority ] - Executed in 11976 clock cycles
IrtClflush & CrtClflush
...
* CRuntimeSet End *

Test Completed in 9234 ticks
> Test0001 End <
Tests: Completed

[ Run-Time testing - Extended Tracing - Yes ]
[ Intel C++ compiler ]

Application - IccTestApp - WIN32_ICC ( 32-bit ) - Release
Tests: Start
> Test0001 Start <
**********************************************
Configuration - WIN32_ICC ( 32-bit ) - Release
CTestSet::InitTestEnv - Passed

* CRuntimeSet Start *
...
[ CrtSetThreadPriority ] - Executed in 2400 clock cycles
[ CrtClflush ] - Executed in 88 clock cycles
[ CrtClflush ] - Executed in 88 clock cycles
[ CrtClflush ] - Executed in 88 clock cycles
[ CrtClflush ] - Executed in 88 clock cycles
[ CrtClflush ] - Executed in 88 clock cycles
[ CrtClflush ] - Executed in 88 clock cycles
[ CrtClflush ] - Executed in 88 clock cycles
[ CrtClflush ] - Executed in 88 clock cycles
[ CrtClflush ] - Executed in 88 clock cycles
[ CrtClflush ] - Executed in 221809 clock cycles
[ CrtSetThreadPriority ] - Executed in 6548 clock cycles
IrtClflush & CrtClflush
...
* CRuntimeSet End *

Test Completed in 4516 ticks
> Test0001 End <
Tests: Completed

[ Run-Time testing - Extended Tracing - Yes ]
[ MinGW C++ compiler ]

Application - MgwTestApp - WIN32_MGW ( 32-bit ) - Release
Tests: Start
> Test0001 Start <
**********************************************
Configuration - WIN32_MGW ( 32-bit ) - Release
CTestSet::InitTestEnv - Passed

* CRuntimeSet Start *
...
[ CrtSetThreadPriority ] - Executed in 3128 clock cycles
[ CrtClflush ] - Executed in 88 clock cycles
[ CrtClflush ] - Executed in 88 clock cycles
[ CrtClflush ] - Executed in 88 clock cycles
[ CrtClflush ] - Executed in 88 clock cycles
[ CrtClflush ] - Executed in 88 clock cycles
[ CrtClflush ] - Executed in 88 clock cycles
[ CrtClflush ] - Executed in 88 clock cycles
[ CrtClflush ] - Executed in 88 clock cycles
[ CrtClflush ] - Executed in 88 clock cycles
[ CrtClflush ] - Executed in 171099 clock cycles
[ CrtSetThreadPriority ] - Executed in 4284 clock cycles
IrtClflush & CrtClflush
...
* CRuntimeSet End *

Test Completed in 3516 ticks
> Test0001 End <
Tests: Completed

[ Run-Time testing - Extended Tracing - Yes ]
[ Watcom C++ compiler ]

Application - WccTestApp - WIN32_WCC ( 32-bit ) - Release
Tests: Start
> Test0001 Start <
**********************************************
Configuration - WIN32_WCC ( 32-bit ) - Release
CTestSet::InitTestEnv - Passed

* CRuntimeSet Start *
...
[ CrtSetThreadPriority ] - Executed in 3776 clock cycles
[ CrtClflush ] - Executed in 88 clock cycles
[ CrtClflush ] - Executed in 88 clock cycles
[ CrtClflush ] - Executed in 88 clock cycles
[ CrtClflush ] - Executed in 88 clock cycles
[ CrtClflush ] - Executed in 88 clock cycles
[ CrtClflush ] - Executed in 88 clock cycles
[ CrtClflush ] - Executed in 88 clock cycles
[ CrtClflush ] - Executed in 88 clock cycles
[ CrtClflush ] - Executed in 88 clock cycles
[ CrtClflush ] - Executed in 173828 clock cycles
[ CrtSetThreadPriority ] - Executed in 4908 clock cycles
IrtClflush & CrtClflush
...
* CRuntimeSet End *

Test Completed in 8000 ticks
> Test0001 End <
Tests: Completed

[ Flush Cache Win32 API functions on Windows Desktop and Embedded OSs ]
[ Win32 API function - FlushInstructionCache ]

Windows Desktop - [ winbase.h ]

...
BOOL WINAPI FlushInstructionCache(
__in HANDLE hProcess,
__in_bcount_opt( dwSize ) LPCVOID lpBaseAddress,
__in SIZE_T dwSize );
...

Windows CE - [ winbase.h ]

...
BOOL WINAPI FlushInstructionCache(
HANDLE hProcess,
LPCVOID lpBaseAddress,
DWORD dwSize );
...

[ Flush Cache intrinsic on Windows Embedded OSs ]

Windows CE - [ cmnintrin.h ]

...
__CacheRelease( void *p );
...

When compiling with Microsoft C++ compiler a warning C4732 is displayed when
an intrinsic '_CacheRelease' is not supported on an architecture.

[ Flush Cache intrinsic on Itanium IA64 architecture ]

Itanium IA64 Architecture

...
__fc( __int64 *p );
...

[ Flush Cache intrinsics on AMD CPU architectures ]

Not reviewed.

[ Intel References ]

Intel 64 and IA-32 Architectures Optimization Reference Manual
Order Number: 248966-033
June 2016

...
Chapter: INSTRUCTION LATENCY AND THROUGHPUT
Table C-17. General Purpose Instructions ( Page C-17 )

...
CLFLUSH throughputs for different CPUs are ~2 to 50, ~3 to 50, ~3 to 50 and ~5 to 50 clock cycles.
...

...
Note 13 ( Page C-19 ):

...
CLFLUSH throughput is representative from clean cache lines for a range of buffer sizes. CLFLUSH
throughput can decrease significantly by factors including: (a) the number of back-to-back CLFLUSH
being executed, (b) flushing modified cache lines incurs additional cost than cache lines in other
coherent state. See Section 7.4.6.
...

Section 7.4.6 CLFLUSH Instruction ( Page 7-9 ):

It provides additional information for the instruction.

Intel 64 and IA-32 Architectures Software Developer's Manual
Volume 2 ( 2A, 2B, 2C & 2D ):
Instruction Set Reference, A-Z
Order Number: 325383-059US
June 2016

...
Chapter: INSTRUCTION SET REFERENCE, A-L
CLFLUSH - Flush Cache Line ( Page 3-140 vol. 2A )

A very important note is as follows:

...
data can be speculatively loaded into a cache line just before, during, or after
the execution of a CLFLUSH instruction that references the cache line
...

[ External References ]

1. Coherence with Cached Memory-Mapped IO
John D. McCalpin, Ph.D, 2013
https://sites.utexas.edu/jdm4372/2013/05

[ Command Line Options of C++ compilers ]

Command Line Options of C++ compilers used in these performance evaluations will be provided.

[ Borland C++ compiler v5.5.1 32-bit ]

-d -O2 -w -D_WIN32_BCC -DNDEBUG -5 -nRelease -eBccTestApp.exe -I"C:\WorkLib\MKL\Include" -L"C:\WorkLib\MKL\Lib\Ia32Bcc" -lS:33554432 BccTestApp.cpp HrtALLib.asm

[ MinGW C++ compiler v6.1.0 32-bit ]

MgwTestApp.cpp

-DNDEBUG

-O3

-msse2
-mprfchw

-ffast-math
-fpeel-loops
-ftree-vectorizer-verbose=0
-ftree-vectorize
-fvect-cost-model
-fomit-frame-pointer
-flto
-fwhole-program
-fopenmp

-w

-I "C:/WorkLib/ICC2011/Composer XE/Mkl/Include"
-B "../../AppsSca"

"C:/WorkLib/ICC2011/Composer XE/Mkl/Lib/Ia32/mkl_rt.lib"

-Xlinker
--stack=67108864

[ Microsoft C++ compiler ( VS2005 PE ) 32-bit ]

[ Compiler ]
/O2 /Ob1 /Oi /Ot /Oy /GL /I "..\..\Include" /D "WIN32" /D "_CONSOLE" /D "NDEBUG" /D "_WIN32_MSC" /D "_VC80_UPGRADE=0x0710" /D "_UNICODE" /D "UNICODE" /GF /Gm /MT /GS- /fp:fast /GR- /openmp /Yu"Stdphf.h" /Fp"Release\MscTestApp.pch" /Fo"Release/" /Fd"Release/" /W4 /nologo /c /Wp64 /Zi /Gd /TP /wd4005 /U "_WINCE_MSC" /U "WIN32_PLATFORM_PSPC" /U "WIN32_PLATFORM_WFSP" /U "WIN32_PLATFORM_WM50" /U "_WIN32_MGW" /U "_WIN32_BCC" /U "_COS16_TCC" /U "_WIN32_ICC" /U "_WIN32_WCC" /errorReport:prompt /arch:SSE2

[ Linker ]
/OUT:"Release/MscTestApp.exe" /INCREMENTAL:NO /NOLOGO /MANIFEST /MANIFESTFILE:"Release\MscTestApp.exe.intermediate.manifest" /NODEFAULTLIB:"../../Bin/Release/ScaLib.lib" /SUBSYSTEM:CONSOLE /STACK:268435456 /LARGEADDRESSAWARE /LTCG /MACHINE:X86 /ERRORREPORT:PROMPT kernel32.lib user32.lib gdi32.lib winspool.lib comdlg32.lib advapi32.lib shell32.lib ole32.lib oleaut32.lib uuid.lib odbc32.lib odbccp32.lib "..\..\bin\release\scalib.lib"

[ Intel C++ compiler v12.1.7 ( u371 ) 32-bit ]

[ Compiler ]
/c /O3 /Ob1 /Oi /Ot /Oy /Qipo /I "..\..\Include" /D "WIN32" /D "_CONSOLE" /D "NDEBUG" /D "_WIN32_ICC" /D "INTEL_SUITE_VERSION=PE121_300" /D "_VC80_UPGRADE=0x0710" /D "_UNICODE" /D "UNICODE" /GF /MT /GS- /fp:fast=2 /GR- /Yu"Stdphf.h" /Fp"Release\IccTestApp.pch" /Fo"Release/" /W5 /nologo /Wp64 /Zi /Gd /TP /Qdiag-disable:2012 /Qdiag-disable:2013 /Qdiag-disable:2014 /Qdiag-disable:2015 /Qdiag-disable:2017 /Qdiag-disable:2021 /Qdiag-disable:2022 /Qdiag-disable:2304 /U "_WIN32_MSC" /U "_WINCE_MSC" /U "WIN32_PLATFORM_PSPC" /U "WIN32_PLATFORM_WFSP" /U "WIN32_PLATFORM_WM50" /U "_WIN32_MGW" /U "_WIN32_BCC" /U "_COS16_TCC" /U "_WIN32_WCC" /Qopenmp /Qfp-speculation:fast /Qopt-matmul /Qparallel /Qstd=c++0x /Qrestrict /Qdiag-disable:111,673,10121
/Wport /Qeffc++ /QxSSE2 /Qansi-alias /Qvec-report=0 /Qfma /Qunroll:8 /Qunroll-aggressive /Qopt-streaming-stores:always /Qopt-block-factor:128 /Qopt-mem-layout-trans:2 /Wport /Qeffc++ /QxSSE2 /Qansi-alias /Qvec-report=0 /Qfma /Qunroll:8 /Qunroll-aggressive /Qopt-streaming-stores:always /Qopt-block-factor:128 /Qopt-mem-layout-trans:2

[ Linker ]
kernel32.lib user32.lib gdi32.lib winspool.lib comdlg32.lib advapi32.lib shell32.lib ole32.lib oleaut32.lib uuid.lib odbc32.lib odbccp32.lib /OUT:"Release/IccTestApp.exe" /INCREMENTAL:NO /nologo /MANIFEST /MANIFESTFILE:"Release\IccTestApp.exe.intermediate.manifest" /NODEFAULTLIB:"../../Bin/Release/ScaLib.lib" /TLBID:1 /SUBSYSTEM:CONSOLE /STACK:268435456 /LARGEADDRESSAWARE /MACHINE:X86 /qdiag-disable:111,673,10121

[ Watcom C++ compiler v2.0.0 32-bit ]

WccTestApp.cpp -5r -fp5 -fpi87 -wx -d0 -s -oabil+mprt -xd -D_WIN32_WCC -DNDEBUG -feWccTestApp.exe -k268435456 -i"C:\WorkLib\ICC2011\Compos~1\Mkl\Include" -"libpath C:\WorkLib\ICC2011\Compos~1\Mkl\Lib\Ia32Wcc" -wcd=007 -wcd=008 -wcd=013 -wcd=014 -wcd=086 -wcd=188 -wcd=367 -wcd=368 -wcd=369 -wcd=387 -wcd=389 -wcd=549 -wcd=601 -wcd=628 -wcd=689 -wcd=716 -wcd=725 -wcd=726 -wcd=735

Correction for Post #9

>>[ Run-Time testing - Extended Tracing - No ]
>>[ Borland C++ compiler ]
>>
>>...
>>...
>>
>>Note: This is the worst case and related to how CLFLUSH and RDTSC instructions are implemented in software.

A correct Note is:

This is the worst case and related to how CrtClflush and CrtRdtsc C-functions are implemented in software.

[ A workaround for Intel C++ compiler ]

The problem has two parts, that is:

- RDTSC instruction was Not aligned on a 16-byte boundary for Intel C++ compiler

- Pipelining of a series of CLFLUSH instructions is affected when a MOV instruction is inserted after the 1st CLFLUSH instruction

By the way, Watcom C++ compiler's binary of codes are Not aligned on a 16-byte boundary and it doesn't have any problems!

So, I decided to use a workaround by forcing an alignment on a 16-byte boundary ( _DEFAULT_CODEALIGN16 is a macro based on _asm ALIGN 16 assembler directive ).

...
_DEFAULT_CODEALIGN16;

RTuint64 uiClock1 = CrtRdtsc();

CrtClflush( &piAddress[0][0] );
CrtClflush( &piAddress[1][0] );
CrtClflush( &piAddress[2][0] );
CrtClflush( &piAddress[3][0] );
CrtClflush( &piAddress[4][0] );
CrtClflush( &piAddress[5][0] );
CrtClflush( &piAddress[6][0] );
CrtClflush( &piAddress[7][0] );
CrtClflush( &piAddress[8][0] );
CrtClflush( &piAddress[9][0] );

RTuint64 uiClock2 = CrtRdtsc();
...

Here is statistics for a memory address of 1st RDTSC instruction:

MSC - 00244490 % 0x10 = 0 - Aligned on 16-byte boundary? - Yes
ICC - 00403660 % 0x10 = 0 - Aligned on 16-byte boundary? - Yes
MGW - 00402490 % 0x10 = 0 - Aligned on 16-byte boundary? - Yes
BCC - 0040417A % 0x10 = 10 - Aligned on 16-byte boundary? - No
WCC - 00403791 % 0x10 = 11 - Aligned on 16-byte boundary? - No

[ Run-Time testing - Extended Tracing - No ]
[ Intel C++ compiler ]

...
[ CrtClflush ] - Executed in 12 clock cycles
[ CrtClflush ] - Executed in 12 clock cycles
[ CrtClflush ] - Executed in 12 clock cycles
[ CrtClflush ] - Executed in 12 clock cycles
[ CrtClflush ] - Executed in 12 clock cycles
[ CrtClflush ] - Executed in 12 clock cycles
[ CrtClflush ] - Executed in 12 clock cycles
[ CrtClflush ] - Executed in 12 clock cycles
[ CrtClflush ] - Executed in 12 clock cycles
[ CrtClflush ] - Executed in 12 clock cycles
...

[ Intel C++ compiler New Features Request - Re-ordering statistics and control ]

- A Warning Message at /W5 level ( Not at another levels ) needs to be displayed when there is a Re-ordering

- Introduction a '#pragma no-reordering' directive for a piece of critical codes to prevent any Re-orderings

- A command line compiler option to control Re-ordering of instructions ( similar to Watcom C++ compiler option '-or' Re-order instructions to avoid stalls )

[ MinGW C++ compiler command line options ]

For example, these are command line options for different types of Re-orders supported by MinGW C++ compiler:

...
-Wreorder - Warn when the compiler reorders code.
-freorder-blocks - Reorder basic blocks to improve code placement.
-freorder-blocks-algorithm= - -freorder-blocks-algorithm=[simple|stc] Set the used basic block reordering algorithm.
-freorder-blocks-and-partition - Reorder basic blocks and partition into hot and cold sections.
-freorder-functions - Reorder functions to improve code placement.
-fprofile-reorder-functions - Enable function reordering that improves code placement.
-ftoplevel-reorder - Reorder top level functions, variables, and asms.
...

[ A note from Intel software engineer ]

>>...
>>Compiler optimization may re-order instructions based on instruction latency/throughput targeting different micro-architecture.
>>...

I understand it but my point is:

Intel C++ compiler should give us a greater control in similar to my cases. If a Software Engineer has some specs, knows
how some processing needs to be done ( its order, number of instructions, estimated number of clock cycles to complete
the processing, etc ), then Intel C++ compiler should Not interfere with the Software Engineer's codes. Of course,
implementation with assembler solves all these problems but it is more time consuming to implement and
it breaks portability of C/C++ source codes.

Here are performance results when the Serial-Test-Case was converted to a 10-interations For-Loop-Test-Case.

Performance results from the best to the worst:

[ MinGW C++ compiler ]
...
[ Sub-Test002.21.B - Processing of 10 calls ] - Executed in 120 clock cycles
[ Sub-Test002.21.B - For-Loop Overhead ] - Executed in 84 clock cycles
[ Sub-Test002.21.B - CrtClflush ] - Executed in 3 clock cycles
...

[ Intel C++ compiler ]
...
[ Sub-Test002.21.B - Processing of 10 calls ] - Executed in 196 clock cycles
[ Sub-Test002.21.B - For-Loop Overhead ] - Executed in 152 clock cycles
[ Sub-Test002.21.B - CrtClflush ] - Executed in 4 clock cycles
...

[ Watcom C++ compiler ]
...
[ Sub-Test002.21.B - Processing of 10 calls ] - Executed in 212 clock cycles
[ Sub-Test002.21.B - For-Loop Overhead ] - Executed in 128 clock cycles
[ Sub-Test002.21.B - CrtClflush ] - Executed in 8 clock cycles
...

[ Microsoft C++ compiler ]
...
[ Sub-Test002.21.B - Processing of 10 calls ] - Executed in 192 clock cycles
[ Sub-Test002.21.B - For-Loop Overhead ] - Executed in 88 clock cycles
[ Sub-Test002.21.B - CrtClflush ] - Executed in 10 clock cycles
...

[ Borland C++ compiler ]
...
[ Sub-Test002.21.B - Processing of 10 calls ] - Executed in 964 clock cycles
[ Sub-Test002.21.B - For-Loop Overhead ] - Executed in 264 clock cycles
[ Sub-Test002.21.B - CrtClflush ] - Executed in 70 clock cycles
...

Results are very reproducible and I see that in case of MinGW and Intel C++ compilers I was able to achieve low-bound numbers for CLFLUSH instruction stated by Intel in:

Intel 64 and IA-32 Architectures Optimization Reference Manual
Order Number: 248966-033
June 2016

...
Chapter: INSTRUCTION LATENCY AND THROUGHPUT
Table C-17. General Purpose Instructions ( Page C-17 )

...
CLFLUSH throughputs for different CPUs are ~2 to 50, ~3 to 50, ~3 to 50 and ~5 to 50 clock cycles.
...

A latency of RDTSC instruction should Not be taken into account when two RDTSC instructions are
called one after another. This is because RDTSC instruction latency is a constant for a Processing Unit,
the same number of mu-ops are executed by the Processing Unit in both cases and RDTSC latency will be canceled.

RDTSC instruction, as you know, simply reads and returns a value of Time Stamp Counter ( TSC ) of the Processing Unit.

In a general form, if
...
T1 = RDTSC()
Processing...
T2 = RDTSC()
...
then

Processing Completed in Clock Cycles = ( T2 + RDTSCoverhead ) - ( T1 + RDTSCoverhead ) = ( T2 + RDTSCoverhead - T1 - RDTSCoverhead ) = ( T2 - T1 ).

A latency of RDTSC instruction is Not Known since Intel does Not release any information and
take a look at Intel 64 and IA-32 Architectures Optimization Reference Manual.

[ MinGW C++ compiler - 64-bit - Ivy Bridge ]
...
[ Sub-Test002.21.A - CrtClflush ] - Executed in 6 clock cycles
[ Sub-Test002.21.B - Processing of 10 calls ] - Executed in 44 clock cycles
[ Sub-Test002.21.B - For-Loop Overhead ] - Executed in 24 clock cycles
[ Sub-Test002.21.B - CrtClflush ] - Executed in 2 clock cycles
...

[ Intel C++ compiler - 64-bit - Ivy Bridge ]
...
[ Sub-Test002.21.A - CrtClflush ] - Executed in 6 clock cycles
[ Sub-Test002.21.B - Processing of 10 calls ] - Executed in 120 clock cycles
[ Sub-Test002.21.B - For-Loop Overhead ] - Executed in 100 clock cycles
[ Sub-Test002.21.B - CrtClflush ] - Executed in 2 clock cycles
...

[ Watcom C++ compiler - 32-bit - Ivy Bridge ]
...
[ Sub-Test002.21.A - CrtClflush ] - Executed in 7 clock cycles
[ Sub-Test002.21.B - Processing of 10 calls ] - Executed in 128 clock cycles
[ Sub-Test002.21.B - For-Loop Overhead ] - Executed in 92 clock cycles
[ Sub-Test002.21.B - CrtClflush ] - Executed in 3 clock cycles
...

[ Microsoft C++ compiler - 64-bit - Ivy Bridge ]
...
[ Sub-Test002.21.A - CrtClflush ] - Executed in 15 clock cycles
[ Sub-Test002.21.B - Processing of 10 calls ] - Executed in 108 clock cycles
[ Sub-Test002.21.B - For-Loop Overhead ] - Executed in 28 clock cycles
[ Sub-Test002.21.B - CrtClflush ] - Executed in 8 clock cycles
...

[ Borland C++ compiler - 64-bit - Ivy Bridge ]
...
[ Sub-Test002.21.A - CrtClflush ] - Executed in 85 clock cycles
[ Sub-Test002.21.B - Processing of 10 calls ] - Executed in 232 clock cycles
[ Sub-Test002.21.B - For-Loop Overhead ] - Executed in 144 clock cycles
[ Sub-Test002.21.B - CrtClflush ] - Executed in 8 clock cycles
...

Hi Sergey,

 

Thank you for the study (measuring clflush latency). It is really interesting. However, with the presence of HPCs, have you tried other hardware events by utilising PMCs to do this measure? If so, what events are more efficient in measuring clflush latency?

 

Leave a Comment

Please sign in to add a comment. Not a member? Join today