"Vectorization: Writing C/C++ code in VECTOR Format"

Vectorization: Writing C/C++ code in VECTOR Format

Mukkaysh Srivastav
Computational Research Laboratories (CRL) - Pune, India

 

1.0 Introduction: Vectorization has been key optimization principle over x87 stack more than a decade. But often C/C++ algorithmic source-code is written without adequate attention to vectorization concepts. Since the performance of many high performance computing (HPC) scientific-applications degrades on multi-core architecture (MCA) (E.g.: Intel® Xeon® 5355 processor) during execution,   writing most compute part of algorithm in vector-length format would enhance the performance of such scientific-applications by vectorizing the source code either in 2 Double-precision (8-byte) or 4 Single-precision (4-byte) or 16-int (1-byte) vector-length format within 128-bit SSE stacks. Optimizing programs for modern multiprocessor or vector platforms is a major important challenge for Compilers today. Writing C/C++ code in a form that will enable a good C/C++ compiler to vectorize it is likely to yield some results in performance improvement. In this article, focus has been made in re-designing C/C++ code in vector-length format to address SSE needs. The source code used to demonstrate re-designing of compute algorithm is from multi-file C/C++ scientific-applications HPC package namely, AutoDock (http://autodock.scripps.edu/).

1.1 Vectorization Overview: Vectorization is a special case of SIMD, a term defined in Flynn's architecture taxonomy to denote a single instruction stream capable of operating on multiple data elements in parallel. The number of elements which can be operated on in parallel range from four single-precision floating point data elements in Streaming SIMD Extensions and two double-precision floating-point data elements in Streaming SIMD Extensions 2 to sixteen byte operations in a 128-bit register in Streaming SIMD Extensions 2. Thus, vector-length ranges from 2 to 16, depending on the instruction extensions used and on the data type. 
Vectorization has been currently used both as auto-vectorization and explicit vectorization, namely Single Instructions Multiple Data (SIMD) or simdization. Producing SIMD codes is sometimes done manually for important specific application, but is often produced automatically by Compilers (referred to as auto-vectorization). Auto-vectorization is enabled through compiler flags. Compiler flags for auto-vectorization can be referred from the relevant manuals, like Intel C++  Compiler User & Reference Guides. An auto-vectorizing compiler parallelizes code to utilize streaming SIMD extensions (SSE) instruction set architectures (SSE, SSE2, SSE3, SSSE3, and SSE4) of latest processors. Explicit form of vectorization, SIMD is used either as Inline assembly, intrinsic or pure assembly .S representation.  Explicit vector programming is time consuming, error prone and also machine dependent, so objective should be in re-designing the code in vector-length format. Moreover, simdization is not trivial, it has been observed that difficulties in optimizing code for SIMD architecture stems from hardware constraints too. Intel C++ Compiler supports GNU*-Assembler (GAS) syntax on Linux* x86_64 as default assembler, so writing either GNU-syntax Inline assembly or pure .S assembly file addressing SSE would be a tough job rather re-designing C/C++ compute code. 

1.2 C/C++ Code Format
: The source code as demonstrated here has been taken from one of the C++ files of AutoDock (v-4.0.1) scientific-applications package as available under GPL. AutoDock is a suite of automated docking tools; it has been designed to predict how small molecules, such as substrates or drug candidates, bind to a receptor of known 3D structure. One of the AutoDock (v-4.0.1) C++ source code file torsion.cc file has been vectorized and described here. This torsion.cc file API was chosen because this API has high CPU processing time as generated through use of Intel VTune profiler. More details about AutoDock can be obtained from http://autodock.scripps.edu/.  The original compute algorithm is - 

crd[mvatm][X] = (double)crdtemp[X] + d[X] * k[X][X] + d[Y] * k[X][Y] + d[Z] * k[X][Z];
crd[mvatm][Y] = (double)crdtemp[Y] + d[X] * k[Y][X] + d[Y] * k[Y][Y] + d[Z] * k[Y][Z];
crd[mvatm][Z] = (double)crdtemp[Z] + d[X] * k[Z][X] + d[Y] * k[Z][Y] + d[Z] * k[Z][Z];

The above code can't be vectorized because the loop contains three parameters (X, Y & Z is defined in autocomm.h file of AutoDock v-4.0.1 package as macros for X, Y & Z co-ordinates) but double-precision (DP) floating-point (FP) SIMD elements wants data either in 2, or 4 vector-length format, and moreover the algorithm access pattern through k is non-unit stride. The original code if executed with Intel C++ Compiler (ICC-v11.0) with -vec-report3 flag generates lots of data dependencies hazards as - 

torsion.cc(106): (col. 15) remark: vector dependence: assumed ANTI dependence between crd line 106 and crd line 109.
torsion.cc(109): (col. 15) remark: vector dependence: assumed FLOW dependence between crd line 109 and crd line 106.
torsion.cc(106): (col. 15) remark: vector dependence: assumed ANTI dependence between crd line 106 and crd line 109.
torsion.cc(109): (col. 15) remark: vector dependence: assumed FLOW dependence between crd line 109 and crd line 106.
torsion.cc(106): (col. 15) remark: vector dependence: assumed ANTI dependence between crd line 106 and crd line 109.
.....
.....
torsion.cc(108): (col. 15) remark: vector dependence: assumed FLOW dependence between crd line 108 and crd line 105.
torsion.cc(104): (col. 15) remark: vector dependence: assumed ANTI dependence between crd line 104 and crd line 107.
torsion.cc(107): (col. 15) remark: vector dependence: assumed FLOW dependence between crd line 107 and crd line 104.
torsion.cc(104): (col. 15) remark: vector dependence: assumed ANTI dependence between crd line 104 and crd line 107.
torsion.cc(107): (col. 15) remark: vector dependence: assumed FLOW dependence between crd line 107 and crd line 104.
....
....
torsion.cc(109): (col. 15) remark: vector dependence: assumed FLOW dependence between crd line 109 and tlist line 103.
torsion.cc(103): (col. 15) remark: vector dependence: assumed ANTI dependence between tlist line 103 and crd line 109.
torsion.cc(109): (col. 15) remark: vector dependence: assumed FLOW dependence between crd line 109 and tlist line 103.
torsion.cc(103): (col. 15) remark: vector dependence: assumed ANTI dependence between tlist line 103 and crd line 109.
torsion.cc(109): (col. 15) remark: vector dependence: assumed FLOW dependence between crd line 109 and tlist line 103.

Since from above generated compiler compiled message we see that code has too many flow & false dependencies. Our objective is to demonstrate in getting rid of both flow & false dependencies and have unit-stride access pattern.

These problem of above code not been vectorized, aligned and having unit-stride access pattern has been demonstrated in section 1.4 by simply re-designing torsion.cc file of AutoDock by addressing vector-length format.

1.3 SIMD Non-vectorized Assembly Behavior
: The behavior of SSE assembly instructions for above algorithm has been -

movaps                 %xmm5, %xmm12       
mulsd                    %xmm15, %xmm12       
addsd                    %xmm2, %xmm12         
movaps                 %xmm9, %xmm0        
mulsd                    %xmm14, %xmm0       
addsd                    %xmm0, %xmm12       
movaps                 %xmm11, %xmm0       
mulsd                    %xmm13, %xmm0       
addsd                    %xmm0, %xmm12       
cvtsd2ss                %xmm12, %xmm12        
movss                   %xmm12, (%r10,%rdi)

movsd                   40(%rsp), %xmm0     
mulsd                    %xmm15, %xmm0       
addsd                    %xmm4, %xmm0       
movaps                 %xmm6, %xmm12       
mulsd                    %xmm14, %xmm12      
addsd                    %xmm12, %xmm0       
movaps                 %xmm7, %xmm12      
mulsd                    %xmm13, %xmm12     
addsd                    %xmm12, %xmm0  
cvtsd2ss                %xmm0, %xmm0
movss                   %xmm0, 4(%r10,%rdi)

mulsd                    %xmm8, %xmm15      
addsd                    %xmm3, %xmm15       
mulsd                    %xmm10, %xmm14      
addsd                    %xmm14, %xmm15      
mulsd                    %xmm1, %xmm13       
addsd                    %xmm13, %xmm15      
cvtsd2ss                %xmm15, %xmm13     
movss                   %xmm13, 8(%r10,%rdi)

From the above SSE assembly code, we see alignment didn't happen for crd[mvatm][Y] and crd[mvatm][Z] expressions. Also, it is seen that movsd for k[Y][X] is scalar code, so the alignment is a lot less important (since floats are usually 4-byte aligned; for vector data we want the more strict 16-byte alignment). Data dependencies could have been resolved if data rearrangement unpack (unpcklpd, etc.) instructions could have been invoked by the compiler along with mulpd, addpd, movhpd instructions. Moreover, due to associated overhead of data rearrangement, it is generally profitable to vectorize only those loops that have relatively few non-unit-stride memory references.

Understanding of above assembly (.S) content can be done either by using ICC (v-11.0.074) to generate above assembly code (.S) for torsion.cc, which does generate .S file with line number in aligned with source-code line number or by having objdump for AutoDock (v-4.0.1) executable. Interpreting further the SSE instructions can be done by referring to Intel 64 and IA-32 Architectures Optimization Reference Manual (Order Number: 248966-018) document.  

1.4 C/C++ Vectorized Code Format
: Having the problem as discussed in section 1.2 & 1.3, we re-design the torsion.cc API code such that torsion.cc compute algorithm becomes amicable to unit-stride and vectorization. In original code we have X, Y & Z as three parameters, function arguments being passed with array layout of crd [2048][3], and local variable array layout for k being [3][3]. Since, SSE are 128-bit width registers which means it should either have 4 single-precision (SP) floating-point (FP) or 2 double-precision (DP) floating-point (FP) to fit equally in 128-bit both for alignment and effective vectorization. Since this mis-alignment prohibits vectorization, we introduce one dummy "S" parameter which provides the correct alignment for the 128-bit SSE register and enables vectorization. To have unit-stride access of memory, we transpose original k matrix. Also, we increase the size of k matrix to ensure proper mapping of 4th. parameter. Since an array of layout of crd [2048][3] is passed as function argument, so to copy this crd array layout with local increased array layout of crdmod[2048][4], we use memcpy() operation than any other programming techniques of array copy operation as memcpy() is most efficient way to perform copy of arrays, we prefer it to use here than any other techniques. Having the array layout now in SSE-aligned format, the modified compute algorithm of torsion.cc is -

/* Incorporate S as fourth parameter along with transposed k in 2 DP FP format */
/* Below statements are vectorized */
crdmod[mvatm][X] = (double)crdtemp[X] + d[X] * k[X][X] + d[Y] * k[Y][X] + d[Z] * k[Z][X];
crdmod[mvatm][Y] = (double)crdtemp[Y] + d[X] * k[X][Y] + d[Y] * k[Y][Y] + d[Z] * k[Z][Y];
crdmod[mvatm][Z] = (double)crdtemp[Z] + d[X] * k[X][Z] + d[Y] * k[Y][Z] + d[Z] * k[Z][Z];
crdmod[mvatm][S] = (double)crdtemp[S] + d[X] * k[X][S] + d[Y] * k[Y][S] + d[Z] * k[Z][S];

Compiling above algorithm as - 

[@n1 autodock-4.0.1]$ icpc -g -O0 torsion.cc -S

[@n1 autodock-4.0.1]$ icpc -g -O1 torsion.cc -S

[@n1 autodock-4.0.1]$ icpc -g -O2 torsion.cc -S
torsion.cc(128): (col. 15) remark: BLOCK WAS VECTORIZED.

[@n1 autodock-4.0.1]$ icpc -g -O3 torsion.cc -S
torsion.cc(128): (col. 15) remark: BLOCK WAS VECTORIZED.

[@n1 autodock-4.0.1]$ icpc torsion.cc -S
torsion.cc(128): (col. 15) remark: BLOCK WAS VECTORIZED. 

We observe that, Intel C++ Compiler (ICC v-11.0.074) vectorizes both with -O2, -O3 and ICC default optimization (-O2) but not with -O1 & -O0 for torsion.cc modified code of AutoDock which was desired finally and moreover we don't see any compiler generated dependency messages as seen in section 1.2 for un-modified compute algorithm having X, Y & Z parameters. Current ICC-v11.0 supports vectorization at -O2 & aggressive vectorization at -O3 optimization levels.

Here, we have seen that ICC (v-11.0.074) successfully generates vectorization message as "BLOCK WAS VECTORIZED" but not as "LOOP WAS VECTORIZED". Sometime we get confused with ICC vectorization messages between "LOOP WAS VECTORIZED" or "BLOCK WAS VECTORIZED. The only difference between message of types "LOOP WAS VECTORIZED" or "BLOCK WAS VECTORIZED" can be understood with below example -

for (int i = 0; i < 2; i++)
a[i] = b[i] + 1;

gives LOOP WAS VECTORIZED at the line number of the FOR loop, while the equivalent

a[0] = b[0] + 1;
a[1] = b[1] + 1;

gives BLOCK WAS VECTORIZED at the line number of the first statement. This difference of messages shouldn't generate confusion in interpreting vectorization messages by ICC.

1.5 SIMD Vectorized Assembly Behavior
:
Below assembly is representation of being vectorized for modified algorithm with 4th. parameter "S" - 

movaps                 %xmm14, %xmm0                               
movhpd                 24(%rsp), %xmm0                              
movaps                 %xmm6, %xmm4                                 
unpcklpd               %xmm9, %xmm4                                 

movhpd                56(%rsp), %xmm13                             
unpcklpd               %xmm0, %xmm5                                 
movaps                %xmm3, %xmm14                                
movhpd                88(%rsp), %xmm12                             
unpcklpd               %xmm7, %xmm3                                 

unpcklpd               %xmm8, %xmm2                                  
movhpd                120(%rsp), %xmm1            

...B1.6:                                                                                                                                         // Level  

movaps                 %xmm5, %xmm9                                
mulpd                   %xmm8, %xmm9                                   
mulpd                   %xmm13, %xmm8                                 
addpd                   %xmm4, %xmm9                                   
unpcklpd               %xmm15, %xmm15                              
addpd                   %xmm0, %xmm8                                  

movaps                 %xmm3, %xmm14                                  
mulpd                    %xmm15, %xmm14                                
mulpd                    %xmm12, %xmm15                                
addpd                    %xmm14, %xmm9                                  
unpcklpd                %xmm7, %xmm7                                    
addpd                    %xmm15, %xmm8                                  

movaps                 %xmm2, %xmm14                                 
mulpd                    %xmm7, %xmm14                                
mulpd                    %xmm1, %xmm7                                    
addpd                    %xmm14, %xmm9                                  
cvtpd2ps                %xmm9, %xmm9                                   
addpd                    %xmm7, %xmm8                                   
cvtpd2ps                %xmm8, %xmm7                                    
movlhps                 %xmm7, %xmm9                                   
movaps                  %xmm9, 160(%rsp,%rdi)                       
unpcklpd                 %xmm8, %xmm8                                    

We see from above that unpcklpd instruction has been called which does resolves dependencies. Also, when using movlhps the data it takes from immediate "cvtpd2ps %xmm8, %xmm7" operations is from low quad-word of xmm7 thus movlhps uses only the lower part of xmm7; these two instructions are functionally independent. However cvtpd2ps will "lock" %xmm7 till it completion thus preventing movlhps. Moreover, addpd & mulpd has been used here than earlier non-vectorized assembly code of section 1.3 where we saw addsd and mulsd instructions calls.  We also see that movhpd instructions have been used to perform the load/store since compiler cannot provide alignment of data. The use of movhpd instruction has a significant impact on the performance finally. 

As, Intel Xeon processor has 128-bit-wide floating point unit (4 flops/cycle), it becomes more important to use full-width SSE2 instructions as opposed to the "upper/lower" half varieties if data alignment is known. In other words, instructions such as mulpd, addpd, etc. are preferable than instructions mulsd, addsd, etc., as obtained in section 1.3 because aligned data and vectorizable code are more beneficial and moreover these instructions has low latency than lower quad-word double-precision (DP)  floating-point (FP) operation.

1.6 Conclusion
:
We therefore conclude that if we know how to write any C/C++ source-code compute algorithm in vector-length format to address current SSE stacks of multi-core architectures, compiler for this algorithm can automatically vectorizes the code or in other way compiler job of performing vectorization becomes much easier. We also see that by having vector-length formatted C/C++ code, compiler generates and uses more efficient vectorizable instructions (like unpcklpd, addpd, mulpd, movlhps, movhpd, etc.) which do take care of dependencies.

The above modified C/C++ compute algorithm source-code for torsion.cc of AutoDock (v-4.0.1) package has been demonstrated only on Intel® Xeon® 5355 processor. This approach of writing C/C++ source-code in vector-length format can be true in any other processors which support SIMDization.

Somehow, not verified if Intel C++ Compiler (ICC v-11.0.074) can generate SHUFPD instructions which can yield better results in place of unpcklpd. Also, ICC generated assembly for modified code haven't been tried further either as optimized Inline assembly API nor as pure optimized .S assembly file and latter to use SHUFPD in place of unpcklpd instructions for having much better performance.

Above exercise of having vector-length format of C/C++ source-code can easily be executed on Linux x86_64 machine using torsion.cc C++ file of AutoDock (v-4.0.1) source-code as it is easily available under GPL. 

1.7 References
:
1. The Software Vectorization Handbook: By Aart J.C. Bik, Intel Press
2. Optimizing Compilers for Modern Architectures:  By Randy Allen and Ken Kennedy
3. Intel 64 and IA-32 Architectures Optimization Reference Manual (Order Number: 248966-018)
4. AutoDock Source code: http://autodock.scripps.edu/ 

1.8 System Configuration
:
•     Intel C++ Compiler:
Intel® C++ Intel® 64 Compiler Professional for applications running on Intel® 64, Version 11.0   
Build 20081105 Package ID: l_cproc_p_11.0.074

•     Operating System:
Linux n1 2.6.9-55.9hp.4sp.XCsmp #1 SMP Tue Nov 27 18:32:01 EST 2007 x86_64 x86_64 x86_64 GNU/Linux

•     CPU Information:
vendor_id              : GenuineIntel
model name           : Intel® Xeon® CPU           X5355 @ 2.66GHz
cache size              : 4096 KB
cpu cores               : 4
fpu                        : yes
fpu_exception         : yes
flags                      : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi
mmx fxsr sse sse2 ss ht tm syscall lm pni monitor ds_cpl est tm2 cx16 xtpr
bogomips               : 5337.26
clflush size             : 64
cache_alignment     : 64
address sizes          : 36 bits physical, 48 bits virtual

•     Memory Information:
Memory Total          : 16413224 kB
Page Tables            : 1428 kB
Vmalloc Total          : 536870911 kB
CPU                       : L1 I cache: 32K, L1 D cache: 32K
CPU                       : L2 cache: 4096K

•     Linux Development Tool Components:
GNU C  (gcc --version)
gcc (GCC) 3.4.6 20060404 (Red Hat 3.4.6-8)

GNU C++ (g++ --version)
g++ (GCC) 3.4.6 20060404 (Red Hat 3.4.6-8)

Binutils (as -v)
GNU assembler version 2.15.92.0.2 (x86_64-redhat-linux) using BFD version 2.15.92.0.2 20040927

GLIBC  (/lib/libc-2.3.4.so)
GNU C Library stable release version 2.3.4, by Roland McGrath et al.

LIBSTDC++ (/usr/lib/libstdc++.so.6.0.3)
libstdc++.so.6.0.3

1.9 Glossary
:
HPC High Performance Computing
Vector-Length The number of data elements packed together
Precision The number of bits per data element
SIMD Single Instruction Multiple Data
SSE Streaming SIMD Extension Intel instruction set multimedia extension. 
Versions: SSE (1999), SSE2 (2001), SSE3 (2004), SSSE3 (2005), SSE4 (2006).
ICC Intel C++ Compiler
GAS GNU Assembler

2.0 Acknowledgment
: I am thankful to Aart J.C. Bik of Google (USA) for bringing excellent book on Vectorization (The Software Vectorization Handbook), Dr. Parag Prasad (Life-Science Group, Computational Research Laboratories  - India) in allowing me to investigate and optimize AutoDock source code, Milind Athavale (System Software Group,  Computational Research Laboratories  - India) and Shreeniwas Sapre  (Library Group,  Computational Research Laboratories  - India) for reviewing this document.

2.1 About the author
: Author, Mukkaysh Srivastav works with Computational Research Laboratories (CRL) - Pune, India. His areas of interest are Compiler development, HPC Software Tools, Static-Analysis (mainly MPI), Performance & Optimization, SIMD x86_64 Assembly Programming and Public Speaking. With academic qualification, he has been affiliated with University of Massachusetts - Boston (USA); Indian Institute of Technology - Delhi (India) and Regional Engineering College - India. He can be reached through srimks@msn.com .                                          

Optimization Notice in English

Reportez-vous à notre Notice d'optimisation pour plus d'informations sur les choix et l'optimisation des performances dans les produits logiciels Intel.