Recent posts
https://software.intel.com/en-us/recent/790614
enIPP MX special operation on array of matrices
https://software.intel.com/en-us/forums/intel-integrated-performance-primitives/topic/515132
<p>Hi,</p>
<p>Here is a simple question. I'm new to IPP and I'm trying to understand how to use it for solving the following problem:</p>
<p>A += B*C + D*E + F*G + ...</p>
<p>A, B, C, D, E, F, G, ... are all matrices of the same size, * represents standard matrix multiply. The sizes of the matrices are small, typically between 3x3 and 35x35. </p>
<p>IPP provides a routine - I'm looking at the ippmMul_mama_64f function - that would operate on two source arrays of matrices, in our case: [B, D, F] and [C, E, G], producing, as far as I understood, three output matrices A1, A2, and A3, storing the results of B*C, D*E, and F*G, respectively. Now I have two correlated questions:</p>
<p>- In my problem, there's a single output matrix A. Is there a function in IPP, or a safe way of using ippmMul_mama_64f, such that the results are *accumulated* in a single output matrix A, rather than in three different matrices A1, A2, and A3?</p>
<p>- If this is not possible, how do I best combine the three temporaries A1, A2, and A3?</p>
<p>Ah, incidentally: is there any document I can look at that compares the performance of IPP MX to hand-crafted implementatios? I've done a bit of research and I couldn't find much.</p>
<p>Thanks</p>
<p>-- Fabio</p>
Tue, 13 May 14 02:58:14 -0700Fabio L.515132Loads blocked due to store forwarding
https://software.intel.com/en-us/forums/software-tuning-performance-optimization-platform-monitoring/topic/401511
<p>Hi all</p>
<p>I'm using vtune to spot bottlenecks in a piece of code that looks like the following:</p>
<p>for (int i = 0; i < X; i++) <br /> for (int j = 0; j < Y; j++) <br /> for (int k = 0; k < Z; k++) <br /> A[j][k] += (FE0[i][j]*FE0[i][k]*I[0] + B1[i][k]*C1[i][j]*I[1] + B2[i][k]*C2[i][j]*I[2] + ...);</p>
<p>The code is automatically generated by a high-level tool, that's why it looks "weird". I'm using the most recent intel suite (compiler and tools).</p>
<p>In a specific run (there is no significant variation in results among different runs), Vtune says loads blocked by store forwarding constitue a significant proportion of the execution time, equal to roughly 0.160. If I look at the assembly/source view, it appears that the load-sum-store on A is the main responsible of that:</p>
<p>1) I can't manage to understand why. A is *never* used for the right hand side computation, so why should the problem occur?</p>
<p>In addition, most of the instructions (even, for example, those computing the other products/sums in the right hand side) seem to be affected by "loads blocked after store forwarding". Which is actually weird. Although they just use registers to perform the product/sum itself, the vtune analysis says they are "blocked due to store forwarding" (i.e. the coloumn "loads blocked by store forwarding" is different from 0).</p>
<p>2) Could you explain this?</p>
<p>Thank you very much for considering my request</p>
</p>
Tue, 16 Jul 13 09:03:16 -0700Fabio L.401511Almost-unit-stride stores
https://software.intel.com/en-us/forums/intel-isa-extensions/topic/394833
<p>Hi all</p>
<p>I have a AVX vector register <em>reg</em> containing 4 double values, let's call them (in order): 0 - 2 - 3 - 4<br />These values have to be added to distinct locations of an array A, namely to positions A[0], A[2], A[3], A[4]<br />In other words:</p>
<p>A[0] += reg[0], A[2] += reg[2] and so on</p>
<p>This is a quite recurrent situation in my program, i.e. sequences of load-add-stores that are "almost" unit-stride - but actually they are not. </p>
<p>At the beginning I thought I could have used some sort of shuffle instructions to shift values in reg, i.e. getting 0 - x - 2 - 3 (and maybe treating reg[4] as a scalar value), and then perfom standard 256-bit instructions. However, as far as I know, I can't reduce that kind of shifting to a single instruction, right? </p>
<p>Related to this question, let's say that now reg is 0 - 2 - 3 - 5. Should I treat all 4 values as scalar values or is there a way of efficiently (1/2 instructions?) extracting the two values in the middles (i.e. those crossing the two 128-bits lanes) into a 128-bit register ?</p>
<p>Thanks </p>
<p>-- Fabio</p>
Mon, 01 Jul 13 09:21:26 -0700Fabio L.394833Padding does not help AVX
https://software.intel.com/en-us/forums/intel-isa-extensions/topic/362177
<p>Hi all</p>
<p>I have the following C function:</p>
<p><strong class="quote-header">Quote:</strong><blockquote class="quote-msg quote-nest-1 odd"><div class="quote-author"></div></p>
<p>void mass_ffc( double A[10][10], double x[3][2])<br />{<br /> // Compute Jacobian of affine map from reference cell<br /> const double J_00 = x[1][0] - x[0][0];<br />...<br /> const double J_11 = x[2][1] - x[0][1];</p>
<p> // Compute determinant of Jacobian<br /> double detJ = J_00*J_11 - J_01*J_10;<br /> const double det = fabs(detJ);</p>
<p> // Array of quadrature weights.<br /> const double W12[12] __attribute__((aligned(PADDING))) = { .... };</p>
<p> // Value of basis functions at quadrature points.<br /> const double FE0[12][10] __attribute__((aligned(PADDING))) = \<br />{{0.0463079953908666, 0.440268993398561, 0.0463079953908666, 0.402250914961474, -0.201125457480737, -0.0145210435563256, -0.0145210435563258, ...0.283453533784293}};</p>
<p>for (int ip = 0; ip < 12; ip++) {<br /> double tmp = W12[ip]*det; <br /> for (int j=0; j<10; ++j) {<br /> double tmp2 = FE0[ip][j]*tmp;</p>
<p> #pragma vector aligned<br /> for (int k=0; k<10; ++k) {<br /> A[j][k] += FE0[ip][k]*tmp2;<br /> }<br /> } // end loop over 'j'<br /> } // end loop over 'ip'</p>
<p>} // end function</p>
<p></blockquote></p>
<p>Compiling it with ICC 2013 (flags: -xAVX, -O3) I end up with, let's say, a quite expected result: the innermost loop over k is fully unrolled, the first two iterations are peeled out and the remaining 8 are performed with avx instructions (mulpd, addpd). Then, I padded the FE0 and A matrices to 12 elements and I increased the k trip count to 12. The idea is that this way I would have been able to get a fully unrolled k loop and to carry it out with just 3 "groups" (mulpd, addpd) of packed avx instructions, saving the time spent for peeling and, in general, with scalar instructions.</p>
<p>Now the point is that if I compile the function with trip count 12, the compiler inserts a long sequence of movupd instructions both <strong>before</strong> and <strong>after </strong>the piece of assembly code representing the <strong>full unrolling of the loops over j and k</strong>. These movupd basically copy the elements in A to the stack (before) and from the stack back to A (after, and then the function returns). For example:</p>
<p><strong class="quote-header">Quote:</strong><blockquote class="quote-msg quote-nest-1 odd"><div class="quote-author"></div></p>
<p>...</p>
<p>vmovupd 32(%r15), %ymm2 <br /> vmovupd 96(%r15), %ymm14 <br /> vmovupd %ymm15, 1280(%rsp) <br /> vmovupd 608(%r15), %ymm15 <br /> vmovupd %ymm1, 1792(%rsp)<br /> vmovupd %ymm2, 1824(%rsp)</p>
<p>...</p>
<p># compilation of the loop nests</p>
<p>...</p>
<p>1760(%rsp), %ymm3 <br /> vmovupd %ymm15, 928(%r15)<br /> vmovupd 1600(%rsp), %ymm15 <br /> vmovupd %ymm0, 544(%r15) <br /> vmovupd %ymm1, 480(%r15)</p>
<p></blockquote></p>
<p>Of course, you might say why caring about a so mild (potential?) optimization in such a small function? because the function is invoked millions of times. </p>
<p>My questions are: what that sequence of movupd instruction represents? And why is it inserted there with trip count 12? </p>
<p>In the end, the version with trip count 10 goes faster than that with trip count 12.</p>
<p>By the way, if I increase the trip count to, let's say, 16, I don't get this weird behaviour. </p>
<p>Thanks for considering my (long) request.</p>
<p>Fabio</p>
</p>
Fri, 25 Jan 13 01:54:35 -0800Fabio L.362177Getting aligned accesses with AVX/SSE
https://software.intel.com/en-us/forums/intel-isa-extensions/topic/348806
<p>Hi all</p>
<p>Some preliminar information. I have the latest intel suite (2013) on a Linux machine. In the code, PADDING is a macro expanded to either 16 or 32 depending on the actual vector instruction set used to compile the program (SSE4.2 or AVX). </p>
<p>I am struggling to get aligned accesses on this code:</p>
<blockquote><p>void f ( double A[restrict 3][4], double x[3][2] ) {</p>
<p>double det = ... ;</p>
<p>double W3[3] __attribute__((aligned(PADDING))) = {0.166666666666667, 0.1666<br />66666666667, 0.166666666666667};<br /> double FE0[3][4] __attribute__((aligned(PADDING))) = \<br /> {{0.666666666666667, 0.166666666666667, 0.166666666666667},<br /> {... }};</p>
<p>for ( int ip = 0; ip <3; ip++)<br /> {<br /> double tmp = W3[ip]*det; <br /> for ( int j = 0; j < 3; j++ )<br /> {<br /> #pragma vector aligned<br /> for ( int k = 0; k < 4; k++ )<br /> {<br /> A[j][k] += FE0[ip][j]*FE0[ip][k]*tmp;<br /> }<br /> }<br /> }<br />}</p>
</blockquote>
<p>In the caller, A is a static array (of size [3][4]) labelled with __attribute__((aligned(PADDING)))</p>
<p>Despite extensive use of __attribute__ (aligned) and pragmas, by looking at the assembly (generated by icc -O3 -ansi-alias nomefile.c -xSSE4.2 -restrict), it is clear how for the innermost loop unaligned loads and stores (movupd) are used, Shouldn't I expect movapd?</p>
<p>Notice how FE0 and A have been padded from 3 to 4 doubles (and the trip count of loop k extended to 4) so that it is possible to exploit avx aligned instructions.</p>
<p>Thanks for considering my request</p>
<p>Fabio </p>
</p>
Mon, 17 Dec 12 16:28:53 -0800Fabio L.348806