What is the fastest assembler code for this task?

What is the fastest assembler code for this task?

void do(double *const pData, const std::size_t iColPos, const std::size_t _iMax, const unsigned int *const pFactorPositions, const std::size_f iOffset, const double d)
for (std::size_t iSource = iColPos;
iSource < _iMax;
pDatat[pFactorPositions[iSource + iOffset]]) -= pData[iSource]*d;

I've to state, that LHS and RHS indexes are always unique -- no LHS is ever the same again and is never identical to RHS.

here is what g++ produces:

81088: 42 8d 14 00 lea (%rax,%r8,1),%edx
8108c: 89 c6 mov %eax,%esi
8108e: ff c0 inc %eax
81090: 66 0f 12 0c f1 movlpd (%rcx,%rsi,8),%xmm1
81095: 39 c7 cmp %eax,%edi
81097: 41 8b 14 91 mov (%r9,%rdx,4),%edx
8109b: f2 0f 59 ca mulsd %xmm2,%xmm1
8109f: 48 8d 14 d1 lea (%rcx,%rdx,8),%rdx
810a3: 66 0f 12 02 movlpd (%rdx),%xmm0
810a7: f2 0f 5c c1 subsd %xmm1,%xmm0
810ab: f2 0f 11 02 movsd %xmm0,(%rdx)
810af: 77 d7 ja 81088

I've tried to force unrolling this loop but the performance is the same?
Would it matter if different registers are being used in every step of the unrolled loop?

6 posts / 0 nouveau(x)
Dernière contribution
Reportez-vous à notre Notice d'optimisation pour plus d'informations sur les choix et l'optimisation des performances dans les produits logiciels Intel.

Hello Foelsche,

It may not matter very much what the assembly code looks like depending on the size of the arrays and how random the accesses to the pDatat array become because ot the indirect indexing from the pFactorPositions array.

How big are the arrays involved? The worst case is if Pdatat is very large ( larger than last level cache) if pDatat[pFactorPosition[index]] more or less results in random memory accesses for the pDatat locations. Then it won't matter what the assembly looks like, you'll be mostly waiting on memory.



the vector pFactorPositions is sorted -- means addresses for the LHS are always increasing and may jump any number or none.



array size should be limited to less than 1000 elements
-- sometimes only 1 or two elements.

Of course the routine will be inlined...


Assuming that the pData and pFactorPositions and pData arrays are only of size 1000 elements, they should all fit in cache. If you wanted to see if the indirect addressing is causing havoc you could try setting pFactorPositions to have sequential values.

But I'm guessing that you are getting 1 or more instruction executed per clocktick.

Do you have any CPI or IPC stats for the loop?

As Pat said it possibly does not maatter what assembly looks like.There is dependency on the indirect array indexing.Becuse of this data prefetching could not expect data spatial locality.

Sorry my mistake source array pData[iSource] has spatial locality because of lineary increased index.

Laisser un commentaire

Veuillez ouvrir une session pour ajouter un commentaire. Pas encore membre ? Rejoignez-nous dès aujourd’hui