Will Intel Intrinsics really help here?

Will Intel Intrinsics really help here?

 

I have a short 64 bit comparison in the loop below. Basically an operational AND accumlation, followed by a first set bit. Typically the outer look runs for ~400 iterations, while the inner loop 12 (_num_tables). If I replace the 64 bit operation with the intrinsics for 128 bit operations (and reduce the outer loop iteraation by 2 to ~200). The intrinsics performance drops by about 35% compared to the 64 bit case. This is all on the latest hardware, compiled -O3 etc.

There is no one line that appears to be the offender performance-wise in the intrinsic version. I'm curious is there anything stupid that I'm doing in the 128 bit version that jumps out as an obvious performance no-no?

Thanks for any advice!

 

        /* for each chunk of rules, i.e. 64 at a time */
        unsigned int end = conf->_num_chunks * 2;
        for (j = 0; j < end; ++j,++j) {
                long int rule_match = 0xFFFFFFFFFFFFFFFF;
                /* For each table */
                for (i = 0; i < conf->_num_tables; ++i) {
                        rule_match &= *((long int*)(conf->_match_table[i][ packet[i] ] + j));
                        if (!rule_match)
                                goto next; /* don't need to proceed, no match */
                }
                return ffsl(rule_match);
        next: ;
        }

 

128 bit intrinsic version below:

        /* for each chunk of rules, i.e. 128 at a time */
        for (j = 0; j < conf->_num_chunks2; ++j) {
                /* initial 128 bit wide value */
                __m128i rule_match_128 = max;
                unsigned short jump = j * 4;
                /* For each table */
                for (i = 0; i < conf->_num_tables; ++i) {
                        uint8_t seg = packet[i];
                        /* copy 128 bit index into comparison */
                        __m128i *match_128 = (__m128i*)(conf->_match_table[i][seg] + jump);
                        /* perform &= on 128 bit wide comparison */
                        rule_match_128 = _mm_and_si128(rule_match_128, *match_128);
                        if (_mm_movemask_epi8(_mm_cmpeq_epi32(rule_match_128, zero)) == 65535)
        			goto next;
                }

        	/* Only returning first match for now */
                for (i = 0; i < 128; ++i) {
                        __m128i cmp = _mm_and_si128(rule_match_128, lut[i]);
                        if (_mm_movemask_epi8(_mm_cmpeq_epi32(cmp,zero)) != 65535) {
				return i;
                        }
		}
        next: ;
        }

 

 

2 post / 0 nuovi
Ultimo contenuto
Per informazioni complete sulle ottimizzazioni del compilatore, consultare l'Avviso sull'ottimizzazione
Ritratto di iliyapolak

 

Can you perform VTune analysis of both test cases and post the screenshots?

Accedere per lasciare un commento.