The application I'm trying to optimise makes intensive use of shift cl operations to pack bits together. Such as:
packed_bits = (packed_bits << n) | new_bits;
(where n is a variable specifying the size of new_bits)
The target platform is Sandy Bridge. Vtune reports a high number of Flags Merge Stalls. All is consistent with the description made in this page http://software.intel.com/sites/products/documentation/doclib/iss/2013/a...
Not use any shift cl operation sounds a quite big limitation. I was wondering if anyone could suggest a way to workaround this issue, considering that such flags are not really used for packing bits (i.e. the bits that slide out are not relevant).