The rule 4 or 9 "Don't forget tu use SIMD Instructions" is interesting, but i have once small question.
I work on video and i'm the code optimizer of my company.
I have write a lot of SIMD routines in order to optimize our watermarking system, but i have a fear which could be express by :
If i have 3 or 4 thread on a single-Processor (maybe HT) which execute same jobs. These Jobs have been really well optimized with SIMD (15ms/frame without SIMD to 5ms with optimizations). There is only once SSE2 Unit and MMX unit so how processor manage this amount of SIMD code ?
There is no SIMD tools like Instruction Level Parallelism ? There is no risk to decrease performances when we use lot's of SIMD CODE ?