I'm currently in the unusual position of writing a software rasterizer. At this point, I've vectorized the C code to make use of SSE2. So far, I've managed a 9x speed increase over the compiler through the use of SSE vectorization and prefetching. However, I feel that I could probably speed this up further through better instruction scheduling. That said, I'd like to rewrite some of my assembly to better utilize the execution units within the CPU. My target CPUs here are the Pentium 4 and up, so that means I'm limited to SSE2.
That all being said, is there a chart somewhere that lists instructions, their latency, their issue port, and their respective execution unit?