many thanks for all contributors to my past question.
Hello, I'm investigating conversion of a number of compute kernels from AVX 128 to AVX 256 and would appreciate any guidance which might be available on getting a small number of operations on port
I've started attempting to learn RTM extensions. The most common examples I can find online are using them to implement a mutex or concurrent lock. Often they are similar to:
This code scales poorly with AVX on my Sandy Bridge, how can I make it more vectorizer friendly:
I would like to ask question about parallelization+vectorization: