I would like to ask question about parallelization+vectorization:
As processors are manufactured with increased core-count and ever wider SIMD registers and instruction sets, effective use of these components is essential to good application performance.
Attached presentation describes SSE3/SSE4 implementation of 3D Convolution for 16bit original data.
Programming for Multicore and Many-core Products including Intel® Xeon® processors and Intel® Xeon Phi™ X100 Product Family coprocessorsThe programming models in use today, used for multicore processors every day, are available for many-core coprocessors as well. Therefore, explaining how to program both Intel Xeon processors and Intel Xeon Phi coprocessor is best done by explaining the options for parallel programming. This paper provides the foundation for understanding how multicore processors and many-core coprocessors are...