16bit 3D Convolution: SSE4+OpenMP implementation on Penryn CPU

Attached presentation describes SSE3/SSE4 implementation of 3D Convolution for 16bit original data.

SSE Speed-up (comparing with serial code) is ~3x, OpenMP on 2way Harpertown (Penryn) machine rises it ~6x, therefore overall speed-up SSE+OpenMP is ~18x.

Please, find attached:

  1. PowerPoint presentation, describing this algorithm.
  2. ZIP file containing C code project implementation, included into simple benchmarking application. The project is built for MS VisStudio-2005.

Command line has the form <appName XYSize Zsize NumberOfRunnings>, for example <Conv3D 512 512 3>.

