A BRIEF HISTORY OF DATA-PARALLEL CODING
Data-Parallelism Spanning From SSE to AVX to Larrabee to...
A BRIEF HISTORY OF DATA-PARALLEL CODING
3D Running Average SSE algorithm is implemented for FP (SP) input data. Averaging window is fixed as 11 - this value was requested by customer who initiated this work. Basing on current implementation ideas, it is simple to build versions for other averaging windows as well.
Please, find attached:
The uploaded presentation describes the SSE implementation of imge 2x shrink, when one pixel contains 4 bytes: 3 color components R, G & B, and 4th components - weight A.
Speed-up (comparing with serial code) is 4.6 for Merom platform, ~7 on Penryn platform.
Please, find attached:
Command line doesn't have any arguments - application name only.
Attached presentation describes SSE3/SSE4 implementation of 3D Convolution for 16bit original data.
SSE Speed-up (comparing with serial code) is ~3x, OpenMP on 2way Harpertown (Penryn) machine rises it ~6x, therefore overall speed-up SSE+OpenMP is ~18x.
Please, find attached: