# Data-Parallelism Spanning From SSE to AVX to Larrabee to...

Greetings all, and thanks for reading my first Intel Software Network blog! I just took my wife to see the movie Julie and Julia, and was inspired to blog, and since the popcorn is still processing, I'm not yet asleep. I won't make a 365-day commitment to parallelize every numerical recipe or anything, but I will try to keep coming back, answer questions, follow-up on new Intel technology developments, etc.

A BRIEF HISTORY OF DATA-PARALLEL CODING

# Parallelization And Optimization of The Line Segment Intersection Problem

<!--[endif]--><!--[if gte mso 9]> Normal 0 false false false MicrosoftInternetExplorer4 <![endif]--><!--[if gte mso 9]> <![endif]--> <!--[endif]--><!--[if gte mso 9]> <![endif]--><!--[if gte mso 9]> <![endif]-->

Line Segment Intersection Problem

1. Problem Statement

Write a threaded code to find pairs of input line segments that intersect within three-dimensional space. Line segments are defined by 6 integers representing the two (x,y,z) endpoints.

# High Clocks Per Instruction Retired when vectorizing the loop.

Sometimes when we vectorize a loop, we get a high Clocks Per Instruction Retired (CPI) value. This happens when there is high bus utilization and the bus gets saturated.
• Desenvolvedores
• Linux*
• Apple Mac OS X*
• Microsoft Windows* (XP, Vista, 7)
• Servidor
• C/C++
• Compilador C++ Intel®
• Intel® Parallel Composer
• Suíte de ferramentas embarcadas de desenvolvimento de software Intel® para o processador Intel® Atom™
• Intel® VTune™ Performance Analyzer
• simd
• SSE2
• SSE3
• SSE4
• SSE
• High CPI
• hardware prefetcher
• SSE1
• Memoray latency
• BUS Saturation
• Processadores Intel® Atom™
• Otimização
• Vetorização
• # Вкус векторизации

В трудовые будни наобщавшись с народом я понял что что то с темой векторизации (Vectorization по-английски)  еще не всем понятно.

Много всего, может быть, уже написанно однако - постараемся суммировать знания.

Как известно в C/C++ мы оперируем с операндами, которые обязаны иметь тип, что внутренне подразумевает размерность или количество байт необходимых для хранения самих операндов/переменных.

# 3D Running Average SSE algorithm

3D Running Average SSE algorithm is implemented for FP (SP) input data. Averaging window is fixed as 11 - this value was requested by customer who initiated this work. Basing on current implementation ideas, it is simple to build versions for other averaging windows as well.

• SSE
• Computação paralela
• # 2x Shrink SSE algorithm

The uploaded presentation describes the SSE implementation of imge 2x shrink, when one pixel contains 4 bytes: 3 color components R, G & B, and 4th components - weight A.

Speed-up (comparing with serial code) is 4.6 for Merom platform, ~7 on Penryn platform.

1. PowerPoint presentation, describing this algorithm.
2. ZIP file containing C code project implementation, included into simple benchmarking application. The project is built for MS VisStudio-2005.

Command line doesn't have any arguments - application name only.

• SSE
• Computação paralela
• # 16bit 3D Convolution: SSE4+OpenMP implementation on Penryn CPU

Attached presentation describes SSE3/SSE4 implementation of 3D Convolution for 16bit original data.

SSE Speed-up (comparing with serial code) is ~3x, OpenMP on 2way Harpertown (Penryn) machine rises it ~6x, therefore overall speed-up SSE+OpenMP is ~18x.

• SSE
• Computação paralela
• # Sun + Intel + OpenSolaris + 2 Years = The Year of Core

Today is the second anniversary of the Sun and Intel joint agreement to optimize the Solaris operating system for Intel Xeon processors. Like last year, when I wrote this summary of our work, I decided to recap where we are to date.

Like last year’s edition, this is pretty much off the top of my head.

# x87 and SSE Floating Point Assists in IA-32: Flush-To-Zero (FTZ) and Denormals-Are-Zero (DAZ)

### Introduction

This document details the difference between how assists are handled with x87 and Single Instruction Multiple Data (SIMD) instructions, and gives information on how to change their behavior when using (Streaming SIMD Extensions) SSE and SSE2.

• simd
• SSE2
• SSE
• # Точность и вежливость компилятора

В процесе нахождения высшей истины иногда приходиться спотыкать и полностью осознавать базис.

Возьмем к примеру, следующий код

:#include <stdio.h>

int main (void)
{
double a = 3.0, b = 7.0, c;

c = a / b;

if (c == a / b) {
printf ("comparison succeeds\n");
} else {
printf ("unexpected result\n");
}

return 0;
}

и оказываеться что например на gcc, наверное и на других компиляторах, он вполне может выдавать unexpected result.