SSE

Data-Parallelism Spanning From SSE to AVX to Larrabee to...

Greetings all, and thanks for reading my first Intel Software Network blog! I just took my wife to see the movie Julie and Julia, and was inspired to blog, and since the popcorn is still processing, I'm not yet asleep. I won't make a 365-day commitment to parallelize every numerical recipe or anything, but I will try to keep coming back, answer questions, follow-up on new Intel technology developments, etc.

A BRIEF HISTORY OF DATA-PARALLEL CODING

Parallelization And Optimization of The Line Segment Intersection Problem

<!--[endif]--><!--[if gte mso 9]> Normal 0 false false false MicrosoftInternetExplorer4 <![endif]--><!--[if gte mso 9]> <![endif]--> <!--[endif]--><!--[if gte mso 9]> <![endif]--><!--[if gte mso 9]> <![endif]-->

Line Segment Intersection Problem


1. Problem Statement

Write a threaded code to find pairs of input line segments that intersect within three-dimensional space. Line segments are defined by 6 integers representing the two (x,y,z) endpoints.

Вкус векторизации

В трудовые будни наобщавшись с народом я понял что что то с темой векторизации (Vectorization по-английски)  еще не всем понятно.

Много всего, может быть, уже написанно однако - постараемся суммировать знания.

Как известно в C/C++ мы оперируем с операндами, которые обязаны иметь тип, что внутренне подразумевает размерность или количество байт необходимых для хранения самих операндов/переменных.

2x Shrink SSE algorithm

The uploaded presentation describes the SSE implementation of imge 2x shrink, when one pixel contains 4 bytes: 3 color components R, G & B, and 4th components - weight A.

Speed-up (comparing with serial code) is 4.6 for Merom platform, ~7 on Penryn platform.

Please, find attached:

  1. PowerPoint presentation, describing this algorithm.
  2. ZIP file containing C code project implementation, included into simple benchmarking application. The project is built for MS VisStudio-2005.

Command line doesn't have any arguments - application name only.

  • SSE
  • Elaborazione parallela
  • 16bit 3D Convolution: SSE4+OpenMP implementation on Penryn CPU

    Attached presentation describes SSE3/SSE4 implementation of 3D Convolution for 16bit original data.

    SSE Speed-up (comparing with serial code) is ~3x, OpenMP on 2way Harpertown (Penryn) machine rises it ~6x, therefore overall speed-up SSE+OpenMP is ~18x.

    Please, find attached:

  • SSE
  • Elaborazione parallela
  • Sun + Intel + OpenSolaris + 2 Years = The Year of Core

    Today is the second anniversary of the Sun and Intel joint agreement to optimize the Solaris operating system for Intel Xeon processors. Like last year, when I wrote this summary of our work, I decided to recap where we are to date.

    Like last year’s edition, this is pretty much off the top of my head.

    x87 and SSE Floating Point Assists in IA-32: Flush-To-Zero (FTZ) and Denormals-Are-Zero (DAZ)

    Introduction

    This document details the difference between how assists are handled with x87 and Single Instruction Multiple Data (SIMD) instructions, and gives information on how to change their behavior when using (Streaming SIMD Extensions) SSE and SSE2.

  • Sviluppatori
  • Intel® Streaming SIMD Extensions
  • SSE2
  • SSE
  • Processori Intel® Pentium®
  • Точность и вежливость компилятора

    В процесе нахождения высшей истины иногда приходиться спотыкать и полностью осознавать базис.

    Возьмем к примеру, следующий код

    :#include <stdio.h>

    int main (void)
    {
      double a = 3.0, b = 7.0, c;

      c = a / b;

      if (c == a / b) {
        printf ("comparison succeeds\n");
      } else {
        printf ("unexpected result\n");
      }

      return 0;
    }

    и оказываеться что например на gcc, наверное и на других компиляторах, он вполне может выдавать unexpected result.

    Iscriversi a SSE