SSE/SSE2 Instructions and the easiest way to port.

SSE/SSE2 Instructions and the easiest way to port.

I have a legacy code that utilizes threads and SSE instructions heavily. I wanted to port the code to the MIC cores, and could not succeed obviously because Intel® Xeon Phi™ cores do not support Intel Advanced Vector Extensions (Intel® AVX), or any of the Intel® Streaming SIMD Extensions (Intel® SSE) for some reason.

What is the easiest way to port my vector instructions to MIC so that I could still continue using vector instructions manually? The first obvious solution is to use the native SIMD of MIC cores, VPU and those zmm registers. Go over the SSE/SSE2 instructions one by one and convert them all, quite a headache.

Any other solutions, workarounds for porting SSE instructions to MIC cores?

Why did Intel drop SSE or AVX in Phi coprocessors and introduced a new set of zmm registers instead of old ones? That causes lots of problems for the codes that rely on manual SIMD optimization (using SSE/AVX) to be able to run on Xeon Phi...

Thanks in advance.

7 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.


I'm in the same boat. You can see this thread with some suggestions. So far I wasn't able to get acceptable performance on Phi. My program still runs much faster on regular Xeon processor.

icc generates AVX-128 (not AVX-256) automatically from SSE intrinsics for host.  You would not get the benefit (if any) of wider intrinsics. Even with gcc it's possible to get by with SSE intrinsics, by appending the vzero_upper.

For MIC, you need the widening to 512-bit parallel instructions to make it worth your effort. 

I get linear speedup with my code on the host but I also want to utilize it on Phi as it seems promising.

So what you tell me is to reengineer my code so that it utilizes 512-bit wide registers? I will not be utilizing full length of it though, just 128-bit portion. I know I would not be utilizing SIMD instructions to full extent but that is OK for me since my main strength is the scalability, the code scales very good so that I just want to have those 128-bit wide SSE vector instructions on Phi as well that I can use on the host so that I dont lose unnecessary performance.

I also would like to know on a single core of MIC, where there are 4 threads, are zmm registers replicated for each thread? Or do 4 threads use the same set of zmm registers?

Well, idealy, portable vector codes would be written with at least a means to parameterize the code on vector width so that with no more than a recompilation you could take advantage of the increased vector length, though sometimes such generification leaves the code more obfuscated in its intentions and operations.

Here's the problem with only actually using a quarter of each vector.  Your scalable code might easily expand to use all the HW threads available on the coprocessor, but each of those vector operations will be operating on whole cache lines (64 bytes or 512 bits, the width of an Intel Xeon Phi vector) and incurring all the overheads and memory pressures of those cache manipulations will entail, but achieving only a quarter of the potential computational intensity available on the coprocessor.  Likewise, in order to ensure vector alignment, those 128-bit vectors will need 512-bit alignments to ensure the right sets of 128 bits will meet each other in the vector ALU, so most of the coprocessor data allocation will be padding.  You maybe able to scale up on the coprocessor if your algorithm provides sufficient parallel slack, but you may be wasting too much of the machine with the 128-bit vector limit to be able to overcome in parallelism what you'd be losing in vectorization.  It's a cost tradeoff whose thresholds really depend on the nature of your kernel.  It should scale a lot better, though, if you can parameterized vector width and maybe even improve host performance by taking advantage of the 256-bit wide vectors available there.  Or that may be too much work, depending on how the 128-bitness is wired into your code.

p.s. each Intel Xeon Phi coprocessor core contains sufficient resources to maintain simultaneously the architectural registers for four independent threads.  This is true for the zmm registers--there's no swapping of register contents to accompany the context switches that the core handles, just a big register file.  


TimP (Intel) wrote:


For MIC, you need the widening to 512-bit parallel instructions to make it worth your effort. 

I would like to learn how to take advantage from Phi 512-bit SIMD to speed-up the code which already uses x8 vectorization via SSE2. The original vectorized code utilities heavily SSE2/_mm_madd_epi16 instruction to vectorize upon x8 operands plus the rendering engine is heavily multi-threaded with the perfect cores-scalability. Once every bit is not “wasted” 16-bit mantissa is totally sufficient for rendering so, porting vectorization to x8 doubles offers no performance benefits in such case. I hoped that 512-bit SIMD may be effectively utilized as 4x128 SIMD units (x4 threads per core); however, the way Intel suggests porting indicates that whole 512 array is used to perform the same x8 vectorization. Apparently, a new development with x8 doubles vectorization may benefit from less tedious optimization but hardly it is going to offer a speed advantage over already x8 vectorized code via SSE2. I'm sure, my case is not an unique case; a “millenniums” of man x hours has been invested in SSE optimization so, suggesting to re-engineer the years of exceptionally expensive and tedious developments effectively to exclude it to be ported to Xeon Phi. In fact, it reminds me another not so-recent history ;o( I sincerely hope that customer's wishes may effect Intel's plans with SSE for Xeon Phi...

Leave a Comment

Please sign in to add a comment. Not a member? Join today