sse execution units in core duo

sse execution units in core duo

I have read at various places that all intel processors before Core 2 Duo (including Core Duo) have 64-bit floating point execution units. (I am not talking about the x87 FPU). Due to this, the sse instructions using 128-bit operands are split into two with 64-bits handled at a time.

Regarding this, I have the following questions:

a. Is this true?

b. Assuming it is true, won't it mean that there is no speed advantage with instructions like addpd as compared to addsd (as the addpd instruction is split into two anyway) ?

Regards
Gautam

24 posts / novo 0
Último post
Para obter mais informações sobre otimizações de compiladores, consulte Aviso sobre otimizações.

If you amend your statement to "all mobile CPUs before Core 2," I believe it's true. This meant only an increase in overall latency, typically 1 clock cycle, as the 2nd half operand follows the first half through the pipeline, for Intel CPUs.
For non-Intel brands of CPUs, it was technically possible for a pair of scalar double operations to run as fast as a parallel double operation, provided that bottlenecks in instruction decode, write combine buffering, etc, were avoided. In my experience, there was always at least a 5% advantage in vectorized code for large applications with 64-bit operands; a 20% advantage with 32-bit operands. Granted, you might loosely characterize a 5% gain as no gain, when a competing Intel CPU showed a 30% gain for vectorization in the same situation.
On the other side of the coin, the CPUs which were designed not to depend on vectorization did show an advantage when running vectorizable code which was not vectorized.
Given the wide availability of vectorizing compilers and CPUs which execute parallel instructions efficiently, I don't see that historical facts should influence your future development plans, unless you intend to use Microsoft C (the only remaining major non-vectorizing compiler) to develop for CPUs of the past.

Thanks Tim. I just wanted some clarifications:

a. Is it also true for all Intel non-mobile CPU's before core 2 duo? In other words, is core 2 duo the only intel processor which has a 128-bit wide SSE execution unit?

b. When you say that you obtained a gain of 5% with 64-bit operands, are you talking about non-intel brand CPU's or intel cpus' with 64-bit wide SSE execution unit? Same question for the case when you mention 30% gain.

Unfortunately , I was looking at ~ halving the time of some of my numerical functions, therefore it is very pertinent to me.

Regards
Gautam

According to my information, the pentium-m and successors did split 128-bit operands, somewhat analogous to the way AMD CPUs did at the time. I'm not certain about the Core Solo and Duo. As I tried to explain, there is no way this splitting should discourage you from using a vectorizing compiler; you will still get big speedups with vectorization on past Intel mobile CPUs. If numerical performance is so important, of course you should consider Core 2 Duo with sufficient RAM to run 64-bit OS.
I was contrasting the 5% overall gain for vectorizing large double precision applications on a non-Intel CPU with a 30% gain on an Intel CPU with the same application. Such applications are not really feasible to run on Core Solo or Turion. I mentioned it only as an example of the CPUs which were designed to run nearly as well without vectorization as with it.

I am still a bit confused. Why would an addpd run faster than two addsd's on a processor which performs this splitting? Aren't we basically performing the same number of Uops in both the cases?

Any suggestions?

I don't know where you're going with this. Certainly, avoiding addpd and the like has been used as a strategy to make code run slower on platforms other than Opteron. In order to minimize the gain going from Pentium-m to Core 2 Duo, in view of the SSE instruction coding bottleneck of Pentium-m, you would have to use fadd instructions, in many cases, to get performance approaching addpd, and you would have to assure that there is no gain in better utilization of read/write combining buffers for addpd.
In any case, discussing how to minimize the gain of current CPUs over those no longer in production seems tangential to the purpose for which this forum branch was started.

I am writing code that will run on pentium-m, core duo, pentium-4 etc. I want to know if using packed double precision sse2 instructions will give any time benefit on these processors. If yes, then why?, because packed/parallel double precision sse2 instrucions are split into two or more micro-instructions on these processors, so they should not be faster than their scalar counterparts, which do not have any such splitting.

If you use ADDSD instead of ADDPD you will process (at least) two times less data per loop iteration. Since loops are usually unrolled to consume one cache line of data per iteration that means you will end up with longer code and you will have to use more registers.

I am sure ADDPD will work at least a bit faster than ADDSD on any CPU that supports it — if you believe otherwise, then by all means benchmark your own code to see if vectorization is beneficial for your particular case.

--
Regards,
Igor Levicki

If you find my post helpfull, please rate it and/or select it as a best answer where applies. Thank you.

somewhat analogous to the way AMD CPUs did at the time

This may explain why I was only experiencing less than 2x speedup from the expected 4x when using SSE on my previous AMD 3500.
I always assumedit was a AMD's thing, possibly related to 3DNow legacy (would have made sense).

Now that you mention this, wasn't then (at that time)the real benefit of SSE just in the fact you were -not- using the FPU that was slower (especially on the P4)?
I mean, at that time I was wondering why there were so many scalar SSE1 operations, since they were seriously inferior (32bit vs 80bit) to the FPU versions. I just didn't see the point, other than them being there because the oldFPU started to suck on Intel's (but was still fast on AMD's).

And then SSE3 added.. a new FPU instruction. So I don't think I will ever understand this schizophrenic situation:
-does Intel want us to use SSE or the FPU?
-why scalar SSE operations?
-why so many instructions that do the same thing, but differently?
Was there a plan/story behind all this? And changes in those plans because of failures?

I don't think you'll be able to get a complete explanation of the history of AMD or Intel strategies here, even if you state your question more precisely (and don't assume everyone knows exactly what each AMD model name stands for).

I believe that past AMD models were designed so that, for example, 2 ADDSD could be executed in the same or less time than a single ADDPD (with a number of qualifications). I suspect that AMD intended not to rely as much as Intel on vectorizing compilers.

The Pentium-M was designed likewise so that parallel SSE operations were performed in 2 parts. Also, it was designed so that x87 instructions could be issued at a higher rate than SSE instructions. I doubt that "Intel wants" people to continue writing code for Pentium-M, which came near the end of the line of CPUs without 64-bit mode option.

In fact, the option to generate code skewed toward Pentium-M is deprecated to the extent that it isn't mentioned in the basic documentation of the latest Intel compilers, and compilers issued in the last year have warned against its use. Of course, there never was an equivalent option for 64-bit mode.

If you are so interested in history, you could read up on how Prof. Kahan persuaded vendors to standardize and produce 80-bit format floating point, the ensuing controversies, and how performance came to be preferred over extra precision. Not to mention 64-bit mode.

Hi All,

Probably late answer, but I decided to clarify it anyways as I used to get similar questions regularly. Lets leave aside x87 legacy 80-bit Floating Point (FP) instructions as they have other differences and for clarity speak only about SIMD FP operations. Ill refer to 64-bit FP (a.k.a. double precision or DP) operations included in SSE2, and to 32-bit (single precision or SP) available since SSE in Pentium III product.

Pentium III (only SP supported by SSE), Pentium M based CPUs (and also Core Duo, not to confuse with Core2): have 64-bit FP MUL and ADD execution units (EU) located on two different dispatch ports. 128-bit SSE operations are being split into two 64-bit parts in front end stages before they go to OOO engine for scheduling and dispatch. Also DP MUL executed with half throughput. Thus peak throughput of FP operations you can get is 0.5 64-bit MUL + 1 64-bit ADD (4 SP or 1.5 DP FLOPS) per cycle either with packed or scalar operations for DP, vectorization for SP is required.

P4: is a bit more complex, it has 128-bit FP MUL and FP ADD units both of which can accept either packed or scalar operation every other cycle, both ADD and MUL EUs located on same port, which can dispatch just one either ADD or MUL packed or scalar operation per cycle. So, peak FP operations throughput is 1 64-bit FP MUL + 1 64-bit FP ADD per cycle (4 SP or 2 DP FLOPS) but to achieve it _packed_ 128-bit instructions must be used. If code not vector zed then just one scalar (either ADD or MUL) FP operation can be dispatched per clock on P4, this may explain relative weakness of P4 on not vectorized FP codes, on the other hand for optimized vectorized code P4 was offering very completive FP performance.

Core2 (all current products and Nehalem) doubled peak FP performance by adding 128-bit FP ADD and FP MUL EUs (on different ports working with 1 cycle throughput), with peak FP throughput for vectorized code of 2 64-bit MUL and 2 64-bit ADD per cycle (8 SP or 4 DP FLOPS).

Upcoming products based on Sandy Bridge microarchitecture will once again double peak FP operations throughput by introducing 256-bit AVX instruction set, supported by microarchitecture capability to start 1 256-bit FP MUL and 1 256-bit FP ADD operations per cycle (16 SP or 8 DP FLOPS).

Hope this helps,
-Max

Max gave an excellent summary; I have not seen this explained in one place.

Let me try again to follow up, since my original reply was deleted during restore.

Although the peak floating point arithmetic issue rate (per cycle) doubled from P4 to Core 2, the peak rate for use of data in cache did not increease, while the clock rate (per second) decreased. This was offset partly by improvements in buffering. For example, the number of write combine buffers went from 6 in P4, to 8 in Prescott, to 10 in Core 2. No indication has come of a further increase in write combine buffers per core for AVX, in spite of the restoration of HyperThreading.

AVX did not introduce wider move instructions. 256-bit registers are packed and unpacked by 128-bit moves. 2 128-bit loads are possible per cycle, but only 1 128-bit store. Add and multiply instructions support 1 possible 256-bit aligned memory operand. The increased rate of loads (and improved memory system performance) would help SSE2 code as well, returning to the balance between floating point and data load rate of P4. So, in practice, AVX instructions would not double performance.

gcc for avx, so far, doesn't pack 2 128-bit operands per register, so the potential increase in peak floating point rate doesn't apply to gcc.

Hi, Tim, thanks for reply.

Let me correct you though:

> AVX did not introduce wider move instructions ...
Please check AVX Programming Reference at http://software.intel.com/sites/avx/ - AVX indeed introduces 256-bit load/store instructions.

VMOVUPS/VMOVUPD/VMOVDQU should be used for 256-load store in AVX (and not aligned counterparts VMOVAPS/ which still exists if you need alignment exceptions).

AVX improves the programming paradigm LOAD+OP type of operations (e.g. VADDPS ymm0, ymm1, [rsi + rax*8]) for both 128- and 256-bit instructions do not fire unaligned exceptions any more in AVX, it is new standard behavior. And taking into account that with Nehalem MOVUPS for actually aligned data starts to have same performance as MOVAPS (check Ronak Singhal IDF slides https://intel.wingateweb.com/SHchina/published/NGMS001/SP08_NGMS001_100r..., slide 25: no reason to use aligned instruction on Nehalem!), this is going to continue further. So, instructions producing aligned exception ([V]MOVAPS/) should not be used/generated by compiler in AVX (at least to maintain consistency with LOAD+OP operations), this makes life of developers much easier. Please continue to make efforts to align data though, as cache line spits would still lower load throughout.

> Although the peak floating point arithmetic issue rate (per cycle) doubled from P4 to Core 2, the peak rate for use of data in cache did not increase ...
Yes, Core2 did not increase L1 read peak throughout (16-byte/clock) vs. P4, however, as you mentioned, Core2 u-arch actually allows to achieve this high throughout to much wider range of codes compared to P4 (Id even say P4 hardly needed that high L1 throughout for majority of FP codes). Plenty of (and most optimized) FP algorithms have amount of calculations larger than amount of data they read, so doubled FP operations peak is perfectly achievable with 128-bit/clock L1 read throughout on Core2 (well, LINPACK would probably be most widely recognized example).

And as was said on past IDFs Sandy Bridge will also double L1 load throughput to balance doubling FP operations throughout peak.

I also expect GCC to fully support 256-bit AVX FP vector width (for loads/stores and compute/shuffle operations) maybe soon or closer to Sandy Bridge appearance on the market.

Please let me know if I still missed to clarify something,
Thank you,
-Max

Intel compilers continue to produce 2 versions of vectorized code so as to use more aligned loads, and gcc continues to use scalar loads to avoid unaligned loads, even though these tactics aren't optimum on Nehalem or Barcelona.

The greatly improved performance of unaligned loads also makes viable the consideration of loop reversal to allow effective vectorization with source and destination overlap. Both Intel and gnu compilers still avoid reversed loop vectorization with parallel loads and stores.

Quoting - Igor Levicki

If you use ADDSD instead of ADDPD you will process (at least) two times less data per loop iteration. Since loops are usually unrolled to consume one cache line of data per iteration that means you will end up with longer code and you will have to use more registers.

I am sure ADDPD will work at least a bit faster than ADDSD on any CPU that supports it - if you believe otherwise, then by all means benchmark your own code to see if vectorization is beneficial for your particular case.

Hi Igor.

Sorry for asking late for this thread.

I understand ADDSD & ADDPD for SSE2 as below -

-----

__m128d _mm_add_sd(__m128d a, __m128d b)
Adds the lower DP FP (double-precision, floating-point) values of a and b ; the
upper DP FP value is passed through from a.
r0 := a0 + b0
r1 := a1

&
__m128d _mm_add_pd(__m128d a, __m128d b)
Adds the two DP FP values of a and b.
r0 := a0 + b0
r1 := a1 + b1
----

Query:

(a) You commented "If you use ADDSD instead of ADDPD you will process (at least) two times less data per loop iteration." Could you elaborate w.r.t above definition of ADDSD & ADDPD. I am simply asking to understand.

(b) You commented "I am sure ADDPD will work at least a bit faster than ADDSD on any CPU that supports it". Couid you elaborate more w.r.t above definition of ADDSD & ADDPD.

Thanks & BR.

Quoting - Max Locktyukhin

Hi All,

Probably late answer, but I decided to clarify it anyways as I used to get similar questions regularly. Lets leave aside x87 legacy 80-bit Floating Point (FP) instructions as they have other differences and for clarity speak only about SIMD FP operations. Ill refer to 64-bit FP (a.k.a. double precision or DP) operations included in SSE2, and to 32-bit (single precision or SP) available since SSE in Pentium III product.

Pentium III (only SP supported by SSE), Pentium M based CPUs (and also Core Duo, not to confuse with Core2): have 64-bit FP MUL and ADD execution units (EU) located on two different dispatch ports. 128-bit SSE operations are being split into two 64-bit parts in front end stages before they go to OOO engine for scheduling and dispatch. Also DP MUL executed with half throughput. Thus peak throughput of FP operations you can get is 0.5 64-bit MUL + 1 64-bit ADD (4 SP or 1.5 DP FLOPS) per cycle either with packed or scalar operations for DP, vectorization for SP is required.

P4: is a bit more complex, it has 128-bit FP MUL and FP ADD units both of which can accept either packed or scalar operation every other cycle, both ADD and MUL EUs located on same port, which can dispatch just one either ADD or MUL packed or scalar operation per cycle. So, peak FP operations throughput is 1 64-bit FP MUL + 1 64-bit FP ADD per cycle (4 SP or 2 DP FLOPS) but to achieve it _packed_ 128-bit instructions must be used. If code not vector zed then just one scalar (either ADD or MUL) FP operation can be dispatched per clock on P4, this may explain relative weakness of P4 on not vectorized FP codes, on the other hand for optimized vectorized code P4 was offering very completive FP performance.

Core2 (all current products and Nehalem) doubled peak FP performance by adding 128-bit FP ADD and FP MUL EUs (on different ports working with 1 cycle throughput), with peak FP throughput for vectorized code of 2 64-bit MUL and 2 64-bit ADD per cycle (8 SP or 4 DP FLOPS).

Upcoming products based on Sandy Bridge microarchitecture will once again double peak FP operations throughput by introducing 256-bit AVX instruction set, supported by microarchitecture capability to start 1 256-bit FP MUL and 1 256-bit FP ADD operations per cycle (16 SP or 8 DP FLOPS).

Hope this helps,
-Max

Hi Max.

Above information seems very valuable w.r.t EU & instructions flow.

Could you suggest something about the behaviour within Clovertown 5300 series processor for SSE2.

Do you have any idea where one can get information of EU & instructions flow for SSE2 related with Clovertown?

~BR

sorry, I missed your question here - Clovertown is essentially same as Core2 referred above, and Nehalem has same peak FLOPS throughput: (4ADD + 4MUL)/cycle for SP, and (2ADD + 2MUL)/cycle for DP for vectorized SSE/SSE2 code.

-Max

Hi Max!

Could you give some information about Intel Atom peak FP operations throughput (for single and double precision)?

Quoting - Max Locktyukhin (Intel)

Hi All,

Probably late answer, but I decided to clarify it anyways as I used to get similar questions regularly. Lets leave aside x87 legacy 80-bit Floating Point (FP) instructions as they have other differences and for clarity speak only about SIMD FP operations. Ill refer to 64-bit FP (a.k.a. double precision or DP) operations included in SSE2, and to 32-bit (single precision or SP) available since SSE in Pentium III product.

Pentium III (only SP supported by SSE), Pentium M based CPUs (and also Core Duo, not to confuse with Core2): have 64-bit FP MUL and ADD execution units (EU) located on two different dispatch ports. 128-bit SSE operations are being split into two 64-bit parts in front end stages before they go to OOO engine for scheduling and dispatch. Also DP MUL executed with half throughput. Thus peak throughput of FP operations you can get is 0.5 64-bit MUL + 1 64-bit ADD (4 SP or 1.5 DP FLOPS) per cycle either with packed or scalar operations for DP, vectorization for SP is required.

Hope this helps,
-Max

>>> 0.5 64-bit MUL + 1 64-bit ADD (4 SP or 1.5 DP FLOPS) per cycle

Hmmm.. How get 1.5 DP FLOPS per cycle on PIII?

0.5*2+1*2=3. How you get 4 SP??

Why in this link: http://www.intel.com/support/processors/sb/CS-020868.htm#2
PIII-1GHz performance indicated as 2GFLOPS?? It's unreal. Real peak performance - 1 DP GFLOPS! How many in SP? 4 or 3 GFLOPS?

Hi,

> Could you give some information about Intel Atom peak FP operations throughput (for single and double precision)?

To see complete operations throughput picture please check Optimization Reference Manual http://www.intel.com/products/processor/manuals/ look at pages 12-19 - 12-26 for Atom

To summarize SIMD/SSE FP performance: current Atom can do 1 128-bit or 64-bit SP ADD (ADDSS/ADDPS) on port1 and 1 128-bit or 64-bit SP MUL (MULSS/MULPS) on port0 - what gives quite high 8 SP FLOPS/cycle in _peak_ but practically achievable performance will be lesser; scalar DP performance is 1 DP ADD + 1 DP MUL/cycle - 2 DP FLOPS/cycle, and packed DP performance is quite slow about ~1/5 throughput compared to SP or scalar DP.

>> 0.5 64-bit MUL + 1 64-bit ADD (4 SP or 1.5 DP FLOPS) per cycle
> Hmmm.. How get 1.5 DP FLOPS per cycle on PIII?
I put note in the beginning that PIII only supports SP SIMD (SSE), so DP performance was cited for PentiumM derivatives only.

> 0.5*2+1*2=3. How you get 4 SP??
Half throughput for MULs is only for DP, not for SP

> Why in this link: http://www.intel.com/support/processors/sb/CS-020868.htm#2 PIII-1GHz performance indicated as 2GFLOPS?? It's unreal. Real peak performance - 1 DP GFLOPS! How many in SP? 4 or 3 GFLOPS?

cannot comment re that old discontinued processor page ...

-Max

Hi, Max!

>> please check Optimization Reference Manual at pages 12-19 - 12-26 for Atom

on pages 12-10...12-11 is written:

FP Multiplier --- Throughput
Scalar double (mulsd) --- 2
Packed single (mulps) --- 2
Packed double (mulpd) --- 9

on pages 12-19...12-26:

FP Multiplier --- Throughput
Scalar double (mulsd) --- 1
Packed single (mulps) --- 1
Packed double (mulpd) --- 8

What is true??

Also, in column "Ports" for instructions addpd/mulpd specified "Both"
Whether it means, what these instructions can not be run simultaneously, and peak performance for packed DP in this case = 2*(1/(5+8)) = 0.15 flops/cycle (in thirteen times more slowly, than scalar DP)?

Why in this forum there is no "Edit" button? The correct formula in the previous post: "...peak performance for packed DP in this case = 2*(2/(5+8)) = 0.31 flop/cycle?"

And how correct calculate DP performance for PIII?
From manual: FADD throughput=1, FMUL throughput=2, (FADD and FMUL on one Port0)
=1*(2/(1+2))=0.667 flop/cycle?

Why in this forum there is no "Edit" button? The correct formula in the previous post: "...and peak performance for packed DP in this case = 2*(2/(5+8)) = 0.31 flop/cycle?"

And how correct calculate DP performance for PIII?
From manual: FADD throughput=1 FMUL=2 (both on one Port0)
= 1*(2/(1+2))=0.667 flop/cycle? Its true?

Deixar um comentário

Faça login para adicionar um comentário. Não é membro? Inscreva-se hoje mesmo!