How many MMX/SSE units in Core-2 Quad

How many MMX/SSE units in Core-2 Quad

I have a powerful HP comuter with Q9550 (Core 2 Quad CPU). It seems that there is only one MMX/SSE unit shared between all 4 cores.

The reason I think so is the following. I am running a simple program that usses SSE-2.

  • Running 1 thread achieves 300MB/s.
  • Running 2 threads achieves 150MB/s per thread.
  • Running 4 threads achieves 75MB/s per thread.

My laptop with T7250 (Core 2 Duo CPU) exhibits the similar behavior.

Is it true that Core-2 CPUs contain only one MMX/SSE unit?

Thanks!

32 帖子 / 0 全新
最新文章
如需更全面地了解编译器优化,请参阅优化注意事项

I don't think all the cores could share a single execution unit :) Quads are even two dual cores.It is more likely you are limited by the system memory or the cache. You could upload your code if it is simple enough.

yes, it's probably the shared cache.

But I wonder what's the worst: a shared cache that brings performances down, or a per-core cache that would still bring performances down because of the OS'es scheduling that will place your threads on random cores, thus they may not always be using the same core & find their cached databack.

I suspect my quad is 2 duals sharing 2 caches or something, because I get pretty weird results when I set thread affinity to specific cores (like a lot better when it's on core 1 & 2 or3 & 4, than on core 1 & 3 or 2 & 3).

I saw Vista supports NUMA, maybe this is a solution?

Yes, Core 2 Quad has 2 cores on each L2 cache. I have observed 30% loss in performance when not setting affinity correctly for MPI funnelled application on Core 2 Quad. Yes, each core has its own register set, in case that was the subject of your first post.
In a normal OpenMP application, it's important to get the physical ordering right, so that pairs of threads operating on contiguous data are on the same cache. At least, if you write a benchmark (intentionally or not) to measure TLB stalls, or false sharing, you must account for the mapping you choose, or the variability, should you leave the scheduling to Windows.
Availability of affinity tools is one of the advantages in using the Intel OpenMP library, which is available for VC9 as well as Intel compilers.
Improved scheduling hasn't been accepted for Vista, and is still in the proposal stage for Windows 7.

Thank you for the offer to examine the code. While creating a simple test I have realized that my performance was bound to memory accesses. So this solves the problem. Thanks again!

Availability of affinity tools is one of the advantages in using the Intel OpenMP library, which is available for VC9 as well as Intel compilers.

But you still have to know details about the system's caches in the first place, no? I mean, is there a way to know about it, other than by checking the CPU ID? Otherwise, you can only optimize for the CPU's you know, & it won't be future-proof.
Or maybe there really is a way to know about shared cache through specific flags?

The Intel environment variables employ several strategies, including cpuid, to determine the cache topology, and provide diagnostic options to show you what decision was made, assuming correctness of the BIOS.
You have also the option of specifying a mapping of OpenMP threads to logical processors; as you say, this requires you to determine those details yourself, and they may change even with BIOS changes on the same platform.

It's true that those details will change with BIOS changes on the same platform.

Most likely your program is poorly optimized and the cores are competing for memory bandwidth. You should consider different data layout or different algorithm.

--
Regards,
Igor Levicki

If you find my post helpfull, please rate it and/or select it as a best answer where applies. Thank you.

Hello,

I just wanted to confirm the conclusion of this topic. So, does each core on a Quad core processor has a SSE unit ? (In total 4 SSE units ? ) If so, the performance is supposed to go up by 4 times compared to sequential code, don't you think ? 

Also, if you could post some references on how to spawn threads to each core and controlling them, would be awesome!

I'm a newbie, sorry if I have made any inconsiderable assumptions.

Thanks a lot in advance!

Shiv

Hello,

I just wanted to confirm the conclusion of this topic. So, does each core on a Quad core processor has a SSE unit ? (In total 4 SSE units ? ) If so, the performance is supposed to go up by 4 times compared to sequential code, don't you think ? 

Also, if you could post some references on how to spawn threads to each core and controlling them, would be awesome!

I'm a newbie, sorry if I have made any inconsiderable assumptions.

Thanks a lot in advance!

Shiv

Each core has one SSE unit.

In the original post

1T, 300MB.s/T
2T, 150MB/s/T
4T. 75MB/s/T

This indicates the test program is memory bound.

>> If so, the performance is supposed to go up by 4 times compared to sequential code, don't you think ?

Only when the four execution units can be fed when hungry (kept busy). As in the original post, an application might not be able to keep the execution units busy due to memory and/or cache latency issues. As Ivan indicated, improved algorithms can reduce the number of memory and LLC accesses and thus improve the overall performance. One usually observes linear scaling only when none of the memory systems (RAM, L3, L2, L1) reach saturation as the number of threads increase (assuming no oversubscription and no preemption). In some cases, well written multi-thread algorithms can observer super scaling. This occurs when the additional threads can take advantage of the RAM/L3(/L2) latencies performed by a different thread.

Jim Dempsey

www.quickthreadprogramming.com

Hi Shiva,

             Just to add my suggestion in terms of load balancing, where the worker threads would automatically been created and assigned to the core, You can have a look at the "Cilk Plus" and "Array notations" which would improve vectorization and do automatic load balancing.

https://www.cilkplus.org/

Regards,

Sukruth H V

As Jim said each core has one vector SSE unit which is probably composed with floating point adder and multiplier.I think that the same unit also contains integer adder and multiplier which are hardwired to different execution ports.On newer microarchitecture branch logic was added to execution stack,but it is not probably tightly coupled to arithmetic units.

I could be wrong on my assumption that integer part of SSE unit is used to calculate memory addresses.

The most detailed microarchitecture/implementation descriptions that I have seen for a broad range of processors are at Agner.org, in the documents called "microarchitecture.pdf" and "instruction_tables.pdf".   Based on a combination of vendor documentation and very careful microbenchmarking, the former describes the microarchitecture, while the latter shows how each instruction maps to the various execution pipelines.  I think that you have to look at the code at this level of detail to compute the minimum execution time of a piece of code accurately.  Of course even these tables only document the most common case(s) -- any implementation is going to have "corner cases" for which extra stalls occur in instructions that access memory (and sometimes in other classes of instructions as well).

For the Q9550 (Core 2 Quad) processor, Wikipedia says that this is made of two "Wolfdale" parts in one package.  Agner Fog's "microarchitecture.pdf" includes this in the chapter titled "Core 2 and Nehalem pipeline", while the "instruction_tables.pdf" file includes a chapter on Wolfdale that seems pretty complete.

In this particular case, the microarchitecture notes state that the Core 2 has separate functional units for integer multiplication and floating-point multiplication, with integer multiplication instructions issued on port 1 and floating-point multiplication instructions issued on port 0.  Memory reads (both scalar and aligned SSE) are issued to port 2, which is used only by memory read instructions, so there are no conflicts between these reads and any arithmetic instructions.  On the other hand, unaligned SSE read instructions issue to port 0, port 5, and twice to port 2, so they are capable of interfering with the many other instructions that need to issue on ports 0 and 5.

But certainly the short answer is that on the Q9550 processor none of execution units are shared across cores, so the theoretical peak performance is linear in the number of cores used.   Each of the dual-core chips inside the package shares the 6 MiB L2 across the two cores, so contention can begin when both cores on the same chip execute memory accesses that miss their private L1 caches.  The two dual-core chips share a single Front-Side-Bus, so contention between cores on different chips can when two or more cores miss in the L2 cache.

John D. McCalpin, PhD
"Dr. Bandwidth"

So it seems that Q9550 has different execution units for floating point arithmetic and for integer arithmetic each wired to different port.I wonder if legacy x87 floating point arithmetic is executed by different unit?

By looking at the description provided by John it seems that memory address calculation are performed by different integer unit thus not conflicting with integer unit.

Citation :

iliyapolak a écrit :

So it seems that Q9550 has different execution units for floating point arithmetic and for integer arithmetic each wired to different port.I wonder if legacy x87 floating point arithmetic is executed by different unit?

By looking at the description provided by John it seems that memory address calculation are performed by different integer unit thus not conflicting with integer unit.

At least to the x87 I think I can make a statement. I assume it is a complete different unit. At least x87 is internally working (also in new hardware) with 80 bit data representation and not single or double precision. Therefore I assume hardware can not be reused for legacy x87.

Citation :

Christian M. a écrit :

Quote:

iliyapolak wrote:

So it seems that Q9550 has different execution units for floating point arithmetic and for integer arithmetic each wired to different port.I wonder if legacy x87 floating point arithmetic is executed by different unit?

By looking at the description provided by John it seems that memory address calculation are performed by different integer unit thus not conflicting with integer unit.

 

At least to the x87 I think I can make a statement. I assume it is a complete different unit. At least x87 is internally working (also in new hardware) with 80 bit data representation and not single or double precision. Therefore I assume hardware can not be reused for legacy x87.

I can only suppose that FP execution stack of Haswell CPU contains inside legacy x87 circuitry,altough this in not stated in pasted link below.

http://www.realworldtech.com/haswell-cpu/4/

I think we won't get a clear statement on this. Isn't x87 marked as 'out-dated' ? So new reviews will not concentrate on that.

BTW, the posted link is great, you get quite good information.

>>>I think we won't get a clear statement on this. Isn't x87 marked as 'out-dated' ? So new reviews will not concentrate on that>>>

It is outdated ,but must be kept for compatibility and for scalar FP calculation with higher precision.

I suppose that x87 forms a part of FP execution stack which deals with scalar values.

Agner Fog has been keeping his tables up to date:

http://www.agner.org/optimize/instruction_tables.pdf

so maybe you could find some comparisons of the early core 2 quad vs. current CPU generations.

I'm still mystified as to what you were driving at; were you expecting that x87 instructions could execute in parallel with SSE instructions without requiring the same resources?   I think we have reasonable guarantee there is no sharing of program-accessible registers, but it seems clear they do share micro-op execution pipelines.  As to register sharing between instruction modes, that was tried when MMX was introduced, and abandoned when compatible CPUs came on the market with independent register sets.  My personal, barely educated, guess would be that most independent coding for x87 instructions would reside in the rom and not be in dedicated circuitry.

According to my experience, compilers have given up attempting to cope with register pressure in 32-bit mode by using both x87 and simd registers.  Communication between the register sets is impossibly slow, and the design of Windows 64-bit ABI seemed to exclude attempts to do that in X64. Shortage of integer registers is an even worse bottleneck in 32-bit mode.  Intel compilers have dropped the support for combined x87 and SSE mode which was required for P-III and now support P-III and Athlon32 only in x87 mode; that stuff was already obsolescent when the core2 came out.  But I'm going out on a limb in guessing you might have something like this in mind.

It seems that there is no freely available info or clearly  stated information about the implementation from high level POV of x87 unit.Afaik Scheduler logic is wired to execution ports and probably one of those ports( Port1?) is responsible for x87 uops.At register file they probably use different  physical registers to hold temp results and constants.

Just guessing.

Hello....

I have one basic kind of question...

Does XMM registers say XMM0-XMM7 are per core 

or 

each core has its own bank of 8 registers for Intel Sandy Bridge Architecture ?

Thanks,

Chaitali

 

Each logical core has its own set of architectural registers, including vector registers. It cannot be otherwise since concurrent threads would garble each other's state.

 

Citation :

Chaitali C. a écrit :
Does XMM registers say XMM0-XMM7 are per core

per thread (per logical processor in hardware, saved with the thread context XSTATES in software), btw this is XMM0-XMM15 in 64-bit mode

Citation :

Chaitali C. a écrit :
each core has its own bank of 8 registers for Intel Sandy Bridge Architecture ?

each Sandy Bridge core has 144 registers able to act as x87/MMX or XMM/YMM registers (more register file entries than for the architected state alone is required for temporary rename registers) for the two hardware thread contexts, a good source here: http://www.realworldtech.com/sandy-bridge/5/

As it was said by the other posters each CPU core has its own set of physical registers to whom architectural registers(software accessible) are "connected". High number of physical registers are used for register renaming, temporaries storage and probably also used to store decomposed floating point values.

Citation :

iliyapolak a écrit :
and probably also used to store decomposed floating point values.

what are you meaning here ?

Sorry I made a mistake in my post. I meant component of various algorithms. For example Taylor approximation of sine constants would be probably kept in those registers.

Citation :

iliyapolak a écrit :
I meant component of various algorithms. For example Taylor approximation of sine constants would be probably kept in those registers.

indeed, though in case of high register pressure load+op instructions will be almost as fast since the constants will be kept live in the L1D cache when used in inner loops

with AVX-512 broadcast load + op (aka "scalar memory mode") will be even more efffective for this usage though we miss timing comparisons at the moment

on a side note: strangely the Intel compiler use a lot more load+op for Knights Landing tagets than for Skylake Xeon targets when compiling the very same source code 

Do you mean physical register pressure or architectural registers pressure?

Citation :

iliyapolak a écrit :
Do you mean physical register pressure or architectural registers pressure?

I was meaning logical register pressure, i.e. it is generally a good idea to use load+op for constants for polynomial evaluation and to use the registers for temporary variables, particularly with 32-bit code and only 8 XMM/YMM logical registers

 

Yes I agree with you, it is wise to do it even for short few terms polynomial evaluation.

发表评论

登录添加评论。还不是成员?立即加入