__cpuid returns incorrect number of logical processors on Xeon X5660

__cpuid returns incorrect number of logical processors on Xeon X5660

Hello,

I have a HP Z800 with a dual Xeon X5660. 2 x 6 cores x Hyperthreading = 24 logical processors.

In
Windows 7 x64, calling GetSystemInfo returns 24 logical processors.
Calling GetLogicalProcessorInformation also returns 24. (2 processor
packages, 24 logical processors, 12 processor cores).

But calling __cpuid returns 32. I also get 32 even if I disabled hyperthreading in the BIOS.

CPU-Z reports the right information. In the task manager, I see 24 cpus.

I
am trying to determine the affinity of each core so that I only run 1
thread per core (no collision between 2 logical processors running on
the same core).

I have another machine with a dual Xeon E5520 (2 x 4 cores x Hyperthreading = 16 logical processors).

On
the Xeon E5520, setting the affinity to 1, 2, 4, 8 will make the thread
run on the first 4 cores while setting the affinity 16, 32, 64 and 128
will make the thread run on the next 4 cores.

On
the Xeon X5660, setting the affinity from 1 to 32 will make the thread
run on the 6 cores while setting the affinity to 64 to 2048 will make
the thread run on different logical processors but on the same cores. Performance are much lower in that case.

I compiled a small program that I found on microsoft web site that shows an example of __cpuid and __cpuidex.

I
made sure that I have the latest version of the BIOS and that all the
chipset drivers are up to date according to the Intel Update Tool.

Any suggestions are welcomed and appreciated.

Here's the output that I got from the tool that I compiled:

For InfoType 0
CPUInfo[0] = 0xb
CPUInfo[1] = 0x756e6547
CPUInfo[2] = 0x6c65746e
CPUInfo[3] = 0x49656e69

For InfoType 1
CPUInfo[0] = 0x206c1
CPUInfo[1] = 0x200800
CPUInfo[2] = 0x29ee3ff
CPUInfo[3] = 0xbfebfbff

For InfoType 2
CPUInfo[0] = 0x55035a01
CPUInfo[1] = 0xf0b2ff
CPUInfo[2] = 0x0
CPUInfo[3] = 0xca0000

For InfoType 3
CPUInfo[0] = 0x0
CPUInfo[1] = 0x0
CPUInfo[2] = 0x0
CPUInfo[3] = 0x0

For InfoType 4
CPUInfo[0] = 0x3c004121
CPUInfo[1] = 0x1c0003f
CPUInfo[2] = 0x3f
CPUInfo[3] = 0x0

For InfoType 5
CPUInfo[0] = 0x40
CPUInfo[1] = 0x40
CPUInfo[2] = 0x3
CPUInfo[3] = 0x1120

For InfoType 6
CPUInfo[0] = 0x7
CPUInfo[1] = 0x2
CPUInfo[2] = 0x9
CPUInfo[3] = 0x0

For InfoType 7
CPUInfo[0] = 0x0
CPUInfo[1] = 0x0
CPUInfo[2] = 0x0
CPUInfo[3] = 0x0

For InfoType 8
CPUInfo[0] = 0x0
CPUInfo[1] = 0x0
CPUInfo[2] = 0x0
CPUInfo[3] = 0x0

For InfoType 9
CPUInfo[0] = 0x0
CPUInfo[1] = 0x0
CPUInfo[2] = 0x0
CPUInfo[3] = 0x0

For InfoType 10
CPUInfo[0] = 0x7300403
CPUInfo[1] = 0x4
CPUInfo[2] = 0x0
CPUInfo[3] = 0x603

For InfoType 11
CPUInfo[0] = 0x1
CPUInfo[1] = 0x2
CPUInfo[2] = 0x100
CPUInfo[3] = 0x0

For InfoType 80000000
CPUInfo[0] = 0x80000008
CPUInfo[1] = 0x0
CPUInfo[2] = 0x0
CPUInfo[3] = 0x0

For InfoType 80000001
CPUInfo[0] = 0x0
CPUInfo[1] = 0x0
CPUInfo[2] = 0x1
CPUInfo[3] = 0x2c100000

For InfoType 80000002
CPUInfo[0] = 0x65746e49
CPUInfo[1] = 0x2952286c
CPUInfo[2] = 0x6f655820
CPUInfo[3] = 0x2952286e

For InfoType 80000003
CPUInfo[0] = 0x55504320
CPUInfo[1] = 0x20202020
CPUInfo[2] = 0x20202020
CPUInfo[3] = 0x58202020

For InfoType 80000004
CPUInfo[0] = 0x30363635
CPUInfo[1] = 0x20402020
CPUInfo[2] = 0x30382e32
CPUInfo[3] = 0x7a4847

For InfoType 80000005
CPUInfo[0] = 0x0
CPUInfo[1] = 0x0
CPUInfo[2] = 0x0
CPUInfo[3] = 0x0

For InfoType 80000006
CPUInfo[0] = 0x0
CPUInfo[1] = 0x0
CPUInfo[2] = 0x1006040
CPUInfo[3] = 0x0

For InfoType 80000007
CPUInfo[0] = 0x0
CPUInfo[1] = 0x0
CPUInfo[2] = 0x0
CPUInfo[3] = 0x100

For InfoType 80000008
CPUInfo[0] = 0x3028
CPUInfo[1] = 0x0
CPUInfo[2] = 0x0
CPUInfo[3] = 0x0

CPU String: GenuineIntel
Stepping ID = 1
Model = 12
Family = 6
Extended model = 2
CLFLUSH cache line size = 64
Logical Processor Count = 32

The following features are supported:
SSE3
MONITOR/MWAIT
CPL Qualified Debug Store
Virtual Machine Extensions
Enhanced Intel SpeedStep Technology
Thermal Monitor 2
Supplemental Streaming SIMD Extensions 3
L1 Context ID
CMPXCHG16B Instruction
xTPR Update Control
Perf\\Debug Capability MSR
SSE4.1 Extensions
SSE4.2 Extensions
PPOPCNT Instruction
x87 FPU On Chip
Virtual-8086 Mode Enhancement
Debugging Extensions
Page Size Extensions
Time Stamp Counter
RDMSR and WRMSR Support
Physical Address Extensions
Machine Check Exception
CMPXCHG8B Instruction
APIC On Chip
SYSENTER and SYSEXIT
Memory Type Range Registers
PTE Global Bit
Machine Check Architecture
Conditional Move/Compare Instruction
Page Attribute Table
36-bit Page Size Extension
CFLUSH Extension
Debug Store
Thermal Monitor and Clock Ctrl
MMX Technology
FXSAVE/FXRSTOR
SSE Extensions
SSE2 Extensions
Self Snoop
Multithreading Technology
Thermal Monitor
Pending Break Enable
LAHF/SAHF in 64-bit mode
RDTSCP instruction
64 bit Technology

CPU Brand String: Intel Xeon CPU X5660 @ 2.80GHz
Cache Line Size = 64
L2 Associativity = 6
Cache Size = 256K

Number of Cores = 16

ECX Index 0
Type: Data Cache
Level = 2
Self Initializing
Is Not Fully Associatve
Max Threads = 2
System Line Size = 64
Physical Line Partions = 1
Ways of Associativity = 8
Number of Sets = 64

ECX Index 1
Type: Instruction Cache
Level = 2
Self Initializing
Is Not Fully Associatve
Max Threads = 2
System Line Size = 64
Physical Line Partions = 1
Ways of Associativity = 4
Number of Sets = 128

ECX Index 2
Type: Unified Cache
Level = 3
Self Initializing
Is Not Fully Associatve
Max Threads = 2
System Line Size = 64
Physical Line Partions = 1
Ways of Associativity = 8
Number of Sets = 512

ECX Index 3
Type: Unified Cache
Level = 4
Self Initializing
Is Not Fully Associatve
Max Threads = 32
System Line Size = 64
Physical Line Partions = 1
Ways of Associativity = 16
Number of Sets = 12288

11 post / 0 nuovi
Ultimo contenuto
Per informazioni complete sulle ottimizzazioni del compilatore, consultare l'Avviso sull'ottimizzazione

The CPUID function returns the number of Core and Thread slots consumed by the APIC of the processor and not the number available (or used) within the processor. A 6-core processor will (may)consume 8 core slots (power of 2). Note, processor design is not bound by this rule. Using a power of 2 makes it easier for internal addressing (simple mask). The O/S generally squishes out the unpopulated core/thread slots when producing its logical processor numbering tables. The Intel system programmers guide covers this (but not with bold and italics as to the distinction)

Jim Dempsey

www.quickthreadprogramming.com

I don't think the programmers' guide goes so far as to tell how to identify which of the 6 cores share 2 of the 4 paths to 3rd level cache.

Tim,

If there are 4 paths to L3for 6 cores then are there 4 L2 caches or 6?
The various docs and charts on Intel.com aren't quite clear on this.
In some places it is stated a total of 1MB of L2 cache in others it indicates 256KB L2 per core.
This would indicate two L2's are shared with 2 cores (similar to some of your other processors without L3).

If there ar 6 L2's then I would imagine some bits in a cpuid/cpuidex would indicate sharing some sort of MUX between L2 and L3.

Jim Dempsey

www.quickthreadprogramming.com

As far as I know, all 6 cores are similar, each with its own L1 and L2, but with 2 pairs of cores sharing path to L3. If those shared paths don't approach saturation, the 6 cores can show full value. We haven't come up with any way to identify the paired cores, other that to run bandwidth benchmarks such as stream, pinned to all combinations of pairs of cores. The numbering as seen in software isn't likely to be the same as in the hardware. Some say it could change at reboot, although colleagues depend on it being the same on machines with identical BIOS and kernels.

Interesting. I would expect the cores within a processor to always have the same relative APIC number at every boot. The base processor APIC number might depend on other factors.

On an 8 core processor then would this mean there is a 2-way followed by a 4-way?

IOW four pairs of cores, each pair sharing one of four paths into L3. It would seem to make sense.
This might seem to indicate that on a 6-core system, 2 of the cores experience less latency to the L3 cache.

Back in the 1980's I help (very little help) design a shared memory system for a cluster whereby the selector used a rotating priority. The switch would be free running until a core (processor in this case) indicates it wants access. The first (next)core encountered in the current rotating sequence with an access request would gain control of the switch. Subsequent accesses to the shared memory would remain locked on the core (processor) until the core microcode released the switch. With this scheme a core (processor) could perform multiple accesses through the switch incurring the latency overhead only once. You could alternately set the switch to a fixed priority (e.g. like SCSI bus).

The reason I bring this up is multiple-tiering could experience a similar benefit by having the switching logic be somewhat sticky.

Jim

www.quickthreadprogramming.com

The 6, 8, and 10 core Intel CPUs use the L3 ring cache, where access to L3 associated with a non-adjacent core is 1 or 2 steps additional. As far as I know, the 8 cores all have equal access to their own segment of L3.

Thanks for the replies! A power of 2 would explain the number that I get.

Maybe you can help me a bit more.

After setting the thread affinity mask, I get the APIC ID in order to
retrieve the logical and physical ID of the logical procesor.

On the Intel Xeon E5520 @ 2.227 GHz (as reported by CPU-Z)

Family:

6

Model:

A

Stepping:
5

Ext. Family:
6

Ext. Family:
1A

Revision:
D0

I get pairs like this:

Affinity mask

Logical ID

Physical ID

2^0

0

0

0

2^7

7

0

2^8

0

1

1

2^15

7

1

From the Xeon X5660 Westmere-EP.

Family:

6

Model:

C

Stepping:
1

Ext. Family:
6

Ext. Family:
2C

Revision:
B0

Affinity mask

Logical ID

Physical ID

2^0

0

0

0

2^5

5

0

2^6

0

2

2

2^11

5

2

0

4

4

5

4

0

6

6

2^23

5

6

From the table, it is as if there are 4 different Physical
IDs.

But, even for the E5520, the values that I get are not what I would have expected.

For the E5520, I would have expected this: (not sure about the
Logical ID but since setting the affinity to 2^4 would make the thread
run on the next physical CPU, I expected Physical ID to be 1).

Affinity mask

Logical ID

Physical ID

2^0

0?

0

2^1

2?

0

2^2

4?

0

2^3

6?

0

2^4

0?

1

2^5

2?

1

2^6

4?

1

2^7

6?

1

2^8

1?

0

2^9

3?

0

2^10

5?

0

2^11

7?

0

2^12

1?

1

2^13

3?

1

2^14

5?

1

2^15

7?

1

Thanks for the help!

Depending on settings for your operating system (or it may have one default setting that you cannot change), the affinity mask "processor number" 0:n can map to any of the physical hardware threads. And it could even have gaps (IOW not necessarily packed into the lowest bits). Your printout from your E5520 system indicates that thelow thread of an HT pair from each physical package in package id order gets assigned logical processor numbers first, then followed by the high thread number from an HT pair from each physical package in package ID order. If this is not what you want, then check your O/S documentation to see if/how you can specify you want the sequencing in the order you desire. Typical settings are

1) all logical ID's in logical ID orderper physical ID in physical ID order
2)lowest logical ID per HT siblingsin logical ID orderper physical ID in physical ID order followed by next higher logical ID per HT siblingsin logical ID orderper physical ID in physical ID order followed by next higher ID per HT siblings (assuming more than 2 HT per core). This appears to be your setting
3) Lowest logical ID per physical ID in physical ID order followed by next higher logical ID per physical ID in physical ID order, ... (without regard to HT siblings)
4) 3) above but sequencing one thread per HT siblings

The E5660 report indicates that each package consumes two physical ID's (one of the CPUID's should indicate number of physical ID's per package).

In the QuickThread threading toolkit that I wrote (www.quickthreadprogramming.com) I perform affinity pinning by system logical processor number (affinity bit mask order) then I use CPUID and CPUIDEX to build a proximity bitmask tableper thread per cache level per NUMA node such that the threading scheduler is aware of localities amongst available threads. The user application can then specify enqueuing to

self
within L1 distance(HT siblings)
within L2 distance
within L3 distance (usualy equivilent to socket)
within same NUMA node
within one hop NUMA distance
within two hops NUMA distance
within three hops NUMA distance

The above is an inclusion distance from thread issuing the enqueue.
There are additional control bits for exclusion, dispersion and availability.

This all sounds complicated but it is relatively trivial to use and implement internally.

//slice rowsby sockets
parallel_for(OneEach_L3$, doRows, 0, nRows, nCols, A, B, C);
...
void doRows(int RowN, int RowM, int nCols, double* A[], double* B[], double* C[])
{
// slice slice of rows by threads within socket
parallel_for(L3$, doTile, 0, nCols, RowN, RowM, A, B, C);
}
...
void doTile(int ColN, int ColM, int RowN, int RowM, double* A[], double* B[], double* C[])
{
for(int Row=RowN; Row .lt. RowM; ++Row)
for(int Col=ColN; Col .lt. ColM; ++Col)
C[Row][Col] =doSomething(A[Row][Col],B[Row][Col]);
}

You can alternately use lambda functions if you want or traditional functions as above.

Jim Dempsey

www.quickthreadprogramming.com

Information worth it's weight in gold!

- CCK

Thanks for the feedback! It is not the system/CPUs which are problematic. I believe that our current threading model (which is more than 10 years old) needs to be re-designed. It was scalable up to 16 logical processors. Beyond that, it becomes a burden. If I get the ok from management, I will investigate your suggestions.

Lascia un commento

Eseguire l'accesso per aggiungere un commento. Non siete membri? Iscriviti oggi