i7-980x details on latencies of caches and TLB and buffers

i7-980x details on latencies of caches and TLB and buffers

Hi everyone,
I'm looking for informations on the new I7-980x like :
type of the 3 caches L1 L2 and L3 (size of line or blocs), latencies, single or dual port
Also how many buffers is there inside this cpu and the type and latencies.

And the last things what methods are used to improve the succes hit ratio of caches.

thank you very much for those who respond to me.

9 posts / 0 nouveau(x)
Dernière contribution
Reportez-vous à notre Notice d'optimisation pour plus d'informations sur les choix et l'optimisation des performances dans les produits logiciels Intel.

You might look at this:

http://software.intel.com/sites/products/collateral/hpc/vtune/performanc...

Core i7 Xeon 5500 Series

Data Source Latency (approximate)

L1 CACHE hit, ~4 cycles

L2 CACHE hit, ~10 cycles

L3 CACHE hit, line unshared ~40 cycles

L3 CACHE hit, shared line in another core ~65 cycles

L3 CACHE hit, modified in another core ~75 cycles

remote L3 CACHE ~100-300 cycles

Local Dram ~60 ns

Remote Dram ~100 ns

There are additional issues to concern yourself with, in particular TLB miss.
Listed in above guide under DTLB. Where TLB is Translation Look Aside Buffer.
The TLB is a seperate very small cache of the virtual address to physical address mappings. On listed Core i7 Xeon Series this is 64 entries forprimary and 512 entries for secondary DTLB cache. I did not see mention of clock cycles impact when reference not in 64 entry DTLB primary. If a virtual memory reference is not mapped within the cached DTLBs you might suffer an additional 1 or 2 DRAM latencies.

Also, the above DRAM latencies are for the memory latency alone. Therefore, the latency to determine a cache miss may need to be added to the DRAM latency might get hidden. DRAM access tends to be pipelined so the cache miss latency. However.... when a specific thread experiences a cache miss, the DRAM request goes into a queue (16/12 deep depending on other thread memory requests). Therefore, a specific (worst case)request could have a latency of up to (3+16)*DRAM latency.

>>And the last things what methods are used to improve the succes hit ratio of caches

a) Structure data such that computationally related information resides within same cache line.
b) Write your algorithms such that the higher frequency of access occures on (near) adjacent memory. This reduces the TLB pressure.
c) Reduce number of writes to RAM through use of temporary varibles that can be registerized (optimized into register)
d) Structure data such that you can manipulate using SSE (when possible)
e) For parallel programming, coordinate activities amongst threads sharing cache levels. (HT siblings for L1 and L2, same die or half die for L3, same NUMA node for multiple nodes).

Jim Dempsey

www.quickthreadprogramming.com

Thank you very much Jim,
just to be sure, this document is about the i7 processor, I'm not sure that the i7 extreme edition has the same performance. May be i'm not correct! so i'm waiting for your answer.
Thanks again.

The stats I posted were for a Core i7 Xeon 5500 series. This is an MP (Multi-Processor). You could strike the information relating to "remote..." latencies. There may be some addional differences between the processor models (exact clock counts and memory read/write queue depth). Instead of using tables written in a specification document, I suggest you measure the performance of representative benchmarks or against your application, as too many circumstances will affect the performance observedas compared to performance expected usingpublished statistics.

Make the pudding (the proof is in the pudding).

Jim Dempsey

www.quickthreadprogramming.com

Hi Jim,
thansk again for your help.
May be there is missunderstand.
I haven't this processor.
But I'm taking a course where the teacher asked us to talk about a computer. I choose one with this cpu.
Now I must give informations about all what I asked u for.
That's why i need to have informations about the wetsmere architecture and the performances of the i7-980x.
If you have any idea where to find documents or datasheet or benchmarks results of this cpu, I will thank u all my life :)
Azzedine.

Not certain if this iswhat you need foryour assignment- but a simple Google search on i7-980xgenerated multiple links including: htttp://ark.intel.com/Product.aspx?id=47932

Hi Jim,
before I ask you those questions,
I saw most of all the links that you saw. Nonw gave me a correct answer on the speficic cpu i7-980x.
As it is a westmere I tried to find something on it but no way.
Thanks anyway I don't want to take a lot of your time.
Azzedine :)

Azzedine,

From the spec sheet http://ark.intel.com/Product.aspx?id=47932 this processor has

Turbo Boost
Speed Step
Intel Smart Cache

Therefore determining latencies will be difficult. What you measure for a short run benchmark on a cold CPU will differ from a long run benchmark that causes the CPU to heat up and reduce the speed. IOW your latencies are not fixed. Any chart you produce needs to include a thermal curve as well as application thread to core mapping. So much qualification detail is on these charts as to make them meaningless for use in projecting performance estimations of an arbitrary application. Meaning you really need to run your specific application with actual data and not use best case performance latencies or industry benchmark applications.

Jim Dempsey

www.quickthreadprogramming.com

Hi Jim,
thanks again,
it is with real pleasure to read you.
I agree with what you said aerlier.
But for my assignement, it is not with so much precision, just an average, or a best case or a worst case.
So my question why Intel don't give us some data about that.
Jim if you give again 10 minutes I will not bore you. Here is an example of a website :

http://www.xbitlabs.com/articles/cpu/display/intel-core-i7-980x_3.html

He talks about latencies but in different ways: as example for the latencies of the L3 caches he gave two values:
1----- L3 latency cache (for the gulftown=i7-980x) 44 clocks
2----- L3 cache latency obtained in the screenshot of the lavalys testbench = 4.8ns

Could you tell me what is the relationship between the number of clock and the nanoseconds, in another way (excuse my english) how I could pass from representating this latency from ns to clock or vise versa.

thanks again.

nb: All my classroom is talking about an old arm processor, because there is a lot of informations about that. Me I try to talk about something new, but i was surprised by the Intel staff, for any questions they sent me to the forum, to you Jim:)

Connectez-vous pour laisser un commentaire.