i3-2120 FSB Speed and Memory bandwidth

i3-2120 FSB Speed and Memory bandwidth

All,

The URL below tells that i3-2120 has 3.3GHz CPU clock and a bus/core ratio of 33.

http://ark.intel.com/products/53426/Intel-Core-i3-2120-Processor-%283M-C...

This means that the FSB base-clock is 3.3GHz/33 = 100Mhz.
Since FSBs are quad-pumped, we can just look at it as 100*4MT/s = 400MT/s
Say, each transation transfers 8 bytes (64-bit), this leads to 3200MB/s or 3.2GB/s

The URL above says that there are 2 memory channels.
Assuming 2 CPUs can simultaneously read (not sure how), we can say that the max bandwidth to CPU is around 6.4GB/s

However, the URL specifies that the RAM used is DDR3-1066/1333 and specifies the max-memory bandwidth as "1333*8*2MB/s" = 21GB/s

My question is what is the point in having a super-fast memory while the data to CPU can be transferred only at a much lower rate... I am so confused by all these numbers. Can someone lend me some help?

Thanks,
Best Regards,
Sarnath

34 posts / 0 nouveau(x)
Reportez-vous à notre Notice d'optimisation pour plus d'informations sur les choix et l'optimisation des performances dans les produits logiciels Intel.

Hello Sarnath,
The 2nd generation core processors (formerly codenamed sandybridge) such as the i3-2120 have integrated memory controllers and don't use FSB technology anymore.
On my2.3 GHz Sandybridge-based system, the dual channel integrated memory controller is able to hit 18.5 GB/sec using 1333 GT/s memory (with a read memory test running on all cpus).
So Sandybridge (and Nehalem) can make good use of the high speed memory.
I hope this helps,
Pat

Hi Patrick,
Thanks for answering this over a weekend! Appreciate it very much.
This is a great piece of info.

Few more questions:
1.Can you tell what exact purpose does the FSB serve in these new chipsets?
What other clocks get derived from the FSB?
2. What does "2 memory channels" mean ? Does it mean that 2 CPUs can read simultaneously?
OR Does it mean that there are 2 paths every CPU can take to memory depending on the load?
3. Is the dual-port a motivation for Sandy Bridge micro-architecture to dedicate 2 ports for memory-loads?
If so, Does Nehalem miro-arch also dedicate 2 ports for loads? (I will look up the manual anyway for the last one)

Thanks a lot,

Best Regards,
Sarnath

You're opening up a lot of questions which might require reading references, I'll try some over-simplified replies.
1. As there is no "FSB" on Nehalem and Sandy Bridge, FSB functions of prior architectures are taken over by QPI and "un-core" et al. As far as I know, the clocks you would be interested in are derived from QPI clock. Nehalem and Sandy Bridge differ in that I believe Nehalem has more possible model-dependent ratios of CPU clock to uncore clock.
2. You have only 1 CPU, it presumably interleaves access on the 2 memory channels (at least when fetching whole cache lines from memory to last level cache).
3. If you're talking about memory channels, Nehalem server and desktop usually had 3 per CPU. Sandy Bridge will soon add servers (2, and later, 4 CPUs) with 4 channels per CPU. Evidently, such expense (both money and power consumption) is outside the possibility for low end mobile.

Hi Tim,
Thanks for your answer. I am pleasantly surprised that Intel forum is so well attended and active. Thanks!

Coming back,

I just wrote a memcpy program routine that is able to hit 7GBps on my sandybridge system here running on 1 CPU. I use non-temporal writes and aggressive software prefetch. This betters intel's "ssse_rep" based memcpy routine but I think that is expected. I dont think REP MOV instructions make use of non-temporal writes. I am not too happy with my performance. Because going by Pat's answer 18.2GBps should be reachable per CPU on the system. So, I think I am atleast 2.5x away from the best performance.
I think I am not using both the memory channels effectively. I want to understand what is the correct way of reading memory so that I can load both the memory controllers simultaneously. Pardon my ignorance. Thanks for all your time,

Best Regards,
Sarnath

Note thatmy 18+ GB/s result was running on all the 8 cpus (4 cores with HT enabled) of the processor.
If you run multiple instances (1 instance/cpu)of your memcpy, I'm pretty sure you will get into the 16-19 GB/s range.
I was not able to max out the memory bandwidth with just 1 cpu.

You don't have to do anything special to use both memory channels. The hardware will automatically use both memory channels.
Pat

Evidently, the bandwidth of a high end desktop (particularly if using DDR3-1600) with 4 channels, using multiple threads, is better than a low end laptop with 2 channels.

Hello Patrick, Thanks for this useful piece of info.The takeaway is: One has to read from all cpus and hyper-threaded cpus to get this number.. If you dont mind, Can you publish the benchmark number with: 1. Only 2 CPUs - no hyper-threading 2. Only 2 cores - with hyper-threading I wonder what role hyper-threading plays here. Is it key to take advantage of the dual-memory channels? Thanks, Best Regards, Sarnath

So... I had all those numbers this weekend and then deleted them.

BW(GB/sec), # cpus (config)
15.5, 1 (use 1st thread on core 0)
18.9, 2 (1thread on core 0 and2nd thread on core 1)
19.2, 4 (1 thread on each core)

16.8, 2 (both threads on core 0)
18.8, 4 (use both threads on core 0 and both threads on core 1)

You see that you can hit close to the max bw with just 2 CPUs.
Hyper-threading plays no role in utilizing the dual channel memory.
Just using the 2 HT threads on the 1st core probably just can't quitekeep the memory busy enough.

Nowsome comments about my memory bw benchmark, parameters, measuring bw, etc.
The memory test I used is a 'read' memory bw. All this benchmark does is read the memory.
I use this test as a sanity check of the memory system.
It also it easy to count the bw since the bw is just '# of bytes read/elapsed_time'.
Also,each thread reads their own 40MB array.

Computing bandwidth for a memcpy is harder. The actual amount of memory moved may be 2 or 3 times the size of the dest array size.
For a standard memcpy (no non-temporal stores), the total memory moved is 3 times the dest array size. You do an RFO (read for ownership) of the dest, a read of the source, and (eventually) a writeback of the dest.
For the non-temporal store memcpy, (if everything is done correctly) you do a readof the source and a store of the dest. So the total memory moved is 2x the dest.

UsuallyI like to check thebandwidth that people quote for memcpywith VTune Amplifier counters.

Pat

Hello Patrick,

Thanks for this great deal of info. The "memcpy" 3x thing is news for me. I have been scratching my head for sometime on this. I get it completely now. Thanks!

The numbers you get are pretty impressive compared to the 7GBps that I am getting for my version of memcpy. I was just using prefetch to accelerate the loads (keep the memory busy - the manual says 64 muops can be in flight in DCU - I just prefetch around 16 cache-lines every 4th iteration and then copy 4 cache-lines per iteration using non-temporal stores).

I want to profile this with Intel's fast memcpy. But my intel compiler (trial version) is using only "ssse3 rep movs" version of memcpy. How do I make the compiler use fast memcpy? I see that symbol defined in libirc.a. Not sure how to use it.

Meanwhile, Appreciate any ideas to improve memcpy.

Intel arch is intriguing and I think it is going to take some time to master it.
I am sure I can do this with all the support you guys are giving,
Thanks a lot for your time on this!

Best Regards,
Sarnath

Hey Sarnath,
The first thing I'd recommend is just making a simple loop, without prefetching that just reads the memory and see what sort of performance you get.
This why I start simple and then get more complicated.
This will let you test your framework to make sure you are timing correctly, counting correctly, etc.
You can use a simple loop like:

#define BIG 40*1024*1024
char array[BIG];
int i, j, loops;
double time_beg, time_end, bytes=0;
j = 0;
loops = 0;
memset(array, 1, sizeof(array));
time_beg = your_timer_routine(); # replace with your timer routine
while((time_end = your_timer_routine()) < 10) # spin for 10 seconds
{
loops++;
bytes += BIG;
for(i=0; i < BIG; i+=64)
{
j += array[i];
}
}

# print results. Add print of j just to make sure compiler doesn't optimize everything away.
printf("time= %f, MB/sec = %f j=%d \n", time_end - time_beg, 1.0e-6 * bytes/(time_end-time_beg), j);

If you only have 1 DIMM in your system, then I believe you will only exercise 1 channel.
How many DIMMs do you have?
Pat

Quoting k_sarnath ...
I want to profile this with Intel's fast memcpy. But my intel compiler (trial version) is using only "ssse3 rep movs" version of memcpy. How do I make the compiler use fast memcpy? I see that symbol defined in libirc.a. Not sure how to use it.
...

Hi everybody,

I'd like to get more information for Intel's fast 'memcpy' ( SSE based ):

- Isitavailableon Windows platforms?

- Is there a source code for the function and if Yes how could I download it? If No, please provide with details
what Intel's libraryhas that function?

Best regards,
Sergey

Hello Pat,

Thanks for the code and your time! I understand that you are doing an extra-mile by typing out the code. Appreciate it very much!

I hear from my colleague that the machine has only one4GB RAMthing inserted in the slot although the machine has 2 slots ( I hope that is what DIMM means..). So that means I am limited to 10GBps :-(
btw, That means the RAM controller interleaves memory address among the DIMMs.... And that threshold size must be quite less (64 bytes or 128 bytes)...Any idea what this size could be?

I will check out your code (which looks cool) and post the results,
Thanks,
Best Regards,
Sarnath

Hello Pat,
The code that you gave me reaches 10GBps on my machine.....

sarnath@SandyBridge:~/intel_forums\$ ./a.out

time= 0.406786, MB/sec = 10310.828824 j=65536000

That possibly confirms the fact that my system has only 1 DIMM.
btw, This is a "great" learning for me -- that DIMMs are tied to the channels. Thanks!

Best Regards,
Sarnath

This article gives a summary of some of the memcpy issues as of several years ago.
fast_memcpy from the Intel compiler library is set up for automatic substitution by the Intel compilers (all versions). Both explicit memcpy() and several versions of source code loops will be converted.
If you wish to see source code for linux, various versions of glibc (2.6 and later) may be your best choice. Several of those were designed to achieve nearly full performance on CPUs of more than one brand which were in service at the time of writing.

Hello Tim,

Thanks for the link. I just stumbled on it last weekend when I was browsing around for memcpy thingies...I did not read it fully though. I will check out..

Intel uses "ssse3_rep_movsb" stuff as replacement for my invocation of memcpy. I will check that again with "const restrict" pointers and see if that changes something....

This performance sounds reasonable considering you only have 1 DIMM (so you are only using 1 of the 2 channels).
You'll need to have both slots populated to get the dual channel performance.
Maybe you can trade theone 4GB DIMM for two 2GB DIMMs (or just add another 4GB DIMM).
Pat

Hello Pat,

Thanks for all your help! The DIMM thingis definitely a new learning for me! I am glad I asked around...

Moreover,
I need to find why ICPC is replacing my call to "memcpy" with calls to "__intel_ssse3_rep_memcpy" instead of "intel_fast_memcpy". I just tried casting the src pointer as "const void * restrict". But that does not change a thing... Any help?
The rep_memcpy is almost 2x slower than what I wrote.... I was expecting intel's implementation to beat mine so that I could learn what intel is doing differently...At least, I wanted to know what is my ceiling...but this "rep" thing is killing me...btw, i have only an eval-copy of icc (intel composer studio on 64-bit linux). Does that matter?

Thanks,
Best Regards,
Sarnath

In general, if you are using a current glibc or the Intel compiler, it will be hard to beat the system memcpy performance.
I wrote the aversion of memcpy & memset routines used bythe Intel compiler at one point in time.
Usually the only way you can beat the system memcpy is if you know something about how you are going to use the memcpy... like you KNOW the source and dest are 16 byte aligned and you KNOW that the size is a multiple of 16 bytes, or something like that which the compiler can't figure out at compile time.
Also, for a general memcpy, the most common sizes are usually less than 256 bytes, and generally less than 64 bytes. At least, that was the case when I profiled memcpy usages a decade ago. These short cases are harder to optimize.
So, unless have time to burn, I'd recommend making sure you have a current glibc and/or Intel compiler.
Or profile your application and check whether a significant amount of time is actually being spent in memcpy.
Pat

The evaluation copy of icc should behave the same as a fully licensed copy. I didn't see what architecture setting you tried; I guess it must be -xSSSE3 or later, so it might be of interest to see what other choices do. The SNB CPU was supposed to improve the rep memcpy performance, but not so much as to make that the preferred method, except possibly for short string lengths, depending on alignment.

Thank you, Tim! That thread gets "hot". :)

When you're speaking about Intel's 'memcpy' do you mean a version that uses128-bit Streaming SIMD registers?

Best regards,
Sergey

Yes, memcpy should be using 128-bit SIMD for large enough transfers. Current compilers are capable of applying auto-vectorization to accomplish this without requiring SSE intrinsics or asm. Recent glibc memcpy with 64-bit SSE2 (including non-temporal) is reasonably competitive. 64-bit was chosen there to maximize performance on CPUs like Pentium-m, atom, Athlon,....
I haven't heard of an investigation of which AVX CPUs might benefit from 256-bit moves.

Quoting Patrick Fay (Intel) ...
loops = 0;
...
while((time_end = your_timer_routine()) < 10) # spin for 10 seconds
{
loops++;
...
}
...
printf("time= %f, MB/sec = %f j=%d \n", time_end - time_beg, 1.0e-6 * bytes/(time_end-time_beg)
...

Hi Patrick,

Why do you use 'loops++' in the 'while' loop?

It takes time to increment andthe variable is not used later for analysis in 'printf'.

Best regards,
Sergey

Usually the only way you can beat the system memcpy is if you know something about how you are going to use the memcpy... like you KNOW the source and dest are 16 byte aligned and you KNOW that the size is a multiple of 16 bytes, or something like that which the compiler can't figure out at compile time.

This is correct. I do know that everything is 16-byte aligned and size is a multiple of 16 bytes. But the system memcpy should find this out at run time in probably 20 to 30 cycles and then choose an optimized path - no? But I dont expect it to be 2x slower.... Most of that 2x comes from the 'non-temporal' writes... Only 10% performance comes from prefetching. I use 4MB for the transfers. We work in image-processing.

Also, for a general memcpy, the most common sizes are usually less than 256 bytes, and generally less than 64 bytes. At least, that was the case when I profiled memcpy usages a decade ago. These short cases are harder to optimize.

Thanks for sharing your experience. May be, REP MOVS... works well for these sizes...But my question remains. How do I force the compiler to use "fast_memcpy". I will try adding "__declspec(align(16))" and then see if that helps....I understand that the declspec directive helps in aligning start-address of data items... But how do I qualify a pointer as to be *pointing* to a 16-byte aligned data-structure. I will muck around with this for sometime today. I will post if I find something interesting...

So, unless have time to burn, I'd recommend making sure you have a current glibc and/or Intel compiler. Or profile your application and check whether a significant amount of time is actually being spent in memcpy.

I tried with "glibc" memcpy and I find that it comes very to my performance. I am 99.99% sure that glibc memcpy is using non-temporal writes to accelerate which is probably *not* being used by "rep movs" based intel's memcpy....I tired Encoding 4MB manually to hint the compiler about a big size during compile time.. But that does not change a thing. The only thing remaining is to use pure global variables so that compiler knows that they are 16-byte aligned.. Lets see how that one goes.

Thanks for all your time and help! This thread has been immensely useful!
Best Regards,
Sarnath

The evaluation copy of icc should behave the same as a fully licensed copy.

This is great! Thanks!

I didn't see what architecture setting you tried; I guess it must be -xSSSE3 or later, so it might be of interest to see what other choices do.

It was set to -xSSE4.2

```   for(i=0; i < BIG; i+=64)
{
j += array[i];
}```

Since "j += .." introduces a dependence chain, I was just thinking how the miroarch would be handling it. Let me just share what I think would happen. Please Correct me if I am wrong:

1. The branch predictor would predict the control flow correctly most of the time. So, branch mis-prediction isa non-issue for this loop.

2. Since the loop-body is very small, it is possible that LSD logic kicks in and Microcode Op queue will be nailed to generate a stream of micro-ops to the renamer unit until a mis-prediction breaks the stream. So there is absolutely NO front-end related bandwidth issues for this code.

3. If the compiler had generated a ADD REG, [MEMORY] type instruction, the micro-op queue would un-laminate them into two micro-ops - (or) If the compiler had generated LD + ADD then, it will also result in a similar muop sequence

4. All the ADD muops that add to "j" will form a huge dependent chain. The renamer unit has no choice but to honour the dependency chain (so renaming cannot be done). The performance of this loop is limited by the resources that the renamer unit has to handle dependence chains.

5. As the "ADD toj"muops stack up on one another waiting for LD data to arrive, the "LD" continues to execute in the execute pipeline. The only other ADD that executes without getting stacked upis the one that calculates the address of the "LD"s. These "Address ADD" instructions probably are incrementing a register and will form a dependency chain among themselves. They will execute one after another. So, addresses toLDs are not available immediately every cycle. Due to latencies and dependence chains, it is quite possible that only 1 LDmuop executes per cycle though SNB provides for 2 LD ports (Can vTune confirm this?)

6. The DCU handles the LOADs with ease. Since there are no outstanding stores, memory disambiguation is not a factor here... LOADs just slide through the DCU..Hardware prefetcher will kick in and will start prefetching more cache-lines ahead of the execute pipeline. This can mitigate the effect of "less optimal number" of LOADs executed by ooo execution engine.

7. As the Loaded data become available, the stacked up dependent ADDs enter the ALU pipe one after another. There is no real pipelining here.. They have to wait for the previous muop to complete even if their LOAD data is available.

Is this understanding correct?

Well, after reading how much smartness goes into intel's chips, I do feel it is fully justified to say "Intel inside; Idiot outside" :-)

btw, Learning how to edit, quote and paste code in this forum is more difficult than learning intel micro-architecture. I toggle to HTML editor for the blockquote thing.. It just does not work as intended inside the normal text box

Best Regards,
Sarnath

The only thing remaining is to use pure global variables so that compiler knows that they are 16-byte aligned..

I did and it worked! The compiler uses 'intel_fast_memcpy' while copying global arrays.....

I just benchmarked. My fast implementation is as good as g++ libc and both of us are ~2x faster than intel's implementation..... But therez a catch here.
Since "memcpy"ed data are immediately re-used (temporal nature of data), it is possible that Intel is using "cached" writes (instead of non-temporal). That is good for small and typical use. Good!

However this makes no sense whenthe "copy size" is greater than the LLC size. For example, if I copy 5MB of data, libc and my implementation clock around 9.7GBps. Intel memcpy clocks around 6.594GBps. Theoretical peak of my system is ~10GBps. May be, ICPC guys can look into this aspect and fix it.

Thanks!

intel_fast_memcpy() (as well as recent glibc implementations) should have a threshold where it chooses nontemporal over a certain size.

For 1), true, branch mispredict should not be a factor.

2) this is probably true but the only way to know for sure is to look at the dis-assembly code and look at front-end events with VTune (or a similar tool). In general, if something is going to memory then the memory latency is the bottleneck, not the front-end. The front-end isues are usually up to a few cycles or why you are not retiring more than 1 uop/cycle. If you are fetching memory,a load can take 100s of cycles.
For instance, with the prefetchers disabled, and using a load-to-use, dependent load latency test, a memory load takes 187 cycles on my system.
The prefetchers and the out-of-order execution helps keep multiple loads outstanding so that the effective latency is actually much less.

3) I don't know about this one. I'm not an expert on the uarch details. They keep changing the uarch on me.
4) Probably true.
5) for the case where the loads are coming from memory, most probably less than1 load is executed per cycle. Note that if the data come from anywhere besides L1D then a full cache line is moved to the L1D. If you run this simple read test with an array size that fits into L1 then you are just loading a registers worth of data per load (this messes up the bw calculation).
The cpu can speculatively execute 2 loads per cycle even with the dependency and just be careful to update 'j' in order.
You can use VTune to count # loads executed per cycle.
6) Yes, the out-of-order engine and prefetchers kick in to keep mutiple loads outstanding.
7) Yes. The stores to 'j' will execute in order.

I know the 'Intel Inside' is a jest but I'd prefer 'Intel inside; genius outside'. The chips are tools. Without the great software creators and users of the software the tools are useless.

Pat

intel_fast_memcpy() (as well as recent glibc implementations) should have a threshold where it chooses nontemporal over a certain size.

Here is the sample test that transfers 40MB of data from src to destination. I memset both src and dest (which are global arrays) to 0 and 1 respectively before the test begins - So, there is no loss of bandwidth due to page-faults. Why is 40MB not enough for "intel_fast_memcpy" to use the non-temporal access? I think something is amiss here. We are using the latest eval copy from the intel website. So, I believe I must be having the latest version.

sarnath@SandyBridge:~/intel_forums\$ cat Makefile
gcc:
g++ -O3 -o memcpy memcpy.cpp -lrt
icc:
icpc -O3 -xSSE4.2 -o memcpy memcpy.cpp -lrt
icpc -O3 -xSSE4.2 -o memread memread.cpp -lrt
sarnath@SandyBridge:~/intel_forums\$ make gcc && ./memcpy && make icc && ./memcpy
g++ -O3 -o memcpy memcpy.cpp -lrt
Data = 83.89MB, time= 0.009030, MB/sec (RW Bandwidth) = 9289.441357
icpc -O3 -xSSE4.2 -o memcpy memcpy.cpp -lrt
icpc -O3 -xSSE4.2 -o memread memread.cpp -lrt
Data = 83.89MB, time= 0.012870, MB/sec (RW Bandwidth) = 6518.093703 sarnath@SandyBridge:~/intel_forums\$

Thanks for your time on this, Best Regards, Sarnath

Hello Pat,

Thanks for all your clarifications! It has been great to have your support in the forum!
It is a great help to developers,

btw, we are using Ubuntu 10.04 and Intel vTune (eval) will not even install. It takes an exception and stops..
Is ubuntu supported?

Best Regards,
Sarnath

I've reported this to the compiler team.
They are looking into it. I'll let you know what becomes of it.
Thanks,
Pat

Thanks for that!
+
The fastMemcpy() code that I had written runs slightly faster if I compile with "g++" than "icpc" :-(

Hi Sarnath,
Intel VTune Amplifier XE is supported on Ubuntu* 10.10, 11.04 and 11.10 currently.
Thanks,
Shannon