Q: What are the Clock speed and IPC improvements for IVB-EP?
A: With regards to IPC there is no change for IVB-EP, but the performance improvements over SNB-EP, such as greater number of cores, higher QPI speed, larger L3 cache, atomic locks on PCIe traffic, support for 1867 GHZ memory, support for Digital RND and more are listed in these slides. More architectural details can be found in the Intel Software Optimization Guides found here.
Q: Where can I find a link to the SW developer’s guide for Haswell?
Q: Is there compiler support for Auto-Vectorization via Phi?
A: Yes, there are various levels of auto-vectorization supported for both Xeon and PHI. Along with the vec-report switches, information is provided on the details of loops that were vectorized and those that weren’t. Another great source of information are the optimization manuals at http://software.intel.com/mic-developer
Q: Interest in using/testing with Intel PCM tool – how to get started?
A: Here is a link to the PCM tool to investigate the memory power savings issue on Nehalem/Westmere and Sandy Bridge (SNB) boxes: http://software.intel.com/en-us/articles/intel-performance-counter-monitor-a-better-way-to-measure-cpu-utilization This link describes the various ways to use the tool. The Download button is at the very end of the page, prior to comments. All you need to do is to do a make. "pcm.x 1" will update the L1, L2 cache counters every second. pcm.x without arguments will get you the help menu. Pcm-power.x 1" will show you the IMC stats, you can use the –m "arg" command for various ranks on the different sockets. pcm-power.x with no args will get you the help menu. To confirm that memory power savings is turned off, we need to see that the memory in ranks 0 and 1 on both sockets are not in the CKE residency state, i.e. CKE residency states are at 0%.
Another useful tool, "pcm-memory.x" will give you the memory read write stats.pcm-power and pcm-memory will only run on SNB, as NHM does not have these counters. pcm.x can be used on the NHM box. On the SNB you may need to enable NON POR devices under PPM in order to see IMC and QPI stats. To see the IMC and QPI counters, one needs to run as root to own the MSR counters.
Q: Intel DPDK interest, any oppty to get that software?
A: The current stable release is a free download on intel.com (http://intel.com/go/dpdk)
Q: Interest in SSE 4.2 String and Text New Instructions?
A: SSE 4.2 instructions: Useful for XML, string and text processing, matching and hashing: http://software.intel.com/sites/default/files/m/0/3/c/d/4/18187-d9156103.pdf
Please contact Intel Engineer An Le for new string-to-double and string-to-float functions. These find value in FIX string and XML processing workloads.
http://software.intel.com/en-us/articles/xml-parsing-accelerator-with-intel-streaming-simd-extensions-4-intel-sse4 An example using hash-like approach to find substrings in a string. http://software.intel.com/en-us/forums/topic/279387
There are some experimental new String instructions that may help improve performance. Details are found in the Compiler documents that come with Intel’s Compiler.
Q: Can we have some detailed doc about the Sandy/Ivy architecture (MMU, ALU, registers, bus, IO paths…)?
A: The best sources of information are the Intel Architecture and Software Optimization manuals for developers found at this site. http://www.intel.com/content/www/us/en/processors/architectures-software-developer-manuals.html & http://software.intel.com/sites/products/collateral/hpc/vtune/performance_analysis_guide.pdf
Other good sources are articles on the Intel® Developer Zone and the public forums that are regularly monitored by Intel folks:
Also, some site representatives attend talks given at Intel’s IDF forums and they write architectural reviews on their sites. Sometimes I find the explanations in their articles to be easier to comprehend, although they are not reference material. Here is such an example: http://www.realworldtech.com/sandy-bridge/
Q: What is the number of cycles taken to reach caches L1, L2, L3, memory for several sizes of blocks?
A: We do not publish cache latencies for various block sizes as this is dependent upon hardware and software prefetching and other optimizations. Typically, cache latencies for Sandy Bridge are 4 cycles for L1 instruction and data, approx. 12 cycles for L2 and about 24 cycles for L3.
Q: How are divided the L2 and L3 caches between Instructions and Data? (Is it half/half like L1?)
A: L2 is a combined cache for instructions and data. L3 is all inclusive per socket, and contains both L1 and L2 caches, that are kept coherent through snooping.
Q: How can we disable the branch prediction on a core or for a specific process?
A: There is no clear way to disable Branch Prediction on our later processors, for a core or for a specific process. Having said that, Branch Prediction has improved considerably on Sandy Bridge due to micro op cache, loop stream detectors etc. One way to examine Branch Misprediction for a given core or process is to use VTune, our software profiler and see if the given process is experiencing significant mispredictions. Low Latency customers do ask for such a switch to minimize latency spikes. Will get you more information on such a possibility. In the meantime, to reduce Branch mispredicts, one may consider the use of conditional instructions.
Here is an article that talks to this. http://software.intel.com/en-us/articles/branch-and-loop-reorganization-to-prevent-mispredicts
Q: Recently there was some discussion on perceived slowdown when micro benchmarks were used, and on Sandy Bridge versus Westmere when locking was involved. Can you talk to that?
A: Yes, We published a paper on Lock Analysis on Xeon processors to address this issue: http://www.intel.com/content/dam/www/public/us/en/documents/white-papers/xeon-lock-scaling-analysis-paper.pdf
Q: How can we maximize the frequency of a few cores, while reducing a lot the others, in a static way (no C/P states changes)
A: C0 is the active state. The best current way to do this is to disable C states in the BIOS (C1, C!E, C3, C6) and enable Turbo. Affinitize important threads to a few cores, and let low priority and low worker threads on other cores. Now use the On-demand/ Performance Scaling governors in Linux. Turbo ver 2.0 on SNB is very stable and one should see few to none of frequency transitions between various 100 MHz bins.
Folks have enabled turbo in machines in Colos also. Use a real time tool with minimal overhead, such as Intel’s Performance Counter Monitor to monitor Cache misses per core, C states, and Core frequency transition counts. Also memory residency states, Memory bandwidth etc. http://software.intel.com/en-us/articles/intel-performance-counter-monitor-a-better-way-to-measure-cpu-utilization
Turbostat is another tool from Intel that one can use. It is now part of the kernel git. Just build it and install it. One can use appropriate kernel boot parameters in /etc./grub. Speak to your OEM. Vendors like Dell, HP and Red Hat have low latency settings guides. Dell has DPAT. One can also use Red Hat’s script to keep cores in C0 states without requiring the idle=poll loop. https://access.redhat.com/knowledge/articles/221153
Here is a good introduction to processor power and CPU states: people.cs.pitt.edu/~kirk/cs3150spring2010/ShiminChen.pptx
Q: Is there a better clock on Sandy Bridge? Seems more accurate?
A: Believe this relates to the invariant RDTSC. Since Nehalem, using the RDTSCP instruction will give you time in cycles that are invariant across cores and even sockets. Use this over gettimeofday() or clock gettime. RDTSCP is a semi serializing instruction (unlike CPUID) and has a slightly higher overhead than RDTSC. Please refer to the Software optimization Manuals for more information.
Q: Can you disable DDIO in the BIOS?
A: Some vendors do provide a switch in their BIOS to disable it. But we do not recommend it.
Q: Can you decide how much cache is ‘reserved’ for DDIO?
A: No. It’s fixed at 10%. However, there is a trick for going beyond 10%. If the destination address is already in L3, then a write to that address doesn’t "count" as part of the 10%. So if you "warm" the cache with addresses you know you will be using, you can go beyond 10%.
Q: What is the eviction algorithm if the cache size is insufficient?
A: It’s the standard Least-Recently Used algorithm. Nothing new or different here.
Q: There is no cross socket DDIO support today, but that will be added in a future version, correct?
A: Exact details are not available, but it is safe to say "beyond Ivy Bridge-EP."
Q: Will there ever be a version of the Xeon that is unlocked, allowing users to boost the frequency?
A: There is no plan in place to provide unlocked Server parts to customers.
Unlocked Xeon Desktop parts are freely available and HFTs often use them although they do not come with features important for Server use, such as ECC or IPMI, etc.
Q: Any QPI changes with Sandy Bridge? What about an FPGA with QPI? Will QPI be compatible with non-CPU devices?
A: Yes Transfer rate has increased from 6.4 GT/sec to 8 GT/sec. Intel allows and shares QPI technology with manufacturers to add FPGA type devices to the QPI. A few manufacturers (Altera and Xilinx) are doing so. The right person for more details along these lines is firstname.lastname@example.org. Here is a paper on integrating QPI and FPGA:
Q: Block Diagram for the Haswell data paths?
A: Here is a link to Haswell architecture presentation at IDF 2012
Q: With DDR3, you have 8 byte per clock, so you could push a whole cache line in 2 clocks. With 4 channels, you get 4X8=32 (64 in 2 clocks) – has to do with interleaving?
Q: Can a Phi card do a transfer via PCIe to another card on the PCIe bus?
A: Yes. More details are available at http://sofwtare.intel.com/mic-developer
Q: Details on bit manipulation and long integer support in Haswell?
A: The details are at these links. One is a talk at IDF 2012 and the other is an article on IDZ: http://intelstudios.edgesuite.net/idf/2012/sf/aep/ARCS005/ARCS005.html
Q: What would the Haswell microarchitecture look like for things like instruction dispatch, branch prediction algorithm, etc.
A: The two links given above explain the details. In particular: http://intelstudios.edgesuite.net/idf/2012/sf/aep/ARCS005/ARCS005.html
Q: What are Intel’s plans for support for the C++ 11 specification? Not supported in ICC 11, but curious to understand if that changes in ICC v12 and to what extent?
A: With the latest 13.0, we have added support for additional C++11. See summary of what is supported here: http://software.intel.com/en-us/articles/c0x-features-supported-by-intel-c-compiler/
Q: Can the compiler deal with TSX; assuming hints via pragmas or do TSX (RTM) via assembly?
A: With 13.0, compiler support Intrinsics and inline asm for TSX (RTM).
Q: In the brief they are references to "Memory bandwidth-bound applications" and "local memory." What are they referring to, processor cache?
A: The Phi card actually has its own local GDDR5 memory, so local memory likely refers to the RAM on the card. Memory bound applications are those that have a bias towards using memory bandwidth and may bottleneck at some point based on platform memory performance maximums.
Q: Can I use any compiler or must they be the Intel compilers – specifically, we need C++. For one of our potential targets, ICC should be fine. For another, we know that 128bit integer support is till problematic.
A: The Intel compiler is currently the only one that properly support heterogeneous compilation for the Xeon and Phi processors. We have good support for C++ and if you are having an issue with the 128-bit integer support I’d like to open a case to investigate that issue, if one is not already in our system?
Q: What are the hardware counters to examine contested and false sharing on Xeon?
A: The formula for contested accesses for SNB is
Formula: % of cycles spent accessing data modified by another core: (MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_HITM_PS * 60) / CPU_CLK_UNHALTED.THREAD
Thresholds: Investigate if – % cycles accessing modified data > .05
A link to performance analysis using the counters for SNB-EP using VTune. The counter names are given and can be used to query with PCM code.
A link to a general paper on false sharing for core i7 processors is provided here.