Intel® Developer Zone:
Performance

Highlights

Just published! Intel® Xeon Phi™ Coprocessor High Performance Programming 
Learn the essentials of programming for this new architecture and new products. New!
Intel® System Studio
The Intel® System Studio is a comprehensive integrated software development tool suite solution that can Accelerate Time to Market, Strengthen System Reliability & Boost Power Efficiency and Performance. New!
In case you missed it - 2-day Live Webinar Playback
Introduction to High Performance Application Development for Intel® Xeon & Intel® Xeon Phi™ Coprocessors.
Structured Parallel Programming
Authors Michael McCool, Arch D. Robison, and James Reinders uses an approach based on structured patterns which should make the subject accessible to every software developer.

Deliver your best application performance for your customers through parallel programming with the help of Intel’s innovative resources.

Development Resources


Development Tools

 

Intel® Parallel Studio XE ›

Bringing simplified, end-to-end parallelism to Microsoft Visual Studio* C/C++ developers, Intel® Parallel Studio XE provides advanced tools to optimize client applications for multi-core and manycore.

Intel® Software Development Products

Explore all tools the help you optimize for Intel architecture. Select tools are available for a free 30-day evaluation period.

Tools Knowledge Base

Find guides and support information for Intel tools.

Advanced Optimizations for Intel® MIC Architecture, Low Precision Optimizations
By AmandaS (Intel)Posted 11/25/20130
Compiler Methodology for Intel® MIC Architecture Advanced Optimizations for Intel® MIC Architecture,  Low Precision OptimizationsOverview The latest Intel Compilers (released after the 13.0.039 Beta Update 1 release) do not generate low-precision sequences unless low-precision options are adde...
OpenMP Related Tips
By AmandaS (Intel)Posted 11/25/20130
Compiler Methodology for Intel® MIC Architecture OpenMP Related Tips OpenMP* Loop Collapse Directive Use the OpenMP collapse-clause to increase the total number of iterations that will be partitioned across the available number of OMP threads by reducing the granularity of work to be done by e...
Profiling OpenMP* applications with Intel® VTune™ Amplifier XE
By Kirill Rogozhin (Intel)Posted 11/13/20130
Parallelism delivers the performance High Performance Computing (HPC) requires. The parallelism runs across several layers: super scalar, vector instructions, threading and distributed memory with message passing.    OpenMP* is a commonly used threading abstraction, especially in HPC. Many HPC ...
Intel® SDK for OpenCL* Applications - Performance Debugging Intro
By Maxim Shevtsov (Intel)Posted 11/08/20132
To the Intel® OpenCL SDK page Table of Contents 1. Host-Side Timing 2. Wrapping the Right Set of Operations 3. Profiling Operations Using OpenCL Profiling Events 4. Comparing OpenCL Kernel Performance with Performance of Native Code 5. Getting Credible Performance Numbers 6. Using Tools Download...

Pages

Subscribe to
Intel® Advisor XE 2013 Update 4 is now available as part of Studio XE 2013 SP1 packages
By RAVI (Intel)Posted 09/04/20130
Intel® Advisor XE 2013 Update 4 was released in late July, 2013 and is now available as part of the following Studio XE 2013 SP1 packages: Intel® Parallel Studio XE 2013 SP1 Intel® C++ Studio XE 2013 SP1  Intel® Fortran Studio XE 2013 SP1 Intel® Cluster Studio XE 2013 SP1 Please downlo...
Intel is Number 1 with a Milky Way
By Clay Breshears (Intel)Posted 07/29/20132
No, not the candy bar (though I could really go for a Milky Way Midnight bar right now). I'm thinking of the Milky Way 2 (Tianhe-2) computer system at the National Supercomputing Center in Guangzhou China. This machine incorporates 32,000 12-core Intel Xeon processors (E5-2600 v2) and 48,000 Int...
Fun with Intel® Transactional Synchronization Extensions
By Wooyoung Kim (Intel)Posted 07/25/20130
By now, many of you have heard of Intel® Transactional Synchronization Extensions (Intel® TSX). If you have not, I encourage you to check out this page (http://www.intel.com/software/tsx) before you read further. In a nutshell, Intel TSX provides transactional memory support in hardware, making t...
AVX-512 instructions
By James Reinders (Intel)Posted 07/23/201314
Intel® Advanced Vector Extensions 512 (Intel® AVX-512) The latest Intel® Architecture Instruction Set Extensions Programming Reference includes the definition of Intel® Advanced Vector Extensions 512 (Intel® AVX-512) instructions. These instructions represent a significant leap to 512-bit SIMD su...

Pages

Subscribe to Intel Developer Zone Blogs
Cache-coherence traffic
By aminer101
Hello, I have come to an interresting subject, in computer sciencewe calculate the complexity to give an idea at how goodor bad is the algorithm, it's the same with Locks you have to do some calculation fir Locks algorithms to give an idea at how good or bad is the Lock is with cache-coherence traffic, so follow with me please, if you take a look at the source code of my scalable distributed fair Lock (you can download the source code at:  http://pages.videotron.com/aminer/) You will read inside the LW_DFLOCK.pas right on the  "procedure TDFLOCK.Enter;" you will read this: == if ((FCount3^.FCount3=2) and CAS(FCount2.FCount2,1,0)) then begin myobj^.myid:=-1; break;end;" == If you have noticed if FCount2.FCount2 have changed this will generate N  (N is the number of threads) cache lines misses and cache lines transfers, but you have to be smart please ,in a contention scenario since we are looping around: "if ((FCount3^.FCount3=2) and (FCount1^[myid].FCount1=myobj^.count)) then break...
Lockfree or not to lockfree...
By aminer102
Hello, Lockfree or not to lockfree , that's the question ... I have read here and there that lockfree algorithms are hard, but that's not always the case, if you take a look at the following FIFO queue that is waitfree on the push() and Lockfree on the pop() http://pastebin.com/f72cc3cc1 You will read this: ===public bool pop(out T state) {      node cmp, cmptmp = m_tail;      do {        cmp = cmptmp;        node next = cmp.m_next;        if (next == null) {          state = default(T);          return false;        }        state = next.m_state;        cmptmp = System.Threading.Interlocked.CompareExchange(ref m_tail, next, cmp);      } while (cmp != cmptmp);      return true;    }  }; === Lockfree is easy, you will notice that between the the "cmp = cmptmp;" and the: "cmptmp = System.Threading.Interlocked.CompareExchange(ref m_tail, next, cmp);" It's as we had a critical section, cause if for example  two threads  crossed the "cmp = cmptmp;"  , one of them will succeed and ano...
omp_get_thread_num() returns random values in the parallel region
By MooN K.3
Hello OpenMp professionals I m working in the parallel region with openmp, and i get a random thread's ID (not in order) example for a number of threads =4, so i get Thread's ID =1,Thread's ID =3,Thread's ID =2,Thread's ID =0, and for another execution i get another order.How to get the order of IDs eq to 0, 1, 2, 3. Any satisfactory answer would be welcome. Here is my code: int nThread = omp_get_max_threads (); #pragma omp parallel num_threads(nThread){int myID=omp_get_thread_num (); printf("Thread's ID %d \n", myID); } Thanks for your reply
Here is my proof
By aminer107
Hello, I have noticed that Robert Wessel didn't understood my ideas.. So i will prove it to you right now , so follow with me carefully please... Let say we have 4 threads, and 4 cores, and let say that each thread is running the same parallel code, and let say we have also a serial code inside some critical section, and let say that the parallel part is 0.1% and the serial part is 0.9%, so let say that the serial part takes 1 second(it means 0.1%) and the parallel part takes 9 seconds(it means 0.9%), what i have tried to explain to you is that the Amadhl's law or equation  is not a correct law and it doesn't give a correct results, here is why: so if the 4 threads are all looping and looping again for a number of times running the same parallel code and the same serial code and let say that they are contending at the same time for the critical section, this  is why i have called an ideal contention scenario, so if they  are contending AT THE SAME TIME for the critical section(th...
Amdahl equation and scalability
By aminer100
Hello, I have come to another interresting subject, the Amdahl's equationthat is equal to 1/(S+P/N)(S:the percentage of the serial part,P: percentage of the parallel partN: the number of cores) So we have to be smart, so follow with me,  i have read on some documents something that look like this: if the Serial partis 0.1% and the parallel part is 0.9% and we have 4 cores, so the Amdahl equation will equal  to 1/ 0.1 + 0.225 = 3X , so this will scale to 3X, but i don't agree with this , cause i think this Amdahl equation do not give a correct picture,so imagine that the serial part take 1 second andthe parallel part 9 seconds, that means S= 0.1% and P=0.9%,so with 4 cores you will say that this will run four P parts in 9 secondsand four S parts in 4 seconds this will equal 13 seconds, but the serial part will run in 4*10 seconds , so the scalability will equal 40 seconds divideby 13 seconds so this will scale to 3X, this is exactly what i have found with the Amdahl's equation. But ...
Scalable Distributed Fair Lock 1.0 is here
By aminer100
Hello, Scalable Distributed Fair Lock 1.0 is here, i have put it again  on my website cause even that it uses a Ticket mechanism , the Ticket mechanism performs well up to a number of threads 3X times the number of cores and i have benchmarked it to notice that , so the Ticket mechanism is still very useful. Authors: Amine Moulay Ramdane. Description: A scalable distributed Lock, it is as fast as the Windows critical section and it is fair when there is contention , so it think it avoids starvation and it reduces the cache-coherence traffic. My Scalable and distributed Lock uses now a Ticket mechanism, but my Scalable distributed Lock is more efficient than the Ticket Spinlock cause it's distributed and hence the threads spins on local variables so it reduced the cache misses and the cache-cohence traffic, so my scalable distributed Lock is more efficient than the Ticket Spinlock. I have not used a Waitfree queue to implement my Scalable Distributed Lock,  but you can use a Wai...
a test please ingore
By aminer100
A test please ingore
Power Consumption varies on Sandy-Bridge with SMT enabled/disabled
By Leonard P.3
I've found a strange phenomenon with Sandy-Bridge processors and am curious if anyone can answer why this occurs. If you enable SMT in the bios, power consumption when idle (running sleep) on the system increases by .3%-1% (depending on system).  These measurements come from running on dual-socketed Sandy-Bridge stand-alone servers and rack-based servers using a Watts-Up Pro device.  (Turbo is disabled in both cases.)  Are additional resources being powered when SMT is enabled in the bios? When running parallel benchmarks (NAS Parallel OpenMP), but only using as many threads as physical cores (16 in this case, 2 sockets, each 8-core Sandy-Bridge), I see a power increase of 2-3% with SMT enabled (in bios) over SMT disabled (in bios).  Using Intel's performance monitoring tools (pcm.x), I see L2 and L3 misses increase with SMT enabled over SMT disabled for some benchmakrs.  Using a custom tool for reading performance counters using PAPI, I find icache misses also increase sizably for ...

Pages

Subscribe to Forums

Highlights