Intel® Developer Zone:


Just published! Intel® Xeon Phi™ Coprocessor High Performance Programming 
Learn the essentials of programming for this new architecture and new products. New!
Intel® System Studio
The Intel® System Studio is a comprehensive integrated software development tool suite solution that can Accelerate Time to Market, Strengthen System Reliability & Boost Power Efficiency and Performance. New!
In case you missed it - 2-day Live Webinar Playback
Introduction to High Performance Application Development for Intel® Xeon & Intel® Xeon Phi™ Coprocessors.
Structured Parallel Programming
Authors Michael McCool, Arch D. Robison, and James Reinders uses an approach based on structured patterns which should make the subject accessible to every software developer.

Deliver your best application performance for your customers through parallel programming with the help of Intel’s innovative resources.

Development Resources

Development Tools


Intel® Parallel Studio XE ›

Bringing simplified, end-to-end parallelism to Microsoft Visual Studio* C/C++ developers, Intel® Parallel Studio XE provides advanced tools to optimize client applications for multi-core and manycore.

Intel® Software Development Products

Explore all tools the help you optimize for Intel architecture. Select tools are available for a free 30-day evaluation period.

Tools Knowledge Base

Find guides and support information for Intel tools.

A Parallel Stable Sort Using C++11 for TBB, Cilk Plus, and OpenMP
By Arch D. Robison (Intel)Posted 04/11/20140
This article describes a parallel merge sort code, and why it is more scalable than parallel quicksort or parallel samplesort. The code relies on the C++11 “move” semantics. It also points out a scalability trap to watch out for with C++. The attached code has implementations in Intel® Threading ...
Intel® Software Development Tools 2015 Beta
By Gergana Slavova (Intel)Posted 03/27/20140
Intel® Software Development Tools 2015 Beta What's New in the 2015 Beta This suite of products brings together exciting new technologies along with improvements to Intel’s existing software development tools: Get guidance on how to boost performance safely without creating threading bugs using...
Resource Guide for People Investigating the Intel® Xeon Phi™ Coprocessor
By Taylor Kidd (Intel)Posted 03/25/20140
This article identifies resources for anyone investigating the value to their organization of the Intel® Xeon Phi™ coprocessor, which is based on the Intel® Many Integrated Core (Intel® MIC) architecture. It is one of three such guides, each for people in one of the following specific roles: Adm...
Resource Guide for Intel® Xeon Phi™ Coprocessor Administrators
By Taylor Kidd (Intel)Posted 03/25/20140
This article makes recommendations for how an administrator can get up to speed quickly on the Intel® Many Integrated Core (Intel® MIC) Architecture. This article is 1 of 3: For the Administrator, for the Developer, and for the Investigator. Someone who will administer and support a set of machi...


Subscribe to
Go Parallel 2
By Dmitry VyukovPosted 04/13/20140
Parallel programming with Go language (golang). The blog shows examples of parallel divide-and-conquer decomposition and parallel pipelines.
Performance BKMs: Introduction and Super-secret Intel Tools
By Taylor Kidd (Intel)Posted 03/27/20140
At SC13 (Super Computing 2013)*, someone commented that Intel seems to have some super-secret set of tricks in its pocket, allowing us to optimize “far beyond those of mortal man”+. We don’t really have any super-secret tricks. Even if we did, we wouldn’t use them. We want mortal man (you) to be ...
Notification: Update to Resource Guides for Developer and Administrator published
By Taylor Kidd (Intel)Posted 03/26/20140
Hi all, I just wanted to let whoever is listening that I just published updates to the Resource Guide for Intel® Xeon Phi™ Coprocessor Developers and Resource Guide for Intel® Xeon Phi™ Coprocessor Administrators documents. -- Taylor  
BKMs on the use of the SIMD directive
By Taylor Kidd (Intel)Posted 03/25/20140
We had an ask from one of the various “Birds of a Feather” meetings Intel® holds at venues such as at the Super Computing* (SC) and International Super Computing* (ISC) conferences. The customer wanted to know BKMs (Best Known Methods) on the proper usage of the new OpenMP* 4.0 / Intel® Cilk™ Plu...


Subscribe to Intel Developer Zone Blogs
Intel® Parallel Studio XE SP1 & Intel® Cluster Studio XE SP1
By kathy-farrel (Intel)0
Intel® Parallel Studio XE SP1 & Intel® Cluster Studio XE SP1 - What's New - Webinar Tuesday, September 17 9am PDT Please join us for a technical presentation on the new features found in the recently released Intel® Parallel Studio XE 2013 SP1 Intel® Cluster Studio XE SP1. This release includes support for compilers and performance analysis on Intel® Xeon Phi™ on Windows*. The technical presentation will briefly cover new features for both C++ and Fortran on Linux*, Windows*, and OS X* operating systems as well as error checking and performance profiling tools. Learn how to efficiently boost your application performance! Not too late! - Register Now  Learn about Upcoming Webinars
Responsive OpenMP Theads in Hybrid Parallel Environment
By Don K.1
I have a Fortran code that runs both MPI and OpenMP.  I have done some profiling of the code on an 8 core windows laptop varying the number of mpi  tasks vs. openmp threads and have some understanding of where some performance bottlenecks for each parallel method might surface.  The problem I am having is when I port over to a Linux cluster with several 8-core nodes.  Specifically, my openmp thread parallelism performance is very poor.  Running 8 mpi tasks per node is significantly faster than 8 openmp threads per node (1 mpi task), but even 2 omp threads + 4 mpi tasks runs was running very slowly, more so than I could solely attribute to a thread starvation issue.  I saw a few related posts in this area and am hoping for further insight and recommendations in to this issue.  What I have tried so far ... 1.  setenv OMP_WAIT_POLICY active      ## seems to make sense 2.  setenv KMP_BLOCKTIME 1          ## this is counter to what I have read but when I set this to a large number (2500...
Optimizing cilk with ternary conditional
By Fabio G.3
What is the best way to optimize the cycle cilk_for(i=0;i<n;i++){ x[i]=x[i]<0?0:x[i]; }or somethings like that? Thanks, Fabio
have asked them to
By Robert P.0
ICC t20 World Cup 2014 Live StreamIndia vs Pakistan Live Stream
Optimizing reduce_by_key implementation using TBB
By Shruti R.0
Hello Everyone, I'm quite new to TBB & have been trying to optimize reduce_by_key implementation using TBB constructs. However serial STL code is always outperforming the TBB code! It would be helpful if I'm given an idea about how reduce_by_key can be improvised using tbb::parallel_scan. Any help at the earliest would be much appreciated. Thanks.
reading a shared variable
hello everyone I am relatively new to parallel programming and have the following doubt:- is reading a shared variable(that is not modified by any thread) without using locks a good practice thanks for the help in advance  
Weird Openmp bug
By Cheng C.1
Dear all, I want to combine OpenMP and RSA_public_encrypt and RSA_private_decrypt routines. However, I was confused by a weird bug for a few days.    In the attached program, if I generated 2 threads for parallel encryption and decryption, everything works well. If I generated 3 or more threads, the RSA_public_encrypt routine works fine. All strings are successfully encrypted (encrypt_len=256). However, the RSA_private_decrypt routine went wrong, that is, only one thread works properly, all the other threads failed to decrypt some of the strings (decrypt_len=-1, rsa_eay_private_decrypt padding check failed). If there are 1000 strings and 4 threads, the total number of string failed to decrypt went around 710 (some times as low as around 200). So as expected, if I use 4 threads for parallel RSA_public_encrypt and one thread for RSA_private_decrypt, nothing went wrong.   It would be great if you could give some ideas. Thanks very much.    #include <openssl/rsa.h> #include <...
performance loss
By Bo W.8
Hi, some interesting performance loss happened with my measurements. I have a system with two sockets, each socket is a E5-2680 processor. Each processor has 8 cores and with hyper-threading. The hyper-threading was ignored.  On this system, I started a program 16 times at the same time and each time pinned the program to different cores. At first, i set all cores to 2.7GHz and saw : Program 0 Runtime 7.7s Program 8 Runtime 7.63s And then, i set  cores on the second socket  to 1.2GHz and saw: Program 0 Runtime 12.18s Program 8 Runtime 15.73s The program 8 ran slower. It is clear, because core 8 had lower frequency. But why was program 0 also slower? Its frequency wasn't touched.   Regards, Bo


Subscribe to Forums