Intel® Developer Zone:


Just published! Intel® Xeon Phi™ Coprocessor High Performance Programming 
Learn the essentials of programming for this new architecture and new products. New!
Intel® System Studio
The Intel® System Studio is a comprehensive integrated software development tool suite solution that can Accelerate Time to Market, Strengthen System Reliability & Boost Power Efficiency and Performance. New!
In case you missed it - 2-day Live Webinar Playback
Introduction to High Performance Application Development for Intel® Xeon & Intel® Xeon Phi™ Coprocessors.
Structured Parallel Programming
Authors Michael McCool, Arch D. Robison, and James Reinders uses an approach based on structured patterns which should make the subject accessible to every software developer.

Deliver your best application performance for your customers through parallel programming with the help of Intel’s innovative resources.

Development Resources

Development Tools


Intel® Parallel Studio XE ›

Bringing simplified, end-to-end parallelism to Microsoft Visual Studio* C/C++ developers, Intel® Parallel Studio XE provides advanced tools to optimize client applications for multi-core and manycore.

Intel® Software Development Products

Explore all tools the help you optimize for Intel architecture. Select tools are available for a free 30-day evaluation period.

Tools Knowledge Base

Find guides and support information for Intel tools.

Intel Cluster Ready FAQ: Software vendors (ISVs)
By Werner Krotz-vogel (Intel)Posted 03/23/20150
Why should we join the Intel Cluster Ready program? A: By offering registered Intel Cluster Ready applications, you can provide the confidence that applications will run as they should, right away, on certified clusters. Participating in the program will help you increase application adoption, e...
Intel Cluster Ready FAQ: Hardware vendors, system integrators, platform suppliers
By Werner Krotz-vogel (Intel)Posted 03/23/20150
Q: Why should we join the Intel® Cluster Ready program? A: By offering certified Intel Cluster Ready systems and certified components, you can give customers greater confidence in deploying and running HPC systems. Participating in the program will help you drive HPC adoption, expand your custom...
Intel Cluster Ready FAQ: Customer benefits
By Werner Krotz-vogel (Intel)Posted 03/23/20150
Q: Why should we select a certified Intel Cluster Ready system and registered Intel Cluster Ready applications?A: Choosing certified systems and registered applications gives you the confidence that your cluster will work as it should, right away, so you can boost productivity and start solving n...
Dynamic allocator replacement on OS X* with Intel® TBB
By Kirill Rogozhin (Intel)Posted 03/23/20150
The Intel® Threading Building Blocks (Intel® TBB) library provides an alternative way to dynamically allocate memory - Intel TBB scalable allocator (tbbmalloc). Its purpose is to provide better performance and scalability for memory allocation/deallocation operations in multithreaded applications...
Subscribe to Intel Developer Zone Articles
Web Resources about Intel® Transactional Synchronization Extensions
By Roman Dementiev (Intel) Posted on 07/28/14 3
Short URL for this page: In this blog I list useful technical resources related to Intel® Transactional Synchronization Extensions (Intel TSX). I will try to keep the list up-to-date as new material becomes available (subscribe to this page below to get update notifica...
Additional AVX-512 instructions
By James Reinders (Intel) Posted on 07/17/14 1
Additional Intel® Advanced Vector Extensions 512 (Intel® AVX-512) The Intel® Architecture Instruction Set Extensions Programming Reference includes the definition of additional Intel® Advanced Vector Extensions 512 (Intel® AVX-512) instructions. As I discussed in my first blog about Intel® AVX-...
Using Intel® TSX with VTune(TM) Amplifier XE 2015 Beta to measure transaction time & abort in your code?
By Peter Wang (Intel) Posted on 07/12/14 2
When the user develops multithreaded applications, the user should protect critical (sensitive) code area called by threads, so threads access shared memory without data conflict. Most of time, the user might use critical_section, mutex, semaphore, atomic, events, or other “locks” to protect crit...
Compete And Win A Prize With The New Intel® CnC!
By Frank Schlimbach (Intel) Posted on 07/10/14 0
A new version if Intel® Concurrent Collections for C++ (CnC) has been released. We are celebrating its coming out to open source with a programming contest, which will have its showdown at the 6th annual CnC workshop. The organizers call on individuals and small teams to compete for a significant...
Subscribe to Intel Developer Zone Blogs
By Mayur B.0
Hello everyone,        I want to solve sparse matrix (for solving linear equations) with minimum time. Now, I am using "pardiso" function from Intel MKL library(Version10.3). But this function takes too long time. Is there any other function available in latest Version which fulfills minimum time requirement?     Could you please help me. Thanks in advance. Mayur
My new invention: Scalable distributed sequential lock
By aminer102
  Hello, Scalable Distributed Sequential lock version 1.01 Author: Amine Moulay Ramdane.  Email: Description: This scalable distributed sequential lock was invented by Amine Moulay Ramdane, and it combines the characteristics of a distributed reader-writer lock with the characteristics of a sequential lock , so it is a clever hybrid reader-writer lock that is more powerful than the the Dmitry Vyukov's distributed reader-writer mutex , cause the Dmitry  Vyukov's distributed reader-writer lock will become slower and slower on the writer side with more and more cores because it transfers too many cache-lines between cores on the writer side, so my invention that is my scalable distributed sequential lock has eliminated this weakness of the Dmitry Vyukov's distributed reader-writer mutex,  so that the writers throughput has become faster and very fast, and my scalable distributed sequential lock elminates the weaknesses of the Seqlock (sequential lock) that is "live...
interlocked or not interlocked?
By Rudolf M.7
I'm using an InterlockedCompareExchange to set a variable to my id (something like "while(0 != InterlockedCompareExchange(&var, myId, 0)) ::Sleep(100);" ) now... no other thread will change this variable until it becomes 0 again... after using it, I could do an "InterlockedExchange(&var, 0);" or simply "var = 0;" ... I'm not sure, but I think, this doesn't change much... which one is the bether solution? which one the faster? ... or is one even wrong? ... I thought, the second one could be the faster one, when I don't expect to see a lot of threads trying to "take" this variable at the same time... is that correct?
OpenMP Block gives false results
By Jack S.1
Hi all, I would appreciate your point of view where I might did wrong using OpenMP.  I parallelized this code pretty straight forward - yet even with single thread (i.e., call omp_set_num_threads(1)) I get wrong results. I have checked with Intel Inspector, and I do not have a race condition, yet the Inspector tool indicated (as a warning) that a thread might approach other thread stack (I have this warning in other code I have, and it runs well with OpenMP). I'm pretty sure this is not relate to the problem. Thanks, Jack. SUBROUTINE GR(NUMBER_D, RAD_D, RAD_CC, SPECT) use TERM,only: DENSITY, TEMPERATURE, VISCOSITY, WATER_DENSITY, & PRESSURE, D_HOR, D_VER, D_TEMP, QQQ, UMU use SATUR,only: FF, A1, A2, AAA, BBB, SAT use DELTA,only: DDM, DT use CONST,only: PI, G IMPLICIT NONE INTEGER,INTENT(IN) :: NUMBER_D DOUBLE PRECISION,INTENT(IN) :: RAD_CC(NUMBER_D), SPECT(NUMBER_D) DOUBLE PRECISION,INTENT(INOUT) :: RAD_D(NUMBER_D) DOUBLE PRECISION :: R3, ...
OpenMP 4.0 task depend too limited would TBB be better?
By Nicholas B.0
Hello I have been looking at task depend in OpenMP 4.0 but it looks like it is too limited for what I want to do. To do what I want it would need to take a vector subscript in the array section in the depend clause. My code would look something like ths: type cell_type ... contains procedure :: process end type cell_type type(cell_type), dimension(n) :: cells type edge_type integer, dimension(:), allocatable :: icells ... contains procedure :: process end type edge_type type(edge_type), dimension(m) :: edges ! a bit like a c++ std::vector<std::vector<int>> edges(1)%icells = [1, 5, 7, 8, 100] ! edge 1 depends on cells 1, 5, 7, 8 and 100 edges(2)%icells = [1, 2, 4] ! edge 2 depends on cells 1, 2 and 4 ... do i=1,n !$omp task depend(out:cells(i)) call cells(i)%process() !$omp end task end do do j=1,m ! next line not allowed !$omp task depend(out:edges(j)) depend(in:cells(edges(j)%icells)) call edges(j)%process(cells) !$omp end task end d...
Nested OMP on Xeon Phi using OMP4
By james B.3
Xeon Phi has 60 cores and 4 threads per core. I am writing an experiment that will have 1 master thread on each core, and each of these will spawn  4 slave threads. Looking at the manual it seems that I want to set the envars: MIC_OMP_NESTED=TRUE MIC_OMP_PROC_BIND="spread, close" MIC_OMP_NUM_THREADS=60Is this correct? I've tested this and it doesn't die... Is there a way I can get the runtime to spitout affinity debug info about where it is actually placing things so I can be certain? Cheers, James
Slowdown with OpenMP
By Matt S.11
I'm getting some pretty unusual results from using OpenMP on a fractional differential equations code written in fortran. No matter where I use OpenMP in the code, whether it be on an intilization loop or on a computational loop, I get a slowdown across the entire code. I can put OpenMP in one loop and it will slow down an unrelated one (timed seperately)! The code is a bit unusual, as it initalizes arrays starting at 0 (and some even negative). For example, real*8 :: gx(0:Nx) real*8 :: AxLh(1-Nx:Nx-1), AxRh(1-Nx:Nx-1), AxL0(1-Nx:Nx-1), AxR0(1-Nx:Nx-1) Where Nx is, let's say, 512. Would that possibly have anything to do with the ubiquitous slowdown with OpenMP? Also, any ideas on reducing "pow" overhead in the following snippet would be greatly appreciated do k = 1, 5 hgck = foo_c(k) hgpk = foo_p(k) do j = 1, 100 vx = vx + hgck * ux(x, t, foo(j) + hgpk) end do end do where ux is a function defined by function ux(x,t,xi) impl...
web crawling through &quot;Intel Xeon Phi Coprocessors&quot;
By Sunil K.1
I am new to this forum. I want to implement parallel crawling on "Intel Xeon Phi Coprocessors" as for my project. Before buying equipment, installing software and start learning about this platform I want to know that whether it is possible to somehow connect to Network and get web URLs in parallel using this technology? (I don't want to create cluster of CPUs to do. I want to do it using single card).
Subscribe to Forums