Intel® Developer Zone:
Performance

Highlights

Just published! Intel® Xeon Phi™ Coprocessor High Performance Programming 
Learn the essentials of programming for this new architecture and new products. New!
Intel® System Studio
The Intel® System Studio is a comprehensive integrated software development tool suite solution that can Accelerate Time to Market, Strengthen System Reliability & Boost Power Efficiency and Performance. New!
In case you missed it - 2-day Live Webinar Playback
Introduction to High Performance Application Development for Intel® Xeon & Intel® Xeon Phi™ Coprocessors.
Structured Parallel Programming
Authors Michael McCool, Arch D. Robison, and James Reinders uses an approach based on structured patterns which should make the subject accessible to every software developer.

Deliver your best application performance for your customers through parallel programming with the help of Intel’s innovative resources.

Development Resources


Development Tools

 

Intel® Parallel Studio XE ›

Bringing simplified, end-to-end parallelism to Microsoft Visual Studio* C/C++ developers, Intel® Parallel Studio XE provides advanced tools to optimize client applications for multi-core and manycore.

Intel® Software Development Products

Explore all tools the help you optimize for Intel architecture. Select tools are available for a free 30-day evaluation period.

Tools Knowledge Base

Find guides and support information for Intel tools.

Billion-Particle SIMD-Friendly Two-Point Correlation on Large-Scale HPC Cluster Systems
By BELINDA L. (Intel)Posted 10/15/20130
By Jatin Chhuganiy, Changkyu Kimy, Hemant Shukla, Jongsoo Parky, Pradeep Dubeyy, John Shalf and Horst D. Simon Abstract Two-point Correlation Function (TPCF) is widely used in astronomy to characterize the distribution of matter/energy in the Universe, and help derive the physics that can trace...
Efficient Backprojection-based Synthetic Aperture Radar Computation with Many-core Processors
By BELINDA L. (Intel)Posted 10/15/20130
By Jongsoo Park, Ping Tak Peter Tang, Mikhail Smelyanskiy, Daehyun Kim, Thomas Benson Abstract Tackling computationally challenging problems with high efficiency often requires the combination of algorithmic innovation, advanced architecture, and thorough exploitation of parallelism. We demonst...
Exploring SIMD for Molecular Dynamics, Using Intel® Xeon® Processors and Intel® Xeon Phi™ Coprocessors
By BELINDA L. (Intel)Posted 10/15/20130
By S. J. Pennycook, C. J. Hughesy, M. Smelyanskiyy and S. A. Jarvis Abstract We analyse gather-scatter performance bottlenecks in molecular dynamics codes and the challenges that they pose for obtaining benefits from SIMD execution. This analysis informs a number of novel code-level and algorit...
Fast Construction of SAH BVHs on the Intel Many Integrated Core (MIC) Architecture
By BELINDA L. (Intel)Posted 10/15/20130
By Ingo Wald Abstract We investigate how to efficiently build bounding volume hierarchies (BVHs) with surface area heuristic (SAH) on the Intel(R) Many Integrated Core Architecture. To achieve maximum performance, we use four key concepts: progressive 10- bit quantization to reduce cache footpr...

Pages

Subscribe to
Intel(r) Transactional Synchronization Extensions (Intel(r) TSX) profiling with Linux perf
By Andreas Kleen (Intel)Posted 05/03/20135
Intel TSX exposes a speculative execution mode to the programmer to improve locking performance.. Tuning speculation requires heavily on a PMU profiler. This document describes TSX profiling using the Linux  perf) (or “perf events”) profiler, that comes integrated with newer Linux systems. More d...
Code Examples from Xeon Phi Book
By James Reinders (Intel)Posted 05/01/20130
The code used in examples (Chapters 2-4) in our book can be downloaded from the book's website. We appreciate attribution, but there are no restrictions on use of the code - please use and enjoy! You can use the step by step instructions in the book or if you prefer we've included a Makefile for ...
Modern locking
By Andreas Kleen (Intel)Posted 04/29/20130
Modern locking Most multi-threaded software uses locking. Lock optimization traditionally has aimed to reduce lock contention, that is make the critical regions smaller. In optimized software, this often results in a lot of very small critical regions, protected by many locks. Each critical regio...
Register for Intel® Software Tools Spring Technical Webinar Presentation "Design and prototype scalable threading using Intel® Advisor XE"
By RAVI (Intel)Posted 04/24/20130
I will be presenting on May 14th at 11am PDT on the following topic: Design and prototype scalable threading using Intel® Advisor XE Please register for this presentation using the following link: https://www1.gotomeeting.com/register/849275177 Here is a short abstract of the presentation: Intel®...

Pages

Subscribe to Intel Developer Zone Blogs
algorithms
By lara h.0
Hello, look down the the following link...it's about parallel partition... http://www.redgenes.com/Lecture-Sorting.pdf I have tried to simulate this parallel partition method ,but i don't think it will scale cause we have to do a merging,which essentially is an array-copy operation but this array-copyoperations will be expensive compared to an integer compareoperation that you find inside the partition fuinction, and it will stillbe expensive compared to a string compare operation that you findinside the partition function. So since it's not scaling i have abondonedthe idea to implement this parallel partition method in my parallelquicksort. I have also just read the following paper about Parallel Merging: http://www.economyinformatics.ase.ro/content/EN4/alecu.pdf And i have implemented this algorithm just to see what is its performance.and i have noticed that the serial algorithm is 8 times slower than themergefunction that you find in the serial mergesort algorithm.So 8 times slow...
Complexity rank of cache locking
By Klara Z.3
Welcome, I know CPU cycles needed by locking vary, but I need some general picture about how heavy is cache locking. Particularly, for P6+ chip, what rank of the number of cycles consumed by LOCK BTS / INC / DEC would be, if the operand is already cashed memory? By rank I mean, would it be like 10 or rather 100?
Why Sequential Semantic on x86/x86_64 is using through MOV [addr], reg + MFENCE instead of +SFENCE?
By AlexeyAB0
At Intel x86/x86_64 systems have 3 types of memory barriers: lfence, sfence and mfence. The question in terms of their use. For Sequential Semantic (SC) is sufficient to use MOV [addr], reg + MFENCE for all memory cells requiring SC-semantics. However, you can write code in the whole and vice versa: MFENCE + MOV reg, [addr]. Apparently felt, that if the number of stores to memory is usually less than the loads from it, then the use of write-barrier in total cost less. And on this basis, that we must use sequential stores to memory, made another optimization - [LOCK] XCHG, which is probably cheaper due to the fact that "MFENCE inside in XCHG" applies only to the cache line of memory used in XCHG (video where on 0:28:20 said that MFENCE more expensive that XCHG). GCC 4.8.2 uses this approach of using: LOAD(without fences) and STORE + MFENCE, such as writen there: http://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html C/C++11 Operation x86 implementation Load Seq_Cst: MOV (from memory) ...
OpenMP does not like fmax/fabs
By Jon U.9
We have a code that is exhibiting greatly different runtimes between a Fortran and C version. The problem has been isolated to one simple loop: #pragma omp parallel for reduction(max:dt) for(i = 1; i <= NR; i++){ for(j = 1; j <= NC; j++){ dt = fmax( fabs(t[i][j]-t_old[i][j]), dt); t_old[i][j] = t[i][j]; } } Which runs about 12 times slower than the equivalent Fortran loop: !$omp parallel do reduction(max:dt) Do j=1,NC Do i=1,NR dt = max( abs(t(i,j) - told(i,j)), dt ) Told(i,j) = T(i,j) Enddo Enddo !$omp end parallel do Removing the dt assignment eliminates the disparity. Also, running these as serial codes shows no disparity, do the problem is not that the actual C implementation is just so bad. Also, eliminating just the reduction does not close the gap, so it is not the reduction operation itself. All of those tests lead us to the conclusion that there is some terrible interaction between OpenMP and fmax/abs. Any...
Parallelizing my existing code in TBB please help me with this errors
By Girija B.3
Hi , I am new to TBB and working on parallelizing my existing code. I could easilt paralleize with OpenMP but we need to check the performance of our code in Both TBB and OpenMP after parallelization hence i tried parallelizing the code but i am getting errors which i am not able to reslove please help kindly help me with these errors.My code is as below just using a parallel for loop and lambda function i ahve all serial , openmp and tbb changes i have made please do look at teh code and tell me what else i shud change for tbb to work.         case openmp:        {            #pragma omp parallel for private (iter, currentDB, db)            for (iter = 1; iter < numDB; iter++)            {                 currentDB = this->associateDBs->GetAssociateDB(iter);                db = this->dbGroup.getDatabase( currentDB );                GeoRanking::GeoVerifierResultVector  resLocal;                db->recog( fg, InternalName, resLocal );                LOG(info,omp_get_t...
Selecting custom victim in job scheduling on NUMA systems
By kadir.akbudak1
I have a NUMA system. There is a thread for each core in the system. Threads that process similar data are assigned to the same node to reuse the data in the large L3 cache of the node. I want threads that are assigned to the same node should steal each other's jobs. If all jobs on a node have finished, these threads should steal jobs assigned to threads on other nodes. How can I implement this via OpenMP?
cache topology
By Ilya Z.13
hi, I'm writting cpuid program. I need help with getting number of each type of cache. not its size, but the number. for example i need get info such as below: L1 data cache = 2 x 64KB. CPUID will give me the size of each sort of cache, but not its number. On MSDN i've found that GetLogicalProcessorsInformationEx proc might be helpful to get that number. but i'm not sure do i understood it right. I guess, that member of CACHE_RELATIONSHIP structure, the GROUP_AFFINITY will be related with quantity. Could some give me some hints or explain what this proc exactly does or tell me were else find such infos. thanks in advance
Poor openmp performance
By Ronglin J.5
We have E5-2670 * 2, 16 cores in total.We get the openmp performance as follows (the code is also attached below):  NUM THREADS:           1 Time:    1.53331303596497    NUM THREADS:           2 Time:   0.793078899383545  NUM THREADS:           4 Time:   0.475617885589600  NUM THREADS:           8 Time:   0.478277921676636  NUM THREADS:          14 Time:   0.479882955551147  NUM THREADS:          16 Time:   0.499575138092041      OK, this scaling is very poor when the thread number larger than 4. But if I uncomment the lines 17 and 24, let the initialization is also done by openmp. The different results are:  NUM THREADS:           1 Time:    1.41038393974304  NUM THREADS:           2 Time:   0.723496913909912  NUM THREADS:           4 Time:   0.386450052261353  NUM THREADS:           8 Time:   0.211269855499268  NUM THREADS:          14 Time:   0.185739994049072  NUM THREADS:          16 Time:   0.214301824569702 Why the performances are so different? Some information:ifort v...

Pages

Subscribe to Forums

Highlights