英特尔® 开发人员专区:
性能

精华

新鲜出炉!Intel® Xeon Phi™ Coprocessor High Performance Programming 
学习面向这个新型架构和新产品编程的基本要素。 全新!
英特尔® System Studio
英特尔® System Studio 是一款综合性集成软件开发工具套件解决方案,能够缩短上市时间,增强系统可靠性,并提高能效和性能。 全新!
万一您错过了时间,还可参加为时两天的现场网络研讨会的重播
介绍面向英特尔® 至强™ 处理器和英特尔® 至强融核™ 协处理器的高性能应用程序开发。
Structured Parallel Programming
作者 Michael McCool、Arch D. Robison 和 James Reinders 采用一种基于结构性形式的途径,从而使该课题能为每一位软件开发人员所接受。

在英特尔创新资源的帮助下实现并行编程,为您的客户提供最出色的应用性能。

开发资源


开发工具

 

英特尔® Parallel Studio

英特尔® Parallel Studio 为 Microsoft Visual Studio* C/C++ 开发人员带来了简化的端到端并行性,还可提供先进的工具,帮助他们优化面向多核和众核的客户端应用。

英特尔® 软件开发产品

探索所有可帮助您针对英特尔架构实现优化的工具。某些指定工具支持 45 天免费评估期。

工具知识库

查找关于英特尔工具的指南和支持信息。

Bitonic Sorting
作者:Vadim Kartoshkin (Intel)张贴日期:02/12/20150
Demonstrates how to implement an efficient sorting routine with the OpenCL™ technology that operates on arbitrary input array of integer values. The sample uses properties of bitonic sequence and principles of sorting networks and enables efficient SIMD-style parallelism through OpenCL vector dat...
PinPlay:FAQ
作者:admin张贴日期:02/10/20150
I. How long does record/replay take? Record/replay overhead is a function of number of memory accesses and the amount of sharing in the test program. 1. Time for recording/replaying a 'region':  Source : CGO2014 paper on DrDebug 2. Slow-down for whole-program recording. Source: Measured wi...
Understanding How General Exploration Works in Intel® VTune™ Amplifier XE
作者:Jackson Marusarz (Intel)张贴日期:02/09/20150
The General Exploration Analysis Type in Intel® VTune™ Amplifier XE is used to detect microarchitectural hardware bottlenecks in an application or system. General Exploration uses hardware event counters to detect and locate issues and presents the data in a user-friendly and actionable format. T...
The Generic Address Space in OpenCL™ 2.0
作者:Adam Lake (Intel)张贴日期:02/06/20150
Introduction What is the Generic Address Space? Enabling the Generic Address Space Why Would I Want to Use the Generic Address Space? Performing Some Operations in a Specific Address Space Address Space Casting Performance Implications and How to Address Them A Working Example Future Work...
订阅 英特尔开发人员专区文章
关于Android Service组件在多线程应用的理解
作者:auspicious 张贴日期:2012/08/14 0
Android Service组件在Google Android SDK官网上的定义是这样的: A Service is an application component representing either an application's desire to perform a longer-running operation while not interacting with the user or to supply functionality for other applications to use. Each service class must have a corr...
测试多线程对多核 cpu 的分支预测的影响
作者:hengyunabc123 张贴日期:2012/08/14 0
前言: 现代的cpu都有流水线,分支预测功能,CPU的分支预测准确性可以达到98%以上,但是如果预测失败,则流水线失效,性能损失很严重。 CPU使用的分支预测技术可以参考: 处理器分支预测研究的历史和现状.pdf 同时多线程处理器上的动态分支预测器设计方案研究.pdf 正确地利用这些特性,可以写出高效的程序。 比如在写if,else语句时,应当把大概率事件放到if语句中,把小概率事件放到else语句中。 但是通常这种考虑都是基于单线程的,在多线程下有可能出现意外情况,比如多个线程同时执行同一处的代码。 测试: 下面基于Intel Core i5的一些多线程分支预测的测试。 测试思路(真实...
如何让 windows 平台多线程 DLL 完整退出
作者:luansxx 张贴日期:2012/08/14 0
如果你在windows平台开发动态链接库,并且在链接库启动了内部线程,那么你很有可能发现加载你的DLL的程序在退出时会死锁,有时候虽然主程序界面没有了,但是打开任务管理器,发现进程还在。 虽然用户不觉得异常,但是最求完美的你,一定想让程序完整的退出,下面与你分享一下我这几天与这个问题奋战的经验总结。 最近做播放器插件开发,基于directshow、vlc、mplayer框架,各做了一个插件,三个插件中都使用了另外一个媒体DLL库(Mylib.dll),并且都是通过动态加载(LoadLibrary)使用的。该DLL比较复杂,内部使用的线程;另外directshow、vlc的插件自身也是一个...
怎样快速估算热点函数的性能提高?
作者:Peter Wang (Intel) 张贴日期:2012/08/12 0
使用Intel VTune Amplifier XE可以帮助我们快速找到热点函数,计算CPU的消耗、进行并行性分析,进而优化算法,如调整线程上的任务分配、优化同步锁的使用、减少线程的等待时间等。 优化后程序再次使用VTune进行分析,在Summary 报告中有Elapsed Time指标可以知道程序的整体性能提高。但是对于某个特定的热点程序如何评估它的性能提高呢? 对于单线程的应用程序非常简单,只要比较前后的CPU时间就可以了。对于多线程程序就需要一定的估算技巧了。 下面是使用产品的附例tachyon_vtune_amp_xe.zip,优化前后的报告。报告中的CPU Time是所有核上统...
订阅 英特尔® 开发人员专区博客
Slowdown with OpenMP
作者:Matt S.11
I'm getting some pretty unusual results from using OpenMP on a fractional differential equations code written in fortran. No matter where I use OpenMP in the code, whether it be on an intilization loop or on a computational loop, I get a slowdown across the entire code. I can put OpenMP in one loop and it will slow down an unrelated one (timed seperately)! The code is a bit unusual, as it initalizes arrays starting at 0 (and some even negative). For example, real*8 :: gx(0:Nx) real*8 :: AxLh(1-Nx:Nx-1), AxRh(1-Nx:Nx-1), AxL0(1-Nx:Nx-1), AxR0(1-Nx:Nx-1) Where Nx is, let's say, 512. Would that possibly have anything to do with the ubiquitous slowdown with OpenMP? Also, any ideas on reducing "pow" overhead in the following snippet would be greatly appreciated do k = 1, 5 hgck = foo_c(k) hgpk = foo_p(k) do j = 1, 100 vx = vx + hgck * ux(x, t, foo(j) + hgpk) end do end do where ux is a function defined by function ux(x,t,xi) impl...
web crawling through "Intel Xeon Phi Coprocessors"
作者:Sunil K.1
I am new to this forum. I want to implement parallel crawling on "Intel Xeon Phi Coprocessors" as for my project. Before buying equipment, installing software and start learning about this platform I want to know that whether it is possible to somehow connect to Network and get web URLs in parallel using this technology? (I don't want to create cluster of CPUs to do. I want to do it using single card).
Intel MPI for Phi tuning tips?
作者:Ronald W Green (Intel)3
Does setting     I_MPI_MIC=enable change other MPI environment variables, particularly any that would tune MPI for the MIC system architecture?   As a side question, has anyone written a Tuning and Tweaking guide for IMPI for Phi?  For example, what I_MPI variables could one use to help tune an app targeting 480 ranks across 8 Phis? Thanks Ron
Lock-free Java, or better scaling on multi-core systems
作者:William L.0
Everyone these days has to address multi-core issues, or vertical scaling, at least on the server-side of things. And there does not seem to be a general approach, so we end up re-architecting our applications every time we add cores. At the same time, the availability of many-core processors seems to be constrained by the lack of a reasonable software technology to make good use of them. Actors seems like a good approach, and allow you to write fast, lock-free code. But large actor-based systems are not robust. Most actor implementations require applications to implement a state machine per actor for determining what messages are to be processed, and maintaining a large number of interacting state machines is well beyond the abilities of most developers. Which is very sad, as throughput of actor-based applications typically scales with the number of cores. I've worked on this problem for a number of years now and have developed a simple variation on actors which support non-blockin...
igzip for VS10 C++?
作者:David L.6
I was searching for a zlib-compatible compressor but faster, and came cross the paper describing igzip -- High Performance DEFLATE Compression on Intel Architecture Processors igzip looks like exactly (!) what I am looking for.  Compatible with zlib, but faster. However, the downloadable source was for Linux.  I need it for a VS10 C++ project.  I have successfully (I think) compiled and assembled the desired modules (common, crc, crc_utils, hufftables, hufftables_c.cpp, igzip0c_body, igzip0c_finish, init_stream) into a .lib.  But when I attempt to link the library into my project, I get error LNK2019: unresolved external symbol fast_lz (and init_stream) from where they are called.  I also have a "C" lz4 compression library linked into the project, and it works fine.  I have spent 3 days playing with it, looking for the clue that will unlock the symbols, but no luck so far. I get no other warnings and/or errors during the compiling/assembling of the library or project.  Any help (esp...
OpenCL vs Intel Cilk Plus Issues, Differences and Capabilities
作者:Yaknan G.0
I  am curious as to the differences between OpenCL and Intel Cilk Plus. They are both parallel programming paradigms that are receiving wide recognition but technically speaking is one better than the other or are they simply different. Also what yardstick do I use when choosing between the two when solving an embarrassingly parallel problem. Please i need answers. Thanks! Yaknan
Thread complexion(Multi-threading)
作者:Masood Ali M.4
Hello everyone,                            On the other day was trying to create a thread which could capture the working of an already existing(working) thread and copy its working. Setting priority of threads so that they can capture the working of the same priority level threads and also dynamic increase in the thread capacity to handle similar kind of work. would appreciate if anybody could help with it. Thanks. -Ali
The list of out-of-order CPUs
作者:bp1
Hi, I would like to know the list of commercial products ( CPUs / SoCs ) made by Intel that support an out-of-order execution . I noticed that the new Baytrail architecture apparently should support this kind of execution, but I have no information about other architectures, about Xeon, iCore, previous Atoms, Celerons and Pentiums; at this point I also have no specific information about the subsets of a given family, for example Baytrail is usually shifted into Baytrail-M and Baytrail-T and I can only speculate that this new out-of-order applies to both . It would also be really nice if you can spend some time describing the support to this kind of memory models given by open source compilers such as gcc and clang . Thanks .
订阅 论坛

精华