Intel® Developer Zone:
Performance

Highlights

Frisch aus der Presse! Intel® Xeon Phi™ Coprocessor High Performance Programming 
Eignen Sie sich die Grundlagen der Programmierung für diese neue Architektur und neuen Produkte an. Neu!
Intel® System Studio
Das Intel® System Studio ist eine umfassende, integrierte Tool-Suite-Lösung für die Software-Entwicklung. Damit können Sie die Vorlaufzeit verkürzen sowie Systeme zuverlässiger, energieeffizienter und leistungsfähiger machen.
Neu!
Für alle, die es verpasst haben – Aufzeichnung des zweitägigen Live-Webinars
Einführung zur Entwicklung von High-Performance-Anwendungen für Intel® Xeon und Intel® Xeon Phi™ Coprozessoren.
Structured Parallel Programming
Die Autoren Michael McCool, Arch D. Robison und James Reinders machen das Thema über strukturierte Muster für jeden Software-Entwickler zugänglich.

Bieten Sie Ihren Kunden bestmögliche Anwendungen an, durch Parallel Programming mithilfe der innovativen Ressourcen von Intel.

Entwicklungsressourcen


Entwicklungs-Tools

 

Intel® Parallel Studio

Intel® Parallel Studio bietet vereinfachte, durchgehende Parallelität für Microsoft Visual Studio* C/C++ Entwickler mit ausgeklügelten Tools zur Optimierung von Clientanwendungen für Multicore und Manycore.

Intel® Produkte für die Software-Entwicklung

Erkunden Sie alle Tools, die Ihnen bei der Optimierung für die Intel Architektur helfen können. Bestimmte Tools können 45 Tage lang kostenlos ausprobiert werden.

Tools-Wissensdatenbank

Anleitungen und Supportinformationen für Intel Tools.

RoutePackets
By Posted 06/13/20071
Problem Statement     We can think of a computer network as a undirected graph where nodes represent routers and edges represent connections between the routers. In this network, there are a number of packets, each of which has a source and a target. Over a number of discrete time steps, each pac…
Sudoku
By Posted 06/13/20072
Problem Statement    Sudoku is a Japanese game which has become extremely popular recently. In the game, we try to fill an N2 x N2 grid with integers from 1 to N2. The rules state that each row and column must contain exactly one occurrence of each of the N2 integers. Additionally, each of the N2 …
PartitionGraph
By Posted 06/13/20071
Problem Statement    Given a simple undirected graph G, your task is to partition it into C disjoint sets of nodes. Your goal is to do this in such a way that as few edges are cut as possible, while each of the sets is relatively large. More specifically, you want to minimize the ratio: (edges cut…
BoxPacking
By Posted 06/13/20071
Problem Statement    You run a company that ships a lot of items to customers. The items are all rectangular and thus can be represented by their widths, heights, and depths (in inches). You often need to ship a lot of items to a single customer, and need to figure out the best way to pack all the…

Seiten

 abonnieren
Kein Inhalt gefunden
Intel Developer Zone Blogs abonnieren
Complexity rank of cache locking
By Klara Z.3
Welcome, I know CPU cycles needed by locking vary, but I need some general picture about how heavy is cache locking. Particularly, for P6+ chip, what rank of the number of cycles consumed by LOCK BTS / INC / DEC would be, if the operand is already cashed memory? By rank I mean, would it be like 10 or rather 100?
Why Sequential Semantic on x86/x86_64 is using through MOV [addr], reg + MFENCE instead of +SFENCE?
By AlexeyAB0
At Intel x86/x86_64 systems have 3 types of memory barriers: lfence, sfence and mfence. The question in terms of their use. For Sequential Semantic (SC) is sufficient to use MOV [addr], reg + MFENCE for all memory cells requiring SC-semantics. However, you can write code in the whole and vice versa: MFENCE + MOV reg, [addr]. Apparently felt, that if the number of stores to memory is usually less than the loads from it, then the use of write-barrier in total cost less. And on this basis, that we must use sequential stores to memory, made another optimization - [LOCK] XCHG, which is probably cheaper due to the fact that "MFENCE inside in XCHG" applies only to the cache line of memory used in XCHG (video where on 0:28:20 said that MFENCE more expensive that XCHG). GCC 4.8.2 uses this approach of using: LOAD(without fences) and STORE + MFENCE, such as writen there: http://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html C/C++11 Operation x86 implementation Load Seq_Cst: MOV (from memory) St…
OpenMP does not like fmax/fabs
By Jon U.9
We have a code that is exhibiting greatly different runtimes between a Fortran and C version. The problem has been isolated to one simple loop: #pragma omp parallel for reduction(max:dt) for(i = 1; i <= NR; i++){ for(j = 1; j <= NC; j++){ dt = fmax( fabs(t[i][j]-t_old[i][j]), dt); t_old[i][j] = t[i][j]; } } Which runs about 12 times slower than the equivalent Fortran loop: !$omp parallel do reduction(max:dt) Do j=1,NC Do i=1,NR dt = max( abs(t(i,j) - told(i,j)), dt ) Told(i,j) = T(i,j) Enddo Enddo !$omp end parallel do Removing the dt assignment eliminates the disparity. Also, running these as serial codes shows no disparity, do the problem is not that the actual C implementation is just so bad. Also, eliminating just the reduction does not close the gap, so it is not the reduction operation itself. All of those tests lead us to the conclusion that there is some terrible interaction between OpenMP and fmax/abs. Any h…
Parallelizing my existing code in TBB please help me with this errors
By Girija B.3
Hi , I am new to TBB and working on parallelizing my existing code. I could easilt paralleize with OpenMP but we need to check the performance of our code in Both TBB and OpenMP after parallelization hence i tried parallelizing the code but i am getting errors which i am not able to reslove please help kindly help me with these errors.My code is as below just using a parallel for loop and lambda function i ahve all serial , openmp and tbb changes i have made please do look at teh code and tell me what else i shud change for tbb to work.         case openmp:        {            #pragma omp parallel for private (iter, currentDB, db)            for (iter = 1; iter < numDB; iter++)            {                 currentDB = this->associateDBs->GetAssociateDB(iter);                db = this->dbGroup.getDatabase( currentDB );                GeoRanking::GeoVerifierResultVector  resLocal;                db->recog( fg, InternalName, resLocal );                LOG(info,omp_get_thr…
Selecting custom victim in job scheduling on NUMA systems
By kadir.akbudak1
I have a NUMA system. There is a thread for each core in the system. Threads that process similar data are assigned to the same node to reuse the data in the large L3 cache of the node. I want threads that are assigned to the same node should steal each other's jobs. If all jobs on a node have finished, these threads should steal jobs assigned to threads on other nodes. How can I implement this via OpenMP?
cache topology
By Ilya Z.13
hi, I'm writting cpuid program. I need help with getting number of each type of cache. not its size, but the number. for example i need get info such as below: L1 data cache = 2 x 64KB. CPUID will give me the size of each sort of cache, but not its number. On MSDN i've found that GetLogicalProcessorsInformationEx proc might be helpful to get that number. but i'm not sure do i understood it right. I guess, that member of CACHE_RELATIONSHIP structure, the GROUP_AFFINITY will be related with quantity. Could some give me some hints or explain what this proc exactly does or tell me were else find such infos. thanks in advance
Poor openmp performance
By Ronglin J.5
We have E5-2670 * 2, 16 cores in total.We get the openmp performance as follows (the code is also attached below):  NUM THREADS:           1 Time:    1.53331303596497    NUM THREADS:           2 Time:   0.793078899383545  NUM THREADS:           4 Time:   0.475617885589600  NUM THREADS:           8 Time:   0.478277921676636  NUM THREADS:          14 Time:   0.479882955551147  NUM THREADS:          16 Time:   0.499575138092041      OK, this scaling is very poor when the thread number larger than 4. But if I uncomment the lines 17 and 24, let the initialization is also done by openmp. The different results are:  NUM THREADS:           1 Time:    1.41038393974304  NUM THREADS:           2 Time:   0.723496913909912  NUM THREADS:           4 Time:   0.386450052261353  NUM THREADS:           8 Time:   0.211269855499268  NUM THREADS:          14 Time:   0.185739994049072  NUM THREADS:          16 Time:   0.214301824569702 Why the performances are so different? Some information:ifort ver…
how to measure false sharing on intel xeon e7520 cpu
By Xingjing Lu0
It is four-socket platform, and 32core in total with HT. Now, I measure the false shring with Oprofile of event: MEM_UNCORE_RETIRED:6000:0x02:0:1 (MEM_UNCORE_RETIRED.OTHER_CORE_L2_HITM), but it seems this event only measures the false sharing within a socket. Is it enough if I want to get the false sharing data among the entire system ? Thanks! Eric

Seiten

Foren abonnieren

Highlights