Matrix Multiplication: Performance and Scalability in OpenMP Lic. Nicols Wolovick Fa.M.A.F.  National University of Crdoba Argentina 

Idea Description  
I am the teacher ofHPCMMM10course atFaMAF,National University of Crdoba.
This course is satellite of Dr. Sterling'sCSC7600at LSU.
In our second year, we have gained momentum in HPC education and training as well as consulting within our University.
In thebeginningof this year a simple, widely known and studied problem was posed to the students: matrix multiplication.
We made an internal contest, to obtain the fastest serial code.
Many versions were submitted, and we finally obtained 20x of improvement over the most nave implementation.
The students learnt a lot about compiler optimizations, an above all the effect of the caches in the performance of the code.
This opportunity given by Intel is a next step in thisexercise. What to do with 32 cores, how to optimize the code, and above all how to lay out the data access in order to get the most of the memory bus.
I am quite confident that studying new Intel Quick Path architecture for memory communication will unleash the fastest code, and in turn this will help the students to leave all ingenuities about perfect scaling in SMP, due to memory bandwidth contention.
The contest will be held among the students still continuing the class (18 currently), so we need 18 accounts in a 32way SMP machine to conduct the internal contest, plus a C compiler supporting OpenMP, like gcc4.4 or icc11.
The result will be a technical report written by the students and my communicating the experience and conclusions obtained from this simple, yet extremely interesting example.
The Tech Brief for this project can be found here: http://software.intel.com/enus/articles/matrixmultiplicationperformanceandscalabilityinopenmpstudentchallenge/ 
Congratulations to the 32 Core Testing Plan Winners!
FAST SPECTRAL ESTIMATION OF GENETIC HOMOLOGY
Anton Pankratov and Ruslan Tetuev
1) Moscow State University, The Department of Computational Mathematics and Cybernetics, IntelMSU students Lab.
2) Institute of Mathematical Problems of Biology, Russian Academy of Sciences
Imagine what it would be if you want to view homology of all known genetic sequences at once as one single image, a huge one indeed. For example, we can sign pixel at (i,j) position of the image by bright red points when estimated similarity value of corresponding ith and jth DNA segments is over predefined threshold value; green points denote inverted identities and so on. The question is how to generate such a huge image quickly, and how to browse over its broad plane, and how to inquire interesting subsequences separately?
Here we describe a novel approach for such a tough problem, which could be resolved easily by meanings of highperformance computing on multicore systems. We call it N.A.S.C.A.: Numerical Analytical Spectral Comparing Approach. The main idea of this approach is to apply numerical analytical methods instead of using dynamic programming. Previously it was demonstrated that the comparison of DNA segments could be performed as comparison of spectral values of fragments of its GC content profile. These results have been obtained due to the advantage of Chebyshev approximation. Thus we can operate with extremely long DNA segments (up to 100000 base pairs) using only their spectral representations. The length of GC scanning scales usually as 256, 512, 1024, etc.
The next problem we had to resolve speed up! Fortunately, this approach shows its ability to be parallelized, mainly because of recurrent formulas of Chebyshev orthogonal polynomials. Current implementation of our algorithm parallelization based on using OpenMP pragmas and Intel IPP functions. And it scales linearly on 2 and 4cores systems! We have every reason to believe that the total acceleration of the algorithm will be achieved on a 32core system. In case it wouldn't scale yet so good, we'd like to improve it as soon as possible.
The Technical Brief for this is now posted: http://software.intel.com/enus/articles/fastspectralestimationofgenetichomology/
/*>*/
Parallel algorithms of Asynchronous Cellular Automata simulation.
Kalgin Konstantin.
Institute of Computational Mathematics and Mathematical Geophysics, Russian Academy of Sciences, Siberian Branch (ICM&MG SB RAS)
Novosibirsk State University (NSU)
Asynchronous
cellular automata (ACA) are used for simulation of physical and
chemical processes on molecular level, for example, to study
oscillatory chemical surface reactions, absorption, sublimation and
diffusion of atoms in the epitaxial growth processes. Simulation of
natural processes requires huge cellular space and millions of
iterative steps for obtaining the real scene of the process.
Therefore, it requires a lot of computing power. Unfortunately, ACA
can not be parallelized so easily as synchronous cellular automata
(SCA). As distinct to SCA, ACA functioning is a sequential
application of transition rule to randomly selected cells. The cells
are selected with equal probabilities and irrespective of the process
history.
There are several
sophisticated parallel algorithms of ACA simulation on computers with
different architectures. In [1] an algorithm suitable only for shared
memory computers is proposed. Parallel algorithm suitable for
distributed memory computers is presented in [2].
For testing we use
physicochemical process of surface reaction (CO+O_2) over the
supported palladium nanoparticles [3].
Testing have shown
perfect results for the first algorithm on modern Intel Core i7 (4
cores) with Hyperthreading support only 75% efficiency with 4
threads and 90%150% with 8 threads for cellular array more than
1000x1000 cells. Unfortunately, the second algorithm is implemented
by using MPI and multithreaded variant should have the same
efficiency for smaller arrays.
Results for 8core
machine (two Intel Xeon 5140) are not so good. It may be caused by
Hyperthreading absence, separated memory controller, or something
else. At worst it may be caused by poor scalability.
With Manycore
Testing Lab I am going to

implement the second algorithm for
manycores 
obtain scalability of these
algorithms on the modern manycore 
learn bottlenecks of these
algorithms on the modern manycore 
create a new algorithm that fully
utilizes benefits of manycores
[1] B.D.
Lubachevsky. Efficient parallel simulations of asynchronous cellular
arrays. Complex Systems 1, 6 (Dec. 1987), 10991123.
[2] K.V. Kalgin.
Parallel simulation of asynchronous Cellular Automata evolution.
Bull. Nov. Comp. Center, Comp. Science, 27(2008), 5563.
[3] Elokhin, V.I.,
Latkin, E.I., Matveev, A.V., and Gorodetskii, V.V. Application of
Statistical Lattice Models to the Analysis of Oscillatory and
Autowave Processes on the Reaction of Carbon Monoxide Oxidation over
Platinum and Palladium Surfaces. Kinetics and Catalysis, vol. 44, N5,
pp.672700, 2003.
Idea Description
Yekaterina Zhmud, Novosibirsk State University
Computation of automorphism groups of qary codes
The purpose of the work is to develop and implement the algorithm to find automorphism groups of qary codes in ndimensional vector space over the finite field (a space of ntuples with coordinates from this field). Automorphism (symmetry) of the code is isometry that maps the code into itself. It is known that all the symmetries are reduced to the combinations of permutations of coordinate positions and sets of transpositions of elements of the field.
The investigation of symmetry groups is important in classification of codes. Symmetry groups are applied widely in cryptography. For example, for codes with big symmetry group there are many equivalent codes, it makes unauthorized access to protected information more difficult.
Detailed information about code properties is known only for binary codes(q=2). But now technical abilities allow to use qary codes, where q>2. So, the investigation of qary codes turns out to be actual and useful.
Currently the algorithm of computation of symmetry groups of codes is developed and realized for code length equal to 13 and less. But for these codes many resources are necessary. After several partial tests it was computed that first (basic) part of the program will compute for about 21 days. It complicates testing of the whole program very much, because other part of the program depends on basic part. The other part and its methods can vary, but testing is impossible without this data.
Assumed that the usage of 32 core software development environment will significantly help in this investigation, because with basic data known (computed using 32 core DE) it is possible to change and to improve the other part of the program for the purpose of getting more efficiency.
Both parts of the program supposed to be realized parallel using OpenMP.
The Technical Brief for this is now posted: http://software.intel.com/enus/articles/computationofautomorphismgroupsofqarycodes/
Congratulations to the 32 Core Testing Plan Winners!
We had a fantastic selection of Testing Plan Entries last month. There were actually so many interesting requests that we decided to open the Lab a few weeks early to a select group of "Runnerups". This means we should be seeingsomereports out on the Lab very soon and some good examples of how this tool could be used in the classroom! Meanwhile, our 5 top candidates have all recieved their accounts and are getting to work on their problems. Thank you again for all the wonderful entries.
Winners(in no particular order):
Anton Pankratov, Lomonosov Moscow State University
Nicols Wolovick, Universidad Nacional de Cordoba
Miguel Montes, Instituto Universitario Aeronautico
Yekaterina Zhmud, Novosibirsk State University
Konstantine Kalgin, Novosibirsk State University
www.intel.com/software/academic