Congratulations to the 32 Core Testing Plan Winners!

Congratulations to the 32 Core Testing Plan Winners!

We had a fantastic selection of Testing Plan Entries last month. There were actually so many interesting requests that we decided to open the Lab a few weeks early to a select group of "Runner-ups". This means we should be seeingsomereports out on the Lab very soon and some good examples of how this tool could be used in the classroom! Meanwhile, our 5 top candidates have all recieved their accounts and are getting to work on their problems. Thank you again for all the wonderful entries.

Winners(in no particular order):

Anton Pankratov, Lomonosov Moscow State University

Nicols Wolovick, Universidad Nacional de Cordoba

Miguel Montes, Instituto Universitario Aeronautico

Yekaterina Zhmud, Novosibirsk State University

Konstantine Kalgin, Novosibirsk State University

Intel Academic Community Marketing
6 帖子 / 0 全新

Matrix Multiplication: Performance and Scalability in OpenMP
Lic. Nicols Wolovick
Fa.M.A.F. - National University of Crdoba

Idea Description
I am the teacher ofHPCMMM10course atFaMAF,National University of Crdoba.
This course is satellite of Dr. Sterling'sCSC7600at LSU.
In our second year, we have gained momentum in HPC education and training as well as consulting within our University.
In thebeginningof this year a simple, widely known and studied problem was posed to the students: matrix multiplication.
We made an internal contest, to obtain the fastest serial code.
Many versions were submitted, and we finally obtained 20x of improvement over the most nave implementation.
The students learnt a lot about compiler optimizations, an above all the effect of the caches in the performance of the code.
This opportunity given by Intel is a next step in thisexercise. What to do with 32 cores, how to optimize the code, and above all how to lay out the data access in order to get the most of the memory bus.
I am quite confident that studying new Intel Quick Path architecture for memory communication will unleash the fastest code, and in turn this will help the students to leave all ingenuities about perfect scaling in SMP, due to memory bandwidth contention.
The contest will be held among the students still continuing the class (18 currently), so we need 18 accounts in a 32-way SMP machine to conduct the internal contest, plus a C compiler supporting OpenMP, like gcc-4.4 or icc-11.
The result will be a technical report written by the students and my communicating the experience and conclusions obtained from this simple, yet extremely interesting example.

The Tech Brief for this project can be found here:


Anton Pankratov and Ruslan Tetuev

1) Moscow State University, The Department of Computational Mathematics and Cybernetics, Intel-MSU students Lab.

2) Institute of Mathematical Problems of Biology, Russian Academy of Sciences

Imagine what it would be if you want to view homology of all known genetic sequences at once as one single image, a huge one indeed. For example, we can sign pixel at (i,j) position of the image by bright red points when estimated similarity value of corresponding i-th and j-th DNA segments is over predefined threshold value; green points denote inverted identities and so on. The question is how to generate such a huge image quickly, and how to browse over its broad plane, and how to inquire interesting subsequences separately?

Here we describe a novel approach for such a tough problem, which could be resolved easily by meanings of high-performance computing on multi-core systems. We call it N.A.S.C.A.: Numerical Analytical Spectral Comparing Approach. The main idea of this approach is to apply numerical analytical methods instead of using dynamic programming. Previously it was demonstrated that the comparison of DNA segments could be performed as comparison of spectral values of fragments of its GC content profile. These results have been obtained due to the advantage of Chebyshev approximation. Thus we can operate with extremely long DNA segments (up to 100000 base pairs) using only their spectral representations. The length of GC scanning scales usually as 256, 512, 1024, etc.

The next problem we had to resolve speed up! Fortunately, this approach shows its ability to be parallelized, mainly because of recurrent formulas of Chebyshev orthogonal polynomials. Current implementation of our algorithm parallelization based on using OpenMP pragmas and Intel IPP functions. And it scales linearly on 2- and 4-cores systems! We have every reason to believe that the total acceleration of the algorithm will be achieved on a 32-core system. In case it wouldn't scale yet so good, we'd like to improve it as soon as possible.

The Technical Brief for this is now posted:


Parallel algorithms of Asynchronous Cellular Automata simulation.

Kalgin Konstantin.

Institute of Computational Mathematics and Mathematical Geophysics, Russian Academy of Sciences, Siberian Branch (ICM&MG SB RAS)
Novosibirsk State University (NSU)

cellular automata (ACA) are used for simulation of physical and
chemical processes on molecular level, for example, to study
oscillatory chemical surface reactions, absorption, sublimation and
diffusion of atoms in the epitaxial growth processes. Simulation of
natural processes requires huge cellular space and millions of
iterative steps for obtaining the real scene of the process.
Therefore, it requires a lot of computing power. Unfortunately, ACA
can not be parallelized so easily as synchronous cellular automata
(SCA). As distinct to SCA, ACA functioning is a sequential
application of transition rule to randomly selected cells. The cells
are selected with equal probabilities and irrespective of the process

There are several
sophisticated parallel algorithms of ACA simulation on computers with
different architectures. In [1] an algorithm suitable only for shared
memory computers is proposed. Parallel algorithm suitable for
distributed memory computers is presented in [2].

For testing we use
physicochemical process of surface reaction (CO+O_2) over the
supported palladium nanoparticles [3].

Testing have shown
perfect results for the first algorithm on modern Intel Core i7 (4
cores) with Hyper-threading support only 75% efficiency with 4
threads and 90%-150% with 8 threads for cellular array more than
1000x1000 cells. Unfortunately, the second algorithm is implemented
by using MPI and multi-threaded variant should have the same
efficiency for smaller arrays.

Results for 8-core
machine (two Intel Xeon 5140) are not so good. It may be caused by
Hyper-threading absence, separated memory controller, or something
else. At worst it may be caused by poor scalability.

With Manycore
Testing Lab I am going to

  • implement the second algorithm for

  • obtain scalability of these
    algorithms on the modern manycore

  • learn bottlenecks of these
    algorithms on the modern manycore

  • create a new algorithm that fully
    utilizes benefits of manycores

[1] B.D.
Lubachevsky. Efficient parallel simulations of asynchronous cellular
arrays. Complex Systems 1, 6 (Dec. 1987), 1099-1123.

[2] K.V. Kalgin.
Parallel simulation of asynchronous Cellular Automata evolution.
Bull. Nov. Comp. Center, Comp. Science, 27(2008), 55-63.

[3] Elokhin, V.I.,
Latkin, E.I., Matveev, A.V., and Gorodetskii, V.V. Application of
Statistical Lattice Models to the Analysis of Oscillatory and
Autowave Processes on the Reaction of Carbon Monoxide Oxidation over
Platinum and Palladium Surfaces. Kinetics and Catalysis, vol. 44, N5,
pp.672-700, 2003.

Parallel implementation of Lattice Boltzmann Methods

Miguel Montes - Carlos Sacco
Instituto Universitario Aeronutico - Facultad de Ingeniera

Idea Description

The main topic of this project is to test a parallel approach for a numerical formulation known as Lattice Boltzmann Method (LBM). This method is relatively new and contrasts with the traditional approach to CFD by adopting a bottom-up approach to fluid modeling. To achieve this, it describes the fluid at a molecular level and proposes models for the collision between molecules.
The full continuum-level physics of the fluid is implicitly contained in this model, and the conceptual work to be effectuated is opposite to the one in the top-down approach of classical CFD. Indeed, the various physical ingredients contained in the model need to be identified one by one and properly explicited in order to allow for a segregation between relevant and negligible properties.
Based on those considerations, the collision model is substantially simplified for the needs of the numerical treatment. As such, the Lattice Boltzmann Method is well adapted to the simulation of complex fluids, as the relevant physical ingredients are taken in charge very naturally by the method. In traditional CFD, a partial differential equation must be discretized, and the numerical instabilities may be treated carefully to avoid convergence problems. The implementation of such a model requires advanced software engineering techniques to implement its dynamics. Those problems dont appear in the LBM, due to the nature of this formulation.
The LBM has became a powerful tool for modeling complex flows since it first appeared in the 80s . The LBM is a discrete method, which incorporates the essential physics of microscopic processes so that the macroscopic averaged properties are consistent with the NavierStokes equations (NSEs).The LBM discrete formulation provide an explicit algorithm that can be parallelized easily.
In our institute, we have a Lattice Boltzmann code in Fortran 90 and C, in both cases the parallelization was performed using OpenMP libraries and Intel Compilers. In this project we can test different aspects about the parallel implementation of LBM, as the scalability of the implementation in a shared memory environment. The code so far has been tested in four core machines, and this project would be an opportunity to test its scalability and to address such issues as memory bandwitdth, cache utilization, contention, etc. In the team working in this project, we have researchers in the area of CFD, professors of computer science and undergraduate students taking HPC courses.

Yekaterina Zhmud, Novosibirsk State University

Computation of automorphism groups of q-ary codes

The purpose of the work is to develop and implement the algorithm to find automorphism groups of q-ary codes in n-dimensional vector space over the finite field (a space of n-tuples with coordinates from this field). Automorphism (symmetry) of the code is isometry that maps the code into itself. It is known that all the symmetries are reduced to the combinations of permutations of coordinate positions and sets of transpositions of elements of the field.

The investigation of symmetry groups is important in classification of codes. Symmetry groups are applied widely in cryptography. For example, for codes with big symmetry group there are many equivalent codes, it makes unauthorized access to protected information more difficult.

Detailed information about code properties is known only for binary codes(q=2). But now technical abilities allow to use q-ary codes, where q>2. So, the investigation of q-ary codes turns out to be actual and useful.

Currently the algorithm of computation of symmetry groups of codes is developed and realized for code length equal to 13 and less. But for these codes many resources are necessary. After several partial tests it was computed that first (basic) part of the program will compute for about 21 days. It complicates testing of the whole program very much, because other part of the program depends on basic part. The other part and its methods can vary, but testing is impossible without this data.

Assumed that the usage of 32 core software development environment will significantly help in this investigation, because with basic data known (computed using 32 core DE) it is possible to change and to improve the other part of the program for the purpose of getting more efficiency.

Both parts of the program supposed to be realized parallel using OpenMP.

The Technical Brief for this is now posted:


登录添加评论。 还不是成员? 立即加入