Real results for many-core processors illustrate the power of a familiar configuration (SMP) even when reduced to a single chip. SMP on-a-chip can use the same applications, same tools, offer the same flexibility and pose familiar challenges that are solved by familiar techniques and skills.
I recently attended a symposium, co-sponsored by TACC and Intel, at the Texas Advanced Computing Center (TACC) in Austin where the programming of two many-core devices were discussed. One was a research chip designed to push some limits and allow interesting research on a device that lacks many things a product would require. The research chip is known as Intel’s Single-Chip Cloud Computer (SCC). The other many-core device was a prototype of our new Intel Many Integrated Core (MIC) Architecture, the Knights Ferry co-processor. The deadline for papers precluded inclusion of results from pre-production Knights Corner co-processors which will be the first Intel MIC co-processor products. There was a lot of whispering in the hallways about the excitement of starting work with Knights Corner co-processors.
The papers and the half day tutorial at the “TACC-Intel Highly Parallel Computing Symposium” all had strong elements relating to familiar parallel programming challenges: scaling and vectorization. This is because both devices are built on Intel Pentium processor cores hooked together with their design for a connection fabric on the same piece of silicon.
Simply put, they are both SMP on-a-chip (symmetric multi-processors) devices, with somewhat different design goals.
At Intel, we have been convinced that putting a familiar generally programmable SMP on-a-chip is a good idea. It has a familiarity in programmability which proves to have many benefits. SCC was built for research into many facets of highly parallel devices. Knights Corner is designed for production usage and is optimized for power and highly parallel workloads. Knights Corner is well suited for HPC applications that already run on SMP systems. Presenter after presenter who talked about using the prototype Knights Ferry mentioned how applications “just worked."
I like to say, “Programming is hard, and so is parallel programming.” It follows that making an SMP or an SMP on-a-chip get maximum performance may not quite be rocket science, but it is no walk in the park. So, there was plenty of room for the papers to discuss the challenges of tuning for any SMP system.
What was really striking was how optimizations for Knights Ferry co-processors were applicable to SMP systems in general. Several authors commented on how their work to get better scaling or better vectorization for Knights Ferry also improved the performance of the same code compiled to run on an Intel Xeon processor based SMP system. This performance-reuse is very significant, and one presenter exclaimed “Time spent optimizing for MIC is time well spent because it optimizes your code for non-MIC processors at the same time.”
All the papers and presentations (including my keynote) are available on-line now at http://www.tacc.utexas.edu/
Here are some notes from a few of the talks:
Dr. Robert Harkness, gave an engaging talk entitled “Experiences with ENZO on the Intel Many Integrated Core Architecture.” I enjoyed his comment that “we always programming for the future” because they “never have enough compute power.” He looked at multiple programming models, but had the best results using the “dusty” MPI based program that he had running on an SMP before Knights Ferry. He did his work on MPICH 1.2.7p1 because Intel did not supply an MPI with the Knights Ferry systems. He said it was obsolete but very easy to build and use. He reported that one person (not a dedicated programmer) was able to build everything (a quarter million lines of code) in a single week without any application source code modifications at all. The week, it seems, was spent hunting down libraries and recompiling them including MPICH. His results scaled very well.
His conclusions (from slide 30 of his presentation) were: “Intel MIC is the best way forward for large-scale codes which cannot use the existing GPGPU model (even with directives).”
A talk by Theron Voran, with the National Center of Atmospheric Research, looked at using Knights Ferry for Climate Science. He started by saying "We have large bodies of code laying around. We don't want to rewrite in new languages for restrictive architectures." He had several good introduction slides including a comparison of accelerators vs. multicore and many-core devices.
Here the challenges of vectorization offered opportunities for future work. Compiler hints, loop restructuring and relate activities should enhance performance on Xeon-based and MIC-based SMP systems, as well as work on improving scalability on more and more cores. Even with these challenges, the authors noted “Relative ease in porting codes” (recompiling) and the belief that computational capabilities of MIC will be worthwhile.
Ryan Hulguin, with the University of Tennessee, looked at CFD solvers on Knights Ferry. He looked at two methods, one based on Euler equations (for inviscid fluid flows) and another based on the BGK model Boltzmann equation (for rarefied gas flows). Performance results showed OpenMP to be effective on Knights Ferry, and that the SMP programming challenges of vectorization and having good concurrency held true on Knights Ferry as well.
A talk on Dense Linear Algebra Factorization, from David Hudak at the Ohio Supercomputing Center, talked about Heterogeneous Programming Challenges. David is a Wolverine working in a Buckeye world. My heart goes out to him. I really enjoyed his separation of short-term issues that distract us from the real long-term challenges that will stay with us.
The talk compared a QR factorization implemented in OpenMP with a Cilk Plus implementation. Both performed well. The authors emphasized that guidance to Vectorize and use lots of tasks, proved to work.
I’ve written more than I set out to write, so I’ll stop here. The SCC-related papers were very interesting as well, ranging from Tim Mattson’s overview of the program to papers showing research results from investigations using SCC. The other MIC related papers are all worthy as well, including an excellent paper on early experiences with MVAPICH2 doing Intra-MIC MPI communication. Amazing things you can do on an SMP on-a-chip… it runs a real Linux after all!
It is very common for demos to start with an ‘ssh’ (shell) to one of the Knights Ferry processors… and then running the application natively from the command line. SMP on-a-chip, indeed. Too bad I can’t convince Intel to name it that. Even if I did, it would probably be chipSMP™ model 8650plus XS. Nevermind, Knights Corner is fine by me.
The papers and talks can be found at http://www.tacc.utexas.edu/
Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserverd for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.
Notice revision #20110804