A first look at the Manycore Testing Lab

Here's a quick report off my initial reactions after spending a couple of hours getting oriented to the Manycore Testing Lab (MTL) through "VIP access", from my perspective as a CS prof at a small college.

First, to clarify just where I'm coming from, I've been directing students in a Beowulf cluster project at St. Olaf College over the past four years or so, and as you know, my primary project at the moment is to produce modular teaching materials for introducing concepts of parallelism and concurrency throughout the undergraduate CS curriculum, starting with the introductory course CS1, together with my collaborator Libby Shoop of Macalester College.  As much as possible, I've been delegating the technical work to my undergraduate research students, who have been doing a fantastic job.  It is a point of pride that I've never even booted or shut down one of our cluster machines, with one exception (A/C breakdown with flooding in our old makeshift cluster room, a year and a half ago on New Year's Eve).  Because of two unusually strong and highly motivated undergraduate researchers, our NSF-funded project has become a partnership with these students that has accelerated achievement of our grant goals.  The WebMapReduce software my students demonstrated in Intel's booth at SIGCSE 2010 is a case in point:  I initially specified and direct the software project, and they have complete ownership and implementation responsibility, enabling our grant project to get way ahead of schedule on that software goal while Libby and I could focus on beginning to develop the teaching modules.

I received my MTL access last Thursday [May 6, 2010].  This was "crunch time" in St. Olaf's academic calendar, the big push leading to finals (which start 5/20) and on to graduation (5/30), when one of my three top researchers will be leaving.  I've been pushing them hard as undergraduate researchers throughout this term, and it's time for them to focus on being excellent students in their courses.  I'm grateful that Intel provided them with MTL access along with my own access, and they would really like to get started on some of the manycore projects we proposed for the recent MTL contest, but papers and exams need to take precedence for the rest of this month.

But I happen to be on a semester of sabbatical this term, so I started playing with the new technology first myself, for a change.  I wanted something quick, so I wrote up a little trapezoidal approximation loop in C, and thought I'd try parallelizing it with OpenMP to observe the effects of adding more and more cores.  Since we've been a cluster shop up until now, I hadn't actually run an OpenMP code in my life, although I knew the basic approach (this was helped a lot by attending Clay Breshears' workshop on OpenMP at the SIGCSE conference in March).

I had a copy of Chapman, Jost and van der Pas's Using OpenMP: Portable Shared Memory Parallel Programming (2007) handy (I had left Clay's book at home...), and got my OpenMP code running on my office computer (a Linux quad core machine) in short order, using the parallel for construct.  My first naive run uncovered an obvious (in retrospect) race condition in accessing my accumulator;  I found it easy to flip through the book and assess my options, settling on fixing it using the reduction attribute. I was ready to go manycore.







#include <stdio.h>
#include <math.h>

/* I set THREADCT externally in my test script; this is a default value */
#ifndef THREADCT
#define THREADCT 8
#endif

const double pi = 3.141592653589793238462643383079;

int main(int argc, char** argv) {
/* Variables */
double a = 0.0, b = pi; /* limits of integration */;
int n = 1048576; /* number of subdivisions */
double h; /* width of subdivision */
double integral; /* accumulates answer */
int i; /* loop control */

double f(double x);

#ifdef _OPENMP
printf("OMP defined\n");
printf("threads\t%d\n", THREADCT);
#else
printf("OMP not defined\n");
#endif

h = (b - a) / n;
integral = (f(a) + f(b))/2.0;

#pragma omp parallel for default(none) \
shared (a,n,h) private(i) reduction(+:integral) num_threads(THREADCT)
for(i = 1; i < n; i++) {
integral += f(a+i*h);
}

integral = integral * h;
printf("With n = %d trapezoids, our estimate\n", n);
printf("of the integral from %g to %g = %g\n", a, b, integral);
}

double f(double x) {
return sin(x);
}


My quick test code for OpenMP trapezoidal approximation, in C
(Hoping there aren't beginner's mistakes!)

I'll rewrite my code in C++ when I present it to my team of Beowulf students next week, since several of the less experienced ones are comfortable in C++ but not in C.

Downloading the Cisco VPN client to my laptop and using it to connect to the MTL was straightforward with the information given in my welcome emails and the Getting Started Guide.  (I have a couple of technical suggestions that I'll send separately.)  Once I figured out the network situation, I uploaded my little OpenMP code, and cobbled together a quick shell script to time a bunch of test runs with varying thread counts and observe the effects of increasing the degree of multi-core computing.  Of course, computing a 2^{20}-subdivision trapezoidal approximation of \int_0^\pi \sin x\, dx is a toy example, and I'm eager to explore something with more significance and more load for the threads, but there will be time for that later.  I just wanted to make sure I was getting the system to work correctly for me.  Averaging 60 runs for each value of threadct, and simply using the linux /usr/bin/time to assess user, system, and real elapsed time, I soon had the following result table.




















































threadctusersysrealreal*threadct
10.06710.00150.06910.0691
20.08780.00150.04660.0932
40.12500.00160.03460.1384
80.16490.00260.02490.1992
160.18800.00520.01670.2672
320.27640.01620.01520.4864

Average time in seconds at various thread counts
(60 runs at each thread count)

My lessons


Of course, my overall objective is to teach undergraduates about parallelism, starting inexperienced undergraduates early in their CS coursework.  This brief foray into multi-core computing demonstrates how useful the MTL will be for introducing beginners to substantial issues of parallelism.  I could imagine students in, say, a second course in CS learning about the following:

  • OpenMP's simple approach to parallelizing for loops makes it easy to incrementally move from sequential to parallel coding in C or C++ for inexperienced programmers.

  • Explicitly managing the shared and private (thread-local) variables appearing in a parallel for construct provides an effective and accessible platform for young programmers to begin considering issues of shared memory and locality.

  • The value of this integral should be 2, and when the reduction attribute is omitted from my OpenMP parallel for pragma with threadct > 1, the computed result is obviously different from 2.  This gives an especially natural introduction to the issues that arise when multiple threads having write access to a shared variable.  This is an example of a race condition, of course, since the correct computation depends on timing.  It's hard to think of a more accessible, predictable, and obvious race condition to demonstrate to students who are first learning about concurrency.

  • This reduction solution for avoiding that race condition provides a natural way to introduce the notion of a reduction in parallel computing.  Our students will have seen map-reduce algorithms in WebMapReduce, so we can compare and contrast the two appearances of the idea of reducing.

  • Considering the result table, we see that the real time elapsed when the program runs decreases as the number of threads increases, although doubling the number of threads results in something less than a two-fold speedup.  This gives a good opportunity to point out to students that only part of the code was parallelized, so increasing the number of threads should only improve that portion's performance.

  • That observation can become a launching point for defining speedup and introducing theoretical issues such as Amdahl's Law.


I could go on, but you get the picture:  this example gives an accessible opportunity for students to get acquainted with the effects of parallel computing early in their CS coursework, with minimal start up (assuming they know C or C++).  Followup discussions could explore other columns of the table, other programs, etc.

Of course, before those followup discussions, I will need to do some further explorations on my own.  For example, I might get some insight into the sudden jump in system time at 32 threads by seeing if that higher level is present with 30 or 31 threads;  finding out more about OpenMP and its internal operations would probably help.  One thing that troubles me is that the real time performance seems to be improving exponentially as the number of threads increases (halving the time when I quadruple the number of threads), which is not what I would expect from Amdahl's law.  I'm eager to carve out some time to explore this technology more.

Proper uses of the MTL


I must admit that I could illustrate most of these points with fewer than 32 cores.  For example, if we put together a "mere" 16-core system at St. Olaf (4 CPUs, 4 cores each), I could likely demonstrate the initial list of pedagogical points in the previous section quite well.  However, we don't have such a system on campus at present.  Being able to use the MTL to explore manycore computing first hand and now, I am already gaining the experience and building the academic case I need to make in order to get this kind of technology on campus.

But that would take months and local negotiations.  (At my small college, equipment requests are solicited only twice per year.)  The MTL is immediate:  assuming the MTL will have sufficient resources to support this, I'll start having our students using it in class next semester, long before I could obtain a local system from the College's resources.  (In fact, I'm hoping to run an extracurricular lab session for my cluster research students with it this week, even though it's happening at the end of our semester...)  It's invaluable that Intel is taking care of the maintenance, like a cloud service, especially in the case of a small school like ours.

Besides the classroom use I can envision, I'm definitely looking forward to using it for projects.  For example, I've been talking with my colleague about parallelizing his (C++) code for computer vision segmentation.  I don't know his code, but I anticipate that OpenMP will help to make it straightforward to parallelize.  (By the way, I'm also hoping to run my extracurricular lab for Libby and my faculty colleagues next week.)  If we can demonstrate manycore performance improvement with the MTL, it may well transform this element of his project, and open new doors for its applications.

The VPN/SSL strategy for connecting the MTL makes this resource widely accessible from just about anywhere.  It's quite understandable that the MTL system is very "locked down" from a networking viewpoint:  a user's local computer and the remote MTL computer can only access each other across the network, thus precluding any potential for a rogue student to launch a DOS or other network-based attack from dozens of cores.  Of course, this is sometimes a little inconvenient.  For example, I needed to use my local machine as a go-between for uploads and downloads from MTL, and found myself manually shutting down/starting up VPN in order to get data or code between a target location off of my local machine and the MTL.  Cisco's VPN client makes this simple, but not as convenient as a direct network copy between target and MTL.  Of course, this connectivity restriction is a quite reasonable tradeoff for access to this unique resource internal to Intel.

So, I am delighted to have had a chance to get started exploring the MTL.  My initial experience has been very positive.  I found it very easy to get started, and after only one short initial experiment, it is already influencing my thinking about how to bring more parallelism into the classroom, even in early courses.  I am eager to get some research projects going on the MTL, too, although some of our team's particular work involving user interfaces to high-performance computing are precluded by the MTL's understandable network security policies.  All in all, this is a great tool for teaching, and I applaud Michael Wrinn's vision and Intel's generous support in making this happen.

Dick Brown
St. Olaf College

P.S. (6/4/10):  My demo lab with half a dozen students the following week (using a C++ rewrite) went very smoothly, without a hitch, even for those who had very little experience.  (My two star students each went off on their own to implement something more substantial, one in TBB and one in pthreads...)   I also ran a separate session for my CS colleagues, who do not have parallel computing backgrounds, leading to quite an provocative and productive discussion.  -- D
如需更全面地了解编译器优化,请参阅优化注意事项.