Here's a quick report off my initial reactions after spending a couple of hours getting oriented to the Manycore Testing Lab (MTL) through "VIP access", from my perspective as a CS prof at a small college.
First, to clarify just where I'm coming from, I've been directing students in a Beowulf cluster project at St. Olaf College over the past four years or so, and as you know, my primary project at the moment is to produce modular teaching materials for introducing concepts of parallelism and concurrency throughout the undergraduate CS curriculum, starting with the introductory course CS1, together with my collaborator Libby Shoop of Macalester College. As much as possible, I've been delegating the technical work to my undergraduate research students, who have been doing a fantastic job. It is a point of pride that I've never even booted or shut down one of our cluster machines, with one exception (A/C breakdown with flooding in our old makeshift cluster room, a year and a half ago on New Year's Eve). Because of two unusually strong and highly motivated undergraduate researchers, our NSF-funded project has become a partnership with these students that has accelerated achievement of our grant goals. The WebMapReduce software my students demonstrated in Intel's booth at SIGCSE 2010 is a case in point: I initially specified and direct the software project, and they have complete ownership and implementation responsibility, enabling our grant project to get way ahead of schedule on that software goal while Libby and I could focus on beginning to develop the teaching modules.
I received my MTL access last Thursday [May 6, 2010]. This was "crunch time" in St. Olaf's academic calendar, the big push leading to finals (which start 5/20) and on to graduation (5/30), when one of my three top researchers will be leaving. I've been pushing them hard as undergraduate researchers throughout this term, and it's time for them to focus on being excellent students in their courses. I'm grateful that Intel provided them with MTL access along with my own access, and they would really like to get started on some of the manycore projects we proposed for the recent MTL contest, but papers and exams need to take precedence for the rest of this month.
But I happen to be on a semester of sabbatical this term, so I started playing with the new technology first myself, for a change. I wanted something quick, so I wrote up a little trapezoidal approximation loop in C, and thought I'd try parallelizing it with OpenMP to observe the effects of adding more and more cores. Since we've been a cluster shop up until now, I hadn't actually run an OpenMP code in my life, although I knew the basic approach (this was helped a lot by attending Clay Breshears' workshop on OpenMP at the SIGCSE conference in March).
I had a copy of Chapman, Jost and van der Pas's Using OpenMP: Portable Shared Memory Parallel Programming (2007) handy (I had left Clay's book at home...), and got my OpenMP code running on my office computer (a Linux quad core machine) in short order, using the parallel for construct. My first naive run uncovered an obvious (in retrospect) race condition in accessing my accumulator; I found it easy to flip through the book and assess my options, settling on fixing it using the reduction attribute. I was ready to go manycore.
I'll rewrite my code in C++ when I present it to my team of Beowulf students next week, since several of the less experienced ones are comfortable in C++ but not in C.
Downloading the Cisco VPN client to my laptop and using it to connect to the MTL was straightforward with the information given in my welcome emails and the Getting Started Guide. (I have a couple of technical suggestions that I'll send separately.) Once I figured out the network situation, I uploaded my little OpenMP code, and cobbled together a quick shell script to time a bunch of test runs with varying thread counts and observe the effects of increasing the degree of multi-core computing. Of course, computing a
-subdivision trapezoidal approximation of
is a toy example, and I'm eager to explore something with more significance and more load for the threads, but there will be time for that later. I just wanted to make sure I was getting the system to work correctly for me. Averaging 60 runs for each value of threadct, and simply using the linux /usr/bin/time to assess user, system, and real elapsed time, I soon had the following result table.
Of course, my overall objective is to teach undergraduates about parallelism, starting inexperienced undergraduates early in their CS coursework. This brief foray into multi-core computing demonstrates how useful the MTL will be for introducing beginners to substantial issues of parallelism. I could imagine students in, say, a second course in CS learning about the following:
I could go on, but you get the picture: this example gives an accessible opportunity for students to get acquainted with the effects of parallel computing early in their CS coursework, with minimal start up (assuming they know C or C++). Followup discussions could explore other columns of the table, other programs, etc.
Of course, before those followup discussions, I will need to do some further explorations on my own. For example, I might get some insight into the sudden jump in system time at 32 threads by seeing if that higher level is present with 30 or 31 threads; finding out more about OpenMP and its internal operations would probably help. One thing that troubles me is that the real time performance seems to be improving exponentially as the number of threads increases (halving the time when I quadruple the number of threads), which is not what I would expect from Amdahl's law. I'm eager to carve out some time to explore this technology more.
I must admit that I could illustrate most of these points with fewer than 32 cores. For example, if we put together a "mere" 16-core system at St. Olaf (4 CPUs, 4 cores each), I could likely demonstrate the initial list of pedagogical points in the previous section quite well. However, we don't have such a system on campus at present. Being able to use the MTL to explore manycore computing first hand and now, I am already gaining the experience and building the academic case I need to make in order to get this kind of technology on campus.
But that would take months and local negotiations. (At my small college, equipment requests are solicited only twice per year.) The MTL is immediate: assuming the MTL will have sufficient resources to support this, I'll start having our students using it in class next semester, long before I could obtain a local system from the College's resources. (In fact, I'm hoping to run an extracurricular lab session for my cluster research students with it this week, even though it's happening at the end of our semester...) It's invaluable that Intel is taking care of the maintenance, like a cloud service, especially in the case of a small school like ours.
Besides the classroom use I can envision, I'm definitely looking forward to using it for projects. For example, I've been talking with my colleague about parallelizing his (C++) code for computer vision segmentation. I don't know his code, but I anticipate that OpenMP will help to make it straightforward to parallelize. (By the way, I'm also hoping to run my extracurricular lab for Libby and my faculty colleagues next week.) If we can demonstrate manycore performance improvement with the MTL, it may well transform this element of his project, and open new doors for its applications.
The VPN/SSL strategy for connecting the MTL makes this resource widely accessible from just about anywhere. It's quite understandable that the MTL system is very "locked down" from a networking viewpoint: a user's local computer and the remote MTL computer can only access each other across the network, thus precluding any potential for a rogue student to launch a DOS or other network-based attack from dozens of cores. Of course, this is sometimes a little inconvenient. For example, I needed to use my local machine as a go-between for uploads and downloads from MTL, and found myself manually shutting down/starting up VPN in order to get data or code between a target location off of my local machine and the MTL. Cisco's VPN client makes this simple, but not as convenient as a direct network copy between target and MTL. Of course, this connectivity restriction is a quite reasonable tradeoff for access to this unique resource internal to Intel.
So, I am delighted to have had a chance to get started exploring the MTL. My initial experience has been very positive. I found it very easy to get started, and after only one short initial experiment, it is already influencing my thinking about how to bring more parallelism into the classroom, even in early courses. I am eager to get some research projects going on the MTL, too, although some of our team's particular work involving user interfaces to high-performance computing are precluded by the MTL's understandable network security policies. All in all, this is a great tool for teaching, and I applaud Michael Wrinn's vision and Intel's generous support in making this happen.
Dick Brown
St. Olaf College
P.S. (6/4/10): My demo lab with half a dozen students the following week (using a C++ rewrite) went very smoothly, without a hitch, even for those who had very little experience. (My two star students each went off on their own to implement something more substantial, one in TBB and one in pthreads...) I also ran a separate session for my CS colleagues, who do not have parallel computing backgrounds, leading to quite an provocative and productive discussion. -- D
First, to clarify just where I'm coming from, I've been directing students in a Beowulf cluster project at St. Olaf College over the past four years or so, and as you know, my primary project at the moment is to produce modular teaching materials for introducing concepts of parallelism and concurrency throughout the undergraduate CS curriculum, starting with the introductory course CS1, together with my collaborator Libby Shoop of Macalester College. As much as possible, I've been delegating the technical work to my undergraduate research students, who have been doing a fantastic job. It is a point of pride that I've never even booted or shut down one of our cluster machines, with one exception (A/C breakdown with flooding in our old makeshift cluster room, a year and a half ago on New Year's Eve). Because of two unusually strong and highly motivated undergraduate researchers, our NSF-funded project has become a partnership with these students that has accelerated achievement of our grant goals. The WebMapReduce software my students demonstrated in Intel's booth at SIGCSE 2010 is a case in point: I initially specified and direct the software project, and they have complete ownership and implementation responsibility, enabling our grant project to get way ahead of schedule on that software goal while Libby and I could focus on beginning to develop the teaching modules.
I received my MTL access last Thursday [May 6, 2010]. This was "crunch time" in St. Olaf's academic calendar, the big push leading to finals (which start 5/20) and on to graduation (5/30), when one of my three top researchers will be leaving. I've been pushing them hard as undergraduate researchers throughout this term, and it's time for them to focus on being excellent students in their courses. I'm grateful that Intel provided them with MTL access along with my own access, and they would really like to get started on some of the manycore projects we proposed for the recent MTL contest, but papers and exams need to take precedence for the rest of this month.
But I happen to be on a semester of sabbatical this term, so I started playing with the new technology first myself, for a change. I wanted something quick, so I wrote up a little trapezoidal approximation loop in C, and thought I'd try parallelizing it with OpenMP to observe the effects of adding more and more cores. Since we've been a cluster shop up until now, I hadn't actually run an OpenMP code in my life, although I knew the basic approach (this was helped a lot by attending Clay Breshears' workshop on OpenMP at the SIGCSE conference in March).
I had a copy of Chapman, Jost and van der Pas's Using OpenMP: Portable Shared Memory Parallel Programming (2007) handy (I had left Clay's book at home...), and got my OpenMP code running on my office computer (a Linux quad core machine) in short order, using the parallel for construct. My first naive run uncovered an obvious (in retrospect) race condition in accessing my accumulator; I found it easy to flip through the book and assess my options, settling on fixing it using the reduction attribute. I was ready to go manycore.
#include <stdio.h> |
My quick test code for OpenMP trapezoidal approximation, in C
(Hoping there aren't beginner's mistakes!)
(Hoping there aren't beginner's mistakes!)
I'll rewrite my code in C++ when I present it to my team of Beowulf students next week, since several of the less experienced ones are comfortable in C++ but not in C.
Downloading the Cisco VPN client to my laptop and using it to connect to the MTL was straightforward with the information given in my welcome emails and the Getting Started Guide. (I have a couple of technical suggestions that I'll send separately.) Once I figured out the network situation, I uploaded my little OpenMP code, and cobbled together a quick shell script to time a bunch of test runs with varying thread counts and observe the effects of increasing the degree of multi-core computing. Of course, computing a
| threadct | user | sys | real | real*threadct |
| 1 | 0.0671 | 0.0015 | 0.0691 | 0.0691 |
| 2 | 0.0878 | 0.0015 | 0.0466 | 0.0932 |
| 4 | 0.1250 | 0.0016 | 0.0346 | 0.1384 |
| 8 | 0.1649 | 0.0026 | 0.0249 | 0.1992 |
| 16 | 0.1880 | 0.0052 | 0.0167 | 0.2672 |
| 32 | 0.2764 | 0.0162 | 0.0152 | 0.4864 |
Average time in seconds at various thread counts
(60 runs at each thread count)
(60 runs at each thread count)
My lessons
Of course, my overall objective is to teach undergraduates about parallelism, starting inexperienced undergraduates early in their CS coursework. This brief foray into multi-core computing demonstrates how useful the MTL will be for introducing beginners to substantial issues of parallelism. I could imagine students in, say, a second course in CS learning about the following:
- OpenMP's simple approach to parallelizing for loops makes it easy to incrementally move from sequential to parallel coding in C or C++ for inexperienced programmers.
- Explicitly managing the shared and private (thread-local) variables appearing in a parallel for construct provides an effective and accessible platform for young programmers to begin considering issues of shared memory and locality.
- The value of this integral should be 2, and when the reduction attribute is omitted from my OpenMP parallel for pragma with threadct > 1, the computed result is obviously different from 2. This gives an especially natural introduction to the issues that arise when multiple threads having write access to a shared variable. This is an example of a race condition, of course, since the correct computation depends on timing. It's hard to think of a more accessible, predictable, and obvious race condition to demonstrate to students who are first learning about concurrency.
- This reduction solution for avoiding that race condition provides a natural way to introduce the notion of a reduction in parallel computing. Our students will have seen map-reduce algorithms in WebMapReduce, so we can compare and contrast the two appearances of the idea of reducing.
- Considering the result table, we see that the real time elapsed when the program runs decreases as the number of threads increases, although doubling the number of threads results in something less than a two-fold speedup. This gives a good opportunity to point out to students that only part of the code was parallelized, so increasing the number of threads should only improve that portion's performance.
- That observation can become a launching point for defining speedup and introducing theoretical issues such as Amdahl's Law.
I could go on, but you get the picture: this example gives an accessible opportunity for students to get acquainted with the effects of parallel computing early in their CS coursework, with minimal start up (assuming they know C or C++). Followup discussions could explore other columns of the table, other programs, etc.
Of course, before those followup discussions, I will need to do some further explorations on my own. For example, I might get some insight into the sudden jump in system time at 32 threads by seeing if that higher level is present with 30 or 31 threads; finding out more about OpenMP and its internal operations would probably help. One thing that troubles me is that the real time performance seems to be improving exponentially as the number of threads increases (halving the time when I quadruple the number of threads), which is not what I would expect from Amdahl's law. I'm eager to carve out some time to explore this technology more.
Proper uses of the MTL
I must admit that I could illustrate most of these points with fewer than 32 cores. For example, if we put together a "mere" 16-core system at St. Olaf (4 CPUs, 4 cores each), I could likely demonstrate the initial list of pedagogical points in the previous section quite well. However, we don't have such a system on campus at present. Being able to use the MTL to explore manycore computing first hand and now, I am already gaining the experience and building the academic case I need to make in order to get this kind of technology on campus.
But that would take months and local negotiations. (At my small college, equipment requests are solicited only twice per year.) The MTL is immediate: assuming the MTL will have sufficient resources to support this, I'll start having our students using it in class next semester, long before I could obtain a local system from the College's resources. (In fact, I'm hoping to run an extracurricular lab session for my cluster research students with it this week, even though it's happening at the end of our semester...) It's invaluable that Intel is taking care of the maintenance, like a cloud service, especially in the case of a small school like ours.
Besides the classroom use I can envision, I'm definitely looking forward to using it for projects. For example, I've been talking with my colleague about parallelizing his (C++) code for computer vision segmentation. I don't know his code, but I anticipate that OpenMP will help to make it straightforward to parallelize. (By the way, I'm also hoping to run my extracurricular lab for Libby and my faculty colleagues next week.) If we can demonstrate manycore performance improvement with the MTL, it may well transform this element of his project, and open new doors for its applications.
The VPN/SSL strategy for connecting the MTL makes this resource widely accessible from just about anywhere. It's quite understandable that the MTL system is very "locked down" from a networking viewpoint: a user's local computer and the remote MTL computer can only access each other across the network, thus precluding any potential for a rogue student to launch a DOS or other network-based attack from dozens of cores. Of course, this is sometimes a little inconvenient. For example, I needed to use my local machine as a go-between for uploads and downloads from MTL, and found myself manually shutting down/starting up VPN in order to get data or code between a target location off of my local machine and the MTL. Cisco's VPN client makes this simple, but not as convenient as a direct network copy between target and MTL. Of course, this connectivity restriction is a quite reasonable tradeoff for access to this unique resource internal to Intel.
So, I am delighted to have had a chance to get started exploring the MTL. My initial experience has been very positive. I found it very easy to get started, and after only one short initial experiment, it is already influencing my thinking about how to bring more parallelism into the classroom, even in early courses. I am eager to get some research projects going on the MTL, too, although some of our team's particular work involving user interfaces to high-performance computing are precluded by the MTL's understandable network security policies. All in all, this is a great tool for teaching, and I applaud Michael Wrinn's vision and Intel's generous support in making this happen.
Dick Brown
St. Olaf College
P.S. (6/4/10): My demo lab with half a dozen students the following week (using a C++ rewrite) went very smoothly, without a hitch, even for those who had very little experience. (My two star students each went off on their own to implement something more substantial, one in TBB and one in pthreads...) I also ran a separate session for my CS colleagues, who do not have parallel computing backgrounds, leading to quite an provocative and productive discussion. -- D

Comments
Hi Dick,
Thank you so much for posting this amazingly thorough recounting of your early access to the Manycore Testing Lab. I look forward to hearing more from you about the testing you and your students are able to do on the MTL. Thanks for being a pioneer and let us know how you are integrating your results into you curriculum.
Best,
Paul
Hi Dick,
Thank you so much for posting this amazingly thorough recounting of your early access to the Manycore Testing Lab. I look forward to hearing more from you about the testing you and your students are able to do on the MTL. Thanks for being a pioneer and let us know how you are integrating your results into you curriculum.
Best,
Paul
Hmm. Looks like I was so excited I commented twice :-) Never mind. Your blog deserves it!
P
Great report, Dick. If you haven't done so already, be SURE to give your technical and other suggestions for improvement to the team at intel_mtl@intel.com.
Cheers,
jdg (only commenting once, but, agree with Paul's enthusiasm wholeheartedly...)
Actually, Michael Wrinn and Prof. Hui Yang (SFSU) did the OpenMP workshop at SIGCSE. My workshop was on Threaded Programming Methodology, but I used OpenMP for the example. Glad it was helpful (and thanks for the book plug).
Two other points that can be made from your timing results are granularity and overhead. The amount of work per thread goes down as you add more threads to work on a fixed size workload (coarse to fine) and the time needed to create and manage threads increases as you add more and more to the mix.
--clay
Dick,
This is really great Blog posting, thanks again for taking to time to clearly document your practical usage of the MTL. I think it will help others enormously who are thinking of applying for access, but maybe reticent somewhat of the process and its immediate benefits. I'm very glad to read that your experiences using the MTL were successful. As Jeff as said, we really appreciate any feedback on how to improve the MTL experience.
Looking forward to working with you and your students, again in the near future.
-Mike
Hello Dick,
Nice to see the details about manycore and how MTL provides a relatively easy way to get students involved, I will certainly consider this next time I teach HPSC here -- nice to complement the standard MPI and even the MapReduce approaches I've used in the past -- JD