Performance Presentation: Concepts Behind Parallel Computing and Extended CPU Instructions

As you may have already read in a previous post called Personal Review of Intel Under-NDA Sandy-Bridge Event I held the last session in an Intel-Under-NDA event. The presentation was called Performance and covered the different aspects of parallel computing and also the new Sandy Bridge AVX Instructions. I have introduced this new feature in a previous post called Visual Studio 2010 Built-in CPU Acceleration. The goal of the presentation is to provide a better perspective of all the new and advanced tools that Intel has provided in the last few years. The most important thing to do before you decide to use these new features is to understand the feature and understand how it applies to your application.

Although with a slight delay and as promised during the session and in my last post, this post and the few following it will contain selected slides with what you would have heard during the session if you were there. No NDA material exposed.

Performance


As always we should start with a few words about this presentation and its goals. The idea behind this presentation is to help you the audience understand the concepts behind parallel computing and extended CPU instructions. I know that I have a good presentation before me when I notice that writing the presentation and reviewing it helps me organize my thoughts. This means that the material is edited correctly and has a refined message which I myself have never seen before this point. Perspective is very important to me. It can be the difference between good architecture to an architecture which will have to be modified before the first release of the product. Too often I see people avoiding parallel programming because they cannot guarantee that they can pick the right path.

The presentation begins with the less technical slides so we get used to the graphics and the presenter's voice.

Why Parallel

Going back to the 1970's there were IC chips (Integrated Circuit) that worked about a few MHz. This was new because it meant that it was possible to dispatch data very fast compared to physical switching. By dispatching data I mean a single bit and up to several bits. This technology called TTL demanded that several chips work in parallel in order to get any work done because every chip had its own dedicated functionality. If RS232 communication required XOR operations for data integrity then there had to be a XOR TTL chip on the board. A Diskette Drive using XOR for data verification needed another XOR TTL chip. Processors were expensive at the time and did very little.

Beginning of the 1980's Intel releases a new generation of CPU chips which had 16 bit address bus, and 16 bit internal data bus. The low price (and a few other factors) made it the main processor for a home PC. It was still very slow so there were assisting chips on the board such as DMA, communication chips, etc. but now the CPU could perform integrity check for RS232 communication without the need to a dedicated chip. Slowly but surely the CPU became more and more powerful allowing floating point instructions and Packed Operations, first as an external co-processor and eventually as an internal component. With this the need of 'smart' chips working side by side with the CPU has reduced and if we compare the complexity and power of the CPU vs. its peripherals from 1980 till 2000 there is a rapid decline and peripherals today are much slower than the main CPU whereas 1980 PC might have had device faster than the CPU helping it with the hard work.

There two main reasons for CPU becoming much stronger than the peripherals. First is that it is so simple to increase CPU clock by improving silicon technology. The faster the CPU the less help it needs from other peripheral devices. This causes the main CPU to take the roles of many devices and reduce the peripheral functionality to OSI model 'layer 2' only. The second reason for CPUs to take over is that it is simpler to design and redesign software more than hardware and whereas hardware may become obsolete, using software maintains compatibility over years and different board manufacturers.

One day we wake up and find out that if we keep increasing CPU speed we need new cooling technology invented. Such technology is possible but eventually the CPU speed will reach a critical point at which it will no longer be possible to cool it. On another track and with no relation to this, network cards began offering 'offloading' features which means that the network card provides Layer 2 processing and also Layer 3 and even Layer 4 functionality. Graphic cards also started providing advanced hardware acceleration features. USB Bus is also performing Layer 3 processing in hardware.

We find ourselves with many smart peripherals again. This is the result of the fact that technology has advanced to the point when peripherals today may have more processing power than an old Pentium processor and with the fact that CPU speed has reached its critical point. We had CPUs at 4GHz and then went back to 3Ghz and below.

All these bring us back to the parallel world:
* We can no longer buy a new CPU and expect it to make our software work faster just because it is a newer CPU
* Peripherals today are more powerful than a CPU was 10 years ago
* The Internet has re-invented distributed systems and scalability does not stop with a single machine

There are new tools and new libraries, new design patterns, new programming models and even new languages, all created or re-invented for parallel programming, all to assist us programmers with understanding this problematic area of parallel programming and solving this riddle.

This is all very nice and interesting but I have to tell you that a bigger question has been troubling me and I would really like to hear the answer for that:

Why would the original designers of systems base their designed on parallel operations? Parallel hardware is not an excuse good enough to justify Fork. UNIX had a built-in command called Fork. This command split a process into two separate processes. Actually the entire system design was based on it and all processes were Forked from the main system process. This automatically copied file handles, security attributes, etc. Why would the system designers of UNIX support Fork when the system was written using Assembly?! You don't add nice-to-have features in such scenarios. Moreover did you notice that the original user interface was a 'DOS' like text console that interacted with several processes in parallel? Why would users want that?

See? Something is wrong here. How is it possible that parallel design was a common practice back then if it is so impossible to understand? This just doesn't add up!

We can talk about how parallel programming is the future of computing and this does make an interesting talk for a coffee break but there is more to it. One of the most important aspects of parallel design is User Experience. User Experience is the product! It is a combination of two things: the User Interface which is the graphics and animations, and the Business Logic and how the application behaves. Have you ever clicked Print instead of Save and had to wait for 10 seconds for the Printer Selection dialog to appear so you can close it? This is bad User Experience. You will never find this in a good computer game. The difference from User Interface to User Experience is a result of a parallel design.

Learning that 1970's software designs were based on parallel methodologies is still surprising for me, even when I know that these new concepts came from the 1970's: Services, Web-Server, Cluster, Terminal Services (Remote-Desktop), Transaction, Distributed Computing, Cloud, Fork, Join, and a few others...

I will try to look into it in the following blog posts.

Para obter mais informações sobre otimizações de compiladores, consulte Aviso sobre otimizações.