4,391 Posts served
10,712 Conversations started
- Academic

- Android

- Art, Music, & Animation

- Embedded Computing

- Events

- Game Development

- Graphics & Media

- Intel SW Partner Program

- Intel® AppUp Developer Program

- Manageability & Security

- Mobility

- Open Source

- Parallel Programming

- Performance and Optimization

- Power Efficiency

- Site News & Announcements

- Software Tools

- Association for Computing Machinery TechNews (ACM)
- Go Parallel! (Dr. Dobbs)
- HPCwire (Tabor Communications, Inc.)
- insideHPC (John West)
- Joe Duffy's Weblog (Microsoft)
- Microsoft Parallel Programming Development Center (Microsoft Germany)
- MultiCoreInfo.com
- scalability.org (Scalable Informatics)
- Software Dev Blog (Intel Germany)
- Soft Talk Blog (Intel United Kingdom)
- The Moth (Microsoft)
CPU Auxiliary Cores
By Asaf Shelly (31 posts) on February 4, 2009 at 5:11 pm
If you had ever attended my lectures then you already know that parallel computing is not an add-on or a library that we use. It goes much deeper into the system design and architecture. Many times when we go for lunch after a session about multicore programming people identify parallel methodologies in the every-day environment. When you do this you know that you are starting to really understand parallel computing. A simple example is lunch itself: Someone is taking your order and before or after the orders of other people at your table. The order includes the main course and the salad. The main courses are different and have different preparation periods. The salad is handled by someone completely different. Nevertheless all dishes come out at the same time for every table. The system uses Queues to manage the inputs and Accumulation Chambers to make sure that items wait for all items in the same group before they are sent out of the system. Many people notice this and start calling the patterns by name.
Today I wish to share with you a model of a system that is a design pattern for parallel processing that you will probably not see in too many places. This model comes from a big and heavy organization that needs very fast responses, which for itself is a contradiction. The big organization from which I am taking this practice is the army. An army is a huge organization in general and there is a lot to learn from its internal structure because an army does not produce hardware or software and does not provide services. An army does one thing: manage large scale operations, which fault tolerance at the expense of human lives – meaning zero.
The army is divided into units and these units have war rooms. A small unit can be a Tank and can receive targets or detect targets and respond fast, and a large unit can have multiple targets flowing into a huge queue from multiple small units. These targets are collected as Tasks, sorted, and dispatched to the most appropriate unit. So far this makes a lot of sense. The advantage of a small unit is its ability to respond fast: they shoot at me – I shoot at them, and the advantage of the large unit is with its ability to collect multiple targets and prioritize. We employ these methodologies today in computerized systems. The first is Hardware Acceleration and the latter is a software application or service. As an example for the first: when you click the CD-ROM drive 'open' button the tray will open and all the CD-ROM device needs is a power cable connected (no need for an OS, or BIOS to communicate with it). As an example for the latter method: I press the 'play' key on a multimedia keyboard and a media player starts playing music from the CD-ROM device.
This concept already exists as you can see. The interesting concept that I found is taking the best in both worlds. On one hand there is a need to collect information and locate targets. On the other hand this has to be fast enough or the targets become irrelevant. The time spreads leave no doubt. For example the time it takes to hit the target before it is too late is 20 seconds but the time it takes to filter out the target from the list and dispatch it is 3 to 5 minutes.
The solution that I found is an auxiliary unit. This unit is independent. It is 'initialized' with the filtering conditions (for example 'brown targets') and it has the ability to act immediately. Instead of boring you with the details I will provide the computerized pattern that I am looking for:
Here is an example of a problem: I have two processes. One is an application that doing some work and it is calling for the second which is a service to do some operation. After the operation is complete the application needs to do some cleanup. Let's provide timing so it is clearer: The application works for 120 milliseconds, calls the service for a 5 millisecond operation and then completes with a 5 microsecond cleanup. The problem is that the transition from the application to the service requires a heavy Context-Switch operation, when the service is done there is another Context-Switch back to the application, and then there is another Context-Switch to the next Application because our application had completed its operations. For the practice let's suppose that a Context-Switch costs 3 microseconds. In this case the switch from the application to the service adds less than 0.1% to the processing time, however the switch from the service to the application is 3us for the Context-Switch and 5us for the actual work which degrades overall application performance and system performance. This is especially true when multiple threads are used or a thread pool is used for short tasks. Sometimes this cannot be avoided for example we wait on an event that indicates that the file is fully written and when the event is released we close the file and exit.
The solution offered here is a small auxiliary CPU Core that can handle such small tasks. This CPU Core will not have an interrupt handling mechanism which means that it cannot support Page-Faults and thus all memory MUST be waiting for it in physical memory. It can also not handle Exceptions. CPU Exceptions are implemented as hardware Interrupts so an Exception on an Auxiliary Core must be handled by a real CPU Core.
The appropriate method would probably be 4 CPU Cores and 256 Auxiliary Cores. CPU Cores handle Context-Switches etc; however the implementation may very well allow multiple processes active at the same time. When a process enters a wait-state to wait for an event the process information required after the event is signaled will be prepared on a single memory page. When the event is signaled an Auxiliary core can execute the code immediately until the next wait-state. An example for this is the code fragment that handles a window's input loop. For the example above, the application will work for 120ms, prepare the appropriate information and wait for the service. This is a Context-Switch, however when the service is done it will not Context-Switch back to the process because the process is already active in memory and an Auxiliary Core will immediately start working. This pattern may be very effective in the case of two threads communicating with each other and blocking each other repeatedly.
This sounds like a very good methodology, however it means employing Core Dedication which is something that we are still trying to avoid.
I will soon add this as one of the design patterns in the collection at http://www.AsyncOp.com. If you have any further ideas that are not there then feel free to communicate with me.
Categories: Academic, Parallel Programming, Software Tools
Tags: Guest Blog, Overview, Technical, www.AsyncOp.com
For more complete information about compiler optimizations, see our Optimization Notice.
Comments (9)
| February 8, 2009 6:03 PM PST
Asaf Shelly
|
Hi Jim, Let me start by saying that I appreciate the long response and the time and efforts put into it. You raise many interesting points and I will try to address one by one. I can see your general point of view. It was actually clear to me that writing a demo for this design is not going to be simple. It is possible to use a regular core to simulate an Aux core but as you mentioned the operating system is not equipped for this task. I agree that the CPU design that we have today and the OS above it are not ideal. Perhaps I am hoping that this will change. 8086 and the architecture around it was designed before year 1980. How many computer-generations ago was that? The system is not designed for Core-Dedication models in which different cores have different capabilities. Another huge problem that I have with the system is that there is no such thing as an Event with Data or an Event with Priority. The priority of the Event is determined by the priority of the thread waiting for that Event. What about an Event with Data? The hardware supports only one type of flow control: Stack. Splitting execution flow is done by using multiple Stacks. I am not sure that it is the only model that I want to see. To the point of your detailed response: What if instead of considering applications we consider IRPs on Windows NT? An IRP is an Event with Data, it has its own Stack as part of the IRP data buffer (Get Stack Location), priority is based on the request because in this model every Event has a Stack. If IRP packets are locked to physical memory then it would be possible for one driver to start and operation, call the next driver, and when the operation is complete resume its operation on an Auxiliary core. Would that solve the problem and make it possible? Best Regards, Asaf |
| February 9, 2009 9:17 AM PST
jimdempseyatthecove
|
Asaf, Although the 8086 has been around since the 1970’s I would not categorizing age of technology with lack of capability of technology. The original 8086 had an instruction escape sequence that was used to extend the instruction set into a co-processor. The FPU (8087) was one such co-processor. But you could have used this “old” feature to extend the instruction set into other devices - Auxiliary cores could have been such an example. In a manner of thinking your new idea is really and old idea revisited. Co-processors have been around for a long time. The current style of GPU programming is an example of a Windows NT IRP style of programming for a hybrid programming environment between CPU and GPU (read co-processor). And shared memory techniques can be used after initialization to reduce the number of O/S IRP calls. The trick with any co-processor(s) implementation is how to eliminate or greatly reduce the start/stop time, the communication overhead, reduce memory cycles, take advantage of cached data, not introduce interrupt latency, and not require changes in operating system. And not to mention, fit in with current multi-core designs. Extending SIMD width (e.g. AVX) is one example of increasing parallelization. At best, some limited sections of code may see a 2x performance boost. What you have been seeking is MIMD – Multiple Instructions Multiple Data whereby an application hardware thread (using full CPU resources) can enlist the cooperation of auxiliary processors to perform multiple execution paths through the system but not at the expense of an operating system context switch. The principal goal is to reduce or eliminate Auxiliary start/stop times as well as eliminate O/S context switching. And further, to do this without requiring an operating system change. One way to handle this would be to structure the Auxiliary processing units (APU) to work off of one or more AVX 256-bit SSE registers. i.e the working data set is never in memory (thus eliminating the page fault), is limited to 32 bytes of data (or 64, 96, 128, …bytes of data). APUs could potentially read-share AVX SSE registers. To simplify the design, you might have the restriction to permit one APU to have write access. What does this give you. Each hardware thread (e.g. 8 on 4 core processor) has. Each of the hardware threads has 16 xmm registers and potentially enlist 128 APUs on a 4 core system with HT. Since this sounds like a project for a thesis you are interested in doing, see if you can get a copy of the AVX emulator from Intel. Extend the AVX to perform general programming in a MIMD format. You can experiment with your proposed architecture without expending too much time in writing the simulation software. Jim Dempsey |
| February 9, 2009 9:22 AM PST
jimdempseyatthecove
|
I forgot to mention that an extension of this APU instruction set might be to permit it read/write to L1 and/or L2 cache but not main RAM. Jim |
| February 12, 2009 4:18 AM PST
Asaf Shelly
|
Hi Jim, I have to say that FPU and MMX have passed through my mind but I dismissed the model immediately because it is used to extend instruction set which by definition is blocking to the CPU. However now that you mention it, it does solve the problem of cache management and sharing issues. When you say MIMD I find myself wondering if the OS can live with instructions predefined by Intel. The answer is that such an external unit should have pre-loadable code, something like Stored Procedures. This way the application / driver can allocate code space on the APU's internal memory / APU-cluster memory and every application / driver can use it. For example malloc and free, or CloseHandle, etc. Such acceleration could also prove useful for basic OS operations such as Context-Switch related tasks. Generally speaking such CPU installed procedures could reduce CPU complexity for example using 8 programmable APUs will eliminate the need for MMX unit, for instructions such as STOS and SCAS (if I remember assembly correctly), and also allow for extended Atomic-Operations instruction set. The only problem that I have is that another thread cannot continue to run while the core is hosting another thread. The model you are raising with APUs might prove to be better than multi-core CPUs. This is what MMX and FPU came to solve: the need to perform single byte operations faster than a single operation per CPU clock, and the need to perform floating point operations that take a single CPU cycle (or as close to it as possible). These are effective only when there are massive amounts of operations to perform and are irrelevant when it comes to few operations. This is the same problem that Multi-core CPUs come to solve. MMX and FPU work great but these are not general purpose. I would propose a different model to 4 cores with MMX and FPU. How about a single core CPU with 32 such APUs that have a collection of stored procedures? This way for example a C\++ 'for' loop would mean installing the same code to all APUs with a different iteration index. On the other hand it would be possible to run OS MIMD code at the same time an application is running and a driver MIMD code. What do you think? Asaf Shelly |
| February 13, 2009 9:06 PM PST
jimdempseyatthecove
|
Look at gpgpu.org and ATI and nVidia websites regarding programming a GPU for general use. You can already incorporation 100's of processors on a PC to help run your application. Brook+ (ATI) and CUDA (nVidia) are popular programming tools for C++ - like programming of GPUs. Intel Larrabee will have similar characteristics to GPU and potentially be better suited for use as "APU". Jim Dempsey |
| February 19, 2009 2:02 PM PST
Aaron Tersteeg (Intel)
|
Jim and Asaf, Thank you for sharing your thoughts and staying involved with the Parallel Programming community. Cheers, Aaron |
| March 13, 2009 2:40 PM PDT
Gastón C. Hillar
|
I am adding some personal thoughts about this discussion, without getting involved in technical details, because the discussion has been very interesting indeed. Nowadays, I don't think that the problem is how to organize cores from a hardware perspective. I think the problem is how to convince developers to make the necessary efforts to use that huge processing power in the applications. More than 80% of the base software I am using doesn't use more than 1 core. Once we have 80% of desktop applications taking full advantage of multicore, there will be a great space to test new alternatives. However, let's be serious. Nowadays, most developers do not use more than 1 core in their applications. There is a lot of work to be done in the software land. Software engineers do have a great opportunity to exploit the hardware advancements created by hardware engineers. Cheers, Gastón Hillar |
| March 24, 2009 8:25 AM PDT
Asaf Shelly
|
Hi Gastón, As always we agree on most things :-) I do not believe that there is a reason for developers to start using the CPU more. First of all we need to accept that there are two basic types of operations for threads: One is work and a working thread, and the other is wait and a blocking thread. The biggest problem for me today is that the infrastructure is not good enough. Take for example Web Servers and Databases. Both are very good parallel infrastructures that can handle multiple requests at the same time and the programmers who are using these technologies write their code serially. For example Asp.Net and PHP. We should expect more of the such in the future. Best Regards, Asaf |
Trackbacks (2)
- J20 Part Vehicle Used Jeep Salvage Yard, Used Jeep J20 Crossover
May 20, 2010 6:47 AM PDT - Armada E500 Speicher, E500 Promotion
May 22, 2010 9:12 PM PDT




jimdempseyatthecove
73,324
The concept of the auxiliary cores is good, however I believe the manner in which you describe “and example of a problem” concentrates on the elimination of or reduction of a context switch for application startup/suspension. Although this goal is commendable, it will require a change in the operating system whereby an event can signal that either a process is to be started or auxiliary core(s) or both. This is a chicken and egg situation, unless you are the provider of the operating system.
A second problem you have is in a virtual memory system, and unless an application has the ability to probe and lock pages (most user mode applications do not have this privledge), then (assuming auxiliary cores are cautioned against page fault) the only memory the auxiliary thread could reliably address is the process’s current instruction pointer page and stack pointer page (where stack pointer page residency is questionable at best).
Giving an application the ability to probe and lock pages cannot be given willy-nilly as an application could be written to abuse the authority and attempt to lock all of physical memory in an effort to purposely crash a system.
To correct for this, the operating system will have to be changed, whereby a process can request a residency requirement of this, that, and these other pages. The operating system would have permission to deny the request and/or delay the request by suspending the process. However, when the process continues, the requested pages are guaranteed to be resident. I do not believe this feature request would be hard to implement since a device driver can do this for you already. But for this purpose you would like to eliminate a device driver call if at all possible.
With the above residency requirement met available the application is free to “throw” fiber requests at the auxiliary cores (without the knowledge of any being available at the time). The fiber throw stalls when an auxiliary core is not available as well as stalls during the period between the auxiliary core being available but not mapped to the virtual address of the process and being mapped to the virtual address of the process. Once an auxiliary core is available and mapped the throw succeeds.
Once an auxiliary core starts processing it may have a restriction on the number of instruction cycles it is permitted to execute. Upon interrupt for context switch, when the process (in interrupt service) issues MFENCE it stalls until all current process bound auxiliary cores complete. If the MFENCE occurs late in the interrupt service the auxiliary core may have completed processing and no stall would be encountered.
Note that with 256 auxiliary cores working for one process could wait for 1000’s of memory cycles to complete. Interrupt latency could be a problem but there are work a rounds for this (e.g. not use this on certain cores or restrict the number of pending throws on certain cores).
This would require the architecture of the processor to be aware of the presence of and status of the auxiliary cores… but would not require a change in the operating system to implement.
The initial implementation for page locking can be done on Windows using a device driver and IOCTL. Later implementations would have a proper user application level request to lock/unlock specific pages.
Jim Dempsey