English | 中文 | Русский | Français
2,857 Posts served
8,606 Conversations started
Let me start with a brief introduction. I am a program/release manager coordinating release activities in the team developing Intel(R) Parallel Amplifier and Inspector. The latter are components of new Intel(R) Parallel Studio.
Prior to Intel I had been working at the software company that develops an Open Source 3D modeling kernel – Open CASCADE (www.opencascade.org) - which is available for free download, and can be used to model, analyze, visualize and exchange 3D shapes (such as car bodies or bridge elements). There are multiple CAD/CAM/CAE applications using this modeling kernel, and if you are familiar with that industry, you probably heard of it.
A few months ago, when preparing to the Beta release of Amplifier, virtually everyone in the team was mobilized to test the new product. I volunteered to participate, not only to eat our own dog food to understand how it tastes for others but also to challenge it with the Open CASCADE software, something I really well knew inside. Eventually it surprisingly resulted in a win-win as not only Amplifier got better but also Open CASCADE did. Let me show you how this worked out.
Any 3D modeling kernel must provide 3D Boolean Operations – algorithms to intersect, fuse and cut solids bodies. Due to their complexity they are often relatively slow, sometimes up to several minutes what regularly caused users’ criticism.
So I selected Boolean Operations as a first target for the Amplifier, choosing its hotspot analysis mode (which allows you to identify parts in your code take most time to complete). As Open CASCADE is single-threaded so far, I had to left aside two other types - concurrency analysis (which allows you to estimate how efficiently your cpu cores have been used by your multi-threaded application), and waits & locks analysis (showing you how your application threads were waiting and awaken during execution time).
It can be amazing how many improvement opportunities your legacy code may hide – abused cycle invariants, excessive memory allocations, data recalculations and what not! In about 4 hours, after several experiments, rewriting some Open CASCADE code, comparing test runs, and repeating this over and over again, I finally said "yes, I did it!" The results were speaking for themselves – achieved gain ranged from 2x to 40x (though I used a limited test set to measure). For me, as a software developer this was so inspiring that I sat down and immediately typed my story into a personal blog that I run at opencascade.blogspot.com. This is a basically a step by step story as it was evolving. You can read 3 parts of "Why are Boolean Operations so slooo...ooow ?" following these direct links – part1, part2, part3. The screenshots are slightly outdated (as the product was in active development) but you will easily match them with a current version of Amplifier.
There is another ‘love’ story with the Amplifier described in my blog in addition to many not described ones. I have been posting my fixes on SourceForge to let the Open Source community get early access to them, while they are underway being validated by the company developers to show up in future Open CASCADE releases.
Today Amplifier is the tool in my toolbox. Hope it will become one in yours as well. Just go to http://www.intel.com/go/parallel and try its Beta for yourself today, and see what it can do for *you*.
| March 5, 2009 8:15 AM PST
Dmitry Oganezov (Intel)
|
Great story, Roman! I know that you write code from time to time, despite the fact that as a Program Manager you’re supposed to write angry e-mails and draw roadmaps :). So, let me challenge you with the question: are you honest enough saying “In about 4 hours, after several experiments”? Perhaps you had solid experience with Amplifier before your experiment with the OpenCascade, - that’s probably why you spent just only 4 hours to get a “40x” improvement. How much time will a newbie spend? 4 weeks? |
| March 5, 2009 9:21 PM PST
Roman Lygin (Intel)
|
Thanks for a welcome, colleagues. 2 Alexey: TBB is a next step I was going to take in pursuing parallelization. Despite my Qt addiction, for this particular project I believe TBB is a more appropriate choice. For the 3D modeling library it makes more sense to depend on a solution like TBB than on a library which is mainly GUI-focused. Nonetheless, I am thinking of an abstraction layer inside OpenCASCADE itself and subclassing it with TBB-based implementation. 2 Dmitry: No, I'm not joking. Though I did have some experience with Amplifier it's just plain intuitive to use. I bet an average user understanding what he intends to achieve with it will be familiar with the tool in less than 20-30 mins, at least with hotspot analysis (of course, provided that he's familiar with the code he is testing to understand what Amplifier tells him). What took me 4 hours is basically understanding particular algorithmic details of Open CASCADE Boolean Ops and rewriting them via several iterations. What also saved me time was comparatively low overhead of the Amplifier itself - running Booleans under Amplifier was alsmot as they did without it. |
| March 6, 2009 11:13 AM PST
Alexey Kukanov (Intel)
|
While I understand the reasons behind the desire to provide an abstraction layer for parallelism, I also saw a few attempts to do so, which in my opinion lose most of TBB benefits, such as good support for nested parallelism, dynamic load balancing, automatic adjustment to the number of processors, etc. To keep most of those TBB values you would need to design the abstraction layer over high-level TBB algorithms (parallel_for), and even in this case some benefits, such as availability of different work partitioners, might be lost. TBB containers, sync primitives, and memory allocator could also be useful, and wrapping those (might be except for the allocator) could be a problem. If/when you start the effort, I am ready to provide help and assistance. In my past (though rather far from now), I dealt with computational geometry, and it is still of interest to me :) |

Alexey Kukanov (Intel)
14,341
Status Points:
14,341
As you said that that heavy code is not yet parallelized, would it be interesting for you to "eat more of our own dog food" by trying to make it parallel with TBB? Definitely it would be interesting for me :) Or you think that Qt Concurrent is the way to go (wild guess after reading some of the posts you referenced)?