Optimizations for MSC.Software SimXpert* using Intel® Threading Building Blocks (Intel® TBB)

Executive Summary

MSC.Software SimXpert* is a fully integrated simulation environment for performing multidiscipline based analysis. It provides an interactive, graphical interface designed to facilitate the engineer's job of performing end-to-end simulations including native computer-aided design (CAD) import, pre-processing, model set up, solving, post-processing, and reporting.

Intel and MSC.Software collaborated on threading of SimXpert, resulting in a significant performance improvement in the tool and increased productivity for users. Intel® TBB was the method selected because of its compatibility with SimXpert, i.e., a multi-platform application written in C++. SimXpert has many features beyond typical high performance computing (HPC) number crunching applications, including - complex data-base style manipulations with geometry data and simulations results, complex memory allocation operations, reliance on extensive enterprise class infrastructure C++ code, overall visualization pipeline architecture that fits into the Intel TBB pipeline parallel pattern, and OpenGL* rendering.

Threading with Intel TBB was added incrementally, starting with threading an initial 72 key engineering operations, followed by threading the code responsible for producing graphical primitives for fringe plots. Measurements for seven very large customer simulation models on a two socket (2S) Intel® Xeon® processor 5100 series platform (4 threads) showed scaling between 3.8X to 3.9X for the engineering calculations. For the fringe plot optimizations, a speedup ranging from 3% to 44% was achieved.

Going forward, MSC.Software will continue the incremental threading approach; next steps will be threading the remaining plot types with possible future implementation of Intel TBB pipeline to overlap processing with I/O.

Introduction

To address increasing customer model sizes and align with the multi-core processor roadmaps for hardware vendors, MSC.Software engaged with Intel to thread SimXpert. The Intel Academic Community provided training for a group of MSC.Software engineers on threading for multi-processor architectures and Intel® Threading Tools (Intel® Thread Checker, Intel® Thread Profiler, and Intel TBB). A multi-phased, incremental threading approach was defined for the project.

For Phase One, MSC.Software identified 72 engineering operations in the post processing portion of SimXpert that are responsible for the calculation of various engineering quantities , e.g., von Mises, Principal, Tresca, and Maximum Shear stresses. Intel prototyped the engineering operations and investigated both Intel TBB and OpenMP* for threading implementation. Intel TBB was selected as the best method due to its compatibility for all supported platforms. Its performance was also slightly faster than OpenMP.

For Phase Two, code responsible for producing graphical primitives was threaded, which improved performance for fringe plots. This whitepaper discusses the details of these threading implementation phases, the results achieved, and plans for additional threading for SimXpert in future phases.

 

 

Background/Workloads measured

Once the finite element model has been analyzed, the results can be accessed by SimXpert for post-processing. It was the Post-Processing Component (PPC) of SimXpert that Intel and MSC.Software targeted for threading. This "module" allows the expert analyst to view selected results in a variety of ways such as fringe, deformation, contour, vector, and tensor plots, identify problems, and redesign areas of a structure if necessary. Performance for both threading phases was measured for fringe plots using large simulation models provided by MSC.Software customers. These models represent typical use cases from customers in the Aerospace, Automotive, and General Manufacturing industries. The numerical and graphical loading that occurs is due to several critical factors.

 

 

  • Free faces (Figure 1) are the internal and external faces of the model's finite elements where a fringe plot is rendered.
  • The clustering of the finite element IDs for the elements whose free faces are being rendered directly affects the result data retrieval time.
  • The dimensionality of the data (i.e. scalar, vector, tensor data type) directly affects the number of data values that are retrieved for post-processing.
  • The complexity of the engineering derivation that is applied to the initial analysis data to transform it from either a vector or tensor data type to a scalar data type for fringe plot rendering also plays a roll.

 




Figure 1 - Free Face Rendering on the Model's Finite Elements

 

 

 

Threading SimXpert - Phase One

The initial targets for threading SimXpert were 72 engineering calculations in the Post-Processing Component (PPC) portion of SimXpert. Transformations were required in the original serial code before it could be parallelized with tbb::parallel_for.

Original Serial Code

 

 

  for (size_t i=0; i<Size;++i) { 
         deriveFunc(ptr_inArray,ptr_outArray); 
         ptr_inArray += inStride);
         ptr_outArray += outStride); 
  }

 


Transformation to make arrays random access containers

 

 

 

for (size_t i=0;i<Size; ++i) {
     deriveFunc(ptr_inArray[k* inStride],
			ptr_outArray[k * outStride]);
}

 


After completing the transformations, tbb::parallel_for was integrated into the application. MSC.Software relied heavily on other threading tools such as Intel® Thread Checker and Intel® Thread Profiler to ensure correctness and optimum performance. This code represented only 7.4% of the total runtime for SimXpert, but threading resulted in an average of 4.9% improvement in overall performance. Table 1 shows the scaling that was achieved on a 2S 3.0GHz Intel® Xeon® processor 5100 series platform/8GB with Red Hat Linux* 4 update 3.

 

 

 

Plot File Name/Entity Count Serial Time (sec) Parallel Time (sec) Speedup Factor (Serial Time/Parallel Time) Serial Process Time (sec) Parallel Process Time (sec) %Process Speedup (s-p)/s %Time spent in numeric operations
Fringe - Stress, Max Princ Avg Meth=Avg/Derive, Extrap Meth=Avg xx0kst0.xdb/624924 0.765 0.196 3.903 10.22 9.65 5.579 7.48
Fringe - Stress, Mid Princ Avg Meth=Avg/Derive, Extrap Meth=Avg xx0kst0.xdb/624924 0.763 0.195 3.904 10.209 9.635 5.623 7.47
Fringe - Stress, Min Princ Avg Meth=Avg/Derive, Extrap Meth=Avg xx0kst0.xdb/624924 0.762 0.197 3.873 10.208 9.636 5.604 7.46
Fringe - Stress, Tresca Avg Meth=Avg/Derive, Extrap Meth=Avg xx0kst0.xdb/624924 0.767 0.196 3.905 10.228 9.675 5.410 7.50
Fringe - Stress, Max Princ Avg Meth=Avg/Derive, Extrap Meth=Avg xx0ust0.xdb/605288 0.696 0.180 3.874 9.573 9.152 4.401 7.27
Fringe - Stress, Mid Princ Avg Meth=Avg/Derive, Extrap Meth=Avg xx0ust0.xdb/605288 0.691 0.181 3.820 9.553 9.110 4.641 7.24
Fringe - Stress, Min Princ Avg Meth=Avg/Derive, Extrap Meth=Avg xx0ust0.xdb/605288 0.693 0.179 3.879 9.556 9.114 4.626 7.25
Fringe - Stress, Tresca Avg Meth=Avg/Derive, Extrap Meth=Avg xx0ust0.xdb/605288 0.693 0.178 3.886 9.584 9.105 4.998 7.23
Fringe - Stress, Max Shear Avg Meth=Avg/Derive, Extrap Meth=Avg xx0ust0.xdb/605288 0.695 0.180 3.861 9.554 9.099 4.766 7.27
Fringe - Stress, Max Princ Avg Meth=Avg/Derive, Extrap Meth=Avg xx0jst0.xdb/2394421 2.883 0.731 3.942 39.068 37.007 5.275 7.38
Fringe - Stress, Mid Princ Avg Meth=Avg/Derive, Extrap Meth=Avg xx0jst0.xdb/2394421 2.888 0.730 3.956 39.090 36.945 5.486 7.39
Fringe - Stress, Min Princ Avg Meth=Avg/Derive, Extrap Meth=Avg xx0jst0.xdb/2394421 2.880 0.730 3.947 39.086 36.816 5.808 7.37
Fringe - Stress, Tresca Avg Meth=Avg/Derive, Extrap Meth=Avg xx0jst0.xdb/2394421 2.874 0.730 3.937 37.996 36.833 3.061 7.56
Fringe - Stress, Max Shear Avg Meth=Avg/Derive, Extrap Meth=Avg xx0jst0.xdb/2394421 2.894 0.732 3.952 39.433 38.277 2.932 7.34
Average       3.90     4.872 7.37
Minimum       3.82     2.932 7.23
Maximum       3.96     5.808 7.56


Table 1 - Summary for Plots Where Serial Time in Numeric Operations Was Greater than 0.5 Seconds

 

 

Threading SimXpert - Phase Two

A key goal for the user experience with SimXpert is quick post-processing of analysis result data. Post-processing analysis involves transforming the initial analysis data to the final numerical form specified by the engineer, then mapping it to its graphical primitive representation. For example, an engineer may want to direct SimXpert to render color fringe plots of von Mises, Maximum Principal, and Maximum Shear stress to investigate the performance of the simulation model relative to its applied loading. Figure 2 demonstrates a fringe plot of the von Mises stress distribution across a simple connecting rod model.



Figure 2 - Fringe Plot of von Mises stress

Phase Two for SimXpert applied threading to the portion of the code responsible for graphical primitive production for fringe plots. This code accounted for approximately 35% of the total plot time. As a proof of concept, MSC.Software and Intel prototyped the threaded code and saw scaling up to 3.2X on 4 cores. The method used involved the production and packaging of graphics primitives into containers. The program flow was modified as follows -

 

 

Serial Parallel
Iterate over all elements/faces Divide face/element iteration over multiple threads with tbb::parallel_for
Allocate (or reallocate) memory as needed for containers Local storage holds elements in each Intel TBB task
Do calculations on each element and produce graphical primitives Serial code works on the local containers without modification
Copy primitives into container (flat array) using memcpy Partial results in each local container safely get combined into tbb::concurrent_vector
Sequentially bump container pointer, stored in a member variable  


Table 2 - Serial versus Parallel program flow

Performance improvements were observed when the models ran on a 2S 2.66GHz Intel® Xeon® processor 5100 series platform/8GB Memory/Windows* XP Professional X64 Edition Version 2003 SP2 (specifics in Table 3):

 

 

  • 3D solid finite element simulation model representing the casting of a V6 engine block (modelsec.xdb) with 98,814 free faces and a 358.7MB file size achieved a 28% performance improvement
  • 3D solid finite element simulation model representing a turbine blade (xx0kst0.xdb) with 65,416 free faces and 513.3MB file size achieved between 3 and 10% performance improvement for various plots
  • 3D solid finite element simulation model representing a casting of kitchen appliance housing (xx0ust0.xdb) with 90,460 free faces and a 281.7MB file size achieved between 6 and 26% performance improvement for various plots
  • 2D and 3D finite element simulation model representing a car chassis (xx0o.xdb) with 1,209,323 free faces and a 438.8MB file size achieved between 19 and 27% performance improvement for various plots
  • 3D solid finite element simulation model representing the central hub of an aircraft propeller (xx0fst0.xdb) with 89, 935 free faces and a 165.2MB file size achieved between 10 and 30% performance improvement for various plots
  • 3D solid finite element simulation model representing the casting of a straight 6 engine block (xx0jst0.xdb) with 461,808 free faces and a 1028.5MB file size achieved between 15 and 44% performance improvement for various plots

 

 

Workload / Description file size/# free faces Fringe - Eigen Vectors, Translational - % speedup Fringe - Stress, Von Mises Avg Meth=Avg/ Derive, Extrap Meth=Avg - % speedup Fringe - Stress, Max Princ Avg Meth=Avg/ Derive, Extrap Meth=Avg - % speedup Fringe - Stress, Tresca Avg Meth=Avg/ Derive, Extrap Meth=Avg - % speedup Fringe - Stress, Octal Avg Meth=Avg/ Derive, Extrap Meth=Avg - % speedup Fringe - Stress, Inv 1 Avg Meth=Avg/ Derive, Extrap Meth=Avg - % speedup Fringe - Stress, Max Shear Avg Meth=Avg/ Derive, Extrap Meth=Avg - % speedup Fringe - Disp Trans, Mag - % speedup
modelsec (engine block) 358.7 MB/98,814 28.092              
xx0kst0 (turbine blade) 513.3 MB/65,416   3.729 8.103 8.968 3.373 2.926 8.199 10.334
xx0ust0 (housing) 281.7 MB/90,460   8.422 13.03 10.733 6.024 6.143 10.628 26.203
xx0o (car chassis) 438.8 MB/1,309,323   19.437 19.791 20.03 19.647 19.671 19.902 27.543
xx0fst0 (propeller hub) 165.2 MB/83,935   10.013 10.863 10.73 10.472 10.867 11.007 29.824
xx0jst0 (straigt 6-cyl engine block) 1028.5 MB/461,808   15.061 18.856 18.988 15.636 15.463 19.585 43.807


Workload/Description file size/# free faces Fringe - Eigen Vectors, Translational - % speedup Fringe - Stress, Von Mises Avg Meth=Avg/Derive, Extrap Meth=Avg - % speedup Fringe - Stress, Max Princ Avg Meth=Avg/Derive, Extrap Meth=Avg - % speedup Fringe - Stress, Tresca Avg Meth=Avg/Derive, Extrap Meth=Avg - % speedup modelsec (engine block) 358.7 MB/98,814 28.092       xx0kst0 (turbine blade) 513.3 MB/65,416   3.729 8.103 8.968 xx0ust0 (housing) 281.7 MB/90,460   8.422 13.03 10.733 xx0o (car chassis) 438.8 MB/1,309,323   19.437 19.791 20.03 xx0fst0 (propeller hub) 165.2 MB/83,935   10.013 10.863 10.73 xx0jst0 (straigt 6-cyl engine block) 1028.5 MB/461,808   15.061 18.856 18.988
Workload/Description file size/# free faces Fringe - Stress, Octal Avg Meth=Avg/Derive, Extrap Meth=Avg - % speedup Fringe - Stress, Inv 1 Avg Meth=Avg/Derive, Extrap Meth=Avg - % speedup Fringe - Stress, Max Shear Avg Meth=Avg/Derive, Extrap Meth=Avg - % speedup Fringe - Disp Trans, Mag - % speedup
modelsec (engine block) 358.7 MB/98,814        
xx0kst0 (turbine blade) 513.3 MB/65,416 3.373 2.926 8.199 10.334
xx0ust0 (housing) 281.7 MB/90,460 6.024 6.143 10.628 26.203
xx0o (car chassis) 438.8 MB/1,309,323 19.647 19.671 19.902 27.543
xx0fst0 (propeller hub) 165.2 MB/83,935 10.472 10.867 11.007 29.824
xx0jst0 (straigt 6-cyl engine block) 1028.5 MB/461,808 15.636 15.463 19.585 43.807

-->Table 3 - Speedup for fringe plot optimization (average of three runs)

 

MSC.Software Testimonial

"We are very pleased with the progress we have made in a very short time incorporating parallel algorithms from Intel® TBB in our SimXpert code. Intel's technical leadership and assistance facilitated our decision to introduce multithreaded versions of SimXpert moving forward, and we're pleased to see the breadth of TBB parallel algorithms, such as Parallel For and Parallel Pipeline, that are available for our future consideration.

"We were very excited to see the near theoretical performance scaling that was achieved by applying multithreading to the post processing portion of SimXpert that is responsible for calculation of various engineering quantities such as von Mises, Principal, Tresca, and Maximum Shear stresses. Performing tests with seven very large customer simulation models, on a machine equipped with two Intel® Core™ 2 Duo processors, we found that the scaling of the engineering calculations ranged from 3.86 to 3.9. Achieving these near perfect results has increased the excitement within SimXpert development team to expand the use of multi-threading throughout the product."

George Truesdell
Manager, Product Development

 

 

Next Steps

In future releases (following SimXpert R4), the remaining plot types will be threaded. Intel TBB pipeline will also be evaluated for threading overlap processing and buffered I/O. Intel engineers have prototyped an Intel TBB pipeline that uses the engineering calculations from Phase One. Intel Thread Profiler identified an issue in this initial implementation with buffer thrash. When fixed, the desired scalability was achieved. Matching the pipeline token count to the hardware thread count produced "laminar" scheduling and eliminated buffer thrash, resulting in 3.9X scaling on 4 cores and 7.5X to 7.8X scaling on 8 cores.

Concept



 

 

Conclusion

The MSC.Software project to add threading to SimXpert was successful, resulting in a significant performance improvement in SimXpert and a faster turnaround time for end-users, leading to increased productivity. SimXpert was one of the first commercial applications to release with Intel TBB. Intel TBB was an ideal tool for this project since SimXpert is a multi-platform application written in C++ that has many features beyond typical HPC number crunching applications. In addition, the code of SimXpert was well suited to the incremental threading approach that MSC.Software chose.

For Phase One, measurements for seven very large customer simulation models on a 2S Intel® Xeon® processor 5100 series platform (4 threads) showed scaling between 3.8X to 3.9X for the engineering calculations.

For Phase Two, optimizations for fringe plots resulted in a speedup ranging from 3 to 44% for measured workloads. MSC.Software plans to continue with the incremental threading approach for the remaining plot types, and investigate Intel TBB pipeline for overlapping processing and I/O.

 

 

MSC.Software SimXpert

MSC.Software's SimXpert* is a fully integrated simulation environment for performing multidiscipline based analysis. It provides an interactive, graphical interface designed to facilitate the engineer's job of performing end-to-end simulations including native CAD import, pre-processing, model set up, solving, post-processing, and reporting. Designed for both analysts and design engineers, SimXpert scales across multiple engineering physics offering a state-of-the art, easy to use graphical interface for performing multidiscipline simulations, either coupled or chained, at any stage of the design process. The SimXpert scalable workspaces include a variety of discipline solutions including structural linear and nonlinear analysis, thermal, multibody dynamics, and explicit crash simulation. With built-in, bi-directional CAD associativity, engineers gain superior interoperability between SimXpert and multiple CAD systems including V5, Pro/Engineer, and Unigraphics. Additionally, SimXpert provides a unique CAE graphical template builder and runner to allow analysts to quickly automate a variety of steps during analysis such as model set up, pre-processing, post-processing or other mundane, time consuming tasks.

 

 

About the Authors

Kathy Carver joined Intel in 1992 and is currently an application engineer in Intel's Software and Services Group (SSG) working on optimizing CAE applications to take advantage of the Intel's latest hardware and software innovations. Previously at Intel, she worked on pre-silicon validation of the first Intel® Itanium® processor and on development tools for Intel's Supercomputer Systems Division (SSD). She holds a BS in Computer Science from Western Kentucky University, Bowling Green, KY.

Mark Lubin is a Parallel Applications Engineer at Intel within the Software and Services Group (SSG) where he is working on optimizing of HPC applications. Prior to joining Intel, Mark did his postdoctoral research at UCSD, where he developed quantum molecular dynamics computer models and software for parallel computers. He has published over 15 peer-reviewed publications. Mark received his M.S. in EE from Moscow Institute of Electronic Technology, Russia. He received his Ph.D. in physics from the University of Central Florida.

Bonnie Aona is a software engineer in the Intel Compilers and Languages Group within the Software and Services Group (SSG) working on optimizing and testing applications to take advantage of the latest Intel software and hardware innovations to achieve high performance and parallelism. Bonnie's career leverages Software Quality Assurance and program management with software design for complex high performance applications for computer graphics, real-time systems, scientific research, manufacturing, e-Commerce, aerospace and healthcare. She holds Masters degrees in Electrical and Computer Engineering from University of California at Davis.

 

 

 

 

Para obter informações mais completas sobre otimizações do compilador, consulte nosso aviso de otimização.