TBB adoption in CAD: Technical Insights

Working in software engineering, it is always important to be connected to your customers. Not only does it give you extra motivation from observing how your product helps them, but first of all, it allows you to better understand usage models, specific environment, potential issues and bring back the feedback to the team and improve the product.

In this regard, I was especially glad to work with the engineering team of the OPEN CASCADE company who develops an open source 3D modeling kernel Open CASCADE Technology (OCCT) for CAD/CAM/CAE applications. A while ago OPEN CASCADE started using Intel® Parallel Studio, adopting Amplifier and Inspector. Last year the team started investigating Intel TBB and adopted it in a public release of OCCT 6.5 announced in March.

This post will give some technical background of TBB usage in Open CASCADE.

OCCT originates from mid-1990es and until now its huge legacy code base was single threaded. To break-through into multi-threading, it was important to focus on right directions and achieve impactful results with limited investments. Two core algorithms were addressed in a first place:
1. Memory management
2. 3D model tessellation

Memory management
OCCT has its own memory manager based on caching and reusing memory chunks. However its original design back in 1990-es was tailored to a single-threaded mode. In some recent version, it was made thread-safe with introduction of a coarse-grain lock protecting a chunk list. Certainly this design did not allow scaling in multi-threaded scenarios.

The charts below compares scalability of a few allocators – Standard (OS), OCCT, and TBB, on different Windows versions (and different hardware). The workload involved a surface meshing (tessellation) algorithm from OCCT 6.3.1 executed on a complex 3D model. Even if the workload itself contained imbalances (e.g. planar and NURBS surfaces), and linear scalability was not expected, the results were convincing enough to conclude on potential benefits for other workloads.

So, obviously the coarse-grain lock in the OCCT allocator virtually serialized the execution and killed scalability. Therefore adoption of a scalable TBB memory allocator was a reasonable choice. Moreover, since OCCT already enabled switching memory allocator in load-time, integration of TBB allocator was very simple – via subclassing an abstract adaptor class. Here is a code excerpt from OCCT 6.5:

Standard_Address Standard_MMgrTBBalloc::Allocate(const Standard_Size aSize)
// the size is rounded up to 4 since some OCC classes
// assume memory to be double word-aligned
const Standard_Size aRoundSize = (aSize + 3) & ~0x3;
Standard_Address aPtr = ( myClear ? scalable_calloc(aRoundSize, sizeof(char)) :
scalable_malloc(aRoundSize) );
if ( ! aPtr )
Standard_OutOfMemory::Raise("Standard_MMgrTBBalloc::Allocate(): malloc failed");
return aPtr;

3D model tessellation
Tessellation is an approximation with the help of surface triangulation and is used in visualization component of OCCT. The image below shows a sample 3D model and its underlying triangles:

Visualization component is used in virtually every OCCT-based application, and thus, making this algorithm concurrent would naturally bring the greatest impact.

Due to data model specificities, child sub-shapes are often dependent. For instance, an edge of a box is shared by its owning faces and hence, two faces cannot be independently tessellated. This may significantly limit scalability. Nonetheless, the first version will already take advantage of concurrent execution of data-independent iterations.

Like in the case of the TBB allocator, integration was very compact and required just a few lines of code change:

void BRepMesh_FastDiscret::Perform(const TopoDS_Shape& shape)
std::vector aFaces;
. . .

// mesh faces in parallel threads using TBB
if (Standard::IsReentrant())
tbb::parallel_for_each (aFaces.begin(), aFaces.end(), *this);
for (std::vector::iterator it(aFaces.begin()); it != aFaces.end(); it++)
Process (*it);

/* Processes the given face.*/
void BRepMesh_FastDiscret::operator ()(const TopoDS_Face& face) const
Process (face);

/* Processes the given face.*/
void BRepMesh_FastDiscret::Process(const TopoDS_Face& theFace) const
. . .

The following chart demonstrates scalability of the TBB-based tessellation algorithm measured on a couple of relatively large models:

Limited scalability is mainly attributed to concurrent execution of only a sub-part of the entire algorithm (and hence an effect of the Amdahl's law).

Initial integration of TBB required very limited code changes in the OCCT sources but should benefit virtually all users of OCCT-based applications.
Of course, we will look forward to extending TBB presence in OCCT!

For more complete information about compiler optimizations, see our Optimization Notice.