The hidden performance cost of accessing thread-local variables

Ever finished parallelizing a code and discovered that the performance was not what you were expecting? I think that has happened to everyone. One of the tricks I’ve recently learned is that it is a good idea to start the code optimization by running Intel® VTune™ Amplifier XE Lightweight Hotspots analysis, which shows function hot spots of an application (shows clock ticks and instructions retired). Unlike precise call graph analysis, Intel® VTune™ Amplifier XE Lightweight Hotspots analysis is very fast, and does not instrument your application. Even if you think you know your application very well, you may find something surprising about your application behavior with Intel® VTune™ Amplifier XE Lightweight Hotspots analysis.

That is exactly what happened when I finished parallelizing the fur shader for DreamWorks Animation. Fur shader generates millions of fur strands that are then tessellated and rasterized. It also uses a geometry library to evaluate fur geometry to calculate normals, derivatives, etc. This geometry library was developed over many years. When I profiled this code with Intel® VTune™ Amplifier XE I noticed that it was a pretty flat profile with the top hotspot function “__tls_get_addr” only taking 5% of the total CPU clock ticks. Since scenes with fur are common, and generating millions of fur consumes a large percentage of the render time, even small optimizations will have noticeable impact in render time. So, what is “__tls_get_addr” in the profile? It is a Linux kernel function to get the address of a thread-local variable. The cost of this function should be small. What happened here?

In order to find the call sites of “__tls_get_addr”, I next used Intel® VTune™ Amplifier XE’s Hotspots feature which provides fast stack and call tree information. The call tree showed that most of the calls to “__tls_get_addr” came from DreamWorks Animation geometry library. These were accesses to a few of the global variables in that library. These global variables had recently been converted into thread-local variables when the library was made thread safe.

Converting from global variables to thread-local variables is very easy; just add the key word "__thread" before the variable declaration. For example, "__thread int count”. This solution is very attractive to the developers who are writing parallel code accessing legacy libraries, and do not have the time to re-architect the legacy library for thread safety.

The cost of using thread-local variable

Usually the cost of using thread-local variables is very small and a programmer may not even notice it. However, if the thread-local variable is accessed very frequently, the cost may become an issue.

In order to understand where the cost of accessing thread local variable comes from, I needed to understand how a compiler implements it. The compiler will assign a unique global ID to each thread-local variable (this ID is the same for each thread), and maintains a Thread Local Storage (TLS) lookup table for each thread. The global ID then used to find the address of any thread local variable. So the cost of using a thread-local variable includes the cost of a function call and a lookup in the indexed table.

A simple example showing the cost of accessing thread-local variable

Let us look at the following example (tlb.cpp).

#include "stdio.h"
#include "math.h"
__thread double tlvar;
//following line is needed so get_value() is not inlined by compiler
double get_value() __attribute__ ((noinline));
double get_value()
{
return tlvar;
}
int test()
{
double f=0.0;
tlvar = 1.0;
for(int i=0; i<1000000000; i++)
{
f += sqrt(get_value());
}
printf("f = %f\n", f);
return 1;
}


In order to simulate the development environment at DreamWorks Animation, let us create a shared library with the following ICC commands:
icpc tlb.cpp -c -o tlb.o -fPIC -g
icpc -shared -o tlb.so tlb.o

The main.cpp calls function test in module tlb.so:

void test();
int main()
{
test();
return 1;
}


Build the executable with the following command: "icpc main.cpp tlb.so -o tlb-no-inline".

Running this program, Intel® VTune™ Amplifier XE showed that ”__tls_get_addr” is the second highest hot spot function with the CPU time in that function being 25% of the total CPU time. (See table below).



Minimizing the cost of accessing thread-local variable

If you inspect the file tlb.cpp carefully, you will see there is an attribute "noinline" for function “get_value”. This is what I added to simulate the behavior of a particular function accessing a thread local variable in DreamWorks Animation’s application, which could not be inlined automatically by the Intel® C++ Compiler. In my example, since function “get_value” is not inlined, every time it accesses the thread local variable "tlvar", the function “__tls_get_addr” will be called.

Reducing the cost of function “__tls_get_addr” is actually not difficult. You can use the key word “__forceinline” to give hint to the compiler to inline the function “get_value”. The new code looks like the following:

#include "stdio.h"
#include "math.h"
__thread double tlvar;

__forceinline double get_value()
{
return tlvar;
}
int test()
{
double f=0.0;
tlvar = 1.0;
for(int i=0; i<1000000000; i++)
{
f += sqrt(get_value());
}
printf("f = %f\n", f);
return 1;
}


Since the function “get_value” is inlined, the Intel® C++ Compiler is smart enough to notice that there is a call to access the thread-local variable inside the “for” loop, and it always returns the same value. As a result, the compiler will move this call outside of the “for” loop. Now there is only one call to function “__tls_get_addr”.

The following table shows the new Intel® VTune™ Amplifier XE Lightweight Hotspots data with the change we made. You can now see that the functions “__tls_get_addr” and “get_value” no longer in the hotspot list.



You may ask if the function “get_value” cannot be inlined for whatever reason, is it possible to reduce the cost of accessing a thread-local variable? The answer is “yes”. Since in this example, the thread-local variable is read-only, you can assign the thread-local variable to a local variable outside the “for” loop, and then use the local variable inside the loop, as shown below.

__thread double tlvar;
double get_value() __attribute__ ((noinline));
double get_value()
{
return tlvar;
}
int test()
{
double tmp, f=0.0;
tlvar = 1.0;
tmp = get_value();
for(int i=0; i<1000000000; i++)
{
f += sqrt(tmp);
}
printf("f = %f\n", f);
return 1;
}


Conclusions

Through this simple example, you can see that there is a hidden cost to frequently access a thread-local variable. In certain situations, that cost may be significant. In order to reduce this cost, the number of calls to “__tls_get_addr” introduced by the compiler should be minimized as discussed in the above examples. So, if you notice a Intel® VTune™ Amplifier XE profile with your application that shows significant time in “__tls_get_addr” function, then it is important to check what functions are accessing thread local variables frequently, and then follow the recommendations discussed in this blog.
For more complete information about compiler optimizations, see our Optimization Notice.
Tags: