How to Separate Closure Compilation and Execution -- An Introduction to closure::compile()

Closure is the basic unit of an Intel® ArBB execution. As we know from here and here, an Intel ArBB function (computations performed on Intel ArBB data types) is firstly captured into a closure object at run time. Then, the closure is compiled once and can be executed repeatedly. Because closure capturing and compilation take place at run time, Intel ArBB is able to optimize a closure for the exact machine configuration that it runs on.

Closure capturing can be implicit or explicit:

  • Implicit: The first invocation of a function using the call() operator creates a closure which is compiled and cached before it gets executed. Subsequent invocations of the same function using call() will reuse and execute the cached closure.
  • Explicit: The arbb::capture() function creates a closure for a function without executing it. A function can be captured multiple times creating multiple different closure objects which is useful for run-time specialization of the closure created from the function.

Keen users have noticed that no matter how a closure is created, the first execution of the closure is always more expensive than subsequent executions. This extra cost of first-time execution is the overhead of compiling and optimizing the closure. The current version of ArBB specializes compiled closures for two kinds of argument attributes:

  • For container arguments, whether the argument is bound or not
  • For 2D and 3D bound container arguments, whether the argument is strided or not

If these attributes change from execution to execution, the closure is implicitly recompiled for each different set of attributes. Recompilation means extra performance costs due to compilation overhead even after the first execution of the closure.

Intel® ArBB 1.0 Beta 5 release introduces a new member function compile() for the closure class template and the auto_closure class that explicitly compiles a closure for a given set of arguments to avoid the implicit recompilation mentioned above.

[cpp:nogutter:nocontrols]template<typename FunctionType > void arbb::closure<FunctionType>::compile(FunctionParams... arguments) const; void arbb::auto_closure::compile(FunctionParams... arguments) const; [/cpp:nogutter:nocontrols]
The benefits of the compile() API can be summarized as follows:

  • Users can better gauge the performance behavior of their code as compilation overhead and actual execution cost can now be measured separately.
  • compile() can be called multiple times to prepare a closure for different sets of argument attributes. Users can control when to incur the overhead of compilation during the whole course of execution.
  • Using compile() can avoid implicit dynamic re-compilation, leading to more predictable performance.

compile() is called using arguments with the same attributes that will be used to execute the closure. Once a closure has been compiled for arguments with particular attributes the closure can be executed repeatedly for arguments with the same attributes without incurring compilation cost. If a closure is to be executed using arguments with different attributes, compile() must be called again with these arguments to avoid implicit dynamic re-compilation.

The following example shows the proper usage of compile() and its effects.

#include <arbb.hpp> #include <iostream> using namespace arbb; template<typename T> void matvec_product(dense<T>& result, const dense<T, 2>& matrix, const dense<T>& vector) { result = add_reduce(matrix * repeat_row(vector, matrix.num_rows())); }

int main()
{
 typedef f32 T;
 typedef uncaptured<T>::type UT;

 const std::size_t size = 4096;
 std::vector<UT> vdata(size);
 std::vector<UT> mdata(size * size);

 dense<f32> vector_bound, result;
 dense<f32, 2> matrix_bound;

 bind(vector_bound, &vdata[0], size);
 bind(matrix_bound, &mdata[0], size, size);

 // capture a closure
 closure<void(dense<T>&, const dense<T, 2>&, const dense<T>&)> c = capture(matvec_product<T>); 

 // compile the closure for bound data
 double time_compile_1;
 {
 const scoped_timer timer(time_compile_1);
 c.compile(result, matrix_bound, vector_bound);
 }
 std::cout << "Compile time " << time_compile_1 << " msn";

 // execute the closure on bound data and take measurements
 for (std::size_t i = 0; i != 5; ++i) {
 double time_i;
 {
 const scoped_timer timer(time_i);
 c(result, matrix_bound, vector_bound);
 // make sure computation is completed
 result.read_only_range();
 }
 std::cout << "Time " << i << " (bound): " << time_i << " msn";
 }

 // argument attribute changes from bound to non-bound
 dense<f32> vector_nonbound(4096);
 dense<f32, 2> matrix_nonbound(4096, 4096);

 // compile the closure for non-bound data
 double time_comile_2;
 {
 const scoped_timer timer(time_compile_2);
 c.compile(result, matrix_nonbound, vector_nonbound);
 }
 std::cout << "Compile time " << time_compile_2 << " msn";

 // execute the closure on non-bound data and take measurements
 for (std::size_t i = 0; i != 5; ++i) {
 double time_i;
 {
 const scoped_timer timer(time_i);
 c(result, matrix_nonbound, vector_nonbound);
 // make sure computation is completed
 result.read_only_range();
 }
 std::cout << "Time " << i << " (non-bound): " << time_i << " msn";
 }

 return 0;
}

In this example, a closure is captured for the function "matvec_product". Then this closure is executed on two types of input arguments, bound containers and non-bound containers, respectively. The effects of compile() can be seen by commenting/uncommenting the two compile calls (lines 31-36 and lines 55-60).

Here is the output when compile is used:

Compile time 4.446 ms
Time 0 (bound): 9.367 ms
Time 1 (bound): 9.136 ms
Time 2 (bound): 9.202 ms
Time 3 (bound): 9.217 ms
Time 4 (bound): 9.175 ms
Compile time 4.315 ms
Time 0 (non-bound): 8.859 ms
Time 1 (non-bound): 8.879 ms
Time 2 (non-bound): 8.876 ms
Time 3 (non-bound): 8.868 ms
Time 4 (non-bound): 8.904 ms


Here is the output when the two compile calls are commented:

Time 0 (bound): 13.818 ms
Time 1 (bound): 9.323 ms
Time 2 (bound): 9.379 ms
Time 3 (bound): 9.319 ms
Time 4 (bound): 9.191 ms
Time 0 (non-bound): 13.209 ms
Time 1 (non-bound): 9.025 ms
Time 2 (non-bound): 8.963 ms
Time 3 (non-bound): 9.01 ms
Time 4 (non-bound): 8.983 ms

The actual run times vary on different systems but the pattern should be always the same: when argument attributes change, the closure is re-compiled and the overhead is folded into execution. compile() can be used to separate the overhead.
For more complete information about compiler optimizations, see our Optimization Notice.
Categories: