Late-initialization of frame descriptors in Cilk Plus/LLVM

The Intel® Cilk™ Plus C/C++ language extensions support the expression of portable and efficient task and vector parallel programs. Cilk Plus/LLVM is an implementation of these extensions in the Clang frontend for LLVM. Today, the Cilk Plus/LLVM implementation supports all of the Intel® Cilk™ keywords for tasking: cilk_spawn, cilk_sync, and cilk_for.

In this article we explain one of the optimizations that we have implemented in Cilk Plus/LLVM: late-initialization of frame descriptors[1]. With this explanation, we provide a view into how one of the Intel® Cilk™ keywords - cilk_spawn - is implemented under the hood. Let’s start with an explanation of cilk_spawn and why late-initialization helps make it more efficient.

Spawn and the Frame Descriptor

Spawning is a core operation in Intel® Cilk™ Plus that identifies function calls at which a parallel task may fork from a parent task.  To implement spawn, the Cilk Plus/LLVM compiler communicates to the runtime that what follows a spawn statement can be stolen by another worker to execute in parallel. Following the Intel® Cilk™ Plus ABI specification, this communication is performed by creating a frame descriptor object (struct __cilkrts_stack_frame).  This frame descriptor contains information about the current stack frame and execution context, which is then passed to the runtime.

The most straightforward place to initialize the frame descriptor is at the beginning of the spawning function.  If there are multiple spawns, some of which may be conditionally executed, this ensures that the frame descriptor is always initialized by the time a spawn occurs.  Although initializing the frame descriptor at the beginning is simple and correct, it is usually not optimal.  To understand how this can cause performance problems, consider the following toy example, which calculates Fibonacci numbers.

int fib(int n) {
  __cilkrts_stack_frame sf;    // implicitly inserted by the compiler
  __cilkrts_enter_frame(&sf);  // implicitly inserted by the compiler
  if (n < 2)
    return n;
  int a = cilk_spawn fib(n-1); // uses the __cilkrts_stack_frame
  int b = fib(n-2);
  cilk_sync;                   // uses the __cilkrts_stack_frame
  return a + b;
}

If fib() is called with a value of n smaller than 2, it immediately returns the number passed in. Since it does not actually perform any spawns in this base case, the call to __cilkrts_enter_frame() is unnecessary. Since this implementation of fib() calls itself twice recursively, there will be many calls to such a base case. In fact, when executing fib(30), there will be over a million unnecessary calls to __cilkrts_enter_frame()! Although each unnecessary call to __cilkrts_enter_frame() consumes only a few cycles - roughly equivalent to making three function calls - they can add up to a significant slowdown overall.

The Optimization

Applying C++’s "only pay for what you use" principle, we can make it so that the initialization cost for frame descriptors is only paid when a spawn is actually executed.

The optimization is simple: move the initialization of the frame descriptor as late as possible before the spawns themselves, and make the initialization conditional if the spawns are conditional. This optimization is explicitly suggested by the Intel® Cilk™ Plus ABI. In general, frame descriptor initialization may need to be duplicated for each spawn if the compiler cannot prove that one spawn dominates another.  Then, because the __cilkrts_stack_frame may not be initialized, each operation on it must be conditional, as in the following pseudo code.

int fib(int n) {
  __cilkrts_stack_frame sf;     // implicitly inserted by the compiler
  sf.worker = 0;                // mark as uninitialized
  if (n < 2)
    return n;
  if (sf.worker == 0)           // initialize the frame if it has not been already;
    __cilkrts_enter_frame(&sf); // the compiler will remove the condition after optimizations
  int a = cilk_spawn fib(n-1);  // uses the __cilkrts_stack_frame
  int b = fib(n-2);
  if (sf.worker != 0)           // sync, if the frame is initialized
    cilk_sync;                  // uses the __cilkrts_stack_frame
  return a + b;
}

Now that the __cilkrts_enter_frame() call has been moved past the "return n" statement, it will no longer incur overhead in the base case. In practice, the added control flow to check the value of sf.worker is either optimized away by the compiler, or simply does not make a difference to overall performance on modern CPUs thanks to branch prediction.

Performance

In Cilk Plus/LLVM, we implemented this optimization as a separate pass. That is, the frontend always inserts the call to __cilkrts_enter_frame() at the beginning of a function using a spawn. Later when optimizing the generated LLVM IR, the late-initialization optimization pass moves the call to one or more spawn sites. Having a separate pass allows the optimization to be disabled easily for testing. By default, this optimization is enabled at -O1 and above in Cilk Plus/LLVM.  To see the effects of this optimization, you can add the flags '-mllvm -disable-cilk-sf-late-init' to the clang command line to disable the pass when building an Intel® Cilk™ Plus application.

The following table shows the improvement from the late-initialization optimization on the 'fib' and 'knapsack' workloads. The ‘fib’ example is as seen above, and was run with an input of 30.  Knapsack is based on a sample that was distributed with the Cilk-5 release from MIT that computes solutions to the 0-1 Knapsack problem using dynamic programming. Knapsack was tested on the input ‘knapsack-example3.input’. The source code for both workloads, as well as example input to knapsack is attached to this article.  The times below are the average of 100 runs using 8 workers on an Intel® Core™ i7-3770K.  Configuration details are at the end of the article[2].

Workload Time
-disable-cilk-sf-late-init 
Time Improvement
Fib 12.1 ± 0.5 ms  10.8 ± 0.5 ms  10.3% [2]
Knapsack 1.01 ± 0.02 s 0.912 ± 0.02 s 9.3% [2]

We picked these workloads specifically to illustrate the frame descriptor late-initialization optimization because we expected to see a significant speedup. Both of them have very simple base cases that occur often, and therefore the overhead of doing the unnecessary frame initialization is comparatively high when not optimized away.

In "real world" examples, we usually see speedups that are more marginal than these. Those real world cases typically have more complicated base cases that are not so sensitive to the overhead of an unnecessary __cilkrts_enter_frame(). Some workloads simply don’t call the base case as often. Regardless, for highly recursive workloads, the optimization makes performance less dependent on the size of the base case.

Learning More

If you want to see all the details of our implementation, head on over to github and check out the source code. Chances are, though, that you are more interested in trying out Intel® Cilk™ Plus in Clang/LLVM. In that case, follow the "Try it!" section of the Cilk Plus/LLVM homepage. We’d love to hear your feedback!

 

 

Footnotes

[1] This optimization is also implemented in Intel® Composer XE
[2] The source code for workloads ‘fib’ and ‘knapsack’ are attached to this article.  These results collected 2013-05-15 by Ben Langmuir.

Configurations

The compiler used was Cilk Plus/LLVM, which can be built from source using the instructions at http://cilkplus.github.com. The exact revisions used to collect the above results are as follows:

Clang: 42592f97c5a77d06f650b31a3a24cb43eccd8208
LLVM: e8aff964b91ae635d121c1e2d1ddfe9a5fb4bf26
Compiler-rt: 7281db10727b47270af32c3cb109e54e92d86f27
 
Hardware information:
CPU: Intel® Core™ i7-3770K @ 3.50 GHz
Memory: 16 GB
OS: Ubuntu* 12.04 64 bit

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.

Disclaimers and Notices

INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.

A "Mission Critical Application" is any application in which failure of the Intel Product could result, directly or indirectly, in personal injury or death. SHOULD YOU PURCHASE OR USE INTEL'S PRODUCTS FOR ANY SUCH MISSION CRITICAL APPLICATION, YOU SHALL INDEMNIFY AND HOLD INTEL AND ITS SUBSIDIARIES, SUBCONTRACTORS AND AFFILIATES, AND THE DIRECTORS, OFFICERS, AND EMPLOYEES OF EACH, HARMLESS AGAINST ALL CLAIMS COSTS, DAMAGES, AND EXPENSES AND REASONABLE ATTORNEYS' FEES ARISING OUT OF, DIRECTLY OR INDIRECTLY, ANY CLAIM OF PRODUCT LIABILITY, PERSONAL INJURY, OR DEATH ARISING IN ANY WAY OUT OF SUCH MISSION CRITICAL APPLICATION, WHETHER OR NOT INTEL OR ITS SUBCONTRACTOR WAS NEGLIGENT IN THE DESIGN, MANUFACTURE, OR WARNING OF THE INTEL PRODUCT OR ANY OF ITS PARTS.

Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined". Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The information here is subject to change without notice. Do not finalize a design with this information.

The products described in this document may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request.

Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order.

Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling 1-800-548-4725, or go to: http://www.intel.com/design/literature.htm

Optimization Notice

Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel.

Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804

Intel and Cilk are trademarks of Intel Corporation in the U.S. and/or other countries.
*Other names and brands may be claimed as the property of others.
Copyright © 2013 Intel Corporation. All rights reserved.

有关编译器优化的更完整信息,请参阅优化通知