cilk_for "loop iteration count cannot be computed"

cilk_for "loop iteration count cannot be computed"

Apparently, Advisor cannot analyze results for display in Survey Report where cilk_for is in use.  The "Why no vectorization?" field shows this comment. The times are quoted as 0. in summary even though they show up with reasonable values ascribed to cilkrts_cilk_for in source and assembly view.  I have built with -debug:inline-debug-info -Qipo-.


7 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.


Hey Tim,

If each "iteration" of the cilk_for loop is a separate thread then it is possible that the loop is not being vectorized.

I would have to have your example to say for sure, but

There may be some vector instructions in the loop but from our analysis the loop itself is done in parallel.

Looking at the loop analytics in Advisor  would be the way to show how many vector instructions are executed.

You could also try adding an inner loop (ie strip mining) to vectorize and use the cilk_for for parallelism.

I'd be happy to look at your example.




I'm not expecting the cilk_for top level loop to vectorize, except in the cases where _Simd clause is included (with a demonstrated benefit).  In most cases, on a dual core HSW platform, Cilk(tm) Plus performance is good enough to convince me of effective vectorization.  Even on MIC KNC, where cilk(tm) Plus typically performs about 30% of plain C99 (after numbers of threads and workers are optimized by trial and error), the vectorization shows typically 3x speedup.

icl -O3 -Qipo- -debug:inline-debug-info -QxHost -Qunroll:4 -Qopt-report:4 -c loopdcp.c

ifort -O3 -Qipo- -Qopenmp -QxHost -debug:inline-debug-info -fpp -Qopt-report:4 -assume:underscore -names:lowercase loopdcp.obj maind.F forttime.f90



set OMP_PLACES=cores

I don't know of any way to force use of 2+ cores other than to set NWORKERS=3.

I suppose it may be possible to run on 1 core with an option which serializes cilk_for.




Downloadapplication/octet-stream forttime.F90930 bytes
Downloadtext/x-csrc loopdcp.c33.39 KB
Downloadapplication/octet-stream maind.F36.7 KB


Comparing the serial performance would give you the precise speedup but using the loop analytics tab you can do a ballpark estimate based upon our static analysis of the instructions in the loop. Also running a trip count/flops analysis would give you another metric you could compare.


Hi TIm,

Can you send g2c.h?


Sorry, still some of those f2c translation relics.

ifort -O3 -Qipo- -Qopenmp -QxHost -debug:inline-debug-info -fpp -Qopt-report:4 -assume:underscore -names:lowercase loopdcp.obj maind.F forttime.f90

When /Qcilk-serialize is set in the ICL compilation, a majority of the "iteration count cannot be computed" notations are changed to "vector dependence prevents vectorization, "  but there are 4 more cases displayed as implementing AVX or AVX2 vectorization.  Run time barely increased with suppression of cilk_for parallelization.

In function s2102, the (int) cast which is necessary for performance apparently triggers Advisor into assuming VL=8, so cutting the reported "efficiency" in half (same effect as in C or Fortran code).


Downloadtext/x-chdr g2c.h5.23 KB

The corrected version of function s176 (in attachment) has execution of cilk-serialize confined to non-vector remainder loop.  It doesn't show up in Advisor summary either with or without cilk-serialize.

If the operands of reduce_add are reversed (so as to align one of them), the remainder loop is vectorized.  Then the cilk-serialize executes the primary loop version and shows up in Advisor summary as 47% efficient.

Sampling interval should be reduced to about 2ms as the total run time is not much over 1 second.

C and Fortran versions of s125 and s2102 take advantage of pragma nontemporal, but this appears to be excluded by Cilk(tm) Plus.


Downloadtext/x-csrc loopdcp.c33.45 KB

Leave a Comment

Please sign in to add a comment. Not a member? Join today