Memory Limitiation into Cilk ?

Memory Limitiation into Cilk ?

I just want to use Cilk Plus with implicit shared memory , but i am getting this error :

HOST--ERROR:myoiExPLExtendVSM: VSM size exceeds the limitation (4294967296) now!
HOST--ERROR:myoiExMalloc:662 Fail to get a new memory chunk!
HOST--ERROR:myoArenaMalloc1: Fail to get free memory space!
HOST--ERROR:myoArenaAlignedMalloc1: No enough memory space!

there is still enough space, it's also works with OpenMp and TBB but the program crashes after start with this comment, are there any paramenters to use bigger memory spaces . I didn't found any manuals to this.

best regards

13 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

As with most programs or library, the Cilk Plus runtime does use some amount of shared memory for its scheduler.  But without some more specifics / details about the platform you are running on, or how your application is using Cilk Plus, it is hard to tell.

Is this running on an offload device of some kind, or a more traditional desktop/server environment?  32-bit or 64-bit OS?   How close to the limit of available physical memory does the application get in its normal (serial) execution?   What happens if you vary the number of workers used?  (E.g., CILK_NWORKERS=1, CILK_NWORKERS=2, etc.)    As the number of workers increases, the memory usage also increases.

Cheers,

Jim

Well i just used the same code but with another API (OpenMP and TBB) and it just worked fine with the same problem size, i think the memory usage was 4GBytes maybe a lot more, but with Cilk i got this error with offloading on MIC. The only thing i did,was changing the variables from local to global , using _Cilk_shared float * _Cilk_shared var and _Cilk_offload before the _Cilk_for.

Yeah well , i just want to use Cilk into an offloading mode with Intel MIC,it works with OpenMP and TBB but not with Cilk with the same problem size.

I also computed the same problem on Xeon processor with the same problem size with Cilk and it worked but without the usage of

#define malloc(x) _Offload_shared_aligned_malloc(x,ALIGN)  ; ALIGN=64
#define free(x) _Offload_shared_aligned_free(x);

and transfering those datas to IntelMIC.

Are differences at the used libaries which comes from Cilk ?

Should i change the number of workes ? because at openMP and TBB i used 240 Threads.

Cheers

It's certainly worth while to try a smaller value of MIC_CILK_NWORKERS, since the optimum value of NWORKERS on MIC frequently comes out half or less of the optimum number of OpenMP threads for the same task.

I wouldn't be surprised if you encountered a limit of 4GB for offload of such a data region, or if it depended on your coprocessor model or even stack setting. How much free memory is visible?  Mine will not reach 8GB virtual available, less than 4GB physical

Ok thx i will try i am in urly atm, i am havin a question is there any difference at using _Offload_shared_aligned_malloc even without using offload mode ? because at OpenMP it's possible to use _mm_malloc which aligns to the memory banchs  so are or can i also use _mm_malloc ?

Does the stack size of workers influence the speed up. `?

Nevertheless CilkPlus and OpenMP achieves nearly the same speedup

Ok i did this

export MIC_PREFIX=MIC ; MIC_CILK_NWORKERS=60

and

MIC_CILK_NWORKERS=1 and it didn't worked

 

Cheers

 

Just like computing on your XEON host, if your stack size is too small, you're likely to get a segfault.  Reproducing errors like this will be hard because there's no way to predict the scheduling - it's deliberately randomized.

The stack size probably doesn't effect speedup.  It's simply the size of the stacks that the Cilk runtime will allocate to steal work on.  However, there is one way in which it can affect parallelism (and therfore speedup). If the Cilk runtime attempts to allocate a stack and fails, the worker that tried to allocate the stack will stall for a bit in the hope that a stack has been deallocated and the next allocation attempt will succeed.  You'd need to be at the limits of the address space for that to occur.  But in that sense, if you crank down the stack size, you may be able to allocate more (smaller) stacks.  If you application is being stalled for lack of a stack, that might help.

You can also control the number of stacks that will be allocated.  That will also cause stalling, if all of the stacks have been used.

But both of these are fringe cases.

   - Barry

If you have set MIC_CILK_NWORKERS=1, and the program still doesn't work, then something more fundamental, apart from the scheduling done by the Cilk Plus runtime, may be going on.   With 1 worker, I don't think the runtime does anything interesting.

For the TBB or OpenMP versions that work, are you offloading the computation onto the device, or running on the host machine?   I am a bit confused by the statement "Cilk Plus and OpenMP achieves nearly the same speedup," because that implies you got both versions to run correctly.

Perhaps it might help if you could post a sample program or stripped-down version of your program which still triggers the error you are seeing, and the similar one that does work? 
Cheers,

Jim

Yeah well , i just computed the same problem on the Xeon Processor and compared the performance no on the wirk.

The try was just with on worker.

I think the most important parts are :

#define real float
#define SQRT sqrtf
#define ALIGN 64

#include <cilk/cilk.h>
#include <cilk/reducer_opadd.h>
#include <cilk/cilk_api.h>

.....

#define ALINGED 1
#define malloc(x) _Offload_shared_aligned_malloc(x,ALIGN)
#define free(x) _Offload_shared_aligned_free(x);

....

void init_Atoms(dim3 grid,int nAtoms);
void save_Atoms(int nAtoms,real gridspace,char* out_path);
void load_Atoms(int* nAtoms,real *gridspace,char* in_path);
void calc_energy(dim3 grid,int z_s,real gridspace,int nAtoms);

void write_to_CSV(real *f,dim3 grid,char *out_path,real gridspace);

static double dtime(void);
static double cur_second(void);

_Cilk_shared real *_Cilk_shared energygrid;
_Cilk_shared real *_Cilk_shared atoms;

int main(int argc, char* argv[]){

.....

energygrid=(_Cilk_shared real*) malloc(grid.x*grid.y*grid.z*sizeof(real));

}

void init_Atoms(dim3 grid,int nAtoms){
        atoms=(_Cilk_shared real *)malloc(4*nAtoms*sizeof(real));

.....

}

The error occurs at mallocation at energygrid, where as grid.x=grid.y=grid.z=1000 with real =float

Cheers.

 

Sorry ,...

The try was just with on worker to figure out if something general is just wrong.

Yeah well , i just computed the same problem on the Xeon Processor with OpenMP,TBB and Cilk and compared the performance no on the MIC,

Cheers

The full code can be downloaded from her, ther is a problem with the aligned of the text but it's still readable.

https://www.dropbox.com/home/3_vdo_scaled_simd_opt_xeon_phi?select=vdo_s...

best regards

Dropbox tells me that the folder doesn't exist.

    - Barry

Leave a Comment

Please sign in to add a comment. Not a member? Join today