opencl_snprintf gobbling CPU time as shown by VTune Amplifier

opencl_snprintf gobbling CPU time as shown by VTune Amplifier

Аватар пользователя sharpneli

When OpenCL application is profiled with VTune Amplifier (Excellent product. I'm in love with it ) it shows parts of the intel OpenCL internals. It seems that the function opencl_snprintf takes quite a lot of cpu time to run (no printf is used on our kernels).

As an example our most used kernel takes 39 seconds of CPU time to run (very, very many invocations within a program with new arguments set for every call) and opencl_snprintf uses 28 seconds on a simple test run. I'd wish to know why snprintf is used when spawning tasks or in other internal operation. Is there internal logging going on or what is the purpose of it? Or is it just VTune misreporting the CPU time/instructions depening on measurement mode? If snprintf actually takes this much time to run can we expected that it will be optimized away eventually?

9 сообщений / 0 новое
Последнее сообщение
Пожалуйста, обратитесь к странице Уведомление об оптимизации для более подробной информации относительно производительности и оптимизации в программных продуктах компании Intel.
Аватар пользователя Evgeny Fiksman (Intel)

Hi,

Thanks for the report we will investigate this issue.

Do your kernel include barrier() instrunction? If so, it might be VTune issue.

Any way, I'll very appriciate it, if you could share code example that represents one of your kernels.

Evgeny

Аватар пользователя sharpneli

The kernel itself is very trivial.

#define SIDE 256
#define INDEX(i,j) ((i) + (SIDE)*(j))

/* Invocated with global size = {SIDE-2, SIDE-2} and global offset of 1 on both dimensions */
__kernel void propagate(__global const float * restrict curx, __global const float * restrict oldx, __global float * restrict newx, float a, float c)
{
int i = get_global_id(0);
int j = get_global_id(1);
newx[INDEX(i,j)] = (oldx[INDEX(i,j)] + a*(curx[INDEX(i-1,j)] + curx[INDEX(i+1,j)] + curx[INDEX(i,j-1)] + curx[INDEX(i,j+1)]))/c;
}

Even though there is cast to int on get_global_id the compiler produces identical code to the case where size_t is explicitly used. The time usage of opencl_snprintf does not depend on the SIDE parameter, it seems to scale with the amount of kernels enqueued to run.
All the other kernels in the program are written in similar way. No barriers and no atomic operations nor any local memory usage. Just few read only buffers and one write only buffer and arithmetic operations performed on them.
The only fancy thing we have is write_imagef on a kernel which visualises the data. But the time usage ought to scale with the SIDE parameter if snprintf would've been used inside.

We use OpenGL interoperability to draw into a single texture with write_imagef. Time usage attributed to snprintf does not scale with amount of frames. If we increase the amount of kernel calls between frames thus having less drawing for the same amount of profiling time the time usage of snprintf remains constant.
When invocating a kernel our utility library calls getKernelInfo to get the amount of arguments it has, then calling setarg for every argument and then calling enqueueNDRangeKernel.

Call stack grouping of VTune shows that snprintf is called by clDevCreateDeviceInstance which is called by "Intel::OpenCL::TaskExecutor::CreateThreadPartitioner" and going from there we end up at our original function which created the command queue at the start of the program. When grouped by threads it is clear that the snprintf is equally used in the working threads (as spawned by the opencl implementation itself), and that the clDevCreateDeviceInstance is constantly called inside the working threads.

I hope this was useful.

Аватар пользователя Evgeny Fiksman (Intel)

Hi,

Thanks a lot for the deep investigation.

We are continuingour analysisinorder to better understand functional and performance implications.

Could you alsoshare you host code toidentefy your specific issue?

Thanks,
Evgeny

Аватар пользователя Evgeny Fiksman (Intel)

Hi again,

I can't reproduce you issue on top of source.

Besideyou host code sources, could you also share the VTune screen shot?

Thanks,
Evgeny

Аватар пользователя sharpneli

I hope this shot contains the relevant information. The reporting period was filtered to contain only the actual processing time. No loading and OpenCL initialization is included there. This could easily be just a VTune issue so that it reports CPU time from the kernels to those functions, but I have no methods of finding that out. As an additional information the host code is 32bit and the intel OpenCL sdk has the 64bit version installed.

Аватар пользователя Evgeny Fiksman (Intel)

Hi,

I assume you have 64 bit windwows machine, right?

Our 64bit version also includes 32bit binaries, therefore you can link 32bit host with our SDK.
It looks like it's VTune issue since the shown call stack is not reasonable.

Please provide VTune version information and your sources host + kenrel (if you can) and we will validate this issue.

Thanks,
Evgeny

Аватар пользователя sharpneli

The version of vtune is build 176374 and yes the windows machine is 64bit. The processor is Nehalem based i5. Unfortunately I'm unable to provide our host code as of this time. You are most likely correct that the issue is with VTune itself. We do not have any problems with this issue as vtune profiles the kernels correctly as far as we know. I just wondered which product has the issue (vtune or opencl) so that relevant bug report could be filed.

Regards,
Teemu

Аватар пользователя Evgeny Fiksman (Intel)

Hi Teemu,

Thanks again for your efforts. Finnaly, I have reproduced the issue and it looks like our tools problem.

We will work with VTune team and fill find right solution for this issue.

Evgeny

Зарегистрируйтесь, чтобы оставить комментарий.