Would be fantastic if you can develop some kind of CUDA's thrust/CUDAPP library optimized for your OpenCL implementation. Ideally, I would like to have some sort, reduction, parallel scan, matrix multiplication and FFT algorithms there.
Hi,I cannot comment on our future plans in this forum.Yet, I would like to understand your vision more.Do you expect these functions/libraries to be available:1. From your host C/C++ code2. As an OpenCL kernel to be enqueue3. As a function inside the kernel codeYou may want also to look into the new compile and link options in OpenCL specifiction version 1.2.Regards- Arnon
Ideally I would like this from the c++ host code but also as a function inside the kernel code, yep.
Hi,just a long term idea.. I don't expect this year to be implemented but with ever increasing power of Intel IGPs scientific math libs would be good..Nvidia has BLAS and FFT libraries for CUDA and AMD BLAS and FFT for OpenCL.. would be good if you can build some BLAS and FFT libs optimized for your GPUs and expose in OpenCL as host code functions but taking device buffers as I/O and disallowing host transfers (see new CUBLAS in CUDA 4.0 where even scalar parameteres like alpha, beta in blas fuctions are taken from device to avoid all host-device transfers which can take even more time than the function itself..
Ok,Great inputs.So far I see 2 usages:1. (jogshy) I just want to accelrate my C/C++ code.2. (rtfss1) I want BLAS/FFT that interact with my OpenCL code on the target device.In both solutions, what you expect is that the BLAS/FFT (MKL/IPP libraries) from Intel will provide the best performance for your usages on Intel Core Processors with HD Graphics.In both cases I assume that using the HD Graphics is not a must have if the most optimized specific algorithms are running better on the processor itself in the boundries of your workloads. Right?Arnon
>>In both cases I assume that using the HD Graphics is not a must have if
the most optimized specific >>algorithms are running better on the
processor itself in the boundries of your workloads. Right?
Anyways, I would prefer to rely on Intel's optimized sort/reductions instead of implementing my own ones which is tedious and I don't really know your HW/implementation to optimize it as it should be.
Well.. not exactly.. today intel IGP has a peak of 256 gflops in SP which is a little more than the CPU quad core it acompanies..assuming we get 2x-3x faster IGP next year I'm saying next year perhaps intel IGPs could make a single precision matrix multiplication (BLAS3 dgemm routine) say 2-3x faster than CPU then an optimized BLAS library running on GPU could outperform even intel optimized MKL libraries by that factor and difference is only even becoming more pronounced as seems GPUs GFLOPS evolve faster than CPU Gflops..
as said is a long term idea but seems a library could take good time on implementing seeing amd and cuda blas implementation need some years to achieve full blas2 and blas3 compilance..thanks