It is often convenient to keep a kernel source same for different devices. On the other hand, it is often important to apply specific optimizations per device.
If you need separate versions of kernels, one way to keep the source code base same, is using the preprocessor to create CPU-specific or GPU-specific optimized versions of the kernels. You can run clBuildProgram twice on the same program object, once for CPU with some flag (compiler input) indicating the CPU version, the second time for GPU and corresponding compiler flags. Then, when you create two kernels with clCreateKernel, the runtime has two different versions for each kernel.
To maintain different versions of a kernel, consider using preprocessor directives over regular control flow, as explained in the “Using Specialization in Branching” section. Kernel prototype (the number of arguments and their types) should be the same for all kernels across all devices; otherwise you might get a CL_INVALID_KERNEL_DEFINITION error.
Mapping Memory Objects
Using Buffers and Images Appropriately
Using Floating Point for Calculations
Applying Shared Local Memory (SLM)
Notes on Branching/Loops
Considering native_ Versions of Math Built-Ins