For custom algorithms the built-in functionality of ViennaCL may not be sufficient or not fast enough. In such cases it can be desirable to write a custom OpenCL compute kernel, which is explained in this chapter. The following steps are necessary and explained one after another:
A tutorial on this topic can be found at examples/tutorial/custom-kernels.cpp.
The OpenCL source code has to be provided as a string. One can either write the source code directly into a string within C++ files, or one can read the OpenCL source from a file. For demonstration purposes, we write the source directly as a string constant:
 The kernel takes three vector arguments vec1, vec2 and result and the vector length variable size. It computes the entry-wise product of the vectors vec1 and vec2 and writes the result to the vector result. For more detailed explanation of the OpenCL source code, please refer to the specification available at the Khronos group webpage [18] .
The source code in the string constant my_compute_kernel has to be compiled to an OpenCL program. An OpenCL program is a compilation unit and may contain several different compute kernels, For example, one could also include another kernel function inplace_elementwise_prod which writes the result directly to one of the two operands vec1 or vec2 in the same program. 
 The next step is to extract the kernel object my_kernel from the compiled program (an explicit kernel registration was needed prior to ViennaCL 1.5.0, but is no longer needed): 
 Now, the kernel is set up to use the function elementwise_prod compiled into the program my_prog.
Instead of extracting references to programs and kernels directly at program compilation, one can obtain them at other places within the application source code by
This simplifies application development considerably, since no program and kernel objects need to be passed around.
Before launching the kernel, one may adjust the global and local work sizes (readers not familiar with that are encouraged to read the OpenCL standard [18] ). The following code specifies a one-dimensional execution model with 16 local workers and 128 global workers:
In order to use a two-dimensional execution, additionally parameters for the second dimension are set by
 However, for the simple kernel in this example it is not necessary to specify any work sizes at all. The default work sizes (which can be found in viennacl/ocl/kernel.hpp) suffice for most cases. We recommend to write kernels which do NOT depend on a particular thread configuration, as this will usually lead to non-portability of performance.
Kernel arguments are set in the same way as for ordinary functions. We assume that three ViennaCL vectors vec1, vec2 and result have already been set up: 
Per default, the kernel is enqueued in the first queue of the currently active device. A custom queue can be specified as optional second argument.
cl_int, cl_uint, etc. size_t, because size_t might differ on the host and the compute device.