Auxiliary Functions
Why are they fast?
The kernel string given to OpenCL compiler (to be compiled) has not only the kernel functions, but also other functions that are called by kernel functions. This way, writing a ray-tracer becomes much more easier. Code-duplication is evaded with help of auxiliary functions.
OpenCL auxiliary functions are not functions. They all get inlined as instructions. Without any jumps, program flow continuously fills ALU pipelines. Disadvantage of this is, there can't be recursive functions. This is where ray-tracers are adapted to service "queues" of rays instead of recursively computing all reflections and refractions. There is an example of auxiliary function usage: http://stackoverflow.com/questions/37774463/how-to-reduce-code-duplication-between-opencl-kernels/37774810#37774810
Note: Adding too many differently named duplicate auxiliary functions for (fake) recursion is in fact, a lot slower than running multiple kernels.