Atomic Functions
For some algorithms, however, some parts can not be parallelized or memory access pattern contains collisions. Unmanaged collision is an undefined behavior. To overcome this behavior, there are atomic functions. This is main reason why they exist.
Parallelization is one of many important steps to run an algorithm on multi-core systems. When this step is successfully done, speed gain (or latency reduction) depends on the number of cores and the pattern of accesses between cores and memory.
OpenCL is built on a memory hierarchy that lets vendors cache data on all levels. Data can be cached per workitem, per work group, per compute unit, per kernel and even per command queue. This can easily mislead developers of OpenCL on some algorithms that it "looks" like working. On different vendors and different type devices, the behavior of caching(and the algorithm) can change. Atomic functions, on the other hand, work around those caches(except pipeline level) and make sure that data has reached to target, whether it is a global memory area or a local one, doesn't matter, so other workitems in same kernel can get that updated data with another atomic function.
In OpenCL kernel, atomic functions work for __global or __local memory qualifiers. If it is __global, then it makes sure global memory is updated. If it is __local, then it updates local memory. Detailed info about atomic functions is documented here:
Note: Atomic functions shouldn't be used in pipelined or load-balanced kernels within Cekirdekler API. Atomic function scope ends with its kernel. Using multiple GPUs or enabling pipelining, causes the Cekirdekler API to execute multiple kernels instead of one, then renders atomic functions futile.