I was trying to understand the advantages of chaning the local work group sizes in order to check its advantagees. Here are the results for a simple program which only takes in a random array, doubles it and returns. It also records the time needed for the whole operation.
The following plot explains the results for the various devices.
I don't get a few things:
- Exact explanation of threads.
- Why does the speedup increase in the case of increasing threads for CPU
- What is the difference between POCL and Intel. I think that one of them is an implementation by Intel itslelf and PoCL is and implementation for OpenCL for CPU.
Finally I don't get how to abort the program running on a GPU.