Skip to content

Instantly share code, notes, and snippets.

@jeffzyliu
Last active May 17, 2020 04:13
Show Gist options
  • Save jeffzyliu/6c37a9bc1a8aeab6e8985c1276f1796a to your computer and use it in GitHub Desktop.
Save jeffzyliu/6c37a9bc1a8aeab6e8985c1276f1796a to your computer and use it in GitHub Desktop.

I’m huddled over my computer, eleven hours into the assignment. I’m building an n-body planetary solver on the powerful GPUs of the Dartmouth high-performance research computing cluster, and there is a bug in my code. GPU programming entails sending a small but complex set of instructions to many thousand computing cores to perform in parallel, and there is no margin for error.

But this bug was not like the others. I had become accustomed to the usual problems already—multi-dimensional indexing errors, race conditions, and the like, but this was far more elusive. Seven out of ten times, 2,496 CUDA cores whirred in perfect unison, and the planets arrived at their rightful locations. The other three times, though, the output was not so nice. The planets went everywhere, and the runtime jerked inconsistently. Moreover, when I added theoretically harmless print statements to debug the code, the output changed even more. I profiled and checked my code again and again, to no avail.

Hours and hours later, as I continued to pore over my code, my tinkering started to finally return insight. When I ran a test case of just 64 planets but organized the GPU execution in a contrived, inefficient way, I was able to completely and consistently replicate the bug. As my bug went from random to systematic, my attention drifted beyond systematic machine instructions to the random scheduling of parallel computing.

This contrived test case demonstrated what regular test cases could not; that by overloading the amount of streaming multiprocessors on the GPU, I could force the blocks to begin running concurrently, context-switching randomly as if they were threads in a CPU (much like traditional race conditions á la CS10). My blocks were perfectly synchronized, but my code was not synchronized right between blocks. CUDA doesn’t natively support true block-wise synchronization since the overhead is way too high, and as such, Nvidia’s profilers would never catch such an error.

Finally understanding this idea of hierarchal synchronization at multiple levels (you could call it sync-ception), I devised a strategy to sidestep it. By keeping two copies of the data, I could read from one copy and write to the other, removing all conflicts in un-synchronizable code. The next iteration, I would read from the second and write to the first. This way, I always had the updated data. I implemented the change quickly in my code and was ecstatic that the bug was totally fixed.

Excited about my finding, I emailed Professor Zhu, detailing my errors and my solution. Turns out, everyone else faced with the same issue, and he had been getting many confused emails about what was going on. He didn’t actually anticipate the problem and was unsure of why it happened until my email clarified it for him. With my permission, he forwarded my email to the entire class, and he later told me that many other students implemented the technique I suggested and emailed him their thanks toward my discussions.

CS89.25 GPU Programming and High-Performance Computing has been by far the most technically difficult course I ever taken, but it has also been by far the most rewarding. It’s the most cathartic feeling in the world to finally fix difficult technical bugs, and it’s always amazing to see my code run hundreds or thousands of times as quickly as an equivalent CPU algorithm.

This code is included in my second code submission. :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment