Not so long ago, I wanted to make an algorithm run on a particular GPU. So I spent about a month setting up a special C++ compiler, tuning it up with its special flags, debugging, and optimizing the GPU code. I was very pleased in the end since not only I managed to make it work, but the resulting implementation, due to all the optimizations, boiled down to ~400 lines of GPU’s native machine code.


Except, when you look at it from another angle, if I were writing in assembly from the start, 400 lines of code in one…

Oleksandr Kaleniuk

