Not so long ago, I wanted to make an algorithm run on a particular GPU. So I spent about a month setting up a special C++ compiler, tuning it up with its special flags, debugging, and optimizing the GPU code. I was very pleased in the end since not only I managed to make it work, but the resulting implementation, due to all the optimizations, boiled down to ~400 lines of GPU’s native machine code.


Except, when you look at it from another angle, if I were writing in assembly from the start, 400 lines of code in one…

Oleksandr Kaleniuk

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store