When I was working as a deep learning software engineer in Intel on a AI chip project, I was aware of a assembler called Maxas ( https://github.com/NervanaSystems/maxas) which can generate the GPU machine code which outperforms the nVidia official GEMM library, and started to get interested in it. The author…