There is an alternate reality where Cray took their vector supercomputers, ditched FP64 calculations, and went with one FP32 pipe and a BF16 tensor core pipe. The same instruction set, memory architecture, and vector registers would have made a sweet deep learning machine, in many ways nicer than SIMT CUDA programming on GPUs. A Y-MP class machine like that could have delivered the AlexNet and DQN moments two decades earlier.
Even doing everything in FP64 with no architectural changes, a Cray-1 would have been the best machine in the world for neural networks. If
@geoffreyhinton had access to one for early research, the case could have been made for the architectural modifications to 10x the performance.