Here's a simple model I wrote yesterday (it learns to estimate a similarity metric between pairs of sequences). It runs with all backends -- no code changes. Tried PT -> trains at 24ms/step on V100. Tried JAX -> trains at 10ms/step on V100.
JAX is fast ⏩
Benchmarking Mistral 7B inference on a V100 (float16, batch_size 10): the throughput of the KerasNLP implementation with JAX is over 2x higher than the Hugging Face PyTorch one (compiled).
Worth noting that this is "out of the box" performance: the KerasNLP model is not optimized for performance. It's written the way anyone would naively write a Keras 3 LLM.
A major advantage of using Keras 3 with the JAX backend: it's fast. And it's fast out of the box, without any need for careful performance optimization.
Gemma 2B inference for a single prompt runs at 116 token/s on a V100, a 3.3x increase over the HF/PT implementation.
Added a new demo to minGPT that trains a GPT on pixels of CIFAR-10 images instead of text. Quite powerful that one can run the same training code/model on both domains. Notebook: . Produced reasonable samples after ~only 30 minutes on an 8-GPU V100 node: