@bil16238531 GPT-3 model is GPT-2 but trained for longer (300B) tokens and yes on a better dataset. FineWeb is a good dataset, so you can train your own like this. It will cost ~$500. Use -b 32 -t 2048 instead to use the 2048 GPT-3 context length to be accurate.