Example here is the llm.c GPT-3 (124M) training on FineWeb (figure cropped at 250B tokens), we seem to surpass GPT-3 HellaSwag (green line) at ~150B tokens, per paper expected this to be at 300B tokens. Will re-run with FineWeb-Edu.
I do want to be a bit careful on conclusions though because HellaSwag is just one eval, mostly targeting English sentences and a multiple choice of their likely continuations in "tricky" settings. It may be that the GPT-2/3 datasets were a lot broader (e.g. more multilingual than FineWeb, or a lot more math/code than FineWeb, etc.). So it's likely we want to expand the set of evals to make more confident statements and comparisons.
显示更多