Day 135/365 of GPU Programming
Learning more about benchmarks and evals today.
E.g. going through one of the lectures from last year's Transformers & Language Models at Stanford today and getting a better understanding of structured outputs, LLM as a Judge, position/verbosity bias, quantifying factuality, tool use, failure models, MMLU, AIME, PIQA, SWE Bench, HarmBench, etc
显示更多
Day 134/365 of GPU Programming
Spending the day reading the papers of benchmarks I've been repeatedly seeing.
Starting with MMLU, GPQA, LongBench and NoLiMa and their different iterations (v1 vs v2, standard vs pro, etc).
Working on inference optimization the past few days made me realize I don't really know anything about benchmarks, so trying to become more aware of various benchmarks, their strengths and limitations.
Any other benchmarks I should look into more deeply?
显示更多
Day 134/365 of GPU Programming
Spending the day reading the papers of benchmarks I've been repeatedly seeing.
Starting with MMLU, GPQA, LongBench and NoLiMa and their different iterations (v1 vs v2, standard vs pro, etc).
Working on inference optimization the past few days made me realize I don't really know anything about benchmarks, so trying to become more aware of various benchmarks, their strengths and limitations.
Any other benchmarks I should look into more deeply?
显示更多