2/📊 Key Results (held-out test set, n=300)
✅DeBERTa Judge: Pearson 0.747 (95% CI [0.663, 0.816]) → Outperforms all reference-based evaluators in our prior framework (best: 0.629)
✅Reference-Free composite score: Pearson 0.645→ Matches the best reference-based single evaluator — with zero reference answers
✅Cascade + online weight calibration: Saves 72.7% evaluation cost
显示更多