Stop testing and rewriting prompts manually!
Most teams run evals, look at failures, guess what's wrong, rewrite the prompt, then repeat. It's slow and you never know if your rewrite actually fixes the root issue.
The better way is evolutionary optimization.
Instead of manual rewrites, you use genetic algorithms to analyze eval feedback and rewrite prompts automatically. The algorithm maintains diverse prompt candidates that excel at different problem types, not just one "best" version.
DeepEval does this using GEPA - Genetic Evolution with Pareto Selection.
You provide a prompt template, test cases, and metrics to optimize for. The optimizer handles the rest.
Here's how it works:
It splits your test cases into validation and feedback sets. The validation set scores every prompt fairly. The feedback set provides training signals for mutations.
Then it starts evolving. It selects a parent prompt, runs it on a minibatch of test cases, collects metric feedback on what failed, and uses an LLM to rewrite the prompt addressing those issues.
If the rewritten prompt scores better, it gets added to the candidate pool. After several iterations, it returns the highest-scoring prompt.
Key capabilities:
• Works with 50+ built-in metrics - answer relevancy, hallucination, bias, task completion, and more.
• Supports multi-objective optimization - optimize for multiple metrics simultaneously without forcing tradeoffs.
• Configurable iterations and minibatch sizes - control search thoroughness and compute cost.
The best part?
It's 100% open source.
Link to DeepEval in the comments!
显示更多