Evaluation for generative systems is different from evaluation for traditional machine learning. Outputs are flexible, and more than one answer can be acceptable. That makes evaluation more nuanced, especially when the system is user-facing and the cost of a “confident but wrong” answer is high.
The goal is not to find a single perfect metric. The goal is to build a practical way to detect when quality is drifting, understand why it is happening, and decide what to change.
What makes evaluation hard
.png)
Evaluation is harder for a few common reasons.
- Answers are not deterministic. The same prompt can produce different wording and different levels of detail.
- “Ground truth” can be ambiguous. In many real questions, there is not one short, universally correct answer.
- Hallucinations can sound convincing. A fluent answer can still be unsupported by the source.
- Retrieval quality (in RAG) can make a good model look bad. If the context is incomplete or noisy, the best possible answer will still be limited.
What to evaluate
In practice, evaluation should cover both:
- the generation quality, and
- the retrieval quality (when context is pulled from a knowledge base).
Keeping these separate helps teams avoid guessing. If results are poor, you can determine whether the issue is mainly the answer, the retrieved context, or both.
Generation evaluation
Generation checks focus on whether the answer is supported and useful.
- Faithfulness: Does the answer match the retrieved source information, without adding unsupported claims?
- Answer relevancy: Does the answer address the question directly, without irrelevant filler or missing the point?
.png)
Retrieval evaluation