Evaluation for generative systems is different from evaluation for traditional machine learning. Outputs are flexible, and more than one answer can be acceptable. That makes evaluation more nuanced, especially when the system is user-facing and the cost of a “confident but wrong” answer is high.

The goal is not to find a single perfect metric. The goal is to build a practical way to detect when quality is drifting, understand why it is happening, and decide what to change.

What makes evaluation hard

Evaluation is harder for a few common reasons.

What to evaluate

In practice, evaluation should cover both:

Keeping these separate helps teams avoid guessing. If results are poor, you can determine whether the issue is mainly the answer, the retrieved context, or both.

Generation evaluation

Generation checks focus on whether the answer is supported and useful.

Retrieval evaluation