The future of LLM evaluations resembles software testing more than benchmarks. Real-world testing looks like this, asking LLMs to produce Dad jokes like this zinger : I’m reading a book about gravity & it’s impossible to put down.
Machine learning benchmarks like those published by Google for Gemini2 last week, or precision and recall for classifying dog & cat photos, or the BLEU score for measuring machine translation provide a high-level comparison of relative model performance.
But this isn’t enough for a product team to be satisfied that their LLM-enabled product will perform well in the wild.
LLMs are tricky. They don’t always provide the identical answer to the same or similar input. 1 can be greater than 4.. This is called non-determinism.
How to solve this problem?
To produce high quality LLM-products, teams will need to combine analytics with evaluation.
Combining analytics with evaluation is the key to improving performance. Analytics surface the questions users ask when using the model.
Those questions create the evaluations product teams use to determine performance. They gather additional data, retrain/fine-tune the model, & release it again.
Today, evaluations are rule based or human-in-the-loop evaluations. But in the future, other models will judge the output to ensure consistency over time.
And the iteration wheel improves ensuring that the Dad jokes from a model really are the best.