How to evaluate a model for your use case: Emmanuel Turlay

Hi everyone, I'm Emmanuel, CEO of Sematic, the company behind Airtrain. Today, I want to talk about a difficult problem in the language modeling space, and that is evaluation. Unlike in other areas of machine learning, it is not so straightforward to evaluate language models for a specific use case. There are metrics and benchmarks, but they mostly apply to generic tasks, and there is no one-size-fits-all process to evaluate the performance of a model for a particular use case.

So first, let's get the basics out of the way. What is model evaluation? Model evaluation is the statistical measurement of the performance of a machine learning model. How well does a model perform on a particular use case, measured on a large dataset independent from the training dataset? Model evaluation usually comes right after training or fine-tuning and is a crucial part of model development.

All ML teams dedicate large resources to establish rigorous evaluation procedures. You need to set up a solid evaluation process as part of your development workflow to guarantee performance and safety. You can compare evaluation to running a test suite in your continuous integration pipeline. In traditional supervised machine learning, there is a whole host of well-defined metrics to clearly grade a model's performance.

For example, for regressions, we have the root mean squared error or the mean absolute error. For classifiers, people usually use precision, recall, or F1 score, and so on. In computer vision, a popular metric is the intersection of a union. So what metrics are available to score language models? Well, unlike other types of models returning structured outputs such as a number, a class, or a bounding box, language models generate text, which is very unstructured.

An inference that is different from the ground truth reference is not necessarily incorrect. Depending on whether you have access to labeled references, there are a number of metrics you can use. For example, BLEU is a precision-based metric. It measures the overlap between n-grams, that is sequences of tokens, between the generated text and the inference.

It's a common metric to evaluate translation between two languages and can also be used to score summarization. It can definitely serve as a good benchmark, but it is not a safe indicator of how a model will perform on your particular task. For example, it does not take into account intelligibility or grammatical correctness.

Rouge is a set of evaluation metrics that focuses on measuring the recall of sequences of tokens between references and the inference. It is mostly useful to evaluate for summarization. If you don't have access to labeled references, you can use other standalone metrics. For example, density quantifies how well the summary represents pool fragments from the text, and coverage quantifies the extent to which a summary is derivative of a text.

As you can see, these metrics are only useful to score certain high-level tasks such as translation and summarization. There are also a number of benchmarks and leaderboards that rank various models. Benchmarks are standardized tests that score model performance for certain tasks. For example, glue or general language understanding evaluation is a common benchmark to evaluate how well a model understands language through a series of nine tasks.

For example, paraphrase detection and sentiment analysis. Helleswag measures natural language inference, which is the ability for a model to have common sense and find the most plausible end to a sentence. In this case, answer C is the most reasonable choice. There are other benchmarks such as trivia QA, which asks almost a million trivia questions from Wikipedia and other sources and tests the knowledge of the model.

Also AHRQ test models' ability to reason about high school level science questions. And there are dozens more benchmarks out there. All these metrics and benchmarks are very useful to draw a landscape of how LLMs compare to one another. But they do not tell you how they perform for your particular task on the type of input data that will be fed by your application.

For example, if you're trying to extract symptoms from a doctor's notes, or extract ingredients from a recipe, or form a JSON payload to query an API, these metrics will not tell you how each model performs. So each application needs to come up with with its own evaluation procedure, which is a lot of work.

There is one magic trick though. You can use another model to grade the output of your model. You can describe to an LLM what you're trying to accomplish and what are the grading criteria and ask it to grade the output of another LLM on a numerical scale. Essentially, you are crafting your own specialized metrics for your own application.

Here's an example of how it works. You can feed your evaluation data set to the model you want to evaluate, which is going to generate the inferences that you want to score. Then, you can include those inferences inside a broader scoring prompt in which you've described the task you're trying to accomplish and the properties you're trying to grade.

And also, you describe the scale across which it should be graded. For example, from 1 to 10. Then, you pass this scoring prompt to a scoring model, which is going to generate a number - a score - to score the actual inference. If you do this on all the inferences generated from your evaluation data set, you can draw a distribution of that particular metric.

For example, here is a small set of closing words generated for professional emails. We want to evaluate their politeness. We can prompt a model to score the politeness of each statement from 1 to 10. For example, "Please let us know at your earliest convenience" scores highly, while "Tell me ASAP will score poorly." We found that the best grading model at this time is still GPT-4, but can be quite costly to use to score large datasets.

We have found that FLAN-T5 offers a good trade-off of speed and correctness. Airtrain was designed specifically for this purpose. With Airtrain, you can upload your dataset, select the models you want to compare, describe the properties you want to measure, and visualize metric distribution across your entire dataset. You can compare LAMA2 with Falcon, FLAN-T5, or even your own model.

Then, you can make an dedicated decision based on statistical evidence. Sign up today for early access at Airtrain.ai and start making data-driven decisions about your choice of LLM. Thanks. Goodbye.

How to evaluate a model for your use case: Emmanuel Turlay

Transcript