back to indexHow to evaluate a model for your use case: Emmanuel Turlay

00:00:00.000 |
Hi everyone, I'm Emmanuel, CEO of Sematic, the company behind Airtrain. Today, I want to talk 00:00:24.800 |
about a difficult problem in the language modeling space, and that is evaluation. Unlike in other 00:00:31.040 |
areas of machine learning, it is not so straightforward to evaluate language models 00:00:35.680 |
for a specific use case. There are metrics and benchmarks, but they mostly apply to generic 00:00:42.000 |
tasks, and there is no one-size-fits-all process to evaluate the performance of a model for a 00:00:47.040 |
particular use case. So first, let's get the basics out of the way. What is model evaluation? 00:00:54.800 |
Model evaluation is the statistical measurement of the performance of a machine learning model. 00:01:00.160 |
How well does a model perform on a particular use case, measured on a large dataset independent 00:01:05.760 |
from the training dataset? Model evaluation usually comes right after training or fine-tuning and is a 00:01:12.880 |
crucial part of model development. All ML teams dedicate large resources to establish rigorous 00:01:18.800 |
evaluation procedures. You need to set up a solid evaluation process as part of your development 00:01:24.640 |
workflow to guarantee performance and safety. You can compare evaluation to running a test suite in your 00:01:30.880 |
continuous integration pipeline. In traditional supervised machine learning, there is a whole host 00:01:36.720 |
of well-defined metrics to clearly grade a model's performance. For example, for regressions, we have 00:01:44.000 |
the root mean squared error or the mean absolute error. For classifiers, people usually use precision, recall, 00:01:53.120 |
or F1 score, and so on. In computer vision, a popular metric is the intersection of a union. 00:02:01.120 |
So what metrics are available to score language models? Well, unlike other types of models returning 00:02:08.000 |
structured outputs such as a number, a class, or a bounding box, language models generate text, 00:02:14.640 |
which is very unstructured. An inference that is different from the ground truth reference is not 00:02:19.920 |
necessarily incorrect. Depending on whether you have access to labeled references, there are a number of 00:02:26.160 |
metrics you can use. For example, BLEU is a precision-based metric. It measures the overlap 00:02:32.720 |
between n-grams, that is sequences of tokens, between the generated text and the inference. 00:02:38.400 |
It's a common metric to evaluate translation between two languages and can also be used to score 00:02:44.480 |
summarization. It can definitely serve as a good benchmark, but it is not a safe indicator of how a 00:02:50.800 |
model will perform on your particular task. For example, it does not take into account intelligibility 00:02:56.960 |
or grammatical correctness. Rouge is a set of evaluation metrics that focuses on measuring the 00:03:03.200 |
recall of sequences of tokens between references and the inference. It is mostly useful to evaluate for 00:03:11.120 |
summarization. If you don't have access to labeled references, you can use other standalone metrics. 00:03:18.560 |
For example, density quantifies how well the summary represents pool fragments from the text, 00:03:24.000 |
and coverage quantifies the extent to which a summary is derivative of a text. As you can see, 00:03:30.960 |
these metrics are only useful to score certain high-level tasks such as translation and summarization. 00:03:37.440 |
There are also a number of benchmarks and leaderboards that rank various models. Benchmarks are standardized 00:03:45.600 |
tests that score model performance for certain tasks. For example, glue or general language understanding 00:03:53.200 |
evaluation is a common benchmark to evaluate how well a model understands language through a series of nine 00:04:00.000 |
tasks. For example, paraphrase detection and sentiment analysis. Helleswag measures natural language inference, 00:04:09.760 |
which is the ability for a model to have common sense and find the most plausible end to a sentence. 00:04:15.520 |
In this case, answer C is the most reasonable choice. There are other benchmarks such as trivia QA, 00:04:23.680 |
which asks almost a million trivia questions from Wikipedia and other sources and tests the knowledge 00:04:29.360 |
of the model. Also AHRQ test models' ability to reason about high school level science questions. 00:04:35.760 |
And there are dozens more benchmarks out there. All these metrics and benchmarks are very useful to draw a 00:04:42.800 |
landscape of how LLMs compare to one another. But they do not tell you how they perform for your particular 00:04:49.360 |
task on the type of input data that will be fed by your application. For example, if you're trying to extract symptoms 00:04:57.120 |
from a doctor's notes, or extract ingredients from a recipe, or form a JSON payload to query an API, 00:05:04.560 |
these metrics will not tell you how each model performs. So each application needs to come up with 00:05:11.120 |
with its own evaluation procedure, which is a lot of work. There is one magic trick though. You can use 00:05:18.880 |
another model to grade the output of your model. You can describe to an LLM what you're trying to accomplish 00:05:25.760 |
and what are the grading criteria and ask it to grade the output of another LLM on a numerical scale. 00:05:32.160 |
Essentially, you are crafting your own specialized metrics for your own application. 00:05:38.640 |
Here's an example of how it works. You can feed your evaluation data set to the model you want 00:05:44.000 |
to evaluate, which is going to generate the inferences that you want to score. 00:05:47.120 |
Then, you can include those inferences inside a broader scoring prompt in which you've described 00:05:54.720 |
the task you're trying to accomplish and the properties you're trying to grade. And also, 00:05:58.800 |
you describe the scale across which it should be graded. For example, from 1 to 10. Then, 00:06:04.400 |
you pass this scoring prompt to a scoring model, which is going to generate a number - a score - to 00:06:10.240 |
score the actual inference. If you do this on all the inferences generated from your evaluation data set, 00:06:16.160 |
you can draw a distribution of that particular metric. For example, here is a small set of closing 00:06:22.000 |
words generated for professional emails. We want to evaluate their politeness. We can prompt a model to 00:06:28.400 |
score the politeness of each statement from 1 to 10. For example, "Please let us know at your earliest 00:06:34.800 |
convenience" scores highly, while "Tell me ASAP will score poorly." We found that the best grading model at 00:06:42.400 |
this time is still GPT-4, but can be quite costly to use to score large datasets. We have found that 00:06:48.880 |
FLAN-T5 offers a good trade-off of speed and correctness. Airtrain was designed specifically 00:06:55.200 |
for this purpose. With Airtrain, you can upload your dataset, select the models you want to compare, 00:07:01.040 |
describe the properties you want to measure, and visualize metric distribution across your entire 00:07:06.400 |
dataset. You can compare LAMA2 with Falcon, FLAN-T5, or even your own model. Then, you can make an 00:07:13.440 |
dedicated decision based on statistical evidence. Sign up today for early access at Airtrain.ai and start 00:07:20.240 |
making data-driven decisions about your choice of LLM. Thanks. Goodbye.