How to evaluate a model for your use case: Emmanuel Turlay

00:00:00.000 | Hi everyone, I'm Emmanuel, CEO of Sematic, the company behind Airtrain. Today, I want to talk

00:00:24.800 | about a difficult problem in the language modeling space, and that is evaluation. Unlike in other

00:00:31.040 | areas of machine learning, it is not so straightforward to evaluate language models

00:00:35.680 | for a specific use case. There are metrics and benchmarks, but they mostly apply to generic

00:00:42.000 | tasks, and there is no one-size-fits-all process to evaluate the performance of a model for a

00:00:47.040 | particular use case. So first, let's get the basics out of the way. What is model evaluation?

00:00:54.800 | Model evaluation is the statistical measurement of the performance of a machine learning model.

00:01:00.160 | How well does a model perform on a particular use case, measured on a large dataset independent

00:01:05.760 | from the training dataset? Model evaluation usually comes right after training or fine-tuning and is a

00:01:12.880 | crucial part of model development. All ML teams dedicate large resources to establish rigorous

00:01:18.800 | evaluation procedures. You need to set up a solid evaluation process as part of your development

00:01:24.640 | workflow to guarantee performance and safety. You can compare evaluation to running a test suite in your

00:01:30.880 | continuous integration pipeline. In traditional supervised machine learning, there is a whole host

00:01:36.720 | of well-defined metrics to clearly grade a model's performance. For example, for regressions, we have

00:01:44.000 | the root mean squared error or the mean absolute error. For classifiers, people usually use precision, recall,

00:01:53.120 | or F1 score, and so on. In computer vision, a popular metric is the intersection of a union.

00:02:01.120 | So what metrics are available to score language models? Well, unlike other types of models returning

00:02:08.000 | structured outputs such as a number, a class, or a bounding box, language models generate text,

00:02:14.640 | which is very unstructured. An inference that is different from the ground truth reference is not

00:02:19.920 | necessarily incorrect. Depending on whether you have access to labeled references, there are a number of

00:02:26.160 | metrics you can use. For example, BLEU is a precision-based metric. It measures the overlap

00:02:32.720 | between n-grams, that is sequences of tokens, between the generated text and the inference.

00:02:38.400 | It's a common metric to evaluate translation between two languages and can also be used to score

00:02:44.480 | summarization. It can definitely serve as a good benchmark, but it is not a safe indicator of how a

00:02:50.800 | model will perform on your particular task. For example, it does not take into account intelligibility

00:02:56.960 | or grammatical correctness. Rouge is a set of evaluation metrics that focuses on measuring the

00:03:03.200 | recall of sequences of tokens between references and the inference. It is mostly useful to evaluate for

00:03:11.120 | summarization. If you don't have access to labeled references, you can use other standalone metrics.

00:03:18.560 | For example, density quantifies how well the summary represents pool fragments from the text,

00:03:24.000 | and coverage quantifies the extent to which a summary is derivative of a text. As you can see,

00:03:30.960 | these metrics are only useful to score certain high-level tasks such as translation and summarization.

00:03:37.440 | There are also a number of benchmarks and leaderboards that rank various models. Benchmarks are standardized

00:03:45.600 | tests that score model performance for certain tasks. For example, glue or general language understanding

00:03:53.200 | evaluation is a common benchmark to evaluate how well a model understands language through a series of nine

00:04:00.000 | tasks. For example, paraphrase detection and sentiment analysis. Helleswag measures natural language inference,

00:04:09.760 | which is the ability for a model to have common sense and find the most plausible end to a sentence.

00:04:15.520 | In this case, answer C is the most reasonable choice. There are other benchmarks such as trivia QA,

00:04:23.680 | which asks almost a million trivia questions from Wikipedia and other sources and tests the knowledge

00:04:29.360 | of the model. Also AHRQ test models' ability to reason about high school level science questions.

00:04:35.760 | And there are dozens more benchmarks out there. All these metrics and benchmarks are very useful to draw a

00:04:42.800 | landscape of how LLMs compare to one another. But they do not tell you how they perform for your particular

00:04:49.360 | task on the type of input data that will be fed by your application. For example, if you're trying to extract symptoms

00:04:57.120 | from a doctor's notes, or extract ingredients from a recipe, or form a JSON payload to query an API,

00:05:04.560 | these metrics will not tell you how each model performs. So each application needs to come up with

00:05:11.120 | with its own evaluation procedure, which is a lot of work. There is one magic trick though. You can use

00:05:18.880 | another model to grade the output of your model. You can describe to an LLM what you're trying to accomplish

00:05:25.760 | and what are the grading criteria and ask it to grade the output of another LLM on a numerical scale.

00:05:32.160 | Essentially, you are crafting your own specialized metrics for your own application.

00:05:38.640 | Here's an example of how it works. You can feed your evaluation data set to the model you want

00:05:44.000 | to evaluate, which is going to generate the inferences that you want to score.

00:05:47.120 | Then, you can include those inferences inside a broader scoring prompt in which you've described

00:05:54.720 | the task you're trying to accomplish and the properties you're trying to grade. And also,

00:05:58.800 | you describe the scale across which it should be graded. For example, from 1 to 10. Then,

00:06:04.400 | you pass this scoring prompt to a scoring model, which is going to generate a number - a score - to

00:06:10.240 | score the actual inference. If you do this on all the inferences generated from your evaluation data set,

00:06:16.160 | you can draw a distribution of that particular metric. For example, here is a small set of closing

00:06:22.000 | words generated for professional emails. We want to evaluate their politeness. We can prompt a model to

00:06:28.400 | score the politeness of each statement from 1 to 10. For example, "Please let us know at your earliest

00:06:34.800 | convenience" scores highly, while "Tell me ASAP will score poorly." We found that the best grading model at

00:06:42.400 | this time is still GPT-4, but can be quite costly to use to score large datasets. We have found that

00:06:48.880 | FLAN-T5 offers a good trade-off of speed and correctness. Airtrain was designed specifically

00:06:55.200 | for this purpose. With Airtrain, you can upload your dataset, select the models you want to compare,

00:07:01.040 | describe the properties you want to measure, and visualize metric distribution across your entire

00:07:06.400 | dataset. You can compare LAMA2 with Falcon, FLAN-T5, or even your own model. Then, you can make an

00:07:13.440 | dedicated decision based on statistical evidence. Sign up today for early access at Airtrain.ai and start

00:07:20.240 | making data-driven decisions about your choice of LLM. Thanks. Goodbye.