Surfacing Semantic Orthogonality Across Model Safety Benchmarks

00:00:00.240 | Thank you for the introduction and thanks to the International Advanced Natural Language

00:00:03.920 | Processing Conference for organizing this. And thanks as well for allowing this talk to start

00:00:09.600 | and kick off the conference. I appreciate it. You guys have done a great job. In terms of the topic,

00:00:17.680 | I do have to make sure that we understand the contextual background behind this topic

00:00:26.400 | today in recent events over the last few weeks and months. So I'm going to take a few minutes before

00:00:31.840 | getting into the paper to make sure that for those of you that are outside of the bounds of not only

00:00:39.360 | just NLP but also maybe working on NLP in another industry or aspect, I think it's important that we

00:00:47.280 | know what this paper is discussing. And so first of all, the papers about AI safety benchmarks and

00:00:58.400 | artificial intelligence, I'm assuming people listening and watching are familiar with AI safety.

00:01:07.680 | I assume people are also relatively familiar with in terms of what they've read, although there's many

00:01:13.040 | different meanings for that. Benchmarks, I'm not sure if people are incredibly familiar with benchmarks.

00:01:19.360 | But what benchmarks are, they're question and answer data sets, prompts and response data sets that are

00:01:25.600 | used to -- there's a few other formats as well -- but that are used to measure LLMs. They've been controversial

00:01:32.080 | in the past because of their incomplete nature and a lot of the shortcomings that they have in measuring

00:01:38.800 | everything that people expect when we measure LLMs. So there's a hype versus reality for each of these

00:01:47.600 | terms in this topic. And I want to make sure that we understand the hype versus reality for each of these

00:01:55.040 | terms so that way we can understand what the topic is, and then I'll be getting into the paper.

00:02:03.040 | So artificial intelligence, the hype, has reached fever pitch. This was last month where the former

00:02:13.920 | Google CEO was warning that AI is about to explode and take over humans, when in reality it was announced

00:02:22.160 | as well this week that Meta is delaying the rollout of its flagship AI model. There's a lot of issues. If you

00:02:27.840 | see the bottom paragraph here, there's a lot of companies that are having a hard time getting past

00:02:33.120 | the advances from Transformers. This is something that a lot of developers have known about for the

00:02:38.160 | last few years. But this is the reality versus the hype. In terms of AI safety, again, there's many different

00:02:45.440 | definitions of AI safety. There's hype. This is one aspect of AI safety. If you think of AI as contributing

00:02:53.120 | something good, will AI replace doctors? How could AI be bad if it can replace doctors and prevent

00:02:58.960 | some things? Can it add to more well-being feeling from psychologists? Well, there's reality there

00:03:09.760 | in that obviously, as a lot of us have read about over the last few years, AI doctors are all over social

00:03:16.560 | media spreading fake claims, and it gets worse and worse. And there's a lot of efforts to prevent what's

00:03:23.280 | happening as a result of artificial intelligence, which seems to be where a lot of the harms are in terms of

00:03:31.600 | the psychological effects of using AI, the psychological effects of AI being accessible to people. Benchmarks,

00:03:40.240 | there's also a hype versus reality. As we know, we hear every model coming out, what the score might be

00:03:49.040 | in terms of some predominant benchmark that's best in class. For example, this is actually a little over a

00:03:58.000 | month prior to now. Chris Cox at Meta talking about Llama 4 being great and releasing all these metrics.

00:04:05.040 | Reality is this particular model that he's talking about was optimized. In other words, it was given the

00:04:13.040 | answers to the tests. So if it's not straightforward, I think we're moving into a place where hype could be a

00:04:23.360 | little more than it is now. And reality could be a little bit more severe in terms of contrast than it

00:04:29.600 | is now. This paper is about reality. AI is going to be around for the time being. No matter where it is in

00:04:40.720 | the hype cycle, it'll be visible or not visible, but there's no way to escape it. In terms of AI safety, there's

00:04:48.320 | always going to be some harms that we want to prevent. In terms of benchmarks, there's always going to be a need to

00:04:52.880 | measure. So this paper is about AI safety benchmarks. Again, it's introduced – thank you for introducing

00:05:00.240 | me, but I want to thank my co-authors here. I'm Jonathan Bennion. Shona Ghosh from NVIDIA

00:05:07.840 | has – Nantik Singh and Nuha Dzeri have all contributed a lot of thoughts going into this paper that I think that

00:05:19.360 | they should be recognized as well. So I'm excited to present. So in terms of choosing the benchmarks that we

00:05:28.480 | analyzed, again, we're looking at AI safety benchmarks. No other paper, by the way, has looked at the semantic

00:05:36.720 | extent and area that's covered by AI safety benchmarks as far as we know. So how do we choose the benchmarks

00:05:48.720 | that we did choose to analyze? Basically, if we go back the last two years, we see between five and ten

00:05:58.880 | benchmarks actually released for open source research for AI safety. Again, each of these are reflecting

00:06:05.920 | some of the values that people have. Some of these are reflecting the actual use cases that it targets.

00:06:16.320 | Some of these change over time. It really depends on the definition of harm. And so it's an exciting

00:06:24.080 | place to be in terms of measuring safety. But a lot of these datasets are also private. So we looked at

00:06:31.680 | the benchmarks that are open source and filtered it only to those that had enough rows for us to

00:06:39.040 | measure in terms of sample size and not whittle everything down. Some of these benchmarks. There's two

00:06:46.480 | others that were considered for the paper, but they became too small after filtering out for first turn

00:06:55.360 | only, filter only first turn prompts. And then we also filtered on only the prompts that were flagged as

00:07:03.840 | harmful. Some flat, some prompts right now are flagged for not harmful because of the nature of how these

00:07:10.880 | are used to either, again, to these, these, these benchmarks are used to measure LLMs, LLM systems,

00:07:16.320 | or fine tune a model on a behavior that, that you want, that's a little bit more desired and more aware

00:07:23.440 | of these, these, this ground truth that's defined here in these datasets for measuring harm. I want to get

00:07:30.400 | into the methodology. This is the bulk of the paper. Basically, we appended the benchmarks into one dataset

00:07:38.720 | for solid findings, had to clean the data by examining statistical sample size for, from each,

00:07:45.600 | and then also had to clean the data by removing duplicates and taking out outliers when we look at

00:07:54.240 | outliers of the total dataset at that point. And the outliers actually in this case were prompt length,

00:08:01.200 | which doesn't perfectly correlate to the embedding vectors, but we'll get into this in a second.

00:08:07.760 | Steps three, four, and five here were iterative with variants to find the best and most optimized

00:08:16.800 | unsupervised learning clusters that were developed to, to, to, to, to figure out harms across all of

00:08:23.200 | these datasets, or at least clusters of, of, of, of meaning, um, which are presumed to be harms.

00:08:28.640 | So, uh, so after using an embedding model that was tested for, um, um, for best fit, we'll, we'll get

00:08:38.240 | into that in a second. Uh, there was, uh, a few different, uh, dimensionality reduction techniques

00:08:44.880 | that, that we looked at. And then, uh, each of those had hyperparameters and values for those

00:08:49.200 | hyperparameters in grid search. And then, uh, there's multiple distance metrics that could always be,

00:08:53.760 | be used, uh, in, in, in, in clustering. And so I'll get into that, uh, as well in this presentation

00:08:59.280 | and just doing a quick time check because I'm gonna have to, uh, go through this. I, anyways,

00:09:04.240 | so then, uh, with, um, uh, clusters, once the clusters were developed, uh, to an optimal, uh,

00:09:11.280 | separation by silhouette score, uh, then we took the, uh, the prompt values that were at each centroid

00:09:18.320 | that were at each edge. There's four edges. And, uh, this is again, all according to past research

00:09:23.120 | that's done this in the past, but this has just never been done in this capacity. Uh, and so then, uh,

00:09:29.680 | each of those prompts were then, uh, using, uh, inference to another LLM, multiple LLMs actually,

00:09:36.320 | to corroborate and find the category labeling, labeling behind that, that centroid. And then,

00:09:41.520 | uh, we ended up with, uh, what's gonna be on the next slide. We also identified more bias that,

00:09:47.040 | that could be seeping into that process, but this is the result, uh, the clustered results by,

00:09:52.960 | by benchmark. Uh, you can see each color here represents a benchmark that I was just talking about.

00:09:57.440 | Each of these benchmark benchmarks might've over-indexed, uh, on a different area, but again,

00:10:02.240 | this is using, uh, k-means clustering after you, we'll, we'll get into the, the process here and how

00:10:07.280 | I optimized, uh, by this is kind of an interesting method. Once everything is clustered in aggregate and,

00:10:14.960 | and, and after everything is, uh, appended into one dataset, uh, to see where these benchmarks,

00:10:22.320 | uh, over-index, you can see the, uh, each one of these dots here is a prompt. And, uh, the x and y

00:10:30.000 | axis here are just dimensions. When I say just dimensions, they're highly normalized from a

00:10:34.160 | high dimensional space, but you can think of these as semantic space. Uh, the closer they are together,

00:10:39.600 | the more, um, common they are, the further away, uh, the, the more breadth and, and semantic meaning and

00:10:46.240 | coverage, uh, is, is, is, uh, uh, highlighted. So again, the, the point of this, this paper was to

00:10:54.320 | show what's happened in the past, also show where people can research further and show, uh, what areas

00:11:00.240 | might, uh, have not had as much research as in terms of the breadth, uh, that, uh, they could have, um,

00:11:06.640 | and, and this is a great means to evaluate as well, uh, because this shows you, uh, what is inside of,

00:11:15.040 | of either, you know, an LLM benchmark or whatever you want to measure and, and it doesn't add those

00:11:20.080 | compounds of using blue and rouge scores. Um, again, the harm categories we, we found, uh, in this case,

00:11:26.080 | controlled substances, suicide and self-harm, guns, illegal weapons, criminal planning, confessions,

00:11:34.560 | hate, which actually included identity hate according to inference, and PII and privacy.

00:11:41.280 | Um, so the bulk of the paper gets into variants that are used to optimize for the distance here in the,

00:11:49.280 | in the clusters, and this process could be reused. It could be also, uh, refined, but the framework is,

00:11:55.760 | is where the paper, um, I think, uh, has, has, has made advances in, in terms of what the framework

00:12:02.880 | could be to optimize for any benchmarks that are around a similar topic in semantic space.

00:12:10.400 | So, um, just to clarify this, this slide, uh, used more than one embedding model, used more than

00:12:20.160 | one distance metric, used more than one means to have a dimensionality reduction, and then, uh,

00:12:26.400 | optimized for hyperparameters that were found by past research to be, uh, the most impactful,

00:12:31.440 | and then, uh, optimized for those values from those hyperparameters, which, you know, could have been a

00:12:37.520 | lot of compute, and then, um, more, ideally more than one evaluation metric. You can see a reference to

00:12:43.600 | BERT score at the bottom. Tried that. Uh, everything was ultimately optimized for silhouette score in order to

00:12:51.280 | optimize for the, the, the, the distance and the separation of each, each cluster. But BERT score,

00:12:57.200 | I thought, or the hypothesis was that BERT score would actually tell you in terms of tokens, uh, the

00:13:05.440 | difference between one cluster and the next. BERT score, the, the results actually came back like 1.0 for

00:13:10.000 | every cluster. And, uh, turns out BERT score is actually not the best metric to use because for, for,

00:13:15.760 | for, uh, data sets that have the same topic or a similar topic, like AI safety,

00:13:21.520 | or adversarial, uh, uh, data sets like, like, like, like AI safety data sets, um, where you fine tune a

00:13:28.320 | model based on what not to do. Um, uh, BERT score didn't, didn't, didn't work here. And so that the,

00:13:35.920 | the secondary metric used here was, uh, performance, performance time. So, so we optimized for the best

00:13:41.840 | silhouette score, and then of the best silhouette scores that were in the same confidence interval,

00:13:46.960 | uh, we're able to find that the most performant, uh, in terms of performance to scale.

00:13:51.120 | So, uh, sample size presumptions, um, this is, these are what went into the sample size calculation.

00:13:58.160 | Why are we doing this? Because the theory for, uh, why we want to do this is to, uh, query and look at

00:14:07.200 | the differences of, uh, over-indexing in a certain cluster from each benchmark. Um, you can see here,

00:14:12.880 | I hope you can see my screen where, where I'm highlighting, the maximum clusters we had to assume was

00:14:16.640 | 15. Um, obviously we didn't get 15, but going into it, uh, we had to presume that because the,

00:14:23.920 | the most recent paper had a taxonomy listing, uh, between 12 and 13, adding 10 percent to that,

00:14:29.520 | according to past research, because we, we would have had, uh, in theory, more, uh, clusters, um,

00:14:35.760 | or at least more semantic space covered by looking at more than one dataset. So, um, we had to presume

00:14:41.520 | something, and so that was rationale to presume 15. Uh, significance level, because it was 15 clusters,

00:14:48.160 | and we wanted to look at this by each benchmark, 15 split, you know, five split by 15. Uh, the

00:14:55.200 | significance level, to be safe, dropped to 0.15. Um, and the effect size is large because, uh, according

00:15:04.800 | to past research, there's been, uh, a citation here in the paper that stated that it wouldn't matter

00:15:12.320 | the results that we'd find in terms of a benchmark having a slightly different, uh, over indexing for,

00:15:17.680 | uh, each harm category. It wouldn't matter unless effect size is 0.5 at least. So I thought, okay,

00:15:22.960 | that's fine. That's rationale to, to use a high effect size here. Um, um, didn't know what we'd get,

00:15:28.960 | uh, because this, this was never really done in this capacity in terms of prompts. Uh, anyways,

00:15:36.080 | with this calculation, uh, ended up with a minimum required sample size per benchmark of 1,635,

00:15:42.400 | and with a total sample size across all benchmarks of 8,175. Uh, outlier removal, uh, did this again

00:15:50.080 | for the whole entire dataset, uh, used compared IQR method and z-score method. Uh, counterintuitively,

00:15:55.760 | the z-score method actually, uh, was looser and allowed for actually more prompts here, uh,

00:16:01.040 | that, you know, could be considered an outlier if we're using our IQR method. What was interesting

00:16:05.920 | was because this is so right skewed, uh, this is not a normal distribution. This is extremely right skewed,

00:16:12.000 | uh, in terms of prompt length. Uh, the z-score actually looked at the standard deviation as it does

00:16:20.000 | and removed less because the standard deviation was so large here. So not only are there long

00:16:25.440 | prompts, there's just a lot of standard deviation amongst those long, long prompts, which was great

00:16:30.640 | because the, these prompts here turned out to be, uh, relatively valuable and showing up in, in a

00:16:36.080 | semantic space that, that it was kind of of its own. So that said, um, this, this, this worked, uh, in terms

00:16:42.960 | of the, the result, uh, still right skewed, but, uh, better if you, especially with the magnitude, uh, down

00:16:49.680 | there, uh, uh, quite a bit better, uh, but still right skewed, uh, there's gonna be the next three

00:16:57.040 | slides. I'm gonna talk about the variance and then I'm gonna talk about the results. Um, so the variance,

00:17:03.520 | uh, again, that I iterated through, uh, the, um, embedding models, uh, there had to be some rationale

00:17:11.840 | there, uh, in terms of what embedding model to use. Uh, mini LM is something that we, uh, started using

00:17:19.680 | because of its scalability. Uh, it produces high quality embedding values for each, each prompt.

00:17:25.920 | For those of you that are unfamiliar, um, it's, uh, it'll just take a prompt and assign a semantic,

00:17:32.240 | uh, vectorized, very high dimensional, uh, value. It's a mini, mini LM, not only does a high quality

00:17:40.000 | embedding value, but also, uh, performs some reduction of, of, of dimensions. And I'll get

00:17:45.600 | into why we even go, go through the, the hassle of, uh, reducing, uh, dimensionality on the next,

00:17:51.360 | the next step. But then there's also an efficient memory usage. It's, it's really small and it's,

00:17:55.440 | it's, uh, been used in a lot of research, uh, especially of the same kind for, for looking at prompts

00:18:00.080 | and semantic space. MP net gets into using more memory. It excels at contextual and sequential

00:18:05.600 | encoding, uh, higher memory usage. So it's the next step up, even though they're both small and

00:18:10.480 | comparable. I wanted to see the direction it would go. Um, both are 512 tokens. This isn't on the paper,

00:18:16.240 | so I added a note down below just to clarify for those of you listening, um, that, uh, there's,

00:18:21.920 | there's memory differences and they're relatively substantial. It's one difference. So we wanted to see

00:18:27.680 | if there was a difference at all and, and using these, um, embedding models as they became more

00:18:32.480 | sophisticated, um, for reducing dimensionality further, which is important to do with, uh, there's,

00:18:40.560 | there's some research that suggests, um, even though we lose more information, uh, we still, uh, in order

00:18:48.320 | to cluster, uh, need something here, uh, to allow us to have manageable values that we're clustering on.

00:18:57.040 | So TSNE, uh, preserves local structure but struggles with global relationships that actually, in this case,

00:19:04.080 | could be useful. UMAP, uh, preserves both local and global structure while scaling efficiently. Different

00:19:11.600 | hyperparameters for each. Uh, for TSNE, we had to draw on past research, um, so prioritize perplexity and

00:19:19.520 | learning rate. For UMAP, uh, again, had to draw upon past research that was most impactful for, for a

00:19:24.640 | similar use case and looked at n-neighbors and, and min-dist. The, uh, Euclidean distance is something that

00:19:31.360 | is kind of a common, uh, default to use. Uh, works well in low dimensional spaces, doesn't work well when you get into

00:19:39.120 | high dimensions, uh, because the differences are, are, are too profound. Um, so Mahalanobis is something

00:19:46.320 | else that we looked at to compare. It incorporates an, an inverse, uh, covariance matrix, uh, to account for

00:19:54.000 | dimensional correlations. And I was really excited about the Mahalanobis distance. The research in the

00:20:02.000 | past suggests that it could be one of the best metrics to look at because it accounts for dimensional

00:20:06.400 | correlations, which you would presume would be extremely interesting. Um, it didn't, uh, in the

00:20:12.560 | results, we're looking at the top eight out of 16 different combinations here. Uh, there's 16. You think

00:20:17.760 | there might have been more, but, uh, the, uh, for grid search, I had to, uh, whittle, uh, some of the values

00:20:25.440 | down in order to, uh, uh, have this performance. And, uh, if you look at the top eight of the, the cluster optimization results,

00:20:33.680 | again, by, uh, silhouette score, you can see that there's some overlap visually in the confidence interval.

00:20:38.880 | And among that overlap, you see the second one from the top, um, mini LM, Euclidean, uh, UMAP, uh, uh,

00:20:47.520 | the end of the 30 min dist is 0.1. So, um, this is also, you could look at the efficiency, which is

00:20:55.280 | normal and normalized, uh, in terms of, uh, seconds in terms of time processing time. And this makes more

00:21:01.360 | sense to scale. If everything's in the same comfort or within the same confidence range. Um, so, uh, number

00:21:08.160 | of clustered clusters was reviewed for diligence, um, because that was important because also counterintuitive,

00:21:15.600 | we expected more because the taxonomies were getting large. There's a lot more semantic space,

00:21:21.040 | a lot more harms that were covered. It was counterintuitive to get, uh, six.

00:21:24.880 | Uh, elbow method actually suggests between five and six. Silhouette analysis suggested six. Uh, there's

00:21:33.840 | research that suggests that if you're between two values to use the one that makes the most sense to you.

00:21:38.560 | So we use six. Um, this influenced prompt values and centroids, um, excuse me, the, the inference, uh,

00:21:46.640 | uh, that, that we gave to the, the, the prompt values, uh, at the centroids, uh, were influenced by,

00:21:54.000 | um, LLMs, but they didn't, there was not much variance here at all. Uh, there might have been a plural

00:21:59.040 | word, uh, versus a singular word if you move across, uh, model families and into the next model family for

00:22:05.920 | inference. So this is, these are the, the, uh, the, the clusters that developed from, from these

00:22:10.560 | benchmarks and insights here, I'm getting down to the end of this, uh, talk and then I'm going to be

00:22:16.560 | opening up for questions, but I want to emphasize, uh, the insights and then what we learned here in the

00:22:21.120 | insights, if you know, the sparsity and variant breadth, like I, uh, mentioned before, um, the hate and

00:22:27.120 | identity hate category, it's very focused, uh, it could be more, more, but then actually there's a,

00:22:33.360 | there's a, a tangent there on, uh, uh, the inability for, uh, LLMs to capture hate speech right now,

00:22:42.240 | currently, uh, and that's a criticism of a lot of, uh, AI tools, uh, it should be better, right? Well,

00:22:48.480 | this is because we don't have as much ground truth and this, this, this highlights that. Also, uh, there is

00:22:53.840 | anthropomorphism that is happening quite often and a lot of other harms that are happening

00:22:58.400 | psychologically to people using AI that, that are not, uh, evident here. Um, it's not that they're

00:23:04.000 | debated, they're, they're, they're harms, they're just not, not explored. So this allows you to pivot

00:23:10.080 | and go, okay, these are harms that we looked at in the past. What else can we look at in order to,

00:23:14.320 | I don't want to be fear-mongering, um, but there are, like I said in the beginning of the talk,

00:23:20.960 | uh, there are, uh, harms that will happen from AI use. Uh, one, exacerbation of suicide and self-harm.

00:23:30.160 | In other words, if someone's using an AI tool and they're thinking about self-harm and suicide,

00:23:35.520 | it could exacerbate that, um, through, uh, sycopency. And I think we all know, are aware of that, um, uh,

00:23:44.240 | but some people are not, and some people are, um, possibly, uh, more susceptible to this than others.

00:23:51.120 | So, so, uh, there's always going to be some harms with, with usage, not that AI is causing those per

00:23:57.440 | se, it's the usage. And so, uh, this is something that, um, we thought was interesting. In terms of

00:24:02.800 | looking at the bias in clusters, there's still the bias in clusters that exist. If you look at the prompt,

00:24:06.240 | the distribution of prompt links by benchmark, different, uh, and that's all I'm going to say

00:24:10.080 | about that, um, because we're almost out of time when I talk about the limitations and then the, uh,

00:24:14.960 | takeaways. Limitations, obviously there's, there's, uh, methodological, methodological limitations here,

00:24:20.480 | uh, could have increased the sample size, for example, uh, dimensionality reduction loses information.

00:24:26.400 | There's bias actually inherent in the embeddings models that we're trying to get out.

00:24:29.360 | Uh, um, choosing the benchmarks, this is only five. I mean, I generalize because there's a lot of private

00:24:34.640 | benchmarks. Um, equal benchmark weighting presumes that people are using this equally

00:24:40.000 | they may not, uh, human biases that are inherent in research, implicit Western views in our research,

00:24:45.360 | um, in terms of the past harm as well. And, uh, the author's technical backgrounds, including myself,

00:24:52.320 | uh, um, because of the technical background, we might not have been thinking about harms that are

00:24:57.440 | actually, um, something that might be more of a priority that, that we might not have seen, not have

00:25:02.000 | seen, uh, future research directions, um, could include harm benchmarks for more, more cultural contexts.

00:25:08.480 | Um, there could be more exploration of prompt response relationships, because this is only

00:25:12.720 | looking at the prompts intended to look at the prompts, prompt response relationships,

00:25:16.560 | but ran out of time and space in the paper. And then, um, if you were to apply this methodology

00:25:22.400 | framework to domain specific data sets, um, and investigate differences this way, this, this is a,

00:25:28.000 | an evaluation method that is, uh, uh, uh, solid because it shows you what, what, what's in the data.

00:25:35.360 | Top four conclusions, last slide. Uh, there are six primary harm categories that we identified

00:25:42.160 | with varying coverage and breadth from each benchmark. Uh, semantic coverage gaps, as you've seen,

00:25:47.280 | exist across recent benchmarks and will over time as we change the definition for harms. Uh, the third,

00:25:53.440 | uh, was that we've found optimal clustering configuration framework for this particular use

00:25:58.640 | case. And, um, this could be scaled for, for use in other benchmarks of similar topical use or, uh,

00:26:05.680 | other LLM applications of other similar, similar topical use. Uh, again, it shows you amongst a collection of

00:26:13.040 | things of similar topics, how something might over index and under index fourth, uh, plotting semantic

00:26:19.760 | space. Um, again, this is a, uh, transparent evaluation approach that, uh, allows for more action and more

00:26:25.360 | insight than the stereotypical region blue scores, which are binomial, uh, related to precision and recall

00:26:31.840 | that we're biased on using. So this allows you to, to have more insights. Thank you very much. Uh, we're

Surfacing Semantic Orthogonality Across Model Safety Benchmarks — Jonathan Bennion