Surfacing Semantic Orthogonality Across Model Safety Benchmarks

Thank you for the introduction and thanks to the International Advanced Natural Language Processing Conference for organizing this. And thanks as well for allowing this talk to start and kick off the conference. I appreciate it. You guys have done a great job. In terms of the topic, I do have to make sure that we understand the contextual background behind this topic today in recent events over the last few weeks and months.

So I'm going to take a few minutes before getting into the paper to make sure that for those of you that are outside of the bounds of not only just NLP but also maybe working on NLP in another industry or aspect, I think it's important that we know what this paper is discussing.

And so first of all, the papers about AI safety benchmarks and artificial intelligence, I'm assuming people listening and watching are familiar with AI safety. I assume people are also relatively familiar with in terms of what they've read, although there's many different meanings for that. Benchmarks, I'm not sure if people are incredibly familiar with benchmarks.

But what benchmarks are, they're question and answer data sets, prompts and response data sets that are used to -- there's a few other formats as well -- but that are used to measure LLMs. They've been controversial in the past because of their incomplete nature and a lot of the shortcomings that they have in measuring everything that people expect when we measure LLMs.

So there's a hype versus reality for each of these terms in this topic. And I want to make sure that we understand the hype versus reality for each of these terms so that way we can understand what the topic is, and then I'll be getting into the paper. So artificial intelligence, the hype, has reached fever pitch.

This was last month where the former Google CEO was warning that AI is about to explode and take over humans, when in reality it was announced as well this week that Meta is delaying the rollout of its flagship AI model. There's a lot of issues. If you see the bottom paragraph here, there's a lot of companies that are having a hard time getting past the advances from Transformers.

This is something that a lot of developers have known about for the last few years. But this is the reality versus the hype. In terms of AI safety, again, there's many different definitions of AI safety. There's hype. This is one aspect of AI safety. If you think of AI as contributing something good, will AI replace doctors?

How could AI be bad if it can replace doctors and prevent some things? Can it add to more well-being feeling from psychologists? Well, there's reality there in that obviously, as a lot of us have read about over the last few years, AI doctors are all over social media spreading fake claims, and it gets worse and worse.

And there's a lot of efforts to prevent what's happening as a result of artificial intelligence, which seems to be where a lot of the harms are in terms of the psychological effects of using AI, the psychological effects of AI being accessible to people. Benchmarks, there's also a hype versus reality.

As we know, we hear every model coming out, what the score might be in terms of some predominant benchmark that's best in class. For example, this is actually a little over a month prior to now. Chris Cox at Meta talking about Llama 4 being great and releasing all these metrics.

Reality is this particular model that he's talking about was optimized. In other words, it was given the answers to the tests. So if it's not straightforward, I think we're moving into a place where hype could be a little more than it is now. And reality could be a little bit more severe in terms of contrast than it is now.

This paper is about reality. AI is going to be around for the time being. No matter where it is in the hype cycle, it'll be visible or not visible, but there's no way to escape it. In terms of AI safety, there's always going to be some harms that we want to prevent.

In terms of benchmarks, there's always going to be a need to measure. So this paper is about AI safety benchmarks. Again, it's introduced – thank you for introducing me, but I want to thank my co-authors here. I'm Jonathan Bennion. Shona Ghosh from NVIDIA has – Nantik Singh and Nuha Dzeri have all contributed a lot of thoughts going into this paper that I think that they should be recognized as well.

So I'm excited to present. So in terms of choosing the benchmarks that we analyzed, again, we're looking at AI safety benchmarks. No other paper, by the way, has looked at the semantic extent and area that's covered by AI safety benchmarks as far as we know. So how do we choose the benchmarks that we did choose to analyze?

Basically, if we go back the last two years, we see between five and ten benchmarks actually released for open source research for AI safety. Again, each of these are reflecting some of the values that people have. Some of these are reflecting the actual use cases that it targets. Some of these change over time.

It really depends on the definition of harm. And so it's an exciting place to be in terms of measuring safety. But a lot of these datasets are also private. So we looked at the benchmarks that are open source and filtered it only to those that had enough rows for us to measure in terms of sample size and not whittle everything down.

Some of these benchmarks. There's two others that were considered for the paper, but they became too small after filtering out for first turn only, filter only first turn prompts. And then we also filtered on only the prompts that were flagged as harmful. Some flat, some prompts right now are flagged for not harmful because of the nature of how these are used to either, again, to these, these, these benchmarks are used to measure LLMs, LLM systems, or fine tune a model on a behavior that, that you want, that's a little bit more desired and more aware of these, these, this ground truth that's defined here in these datasets for measuring harm.

I want to get into the methodology. This is the bulk of the paper. Basically, we appended the benchmarks into one dataset for solid findings, had to clean the data by examining statistical sample size for, from each, and then also had to clean the data by removing duplicates and taking out outliers when we look at outliers of the total dataset at that point.

And the outliers actually in this case were prompt length, which doesn't perfectly correlate to the embedding vectors, but we'll get into this in a second. Steps three, four, and five here were iterative with variants to find the best and most optimized unsupervised learning clusters that were developed to, to, to, to, to figure out harms across all of these datasets, or at least clusters of, of, of, of meaning, um, which are presumed to be harms.

So, uh, so after using an embedding model that was tested for, um, um, for best fit, we'll, we'll get into that in a second. Uh, there was, uh, a few different, uh, dimensionality reduction techniques that, that we looked at. And then, uh, each of those had hyperparameters and values for those hyperparameters in grid search.

And then, uh, there's multiple distance metrics that could always be, be used, uh, in, in, in, in clustering. And so I'll get into that, uh, as well in this presentation and just doing a quick time check because I'm gonna have to, uh, go through this. I, anyways, so then, uh, with, um, uh, clusters, once the clusters were developed, uh, to an optimal, uh, separation by silhouette score, uh, then we took the, uh, the prompt values that were at each centroid that were at each edge.

There's four edges. And, uh, this is again, all according to past research that's done this in the past, but this has just never been done in this capacity. Uh, and so then, uh, each of those prompts were then, uh, using, uh, inference to another LLM, multiple LLMs actually, to corroborate and find the category labeling, labeling behind that, that centroid.

And then, uh, we ended up with, uh, what's gonna be on the next slide. We also identified more bias that, that could be seeping into that process, but this is the result, uh, the clustered results by, by benchmark. Uh, you can see each color here represents a benchmark that I was just talking about.

Each of these benchmark benchmarks might've over-indexed, uh, on a different area, but again, this is using, uh, k-means clustering after you, we'll, we'll get into the, the process here and how I optimized, uh, by this is kind of an interesting method. Once everything is clustered in aggregate and, and, and after everything is, uh, appended into one dataset, uh, to see where these benchmarks, uh, over-index, you can see the, uh, each one of these dots here is a prompt.

And, uh, the x and y axis here are just dimensions. When I say just dimensions, they're highly normalized from a high dimensional space, but you can think of these as semantic space. Uh, the closer they are together, the more, um, common they are, the further away, uh, the, the more breadth and, and semantic meaning and coverage, uh, is, is, is, uh, uh, highlighted.

So again, the, the point of this, this paper was to show what's happened in the past, also show where people can research further and show, uh, what areas might, uh, have not had as much research as in terms of the breadth, uh, that, uh, they could have, um, and, and this is a great means to evaluate as well, uh, because this shows you, uh, what is inside of, of either, you know, an LLM benchmark or whatever you want to measure and, and it doesn't add those compounds of using blue and rouge scores.

Um, again, the harm categories we, we found, uh, in this case, controlled substances, suicide and self-harm, guns, illegal weapons, criminal planning, confessions, hate, which actually included identity hate according to inference, and PII and privacy. Um, so the bulk of the paper gets into variants that are used to optimize for the distance here in the, in the clusters, and this process could be reused.

It could be also, uh, refined, but the framework is, is where the paper, um, I think, uh, has, has, has made advances in, in terms of what the framework could be to optimize for any benchmarks that are around a similar topic in semantic space. So, um, just to clarify this, this slide, uh, used more than one embedding model, used more than one distance metric, used more than one means to have a dimensionality reduction, and then, uh, optimized for hyperparameters that were found by past research to be, uh, the most impactful, and then, uh, optimized for those values from those hyperparameters, which, you know, could have been a lot of compute, and then, um, more, ideally more than one evaluation metric.

You can see a reference to BERT score at the bottom. Tried that. Uh, everything was ultimately optimized for silhouette score in order to optimize for the, the, the, the distance and the separation of each, each cluster. But BERT score, I thought, or the hypothesis was that BERT score would actually tell you in terms of tokens, uh, the difference between one cluster and the next.

BERT score, the, the results actually came back like 1.0 for every cluster. And, uh, turns out BERT score is actually not the best metric to use because for, for, for, uh, data sets that have the same topic or a similar topic, like AI safety, or adversarial, uh, uh, data sets like, like, like, like AI safety data sets, um, where you fine tune a model based on what not to do.

Um, uh, BERT score didn't, didn't, didn't work here. And so that the, the secondary metric used here was, uh, performance, performance time. So, so we optimized for the best silhouette score, and then of the best silhouette scores that were in the same confidence interval, uh, we're able to find that the most performant, uh, in terms of performance to scale.

So, uh, sample size presumptions, um, this is, these are what went into the sample size calculation. Why are we doing this? Because the theory for, uh, why we want to do this is to, uh, query and look at the differences of, uh, over-indexing in a certain cluster from each benchmark.

Um, you can see here, I hope you can see my screen where, where I'm highlighting, the maximum clusters we had to assume was 15. Um, obviously we didn't get 15, but going into it, uh, we had to presume that because the, the most recent paper had a taxonomy listing, uh, between 12 and 13, adding 10 percent to that, according to past research, because we, we would have had, uh, in theory, more, uh, clusters, um, or at least more semantic space covered by looking at more than one dataset.

So, um, we had to presume something, and so that was rationale to presume 15. Uh, significance level, because it was 15 clusters, and we wanted to look at this by each benchmark, 15 split, you know, five split by 15. Uh, the significance level, to be safe, dropped to 0.15.

Um, and the effect size is large because, uh, according to past research, there's been, uh, a citation here in the paper that stated that it wouldn't matter the results that we'd find in terms of a benchmark having a slightly different, uh, over indexing for, uh, each harm category. It wouldn't matter unless effect size is 0.5 at least.

So I thought, okay, that's fine. That's rationale to, to use a high effect size here. Um, um, didn't know what we'd get, uh, because this, this was never really done in this capacity in terms of prompts. Uh, anyways, with this calculation, uh, ended up with a minimum required sample size per benchmark of 1,635, and with a total sample size across all benchmarks of 8,175.

Uh, outlier removal, uh, did this again for the whole entire dataset, uh, used compared IQR method and z-score method. Uh, counterintuitively, the z-score method actually, uh, was looser and allowed for actually more prompts here, uh, that, you know, could be considered an outlier if we're using our IQR method.

What was interesting was because this is so right skewed, uh, this is not a normal distribution. This is extremely right skewed, uh, in terms of prompt length. Uh, the z-score actually looked at the standard deviation as it does and removed less because the standard deviation was so large here.

So not only are there long prompts, there's just a lot of standard deviation amongst those long, long prompts, which was great because the, these prompts here turned out to be, uh, relatively valuable and showing up in, in a semantic space that, that it was kind of of its own.

So that said, um, this, this, this worked, uh, in terms of the, the result, uh, still right skewed, but, uh, better if you, especially with the magnitude, uh, down there, uh, uh, quite a bit better, uh, but still right skewed, uh, there's gonna be the next three slides. I'm gonna talk about the variance and then I'm gonna talk about the results.

Um, so the variance, uh, again, that I iterated through, uh, the, um, embedding models, uh, there had to be some rationale there, uh, in terms of what embedding model to use. Uh, mini LM is something that we, uh, started using because of its scalability. Uh, it produces high quality embedding values for each, each prompt.

For those of you that are unfamiliar, um, it's, uh, it'll just take a prompt and assign a semantic, uh, vectorized, very high dimensional, uh, value. It's a mini, mini LM, not only does a high quality embedding value, but also, uh, performs some reduction of, of, of dimensions. And I'll get into why we even go, go through the, the hassle of, uh, reducing, uh, dimensionality on the next, the next step.

But then there's also an efficient memory usage. It's, it's really small and it's, it's, uh, been used in a lot of research, uh, especially of the same kind for, for looking at prompts and semantic space. MP net gets into using more memory. It excels at contextual and sequential encoding, uh, higher memory usage.

So it's the next step up, even though they're both small and comparable. I wanted to see the direction it would go. Um, both are 512 tokens. This isn't on the paper, so I added a note down below just to clarify for those of you listening, um, that, uh, there's, there's memory differences and they're relatively substantial.

It's one difference. So we wanted to see if there was a difference at all and, and using these, um, embedding models as they became more sophisticated, um, for reducing dimensionality further, which is important to do with, uh, there's, there's some research that suggests, um, even though we lose more information, uh, we still, uh, in order to cluster, uh, need something here, uh, to allow us to have manageable values that we're clustering on.

So TSNE, uh, preserves local structure but struggles with global relationships that actually, in this case, could be useful. UMAP, uh, preserves both local and global structure while scaling efficiently. Different hyperparameters for each. Uh, for TSNE, we had to draw on past research, um, so prioritize perplexity and learning rate.

For UMAP, uh, again, had to draw upon past research that was most impactful for, for a similar use case and looked at n-neighbors and, and min-dist. The, uh, Euclidean distance is something that is kind of a common, uh, default to use. Uh, works well in low dimensional spaces, doesn't work well when you get into high dimensions, uh, because the differences are, are, are too profound.

Um, so Mahalanobis is something else that we looked at to compare. It incorporates an, an inverse, uh, covariance matrix, uh, to account for dimensional correlations. And I was really excited about the Mahalanobis distance. The research in the past suggests that it could be one of the best metrics to look at because it accounts for dimensional correlations, which you would presume would be extremely interesting.

Um, it didn't, uh, in the results, we're looking at the top eight out of 16 different combinations here. Uh, there's 16. You think there might have been more, but, uh, the, uh, for grid search, I had to, uh, whittle, uh, some of the values down in order to, uh, uh, have this performance.

And, uh, if you look at the top eight of the, the cluster optimization results, again, by, uh, silhouette score, you can see that there's some overlap visually in the confidence interval. And among that overlap, you see the second one from the top, um, mini LM, Euclidean, uh, UMAP, uh, uh, the end of the 30 min dist is 0.1.

So, um, this is also, you could look at the efficiency, which is normal and normalized, uh, in terms of, uh, seconds in terms of time processing time. And this makes more sense to scale. If everything's in the same comfort or within the same confidence range. Um, so, uh, number of clustered clusters was reviewed for diligence, um, because that was important because also counterintuitive, we expected more because the taxonomies were getting large.

There's a lot more semantic space, a lot more harms that were covered. It was counterintuitive to get, uh, six. Uh, elbow method actually suggests between five and six. Silhouette analysis suggested six. Uh, there's research that suggests that if you're between two values to use the one that makes the most sense to you.

So we use six. Um, this influenced prompt values and centroids, um, excuse me, the, the inference, uh, uh, that, that we gave to the, the, the prompt values, uh, at the centroids, uh, were influenced by, um, LLMs, but they didn't, there was not much variance here at all. Uh, there might have been a plural word, uh, versus a singular word if you move across, uh, model families and into the next model family for inference.

So this is, these are the, the, uh, the, the clusters that developed from, from these benchmarks and insights here, I'm getting down to the end of this, uh, talk and then I'm going to be opening up for questions, but I want to emphasize, uh, the insights and then what we learned here in the insights, if you know, the sparsity and variant breadth, like I, uh, mentioned before, um, the hate and identity hate category, it's very focused, uh, it could be more, more, but then actually there's a, there's a, a tangent there on, uh, uh, the inability for, uh, LLMs to capture hate speech right now, currently, uh, and that's a criticism of a lot of, uh, AI tools, uh, it should be better, right?

Well, this is because we don't have as much ground truth and this, this, this highlights that. Also, uh, there is anthropomorphism that is happening quite often and a lot of other harms that are happening psychologically to people using AI that, that are not, uh, evident here. Um, it's not that they're debated, they're, they're, they're harms, they're just not, not explored.

So this allows you to pivot and go, okay, these are harms that we looked at in the past. What else can we look at in order to, I don't want to be fear-mongering, um, but there are, like I said in the beginning of the talk, uh, there are, uh, harms that will happen from AI use.

Uh, one, exacerbation of suicide and self-harm. In other words, if someone's using an AI tool and they're thinking about self-harm and suicide, it could exacerbate that, um, through, uh, sycopency. And I think we all know, are aware of that, um, uh, but some people are not, and some people are, um, possibly, uh, more susceptible to this than others.

So, so, uh, there's always going to be some harms with, with usage, not that AI is causing those per se, it's the usage. And so, uh, this is something that, um, we thought was interesting. In terms of looking at the bias in clusters, there's still the bias in clusters that exist.

If you look at the prompt, the distribution of prompt links by benchmark, different, uh, and that's all I'm going to say about that, um, because we're almost out of time when I talk about the limitations and then the, uh, takeaways. Limitations, obviously there's, there's, uh, methodological, methodological limitations here, uh, could have increased the sample size, for example, uh, dimensionality reduction loses information.

There's bias actually inherent in the embeddings models that we're trying to get out. Uh, um, choosing the benchmarks, this is only five. I mean, I generalize because there's a lot of private benchmarks. Um, equal benchmark weighting presumes that people are using this equally they may not, uh, human biases that are inherent in research, implicit Western views in our research, um, in terms of the past harm as well.

And, uh, the author's technical backgrounds, including myself, uh, um, because of the technical background, we might not have been thinking about harms that are actually, um, something that might be more of a priority that, that we might not have seen, not have seen, uh, future research directions, um, could include harm benchmarks for more, more cultural contexts.

Um, there could be more exploration of prompt response relationships, because this is only looking at the prompts intended to look at the prompts, prompt response relationships, but ran out of time and space in the paper. And then, um, if you were to apply this methodology framework to domain specific data sets, um, and investigate differences this way, this, this is a, an evaluation method that is, uh, uh, uh, solid because it shows you what, what, what's in the data.

Top four conclusions, last slide. Uh, there are six primary harm categories that we identified with varying coverage and breadth from each benchmark. Uh, semantic coverage gaps, as you've seen, exist across recent benchmarks and will over time as we change the definition for harms. Uh, the third, uh, was that we've found optimal clustering configuration framework for this particular use case.

And, um, this could be scaled for, for use in other benchmarks of similar topical use or, uh, other LLM applications of other similar, similar topical use. Uh, again, it shows you amongst a collection of things of similar topics, how something might over index and under index fourth, uh, plotting semantic space.

Um, again, this is a, uh, transparent evaluation approach that, uh, allows for more action and more insight than the stereotypical region blue scores, which are binomial, uh, related to precision and recall that we're biased on using. So this allows you to, to have more insights. Thank you very much.

Uh, we're

Surfacing Semantic Orthogonality Across Model Safety Benchmarks — Jonathan Bennion

Transcript