back to indexPerceptual Evaluations: Evals for Aesthetics — Diego Rodriguez, Krea.ai

Chapters
0:0
0:15 Introduction to Perceptual Evaluations
0:50 The Problem with Current AI Evaluations
2:16 Historical Context and Compression
5:14 Limitations in AI and Human-centric Metrics
8:0 Rethinking Evaluation and the Future of AI
12:44 Evaluating Our Evaluations
13:32 Krea's Role and Call to Action
00:00:00.000 |
Okay, so hello everyone. My name is Diego Rodriguez. I'm the co-founder of CREA, a startup 00:00:21.920 |
in the AI space like many others, in particular generative media, multimedia, multimodal, and all 00:00:27.920 |
the buzzwords. But I come here mainly to tell a story about how we think about evaluations when 00:00:40.800 |
we have to take into account human perception and human opinion and aesthetics into the mix, right? 00:00:48.320 |
So I'm going to start with a very simple story. It's like I put an AI-generated image of a hand. 00:00:54.240 |
Obviously, it looks horrible. And then I ask O3, what do you think of this image? Then he thought 00:01:00.240 |
for 17 seconds, obviously, tool calling, does Python, analysis, OpenCV, goes crazy. And then after he 00:01:06.640 |
charges me a few cents, it's like, oh, just a couple of months, it's like, it's mostly natural, but like... 00:01:12.960 |
And it's like, okay, we have like what many people claim is basically AGI, and it is completely 00:01:20.640 |
unable of answering a very simple question. And like, that's a surprising thing if you think about it, 00:01:31.200 |
because we as humans, when people see that image, it's like we just react so naturally, right? 00:01:36.640 |
Against that, it's like, ugh, what is that? Like, that's not natural. And I feel like that's precisely 00:01:41.600 |
what AI models are being trained on. A, on human data, right? Second, human preference data. And third, 00:01:51.440 |
like, in a way limited by the data that we humans made based on our preconceived notions and perception 00:02:00.000 |
and all of that. So that's what this talk is about, about what can we do better? And honestly, 00:02:09.120 |
to ask ourselves some questions that I think are not being asked enough in the field. 00:02:13.840 |
Cool. So tiny, tiny bit of history. There's a... We all know about Claude, or Claude Shannon, that is, 00:02:24.720 |
the father of information theory. And according to many, his master's thesis is one of the most 00:02:32.080 |
important master's thesis in the world, where he laid foundations for digital circuits, and then 00:02:37.600 |
eventually communication and all. And to a degree, we can say that even LLMs nowadays, right? If we fast forward. 00:02:47.280 |
And I want to focus on the fact that we call his work foundational and information theory, well, 00:02:59.120 |
when he published it, it was actually called, like, mathematical foundation for communication theory. 00:03:06.880 |
And he was always focused on communication. There's this image appears on that work, and it's all about, 00:03:15.360 |
okay, this is the source, this is the channel, this is nation, there can be some noise there. 00:03:19.680 |
And as a founder of a company that is focusing on media, to me, it's interesting to realize, like, 00:03:33.120 |
these parallels between classic information theory and communication. Let me see, did I put the image? 00:03:40.880 |
Let's see. Well, if you, I didn't put the image, but if you have any context around variational 00:03:47.120 |
autoencoders or neural networks and whatever, you can screen and be like, oh, is that a neural network? 00:03:51.680 |
Right? And in the context of information and communication, I want to talk about how compression is 00:04:03.360 |
going to be related to how we think about evaluation, right? And I'm going to talk, for example, on JPEG. 00:04:12.560 |
JPEG exploits, like, human nature in the sense that we are very sensitive to brightness, but not so much 00:04:22.880 |
to color. And this is an illusion that also talks about that, where A and B is actually the same color, 00:04:27.760 |
but we are basically unable to perceive it until we do this. And then, suddenly, it's like, oh, 00:04:32.320 |
really? And it's kind of like, what's going on there? Right? And so, JPEG just does the same thing, 00:04:39.520 |
where, okay, we have RGB color space to represent images with computers. We notice that there's a 00:04:46.640 |
diagonal that represents the brightness of the images. We can change into a different color space 00:04:52.000 |
that separates color versus brightness. And then, we can downsample the channels around color, 00:04:58.480 |
because we are actually not even down sensitive to it. So, we can remove that or parts of it. And then, 00:05:04.560 |
once we do that, this is an image where we can see the brightness and color components separated. 00:05:11.440 |
Once we downsample, we can try to recreate the image. And this is an example of, like, basically, 00:05:17.600 |
original image and then the image with the downsample color looks the same to us. And the image is, like, 00:05:23.600 |
50% less information. Right? And other stuff. There's Huffman coding and more stuff. But, like, 00:05:33.280 |
the point is the same. Right? And then, the thing is, if you exploit the same for audio, like, what can we 00:05:41.680 |
hear? What can we not hear? Well, you do the same and we have MP3. And then, if you do the exact same 00:05:47.680 |
thing across time, well, congrats, now you have MP4. It's, like, it's all this principle of, like, let's 00:05:53.760 |
exploit how we humans perceive the world. Right? But this made me think about myself because I studied 00:06:02.000 |
artificial systems engineering, which is engineering around all of these, how microphones work, how speakers 00:06:07.920 |
work? And it was just interesting to me that I was coding. I start deleting information. I know for a 00:06:15.760 |
fact that I'm deleting information, yet, and then I re-render the image, and I see the same. It's, like, 00:06:21.360 |
like, philosophers always tell you about, like, oh, we are limited by our senses. But, like, this is the 00:06:26.080 |
first time that I was, like, I'm seeing it. Right? Like, I am not seeing the difference. Um, but then, 00:06:33.680 |
if all, if, if a lot of our data is the internet, right? Like, we're stripping data from the internet, 00:06:42.000 |
bunch of those images are also compressed, right? Like, are we taking into account that perhaps our AIs 00:06:47.600 |
are limited to? Because we're, like, like, we have some sort of contagion going on of our flaws into the AI. 00:06:57.360 |
Um, and then it gets more tricky because, for instance, uh, this is a, just a screenshot I took 00:07:05.440 |
from a paper. I think it's called Clean FID. And FID scores, for all of you who don't have context, is 00:07:13.200 |
one of the standard metrics used for, uh, how well, for instance, diffusion models are reproducing an 00:07:19.280 |
image. But then you start adding JPEG artifacts, and the score is, like, oh, no, no, no, this is horrible, 00:07:23.600 |
horrible image. And it's, like, perceptually, the four images are basically the same. Yet, the FID 00:07:30.160 |
score is, like, no, no, no, this is really bad. So then it's, like, why are we using FID scores or metrics 00:07:36.080 |
along those lines to decide, oh, this generative AI model is good or bad, right? 00:07:42.640 |
Um, so the thing is, sometimes I feel like we are focused on measuring just things that are easy 00:07:49.280 |
to measure, right? Like, from adherence with clip, how many objects are there? Is this blue? Is this 00:07:56.880 |
red? Et cetera. But what about here? Oh, it's, like, oh, no, really bad, really bad generator, because 00:08:06.000 |
that's not how clock looks, and the sky, that makes no sense, and it's, like, okay, how, 00:08:10.880 |
not only are we limiting our AIs by our human perceptions, uh, on top of that, we forget about 00:08:21.920 |
the relativity of metrics, right? Like, no, actually, this is art, and this is great, and, and, and there's 00:08:30.240 |
sometimes meaning behind the work that is not, like, it is conveying the image, but only if you're human, you, 00:08:38.480 |
you get it, right? Like, oh, this is what the author is trying to tell me, but I feel like the metrics 00:08:43.120 |
don't show that, and kind of, like, commercially and professionally, my job is kind of, like, okay, 00:08:48.640 |
how can we make a company that allows creatives, artists of all sorts, uh, we can start with imagery, 00:08:58.080 |
we can start with video, but to better express themselves, but how are we supposed to do that if 00:09:03.040 |
this is kind of, like, this is kind of, like, the state of the art, right? Um, then, a friend of mine, 00:09:10.160 |
uh, Cheng Lu, actually, he works at Midjourney, he was, uh, he has, um, like, he has great talks 00:09:19.840 |
that you should all check, but he told me once, a little bit over a year ago, a quote that I just can't 00:09:26.480 |
stop thinking about, which goes something, like, hey, man, if you think about it, like, 00:09:32.960 |
predicting the car back when everything was horses, it's not that hard. I was like, well, like, yeah, 00:09:39.760 |
it's not that hard to do, like, oh, cars are the future, and whatever, it's like, we have, 00:09:43.120 |
we have a thing that goes like this, we have horses that make energy, so you swap the thing for the 00:09:48.240 |
engine, that's essentially a car, right? It's like, come on, how hard is that? It's like, you know, 00:09:52.560 |
what's hard to predict? Traffic. Right? And, and then I just kept thinking about it, I was like, oh, man, 00:09:59.600 |
like, as engineers, as researchers, as founders, what are the traffics that we're missing now? 00:10:06.320 |
Because I feel like everyone's focused on, like, yeah, but you can, you can, I don't know, transform from 00:10:10.800 |
Jason to Jamal. I'm like, who cares? That dude, who cares? Like, or yes, it's important, right? Like, but 00:10:16.800 |
what kind of big picture are we all missing, right? Then he talks about, well, you know, the myth of the 00:10:27.600 |
Tower of Babel, where in a nutshell, it's like, 00:10:36.480 |
God, like, we want to go and meet God, and then he's like, no, I don't want that. So instead, 00:10:45.200 |
I'm just going to confuse all of you, and then you're not going to be able to coordinate. And then 00:10:51.440 |
each one is going to speak a different language, and then it's just basically going to be impossible to 00:10:56.240 |
keep the thing going, which, like, reminds me of, like, standard infrastructure meetings with backend 00:11:02.400 |
engineers. It's like, no, we should use Kubernetes. And it's like, it's just all fighting and whatever, 00:11:07.040 |
and nothing gets built. And I'm like, dude, God is winning. God damn it. But then this makes me think 00:11:15.200 |
about, like, we are now in, we just entered the age where you can have models. Essentially, they solve 00:11:26.800 |
translation, right? Or they solve it to a very high degree. So what happens now that we can all speak 00:11:35.360 |
our own languages, yet at the same time communicate with each other? I'm already doing it. For instance, 00:11:40.240 |
I do sometimes customer support manually for Korea, and I literally speak Japanese with some of my users, 00:11:47.120 |
and I don't speak Japanese. I learn a little bit, but I don't speak it. And I'm now able to provide an 00:11:53.920 |
excellent founder-led, whatever that means, customer support level to a country that otherwise I would 00:12:01.360 |
be unable to do, right? And so I invite us all to think about what that really means. Because this, 00:12:16.160 |
for instance, means that we can now understand better or transmit our own opinion better to others. 00:12:24.160 |
And on the previous point that I was talking about with the art, that's kind of like an opinion, right? 00:12:31.520 |
Like evolves are not just about are there four cats here. It's about this cat is blue. And it's like, yeah, 00:12:38.960 |
but is it blue or is it teal? What kind of blue? And I don't like this blue and all of that. So 00:12:46.400 |
like in a nutshell, it's like, how do we evolve our evals, right? Like from my opinion, like from my 00:12:54.560 |
opinion, this is bad. Then I want metrics that take into account my opinion too. And then it's like, okay, 00:13:02.720 |
consider myself I may be a visual learner. What that means is like, maybe your evals should take into 00:13:12.160 |
account how we humans perceive images, right? And also the nature of the data, such as, oh, 00:13:21.840 |
it's all trained on JPEG on the internet. So take into account the artifacts, take into account, 00:13:27.280 |
uh, like all of these while, while training your data. Um, okay. I guess mandatory slide before the 00:13:35.360 |
thank you. Uh, bunch of users, bunch of money. We did all of that. We're eight people. Now we're 12. 00:13:41.040 |
And this is an email that I set up today for high priority applications, uh, for anyone who wants to 00:13:47.200 |
work on research around, uh, aesthetics research, uh, hyper-personalization, scaling generative AI 00:13:53.840 |
models in real time for multimedia, image, video, audio, 3D, uh, across the globe. Uh, we have customers 00:14:00.960 |
like those, um, that's it. Thank you. Oh, Q and A. Okay. Perfect. Any questions? Uh, 00:14:13.360 |
Yeah. Okay. There's many points there. Can you, like, reframe the question? Like... 00:15:08.560 |
So, so the question, like, in a nutshell is like, are there, uh, perceptually aware metrics, 00:15:18.240 |
right? Like, like, okay, you, I showed an example of FID score. It changes a lot with JPEG artifacts. 00:15:24.480 |
Are those where it's almost like the opposite, barely changes, uh, and the metric is still good. 00:15:29.360 |
Like, there are some, and many of these are used also in traditional, uh, encoding, uh, techniques. Um, 00:15:37.840 |
But in a way I'm here to invite us all to start thinking about those, like, 00:15:43.680 |
like to, we can actually train, like, we can train, uh, I mean, it's, it's called a classifier, right? 00:15:56.320 |
Or, or, or a continuous classifier. We can train so that it understands what we mean. And it's like, 00:16:01.440 |
hey, I showed you these five images. These five images are actually all good. And then they can have 00:16:05.600 |
all sorts of artifacts, not just JPEG artifacts. And this is exactly where machine learning excels, 00:16:11.200 |
right? When it's all about opinions. And it's like, let me just know and you will know, you know, 00:16:16.000 |
you know what, you will know when you see it. That's precisely the type of question that AI is amazing at.