Perceptual Evaluations: Evals for Aesthetics

Okay, so hello everyone. My name is Diego Rodriguez. I'm the co-founder of CREA, a startup in the AI space like many others, in particular generative media, multimedia, multimodal, and all the buzzwords. But I come here mainly to tell a story about how we think about evaluations when we have to take into account human perception and human opinion and aesthetics into the mix, right?

So I'm going to start with a very simple story. It's like I put an AI-generated image of a hand. Obviously, it looks horrible. And then I ask O3, what do you think of this image? Then he thought for 17 seconds, obviously, tool calling, does Python, analysis, OpenCV, goes crazy.

And then after he charges me a few cents, it's like, oh, just a couple of months, it's like, it's mostly natural, but like... And it's like, okay, we have like what many people claim is basically AGI, and it is completely unable of answering a very simple question. And like, that's a surprising thing if you think about it, because we as humans, when people see that image, it's like we just react so naturally, right?

Against that, it's like, ugh, what is that? Like, that's not natural. And I feel like that's precisely what AI models are being trained on. A, on human data, right? Second, human preference data. And third, like, in a way limited by the data that we humans made based on our preconceived notions and perception and all of that.

So that's what this talk is about, about what can we do better? And honestly, to ask ourselves some questions that I think are not being asked enough in the field. Cool. So tiny, tiny bit of history. There's a... We all know about Claude, or Claude Shannon, that is, the father of information theory.

And according to many, his master's thesis is one of the most important master's thesis in the world, where he laid foundations for digital circuits, and then eventually communication and all. And to a degree, we can say that even LLMs nowadays, right? If we fast forward. And I want to focus on the fact that we call his work foundational and information theory, well, when he published it, it was actually called, like, mathematical foundation for communication theory.

And he was always focused on communication. There's this image appears on that work, and it's all about, okay, this is the source, this is the channel, this is nation, there can be some noise there. And as a founder of a company that is focusing on media, to me, it's interesting to realize, like, these parallels between classic information theory and communication.

Let me see, did I put the image? Let's see. Well, if you, I didn't put the image, but if you have any context around variational autoencoders or neural networks and whatever, you can screen and be like, oh, is that a neural network? Right? And in the context of information and communication, I want to talk about how compression is going to be related to how we think about evaluation, right?

And I'm going to talk, for example, on JPEG. JPEG exploits, like, human nature in the sense that we are very sensitive to brightness, but not so much to color. And this is an illusion that also talks about that, where A and B is actually the same color, but we are basically unable to perceive it until we do this.

And then, suddenly, it's like, oh, really? And it's kind of like, what's going on there? Right? And so, JPEG just does the same thing, where, okay, we have RGB color space to represent images with computers. We notice that there's a diagonal that represents the brightness of the images. We can change into a different color space that separates color versus brightness.

And then, we can downsample the channels around color, because we are actually not even down sensitive to it. So, we can remove that or parts of it. And then, once we do that, this is an image where we can see the brightness and color components separated. Once we downsample, we can try to recreate the image.

And this is an example of, like, basically, original image and then the image with the downsample color looks the same to us. And the image is, like, 50% less information. Right? And other stuff. There's Huffman coding and more stuff. But, like, the point is the same. Right? And then, the thing is, if you exploit the same for audio, like, what can we hear?

What can we not hear? Well, you do the same and we have MP3. And then, if you do the exact same thing across time, well, congrats, now you have MP4. It's, like, it's all this principle of, like, let's exploit how we humans perceive the world. Right? But this made me think about myself because I studied artificial systems engineering, which is engineering around all of these, how microphones work, how speakers work?

And it was just interesting to me that I was coding. I start deleting information. I know for a fact that I'm deleting information, yet, and then I re-render the image, and I see the same. It's, like, like, philosophers always tell you about, like, oh, we are limited by our senses.

But, like, this is the first time that I was, like, I'm seeing it. Right? Like, I am not seeing the difference. Um, but then, if all, if, if a lot of our data is the internet, right? Like, we're stripping data from the internet, bunch of those images are also compressed, right?

Like, are we taking into account that perhaps our AIs are limited to? Because we're, like, like, we have some sort of contagion going on of our flaws into the AI. Um, and then it gets more tricky because, for instance, uh, this is a, just a screenshot I took from a paper.

I think it's called Clean FID. And FID scores, for all of you who don't have context, is one of the standard metrics used for, uh, how well, for instance, diffusion models are reproducing an image. But then you start adding JPEG artifacts, and the score is, like, oh, no, no, no, this is horrible, horrible image.

And it's, like, perceptually, the four images are basically the same. Yet, the FID score is, like, no, no, no, this is really bad. So then it's, like, why are we using FID scores or metrics along those lines to decide, oh, this generative AI model is good or bad, right?

Um, so the thing is, sometimes I feel like we are focused on measuring just things that are easy to measure, right? Like, from adherence with clip, how many objects are there? Is this blue? Is this red? Et cetera. But what about here? Oh, it's, like, oh, no, really bad, really bad generator, because that's not how clock looks, and the sky, that makes no sense, and it's, like, okay, how, not only are we limiting our AIs by our human perceptions, uh, on top of that, we forget about the relativity of metrics, right?

Like, no, actually, this is art, and this is great, and, and, and there's sometimes meaning behind the work that is not, like, it is conveying the image, but only if you're human, you, you get it, right? Like, oh, this is what the author is trying to tell me, but I feel like the metrics don't show that, and kind of, like, commercially and professionally, my job is kind of, like, okay, how can we make a company that allows creatives, artists of all sorts, uh, we can start with imagery, we can start with video, but to better express themselves, but how are we supposed to do that if this is kind of, like, this is kind of, like, the state of the art, right?

Um, then, a friend of mine, uh, Cheng Lu, actually, he works at Midjourney, he was, uh, he has, um, like, he has great talks that you should all check, but he told me once, a little bit over a year ago, a quote that I just can't stop thinking about, which goes something, like, hey, man, if you think about it, like, predicting the car back when everything was horses, it's not that hard.

I was like, well, like, yeah, it's not that hard to do, like, oh, cars are the future, and whatever, it's like, we have, we have a thing that goes like this, we have horses that make energy, so you swap the thing for the engine, that's essentially a car, right?

It's like, come on, how hard is that? It's like, you know, what's hard to predict? Traffic. Right? And, and then I just kept thinking about it, I was like, oh, man, like, as engineers, as researchers, as founders, what are the traffics that we're missing now? Because I feel like everyone's focused on, like, yeah, but you can, you can, I don't know, transform from Jason to Jamal.

I'm like, who cares? That dude, who cares? Like, or yes, it's important, right? Like, but what kind of big picture are we all missing, right? Then he talks about, well, you know, the myth of the Tower of Babel, where in a nutshell, it's like, God, like, we want to go and meet God, and then he's like, no, I don't want that.

So instead, I'm just going to confuse all of you, and then you're not going to be able to coordinate. And then each one is going to speak a different language, and then it's just basically going to be impossible to keep the thing going, which, like, reminds me of, like, standard infrastructure meetings with backend engineers.

It's like, no, we should use Kubernetes. And it's like, it's just all fighting and whatever, and nothing gets built. And I'm like, dude, God is winning. God damn it. But then this makes me think about, like, we are now in, we just entered the age where you can have models.

Essentially, they solve translation, right? Or they solve it to a very high degree. So what happens now that we can all speak our own languages, yet at the same time communicate with each other? I'm already doing it. For instance, I do sometimes customer support manually for Korea, and I literally speak Japanese with some of my users, and I don't speak Japanese.

I learn a little bit, but I don't speak it. And I'm now able to provide an excellent founder-led, whatever that means, customer support level to a country that otherwise I would be unable to do, right? And so I invite us all to think about what that really means. Because this, for instance, means that we can now understand better or transmit our own opinion better to others.

And on the previous point that I was talking about with the art, that's kind of like an opinion, right? Like evolves are not just about are there four cats here. It's about this cat is blue. And it's like, yeah, but is it blue or is it teal? What kind of blue?

And I don't like this blue and all of that. So like in a nutshell, it's like, how do we evolve our evals, right? Like from my opinion, like from my opinion, this is bad. Then I want metrics that take into account my opinion too. And then it's like, okay, consider myself I may be a visual learner.

What that means is like, maybe your evals should take into account how we humans perceive images, right? And also the nature of the data, such as, oh, it's all trained on JPEG on the internet. So take into account the artifacts, take into account, uh, like all of these while, while training your data.

Um, okay. I guess mandatory slide before the thank you. Uh, bunch of users, bunch of money. We did all of that. We're eight people. Now we're 12. And this is an email that I set up today for high priority applications, uh, for anyone who wants to work on research around, uh, aesthetics research, uh, hyper-personalization, scaling generative AI models in real time for multimedia, image, video, audio, 3D, uh, across the globe.

Uh, we have customers like those, um, that's it. Thank you. Oh, Q and A. Okay. Perfect. Any questions? Uh, Yeah. Okay. There's many points there. Can you, like, reframe the question? Like... Yeah. Yeah. Yeah. Yeah. So, so the question, like, in a nutshell is like, are there, uh, perceptually aware metrics, right?

Like, like, okay, you, I showed an example of FID score. It changes a lot with JPEG artifacts. Are those where it's almost like the opposite, barely changes, uh, and the metric is still good. Like, there are some, and many of these are used also in traditional, uh, encoding, uh, techniques.

Um, But in a way I'm here to invite us all to start thinking about those, like, like to, we can actually train, like, we can train, uh, I mean, it's, it's called a classifier, right? Or, or, or a continuous classifier. We can train so that it understands what we mean.

And it's like, hey, I showed you these five images. These five images are actually all good. And then they can have all sorts of artifacts, not just JPEG artifacts. And this is exactly where machine learning excels, right? When it's all about opinions. And it's like, let me just know and you will know, you know, you know what, you will know when you see it.

That's precisely the type of question that AI is amazing at.

Perceptual Evaluations: Evals for Aesthetics — Diego Rodriguez, Krea.ai

Chapters

Transcript