Perceptual Evaluations: Evals for Aesthetics

00:00:00.000 | Okay, so hello everyone. My name is Diego Rodriguez. I'm the co-founder of CREA, a startup

00:00:21.920 | in the AI space like many others, in particular generative media, multimedia, multimodal, and all

00:00:27.920 | the buzzwords. But I come here mainly to tell a story about how we think about evaluations when

00:00:40.800 | we have to take into account human perception and human opinion and aesthetics into the mix, right?

00:00:48.320 | So I'm going to start with a very simple story. It's like I put an AI-generated image of a hand.

00:00:54.240 | Obviously, it looks horrible. And then I ask O3, what do you think of this image? Then he thought

00:01:00.240 | for 17 seconds, obviously, tool calling, does Python, analysis, OpenCV, goes crazy. And then after he

00:01:06.640 | charges me a few cents, it's like, oh, just a couple of months, it's like, it's mostly natural, but like...

00:01:12.960 | And it's like, okay, we have like what many people claim is basically AGI, and it is completely

00:01:20.640 | unable of answering a very simple question. And like, that's a surprising thing if you think about it,

00:01:31.200 | because we as humans, when people see that image, it's like we just react so naturally, right?

00:01:36.640 | Against that, it's like, ugh, what is that? Like, that's not natural. And I feel like that's precisely

00:01:41.600 | what AI models are being trained on. A, on human data, right? Second, human preference data. And third,

00:01:51.440 | like, in a way limited by the data that we humans made based on our preconceived notions and perception

00:02:00.000 | and all of that. So that's what this talk is about, about what can we do better? And honestly,

00:02:09.120 | to ask ourselves some questions that I think are not being asked enough in the field.

00:02:13.840 | Cool. So tiny, tiny bit of history. There's a... We all know about Claude, or Claude Shannon, that is,

00:02:24.720 | the father of information theory. And according to many, his master's thesis is one of the most

00:02:32.080 | important master's thesis in the world, where he laid foundations for digital circuits, and then

00:02:37.600 | eventually communication and all. And to a degree, we can say that even LLMs nowadays, right? If we fast forward.

00:02:47.280 | And I want to focus on the fact that we call his work foundational and information theory, well,

00:02:59.120 | when he published it, it was actually called, like, mathematical foundation for communication theory.

00:03:06.880 | And he was always focused on communication. There's this image appears on that work, and it's all about,

00:03:15.360 | okay, this is the source, this is the channel, this is nation, there can be some noise there.

00:03:19.680 | And as a founder of a company that is focusing on media, to me, it's interesting to realize, like,

00:03:33.120 | these parallels between classic information theory and communication. Let me see, did I put the image?

00:03:40.880 | Let's see. Well, if you, I didn't put the image, but if you have any context around variational

00:03:47.120 | autoencoders or neural networks and whatever, you can screen and be like, oh, is that a neural network?

00:03:51.680 | Right? And in the context of information and communication, I want to talk about how compression is

00:04:03.360 | going to be related to how we think about evaluation, right? And I'm going to talk, for example, on JPEG.

00:04:12.560 | JPEG exploits, like, human nature in the sense that we are very sensitive to brightness, but not so much

00:04:22.880 | to color. And this is an illusion that also talks about that, where A and B is actually the same color,

00:04:27.760 | but we are basically unable to perceive it until we do this. And then, suddenly, it's like, oh,

00:04:32.320 | really? And it's kind of like, what's going on there? Right? And so, JPEG just does the same thing,

00:04:39.520 | where, okay, we have RGB color space to represent images with computers. We notice that there's a

00:04:46.640 | diagonal that represents the brightness of the images. We can change into a different color space

00:04:52.000 | that separates color versus brightness. And then, we can downsample the channels around color,

00:04:58.480 | because we are actually not even down sensitive to it. So, we can remove that or parts of it. And then,

00:05:04.560 | once we do that, this is an image where we can see the brightness and color components separated.

00:05:11.440 | Once we downsample, we can try to recreate the image. And this is an example of, like, basically,

00:05:17.600 | original image and then the image with the downsample color looks the same to us. And the image is, like,

00:05:23.600 | 50% less information. Right? And other stuff. There's Huffman coding and more stuff. But, like,

00:05:33.280 | the point is the same. Right? And then, the thing is, if you exploit the same for audio, like, what can we

00:05:41.680 | hear? What can we not hear? Well, you do the same and we have MP3. And then, if you do the exact same

00:05:47.680 | thing across time, well, congrats, now you have MP4. It's, like, it's all this principle of, like, let's

00:05:53.760 | exploit how we humans perceive the world. Right? But this made me think about myself because I studied

00:06:02.000 | artificial systems engineering, which is engineering around all of these, how microphones work, how speakers

00:06:07.920 | work? And it was just interesting to me that I was coding. I start deleting information. I know for a

00:06:15.760 | fact that I'm deleting information, yet, and then I re-render the image, and I see the same. It's, like,

00:06:21.360 | like, philosophers always tell you about, like, oh, we are limited by our senses. But, like, this is the

00:06:26.080 | first time that I was, like, I'm seeing it. Right? Like, I am not seeing the difference. Um, but then,

00:06:33.680 | if all, if, if a lot of our data is the internet, right? Like, we're stripping data from the internet,

00:06:42.000 | bunch of those images are also compressed, right? Like, are we taking into account that perhaps our AIs

00:06:47.600 | are limited to? Because we're, like, like, we have some sort of contagion going on of our flaws into the AI.

00:06:57.360 | Um, and then it gets more tricky because, for instance, uh, this is a, just a screenshot I took

00:07:05.440 | from a paper. I think it's called Clean FID. And FID scores, for all of you who don't have context, is

00:07:13.200 | one of the standard metrics used for, uh, how well, for instance, diffusion models are reproducing an

00:07:19.280 | image. But then you start adding JPEG artifacts, and the score is, like, oh, no, no, no, this is horrible,

00:07:23.600 | horrible image. And it's, like, perceptually, the four images are basically the same. Yet, the FID

00:07:30.160 | score is, like, no, no, no, this is really bad. So then it's, like, why are we using FID scores or metrics

00:07:36.080 | along those lines to decide, oh, this generative AI model is good or bad, right?

00:07:42.640 | Um, so the thing is, sometimes I feel like we are focused on measuring just things that are easy

00:07:49.280 | to measure, right? Like, from adherence with clip, how many objects are there? Is this blue? Is this

00:07:56.880 | red? Et cetera. But what about here? Oh, it's, like, oh, no, really bad, really bad generator, because

00:08:06.000 | that's not how clock looks, and the sky, that makes no sense, and it's, like, okay, how,

00:08:10.880 | not only are we limiting our AIs by our human perceptions, uh, on top of that, we forget about

00:08:21.920 | the relativity of metrics, right? Like, no, actually, this is art, and this is great, and, and, and there's

00:08:30.240 | sometimes meaning behind the work that is not, like, it is conveying the image, but only if you're human, you,

00:08:38.480 | you get it, right? Like, oh, this is what the author is trying to tell me, but I feel like the metrics

00:08:43.120 | don't show that, and kind of, like, commercially and professionally, my job is kind of, like, okay,

00:08:48.640 | how can we make a company that allows creatives, artists of all sorts, uh, we can start with imagery,

00:08:58.080 | we can start with video, but to better express themselves, but how are we supposed to do that if

00:09:03.040 | this is kind of, like, this is kind of, like, the state of the art, right? Um, then, a friend of mine,

00:09:10.160 | uh, Cheng Lu, actually, he works at Midjourney, he was, uh, he has, um, like, he has great talks

00:09:19.840 | that you should all check, but he told me once, a little bit over a year ago, a quote that I just can't

00:09:26.480 | stop thinking about, which goes something, like, hey, man, if you think about it, like,

00:09:32.960 | predicting the car back when everything was horses, it's not that hard. I was like, well, like, yeah,

00:09:39.760 | it's not that hard to do, like, oh, cars are the future, and whatever, it's like, we have,

00:09:43.120 | we have a thing that goes like this, we have horses that make energy, so you swap the thing for the

00:09:48.240 | engine, that's essentially a car, right? It's like, come on, how hard is that? It's like, you know,

00:09:52.560 | what's hard to predict? Traffic. Right? And, and then I just kept thinking about it, I was like, oh, man,

00:09:59.600 | like, as engineers, as researchers, as founders, what are the traffics that we're missing now?

00:10:06.320 | Because I feel like everyone's focused on, like, yeah, but you can, you can, I don't know, transform from

00:10:10.800 | Jason to Jamal. I'm like, who cares? That dude, who cares? Like, or yes, it's important, right? Like, but

00:10:16.800 | what kind of big picture are we all missing, right? Then he talks about, well, you know, the myth of the

00:10:27.600 | Tower of Babel, where in a nutshell, it's like,

00:10:36.480 | God, like, we want to go and meet God, and then he's like, no, I don't want that. So instead,

00:10:45.200 | I'm just going to confuse all of you, and then you're not going to be able to coordinate. And then

00:10:51.440 | each one is going to speak a different language, and then it's just basically going to be impossible to

00:10:56.240 | keep the thing going, which, like, reminds me of, like, standard infrastructure meetings with backend

00:11:02.400 | engineers. It's like, no, we should use Kubernetes. And it's like, it's just all fighting and whatever,

00:11:07.040 | and nothing gets built. And I'm like, dude, God is winning. God damn it. But then this makes me think

00:11:15.200 | about, like, we are now in, we just entered the age where you can have models. Essentially, they solve

00:11:26.800 | translation, right? Or they solve it to a very high degree. So what happens now that we can all speak

00:11:35.360 | our own languages, yet at the same time communicate with each other? I'm already doing it. For instance,

00:11:40.240 | I do sometimes customer support manually for Korea, and I literally speak Japanese with some of my users,

00:11:47.120 | and I don't speak Japanese. I learn a little bit, but I don't speak it. And I'm now able to provide an

00:11:53.920 | excellent founder-led, whatever that means, customer support level to a country that otherwise I would

00:12:01.360 | be unable to do, right? And so I invite us all to think about what that really means. Because this,

00:12:16.160 | for instance, means that we can now understand better or transmit our own opinion better to others.

00:12:24.160 | And on the previous point that I was talking about with the art, that's kind of like an opinion, right?

00:12:31.520 | Like evolves are not just about are there four cats here. It's about this cat is blue. And it's like, yeah,

00:12:38.960 | but is it blue or is it teal? What kind of blue? And I don't like this blue and all of that. So

00:12:46.400 | like in a nutshell, it's like, how do we evolve our evals, right? Like from my opinion, like from my

00:12:54.560 | opinion, this is bad. Then I want metrics that take into account my opinion too. And then it's like, okay,

00:13:02.720 | consider myself I may be a visual learner. What that means is like, maybe your evals should take into

00:13:12.160 | account how we humans perceive images, right? And also the nature of the data, such as, oh,

00:13:21.840 | it's all trained on JPEG on the internet. So take into account the artifacts, take into account,

00:13:27.280 | uh, like all of these while, while training your data. Um, okay. I guess mandatory slide before the

00:13:35.360 | thank you. Uh, bunch of users, bunch of money. We did all of that. We're eight people. Now we're 12.

00:13:41.040 | And this is an email that I set up today for high priority applications, uh, for anyone who wants to

00:13:47.200 | work on research around, uh, aesthetics research, uh, hyper-personalization, scaling generative AI

00:13:53.840 | models in real time for multimedia, image, video, audio, 3D, uh, across the globe. Uh, we have customers

00:14:00.960 | like those, um, that's it. Thank you. Oh, Q and A. Okay. Perfect. Any questions? Uh,

00:14:13.360 | Yeah. Okay. There's many points there. Can you, like, reframe the question? Like...

00:14:38.720 | Yeah. Yeah. Yeah. Yeah.

00:15:08.560 | So, so the question, like, in a nutshell is like, are there, uh, perceptually aware metrics,

00:15:18.240 | right? Like, like, okay, you, I showed an example of FID score. It changes a lot with JPEG artifacts.

00:15:24.480 | Are those where it's almost like the opposite, barely changes, uh, and the metric is still good.

00:15:29.360 | Like, there are some, and many of these are used also in traditional, uh, encoding, uh, techniques. Um,

00:15:37.840 | But in a way I'm here to invite us all to start thinking about those, like,

00:15:43.680 | like to, we can actually train, like, we can train, uh, I mean, it's, it's called a classifier, right?

00:15:56.320 | Or, or, or a continuous classifier. We can train so that it understands what we mean. And it's like,

00:16:01.440 | hey, I showed you these five images. These five images are actually all good. And then they can have

00:16:05.600 | all sorts of artifacts, not just JPEG artifacts. And this is exactly where machine learning excels,

00:16:11.200 | right? When it's all about opinions. And it's like, let me just know and you will know, you know,

00:16:16.000 | you know what, you will know when you see it. That's precisely the type of question that AI is amazing at.

Perceptual Evaluations: Evals for Aesthetics — Diego Rodriguez, Krea.ai

Chapters