back to indexQwen Image: SOTA text rendering + 4o-imagegen-level Editing Open Weights MMDiT

Chapters
0:0 Introduction to the Qwen Image Paper Review
0:40 Introduction to Qwen Image
1:27 Progressive Training Strategy
1:48 Model Architecture
10:43 Diffusion Process
12:14 Data Collection and Curation
14:51 Seven-Stage Filtering Process
24:27 Data Annotation with Qwen
27:51 Data Synthesis
34:3 Pre-training and Optimization
40:46 Post-training and Reinforcement Learning (RL)
43:38 Multimodal Image Generation
44:6 Performance and Benchmarks
49:19 Live Testing in Other Languages
57:43 Conclusion and Practical Applications
00:00:00.960 |
Yeah, okay. So, hello everyone. This is the Quint image paper, or technical report rather. 00:00:06.640 |
So, let's get started. So, wait, wait, wait. We have a volunteer for next week, Venki. 00:00:13.600 |
Yeah, Venki. Awesome. Let's go. Are you on the Discord? 00:00:20.480 |
I think I'm on the Discord. If not, I'll ask my friend. 00:00:25.840 |
Yeah, I mean, just announce what paper you're going through next week, and just drop the link 00:00:30.800 |
to the paper so that people can pre-read this. Super excited for you to share, Venki. Looking 00:00:36.720 |
forward to it. Thank you. Thank you, YouTube. Yeah, cool. Okay. So, yeah. So, we have Quint image. 00:00:45.680 |
So, it's a diffusion image generation model. That's pretty nice. And throughout the paper, 00:00:55.120 |
they talk about how it's a diffusion model, obviously, but they kind of put special emphasis 00:00:58.640 |
on how the model does well in Chinese. Or, like, logographic languages, including Chinese and others. 00:01:06.960 |
And also, we kind of get to see how they curate their data, and also how they label it, 00:01:12.720 |
and the different techniques that they use. Yeah, I find that pretty interesting. So, let's see. 00:01:17.680 |
So, it's Quint image. So, it achieves significant advances in complex text rendering and precise image 00:01:23.680 |
editing. That's pretty cool. They use a progressive training strategy. So, yeah. So, with this, they use, 00:01:28.640 |
essentially, curriculum learning. Like, for a stage, they train the model on, like, very, very low 00:01:33.440 |
resolution/quality images, like 256 pixel, or, like, 256 by 256 images. And they kind of, like, 00:01:40.240 |
progress. Like, they make the images, like, sharper and sharper, which I find interesting. So, 00:01:46.800 |
that's a form of curriculum learning. Let's see. English, Chinese. And they also use VAE, like, 00:01:53.280 |
a variational autoencoder, and also a language model, or, like, a visual language model, to, 00:01:58.800 |
like, to kind of have, like, a system one, system two thing, where, like, they use the VAE encoder to 00:02:03.520 |
encode low-level, like, physical or, like, spatial details within the image. And they use their language 00:02:09.600 |
model, like, when 2.5 VL to encode, like, the more, like, semantic part of the image. So, I kind of 00:02:17.600 |
found that interesting also. So, let's see. There's nothing new here. These are a bunch of benchmarks. 00:02:22.720 |
You guys can read it if you want. But I'll kind of skip them. 00:02:31.520 |
Yeah, the images are worth looking at at some point later. You can refer to them, 00:02:39.040 |
or pull it up yourself separately, you know? But, like, it's editing and generation that, 00:02:45.760 |
and both are, like, to a level I've never seen before. 00:02:49.360 |
Yeah, and if people are interested, I can just, like, go back after. 00:02:53.120 |
I think editing and stuff, it's been done before just in separate models, right? Like, 00:03:00.560 |
you need, like, a context consistency model. But, yes, very cool. I thought the first page of images 00:03:08.240 |
was very cool. Like, this one with the whiteboard, you see how, like, the whiteboard is at an angle, 00:03:14.000 |
and then it has reflections, and the text is at an angle and not directly straight? I was like, 00:03:21.520 |
Quen dropped a teaser about edit today. This is an editing model, by the way. It's very heavy on handwriting. 00:03:28.800 |
Okay, let's just continue to. I think you should read the descriptions of all these as well, 00:03:36.240 |
for what they show, by the way, for people following along. Oh, shit. I'll drop the paper as well. 00:03:40.400 |
Yeah, so let's see. Intro. So here they just talk about the challenges. They talk about, like, 00:03:50.560 |
allotting model outputs with complex prompts, and also, like, just essentially making the text look 00:03:58.160 |
nice. That's kind of what they're talking about here. And, like, sometimes they talk about how people 00:04:03.600 |
have difficulty modifying images. Like, if someone has, like, a pose, or if they want someone to strike, 00:04:09.120 |
like, a different pose, they want to edit the image. Like, the person will kind of strike the pose, 00:04:13.520 |
but the background will kind of, like, lose coherence. So they kind of, like, they talk about 00:04:17.600 |
that as, like, some of the problems that they're trying to solve with the making of Quen. So they also 00:04:23.200 |
put, like, a ton of effort into, like, data engineering right here, which we'll kind of, 00:04:27.600 |
like, see in the paper. So progressive learning, I already talked about that. Multitask training 00:04:32.880 |
paradigms, we'll talk about that. And they also talk about, like, how they use GPUs. They have, like, 00:04:37.200 |
a really interesting structure there. So let's see. Yeah, we'll get into all of this. Yeah. Okay. So 00:04:47.200 |
here's the architecture. So for this, this is a diffusion model. So, like, they'll inject noise 00:04:53.200 |
into it, like, when they're training. When they're training, they'll, like, they'll patch some of the 00:04:57.280 |
images, and they'll train the model to, like, essentially reconstruct it. And what they do here 00:05:05.440 |
is they use, like, well, we'll talk about it here, but they'll use, like, the language model 00:05:11.360 |
to maintain, like, global coherence. Like, they'll say, like, oh, this image is supposed to be about 00:05:16.080 |
this image. And their autoencoder kind of encodes the lower-level details. So they have, like, 00:05:23.600 |
it's composed of 60 of these, like, transformer blocks. I mean, not transformer. Yeah, the diffusion 00:05:28.080 |
transformer blocks. And let's see. So there's nothing special here. Or you can look at this. 00:05:32.400 |
But they use QKNorm. And they also, like, they made their own positional encoding, like, mechanism, 00:05:40.720 |
which is msrope, which we'll also get into. Okay, so let's see. So yeah, so they have three 00:05:46.960 |
components. So they have, like, a multimodal language model. That serves as the condition encoder. And it 00:05:51.360 |
extracts the feature from the textual inputs. This is, like, the QN 2.5 part that I was talking about. 00:05:56.080 |
They also have, like, the VAE. So it compresses images. Like, it just extracts the physical features, 00:06:02.240 |
quote-unquote, of the image. And they'll also have, like, the multimodal diffusion transformer, 00:06:06.080 |
which is the actual diffusion part of the diffusion model. Let's see. So they use, 00:06:11.440 |
they talk about how they use QN 2.5 VL. Let's see. So they say that it has, like, 00:06:17.440 |
they have three key reasons. So language individual spaces of QN 2.5 VL have already been aligned. 00:06:22.000 |
And 2.5 VL retains strong language modeling capabilities without significant degradation 00:06:27.680 |
compared to other language models. So I tried to, like, look up and, like, I tried to see what they're 00:06:32.240 |
talking about with this part. I didn't really, I wasn't really able to find what they were talking about. 00:06:35.760 |
So if anyone knows, like, I would, I would definitely like to know that. But yeah, 00:06:40.160 |
so the third reason is that QN 2.5 VL supports multimodal inputs. So just image and text. Let's see. 00:06:46.400 |
Oh, yeah. And they use, they use a latent space of, or latent of the last layer's hidden state from 00:06:54.800 |
QN 2.5 VL as the backbone. So they use that to, like, represent the user's input. So that's pretty 00:07:01.120 |
interesting. Let's see. So as for the VAE, let's see. So they train an image VAE with 2D convolutions, 00:07:08.400 |
or that's usually how this works. They usually train a VAE with 2D convolutions on a massive image 00:07:13.200 |
dataset to obtain a high quality image representation. But what they do differently is they use a single 00:07:19.040 |
encoder, dual decoder architecture. So they use a shared encoder. Oh, no. Shared encoder compatible with 00:07:24.320 |
images and videos alongside separate specialized decoders for each modality. So that's what they do 00:07:29.600 |
differently. And let's see what they do. Oh, they also, like, they also collected a lot of text-rich 00:07:36.160 |
images. Like, they talk about here how it's, like, PDFs, PowerPoints, alongside, like, synthetic graphic, 00:07:41.840 |
or synthetic paragraphs, and, like, covering different languages. And they use this to essentially to 00:07:48.800 |
train the model, like, later with post-training on RL. They use that to, like, train the model on 00:07:55.280 |
how to actually follow instructions. Like, oh, if I want to make a PowerPoint with, like, 00:07:58.160 |
this property and this property, they'll, like, use some of those, like, some of those documents. 00:08:03.440 |
Let's see. They do that. Let's see. Yeah. And for the diffusion part, they, let's see, 00:08:12.160 |
they use a multimodal diffusion transformer. So they talk about how they have, like, their, 00:08:16.320 |
like, their MS rope, which is a multimodal scalable rope. So, like, the reason, well, I'll just read it 00:08:23.920 |
first. But let's see. So in the traditional, like, diffusion block, text tokens are directly concatenated 00:08:29.120 |
after the flattened image positional embedding. So they're talking about, like, this. So, like, 00:08:34.800 |
if your image is, like, if your image is split into nine pieces, then they'll, like, they'll just, 00:08:39.040 |
like, concatenate the text after. But what they do is, I'm reading from the blue text now. So they, 00:08:46.640 |
text inputs are treated as 2D tensors with identical position IDs applied across both dimensions, 00:08:52.400 |
and they're concatenated along the diagonal of the image. What that actually means is that they have 00:08:56.800 |
their image here. And they essentially just pretend that the text, like, the text after is, like, 00:09:02.720 |
concatenated along this dimension. So they pretend the image is, like, in this case, they pretend the image 00:09:06.720 |
is, like, a six, a six by six image. So that's what they do. And they say, like, the reason that they use 00:09:16.080 |
this is because, like, previous 2D ropes, like, with other implementations of their positional embedding, 00:09:23.280 |
certain rows of positional encodings for text and image, the 0th middle row in figure 8b becomes 00:09:28.560 |
isomorphic. So essentially, the model becomes confused. They're talking about this part right 00:09:32.160 |
here. They say, like, with previous positional encodings, the model kind of becomes confused, 00:09:39.520 |
and it can, like, confuse these, like, this middle row. So it can become confused as to which, like, 00:09:44.960 |
which parts of the image correspond to text and which are actual, like, pieces of the image. 00:09:49.760 |
Let's see. So this is where they talk about the actual, like, the layers. 00:09:56.800 |
They talk about, like, the VLM, the VAE, and the transformer. So there's that. So 20 billion 00:10:07.920 |
Yeah. So does anyone have any questions so far? I realize I didn't pause for input. 00:10:16.160 |
No, we're looking at chat. It seems good. You're doing good on Mark. 00:10:23.920 |
I actually have a question. In the architecture diagram, up a little above, I don't see any 00:10:32.000 |
mention of... Maybe I just didn't read it properly. No, sorry. Up in the architectural diagram part. 00:10:38.320 |
Oh, yeah. Yeah. Like, up in the previous page, I think. Yeah. So I don't... If I'm not... I don't see... 00:10:45.280 |
So is the diffusion process... Like, how does diffusion come in here? Is it just that they're adding 00:10:54.720 |
noise in the autoencoder and then removing the noise progressively on the unpatch, like, 00:11:06.160 |
using the transformer blocks? Or is there something more complicated there? 00:11:09.520 |
Yeah. So someone correct me if I'm wrong here, but, like, I don't see how they actually, like... I don't 00:11:16.240 |
see which, like, which diffusion objective they use. But, like, to my knowledge, a common diffusion objective 00:11:23.040 |
is, like, not to reconstruct the image, but usually they'll estimate how much noise was added. Like, 00:11:28.400 |
first they'll... Like, with the forward pass, they'll, like, corrupt the data. Like, they'll gradually corrupt the data 00:11:32.800 |
using a specific amount of noise at each step. And, like, with a backward pass, instead of, like, 00:11:38.560 |
actually trying to reconstruct the data or reconstruct the image or whatever modality it is, they'll try to 00:11:43.680 |
estimate how much noise was taken away in that step. So I don't know if they put it here, but I wasn't able to find it. 00:11:50.400 |
Yeah, okay. So it seems like it's just in the training regime and adding the noise. Okay. Understood. Thank you. That's helpful. 00:12:00.000 |
Yeah. Any other questions? All right. Cool. Yeah. So now we go on to the data collection. So in here, 00:12:13.120 |
they actually put a lot of effort into data collection and curation. Well, I'm sure all, like, I'm sure all 00:12:18.880 |
companies/labs put effort into this, but, like, we're able to actually see what they do to actually 00:12:24.800 |
curate their data. So they annotated, they collected annotated billions of text image pairs. That's cool. 00:12:30.480 |
And they prioritize data quality and balanced data distribution. So they try to create, like, 00:12:35.040 |
a well-balanced and representative data set that closely mirrors real-world scenarios. So that's 00:12:39.600 |
interesting. And they categorize it into, like, four categories. So nature, design, people, and synthetic 00:12:44.160 |
data. And, like, it's important to note, they say it, like, somewhere here, but it's very important to note 00:12:48.480 |
that when they say synthetic data, they mean, like, PowerPoints and stuff like that. They do not mean AI-generated content. 00:12:54.240 |
They take extra care to not include AI-generated content, like, in their data mix. 00:13:00.880 |
Yeah. So we'll just go through each of these. So they have their nature category. So it says, 00:13:07.120 |
like, 55% of the data set. They have, like, their objects, landscape, cityscape, plants, animals, 00:13:12.160 |
indoor, and food. So also it has, like, content that doesn't clearly belong to the people or design 00:13:16.960 |
categories. So that's cool. So with the design category, oh, they also have, like, a graphic here, 00:13:23.200 |
but I'll look at that later. So they also have, like, their design category. So it's 27% of the data set. 00:13:28.720 |
So it's usually, like, posters, UIs, presentation slides, and, like, paintings, sculptures, arts and crafts, 00:13:34.880 |
and digital arts. So, like, this, they say that this helps the model to form, like, to 00:13:41.840 |
emulate/replicate different art styles. That was interesting. So they also have the people data set. 00:13:47.520 |
That's 13% of the data set. So they pay special attention to this, because that's, like, portraits, 00:13:54.400 |
sports and activities, and just, like, humans doing different things. So they say that it helps the 00:14:01.520 |
model to generate realistic and diverse human images. And finally, the synthetic data set. So it's around 00:14:06.320 |
5% of the data set. So, again, like I said before, it does not include images generated by other AI 00:14:11.600 |
models. But it's, like, data synthesized to control text rendering techniques. So it includes, like, let's 00:14:17.440 |
see. Let's see. Oh, yeah. So they adopt a conservative stance towards such data as training on low fidelity 00:14:23.520 |
or misleading images may weaken the model's generalization capabilities. So, yeah. So let's go to the graphic. 00:14:29.920 |
So this is, like, a visual representation of, like, of the proportion of, like, the nature of the 00:14:35.520 |
different classes and the different, like, subclasses, like objects, cityscape, et cetera. 00:14:39.280 |
Yeah. So that is that. So they have data filtering. They have, like, I think, like, seven to ten stages. 00:14:48.880 |
How many stages? They have a lot of stages of, like, filtering and pre-training. So this is kind of, like, 00:14:56.480 |
I get the gist from reading this section that they had, like, a seven to ten stage filtering process, 00:15:04.000 |
but they also started doing curriculum learning at the same time while they were filtering the data. 00:15:08.400 |
I'm not sure if that's actually the case, but that's kind of what it seems like from here. So 00:15:12.400 |
let's see. Oh, yeah. They have seven sequential stages. So, yeah, synthetic data is introduced from 00:15:20.480 |
Yeah. They mention it in the abstract that it's kind of important, right? So they have a, 00:15:25.600 |
they adopt a progressive training strategy and then they kind of go through their stages, right? So 00:15:31.760 |
there's non-text-to-text rendering evolves from simple complex textual inputs, gradually scales up to 00:15:37.520 |
paragraph level descriptions. So they do have, like, this curriculum learning and it's kind of, 00:15:44.960 |
um, it builds from simplicity to more advanced stuff to, like, the most niche little synthetic data. 00:15:51.920 |
Yeah. They cover it later too, yeah. Okay. Thanks. I appreciate that. Yeah. So 00:15:59.200 |
I was like, I'm not going to go, like, super in-depth for each stage, but, like, so let's see, 00:16:04.320 |
they have the initial pre-training data. So, like, this is what I talked about or reference earlier, 00:16:08.800 |
where they trained on, like, very small images, like 256 by 256 pixels, like, various aspect ratios. 00:16:14.240 |
So they kind of list them there. Uh, so they also, like, remove low quality and relevant images. 00:16:19.520 |
So they make sure they remove duplicates and they remove, like, really, really low quality images 00:16:26.320 |
right here and also NSFW stuff. Yeah. Let's see. They also, let's see. So onto stage two, they focus on 00:16:34.720 |
improving the image quality. So they remove images, like, with significant rotation or flipping. 00:16:40.240 |
That's this part. Uh, let's see. They describe, like, blurry, out-of-focus images, excessively bright 00:16:48.880 |
or dark or images, like, with unnaturally high color saturation. They also remove images with, like, 00:16:55.120 |
low entropy. So that's just, like, only black or only white images or whatever. And they discard images 00:17:01.360 |
with overly complex textures or, yeah, which is associated with noise and non-sematic patterns. 00:17:06.720 |
So this is also a graphic of, um, like, of their filtering process. 00:17:11.440 |
Yeah. So stage three. Let's see. So this is actually where they do, uh, or where they talk 00:17:19.120 |
about some of their annotation. So here's where they start to focus on text. So they start to 00:17:24.000 |
focus on improving the alignment between textual description and visual content. 00:17:28.480 |
So they, I'm reading the blue part here, but they, uh, they have, like, two splits. So they have, 00:17:35.520 |
like, captions provided by websites as well as metadata, such as titles or tags originally associated 00:17:40.000 |
with the images. Uh, so they also have, let's see, captions generated by Quent. So, like, 00:17:45.920 |
they also use Quent to help in their, like, data annotation process. Um, let's see. 00:17:52.320 |
So this is just talking about the, how they also combine raw captions and synthesized captions. 00:17:56.400 |
They also discard, like, trash captions, like, really long captions or, like, generic ones. That's 00:18:03.040 |
like, sorry, I cannot provide a caption. I mean, indicating that the caption is broken or something, 00:18:06.800 |
something else is wrong. So they also talk about, let's see, text rendering. So this is H4, 00:18:11.760 |
text rendering. They, interestingly, they split their languages into English, Chinese, or other, 00:18:17.200 |
which I did not expect. But apparently it worked for them. Uh, let's see. Yeah, they address 00:18:24.720 |
challenges such as low frequency characters, mixed language scenarios, and font diversity. 00:18:29.280 |
So they incorporate synthetic text rendering data. And let's see, they also remove images with overly 00:18:35.920 |
dense or excessively small text. So they do that to increase their text quality. 00:18:44.240 |
Oh, and this is also an interesting graphic where they show, uh, like, they show kind of the 00:18:47.920 |
distributions on some of their, like, their filters. And you can kind of see the examples of some of 00:18:54.160 |
Yeah, so stage five. So they talk about how the model transitions to training with, uh, training with 00:19:04.000 |
images at 640p resolution. So they're increasing the, like, they're making their images sharper and just 00:19:08.880 |
increasing the resolution, uh, presumably to make the training more stable. So they also apply more 00:19:15.440 |
filters. So let's see, they, they try to remove images that have, like, overexposure, underexposure, 00:19:21.520 |
blur, or compression artifacts, or all poor, or, or composition or visual appeal. They also remove 00:19:27.600 |
images containing watermarks to our codes and stuff like that. So stage six, they kind of focus more on 00:19:32.960 |
portraits. Uh, let's see. Yeah, so they categorize their data set into three categories. So general 00:19:40.880 |
portrait and text rendering. So that's what it sounds like. Stage seven, this is balanced multi-scale 00:19:48.160 |
training. So again, they're increasing their resolution. And they, interestingly, they design a 00:19:54.000 |
hierarchical taxonomy system for image categorization. So within each category, they retain only images with the 00:20:00.560 |
highest quality. So here they're, this is essentially, like, data QA. So just making sure that their, that 00:20:06.400 |
their model, um, doesn't generate really bad images, like, of a certain, like, of a certain type. So, like, 00:20:13.440 |
they build a tree of all, you know, of all, like, objects that they have. And they just essentially check, 00:20:18.640 |
like, oh, are, like, like, when the model generates buildings, do the buildings look good? Etc. Like, 00:20:24.560 |
like, do the landscapes look good? Do the parks look good? Etc. They do that. They also, uh, while 00:20:32.640 |
they're, like, within each category, they retain images with the highest quality. And they also allow, 00:20:38.480 |
like, they make sure to balance their data so that they allow the model to retain previously learned 00:20:42.640 |
general knowledge and ensure stable convergence while adapting to higher resolution images. So presumably, 00:20:47.920 |
this is to combat, like, catastrophic forgetting, where your model trains on a specific, uh, subset of 00:20:52.800 |
data, but, like, kind of loses its generality. So that's what they do in stage. Yeah, stage seven? 00:20:58.240 |
Yeah, I think stage one to seven really shows how much care they put to cleaning the data, 00:21:04.560 |
whereby they had so many simple filters just to check for resolution, check for cropping, check for flipping, 00:21:11.920 |
and everything. Um, and that's what led to this strong model. Actually, then, then the question becomes, hey, 00:21:19.040 |
if we had left, left that data in, would the model be just as good? It's unknown. I don't know if we would 00:21:24.400 |
actually train such a model on deliberately train a model on bad data. Um, but, I mean, their pipeline is actually 00:21:31.920 |
I think it might agree. Um, they, they share some stuff with, like, why they don't do synthetic data from, 00:21:40.480 |
uh, you know, images generated by other models and noisy stuff. And they, they say that that would harm 00:21:47.440 |
the quality, right? Yeah, because there's a lot of artifacts, right? Yeah. Exactly. So if you train on 00:21:52.960 |
poor quality data, you will not get good model, but that's, that's, you know, it seems necessary to 00:21:59.600 |
need to do all this then. Yeah, that seems so. And if we look at, um, figure 10, right, I think like 00:22:06.080 |
they shrank their data set maybe by, by two thirds. So a lot of it is, it's filtering and I, I, I really 00:22:15.120 |
enjoyed figure 10. So it really just shows you how much care needs to be taken for doing this. 00:22:20.640 |
Yeah. Yeah. So does anyone else have any other questions or comments? 00:22:31.840 |
Did they mention, uh, if this is a semi-automatic, uh, filtering because, uh, you know, it's, uh, billions 00:22:42.400 |
of images and, uh, it's a huge effort to annotate and filter, uh, based on quality and, uh, all of this. 00:22:51.920 |
Yeah. So they like in the, in the technical, in the technical report itself, they just talk about, 00:23:00.480 |
they say like, oh, we applied in entropy filter. So I'm assuming that they programmatically do that. 00:23:05.760 |
Uh, maybe there's more data in the, what's it called in the appendix. I didn't read the appendix or I 00:23:10.240 |
don't even know if there is an appendix. Yeah. It just sounds like they, yeah. 00:23:14.000 |
No, go for it. Oh yeah. No, I was also, I was just saying like, yeah, I apply, like, I imagine that 00:23:19.920 |
they just apply, um, they just say like, oh, if the entropy is greater than this in this image, then they 00:23:24.160 |
just like discard it and do something similar or attempt to do something similar with the other 00:23:28.000 |
filters. Yeah. I imagine that pipeline is completely automatic. Developing the filters is not automatic 00:23:34.400 |
in a sense. They probably need to figure out what the, what, what the right threshold is so that we 00:23:38.960 |
exclude most of the defects without losing out too much, too much good data. But you can imagine that, 00:23:44.240 |
you know, you have a team of 20, everyone just takes one of these filters and then you be your 00:23:48.960 |
evals and figure out how to cut it, get high precision and recall. And then once it's there, 00:23:53.200 |
it's just, it's just a very simple, uh, CPU intensive task. Right. That open CV probably 00:23:58.880 |
has quite a few of these as well. And then once it, I just pass everything through. And I think that 00:24:04.160 |
will work very well. Yeah. But I'm sure that they also did some quality, uh, control checks on a small 00:24:11.200 |
sample subsets. Oh yeah. Yeah. I'm also sure. Yeah. Okay. So let's see. 00:24:22.240 |
Data annotation. Oh yeah. So they go on to talk about their data annotation, which I think is also 00:24:26.480 |
really interesting because they, they essentially have Quen generate, uh, a JSON. So I'll actually 00:24:32.160 |
read some of this. So say we use a capable image captioner. So they're talking about Quen 2.5 VL 00:24:36.480 |
that generate, uh, comprehensive image descriptions, but also structured metadata that captures essential 00:24:41.920 |
image properties and attributes. So like, instead of treating captioning and metadata extraction as 00:24:47.200 |
independent tasks, the captioner concurrently describes visual content and generates detailed 00:24:52.320 |
information in a structured format, such as JSON. So critical details such as like object attributes, 00:24:57.440 |
spatial relationships, environmental context, and verbatim, uh, translations of visible text are 00:25:01.920 |
captured. So they capture key image properties and report it in a structured format. So I think that 00:25:08.080 |
that's really interesting. I don't know how many other like labs do this, but yeah, I just think it's 00:25:14.560 |
really interesting that like, they, they treated them both as the same like image or is it, uh, image 00:25:19.200 |
captioning and like metadata extraction to like capture a bunch of different relationships from the image 00:25:25.280 |
that might not just be captured with the image caption. Yeah. I also, I thought this was really 00:25:33.120 |
interesting and that they're actually using the, the vision language model in two ways. One is they're 00:25:38.960 |
doing it, using it to annotate like this, but then they're also using it to embed the language into the 00:25:45.840 |
vision, uh, the vision, uh, the vision embedding space, right. And then there, and then like sort of losing 00:25:53.440 |
across, uh, across attention to align. I thought that was really interesting that they were sort of using it in two ways. 00:26:02.560 |
Yeah, I agree. I thought it was pretty cool. The two papers they reference on this, um, siglip 2 and Chinese clip, uh, 00:26:15.280 |
the siglip 2, I looked into that paper. It's kind of, it's kind of interesting. It's like a, you know, 00:26:19.760 |
a very recent 2025 iteration of clip, but it's cool to see how much stuff keeps pushing there. 00:26:27.600 |
And then, uh, you know, they even mentioned like in that stage three, after you do captioning, how do 00:26:35.680 |
they, you know, how do they still filter itself? So like there's token length, you gotta remove stuff, 00:26:41.280 |
you gotta filter stuff that says I can't caption this image. Um, you know, 00:26:47.280 |
Yeah. Wait, do you know what Chinese clip did differently than like regular clip? 00:26:52.640 |
I'm pretty sure that the captions are in Chinese. 00:26:55.920 |
I haven't checked, but you know, I would assume Chinese clip is clip in Chinese. 00:27:04.080 |
I meant like they do anything differently in terms of like diffusion. 00:27:09.600 |
So they, they have two, two, two things that they reference. Uh, sig clip is not Chinese clip. It's, 00:27:15.040 |
it's, uh, it's a variation that builds on top of clip and that's not specific. That's not like specific to 00:27:22.320 |
Chinese. That's from deep mind in like February, 2025. It's, it's just an improvement on clip, 00:27:28.800 |
but you know, they probably merged that with the Chinese clip. 00:27:45.360 |
Uh, oh yeah. So they talk about how apparently this is a problem with, uh, like with Chinese where 00:27:54.480 |
like there are some characters that are just really, really rare, but are still important. 00:27:58.080 |
So I say given the long tail distribution of textual context, relying solely on, uh, 00:28:02.640 |
naturally occurring text is insufficient to ensure adequate exposure to these rare characters during 00:28:06.720 |
model training. So to address this, they use like a multi-stage text-aware image synthesis pipeline. 00:28:12.800 |
It's like, they have three stages and they kind of describe it. Uh, so the most straightforward 00:28:19.520 |
way it's like to train the model to recognize and generate characters. So like they make text 00:28:23.920 |
or they extract text paragraphs from large scale, high quality corpora, and they render it onto clean 00:28:28.320 |
backgrounds. And so they also like implement QA or quality control. So if any character within a 00:28:34.800 |
paragraph cannot be rendered due to limitations, the entire paragraph is discarded. So again, like they just 00:28:39.760 |
really, really care about having like clean data. Uh, yeah. So they maintain a high fidelity. Let's see. 00:28:47.360 |
So this is an example of the first one, the paragraph that I just talked about. So they do that. 00:28:53.360 |
So they also do compositional rendering and contextual scenes. So they just embed a embed synthetic text into 00:29:01.520 |
realistic visual context. So they use QNVL captioner to generate descriptive captions for each synthesized image, 00:29:07.840 |
capturing contextual relationships between the text and surrounding visual elements. 00:29:10.880 |
So an example of this is like this right here. And the third is, let's see, complex rendering and 00:29:17.920 |
structured templates. So they follow complex structured prompts involving layout sensitive content. 00:29:23.280 |
So they propose a synthesis strategy based on programmatic editing of predefined templates, 00:29:27.520 |
such as PowerPoint slides or user interface mockups. This is kind of what I was talking about at the very 00:29:30.960 |
beginning where they, like, they kind of use some of their PowerPoint such like synthetic, quote unquote, synthetic images. 00:29:37.840 |
Uh, that kind of like teach it to follow instructions or teach it to like how to place different, like how to use graphics or like manipulate graphics. 00:29:50.160 |
I thought this section was very interesting, right? Because what they're doing here is they're, 00:30:00.080 |
they're generating a very different form of synthetic data, right? This is not synthetic data from a diffusion model or like an LLM that's just added. Like you don't, you're not really doing distillation. 00:30:12.480 |
Uh, you know, the first one, uh, you know, the first one, the pure rendering is very interesting, right? 00:30:16.720 |
They're, they're basically writing paragraphs, like they're extracting text and just rendering it onto clean backgrounds. 00:30:25.680 |
Like, you know, in Photoshop where you can like have text and just paste it into different backgrounds. 00:30:30.400 |
Uh, that's, that's a form of a synthetic image, but it's not a generated image, right? 00:30:36.080 |
So they mentioned this more earlier in the paper as well, where their synthetic data gen is like very different, it's, it's not generated images, it's text that's written and then rendered in synthetic sense. 00:30:52.080 |
And then the thing that you mentioned with the random characters. 00:30:56.960 |
So the, the, the part of that is actually that in, in languages like Chinese, there's tail end characters that don't show up a lot, right? 00:31:05.920 |
Like you, you just won't see these that often, but you still need to be able to understand them. 00:31:12.000 |
So this synthetic text properly brings back in the tail end of distribution that you don't see. 00:31:18.560 |
So it's like, you know, in English, we have like only so many characters, right? 00:31:22.080 |
26 letters, they're pretty distributed, but in, in Chinese there are characters that don't show up and they, they render them and then they synthetically add them back. 00:31:31.600 |
But it's a very different type of, uh, synthetic data gen. 00:31:41.840 |
Like, yeah, they just have templates of PowerPoints and then they just add in words, you know? 00:31:49.120 |
I'm also kind of surprised because like, I thought that if there were rare characters and they would rarely show up in, um, like in these like pure rendering, like these types of images. 00:31:58.720 |
But I mean, maybe they artificially, like maybe they just like manually found which characters are rare and just like ensured that there are more of those available. 00:32:14.480 |
That's a paragraph of text that exists, and then they just paste that onto a green background. 00:32:24.160 |
It's not telling, it's not telling a model to generate something with this. 00:32:27.760 |
They actually just purely printed out this paragraph and then overlay it on a green background. 00:32:34.720 |
That's what makes this unique form of synthetic data gen, right? 00:32:37.680 |
Because it's, yeah, it's just image editing, but it's, it's not generated per se. 00:32:44.320 |
And then they do this, they adjust font, sizing, spacing, all that stuff. 00:32:48.880 |
And then, you know, when you type out this paragraph, if one character is off, they discard it. 00:32:58.080 |
And this helps with like tiny text too, I think they said. 00:33:03.200 |
And as long as it's correct, you know, synthetic small text. 00:33:09.600 |
But the, the, the cool thing is after you do this, like in the second blob there of text being 00:33:19.840 |
overlaid, sorry, if you go up a little bit, the second one, like, you know, this, I love you too. 00:33:25.280 |
So this is text that was thrown on a piece of paper in a background. 00:33:28.720 |
They still have to pass this back in through their captioner, right? 00:33:32.800 |
So even though it's like fake image, not synthetically generated, they, they have to 00:33:38.960 |
still throw it in the captioner and have descriptions without metadata and stuff. 00:33:57.680 |
So, so like, first I'll just read some of it. 00:34:00.880 |
So they adopt a flow matching training objective to pre-train Quen image. 00:34:04.320 |
So it facilitates stable learning dynamics via ordinary differential equations while preserving 00:34:08.640 |
occultence to the maximum likelihood objective. 00:34:10.400 |
Uh, so this is, this is essentially like the diffusion part of their model. 00:34:20.400 |
So yeah, you can just, you can throw it into chat.tpt if you want, but like essentially they 00:34:25.120 |
train the data to point towards, I think like that, or they start with noise and they train 00:34:31.360 |
like the noise to point towards like the actual Bonavilla data. 00:34:39.280 |
Then the model trained to predict the target velocity and the loss function is defined as 00:34:42.400 |
the mean squared error between the predicted output and the ground truth velocity. 00:34:52.320 |
So essentially you get it to like point in the direction of the real data, uh, like in this distribution. 00:35:00.560 |
So they also talk about how they optimize, uh, their models for like GPU usage. 00:35:06.560 |
So they use something called like what they call a producer consumer framework. 00:35:10.320 |
It decouples data pre-processing from model training. 00:35:12.800 |
So this design enables both stages to operate asynchronously and add optimal efficiency. 00:35:17.280 |
So on the producer side, the selected data is encoded into latent representation using MLM models and VAE. 00:35:27.760 |
Like they dedicate some GPUs to like do the producer, like do the producer's work. 00:35:32.400 |
And they dedicate some GPUs to do the consumer's work. 00:35:34.800 |
So the consumer GPUs are dedicated exclusively to model training and every data parallel group 00:35:39.680 |
asynchronously pulled pre-pulls pre-processed batches directly from the producer. 00:35:43.920 |
It's like, again, like they have, uh, let's see. 00:35:48.720 |
They encode like the data to relate in representations. 00:35:51.840 |
And they just like stack them up somewhere in memory or in storage. 00:35:55.280 |
And when, like whenever the consumer GPUs, uh, like whenever they finish with their previous batch and 00:36:00.560 |
they're ready, they just like asynchronously pull the data. 00:36:06.480 |
So let's see, so distributed training optimization. 00:36:09.440 |
So they use like hybrid parallelism strategy. 00:36:11.440 |
So they combined data parallelism and tensor parallelism to efficiently scale training across 00:36:17.920 |
So this is like less, it's like, I'm less confident, uh, of like what this, or like of the specifics of 00:36:24.320 |
So if anyone wants to like jump in, they're like more than welcome to, uh, 00:36:28.240 |
they also talk about like distributed optimizer and activation checkpointing. 00:36:31.360 |
Like to alleviate GPU memory pressure with minimal 00:36:36.560 |
So we experimented with both distributed optimizers and activation checkpointing. 00:36:41.040 |
However, activation checkpointing introduces substantial computational overhead and backward 00:36:44.800 |
paths, which can significantly degrade training speed. 00:36:47.040 |
So they observed that enabling activation checkpointing reduced per GPU memory consumption by 11%, 00:36:53.200 |
but like at the cost of increasing per iteration time by essentially almost four times. 00:36:58.480 |
So from two to seven and a half seconds per iteration. 00:37:01.200 |
And like they say that based on the trade-off, they ultimately opted to disable activation 00:37:07.840 |
I think there's an important point here, right? 00:37:10.400 |
In the sense that, um, with activation checkpointing, you're like 75% slower. 00:37:15.840 |
Um, but you only save like 11% GPU memory in, in a sense you could sort of reduce your batch size 00:37:21.520 |
and actually go away faster and you would actually make up for it. 00:37:23.840 |
Um, in some of my experiments, I mean, the default, if you ask Claude to write some code, 00:37:29.040 |
the default is to enable activation checkpointing. 00:37:30.880 |
I think because everyone's was fine tuning on very small LLMs. 00:37:34.160 |
Uh, and you know, I think by in hugging phase, the default is just to enable activation checkpointing. 00:37:39.600 |
But when you turn it off, uh, at least I was able to see a 20% speed up with not very much increase in memory either. 00:37:45.920 |
So I think it's something to observe, uh, something, if you do train your own models or finding your own models, 00:37:50.880 |
like do consider not using activation checkpointing, uh, just for your, uh, training loop to run faster. 00:37:58.240 |
Um, so it's, it's nice to see another data point here in this paper. 00:38:04.320 |
Would it like, would a middle point in that trade off just be to have activation checkpointing, 00:38:08.800 |
but just to like checkpoint it less frequently? 00:38:14.800 |
I think activation checkpointing is needed every time you do a backward pass and optimization. 00:38:23.280 |
So, uh, I mean, try it for what's worth if you have extra memory on your GPU, 00:38:31.040 |
or you can afford to go over a smaller batch size and just do gradient accumulation. 00:38:34.480 |
I think it's definitely, I mean, in this case, it's definitely worth the trade off not 00:38:40.640 |
Uh, I found the same thing in my own use case as well. 00:38:52.640 |
So we talked about this kind of earlier with, oh no, there we go. 00:38:57.200 |
So I kind of talked about this earlier, uh, with their data synthesis or their data curation pipeline. 00:39:03.280 |
But so onto a training strategy, they adopt a multi-stage re-training strategy aimed at 00:39:07.520 |
progressively enhancing data quality, image resolution, and model performance. 00:39:17.120 |
So they go from 256 to 256 pixels up to 1328 by 1328. 00:39:29.040 |
They let the model capture more detailed features leading to better performance. 00:39:33.680 |
Richer features, specs, or spaces facilitate improved generalization to unseen data. 00:39:38.720 |
So transitioning from low resolution to high-res on flower images allows the model to discern finer 00:39:50.000 |
So they progressively introduced images containing like render text. 00:39:52.960 |
So the model can like learn visual representations and subsequently acquire text rendering capability. 00:39:59.760 |
So they also go from massive to refined data. 00:40:01.760 |
They gradually employ increasingly stringent data filtering mechanisms to select higher quality data. 00:40:07.040 |
So it ensures that only the most relevant and high quality samples are leveraged to ensure training efficiency. 00:40:17.280 |
So it mitigates the risk of the model overfitting to particular domains or resolutions. 00:40:21.440 |
And like just let the model generalize better. 00:40:28.640 |
So here they generate supplementary samples enriching the data set and ensuring more comprehensive coverage of diverse visual domains. 00:40:37.280 |
They say it enhances the model's ability to generalize and perform robustly across a wider range of scenarios. 00:40:46.800 |
So they have supervised fine-tuning, RL, and DPO, or and GRPR2. 00:40:56.320 |
So they use hemean annotations to address specific shortcomings. 00:41:04.240 |
So they make the selected images clear, rich in detail, bright, and photorealistic, and like all that good stuff. 00:41:09.440 |
And they guide the model towards producing content with creative realism. 00:41:15.040 |
DPO, like it excels at the flow matching, which is the diffusion part of the image generation. 00:41:20.400 |
And GRPO performs on-policy sampling during training and evaluates each trajectory with a reward model. 00:41:30.560 |
So for DPO, it says given the same prompt, multiple images are generated with different random initialization seeds. 00:41:36.240 |
So like with prompts without reference images, annotators are asked to select the best and worst 00:41:45.040 |
So I don't know if selecting the best and worst samples, I don't know if this is new or if people have 00:41:52.400 |
Because like usually with DPO, I just hear people selecting the best images or like the best, 00:42:00.080 |
So DPO, they use that for flow matching and for GRPO. 00:42:12.000 |
But they use it for like, I'm pretty sure they use it for reconstructing the image. 00:42:17.040 |
So if anyone wants to like comment, they can jump in here too. 00:42:22.480 |
But if no one has anything to say, then I'll go on to the next section. 00:42:37.280 |
So in addition to text image, so they let the model explore like multimodal image generation tasks. 00:42:42.960 |
So it's not only like giving the model a prompt, which is text only. 00:42:46.480 |
They also let the user like give it a prompt in an image. 00:42:50.640 |
So they also find that providing the visual semantic embeddings from the MLM enables better instruction 00:42:58.320 |
So this is kind of like what they talked about like at the very beginning where, or not at the very 00:43:02.080 |
beginning, but like what they talked about previously where they, they took the last latent 00:43:06.400 |
representation of Quen, like of their, like their multimodal language model. 00:43:11.600 |
They took the last image and they kind of like fed that into the model. 00:43:17.680 |
And they also use like the, like they talked about how pixel level VAE embeddings further 00:43:23.760 |
enhances the model's ability to preserve visual fidelity and maintain structural consistency. 00:43:28.240 |
So I was kind of talking about that earlier too. 00:43:29.920 |
So yeah, so I think like the, I don't know, this really reminds you of like system one and 00:43:34.080 |
system two and how like different labs have been using that. 00:43:36.960 |
Like I remember DeepMind used system one and system two for robotics. 00:43:39.680 |
And I think like figure AI use that for like, they also use that for robotics. 00:43:43.840 |
So I don't really have a lot of notes on this. 00:43:54.720 |
I kind of like put more emphasis on how the model was trained in like that data creation 00:44:16.880 |
I really just like left the benchmarks alone. 00:44:18.800 |
And like if people want to read them, I just let, like you can read them if you want. 00:44:29.600 |
They also have like image editing benchmarks. 00:44:31.280 |
And they also have like, so in some of their editing benchmarks, or in one of them rather. 00:44:40.640 |
So they have like very dense PDFs and they like essentially just let the model see. 00:44:46.240 |
Or like they let the model like try to reconstruct some of the images to like see 00:44:57.920 |
So like, again, they focus on like English, Chinese, and like multi-object generation 00:45:09.840 |
That's, that's most of the thing that I, or that's most of what I focused on like in my annotation. 00:45:16.160 |
So does anyone have any, any questions or anything else or any comments? 00:45:19.280 |
I think the conclusion is also quite eye opening in the sense that they try to make, they make this 00:45:32.720 |
claim that a generative model can effectively perform classical understanding tasks. 00:45:38.400 |
And they say that, uh, uh, the current image model is deliberately does not optimize for photo realism and, or aesthetic quality, 00:45:48.000 |
but really tries to optimize more for aligning text and image. 00:45:52.000 |
I think you sort of tells you where they are trying to bring this model towards, right? 00:45:58.000 |
Like essentially like creating posters, creating PowerPoints. 00:46:00.560 |
I essentially, it's more practical instead of just generating images, but images with text. 00:46:05.120 |
Uh, so I thought the conclusion was quite a worth trying to read and understanding what they mean, 00:46:13.280 |
what they mean by it as well, uh, especially the last paragraph. 00:46:27.680 |
I'll just like read it, but well, like other people can ask questions. 00:46:49.440 |
So they mentioned that, uh, the, uh, model streaming data, uh, pipeline, uh, does include other languages, 00:46:56.960 |
but, uh, I see that, uh, that most of the benchmarks, uh, were focusing on, uh, English and Chinese. 00:47:08.160 |
So I don't know if, uh, uh, uh, in the conclusion that they, 00:47:14.720 |
they, uh, specified that they can, uh, handle other languages. 00:47:26.000 |
So actually, I'm not sure they don't really, yeah. 00:47:28.400 |
They don't really talk about like other languages besides English and Chinese. 00:47:40.560 |
They had, uh, English, Chinese, and other, but I don't think this thing will do, do other languages. 00:47:52.240 |
Maybe others to exclude them or to further them. 00:47:58.560 |
But, um, yeah, it's, it's primarily English and Chinese. 00:48:03.520 |
So I don't know if they did evals on the other texts, maybe they're just not equipped to do evals on other languages. 00:48:20.240 |
They had English, Chinese, other language, and no text. 00:48:23.360 |
We definitely had a other language, but I don't think it's significant, you know? 00:48:28.560 |
And then some of this is also like, um, you know, there's already a vision encoder in there that can 00:48:34.080 |
do multilingual, but I don't think it can specialize in output for this. 00:48:39.360 |
Uh, uh, I mean, actually, I tested, like, I tested some other prompts with the, I didn't text, 00:48:51.840 |
or test any, like, non-English or Chinese languages. 00:48:54.880 |
Yeah, I mean, like, Quen, I tested them in Sora, like, Quen and Sora. 00:48:58.800 |
And like, actually, I think the quality was, like, pretty similar. 00:49:01.280 |
Or Quen's, like, Quen's text generation was, like, uh, it was a little more clear. 00:49:08.720 |
It was a lot easier to look at them and just, like, see the background, 00:49:11.360 |
or, like, distinguish between the characters and the background. 00:49:44.000 |
I don't know if we would even be able to evaluate it. 00:50:08.240 |
It's not correct, but it has Korean-looking stuff. 00:50:22.080 |
I went to Google Translate, and I translated the sentence. 00:50:30.160 |
You want to share the images I shared in Zoom chat? 00:50:37.200 |
Well, I mean, it's off, but it does look a little more Korean. 00:50:54.640 |
Oh, I guess the second line is kind of there. 00:51:39.040 |
I'm now trying to translate the text it generated. 00:51:45.680 |
I showed another image if you want to swap to it. 00:51:51.120 |
And then I'm translating this with Korean model. 00:51:55.440 |
And see if it's coherent or if it's just random stuff. 00:52:28.480 |
Oh yeah, as long as the fuzzy background characters don't look Korean. 00:52:41.120 |
So the second image is nonsense, gibberish, according to translations. 00:52:56.720 |
Yeah, but the font of the main text and the background text do look different. 00:53:16.880 |
I think next week we have, was it Venki again? 00:53:23.760 |
Wow, this chalk text actually looks so much like chalk. 00:53:49.680 |
I'm impressed for the first character at least. 00:54:00.080 |
I was going over that the flow and diffusion model that they had in 4.1. 00:54:06.640 |
And I just want to make sure that I understand. 00:54:09.920 |
So I think that they have a joint latent space of image and text, correct? 00:54:15.760 |
And so they're applying a noise model on top of that space, right? 00:54:23.760 |
And I was wondering, typically when you add noise to say latent space of text, 00:54:31.040 |
That's why we don't have very good diffusion models for text. 00:54:34.160 |
So I'm kind of surprised that this kind of works. 00:54:38.000 |
So I'm not sure what you guys feel about that. 00:54:40.560 |
So what I'm saying is that their latent space is kind of like a multimodal one, right? 00:54:46.480 |
It contains both image and text related stuff together. 00:54:51.200 |
And typically diffusion, as you have here, which is adding noise to that, normally doesn't work well 00:55:01.360 |
So I was wondering why this would do any better. 00:55:08.160 |
So one thing that I noticed was that if you look at the architectural diagram, the noise is only added on the image side. 00:55:17.440 |
So if you scroll up to the architectural diagram, I might be wrong about that. 00:55:24.400 |
It's just from memory, but I think that's what I saw. 00:55:29.040 |
So the noise is only added on the image side. 00:55:32.240 |
And then you have this cross entropy thing that combines them. 00:55:38.400 |
So it looks like it's more like conditioning on the text, building the common latent space inside of the transformer blocks. 00:55:59.280 |
I also think it's because the resulting product of the model will be an image, which is continuous in diffusion space. 00:56:07.520 |
If the resulting product of the model would be text or something, or something that's inherently discrete, then I feel it could be a lot different. 00:56:15.200 |
Like you might have to use a decoder or something. 00:56:18.400 |
But like, yeah, like with the prompt and the image, both of those get like injected into latent space. 00:56:25.760 |
And then you can like perform diffusion on that. 00:56:27.840 |
And like the resulting process will be an image. 00:56:30.080 |
Well, I think RJ said that it doesn't get injected right on the combined embedding. 00:56:42.400 |
The noise only added to the image and image latent space, not to the prompt. 00:56:47.760 |
They get combined later, but the noise is only added to the image side. 00:57:26.880 |
But there is like the same number of letters and going from right to left.