Back to Index

OpenAI’s New ImageGen is Unexpectedly Epic … (ft. Reve, Imagen 3, Midjourney etc)


Chapters

0:0 Intro
1:7 Prompt Adherence, vs Reve, Midjourney, Imagen 3 + one other
3:39 Idioms
4:20 Thumbnails?
5:56 Captions / Infographics
7:20 Filters and Public Figures + Gray Swan
8:30 Sora?
8:49 Ethnicities/hands
9:9 Where’s Waldo?
10:33 Selfies and Photorealism

Transcript

I have spent quite a while testing the new 4.0 image gen from OpenAI and comparing it to models released just yesterday for example as well as models that aren't even publicly out yet. Rarely for me in AI is one model so much better than the rest. No of course it's still not perfect and don't even think about showing the model a mirror because it will frankly have a breakdown.

But the word that comes to mind for me about this new image gen and I know calling it new is a little bit of a stretch because it's been being worked on for more than two years. The word that comes to mind is obedient. Depict six people of six completely different ethnicities doing jazz hands.

That was my prompt here. Okay you could quibble that the people in the back you can't quite see their hands but this is not bad. Think of just how recently it was that hands were such a problem for AI. This new tool which Sam Altman is calling Images in ChatGPT will be available to everyone apparently, even free tier users.

And it will also be coming to the API so I thought it deserved its own video featuring comparisons with Reeve, Midjourney and I might even sneak another model in there. I'll also cover image editing which I know is not unique to this model. You can do it with Gemini in Google's AI studio but still it's a notch above.

The first comparison I think is pretty illuminating and the prompt I used was three apples balanced on the trunk of a blue elephant with three legs standing beside five weeping willow trees in El Jem, Tunisia. Obviously that is an incredibly difficult prompt to adhere to but I think the model did insanely well.

It captured the Colosseum in El Jem blue elephant. You've got three apples on the trunk in every image and kind of five trees in every image depending on how you count it. I know got quite a few more in the background, far into the background but if you're being generous some of these one, two, three, four, five, pretty accurate.

I'm also just noticing that the shadows are fairly consistent which I think is pretty impressive but obviously not the three legs on the elephant. It's a little bit like my common sense reasoning benchmark SimpleBench in that having three legs here for an elephant is a twist on a common scenario and the model just doesn't expect it and can't really do it.

It's just been trained on too many images with elephants having the normal four legs. Imaging 3 which is Google's best text to image model struggles somewhat. Again no three legs on the elephant but this time the apples are kind of wrong in number. Not all of them are on the trunk and you're not getting much of a sense of location.

Then of course I wanted to test Reeve which was codenamed Half Moon previously which is claimed by that company to be the best image model in the world and the way I'd phrase it is that it's very very good. If it wasn't 440 Image Gen I'd say it probably is the best image model in the world but for now I'm going to say second.

On this particular prompt you may even prefer it even though there are only four trees that I can see but it's a really good image and great sense of location. We'll slightly more often get the number of apples wrong but overall despite occasional shadow issues the images are pretty vivid and engaging.

So massive credit to Reeve here. I'm now going to show you a sneak peek of a model releasing tomorrow and I think this is a brilliant image. Not quite what I was going for but nevertheless very interesting. All of the images from this model were fairly similar engaging but not quite what I was looking for.

Okay this next one you might like because I'm sure you guys are going to see plenty of comparisons online but I wanted to go one meta layer higher. I asked all the models to illustrate the idiom hold your horses. That's a pretty tough test because it's not just about visuals literally holding a horse.

It's also the idiom hold your horses slow down. Only OpenAI's 4.0 image gen understood the metaphor and in every image conveyed it appropriately. Plus it gave some really great text too of course. Reeve as well as having some slightly dodgy image details just didn't really understand the metaphor in any of the images.

Imogen 3 from Google couldn't do this at all. And as you can see at the top nor could Midjourney. Okay this next one's not going to be a comparison but I think it shows off the capabilities of 4.0 image gen really quite well. Here is one of my classic thumbnails and I gave it to 4.0 image gen and said make it 3D.

Think you have to admit that with the slight exception of Anthropix logo down here the overall results are darn impressive. I mean just for a moment let's just focus on the fact that aside from possibly a little line here next to stumbles in one of the images the text is incredibly accurate.

Then look at this one in the top right and I'm actually going to zoom in. The effect of the whale coming out from the water drawn from my thumbnail as inspiration is pretty darn impressive. Now I'm not saying that I'm going to immediately drop my traditional thumbnail approach but for my just released new Patreon video which was about Claude 3.7 having theory of mind and knowing it's being tested.

I did want to try it out so I got my existing thumbnail and ran it through 4.0 image gen to see what it would come up with and as you can see you have this lab like image with this being projected onto the wall. I don't normally like AI thumbnails but this is probably the first tool that has tempted me.

The next test I can see being the most common use case for image gen with ChatGPT. You could call it images with captions or basic infographics but it does really quite well. Here I asked depict a four panel journey showing the stages of a human life. Not only did I get that journey for each one but I got these labels that I didn't even ask for which I've just noticed aren't quite perfect.

You can see the elderly spelt wrong top right but again you'd be hard pressed to say for some of these that there's any clear mistakes. Now because I love the UI all of these tests were done on Sora but of course we can't forget image editing. That is either unavailable or a whole set of extra steps with other image generators but not so with ChatGPT with images.

So for one of these images I picked it out and said add glasses to each character. That got me this image where you can see the original image is preserved just they now have glasses. All of the other image generators had problems with the four stages of life although Reeve came the closest with this image.

I mean it kind of skips out everything from the age of 21 to 81 but not bad. Midjourney went super metaphorical and artistic but I did say human life and I can't really see humans here. The unreleased model went in a completely different direction which I kind of like but I'm a bit confused by.

Now I did miss my opportunity to talk about native image generation and editing in Google AI studio with Gemini 2 flash but now I have got a chance the comparison isn't quite as favourable. I said depict a four panel journey again showing the stages of a human life and got this and it makes me wonder before I was born was I a robo dog with a stick in my back quarter.

Anyway we can edit the images and I said change the baby on the right to being an old man and as you can see I got this. Okay now for some disclaimers and a few times I was denied permission for an image so there are filters for the new image gen.

It did allow me to submit a photo of the Google CEO and Sam Altman the CEO of OpenAI and I said make these two people arm wrestle and even though the fidelity to how they look isn't perfect this image in the top left isn't bad and I thought I would be denied this generation but I wasn't.

You can let me know in the comments whether you think slightly less filtering is a good thing but for me true safety is about things like bioweapons and cyber weapons. That's why you through my new link in the description can and possibly should enter the grey swan arena. If you have any interest or aptitude in jailbreaking models testing whether they can do these kind of things and yes appropriately that now includes visual vulnerabilities breaking models through the images you submit to them or you're just interested in big prize pools do check out the link in the description.

And yes as you may have noticed the prize pools are getting a bit out of control. In case you're wondering because I'm doing this in Sora I can turn any image into a video but honestly I wouldn't quite recommend it. Even when you're using storyboards the results aren't exactly life-like.

The six different people with different ethnicities doing jazz hands was probably one of the most impressive outputs I saw from image gen mainly for the reason that that was a stark weakness of image gen models going back last year and the year before and also just that it was so much better than other models at this particular prompt.

Midjourney struggled hard, Google's image n3 denied me entirely and Reeve wasn't bad. We do have the six different people. I wouldn't exactly call this jazz hands though. One thing I do have to mention of course is that when you're using ChatGPT to generate images it is going to be slower typically than the other models.

But here was another test that I hope you guys like. Just so you can see it I said create a difficult what we would call in Britain Where's Wally or Where's Waldo style image with an italic caption telling the viewer what to look for it should take at least 10 seconds to solve.

Now I'm gonna scroll through the images and you can of course pause the video but the generations while artistically very interesting but for me all suffered from the same problem which is that they didn't actually display the thing that they told you to look for. Unless you want to be very generous and count this thing here as a tiger but that is a big stretch.

You know what in this one I'm actually going to give it to image n3 because I can see that it's saying find the time traveler in the medieval marketplace and even though the text is kind of screwed up and it's very easy to spot at least it's there and it's kind of cool.

Reeve created very beautiful images but again I think they suffered from that same problem of not actually having the thing that you're supposed to look for. Honestly don't waste too much of your time but if you do see them let me know. I'm going to give Reeve a pass on this image because they said find the pirate hiding among the beach goers and I'm gonna say this is the pirate not really hiding but there we go.

I guess this serves to illustrate the point which is the logic the kind of brains of 4.0 image gen is just noticeably better than the others. The artisticness is probably similar. Obviously most people will just use this to turn their selfies into charcoal sketches or Dragon Ball Z characters that's pretty obvious but the fact that we now have AI models capable of producing an image like this one with just incredibly accurate text and genuine logic behind what it's portraying that is a true moment in AI.

It's worth a dedicated video because sometimes incremental change can add up to big change. So is this a storm in a teacup or a true moment in AI? Will you never use this tool or use it hundreds of times like I'm expecting to? Let me know. Thank you so much for watching.

See you in the next video which should be coming very soon and have a wonderful day.