OpenAI’s New ImageGen is Unexpectedly Epic … (ft. Reve, Imagen 3, Midjourney etc)

00:00:00.000 | I have spent quite a while testing the new 4.0 image gen from OpenAI and comparing it to models

00:00:06.860 | released just yesterday for example as well as models that aren't even publicly out yet. Rarely

00:00:12.440 | for me in AI is one model so much better than the rest. No of course it's still not perfect and

00:00:19.100 | don't even think about showing the model a mirror because it will frankly have a breakdown. But the

00:00:25.140 | word that comes to mind for me about this new image gen and I know calling it new is a little

00:00:29.580 | bit of a stretch because it's been being worked on for more than two years. The word that comes to mind

00:00:34.220 | is obedient. Depict six people of six completely different ethnicities doing jazz hands. That was

00:00:41.040 | my prompt here. Okay you could quibble that the people in the back you can't quite see their hands

00:00:45.500 | but this is not bad. Think of just how recently it was that hands were such a problem for AI. This new

00:00:53.240 | tool which Sam Altman is calling Images in ChatGPT will be available to everyone apparently,

00:00:59.160 | even free tier users. And it will also be coming to the API so I thought it deserved its own video

00:01:05.160 | featuring comparisons with Reeve, Midjourney and I might even sneak another model in there.

00:01:10.480 | I'll also cover image editing which I know is not unique to this model. You can do it with Gemini

00:01:14.480 | in Google's AI studio but still it's a notch above. The first comparison I think is pretty illuminating

00:01:21.120 | and the prompt I used was three apples balanced on the trunk of a blue elephant with three legs

00:01:28.520 | standing beside five weeping willow trees in El Jem, Tunisia. Obviously that is an incredibly

00:01:35.280 | difficult prompt to adhere to but I think the model did insanely well. It captured the Colosseum in El Jem

00:01:43.180 | blue elephant. You've got three apples on the trunk in every image and kind of five trees in every image

00:01:50.920 | depending on how you count it. I know got quite a few more in the background, far into the background

00:01:55.160 | but if you're being generous some of these one, two, three, four, five, pretty accurate. I'm also just

00:02:00.160 | noticing that the shadows are fairly consistent which I think is pretty impressive but obviously not the

00:02:07.020 | three legs on the elephant. It's a little bit like my common sense reasoning benchmark SimpleBench

00:02:12.140 | in that having three legs here for an elephant is a twist on a common scenario and the model just doesn't

00:02:18.160 | expect it and can't really do it. It's just been trained on too many images with elephants having

00:02:22.920 | the normal four legs. Imaging 3 which is Google's best text to image model struggles somewhat. Again

00:02:29.960 | no three legs on the elephant but this time the apples are kind of wrong in number. Not all of them

00:02:35.460 | are on the trunk and you're not getting much of a sense of location. Then of course I wanted to test

00:02:40.680 | Reeve which was codenamed Half Moon previously which is claimed by that company to be the best image model

00:02:46.640 | in the world and the way I'd phrase it is that it's very very good. If it wasn't 440 Image Gen I'd say it

00:02:54.180 | probably is the best image model in the world but for now I'm going to say second. On this particular prompt

00:02:59.040 | you may even prefer it even though there are only four trees that I can see but it's a really good image

00:03:05.080 | and great sense of location. We'll slightly more often get the number of apples wrong but overall

00:03:11.460 | despite occasional shadow issues the images are pretty vivid and engaging. So massive credit to Reeve here.

00:03:19.120 | I'm now going to show you a sneak peek of a model releasing tomorrow and I think this is a brilliant

00:03:25.000 | image. Not quite what I was going for but nevertheless very interesting. All of the images from this model

00:03:31.080 | were fairly similar engaging but not quite what I was looking for. Okay this next one you might like

00:03:36.380 | because I'm sure you guys are going to see plenty of comparisons online but I wanted to go one meta layer

00:03:41.160 | higher. I asked all the models to illustrate the idiom hold your horses. That's a pretty tough test

00:03:47.380 | because it's not just about visuals literally holding a horse. It's also the idiom hold your horses

00:03:53.780 | slow down. Only OpenAI's 4.0 image gen understood the metaphor and in every image conveyed it

00:04:00.740 | appropriately. Plus it gave some really great text too of course. Reeve as well as having some slightly

00:04:06.820 | dodgy image details just didn't really understand the metaphor in any of the images. Imogen 3 from Google

00:04:13.580 | couldn't do this at all. And as you can see at the top nor could Midjourney. Okay this next one's not

00:04:19.160 | going to be a comparison but I think it shows off the capabilities of 4.0 image gen really quite well.

00:04:24.520 | Here is one of my classic thumbnails and I gave it to 4.0 image gen and said make it 3D. Think you have

00:04:30.300 | to admit that with the slight exception of Anthropix logo down here the overall results are darn impressive.

00:04:37.100 | I mean just for a moment let's just focus on the fact that aside from possibly a little line here

00:04:42.260 | next to stumbles in one of the images the text is incredibly accurate. Then look at this one in the

00:04:48.720 | top right and I'm actually going to zoom in. The effect of the whale coming out from the water

00:04:53.120 | drawn from my thumbnail as inspiration is pretty darn impressive. Now I'm not saying that I'm going to

00:04:59.640 | immediately drop my traditional thumbnail approach but for my just released new Patreon video which was

00:05:05.860 | about Claude 3.7 having theory of mind and knowing it's being tested. I did want to try it out so I

00:05:11.600 | got my existing thumbnail and ran it through 4.0 image gen to see what it would come up with and as you

00:05:16.640 | can see you have this lab like image with this being projected onto the wall. I don't normally like AI

00:05:22.360 | thumbnails but this is probably the first tool that has tempted me. The next test I can see being the

00:05:28.280 | most common use case for image gen with ChatGPT. You could call it images with captions or basic

00:05:34.380 | infographics but it does really quite well. Here I asked depict a four panel journey showing the stages

00:05:41.540 | of a human life. Not only did I get that journey for each one but I got these labels that I didn't even

00:05:47.960 | ask for which I've just noticed aren't quite perfect. You can see the elderly spelt wrong top right

00:05:53.440 | but again you'd be hard pressed to say for some of these that there's any clear mistakes. Now because

00:05:59.880 | I love the UI all of these tests were done on Sora but of course we can't forget image editing. That is

00:06:05.400 | either unavailable or a whole set of extra steps with other image generators but not so with ChatGPT

00:06:11.540 | with images. So for one of these images I picked it out and said add glasses to each character. That got

00:06:17.340 | me this image where you can see the original image is preserved just they now have glasses. All of the

00:06:22.500 | other image generators had problems with the four stages of life although Reeve came the closest with

00:06:28.140 | this image. I mean it kind of skips out everything from the age of 21 to 81 but not bad. Midjourney

00:06:34.880 | went super metaphorical and artistic but I did say human life and I can't really see humans here. The

00:06:40.460 | unreleased model went in a completely different direction which I kind of like but I'm a bit confused

00:06:46.280 | by. Now I did miss my opportunity to talk about native image generation and editing in Google AI

00:06:51.880 | studio with Gemini 2 flash but now I have got a chance the comparison isn't quite as favourable.

00:06:57.480 | I said depict a four panel journey again showing the stages of a human life and got this and it makes

00:07:02.940 | me wonder before I was born was I a robo dog with a stick in my back quarter. Anyway we can edit the

00:07:10.280 | images and I said change the baby on the right to being an old man and as you can see I got this.

00:07:17.800 | Okay now for some disclaimers and a few times I was denied permission for an image so there are filters

00:07:24.760 | for the new image gen. It did allow me to submit a photo of the Google CEO and Sam Altman the CEO of

00:07:31.360 | OpenAI and I said make these two people arm wrestle and even though the fidelity to how they look isn't

00:07:38.020 | perfect this image in the top left isn't bad and I thought I would be denied this generation but I

00:07:44.020 | wasn't. You can let me know in the comments whether you think slightly less filtering is a good thing but for me

00:07:50.180 | true safety is about things like bioweapons and cyber weapons. That's why you through my new link in the

00:07:57.440 | description can and possibly should enter the grey swan arena. If you have any interest or aptitude in

00:08:04.660 | jailbreaking models testing whether they can do these kind of things and yes appropriately that now includes visual

00:08:10.340 | vulnerabilities breaking models through the images you submit to them or you're just interested

00:08:14.980 | in big prize pools do check out the link in the description. And yes as you may have noticed the prize pools

00:08:20.500 | are getting a bit out of control. In case you're wondering because I'm doing this in Sora I can turn any image

00:08:26.340 | into a video but honestly I wouldn't quite recommend it. Even when you're using storyboards

00:08:32.340 | the results aren't exactly life-like. The six different people with different ethnicities doing

00:08:39.620 | jazz hands was probably one of the most impressive outputs I saw from image gen mainly for the reason

00:08:45.620 | that that was a stark weakness of image gen models going back last year and the year before and also

00:08:51.460 | just that it was so much better than other models at this particular prompt. Midjourney struggled hard,

00:08:57.220 | Google's image n3 denied me entirely and Reeve wasn't bad. We do have the six different people.

00:09:03.380 | I wouldn't exactly call this jazz hands though. One thing I do have to mention of course is that when you're

00:09:07.860 | using ChatGPT to generate images it is going to be slower typically than the other models. But here was another

00:09:13.620 | test that I hope you guys like. Just so you can see it I said create a difficult what we would call in Britain

00:09:18.660 | Where's Wally or Where's Waldo style image with an italic caption telling the viewer what to look for it

00:09:24.340 | should take at least 10 seconds to solve. Now I'm gonna scroll through the images and you can of course

00:09:28.900 | pause the video but the generations while artistically very interesting but for me all suffered from the

00:09:34.900 | same problem which is that they didn't actually display the thing that they told you to look for.

00:09:39.460 | Unless you want to be very generous and count this thing here as a tiger but that is a big stretch.

00:09:46.260 | You know what in this one I'm actually going to give it to image n3 because

00:09:50.580 | I can see that it's saying find the time traveler in the medieval marketplace and even though the text

00:09:55.700 | is kind of screwed up and it's very easy to spot at least it's there and it's kind of cool. Reeve created

00:10:00.900 | very beautiful images but again I think they suffered from that same problem of not actually having the

00:10:05.620 | thing that you're supposed to look for. Honestly don't waste too much of your time

00:10:08.660 | but if you do see them let me know. I'm going to give Reeve a pass on this image because they said

00:10:13.140 | find the pirate hiding among the beach goers and I'm gonna say this is the pirate not really hiding

00:10:19.860 | but there we go. I guess this serves to illustrate the point which is the logic the kind of brains

00:10:25.300 | of 4.0 image gen is just noticeably better than the others. The artisticness is probably similar.

00:10:32.020 | Obviously most people will just use this to turn their selfies into charcoal sketches or Dragon Ball Z

00:10:37.460 | characters that's pretty obvious but the fact that we now have AI models capable of producing an image

00:10:43.940 | like this one with just incredibly accurate text and genuine logic behind what it's portraying that is

00:10:51.540 | a true moment in AI. It's worth a dedicated video because sometimes incremental change can add up to

00:10:58.180 | big change. So is this a storm in a teacup or a true moment in AI? Will you never use this tool

00:11:05.780 | or use it hundreds of times like I'm expecting to? Let me know. Thank you so much for watching.

00:11:11.060 | See you in the next video which should be coming very soon and have a wonderful day.

OpenAI’s New ImageGen is Unexpectedly Epic … (ft. Reve, Imagen 3, Midjourney etc)

Chapters