back to index

OpenAI’s New ImageGen is Unexpectedly Epic … (ft. Reve, Imagen 3, Midjourney etc)


Chapters

0:0 Intro
1:7 Prompt Adherence, vs Reve, Midjourney, Imagen 3 + one other
3:39 Idioms
4:20 Thumbnails?
5:56 Captions / Infographics
7:20 Filters and Public Figures + Gray Swan
8:30 Sora?
8:49 Ethnicities/hands
9:9 Where’s Waldo?
10:33 Selfies and Photorealism

Whisper Transcript | Transcript Only Page

00:00:00.000 | I have spent quite a while testing the new 4.0 image gen from OpenAI and comparing it to models
00:00:06.860 | released just yesterday for example as well as models that aren't even publicly out yet. Rarely
00:00:12.440 | for me in AI is one model so much better than the rest. No of course it's still not perfect and
00:00:19.100 | don't even think about showing the model a mirror because it will frankly have a breakdown. But the
00:00:25.140 | word that comes to mind for me about this new image gen and I know calling it new is a little
00:00:29.580 | bit of a stretch because it's been being worked on for more than two years. The word that comes to mind
00:00:34.220 | is obedient. Depict six people of six completely different ethnicities doing jazz hands. That was
00:00:41.040 | my prompt here. Okay you could quibble that the people in the back you can't quite see their hands
00:00:45.500 | but this is not bad. Think of just how recently it was that hands were such a problem for AI. This new
00:00:53.240 | tool which Sam Altman is calling Images in ChatGPT will be available to everyone apparently,
00:00:59.160 | even free tier users. And it will also be coming to the API so I thought it deserved its own video
00:01:05.160 | featuring comparisons with Reeve, Midjourney and I might even sneak another model in there.
00:01:10.480 | I'll also cover image editing which I know is not unique to this model. You can do it with Gemini
00:01:14.480 | in Google's AI studio but still it's a notch above. The first comparison I think is pretty illuminating
00:01:21.120 | and the prompt I used was three apples balanced on the trunk of a blue elephant with three legs
00:01:28.520 | standing beside five weeping willow trees in El Jem, Tunisia. Obviously that is an incredibly
00:01:35.280 | difficult prompt to adhere to but I think the model did insanely well. It captured the Colosseum in El Jem
00:01:43.180 | blue elephant. You've got three apples on the trunk in every image and kind of five trees in every image
00:01:50.920 | depending on how you count it. I know got quite a few more in the background, far into the background
00:01:55.160 | but if you're being generous some of these one, two, three, four, five, pretty accurate. I'm also just
00:02:00.160 | noticing that the shadows are fairly consistent which I think is pretty impressive but obviously not the
00:02:07.020 | three legs on the elephant. It's a little bit like my common sense reasoning benchmark SimpleBench
00:02:12.140 | in that having three legs here for an elephant is a twist on a common scenario and the model just doesn't
00:02:18.160 | expect it and can't really do it. It's just been trained on too many images with elephants having
00:02:22.920 | the normal four legs. Imaging 3 which is Google's best text to image model struggles somewhat. Again
00:02:29.960 | no three legs on the elephant but this time the apples are kind of wrong in number. Not all of them
00:02:35.460 | are on the trunk and you're not getting much of a sense of location. Then of course I wanted to test
00:02:40.680 | Reeve which was codenamed Half Moon previously which is claimed by that company to be the best image model
00:02:46.640 | in the world and the way I'd phrase it is that it's very very good. If it wasn't 440 Image Gen I'd say it
00:02:54.180 | probably is the best image model in the world but for now I'm going to say second. On this particular prompt
00:02:59.040 | you may even prefer it even though there are only four trees that I can see but it's a really good image
00:03:05.080 | and great sense of location. We'll slightly more often get the number of apples wrong but overall
00:03:11.460 | despite occasional shadow issues the images are pretty vivid and engaging. So massive credit to Reeve here.
00:03:19.120 | I'm now going to show you a sneak peek of a model releasing tomorrow and I think this is a brilliant
00:03:25.000 | image. Not quite what I was going for but nevertheless very interesting. All of the images from this model
00:03:31.080 | were fairly similar engaging but not quite what I was looking for. Okay this next one you might like
00:03:36.380 | because I'm sure you guys are going to see plenty of comparisons online but I wanted to go one meta layer
00:03:41.160 | higher. I asked all the models to illustrate the idiom hold your horses. That's a pretty tough test
00:03:47.380 | because it's not just about visuals literally holding a horse. It's also the idiom hold your horses
00:03:53.780 | slow down. Only OpenAI's 4.0 image gen understood the metaphor and in every image conveyed it
00:04:00.740 | appropriately. Plus it gave some really great text too of course. Reeve as well as having some slightly
00:04:06.820 | dodgy image details just didn't really understand the metaphor in any of the images. Imogen 3 from Google
00:04:13.580 | couldn't do this at all. And as you can see at the top nor could Midjourney. Okay this next one's not
00:04:19.160 | going to be a comparison but I think it shows off the capabilities of 4.0 image gen really quite well.
00:04:24.520 | Here is one of my classic thumbnails and I gave it to 4.0 image gen and said make it 3D. Think you have
00:04:30.300 | to admit that with the slight exception of Anthropix logo down here the overall results are darn impressive.
00:04:37.100 | I mean just for a moment let's just focus on the fact that aside from possibly a little line here
00:04:42.260 | next to stumbles in one of the images the text is incredibly accurate. Then look at this one in the
00:04:48.720 | top right and I'm actually going to zoom in. The effect of the whale coming out from the water
00:04:53.120 | drawn from my thumbnail as inspiration is pretty darn impressive. Now I'm not saying that I'm going to
00:04:59.640 | immediately drop my traditional thumbnail approach but for my just released new Patreon video which was
00:05:05.860 | about Claude 3.7 having theory of mind and knowing it's being tested. I did want to try it out so I
00:05:11.600 | got my existing thumbnail and ran it through 4.0 image gen to see what it would come up with and as you
00:05:16.640 | can see you have this lab like image with this being projected onto the wall. I don't normally like AI
00:05:22.360 | thumbnails but this is probably the first tool that has tempted me. The next test I can see being the
00:05:28.280 | most common use case for image gen with ChatGPT. You could call it images with captions or basic
00:05:34.380 | infographics but it does really quite well. Here I asked depict a four panel journey showing the stages
00:05:41.540 | of a human life. Not only did I get that journey for each one but I got these labels that I didn't even
00:05:47.960 | ask for which I've just noticed aren't quite perfect. You can see the elderly spelt wrong top right
00:05:53.440 | but again you'd be hard pressed to say for some of these that there's any clear mistakes. Now because
00:05:59.880 | I love the UI all of these tests were done on Sora but of course we can't forget image editing. That is
00:06:05.400 | either unavailable or a whole set of extra steps with other image generators but not so with ChatGPT
00:06:11.540 | with images. So for one of these images I picked it out and said add glasses to each character. That got
00:06:17.340 | me this image where you can see the original image is preserved just they now have glasses. All of the
00:06:22.500 | other image generators had problems with the four stages of life although Reeve came the closest with
00:06:28.140 | this image. I mean it kind of skips out everything from the age of 21 to 81 but not bad. Midjourney
00:06:34.880 | went super metaphorical and artistic but I did say human life and I can't really see humans here. The
00:06:40.460 | unreleased model went in a completely different direction which I kind of like but I'm a bit confused
00:06:46.280 | by. Now I did miss my opportunity to talk about native image generation and editing in Google AI
00:06:51.880 | studio with Gemini 2 flash but now I have got a chance the comparison isn't quite as favourable.
00:06:57.480 | I said depict a four panel journey again showing the stages of a human life and got this and it makes
00:07:02.940 | me wonder before I was born was I a robo dog with a stick in my back quarter. Anyway we can edit the
00:07:10.280 | images and I said change the baby on the right to being an old man and as you can see I got this.
00:07:17.800 | Okay now for some disclaimers and a few times I was denied permission for an image so there are filters
00:07:24.760 | for the new image gen. It did allow me to submit a photo of the Google CEO and Sam Altman the CEO of
00:07:31.360 | OpenAI and I said make these two people arm wrestle and even though the fidelity to how they look isn't
00:07:38.020 | perfect this image in the top left isn't bad and I thought I would be denied this generation but I
00:07:44.020 | wasn't. You can let me know in the comments whether you think slightly less filtering is a good thing but for me
00:07:50.180 | true safety is about things like bioweapons and cyber weapons. That's why you through my new link in the
00:07:57.440 | description can and possibly should enter the grey swan arena. If you have any interest or aptitude in
00:08:04.660 | jailbreaking models testing whether they can do these kind of things and yes appropriately that now includes visual
00:08:10.340 | vulnerabilities breaking models through the images you submit to them or you're just interested
00:08:14.980 | in big prize pools do check out the link in the description. And yes as you may have noticed the prize pools
00:08:20.500 | are getting a bit out of control. In case you're wondering because I'm doing this in Sora I can turn any image
00:08:26.340 | into a video but honestly I wouldn't quite recommend it. Even when you're using storyboards
00:08:32.340 | the results aren't exactly life-like. The six different people with different ethnicities doing
00:08:39.620 | jazz hands was probably one of the most impressive outputs I saw from image gen mainly for the reason
00:08:45.620 | that that was a stark weakness of image gen models going back last year and the year before and also
00:08:51.460 | just that it was so much better than other models at this particular prompt. Midjourney struggled hard,
00:08:57.220 | Google's image n3 denied me entirely and Reeve wasn't bad. We do have the six different people.
00:09:03.380 | I wouldn't exactly call this jazz hands though. One thing I do have to mention of course is that when you're
00:09:07.860 | using ChatGPT to generate images it is going to be slower typically than the other models. But here was another
00:09:13.620 | test that I hope you guys like. Just so you can see it I said create a difficult what we would call in Britain
00:09:18.660 | Where's Wally or Where's Waldo style image with an italic caption telling the viewer what to look for it
00:09:24.340 | should take at least 10 seconds to solve. Now I'm gonna scroll through the images and you can of course
00:09:28.900 | pause the video but the generations while artistically very interesting but for me all suffered from the
00:09:34.900 | same problem which is that they didn't actually display the thing that they told you to look for.
00:09:39.460 | Unless you want to be very generous and count this thing here as a tiger but that is a big stretch.
00:09:46.260 | You know what in this one I'm actually going to give it to image n3 because
00:09:50.580 | I can see that it's saying find the time traveler in the medieval marketplace and even though the text
00:09:55.700 | is kind of screwed up and it's very easy to spot at least it's there and it's kind of cool. Reeve created
00:10:00.900 | very beautiful images but again I think they suffered from that same problem of not actually having the
00:10:05.620 | thing that you're supposed to look for. Honestly don't waste too much of your time
00:10:08.660 | but if you do see them let me know. I'm going to give Reeve a pass on this image because they said
00:10:13.140 | find the pirate hiding among the beach goers and I'm gonna say this is the pirate not really hiding
00:10:19.860 | but there we go. I guess this serves to illustrate the point which is the logic the kind of brains
00:10:25.300 | of 4.0 image gen is just noticeably better than the others. The artisticness is probably similar.
00:10:32.020 | Obviously most people will just use this to turn their selfies into charcoal sketches or Dragon Ball Z
00:10:37.460 | characters that's pretty obvious but the fact that we now have AI models capable of producing an image
00:10:43.940 | like this one with just incredibly accurate text and genuine logic behind what it's portraying that is
00:10:51.540 | a true moment in AI. It's worth a dedicated video because sometimes incremental change can add up to
00:10:58.180 | big change. So is this a storm in a teacup or a true moment in AI? Will you never use this tool
00:11:05.780 | or use it hundreds of times like I'm expecting to? Let me know. Thank you so much for watching.
00:11:11.060 | See you in the next video which should be coming very soon and have a wonderful day.