back to indexOpenAI’s New ImageGen is Unexpectedly Epic … (ft. Reve, Imagen 3, Midjourney etc)

Chapters
0:0 Intro
1:7 Prompt Adherence, vs Reve, Midjourney, Imagen 3 + one other
3:39 Idioms
4:20 Thumbnails?
5:56 Captions / Infographics
7:20 Filters and Public Figures + Gray Swan
8:30 Sora?
8:49 Ethnicities/hands
9:9 Where’s Waldo?
10:33 Selfies and Photorealism
00:00:00.000 |
I have spent quite a while testing the new 4.0 image gen from OpenAI and comparing it to models 00:00:06.860 |
released just yesterday for example as well as models that aren't even publicly out yet. Rarely 00:00:12.440 |
for me in AI is one model so much better than the rest. No of course it's still not perfect and 00:00:19.100 |
don't even think about showing the model a mirror because it will frankly have a breakdown. But the 00:00:25.140 |
word that comes to mind for me about this new image gen and I know calling it new is a little 00:00:29.580 |
bit of a stretch because it's been being worked on for more than two years. The word that comes to mind 00:00:34.220 |
is obedient. Depict six people of six completely different ethnicities doing jazz hands. That was 00:00:41.040 |
my prompt here. Okay you could quibble that the people in the back you can't quite see their hands 00:00:45.500 |
but this is not bad. Think of just how recently it was that hands were such a problem for AI. This new 00:00:53.240 |
tool which Sam Altman is calling Images in ChatGPT will be available to everyone apparently, 00:00:59.160 |
even free tier users. And it will also be coming to the API so I thought it deserved its own video 00:01:05.160 |
featuring comparisons with Reeve, Midjourney and I might even sneak another model in there. 00:01:10.480 |
I'll also cover image editing which I know is not unique to this model. You can do it with Gemini 00:01:14.480 |
in Google's AI studio but still it's a notch above. The first comparison I think is pretty illuminating 00:01:21.120 |
and the prompt I used was three apples balanced on the trunk of a blue elephant with three legs 00:01:28.520 |
standing beside five weeping willow trees in El Jem, Tunisia. Obviously that is an incredibly 00:01:35.280 |
difficult prompt to adhere to but I think the model did insanely well. It captured the Colosseum in El Jem 00:01:43.180 |
blue elephant. You've got three apples on the trunk in every image and kind of five trees in every image 00:01:50.920 |
depending on how you count it. I know got quite a few more in the background, far into the background 00:01:55.160 |
but if you're being generous some of these one, two, three, four, five, pretty accurate. I'm also just 00:02:00.160 |
noticing that the shadows are fairly consistent which I think is pretty impressive but obviously not the 00:02:07.020 |
three legs on the elephant. It's a little bit like my common sense reasoning benchmark SimpleBench 00:02:12.140 |
in that having three legs here for an elephant is a twist on a common scenario and the model just doesn't 00:02:18.160 |
expect it and can't really do it. It's just been trained on too many images with elephants having 00:02:22.920 |
the normal four legs. Imaging 3 which is Google's best text to image model struggles somewhat. Again 00:02:29.960 |
no three legs on the elephant but this time the apples are kind of wrong in number. Not all of them 00:02:35.460 |
are on the trunk and you're not getting much of a sense of location. Then of course I wanted to test 00:02:40.680 |
Reeve which was codenamed Half Moon previously which is claimed by that company to be the best image model 00:02:46.640 |
in the world and the way I'd phrase it is that it's very very good. If it wasn't 440 Image Gen I'd say it 00:02:54.180 |
probably is the best image model in the world but for now I'm going to say second. On this particular prompt 00:02:59.040 |
you may even prefer it even though there are only four trees that I can see but it's a really good image 00:03:05.080 |
and great sense of location. We'll slightly more often get the number of apples wrong but overall 00:03:11.460 |
despite occasional shadow issues the images are pretty vivid and engaging. So massive credit to Reeve here. 00:03:19.120 |
I'm now going to show you a sneak peek of a model releasing tomorrow and I think this is a brilliant 00:03:25.000 |
image. Not quite what I was going for but nevertheless very interesting. All of the images from this model 00:03:31.080 |
were fairly similar engaging but not quite what I was looking for. Okay this next one you might like 00:03:36.380 |
because I'm sure you guys are going to see plenty of comparisons online but I wanted to go one meta layer 00:03:41.160 |
higher. I asked all the models to illustrate the idiom hold your horses. That's a pretty tough test 00:03:47.380 |
because it's not just about visuals literally holding a horse. It's also the idiom hold your horses 00:03:53.780 |
slow down. Only OpenAI's 4.0 image gen understood the metaphor and in every image conveyed it 00:04:00.740 |
appropriately. Plus it gave some really great text too of course. Reeve as well as having some slightly 00:04:06.820 |
dodgy image details just didn't really understand the metaphor in any of the images. Imogen 3 from Google 00:04:13.580 |
couldn't do this at all. And as you can see at the top nor could Midjourney. Okay this next one's not 00:04:19.160 |
going to be a comparison but I think it shows off the capabilities of 4.0 image gen really quite well. 00:04:24.520 |
Here is one of my classic thumbnails and I gave it to 4.0 image gen and said make it 3D. Think you have 00:04:30.300 |
to admit that with the slight exception of Anthropix logo down here the overall results are darn impressive. 00:04:37.100 |
I mean just for a moment let's just focus on the fact that aside from possibly a little line here 00:04:42.260 |
next to stumbles in one of the images the text is incredibly accurate. Then look at this one in the 00:04:48.720 |
top right and I'm actually going to zoom in. The effect of the whale coming out from the water 00:04:53.120 |
drawn from my thumbnail as inspiration is pretty darn impressive. Now I'm not saying that I'm going to 00:04:59.640 |
immediately drop my traditional thumbnail approach but for my just released new Patreon video which was 00:05:05.860 |
about Claude 3.7 having theory of mind and knowing it's being tested. I did want to try it out so I 00:05:11.600 |
got my existing thumbnail and ran it through 4.0 image gen to see what it would come up with and as you 00:05:16.640 |
can see you have this lab like image with this being projected onto the wall. I don't normally like AI 00:05:22.360 |
thumbnails but this is probably the first tool that has tempted me. The next test I can see being the 00:05:28.280 |
most common use case for image gen with ChatGPT. You could call it images with captions or basic 00:05:34.380 |
infographics but it does really quite well. Here I asked depict a four panel journey showing the stages 00:05:41.540 |
of a human life. Not only did I get that journey for each one but I got these labels that I didn't even 00:05:47.960 |
ask for which I've just noticed aren't quite perfect. You can see the elderly spelt wrong top right 00:05:53.440 |
but again you'd be hard pressed to say for some of these that there's any clear mistakes. Now because 00:05:59.880 |
I love the UI all of these tests were done on Sora but of course we can't forget image editing. That is 00:06:05.400 |
either unavailable or a whole set of extra steps with other image generators but not so with ChatGPT 00:06:11.540 |
with images. So for one of these images I picked it out and said add glasses to each character. That got 00:06:17.340 |
me this image where you can see the original image is preserved just they now have glasses. All of the 00:06:22.500 |
other image generators had problems with the four stages of life although Reeve came the closest with 00:06:28.140 |
this image. I mean it kind of skips out everything from the age of 21 to 81 but not bad. Midjourney 00:06:34.880 |
went super metaphorical and artistic but I did say human life and I can't really see humans here. The 00:06:40.460 |
unreleased model went in a completely different direction which I kind of like but I'm a bit confused 00:06:46.280 |
by. Now I did miss my opportunity to talk about native image generation and editing in Google AI 00:06:51.880 |
studio with Gemini 2 flash but now I have got a chance the comparison isn't quite as favourable. 00:06:57.480 |
I said depict a four panel journey again showing the stages of a human life and got this and it makes 00:07:02.940 |
me wonder before I was born was I a robo dog with a stick in my back quarter. Anyway we can edit the 00:07:10.280 |
images and I said change the baby on the right to being an old man and as you can see I got this. 00:07:17.800 |
Okay now for some disclaimers and a few times I was denied permission for an image so there are filters 00:07:24.760 |
for the new image gen. It did allow me to submit a photo of the Google CEO and Sam Altman the CEO of 00:07:31.360 |
OpenAI and I said make these two people arm wrestle and even though the fidelity to how they look isn't 00:07:38.020 |
perfect this image in the top left isn't bad and I thought I would be denied this generation but I 00:07:44.020 |
wasn't. You can let me know in the comments whether you think slightly less filtering is a good thing but for me 00:07:50.180 |
true safety is about things like bioweapons and cyber weapons. That's why you through my new link in the 00:07:57.440 |
description can and possibly should enter the grey swan arena. If you have any interest or aptitude in 00:08:04.660 |
jailbreaking models testing whether they can do these kind of things and yes appropriately that now includes visual 00:08:10.340 |
vulnerabilities breaking models through the images you submit to them or you're just interested 00:08:14.980 |
in big prize pools do check out the link in the description. And yes as you may have noticed the prize pools 00:08:20.500 |
are getting a bit out of control. In case you're wondering because I'm doing this in Sora I can turn any image 00:08:26.340 |
into a video but honestly I wouldn't quite recommend it. Even when you're using storyboards 00:08:32.340 |
the results aren't exactly life-like. The six different people with different ethnicities doing 00:08:39.620 |
jazz hands was probably one of the most impressive outputs I saw from image gen mainly for the reason 00:08:45.620 |
that that was a stark weakness of image gen models going back last year and the year before and also 00:08:51.460 |
just that it was so much better than other models at this particular prompt. Midjourney struggled hard, 00:08:57.220 |
Google's image n3 denied me entirely and Reeve wasn't bad. We do have the six different people. 00:09:03.380 |
I wouldn't exactly call this jazz hands though. One thing I do have to mention of course is that when you're 00:09:07.860 |
using ChatGPT to generate images it is going to be slower typically than the other models. But here was another 00:09:13.620 |
test that I hope you guys like. Just so you can see it I said create a difficult what we would call in Britain 00:09:18.660 |
Where's Wally or Where's Waldo style image with an italic caption telling the viewer what to look for it 00:09:24.340 |
should take at least 10 seconds to solve. Now I'm gonna scroll through the images and you can of course 00:09:28.900 |
pause the video but the generations while artistically very interesting but for me all suffered from the 00:09:34.900 |
same problem which is that they didn't actually display the thing that they told you to look for. 00:09:39.460 |
Unless you want to be very generous and count this thing here as a tiger but that is a big stretch. 00:09:46.260 |
You know what in this one I'm actually going to give it to image n3 because 00:09:50.580 |
I can see that it's saying find the time traveler in the medieval marketplace and even though the text 00:09:55.700 |
is kind of screwed up and it's very easy to spot at least it's there and it's kind of cool. Reeve created 00:10:00.900 |
very beautiful images but again I think they suffered from that same problem of not actually having the 00:10:05.620 |
thing that you're supposed to look for. Honestly don't waste too much of your time 00:10:08.660 |
but if you do see them let me know. I'm going to give Reeve a pass on this image because they said 00:10:13.140 |
find the pirate hiding among the beach goers and I'm gonna say this is the pirate not really hiding 00:10:19.860 |
but there we go. I guess this serves to illustrate the point which is the logic the kind of brains 00:10:25.300 |
of 4.0 image gen is just noticeably better than the others. The artisticness is probably similar. 00:10:32.020 |
Obviously most people will just use this to turn their selfies into charcoal sketches or Dragon Ball Z 00:10:37.460 |
characters that's pretty obvious but the fact that we now have AI models capable of producing an image 00:10:43.940 |
like this one with just incredibly accurate text and genuine logic behind what it's portraying that is 00:10:51.540 |
a true moment in AI. It's worth a dedicated video because sometimes incremental change can add up to 00:10:58.180 |
big change. So is this a storm in a teacup or a true moment in AI? Will you never use this tool 00:11:05.780 |
or use it hundreds of times like I'm expecting to? Let me know. Thank you so much for watching. 00:11:11.060 |
See you in the next video which should be coming very soon and have a wonderful day.