back to indexLatent Space LIVE! - Best of 2024: Startups, Vision, Open Src, Reasoning, & The Great Scaling Debate

Chapters
0:0
20:52 Conviction Startups Overview
78:44 Best of Vision 2024 (Roboflows x Moondrram)
322:59 Loubna (HF) synthetic data and smol models
457:40 The scaling/wall debate w/ Dylan Patel
465:26 Opening statements
00:03:35.700 |
i need to share audio yeah because because i'm not sharing my screen right 00:03:42.700 |
but so for the mic she's going to do it oh you don't need to 00:04:08.620 |
so all the mics and like the audio from this room we're going to zoom yeah 00:04:26.620 |
set up okay um can you take them yeah you just have to mute your yeah 00:04:37.260 |
i might need to share your audio like if i present her yeah you can go but i'm just 00:04:47.340 |
muting the music yeah we just need that yeah okay over there yeah yes um 00:05:15.980 |
actually i don't know what else um i guess um yeah 00:06:45.820 |
i mean the same pattern we're gonna we're gonna sleep 00:07:13.100 |
um um yeah no you can't make it either two investors yeah 00:07:43.420 |
but uh yeah that's what the north is um they have a great terms um i didn't know 00:11:07.900 |
made plans last night it's great yeah i actually realized we should probably hire a designer 00:11:12.620 |
the weird thing is you have no idea like how many people are having trouble finding this place 00:11:25.820 |
versus so many people like this like waking up late yes well it's okay but we're recording the 00:11:31.980 |
whole thing when you said 500 i was imagining exactly just while they're going uh 00:12:07.820 |
okay um you can just plug in here and i'll drop you the zoom link 00:12:21.820 |
so we stream from zoom straight to youtube but we're also recording separately for the podcast 00:12:48.700 |
no that's not the link ignore the thing i just said 00:12:57.020 |
yeah we need we need to show them a little harder ah 00:13:04.780 |
um so yeah it should be good for zoom yeah um do we need to send you a laptop there no that's 00:13:13.340 |
great i i use my boots from pisa okay awesome 00:13:25.580 |
we think so can you can you hear anything i'm not sure 00:13:29.180 |
well there's like a slight delay but if i'm talking here it should show up there in like 00:13:36.940 |
10 seconds yeah okay oh one more thing for these mics yeah just make sure they use it so that it 00:13:44.860 |
goes into zoom yeah this is on and then yeah we're using this manually but also 00:14:01.840 |
you take it as well yeah uh all right i'm gonna wire you both up and now by the way oh i'm sorry 00:14:13.580 |
i feel like for sarah we need to give her a laugh 00:14:49.420 |
oh is that it that's it there's a puppy thing but we're indoors so we don't need that 00:14:55.900 |
wait that's so good i like your shirt oh thank you 00:15:06.060 |
yes he sent me a photo of this place and i was like we have to do this 00:15:20.060 |
i wanted to talk where the way they set up the conference there's a rotating platform 00:15:33.260 |
the center it's like a stadium like thing it's like i don't know look at this is not intense 00:15:38.860 |
yeah that's a really good bit yeah it's terrible but you should think about doing it next time 00:15:48.460 |
yeah you must be having a dinner office we just need a platform and then like 00:16:02.620 |
like we have a lot of people on youtube i don't know how many people 00:16:29.740 |
i i just have all the um opening eye jokes that i've warm in my head 00:16:37.180 |
um like how does uh how does rudolf update like yeah exactly i know i know i like that one too 00:16:53.580 |
with 40 people online all right i am ready to transfer over to you 00:17:57.420 |
this sounds good i think that looks good actually yeah um and then on the top of it there's a button 00:18:06.780 |
if you press it now um then it'll start flashing red and that's the report and it's not broadcasting 00:18:11.820 |
it's just recording like that so if you push then okay there you go 00:18:19.180 |
okay they're figuring out the suit okay so you want me to dial it into the sim uh yeah i think 00:18:41.100 |
i sent it to you yeah i'll check text or email i'll text 00:19:10.140 |
screen so that plugs in yeah oh share screen too 00:19:14.060 |
so right now it's just pinning wow not camera yeah so we can tweak it 00:19:27.980 |
messing with your computer settings there we go no it's just you know 00:19:36.300 |
standard presenter issues uh this goes into stream and you're also mic'd up 00:19:42.780 |
do you have the is it on yeah mine's recording yeah nice 00:20:03.820 |
do we get computer audio as well and do we get audio from the computer too 00:20:08.540 |
okay uh we just have the hn demo but you can do a really good impression of 00:20:16.220 |
i mean well um i think we can just run it in worst case we'll um 00:20:25.740 |
we'll put it in the show notes it's fine yeah okay 00:20:32.940 |
yeah do you do you want to start by saying anything yeah i think you should probably yeah 00:20:37.740 |
okay i've been so busy with logistics and stuff that um i haven't done okay um i think we're 00:20:46.460 |
going to kick this off um thanks to everyone who made it early morning um it's like really 00:20:51.580 |
weird experiments that we wanted to try because one we saw this space uh and but two also i've 00:20:56.620 |
been to a number of these things now and um i always felt like there was not enough like 00:21:01.100 |
industry content for for people and we wanted an opportunity while everyone is in town in like one 00:21:06.780 |
central spot to get everyone together um to talk about the best stuff of the year review the year 00:21:11.980 |
it's very nice that new york is always the end of the year um and so i'm very honored that uh 00:21:17.420 |
sarah and pranav have agreed to help us kick this off um sarah i've known for i was actually 00:21:23.100 |
counting 17 years um and but she's she's gone she's uh been enormously successful as an ai 00:21:32.060 |
investor um even uh even when you're doing your graylock days i was tracking your your investing 00:21:37.020 |
and it's uh it's come a long way since then um and pranav uh i i've known i've known uh shorter 00:21:42.940 |
but he's also starting to write uh really incredible posts and opinions about what he's 00:21:47.100 |
seeing as an investor so i wanted to kick this off at the industry session um we have a great day of 00:21:52.380 |
sort of like best of year recaps uh for uh lined up i think vic is here as well um and uh and the 00:21:59.580 |
robo flow guys so uh i would just let you keep kick it off thank you hi everyone uh my name is 00:22:09.180 |
sarah guo and thanks to uh sean and friends here for having me and pranav so um i'd start by just 00:22:16.860 |
giving 30 seconds of intro i promise this isn't an ad uh we started a venture fund called conviction 00:22:22.140 |
about two years ago here is a set of the investments we've made uh they range from 00:22:27.580 |
companies at the infrastructure level in terms of feeding the revolution to foundation model 00:22:34.380 |
companies alternative architectures domain specific training efforts and of course applications 00:22:39.420 |
and the premise of the fund sean mentioned i worked at graylock for about a decade before 00:22:45.100 |
that and came from the product engineering side was that uh we we thought that there was a really 00:22:50.700 |
interesting technical revolution happening uh that it would probably be the biggest change in 00:22:55.580 |
how people use technology in our lifetimes and that represented huge economic opportunity 00:23:00.380 |
and and maybe that there would be an advantage versus the incumbent venture firms in that when 00:23:06.060 |
the floor is lava the dynamics of the markets change the types of products and founders that 00:23:11.020 |
you back change uh it's a lot for existing firms to ingest and a lot of their mental models may not 00:23:17.580 |
apply in the same way uh and so there was an opportunity for first principles thinking and 00:23:22.060 |
if we were right would we do really well and get to work with amazing people and so we are 00:23:25.980 |
two years into that journey and we can share some of the opinions and predictions we have with all 00:23:29.820 |
of you um sorry i'm just making sure that isn't actually blocking the whole presentation uh i'm 00:23:38.380 |
proud it's going to start us off um so quick agenda for today we'll cover some of the model 00:23:43.580 |
landscapes and themes that we've seen in 2024 uh what we think is happening in ai startups and then 00:23:48.220 |
some of our latent priors uh on what we think is working in investing so the um i thought it'd be 00:23:54.780 |
useful to start from like what was happening at neurops last year in december 2023 so in october 00:24:00.540 |
2023 opening i had just launched the ability to upload images to chat gpt which means up until 00:24:05.100 |
that moment it's hard to believe but like roughly a year ago you could only input text and get text 00:24:08.940 |
out of chat gpt um the mistral folks had just launched the mixed role model right before the 00:24:14.380 |
beginning of neurops google had just announced gemini i very genuinely forgot about the existence 00:24:19.260 |
of bard before making these slides and europe had just announced that they were doing their 00:24:24.140 |
first round of ai regulation but not to be their last and when we were thinking about like what's 00:24:29.500 |
changed in 2024 there's at least five themes that we could come up with that feel like they 00:24:33.980 |
were descriptive of of what 2024 has meant for ai and for startups and so we'd start with um first 00:24:40.540 |
it's a much closer race on the foundation model side than it was in 2023 so this is elem arena 00:24:45.900 |
they're asked users to rate the evaluations from uh from generations from specific prompts so you 00:24:52.140 |
get two responses from two language models answer which one of them is better the way to interpret 00:24:55.820 |
this is like roughly 100 elo difference means that you're preferred two-thirds of the time 00:25:00.140 |
and a year ago every open ai model was like more than 100 points better than anything else 00:25:05.020 |
and the view from the ground was roughly like open ai is the ibm there is no point in competing 00:25:09.740 |
everyone should just give up go work at open ai or attempt to use open ai models and i think the 00:25:15.020 |
story today is not that i think it would have been unbelievable a year ago if you told people that a 00:25:20.780 |
the best model today on this at least on this eval is not open ai and b that it was google 00:25:26.220 |
would have been pretty unimaginable to the majority of researchers but actually there are a variety of 00:25:32.060 |
of proprietary language model options and some set of open source options that are increasingly 00:25:36.060 |
competitive and this seems true not just on the eval side but also in actual spend so this is 00:25:41.260 |
ramp data there's a bunch of colors but it's actually just open ai and anthropic spend and the 00:25:45.900 |
open ai spend at the beginning at the end of last year in november of 23 was close to 90 percent of 00:25:50.780 |
total volume and today less than a year later it's closer to 60 percent of total volume which i think 00:25:56.700 |
is indicative both that language models are pretty easy apis to switch out and people are trialing 00:26:01.420 |
a variety of different options to figure out what works best for them related second trend that 00:26:06.700 |
we've noticed is that open source is increasingly competitive so this is from the scale leader 00:26:11.740 |
boards which is a set of independent evals that are not contaminated and on a number of topics 00:26:17.660 |
that actually the the foundation models clearly care a great deal about open source models are 00:26:21.740 |
pretty good on math instruction following and adversarial robustness the llama model is amongst 00:26:26.620 |
the top three of evaluated models i included the agentic tool use here just to point out that this 00:26:32.060 |
isn't true across the board there are clearly some areas where foundation model companies have 00:26:36.140 |
had more data or more expertise in training against these use cases but models are surprisingly an 00:26:40.940 |
increasing open source models are surprisingly increasingly effective this feels true across 00:26:45.100 |
evals this is the mmlu eval i want to call it two things here one is that it's pretty remarkable 00:26:51.420 |
that the ninth best model and two points behind and uh the the best state-of-the-art models is 00:26:56.540 |
actually a 70 billion parameter model i think this would have been surprising to a bunch of people 00:27:00.940 |
who were the belief was largely that most intelligence is just an emergent property 00:27:04.860 |
and there's a limit to how much intelligence you can push into smaller form factors in fact a year 00:27:09.340 |
ago the the best small model or under 10 billion parameter model would have been mistral 7b which 00:27:14.140 |
on this eval if memory service is somewhere around is 60 and today that's the llama 8b model which is 00:27:19.660 |
more than 10 points better the the gap between what is state-of-the-art and what you can fit 00:27:23.980 |
into a fairly small uh form factor is actually actually shrinking um and again related the we 00:27:31.340 |
think the price of intelligence has come down substantially this is this is a graph of flagship 00:27:35.180 |
open ai model costs where the cost of the api has come down roughly 80 85 and call it the last year 00:27:42.300 |
year and a half which is pretty remarkable this isn't just open ai2 this is also like the full 00:27:47.100 |
set of models this is from artificial analysis which tracks cost per token across a variety of 00:27:51.340 |
different apis and public inference options and like we were doing some math on this if you wanted 00:27:56.140 |
to recreate like what a the kind of data that a text editor had or that like something like notion 00:28:01.260 |
or coda that's somewhere in the volume of a couple thousand dollars to create that volume of tokens 00:28:06.060 |
that's pretty remarkable and impressive it's clearly not the same distribution of data but 00:28:10.940 |
just as like a sense of scope the there's an enormous volume of data that you can create 00:28:14.940 |
and then fourth we think new modalities are beginning to work start quickly with biology 00:28:21.180 |
we're lucky to work with the folks at chi discovery who just released chi1 which is 00:28:25.420 |
open source model that outperforms alpha fold 3 it's impressive that this is like roughly a year 00:28:29.980 |
of work with a pretty specific data set and then pretty specific technical beliefs but 00:28:33.980 |
models in domains like biology are beginning to work we think that's true on the voice side as 00:28:38.300 |
well point out that there were voice models before things like 11 labs have existed for a while but 00:28:43.420 |
we think low latency voice is more than just a feature it's actually a net new experience 00:28:47.900 |
interaction using voice mode feels very different than the historical transcription first models 00:28:52.780 |
same thing with many of the cartesian models and then a new nascent use case is execution so cloud 00:28:59.340 |
launch computer use openai launched code execution inside of canvas yesterday and then i think devon 00:29:03.980 |
just announced that you can all try it for 500 a month which is pretty remarkable it's a set of 00:29:09.100 |
capabilities that have historically never been available to vast majority of population and i 00:29:13.020 |
think we're still in early innings cognition the company was founded under a year ago first product 00:29:17.340 |
was roughly nine months ago which is pretty impressive if you recall like a year ago the 00:29:23.020 |
point of view on swebench was like it was impossible to surpass what 15 percent or so 00:29:28.780 |
and i think the the whole industry now considers that if not trivial accessible yeah 00:29:34.460 |
um last new modality we wanted to call out although there are many more is video um i took 00:29:40.860 |
the liberty i got early access to sora and managed to sign up before they cut off accesses so um here 00:29:46.220 |
is my favorite joke in the form of a video hopefully someone here can guess it 00:29:49.740 |
yeah you're telling me a shrimp fried this rice it's a pretty bad joke but i really like it 00:29:58.940 |
and i think this one the next video here is uh one of our portfolio companies hey jen that 00:30:05.180 |
translated and does the dubbing for or lip sync and dubbing for live speeches so this is javier 00:30:12.460 |
mille who speaks in spanish but here you will hear him in english if this if this plays um 00:30:18.460 |
and you can see that you can capture the original tonality of of his speech and performance i think 00:30:23.500 |
auto here doesn't work but we'll we'll push something publicly sure um let's give it a shot 00:30:29.260 |
yeah excellent of the western world yeah and you can hear that this captures like his original 00:30:36.700 |
tone uh and like the emotion in his speech which is definitely new and pretty impressive 00:30:41.900 |
from from new models um so the last uh the yeah that makes sense um the last point that we wanted 00:30:50.860 |
to call out is uh the much purported end of scaling i think there is a great debate happening 00:30:55.180 |
here later today on the question of this but we think at minimum it's hard to deny that there are 00:30:59.820 |
at least some limits to the the clear benefits to increasing scale um but there also seems like 00:31:06.220 |
there are new scaling paradigms so the question of test time compute scaling is a pretty interesting 00:31:10.220 |
one it seems like openai has cracked a version of this that works and we think a foundation model 00:31:14.540 |
labs will come up with better ways of doing this and b so far it largely works for very verifiable 00:31:20.700 |
domains things that look like math and physics and maybe secondarily software engineering where 00:31:24.300 |
we can get an objective value function and i think an open question for the next year is going to be 00:31:28.380 |
how we generate those value functions for spaces that are not as well constrained or well defined 00:31:32.220 |
so the question that this leaves us in is like well what does that mean for startups 00:31:37.180 |
and i think a prevailing view has been that we live in an ai bubble there's an enormous amount 00:31:43.260 |
of funding that goes towards ai companies and startups that is largely unjustified based on 00:31:47.100 |
outcomes and what's actually working on the ground and startups are largely raising money 00:31:51.820 |
on hype and so we pulled some pitch book data and the 2024 number is like probably incomplete since 00:31:56.940 |
not all rounds are being reported and largely suggests like actually there is a substantial 00:32:01.100 |
recovery in funding and maybe 2025 looks something like 2021 but if you break out the numbers here a 00:32:07.260 |
bit more the red is actually just a small number of foundation model labs like what you would think 00:32:11.900 |
of as the largest labs raising money which is upwards of 30 to 40 billion dollars this year 00:32:16.700 |
and so the reality of the funding environment actually seems like much more sane and rational 00:32:21.260 |
it doesn't look like we're headed to a version of 2021 in fact the the foundation model labs 00:32:25.420 |
account for an outsized amount of money being raised but the the set of money going to companies 00:32:31.260 |
that are working seems much more rational and we wanted to give you we can't share numbers for 00:32:35.900 |
every company but this is one of our portfolio companies growing really really quickly um we 00:32:41.100 |
think 0 to 20 and just plg style spending is pretty impressive if any of you are doing better than that 00:32:45.980 |
you should come find us we'd love to chat and so what we wanted to try and center a discussion on 00:32:53.740 |
this is certainly not all of the companies that are making 10 million more or revenue and growing 00:32:57.500 |
but we took a selection of them and wanted to give you a couple ideas of patterns that we've noticed 00:33:02.300 |
that seem to be working across the board um the first one that we've noticed is like first wave 00:33:07.180 |
service automation so we think there's a large amount of work that doesn't get done at companies 00:33:12.460 |
today either because it is too expensive to hire someone to do it it's too expensive to provide 00:33:17.020 |
them context and enable them to be successful uh at uh at whatever the specific role is or 00:33:22.300 |
it's too hard to manage um those set of people so prescribing it's too expensive to hire those 00:33:26.860 |
specific set of people for sierra and decagon for customer support style companies it's really 00:33:30.620 |
useful to do like next level automation and then there's obviously growth in that and for harvey 00:33:34.380 |
and even up the story is um you can do first wave professional services and then grow beyond that 00:33:41.740 |
second trend that we've noticed is better search new friends so we think that there is a it's pretty 00:33:47.340 |
impressive like how effective text modalities have been so character and replica have been 00:33:51.180 |
remarkably successful companies and there's a whole host of not safe for work chatbots as well 00:33:55.100 |
that are pretty effective at just text generation they're pretty compelling mechanisms and on the 00:33:59.980 |
productivity side perplexity and glean have demonstrated this as well i worked at a search 00:34:03.020 |
company for a while i think the changing paradigms of how people capture and learn information is 00:34:08.220 |
pretty interesting we think it's likely text isn't the last medium their infographics or sets of 00:34:13.020 |
information that seem more useful or sets of engagement that are more engaging um but this 00:34:16.940 |
feels like a pretty interesting place to start oh yeah okay mike so one thing that i've worked on 00:34:26.940 |
investing in in a long time is democratization of different skills be they creative or technical 00:34:32.140 |
this has been an amazing few years for that across different modalities audio video general image 00:34:39.900 |
media text and now code and and really fully functioning applications um one thing that's 00:34:46.620 |
really interesting about the growth driver for all of these companies is the the end users in 00:34:52.220 |
large part are not people that we thought of as we the venture industry you know the royal we 00:34:57.980 |
thought of as important markets before um and so a premise we have as a fund is that there's 00:35:03.660 |
actually much more instinct for creativity visual creativity audio creativity technical creativity 00:35:09.660 |
than like there's latent demand for it and ai applications can really serve that i think in 00:35:15.980 |
particular mid journey was a company that is in the vanguard here and nobody understood for a long 00:35:20.380 |
time because the perhaps outside view is like how many people want to generate images that are not 00:35:27.260 |
easily you know the raster they're not easily editable they can't be using these professional 00:35:31.180 |
context in a complete way and the answer is like an awful lot right for a whole range of use cases 00:35:36.220 |
and i think we'll continue to find that especially as the capabilities improve and we think the the 00:35:41.420 |
range of um uh quality and uh controllability that you can get in these different domains is still 00:35:49.020 |
it's very deep and we're still very early um and then i i think as if if we're in the first or 00:35:55.180 |
second inning of this ai wave one obvious place to go invest and to go build companies is the 00:36:02.220 |
enabling layers right um shorthand for this is obviously compute and data i think the the needs 00:36:08.540 |
for uh data are largely changed now as well you need more expert data you need different forms 00:36:15.340 |
of table talk about that later in terms of who has like let's say reasoning traces in different 00:36:20.380 |
domains that are interesting to companies doing their own training but this is this is an area 00:36:25.980 |
that has seen explosive growth and we continue to invest here um okay so maybe time for some opinions 00:36:32.860 |
there was a prevailing narrative that um you know some part from companies some part from investors 00:36:42.700 |
it's a fun debate uh as to where is the value in the ecosystem and can there be 00:36:47.180 |
opportunities for startups um if you guys remember the phrase gpt rapper it was like the dominant 00:36:52.620 |
phrase in the tech ecosystem for a while of and what it what it represented with this idea that 00:36:58.540 |
there was no value at the application layer you had to do pre-training and then like nobody's 00:37:02.940 |
going to catch open ai and pre-training and you know this isn't this isn't like a a knock on 00:37:08.620 |
open ai at all these these labs have done amazing work enabling the ecosystem and we continue to 00:37:13.420 |
partner with them and and others but um but it's simply untrue as a narrative right the odds are 00:37:21.500 |
clearly in favor of a very rich ecosystem of innovation you have a bunch of choices of models 00:37:27.420 |
that are good at different things you have price competition you have open source uh i think an 00:37:33.340 |
underappreciated impact of test time scaling is you're going to better match user value with your 00:37:39.420 |
spend on compute and so if you are a new company that can figure out how to make these models 00:37:44.460 |
useful to somebody the customer can pay for the compute instead of you taking as a as a startup 00:37:49.420 |
the capex for pre-training or um or rl up front uh and um uh as pranav mentioned you know small 00:37:58.540 |
models especially if you know the domain can be unreasonably effective uh and the product layer 00:38:03.500 |
has if we look at the sort of cluster of companies that we described shown that it is creating and 00:38:09.100 |
capturing value and that it's actually a pretty hard thing to build great products that leverage 00:38:13.020 |
ai um so so broadly like we have a point of view that i think is actually shared by many of the 00:38:19.180 |
labs that the world is full of problems in the last mile to go take even agi into all of those 00:38:26.220 |
use cases is quite long okay another prevailing belief is that um or you know another great debate 00:38:34.060 |
that sean could host is like does the value go to startups or incumbents uh we must admit some 00:38:38.700 |
bias here even though we have you know friends and portfolio former portfolio companies that would 00:38:42.940 |
be considered incumbents now but um uh oh sorry swap swap uh swap views sorry uh you know there 00:38:51.740 |
are there are markets in venture that have been considered traditionally like too hard right like 00:38:57.740 |
just bad markets for the the venture capital spec which is capital efficient rapid growth that's a 00:39:03.900 |
venture backable company um where the end output is a you know a tens of billions of dollars of 00:39:09.900 |
enterprise value company um and and these included areas like legal health care defense pharma 00:39:16.140 |
education um you know any traditional venture firm would say like bad market nobody makes money 00:39:22.300 |
there it's really hard to sell there's no budget etc and and one of the things that's interesting 00:39:26.460 |
is if you look at the cluster of companies that has actually been effective over the past year 00:39:30.700 |
some of them are in these markets that were traditionally non-obvious right and so perhaps 00:39:35.340 |
one of our more optimistic views is that ai is really useful and if you make a capability that 00:39:42.300 |
is novel that is several magnitudes um orders of magnitude cheaper then actually you can change the 00:39:48.620 |
buying pattern and the structure of these markets and maybe the legal industry didn't buy anything 00:39:53.500 |
because it wasn't anything worth buying for a really long time that's one example um we we 00:39:57.660 |
also think that like what was the last great consumer company um maybe it was discord or 00:40:02.620 |
roblox in terms of things that started that have just like really um enormous user basis and 00:40:07.820 |
engagement uh until you know we had these consumer chatbots of different kinds and and like the next 00:40:13.900 |
perhaps the next generation of search as Pranav mentioned we think that the um opportunity for 00:40:20.860 |
social and media generation and games is uh large and new in a totally different way um and and 00:40:27.900 |
finally uh in terms of the markets that we look at uh i think there's broad recognition now that 00:40:33.980 |
you can sell against outcomes and services rather than software spend with ai because you're doing 00:40:39.740 |
work versus just giving people the ability to do a workflow but um if you take that one step further 00:40:45.340 |
we think there's elastic demand for many services right uh our classic example is um there's on 00:40:52.780 |
order of 20 to 25 million professional software developers in the world uh you know i imagine much 00:40:58.700 |
of this audience is technical uh demand for software is not being met right if we take the 00:41:05.740 |
cost of software and high quality software down two orders of magnitude we're just going to end 00:41:10.540 |
up with more software in the world we're not going to end up with fewer people doing development 00:41:14.940 |
at least that's what we would argue um and then finally on the incumbent versus uh startup 00:41:21.820 |
question uh the prevailing narrative is incumbents have the distribution the product surfaces and the 00:41:27.180 |
data don't bother competing with them they're going to create and capture the value and share 00:41:30.860 |
some of it back with their customers i think this is only partially true um they incumbents have the 00:41:35.820 |
distribution they have always had the distribution like the point of the startup is you have to go 00:41:40.060 |
fight with a better product or a more clever product um and maybe a different business model 00:41:45.340 |
to go get new distribution but the specifics around the product surface and the data i think 00:41:50.940 |
are actually worth understanding there's a really strong innovators dilemma if you look at the sas 00:41:55.740 |
companies that are dominant they sell by seat and if i'm doing the work for you i don't necessarily 00:42:01.020 |
want to sell you seats i might actually decrease the number of seats um the tens of the decades of 00:42:07.660 |
years and millions of man and woman hours of code that have been written to uh enable a particular 00:42:16.860 |
workflow in crm for example may not matter if i don't want people to do that workflow of filling 00:42:21.900 |
out the database every friday anymore and so i i do think that this sunk cost or the incumbent 00:42:28.060 |
advantage gets highly challenged by new ux and code generation as well and then one disappointing 00:42:34.620 |
learning that we found in our own portfolio is no one has the data we want in many cases 00:42:40.540 |
right so imagine you are trying to automate a specific type of knowledge work uh and what you 00:42:48.380 |
want is the reasoning trace um all of the inputs and the output decision um like that sounds like 00:42:56.220 |
a very useful set of data and the incumbent companies in any given domain they never save 00:43:00.620 |
that data right like they have a database with the outputs some of the time and so i i would say uh 00:43:06.700 |
one of the things that is worth thinking through as a startup is um when an incumbent says they 00:43:12.540 |
have the data like what is the data you actually need to make your product higher quality 00:43:15.660 |
okay so in in summary um you know our shorthand for the set of changes that are happening is 00:43:23.180 |
software 3.0 we think it is a full stack rethinking and it enables um in a a new generation of 00:43:29.660 |
companies to have a huge advantage the speed of change um favors startups if the floor is lava 00:43:35.500 |
it's really hard to turn a really big ship uh i think that some of the ceos of large companies 00:43:40.460 |
now are incredibly capable but they're still trying to make a hundred thousand people move 00:43:44.380 |
very quickly in a new paradigm um the market opportunities are different right these markets 00:43:49.260 |
that we think are interesting and very large like represent a trillion dollars of value 00:43:53.420 |
are not just the replacement software markets of the last two decades um it's not clear what 00:43:59.500 |
the business model for many of these companies should be uh sierra just started talking about 00:44:03.500 |
charging for outcomes um outcomes based pricing has been this holy grail idea in software and 00:44:08.780 |
it's been very hard but now we do more work um uh there are other business model challenges um 00:44:15.660 |
and so you know our companies they spend a lot more on compute than they have in the past they 00:44:21.020 |
spend a lot with the foundation model providers they think about gross margin uh they think about 00:44:25.660 |
where to get the data uh it's a time where you need to be really creative about product um 00:44:30.220 |
versus just replace the workflows of the past uh and it might require ripping out those workflows 00:44:36.140 |
entirely it's a different development cycle i bet most of the people in this room have written 00:44:41.260 |
evals um and like compared to you know the academic benchmark to a real world eval and said like 00:44:46.860 |
you know that's not it and how do i make a user um understand uh the um non-deterministic nature 00:44:55.580 |
of these outputs or gracefully fail i think that's like a different way to think about product than 00:45:00.220 |
in the past um and we we need to think about infrastructure again right um there was this 00:45:05.420 |
middle period where the cloud providers the hyperscalers took this problem away from software 00:45:10.380 |
developers and it was all just going to be like i don't front end people at some point and it's 00:45:14.060 |
like we are not there anymore we're back in the hardware era where people are um acquiring and 00:45:18.780 |
managing and optimizing compute and i think that will really matter in terms of capability and 00:45:22.380 |
companies um so uh i guess we'll end with a call to action here and and encourage all of you to 00:45:30.140 |
seize the opportunity um it is the greatest technical and economic opportunity that we've 00:45:35.340 |
ever seen like we made a decade plus career type bet on it and um uh we do a lot of work 00:45:43.580 |
with the foundation model companies uh we think they are doing amazing work and they're great 00:45:48.540 |
partners and even co-investors in some of our efforts but uh i think all of the focus on their 00:45:54.620 |
interesting missions around agi and safety um do not mean that there are not opportunities in other 00:46:00.940 |
parts of the economy the world is very large and we think much of the value will be distributed in 00:46:05.820 |
the world through an unbundling and eventually a re-bundling uh as often happens in technology 00:46:10.940 |
cycles um so we think this is a market that is structurally supportive of startups we're really 00:46:16.060 |
excited to try to work with the more ambitious ones and the theme of 2024 um to us has been like 00:46:23.100 |
well thank goodness this is a this is an ecosystem that is much friendlier to startups than 2023 it 00:46:29.500 |
is what we hoped um and and so uh you know please uh ask those questions and take advantage of the 00:46:35.420 |
opportunity do those things work yeah hello they do work i can kick us off okay so if some of these 00:46:56.860 |
companies um can go from you know 1 to 20 in such a short amount of time do you think that they can 00:47:02.300 |
also disappear in a short amount of time uh i can i can take this one i mean uh i think you've seen 00:47:10.140 |
companies go from zero to 80 million and stall out pretty badly actually um so your data is correct 00:47:17.100 |
um there's gonna be uh there's a set of challenges that um are just the challenges of scale right 00:47:26.060 |
like i think sometimes the revenue numbers in these companies can overstate the maturity of 00:47:30.140 |
the businesses themselves right they need to figure out how to serve customers they need to 00:47:33.580 |
scale their leadership um they need to uh prepare to uh service these customers um with the right 00:47:41.820 |
quality level and you know like the company that we showed that went zero to 20 that company has 00:47:46.540 |
20 people right and they have you know x hundred thousand users is yeah it's very challenging um 00:47:52.300 |
and and so i think there there's a set of good hard problems that these companies will have 00:47:57.340 |
i think part of the like most catchphrases or memes they don't catch on unless there's some 00:48:03.660 |
seat of truth and so there was a set of companies that were described by this term gpt wrapper that 00:48:09.500 |
were not more than a somewhat trivial set of prompts and seo pages that directed people to 00:48:17.660 |
our particular use case and i think that's not uh that's like likely not a durable position as a 00:48:24.140 |
technology company um and and so it's not a very clean answer for you it's a it's a nuanced one but 00:48:30.700 |
some of the value that is represented by this um i'm going to scroll back to it some of this value 00:48:37.660 |
that is represented by this cluster is durable and that's the thing that we are interested in 00:48:42.300 |
um uh the the zero to 20 and the zero to 80 and then collapse it's actually valuable it's just 00:48:50.140 |
not durable right users are voting for it and other people can compete and so you know we kind 00:48:54.780 |
of separate these two questions of like you know which of these companies is defensible um and 00:49:00.220 |
where is the revenue or the usage not a novelty but something that's really important to like 00:49:05.660 |
work or player communication sean do you want me to take questions or do you want to do it 00:49:14.060 |
yeah well yeah you can do it hi hi um i think my mic oh here it goes so if all of these companies 00:49:22.460 |
need a lot more money and this is the greatest economic opportunity ever uh don't we need much 00:49:28.860 |
bigger venture funds like orders of magnitude bigger and won't the economics of those funds 00:49:33.900 |
be really broken if they're still raising 40 million dollar like gonna invest in a bunch 00:49:37.820 |
of seed company funds okay uh this is a bit of a triggering question for me because i take a 00:49:43.820 |
particular point of view on it um uh hopefully without arrogance we've chosen to raise 00:49:48.540 |
funds that are relatively small um as early stage investors uh and part of it is the the view of um 00:49:55.980 |
like this company that you know this company uh i think they've spent like maybe seven million 00:50:04.460 |
dollars to date right um and so the view that all ai product companies or all ai companies in general 00:50:12.140 |
are very expensive is not true objectively we have we have several companies that are 00:50:16.940 |
um expensive in the traditional sense of sass like we got to go hire a lot of go-to-market people 00:50:22.460 |
and we have to pay them and there's a j curve of that investment before it comes back in 00:50:26.540 |
repeatable sass revenue um uh and you know i think um inference revenue uh we have companies that are 00:50:35.100 |
profitable or break even and have been incredibly efficient and we have companies that spend a lot 00:50:39.580 |
up front and so i think there's a an entire range um our view as a firm is uh that you know very 00:50:48.060 |
early on um my friend a lot has a a funny phrase here which is um no gpu before product market fit 00:50:56.060 |
i think that is not always true we have given people gpus before anything right but but there's 00:51:01.980 |
there's a a shred of truth in this which is you can experiment like thank you to the open ai and 00:51:09.180 |
anthropics and um other companies of the world that allow uh great product people to experiment 00:51:14.620 |
at very low cost very incrementally and so i i think much of our portfolio looks like those 00:51:20.060 |
companies where you're going to see what kind of value you can bring to users without spending a 00:51:24.940 |
ton up front um as one example like we just saw um uh new fine tuning interfaces for a one come out 00:51:33.260 |
the amount of data that you need to in theory improve um those models for a particular domain 00:51:40.300 |
is very small if that pans out like that's incredibly encouraging as well so so i would 00:51:46.780 |
say like i our goal is to work with the most important companies in ai with a relatively 00:51:52.860 |
small fund and i think that um most companies don't actually they don't benefit from a huge 00:51:59.100 |
amount of capital up front um the only thing i would add to that is uh i i think an interesting 00:52:05.740 |
trend is that we work with a number of second time founders whose point of view this time around is 00:52:09.740 |
like we're never going to make the company that big again i think it's not a surprise actually i 00:52:14.540 |
was doing the math in my head and um this rough ratio of a million dollars of revenue for per 00:52:19.340 |
employee of early stage company holds true for like a remarkable number of our companies like 00:52:23.420 |
a number of our companies have more millions in revenue than they do employees and the point of 00:52:28.060 |
view of a bunch of this is like we're going to keep it that way like we're we're not going to 00:52:31.020 |
grow into a giant team uh ai will make us much more efficient and if you believe in the grand 00:52:35.660 |
vision of much of the intellectual labor that we do should actually just be captured by some 00:52:39.980 |
number of models and we can build much more long-term efficient businesses than we have been 00:52:44.060 |
able to historically i do think it's an interesting question because um if we think 00:52:49.180 |
there is this much opportunity like your opportunity doesn't come evenly right and so 00:52:54.460 |
i'd say our investment pacing is higher than i guess mine has been traditionally and uh another 00:53:01.420 |
part of our view is like okay well we want to offer and we want to offer founders a certain 00:53:05.980 |
service level um and you know founders can decide if they want that or not but it is it's very time 00:53:12.140 |
expensive to us we can only work with that many companies we think many more are really interesting 00:53:17.580 |
and that is one of the reasons that pranav and i did this program for the ecosystem called embed 00:53:21.980 |
where we can work with a larger set of companies we own less but we give them you know uh a network 00:53:27.340 |
and some guidance and and it is genuinely because there are more interesting things that we think 00:53:31.420 |
are going to work than we can work on in a traditional um like artisanal venture sense 00:53:36.620 |
and shameless plug applications will open in january 00:53:38.940 |
i think if i press a button so fast oh so fancy cool uh hi thanks for the talk it was awesome 00:53:53.500 |
so i work for a series c enterprise focused company called writer and one of the interesting 00:53:58.380 |
things about the multi-modality thing that we're seeing in the enterprises beyond vision we're not 00:54:03.500 |
actually seeing a lot of like demand for multi-modality like we'll get asked about um audio 00:54:09.500 |
and video stuff but then when we ask like sort of what's the use case it's sort of like i don't know 00:54:14.860 |
and so i'm curious if if you and your um like portfolio companies are are seeing that in the 00:54:21.020 |
enterprise space and if so like what use cases it seems very focused like the multi-modality stuff 00:54:25.260 |
seems great for the consumer level i'm curious if you're seeing anything on the enterprise side 00:54:30.300 |
i think it's a good call out um enterprises the data they have is mostly like it's text it's like 00:54:36.700 |
structured data and some sql data like it's uh um i don't think your average enterprise has that much 00:54:43.020 |
vision video audio data that is that interesting um but i think that will change um like 00:54:50.940 |
maybe it's because i'm like lazy and disorganized but humans are very unstructured like they don't 00:54:57.260 |
want they don't necessarily think in terms of like relational database schema and like hierarchical 00:55:02.700 |
management of their own information uh and i i think there's a future where we take that away 00:55:07.900 |
from people um and um the capture of information that you're going to use for different enterprise 00:55:13.260 |
workflows um uh enables more multi-modal use if that makes sense and so like the sort of obvious 00:55:20.060 |
example would be there are companies from like perhaps a half generation ago like the gongs of 00:55:24.940 |
the world that captured video and found some um keywords and initial insights uh for sales reps 00:55:31.820 |
but the communications within an organization the decisions made um the uh things that people 00:55:40.460 |
create i think there will be much more capture especially of video but um uh making use of it 00:55:48.460 |
requires companies to do that capture um so we kind of require this intermediate step i think 00:55:54.140 |
there's a company in our uh and this is still a prosumer company today as well to your point of 00:55:59.100 |
like you know the consumer prosumer side is ahead of the enterprise but there's a company in our 00:56:03.340 |
last embed batch called highlight that kind of has this premise that like okay well you know 00:56:08.060 |
we're going to use the multi-modality by using on-screen capture that's what this little like 00:56:12.460 |
bubble is on screen and audio capture and i think that um i think it's a powerful idea 00:56:22.140 |
by the way just a quick check uh peter isaac are you here 00:56:28.860 |
uh hi thanks yeah there's sort of like a meme going around that the the price of intelligence 00:56:38.940 |
is going to go to zero um and you can kind of see this with gpt40 and and with gemini flash 00:56:45.100 |
you can get a million tokens a day which is probably enough for a small company right like 00:56:51.260 |
so i'm curious how as these large companies lose tons of money for market share like how are 00:56:58.540 |
startups going to respond to this like how does that change the market um i think it is impossible 00:57:04.300 |
for anything to be too cheap so i'll start with that um i would also say this company 00:57:09.020 |
with this like awesome revenue chart like i'm pretty sure we paid like five to seven million 00:57:14.460 |
dollars to a uh foundation model provider in this period of time right and so um uh demand is 00:57:21.900 |
like if there was like a secondary theme to this talk demand is elastic in so many ways especially 00:57:26.540 |
for technology and when you make things cheaper we want things to be more intelligent right um and so 00:57:32.940 |
if you make hundreds of calls in order to deliver an output um then suddenly like the fact that the 00:57:39.660 |
cost of calls come down 85% doesn't do you enough uh and so yes it's like an incredibly compelling 00:57:46.380 |
idea of like having intelligence too cheap to meter i'm like maybe this is really old school 00:57:51.340 |
of me but for the last two decades like the internet and compute and software and data 00:57:56.300 |
pipeline like they it still hasn't been cheap enough actually we would do more if it was free 00:58:02.140 |
so uh the other like uh physical barrier that we've run into is um when models are really large 00:58:11.420 |
if you're not going to like quantize and distill and do domain specific things like it's hard to 00:58:16.300 |
run you need a lot of compute just to state the very basics and even with the foundation model 00:58:21.420 |
providers we are seeing people run into inference capacity issues and so um i do not know if this 00:58:27.420 |
is true but uh like one way to read anthropic pricing change is there's not enough capacity 00:58:34.300 |
right uh and so i think like um incredible kudos to the open source ecosystem incredible kudos to 00:58:40.860 |
open ai for like staying on this drumbeat of offering cheaper and cheaper intelligence in 00:58:45.980 |
every generation but uh like we have a companies that are spending a lot of money on um you know 00:58:53.980 |
let's say um search and validation systems with many calls and we think that will continue 00:58:58.940 |
i think you can see that as well in like the the price charts that we had before 00:59:03.580 |
the like one pricing is still absurd um it it seems like it actually is gpt3 pricing 00:59:11.180 |
right yeah but i mean volume of tokens i think um like it is really interesting that 00:59:18.940 |
if you believe like the i mean the the other part of this is like if you look at the test 00:59:22.700 |
time compute scaling um this is it's a log scale like uh it's easy to forget that like that's a lot 00:59:30.060 |
of like historically um like as a result of overtraining a small set of companies took on 00:59:35.180 |
the majority of financial burden for generating high quality models which is you just overtrain 00:59:39.580 |
the shit out of your model and then it's useful for everyone else um if the customer has to pay 00:59:43.740 |
this like that's a lot of money um if you want high quality generation and that means that i pay 00:59:48.940 |
on the order of like thousands of attempts um that's that ends up being pretty expensive 00:59:53.660 |
um question from youtube uh so hi to the youtube audience 00:59:59.100 |
um so we you know you talked about price right price going down uh there's also the other 01:00:05.420 |
dimension of capabilities going up and people always getting steamrolled by open ai so the 01:00:10.380 |
question is what are some specific ways that you've seen companies build to prepare for better models 01:00:14.860 |
like gpt5 or o2 like how do you future proof that um so i i think the like the most common refrain 01:00:22.940 |
from at least opening i but i think the the model companies is you should build a company where 01:00:27.340 |
you're excited when you hear that a new model is coming out not anxious um i would have like one 01:00:33.260 |
edit to this which is like in the limit it seems like the majority of things that are worth building 01:00:37.100 |
today are actually i don't know should you hire a sales team at all if if you think that models 01:00:40.540 |
would be perfectly capable um like one framing that i've thought about on this is um you should 01:00:45.500 |
decide like uh how much you believe uh foundation models will improve on like some core learning or 01:00:53.100 |
intelligence capability um and then build your company imagining that on that prediction so 01:00:59.020 |
the like an example here would be um like if you take like i think there's a generation of these 01:01:04.300 |
like copywriting companies that uh were largely subsumed by chat gpt and the the story for many 01:01:10.300 |
of them was the original usage was they understood better than other people how to get the model to 01:01:16.060 |
like learn what my intent was in generating some piece of content some piece of seo content or they 01:01:19.980 |
understood how to ingest information about my business and it's not hard to imagine like the 01:01:23.900 |
next generation of models are just natively better at this like the context length gets longer you can 01:01:28.140 |
stuff more into the context length you can crawl and like learn more about external websites like 01:01:32.860 |
all that is like relatively cheap and so if the the core thesis the company looks like we don't 01:01:37.580 |
think models will be capable of doing that that feels uh likely short-sighted on the other hand 01:01:42.940 |
like there are a number of delivery mechanisms that are like far out of range of what what models 01:01:48.380 |
will do like sarah had a a good example of this which is like there are some businesses where the 01:01:52.940 |
limiting factor is like not actually intelligence like the the limiting factor for a number of 01:01:57.100 |
businesses is like access to a specific set of people or um like i don't know we work with a 01:02:01.740 |
pharmacy services company where like a core question is like long term can you negotiate 01:02:05.340 |
pricing contracts the core issue there is on intelligence you need some amount of scale and 01:02:08.940 |
then the ability to negotiate contracts um so i think i think many businesses are not exactly 01:02:13.820 |
just a function of your ability to efficiently compute some small set of things i gave this 01:02:18.780 |
presentation um with pranav and i'm like oh i'm so biased it just sounds like startups are gonna 01:02:22.860 |
win everything and i'm um we still there i like to play this game which is what investment decision 01:02:28.860 |
do you regret from the past year it's a really fun game i'm super fun yes um but one of the one of 01:02:34.140 |
the decisions that i regretted was actually um a company that operates in uh uh a space that feels 01:02:43.420 |
very core to perhaps foundation model companies and to hyper scale software players where there's 01:02:50.860 |
tons of ecosystem risk around the company and by the way the people are amazing the metrics were 01:02:56.060 |
amazing we're just like oh they're gonna get crushed and so with everything i said i still 01:03:00.780 |
like overestimated the incumbents like ability to compete and make aggressive strategic decisions 01:03:07.020 |
and so um i i think it's like really hard to overstate how important it is to understand um 01:03:14.460 |
somebody can steamroll you if they focused all of their effort and all their best people 01:03:21.500 |
on a particular area um are they going to right the copywriting example is illustrative because 01:03:28.700 |
it's just not hard to see that understanding the context of a business from its website and from a 01:03:36.460 |
couple documents and by making prompting a little bit easier and adding like some buttons that 01:03:40.540 |
replace some prompts or doing suggested queries like it's just not a lot of work right but there 01:03:46.460 |
are things that are a lot of work like having taste in developer products and distributing 01:03:51.340 |
something amazing and so uh i i i actually think that um uh it's if you ask me like we have to make 01:04:00.300 |
predictions in this business i worry more about under projecting capability than i worry about 01:04:05.500 |
over projecting at least in the short term and then i worry more about um expecting too much 01:04:11.820 |
from the incumbents and being too afraid of them than uh being not afraid enough maybe it's just 01:04:18.940 |
one investment regret either one of you yeah we have one more from online oh okay you can do the 01:04:28.700 |
online one uh how do you see ai changing hardware or in what ways and for example do you see a new 01:04:39.100 |
apple coming out transforming hardware to that level not specifically the humane situation 01:04:45.900 |
they're trying to ask very general how ai interview uh i'm sorry okay i i'd approach this from um uh 01:04:55.980 |
two dimensions um uh everybody every investor wants a like a new consumer hardware platform 01:05:04.700 |
to exist because it's so valuable and the question is like why why should it um i can think of two 01:05:10.380 |
very good reasons one is that the usage pattern that you can imagine for ai applications actually 01:05:16.460 |
requires you to um like the specs you'd want are different right like what if i want to capture 01:05:22.060 |
image or video 100 of the time and um that's like a determinant of my battery life of my 01:05:29.740 |
sensors of how i manage my network etc what if i want to run local models all the time like maybe 01:05:35.500 |
like most of the phone should be a gpu right um i don't uh i i think that the usage patterns are 01:05:42.700 |
perhaps very different for the next generation of you know the the intelligence in your hand 01:05:48.460 |
um i think it's a hard thing to pull off another reason that you could believe in a new hardware 01:05:54.700 |
device is that the advantages of the existing consumer platforms go away right and so at the 01:06:01.260 |
extreme like should you have individual applications that track a single habit like drink water today 01:06:11.740 |
sarah like i don't know like i can generate that pretty easily now and like maybe the single 01:06:17.420 |
function applications that live in the mobile phone ecosystems are um part of uh a more general 01:06:24.780 |
intelligence and um they like that ecosystem is less important um and so i i think there are 01:06:30.700 |
different arguments for this uh and like we continually look for uh opportunities to invest 01:06:37.660 |
here i don't think this is exactly what you asked but i also think the um like there are 01:06:43.500 |
we invested in a company this past year um that is doing uh robotics um i for many years at graylock 01:06:52.700 |
my prior firm like thought of robotics as an easy way to lose a lot of money over a long period of 01:06:57.180 |
time um and and like i think that is true when you look at the outcome set for classical robotics 01:07:03.100 |
even for the companies that got to scale of distribution for an industrial robot or a single 01:07:07.740 |
use consumer robot um but like it's really cool that algorithms and generalization from um the 01:07:15.180 |
broader machine learning field seem to apply here as well uh and so i think being imaginative about 01:07:22.220 |
what physical intelligence looks like is also something we're excited about 01:07:26.460 |
yeah okay okay okay so related to agents i think everyone has been chatting about agents you're 01:07:39.900 |
seeing more like agent usefulness and production but i'm more curious like at the infrastructure 01:07:44.860 |
layer what agent what infrastructure primitives do you think are required for agents to actually work 01:07:50.620 |
and continue to work in production um okay i uh i don't know we talked about this a little bit i'm 01:07:59.980 |
not sure if our points of view in this are the same i think it is um i think it's really hard 01:08:03.580 |
to tell um my suspicion is that um like if you look at the number of like true agents that work 01:08:11.420 |
like the number roughly rounds to zero maybe it's like low single digits or low double digits now 01:08:17.180 |
um double double yeah and uh like they're all like relatively recent i would say like beginning 01:08:21.900 |
of this year um we saw like a bunch of agent framework companies and um like i uh like i 01:08:27.820 |
empathize with like the the root of the question which is it's just really hard to tell what any 01:08:31.340 |
of these companies need especially when like this set of companies that works really well is unclear 01:08:34.780 |
and um i i think there's a lot of valid anxiety on what foundation model companies want the 01:08:39.900 |
interface to be like the computer's interface is a pretty low level one like you the anthropic 01:08:44.700 |
version is like actually just make specific clicks and you know like rumors of other interfaces are 01:08:49.820 |
like much more general like they're take actions on a specific web page um or like entire browser 01:08:54.620 |
environments and so um like at a high level like i imagine that there are sets of like there's like 01:08:59.660 |
the full scope of tools which is like i worked in a search engine for a while like crawl seems 01:09:02.940 |
pretty useful live data seems pretty useful like an api that looks something like here's a url give 01:09:08.140 |
me the set of data that's available or here's a url and a user login let me take some action 01:09:13.420 |
on this page seems pretty useful um and then i don't know what the right place to operationalize 01:09:18.300 |
this and commercially develop a product are um if i had like uh if i was building a company here 01:09:23.980 |
like one thing that i think it's useful to just remain agile like the corset of infrastructure 01:09:28.940 |
is consistently useful like a crawler is consistently useful and then one day you can 01:09:33.020 |
figure out how to expose this better um but i i like empathize with the difficulty of like it's 01:09:39.500 |
really hard to know what works for a bunch of agent companies and my suspicion is like the 01:09:43.980 |
most successful agent frameworks will come from the most successful of these agent companies that 01:09:48.220 |
solve these problems in-house for themselves and then operationalize this externally like it's 01:09:52.220 |
some version of like react is really useful because react was like well adopted at facebook for a 01:09:56.540 |
while um i think we can say that there are like missing components in the ecosystem where that 01:10:05.180 |
if there was a default lots of agent developers would use it right um and so like identity and 01:10:13.020 |
access management is a big problem um uh like if you could make agent development feel more like 01:10:21.500 |
traditional software development i think a lot of people would use that and be like oh like 01:10:25.260 |
you know it magically retries until it gets something and then it gives me like data back 01:10:29.180 |
about how well it's working like things that like it's i think it's pretty easy to actually imagine 01:10:33.420 |
the utilities in the abstract that would be useful to the ecosystem and then um the entire environment 01:10:39.980 |
is fluid right and so um uh do you need like if you think about other things in infrastructure 01:10:46.300 |
like will more workloads need vector indices yes like what is the shape of company that gets to be 01:10:52.140 |
durable here like we don't know yet um and we'll keep looking at it but as pranav said i think we 01:10:57.260 |
look to the handful of companies in our portfolio that are agents working at some scale and um and 01:11:05.260 |
like look for the patterns there versus try to intuit it right now my cash hit was wrong i should 01:11:10.860 |
have updated it's a it's a dozen not a small number it's been a long six months guys yeah 01:11:17.260 |
uh i think one last question and there's a whole bunch of online stuff you won't get to but um yeah 01:11:23.340 |
mark okay um it seems like there should be more consumer companies 01:11:29.180 |
like why why aren't there or is it just a matter of time 01:11:38.860 |
we keep bringing people into embed we keep looking i i think the uh i genuinely this is not 01:11:46.060 |
a um a a knock on the research community or the really young set of founders that like 01:11:52.460 |
i think focused on ai companies um first but the diffusion of innovation curve that applies to 01:11:58.940 |
customers i think also applies to entrepreneurs um researchers saw the capability first and they're 01:12:06.300 |
like like we should do something with this this is going to be amazing and it's like that will 01:12:10.540 |
continue to happen like our portfolio is heavily overrepresented with with people from the research 01:12:14.860 |
community pushing the pushing the state of the art with creative technical ideas um uh i think young 01:12:21.180 |
very young people also were quite early to ai because they're like oh of course like this 01:12:25.900 |
makes sense i've never seen other technology like chachi pt all the way um and their opportunity 01:12:31.660 |
cost is lower than like you're the best product person at an amazing product organization like 01:12:37.740 |
you have to leave your job to start a new company uh and it's been a really long two years like i 01:12:43.900 |
feel like that's just started to happen where some of the talent that has the and you know maybe 01:12:51.660 |
maybe it's just like the next zuck you know there's some dropout that figures out like the 01:12:55.900 |
pattern of social interaction and it's like really ai native about this stuff i also think there's a 01:13:00.460 |
chance that um some of the people who have built intuition for um consumer adoption and consumer 01:13:08.780 |
interfaces they're just taking a little bit to also build intuition for ai products and now 01:13:13.180 |
they're showing up and starting companies and experimenting and so um we have a lot of confidence 01:13:18.540 |
like it is going to happen over the next few years and just a matter of time okay i think we're i 01:13:24.860 |
think we're out of time i'm just trying to defer to sean here but thank you so much um you know 01:13:29.420 |
please call us yeah i'm sure sarah for now we'll be sticking around uh so you can you can uh sort 01:13:38.140 |
of ask some questions outside or you know whatever you want to do in networking wise but we're going 01:13:42.860 |
to move on in our schedule uh we have a ton of papers that we want to cover this is basically 01:13:47.740 |
paper club live um and i think isaac peter uh you guys uh so the the top um the top uh when people 01:14:00.380 |
signed up we actually asked people like what you wanted to cover and the top um votes were vision 01:14:06.300 |
open models um post transformers and all the other stuff that's coming later we also added reasoning 01:14:12.780 |
because i didn't even have the option there and i was like what am i doing doing a sort of paper 01:14:17.980 |
review uh session this year without uh talking about reasoning and um you know test time compute 01:14:24.220 |
so uh but first we're gonna have uh vision uh roblox has been uh really uh great friends with 01:14:29.340 |
latent space we've we've had um joseph nelson on twice um with facebook talking about all the 01:14:35.820 |
innovations in vision but um it's not you know only about segmentation there's a lot of foundation 01:14:41.100 |
model uh progress that happened this year in both the large space and the very small space so we're 01:14:45.820 |
also very proud to have vick um to to update us on moon dream which he's been hacking away from 01:14:51.820 |
for the past year yeah very very short amount of time um are you guys ready are you plugged in 01:14:57.020 |
uh sarah paradov do you do you guys want to take questions like 01:15:05.580 |
i don't know if people want to like there are people that want to talk to you 01:15:18.380 |
awesome yeah just plug in on yeah the white thing exactly do you have sound the stuff 01:15:25.900 |
no sound listen do you have any audio things all right cool cool 01:15:29.500 |
stay close to the mic uh hi hey are they mic'd up nice yeah 01:15:44.700 |
uh man i was hoping to use speaker notes that's not gonna work 01:15:51.580 |
um you could do like a mirroring thing yeah yeah uh so there's settings display yeah yeah 01:16:53.420 |
email yeah both of us relied on your vision capabilities um yeah so this is for the screen 01:17:01.020 |
share is for the live stream and also the editing or recording that we're doing later 01:17:04.780 |
okay so you just share your screen and mute yourself um we got we got the audio you just 01:17:12.140 |
want to capture the screen video and share the um share the green share share the screen that 01:17:17.980 |
you're actually want people to see yeah that one the the the one with with the image that one but 01:17:25.900 |
this is the speaker view yeah you don't want to share the speaker yeah so so you want to share 01:17:30.460 |
this out too that's right double click on it you're good okay all right all right figuring 01:17:37.900 |
things out like yeah now where'd the presentation go uh you can you can do the triple yeah triple 01:17:49.260 |
slide there you go let's pick pick the thing and it is it up there are you just struggling no 01:17:54.140 |
let's uh kill this how do i exit out of this apologies technical difficulties 01:18:02.140 |
nice okay let's okay we're going to drag this up yeah perfect 01:18:30.300 |
we're just gonna make this full screen and call it good 01:18:32.460 |
okay yay okay um hi we're isaac and peter from roboflow and we're going to talk about the best 01:18:46.700 |
papers of 2024 in computer vision um so for us we define best as what made the biggest shifts 01:18:56.780 |
in the space uh and to determine that we looked at what are some major trends that happened uh 01:19:03.740 |
and what papers most contributed to those trends so i'm going to talk about a couple 01:19:06.620 |
trends peter's going to talk about a trend and then uh we're going to hand it off to moon dream 01:19:10.860 |
so the trends that i'm interested in talking about are a major transition from models that 01:19:19.260 |
run on per image basis to models that run using the same basic ideas on video and then also how 01:19:26.700 |
debtors are starting to take over the uh real-time uh object detection scene from 01:19:33.420 |
the yolos which have been dominant for years uh so as a highlight um we're going to talk about 01:19:40.540 |
sora which from my perspective is the biggest paper of 2024 even though it came out in february 01:19:45.980 |
um is the one yeah yeah so just it's a sora is just a uh a post um so i'm going to fill it in 01:19:55.900 |
with details from replication efforts including open sora and related work such as a stable 01:20:01.020 |
diffusion video um and then we're also going to talk about sam2 which applies the sam strategy to 01:20:08.780 |
video and then how debtors are the improvements in 2024 to debtors that are making them a Pareto 01:20:15.580 |
improvement to yellow base models um so to start this off we're going to talk about uh the state 01:20:23.260 |
of the art of video generation at the end of 2023 mag v.i.t uh mag v.i.t is a discrete token video 01:20:32.940 |
tokenizer akin to vq gan but applied to video sequences and it actually outperforms uh state 01:20:40.940 |
of the art uh handcrafted video compression frameworks uh in terms of the uh bit rate 01:20:48.380 |
versus human preference for quality and video is generated by autoregressing on these discrete 01:20:53.020 |
tokens um generates some pretty nice stuff but up to like five seconds length and you know not 01:20:59.100 |
super detailed and then suddenly a few uh months later we have this which when i saw it was totally 01:21:06.700 |
mind-blowing to me um 1080p a whole minute long we've got light reflecting and puddles that's 01:21:13.020 |
reflective uh reminds me of those rtx demonstrations for next generation video games such as cyberpunk 01:21:20.860 |
but with better graphics you can see some issues in the background if you look closely but they're 01:21:26.300 |
kind of with a lot as with a lot of these models the issues tend to be things that people aren't 01:21:31.660 |
going to pay attention to unless they're looking for in the same way that like six fingers on a 01:21:35.340 |
hand you're not going to notice is a uh giveaway unless you're looking for it um so yeah as we 01:21:43.500 |
said sore does not have a paper so we're going to be filling it in with uh context from the rest of 01:21:48.140 |
the uh computer vision scene attempting to replicate these efforts um so the first step 01:21:56.300 |
you have an llm caption a huge amount of videos um this this is a trick that they introduced in 01:22:04.220 |
dolly 3 where they train a uh image captioning model to just generate very high quality captions 01:22:10.940 |
for a huge corpus and then train a diffusion model on that their uh sora and the replication 01:22:18.700 |
efforts also show a bunch of other steps that are necessary for good video generation including uh 01:22:24.860 |
filtering by aesthetic score and filtering by making sure the videos have enough motion so 01:22:30.060 |
they're not just like kind of the generator is not learning to just generate static frames um 01:22:35.580 |
so then we encode our video into a series of space-time latency once again this were a very 01:22:45.100 |
sparse in details so um the replication related works uh open sora actually uses a mag vit v2 01:22:52.540 |
itself to do this but swapping out the uh disc discretization step with a classic vae 01:23:00.140 |
auto encoder framework and they show that there's a lot of benefit from getting the temporal 01:23:07.740 |
compression which makes a lot of sense as uh the each sequential frames and videos have mostly 01:23:13.500 |
redundant information um so by compressing against compressing in the temporal space you allow the 01:23:21.500 |
latent to hold a lot more semantic information while uh avoiding that duplicate um 01:23:28.300 |
so we've got our space-time latence possibly but via there's some 3d vae presumably a mag vat v2 01:23:39.020 |
um and then you throw it into a diffusion transformer so um i i think it's personally 01:23:47.740 |
interesting to note that open sora is using a mag vat v2 which originally used an autoregressive 01:23:53.980 |
transformer decoder to model the latent space but uh is now using a diffusion uh diffusion 01:24:01.740 |
transformer so it's still a transformer happening just the question is like is it parameterizing 01:24:06.060 |
the stochastic uh differential equation is or parameterizing a uh conditional distribution 01:24:11.100 |
via autoregression um it's also um it's also worth noting that most diffusion models today 01:24:21.100 |
the the very high performance ones are switching away from the classic like ddpm 01:24:24.380 |
denoising diffusion probability modeling framework to rectified flows um rectified 01:24:31.260 |
flows have a very interesting property that as they converge they actually get closer to 01:24:36.940 |
being able to be sampled with a single step which means that uh in practice you can actually 01:24:42.460 |
generate high quality samples much faster um major problem of ddpm and related models for 01:24:50.380 |
the past four years is just that they require many many steps to generate high quality samples 01:24:56.380 |
so and naturally the third step is throwing lots of compute at the problem 01:25:02.540 |
so uh i didn't i never figured out how to manage to get this video to loop 01:25:08.620 |
but we see very little compute medium compute lots of compute this is so interesting because the uh 01:25:17.500 |
the original diffusion transformer paper from facebook actually showed that in fact the specific 01:25:22.460 |
hyperparameters of the transformer didn't really matter that much what mattered was that you were 01:25:27.660 |
just increasing the amount of compute that the model had so i love how in the you know once again 01:25:35.340 |
little blog posts they don't even talk about like the specific hyperparameters they say we're using 01:25:38.540 |
a diffusion transformer and we're just throwing more compute at it and this is what happens 01:25:41.900 |
um open sora shows similar results the uh primary issue i think here is that 01:25:49.660 |
no one else has 32x compute budget so we end up with these uh uh we end up in the middle of the 01:25:58.620 |
domain in most of the uh uh related work which is still super super cool it's just a little 01:26:05.260 |
disappointing considering the context um so i think this is a beautiful extension of the 01:26:11.660 |
uh framework that was introduced in 22 and 23 for these very high quality per image generation 01:26:19.900 |
and then extending that to videos it's awesome and it's ga as of monday except no one can seem 01:26:27.020 |
to get access to it because they keep shutting down the login uh the next so next paper i wanted 01:26:33.900 |
to talk about is sam so we at roboflow allow users to label data and train models on that data sam 01:26:41.980 |
for us has saved our users 75 years of labeling time um we are the the best of my knowledge the 01:26:48.620 |
largest uh sam api that exists we also sam also allows us to have our users train just pure uh 01:26:57.660 |
bounding box regression models and use those to generate high quality masks um which has the great 01:27:05.660 |
side effect of requiring less training data to have a meaningful convergence so most people are 01:27:11.020 |
data limited in the real world so anything that requires less data to get to a useful thing is 01:27:15.100 |
super useful um most of our users actually run their object uh per frame object detectors on 01:27:22.860 |
every frame in a video or maybe not most but many many and so uh sam follows into this category of 01:27:31.900 |
taking sam2 falls into this category of taking something that really really works and applying 01:27:36.620 |
it to a video which has the wonderful benefit of being plug and play with most of our many of our 01:27:43.420 |
users use cases um we're we're still building out a sufficiently mature pipeline to take advantage 01:27:49.980 |
of that but it's it's in the works um so here we've got a great example we can click on cells 01:27:58.780 |
and then follow them you even notice the cell goes away and comes back and we can still uh 01:28:02.940 |
keep track of it um which is very challenging for uh existing object trackers um high level 01:28:15.580 |
overview of how sam2 works we uh uh there's a simple pipeline here where we 01:28:24.460 |
can give provide some type of prompt and it fills out the rest of the likely masks for that object 01:28:33.260 |
throughout the rest of the video so here we're giving a bounding box in the first frame a set 01:28:37.500 |
of positive negative points or even just a simple mask um i'm gonna assume people are somewhat 01:28:45.580 |
familiar with sam so i'm gonna just give a high level overview of how sam works you have an image 01:28:51.820 |
encoder that runs on every frame um sam2 can be used on a single image in which case the only 01:28:58.780 |
difference between sam2 and sam is that image encoder which sam used a standard vit um sam2 01:29:08.940 |
replaced that with a uh uh hera hierarchical encoder which gets approximately the same 01:29:15.580 |
results but leads to a six times faster inference which is excellent especially considering how in 01:29:22.460 |
a trend of 23 was replacing the vit with more efficient backbones um in the case where you're 01:29:31.180 |
doing video segmentation the the difference is that you actually create a memory bank and you 01:29:35.820 |
cross attend the features from the image encoder based on the memory bank so the uh feature set 01:29:44.780 |
that is created is essentially uh well i'll go more into it in a couple slides but we take the 01:29:52.860 |
features from the past couple frames plus a set of object pointers and the set of prompts and 01:30:01.500 |
use that to uh generate our new masks then we then fuse the new masks for this frame 01:30:07.660 |
with the um image features and add that to the memory bank it's well i'll say more in a minute 01:30:16.620 |
the um just like sam that sam2 actually uses a data engine to create its uh data set in that 01:30:23.180 |
people are they assembled a huge amount of reference data used people to label some of it 01:30:28.780 |
and train the model uh use the model to label more of it and ask people to refine the predictions of 01:30:35.820 |
the model and then ultimately the data set is just uh created from the final output of the model 01:30:41.660 |
on the uh reference data it's very interesting this paradigm is so interesting to me because it 01:30:47.100 |
uh it uh unifies a model in a data set in a way that is very unique it seems unlikely that another 01:30:55.180 |
model could come in and have such a tight relationship with the training set um yeah 01:31:02.460 |
so brief overview of how the memory bank works the paper did not have a great visual so i'm just i'm 01:31:11.020 |
going to fill in a bit more um so we take the last couple frames from our video and 01:31:20.940 |
uh we take the last couple frames from our video uh attend that along with the set of prompts 01:31:29.500 |
that we provided they could come from the future they could come from anywhere in the video 01:31:34.780 |
as well as reference object pointers saying by the way here's what we've found so far 01:31:40.620 |
uh attending to the last few frames has the interesting benefit of 01:31:44.460 |
allowing it to model complex object motion uh without actually 01:31:50.300 |
uh you by limiting the amount of frames that you attend to you manage to keep the model running in 01:31:58.300 |
real time this is such an interesting topic topic for me because one would assume that attending to 01:32:04.620 |
all of the frames is super essential having some type of summarization of all the frames 01:32:08.540 |
is super essential for high performance um but we see in their later ablation that that actually is 01:32:14.700 |
not the case so here just to make sure that there is some benchmarking happening we just compare to 01:32:22.380 |
some of the stuff that's came out prior and indeed the sam2 strategy does improve on the state of the 01:32:29.980 |
art um this ablation deep in their dependencies was super interesting to me uh we see in section 01:32:40.140 |
c the number of memories um one would assume that increasing the count of memories would 01:32:45.580 |
meaningfully increase performance and we see that it has some impact but not the type that 01:32:50.780 |
you'd expect and that it meaningfully decreases speed which justifies in my mind just having this 01:32:56.540 |
50q of memories um although in the future i'm super interested to see a more dedicated 01:33:05.980 |
summarization of all of the last video not just a stacking of the last frames so 01:33:13.660 |
that another extension of beautiful per frame work into the uh video domain the next trend i'm 01:33:25.180 |
interested in talking about is uh this interesting at roboflow we're super interested in training 01:33:31.260 |
real-time object detectors those are bread and butter and so we're doing a lot to keep track of 01:33:35.660 |
what is actually happening in that space uh we are finally starting to see something change 01:33:42.940 |
so for years yellows have been the dominant way of doing real-time object detection and we can see 01:33:50.300 |
here that they've essentially stagnated the the performance between 10 and 11 is not meaningfully 01:33:56.700 |
different at least you know in in this type of high-level chart and even from the last couple 01:34:03.100 |
series there's not a major change uh so yellows hit a plateau debtors have not so we can look here 01:34:14.700 |
and see the yellow series has this plateau and then these are rt debtor lw data and define have 01:34:22.300 |
meaningfully changed that plateau so that in fact the best define models are plus 4.6 ap on coco at 01:34:29.580 |
the same latency so three major steps to accomplish this uh the first rt debtor which is technically 01:34:38.460 |
a 2023 paper pre-print but published officially in 24 so i'm going to include that i hope that's 01:34:44.460 |
okay um that is showed that uh rt data showed that we could actually match or outspeed yellows 01:34:50.940 |
um then lw debtor showed that pre-training is hugely effective on debtors and much less so 01:34:58.060 |
on yellows and then define out of the types of bells and whistles that we expect from uh 01:35:02.060 |
these types this this uh arena so the major improvements that rt data shows was uh taking 01:35:11.820 |
the multi-scale features that debtors typically pass into their encoder and decoupling them into 01:35:17.820 |
a much more efficient uh transformer encoder uh the transformer is of course quadratic complexity 01:35:25.180 |
so decreasing the amount of stuff that you pass in at once is super helpful for increasing your 01:35:31.580 |
runtime or uh increasing your throughput so that change basically brought us up to yellow speed 01:35:38.700 |
and then they do a hardcore analysis on uh benchmarking yellows including the nms step 01:35:46.620 |
once you uh once you include the nms in the latency calculation you see that in fact these 01:35:52.380 |
debtors are outperforming at least this time the uh the the yellows that existed 01:35:59.420 |
then lw debtor goes in and suggests that in fact the uh um this frame the huge boost here is from 01:36:09.980 |
pre-training so this uh is the defined line and this is the defined line without pre-training 01:36:16.860 |
it's within range it's still an improvement over the uh yellows but the really huge boost comes 01:36:21.980 |
from the benefit of pre-training uh in when yellow x came out in 2021 they showed that they got much 01:36:29.820 |
better results by having a much much longer training time but they found that when they 01:36:36.780 |
did that they actually did not benefit from pre-training so you see in this graph from lw 01:36:43.180 |
debtor in fact yellows do have a real benefit from pre-training but it goes away as we increase the 01:36:49.180 |
training time then the debtors converge much faster lw debtor trains for only 50 epochs rt 01:36:55.420 |
debtors 60 epochs so one could assume that in fact the entire extra gain from pre-training is that 01:37:03.820 |
you're not destroying your original weights by relying on this long training cycle 01:37:07.820 |
um and then lw debtor also shows superior performance to our favorite data set roboflow 100 01:37:17.420 |
which means that they do better on the real world not just on coco 01:37:20.380 |
then define throws all the bells and whistles at it uh yellow models tend to have a lot of 01:37:29.340 |
very specific uh complicated loss functions this uh define brings that into the debtor 01:37:36.300 |
world and shows consistent improvement on a variety of debtor based frameworks 01:37:41.100 |
so bring these all together and we see that suddenly we have almost 60 ap on coco while 01:37:47.900 |
running in like 10 milliseconds huge huge stuff so we're spending a lot of time trying to build 01:37:56.220 |
models that work better with less data and debtors are clearly becoming a promising step in that 01:38:01.660 |
direction the we're interested in seeing from the debtors in this this trend to next is co-debtor and 01:38:11.660 |
the the the models that are currently sitting on the top of the uh leaderboard for large scale 01:38:17.820 |
inference scale really well as you switch out the backbone we're very interested in seeing and and 01:38:25.020 |
having people publish a paper potentially us on what happens if you take these real-time ones 01:38:29.980 |
and then throw a swing g at it like do we have a Pareto curve that extends from the real-time 01:38:34.780 |
domain all the way up to the uh uh super super slow but high performance domain we also want 01:38:41.260 |
to see people benchmarking an rf100 more because that type of data is what's relevant for most 01:38:46.860 |
users um and we want to see more pre-training because pre-training works now it's super cool 01:38:57.500 |
all right so yeah so in that theme uh one of the big things that we're focusing on 01:39:03.180 |
is how do we get more out of our pre-trained models um and one of the lenses to look at this 01:39:08.540 |
is through sort of this this new requirement for like fine-grained visual details and your 01:39:14.860 |
representations that are extracted from your foundation model so it's sort of a hook for this 01:39:19.820 |
um oh yeah this is just a list of all the the papers that i'm going to mention i just want to 01:39:24.940 |
make sure i set up actual papers so you can find it later um yeah so sort of the big hook here is 01:39:30.620 |
that i make the claim that llms can't see if you go to if you go to claude or um chat gpt you ask 01:39:38.860 |
it to to see this uh uh watch and tell me what time it is it fails right and so you could say 01:39:45.820 |
like maybe maybe the um like this is like a very classic uh test of an llm but you could say okay 01:39:53.260 |
maybe this this image is like too zoomed out and it just like it'll do better if we increase the 01:39:58.700 |
resolution and it has easier time finding these fine fine-grained features like where the watch 01:40:02.780 |
hands are pointing no dice and you can say okay well maybe uh the model just doesn't know how to 01:40:07.660 |
tell time from knowing the position of the hands but if you actually prompt it textually it's very 01:40:12.220 |
easy for it to tell the time so this to me is proof that these llms literally cannot see the 01:40:17.180 |
position of the watch hands and it can't see those details so the question is sort of why and uh for 01:40:22.380 |
you anthropic heads out there claude fails too um so the the my first pick for best paper of 2024 01:40:30.620 |
envision is this mmvp paper which tries to investigate why do llms not have the ability 01:40:35.900 |
to see fine-grained details and so for instance it it comes up with a lot of images like this 01:40:40.860 |
where you ask it a question that seems very visually apparent to us like which way is the 01:40:44.540 |
school bus facing and it gets it wrong and then of course it makes up details to support its wrong 01:40:48.620 |
claim um and so the process by which it finds these images is sort of contained in its hypothesis for 01:40:55.740 |
why it can't uh see these details so it hypothesizes that models that have been initialized with with 01:41:03.260 |
clip as their vision encoder they don't have fine-grained details and the features extracted 01:41:09.180 |
using clip because um clip sort of doesn't need to find these fine-grained details to do its job 01:41:15.100 |
correctly which is just to match um captions and images right um and sort of at a high level even 01:41:21.340 |
if chat gpt wasn't initialized with clip um and wasn't trained contrastively at the vision encoder 01:41:26.780 |
wasn't trained contrastively at all still in order to do its job of capturing the image uh it could 01:41:32.140 |
do a pretty good job without actually finding the exact position of all the objects and visual 01:41:37.020 |
features in the image right so this paper finds a set of difficult images for these types of models 01:41:44.540 |
and the way it does it is it looks for embeddings that are similar in clip space but far 01:41:48.540 |
in dyna v2 space so dyna v2 is a foundation model that was trained um self-supervised purely 01:41:55.020 |
on image data um and it kind of uses like some complex student teacher framework but essentially 01:42:01.340 |
and like it patches out like certain areas of the image or like crops with certain areas of 01:42:06.220 |
the image and tries to make sure that those have consistent representations which is a way for it 01:42:09.740 |
to learn very fine-grained visual uh features and so if you take things that are very close in clip 01:42:15.660 |
space and very far in dyna v2 space you get a set of images that um basically a pairs of images that 01:42:22.620 |
are hard for chat gpt and other big language models to distinguish so if you then ask it 01:42:27.900 |
questions about this image well as you can see from this chart it's going to answer the same way 01:42:33.420 |
um for both images right because to to from the perspectives of vision encoder they're the same 01:42:38.780 |
image and so if you ask a question like how many eyes does this animal have it answers the same for 01:42:43.340 |
both and like all these other models including lava um do the same thing right and so this is 01:42:49.260 |
the the benchmark that they create which is like finding clip like clip blind pairs which is pairs 01:42:54.860 |
of images that are similar in clip space and creating a data set of multiple choice questions 01:42:59.820 |
based off of those um and so how do these models do well really bad um lava i think so so chat gpt 01:43:08.620 |
and jim and i do a little bit better than random guessing but like half of the performance of 01:43:12.460 |
humans who find these problems to be very easy uh lava is interestingly extremely negatively 01:43:19.740 |
correlated with this data set it does much much much much worse than random guessing which means 01:43:24.780 |
that this process has done a very good job of identifying hard images for for lava specifically 01:43:30.780 |
and that's because lava is basically not trained for very long and is initialized from clip and so 01:43:37.020 |
you would expect it to do poorly on this data set so one of the proposed solutions that this paper 01:43:44.140 |
attempts is by basically saying okay well if clip features aren't enough what if we train 01:43:48.380 |
the visual encoder of the language model also on dyno features and so it um proposes two different 01:43:54.540 |
ways of doing this one out of additively um which is basically interpolating between the two features 01:44:00.460 |
and then one is interleaving which is just kind of like training one on the combination of 01:44:05.340 |
both features so there's this really interesting trend when you do the additive mixture of features 01:44:10.620 |
so zero is all um clip features and one is all dyna v2 features so it as you in so i think it's 01:44:21.100 |
helpful to look at the rightmost chart first which is as you increase the number of dyna v2 features 01:44:25.500 |
your model does worse and worse and worse on the actual language modeling task and that's 01:44:29.420 |
because dyna v2 features were trained completely from a self-supervised manner and completely in 01:44:34.620 |
image space it knows nothing about text these features aren't really compatible with these text 01:44:38.940 |
models and so you can train an adapter all you want but it seems that it's in such an alien 01:44:43.580 |
language that it's like a very hard optimization for this these models to solve and so that kind 01:44:49.420 |
of supports what's happening on the left which is that yeah it gets better at answering these 01:44:55.260 |
questions as you include more dyna v2 features up to a point but then you when you oversaturate it 01:45:01.500 |
completely loses its ability to like answer language and and do language tasks um so uh 01:45:10.140 |
you can also see with the interleaving like they essentially double the number of tokens that are 01:45:14.860 |
going into these models um and just train on both and it still doesn't really solve the mmvp task 01:45:20.620 |
it gets lava 1.5 above random guessing by a little bit but still not close to um chachi pt or any 01:45:28.460 |
like human performance obviously um so clearly this proposed solution of just using dyna v2 01:45:34.460 |
features directly isn't going to work and basically what that means is that as a um 01:45:39.660 |
as a vision foundation model dyna v2 is going to be insufficient for language tasks right 01:45:45.340 |
so my next pick for best paper of 2024 um would be florence 2 which tries to solve this problem 01:45:52.700 |
by incorporating not only this dimension of spatial hierarchy which is to say pixel level 01:45:58.940 |
understanding but also in making sure to include what they call semantic granularity which ends up 01:46:05.020 |
the goal is basically to have features that are sufficient for finding objects in the image so 01:46:10.860 |
they're they're they have enough pixel information but also can be talked about and can be reasoned 01:46:16.780 |
about um and that's on the semantic granularity axis so here's an example of um basically three 01:46:25.500 |
different paradigms of labeling that they do um so they create a big data set um one is text 01:46:32.060 |
which is just captioning and you would expect a model that's trained only on captioning to 01:46:35.900 |
have similar performance like chachi pt and like not have uh spatial hierarchy not have 01:46:41.660 |
features that are meaningful at the pixel level and so they add another type which is 01:46:46.220 |
region text pairs which is essentially either classifying a region or um 01:46:51.900 |
doing object detection or doing instant segmentation on that region or captioning that 01:46:59.500 |
region and then they have text phrase region annotations which is essentially a triple um 01:47:05.580 |
and basically not only do you have a region that you've described you also find it's like 01:47:10.860 |
its place in a descriptive paragraph about the image which is basically trying to introduce even 01:47:16.700 |
more like semantic understanding of these regions and so like for instance if you're saying a woman 01:47:21.260 |
riding on the road right you have to know what a woman is and what the road is and that she's on 01:47:25.340 |
top of it and that's that's basically composing a bunch of objects in this visual space but also 01:47:30.300 |
thinking about it semantically right um and so the way that they do this is they take um basically 01:47:36.860 |
they just dump uh features from a vision encoder straight into a uh encoder decoder transformer 01:47:44.860 |
um and then they train a bunch of different tasks like object detection and so on uh as a language 01:47:52.540 |
task and i think that's one of the big things that we saw in 2024 is these these um vision 01:47:59.260 |
language models operating in on pixel space linguistically so they introduce a bunch of 01:48:04.380 |
new tokens to point to locations and um in pixel space so how does it work how does it actually do 01:48:13.180 |
we can see uh if you look at the graph on the right which is using the the dino the uh the dino 01:48:20.300 |
framework um your your pre-trained florence 2 models transfer very very well they get 60 60 01:48:28.540 |
percent map on cocoa which is like approaching state-of-the-art and they train with you're good 01:48:34.540 |
and they train with a much more um uh much more efficiently so they they converge a lot faster 01:48:41.020 |
which both of these things are pointing to the fact that they're actually leveraging 01:48:44.940 |
their pre-trained weights effectively um so where is it falling short so these models i forgot to 01:48:52.380 |
mention florence is a 0.2 billion and a 0.7 billion parameter count so they're very very 01:48:57.820 |
small in terms of being a language model um and i think that this framework you can see saturation 01:49:04.460 |
so what this graph is showing is that if you train a florence 2 model purely on the image 01:49:10.460 |
level and region level annotations and not including the pixel level annotations like 01:49:14.860 |
segmentation it actually performs better as an object detector and what that means is that 01:49:21.660 |
it's not able to actually learn all the visual tasks that it's trying to learn because it doesn't 01:49:26.940 |
have enough capacity so i'd like to see this paper explore larger model sizes which brings us 01:49:31.660 |
to our next big paper of 2024 um or two papers so polygema came out earlier this year polygema 2 was 01:49:39.580 |
released i think like a week or two ago um oh i forgot to mention you can actually train like 01:49:45.340 |
label text data sets on roboflow and you can train a florence 2 model and you can actually train a 01:49:49.980 |
train a polygema 2 model on roboflow which we got into the platform within like 14 hours of release 01:49:54.780 |
which i was really excited about so anyway so polygema 2 and so polygema is essentially doing 01:50:00.620 |
the same thing but instead of doing an encoder decoder it just dumps everything into a decoder 01:50:04.460 |
only transformer model um but it also introduced the concept of location tokens to point to 01:50:08.940 |
objects in pixel space polygema 2 so polygema uses gemma as the language encoder and it uses 01:50:15.820 |
gemma 2b polygema 2 introduces using multiple different sizes of language encoders um so the 01:50:23.260 |
way that they sort of get around having to do encoder decoder is they use the concept of prefix 01:50:28.460 |
loss which basically means that when it's generating tokens um autoregressively it's 01:50:35.660 |
all those uh tokens in the prefix which is like the image that it's looking at and like a 01:50:40.540 |
description of the task that it's trying to do they're attending to each other fully full attention 01:50:45.420 |
um which means that you know it can sort of find high level uh it's easier for the the prefix to 01:50:52.060 |
color to color the output of the suffix and also to just find like features uh easily so 01:51:00.460 |
this is sort of an example of like one of the tasks that was trained on which is like you 01:51:04.700 |
describe the task in english um and then you give it all these like you're asking for it to segment 01:51:12.860 |
these two classes um of objects and then it finds like their locations using these look tokens and 01:51:19.740 |
it finds their masks using uh some encoding of the masks into tokens and yeah so one of my critiques 01:51:30.780 |
i guess of polygema one at least is that um you find that performance saturates as a pre-trained 01:51:36.380 |
model after only 300 million examples seen um so what this graph is representing is each blue dot 01:51:43.660 |
is a performance on some downstream task you can see that after seeing 300 million examples 01:51:49.260 |
it sort of does equally well on all of the downstream tasks that they tried it on which 01:51:55.340 |
was a lot as 1 billion examples which to me also kind of suggests a lack of capacity for this model 01:52:02.060 |
polygema 2 you can see the results on object detection so these were transferred to um 01:52:10.460 |
to coco um and you can see that this sort of also points to an increase in capacity being 01:52:17.180 |
helpful to the model you can see as both the resolution increases and the parameter count 01:52:23.020 |
of the language model increases performance increases so resolution makes sense obviously 01:52:26.780 |
it helps to find small images or small objects in the image but also makes sense from another reason 01:52:31.820 |
which is that it kind of gives the model a thinking register and it gives it more tokens to 01:52:35.900 |
like process when making its predictions um but yeah you could you could say oh 43.6 that's not 01:52:42.860 |
that great like um Florence 2 got 60 but this is not training a dino or a debtor on top of this 01:52:50.140 |
language or this image encoder it's doing the raw language modeling task on coco um so it doesn't 01:52:57.660 |
have any of the bells whistles it doesn't have any of the fancy losses it doesn't even have 01:53:01.260 |
bipartite graph matching or anything like that okay the big result and one of the reasons that 01:53:07.580 |
I was really excited about this paper is that they blow everything else away on mmvp I mean 47.3 01:53:13.980 |
sure that's nowhere near human accuracy which again is 94 but for a you know a two billion 01:53:19.500 |
language two billion parameter language model to be chat2bt that's quite the achievement 01:53:23.820 |
um and that sort of brings us to our final pick for paper of the year which um is aimv2 so 01:53:34.380 |
aimv2 sort of says okay maybe this language model like maybe coming up with all these specific 01:53:40.780 |
annotations to find features and with high fidelity and pixel space isn't actually necessary 01:53:47.420 |
and we can come up with an even simpler more beautiful idea for combining um you know image 01:53:53.580 |
tokens and pixel tokens in a way that's interfaceable for language tasks um and this 01:53:59.020 |
is nice because it can scale you can come up with lots more data if you don't have to come up with 01:54:03.260 |
all these annotations right so the way that it works is it does something very very similar to 01:54:07.900 |
polygemo where you have a vision encoder that dumps image tokens into a decoder only transformer 01:54:13.420 |
but the interesting thing is that it also autoregressively tries to learn 01:54:19.580 |
the mean squared error of the image tokens so instead of having to come up with fancy object 01:54:24.940 |
detection or semantic or segment or segmentation labels you can just try to reconstruct the image 01:54:30.060 |
and have it learn fine-grained features that way um and it does this in kind of i think a beautiful 01:54:35.580 |
way that's kind of compatible with the polygemo line of thinking which is randomly sampling a 01:54:39.820 |
prefix prefix length and using only this number of image tokens as the prefix um and so doing a 01:54:47.580 |
similar thing with the uh causal so the causal prefix is the the attention mask on the right so 01:54:53.340 |
it's doing full block attention with some randomly sampled number of image tokens to then reconstruct 01:54:58.700 |
the rest of the image and the downstream caption for that image and so this is the data set that 01:55:06.380 |
they train on it's image or internet scale data very high quality data created by the 01:55:11.500 |
data filtering networks paper essentially which is maybe the best clip data that exists 01:55:18.700 |
and we can see that this is finally a model that doesn't saturate it's even at the highest 01:55:27.020 |
parameter count it's it appears to be well at the highest parameter account it appears to be 01:55:34.140 |
improving in performance with more and more samples seen and so you can sort of think that 01:55:39.100 |
uh you know if we just keep bumping the parameter count and increasing the example scene which is 01:55:44.380 |
the the line of thinking for language models then it'll keep getting better so how does it actually 01:55:49.900 |
do at finding oh it also improves with resolution which you would expect for a model that um 01:55:57.100 |
this is the image net classification accuracy but yeah it does better if you increase the 01:56:01.740 |
resolution which means that it's actually leveraging and finding fine-grained visual 01:56:05.820 |
features um and so how does that actually do compared to clip on coco well you can see that 01:56:12.620 |
if you slap a transformer uh detection head on it and train on coco it's just 60.2 which is also 01:56:18.780 |
within spitting distance of soda which means that it does a very good job of finding um visual 01:56:24.300 |
features but you could say okay well wait a second uh clip got to 59.1 so like how does this prove 01:56:33.100 |
your claim at all because doesn't that mean like clip which is known to be clip blind and do badly 01:56:38.300 |
on mmvp it's able to achieve a very high performance on fine on this fine-grained visual 01:56:43.660 |
features task of object detection well they train on like tons of data they train on like objects 01:56:49.740 |
365 coco flicker and everything else and so i think this benchmark doesn't do a great job of 01:56:56.300 |
selling how good of a pre-trained model mv2 is and we would like to see uh performance on 01:57:02.060 |
fewer data as examples and not trained to convergence on object detection so 01:57:07.100 |
seeing it in the real world on like a data set like robo flow 100 i think would be 01:57:11.100 |
quite interesting and our i guess our final final pick for paper of 2024 would be moondream so 01:57:21.260 |
uh but overall that was exactly what i was looking for like best of 2034 amazing job 01:57:28.540 |
um uh yeah you can there's any other questions while vick gets set up like vision stuff 01:57:42.540 |
hi well while we're getting set up hi over here thanks for the really awesome talk one of the 01:57:48.940 |
things that's been weird and surprising is um that the foundation model companies uh 01:57:56.460 |
even these mlms they're just like worse than rt tether at detection still like if you wanted to 01:58:05.180 |
pay a bunch of money uh to auto label your detection data set if you gave it to openai 01:58:10.060 |
or claude that would be like a big waste um so i'm curious just like even polygema 2 like uh 01:58:16.700 |
is worse so so i'm curious to hear your thoughts on like how come nobody's cracked the code on like 01:58:22.700 |
a generalist that really uh you know beats a specialist model in computer vision like they 01:58:30.380 |
have in uh in lm land i can can you hear me okay oh yeah um it's very very interesting question 01:58:46.380 |
i think um it depends on the specific domain uh for image classification it's basically there 01:58:53.260 |
in the aim v2 showed a simple attentional probe on the pre-trained features gets like 90 which is 01:59:00.380 |
as well as anyone does um the the the bigger question like why isn't it transferring to 01:59:06.860 |
uh uh object detection especially like real-time object detection um i think in my mind there are 01:59:15.100 |
two answers one is object detection is really really really uh the architectures are super 01:59:21.980 |
domain specific you know we see these all these super super complicated things and it's not 01:59:26.700 |
super easy to to to build something that just transfers naturally like that whereas 01:59:31.740 |
image classification you know clip pre-training transfers super super 01:59:34.860 |
easily um and the other thing is until recently the real-time object detectors didn't even really 01:59:43.340 |
benefit from pre-training like you see the yolos that are like essentially saturated showing very 01:59:48.540 |
little difference with uh pre-training improvements uh with using pre-trained model at all it's not 01:59:54.700 |
surprising necessarily that people aren't looking at the effects of better and better pre-training 02:00:01.420 |
on real-time detection maybe that'll change in the next year does that answer your question 02:00:05.260 |
cool uh can you guys hear me uh yeah one thing i want to add is just like or just to summarize 02:00:12.860 |
basically is that like until 2024 you know we haven't really seen a combination of transformer 02:00:19.340 |
based uh object detectors and uh fancy losses and polygema suffers from the same problem which 02:00:25.900 |
is basically to say that um these resnet are like the convolutional models they have all these like 02:00:32.940 |
extreme optimizations for for doing object detection but essentially i think it's kind of 02:00:38.940 |
been shown now that convolution models like just don't benefit from pre-training and just don't 02:00:42.780 |
like have the level of intelligence to transform models awesome hi can you hear me cool sure you 02:00:54.780 |
see you are you sharing your screen i might have forgotten to do that let me do that sorry 02:01:09.260 |
oh here's your screen uh-oh classic um you might have to quit zoom and restart what um 02:01:18.140 |
it's fine yeah it's like we we have we have a capture of your screen i'll just make sure it's 02:01:34.860 |
but soon no yeah yeah there you go perfect all right hi everyone my name is vic um i've been 02:01:46.460 |
working on moon dream for almost a year now like sean mentioned i just went and looked and it turns 02:01:51.580 |
out the first version i released december 29 2023 um it's been a fascinating journey so moon dream 02:01:58.940 |
um started off as a tiny vision language model since then we've expanded scope a little bit to 02:02:04.300 |
also try and build some tooling client libraries etc to help people really deploy it 02:02:09.020 |
um unlike traditional large models that are focused at assistant type use cases we're 02:02:16.700 |
laser focused on building um capabilities that developers can sorry it's uh 02:02:27.100 |
yeah we're laser focused on building capabilities that developers can use to build vision applications 02:02:32.060 |
uh that can run anywhere so in a lot of cases for vision more so than for text you really care about 02:02:37.580 |
being able to run on the edge run in real time etc so um it's really important we have um we have 02:02:44.540 |
different output modalities that we support there's query where you can ask general english 02:02:48.380 |
questions about an image and get back human-like answers there's captioning which allows you to 02:02:53.660 |
get back human-like answers there's captioning which a lot of our users use for generating 02:02:59.340 |
synthetic data sets to then train diffusion models and whatnot um we've done a lot of work to minimize 02:03:04.140 |
the hallucinations there so that's um used a lot we have open vocabulary object detection built-in 02:03:09.900 |
similar to a couple more recent models like pali gem etc where rather than having to train a dedicated 02:03:14.540 |
model you can just say show me soccer balls in this image or show me there any deer in this image 02:03:19.820 |
detected uh more recently earlier this month we released pointing capability where if all 02:03:26.860 |
you're interested in is the center of an object um you can just ask it to point out where that 02:03:32.940 |
is this is very useful when you're doing ui automation type stuff um let's see 02:03:38.860 |
la we we have two models out right now there's a general purpose to be paramodel which um 02:03:48.300 |
runs fair like it's it's uh it's fine if you're running on server it's uh good for our localama 02:03:53.260 |
desktop friends and you can run on flagship flagship mobile phones but it never really 02:03:58.300 |
fulfill the promise of being able to run anywhere uh last week released a new 0.5b paramodel 02:04:03.500 |
which should be seen more as a distillation target as opposed to a general purpose model 02:04:08.780 |
uh it's very good if you're running on like older mobile phones or edge devices uses less memory 02:04:15.980 |
even with our not yet fully optimized inference client um so the way we built our 0.5b model was 02:04:24.780 |
to start with the two billion parameter model um and prune it while doing continual training to 02:04:32.620 |
retain performance we our objective during the pruning was to preserve accuracy across a broad 02:04:40.140 |
set of benchmarks so the way we went about it was to estimate the importance of different 02:04:44.380 |
components of the model like attention heads channels um mlp rows and whatnot um using 02:04:51.500 |
basically a technique based on the gradient i'm not sure how much people want to know details 02:04:55.900 |
we'll be writing a paper about this but uh feel free to grab me if you have more questions 02:04:59.660 |
uh then we iteratively prune a small chunk that will minimize loss in performance uh retrain the 02:05:05.500 |
model to recover performance and bring it back um the 0.5b we release is more of a proof of concept 02:05:11.660 |
that this is possible i think the thing that's really exciting about this is it makes it possible 02:05:15.180 |
for um for developers to build using the 2b parameter model and just explore build their 02:05:24.540 |
application and then once they're ready to deploy uh figure out what exactly they need out of the 02:05:28.940 |
model and prune those capabilities into a smaller form factor that makes sense for their deployment 02:05:33.100 |
target um so yeah very excited about that let me talk to you folks a little bit about uh another 02:05:40.540 |
problem i've been working on recently which is similar to the clocks example we've been talking 02:05:44.140 |
about we had a customer reach out who was uh talking about like who had a bunch of gauges 02:05:50.300 |
out in the field this is very common in manufacturing and oil and gas where you 02:05:54.140 |
have a bunch of analog devices that you need to monitor it's expensive to have humans look at that 02:06:00.620 |
and monitor stuff and make sure that uh the system gets shut down when the temperature goes over 80 02:06:06.060 |
or something so i was like yeah this seems easy enough happy to happy to help you distill that 02:06:11.020 |
uh let's let's get it going turns out our model couldn't do it at all uh i went and looked at 02:06:15.900 |
other open source models to see if i could just generate a bunch of data and learn from that that 02:06:20.940 |
did not work either so i was like let's look at what the folks with hundreds of billions of dollars 02:06:25.580 |
in market cap have to offer and yeah that doesn't work either um my hypothesis is that like the 02:06:35.100 |
the way these models are trained are using a large amount of image text data scraped from 02:06:40.220 |
the internet and that can be biased in the case of gauges most gauge images aren't gauges in the 02:06:45.740 |
wild they're product detail images like these where it's always set to zero it's paired with 02:06:51.420 |
an alt text that says something like givto pressure sensor psi zero to 30 or something 02:06:58.620 |
and so the models are fairly good at picking up those details it'll tell you that it's a 02:07:01.980 |
pressure gauge it'll tell you what the brand is but it doesn't really learn to pay attention to 02:07:05.420 |
the needle over there um and so yeah that's a gap we need to address so naturally my mind goes to 02:07:16.220 |
like let's use synthetic data to solve this problem um that works but it's problematic because it 02:07:23.180 |
turned out we needed millions of synthetic gauge images to get to reasonable performance and 02:07:27.660 |
thinking about it reading a gauge is like not a one like it's not a zero short process in our 02:07:33.660 |
minds right like if you had to tell me the reading in celsius for this real world gauge 02:07:38.860 |
there's two dials on there so first you have to figure out which one you have to be paying 02:07:42.300 |
attention to like the inner one or the outer one um you look at the tip of the needle you look at 02:07:48.220 |
what labels it's between and you count how many and do some math to figure out what that probably 02:07:55.340 |
is so what happens if we just add that as chain of thought um to give the model better understanding 02:08:04.300 |
of the difference up to allow the model to better learn the subtasks it needs to perform to accomplish 02:08:09.580 |
this goal um so you can see in this example this was actually generated by the latest version of 02:08:15.100 |
our model uh it's like okay celsius is the inner scale it's between 50 and 60 there's 10 ticks 02:08:22.060 |
it's at the second tick it's a little debatable here like there's a weird shadow situation going 02:08:25.900 |
on the dial is off so i i don't know what the ground truth is but it works okay um there's 02:08:33.020 |
points on there that the points over there are actually grounded i don't know if this is easy 02:08:38.140 |
to see but when i click on those there's a little red dot that moves around on the image the model 02:08:42.780 |
actually has to predict where uh those points are i was already trying to do this with bounding boxes 02:08:48.620 |
but then malmo came out with pointing capabilities and it's like pointing is a much better paradigm to 02:08:54.620 |
uh to represent this we see pretty good results this one's actually for clock reading i 02:09:01.900 |
couldn't find our chart for gauge reading at the last minute so um the light blue chart is 02:09:09.980 |
with uh our grounded chain of thought um this measures we have we built a clock reading 02:09:16.620 |
benchmark about 500 images this measures accuracy on that um you can see it's a lot more sample 02:09:23.020 |
efficient uh when you're using the chain of thought to help the model um yep another big benefit 02:09:34.300 |
from this approach is like you can kind of understand how the model is doing it and how 02:09:40.300 |
it's feeling so in this example the actual correct reading is 54 celsius the model output 56 02:09:46.620 |
not too bad um but you can actually go and see where it messed up like it got a lot of these 02:09:53.660 |
right except uh instead of saying it was on the seventh tick it actually predicted that was it was 02:10:00.300 |
the eighth eighth tick and that's why it went with 56 so now that you know that this is failing in 02:10:07.340 |
this way you can adjust how you're doing the chain of thought to maybe say like actually count out 02:10:10.940 |
each tick from 40 instead of just trying to say it's the eighth tick or you might say like okay 02:10:15.660 |
i see that there's that middle thing i'll count from there instead of all the way from 40 um 02:10:20.780 |
so helps a ton the other thing i'm excited about is a few short prompting or test time 02:10:26.540 |
training with this like if a customer has a specific gauge that uh like we're seeing minor 02:10:31.340 |
errors on they can give us a couple of examples where like if it's misdetecting the needle they 02:10:37.340 |
can go in and correct that in the chain of thought and hopefully that works the next time um 02:10:41.820 |
now exciting approach we only apply it to clocks and gauges the real question is is it going to 02:10:48.380 |
generalize um probably like there's some signs from text models that when you train on a broad 02:10:53.500 |
number of tasks it does generalize and um i'm seeing some signs with our model as well um 02:10:59.580 |
so in addition to the image-based chain of thought stuff i also added some spelling-based 02:11:03.820 |
chain of thought uh to help it understand uh better understand ocr i guess um i don't understand 02:11:11.740 |
why everyone doesn't do this by the way like it's trivial benchmark question that's very very easy 02:11:16.860 |
to nail um but i also wanted to support it for stuff like license plate partial matching like 02:11:23.580 |
hey does any license plate in this image start with wha or whatever um so yeah that sort of worked 02:11:30.700 |
um all right that that ends my story about the gauges if you think about what's going on over 02:11:39.020 |
here um it's interesting that like llms are showing enormous progress in reasoning especially 02:11:48.540 |
with the latest set of models that we've seen but we're not really seeing i i have a feeling that 02:11:54.620 |
vlms are lagging behind as we can see with these tasks that should be very simple for a human to 02:12:01.660 |
do that are very easy to find um vlms failing at uh my hypothesis on why this is the case is because 02:12:08.460 |
on the internet there's a ton of data that talks about how to reason there's books about how to 02:12:14.780 |
solve problems there's books critiquing the books about how to solve problems but humans are just so 02:12:19.260 |
good at perception that we never really talk about it like maybe in art books where it's like hey to 02:12:24.540 |
show that that mountain is further away you need to desaturate it a bit or whatever but um the 02:12:31.740 |
actual data on how to like look at images is isn't really present also the data we have is kind of 02:12:37.500 |
sketch the best source of data we have is like image all text pairs on the internet and that's 02:12:41.500 |
pretty low quality um so yeah i i think our solution here is really just we need to teach 02:12:47.180 |
them how to operate on individual tasks and figure out how to scale that out um all right yep so 02:12:56.780 |
conclusion uh at moon dream we're trying to build amazing blms that run everywhere very hard 02:13:02.780 |
problem much work ahead but uh we're making a ton of progress and i'm really excited about 02:13:07.340 |
um if anyone wants to chat about more um technical details about how we're doing 02:13:12.620 |
this or interested in collaborating please please hit me up 02:13:15.260 |
yeah like i always when people say when people say multi-modality like you know always think 02:13:26.460 |
about vision as the first among equals in all the modalities so i really appreciate 02:13:31.260 |
having the experts um okay we are a little bit out of time so we're going to move on to luca 02:13:36.940 |
um and talk about open models but if anyone wants to talk to the vision guys i think there's like 02:13:42.700 |
coffee and tea outside we're going to have lunch in an hour as well um so you can ask follow-up 02:13:48.620 |
questions uh outside if you if you wish but yeah luca you go you get set up with uh your mic okay 02:13:56.860 |
we sent you a zoom okay uh it's on it's on the calendar and then 02:14:33.740 |
they just screen share for here no audio no audio no yeah speecher uh plus plug-in 02:14:39.340 |
oh yeah you gotta stick around people you stick around people for sure 02:14:50.300 |
so i didn't know what you're because you're you're coming later yeah i don't really know either 02:14:59.100 |
how was your session yesterday for the tutorial yeah 02:15:07.580 |
yeah it's just good um definitely polish the slides 02:15:28.540 |
cool yeah i think you're set um so as you speak into that mic but any of your 02:15:42.700 |
nathan's microphone no you want me to be we'll just put this on yeah 02:15:51.340 |
i have the same thing yeah so these two mics they're good all right all right cool um yeah 02:16:01.980 |
thanks for having me over um i'm luca i'm a research scientist at the alliance for ai 02:16:07.980 |
i threw together a few slides on sort of like a recap of like interesting themes in open models 02:16:15.740 |
for for 2024 um have about maybe 20-25 minutes of slides and then we can chat if there are any 02:16:22.940 |
questions if i can advance to the next slide okay cool um so um i did the quick check of like 02:16:33.340 |
to sort of get a sense of like how much 2024 was different from 2023 um so i went on hug and face 02:16:39.580 |
and sort of tried to get a picture of what kind of models were released in 2023 and like what do 02:16:45.100 |
we get in 2024 um 2023 you get we got things like uh both llama one and two we got mistro got mpt 02:16:53.020 |
falcon models think the yi model came at the tail end of the year it was a pretty good year 02:16:58.460 |
but then i did the same for 2024 um and it's actually quite stark difference um you have 02:17:08.860 |
models that are you know reveling frontier level performance of what you can get from close models 02:17:15.420 |
from like quen from deep seek we got llama three we got all sorts of different models um i added 02:17:23.260 |
our own uh olmo at the bottom uh there's this uh growing group of like fully open models that i'm 02:17:29.260 |
going to touch on a little bit later um but you know just looking at the slides it feels like 02:17:35.500 |
2024 was just smooth sailing happy news much better than previous year um and you know you 02:17:42.940 |
can plot um you can pick your favorite benchmark or least favorite i don't know depending on what 02:17:50.460 |
point you're trying to make um and plot you know your closed model your open model um and sort of 02:17:58.220 |
spin it in ways that show that oh you know open models are much closer to where closed models 02:18:04.860 |
are today versus to versus last year where the gap was fairly significant um so one thing that 02:18:14.860 |
i think i don't know if i have to convince people in this room but usually when i give this talks 02:18:21.500 |
about like open models there is always like this background question in in in people's mind of like 02:18:27.180 |
why should we use open models um is it just use model apis argument you know it's it's 02:18:33.820 |
just an hdp request to get output from a from one of the best model out there why do i have to set 02:18:39.500 |
up infra use local models um and they're really like to answer um there is the more researchy 02:18:47.820 |
answer for this which is where my background lays which is um just research if you want to do 02:18:55.180 |
research on language models research thrives on on open models there is like large worth of research 02:19:01.580 |
on modeling on how these models behave on evaluation and inference on uh mechanistic 02:19:08.300 |
interpretability that could not happen at all if you didn't have open models um they're also um 02:19:16.140 |
for ai builders there are also like good use cases for using um local models um you know you have 02:19:24.940 |
some this is like a very not uh comprehensive slides but you have things like there are some 02:19:29.660 |
applications where local models just blow close models out of the water um so like retrieval it's 02:19:37.020 |
a very clear example um you might have like constraints like edge ai applications where it 02:19:42.860 |
makes sense but even just like in terms of like stability being able to say this model is not 02:19:47.980 |
changing under the hood um it's there's plenty of good cases for for um open models um and the 02:19:56.860 |
community is just not models um is i stole this slide from uh one of the quen2 announcement blog 02:20:04.860 |
posts uh but it's super cool to see like how much um tech exists around um open models on serving 02:20:13.660 |
them on making them efficient and hosting them it's pretty cool um and um it's um if you think 02:20:23.820 |
about like where the term opens come from comes from like the open source um really open models 02:20:29.740 |
meet the core tenants of of um open of open source uh specifically when it comes around 02:20:37.900 |
collaboration there is truly a spirit like through these open models you can build on top of others 02:20:44.060 |
people innovation um we see a lot of these even in our own work of like you know as we iterate 02:20:50.860 |
in the various version of almo um it's not just like every time we collect from scratch all the 02:20:57.900 |
data no the the first step is like okay what are the cool data sources and datasets people have put 02:21:04.060 |
together for language model for training um or when it comes to like our post-training pipeline 02:21:11.820 |
we uh one of uh the steps is um you want to do some dpo and use a lot of uh outputs of other models 02:21:21.100 |
uh to improve your your preference model so it's really um having like an open sort of ecosystem 02:21:28.140 |
benefits and accelerates the development of open models um one thing that um we got in 2024 which 02:21:37.420 |
is not a specific model but i thought it was really significant is we first got uh we got our 02:21:42.780 |
first open source ai definition um so this is from the open source initiative um they've been 02:21:50.220 |
generally the steward of a lot of the open source licenses when it comes to software 02:21:55.100 |
and so they embarked on this journey and trying to figure out okay 02:22:00.060 |
how does a license an open source license for a model look like 02:22:03.740 |
um majority of the work is very dry because licenses are dry so i'm not gonna walk through 02:22:11.500 |
the license step by step but um i'm just gonna pick out uh one aspect that is very good uh and 02:22:19.820 |
then one aspect that personally feels like it needs improvement on the good side um this um 02:22:26.780 |
this open source ai license actually this is very intuitive if you ever build open source software 02:22:33.420 |
and you have some expectation around like what open source uh looks like for software uh for 02:22:41.260 |
for ai sort of matches your intuition so the weights need to be fairly available uh the code 02:22:49.020 |
must be released with an open source license uh and there shouldn't be like license clauses that 02:22:56.380 |
block specific use cases so under this definition for example lama or some of the quen models are 02:23:03.580 |
not open source because the license says you can't you can't use this this model for this 02:23:09.340 |
or it says if you use this model you have to name the output this way or derivative needs to be uh 02:23:15.660 |
named that way those clauses don't meet open source definition um and so they will not be 02:23:20.780 |
cover the the lama license will not be cover under the open source definition um it's not perfect um 02:23:30.300 |
one of the things that um um internally you know in discussion with with osi we were sort of 02:23:38.700 |
disappointed is around um the language for data um so you might imagine that an open source 02:23:47.980 |
ai model means a model where the data is freely available uh there were discussion around that 02:23:53.420 |
but at the end of the day they decide to go with a soften stance where they say um a model is open 02:24:00.860 |
source if you provide sufficient detailed information on how to sort of replicate the 02:24:06.780 |
data pipeline so you have an equivalent system sufficient sufficiently detailed uh it's very 02:24:14.300 |
it's very fuzzy don't like that an equivalent system is also very fuzzy um and this doesn't 02:24:21.500 |
take into account the accessibility of the process right it might be that you provide enough 02:24:26.700 |
information but this process costs I don't know 10 million dollars to do um now the open source 02:24:33.580 |
definition like any open source license has never been about accessibility so that's never factor 02:24:40.140 |
in open source software how accessible software is um I can make a piece of open source put it on 02:24:46.540 |
my hard drive and never access it that software is still open source the fact that it's not widely 02:24:51.340 |
distributed doesn't change the license but practically the right expectation of like what 02:24:57.020 |
we want good open sources to be so it's kind of sad to see that um the the data component 02:25:04.220 |
in this license is not as as open as some of us would like uh would like it to be and I linked 02:25:11.500 |
the blog post that Nathan wrote on the topic that it's less rambly and easier to follow through 02:25:18.460 |
um one thing that in general I think it's fair to say about the state of open models in 2024 is that 02:25:28.780 |
we know a lot more than what we knew in in 2023 um like um both on the training data like the 02:25:37.260 |
pre-training data you curate um on like how to do like all the post-training especially like on the 02:25:43.580 |
RL side um you know 2023 was a lot of like throwing random darts at the board uh I think 2024 we have 02:25:51.900 |
clear recipes that okay don't get the same results as a closed lab because there is a cost 02:25:57.260 |
in in actually matching what they do um but at least we have a good sense of like okay this is 02:26:03.020 |
this is the path to get state-of-the-art language model um I think that one thing that it's a 02:26:09.900 |
downside of 2024 is that I think we are more research constrained than 2023 it feels that 02:26:18.220 |
like you know the barrier for compute that you need to to move innovation along that's just 02:26:24.940 |
being right uh rising and rising um so like if you go back to this slide there is now this this 02:26:31.660 |
cluster of models that are sort of released by the compute rich club um membership is hotly debated 02:26:39.980 |
um you know some people don't want to be called rich because it comes to expectations some people 02:26:45.740 |
want to be called rich but I don't know there's debate but like these are players that have you 02:26:50.380 |
know 10,000 50,000 GPUs at minimum um and so they can do a lot of work um and a lot of exploration 02:26:58.620 |
in improving models that it's not very accessible um to give you a sense of like how I personally 02:27:06.300 |
think about research budgets um for each part of the of the language model pipeline is like on the 02:27:15.340 |
pre-training side you can maybe do something with a thousand GPUs really you want 10,000 and like if 02:27:21.660 |
you want real estate of the art you know your deep-seek and minimum is like 50,000 um and you 02:27:27.180 |
can scale to infinity the more you have the better it gets um everyone on that side still complains 02:27:32.140 |
that they don't have enough GPUs uh post-training is a super wide um sort of uh spectrum you can do 02:27:40.780 |
as little with like eight GPUs um as long as you're able to um run you know a a good version 02:27:51.100 |
of say a llama model you can do a lot of work there um you can scale a lot of the methodology 02:27:57.420 |
just like scales with compute right if you're interested in um you know your open replication 02:28:05.100 |
of what OpenAI's 01 is um you're going to be on the 10k spectrum of our GPUs um inference you can 02:28:12.780 |
do a lot with very few resources evaluation you can do a lot with well I should say at least one 02:28:19.020 |
GPUs if you want to evaluate um open models but um in general like if you are if you care a lot 02:28:27.660 |
about intervention to do on this model which is my uh prefer area of research then you know the 02:28:35.500 |
resources that you need um are quite quite significant um one of the trends um that has 02:28:43.340 |
emerged in 2024 is this cluster of um fully open models um so almost the model that we built AI2 02:28:53.100 |
being one of them um and you know it's nice that it's not just us there's like a cluster of other 02:28:59.820 |
mostly research um efforts who are working on this um and so it's good to um to give you a primer 02:29:10.860 |
of what like fully open means um so fully open the easy way to think about it is instead of just 02:29:18.380 |
releasing a model checkpoint that you run you release a full recipe so that um other people 02:29:25.180 |
working on it uh working on that space can pick and choose whatever they want from your recipe 02:29:31.660 |
and create their own model or improve on top of your model um you're giving out the full pipeline 02:29:37.180 |
and all the details there um instead of just like the end output um so I pull up the screenshot from 02:29:44.380 |
our recent um MOE model um and like for this model for example we released the model itself 02:29:51.340 |
data that was trained on the code both for training and inference um all the logs that 02:29:57.500 |
we got through um the training run as well as um every intermediate checkpoint 02:30:03.020 |
um and like the fact that you release different part of the pipeline allows others to do really 02:30:10.060 |
cool things um so for example this tweet from early this year from uh folks at news research 02:30:17.020 |
um they use our pre-training data uh to do a replication of the bitnet paper in the open um 02:30:24.220 |
so they took just a really like the initial part of a pipeline um and then did the thing on top of 02:30:31.340 |
it um it goes both ways so for example for the old mode 2 model um a lot of our pre-trained data for 02:30:39.820 |
the first stage of pre-training um was from this DCLM uh initiative uh that was led by folks uh 02:30:48.220 |
ooh a variety of institutions it was a really nice group effort but um you know for when it was nice 02:30:57.580 |
to be able to say okay you know the state of the art in terms of like what is done in the open has 02:31:01.660 |
improved we don't have to like do all this work from scratch to catch up the state of the art 02:31:07.740 |
we can just take it directly and integrate it and do our own improvements on top of that 02:31:13.660 |
um i'm gonna spend a few minutes uh doing like a shameless plug for 02:31:21.900 |
um so indulge me in this um so a few things that we released this year was as i was mentioning 02:31:30.220 |
this OMOE model um which is i think still is state-of-the-art um MOE model in its size class 02:31:38.780 |
and it's also fully open so every components of of this model are available um we release 02:31:46.060 |
a multi-modal model called MOLMO um MOLMO is not just a model but it's a full recipe of how you go 02:31:52.460 |
from a text-only model to a multi-modal model and we apply this recipe on top of 02:31:58.940 |
QUAN checkpoints on top of OMOE checkpoints as well on top of OMOE um and i think they've 02:32:04.380 |
been replication doing that on top of Mistral as well um um on on the post-training side 02:32:14.940 |
we recently released TULU 3 um same story this is a recipe on how you go from a base model 02:32:20.780 |
to a state-of-the-art post-training model we use the TULU recipe on top of OMOE on top of LAMA and 02:32:28.540 |
then there's been um open replication effort to do that on top of QUAN as well uh it's really nice 02:32:34.220 |
to see like you know when your recipe sort of it's kind of turnkey you can apply it to different 02:32:39.340 |
models and it kind of just works um and finally the last thing we released this year was OMO 2 02:32:45.260 |
which so far is the best state-of-the-art fully open language model um it sort of combines aspect 02:32:52.860 |
from all three of these previous models um what we learned on the data side from OMOE 02:32:57.580 |
and what we learned on like making models that are easy to adapt from the multiple project 02:33:02.700 |
and the TULU project um i will close with a little bit of reflection like ways this this 02:33:10.380 |
ecosystem of open models um like it's not all roses it's not all happy uh it feels like day 02:33:18.060 |
to day it's always in peril um and you know i talked a little bit about like the compute issues 02:33:24.300 |
that come with it uh but it's really not just compute um one thing that is on top of my mind 02:33:30.860 |
is due to like the environment and how um you know growing feelings about like how AI is treated 02:33:39.020 |
it's actually harder to get access to a lot of the data that was used to train a lot of the 02:33:45.020 |
models up to last year so this is a screenshot from really fabulous work from Shane Longpray 02:33:50.860 |
who's i think is in europe um about um just access of uh like diminishing access to data 02:34:00.140 |
for language model pre-training so what they did is they um went through every snapshot 02:34:07.260 |
of common crawl uh common crawl is this publicly available scrape of the of a subset of the 02:34:12.860 |
internet and they looked at how um for any given website uh where the website that was 02:34:19.980 |
accessible in say 2017 what whether it was accessible or not in 2024 and what they found is 02:34:26.860 |
as a reaction to like the close uh like of the existence of closed models like openai or clod 02:34:36.860 |
gpt or clond a lot of content owners have blanket blocked any type of crawling to their website 02:34:44.380 |
and this is something that we see also internally at AI2 um like one project that we started this 02:34:50.620 |
year is um we wanted to we want to understand like if you're a good citizen of the internet 02:34:57.980 |
and you crawl uh following sort of norms and policy that have been established in the last 25 years 02:35:05.740 |
what can you crawl and we found that there's a lot of websites where um the norms of how you 02:35:13.180 |
express preference of whether to crawl or not are broken a lot of people would block a lot 02:35:18.220 |
of crawling but do not advertise that in robots txt you can only tell that they're crawling that 02:35:24.060 |
they're blocking you in crawling when you try doing it sometimes you can't even crawl their 02:35:28.860 |
robot txt to to check whether you're allowed or not and then a lot of um websites um there's like 02:35:37.340 |
all these technologies that historically have been have existed to make websites serving easier 02:35:42.780 |
um such as um cloudflare or dns they're now being repurposed for um blocking ai or any type of 02:35:52.300 |
crawling in a way that is very opaque to the content owners themselves um so you know you go 02:35:59.420 |
to these websites you try to access them and they're not available you get a feeling it's like 02:36:06.220 |
oh someone changed something changed on the on the dns side that it's blocking this and likely the 02:36:13.180 |
content owner has no idea they're just using uh cloudflare for better you know load balancing and 02:36:19.180 |
this is something that was sort of sprung on them uh with very little notice um and i think the 02:36:26.220 |
problem is this this um blocking or ideas really it impacts people in different ways um it 02:36:35.100 |
disproportionately helps um companies that have a head start which are usually the closed labs 02:36:41.980 |
and it hurts uh incoming uh newcomer players um where you either have now to do things in a sketchy 02:36:49.660 |
way um or you're never gonna get that content uh that the closed lab might have so there's a lot 02:36:56.620 |
it was a lot of coverage i'm gonna plug nathan's blog post again uh that is that um i think the 02:37:04.140 |
title of this one is very succinct uh which is like we're actually not you know before thinking 02:37:09.260 |
about running out of training data we're actually running out of open training data and so if one 02:37:14.540 |
better open models um they should be on top of our mind um the other thing that has emerged is that 02:37:23.340 |
there's strong lobbying efforts on trying to define any kind of open source ai as like a new um 02:37:34.220 |
extremely risky danger um and i want to be precise here like the problem is now um 02:37:40.380 |
um but the problem is not not considering the risk of this technology every technology has risks 02:37:46.380 |
that that should always be considered the thing that it's like to me is um sorry it's ingenious 02:37:52.940 |
is like just putting this ai on a pedestal um and calling it like an unknown alien technology 02:38:00.780 |
that has like new and undiscovered potentials to destroy um humanity when in reality all the 02:38:09.260 |
dangers i think are rooted in dangers that we know from existing software industry or existing 02:38:17.740 |
issues that come with when using software on um on a lot of sensitive domains like medical 02:38:25.980 |
areas and i also noticed a lot of efforts that have actually been going on and trying to make 02:38:31.500 |
these open models safe um i pasted one here uh from ai2 but there's actually like a lot of work 02:38:38.940 |
that has been going on on like okay how do you make if you're distributing this model openly 02:38:44.700 |
how do you make it safe um how what's the right balance between accessibility on open models and 02:38:50.300 |
safety um and then also this annoying uh brushing of um sort of concerns that are then proved to be 02:38:59.820 |
unfounded under the rug you know if you remember the beginning of this year it was all about 02:39:04.140 |
bio risk of these open models uh the whole thing fizzled out because there's been finally there's 02:39:11.820 |
been like rigorous research not just this paper from cohere folks but it's been rigorous future 02:39:18.300 |
research showing that this is really not a concern that you we should be worried about again there is 02:39:23.340 |
a lot of dangerous use of ai application but this one was just like a lobbying ploy to just make 02:39:30.860 |
things sound scarier uh than they actually are so i gotta preface this part it says this is my 02:39:38.060 |
personal opinion it's not my employer but i look at things like uh the sp1047 from from california 02:39:45.500 |
and i think we kind of dodged a bullet bullet on on this legislation we you know the open source 02:39:52.460 |
community a lot of the community came together at the last sort of the last minute um and did a 02:39:59.340 |
very good effort trying to explain all the negative impact of this bill um but um there's like 02:40:07.260 |
i feel like there's a lot of excitement on building these open models uh or like researching on these 02:40:12.860 |
open models and lobbying is not sexy uh it's kind of boring uh but um it's sort of necessary to make 02:40:20.940 |
sure that this ecosystem can can really thrive um this end of presentation i have some links 02:40:29.500 |
emails sort of standard thing in case anybody wants to reach out and if folks have questions 02:40:37.260 |
or anything they wanted to discuss it's our open floor 02:40:40.940 |
here's sofia um who wants to uh who uh one one very important open model that we haven't covered 02:40:52.540 |
is mistrial so yeah yeah well it's nice to have the mistrial person yes uh talk recap the year 02:40:59.900 |
mistrial but uh while sofia gets set up does anyone have like just thoughts or questions about 02:41:04.460 |
the progress in this space do you always have questions always i'm very curious how we should 02:41:10.140 |
build incentives to build open models things like francois choulet's uh arc prize and other 02:41:16.300 |
initiatives like that what is your opinion on how we should better align incentives in the community 02:41:20.940 |
so that open models stay open i think you can tap in there nice the incentive bit is like really hard 02:41:32.300 |
um like even as something that i actually even we think a lot about it internally um because 02:41:39.660 |
like building open models is risky it's very expensive um and so people don't want to take 02:41:45.340 |
risky bets um i think the definitely like the challenges um like our challenge i think those 02:41:54.060 |
are like very valid approaches for it um and then i think in general promoting building so um any 02:42:03.740 |
kind of effort to participate in this challenge in those challenges if we can promoting doing that 02:42:09.180 |
on top of open models um and sort of really lean into like this multiplier effect um i think that 02:42:17.580 |
is a good way to go um if there were more money for um efforts um like research efforts around 02:42:27.340 |
open models there's a lot of i think there's a lot of investments in companies that at the moment 02:42:33.500 |
are releasing their model in the open which is really cool um but um it's usually more because 02:42:39.580 |
of commercial interest and not wanting to support um this this like open models in the long term 02:42:46.380 |
it's a really hard problem because i think everyone is operating sort of in what everyone 02:42:52.700 |
is at their local maximum right in ways that really optimize their position on the market 02:43:02.460 |
okay somehow it's not being shared on the screen 02:43:28.140 |
uh can i ask one question you know yeah uh so i think one of the gap between the closed and 02:43:34.140 |
open source models is the mutability so the closed source models like chatty was pretty 02:43:39.660 |
good on the low resource languages which is not the same on the open open source models right 02:43:45.020 |
so is it in your plan to improve on that space um i think in general yes is 02:43:56.220 |
here yeah just just use your natural voice yeah um i think if i think we'll see a lot 02:44:02.460 |
of improvements there in like chinese on the side um like there's groups um like focus on 02:44:08.700 |
guys are already working on like better call for multilingual um support i think what our 02:44:18.140 |
challenges there is um you really want to be experts who are actually in those countries 02:44:26.620 |
that use those languages to participate in the international to give you like a very easy example 02:44:33.740 |
i'm originally from italy i think i'm terribly equipped to build a model that works well in 02:44:42.140 |
italy because one of the things you need to be able to do is having that knowledge of like okay 02:44:47.500 |
how do i access you know libraries or content that is from this region that covers from time 02:44:54.620 |
again the u.s long enough that i no longer know that um so i think that the efforts that folks 02:45:01.900 |
central europe for example are doing around like okay let's let's tap into regional communities 02:45:08.300 |
um to get access uh to bring in collaborators from those areas i think it's going to be like 02:45:15.180 |
very crucial for getting out of this area yes let me close it up 02:45:56.060 |
it's fine she's not playing any audio that's weird okay okay okay cool 02:46:06.860 |
um yeah i'm super excited to be here to talk to you guys uh about mistral uh a really short 02:46:15.260 |
and quick recap of what we have done what kind of models and products we have released in the past 02:46:21.900 |
a year and a half so um most of you have already known that we are a small startup 02:46:29.420 |
funded about a year and a half ago in paris in may 2003 it was funded by three of our co-founders 02:46:36.540 |
and in september 2003 we released our first open source model mistral 7b um yeah how many of you 02:46:44.780 |
have used or heard about mistral 7b hey pretty much everyone thank you uh yeah it's our uh 02:46:52.620 |
pretty popular and uh community our community really love this model and in december 2003 we 02:46:59.500 |
we released another popular model with the moe architecture um mr 8x 7b and 02:47:07.100 |
oh going into this year you can see we have released a lot of things this year 02:47:12.620 |
um first of all in february 2004 we released uh mr small mr large uh le chat which is our 02:47:20.140 |
chat interface i will show you in a little bit we released a embedding model for you know converting 02:47:28.140 |
your text into embedding vectors and all of our models are available um the the big cloud resources 02:47:37.820 |
so you can use our model on google cloud aws asia snowflake ibm so very useful for enterprise who 02:47:46.380 |
wants to use our model through cloud and in april and may this year we released another powerful 02:47:53.500 |
open source um moe model ax 22b and we also released our first code model coastal which is 02:48:01.820 |
amazing at 80 plus languages and then we provided another fine tuning service for customization 02:48:09.340 |
so because we know the community love to fine tune our models so we provide you a very nice 02:48:15.180 |
and easy option for you to fine tune our model on our platform and also we released our fine 02:48:21.020 |
tuning code base called mr fine tune it's open source so feel free to take it take a look and 02:48:27.180 |
more models on july to november this year we released many many other models uh first of all 02:48:37.180 |
is the two new small best small models we have minister 3b great for deploying on edge devices 02:48:45.340 |
we have minister 8b if you used to use mr 7b mr minister 8b is a great replacement with much 02:48:53.900 |
stronger performance than mr 7b we also collaborated with nvidia and open sourced 02:49:00.140 |
another model nemo 12b another great model and just a few weeks ago we updated mr large with the 02:49:08.460 |
version 2 with the updated updated state of our features and really great function calling 02:49:14.940 |
capabilities it's supporting function calling latently and we released two multi-modal models 02:49:21.180 |
pixel 12b it's open source and pixel large just amazing model models for not understanding 02:49:29.980 |
images but also great at text understanding so yeah a lot of the image models are not so 02:49:36.620 |
good at text understanding but pixel large and pixel 12b are good at both image understanding 02:49:42.540 |
and text understanding and of course we have models for research coastal mamba is built on 02:49:49.500 |
mamba architecture and method great with working with math math problems so yeah that's another 02:49:57.580 |
models uh here's another view of our model reference we have several premier models which 02:50:09.820 |
means these models are mostly available through our api i mean all of the models are available 02:50:17.020 |
throughout our api except for minister 7 3b but for the premium model they have a special license 02:50:25.660 |
minstrel research license you can use it for free for exploration but if you want to use it for 02:50:30.940 |
enterprise for production use you will need to purchase a license from us so on the top row here 02:50:37.580 |
we have minstrel 3b and ab as our premier model minstrel small for best best low latency use cases 02:50:45.820 |
minstrel large is great for your most sophisticated use cases pixel large is the frontier class 02:50:52.300 |
multimodal model and we have coastal for great for coding and then again mr embedding model 02:50:58.540 |
and the bottom the bottom the slides here we have several apache 2.0 licensed open way models 02:51:06.380 |
free for the community to use and also if you want to fine tune it use it for customization 02:51:12.460 |
production feel free to do so the latest we have pictures 3 12b we also have mr nemo mom 02:51:21.580 |
coastal mamba and master as a real as i mentioned and we have three legacy models that we don't 02:51:28.460 |
update anymore so we recommend you to move to our newer models if you are still using them 02:51:35.900 |
and then just a few weeks ago we did a lot of improvements to our code interface lachette 02:51:46.300 |
how many of you have used lachette oh no only a few okay i highly recommend lachette it's 02:51:54.060 |
chat.mr.ai it's free to use it has all the amazing capabilities i'm going to show you right now 02:52:01.180 |
but before that lachette in french means cat so this is actually a cat logo 02:52:08.860 |
yeah if you can tell this is the cat eyes yeah so first of all i want to show you 02:52:17.020 |
something maybe let's let's take a look at image understanding 02:52:31.100 |
so here i have a receipts and i want to ask i just going to get the prompts 02:52:56.460 |
yeah i had an issue with wi-fi here so hopefully it would work 02:53:03.580 |
cool so basically i have a receipt and i said i ordered a coffee and a sausage how much do i owe 02:53:17.020 |
at a 18 tip so hopefully it was able to get the cost of the coffee and the sausage 02:53:23.820 |
and ignore the other things and um yeah i don't really understand this but i think this is coffee 02:53:30.700 |
uh it's yeah nine yep and then cost of the sausage we have 22 here 02:53:38.060 |
yep and then it was able to add the cost calculate the tip and all that uh great so it's great at 02:53:47.260 |
image understanding is great at uh ocr tasks so if you have ocr tasks please use it as free on 02:53:54.140 |
lachette it's also available through our api and also i'm going to show you a canvas example 02:54:00.380 |
a lot of you may have used canvas with other tools before but uh 02:54:08.620 |
with lachette is completely free again here i'm asking it to create a canvas that's used 02:54:15.420 |
pi script to execute python in my browser so oh what's going on 02:54:30.700 |
yep okay so yeah so basically it's executing python uh here exactly what we wanted uh 02:54:43.180 |
and the other day i was trying to ask lachette to create a game for me let's see if we can 02:55:15.500 |
okay all right you get the idea i failed my mission um 02:55:31.580 |
uh cool yeah so uh as you can see lachette can write like a code about a simple game pretty 02:55:41.420 |
easily and you can ask lachette to explain the code make updates however you like um 02:55:49.100 |
another example there is a bar here i want to move okay right okay and uh let's go back 02:56:00.780 |
another one uh yeah we also have web search capabilities like you can ask what's the latest 02:56:10.540 |
ai news uh image generation is pretty cool generate an image about researchers in vancouver 02:56:21.500 |
uh yeah it's black forest labs uh flex pro uh again this is free so 02:56:31.020 |
oh cool i guess researchers here are mostly from university of british columbia 02:56:39.820 |
uh that's smart uh yeah so this is lachette i please feel free to use it uh and let me know 02:56:48.380 |
if you have any feedback we're always looking for improvement and we're going to release 02:56:52.460 |
a lot more powerful features in the coming years thank you 02:56:55.740 |
yeah i think we can open up the questions there's lunch also outside but uh if anyone 02:57:06.300 |
thought i don't think we have a youtube entry but if anyone has any thoughts on 02:57:10.700 |
mistral or omo or any of the others the open models 02:57:15.340 |
um yeah no i think we can just break for lunch and uh have a chat but thanks thanks so much to 02:57:23.020 |
the speakers thank you again we'll be back here what we're gonna have like some people presenting 02:57:28.620 |
during lunch um i i think i think basically just go grab lunch you can come back in and eat and 02:57:34.060 |
chat uh we'll have some people presenting as well right so unless you want to say you see material 02:57:39.580 |
okay maybe maybe maybe you get something off now 02:57:45.020 |
yeah hi everyone thank you so much for coming today um huge shout out to SWIX and the latent 02:57:55.180 |
space team i think it's been a great yeah let's just give it up for SWIX just real quick um i 02:58:02.220 |
did a little bit of in terms of helping with the planning but i work at notable capital some of you 02:58:07.100 |
may have heard of ggv which was our former name um on the cloud infrastructure team so basically 02:58:12.300 |
anything data dev tools um ai infrastructure as well as ai applications um and so we like to stay 02:58:19.260 |
close to those that are smarter than us which is all of you in this room um so if anyone ever wants 02:58:23.580 |
to you know brainstorm or thinking about starting a company um we're happy to collaborate we've had 02:58:28.380 |
the opportunity to partner with like amazing companies such as hoshi corp bracelle neon 02:58:32.780 |
and many others over the years um and we're based in san francisco and new york so yeah feel free 02:58:38.380 |
to find me laura hamilton x linkedin um you know if we become friends instagram yeah um thank you 02:58:45.740 |
all for coming and then we'll kick off some of the chats with aws after everyone gets lunch all right 02:59:15.420 |
hi these are up here too this is not mine although i did almost take it yeah it's not like everyone's 03:34:19.120 |
Like in my view, I don't know if I would do a traditional. 03:34:38.120 |
Well, hey everyone. Hope you enjoyed lunch. Thanks for thanks for dialing in here. 03:34:44.620 |
My name is Aaron wanted to give a quick shout out to the latent latent space team notable capital swicks for organizing. 03:34:55.620 |
I've been in the role for about three years now. 03:34:59.120 |
I was a founding product hire at a series a company had a great exit there did machine learning for a while. 03:35:06.620 |
Did some strategy consulting with Google for a while and then joined AWS actually got this job on Twitter of all places. 03:35:15.120 |
I liked a tweet that was like, hey, I think more more VC meetings should be over surf lessons. 03:35:21.620 |
And I got a DM back saying hey, you kind of want to come work at AWS and it was off of the races from there. 03:35:31.620 |
Basically just wanted to kind of chat about how AWS works with founders, right? 03:35:37.620 |
I think everyone's aware compute and credits are kind of like the name of the game at this point. 03:35:43.620 |
I like to I like to think about ways to go deeper than that and figure out how we can add value beyond just like here's some GPUs. 03:35:55.120 |
So I wrote the PR FAQ for an accelerator program that is a 10 week program. 03:36:02.120 |
It just wrapped up at reinvent last week where we take a couple companies from around the world and really lean in and try and build co build with them. 03:36:14.120 |
We do like product strategy, help them with fundraising. 03:36:21.120 |
There's like, you know, 700 people in the audience. 03:36:25.120 |
And that's just kind of like, you know, putting what we do on a day to day on the world stage because our whole team is dedicated to figuring out ways to, again, go beyond beyond credits, beyond compute and support. 03:36:38.620 |
Right. So we worked with founders from like day zero, haven't even incorporated. 03:36:43.120 |
We're still like bouncing ideas off of off of each other, thinking about ways to go to market. 03:36:48.120 |
And then, you know, beyond that, like as you're scaling, finding design partners and then getting you listed on marketplace and really co-selling together. 03:36:57.120 |
And we'd love to be a small part of the journey as you're considering entrepreneurship. 03:37:02.620 |
So if you want to chat about all things entrepreneurship, please please reach out. 03:37:12.120 |
If you do just want GPUs and compute and credits, happy to chat about that as well. 03:37:18.120 |
But but great to be here. And again, thanks to SWIX for hosting and to the notable capital team for having us and organizing. 03:37:25.120 |
So thanks, everyone. Enjoy the rest of the talks today. 03:37:31.620 |
Also, we have them to thank for lunch. So all the amazing lunch that we got. 03:37:36.120 |
This whole thing is like self-funded, community funded. So we're very much flying by the seat of our pants. 03:37:41.120 |
And also thank you to Laura for making all this happen. 03:37:44.120 |
OK, so we have a couple more presentations from folks, just people like launching things. 03:37:50.120 |
We got Drew, you're next, but Ben, I'm going to I'm going to call you up first. 03:37:54.120 |
Ben, are you ready? I can get Drew to go first. 03:37:58.620 |
Drew, Drew, you got Drew. The amazing thing about the thing that's Drew's demoing is, well, by definition, it works offline. 03:38:06.120 |
And it's very, very viral. We we're just so lucky to have I mean, just for me to be friends with him 03:38:15.120 |
and to invite him here to to show off the best way you can be reading papers. 03:38:20.120 |
So usually we we come here, we do we demo B2B SaaS and infrastructure as a service. 03:38:25.620 |
This is none of that. You want consumer hardware. We got consumer hardware. OK, go. 03:38:30.120 |
Oh, all right. I have to still hype him up a little bit. What else? 03:38:34.120 |
What else can I say about you? Drew's an insane violinist. 03:38:37.120 |
If you if you like visit his house, like he lives in the House of Musicians 03:38:42.120 |
and they just have classical music live all the time. It's insane. All right. 03:38:48.120 |
Cool. Yeah. Sean is a is a very flattering hype man. Really incredible. 03:39:04.620 |
Just a quick thanks to to latent space for for hosting this and for Sean, like being in. 03:39:11.620 |
I think we met almost two years ago at a replica thing. 03:39:16.120 |
And he's just like organized the entire scene in a way that makes it digestible for me 03:39:20.620 |
and everyone else. Thanks to latent space. So I work for a company called Daylight Computer 03:39:26.620 |
and we're making computers for serious people is one way that I put it. 03:39:32.620 |
But we want to make a better reading experience for researchers specifically 03:39:38.620 |
and a new surface for A.I. in our real lives. 03:39:44.120 |
So how do we we haven't heard a whole lot about consumer applications of A.I. today, 03:39:48.620 |
but I just want to show a demo, some demos we've been working on for how to integrate A.I. 03:39:55.620 |
more comfortably into research workflows, especially reading papers. 03:40:00.620 |
So I'll just quickly go over kind of what what is daylight. 03:40:05.620 |
We invented a new screen technology that works just with the light in the room 03:40:11.120 |
and has no blue light, better for eye strain, better for focus. 03:40:14.620 |
And we wrote an operating system to run this screen on our first product, 03:40:19.620 |
this tablet computer, the DC one, and it allows you to read outside 03:40:32.620 |
So we've kind of made it impossible to get interrupted by notifications and other distractions. 03:40:39.120 |
It's kind of like a Kindle and an iPad had a baby. 03:40:42.620 |
So the kinds of things we're doing with A.I. are to kind of integrate directly 03:40:51.620 |
And I just have a quick demo that I can give here. 03:40:56.620 |
It looks like we don't have sound, but I'll just narrate. 03:41:11.620 |
So the voice is going to be Anjan Kata, our founder, who invented the screen technology. 03:41:21.620 |
It's really all right. I can just talk through it. 03:41:24.620 |
So this is a poem, and often we want to go deeper into the text. 03:41:38.620 |
So this is the daylight reading the "Real Gay Man Watching" poem. 03:41:42.620 |
One cool feature we have is we have a little button on the side 03:41:45.620 |
that you can press at any time and then you can talk to an A.I. 03:41:48.620 |
So I was a little bit confused by certain parts of the poem. 03:41:53.620 |
"What do they mean by 'we would become strong too' and 'not need names'?" 03:42:01.120 |
The phrase "not need names" suggests transcending individual identity. 03:42:06.620 |
And so we just kind of, as we were going through the entire poem, 03:42:09.620 |
we read it once and we kind of went back through. 03:42:13.620 |
and came away feeling like we understood it so much more. 03:42:19.620 |
"Can you tell us more about what they mean by these wrestlers in the Old Testament 03:42:30.620 |
It's referencing the biblical story of Jacob wrestling with an angel found in Genesis 32. 03:42:36.120 |
And that's just, like, incredibly cool that we're able to do this. 03:42:43.120 |
"Could you recommend a few other poems that mirror the themes of this one?" 03:42:55.120 |
"Resilient, Struggled, Personal, Gravitational, Challenged, and Simultaneous." 03:43:00.120 |
I'm gonna go a little bit in here and add these poems to your device and read them. 03:43:03.620 |
Yeah. So we want to bring that to research and to the entire device. 03:43:08.620 |
So one thing that's an advantage of owning the entire stack, 03:43:15.620 |
is we can tailor the workflows across any apps. 03:43:19.620 |
So this will work in a web browser, in your messages, email. 03:43:27.120 |
there's a sort of central AI that is running on your device. 03:43:29.620 |
It can capture that, put it in a knowledge graph, 03:43:35.620 |
And it's just available everywhere and with a hardware button. 03:43:43.620 |
if you're interested in knowledge graphs on device or quantized models, 03:43:49.620 |
And I actually have a couple of these here if people want to play with them. 03:44:02.120 |
We're sold out online probably until the beginning of next year, 03:44:35.120 |
There are like six patents on top of essentially a Game Boy screen. 03:44:44.120 |
and six patents on top of what's called RLCD or TLCD. 03:44:55.120 |
So it's liquid crystal, but it has no backlight required. 03:44:59.120 |
The sort of innovation is the reflectance and transflectance films 03:45:05.120 |
and like stack, you know, black magic to reflect the light back through the LCD. 03:45:21.120 |
And then we, in order to, the transflective part is 03:45:25.120 |
how do you enable a backlight for nighttime use? 03:45:29.120 |
And we developed a layer that allows us to put a blue light free LED 03:45:34.120 |
that's like safe for your circadian health and suprachiasmatic nucleus and so on. 03:45:38.120 |
So you're not like burning your eyes out at midnight reading. 03:45:42.120 |
But it can come through similar to a normal computer screen. 03:45:46.620 |
So that's more or less the secret sauce here. 03:45:56.120 |
But we're going to release it with a, yeah, we're building it. 03:46:00.120 |
Yeah, it's going to be great. It's fun to play with. 03:46:04.120 |
And if you want to, you know, come by and try writing on it or reading on it 03:46:08.120 |
or like watching a video on it, just seeing how it feels, looks, 03:46:15.620 |
There will be a phone, you know, laptop, monitor, all those things. 03:46:36.120 |
We have Ben from StrongCompute, founder of StrongCompute. 03:46:40.120 |
I would say like one of those weird things where even though they're mostly a compute shop, 03:46:44.620 |
they also do a fair amount of like deep research. 03:46:48.120 |
This year, Ring attention got a lot of attention from people for like scaling, 03:46:56.120 |
And we host a paper club. Like this is basically the in-person version 03:47:00.120 |
of the online paper club that we've been running for two years. 03:47:03.120 |
And the single best presentation, one of my favorite presentation of the years 03:47:06.120 |
was from StrongCompute. So really grateful for you guys. 03:47:22.620 |
Did I get you to zoom? I didn't. I don't think I did. 03:47:46.620 |
Zoom. This is mostly just because I want to capture your screen for the recording. 03:47:55.620 |
This is for the swag swap. The swag table is back there. 03:48:25.620 |
So what we're trying to do is make clusters a lot easier to use. 03:48:29.620 |
So anyone who's tried accessing clusters for training, 03:48:33.120 |
we're trying to be what you'd feel an elite DevOps team would be. 03:48:36.620 |
So here's kind of a feature list of some of the stuff we're going for. 03:48:44.620 |
So most people like, we actually started out optimizing, 03:48:48.620 |
well, we started out building compute hardware, 03:48:57.620 |
So we're messing around with CUDA kernels and data loading and that kind of stuff. 03:49:05.620 |
like getting these much greater efficiencies on the GPU. 03:49:08.620 |
Surely the easy part is just taking our work once it's done 03:49:11.620 |
and just putting it on a cloud platform and having it go. 03:49:14.620 |
And it turned out to be the complete opposite. 03:49:16.620 |
We got a whole bunch of awesome optimizations done in a few months. 03:49:21.620 |
to actually build a GPU orchestration platform that we wanted to use. 03:49:25.620 |
And what we realized was that there was just a lot of things 03:49:33.120 |
is something we've been working on for a year, 03:49:35.120 |
which is a new UI for how you work with clusters. 03:49:47.120 |
maybe AWS has given you some credits, that's really nice of them. 03:49:55.120 |
Maybe you've already got some stuff with GCP. 03:50:03.120 |
And in each of those regions, you've got some number of GPUs. 03:50:09.120 |
And then you want to go and do things with them. 03:50:23.120 |
And that job's going to start on some cluster somewhere. 03:50:26.120 |
So with our system, it's pretty much that easy. 03:50:29.120 |
You don't need to worry about Linux, NVIDIA drivers, anything like that. 03:50:31.620 |
You get a Docker, you get root in the Docker. 03:50:33.620 |
It can just jump on the cluster straight away. 03:50:35.620 |
And so then a bunch of other people want to run jobs. 03:50:38.620 |
And so then you end up backed up in a Slurm queue. 03:50:41.620 |
And you go, "Hang on, don't we have all this other compute 03:50:45.620 |
What's it going to take to actually migrate our workloads 03:50:49.620 |
Well, what we've built is the ability to migrate 03:51:16.620 |
I've got a little bit of a video showing some of our speed demos. 03:51:31.620 |
And we have the world's fastest container registry as well. 03:51:34.620 |
So yeah, there's no more slow stuff on the cluster. 03:51:39.620 |
It's also a lot cheaper than what you'd be used to paying for egress. 03:51:42.620 |
So the vision for this is if you imagine a CPU 03:51:46.620 |
and there's a scheduler and it's sending workloads 03:51:52.620 |
What if each core was actually a cluster in a data center? 03:52:00.620 |
And obviously, theoretically that's possible, 03:52:03.620 |
but it's all about the APIs and the interface work. 03:52:05.620 |
Normally, you want to go and start sending workloads 03:52:12.620 |
to the dozen DevOps people and they'll get started 03:52:16.620 |
on a multi-week, multi-month project to do that. 03:52:21.620 |
and without the need for any DevOps work to do. 03:52:24.620 |
We've also got a few other features here as well. 03:52:26.620 |
So you'll have some data sets and they might be quite large. 03:52:56.620 |
So I can go pull that down to any cluster I want. 03:53:00.620 |
And then I can go and I can do training on that data set, 03:53:07.620 |
So one of the issues that we've seen people encounter is, 03:53:12.620 |
but it's not going to have high-speed access to my data set. 03:53:17.620 |
So here we can just set up as many nodes as we want with workstations. 03:53:21.120 |
You can carve off as many GPUs as you'd like. 03:53:23.120 |
And that'll be your container that you can work out of 03:53:29.120 |
That's that 90 gigabyte a second access to that data set. 03:53:33.120 |
And that way you can go, so this is the entire dev cycle. 03:53:38.120 |
We're not doing any production inference hosting right now, 03:53:40.120 |
but you can have fast access to your data sets from your dev container. 03:53:44.620 |
You can train on the clusters very, very easily. 03:53:47.120 |
What we want to do is eliminate any developer waiting time 03:53:53.120 |
So what does that look like for some other examples? 03:53:57.120 |
We can also, because we're able to save the state of a cluster, 03:54:06.120 |
And here we can actually just go and pack as many jobs as we want 03:54:12.620 |
and choose how often they're going to rotate. 03:54:21.120 |
Maybe this job's so important and our cluster is so backed up 03:54:24.120 |
that we just actually want to get this job some dedicated space. 03:54:30.120 |
And now we've actually gone and found resources on the cloud. 03:54:33.120 |
So we're integrated with six clouds and that's growing. 03:54:57.120 |
So if you want to launch a cluster-scale workload, 03:55:08.120 |
You'll see if your stuff works at cluster scale. 03:55:12.620 |
All the other jobs will keep going with that state saving. 03:55:32.620 |
for any folks who are heading home early from Europe. 03:55:39.120 |
Just like people would train chess models on it. 03:55:44.620 |
People were really having a lot of fun with it. 03:56:01.620 |
Yeah, I saw this hackathon and I was wondering 03:56:28.120 |
at figuring out how to do this stuff in a weekend. 03:56:35.120 |
and right now the models will quite easily defeat me. 03:56:52.620 |
we've given out about a dozen of these so far. 03:56:58.620 |
And you just have to be doing something cool, 03:57:02.620 |
something that SWIX would want to host at an event. 03:57:50.120 |
Yeah, some of the stuff I didn't show you today. 03:57:56.120 |
You go, "All right, this developer is allowed 03:58:16.120 |
Yeah, so the question was how we compare this 03:59:01.120 |
We haven't applied a bunch of our speed tooling to it, 03:59:02.620 |
and it's only about a 5% overhead at the moment. 03:59:11.620 |
you can do anything except install kernel modules. 04:19:16.120 |
I'm going to ask Gemini real quick a couple of questions. 04:19:28.620 |
Could you explain to me how I might go about changing my runtime type in Colabs? 04:19:40.620 |
I understand that you can see the Colab interface. 04:19:43.620 |
Is there anything specific you would like to discuss or ask about it? 04:20:06.620 |
So, Gemini, could you explain to me how I would go about changing my runtime type? 04:20:16.620 |
you need to go to the Runtime menu at the top of the page. 04:20:22.620 |
A window will open where you can select the hardware accelerator you want. 04:20:28.620 |
So which one should I select if I wanted to train a model 04:20:39.620 |
If you're looking to train a model and the options shown in the hardware accelerator section 04:20:43.620 |
of the Change Runtime Type pop-up menu are the ones you have to choose from, 04:20:48.120 |
the A4 GPU is a good choice for general model training. 04:20:50.620 |
However, if your model training requires a lot of memory, 04:21:00.620 |
So, Gemini, how about this key section here off to the left? 04:21:07.620 |
How would I go about adding a new key or a new secret? 04:21:36.620 |
Yeah, you can play the audio from the computer. 04:22:08.120 |
So when you play, is it coming from the laptop now? 04:23:05.120 |
Can you help me understand how to add a new secret? 04:23:10.620 |
Looking at the screen, you will need to click the add new secret button 04:23:13.620 |
on the left side panel of the screen below the other secrets. 04:23:17.620 |
After that, you can fill out the information for the new secret. 04:23:23.620 |
All right. So for Colab, how would I go about changing my runtime type? 04:23:33.120 |
To change your runtime type in Colab, you can click on the runtime menu 04:23:35.620 |
at the top of the page, and then select change runtime type. 04:23:39.620 |
From there, you can choose between different options, such as GPU or TPU. 04:23:45.620 |
Excellent. Explain to me what the difference between GPUs and TPUs are. 04:23:51.620 |
Certainly. GPUs, or graphics processing units, are designed to handle 04:23:56.620 |
parallel processing, which is ideal for tasks like rendering images and videos, 04:24:02.120 |
and also machine learning. TPUs, or tensor processing units, 04:24:05.620 |
are custom built by Google and specifically designed for deep learning tasks, 04:24:09.620 |
making them often more efficient than GPUs for these tasks. 04:24:16.620 |
How would I go about accessing the terminal in Microlab? 04:24:21.620 |
To access the terminal in your Colab notebook, you can click on the tools menu 04:24:25.620 |
at the top of the page, and then select command palette. 04:24:29.120 |
Then, type terminal in the search bar, and click on open a terminal. 04:24:33.620 |
This will open a terminal within your Colab environment. 04:24:38.620 |
And what command would I need to type in order to list all of the files 04:24:46.620 |
To list all the files in your current directory, you can use the command ls. 04:24:56.120 |
Are there any other config options that I should add for ls 04:24:59.620 |
to list all of the files, not just the directories? 04:25:03.620 |
Certainly. If you want to see all files, including hidden files, 04:25:10.620 |
This will show all files in the current directory. 04:25:13.620 |
If you would like to list all files in a detailed view, 04:25:16.620 |
including their permissions, owner, size, and modification time, 04:25:23.120 |
Also, combining both, lsil will show a detailed list of all files, 04:25:41.620 |
Yeah, you know, Gemini, I would really prefer if you addressed me 04:25:44.620 |
and started talking like you're a pirate, then. 04:25:48.620 |
Ahoy, matey. What be your next question, then? 04:25:54.620 |
Amazing. Help me understand what the tools command palette 04:26:05.620 |
Aye, matey. The command palette be a handy tool for you. 04:26:09.620 |
It be a way to quickly access a wide range of commands and actions in Colab. 04:26:13.620 |
You can use it to add a Gemini prompt cell, add a code cell, 04:26:17.620 |
change your runtime type, and many other things. 04:26:33.620 |
I want you to have a Texas accent like John Lee. 04:26:38.620 |
Alrighty then, partner. What can I do for yay now? 04:26:41.620 |
Let's talk about that there, text like a true Texan. 04:26:52.620 |
Let me know if you need anything else, you hear? 04:26:55.620 |
So in addition to being able to interact with screens, 04:27:02.620 |
Gemini is also capable of natively outputting image tokens, 04:27:06.620 |
so we haven't released that via AI Studio today. 04:27:10.620 |
We can also couple Gemini together with code execution, 04:27:14.620 |
function calling, so you can do things like say, 04:27:18.120 |
Gemini, watch my screen, and if you see anything 04:27:21.620 |
that looks like a particular kind of notification 04:27:25.620 |
or a particular kind of flag, set off a call to a specific tool. 04:27:32.620 |
I also like using it to help me as kind of like a posture corrector. 04:27:38.620 |
So if I give Gemini access to my camera, which you can see here, 04:27:48.620 |
It can tell me if my posture is correct or incorrect. 04:27:56.620 |
Should I stand up straighter? Do I have good posture? 04:28:03.620 |
Okay. Looking at the video, your posture does seem slightly slumped. 04:28:08.620 |
To improve it, you might try standing up straighter, 04:28:22.620 |
as you're kind of sitting at your desk to see. 04:28:26.620 |
It supports different kind of system instructions, 04:28:29.620 |
so you can add things like the speak like a pirate 04:28:34.620 |
And then there are also a few different voice options. 04:28:44.620 |
that you can play around with to test out some of your favorites. 04:28:52.620 |
So if you don't want to have audio out responses, 04:29:02.620 |
to help experiment with things like bounding boxes. 04:29:05.620 |
So you can see Gemini kind of identify bounding box locations 04:29:17.120 |
and then the armadillo and the fox off to the side. 04:29:21.120 |
It's also capable of doing this for things like socks. 04:29:24.120 |
So being able to sort and filter between different kinds of socks. 04:29:28.120 |
And then also for different kinds of desserts. 04:29:33.120 |
So if you want to have bounding boxes natively output, 04:29:36.120 |
this is something that's supported not just with images, 04:29:44.120 |
you can get started with it today at aistudio.google.com. 04:29:50.120 |
So all of this is freely available for you to use today to try out 04:29:54.120 |
and to also use if you want to create API keys 04:30:09.120 |
So the question was, can you speak to the agentic research? 04:30:15.620 |
how much I can speak to without getting fired. 04:30:22.620 |
we did release a couple of different options, 04:30:27.620 |
which is just using kind of commodity Gemini available APIs. 04:30:31.620 |
So you could test them out, use them for your projects today. 04:30:35.120 |
We also released something called Project Mariner, 04:30:39.620 |
being able to interact with websites directly within the browser. 04:30:43.620 |
Again, strongly encourage you to try out multimodal streaming API 04:30:49.620 |
And you'll probably be able to get very close 04:30:51.620 |
to the kinds of experiments that you saw released just via video. 04:30:55.620 |
But those are the kinds of things that we're focusing on for agents, 04:31:00.120 |
not just being able to understand and generate text and code, 04:31:06.620 |
with these multimodal and streaming experiences. 04:31:49.620 |
I've tried Colab before, but I've never tried the cloud interface. 04:32:05.120 |
how I would use one of the cloud models on this interface? 04:32:11.120 |
Oh, I had switched it to only output just text, I think. 04:32:54.620 |
or one of the cloud models within the screen? 04:32:59.620 |
For some reason, it's not wanting to have the audio outputs anymore. 04:33:40.120 |
For some reason, the audio isn't wanting to work for me anymore. 04:34:08.120 |
it's not wanting to understand the audio anymore. 04:34:34.620 |
No, no, I think the next speaker doesn't have audio. 04:34:45.620 |
And also, I encourage you all to go try it out 04:35:36.620 |
Eugene is a member of our paper club every week, 04:35:42.120 |
He's got a whole article about hardware scaling