Back to Index

Latent Space LIVE! - Best of 2024: Startups, Vision, Open Src, Reasoning, & The Great Scaling Debate


Chapters

0:0
20:52 Conviction Startups Overview
78:44 Best of Vision 2024 (Roboflows x Moondrram)
322:59 Loubna (HF) synthetic data and smol models
457:40 The scaling/wall debate w/ Dylan Patel
465:26 Opening statements

Transcript

you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you i need to share audio yeah because because i'm not sharing my screen right but so for the mic she's going to do it oh you don't need to um so all the mics and like the audio from this room we're going to zoom yeah set up okay um can you take them yeah you just have to mute your yeah i might need to share your audio like if i present her yeah you can go but i'm just muting the music yeah we just need that yeah okay over there yeah yes um actually i don't know what else um i guess um yeah um so um um oh i mean the same pattern we're gonna we're gonna sleep you gonna do um um yeah um um yeah no you can't make it either two investors yeah yeah but uh yeah that's what the north is um they have a great terms um i didn't know yeah we got in two days ago um um he doesn't know okay okay um um what uh yes there is no oh i'm sure right um my question um um yeah um made plans last night it's great yeah i actually realized we should probably hire a designer the weird thing is you have no idea like how many people are having trouble finding this place versus so many people like this like waking up late yes well it's okay but we're recording the whole thing when you said 500 i was imagining exactly just while they're going uh so let me know if you can find the spot okay um you can just plug in here and i'll drop you the zoom link so we stream from zoom straight to youtube but we're also recording separately for the podcast and subsequent editing what is that it's the guest network here yeah to log in no that's not the link ignore the thing i just said yeah we need we need to show them a little harder ah okay that's the zoom link um so yeah it should be good for zoom yeah um do we need to send you a laptop there no that's great i i use my boots from pisa okay awesome we think so can you can you hear anything i'm not sure well there's like a slight delay but if i'm talking here it should show up there in like 10 seconds yeah okay oh one more thing for these mics yeah just make sure they use it so that it goes into zoom yeah this is on and then yeah we're using this manually but also simon fraser right you take it as well yeah uh all right i'm gonna wire you both up and now by the way oh i'm sorry hi sarah um and so there's a i feel like for sarah we need to give her a laugh no it's fine oh uh yes so plug in screw in looks like way about half of it oh is that it that's it there's a puppy thing but we're indoors so we don't need that wait that's so good i like your shirt oh thank you now yeah it's molly white yes he sent me a photo of this place and i was like we have to do this nice this is like the oxford union style i wanted to talk where the way they set up the conference there's a rotating platform the center it's like a stadium like thing it's like i don't know look at this is not intense yeah that's a really good bit yeah it's terrible but you should think about doing it next time yeah i mean it's funny to watch yeah you must be having a dinner office we just need a platform and then like that would be so amazing like we have a lot of people on youtube i don't know how many people i i just have all the um opening eye jokes that i've warm in my head um like how does uh how does rudolf update like yeah exactly i know i know i like that one too um with 40 people online all right i am ready to transfer over to you yes yeah so but this will never no this forever on it oh it's in your shirt now this sounds good i think that looks good actually yeah um and then on the top of it there's a button if you press it now um then it'll start flashing red and that's the report and it's not broadcasting it's just recording like that so if you push then okay there you go okay they're figuring out the suit okay so you want me to dial it into the sim uh yeah i think i sent it to you yeah i'll check text or email i'll text it's the second one i got that thank you uh screen so that plugs in yeah oh share screen too so right now it's just pinning wow not camera yeah so we can tweak it okay um and then i typically hide this messing with your computer settings there we go no it's just you know standard presenter issues uh this goes into stream and you're also mic'd up do you have the is it on yeah mine's recording yeah nice you want to check if do we get computer audio as well and do we get audio from the computer too okay uh we just have the hn demo but you can do a really good impression of i mean well um i think we can just run it in worst case we'll um we'll put it in the show notes it's fine yeah okay yeah do you do you want to start by saying anything yeah i think you should probably yeah okay i've been so busy with logistics and stuff that um i haven't done okay um i think we're going to kick this off um thanks to everyone who made it early morning um it's like really weird experiments that we wanted to try because one we saw this space uh and but two also i've been to a number of these things now and um i always felt like there was not enough like industry content for for people and we wanted an opportunity while everyone is in town in like one central spot to get everyone together um to talk about the best stuff of the year review the year it's very nice that new york is always the end of the year um and so i'm very honored that uh sarah and pranav have agreed to help us kick this off um sarah i've known for i was actually counting 17 years um and but she's she's gone she's uh been enormously successful as an ai investor um even uh even when you're doing your graylock days i was tracking your your investing and it's uh it's come a long way since then um and pranav uh i i've known i've known uh shorter but he's also starting to write uh really incredible posts and opinions about what he's seeing as an investor so i wanted to kick this off at the industry session um we have a great day of sort of like best of year recaps uh for uh lined up i think vic is here as well um and uh and the robo flow guys so uh i would just let you keep kick it off thank you hi everyone uh my name is sarah guo and thanks to uh sean and friends here for having me and pranav so um i'd start by just giving 30 seconds of intro i promise this isn't an ad uh we started a venture fund called conviction about two years ago here is a set of the investments we've made uh they range from companies at the infrastructure level in terms of feeding the revolution to foundation model companies alternative architectures domain specific training efforts and of course applications and the premise of the fund sean mentioned i worked at graylock for about a decade before that and came from the product engineering side was that uh we we thought that there was a really interesting technical revolution happening uh that it would probably be the biggest change in how people use technology in our lifetimes and that represented huge economic opportunity and and maybe that there would be an advantage versus the incumbent venture firms in that when the floor is lava the dynamics of the markets change the types of products and founders that you back change uh it's a lot for existing firms to ingest and a lot of their mental models may not apply in the same way uh and so there was an opportunity for first principles thinking and if we were right would we do really well and get to work with amazing people and so we are two years into that journey and we can share some of the opinions and predictions we have with all of you um sorry i'm just making sure that isn't actually blocking the whole presentation uh i'm proud it's going to start us off um so quick agenda for today we'll cover some of the model landscapes and themes that we've seen in 2024 uh what we think is happening in ai startups and then some of our latent priors uh on what we think is working in investing so the um i thought it'd be useful to start from like what was happening at neurops last year in december 2023 so in october 2023 opening i had just launched the ability to upload images to chat gpt which means up until that moment it's hard to believe but like roughly a year ago you could only input text and get text out of chat gpt um the mistral folks had just launched the mixed role model right before the beginning of neurops google had just announced gemini i very genuinely forgot about the existence of bard before making these slides and europe had just announced that they were doing their first round of ai regulation but not to be their last and when we were thinking about like what's changed in 2024 there's at least five themes that we could come up with that feel like they were descriptive of of what 2024 has meant for ai and for startups and so we'd start with um first it's a much closer race on the foundation model side than it was in 2023 so this is elem arena they're asked users to rate the evaluations from uh from generations from specific prompts so you get two responses from two language models answer which one of them is better the way to interpret this is like roughly 100 elo difference means that you're preferred two-thirds of the time and a year ago every open ai model was like more than 100 points better than anything else and the view from the ground was roughly like open ai is the ibm there is no point in competing everyone should just give up go work at open ai or attempt to use open ai models and i think the story today is not that i think it would have been unbelievable a year ago if you told people that a the best model today on this at least on this eval is not open ai and b that it was google would have been pretty unimaginable to the majority of researchers but actually there are a variety of of proprietary language model options and some set of open source options that are increasingly competitive and this seems true not just on the eval side but also in actual spend so this is ramp data there's a bunch of colors but it's actually just open ai and anthropic spend and the open ai spend at the beginning at the end of last year in november of 23 was close to 90 percent of total volume and today less than a year later it's closer to 60 percent of total volume which i think is indicative both that language models are pretty easy apis to switch out and people are trialing a variety of different options to figure out what works best for them related second trend that we've noticed is that open source is increasingly competitive so this is from the scale leader boards which is a set of independent evals that are not contaminated and on a number of topics that actually the the foundation models clearly care a great deal about open source models are pretty good on math instruction following and adversarial robustness the llama model is amongst the top three of evaluated models i included the agentic tool use here just to point out that this isn't true across the board there are clearly some areas where foundation model companies have had more data or more expertise in training against these use cases but models are surprisingly an increasing open source models are surprisingly increasingly effective this feels true across evals this is the mmlu eval i want to call it two things here one is that it's pretty remarkable that the ninth best model and two points behind and uh the the best state-of-the-art models is actually a 70 billion parameter model i think this would have been surprising to a bunch of people who were the belief was largely that most intelligence is just an emergent property and there's a limit to how much intelligence you can push into smaller form factors in fact a year ago the the best small model or under 10 billion parameter model would have been mistral 7b which on this eval if memory service is somewhere around is 60 and today that's the llama 8b model which is more than 10 points better the the gap between what is state-of-the-art and what you can fit into a fairly small uh form factor is actually actually shrinking um and again related the we think the price of intelligence has come down substantially this is this is a graph of flagship open ai model costs where the cost of the api has come down roughly 80 85 and call it the last year year and a half which is pretty remarkable this isn't just open ai2 this is also like the full set of models this is from artificial analysis which tracks cost per token across a variety of different apis and public inference options and like we were doing some math on this if you wanted to recreate like what a the kind of data that a text editor had or that like something like notion or coda that's somewhere in the volume of a couple thousand dollars to create that volume of tokens that's pretty remarkable and impressive it's clearly not the same distribution of data but just as like a sense of scope the there's an enormous volume of data that you can create and then fourth we think new modalities are beginning to work start quickly with biology we're lucky to work with the folks at chi discovery who just released chi1 which is open source model that outperforms alpha fold 3 it's impressive that this is like roughly a year of work with a pretty specific data set and then pretty specific technical beliefs but models in domains like biology are beginning to work we think that's true on the voice side as well point out that there were voice models before things like 11 labs have existed for a while but we think low latency voice is more than just a feature it's actually a net new experience interaction using voice mode feels very different than the historical transcription first models same thing with many of the cartesian models and then a new nascent use case is execution so cloud launch computer use openai launched code execution inside of canvas yesterday and then i think devon just announced that you can all try it for 500 a month which is pretty remarkable it's a set of capabilities that have historically never been available to vast majority of population and i think we're still in early innings cognition the company was founded under a year ago first product was roughly nine months ago which is pretty impressive if you recall like a year ago the point of view on swebench was like it was impossible to surpass what 15 percent or so and i think the the whole industry now considers that if not trivial accessible yeah um last new modality we wanted to call out although there are many more is video um i took the liberty i got early access to sora and managed to sign up before they cut off accesses so um here is my favorite joke in the form of a video hopefully someone here can guess it yeah you're telling me a shrimp fried this rice it's a pretty bad joke but i really like it and i think this one the next video here is uh one of our portfolio companies hey jen that translated and does the dubbing for or lip sync and dubbing for live speeches so this is javier mille who speaks in spanish but here you will hear him in english if this if this plays um and you can see that you can capture the original tonality of of his speech and performance i think auto here doesn't work but we'll we'll push something publicly sure um let's give it a shot yeah excellent of the western world yeah and you can hear that this captures like his original tone uh and like the emotion in his speech which is definitely new and pretty impressive from from new models um so the last uh the yeah that makes sense um the last point that we wanted to call out is uh the much purported end of scaling i think there is a great debate happening here later today on the question of this but we think at minimum it's hard to deny that there are at least some limits to the the clear benefits to increasing scale um but there also seems like there are new scaling paradigms so the question of test time compute scaling is a pretty interesting one it seems like openai has cracked a version of this that works and we think a foundation model labs will come up with better ways of doing this and b so far it largely works for very verifiable domains things that look like math and physics and maybe secondarily software engineering where we can get an objective value function and i think an open question for the next year is going to be how we generate those value functions for spaces that are not as well constrained or well defined so the question that this leaves us in is like well what does that mean for startups and i think a prevailing view has been that we live in an ai bubble there's an enormous amount of funding that goes towards ai companies and startups that is largely unjustified based on outcomes and what's actually working on the ground and startups are largely raising money on hype and so we pulled some pitch book data and the 2024 number is like probably incomplete since not all rounds are being reported and largely suggests like actually there is a substantial recovery in funding and maybe 2025 looks something like 2021 but if you break out the numbers here a bit more the red is actually just a small number of foundation model labs like what you would think of as the largest labs raising money which is upwards of 30 to 40 billion dollars this year and so the reality of the funding environment actually seems like much more sane and rational it doesn't look like we're headed to a version of 2021 in fact the the foundation model labs account for an outsized amount of money being raised but the the set of money going to companies that are working seems much more rational and we wanted to give you we can't share numbers for every company but this is one of our portfolio companies growing really really quickly um we think 0 to 20 and just plg style spending is pretty impressive if any of you are doing better than that you should come find us we'd love to chat and so what we wanted to try and center a discussion on this is certainly not all of the companies that are making 10 million more or revenue and growing but we took a selection of them and wanted to give you a couple ideas of patterns that we've noticed that seem to be working across the board um the first one that we've noticed is like first wave service automation so we think there's a large amount of work that doesn't get done at companies today either because it is too expensive to hire someone to do it it's too expensive to provide them context and enable them to be successful uh at uh at whatever the specific role is or it's too hard to manage um those set of people so prescribing it's too expensive to hire those specific set of people for sierra and decagon for customer support style companies it's really useful to do like next level automation and then there's obviously growth in that and for harvey and even up the story is um you can do first wave professional services and then grow beyond that second trend that we've noticed is better search new friends so we think that there is a it's pretty impressive like how effective text modalities have been so character and replica have been remarkably successful companies and there's a whole host of not safe for work chatbots as well that are pretty effective at just text generation they're pretty compelling mechanisms and on the productivity side perplexity and glean have demonstrated this as well i worked at a search company for a while i think the changing paradigms of how people capture and learn information is pretty interesting we think it's likely text isn't the last medium their infographics or sets of information that seem more useful or sets of engagement that are more engaging um but this feels like a pretty interesting place to start oh yeah okay mike so one thing that i've worked on investing in in a long time is democratization of different skills be they creative or technical this has been an amazing few years for that across different modalities audio video general image media text and now code and and really fully functioning applications um one thing that's really interesting about the growth driver for all of these companies is the the end users in large part are not people that we thought of as we the venture industry you know the royal we thought of as important markets before um and so a premise we have as a fund is that there's actually much more instinct for creativity visual creativity audio creativity technical creativity than like there's latent demand for it and ai applications can really serve that i think in particular mid journey was a company that is in the vanguard here and nobody understood for a long time because the perhaps outside view is like how many people want to generate images that are not easily you know the raster they're not easily editable they can't be using these professional context in a complete way and the answer is like an awful lot right for a whole range of use cases and i think we'll continue to find that especially as the capabilities improve and we think the the range of um uh quality and uh controllability that you can get in these different domains is still it's very deep and we're still very early um and then i i think as if if we're in the first or second inning of this ai wave one obvious place to go invest and to go build companies is the enabling layers right um shorthand for this is obviously compute and data i think the the needs for uh data are largely changed now as well you need more expert data you need different forms of table talk about that later in terms of who has like let's say reasoning traces in different domains that are interesting to companies doing their own training but this is this is an area that has seen explosive growth and we continue to invest here um okay so maybe time for some opinions there was a prevailing narrative that um you know some part from companies some part from investors it's a fun debate uh as to where is the value in the ecosystem and can there be opportunities for startups um if you guys remember the phrase gpt rapper it was like the dominant phrase in the tech ecosystem for a while of and what it what it represented with this idea that there was no value at the application layer you had to do pre-training and then like nobody's going to catch open ai and pre-training and you know this isn't this isn't like a a knock on open ai at all these these labs have done amazing work enabling the ecosystem and we continue to partner with them and and others but um but it's simply untrue as a narrative right the odds are clearly in favor of a very rich ecosystem of innovation you have a bunch of choices of models that are good at different things you have price competition you have open source uh i think an underappreciated impact of test time scaling is you're going to better match user value with your spend on compute and so if you are a new company that can figure out how to make these models useful to somebody the customer can pay for the compute instead of you taking as a as a startup the capex for pre-training or um or rl up front uh and um uh as pranav mentioned you know small models especially if you know the domain can be unreasonably effective uh and the product layer has if we look at the sort of cluster of companies that we described shown that it is creating and capturing value and that it's actually a pretty hard thing to build great products that leverage ai um so so broadly like we have a point of view that i think is actually shared by many of the labs that the world is full of problems in the last mile to go take even agi into all of those use cases is quite long okay another prevailing belief is that um or you know another great debate that sean could host is like does the value go to startups or incumbents uh we must admit some bias here even though we have you know friends and portfolio former portfolio companies that would be considered incumbents now but um uh oh sorry swap swap uh swap views sorry uh you know there are there are markets in venture that have been considered traditionally like too hard right like just bad markets for the the venture capital spec which is capital efficient rapid growth that's a venture backable company um where the end output is a you know a tens of billions of dollars of enterprise value company um and and these included areas like legal health care defense pharma education um you know any traditional venture firm would say like bad market nobody makes money there it's really hard to sell there's no budget etc and and one of the things that's interesting is if you look at the cluster of companies that has actually been effective over the past year some of them are in these markets that were traditionally non-obvious right and so perhaps one of our more optimistic views is that ai is really useful and if you make a capability that is novel that is several magnitudes um orders of magnitude cheaper then actually you can change the buying pattern and the structure of these markets and maybe the legal industry didn't buy anything because it wasn't anything worth buying for a really long time that's one example um we we also think that like what was the last great consumer company um maybe it was discord or roblox in terms of things that started that have just like really um enormous user basis and engagement uh until you know we had these consumer chatbots of different kinds and and like the next perhaps the next generation of search as Pranav mentioned we think that the um opportunity for social and media generation and games is uh large and new in a totally different way um and and finally uh in terms of the markets that we look at uh i think there's broad recognition now that you can sell against outcomes and services rather than software spend with ai because you're doing work versus just giving people the ability to do a workflow but um if you take that one step further we think there's elastic demand for many services right uh our classic example is um there's on order of 20 to 25 million professional software developers in the world uh you know i imagine much of this audience is technical uh demand for software is not being met right if we take the cost of software and high quality software down two orders of magnitude we're just going to end up with more software in the world we're not going to end up with fewer people doing development at least that's what we would argue um and then finally on the incumbent versus uh startup question uh the prevailing narrative is incumbents have the distribution the product surfaces and the data don't bother competing with them they're going to create and capture the value and share some of it back with their customers i think this is only partially true um they incumbents have the distribution they have always had the distribution like the point of the startup is you have to go fight with a better product or a more clever product um and maybe a different business model to go get new distribution but the specifics around the product surface and the data i think are actually worth understanding there's a really strong innovators dilemma if you look at the sas companies that are dominant they sell by seat and if i'm doing the work for you i don't necessarily want to sell you seats i might actually decrease the number of seats um the tens of the decades of years and millions of man and woman hours of code that have been written to uh enable a particular workflow in crm for example may not matter if i don't want people to do that workflow of filling out the database every friday anymore and so i i do think that this sunk cost or the incumbent advantage gets highly challenged by new ux and code generation as well and then one disappointing learning that we found in our own portfolio is no one has the data we want in many cases right so imagine you are trying to automate a specific type of knowledge work uh and what you want is the reasoning trace um all of the inputs and the output decision um like that sounds like a very useful set of data and the incumbent companies in any given domain they never save that data right like they have a database with the outputs some of the time and so i i would say uh one of the things that is worth thinking through as a startup is um when an incumbent says they have the data like what is the data you actually need to make your product higher quality okay so in in summary um you know our shorthand for the set of changes that are happening is software 3.0 we think it is a full stack rethinking and it enables um in a a new generation of companies to have a huge advantage the speed of change um favors startups if the floor is lava it's really hard to turn a really big ship uh i think that some of the ceos of large companies now are incredibly capable but they're still trying to make a hundred thousand people move very quickly in a new paradigm um the market opportunities are different right these markets that we think are interesting and very large like represent a trillion dollars of value are not just the replacement software markets of the last two decades um it's not clear what the business model for many of these companies should be uh sierra just started talking about charging for outcomes um outcomes based pricing has been this holy grail idea in software and it's been very hard but now we do more work um uh there are other business model challenges um and so you know our companies they spend a lot more on compute than they have in the past they spend a lot with the foundation model providers they think about gross margin uh they think about where to get the data uh it's a time where you need to be really creative about product um versus just replace the workflows of the past uh and it might require ripping out those workflows entirely it's a different development cycle i bet most of the people in this room have written evals um and like compared to you know the academic benchmark to a real world eval and said like you know that's not it and how do i make a user um understand uh the um non-deterministic nature of these outputs or gracefully fail i think that's like a different way to think about product than in the past um and we we need to think about infrastructure again right um there was this middle period where the cloud providers the hyperscalers took this problem away from software developers and it was all just going to be like i don't front end people at some point and it's like we are not there anymore we're back in the hardware era where people are um acquiring and managing and optimizing compute and i think that will really matter in terms of capability and companies um so uh i guess we'll end with a call to action here and and encourage all of you to seize the opportunity um it is the greatest technical and economic opportunity that we've ever seen like we made a decade plus career type bet on it and um uh we do a lot of work with the foundation model companies uh we think they are doing amazing work and they're great partners and even co-investors in some of our efforts but uh i think all of the focus on their interesting missions around agi and safety um do not mean that there are not opportunities in other parts of the economy the world is very large and we think much of the value will be distributed in the world through an unbundling and eventually a re-bundling uh as often happens in technology cycles um so we think this is a market that is structurally supportive of startups we're really excited to try to work with the more ambitious ones and the theme of 2024 um to us has been like well thank goodness this is a this is an ecosystem that is much friendlier to startups than 2023 it is what we hoped um and and so uh you know please uh ask those questions and take advantage of the opportunity do those things work yeah hello they do work i can kick us off okay so if some of these companies um can go from you know 1 to 20 in such a short amount of time do you think that they can also disappear in a short amount of time uh i can i can take this one i mean uh i think you've seen companies go from zero to 80 million and stall out pretty badly actually um so your data is correct um there's gonna be uh there's a set of challenges that um are just the challenges of scale right like i think sometimes the revenue numbers in these companies can overstate the maturity of the businesses themselves right they need to figure out how to serve customers they need to scale their leadership um they need to uh prepare to uh service these customers um with the right quality level and you know like the company that we showed that went zero to 20 that company has 20 people right and they have you know x hundred thousand users is yeah it's very challenging um and and so i think there there's a set of good hard problems that these companies will have i think part of the like most catchphrases or memes they don't catch on unless there's some seat of truth and so there was a set of companies that were described by this term gpt wrapper that were not more than a somewhat trivial set of prompts and seo pages that directed people to our particular use case and i think that's not uh that's like likely not a durable position as a technology company um and and so it's not a very clean answer for you it's a it's a nuanced one but some of the value that is represented by this um i'm going to scroll back to it some of this value that is represented by this cluster is durable and that's the thing that we are interested in um uh the the zero to 20 and the zero to 80 and then collapse it's actually valuable it's just not durable right users are voting for it and other people can compete and so you know we kind of separate these two questions of like you know which of these companies is defensible um and where is the revenue or the usage not a novelty but something that's really important to like work or player communication sean do you want me to take questions or do you want to do it yeah well yeah you can do it hi hi um i think my mic oh here it goes so if all of these companies need a lot more money and this is the greatest economic opportunity ever uh don't we need much bigger venture funds like orders of magnitude bigger and won't the economics of those funds be really broken if they're still raising 40 million dollar like gonna invest in a bunch of seed company funds okay uh this is a bit of a triggering question for me because i take a particular point of view on it um uh hopefully without arrogance we've chosen to raise funds that are relatively small um as early stage investors uh and part of it is the the view of um like this company that you know this company uh i think they've spent like maybe seven million dollars to date right um and so the view that all ai product companies or all ai companies in general are very expensive is not true objectively we have we have several companies that are um expensive in the traditional sense of sass like we got to go hire a lot of go-to-market people and we have to pay them and there's a j curve of that investment before it comes back in repeatable sass revenue um uh and you know i think um inference revenue uh we have companies that are profitable or break even and have been incredibly efficient and we have companies that spend a lot up front and so i think there's a an entire range um our view as a firm is uh that you know very early on um my friend a lot has a a funny phrase here which is um no gpu before product market fit i think that is not always true we have given people gpus before anything right but but there's there's a a shred of truth in this which is you can experiment like thank you to the open ai and anthropics and um other companies of the world that allow uh great product people to experiment at very low cost very incrementally and so i i think much of our portfolio looks like those companies where you're going to see what kind of value you can bring to users without spending a ton up front um as one example like we just saw um uh new fine tuning interfaces for a one come out the amount of data that you need to in theory improve um those models for a particular domain is very small if that pans out like that's incredibly encouraging as well so so i would say like i our goal is to work with the most important companies in ai with a relatively small fund and i think that um most companies don't actually they don't benefit from a huge amount of capital up front um the only thing i would add to that is uh i i think an interesting trend is that we work with a number of second time founders whose point of view this time around is like we're never going to make the company that big again i think it's not a surprise actually i was doing the math in my head and um this rough ratio of a million dollars of revenue for per employee of early stage company holds true for like a remarkable number of our companies like a number of our companies have more millions in revenue than they do employees and the point of view of a bunch of this is like we're going to keep it that way like we're we're not going to grow into a giant team uh ai will make us much more efficient and if you believe in the grand vision of much of the intellectual labor that we do should actually just be captured by some number of models and we can build much more long-term efficient businesses than we have been able to historically i do think it's an interesting question because um if we think there is this much opportunity like your opportunity doesn't come evenly right and so i'd say our investment pacing is higher than i guess mine has been traditionally and uh another part of our view is like okay well we want to offer and we want to offer founders a certain service level um and you know founders can decide if they want that or not but it is it's very time expensive to us we can only work with that many companies we think many more are really interesting and that is one of the reasons that pranav and i did this program for the ecosystem called embed where we can work with a larger set of companies we own less but we give them you know uh a network and some guidance and and it is genuinely because there are more interesting things that we think are going to work than we can work on in a traditional um like artisanal venture sense and shameless plug applications will open in january i think if i press a button so fast oh so fancy cool uh hi thanks for the talk it was awesome so i work for a series c enterprise focused company called writer and one of the interesting things about the multi-modality thing that we're seeing in the enterprises beyond vision we're not actually seeing a lot of like demand for multi-modality like we'll get asked about um audio and video stuff but then when we ask like sort of what's the use case it's sort of like i don't know and so i'm curious if if you and your um like portfolio companies are are seeing that in the enterprise space and if so like what use cases it seems very focused like the multi-modality stuff seems great for the consumer level i'm curious if you're seeing anything on the enterprise side i think it's a good call out um enterprises the data they have is mostly like it's text it's like structured data and some sql data like it's uh um i don't think your average enterprise has that much vision video audio data that is that interesting um but i think that will change um like maybe it's because i'm like lazy and disorganized but humans are very unstructured like they don't want they don't necessarily think in terms of like relational database schema and like hierarchical management of their own information uh and i i think there's a future where we take that away from people um and um the capture of information that you're going to use for different enterprise workflows um uh enables more multi-modal use if that makes sense and so like the sort of obvious example would be there are companies from like perhaps a half generation ago like the gongs of the world that captured video and found some um keywords and initial insights uh for sales reps but the communications within an organization the decisions made um the uh things that people create i think there will be much more capture especially of video but um uh making use of it requires companies to do that capture um so we kind of require this intermediate step i think there's a company in our uh and this is still a prosumer company today as well to your point of like you know the consumer prosumer side is ahead of the enterprise but there's a company in our last embed batch called highlight that kind of has this premise that like okay well you know we're going to use the multi-modality by using on-screen capture that's what this little like bubble is on screen and audio capture and i think that um i think it's a powerful idea uh hi by the way just a quick check uh peter isaac are you here uh hi thanks yeah there's sort of like a meme going around that the the price of intelligence is going to go to zero um and you can kind of see this with gpt40 and and with gemini flash you can get a million tokens a day which is probably enough for a small company right like so i'm curious how as these large companies lose tons of money for market share like how are startups going to respond to this like how does that change the market um i think it is impossible for anything to be too cheap so i'll start with that um i would also say this company with this like awesome revenue chart like i'm pretty sure we paid like five to seven million dollars to a uh foundation model provider in this period of time right and so um uh demand is like if there was like a secondary theme to this talk demand is elastic in so many ways especially for technology and when you make things cheaper we want things to be more intelligent right um and so if you make hundreds of calls in order to deliver an output um then suddenly like the fact that the cost of calls come down 85% doesn't do you enough uh and so yes it's like an incredibly compelling idea of like having intelligence too cheap to meter i'm like maybe this is really old school of me but for the last two decades like the internet and compute and software and data pipeline like they it still hasn't been cheap enough actually we would do more if it was free so uh the other like uh physical barrier that we've run into is um when models are really large if you're not going to like quantize and distill and do domain specific things like it's hard to run you need a lot of compute just to state the very basics and even with the foundation model providers we are seeing people run into inference capacity issues and so um i do not know if this is true but uh like one way to read anthropic pricing change is there's not enough capacity right uh and so i think like um incredible kudos to the open source ecosystem incredible kudos to open ai for like staying on this drumbeat of offering cheaper and cheaper intelligence in every generation but uh like we have a companies that are spending a lot of money on um you know let's say um search and validation systems with many calls and we think that will continue i think you can see that as well in like the the price charts that we had before the like one pricing is still absurd um it it seems like it actually is gpt3 pricing right yeah but i mean volume of tokens i think um like it is really interesting that if you believe like the i mean the the other part of this is like if you look at the test time compute scaling um this is it's a log scale like uh it's easy to forget that like that's a lot of like historically um like as a result of overtraining a small set of companies took on the majority of financial burden for generating high quality models which is you just overtrain the shit out of your model and then it's useful for everyone else um if the customer has to pay this like that's a lot of money um if you want high quality generation and that means that i pay on the order of like thousands of attempts um that's that ends up being pretty expensive um question from youtube uh so hi to the youtube audience um so we you know you talked about price right price going down uh there's also the other dimension of capabilities going up and people always getting steamrolled by open ai so the question is what are some specific ways that you've seen companies build to prepare for better models like gpt5 or o2 like how do you future proof that um so i i think the like the most common refrain from at least opening i but i think the the model companies is you should build a company where you're excited when you hear that a new model is coming out not anxious um i would have like one edit to this which is like in the limit it seems like the majority of things that are worth building today are actually i don't know should you hire a sales team at all if if you think that models would be perfectly capable um like one framing that i've thought about on this is um you should decide like uh how much you believe uh foundation models will improve on like some core learning or intelligence capability um and then build your company imagining that on that prediction so the like an example here would be um like if you take like i think there's a generation of these like copywriting companies that uh were largely subsumed by chat gpt and the the story for many of them was the original usage was they understood better than other people how to get the model to like learn what my intent was in generating some piece of content some piece of seo content or they understood how to ingest information about my business and it's not hard to imagine like the next generation of models are just natively better at this like the context length gets longer you can stuff more into the context length you can crawl and like learn more about external websites like all that is like relatively cheap and so if the the core thesis the company looks like we don't think models will be capable of doing that that feels uh likely short-sighted on the other hand like there are a number of delivery mechanisms that are like far out of range of what what models will do like sarah had a a good example of this which is like there are some businesses where the limiting factor is like not actually intelligence like the the limiting factor for a number of businesses is like access to a specific set of people or um like i don't know we work with a pharmacy services company where like a core question is like long term can you negotiate pricing contracts the core issue there is on intelligence you need some amount of scale and then the ability to negotiate contracts um so i think i think many businesses are not exactly just a function of your ability to efficiently compute some small set of things i gave this presentation um with pranav and i'm like oh i'm so biased it just sounds like startups are gonna win everything and i'm um we still there i like to play this game which is what investment decision do you regret from the past year it's a really fun game i'm super fun yes um but one of the one of the decisions that i regretted was actually um a company that operates in uh uh a space that feels very core to perhaps foundation model companies and to hyper scale software players where there's tons of ecosystem risk around the company and by the way the people are amazing the metrics were amazing we're just like oh they're gonna get crushed and so with everything i said i still like overestimated the incumbents like ability to compete and make aggressive strategic decisions and so um i i think it's like really hard to overstate how important it is to understand um somebody can steamroll you if they focused all of their effort and all their best people on a particular area um are they going to right the copywriting example is illustrative because it's just not hard to see that understanding the context of a business from its website and from a couple documents and by making prompting a little bit easier and adding like some buttons that replace some prompts or doing suggested queries like it's just not a lot of work right but there are things that are a lot of work like having taste in developer products and distributing something amazing and so uh i i i actually think that um uh it's if you ask me like we have to make predictions in this business i worry more about under projecting capability than i worry about over projecting at least in the short term and then i worry more about um expecting too much from the incumbents and being too afraid of them than uh being not afraid enough maybe it's just one investment regret either one of you yeah we have one more from online oh okay you can do the online one uh how do you see ai changing hardware or in what ways and for example do you see a new apple coming out transforming hardware to that level not specifically the humane situation they're trying to ask very general how ai interview uh i'm sorry okay i i'd approach this from um uh two dimensions um uh everybody every investor wants a like a new consumer hardware platform to exist because it's so valuable and the question is like why why should it um i can think of two very good reasons one is that the usage pattern that you can imagine for ai applications actually requires you to um like the specs you'd want are different right like what if i want to capture image or video 100 of the time and um that's like a determinant of my battery life of my sensors of how i manage my network etc what if i want to run local models all the time like maybe like most of the phone should be a gpu right um i don't uh i i think that the usage patterns are perhaps very different for the next generation of you know the the intelligence in your hand um i think it's a hard thing to pull off another reason that you could believe in a new hardware device is that the advantages of the existing consumer platforms go away right and so at the extreme like should you have individual applications that track a single habit like drink water today sarah like i don't know like i can generate that pretty easily now and like maybe the single function applications that live in the mobile phone ecosystems are um part of uh a more general intelligence and um they like that ecosystem is less important um and so i i think there are different arguments for this uh and like we continually look for uh opportunities to invest here i don't think this is exactly what you asked but i also think the um like there are we invested in a company this past year um that is doing uh robotics um i for many years at graylock my prior firm like thought of robotics as an easy way to lose a lot of money over a long period of time um and and like i think that is true when you look at the outcome set for classical robotics even for the companies that got to scale of distribution for an industrial robot or a single use consumer robot um but like it's really cool that algorithms and generalization from um the broader machine learning field seem to apply here as well uh and so i think being imaginative about what physical intelligence looks like is also something we're excited about yeah okay okay okay so related to agents i think everyone has been chatting about agents you're seeing more like agent usefulness and production but i'm more curious like at the infrastructure layer what agent what infrastructure primitives do you think are required for agents to actually work and continue to work in production um okay i uh i don't know we talked about this a little bit i'm not sure if our points of view in this are the same i think it is um i think it's really hard to tell um my suspicion is that um like if you look at the number of like true agents that work like the number roughly rounds to zero maybe it's like low single digits or low double digits now um double double yeah and uh like they're all like relatively recent i would say like beginning of this year um we saw like a bunch of agent framework companies and um like i uh like i empathize with like the the root of the question which is it's just really hard to tell what any of these companies need especially when like this set of companies that works really well is unclear and um i i think there's a lot of valid anxiety on what foundation model companies want the interface to be like the computer's interface is a pretty low level one like you the anthropic version is like actually just make specific clicks and you know like rumors of other interfaces are like much more general like they're take actions on a specific web page um or like entire browser environments and so um like at a high level like i imagine that there are sets of like there's like the full scope of tools which is like i worked in a search engine for a while like crawl seems pretty useful live data seems pretty useful like an api that looks something like here's a url give me the set of data that's available or here's a url and a user login let me take some action on this page seems pretty useful um and then i don't know what the right place to operationalize this and commercially develop a product are um if i had like uh if i was building a company here like one thing that i think it's useful to just remain agile like the corset of infrastructure is consistently useful like a crawler is consistently useful and then one day you can figure out how to expose this better um but i i like empathize with the difficulty of like it's really hard to know what works for a bunch of agent companies and my suspicion is like the most successful agent frameworks will come from the most successful of these agent companies that solve these problems in-house for themselves and then operationalize this externally like it's some version of like react is really useful because react was like well adopted at facebook for a while um i think we can say that there are like missing components in the ecosystem where that if there was a default lots of agent developers would use it right um and so like identity and access management is a big problem um uh like if you could make agent development feel more like traditional software development i think a lot of people would use that and be like oh like you know it magically retries until it gets something and then it gives me like data back about how well it's working like things that like it's i think it's pretty easy to actually imagine the utilities in the abstract that would be useful to the ecosystem and then um the entire environment is fluid right and so um uh do you need like if you think about other things in infrastructure like will more workloads need vector indices yes like what is the shape of company that gets to be durable here like we don't know yet um and we'll keep looking at it but as pranav said i think we look to the handful of companies in our portfolio that are agents working at some scale and um and like look for the patterns there versus try to intuit it right now my cash hit was wrong i should have updated it's a it's a dozen not a small number it's been a long six months guys yeah uh i think one last question and there's a whole bunch of online stuff you won't get to but um yeah mark okay um it seems like there should be more consumer companies like why why aren't there or is it just a matter of time i think simply matter of time like we uh we keep bringing people into embed we keep looking i i think the uh i genuinely this is not a um a a knock on the research community or the really young set of founders that like i think focused on ai companies um first but the diffusion of innovation curve that applies to customers i think also applies to entrepreneurs um researchers saw the capability first and they're like like we should do something with this this is going to be amazing and it's like that will continue to happen like our portfolio is heavily overrepresented with with people from the research community pushing the pushing the state of the art with creative technical ideas um uh i think young very young people also were quite early to ai because they're like oh of course like this makes sense i've never seen other technology like chachi pt all the way um and their opportunity cost is lower than like you're the best product person at an amazing product organization like you have to leave your job to start a new company uh and it's been a really long two years like i feel like that's just started to happen where some of the talent that has the and you know maybe maybe it's just like the next zuck you know there's some dropout that figures out like the pattern of social interaction and it's like really ai native about this stuff i also think there's a chance that um some of the people who have built intuition for um consumer adoption and consumer interfaces they're just taking a little bit to also build intuition for ai products and now they're showing up and starting companies and experimenting and so um we have a lot of confidence like it is going to happen over the next few years and just a matter of time okay i think we're i think we're out of time i'm just trying to defer to sean here but thank you so much um you know please call us yeah i'm sure sarah for now we'll be sticking around uh so you can you can uh sort of ask some questions outside or you know whatever you want to do in networking wise but we're going to move on in our schedule uh we have a ton of papers that we want to cover this is basically paper club live um and i think isaac peter uh you guys uh so the the top um the top uh when people signed up we actually asked people like what you wanted to cover and the top um votes were vision open models um post transformers and all the other stuff that's coming later we also added reasoning because i didn't even have the option there and i was like what am i doing doing a sort of paper review uh session this year without uh talking about reasoning and um you know test time compute so uh but first we're gonna have uh vision uh roblox has been uh really uh great friends with latent space we've we've had um joseph nelson on twice um with facebook talking about all the innovations in vision but um it's not you know only about segmentation there's a lot of foundation model uh progress that happened this year in both the large space and the very small space so we're also very proud to have vick um to to update us on moon dream which he's been hacking away from for the past year yeah very very short amount of time um are you guys ready are you plugged in uh sarah paradov do you do you guys want to take questions like i don't know if people want to like there are people that want to talk to you what's what's your availability are you okay good awesome yeah just plug in on yeah the white thing exactly do you have sound the stuff no sound listen do you have any audio things all right cool cool stay close to the mic uh hi hey are they mic'd up nice yeah oh okay uh man i was hoping to use speaker notes that's not gonna work um you could do like a mirroring thing yeah yeah uh so there's settings display yeah yeah yep there you go thank you sweet are you on zoom okay um i'm sending an email super pumped to be here this is so cool email yeah both of us relied on your vision capabilities um yeah so this is for the screen share is for the live stream and also the editing or recording that we're doing later okay so you just share your screen and mute yourself um we got we got the audio you just want to capture the screen video and share the um share the green share share the screen that you're actually want people to see yeah that one the the the one with with the image that one but this is the speaker view yeah you don't want to share the speaker yeah so so you want to share this out too that's right double click on it you're good okay all right all right figuring things out like yeah now where'd the presentation go uh you can you can do the triple yeah triple slide there you go let's pick pick the thing and it is it up there are you just struggling no let's uh kill this how do i exit out of this apologies technical difficulties nice okay let's okay we're going to drag this up yeah perfect see uh we're just gonna make this full screen and call it good okay yay okay um hi we're isaac and peter from roboflow and we're going to talk about the best papers of 2024 in computer vision um so for us we define best as what made the biggest shifts in the space uh and to determine that we looked at what are some major trends that happened uh and what papers most contributed to those trends so i'm going to talk about a couple trends peter's going to talk about a trend and then uh we're going to hand it off to moon dream so the trends that i'm interested in talking about are a major transition from models that run on per image basis to models that run using the same basic ideas on video and then also how debtors are starting to take over the uh real-time uh object detection scene from the yolos which have been dominant for years uh so as a highlight um we're going to talk about sora which from my perspective is the biggest paper of 2024 even though it came out in february um is the one yeah yeah so just it's a sora is just a uh a post um so i'm going to fill it in with details from replication efforts including open sora and related work such as a stable diffusion video um and then we're also going to talk about sam2 which applies the sam strategy to video and then how debtors are the improvements in 2024 to debtors that are making them a Pareto improvement to yellow base models um so to start this off we're going to talk about uh the state of the art of video generation at the end of 2023 mag v.i.t uh mag v.i.t is a discrete token video tokenizer akin to vq gan but applied to video sequences and it actually outperforms uh state of the art uh handcrafted video compression frameworks uh in terms of the uh bit rate versus human preference for quality and video is generated by autoregressing on these discrete tokens um generates some pretty nice stuff but up to like five seconds length and you know not super detailed and then suddenly a few uh months later we have this which when i saw it was totally mind-blowing to me um 1080p a whole minute long we've got light reflecting and puddles that's reflective uh reminds me of those rtx demonstrations for next generation video games such as cyberpunk but with better graphics you can see some issues in the background if you look closely but they're kind of with a lot as with a lot of these models the issues tend to be things that people aren't going to pay attention to unless they're looking for in the same way that like six fingers on a hand you're not going to notice is a uh giveaway unless you're looking for it um so yeah as we said sore does not have a paper so we're going to be filling it in with uh context from the rest of the uh computer vision scene attempting to replicate these efforts um so the first step you have an llm caption a huge amount of videos um this this is a trick that they introduced in dolly 3 where they train a uh image captioning model to just generate very high quality captions for a huge corpus and then train a diffusion model on that their uh sora and the replication efforts also show a bunch of other steps that are necessary for good video generation including uh filtering by aesthetic score and filtering by making sure the videos have enough motion so they're not just like kind of the generator is not learning to just generate static frames um so then we encode our video into a series of space-time latency once again this were a very sparse in details so um the replication related works uh open sora actually uses a mag vit v2 itself to do this but swapping out the uh disc discretization step with a classic vae auto encoder framework and they show that there's a lot of benefit from getting the temporal compression which makes a lot of sense as uh the each sequential frames and videos have mostly redundant information um so by compressing against compressing in the temporal space you allow the latent to hold a lot more semantic information while uh avoiding that duplicate um so we've got our space-time latence possibly but via there's some 3d vae presumably a mag vat v2 um and then you throw it into a diffusion transformer so um i i think it's personally interesting to note that open sora is using a mag vat v2 which originally used an autoregressive transformer decoder to model the latent space but uh is now using a diffusion uh diffusion transformer so it's still a transformer happening just the question is like is it parameterizing the stochastic uh differential equation is or parameterizing a uh conditional distribution via autoregression um it's also um it's also worth noting that most diffusion models today the the very high performance ones are switching away from the classic like ddpm denoising diffusion probability modeling framework to rectified flows um rectified flows have a very interesting property that as they converge they actually get closer to being able to be sampled with a single step which means that uh in practice you can actually generate high quality samples much faster um major problem of ddpm and related models for the past four years is just that they require many many steps to generate high quality samples so and naturally the third step is throwing lots of compute at the problem so uh i didn't i never figured out how to manage to get this video to loop but we see very little compute medium compute lots of compute this is so interesting because the uh the original diffusion transformer paper from facebook actually showed that in fact the specific hyperparameters of the transformer didn't really matter that much what mattered was that you were just increasing the amount of compute that the model had so i love how in the you know once again little blog posts they don't even talk about like the specific hyperparameters they say we're using a diffusion transformer and we're just throwing more compute at it and this is what happens um open sora shows similar results the uh primary issue i think here is that no one else has 32x compute budget so we end up with these uh uh we end up in the middle of the domain in most of the uh uh related work which is still super super cool it's just a little disappointing considering the context um so i think this is a beautiful extension of the uh framework that was introduced in 22 and 23 for these very high quality per image generation and then extending that to videos it's awesome and it's ga as of monday except no one can seem to get access to it because they keep shutting down the login uh the next so next paper i wanted to talk about is sam so we at roboflow allow users to label data and train models on that data sam for us has saved our users 75 years of labeling time um we are the the best of my knowledge the largest uh sam api that exists we also sam also allows us to have our users train just pure uh bounding box regression models and use those to generate high quality masks um which has the great side effect of requiring less training data to have a meaningful convergence so most people are data limited in the real world so anything that requires less data to get to a useful thing is super useful um most of our users actually run their object uh per frame object detectors on every frame in a video or maybe not most but many many and so uh sam follows into this category of taking sam2 falls into this category of taking something that really really works and applying it to a video which has the wonderful benefit of being plug and play with most of our many of our users use cases um we're we're still building out a sufficiently mature pipeline to take advantage of that but it's it's in the works um so here we've got a great example we can click on cells and then follow them you even notice the cell goes away and comes back and we can still uh keep track of it um which is very challenging for uh existing object trackers um high level overview of how sam2 works we uh uh there's a simple pipeline here where we can give provide some type of prompt and it fills out the rest of the likely masks for that object throughout the rest of the video so here we're giving a bounding box in the first frame a set of positive negative points or even just a simple mask um i'm gonna assume people are somewhat familiar with sam so i'm gonna just give a high level overview of how sam works you have an image encoder that runs on every frame um sam2 can be used on a single image in which case the only difference between sam2 and sam is that image encoder which sam used a standard vit um sam2 replaced that with a uh uh hera hierarchical encoder which gets approximately the same results but leads to a six times faster inference which is excellent especially considering how in a trend of 23 was replacing the vit with more efficient backbones um in the case where you're doing video segmentation the the difference is that you actually create a memory bank and you cross attend the features from the image encoder based on the memory bank so the uh feature set that is created is essentially uh well i'll go more into it in a couple slides but we take the features from the past couple frames plus a set of object pointers and the set of prompts and use that to uh generate our new masks then we then fuse the new masks for this frame with the um image features and add that to the memory bank it's well i'll say more in a minute the um just like sam that sam2 actually uses a data engine to create its uh data set in that people are they assembled a huge amount of reference data used people to label some of it and train the model uh use the model to label more of it and ask people to refine the predictions of the model and then ultimately the data set is just uh created from the final output of the model on the uh reference data it's very interesting this paradigm is so interesting to me because it uh it uh unifies a model in a data set in a way that is very unique it seems unlikely that another model could come in and have such a tight relationship with the training set um yeah so brief overview of how the memory bank works the paper did not have a great visual so i'm just i'm going to fill in a bit more um so we take the last couple frames from our video and uh we take the last couple frames from our video uh attend that along with the set of prompts that we provided they could come from the future they could come from anywhere in the video as well as reference object pointers saying by the way here's what we've found so far uh attending to the last few frames has the interesting benefit of allowing it to model complex object motion uh without actually uh you by limiting the amount of frames that you attend to you manage to keep the model running in real time this is such an interesting topic topic for me because one would assume that attending to all of the frames is super essential having some type of summarization of all the frames is super essential for high performance um but we see in their later ablation that that actually is not the case so here just to make sure that there is some benchmarking happening we just compare to some of the stuff that's came out prior and indeed the sam2 strategy does improve on the state of the art um this ablation deep in their dependencies was super interesting to me uh we see in section c the number of memories um one would assume that increasing the count of memories would meaningfully increase performance and we see that it has some impact but not the type that you'd expect and that it meaningfully decreases speed which justifies in my mind just having this 50q of memories um although in the future i'm super interested to see a more dedicated summarization of all of the last video not just a stacking of the last frames so that another extension of beautiful per frame work into the uh video domain the next trend i'm interested in talking about is uh this interesting at roboflow we're super interested in training real-time object detectors those are bread and butter and so we're doing a lot to keep track of what is actually happening in that space uh we are finally starting to see something change so for years yellows have been the dominant way of doing real-time object detection and we can see here that they've essentially stagnated the the performance between 10 and 11 is not meaningfully different at least you know in in this type of high-level chart and even from the last couple series there's not a major change uh so yellows hit a plateau debtors have not so we can look here and see the yellow series has this plateau and then these are rt debtor lw data and define have meaningfully changed that plateau so that in fact the best define models are plus 4.6 ap on coco at the same latency so three major steps to accomplish this uh the first rt debtor which is technically a 2023 paper pre-print but published officially in 24 so i'm going to include that i hope that's okay um that is showed that uh rt data showed that we could actually match or outspeed yellows um then lw debtor showed that pre-training is hugely effective on debtors and much less so on yellows and then define out of the types of bells and whistles that we expect from uh these types this this uh arena so the major improvements that rt data shows was uh taking the multi-scale features that debtors typically pass into their encoder and decoupling them into a much more efficient uh transformer encoder uh the transformer is of course quadratic complexity so decreasing the amount of stuff that you pass in at once is super helpful for increasing your runtime or uh increasing your throughput so that change basically brought us up to yellow speed and then they do a hardcore analysis on uh benchmarking yellows including the nms step once you uh once you include the nms in the latency calculation you see that in fact these debtors are outperforming at least this time the uh the the yellows that existed then lw debtor goes in and suggests that in fact the uh um this frame the huge boost here is from pre-training so this uh is the defined line and this is the defined line without pre-training it's within range it's still an improvement over the uh yellows but the really huge boost comes from the benefit of pre-training uh in when yellow x came out in 2021 they showed that they got much better results by having a much much longer training time but they found that when they did that they actually did not benefit from pre-training so you see in this graph from lw debtor in fact yellows do have a real benefit from pre-training but it goes away as we increase the training time then the debtors converge much faster lw debtor trains for only 50 epochs rt debtors 60 epochs so one could assume that in fact the entire extra gain from pre-training is that you're not destroying your original weights by relying on this long training cycle um and then lw debtor also shows superior performance to our favorite data set roboflow 100 which means that they do better on the real world not just on coco then define throws all the bells and whistles at it uh yellow models tend to have a lot of very specific uh complicated loss functions this uh define brings that into the debtor world and shows consistent improvement on a variety of debtor based frameworks so bring these all together and we see that suddenly we have almost 60 ap on coco while running in like 10 milliseconds huge huge stuff so we're spending a lot of time trying to build models that work better with less data and debtors are clearly becoming a promising step in that direction the we're interested in seeing from the debtors in this this trend to next is co-debtor and the the the models that are currently sitting on the top of the uh leaderboard for large scale inference scale really well as you switch out the backbone we're very interested in seeing and and having people publish a paper potentially us on what happens if you take these real-time ones and then throw a swing g at it like do we have a Pareto curve that extends from the real-time domain all the way up to the uh uh super super slow but high performance domain we also want to see people benchmarking an rf100 more because that type of data is what's relevant for most users um and we want to see more pre-training because pre-training works now it's super cool all right so yeah so in that theme uh one of the big things that we're focusing on is how do we get more out of our pre-trained models um and one of the lenses to look at this is through sort of this this new requirement for like fine-grained visual details and your representations that are extracted from your foundation model so it's sort of a hook for this um oh yeah this is just a list of all the the papers that i'm going to mention i just want to make sure i set up actual papers so you can find it later um yeah so sort of the big hook here is that i make the claim that llms can't see if you go to if you go to claude or um chat gpt you ask it to to see this uh uh watch and tell me what time it is it fails right and so you could say like maybe maybe the um like this is like a very classic uh test of an llm but you could say okay maybe this this image is like too zoomed out and it just like it'll do better if we increase the resolution and it has easier time finding these fine fine-grained features like where the watch hands are pointing no dice and you can say okay well maybe uh the model just doesn't know how to tell time from knowing the position of the hands but if you actually prompt it textually it's very easy for it to tell the time so this to me is proof that these llms literally cannot see the position of the watch hands and it can't see those details so the question is sort of why and uh for you anthropic heads out there claude fails too um so the the my first pick for best paper of 2024 envision is this mmvp paper which tries to investigate why do llms not have the ability to see fine-grained details and so for instance it it comes up with a lot of images like this where you ask it a question that seems very visually apparent to us like which way is the school bus facing and it gets it wrong and then of course it makes up details to support its wrong claim um and so the process by which it finds these images is sort of contained in its hypothesis for why it can't uh see these details so it hypothesizes that models that have been initialized with with clip as their vision encoder they don't have fine-grained details and the features extracted using clip because um clip sort of doesn't need to find these fine-grained details to do its job correctly which is just to match um captions and images right um and sort of at a high level even if chat gpt wasn't initialized with clip um and wasn't trained contrastively at the vision encoder wasn't trained contrastively at all still in order to do its job of capturing the image uh it could do a pretty good job without actually finding the exact position of all the objects and visual features in the image right so this paper finds a set of difficult images for these types of models and the way it does it is it looks for embeddings that are similar in clip space but far in dyna v2 space so dyna v2 is a foundation model that was trained um self-supervised purely on image data um and it kind of uses like some complex student teacher framework but essentially and like it patches out like certain areas of the image or like crops with certain areas of the image and tries to make sure that those have consistent representations which is a way for it to learn very fine-grained visual uh features and so if you take things that are very close in clip space and very far in dyna v2 space you get a set of images that um basically a pairs of images that are hard for chat gpt and other big language models to distinguish so if you then ask it questions about this image well as you can see from this chart it's going to answer the same way um for both images right because to to from the perspectives of vision encoder they're the same image and so if you ask a question like how many eyes does this animal have it answers the same for both and like all these other models including lava um do the same thing right and so this is the the benchmark that they create which is like finding clip like clip blind pairs which is pairs of images that are similar in clip space and creating a data set of multiple choice questions based off of those um and so how do these models do well really bad um lava i think so so chat gpt and jim and i do a little bit better than random guessing but like half of the performance of humans who find these problems to be very easy uh lava is interestingly extremely negatively correlated with this data set it does much much much much worse than random guessing which means that this process has done a very good job of identifying hard images for for lava specifically and that's because lava is basically not trained for very long and is initialized from clip and so you would expect it to do poorly on this data set so one of the proposed solutions that this paper attempts is by basically saying okay well if clip features aren't enough what if we train the visual encoder of the language model also on dyno features and so it um proposes two different ways of doing this one out of additively um which is basically interpolating between the two features and then one is interleaving which is just kind of like training one on the combination of both features so there's this really interesting trend when you do the additive mixture of features so zero is all um clip features and one is all dyna v2 features so it as you in so i think it's helpful to look at the rightmost chart first which is as you increase the number of dyna v2 features your model does worse and worse and worse on the actual language modeling task and that's because dyna v2 features were trained completely from a self-supervised manner and completely in image space it knows nothing about text these features aren't really compatible with these text models and so you can train an adapter all you want but it seems that it's in such an alien language that it's like a very hard optimization for this these models to solve and so that kind of supports what's happening on the left which is that yeah it gets better at answering these questions as you include more dyna v2 features up to a point but then you when you oversaturate it completely loses its ability to like answer language and and do language tasks um so uh you can also see with the interleaving like they essentially double the number of tokens that are going into these models um and just train on both and it still doesn't really solve the mmvp task it gets lava 1.5 above random guessing by a little bit but still not close to um chachi pt or any like human performance obviously um so clearly this proposed solution of just using dyna v2 features directly isn't going to work and basically what that means is that as a um as a vision foundation model dyna v2 is going to be insufficient for language tasks right so my next pick for best paper of 2024 um would be florence 2 which tries to solve this problem by incorporating not only this dimension of spatial hierarchy which is to say pixel level understanding but also in making sure to include what they call semantic granularity which ends up the goal is basically to have features that are sufficient for finding objects in the image so they're they're they have enough pixel information but also can be talked about and can be reasoned about um and that's on the semantic granularity axis so here's an example of um basically three different paradigms of labeling that they do um so they create a big data set um one is text which is just captioning and you would expect a model that's trained only on captioning to have similar performance like chachi pt and like not have uh spatial hierarchy not have features that are meaningful at the pixel level and so they add another type which is region text pairs which is essentially either classifying a region or um doing object detection or doing instant segmentation on that region or captioning that region and then they have text phrase region annotations which is essentially a triple um and basically not only do you have a region that you've described you also find it's like its place in a descriptive paragraph about the image which is basically trying to introduce even more like semantic understanding of these regions and so like for instance if you're saying a woman riding on the road right you have to know what a woman is and what the road is and that she's on top of it and that's that's basically composing a bunch of objects in this visual space but also thinking about it semantically right um and so the way that they do this is they take um basically they just dump uh features from a vision encoder straight into a uh encoder decoder transformer um and then they train a bunch of different tasks like object detection and so on uh as a language task and i think that's one of the big things that we saw in 2024 is these these um vision language models operating in on pixel space linguistically so they introduce a bunch of new tokens to point to locations and um in pixel space so how does it work how does it actually do we can see uh if you look at the graph on the right which is using the the dino the uh the dino framework um your your pre-trained florence 2 models transfer very very well they get 60 60 percent map on cocoa which is like approaching state-of-the-art and they train with you're good and they train with a much more um uh much more efficiently so they they converge a lot faster which both of these things are pointing to the fact that they're actually leveraging their pre-trained weights effectively um so where is it falling short so these models i forgot to mention florence is a 0.2 billion and a 0.7 billion parameter count so they're very very small in terms of being a language model um and i think that this framework you can see saturation so what this graph is showing is that if you train a florence 2 model purely on the image level and region level annotations and not including the pixel level annotations like segmentation it actually performs better as an object detector and what that means is that it's not able to actually learn all the visual tasks that it's trying to learn because it doesn't have enough capacity so i'd like to see this paper explore larger model sizes which brings us to our next big paper of 2024 um or two papers so polygema came out earlier this year polygema 2 was released i think like a week or two ago um oh i forgot to mention you can actually train like label text data sets on roboflow and you can train a florence 2 model and you can actually train a train a polygema 2 model on roboflow which we got into the platform within like 14 hours of release which i was really excited about so anyway so polygema 2 and so polygema is essentially doing the same thing but instead of doing an encoder decoder it just dumps everything into a decoder only transformer model um but it also introduced the concept of location tokens to point to objects in pixel space polygema 2 so polygema uses gemma as the language encoder and it uses gemma 2b polygema 2 introduces using multiple different sizes of language encoders um so the way that they sort of get around having to do encoder decoder is they use the concept of prefix loss which basically means that when it's generating tokens um autoregressively it's all those uh tokens in the prefix which is like the image that it's looking at and like a description of the task that it's trying to do they're attending to each other fully full attention um which means that you know it can sort of find high level uh it's easier for the the prefix to color to color the output of the suffix and also to just find like features uh easily so this is sort of an example of like one of the tasks that was trained on which is like you describe the task in english um and then you give it all these like you're asking for it to segment these two classes um of objects and then it finds like their locations using these look tokens and it finds their masks using uh some encoding of the masks into tokens and yeah so one of my critiques i guess of polygema one at least is that um you find that performance saturates as a pre-trained model after only 300 million examples seen um so what this graph is representing is each blue dot is a performance on some downstream task you can see that after seeing 300 million examples it sort of does equally well on all of the downstream tasks that they tried it on which was a lot as 1 billion examples which to me also kind of suggests a lack of capacity for this model polygema 2 you can see the results on object detection so these were transferred to um to coco um and you can see that this sort of also points to an increase in capacity being helpful to the model you can see as both the resolution increases and the parameter count of the language model increases performance increases so resolution makes sense obviously it helps to find small images or small objects in the image but also makes sense from another reason which is that it kind of gives the model a thinking register and it gives it more tokens to like process when making its predictions um but yeah you could you could say oh 43.6 that's not that great like um Florence 2 got 60 but this is not training a dino or a debtor on top of this language or this image encoder it's doing the raw language modeling task on coco um so it doesn't have any of the bells whistles it doesn't have any of the fancy losses it doesn't even have bipartite graph matching or anything like that okay the big result and one of the reasons that I was really excited about this paper is that they blow everything else away on mmvp I mean 47.3 sure that's nowhere near human accuracy which again is 94 but for a you know a two billion language two billion parameter language model to be chat2bt that's quite the achievement um and that sort of brings us to our final pick for paper of the year which um is aimv2 so aimv2 sort of says okay maybe this language model like maybe coming up with all these specific annotations to find features and with high fidelity and pixel space isn't actually necessary and we can come up with an even simpler more beautiful idea for combining um you know image tokens and pixel tokens in a way that's interfaceable for language tasks um and this is nice because it can scale you can come up with lots more data if you don't have to come up with all these annotations right so the way that it works is it does something very very similar to polygemo where you have a vision encoder that dumps image tokens into a decoder only transformer but the interesting thing is that it also autoregressively tries to learn the mean squared error of the image tokens so instead of having to come up with fancy object detection or semantic or segment or segmentation labels you can just try to reconstruct the image and have it learn fine-grained features that way um and it does this in kind of i think a beautiful way that's kind of compatible with the polygemo line of thinking which is randomly sampling a prefix prefix length and using only this number of image tokens as the prefix um and so doing a similar thing with the uh causal so the causal prefix is the the attention mask on the right so it's doing full block attention with some randomly sampled number of image tokens to then reconstruct the rest of the image and the downstream caption for that image and so this is the data set that they train on it's image or internet scale data very high quality data created by the data filtering networks paper essentially which is maybe the best clip data that exists and we can see that this is finally a model that doesn't saturate it's even at the highest parameter count it's it appears to be well at the highest parameter account it appears to be improving in performance with more and more samples seen and so you can sort of think that uh you know if we just keep bumping the parameter count and increasing the example scene which is the the line of thinking for language models then it'll keep getting better so how does it actually do at finding oh it also improves with resolution which you would expect for a model that um this is the image net classification accuracy but yeah it does better if you increase the resolution which means that it's actually leveraging and finding fine-grained visual features um and so how does that actually do compared to clip on coco well you can see that if you slap a transformer uh detection head on it and train on coco it's just 60.2 which is also within spitting distance of soda which means that it does a very good job of finding um visual features but you could say okay well wait a second uh clip got to 59.1 so like how does this prove your claim at all because doesn't that mean like clip which is known to be clip blind and do badly on mmvp it's able to achieve a very high performance on fine on this fine-grained visual features task of object detection well they train on like tons of data they train on like objects 365 coco flicker and everything else and so i think this benchmark doesn't do a great job of selling how good of a pre-trained model mv2 is and we would like to see uh performance on fewer data as examples and not trained to convergence on object detection so seeing it in the real world on like a data set like robo flow 100 i think would be quite interesting and our i guess our final final pick for paper of 2024 would be moondream so introducing vick to talk about that uh but overall that was exactly what i was looking for like best of 2034 amazing job um uh yeah you can there's any other questions while vick gets set up like vision stuff yeah hi well while we're getting set up hi over here thanks for the really awesome talk one of the things that's been weird and surprising is um that the foundation model companies uh even these mlms they're just like worse than rt tether at detection still like if you wanted to pay a bunch of money uh to auto label your detection data set if you gave it to openai or claude that would be like a big waste um so i'm curious just like even polygema 2 like uh is worse so so i'm curious to hear your thoughts on like how come nobody's cracked the code on like a generalist that really uh you know beats a specialist model in computer vision like they have in uh in lm land i can can you hear me okay oh yeah um it's very very interesting question i think um it depends on the specific domain uh for image classification it's basically there in the aim v2 showed a simple attentional probe on the pre-trained features gets like 90 which is as well as anyone does um the the the bigger question like why isn't it transferring to uh uh object detection especially like real-time object detection um i think in my mind there are two answers one is object detection is really really really uh the architectures are super domain specific you know we see these all these super super complicated things and it's not super easy to to to build something that just transfers naturally like that whereas image classification you know clip pre-training transfers super super easily um and the other thing is until recently the real-time object detectors didn't even really benefit from pre-training like you see the yolos that are like essentially saturated showing very little difference with uh pre-training improvements uh with using pre-trained model at all it's not surprising necessarily that people aren't looking at the effects of better and better pre-training on real-time detection maybe that'll change in the next year does that answer your question cool uh can you guys hear me uh yeah one thing i want to add is just like or just to summarize basically is that like until 2024 you know we haven't really seen a combination of transformer based uh object detectors and uh fancy losses and polygema suffers from the same problem which is basically to say that um these resnet are like the convolutional models they have all these like extreme optimizations for for doing object detection but essentially i think it's kind of been shown now that convolution models like just don't benefit from pre-training and just don't like have the level of intelligence to transform models awesome hi can you hear me cool sure you see you are you sharing your screen i might have forgotten to do that let me do that sorry oh here's your screen uh-oh classic um you might have to quit zoom and restart what um it's fine yeah it's like we we have we have a capture of your screen i'll just make sure it's visible so let's get to okay easy now to make it likely for you but soon no yeah yeah there you go perfect all right hi everyone my name is vic um i've been working on moon dream for almost a year now like sean mentioned i just went and looked and it turns out the first version i released december 29 2023 um it's been a fascinating journey so moon dream um started off as a tiny vision language model since then we've expanded scope a little bit to also try and build some tooling client libraries etc to help people really deploy it um unlike traditional large models that are focused at assistant type use cases we're laser focused on building um capabilities that developers can sorry it's uh yeah we're laser focused on building capabilities that developers can use to build vision applications uh that can run anywhere so in a lot of cases for vision more so than for text you really care about being able to run on the edge run in real time etc so um it's really important we have um we have different output modalities that we support there's query where you can ask general english questions about an image and get back human-like answers there's captioning which allows you to get back human-like answers there's captioning which a lot of our users use for generating synthetic data sets to then train diffusion models and whatnot um we've done a lot of work to minimize the hallucinations there so that's um used a lot we have open vocabulary object detection built-in similar to a couple more recent models like pali gem etc where rather than having to train a dedicated model you can just say show me soccer balls in this image or show me there any deer in this image detected uh more recently earlier this month we released pointing capability where if all you're interested in is the center of an object um you can just ask it to point out where that is this is very useful when you're doing ui automation type stuff um let's see la we we have two models out right now there's a general purpose to be paramodel which um runs fair like it's it's uh it's fine if you're running on server it's uh good for our localama desktop friends and you can run on flagship flagship mobile phones but it never really fulfill the promise of being able to run anywhere uh last week released a new 0.5b paramodel which should be seen more as a distillation target as opposed to a general purpose model uh it's very good if you're running on like older mobile phones or edge devices uses less memory even with our not yet fully optimized inference client um so the way we built our 0.5b model was to start with the two billion parameter model um and prune it while doing continual training to retain performance we our objective during the pruning was to preserve accuracy across a broad set of benchmarks so the way we went about it was to estimate the importance of different components of the model like attention heads channels um mlp rows and whatnot um using basically a technique based on the gradient i'm not sure how much people want to know details we'll be writing a paper about this but uh feel free to grab me if you have more questions uh then we iteratively prune a small chunk that will minimize loss in performance uh retrain the model to recover performance and bring it back um the 0.5b we release is more of a proof of concept that this is possible i think the thing that's really exciting about this is it makes it possible for um for developers to build using the 2b parameter model and just explore build their application and then once they're ready to deploy uh figure out what exactly they need out of the model and prune those capabilities into a smaller form factor that makes sense for their deployment target um so yeah very excited about that let me talk to you folks a little bit about uh another problem i've been working on recently which is similar to the clocks example we've been talking about we had a customer reach out who was uh talking about like who had a bunch of gauges out in the field this is very common in manufacturing and oil and gas where you have a bunch of analog devices that you need to monitor it's expensive to have humans look at that and monitor stuff and make sure that uh the system gets shut down when the temperature goes over 80 or something so i was like yeah this seems easy enough happy to happy to help you distill that uh let's let's get it going turns out our model couldn't do it at all uh i went and looked at other open source models to see if i could just generate a bunch of data and learn from that that did not work either so i was like let's look at what the folks with hundreds of billions of dollars in market cap have to offer and yeah that doesn't work either um my hypothesis is that like the the way these models are trained are using a large amount of image text data scraped from the internet and that can be biased in the case of gauges most gauge images aren't gauges in the wild they're product detail images like these where it's always set to zero it's paired with an alt text that says something like givto pressure sensor psi zero to 30 or something and so the models are fairly good at picking up those details it'll tell you that it's a pressure gauge it'll tell you what the brand is but it doesn't really learn to pay attention to the needle over there um and so yeah that's a gap we need to address so naturally my mind goes to like let's use synthetic data to solve this problem um that works but it's problematic because it turned out we needed millions of synthetic gauge images to get to reasonable performance and thinking about it reading a gauge is like not a one like it's not a zero short process in our minds right like if you had to tell me the reading in celsius for this real world gauge there's two dials on there so first you have to figure out which one you have to be paying attention to like the inner one or the outer one um you look at the tip of the needle you look at what labels it's between and you count how many and do some math to figure out what that probably is so what happens if we just add that as chain of thought um to give the model better understanding of the difference up to allow the model to better learn the subtasks it needs to perform to accomplish this goal um so you can see in this example this was actually generated by the latest version of our model uh it's like okay celsius is the inner scale it's between 50 and 60 there's 10 ticks it's at the second tick it's a little debatable here like there's a weird shadow situation going on the dial is off so i i don't know what the ground truth is but it works okay um there's points on there that the points over there are actually grounded i don't know if this is easy to see but when i click on those there's a little red dot that moves around on the image the model actually has to predict where uh those points are i was already trying to do this with bounding boxes but then malmo came out with pointing capabilities and it's like pointing is a much better paradigm to uh to represent this we see pretty good results this one's actually for clock reading i couldn't find our chart for gauge reading at the last minute so um the light blue chart is with uh our grounded chain of thought um this measures we have we built a clock reading benchmark about 500 images this measures accuracy on that um you can see it's a lot more sample efficient uh when you're using the chain of thought to help the model um yep another big benefit from this approach is like you can kind of understand how the model is doing it and how it's feeling so in this example the actual correct reading is 54 celsius the model output 56 not too bad um but you can actually go and see where it messed up like it got a lot of these right except uh instead of saying it was on the seventh tick it actually predicted that was it was the eighth eighth tick and that's why it went with 56 so now that you know that this is failing in this way you can adjust how you're doing the chain of thought to maybe say like actually count out each tick from 40 instead of just trying to say it's the eighth tick or you might say like okay i see that there's that middle thing i'll count from there instead of all the way from 40 um so helps a ton the other thing i'm excited about is a few short prompting or test time training with this like if a customer has a specific gauge that uh like we're seeing minor errors on they can give us a couple of examples where like if it's misdetecting the needle they can go in and correct that in the chain of thought and hopefully that works the next time um now exciting approach we only apply it to clocks and gauges the real question is is it going to generalize um probably like there's some signs from text models that when you train on a broad number of tasks it does generalize and um i'm seeing some signs with our model as well um so in addition to the image-based chain of thought stuff i also added some spelling-based chain of thought uh to help it understand uh better understand ocr i guess um i don't understand why everyone doesn't do this by the way like it's trivial benchmark question that's very very easy to nail um but i also wanted to support it for stuff like license plate partial matching like hey does any license plate in this image start with wha or whatever um so yeah that sort of worked um all right that that ends my story about the gauges if you think about what's going on over here um it's interesting that like llms are showing enormous progress in reasoning especially with the latest set of models that we've seen but we're not really seeing i i have a feeling that vlms are lagging behind as we can see with these tasks that should be very simple for a human to do that are very easy to find um vlms failing at uh my hypothesis on why this is the case is because on the internet there's a ton of data that talks about how to reason there's books about how to solve problems there's books critiquing the books about how to solve problems but humans are just so good at perception that we never really talk about it like maybe in art books where it's like hey to show that that mountain is further away you need to desaturate it a bit or whatever but um the actual data on how to like look at images is isn't really present also the data we have is kind of sketch the best source of data we have is like image all text pairs on the internet and that's pretty low quality um so yeah i i think our solution here is really just we need to teach them how to operate on individual tasks and figure out how to scale that out um all right yep so conclusion uh at moon dream we're trying to build amazing blms that run everywhere very hard problem much work ahead but uh we're making a ton of progress and i'm really excited about um if anyone wants to chat about more um technical details about how we're doing this or interested in collaborating please please hit me up yeah like i always when people say when people say multi-modality like you know always think about vision as the first among equals in all the modalities so i really appreciate having the experts um okay we are a little bit out of time so we're going to move on to luca um and talk about open models but if anyone wants to talk to the vision guys i think there's like coffee and tea outside we're going to have lunch in an hour as well um so you can ask follow-up questions uh outside if you if you wish but yeah luca you go you get set up with uh your mic okay we sent you a zoom okay uh it's on it's on the calendar and then alan can set you up with the respondents hey i'm just yeah i'm just they just screen share for here no audio no audio no yeah speecher uh plus plug-in oh yeah you gotta stick around people you stick around people for sure are you also presenting i'm backup okay so i didn't know what you're because you're you're coming later yeah i don't really know either how was your session yesterday for the tutorial yeah your master class yeah it's just good um definitely polish the slides yeah so share your screen cool yeah i think you're set um so as you speak into that mic but any of your nathan's microphone no you want me to be we'll just put this on yeah i have the same thing yeah so these two mics they're good all right all right cool um yeah thanks for having me over um i'm luca i'm a research scientist at the alliance for ai i threw together a few slides on sort of like a recap of like interesting themes in open models for for 2024 um have about maybe 20-25 minutes of slides and then we can chat if there are any questions if i can advance to the next slide okay cool um so um i did the quick check of like to sort of get a sense of like how much 2024 was different from 2023 um so i went on hug and face and sort of tried to get a picture of what kind of models were released in 2023 and like what do we get in 2024 um 2023 you get we got things like uh both llama one and two we got mistro got mpt falcon models think the yi model came at the tail end of the year it was a pretty good year but then i did the same for 2024 um and it's actually quite stark difference um you have models that are you know reveling frontier level performance of what you can get from close models from like quen from deep seek we got llama three we got all sorts of different models um i added our own uh olmo at the bottom uh there's this uh growing group of like fully open models that i'm going to touch on a little bit later um but you know just looking at the slides it feels like 2024 was just smooth sailing happy news much better than previous year um and you know you can plot um you can pick your favorite benchmark or least favorite i don't know depending on what point you're trying to make um and plot you know your closed model your open model um and sort of spin it in ways that show that oh you know open models are much closer to where closed models are today versus to versus last year where the gap was fairly significant um so one thing that i think i don't know if i have to convince people in this room but usually when i give this talks about like open models there is always like this background question in in in people's mind of like why should we use open models um is it just use model apis argument you know it's it's just an hdp request to get output from a from one of the best model out there why do i have to set up infra use local models um and they're really like to answer um there is the more researchy answer for this which is where my background lays which is um just research if you want to do research on language models research thrives on on open models there is like large worth of research on modeling on how these models behave on evaluation and inference on uh mechanistic interpretability that could not happen at all if you didn't have open models um they're also um for ai builders there are also like good use cases for using um local models um you know you have some this is like a very not uh comprehensive slides but you have things like there are some applications where local models just blow close models out of the water um so like retrieval it's a very clear example um you might have like constraints like edge ai applications where it makes sense but even just like in terms of like stability being able to say this model is not changing under the hood um it's there's plenty of good cases for for um open models um and the community is just not models um is i stole this slide from uh one of the quen2 announcement blog posts uh but it's super cool to see like how much um tech exists around um open models on serving them on making them efficient and hosting them it's pretty cool um and um it's um if you think about like where the term opens come from comes from like the open source um really open models meet the core tenants of of um open of open source uh specifically when it comes around collaboration there is truly a spirit like through these open models you can build on top of others people innovation um we see a lot of these even in our own work of like you know as we iterate in the various version of almo um it's not just like every time we collect from scratch all the data no the the first step is like okay what are the cool data sources and datasets people have put together for language model for training um or when it comes to like our post-training pipeline we uh one of uh the steps is um you want to do some dpo and use a lot of uh outputs of other models uh to improve your your preference model so it's really um having like an open sort of ecosystem benefits and accelerates the development of open models um one thing that um we got in 2024 which is not a specific model but i thought it was really significant is we first got uh we got our first open source ai definition um so this is from the open source initiative um they've been generally the steward of a lot of the open source licenses when it comes to software and so they embarked on this journey and trying to figure out okay how does a license an open source license for a model look like um majority of the work is very dry because licenses are dry so i'm not gonna walk through the license step by step but um i'm just gonna pick out uh one aspect that is very good uh and then one aspect that personally feels like it needs improvement on the good side um this um this open source ai license actually this is very intuitive if you ever build open source software and you have some expectation around like what open source uh looks like for software uh for for ai sort of matches your intuition so the weights need to be fairly available uh the code must be released with an open source license uh and there shouldn't be like license clauses that block specific use cases so under this definition for example lama or some of the quen models are not open source because the license says you can't you can't use this this model for this or it says if you use this model you have to name the output this way or derivative needs to be uh named that way those clauses don't meet open source definition um and so they will not be cover the the lama license will not be cover under the open source definition um it's not perfect um one of the things that um um internally you know in discussion with with osi we were sort of disappointed is around um the language for data um so you might imagine that an open source ai model means a model where the data is freely available uh there were discussion around that but at the end of the day they decide to go with a soften stance where they say um a model is open source if you provide sufficient detailed information on how to sort of replicate the data pipeline so you have an equivalent system sufficient sufficiently detailed uh it's very it's very fuzzy don't like that an equivalent system is also very fuzzy um and this doesn't take into account the accessibility of the process right it might be that you provide enough information but this process costs I don't know 10 million dollars to do um now the open source definition like any open source license has never been about accessibility so that's never factor in open source software how accessible software is um I can make a piece of open source put it on my hard drive and never access it that software is still open source the fact that it's not widely distributed doesn't change the license but practically the right expectation of like what we want good open sources to be so it's kind of sad to see that um the the data component in this license is not as as open as some of us would like uh would like it to be and I linked the blog post that Nathan wrote on the topic that it's less rambly and easier to follow through um one thing that in general I think it's fair to say about the state of open models in 2024 is that we know a lot more than what we knew in in 2023 um like um both on the training data like the pre-training data you curate um on like how to do like all the post-training especially like on the RL side um you know 2023 was a lot of like throwing random darts at the board uh I think 2024 we have clear recipes that okay don't get the same results as a closed lab because there is a cost in in actually matching what they do um but at least we have a good sense of like okay this is this is the path to get state-of-the-art language model um I think that one thing that it's a downside of 2024 is that I think we are more research constrained than 2023 it feels that like you know the barrier for compute that you need to to move innovation along that's just being right uh rising and rising um so like if you go back to this slide there is now this this cluster of models that are sort of released by the compute rich club um membership is hotly debated um you know some people don't want to be called rich because it comes to expectations some people want to be called rich but I don't know there's debate but like these are players that have you know 10,000 50,000 GPUs at minimum um and so they can do a lot of work um and a lot of exploration in improving models that it's not very accessible um to give you a sense of like how I personally think about research budgets um for each part of the of the language model pipeline is like on the pre-training side you can maybe do something with a thousand GPUs really you want 10,000 and like if you want real estate of the art you know your deep-seek and minimum is like 50,000 um and you can scale to infinity the more you have the better it gets um everyone on that side still complains that they don't have enough GPUs uh post-training is a super wide um sort of uh spectrum you can do as little with like eight GPUs um as long as you're able to um run you know a a good version of say a llama model you can do a lot of work there um you can scale a lot of the methodology just like scales with compute right if you're interested in um you know your open replication of what OpenAI's 01 is um you're going to be on the 10k spectrum of our GPUs um inference you can do a lot with very few resources evaluation you can do a lot with well I should say at least one GPUs if you want to evaluate um open models but um in general like if you are if you care a lot about intervention to do on this model which is my uh prefer area of research then you know the resources that you need um are quite quite significant um one of the trends um that has emerged in 2024 is this cluster of um fully open models um so almost the model that we built AI2 being one of them um and you know it's nice that it's not just us there's like a cluster of other mostly research um efforts who are working on this um and so it's good to um to give you a primer of what like fully open means um so fully open the easy way to think about it is instead of just releasing a model checkpoint that you run you release a full recipe so that um other people working on it uh working on that space can pick and choose whatever they want from your recipe and create their own model or improve on top of your model um you're giving out the full pipeline and all the details there um instead of just like the end output um so I pull up the screenshot from our recent um MOE model um and like for this model for example we released the model itself data that was trained on the code both for training and inference um all the logs that we got through um the training run as well as um every intermediate checkpoint um and like the fact that you release different part of the pipeline allows others to do really cool things um so for example this tweet from early this year from uh folks at news research um they use our pre-training data uh to do a replication of the bitnet paper in the open um so they took just a really like the initial part of a pipeline um and then did the thing on top of it um it goes both ways so for example for the old mode 2 model um a lot of our pre-trained data for the first stage of pre-training um was from this DCLM uh initiative uh that was led by folks uh ooh a variety of institutions it was a really nice group effort but um you know for when it was nice to be able to say okay you know the state of the art in terms of like what is done in the open has improved we don't have to like do all this work from scratch to catch up the state of the art we can just take it directly and integrate it and do our own improvements on top of that um i'm gonna spend a few minutes uh doing like a shameless plug for some of our fully open recipes um so indulge me in this um so a few things that we released this year was as i was mentioning this OMOE model um which is i think still is state-of-the-art um MOE model in its size class and it's also fully open so every components of of this model are available um we release a multi-modal model called MOLMO um MOLMO is not just a model but it's a full recipe of how you go from a text-only model to a multi-modal model and we apply this recipe on top of QUAN checkpoints on top of OMOE checkpoints as well on top of OMOE um and i think they've been replication doing that on top of Mistral as well um um on on the post-training side we recently released TULU 3 um same story this is a recipe on how you go from a base model to a state-of-the-art post-training model we use the TULU recipe on top of OMOE on top of LAMA and then there's been um open replication effort to do that on top of QUAN as well uh it's really nice to see like you know when your recipe sort of it's kind of turnkey you can apply it to different models and it kind of just works um and finally the last thing we released this year was OMO 2 which so far is the best state-of-the-art fully open language model um it sort of combines aspect from all three of these previous models um what we learned on the data side from OMOE and what we learned on like making models that are easy to adapt from the multiple project and the TULU project um i will close with a little bit of reflection like ways this this ecosystem of open models um like it's not all roses it's not all happy uh it feels like day to day it's always in peril um and you know i talked a little bit about like the compute issues that come with it uh but it's really not just compute um one thing that is on top of my mind is due to like the environment and how um you know growing feelings about like how AI is treated it's actually harder to get access to a lot of the data that was used to train a lot of the models up to last year so this is a screenshot from really fabulous work from Shane Longpray who's i think is in europe um about um just access of uh like diminishing access to data for language model pre-training so what they did is they um went through every snapshot of common crawl uh common crawl is this publicly available scrape of the of a subset of the internet and they looked at how um for any given website uh where the website that was accessible in say 2017 what whether it was accessible or not in 2024 and what they found is as a reaction to like the close uh like of the existence of closed models like openai or clod gpt or clond a lot of content owners have blanket blocked any type of crawling to their website and this is something that we see also internally at AI2 um like one project that we started this year is um we wanted to we want to understand like if you're a good citizen of the internet and you crawl uh following sort of norms and policy that have been established in the last 25 years what can you crawl and we found that there's a lot of websites where um the norms of how you express preference of whether to crawl or not are broken a lot of people would block a lot of crawling but do not advertise that in robots txt you can only tell that they're crawling that they're blocking you in crawling when you try doing it sometimes you can't even crawl their robot txt to to check whether you're allowed or not and then a lot of um websites um there's like all these technologies that historically have been have existed to make websites serving easier um such as um cloudflare or dns they're now being repurposed for um blocking ai or any type of crawling in a way that is very opaque to the content owners themselves um so you know you go to these websites you try to access them and they're not available you get a feeling it's like oh someone changed something changed on the on the dns side that it's blocking this and likely the content owner has no idea they're just using uh cloudflare for better you know load balancing and this is something that was sort of sprung on them uh with very little notice um and i think the problem is this this um blocking or ideas really it impacts people in different ways um it disproportionately helps um companies that have a head start which are usually the closed labs and it hurts uh incoming uh newcomer players um where you either have now to do things in a sketchy way um or you're never gonna get that content uh that the closed lab might have so there's a lot it was a lot of coverage i'm gonna plug nathan's blog post again uh that is that um i think the title of this one is very succinct uh which is like we're actually not you know before thinking about running out of training data we're actually running out of open training data and so if one better open models um they should be on top of our mind um the other thing that has emerged is that there's strong lobbying efforts on trying to define any kind of open source ai as like a new um extremely risky danger um and i want to be precise here like the problem is now um um but the problem is not not considering the risk of this technology every technology has risks that that should always be considered the thing that it's like to me is um sorry it's ingenious is like just putting this ai on a pedestal um and calling it like an unknown alien technology that has like new and undiscovered potentials to destroy um humanity when in reality all the dangers i think are rooted in dangers that we know from existing software industry or existing issues that come with when using software on um on a lot of sensitive domains like medical areas and i also noticed a lot of efforts that have actually been going on and trying to make these open models safe um i pasted one here uh from ai2 but there's actually like a lot of work that has been going on on like okay how do you make if you're distributing this model openly how do you make it safe um how what's the right balance between accessibility on open models and safety um and then also this annoying uh brushing of um sort of concerns that are then proved to be unfounded under the rug you know if you remember the beginning of this year it was all about bio risk of these open models uh the whole thing fizzled out because there's been finally there's been like rigorous research not just this paper from cohere folks but it's been rigorous future research showing that this is really not a concern that you we should be worried about again there is a lot of dangerous use of ai application but this one was just like a lobbying ploy to just make things sound scarier uh than they actually are so i gotta preface this part it says this is my personal opinion it's not my employer but i look at things like uh the sp1047 from from california and i think we kind of dodged a bullet bullet on on this legislation we you know the open source community a lot of the community came together at the last sort of the last minute um and did a very good effort trying to explain all the negative impact of this bill um but um there's like i feel like there's a lot of excitement on building these open models uh or like researching on these open models and lobbying is not sexy uh it's kind of boring uh but um it's sort of necessary to make sure that this ecosystem can can really thrive um this end of presentation i have some links emails sort of standard thing in case anybody wants to reach out and if folks have questions or anything they wanted to discuss it's our open floor here's sofia um who wants to uh who uh one one very important open model that we haven't covered is mistrial so yeah yeah well it's nice to have the mistrial person yes uh talk recap the year mistrial but uh while sofia gets set up does anyone have like just thoughts or questions about the progress in this space do you always have questions always i'm very curious how we should build incentives to build open models things like francois choulet's uh arc prize and other initiatives like that what is your opinion on how we should better align incentives in the community so that open models stay open i think you can tap in there nice the incentive bit is like really hard um like even as something that i actually even we think a lot about it internally um because like building open models is risky it's very expensive um and so people don't want to take risky bets um i think the definitely like the challenges um like our challenge i think those are like very valid approaches for it um and then i think in general promoting building so um any kind of effort to participate in this challenge in those challenges if we can promoting doing that on top of open models um and sort of really lean into like this multiplier effect um i think that is a good way to go um if there were more money for um efforts um like research efforts around open models there's a lot of i think there's a lot of investments in companies that at the moment are releasing their model in the open which is really cool um but um it's usually more because of commercial interest and not wanting to support um this this like open models in the long term it's a really hard problem because i think everyone is operating sort of in what everyone is at their local maximum right in ways that really optimize their position on the market the global maximum is harder to achieve okay somehow it's not being shared on the screen uh can i ask one question you know yeah uh so i think one of the gap between the closed and open source models is the mutability so the closed source models like chatty was pretty good on the low resource languages which is not the same on the open open source models right so is it in your plan to improve on that space um i think in general yes is here yeah just just use your natural voice yeah um i think if i think we'll see a lot of improvements there in like chinese on the side um like there's groups um like focus on guys are already working on like better call for multilingual um support i think what our challenges there is um you really want to be experts who are actually in those countries that use those languages to participate in the international to give you like a very easy example i'm originally from italy i think i'm terribly equipped to build a model that works well in italy because one of the things you need to be able to do is having that knowledge of like okay how do i access you know libraries or content that is from this region that covers from time again the u.s long enough that i no longer know that um so i think that the efforts that folks central europe for example are doing around like okay let's let's tap into regional communities um to get access uh to bring in collaborators from those areas i think it's going to be like very crucial for getting out of this area yes let me close it up hello everyone what's that it's fine she's not playing any audio that's weird okay okay okay cool um yeah i'm super excited to be here to talk to you guys uh about mistral uh a really short and quick recap of what we have done what kind of models and products we have released in the past a year and a half so um most of you have already known that we are a small startup funded about a year and a half ago in paris in may 2003 it was funded by three of our co-founders and in september 2003 we released our first open source model mistral 7b um yeah how many of you have used or heard about mistral 7b hey pretty much everyone thank you uh yeah it's our uh pretty popular and uh community our community really love this model and in december 2003 we we released another popular model with the moe architecture um mr 8x 7b and oh going into this year you can see we have released a lot of things this year um first of all in february 2004 we released uh mr small mr large uh le chat which is our chat interface i will show you in a little bit we released a embedding model for you know converting your text into embedding vectors and all of our models are available um the the big cloud resources so you can use our model on google cloud aws asia snowflake ibm so very useful for enterprise who wants to use our model through cloud and in april and may this year we released another powerful open source um moe model ax 22b and we also released our first code model coastal which is amazing at 80 plus languages and then we provided another fine tuning service for customization so because we know the community love to fine tune our models so we provide you a very nice and easy option for you to fine tune our model on our platform and also we released our fine tuning code base called mr fine tune it's open source so feel free to take it take a look and more models on july to november this year we released many many other models uh first of all is the two new small best small models we have minister 3b great for deploying on edge devices we have minister 8b if you used to use mr 7b mr minister 8b is a great replacement with much stronger performance than mr 7b we also collaborated with nvidia and open sourced another model nemo 12b another great model and just a few weeks ago we updated mr large with the version 2 with the updated updated state of our features and really great function calling capabilities it's supporting function calling latently and we released two multi-modal models pixel 12b it's open source and pixel large just amazing model models for not understanding images but also great at text understanding so yeah a lot of the image models are not so good at text understanding but pixel large and pixel 12b are good at both image understanding and text understanding and of course we have models for research coastal mamba is built on mamba architecture and method great with working with math math problems so yeah that's another models uh here's another view of our model reference we have several premier models which means these models are mostly available through our api i mean all of the models are available throughout our api except for minister 7 3b but for the premium model they have a special license minstrel research license you can use it for free for exploration but if you want to use it for enterprise for production use you will need to purchase a license from us so on the top row here we have minstrel 3b and ab as our premier model minstrel small for best best low latency use cases minstrel large is great for your most sophisticated use cases pixel large is the frontier class multimodal model and we have coastal for great for coding and then again mr embedding model and the bottom the bottom the slides here we have several apache 2.0 licensed open way models free for the community to use and also if you want to fine tune it use it for customization production feel free to do so the latest we have pictures 3 12b we also have mr nemo mom coastal mamba and master as a real as i mentioned and we have three legacy models that we don't update anymore so we recommend you to move to our newer models if you are still using them and then just a few weeks ago we did a lot of improvements to our code interface lachette how many of you have used lachette oh no only a few okay i highly recommend lachette it's chat.mr.ai it's free to use it has all the amazing capabilities i'm going to show you right now but before that lachette in french means cat so this is actually a cat logo yeah if you can tell this is the cat eyes yeah so first of all i want to show you something maybe let's let's take a look at image understanding so here i have a receipts and i want to ask i just going to get the prompts going back going on yeah i had an issue with wi-fi here so hopefully it would work cool so basically i have a receipt and i said i ordered a coffee and a sausage how much do i owe at a 18 tip so hopefully it was able to get the cost of the coffee and the sausage and ignore the other things and um yeah i don't really understand this but i think this is coffee uh it's yeah nine yep and then cost of the sausage we have 22 here yep and then it was able to add the cost calculate the tip and all that uh great so it's great at image understanding is great at uh ocr tasks so if you have ocr tasks please use it as free on lachette it's also available through our api and also i'm going to show you a canvas example a lot of you may have used canvas with other tools before but uh with lachette is completely free again here i'm asking it to create a canvas that's used pi script to execute python in my browser so oh what's going on okay let's see if it works import this oh yep okay so yeah so basically it's executing python uh here exactly what we wanted uh and the other day i was trying to ask lachette to create a game for me let's see if we can make it work yeah the tetris game uh yeah let's just get one row maybe um ah oh no okay all right you get the idea i failed my mission um okay here we go yay uh cool yeah so uh as you can see lachette can write like a code about a simple game pretty easily and you can ask lachette to explain the code make updates however you like um another example there is a bar here i want to move okay right okay and uh let's go back another one uh yeah we also have web search capabilities like you can ask what's the latest ai news uh image generation is pretty cool generate an image about researchers in vancouver uh yeah it's black forest labs uh flex pro uh again this is free so oh cool i guess researchers here are mostly from university of british columbia uh that's smart uh yeah so this is lachette i please feel free to use it uh and let me know if you have any feedback we're always looking for improvement and we're going to release a lot more powerful features in the coming years thank you yeah i think we can open up the questions there's lunch also outside but uh if anyone thought i don't think we have a youtube entry but if anyone has any thoughts on mistral or omo or any of the others the open models um yeah no i think we can just break for lunch and uh have a chat but thanks thanks so much to the speakers thank you again we'll be back here what we're gonna have like some people presenting during lunch um i i think i think basically just go grab lunch you can come back in and eat and chat uh we'll have some people presenting as well right so unless you want to say you see material okay maybe maybe maybe you get something off now yeah hi everyone thank you so much for coming today um huge shout out to SWIX and the latent space team i think it's been a great yeah let's just give it up for SWIX just real quick um i did a little bit of in terms of helping with the planning but i work at notable capital some of you may have heard of ggv which was our former name um on the cloud infrastructure team so basically anything data dev tools um ai infrastructure as well as ai applications um and so we like to stay close to those that are smarter than us which is all of you in this room um so if anyone ever wants to you know brainstorm or thinking about starting a company um we're happy to collaborate we've had the opportunity to partner with like amazing companies such as hoshi corp bracelle neon and many others over the years um and we're based in san francisco and new york so yeah feel free to find me laura hamilton x linkedin um you know if we become friends instagram yeah um thank you all for coming and then we'll kick off some of the chats with aws after everyone gets lunch all right hi these are up here too this is not mine although i did almost take it yeah it's not like everyone's happy uh um nope i didn't even ask the url you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you you You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You You Like in my view, I don't know if I would do a traditional.

Hello. Oh great. Awesome. Yeah, sure. Well, hey everyone. Hope you enjoyed lunch. Thanks for thanks for dialing in here. My name is Aaron wanted to give a quick shout out to the latent latent space team notable capital swicks for organizing. I'm with the AWS AI startups team. I've been in the role for about three years now.

I was a founding product hire at a series a company had a great exit there did machine learning for a while. Did some strategy consulting with Google for a while and then joined AWS actually got this job on Twitter of all places. I liked a tweet that was like, hey, I think more more VC meetings should be over surf lessons.

And I got a DM back saying hey, you kind of want to come work at AWS and it was off of the races from there. So keep your DMS open. I'll keep I'll keep this short here. Basically just wanted to kind of chat about how AWS works with founders, right?

I think everyone's aware compute and credits are kind of like the name of the game at this point. I like to I like to think about ways to go deeper than that and figure out how we can add value beyond just like here's some GPUs. Here's some credits and run with it, right?

Like that's kind of table six at this point. So I wrote the PR FAQ for an accelerator program that is a 10 week program. It just wrapped up at reinvent last week where we take a couple companies from around the world and really lean in and try and build co build with them.

We find design partners. We do like product strategy, help them with fundraising. We just put them on stage at reinvent. There's like, you know, 700 people in the audience. It's a really fun, fun experience. And that's just kind of like, you know, putting what we do on a day to day on the world stage because our whole team is dedicated to figuring out ways to, again, go beyond beyond credits, beyond compute and support.

Right. So we worked with founders from like day zero, haven't even incorporated. We're still like bouncing ideas off of off of each other, thinking about ways to go to market. And then, you know, beyond that, like as you're scaling, finding design partners and then getting you listed on marketplace and really co-selling together.

And we'd love to be a small part of the journey as you're considering entrepreneurship. So if you want to chat about all things entrepreneurship, please please reach out. I'm on LinkedIn, Aaron A. Melgar. If you do just want GPUs and compute and credits, happy to chat about that as well.

But but great to be here. And again, thanks to SWIX for hosting and to the notable capital team for having us and organizing. So thanks, everyone. Enjoy the rest of the talks today. Also, we have them to thank for lunch. So all the amazing lunch that we got. This whole thing is like self-funded, community funded.

So we're very much flying by the seat of our pants. And also thank you to Laura for making all this happen. OK, so we have a couple more presentations from folks, just people like launching things. We got Drew, you're next, but Ben, I'm going to I'm going to call you up first.

Ben, are you ready? I can get Drew to go first. Drew, Drew, you got Drew. The amazing thing about the thing that's Drew's demoing is, well, by definition, it works offline. And it's very, very viral. We we're just so lucky to have I mean, just for me to be friends with him and to invite him here to to show off the best way you can be reading papers.

So usually we we come here, we do we demo B2B SaaS and infrastructure as a service. This is none of that. You want consumer hardware. We got consumer hardware. OK, go. Oh, all right. I have to still hype him up a little bit. What else? What else can I say about you?

Drew's an insane violinist. If you if you like visit his house, like he lives in the House of Musicians and they just have classical music live all the time. It's insane. All right. Cool. Yeah. Sean is a is a very flattering hype man. Really incredible. Just a quick thanks to to latent space for for hosting this and for Sean, like being in.

I think we met almost two years ago at a replica thing. And he's just like organized the entire scene in a way that makes it digestible for me and everyone else. Thanks to latent space. So I work for a company called Daylight Computer and we're making computers for serious people is one way that I put it.

But we want to make a better reading experience for researchers specifically and a new surface for A.I. in our real lives. So how do we we haven't heard a whole lot about consumer applications of A.I. today, but I just want to show a demo, some demos we've been working on for how to integrate A.I.

more comfortably into research workflows, especially reading papers. So I'll just quickly go over kind of what what is daylight. We invented a new screen technology that works just with the light in the room and has no blue light, better for eye strain, better for focus. And we wrote an operating system to run this screen on our first product, this tablet computer, the DC one, and it allows you to read outside or just read without distractions.

So we've kind of made it impossible to get interrupted by notifications and other distractions. It's kind of like a Kindle and an iPad had a baby. So the kinds of things we're doing with A.I. are to kind of integrate directly with paper reading workflows. And I just have a quick demo that I can give here.

It looks like we don't have sound, but I'll just narrate. Here it is. So the voice is going to be Anjan Kata, our founder, who invented the screen technology. It's really all right. I can just talk through it. So this is a poem, and often we want to go deeper into the text.

So let's just -- there we go. So this is the daylight reading the "Real Gay Man Watching" poem. One cool feature we have is we have a little button on the side that you can press at any time and then you can talk to an A.I. So I was a little bit confused by certain parts of the poem.

For example, when it said, "What do they mean by 'we would become strong too' and 'not need names'?" Like, what does "not need names" mean? The phrase "not need names" suggests transcending individual identity. And so we just kind of, as we were going through the entire poem, we read it once and we kind of went back through.

We were able to ask all these questions, and came away feeling like we understood it so much more. Another question we asked was, "Can you tell us more about what they mean by these wrestlers in the Old Testament and who the angel is? What biblical story is this referencing?" So we can go deeper into it.

It's referencing the biblical story of Jacob wrestling with an angel found in Genesis 32. And that's just, like, incredibly cool that we're able to do this. I felt very touched by this poem. And so I could say something like, "Could you recommend a few other poems that mirror the themes of this one?" There we go.

It just gave us five poems. These poems have four themes. "Resilient, Struggled, Personal, Gravitational, Challenged, and Simultaneous." This just feels so cool. I'm gonna go a little bit in here and add these poems to your device and read them. Yeah. So we want to bring that to research and to the entire device.

So one thing that's an advantage of owning the entire stack, operating system, hardware, everything, is we can tailor the workflows across any apps. So this will work in a web browser, in your messages, email. No matter what apps you're running, there's a sort of central AI that is running on your device.

It can capture that, put it in a knowledge graph, develop a model of you and your interests. And it's just available everywhere and with a hardware button. So that's what we're working on. If you're interested in these things, if you're interested in knowledge graphs on device or quantized models, come talk to us.

And I actually have a couple of these here if people want to play with them. Thank you. Actually, they're sold out online, right? Yeah, we are. But we have just a couple. We're sold out online probably until the beginning of next year, like March, maybe. But we have like three here today.

You want to buy them off of it? Yeah. $7.29. U.S. U.S. Canadian money. But yeah, try it. I mean, it's 60 FPS E-Ink. This stuff is not cheap. Give us a tech talk. I can do a quick Q&A. So it's a new twist. There are like six patents on top of essentially a Game Boy screen.

So this is like 20 years of extra research and six patents on top of what's called RLCD or TLCD. So it's transflective or reflective LCD. So it's liquid crystal, but it has no backlight required. The sort of innovation is the reflectance and transflectance films and like stack, you know, black magic to reflect the light back through the LCD.

So it's as fast as any computer monitor. It can run, you know, 120 frames per second. And in broad sunlight. And then we, in order to, the transflective part is how do you enable a backlight for nighttime use? And we developed a layer that allows us to put a blue light free LED that's like safe for your circadian health and suprachiasmatic nucleus and so on.

So you're not like burning your eyes out at midnight reading. But it can come through similar to a normal computer screen. So that's more or less the secret sauce here. Sorry? No, it's very beta right now. But we're going to release it with a, yeah, we're building it. Yeah, it's going to be great.

It's fun to play with. And if you want to, you know, come by and try writing on it or reading on it or like watching a video on it, just seeing how it feels, looks, I'm in the back there. So see you. There will be a phone, you know, laptop, monitor, all those things.

Yeah, so last speaker. We have Ben from StrongCompute, founder of StrongCompute. I would say like one of those weird things where even though they're mostly a compute shop, they also do a fair amount of like deep research. This year, Ring attention got a lot of attention from people for like scaling, I guess like distributed training.

And we host a paper club. Like this is basically the in-person version of the online paper club that we've been running for two years. And the single best presentation, one of my favorite presentation of the years was from StrongCompute. So really grateful for you guys. All right. Hey, folks.

Let's just get on the screen. Yeah. Allow. Allow. Did I get you to zoom? I didn't. I don't think I did. You didn't get me to zoom. Show me to zoom. Are you in the Discord? I can pop that up. Discord. Go to the Lanscapes Discord live. Yeah. I'll navigate faster.

So where's the zoom here? Am I going down or up? I'm going down. There we go. Zoom. This is mostly just because I want to capture your screen for the recording. Oh, also I'm wearing this thing. This is for the swag swap. The swag table is back there. Yeah.

There's a lot more swag people want. Okay. I think we're good. That's it. Cool. All right. All right. So we'll keep this informal. So I'll just leave this up here like this. So what we're trying to do is make clusters a lot easier to use. So anyone who's tried accessing clusters for training, we're trying to be what you'd feel an elite DevOps team would be.

So here's kind of a feature list of some of the stuff we're going for. And this is the vibe that we're going for. So most people like, we actually started out optimizing, well, we started out building compute hardware, cooling our own power systems like that. Then we got into optimizing training time.

So we're messing around with CUDA kernels and data loading and that kind of stuff. But all right, this is the hard part. This is the rocket science, like getting these much greater efficiencies on the GPU. Surely the easy part is just taking our work once it's done and just putting it on a cloud platform and having it go.

And it turned out to be the complete opposite. We got a whole bunch of awesome optimizations done in a few months. And then it took us a couple of years to actually build a GPU orchestration platform that we wanted to use. And what we realized was that there was just a lot of things that you needed to solve.

What I want to share with you today is something we've been working on for a year, which is a new UI for how you work with clusters. And so this is it here. So if you've got some compute, maybe AWS has given you some credits, that's really nice of them.

So you've got a cluster over in US East. Maybe you've already got some stuff with GCP. Maybe you've got some on-prem as well. And in each of those regions, you've got some number of GPUs. And then you want to go and do things with them. You want to go train models on them.

So the dataset stuff I'll get to in a sec, but you go and run a job. And that job's going to start on some cluster somewhere. So with our system, it's pretty much that easy. You don't need to worry about Linux, NVIDIA drivers, anything like that. You get a Docker, you get root in the Docker.

It can just jump on the cluster straight away. And so then a bunch of other people want to run jobs. And so then you end up backed up in a Slurm queue. And you go, "Hang on, don't we have all this other compute somewhere else in the world? What's it going to take to actually migrate our workloads to those other resources?" Well, what we've built is the ability to migrate those workloads like this.

And what that means is that we can snapshot the state of a cluster live. We can move the entire cluster workload to another cloud provider or another region or another cluster on that cloud provider. We can do that incredibly quickly as well. I've got a little bit of a video showing some of our speed demos.

We rewrote file transfer, block storage, like absolutely everything. That number is going to be a lot higher, that 90 gigabytes a second soon. And we have the world's fastest container registry as well. So yeah, there's no more slow stuff on the cluster. It's also a lot cheaper than what you'd be used to paying for egress.

So the vision for this is if you imagine a CPU and there's a scheduler and it's sending workloads out to cores on that CPU. Well, what if the whole world was the CPU? What if each core was actually a cluster in a data center? And you just felt like your computer had incredible amounts of power.

And obviously, theoretically that's possible, but it's all about the APIs and the interface work. Normally, you want to go and start sending workloads around the world, you've got to go and talk to the dozen DevOps people and they'll get started on a multi-week, multi-month project to do that. These transfers happen incredibly quickly and without the need for any DevOps work to do.

We've also got a few other features here as well. So you'll have some data sets and they might be quite large. So maybe you've got a 10 terabyte data set and we'll pop that in the US to start with. We can't, oh, name's required. All right, 10. Just lose the screen.

Open there, I think it's still on. Yeah, there we go. So I'll just pop a 10 terabyte data set in. So I can go pull that down to any cluster I want. And then I can go and I can do training on that data set, but I can also set up workstations.

So one of the issues that we've seen people encounter is, yeah, I can get a GPU reasonably easily, but it's not going to have high-speed access to my data set. So here we can just set up as many nodes as we want with workstations. You can carve off as many GPUs as you'd like.

And that'll be your container that you can work out of and have high-speed access. That's that 90 gigabyte a second access to that data set. And that way you can go, so this is the entire dev cycle. We're not doing any production inference hosting right now, but you can have fast access to your data sets from your dev container.

You can train on the clusters very, very easily. What we want to do is eliminate any developer waiting time or any work that wasn't PyTorch. So what does that look like for some other examples? We can also, because we're able to save the state of a cluster, we're resilient to things like GPU failure, but we can also do this trick where we go and build a time-cycled space.

And here we can actually just go and pack as many jobs as we want into the same space on a cluster and choose how often they're going to rotate. So maybe it's every three hours. Every job gets an hour every three hours. Maybe this job's more important. Maybe this job's so important and our cluster is so backed up that we just actually want to get this job some dedicated space.

So we hit burst and we spin it up. And now we've actually gone and found resources on the cloud. So we're integrated with six clouds and that's growing. And you can plug your own cloud accounts in. You can use ours, whatever you like. It's a great management platform. And those time-cycled spaces, one of the really cool things about them is that what they can let you do is you can interrupt them anytime you want.

So if you want to launch a cluster-scale workload, it's usually about 15 seconds to start one. So you dev on a GPU. You want a cluster to work on. We'll get you on in 15 seconds. You'll see if your stuff works at cluster scale. You'll get off. All the other jobs will keep going with that state saving.

So we're trying to make GPU feel very fluid. So this is our platform. We're running events about every month. There's one coming up this... Let me jump to it. Yeah, this Friday and Saturday for any folks who are heading home early from Europe. You get to play with our platform there.

Just like people would train chess models on it. Last event, we got up to 1,600 GPUs. People were really having a lot of fun with it. And so that's what's happening this Friday and every month or so. I don't know if I can... Yeah, I saw this hackathon and I was wondering what the expected ELO of a chess bot that you can train in a day is.

We were curious about that as well. So this is the sixth time we've run this. The first time your child who's just learned the rules of chess could probably defeat most of the models. But around the fourth time we'd run it, I think they were probably getting up around 1,500 ELO.

People were really getting a lot better at figuring out how to do this stuff in a weekend. And yeah, I think... I'm not the world's greatest chess player and right now the models will quite easily defeat me. And yeah, we're excited to see how much further this will go.

So this is like a short-form way to access the platform. We also have research grants. So if you want access to Compute, we've given out about a dozen of these so far. And we'll give you like 10 to 100k worth of Compute. And you just have to be doing something cool, something interesting, something that SWIX would want to host at an event.

And yeah, you can come and get some Compute. So we'll give you like about three months on the systems with that as well. We have AWS, GCP, Azure, Oracle, Oblivious, Lambda Labs, our own Metal, and we've got some more on the way. It's the same. Like the way our systems work is we've just got a Ubuntu image that has our agent in it.

It's called Agent Smith. And it has our fast data puller, our sort of node clustering stuff, a whole bunch of health monitoring. Yeah, some of the stuff I didn't show you today. You get total financial control. You can set budgets across all providers. You go, "All right, this developer is allowed 10k a month or this project's allowed 100k." It will just tally up all the usage and then tell them when they need to stop.

So, yeah, it's very, very multi-cloud. Yes. Can you repeat the question? Yeah, so the question was how we compare this with something like Modal. So Modal is a great environment where you want, you know, probably not cluster scale, so a smaller number of GPUs, and where you're willing to sort of convert a lot of your workflow to the way that Modal operates.

What we've tried to do is make it so it's-- we're sort of clusters first and foremost. Like, yes, you can have workstations, but it's really about larger clusters and making the data extremely fast so that you're not sitting around waiting for days for that. The other thing is we've tried to make it as close to how you operate now as possible.

So, you know, Docker is, you know, it is the standard. We didn't use Docker to start with because we were concerned about the speed. We haven't applied a bunch of our speed tooling to it, and it's only about a 5% overhead at the moment. And so you get on our system, you get a Docker container, you get root in the container, you can do anything except install kernel modules.

So you get a lot of freedom, and you don't have to convert to another way of doing things. Cool. Yeah. Cool. Well, thank you so much. We'll chat a little bit for lunch, and then we'll come back at 1 p.m. Thank you. Thank you. Thank you. Thank you. Thank you.

Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you.

Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you.

Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you.

Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you.

Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you.

Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you.

Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you.

Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you.

Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you.

Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you.

Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you.

Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you.

Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you.

Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you.

Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you.

Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you.

Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you.

Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you.

Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you.

Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you.

Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you.

Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you.

Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. How you can share your screen. Okay. I understand. And yes, the Corgi extension. And I am going to ask Gemini real quick. I'm going to ask Gemini real quick a couple of questions.

So let's go ahead and get started. Hey, Gemini, could you explain to me -- Could you explain to me how I might go about changing my runtime type in Colabs? Yes, I can see the Colabs. Okay. I understand that you can see the Colab interface. Is there anything specific you would like to discuss or ask about it?

Yep, yep, yep. So -- yep, yep, yep. Okay. Okay. Okay. Okay. Beautiful. Thank you. Cool. So, Gemini, could you explain to me how I would go about changing my runtime type? Sure, I can help with that. To change your runtime type in Google Colab, you need to go to the Runtime menu at the top of the page.

Then you can select Change Runtime Type. A window will open where you can select the hardware accelerator you want. Amazing. This is great. So which one should I select if I wanted to train a model of the ones that are in white over here? Okay, I can help you with that.

If you're looking to train a model and the options shown in the hardware accelerator section of the Change Runtime Type pop-up menu are the ones you have to choose from, the A4 GPU is a good choice for general model training. However, if your model training requires a lot of memory, you might want to try the A100 GPU instead.

Amazing. Thank you. So, Gemini, how about this key section here off to the left? How would I go about adding a new key or a new secret? Certainly, I can assist with that. Just play the audio from this. Yeah, you can play the audio from the computer. So confusing.

Any more questions? Is it this? That's just pure recording. He's here. Sorry about this. So when you play, is it coming from the laptop now? It's coming from the laptop. This is why we have to rehearse. Hey, Gemini. Can you help me understand how to add a new secret?

Sure, I can help with that. Looking at the screen, you will need to click the add new secret button on the left side panel of the screen below the other secrets. After that, you can fill out the information for the new secret. Beautiful. Thank you so much. All right.

So for Colab, how would I go about changing my runtime type? To change your runtime type in Colab, you can click on the runtime menu at the top of the page, and then select change runtime type. From there, you can choose between different options, such as GPU or TPU.

Excellent. Explain to me what the difference between GPUs and TPUs are. Certainly. GPUs, or graphics processing units, are designed to handle parallel processing, which is ideal for tasks like rendering images and videos, and also machine learning. TPUs, or tensor processing units, are custom built by Google and specifically designed for deep learning tasks, making them often more efficient than GPUs for these tasks.

Excellent. Thank you so much. How would I go about accessing the terminal in Microlab? To access the terminal in your Colab notebook, you can click on the tools menu at the top of the page, and then select command palette. Then, type terminal in the search bar, and click on open a terminal.

This will open a terminal within your Colab environment. Excellent. Very cool. And what command would I need to type in order to list all of the files in my current directory? To list all the files in your current directory, you can use the command ls. Type ls into the terminal and press enter.

Are there any other config options that I should add for ls to list all of the files, not just the directories? Certainly. If you want to see all files, including hidden files, you can use the command lsa. This will show all files in the current directory. If you would like to list all files in a detailed view, including their permissions, owner, size, and modification time, you can use lsastan-l.

Also, combining both, lsil will show a detailed list of all files, including hidden ones. That's cool. You're welcome. Awesome. Is there anything else I can help you with? Yeah, you know, Gemini, I would really prefer if you addressed me and started talking like you're a pirate, then. Ahoy, matey.

What be your next question, then? Shiver me timbers, I'm ready for it. Amazing. Help me understand what the tools command palette helps me do within Colab. Aye, matey. The command palette be a handy tool for you. It be a way to quickly access a wide range of commands and actions in Colab.

You can use it to add a Gemini prompt cell, add a code cell, change your runtime type, and many other things. Now I want you to talk like a Texan. Aye, matey. I'll gladly talk about the text. What be your question about it? No, I said please talk like a Texan.

I want you to have a Texas accent like John Lee. Alrighty then, partner. What can I do for yay now? Let's talk about that there, text like a true Texan. What do you want to discuss? Cool. Thank you, Gemini. You're welcome, partner. Happy to help. Let me know if you need anything else, you hear?

So in addition to being able to interact with screens, interact with video, interact with audio, Gemini is also capable of natively outputting image tokens, so we haven't released that via AI Studio today. We can also couple Gemini together with code execution, function calling, so you can do things like say, Gemini, watch my screen, and if you see anything that looks like a particular kind of notification or a particular kind of flag, set off a call to a specific tool.

I also like using it to help me as kind of like a posture corrector. So if I give Gemini access to my camera, which you can see here, so it just sees me. Hello. It can tell me if my posture is correct or incorrect. So Gemini, I'm taking a look at my posture.

Should I stand up straighter? Do I have good posture? What's your assessment? Okay. Looking at the video, your posture does seem slightly slumped. To improve it, you might try standing up straighter, pulling your shoulders back gently, keeping them aligned with your spine. Well, thank you. I appreciate it. But anyhow, it's helpful to have that open as you're kind of sitting at your desk to see.

It supports different kind of system instructions, so you can add things like the speak like a pirate or speak like a Texan up there. And then there are also a few different voice options. So if I go back to the main menu, you can see kind of a few different voices that you can play around with to test out some of your favorites.

We also have kind of an audio and text. So if you don't want to have audio out responses, you can swap to being just text. We released some starter apps as well that I encourage you to take a look at to help experiment with things like bounding boxes. So you can see Gemini kind of identify bounding box locations for different things.

So like the shadow of an original fox, the shadow of an original armadillo, and then the armadillo and the fox off to the side. It's also capable of doing this for things like socks. So being able to sort and filter between different kinds of socks. And then also for different kinds of desserts.

So if you want to have bounding boxes natively output, this is something that's supported not just with images, but also screen by screen for videos. Cool. So as I mentioned, you can get started with it today at aistudio.google.com. We also have a very generous free tier. So all of this is freely available for you to use today to try out and to also use if you want to create API keys as part of your applications.

Awesome. Thank you. Yes. So the question was, can you speak to the agentic research? And I'm trying to debate now how much I can speak to without getting fired. But the agentic research, we did release a couple of different options, including a data science agent within Colab, which is just using kind of commodity Gemini available APIs.

So you could test them out, use them for your projects today. We also released something called Project Mariner, which is focused around computer use, being able to interact with websites directly within the browser. Again, strongly encourage you to try out multimodal streaming API coupled with tools use. And you'll probably be able to get very close to the kinds of experiments that you saw released just via video.

But those are the kinds of things that we're focusing on for agents, not just being able to understand and generate text and code, but also to be able to interact with these multimodal and streaming experiences. Yep. Yeah, let's do it. So let's see. Yes. AWS would be interesting. I will pull up the cloud interface, though, and we'll see how well it does.

I've never tried this either. I've tried Colab before, but I've never tried the cloud interface. So let's see. Hey, Gemini, could you help me understand how I would use one of the cloud models on this interface? Oh, I had switched it to only output just text, I think. Okay.

Oh. Audio. Hey, Gemini, could you help me understand how I would use one of the cloud interfaces or one of the cloud models within the screen? For some reason, it's not wanting to have the audio outputs anymore. Let me try again and refresh. Yeah. Yeah, definitely yes. Oh. So it's recognizing this.

But, hey, Gemini? Hmm. For some reason, the audio isn't wanting to work for me anymore. And I'm curious why that would be. I don't think it's because of the website. Oh, it might be on here. Yeah, because if you see -- I think that's live demos for us. No, no, no.

It's saying that, for whatever reason, it's not wanting to understand the audio anymore. Yeah. Wow, this is very strange, YouTube. Josh Johnson, the flower's to work. Yeah, it's not wanting to output the volume. Let me check. Do you want me to check? No, no, I think the next speaker doesn't have audio.

Yeah. Cool, cool. Try it out. Thanks for showing. Thanks for the Q&A. Let me -- And also, I encourage you all to go try it out with websites that are not Colab. And do it, like, right now with your laptop. So it should be fun. Awesome. Cool. Thank you.

Now I'm trying to understand -- Oh, yes. I'm trying to understand why the -- Oh, it's working. So let me -- I'm going to try it. Okay. Next speaker session. So we've got some pretty fun ones. The next two speakers, we've got Eugene and Dan from Together and Featherless.

They're going to talk about alternative transformer architecture, so what goes on outside of the transformer. Eugene is a member of our paper club every week, core member there. He's got a whole article about hardware scaling and how it's VC subsidized. Hot Take, he's launched, like, one of the best 34B models.

They're dropping it today. So their company has just trained some of the best models. Dan is with Together, same situation. Super smart people. They've also shipped a lot of this stuff. So we have a little bit of an overview of states-based models, overview of what's going on outside of transformers.

They're going to share their hot takes, do a bit of a Q&A. So give it up for them. >> The one that says test only? >> Yep. >> All right, cool. Yeah, so thanks so much for having us. So this is going to be a little bit of a two-part presentation.

My name is Dan. I'm at Together AI, and I'll be joining UCSD as faculty in about a year. Eugene, you want to introduce yourself? >> Yeah, I'm Eugene. I lead the art activity team, and I'm CEO and co-founder of Featherless, and we both work on this new post-transformer architecture space.

>> Yeah, so today we're really excited to talk to you a little bit about that. So first I'm going to give a broad overview of post-transformer architectures, and then afterwards Eugene will tell us a little bit about the latest and the greatest and the latest frontier models in this space.

So the story starts with scaling. So this is probably a figure or something like this that you've seen very recently. Over the last five to six years, we've seen models really scale up in parameter size, and that's brought with it a bunch of new capabilities like the ability to scale your WS screens.

But another place where we've seen scaling, especially recently, is scaling in context length. So this can mean just having more text inputs for your models, but it can also mean things like taking a lot of visual token inputs, image inputs to your models, or generating lots of outputs. And one thing that's been really exciting over the last few months or so is that we're seeing scaling not only during training time, but this is one of the -- this is the iconic image from the OpenAI '01 release.

Not only are we starting to scale train time compute, but we're also starting to scale test time compute. Now, if you're familiar with our attention and our transformer architectures today, this graph on the right might look a little bit scary. And one of the reasons is that the implications are a little bit interesting.

So what does it mean if we want to continue having smarter and smarter models? Do we need bigger and bigger data centers, spending more flops? Is this -- this little dolly three, we need more flops guy, is this going to be the future of all of AI? Or is there a better way, another path forward?

Maybe we can get the same capabilities that we've gotten used to, but for a lot less compute, a lot less flops. And one of the things that we're going to talk about today is specifically looking at that core attention operator in some of these examples. The reason is that -- so this is just some basic, you know, scaling curves, but attention has compute that scales quadratically in the context length.

So that means that if you're doing something like test time compute and you want to spend a bunch of tokens thinking about what comes next, the longer that goes, the more tokens you spend on that, that compute grows quadratically in that. One of the questions that we're interested in is can we scale from the bottom and get it to scale better?

Can we scale in, let's say, n to the three halves or n log n? And so in the first part of the talk, so we just went over the introduction, what I'm going to do over the next few slides is just talk about some of the key advances and ideas that have shown over the past few years since maybe early 2020 to now that shown promise that this might actually be possible, that you can actually get potentially the same results.

So to do that, and basically the story that we're going to look is we're going to start to see how, so this is a basic graph of just the past couple years of progress of perplexity where that blue line, that dotted blue line is attention, it's your basic transformer, full dense attention, and then the dots coming down are some of the methods that you'll see in this presentation today.

We're going to turn the clock back all the way around, and we're going to look at how to make attention subquadratic. Basically, as soon as we said attention is all you need, people started asking this question. So we have this quadratic attention operator, can we do better? I'll briefly talk about why attention is quadratic, and the basic thing that happens, if you're not familiar, is that you have these inputs, these keys and queries, and what you do in this attention matrix, this S matrix over here, what happens beyond the, maybe not Gemini, because we don't necessarily know what architecture is, but let's say we upload it to Lama, what happens behind the scenes is that it's going to take every single word in that book and compare it to every other word.

And this has led to some pretty impressive things, but it's kind of a brute forcing of the way that you would try to interpret something. What attention does in particular is the, and then what attention, sorry, don't want to, no laser pointer, what attention does afterwards is that instead of always operating in this quadratic thing, it takes a role-wise soft max over this matrix and then multiplies it by this values matrix.

So one of the key points to notice is that the output size is always going to be the same as the inputs, at least in standard self-attention. So one of the first things that folks tried to do around 2020 is this thing called linear attention, which is just noticing that if we take out this soft max from here, it's going to be in the middle of the attention operation.

And then if you compute the keys and the values operation first, you actually never hit this quadratic bottleneck. So that's potentially a way to get a lot more computationally efficient. And there are various ways to do this by basically using feature maps or try to approximate this overall attention computation.

But some of this work sort of started to hit a wall in 2020. And the basic challenges were two. So one was actually hardware efficiency. Back then, it was kind of hard to get good quality with these linear attention operators. The other one was actually hardware efficiency. So this feature map that's just shown by a simplified here actually ends up being quite computationally expensive if you just implement it naively.

So you started having these operators that not only were you sure you're not really sure if they have the same quality, but also they're actually just wall clock slower. So you kind of set the stage. So that kind of sets the stage for four years ago. Keep this in mind because linear attention is actually going to come back in a few years once we have a better understanding.

But one of the works that started kicking off this mini revolution in post-transformer architectures was this idea called state-space model. So here the seminal work is one about our work in 2022. And this piece of work really brought together a few ideas from some long-running research lines of work.

The first one was -- and this is really one of the keys to closing the gap in quality -- was just using things that if you talk to an electrical engineer off the street, they might know off the back of their hand. But taking some of those properties with how we model dynamical systems in signal processing and then using those ideas to model the inputs, the text tokens in, for example, a transformer-like next token prediction architecture.

So some of those early state-space model papers were looking at this relatively simple recurrent update model that comes from maybe chapter one of a signal processing class, but then using some principle theory about how you should do that recurrent update in order to really get the most that you can out of your hidden state out of your sequence.

So that was one key idea for quality. And when this was eventually realized, you started to see a bunch of benchmarks that were pretty sticky for a few years, things like long-range arena, some long-sequence evaluation benchmarks, there was stuff in time-series analysis. You started to see the quality tick up in meaningful ways.

But the other key thing that was so influential about these state-space models is that they also had a key idea about how you can compute these things efficiently. So if you go back to your machine learning 101 class where you learned about RNNs, one thing that you may have learned is that they don't paralyze as well as detention, because if you just run them naively, you have to do this kind of sequential update to process new tokens, whereas in attention, you can process all the tokens.

One of the key insights behind the S4 paper was that these were current models. You could take them, and you could also formulate them as a convolution. And in particular, with a convolution, instead of using a PyTorch conv1D operation, you can compute that with the FFT. And that would give you n log n compute in the sequence length n with an operator that was relatively well optimized for modern hardware.

So those are really, I'd say, the two key ideas that started allowing these breakthroughs to happen in these non-transformer architectures. So these ideas about how to principally model -- sorry, how to model the recurrent updates of a sequence in a principled way, and also these key ideas in how you can compute it efficiently by turning it into a convolution and then scaling it up with the FFT.

Along those same lines, so afterwards, we started putting out some work on specialized kernels. So just like we have Flash attention for transformers, we also have works like Flash FFT conv. And if you look at these lines of work, oftentimes whenever you see a new architecture, you see a new primitive.

One of the table stakes now is, do you have an efficient kernel so that you can actually get wall clock speed up? So by 2022, 2023, we were starting to have these models that had promising quality primitives and also promising wall clocks. So we were better than transformers in meaningful ways.

That being said, there were still sometimes a quality gap, particularly for language modeling. And because language is so core to what we do in sequence modeling these days, the next key idea that I'm going to talk about is this idea of selection mechanisms. And this is basically an idea of -- so you have this recurrent state that you're keeping around that just summarizes everything that came before.

And to get a good sequence model, the best thing you're able to do is have the model learn what's the best way to pick out pieces from that recurrent state. So one of the major ideas here in a line of work called H3, Hungry Hungry Hippos, and also these hyena models were -- one way you can do this is by just adding some simple element-wise gates.

So versions of these ideas have been around for decades. If you squint at the LSTM paper, you can probably find this gating mechanism. But turns out you can take those old ideas, add them into these new states-based models, and then you can see quality start to pick up. If you've heard of the Mamba model, this also takes the selection to the next level by actually making some changes in that fundamental recurrent states-based.

So it's not only just this gating that happens around the SSM layer, but also you can actually make the ABCD matrices of your states-based model, you can make them data dependent, which will allow you to even better select out different pieces from your hidden state depending on what you're seeing.

I'll also point out if you look at the bottom right of this figure, there's this little triangle with a GPU SRAM, GPU HBM, and this is just continuing that trend of when you have a new architecture, you also release it with a kernel to show that it is hardware efficient, that it can be hardware efficient everywhere.

One of the next cool things that happened is once we had this understanding of these are the basic pieces, these are the basic principles behind some of the sequence models, linear attention actually started to come back. So earlier this year, there was a model called BASED from Simran Arora and some other folks that combined a more principled version of linear attention that used the two-second summaries that it used a Taylor approximation of the softmax attention, combined that with a simple sliding window attention and was starting to be able to expand the Pareto frontier of how much data can you recall from your sequence versus how small is your recurrent state size.

So those orange dots at the top there are just showing smaller sequences that can recall more memory. And the last major thing or the last major idea I think that has been influential in this line of work and is very relatively late-breaking just a few months ago is just the basic idea that when you have these models that are fundamentally more efficient in the sequence length, you maybe don't want to prompt them or use them in exactly the same way.

So this was a really cool paper called Just Read Twice also from Simran that basically said, hey, all these efficient models can process tokens so much that tokens have unfair advantages compared to a simple transformer token. Sorry, a simple transformer model. So take, for example, the standard use case of you have some long document, you're going to pass it in as input, and then you're going to ask some question about it.

One problem you might imagine for a recurrent model where you have a fixed state size is, let's say that your article is very long and you're trying to ask about some really niche thing, you can really hard for the model to know ahead of time what information to put into the hidden state.

But these models are so much more efficient that you can do something really stupid like you can just put the document, write down the document, write down the question, write down the document again, and then write down the question again. And then this time, the second time that you go over that document, you know exactly what to look for.

And the cool thing about this is so this it really takes advantage of the more efficient architectures that we're having here. So one of the other, I think, influential ideas in this line of work is if you change the fundamental compute capabilities of your model and the way that it scales, you can actually start to query it at test time differently.

And this actually, of course, goes back to those slides on test time compute. So while everybody's looking at, say, test time compute for big transformer models, I think everybody's looking at how can you take those and how does it change with this new next generation of models? So I'll just briefly summarize what some of those key ideas were and then talk and then show you briefly kind of what the state of the art is today.

So the four key ideas are instead of just doing a simple linear attention approximation, instead take ideas that we know from other fields like signal processing, do a more principled approach to your modeling of the sequence. Another key idea throughout all these lines of work is you really want hardware and kernel support from day one.

So even if your model is theoretically more efficient, if somebody goes and runs it and it's two times slower, one of the things that we've learned is that if you're in that situation, it's just going to be dead on arrival. So you want to be designing your architectures with the hardware in mind.

One of the key machine learning ideas that has been important for the quality is just making sure that you encode different ways that you can select from your hidden state and really focus on that as a key decider of quality. And finally, I think one of the emerging new things for this line of work and something that's quite interesting is what are the right test time paradigms for these models?

How do they change relative to what you might do for a standard transformer? I'll briefly end this section. So I've labeled this slide where we are yesterday because Eugene is going to talk about some new models that he released literally this morning. But as of yesterday, some of the really cool results out of these efficient alternative models were -- so AI2 trained this hybrid MOE called Jamba that seems -- that is currently the state-of-the-art for these non-transformer architectures.

There's this -- NVIDIA and MIT put out this new diffusion model called SANA recently that one of their key observations is that you can take a standard diffusion -- transformer diffusion model, replace the layers with linear attention, and then that lets you scale to much larger images, much larger sequences more efficiently.

And one thing that I don't think anybody would have called when -- a few years ago is that one of those gated SSM, gated states-based models, ended up on the cover of Science because a great group of folks went and trained some DNA models. So that's Michael Polly, Eric Yuen from Stanford and the ARC Institute.

So it's -- we're really at an exciting time in 2024 where these non-transformer, post-transformer architectures are showing promise across a wide range -- across a wide range of modalities, of applications, and of tasks. And with that, I'll pass it on to Eugene who can tell you a little bit about the latest and greatest with RWKV.

>> Yeah. So -- >> You're talking to here. >> Oh, I'm talking to here. Okay. So, yeah, two streams. Yeah. So I think one common question that we tend to get asked is what's the difference between RWKV and states-based. So I think one of the key things to really understand, right, the difference between the two groups, right, is that we are actually more like an open-source rental internet meets academia kind of situation.

Like most of us never wrote any paper, but we basically look at RNNs and linear intention, when intention is all you need came out, and then we decided to, like, hey, there is a quadratic scaling problem. Why don't we try fixing that instead? So we end up developing our own branch, but we end up sharing ideas back and forth.

And we do all this actively in Discord, GitHub, et cetera. This was so bad for a few years, right, that basically the average group's H index was so close to zero, right, AI actually came in and helped us write our first paper. Great. Now our H index is now three, apparently.

But the thing is, like, a lot of these experiments led to results, and essentially we took the same ideas from linear intention, and we built on it. So to take a step back into, like, how does RWKB handle its own attention mechanic, and achieve the same goals of, like, O(n) compute, respectively, and in focus of our overall goal to make AI accessible to everyone, regardless of language, nation, or compute, that's our goal.

We actually train our models primarily on over 100 languages, which is another topic altogether, and our goal is to train to even 200 languages to cover all languages in the world. But at the same time, we work on this compute cost so that people can run in Raspberry Pis and on anything.

So, how did RWKB break the dependency of LSTM token flow? Because I think to understand architecture, right, it's probably easier to understand it from the RNN lens, because that's where we built on. We all state space kind of, like, try to start anew, and took lessons from that, so there's a little bit of divergence there, aka this is our version of linear architecture.

But if you step back, all foundation models, be it transformers or non-transformers, at a very high level, right, comes in a token, I mean, takes things into embeddings, and goes through a lot of layers, generate a lot of internal states, whether QKB cache or RNN states or RWKB states, and outputs an embedding, they are not in sampling, and we just take more layers and more embeddings, and somehow that magically works.

So, if you remember your ancient RNN lessons, which we, which we, which we, which we call "Blessed Learning" these days, the general idea is that you have the embedding information flowing all the way up, and when, and you take that information, and you flow it back down, and then you process it as part of your LSTM layers.

So, this is how it generally works. Kapati is quoted saying that RNNs are actually unreasonably effective. The problem is this is not scalable. To start doing work on the second token, you need to wait for the second token, yada yada. That is CPU land, not GPU land. So, so, so, you can have a H100 and you can't even use 1% of it.

So, so that's kind of why RNNs didn't really take off in the direction that we wanted, like, billions of parameters when it comes to training. So, what did RWKB version 0 do? We just did the dumbest, lamest thing. Sorry, this is the bottleneck for RNN. We did the dumb thing of removing that line.

And it kind of worked. It trained, it sucked, but it kind of worked. Then we were like, hey, then no one cared because the loss was crap. But how do we improve that? And that's essentially where we move forward. Because if you see this kind of flow, right, you can actually get the, you can get your GPU saturated quickly where it essentially cascades respectively.

So, I'm just waiting for this to loop again. So, it's like once you get your first layer, your token to be computed finished, you start to cascade your compute all the way until you're, your GPU. So, we worked on it and we started going along the principle of as long as we keep this general architecture where we can cascade and be highly efficient with our architecture, nothing is sacred in our architecture.

And we have done some crazy ideas. In fact, you ask us, if you ask me to explain some things in the paper, right, officially in the paper, I'll say we had this idea and we wrote it this way. The reality is someone came with a code, we tested it, it worked, and then we rationalized later.

So, the general idea behind RWA KVR is that we generally have two major blocks that we do. We call it time mix and channel mix. And time mix generally handles long-term memory states where essentially, where we apply the matrix multiplication and silhouette activation functions into processing input embedding and then output embedding.

I'm oversimplifying it because this calculation changed every version and we have like version 7 right now. Channel mix is similar to Bayes in the sense that it does shorter-term attention where you just look at the sister token or the token before it because there's a shift in the token shift matrix.

I don't really want to go too much into the papers itself because we do have three papers on this. Basically, RWKV, RNN for Transformer, ERA, Ego and Finch RWKV, Matrix Value State. This is the updated version 5, version 6. And GoFinch is our hybrid model respectively. We are writing the paper already for V7, which is for RWKV7, codenamed Goose.

All our architectures are codenamed by a bird. And I'm going to cover as well, QRWKV and MAMA-RWKV and RWKVMV. So where did that lead to? Because we were all GPU poor. And to be clear, most of this research is done only on a handful H100s, which I had one Google researcher told me that was his experiment budget for a single researcher.

So our entire organization has less compute than a single researcher in Google. One of the things that we explored into was how do we convert Transformer models instead? Because someone already paid that million dollars onto training, so why don't we take advantage of those weights? And I believe together AI worked on the lowercase for the MAMA side of things, and we took some ideas from there as well, and we essentially did that for RWKV.

And that led to QRWKV6, which we just dropped today, a 32-bit instruct preview model, where we took the quaint 32-bit instruct model, freeze the feed forward layer, remove the QKV attention layer, and replace it with RWKV linear layers. So to be clear, this means we do not have the RWKV channel mix layer, we only have the time mix layer.

But once we do that, we train the RWKV layer. Important is that the feed forward layer needs to be frozen, so the new attention can be learned. And then we unfreeze the feed forward layer and train all the layers together with a custom learning rate schedule so that they can learn how to work together.

The end result, surprisingly, and to be honest, to the frustration of the RWKV MOE team, which ended up on the same day, was that with just a few hours of training on two nodes, we managed to get it to be on par, kind of, with the original quaint 32-bit model.

So in fact, when the first run, right, that completely confused us. And I was telling Daniel Goldstein, who kind of leads most of our research coordination, "When you pitched me this idea, you told me at best you'll get the same level of performance. You didn't tell me that the MMOGrad score will shoot up." I don't know what's happening there, but it did.

MMOU score dropping, that was expected, because if you think about it, when we were training all the layers, right, we were essentially, like, Frankensteining this thing, and we did brain damage to the feed forward network layer with the new RWKV layers, but 76%, hey, somehow it's retained, and we can probably further train this.

We didn't even spend, like, three days training this, so there's a lot more that can be done, hence the preview. But this brings up a big question, because we are already now in the process of converting the 70B. This is actually extremely compute efficient to test our attention mechanic.

It's like, it becomes a shortcut. We are already planning to do our version 7 and our hybrid architecture for it, because we don't train from scratch, and we get a really good model out of it. And the other thing that is uncomfortable to say is that because we are doing right now on 70B, is that if this scales correctly to 128K context length, I'm not even talking about a million, 128, majority of enterprise workload today is just on 70B at under 32K context length.

That means if this works and the benchmark matches it, it means we can replace the vast majority of current AI workload, unless you want super long context. And then, sorry, can someone give us more GPUs, because we don't need the vRAM for super long context, sadly. So yeah, that's what we are working on, and essentially we are excited about this to just push it further.

And this conversion process, to be clear, I don't think it's going to be exclusive to RWKB. It probably will work for Mamba as well. I don't see why not. And we will probably see more ideas, or more experiments, or more hybrids. Like, yeah, like one of the weirdest things that I wanted to say outright, and I confirmed this with the Black Mamba team and the Jamba team, because we did the Goldfinch hybrid model, is that none of us understand why a hybrid with a state-based model, be it RWKB or state-based, and transformer, performs better than the baseline of both.

It's like, when you train one, you expect, and then you replace, you expect the same results. That's our pitch. That's our claim. But somehow when we jam both together, it outperforms both. And that's like one area of evaluation that, like, we only have four experiments, plus four hybrids, that a lot more needs to be done.

But these are things that excite me, essentially, because that is what potentially we can move ahead for. Which brings us to what comes next. So this part is kind of just some, or we'll talk a little bit about stuff that we're excited about, maybe have a little wild speculation on what's coming next.

And of course, this is also the part that will be more open to questions. So a couple things that I'm excited about is continued hardware model co-design for these models. So one of the things that we've put out recently is this library called Thunder Kittens. It's a CUDA library.

And one of the things that we found frustrating is every time that we built PyTorch and I'm sure you had the exact same experience, we'd have to go and spend two months in CUDA land writing these new efficient things. And if we decided to change one thing in PyTorch, like one line of PyTorch code is like a week of CUDA code, at least.

So one of our goals with a library like Thunder Kittens, we just broke down what are the key principles, what are the key hardware things, what are the key compute pieces that you get from the hardware. So for example, on H100, everything really revolves around a warp group matrix multiply operation.

So you really want your operation to be able to split into relatively small matrix matrix multiply operations. So like multiplying two 64 by 64 matrices, for example. And so if you know that ahead of time when you're designing your model, that probably gives you some information about how you set the state sizes, how you set the update, how you set the run time.

So with Thunder Kittens, we basically built a whole library just around this basic idea that all your basic compute primitives should not be a float, but it should be a matrix and everything should just be matrix compute. And we've been using that to try to both re-implement some existing architectures and also start to design some new ones that are really designed with this core, with a tensor core primitive in mind.

Another thing that we're, at least I'm excited about is we, over the last four or five years, we've really been looking at language models as the next thing. But if you've been paying attention to Twitter, there's been a bunch of new next generation models that are coming out. So there are video generation models that can run real time that are supported by your mouse and your keyboard that I'm told if you play with them, they only have a few seconds of memory.

Can we take that model? Can we do a super long context length so that you could actually maybe generate an entire game state at a time? What does that look like for the model? You're certainly not going to do a giant quadratic attention computation to try to run that.

Maybe use some of these new models or some of these new video generation models that came out. So Sora came out two days ago now, but with super long queue times and super long generation times. So that's probably a quadratic attention operation at the bottom of it. What if I could do that and get the same quality, but a lot faster generation time or some of the demos that we saw from Paige earlier today?

If I have a super long conversation with my Gemini bot, what if I wanted to remember everything that I've seen in the last week? I mean, maybe you don't for personal reasons, but what if I did? What does that mean for the architecture? And I think that's certainly something I'm pretty excited about it too.

I think we were supposed to have some hot takes, but I honestly don't remember what our hot takes were. Yeah. Hot takes. Yes, these are our hot takes. I think the big one on Twitter that we saw, that we shared, was the question is like, is RAG relevant in the case of the future of state-based models?

Is what relevant? RAG. Right? RAG. Oh, sorry. I was like right or left? RAG. Hmm. Okay. We are live, so maybe I'll... Let's see. I haven't played too much with RAG, but when I have, I'll say I found it was a little bit challenging to do research on it because we had this experience over and over again where you could have an embedding model of any quality.

So you could have a really, really bad embedding model, or you could have a really, really good one by any measure of good. And for the final RAG application, it kind of didn't matter. That's what I'll say about RAG while I'm being recorded. I know it doesn't actually answer the question, but...

Yeah. So I think a lot of folks are like extremely excited of the idea of RWKV or state space potentially having infinite context. But I think the reality is that when we say infinite context, we just mean a different kind of infinite context, or as it's previously covered, you need to test the model differently.

So think of it more along the lines of the human. Like, I don't remember what I eat for breakfast yesterday. Yeah, that's the statement that I'll say. And we humans are not quadratic transformers. If we did, if we increased our brain size for every second we live, we would have exploded by the time we are five years old or something like that.

And I think basically fundamentally for us, right, regardless of whether RWKV, state space, XLSTM, et cetera, our general idea is that instead of that expanding state, that increase in computational cost, what if you have a fixed state size? And information theory detects that that fixed state size will have a limit.

Just how big of a limit is a question. Like RWKV is running at 40 megabytes for a state. Its future version might run into 400 megabytes. That is like millions of tokens in, if we're talking about mathematically, the maximum possibility. It's just that I guess we were all more inefficient about it, so maybe we hit 100,000.

And that's kind of like the work we're doing, trying to push it and maximize it. And that's where the models will start deferring because it will choose to forget things, it will choose to remember things. And that's why I think that there might be some element of right, but it may not be the same right.

Maybe the model learned things. And it's like, hmm, I can't remember that article. Let me do a database search. Two search. Just like us humans, when we can't remember the article in the company, we do a search on Notion. Yeah. I think something that would be really interesting is if you could have facts that are -- so right now the one thing we're learning about language models is that all those parameters are around just to store random facts about the world.

And this intuition comes from the observation that if you take a really small language model, it can do things like talk to you or kind of has like the style of conversation it can learn that. But where it will usually fall over compared to a much larger one is it will just be a lot less factual about things that it knows or that it can do.

But that points to all those weights that we're spending, all that SGD that we're spending to train these models are just being used to store facts. And we have things like databases that are pretty good at storing facts. So I think one thing that would be really interesting is if we could actually have some sort of outside data store that a language model can look at that maybe is, you know, has some sort of gradient descent in it, but would be quite interesting.

And then maybe you could edit it, delete facts, you know, change who's president so that it doesn't get lost. Can we open up Q&A and hot takes? Sure. I have hot take Q&A. Do these scale? When 405B state space model rag exists, no one does long context, who's throwing in 2 million token questions, hot takes?

The who's throwing in 2 million token question I think is a really good question. So I actually I was going to offer that as a hot take. I mean my hot take was going to be that long context doesn't matter. I know I just gave a whole talk about it.

You know, what's the point of doing research if you can't, you know, play both sides. But I think one of the so I think for both of us the reason that we first got into this was just from the first principled questions of there's this quadratic thing. Clearly intelligence doesn't need to be quadratic.

What is going on? Can we understand it better? You know, since then it's kind of turned into a race which has been exciting to watch like how much context you can take in. But I think it's right. Nobody is actually putting 2 million context prompt into these models. And, you know, if they are maybe we can go, you know, design a better model to do that particular thing.

Yeah, what do you think about that? So you've also been working on this. Do you think long context matters? So I'm going to burn a bit. How many of you remember the news of Google Gemini is supporting 3 million context? Raise your hand. Yeah, 2 million. Oh, it's 2 million.

Yeah, how many of you actually tried that? I use it a lot. You use it a lot for Mind's TV. I use it a lot. I throw his podcast in there. All right. So for some people there is use. And I think that's might be like this where my opinion starts to differ because I think the big labs may have a bigger role in this.

Because like even for RWKB, even when we train long context, the reason why I say VRAM is a problem is that because when we did the we need to back prop against the states, we actually need to maintain the state in between the tokens by the token length. So that means we need to actually roll out the whole 1 million context if we are actually training 1 million, which is the same for transformers actually.

But it just means we don't magically reduce the VRAM consumption in the training time space. So that is the one that VRAM bottlenecks. And I'm neither OpenAI nor Google. So donate GPUs if you have too much of them. But then putting it back to another paradigm, right, is that I think O1 style reasoning might be actually pushing that direction downward.

In my opinion, this is my partial hot take, is that if let's say you have a super big 400B model and let's say you have a 70B model that may take double the tokens but gets the same result. Strictly speaking, a 70B, and this is even for transformer or non-transformer, right, will take less resources than that 400B model even if it did double the amount of thinking.

And if that's the case, and we are still all trying to figure this out, maybe the direction for us is really getting the sub-200B to be as fast as efficient as possible with a very efficient architecture that some folks happen to be working on to just reason it out over larger and larger context.

Yeah. More hot take. Why not throw GPU at problem? Cerebrus, grok, fast inference. I will accept those GPUs. Please send it over. I'm on Twitter, recently on BlueSky. I'm @realDanFu, so you can follow there. Same thing, Twitter, BlueSky, I guess GitHub, Discord, Pico Creator, P-I-C-O Creator. Can you all hear me?

Yeah. One thing I'm super interested in is models that can watch forever. Obviously you cannot train something on infinite context length. How are you all thinking about that where you run on a much longer context length than is possible to train on? It's a great question. I think you guys probably had tweets along these lines, too.

When we first started doing these things, because these are all recurrent models, in theory you could just run it forever. You could just run it forever, and at the very least it won't error out on you or crash. There's another question of whether it can actually use what it's seen in that infinite context.

One place where probably the research in architectures ran faster than other research is actually the benchmarks for long context. You turn it on forever, you want to do everything or watch everything. What is it that you actually wanted to do? Can we build some benchmarks for that, then measure what's happening, and then ask the question, can the models do it?

Is there something else that they need? I think that if I were to turn back the clock to 2022, that's probably one of the things I would have done differently, which would have been actually get some long context benchmarks out at the same time as we started pushing context length on all these models.

>> I will also say the use case. I think we both agree that there's no infinite memory, and the model needs to be able to learn and decide. I think what we have observed is that one of the key advantages of this alternate attention mechanic that is not based on token position is that when you suddenly become crazy, when you go past the 8K training context length or a million context length, it's actually still stable.

It's still able to run. It's still able to rationalize. It just starts forgetting things. But some of these things are still there in latent memory. Some of these things are still somewhat there. That's the whole point of why reading twice works, things like that. One of the biggest pushes in this direction is that I think both State Space and other researchers where they use this architecture for time series data, weather modeling.

You're not asking what was the weather five days ago. You're asking what's the weather tomorrow based on the infinite length that we as on this earth and the computer will keep running. And they found that it is better than existing transformer or existing architecture in modeling this weather data.

Control for the param size and stuff. I'm quite sure there are people with that. So there are things that in this case, right, there is future applications if your question is just what's next and not what's 10 years ago. Thanks so much for having us. It's a video of us.

It's going to come out on YouTube. All right. Thank you, Eugene and Dan. So we've talked about efficiency in large scale model. We've talked about grass root type of model that comes from two GPUs. No, there was more than that. It's not complete to then not talk about the on device type of size of model.

For that, synthetic data is also very relevant. And joining us to talk about it will be Lubna Ben-Elal. She is a research engineer at Hugging Face. And she's worked on pre-training data sets like Cosmopedia, which is going to be showcased in our event pamphlet towards the end of the day.

If you stay long enough, you will get one of them. It's really artfully done. And Cosmopedia is featured on there. It's Lubna's work. And she's also the lead on training small language models, small LM and small M2 at Hugging Face. The TLDA on those models, they have very, very high performance with all the high cost associated with large models.

So Lubna's going to talk about how she did that, as well as the synthetic data's role in that. Welcome. Thanks for the introduction. Let me see if you can see my screen. Okay, perfect. Okay, cool. We can get started. I'm very happy to be here. Thank you for the invitation.

So I'm going to be talking about synthetic data in 2024, and then I'm going to be talking about small on-device models. So I think the most interesting thing about synthetic data this year is that like now we have it everywhere in the large language models pipeline. I think initially synthetic data was mainly used just for post-training, because naturally that's the part where we needed human annotators to show the models how they should answer instructions, how they should be useful and not toxic.

And when we had LLMs that were really performant, we replaced the human annotators just with the synthetic data. And then after that, we realized that we don't really have good benchmarks to measure if models follow instructions well, if they are creative enough, or if they are chatty enough. So we also started using LLMs as judges.

And I think this year and towards the end of last year, we also went to the pre-training parts, generating synthetic data for pre-training to kind of replace some parts of the web. And the motivation behind that is that you have a lot of control over synthetic data. You can control your prompt and basically also the kind of data that you generate.

So instead of just trying to filter the web, you could try to get the LLM to generate what you think the best web pages could look like, and then train your models on that. So this is how we went from not having synthetic data at all in the LLM pipeline to having it on the web.

And so the cool thing is today you can train an LLM with an entirely synthetic pipeline. For example, you can use our Cosmopedia data sets and you can train a 1B model on 150 billion tokens that are 100% synthetic, and those are also of good quality. And then you can instruction tune the model on a synthetic SFT data set.

You can also do DPO on a synthetic data set. And then to evaluate if the model is good, you can use a benchmark that uses LLMs as a judge, for example, or Alpaca Evol. So I think this is really mind-blowing because just a few years ago we wouldn't think this is possible.

And I think there's a lot of concerns about model collapse, and I'm going to talk about that later. But we will see that if we use synthetic data properly and we curate it carefully, that shouldn't happen. And the reason synthetic data is very popular right now is that we have really strong models, both open and closed.

It is really cheap and fast to use compared to human annotations, which cost a lot and take a lot of time. And also for open models, right now we have some really good inference frameworks. So if you have enough GPUs, it's really easy to spawn these GPUs and generate a lot of synthetic data.

Some examples are VLM, TGI, and TensorRT. Now let's talk about the elephant in the room, model collapse. Is this the end? If you look at the media and all of, for example, some papers in Nature, it's really scary because there's a lot of synthetic data out there in the web and naturally we train on the web.

So we're going to be training a lot of synthetic data. And if model collapse is going to happen, we should really try to take that seriously. And the other issue is that, as I said, a lot of people think the web is polluted because there's a lot of synthetic data.

And, for example, when we're building fine web data sets here at Guillermo and Hinek, we're interested in how much synthetic data is there in the web. So there isn't really a method to properly measure the amount of synthetic data or to say if a web page is synthetic or not.

But one thing we can do is to try to look for, like, proxy words. For example, expressions like as a large language model or words like delve that we know are actually generated by ChatGPT. We could try to measure the amount of these words in our data system and compare them to the previous years.

For example, here we measured, like, these words ratio in different dumps of Common Crawl. And we can see that, like, the ratio really increased after ChatGPT's release. So if we were to say the synthetic data amount didn't change, you would expect this ratio to stay constant, which is not the case.

So there's a lot of synthetic data probably on the web. But does this really make models worse? So what we did is we trained different models on these different dumps. And we then computed their performance on popular, like, NLP benchmarks. And then we computed the aggregated score. And surprisingly, you can see that the latest dumps are actually even better than the dumps that are before.

So if there's some synthetic data there, at least it did not make the models worse. Yeah, which is really encouraging. So personally, I wouldn't say the web is polluted with synthetic data. Maybe it's even making it more rich. And the issue with, like, model collapse is that, for example, those studies, they were done at, like, a small scale.

And you would ask the model to complete, for example, a Wikipedia paragraph. And then you would train it on these new generations. And you would do that iteratively. I think if you do that approach, it's going to have the same kind of behavior because the quality is going to be worse because the model is already small.

And then if you train it just on these generations, you shouldn't expect it to become better. But what we're really doing here is that we take a model that is very large and we try to distill its knowledge into a model that is smaller. And in this way, you can expect to get, like, better performance for your small model.

And using synthetic data for pre-training has become really popular after the textbooks are all you need to really train the series of small models on textbooks that were using a large LLM. And then they found that these models were actually better than models that are much larger. So this was really interesting.

It was, like, first of its time. But it was also met with a lot of skepticism, which is a good thing in research. It pushes you to question things. Because the data set that they trained on was not public. So people were not really sure if these models are really good or maybe there's just some data contamination.

So it was really hard to check if you just have the weights of the models. And at Hugging Face, because we're, like, open source, we try to reproduce what they did. So this is our Cosmopedia data set. We basically tried to follow a similar approach to what they documented in the paper.

And we created a synthetic data set of textbooks and blog posts and stories that had almost 30 billion tokens. And we trained some models on that. And we found that, like, the key ingredient to getting a good data set that is synthetic is trying as much as possible to keep it diverse.

Because if you just throw the same prompts as your model, like, generate, like, a textbook about linear algebra, and even if you change the temperature, the textbooks are going to look alike. So there's no way you could scale to, like, millions of samples. And the way you do that is by creating prompts that have some seeds that make them diverse.

In our case, the prompt, we would ask the model to generate a textbook, but make it related to an existing web page. And also, we try to frame it within -- to stay within topic. For example, here, we put, like, an extract about cardiovascular bioimaging. And then we ask the model to generate a textbook related to medicine that is also related to this web page.

And this is a really nice approach because there's so many web pages out there. So you can be sure that your generation is not going to be diverse when you change the seed example. One thing that's challenging with this is that you want the seed samples to be related to your topics.

So we use, like, a search tool to try to go all the fine web datasets and find the pages that are related to the topics we're interested in. And then we also do a lot of experiments with the type of generations we want the model to generate. For example, we ask it for textbooks for middle school students or a textbook for a college.

And we found that, like, some generation styles help on some specific benchmarks while others help on other benchmarks. For example, college textbooks are good for MMLU, while middle school textbooks are good for benchmarks like OpenBook UA and PICO. This is, like, a sample from, like, our search tool. For example, you have a top category, which is a topic, and then you have some subtopics, and then you have the topic hits, which are basically the web pages in fine web that belong to these topics.

And here you can see the comparison between Cosmopedia. We had two versions, V1 and V2 in blue and red. And you can see the comparison to fine web. And as you can see throughout the training, training on Cosmopedia was consistently better. So we managed to get a dataset that was actually good to train these models on.

It's, of course, so much smaller than fine web. It's only 30 billion tokens. But that's the scale that Microsoft datasets was. So we kind of managed to reproduce a bit what they did. And the dataset is public, so everyone can go there, check if everything is all right. And this is a recent paper from NVIDIA, Nemotron CC.

They took things a bit further, and they generated not a few billion tokens, but 1.9 trillion tokens, which is huge. And we can see later how they did that. It's more of, like, rephrasing the web. So we can see today that there's, like, some really huge synthetic datasets out there, and they're public, so, like, you can try to filter them even further if you want to get, like, more high-quality corpuses.

So for this rephrasing the web, this approach was suggested in this paper by Pratyush, where basically in this paper, they take some samples from C4 datasets, and then they use an LLM to rewrite these samples into a better format. For example, they ask an LLM to rewrite the sample into a Wikipedia passage or into a Q&A page.

And the interesting thing in this approach is that you can use a model that is small because it doesn't -- rewriting doesn't require knowledge. It's just rewriting a page into a different style. So the model doesn't need to have, like, knowledge that is, like, extensive of what is rewriting compared to just asking a model to generate a new textbook and not giving it, like, ground truth.

So here they rewrite some samples from C4 into Q&A, into Wikipedia, and they find that doing this works better than training just on C4. And so what they did in Nemotron CC is a similar approach. They rewrite some pages from Common Crawl for two reasons. One is to, like, improve pages that are low quality.

So they rewrite them into, for example, Wikipedia page so they look better. And another reason is to create more diverse datasets. So they have a dataset that they already heavily filtered, and then they take these pages that are already high quality and they ask the model to rewrite them into Q&A format into, like, open-ended questions or, like, multi-choice questions.

So this way they can reuse the same page multiple times without fearing, like, having multiple duplicates because it's the same information but it's going to be rewritten differently. So I think that's also a really interesting approach for, like, generating synthetic data just by rephrasing the pages that you already have.

There's also this approach called Prox where they try to start from a web page and then they generate a program which finds how to rewrite that page to make it better and less noisy. For example, here you can see there's some leftover metadata in the web page and you don't necessarily want to keep that for training your model.

So they train a model that can generate programs that can, like, normalize and remove lines that are extra. So I think this approach is also interesting but it's maybe less scalable than the approaches that I presented before. So that was it for, like, rephrasing and generating new textbooks. Another approach that I think is really good and becoming really popular for using synthetic data for pre-training is basically building better classifiers for filtering the web.

For example, here we released a dataset called FindWebEDU and the way we built it is by taking Lama3 and asking it to rate the educational content of web pages from 0 to 5. So, for example, if a page is, like, a really good textbook that could be useful in a school setting, it would get a really high score.

And if a page is just, like, an advertisement or promotional material, it would get a lower score. And then after that, we take these synthetic annotations and we train a classifier on them. It's a classifier, like, a BERT model. And then we run this classifier on all of FindWeb, which is a 15 trillion tokens dataset, and then we only keep the pages that have, like, a score that's higher than 3.

So, for example, in our case, we went from 15 trillion tokens to just 1.5 trillion tokens. Those are really highly educational. And as you can see here, FindWebEDU outperforms all the other public web datasets by a larger margin on a couple of benchmarks. Here I show the aggregated score.

And you can see that this approach is really effective for filtering web datasets to get, like, better corpuses for training your LLMs. Others also try to do this approach. There's, for example, the DCLM dataset, where they also train the classifier, but not to detect educational content. Instead, they trained it on OpenHermes dataset, which is a dataset for instruction tuning.

And also, they explain, like, IM5 subreddits. And then they also get really high-quality dataset, which is, like, very information-dense and can help you train some really good LLMs. And then Nemotron and Common Crawl, they also did this approach, but instead of using one classifier, they used an ensemble of classifiers.

So, they used, for example, the DCLM classifier and also classifiers like the ones we used in FineWeb Educational. And then they combined these scores with an ensemble method to only retain the best high-quality pages. And they get a dataset that works even better than the ones we developed. So, that was it for, like, synthetic data for pre-training.

Now, we can go back to post-training. I think there's a lot of interesting post-training datasets out there. One that was released recently is Agent Instruct by Microsoft, where they basically try to target some specific skills and improve the performance of models on them. For example, here you can see code, brain teasers, open domain QA.

And they managed to get a dataset that outperforms -- this one, fine-tuning Mistral 7b on it, it outperforms the original Instruct model that was released by Mistral. And as I said, to get good synthetic data, you really have to have a framework to make sure that your data is diverse.

So, for example, for them, they always see the generations of instructions on either source code or raw text documents. And then they rewrite them to make sure they're easier to generate instructions from. And then they use that for their, like, instruction data generation. There's also the Tool 3 SFT mixture, which was released recently by Allen AI.

It's also really good quality, and it covers a wide range of tasks. And the way they make sure that this dataset is diverse is by using personas from the Persona Hub datasets, which is basically a dataset of, like, I think over a million personas. And, for example, in the Tool mixture to generate, like, a new code snippet, they would give, like, the model persona, for example, a machine learning researcher interested in neural networks, and then ask it to generate, like, a coding problem.

This way, you make sure that your dataset is really diverse, and then you can further filter the datasets, for example, using the reward models. We also released a dataset called Smalltalk, and we also tried to cover the wide scale. For example, when fine-tuning Mistral 7b on the dataset, we also outperformed the original Mistral instructs on a number of benchmarks, notably on mathematics and instruction following with IFEVO.

Another paper that's really interesting I wanted to mention is this one called Multilingual Data Arbitrage by Cohere. And, basically, they want to generate a dataset for post-training that is multilingual, and they have a really interesting problem. It's the fact that there isn't a single model that's really good at all the languages they wanted.

So, what they do is that, like, they use not just one teacher model, but multiple teachers. And then they have a router, which basically sends the prompts they have to all these models, and then they get the completions, and they have a reward model that traces all these generations and only keeps the best one.

And this is, like, arbitrage and finance. So, I think what's interesting in this, it shows that, like, synthetic data, it doesn't have to come from a single model. And you can pull these models together and get, like, a dataset that's really high quality, and that's diverse, and that covers all your needs.

I was supposed to put a meme there, but lack of time. Yeah, so that was it for, like, synthetic data. Now we can go to see what's happening in the small models field in 2024. I don't know if you know, but, like, now we have some models. For example, Lama 3.2, 1b, it matches Lama 213b from -- that was released last year on the LMSIS arena, which is basically the default go-to leaderboard for evaluating models using human evaluation.

And as you can see here, the scores of the models are really close. So, I think we've made, like, a huge leap forward in terms of small models. Of course, that's just one data point, but there's more. For example, if you look at this chart from 2.5 blog posts, it shows that today we have some really good models that are only, like, 3 billion parameters and 4 billion that score really high on MMLU, which is a really popular benchmark for evaluating models.

And you can see here that the blue dots have more than 65 on MMLU, and the gray ones have less. And for example, Lama 33b had less, so now we have a 3b model that outperforms a 33b model that was released earlier on MMLU benchmark. So, I think now people are starting to realize that, like, we shouldn't just scale and scale models, but we should try to make them more efficient.

I don't know if you knew, but you can also chat with a 3b+ model on your iPhone. For example, here, this is an app called PocketPal, where you can go and select a model from Hugging Face. It has a large choice. For example, here, we loaded the 53.5, which is 3.8 billion parameters on this iPhone, and we can chat with it, and you can see that even the latency is also acceptable.

For example, here, I asked it to give me a joke about NeurIPS, so let's see what it has to say. Okay, why did the neural network attend NeurIPS? Because it heard there would be a lot of layers and fun, and it wanted to train its sense of humor. So, not very funny, but at least it can run on-device.

Yeah, so I think now we have good small models, but we also have, like, good frameworks and tools to use these small models. So, I think we're really close to having, like, really on-edge and on-device models that are really good. And I think for a while, we've had this narrative that just training larger models is better.

Of course, this is supported by science scaling laws. As you can see here, for example, when we scale the model size, the loss is lower, and obviously, you get a better model. But -- and we can see this, for example, in the GPT family of models, how we went from just 100 million parameters to more than a trillion parameters, and of course, we all observed the performance improvement when using the latest model.

But one thing that we shouldn't forget is that when we scale the model, we also scale the inference cost and time, and so the largest models are going to cost so much more. So, I think now, instead of just building larger models, we should be focusing on building more efficient models.

It's no longer a race for the largest models, since these models are really expensive to run, and they require, like, really good infrastructure to do that, and they cannot run on, for example, consumer hardware. And when you try to build more efficient models that match larger models, that's when you can really unlock some really interesting on-device use cases.

And I think a trend that we're noticing now is the trend of training smaller models longer. For example, if you compare how long Lama was trained compared to Lama 3, there is a huge increase in the pre-training length. Lama was trained on one trillion tokens, but Lama 3 ATB was trained on 15 trillion tokens.

So, Meta managed to get a model that's the same size, but it performs so much better by choosing to, like, spend the sacrifice during training, because as we know, training is a one-time cost, but inference is something that's ongoing. If you want to see what are, like, the small models reads in 2024, I think this mobile LLM paper by Meta is interesting.

They try to study different models that are, like, have less than one billion parameters and find which architecture makes most sense for these models. For example, they find that depth is more important than width, so it's more important to have models that have, like, more layers than just making them more wide.

They also find that GQA helps, that tying the embedding helps. So, I think it's a nice study overall for models that are just a few hundred million parameters. There's also the Apple Intelligence tech report, which is interesting. So, for Apple Intelligence, they had two models, one that was, like, on server and another model that was on device.

It had three billion parameters, and I think the interesting part is that they trained this model using pruning and then distillation, and for example, they have this table where they show that, like, pruning and distillation works much better than training from scratch, and they also have some interesting insights about, like, how they specialize their models on specific tasks, like, for example, summarization and rewriting.

There's also this paper by NVIDIA that was released recently. I think you've already had a talk about, like, hybrid models. That was all interesting, and this model, they used, like, hybrid architecture between state space models and transformers, and they managed to train a 1B model that's really performant without needing to train it on a lot of tokens.

And regarding our work, we just recently released SmallM2, so it's a series of three models which are the best in class in each model size. For example, our 1.7B model outperforms LAMP 1B and also 2.5, and how we managed to train this model is that we spent a lot of time trying to curate the pre-training dataset.

We did a lot of ablations trying to find which datasets are good and also how to mix them. We also created some new math and code datasets that we're releasing soon, but we basically really spent a lot of time trying to find what's the best mixture that you can train these models on, and then we spent some time trying to, like, we also trained these models for very long.

For example, SmallM1 was trained only on 1 trillion tokens, but this model is trained on 11 trillion tokens, and we saw that the performance kept improving. The models didn't really platonic training, which I think is great, and it shows that you can train such small models for very long and keep getting performance gains.

What's interesting about SmallM2 is that it's fully open. We also released, like, the pre-training code base, the fine-tuning code, and datasets, and also evaluation in this repository. Also, there's, like, really interesting small models for text, but also for vision. For example, here you can see SmallVLM, which is a 2B model that's really efficient.

It doesn't consume a lot of RAM, and it also has a good performance. There's also Moondream 0.5B, which was released recently. It's, like, the smallest vision language model, and as you can see, there isn't, like, a big trade-off compared to Moondream 2B. So now I showed you that we have some really good small models.

We also have the tools to use them, but why should you consider using small models, and when? I think, like, small models are really interesting because of the on-device infrastructure. Because these models are small and they can run fast, you can basically run them on your laptop, but also on your mobile phone, and this means that your dataset stays locally.

You don't have to send your queries to third parties, and this really enhances privacy. This was, for example, one of the big selling points for Apple Intelligence. Also, right now, we really have so many frameworks to do on-device inference. For example, there's MLX, MLC, LLAMA, CPP, Transformers, JS. So we have a lot of options, and each of them have, like, great features, so you have so many options for doing that.

Small models are also really powerful if you choose to specialize them. For example, here, there's a startup called Numind, which took small LLAM, and then they fine-tuned it on text extraction datasets, and they managed to get a model that's not very far from models that are much larger. So I think text extraction is, like, one use case where small models can be really performant, and it makes sense to use them instead of using models.

You can also chat with these models in browser. For example, here, you can go there. You can load the model. You can even turn off your internet and just start chatting with the model locally. Speaking of text extraction, if you don't want to fine-tune the models, there's a really good method of structure generation, where you can basically force the models to follow a JSON schema that you defined.

For example, here, we tried to force the model to follow a schema for extracting key information from GitHub issues. So you can input free text, which is a complaint about a GitHub repository, something not working, and then you can run it there, and the model can extract anything that is relevant for your GitHub issue creation.

For example, the priority. For example, here, priority is high, the type of the issue, bug, and then a title and the estimation of how long this will take to fix. And you can just, like, do this in the browser. You can transform your text into a GitHub issue that's properly formatted.

So what's next for synthetic data and small models? I think that domain-specific synthetic data is going to be -- it's already important. It's going to be even more important. For example, generating synthetic data for math. I think this really would help improve the reasoning of a lot of models, and a lot of people are doing it.

For example, Queen 2.12 math, everyone's trying to reproduce one. And so I think for synthetic data, trying to specialize it on some domains is going to be really important. And then for small models, I think specializing them through fine-tuning is also going to be really important. Because I think a lot of companies are just trying to use these large models because they are better.

But on some tasks, I think you can already get decent performance with small models, so you don't need to pay, like, a cost that's much larger just to make your model better at your task by a few percent. And this is not just for text. I think it also applies for other modalities, like vision and audio.

And I think you should also watch out for on-device frameworks and applications. For example, like the app I showed, Pokespal, Olama, all these frameworks are becoming really popular, and I'm pretty sure that we're going to get, like, more of them in 2025, and users really like that. Maybe for other, I should also say, a hot take.

I think that, like, in AI, we just started, like, with fine-tuning, for example, trying to make BERT work on some specific use cases and really struggling to do that. And then we had some models that are larger, so we just switched to, like, prompt engineering to get the models to solve our tasks.

And I think we're going back to fine-tuning where we realize these models are really costly. It's better to use just a small model. We'll try to specialize it. So I think it's a little bit of a cycle, and we're going to start to see, like, more of fine-tuning and less of just, like, prompt engineering the models.

So that was my talk. Thank you for following, and if you have any questions, we can take them now. >> Opening to question. Hot takes. Go for it. >> Yeah. Now it only talks about text because that's what I work on, but I think also, like, for the other modalities, it makes a lot of sense, and a lot of people are already using AI, like, for building their machine model, they generate, like, synthetic audio data.

I think that's also the case for vision. I think the same applies, like, if you want to avoid model collapse, you should make sure to, like, make your data diverse and also filter it. So I think everything I said also applies for the other modalities. >> Can I jump in with a question?

>> Yeah. >> Talk about modalities. There are more, like, physical world modalities, so the bio modality or the action modality. Synthetic data there is quite hard to do, like, or at least do well. For example, the synthetic data for molecular structure or genomic sequences, that's really hard, but those are the areas, or to your point earlier, specialized synthetic data is very needed, and they are very data poor right now as modalities.

What are your take on that? >> Yeah. I think, like, for you to get good synthetic data, you need to have a model that's already good at the tasks you want to generate synthetic data from. So, for example, for these use cases, the issue is that, like, we don't have, like, models that you could leverage for generating synthetic data from.

So maybe if we solve that issue of having models just, like, by training on data that's not synthetic first, I think we're probably going to get there somewhere. Then synthetic data is going to be a natural follow-up. >> Any other questions, maybe, on device models? >> Apple Intelligence, I think, has got a shitless nod today from OpenAI, so there's that.

All right. Well, thank you, Gopala. >> Thank you. >> Thank you. >> Okay. So now we have time for a little bit of a break. We were supposed to get these pamphlets at the start, but now we have the schedules halfway through the day. So really thanks to Toph1943 and the Singapore friends that helped us print this.

It's super nice. It's like a little take-home thing that at the back has visualizations of all the papers, the top one of each segment that we had to pick, because we only got space for one. And so, yeah, we have Lubna's small LM in here, I think. Yeah, that's her paper right there.

So really, really cute. Thanks to Toph and the Singapore crew for doing this. We'll be back at 3, 3 p.m. All right. Bit of a break.