back to indexAI Engineer Summit 2023 — DAY 1 Livestream

00:00:00.800 |
I can't feel it, chasing the light is all we know. 00:00:42.000 |
We can feel it, chasing the light is all we know. 00:01:02.100 |
We can feel it, chasing the light is all we know. 00:01:23.200 |
Ladies and gentlemen, the opening keynote presentations will commence in the ballroom starting in 15 minutes. 00:01:30.300 |
Please make your way to the ballroom and find your seats. 00:04:22.580 |
Ladies and gentlemen, the opening keynote presentations will commence in the ballroom 00:04:54.000 |
Please make your way to the ballroom and find your seats. 00:05:22.340 |
Please make your way to the ballroom and find your way to the ballroom and find your way to the ballroom. 00:05:38.180 |
Please make your way to the ballroom and find your way to the ballroom and find your way to the ballroom. 00:05:42.140 |
Please make your way to the ballroom and find your way to the ballroom. 00:05:44.060 |
Please make your way to the ballroom and find your way to the ballroom and find your way to the ballroom. 00:05:48.140 |
Please make your way to the ballroom and find your way to the ballroom and find your way to the ballroom. 00:06:00.100 |
Please make your way to the ballroom and find your way to the ballroom and find your way to the ballroom. 00:06:02.060 |
Please make your way to the ballroom and find your way to the ballroom. 00:06:04.060 |
Please make your way to the ballroom and find your way to the ballroom. 00:06:04.060 |
Please make your way to the ballroom and find your way to the ballroom. 00:06:05.060 |
Please make your way to the ballroom and find your way to the ballroom and find your way to the ballroom. 00:06:05.640 |
Please make your way to the ballroom and find your way to the ballroom. 00:06:11.600 |
Please make your way to the ballroom and find your way to the ballroom. 00:06:15.600 |
Please make your way to the ballroom and find your way to the ballroom. 00:06:19.600 |
Please make your way to the ballroom and find your way to the ballroom. 00:06:24.600 |
Please make your way to the ballroom and find your way to the ballroom. 00:06:29.600 |
Please make your way to the ballroom and find your way to the ballroom. 00:06:39.560 |
Please make your way to the ballroom and find your way to the ballroom. 00:06:43.560 |
Please make your way to the ballroom and find your way to the ballroom. 00:06:47.560 |
Please make your way to the ballroom and find your way to the ballroom. 00:06:51.560 |
Please make your way to the ballroom and find your way to the ballroom. 00:06:54.560 |
Please make your way to the ballroom and find your way to the ballroom. 00:06:58.560 |
Please make your way to the ballroom and find your way to the ballroom. 00:07:01.560 |
Please make your way to the ballroom and find your way to the ballroom. 00:07:03.560 |
Please make your way to the ballroom and find your way to the ballroom. 00:07:07.520 |
Please make your way to the ballroom and find your way to the ballroom. 00:07:09.520 |
Please make your way to the ballroom and find your way to the ballroom. 00:07:11.520 |
Please make your way to the ballroom and find your way to the ballroom. 00:07:13.520 |
Please make your way to the ballroom and find your way to the ballroom. 00:07:15.520 |
Please make your way to the ballroom and find your way to the ballroom. 00:07:17.520 |
Please make your way to the ballroom and find your way to the ballroom. 00:07:27.480 |
Please make your way to the ballroom and find your way to the ballroom. 00:07:31.480 |
Please make your way to the ballroom and find your way to the ballroom. 00:07:33.480 |
Please make your way to the ballroom and find your way to the ballroom. 00:07:38.480 |
Please make your way to the ballroom and find your way to the ballroom. 00:07:46.440 |
Please make your way to the ballroom and find your way to the ballroom. 00:07:48.440 |
Please make your way to the ballroom and find your way to the ballroom. 00:07:50.440 |
Please make your way to the ballroom and find your way to the ballroom. 00:12:31.400 |
Ladies and gentlemen, the opening keynote presentations are starting now. 00:17:22.400 |
Ladies and gentlemen, please welcome to the stage 00:18:09.400 |
I am deeply honored to have the opportunity to host you all at this event with my co-host, Swix. 00:18:17.400 |
And I am especially delighted to kick off the presentation portion of the summit. 00:18:24.400 |
We have curated two days of stage talk content for you from some of the top founders and engineers in this versioning new industry of AI engineering. 00:18:35.400 |
In just a few moments, Swix is going to come on stage to help set the context for those talks. 00:18:42.400 |
But for my time on stage, I really want to know who is here. 00:18:47.400 |
A few months ago, they were just an open source project. 00:18:57.400 |
Mind you, the fastest growing open source project in history, but just an open source project. 00:19:04.400 |
Now, they are the presenting sponsor of the AI engineer summit. 00:19:09.400 |
So needless to say, we can expect some big announcements when Torrin takes the stage in a few moments. 00:19:20.400 |
The social media experts who make product launches fun, and who make databases both easy and trustworthy, given that they are open source. 00:19:33.400 |
They are a diamond sponsor this year, and we are honored that they broke their no event sponsorships policy for this event. 00:19:40.400 |
So Paul, Ant, let me know how that policy change works out for you. 00:19:51.400 |
I am looking forward to asking their booth sidekick if they can see what I am wearing yet, and if they can recommend any updates to my wardrobe. 00:20:02.400 |
And if vision is not in the cards yet, Matt and Ben, I can make an intro to Logan Kilpatrick, and perhaps we can get a preview of that vision API. 00:20:13.400 |
The company who helped to commercialize this movement by pumping $13 billion into open AI. 00:20:20.400 |
And many of you might be thinking, well, how can I get some of that gravy for my startup? 00:20:26.400 |
Well, head over to their booth in the Carmel room, just past the elevators behind you, and have a chat. 00:20:32.400 |
They have representatives from Microsoft for Startups who can help to open those doors for you. 00:20:38.400 |
And if you have any questions for the co-pilot team, well, Microsoft's booth is right next to GitHub. 00:20:44.400 |
Oh, and happy belated birthday, Cloudflare, who is here. 00:20:52.400 |
Cloudflare, you are now 13, so that means you're going to be going through some changes. 00:20:58.400 |
But don't worry, there's plenty of folks who will be delighted to talk to you about those changes at your booth, 00:21:03.400 |
just next to Microsoft and Carmel across the elevators. 00:21:10.400 |
The VC half of the Latent Space podcast, but no less technical than Swix. 00:21:17.400 |
Perhaps after watching some of today's talks, we can convince him to come back to the founder world. 00:21:24.400 |
But this time, Alessio, let's get you a software startup and leave the hardware to the experts. 00:21:40.400 |
You're one of 500 people who were selected to attend the inaugural AI Engineer Summit. 00:21:47.400 |
This means that you're not only an experienced software engineer, 00:21:54.400 |
but that you're actively experimenting with, shipping to production, 00:21:58.400 |
or have founded an AI-enhanced or AI-native apps and companies. 00:22:03.400 |
So I want you to remember this when you're interacting with one another. 00:22:06.400 |
But we also want to recognize the many people who are not here today, 00:22:12.400 |
We thank you and appreciate you, and we hope that you enjoy the content on the live stream. 00:22:18.400 |
And be sure to watch tomorrow's opening keynote address for some exciting announcements for 2024, 00:22:29.400 |
we hope that you've been enjoying your conversations with other attendees so far. 00:22:33.400 |
One of the reasons that I really love conferences is that it brings together experts from around the world, 00:22:40.400 |
who are passionate about a singular niche subject. 00:22:44.400 |
And so most of your conversations are already high signal. 00:22:48.400 |
But is there a way we can actually improve this? 00:22:51.400 |
Many of you have downloaded our conference mobile app, Network, 00:22:59.400 |
It includes the conference schedule, a full list of all of our expo partners, 00:23:03.400 |
recommended restaurants, bars, and cafes within walking distance of the venue, 00:23:11.400 |
But in addition to these features, we're pleased to announce AI-enhanced matching. 00:23:18.400 |
Using our generative matching algorithm, which is a fancy way of saying, 00:23:24.400 |
we use LLMs to match you with the right people, 00:23:27.400 |
we can match you with people who help to solve your stated problems. 00:23:38.400 |
you can literally tell us what problem you're looking to have solved, 00:23:42.400 |
and our GMA will connect to you with the right people. 00:23:51.400 |
You can download the app today at ai.engineer/network. 00:23:56.400 |
The matching is currently limited to in-person attendees of this summit only, 00:24:01.400 |
but we have one more big announcement that we'd like to make. 00:24:06.400 |
We are pleased to announce that we are open sourcing Network, 00:24:09.400 |
both the event app and the matching algorithm. 00:24:13.400 |
Our goal is to provide a simple, yet powerful, mobile event experience 00:24:20.400 |
And we hope that the matching algorithm can better assist 00:24:26.400 |
And we'd like to thank our lead engineer, Simon Stermer, for the mobile app, 00:24:30.400 |
and Sweezetz Teller for our identity matching algorithm, 00:24:33.400 |
and our infrastructure partners, dscope for auth, and Supabase for database and pgvector. 00:24:41.400 |
So a round of applause for all of our developers, partners for this app. 00:24:48.400 |
So you can access the repo today at ai.engineer/network or github.com/ai.engineer/network. 00:25:04.400 |
So with that, I'd love to welcome our first speaker to the stage. 00:25:07.400 |
He's the co-host of the Latent Space Podcast and the co-founder of this very summit. 00:25:12.400 |
Please join me in welcoming to the stage, SWIX. 00:25:28.400 |
This conference would not be happening without him. 00:25:47.400 |
One, I'm carrying a magic trackpad because everyone has clickers. 00:25:53.400 |
So we're going to experiment with this today. 00:25:56.400 |
And two, I'm using like AI, like fancy new everything, right? 00:26:00.400 |
So this is Tome, and we're going to go two-dimensional with our slides as well. 00:26:07.400 |
You're all here because you believe that there's some value to this idea. 00:26:10.400 |
And then I just put like a ridiculous 1000x on this. 00:26:13.400 |
But I do think there is some meaning towards thinking about higher orders of magnitude towards 00:26:20.400 |
And that's what I would like all of you to do today and to do with your friends back home. 00:26:25.400 |
So, and obviously a lot of AI-generated art because, I mean, it's an AI conference. 00:26:32.400 |
First of all, I want to congratulate you on being here. 00:26:35.400 |
I'm not talking about here location-wise, physically. 00:26:39.400 |
I'm talking about here in terms of the point in time. 00:26:46.400 |
I would propose, around about 600 AD, this dude, Brahma Gupta, he invented zero. 00:26:59.400 |
But there's certain times where if you're in that field, you have to be there. 00:27:04.400 |
If you're alive during that time, you have to be doing that thing. 00:27:12.400 |
And this conference kind of is inspired by the Solvay conference. 00:27:17.400 |
That's Albert Einstein, Marie Curie, and a lot of people that you just saw in the Oppenheimer 00:27:25.400 |
If you made cars, there was a right time, 1900 to 1930. 00:27:29.400 |
If you made personal computing products, 1980 to 2010. 00:27:33.400 |
If you're a millennial, if you're very online, you ever get these memes like you're born too 00:27:38.400 |
late to explore the earth, born too early to explore the stars. 00:27:45.400 |
This is, based on demographics and history, the approximate timeline of all of humanity. 00:27:50.400 |
We know that we're roughly about 73% of all concurrent intelligences if we don't expand our own 00:27:59.400 |
So my argument and my message to you today is that you are just in time. 00:28:06.400 |
I think a lot of my technology and industrial organization thinking is informed by Carlotta Perez, one of the most influential thinkers on tech revolutions. 00:28:17.400 |
So she wrote this book about the installation and deployment periods of tech cycles. 00:28:22.400 |
And we're definitely going through one today. 00:28:24.400 |
A lot of you on your mind here, I know you're here, but also mentally you're back home thinking, how much of this is a fad? 00:28:37.400 |
The historians greater than us have explored this over the industrial revolution, the age of railways, the age of heavy engineering and steel, oil, and most recently the tech revolution. 00:28:48.400 |
Funny enough, I recently put a -- they all roughly span between 50 to 70 years, and if you're around in that time, that's the field to be pursuing. 00:29:02.400 |
It's very hard historically to place a start point on something that changes human civilization. 00:29:15.400 |
So most of the time these curves are sort of theoretical. 00:29:23.400 |
Here we can actually just put the amount of compute we're using towards training in models. 00:29:29.400 |
That's Alex Nett right on the blue dot over there. 00:29:31.400 |
That's a huge inflection where we realized, gradually realized, it took too long to realize, but scale is starting to work. 00:29:38.400 |
And if you actually take this out, a lot of people have been taking this out, and I want you to take scaling seriously. 00:29:45.400 |
There's three reasons why six is a magic number. 00:29:47.400 |
There's a very famous investor who I shall not name. 00:29:51.400 |
Imagine roughly six of the magnitude and more compute by the end of the decade and plan for that. 00:29:55.400 |
So there is more of this coming, linear projection-wise. 00:29:58.400 |
And you can plan on a lot more investment in language models. 00:30:02.400 |
John Carmack says there's six key insights towards AGI. 00:30:05.400 |
And lastly, George Hatz has these really nice analogies. 00:30:13.400 |
GPT-4 took about 100 person years of compute. 00:30:15.400 |
You stretch it out to GPT-10, the difference between GPT-4 and GPT-10, again under the six-fold increments in GPT advancements. 00:30:24.400 |
And that would be more compute than the equivalent compute of every human who ever lived. 00:30:29.400 |
So just being in the right moment, you will get to live on top of these mega, mega trends that is greater than any single one of us. 00:30:37.400 |
And I think you're all here thinking about the AI engineer. 00:30:40.400 |
And I put it in a very, very small local context of, hey, what's the org chart? 00:30:55.400 |
It's very much of a demand and supply argument. 00:30:57.400 |
There's something like 100,000 card-carrying data science machine learning engineers. 00:31:01.400 |
And GitHub claims to have 100 million registered developers. 00:31:07.400 |
You can debate 40 or 50 million to 100 million. 00:31:12.400 |
So we think there's going to be much more AI engineers than ML engineers. 00:31:18.400 |
Same reasons that I mentioned in the blog post that you've all read. 00:31:22.400 |
And also, why engineering and not just prompting is because LLMs themselves are not AGIs yet. 00:31:29.400 |
We actually have to coordinate them in systems of software. 00:31:32.400 |
We have to write code around them and orchestrate them with code in order to do something useful. 00:31:39.400 |
So I want to spread it out a little bit more. 00:31:41.400 |
I think that the conversation on AI engineer has a vague discrepancy. 00:31:48.400 |
And I want to basically split it out into three areas of AI engineer. 00:31:51.400 |
Software engineer enhanced by AI tooling, like co-pilot. 00:31:54.400 |
Software engineer building AI products, like mid-journey. 00:31:58.400 |
AI product that replaces human engineer, potentially like auto-GBT and maybe replica ghost rider. 00:32:06.400 |
And in case you're wondering like enhanced by versus replaces, I think about it very much like the self-driving car terms. 00:32:13.400 |
There's a difference between whether humans in the loop or humans as to fall back. 00:32:22.400 |
The AI enhanced engineer, for people who are enhanced by AI. 00:32:25.400 |
People who build AI products, AI products engineer. 00:32:28.400 |
And then the AI engineer agent, who is not human. 00:32:32.400 |
And naturally, of course, if you're interested in sort of progressing up the career ladder, there's AI enhanced engineer, then products engineer, and engineer agent. 00:32:40.400 |
So this talk was really inspired by actually Amjad, who's speaking next, where he did a recent talk on with the AXNCD podcast, and Sam Altman, who actually sees 1,000 engineers in OpenAI every day. 00:32:53.400 |
And it's really a set of stackable 10 by 10 by 10 improvements. 00:32:57.400 |
Over the course of the next two days, I think you'll be seeing a lot of the speakers will be working on different parts of the stack. 00:33:04.400 |
So I really encourage you to think about where in your life this AI movement can improve and increase your productivity. 00:33:14.400 |
I'm very, very honored to have drawn from all over the world the leading lights of the AI engineering movement. 00:33:27.400 |
And it's not just about tools and speakers, it's also about you. 00:33:30.400 |
So I highly encourage you to take part in all the opportunities that we have for you to mix and mingle with each other, with the speakers, and with the sponsors as well. 00:33:40.400 |
The final word I do want to offer you is effectively what I think, in terms of non-technical terms, the 1,000 X engineer could offer. 00:33:50.400 |
My favorite advice for what a 10 X engineer could look like is an engineer that teaches 10 other people what they know. 00:33:57.400 |
That's not a technical term, but it is very useful. 00:34:01.400 |
And there's all these scaling laws for networks, which I really keep in mind. 00:34:05.400 |
So you can go from O of N to O of N squared to O to the power of N. 00:34:10.400 |
But really what O of N is, is you attending all the talks and consuming all the content and letting people in with your Pac-Man rule. 00:34:21.400 |
My very first blog post was at exactly a conference like this where I was encouraged to write something. 00:34:28.400 |
And finally, going home and then building your own networks of AI engineers and helping to grow networks of learning as well. 00:34:37.400 |
So I hope you take that with you in your AI engineer journey. 00:34:40.400 |
I hope that over the next few days you get a sense of what it's like to be at the start of an industry. 00:35:12.400 |
I agree with Swix and Ben that it feels like a moment. 00:35:21.400 |
I'm the co-founder of Replit, where we aspire to be the fastest way to get from an idea to a deployed software that you can scale. 00:35:28.400 |
So I'm going to take you back a little bit, not like Swix to the 600 AD, but perhaps to the start of computing. 00:35:50.400 |
We're going to get AGI before we get good presentation software. 00:36:02.400 |
The ENIAC was the first Turing complete programmable Von Neumann machine computer. 00:36:08.400 |
The way you programmed it is like you literally punched cards. 00:36:12.400 |
Not physically, but you had a machine that sort of punched these cards. 00:36:16.400 |
These are sort of binary code for the machine to interpret. 00:36:20.400 |
There wasn't really a software industry because this was really difficult. 00:36:23.400 |
It automated some tasks that human computers did at the time. 00:36:27.400 |
But it didn't create the software industry yet. 00:36:38.400 |
And then we had compilers and higher level languages such as C. 00:36:49.400 |
But text editors were really -- or like text-based programming was at minimum a 10x improvement, 00:36:58.400 |
So we've had these moments where we've had orders of magnitude improvements in programming before. 00:37:05.400 |
And then, you know, the IDE became a thing because, you know, we had large-scale software. 00:37:10.400 |
This is a screenshot from like 2017 or '18 when we added LSP to every programming environment on Replit. 00:37:18.400 |
So anyone with an account can get IntelliSense. 00:37:22.400 |
And we're really proud about that at the time. 00:37:24.400 |
We're burning a lot of CPU doing sort of inference. 00:37:28.400 |
And, you know, if you've run TypeScript server, that's like a lot of RAM. 00:37:32.400 |
But we're really proud that we're giving everyone in the world tools to create professional-grade software. 00:37:41.400 |
About three, four years ago, we started kind of thinking about how AI could change software. 00:37:49.400 |
But with GPT-2, you know, you could sort of kind of, you know, give it some code and kind of complete part of it. 00:37:57.400 |
And we're like, okay, this thing is actually happening, and we better be part of it. 00:38:01.400 |
And so we started building, and we built this product called Ghostwriter, which does auto-complete, chat, and all sorts of things inside the IDE. 00:38:14.400 |
And in just those two years, I mean, the pace of progress across the industry, the tools, basically AI, you know, was deployed, and a lot of different engineers were using it. 00:38:26.400 |
The AI-enhanced engineer, as Wix kind of called it. 00:38:32.400 |
And so we have a world now where a lot of people are gaining a huge amount of productivity improvement. 00:38:38.400 |
I don't think we're at a mode of magnitude improvement yet. 00:38:42.400 |
We're probably in the 50, 80, perhaps 100% improvement for some people. 00:38:49.400 |
And we think that's going to be 10x, 100x, perhaps 1,000x over the next decade. 00:38:55.400 |
The problem, however, Replis' mission has always been about access. 00:38:59.400 |
Our mission is to empower the next billion developers. 00:39:03.400 |
And so we really didn't want to create this world where some people have access to Ghostwriter and other people don't have access to it. 00:39:12.400 |
And we started thinking about, okay, what is it -- if you really take into heart everything that the AI engineer conference is about, 00:39:19.400 |
that we're at a moment where software is changing, where AI is going to be part of the software stack, 00:39:23.400 |
then you have to really step back a little bit and try to rethink how programming changes. 00:39:28.400 |
So our view is these programming add-ons such as Copilot and Coding and Ghostwriter and all these things. 00:39:39.400 |
We think that AI needs to be really infused in every programming interaction that you have. 00:39:44.400 |
And it needs to be part of the default experience of Replit and, I'm sure, other products in the future. 00:39:49.400 |
That's why we're announcing today that we're giving AI for our millions of users that are coding on Replit. 00:39:55.400 |
And so we think this is going to be the biggest deployment of AI-enhanced coding in the world. 00:40:03.400 |
We're going to be burning as much GPU as we're burning CPU. 00:40:08.400 |
We have people all over the world coding on all sorts of devices. 00:40:28.400 |
So they're all going to be AI-enhanced engineers. 00:40:31.400 |
But, you know, as we showed, it's not just about AI-enhanced engineering. 00:40:38.400 |
So AI being part of the software creation stack makes sense. 00:40:41.400 |
But AI part of the call stack is also where a lot of value is created. 00:40:46.400 |
So that's why we're also -- we have this new product called Model Farm. 00:40:54.400 |
And Model Farm basically gives you access to models right into your IDE. 00:41:01.400 |
So all it takes is three lines of code to start doing inference. 00:41:04.400 |
We launched with Google Cloud LLMs, but we're adding LLAMA pretty soon. 00:41:13.400 |
And if you're an LLM provider and want to work with us and provide this on our platform, 00:41:18.400 |
But basically, everyone will get -- there's some free tier here. 00:41:23.400 |
Everyone will get free access at least until the end of the year to Model Farm 00:41:28.400 |
so you can start doing inference and start building AI-based products. 00:41:32.400 |
So next up, I'm going to bring up my colleague, the head of AI, 00:41:39.400 |
Mikayla Katasta, to talk about how we train our own AI models. 00:41:43.400 |
And we have one more announcement for you coming up. 00:42:08.400 |
So today I'm going to be talking about how we're training LLM for code at Replit. 00:42:15.400 |
If you've been around Twitter, I think a bit more than a month ago, you must have read this study from Semi-Analysis. 00:42:22.400 |
And their point was, it's meaningless to work on small models, train on a limited amount of GPUs. 00:42:30.400 |
And that came as a shock to us, because we had a very good success story back in May, where we started to train our models from scratch. 00:42:37.400 |
And then, you know, Hamjad and I, and the AI team, started to think, are we really wasting our time here? 00:42:43.400 |
I'm going to try to convince this actually is not the case. 00:42:46.400 |
So our code completion feature, or Replit, is powered by our own bespoke larger language model. 00:42:53.400 |
We trained an open source code, both published on GitHub and also developed by the Replit user base. 00:43:02.400 |
So we try to find a different sweet spot compared to what you might be using with other plugins. 00:43:06.400 |
We try to keep our P95 latency below 250 milliseconds, such as the developer experience is almost instantaneous. 00:43:13.400 |
You don't even have to think about it, and the code is going to be completed for you. 00:43:17.400 |
At the model size that we were using, we have been state of the art across the past few months. 00:43:26.400 |
Who has heard about our V1 model back in May? 00:43:35.400 |
Jokes aside, so we released Replit code V1 3B back in May. 00:43:39.400 |
We got a lot of adoption, a lot of love, and also a lot of contribution. 00:43:43.400 |
And that's one of the key reasons why we decided to give it back. 00:43:47.400 |
Replit history has been built on the shoulders of giants, of all the people contributing to the open source space. 00:43:54.400 |
So we thought we should do exactly the same year. 00:43:57.400 |
And today, I'm going to be announcing Replit code V1.5 3B. 00:44:03.400 |
So the evolution of the model that we released back in May. 00:44:10.400 |
So the next 10 minutes, we're going to do a technical deep dive. 00:44:12.400 |
And I'm going to tell you how we built it and why it's so powerful. 00:44:15.400 |
So first of all, we followed a slightly different recipe compared to the last time. 00:44:20.400 |
If you recall, back in May, our V1 was a Lama-style code model, which means we followed a lot of the best recipes that Meta pioneered. 00:44:29.400 |
Now we went, you know, one level up, and we are training up to 300 tokens per parameter. 00:44:35.400 |
So if you have been following a big history of LLMs, even two years ago, most of the models were under-trained. 00:44:44.400 |
It's not exactly -- technically speaking, it's not correct. 00:44:47.400 |
But the truth is, you know, mid-2022, the Chinchilla paper from the MIME came out, and it was like a big warning for the old field. 00:44:54.400 |
Basically, what the paper tells us is that we were under-training our models. 00:44:58.400 |
We should give them way more high-quality data. 00:45:01.400 |
And in exchange, we could train smaller models. 00:45:03.400 |
So in a sense, we're amortizing training time for inference time. 00:45:07.400 |
Spending more compute to train a smaller, more powerful model. 00:45:10.400 |
And then at inference time, the latency will be lower. 00:45:13.400 |
And that's the key insight that we're going to be carrying along, you know, this old keynote today. 00:45:18.400 |
Now, differently from the V1, this time we also doubled the amount of high-quality data. 00:45:24.400 |
So we train it up to one trillion tokens of code. 00:45:27.400 |
It's -- the data mixture is roughly 200 billion tokens across five epochs, plus a linear cooldown at the end 00:45:34.400 |
that really allows us to squeeze the best possible performance for the model. 00:45:37.400 |
And rapid code V1.5 this time supports 30 programming languages. 00:45:42.400 |
And we also added a mixture coming from Stack Exchange, posts that are oriented towards developers. 00:45:49.400 |
So questions about coding, questions about software engineering, and so forth. 00:45:56.400 |
Now let's go ahead and take a look inside of the dataset that we used. 00:46:00.400 |
So we started from the Stack, which is an initiative led by BigCode. 00:46:04.400 |
It's a group, you know, under the Hagen-Phase umbrella. 00:46:07.400 |
Very grateful about the work that these people have been doing. 00:46:10.400 |
Basically, they have built a big pipeline, getting data from GitHub, selecting top repositories, 00:46:17.400 |
cleaning up parts of the data, and then especially leaving only code that is licensed under permissive licenses, 00:46:27.400 |
Out of this mixture, we selected 30 top languages. 00:46:31.400 |
And then, really, the key secret ingredient here is how much time we spent working on the data. 00:46:39.400 |
You must have been hearing this again and again. 00:46:41.400 |
And every time you go to an LLM talk, there is a ground stage saying, "Hey, you should pay attention about data quality." 00:46:46.400 |
I'm here to tell you exactly the same once again. 00:46:48.400 |
That's probably the most important thing that you could be spending your time on. 00:46:52.400 |
Especially because the model I'm talking about today is trained from scratch. 00:46:57.400 |
All the models that we released have been trained from the very first token prepared by us. 00:47:01.400 |
So it's extremely important to have high data quality. 00:47:05.400 |
So we took inspiration from the initial quality pipelines built by Codex, by the Pound paper. 00:47:12.400 |
And then we applied way more heuristics there. 00:47:14.400 |
So we're filtering for code that has been auto-generated, minified, non-parceable. 00:47:19.400 |
Basically, all the code that you wouldn't want your model to recommend back to you. 00:47:23.400 |
Because it's not something that you would be writing yourself. 00:47:28.400 |
And all this pipeline have been built on Spark. 00:47:31.400 |
So I'm trying to encourage you to also think of working on your own models. 00:47:34.400 |
Because pretty much a lot of the base components are out there, available open source. 00:47:39.400 |
So you could really build the whole pipeline to train and serve an LLM with a lot of open source components. 00:47:46.400 |
And as Wix was saying, we have seen this crazy acceleration in the last nine months. 00:47:50.400 |
If you wanted to do this in 2022, good luck with that. 00:47:53.400 |
It feels like we're a decade ahead compared to last year. 00:47:57.400 |
And I didn't even expect in myself the speed to move this fast. 00:48:02.400 |
The other insight that we kind of pioneered for our V1 model, and turns out to be very powerful also for this new one. 00:48:10.400 |
So when we released the V1, a few weeks after, coincidentally, a very interesting paper has been published called "Scaling Data Constraint Language Models." 00:48:20.400 |
It's a great read, and it's probably one of the most interesting results in LLM, in my opinion. 00:48:26.400 |
And this intuition allowed us to basically train the model to completion, rather than making trade-offs on the data quality. 00:48:33.400 |
It allowed us to select a small, high-quality subset of data, and then repeat it several times. 00:48:38.400 |
The key finding of this paper is basically in these two plots. 00:48:41.400 |
I'm going to be sharing the slides so you can go and check the links. 00:48:44.400 |
And the idea is your loss curve, after you repeat data four or five times, is going to be comparable to training on a novel dataset. 00:48:52.400 |
Now, not only this is very useful, because it allows us to work only on high-quality data. 00:48:57.400 |
It also allows us to work with data that is exclusively released under permissive license. 00:49:02.400 |
Therefore, once again, for our 1.5 model, we're going to be releasing it open source, and it's going to be released with a commercially permissive license. 00:49:13.400 |
Just shoot us an email when you use it, because I'm very curious. 00:49:33.400 |
We train a new domain-specific vocabulary, 32K, so a small one. 00:49:37.400 |
It helps us to achieve even higher compression on the data. 00:49:42.400 |
If you've been reading, again, about LLMs, you know that from a simplistic point of view, there are data compressors. 00:49:49.400 |
So if your vocabulary allows you to pack even more data on fewer tokens, then you're basically bringing more signals to the model while you're training. 00:49:58.400 |
And with this new vocabulary, we're squeezing a few percent extra, and it's a better vocabulary for code compared to what StarCoder or CodeLAM are using today. 00:50:06.400 |
We train the 128 H100 80GB GPUs, which are as rare as, you know, gold at this point. 00:50:13.400 |
We have been on the Mosaic ML platform for a week. 00:50:16.400 |
And to our knowledge, this is the first model officially announced to be trained on H100s and release open source. 00:50:30.400 |
We have group query attention, which allows us to achieve better inference performance. 00:50:35.400 |
Alibi position embedding, latest optimizers in the game. 00:50:39.400 |
And that, you know, is really the reason why, at the end, you will see very exciting numbers that I don't want to spoil right away. 00:50:44.400 |
So let's start from the base model, and then there is surprise coming. 00:50:50.400 |
This is the evaluation passed at one on UMineval. 00:50:53.400 |
For those of you who have never heard about it, UMineval is a benchmark release back in 2021 by OpenAI, if I recall correctly. 00:51:01.400 |
You have a natural language description of a task in English, and then expect the model to generate a self-contained Python snippet that then is going to be tested with a test harness. 00:51:12.400 |
So you generate code, and then you execute it, and you see if the values in output are exactly what you expect. 00:51:20.400 |
Now, an interesting evolution in the last few months in the field is we were not content on benchmarking exclusively on Python. 00:51:27.400 |
So we're also doing that across several different programming languages. 00:51:31.400 |
And this is coming from the multilingual code, EvalHarness, again, built by BigCode. 00:51:36.400 |
And they also maintain a very interesting leaderboard. 00:51:39.400 |
So what they do is they take models across, you know, several companies and several open-source contributors. 00:51:44.400 |
They run EvalHarness themselves, and then they compile these very interesting leaderboards. 00:51:48.400 |
So you will find us there, I guess, in a few days. 00:51:52.400 |
So from the left column, we have StartCoder 3B, which, as of yesterday, was a state-of-the-art model at the 3B parameter size across languages. 00:52:03.400 |
And today, our WIP 1.5 is basically optimal across every single language that you see on the list. 00:52:09.400 |
But what gets me excited is not that much of the fact that we are more powerful than StartCoder, which has been released a few months ago. 00:52:16.400 |
What got me hyped, you know, when we were training it is that we're very, very close to call Llama 7B. 00:52:22.400 |
So as a reminder, call Llama 7B is a Llama 2 model from Meta, the 7B version, which has been trained on 2 trillion tokens of natural language. 00:52:31.400 |
And then it has an additional pre-training phase of 500 billion tokens exclusively on code. 00:52:36.400 |
So it's a model that is twice the size, it's 2.5X more data, way more GPU compute. 00:52:44.400 |
So you see where I'm going, you know, we're getting very close. 00:52:51.400 |
This is the other model that we have been training in parallel, and this is the REPL-tune version. 00:52:56.400 |
And it means the following: we further pre-trained it on 200 billion tokens of code, this time coming from our own developers. 00:53:04.400 |
So on REPLIT, when you create a public REPL, it's automatically published under IMAT license, so we use this code to further pre-train our model. 00:53:14.400 |
And we extract, again, 30 billion tokens of code, same languages, same data filtering pipeline to retain only the top-quality ones. 00:53:23.400 |
We do these three epochs, then we do also linear cooldown, and we are using, basically, the languages that are predominantly popular for REPLIT users. 00:53:35.400 |
If you go on REPLIT, I would say 95% of the people are mostly writing Python and JavaScript. 00:53:43.400 |
Another key insight is our cutoff for this model is literally a few weeks ago. 00:53:49.400 |
So if there is a cool new library that everyone is writing software for in the last month, our model is going to be capable of generating code that follows that library. 00:53:59.400 |
And we are going to keep, basically, these models up to date so that we can follow the trends and we can make our developers more happy. 00:54:11.400 |
So we are back to this back-to-back comparison. 00:54:15.400 |
And on the very left, we have our base model. 00:54:18.400 |
We didn't add StarCoder here for the sake of space. 00:54:21.400 |
And also, the base model is already topping it on every other language, so it didn't make sense. 00:54:27.400 |
Now we have Colama in between, and you can see why. 00:54:30.400 |
We are, on pretty much every language, substantially better. 00:54:34.400 |
So we have 36% on the OpenAI UMinevald benchmark. 00:54:39.400 |
As a reminder, when I was working on Palmcoder, for example, that was our Passed One result that we published in early 2022. 00:54:55.400 |
And it achieves exactly the same UMinevald Passed One performance. 00:54:58.400 |
Same Code Da Vinci 001, if you go back to the paper, is getting exactly 36%. 00:55:05.400 |
So we were pretty much amazed when this happened. 00:55:08.400 |
Now, why do we go through all this struggle of training our models? 00:55:14.400 |
Not only because it's cool, you know, we love to do this stuff. 00:55:20.400 |
So we really want to go as fast as possible with the most powerful small model we could train. 00:55:27.400 |
And the reason is, all of our models are actually optimized for inference, rather than for being awesome at benchmarks. 00:55:35.400 |
The fact that that happens gives us a lot of pride, and also makes us feel good when we do a wipe check with the model. 00:55:41.400 |
And it performs as we expect, or even better. 00:55:44.400 |
But it turns out that our key result is, on a single model, with no batching, we're generating above 200 tokens per second. 00:55:52.400 |
And we tune the architecture for speed in every possible way. 00:55:56.400 |
We're training a smaller vocabulary, as I was saying before. 00:55:59.400 |
We're using a flash of tension with a Triton kernel. 00:56:03.400 |
So every single aspect is there to make sure that we can go as fast as we can. 00:56:07.400 |
And we optimize, basically, for the usage on the Triton inference server and acceleration framework, 00:56:15.400 |
They really squeeze, you know, the last drop for NBA GPUs. 00:56:19.400 |
But the other very interesting insight is, we work very hard, also, to make the model deployment go much faster. 00:56:26.400 |
So if you ever, you know, had the bad luck to work with Kubernetes in your life, you know, you know how painful it can get, you know, to get your pod, 00:56:36.400 |
and, you know, download all the dependencies, and build it, and yada, yada, you know. 00:56:39.400 |
So the very first time we brought this infrastructure up, it took 18 minutes to go, you know, from clicking until the model was deployed. 00:56:46.400 |
Now, if you want to, you know, adapt to the load that the application is receiving, 18 minutes, you know, looks like an eternity. 00:56:52.400 |
Like, if there is a traffic spike, good luck with that. 00:56:55.400 |
So one of our awesome engineers, Bradley, you're going to find him at the booth later today, brought this number from 18 minutes to just two minutes. 00:57:06.400 |
I'm not going to go through them, just talk to Brad. 00:57:08.400 |
The cool insight here is the fact, now, whenever we get more load, we can react very quickly. 00:57:14.400 |
And that's how we serve a very large user base. 00:57:17.400 |
So the moment that Amjad announced AI4ALL literally 10 minutes ago, we flipped the switch, and now code completion is in front of our users. 00:57:27.400 |
Now, I've been asked several times, guys, why are you losing your model open source? 00:57:37.400 |
It turns out that the moment we did it, we got a lot of adoption. 00:57:41.400 |
And apart from a lot of log, which always feels good, and it feels good to chat with other people in AI that are using what we build. 00:57:49.400 |
We also started to get fine-tuned versions, instruct-tuned versions of that. 00:57:53.400 |
And we have seen a lot of people using our small model deployed in local, say with GJML, which goes super fast on Apple Silicon. 00:58:01.400 |
And they built their own custom privacy-aware, like GitHub Copilot Alternative with Rapid V1. 00:58:09.400 |
So we expect the same to happen with V1.5 in the next few days. 00:58:13.400 |
As we speak also, if you go on ag-in phase, the model is available. 00:58:21.400 |
It's the mastermind behind it, so it's going to tell you every single detail on how to make it run in production. 00:58:25.400 |
And we're going to be here until tonight, so more than happy to play with the model together. 00:58:29.400 |
Now, in the last minute that I've left, I want to give a teaser of what we're going to be doing in the next few weeks. 00:58:36.400 |
So we've aligned a few very exciting collaborations. 00:58:39.400 |
The first one is with Glaive AI, and it's a company that is building synthetic datasets. 00:58:44.400 |
And we're working on an IFT version of our model, so an instruct-fine-tuned version, over 210,000 coding instructions. 00:58:57.400 |
We want to triple-check them and, you know, follow our Twitters. 00:59:01.400 |
And the moment that we're sure that this is performing as we expect, it's going to be out there, and you're going to be able to play with it. 00:59:11.400 |
I think Jesse is here today, and he's going to run a session later explaining you exactly what this new format does. 00:59:18.400 |
I'm going to give you a teaser, and then, you know, go to Jesse's talk, and he's going to explain you all the details. 00:59:22.400 |
So we are design partners on the FIST format, which is fill in the syntax tree. 00:59:28.400 |
You might have heard of fill in the middle, this concept where you can take your file, split it in a half, and then, basically, if you're writing code in between, 00:59:37.400 |
you can tell the LLM that the top of the file is your prefix, the bottom of the file is your suffix, and you give this context to the model so that it knows which part it should fill. 00:59:47.400 |
Now, we found that this format is even more powerful, is aware of the abstract syntax tree underlying the source code. 00:59:56.400 |
And, again, this will be out, you know, in just a matter of, like, a few days or weeks. 01:00:02.400 |
We have collaborations with the Perplexity AI guys. 01:00:07.400 |
So it's a place where the host models incredibly fast, and the Rapid B1.5 will appear there, and you can start to play with it and get a vibe check by tonight. 01:00:19.400 |
Ladies and gentlemen, please welcome to the stage the inventor of AutoGPT and his team, Torrin Bruce Richards. 01:00:42.400 |
Thank you, San Francisco, for the warm welcome. 01:01:03.400 |
I'm Torrin, the creator of AutoGPT, and I'm excited to show you all what the brilliant minds at AutoGPT have been working on. 01:01:13.400 |
I'm going to hand off the stage now to Salen, one of our founding AI engineers. 01:01:26.400 |
I want to talk about something that I think not many of you realize. 01:01:33.400 |
We're not achieving the peak of our potentials. 01:01:36.400 |
We can all work faster, we can work better, and we can do more with less time and less stress. 01:01:46.400 |
I don't know about you, but I've stared at this interface for hours on end. 01:01:54.400 |
How would you go about filling out this spreadsheet? 01:01:56.400 |
It's the lead-generating name of the company, you've got the links. 01:02:00.400 |
What you probably do is go on Google, you search, copy/paste, maybe go on LinkedIn, copy/paste, back to Google, over and over and over again for hours, going back to the same interface, going back to the same websites. 01:02:14.400 |
But what if instead of all that, you can just chat? 01:02:20.400 |
And you get the same end result, a filled out spreadsheet with all the leads. 01:02:28.400 |
Not because we're lazy, allegedly, but because we're overwhelmed. 01:02:32.400 |
Now, how would you go about cleaning out your inbox? 01:02:35.400 |
You'd sit there for hours and hours, sending the same variation of the same email, but what if you could just chat? 01:02:49.400 |
Actually, these emails will now be leads in your inbox instead of just unread emails. 01:03:02.400 |
You spend millions of dollars developing apps that take weeks, months, sometimes even years, sitting there, copy/pasting anyways, because you're probably using ChatGPT. 01:03:14.400 |
But what if, instead of all that effort, you just chat? 01:03:28.400 |
It gave hope to what a world could look like where we all reach our full potential. 01:03:35.400 |
You could see the sparks of digital artificial intelligence. 01:03:39.400 |
And in this world, everyone goes from using their minds mostly to execute menial tasks with only 10% of their brains being used for creative work, 01:03:51.400 |
to becoming creative masterminds, orchestrating the peak potential of their lives. 01:03:56.400 |
And in this world, we're all AI engineers, whether you know it or not. 01:04:02.400 |
And people have noticed, AutoGPT was the fastest repository to 100,000 stars. 01:04:12.400 |
Everyone understands what the potential of this is. 01:04:17.400 |
And it kicked off a whole new field of development. 01:04:20.400 |
A whole new paradigm of augmenting humans to give them time back and live a more stress-free life. 01:04:27.400 |
And even the major players in the space all realized how big of a deal it is. 01:04:34.400 |
And so, I want to hand off to the primary open source developer at AutoGPT to talk a little bit more about the open source repo. 01:04:55.400 |
I think all of us being here is a real testament to the power of open source. 01:05:04.400 |
And on that note, we have some really exciting news to share. 01:05:11.400 |
Because just last week, our open source repo, AutoGPT, hit 150,000 stars on GitHub. 01:05:34.400 |
But to me, it is so much more than just a number. 01:05:38.400 |
It is the 150,000 people who took an interest in what we're doing and decided to click that button. 01:05:50.400 |
It is also the 460+ contributors who took their time and effort submitting thousands of pull requests and issues. 01:06:00.400 |
In the process, and to all of them as well, thank you so much. 01:06:07.400 |
It is also the 47,000 members of our online community. 01:06:14.400 |
And all of the interesting and insightful interactions that they have given us. 01:06:19.400 |
It has been a wild ride at times, but it has allowed us to do and learn so much in the past six months. 01:06:27.400 |
And I am extremely excited for what is to come based on that. 01:06:39.400 |
Now, I have already said it, but we could not have done this without our community. 01:06:45.400 |
So, we are committed to fostering, to growing, and to empower this community, and to build the future together. 01:06:58.400 |
And I will hand it back to Slam to tell you what that means. 01:07:07.400 |
And we haven't stayed stagnant since the open source agent originally came out. 01:07:11.400 |
We have continued to work on it, and we have continued to improve its capabilities and implement the latest cutting edge research. 01:07:16.400 |
But we have also been working on some other things. 01:07:19.400 |
To show our commitment to the agent space and the open source ecosystem, we built a forge, which is a template for any agent creator to have a better time to develop their agents with a standardized template. 01:07:31.400 |
We also built a dev tool UI to easily interact with and iteratively improve your agent using an intuitive interface. 01:07:40.400 |
All of these tools are built on top of the agent protocol from the AI engineer foundation and other industry standards to maximize compatibility and interoperability. 01:07:49.400 |
Anyone who implements this protocol can use our benchmark, front-end dev tool, and other offerings built on top of this protocol. 01:07:58.400 |
And while this dev tool template is in beta or in alpha, it has served our participants of the current hackathon we're running, where we have $30,000 in cash on the line. 01:08:11.400 |
We've received a lot of bug fixes and insights that we're going to take into the future. 01:08:20.400 |
We've realized that coding agents are the fundamental agents of the world. 01:08:27.400 |
Let me tell you, the digital fabric, the fundamental digital fabric is code. 01:08:35.400 |
Our goal is to build a generalist agent, yes. 01:08:41.400 |
A motivated coder can get anything done, except forget a bed frame. 01:08:50.400 |
Another thing that we've learned over time is that without a compass, you don't know where you're going. 01:08:55.400 |
You know, at the start of the repo, we were getting thousands of pull requests, and that's a pull request every two hours. 01:09:02.400 |
We had no way to know whether the pull requests were good, and how do we even test these pull requests. 01:09:11.400 |
It took time to test these, and it was unnecessarily costly. 01:09:16.400 |
We created a benchmark to direct the development of the open source repo and quantitatively know if we were improving. 01:09:22.400 |
It's an easy way to know if your agents are improving down different categories, and people are currently benefiting from this for the virtual hackathon. 01:09:33.400 |
We've been running this in our CI pipeline for the past couple months on different open source agents within the ecosystem. 01:09:39.400 |
And what the tests have shown is that we're on the brink of something special. 01:09:43.400 |
These agents have showed continual improvement, and don't worry, I wouldn't put this in a research paper. 01:09:50.400 |
But there is a continuous trend from 35 to 55%. 01:09:54.400 |
This is just a graph of the success rate on the benchmark over time, over the month of August. 01:10:01.400 |
Another thing that we're committed to is safety. 01:10:06.400 |
As the ecosystem grows, and as the capabilities of agents increase, there's always questions of trust and reliability. 01:10:14.400 |
And these are problems that AutoGPT is committed to. 01:10:17.400 |
One of these problems is prompt injection, which will always be there. 01:10:21.400 |
OWASP, one of the big security organizations, has talked about this and said this is one of the big problems that not just language models face, but also agents. 01:10:31.400 |
It's essentially when agents visit a website, what all agents need to do, and the website has something malicious. 01:10:38.400 |
And then the LLM is like, "Alright, I need to be doing that now." 01:10:45.400 |
Then there's this other category that I like to call innocently malicious, where agents are just bad sometimes. 01:10:56.400 |
And in this example behind me, this person asked an open source agent to delete all the JSON files within a directory, a specific one. 01:11:06.400 |
And the agent ended up deleting all the JSON files on a laptop. 01:11:09.400 |
And this is going to continue to be a problem. 01:11:12.400 |
If we want agents to do the things that humans can do, they will need root access. 01:11:16.400 |
And so within AutoGPT, we're committed to and think about these problems extensively. 01:11:20.400 |
And we've been working on a research paper to solve some of these issues. 01:11:25.400 |
And in order for agents to be commercially viable and trusted, these safety problems need to be solved. 01:11:36.400 |
That one email that's sent could be a lost contract or a lost lead. 01:11:40.400 |
And so this is fundamental, not just to the development of the open source agent, but to all agents out there. 01:11:46.400 |
After all, our end goal is a digital AGI to augment all of humanity. 01:11:52.400 |
And I'm going to invite Craig to announce some exciting news regarding some developments with AutoGPT. 01:12:00.400 |
So, it's been a wild journey from zero to here in six months. 01:12:13.400 |
And we keep stressing this because it's so important to us. 01:12:19.400 |
Because of that shared passion in pushing the frontier of what AI agents can do. 01:12:25.400 |
So, we're really excited to announce that Redpoint Ventures has invested $12 million in turning this vision to a reality. 01:12:45.400 |
This is them showing their deep belief in our mission and their dedication to open source. 01:12:54.400 |
That's why we went with them because they are so dedicated to staying open source. 01:12:58.400 |
And that's really important to every single one of us working on this project. 01:13:06.400 |
With this funding, we want to grow our team and add more passionate individuals. 01:13:18.400 |
And let's all help make the world's best open source generalist agent. 01:13:23.400 |
Together, we can redefine the future of work. 01:13:38.400 |
Ladies and gentlemen, please welcome to the stage our next speakers. 01:13:55.400 |
Applied AI engineer at OpenAI, Simone Fishman. 01:13:59.400 |
And member of Developer Relations at OpenAI, Logan Kilpatrick. 01:14:32.400 |
So, you can think of OpenAI as a product and research company. 01:14:42.400 |
And then we think about what are some of the best ways to apply them to solve the biggest 01:14:51.400 |
Logan and I sit at the end of this deployment pipeline. 01:14:53.400 |
We work with people in the real world that are using OpenAI as models. 01:14:58.400 |
We spend our time thinking about what are some of the best ways to use our models? 01:15:02.400 |
What are some of the hardest problems that haven't been solved yet? 01:15:05.400 |
And how can we apply OpenAI technology to solve these? 01:15:13.400 |
And my name's Logan Kilpatrick, and I do developer relations stuff. 01:15:17.400 |
So, helping people build fun and exciting products and services using our API. 01:15:27.400 |
Folks, all from the title of the talk, we'll talk about multimodal stuff. 01:15:32.400 |
But I think it's important to start off with where we are today. 01:15:35.400 |
And I think, you know, as we all know, people who have been building in the AI space for the 01:15:39.400 |
last 6, 12, 18 months, 2023 has really been the year of chatbots. 01:15:43.400 |
And I think it's been incredible to see how much people have actually been able to do, 01:15:48.400 |
how much value you can create in the world with, like, just a simple chatbot. 01:15:52.400 |
And it still blows my mind to think about how rudimentary these systems are and how much 01:15:57.400 |
more value there's going to be created in the next year, in the next decade. 01:16:02.400 |
And that's why I'm excited for 2024, which I think is really going to be the, I don't know 01:16:06.400 |
if I can trademark this, but the year of multimodal models. 01:16:10.400 |
It's a tongue twister, but also hopefully the domain is available, yearofmultimodals.com. 01:16:22.400 |
OpenAI has a ton of multimodal capabilities that are in the works. 01:16:26.400 |
Some folks might have already tried some of these in ChatGPT and the iOS app or the web 01:16:32.400 |
Things like vision, taking in images, describing them. 01:16:39.400 |
We've had this historically with DALI 2, but DALI 3, really, if folks have tried it, 01:16:45.400 |
So excited to show some of that today as well. 01:16:51.400 |
So if you think of the way that multimodal capabilities are working right now, it's a 01:16:56.400 |
little bit of a setup of islands where we have DALI that takes text and generates images. 01:17:04.400 |
We have Whisper that takes an audio and generates text transcripts. 01:17:11.400 |
GPTV with vision capabilities that takes images and text and can reason over both at the same 01:17:18.400 |
But right now, these are all very disparate things. 01:17:21.400 |
However, you can think of text as a connective tissue between all of these models. 01:17:29.400 |
And there's a lot of interesting things that we can build right now using that paradigm. 01:17:38.400 |
But what we're actually really excited for is a future in which there's unity between all 01:17:50.400 |
But you can think of models in the same way that like GPT can consume images and text 01:17:56.400 |
Maybe in the future we'll consume even more modalities and we'll output even more 01:17:59.400 |
modalities and we'll be able to reason about them at the same time. 01:18:03.400 |
So today, Logan and I are going to show you just like some architecture patterns and some 01:18:09.400 |
ways in which you can mimic this kind of situation with what we have available today and some of 01:18:17.400 |
the patterns that you can start to think about as we move towards this future in which models 01:18:25.400 |
As Simone and I were making these demos today, waiting until the last minute as always, it was really 01:18:34.400 |
interesting to see that like really much of the work of making multimodal systems today is like how do you 01:18:38.400 |
hook everything up together and connect the different modalities. 01:18:41.400 |
And again, as Simone said, using text as sort of the bridge between different modalities. 01:18:45.400 |
But it's going to be super interesting to see like how much developer efficiency gains there are when you no 01:18:50.400 |
longer have to do that and you really just have like a single model that can do text in, text out, 01:18:55.400 |
video at some point, you know, speech in, speech out at some point. 01:18:58.400 |
So it'll be super cool to see when that's possible and make making demos even easier and simpler. 01:19:07.400 |
Alright, well, we'll show you guys two demos today. 01:19:10.400 |
And we'll talk about like some high level ideas and some high level concepts. 01:19:15.400 |
And hopefully at the end of it, you'll be inspired to think about like what are some of the things that 01:19:19.400 |
that maybe you're not able to build today, but you'll be able to build six months a year from now. 01:19:25.400 |
And how you should start thinking about your products as they are able to incorporate more modalities. 01:19:36.400 |
This is a, it's a, it's a very, very simple DALI vision loop. 01:19:42.400 |
Um, yeah, so yeah, sorry, um, excited to, to look at this demo. 01:19:48.400 |
So Simone will, will pull up the demo and I'll, I'll sort of just walk through it. 01:19:51.400 |
But the basic idea is let's take a real image. 01:19:53.400 |
Um, let's use GBTV, um, or GBT4 with, with image inputs to essentially create a nice human, uh, readable, understandable description of that image. 01:20:03.400 |
Um, and then we'll put that into DALI3 and actually go and generate a synthetic version of that image. 01:20:09.400 |
Um, so this whole pipeline takes a little bit to run because, uh, it's not a production, um, system at the moment. 01:20:17.400 |
Uh, but the nice part is, uh, we've got a couple of examples ready and we can, you want to kick one off live as well. 01:20:27.400 |
So very, very, this is a, a fun, simple idea, but, uh, the, this is a photo that I took in the lobby downstairs. 01:20:34.400 |
Uh, just when you walk into the, the hotel, uh, there are these, uh, kind of like, uh, Halloween themed painted ladies. 01:20:42.400 |
Uh, and so what we did here is that we asked, uh, GBTV4 with vision to describe this image in detail. 01:20:49.400 |
Uh, and then we asked it to, uh, generate a description for DALI to, uh, generate a new image based on this. 01:21:04.400 |
Uh, it, it picks up on a lot of details like the, the RIP in the tombstone and the old dogs. 01:21:12.400 |
And then it generates a whole new image, but there's a lot of details that are off. 01:21:15.400 |
You know, like the, the, um, the marble is black and the, uh, the spiders are white. 01:21:23.400 |
And so what we do next is that we pass the, yeah, it's close enough. 01:21:28.400 |
Uh, but we give the two images to GPT with vision again. 01:21:32.400 |
Um, and we ask it to compare them, uh, and see what are some of the differences. 01:21:36.400 |
And, uh, it, it picks up on a lot of the, the different details. 01:21:41.400 |
And then we ask you to create a new, a new image based on these, uh, differences. 01:21:54.400 |
Uh, but you know, it, it, it matches something closely. 01:21:57.400 |
And I think this, this is just to illustrate, I think there's a long way to go, but this is to illustrate the idea that there are plenty of tasks that we do right now in AI. 01:22:05.400 |
Where we, we need the human in the loop to be able to evaluate a visual output that a model produces, compare it with something else. 01:22:14.400 |
Then like iterate on the instructions, pass that again to another model. 01:22:17.400 |
And so that, that, that's a pipeline where we like thought that humans were very essential and that we're probably going to continue to be essential for some time. 01:22:23.400 |
And now that's something that the models can do by themselves. 01:22:26.400 |
Um, and there's a couple of, uh, of, uh, interesting, uh, uh, patterns here. 01:22:31.400 |
I think, I think one of them is describing images. 01:22:33.400 |
That's powerful because now you have an image. 01:22:35.400 |
Now you have text and you can reason about that text. 01:22:38.400 |
But another really powerful element is, uh, comparing images. 01:22:42.400 |
Um, a, and, and, and spotting differences, like having like a final destination that you want to get to and like a current destination. 01:22:50.400 |
And a, a, and that pattern of comparing things, you can apply to a lot of things. 01:22:54.400 |
So imagine, uh, talking outward where Logan and I were just chatting about like some other like ways that you can apply this. 01:22:58.400 |
And, and, and Logan's idea was imagine you are, uh, curating, uh, your room and you're, you just moved to a new place. 01:23:05.400 |
Uh, and you're an Instagram, you find some images of like a vibe that you like and, and like maybe some object. 01:23:13.400 |
You can give that to GPT-4 with vision and you can tell like, okay, now like, like crawl through Amazon and find like all the lamps that matter. 01:23:19.400 |
All the lamps that match this vibe that I want for my room. 01:23:27.400 |
So it's like, I, I would love to be able to just be like, get me all the stuff that matches this specific vibe. 01:23:35.400 |
Simone, can I make one other quick comment, which is just, I think also, you know, folks were, were laughing, you know, in, in, in good jest when this, when this third image came up, came up. 01:23:44.400 |
I think it's important to know that there's, there's like no prompt engineering or anything like that. 01:23:49.400 |
This is like the, the rawest output that you can get. 01:23:53.400 |
So people can, uh, will hopefully go wild with this once it's available through the API and like ideally get much better results than, than we're seeing today. 01:24:03.400 |
Probably using a bunch of techniques that other people talked about at the conference so far. 01:24:06.400 |
So this is the, the very basic version of, of this demo. 01:24:10.400 |
And, and we wanted to keep it simple and minimal just to illustrate the, the, the power of the models. 01:24:14.400 |
This is as raw as you can get when it comes to the models that like, there is almost all the completion output. 01:24:20.400 |
It's going straight into the model and, and I think there's like 50 lines of code. 01:24:23.400 |
So like the majority of the power lifting is being done by the models here. 01:24:26.400 |
Um, and another quick example that I'll show you guys and then I'll try to do one live, uh, which will probably, uh, be tragic, but, uh, um, so this is, this is the backstage right here. 01:24:42.400 |
Uh, uh, I just took this photo right before walking on stage. 01:24:46.400 |
Uh, uh, uh, you can see that, uh, GPT with vision does a really good job actually of describing that. 01:24:51.400 |
Uh, the, there's the monitors and there's boxes and there's cables and there's one not. 01:24:56.400 |
Uh, uh, and then this is the image that DALI generates, DALI 3. 01:25:01.400 |
Uh, so you can see blue carpet, cables, boxes, all the elements. 01:25:05.400 |
And then it goes on to spot the differences and it notices, for example, that in this image, there are all these vertical lights. 01:25:13.400 |
It says that here, uh, lighting like all this like vertical lights on the walls and ceilings, which adds, uh, but then it rewrites the prompt and it gets rid of all the vertical lights. 01:25:26.400 |
And it gets, and it adds the, uh, the curtain in the back, which wasn't present here, but it's present in the, the black curtain here. 01:25:33.400 |
Um, so a little, just little interesting things. 01:25:35.400 |
It's still a long way to go, but like this, this new, this whole new, this opens a whole new box of interaction patterns. 01:25:40.400 |
The, the, the fact that now you can reason visually. 01:25:45.400 |
And, and let's give a shot to, uh, a live example. 01:25:49.400 |
So this, uh, this was a, a trail run that I did over the weekend, uh, up in Purisma Woods. 01:26:09.400 |
So the image depicts a, uh, serene and pictures, woodland setting. 01:26:13.400 |
Uh, the focus of the image is a wooden boardwalk or a footbridge that winds through the dense forest. 01:26:19.400 |
Um, very detailed description, light filters through the trees. 01:26:24.400 |
Uh, and I'm just passing that raw, just straight to DALI. 01:26:28.400 |
Yeah, and if, if folks have seen what happens in the, the DALI, uh, mode in the chat GPT iOS app, for example, it's actually doing a little bit. 01:26:40.400 |
I, I don't, uh, know off the top of my head, like what the prompt is for that, but it's, it's doing some amount of prompt engineering. 01:26:46.400 |
Like if folks have actually tried to use like our labs product before to make DALI images, you have to do that prompt engineering yourself. 01:26:52.400 |
Um, and I think that's been one of the limitations. 01:26:55.400 |
Like if people used mid journey or other, um, other image models in the past, like it's just kind of hard to make good prompts that work well for these systems. 01:27:03.400 |
So it's nice that the, uh, the model can, can take a stab at doing it for you. 01:27:08.400 |
It's telling us a lot of how the, the second image is a lot more beautiful and more detailed, which checks out. 01:27:19.400 |
It's, it's also interesting to see, uh, just for folks to, to think about. 01:27:36.400 |
It's interesting to see that like, it's still of these, um, image models. 01:27:39.400 |
Like the main limitation as we're seeing this demo in real time is actually. 01:27:49.400 |
And then at the time, if we, if we have time, it'll probably work the second round. 01:28:10.400 |
For the second demo, um, uh, we're going to take it a little bit further and we're going 01:28:21.400 |
Uh, and the idea here is that there's a lot of video summarization demos out there that we've seen. 01:28:27.400 |
Uh, the majority of them just take a transcript and then, uh, ask GPT-4 to summarize this transcript. 01:28:32.400 |
However, videos have a lot of, uh, information in them that is conveyed visually. 01:28:36.400 |
And so, uh, what we're doing here is that we're taking frames from the video. 01:28:41.400 |
Um, and then we're asking GPT-4 with Vision to describe all the frames. 01:28:47.400 |
And then we are asking Whisper to transcribe the video. 01:28:50.400 |
And now we have this long textual representation of the video that not only includes all the audio information, 01:28:56.400 |
but also includes visual information from the video. 01:28:58.400 |
And then we're doing some exciting, like, mixes on that, uh, that Logan will tell you about. 01:29:08.400 |
So for, for this demo, we're literally just taking the GPT-4 introduction, uh, video. 01:29:13.400 |
If folks have seen on YouTube, it's a good video if you haven't seen it before. 01:29:19.400 |
Uh, taking the video raw from YouTube, again, like Simone said, cutting up those, uh, the different frames from the video, 01:29:27.400 |
putting those into, to GPT-4 with image input, getting the summaries, which you can see. 01:29:33.400 |
Um, but literally just like actually saying what's, these are simple images. 01:29:37.400 |
So it's easy to capture the, the depth of what's shown here. 01:29:40.400 |
Um, taking those images and then going to the next piece, which is essentially a big, another, another wonderful dolly image, 01:29:50.400 |
but a big description of, uh, of the transcript and then all of the image, essentially like image embeddings is the, is the easiest way of thinking about it. 01:30:00.400 |
So if you want to actually see the results of this QR code, bottom right hand corner is real. 01:30:05.400 |
Um, you can scan it and see the resulting article. 01:30:15.400 |
And I think for, for me, you know, why this is exciting is cause you can sort of capture the, again, capture the depth of, uh, of what happens in a video. 01:30:25.400 |
So a dolly image to start and then a bunch of actual frames that like match up with the contextual representation of what's being talked about in the blog post. 01:30:37.400 |
I, I couldn't open source the code cause it has a bunch of unreleased APIs, but no, no sort of magic behind the scenes stuff that's happening. 01:30:45.400 |
This is like a raw crappy prompt, um, to generate this, uh, this blog post, which I think is, again, I think it's really cool and, um, takes videos and makes them more accessible in the, in the text form. 01:31:36.400 |
Uh, that's, that's something that new that's, that's happening these days. 01:31:40.400 |
And, and if you have any crazy ideas that you think, wow, it would be really cool if, if technology could do this, uh, we'll probably be able to get there. 01:31:48.400 |
And, and the products that you'll be able to build six months from now, a year from now are going to be incredible. 01:31:53.400 |
So start having this in mind as, as, as, as people who are building AI products and people who are building companies. 01:31:59.400 |
Um, think of text as a, as a connecting teacher right now. 01:32:04.400 |
And, and I think this is a very powerful concept. 01:32:06.400 |
And that's gonna continue to be the case for the near future. 01:32:09.400 |
Uh, a, and there are many powerful patterns that are yet to be explored when it comes to multimodal stuff, especially when it comes to, to, uh, doing things with images. 01:32:18.400 |
Uh, so really excited to, uh, soon get this in the hands of all of you guys and, and to see why you all build with this. 01:32:24.400 |
I think it's, uh, it's really exciting, uh, to see, uh, AI start to venture into the visual world. 01:32:31.400 |
Yeah, agents with image input is going to be sick. 01:32:36.400 |
I feel like so much of the internet is requires that. 01:32:40.400 |
I think there's, there's a lot of stuff that's going to happen in the, in the near future. 01:32:43.400 |
And, um, I think it's cool to be able to hopefully get a glimpse of, of what some of those use cases look like. 01:32:55.400 |
And now, please welcome the founder and CEO of Lindy, Flo Crivello. 01:33:20.400 |
I'm going to talk about the future that awaits us and not the super distant future either. 01:33:46.400 |
Like I'm talking about five, ten years, certainly within our lifetimes. 01:33:51.400 |
The future that awaits us once agents have fully realized their potential. 01:33:58.400 |
If I had to describe it in one sentence, I'd say that it's a world where a 25-year-old can have the same or more business impact as the Coca-Cola company. 01:34:10.400 |
It sounds insane when you say it this way, but there's actually a precedent to it. 01:34:18.400 |
Consider what Oprah had to do to build her media empire. 01:34:22.400 |
She had to go and pitch a bunch of, a bunch of CNN executives in some stuffy room and raise money, hire a crew, find cameraman. 01:34:29.400 |
And obviously, the internet and apps like YouTube have brought that friction down to zero. 01:34:40.400 |
And it's brought that friction down so much that it actually sounds like a joke. 01:34:51.400 |
He actually has more of a reach than the Super Bowl. 01:34:55.400 |
And it's just him and his laptop is how he got started. 01:35:03.400 |
He started when he was three years old, making videos on YouTube of him unboxing and reviewing toys. 01:35:10.400 |
Today, he's 12 years old and he's worth $100 million. 01:35:19.400 |
My point is that once you bring the friction down to zero, and once you remove the gatekeepers, you don't just get the same kind of content except cheaper. 01:35:31.400 |
The nature of the content changes when you remove the gatekeepers. 01:35:46.400 |
And so my point is that we're about to see the exact same transformation happen to the world of business at large. 01:35:53.400 |
We're going to take the friction down to zero. 01:35:56.400 |
And as a result, we are going to see much weirder, more creative ideas come to life. 01:36:08.400 |
Before it, I had another one called TeamFlow. 01:36:10.400 |
And I remember when I started it, I had a perhaps naive understanding of what starting a business entailed. 01:36:15.400 |
I thought it was all about building a cool product and bringing it to market. 01:36:19.400 |
And then I found out that's actually the fun part. 01:36:23.400 |
Before you get there, people -- I see the fonders laughing in the audience. 01:36:26.400 |
Before you get there, you've got to meet with lawyers and incorporate and meet with bankers and open a bank account and meet with VCs and raise money and meet with recruiters and hire a team. 01:36:36.400 |
And I mean, you guys know, once you have a business, it's not much easier. 01:36:40.400 |
So when that wave of generative AI came about, all these amazing products that you're seeing that generate copywriting for you, generate images, I was like, that's awesome. 01:36:51.400 |
But it doesn't solve my problem of it's just too darn painful to start a business. 01:36:56.400 |
And also, the GDP isn't made of copywriters or illustrators. 01:37:05.400 |
So that's when I got interested in agentic AI. 01:37:08.400 |
AI that can actually automate the many old parts of your life. 01:37:15.400 |
There's this amazing movie, Office Space, by the same guy who made Silicon Valley. 01:37:21.400 |
And there's this awful, depressing character in there named Milton. 01:37:26.400 |
He spends his life in some basement doing God knows what. 01:37:32.400 |
And in the end, Milton does the most productive thing of his career, 01:37:38.400 |
which is that he burns the building to the ground. 01:37:52.400 |
You know, I take this as a symbol that no one is happy with this status quo. 01:37:59.400 |
People are always worried about, oh, robots are stealing people's jobs. 01:38:02.400 |
I think it's people who have been stealing robots' jobs. 01:38:11.400 |
I think it's a huge problem when you look at the data. 01:38:15.400 |
The average manager in the U.S. spends 15 hours every week on this kind of administrative task. 01:38:22.400 |
That's $459 billion every year just in the U.S. 01:38:36.400 |
And the first thing it does very well is it acts as your personal assistant. 01:38:47.400 |
The good news is as we've dug into that problem space, we found out there are three big time wasters. 01:38:58.400 |
This is where you spend your life at work and you hate it. 01:39:02.400 |
So the product we built, those are actual screenshots of the product. 01:39:07.400 |
You can ask it to schedule your meetings for you. 01:39:12.400 |
This example here is actually pretty cool because it demonstrates another ability of Lindy, 01:39:16.400 |
which is she continuously learns from her interactions with you. 01:39:23.400 |
So here I was like, help me find half an hour every week with Eric. 01:39:27.400 |
And she called it Flo Eric because previously I had asked her, find 30 minutes with Eric tomorrow. 01:39:33.400 |
And she did that, but she named the meeting, meeting with Flo. 01:39:40.400 |
And I was like, no, I gave her a little bit of feedback. 01:39:50.400 |
And she saved the preference for future instances. 01:39:55.400 |
And generally, I can give any arbitrary preference of any complexity that I want to Lindy. 01:40:07.400 |
I can CC Lindy to my emails so that she helps me schedule them. 01:40:14.400 |
And when you use Lindy, she can pre-draft your replies for you in your inbox, in your voice, for each individual recipient. 01:40:24.400 |
Because you don't talk the same way to your partner as you do to your investors, hopefully. 01:40:31.400 |
So every morning I wake up, I open my Gmail, and I just have all the drafts ready for me to review. 01:40:41.400 |
So I've asked her, hey, five minutes before every meeting, send me the Zoom link, the LinkedIn's of the people I'm meeting with, and the summary of my last few emails with them. 01:40:57.400 |
The really crazy thing is that we ourselves didn't actually build any of these features. 01:41:03.400 |
What we did is we built a universal framework, allowing an AI to pursue any arbitrary goal using any arbitrary tool. 01:41:12.400 |
And some very complex and sophisticated behaviors come out of that, as we'll see later. 01:41:22.400 |
Now, my pet peeve, every time people go on stage and they talk about their AI products, they always talk about the good part. 01:41:30.400 |
And so I'm going to break that pattern a little bit today, and I'm going to talk about a time when it didn't work. 01:41:37.400 |
A few weeks ago, I asked Lindy to help me work on my vocabulary, and every morning to send me a new interesting world. 01:41:46.400 |
Every morning I wake up, I have a new world in my inbox. 01:41:53.400 |
A captivating term, denoting the act of meandering through a conversation with no fixed direction. 01:42:02.400 |
So I was like, I didn't have heard of this one before. 01:42:09.400 |
So then I went back and re-Googled every wheel that she sent me, and none of them existed. 01:42:22.400 |
So if I've used any wheel that doesn't exist today, that's why. 01:42:34.400 |
And what it means for you when it works is that your computing experience of the future isn't one when you're in a basement filing TPS reports all day. 01:42:47.400 |
It's not one where you're working on your computer. 01:42:52.400 |
It's one where you're having a conversation with your computer. 01:43:03.400 |
And all the menial, awful parts of your work that you hate arrange themselves automatically for you. 01:43:14.400 |
I cannot wait for this to fully come to fruition. 01:43:16.400 |
But it doesn't yet get you to the stage that I was talking about where a 25-year-old has more business impact than the Coca-Cola company. 01:43:27.400 |
In order to get there, you have to go one step further. 01:43:31.400 |
And instead of having just one Lindy work for you as your assistant, you can have an entire society of Lindys working together on your business to pursue your goals. 01:43:48.400 |
If you want to realize how powerful that can be, consider the fact that every single item around you in the room right now was made by a group of people. 01:43:58.400 |
Even the simplest of items, not one person can do it. 01:44:02.400 |
In fact, there's this guy who ran this project called the toaster project. 01:44:06.400 |
He wanted to see, can a single human make a very simple item like a toaster? 01:44:13.400 |
It cost him $2,000, probably more like 50K if you include the value of his time. 01:44:19.400 |
And this is what he ended up with in the end. 01:44:21.400 |
Oh, he could have gone to Amazon and bought a perfectly fine toaster for $25. 01:44:29.400 |
I think this contrast is a good illustration of the difference in abilities between one person, six months, 50K, pretty bad looking toaster. 01:44:41.400 |
And a group of people, $25, perfectly fine toaster. 01:44:49.400 |
You go to GPT-4, you ask it to do something like, hey, build an entire iOS app for me. 01:44:55.400 |
Design it, publish it to the App Store, do everything. 01:45:00.400 |
And then people conclude, oh, GPT-4 can't do it. 01:45:07.400 |
I think it's the same thing as if you went to some guy and you asked him, make a rocket for me. 01:45:12.400 |
And then you're like, oh, humans can't make rockets. 01:45:18.400 |
So that's exactly what we built is a framework for multiple agents to work together in pursuit of your goals. 01:45:31.400 |
The most awesome example that I know of is that we have created a society of Lindy's for Lindy to build herself. 01:45:45.400 |
We need to build a lot of integrations for Lindy to work well with Slack, Twilio, Google Sheets, and so on and so forth. 01:45:51.400 |
Instead, we are building this society of Lindy's. 01:45:55.400 |
At the top level, there's this tool creation Lindy that takes an instruction like, hey, build a Slack integration. 01:46:01.400 |
Talks to this Lindy that goes online and finds the open API spec and the online web documentation. 01:46:08.400 |
Talks to this Lindy, that's a manager Lindy, just splits up the task across many engineers. 01:46:16.400 |
There's a specific engineer for the authentication code because there's a few gotchas here. 01:46:20.400 |
And then they pass the work to a QA engineer in Lindy that tests the work. 01:46:27.400 |
And if it doesn't work, sends it back to the software engineer. 01:46:34.400 |
This, for the record, is 70% or 80% of the way there. 01:46:43.400 |
So this is how you get to that future where a 25-year-old in his San Francisco studio can have more of a business impact than the Coca-Cola company. 01:46:58.400 |
I think this is going to be the greatest equalizer of human history. 01:47:03.400 |
Today, the best CMO in the world probably works for Apple or Nike or Coca-Cola. 01:47:09.400 |
Not too long from now, the best CMO in the world is going to be an AI CMO. 01:47:15.400 |
The same goes for the best designer in the world, the best engineer in the world. 01:47:19.400 |
They're all going to be AI designers, AI engineers. 01:47:26.400 |
We're all going to have the same lever of infinite strength to make change happen in the world. 01:47:35.400 |
And the only question is going to be, can you use that lever? 01:47:37.400 |
That's the only skill that's going to matter in the future. 01:47:43.400 |
Imagine if you weren't constrained anymore by time, by money, by your team, by your network. 01:47:53.400 |
Imagine if you could build anything and it was just you, your laptop, and your Lindy's. 01:49:06.380 |
Very, very excited for everything that's already happened and about to come. 01:49:10.380 |
There's obviously so much going on in the AI field and AI engineering fields. 01:49:16.380 |
But I feel like oftentimes, I've always wanted to do this at a conference and now I help to run one. 01:49:27.380 |
Marvel's a little bit past this view, but I think the model is kind of working. 01:49:32.380 |
And most of you are familiar with the first two that we have. 01:49:44.380 |
We have a few other things that are planned and we'll be talking about them over the next few days. 01:49:50.380 |
I think what's interesting about the AI field is so much stuff is being built. 01:49:54.380 |
Some stuff is much further along and some stuff is in infancy. 01:49:58.380 |
And we always want to try to encourage people to join in. 01:50:07.380 |
So with that in mind, I want to introduce the next set of speakers. 01:50:11.380 |
And the Marvel analogy is kind of blown up by Ant, CTO of Superbase, who's in the audience 01:50:18.380 |
Because he said we're going to make some iPhone-level announcement. 01:50:21.380 |
And I really idolized the way that he presented it. 01:50:25.380 |
So we're going to present three small things. 01:50:30.380 |
So first up, we have Barb talking about the state of AI engineering. 01:50:54.380 |
Oh, I see some friends in the audience and a lot of new faces. 01:51:02.380 |
I'm an investor at Amplify Partners, where we invest in very technical founders and platforms. 01:51:11.380 |
And so we're going to talk about the state of AI engineering. 01:51:19.380 |
And even from today, you see how quickly the field is moving. 01:51:23.380 |
I definitely don't need to tell this audience that. 01:51:26.380 |
Everything from new state of the art models to changing and very rapidly changing tooling. 01:51:32.380 |
And so we had a conversation and thought it would be helpful to take a step backwards and 01:51:39.380 |
How do we get a good sense of the state of AI engineering? 01:51:45.380 |
How do we get a good sense of what tools people are using? 01:51:51.380 |
Especially given the really, really, really rapid advancements. 01:52:00.380 |
And you all are going to get the very first alpha view at the results of the survey. 01:52:05.380 |
841 people have filled it out about how they're using AI at work. 01:52:10.380 |
As far as I'm aware, that's the largest survey of AI engineers that is out there. 01:52:17.380 |
But after the inaugural one, we can track these over time. 01:52:24.380 |
And I don't have time to cover all the things. 01:52:26.380 |
But we go over everything from demographics to use cases to what are people actually using 01:52:32.380 |
in their stack, to some questions that people care about that are fun rapid fire, to who 01:52:37.380 |
should we really celebrate in the community that is doing a really good job bringing people 01:52:41.380 |
together and educating them through newsletters and podcasts. 01:52:46.380 |
On the demographics front, before I was an investor, I was a data scientist for most of my career. 01:52:51.380 |
Then I worked a little bit in data infrastructure. 01:52:53.380 |
Are there any other data scientists in the house by show of hands? 01:53:00.380 |
How about software engineers as your formal job title? 01:53:05.380 |
How about AI engineers as your formal job title? 01:53:09.380 |
So software engineers actually beat out AI engineers. 01:53:18.380 |
We think that we're going to see a lot more of the title AI engineering. 01:53:21.380 |
But it's both a job function and a skill set that we see across a bunch of different functions. 01:53:28.380 |
So, SWIX has talked about AI becoming more ubiquitous. 01:53:32.380 |
Of the folks that we talked to who have over 10 years of software experience, 20% of them 01:53:38.380 |
have less than one year of AI/ML experience, but they're getting into the field now. 01:53:42.380 |
If I remember correctly, something like 40% have less than three years of AI/ML experience. 01:53:48.380 |
So, we're seeing this flood into combination of AI skill set and the AI engineering role. 01:53:57.380 |
There are a lot of use cases we could talk about. 01:53:59.380 |
But just to give some highlights, most people are using LLMs for more than one use case, for 01:54:07.380 |
I think if we did this survey a few months ago, there would not have been as many folks 01:54:12.380 |
So, congrats to everyone that's doing the hard work to make that happen. 01:54:17.380 |
Accuracy and cost are the most important when choosing a model. 01:54:20.380 |
And serving cost and evaluation are what people said are the most challenging. 01:54:24.380 |
There's a whole section on evaluation, and reading through the comments were pretty funny, 01:54:28.380 |
because while folks had the opportunity to vote on human review and academic benchmarks, 01:54:35.380 |
there were some write-ins like, "I evaluate based on vibes," or "based on my eyeballs." 01:54:40.380 |
And so, it's just a commentary on how far we have to go there. 01:54:44.380 |
And finally, OpenAI's models are the most popular. 01:54:50.380 |
But 80% are experimenting with more than one provider, and I'm including open source in that. 01:54:56.380 |
We're not going to go through all of this, but we will share a survey that goes through all of this. 01:55:06.380 |
I don't know what happened there, but as the cliffhanger, most AI engineers are using a vector database. 01:55:13.380 |
And what I'm saying here that you can't really see is that there's a near-even split between folks using third-party and self-hosted for VectorDBs. 01:55:23.380 |
So, it will be interesting to see where this goes over time. 01:55:26.380 |
For prompt management, I thought this comment was hilarious. 01:55:32.380 |
I prototype using OpenAI Playground and then hard-code prompts into source code. 01:55:39.380 |
And so, we're seeing a combination of folks using external tools, building internal tools -- that's the most popular thing -- and using internal spreadsheets. 01:55:49.380 |
But I think this is also an interesting question around kind of who in the stack owns this, and does it become more important or less important over time as the models get better? 01:55:59.380 |
But as we're doing more AI and potentially have more prompts. 01:56:05.380 |
And finally, we have a whole section of pretty fun questions around is the future open-source or third-party? 01:56:11.380 |
It was actually pretty even with open-source narrowly winning. 01:56:16.380 |
And there aren't too many doomers in the AI engineering crowd. 01:56:28.380 |
I think 12% of folks confidently said there's a 0% chance of -- 0% probability of doom. 01:56:36.380 |
But you see that there's an interesting distribution there. 01:56:41.380 |
And I just want to shout out -- these were the top 10 of each category of newsletters, podcasts, and communities. 01:56:49.380 |
And the way that folks voted on this was if they felt like they've learned something from one of these in the past months. 01:56:55.380 |
So, major shout out to this -- yeah, it's getting a lot of photos -- major shout out to the folks who are putting in a lot of work to educate and help build learning and space for AI engineering. 01:57:10.380 |
If you want to take a look at all of this more in-depth, got a QR code for you and a link. 01:57:19.380 |
You're the first people to see some early results of the AI engineering survey. 01:57:24.380 |
I'm always open to feedback, to discussion, to questions you want to see next. 01:57:29.380 |
Some things -- like there were more people pre-training models than we expected -- are going to dictate what we continue to look at and what we continue to survey and share out. 01:57:37.380 |
But I hope you get value out of the transparency and don't be a stranger. 01:57:53.380 |
So, that's the first small launch that we're going to do, which is the definitive industry survey. 01:57:59.380 |
The next thing I think about when building an industry -- 01:58:09.380 |
So, today I'm very excited to stand here in front of you to announce that we are starting a new organization called the AI Engineer Foundation. 01:58:28.380 |
Before I start, I just want to note that I have been an engineer for the last year. 01:58:36.380 |
So, standing in front of this audience is deeply uncomfortable for me. 01:58:45.380 |
So, for AI Engineer Foundation, we exist to solve problems for AI engineers. 01:58:54.380 |
And today, we're going to enumerate three problems to start with. 01:58:59.380 |
First, every project is reinventing slightly different interfaces. 01:59:12.380 |
Interface to popular LLMs have been implemented differently by different libraries. 01:59:17.380 |
And this is a problem for AI engineers because we're going to have to learn each interfaces independently. 01:59:29.380 |
Development and monitoring tools lack interoperability. 01:59:35.380 |
Without standards, people build end-to-end apps that integrate with limited set of tools. 01:59:42.380 |
This creates an ecosystem which is basically a bunch of verticalized silos that encourages churn. 01:59:49.380 |
more stable infrastructure, as well as a more modularized framework, as well as more collaboration. 01:59:54.380 |
The problem is that, you know, we're not going to be able to do that. 02:00:00.380 |
However, on the other hand, with mutually agreed-upon standards, 02:00:02.380 |
we find common points of shared interest so that AI engineers can find the best-in-class tools 02:00:04.380 |
And this creates an ecosystem as well that -- 02:00:07.380 |
This creates an ecosystem that essentially encourages more stable infrastructure, as well as a more 02:00:14.380 |
modularized framework, as well as more collaboration. 02:00:17.380 |
Problem three, venture-backed open-source lock-in, as well as -- 02:00:23.380 |
We're preventing venture-backed open-source lock-in, as well as having to navigate relicensing challenges. 02:00:29.380 |
So, some of you guys might know, in August, we have this announcement by HashiCorp. 02:00:36.380 |
After nine years of Terraform being open-source, they were suddenly getting relicensed to be a non-open-source-compliant project. 02:00:46.380 |
And this created a lot of panic in the industry. 02:00:55.380 |
And luckily, we have the Linux Foundation, who acted very quickly to create open-tofu to ensure that Terraform stays open-source. 02:01:06.380 |
So, who we are as AI Engineer Foundation is -- everything we do is open-source. 02:01:15.380 |
And we are building a strong AI engineer community. 02:01:24.380 |
And as of yesterday, it's starred about 300 on GitHub. 02:01:28.380 |
And it's a unified interface for AI agent developers. 02:01:35.380 |
And by REST API, I really mean these nine endpoints, as well as a well-defined schema for data types. 02:01:41.380 |
And you can check out Agent Protocol at agentprotocol.ai. 02:01:46.380 |
With Agent Protocol, new tools are suddenly available. 02:01:50.380 |
So, the AutoGPT team recently launched the Arena Hacks Hackathon that is built on top of Agent Protocol. 02:01:57.380 |
So, now we have evaluation benchmarks that, if you're an agent developer, you should totally still participate in this hackathon. 02:02:05.380 |
It's still ongoing to submit your agent to this leaderboard to see how they perform. 02:02:09.380 |
There are three ways for you to stay engaged with us. 02:02:15.380 |
If you have an open-source project and then you would like for other AI engineers to benefit from, 02:02:24.380 |
And if you are a developer, you can stay engaged with us through our Discord community. 02:02:30.380 |
And lastly, if you really resonate with the problems that we're trying to solve, 02:02:36.380 |
and if you are financially able to do so, please sponsor us. 02:02:41.380 |
More information can be found on the website, AIE.foundation. 02:02:48.380 |
I don't know which we're using now, but I'm just going to stick to this for now. 02:03:03.380 |
I do think that we have to set these things in motion so that there's a place for you if you want to come join and collaborate in open-source community. 02:03:17.380 |
I'm just, you know, moving people around or, you know, suggesting projects and promoting people. 02:03:23.380 |
I am not, like, super ready to, like, talk about it, but, you know, like, there's no other place to talk about it. 02:03:42.380 |
And, essentially, it is what I've been pitching with an API gateway that actually helps AI engineers to code faster, make their code base simpler, and do a lot of things that, like, would otherwise take a lot of specialist knowledge, right? 02:04:04.380 |
I think a lot of AI engineering is about access from machine learning to product. 02:04:09.380 |
If you have been on the -- and so that's the website. 02:04:14.380 |
If you have been on the conference website yesterday, thanks to Sean Oliver for coding this up, you actually have seen the summit AI bot, which actually presents the information about the conference in a better way that's more personal to me than the website because sometimes I just want to see details about the speakers, the talks, whatever. 02:04:32.380 |
And the reason, like, why don't we do this more often, right? 02:04:37.380 |
The line chain docs, they kind of launch the chat bot. 02:04:39.380 |
It's on a different domain if you have to find it somewhere. 02:04:41.380 |
It's not ubiquitous because it's a lot of work and a lot of code to write. 02:04:47.380 |
What we have found is that we've been able to fine tune data into models on production traffic. 02:04:54.380 |
And that's effectively the kind of stuff I've been working on. 02:04:58.380 |
That's a screenshot of the fine-tuning UI that OpenAI has launched. 02:05:01.380 |
And this is an example of the kind of stuff that a fine-tuned smaller model can do that would eliminate a whole raft of sort of glue code that most people would be writing. 02:05:16.380 |
This conference is definitely not, like, you know, like the small conference. 02:05:20.380 |
Each of these things are new projects in their infancy that we're presenting alongside of the keynotes. 02:05:26.380 |
And the image that I want to give is that this is a very new field. 02:05:36.380 |
I never asked permission from OpenAI to get started on this crazy journey. 02:05:44.380 |
Please welcome back to the stage the co-founders and hosts of the AIE Summit, SWIX and Benjamin Dunphy. 02:06:25.380 |
But particularly, I'm especially excited about the announcement of the AI Engineer Foundation. 02:06:31.380 |
You know, when SWIX and I were first talking about making this conference back in February, 02:06:37.380 |
I initially proposed potentially we could do a foundation because, you know, this AI engineering phenomenon is going to change engineering. 02:06:45.380 |
It's going to change it, we think, for the better. 02:06:47.380 |
But a lot of people are going to struggle with it. 02:06:54.380 |
And we're going to help make it a much smoother transition. 02:06:59.380 |
If you like the mission that was just announced, you can support it by getting a special edition tea. 02:07:06.380 |
I don't know if we can zoom in on that on the camera or not. 02:07:31.380 |
And that takes you to a Stripe checkout link. 02:07:39.380 |
I think there's only like 100 t-shirts or something. 02:07:50.380 |
And then when that break is done, we'll come back for our second block of opening keynote presentations. 02:07:55.380 |
And then after that, we're going to have the topic tables. 02:08:01.380 |
I know it says on the schedule, food and beverage. 02:08:05.380 |
I think it makes more sense since we just had lunch at like 1:00. 02:08:11.380 |
Just not until about 7:00, 7:30 after the next block of sessions. 02:08:21.380 |
I mean, we had so many amazing announcements on the stage. 02:08:36.380 |
Swix, did you want to say a few words before we break? 02:08:41.380 |
I mean, I'm sure these people are hungry and thirsty and want to chat. 02:09:40.140 |
Gave it all we got and I know we did the best we could. 02:09:45.740 |
If I could go back under the mess, I would memorize your face before I go. 02:10:32.100 |
Additionally, users, now that we all have access to ChatGBT and can really easily access these models, we have very high expectations when we're using AI features inside of products. 02:10:43.680 |
We expect outputs to be crisp, exactly what we wanted. 02:10:47.080 |
We should expect to never see hallucinations. 02:10:49.960 |
And in general, it should be fast and accurate. 02:10:52.520 |
And so I want to go over three easy to implement tactics to get better and safer responses. 02:11:00.120 |
And like I said, these can be used in your everyday when you're just using ChatGBT. 02:11:03.960 |
Or if you're integrating AI onto your product, these will help go a long way to making sure that your outputs are better and that users are happier. 02:11:13.320 |
The first are called multi-persona prompting. 02:11:15.440 |
This comes out of a research study from the University of Illinois. 02:11:18.800 |
Essentially, what this method does is it calls on various agents to work on a specific task when you prompt it. 02:11:26.360 |
And those agents are designed for that specific task. 02:11:30.420 |
So, for example, if I was to prompt a model to help me write a book, multi-persona prompting would lead the model to get a publicist, an author, maybe the intended target audience of my book. 02:11:43.460 |
And they would work hand-in-hand in kind of a brainstorm mechanism with the AI leading this brainstorm. 02:11:50.780 |
They'd go back and forth, throwing ideas off the wall, collaborating until they came to a final answer. 02:11:58.040 |
It's because you get to see the whole collaboration process. 02:12:00.520 |
And so it's very helpful in cases where you have a complex task at hand or it requires additional logic. 02:12:06.500 |
I personally like using it for generative tasks. 02:12:14.100 |
What this does is it grounds prompts to a specific source. 02:12:19.260 |
So, instead of just asking, you know, what part of the digestive tube do you expect starch to be digested, you can say that and then just add to the end according to Wikipedia. 02:12:28.700 |
So, adding according to specified source will increase the chance that the model goes to that specific source to retrieve the information. 02:12:36.700 |
And this can help reduce hallucinations by up to 20%. 02:12:39.800 |
So, this is really good if you have a fine-tuned model or a general model that you know that you're reaching to a very consistent data source for. 02:12:48.680 |
So, this is really good if you don't know how to do it. 02:12:55.680 |
And last up, and arguably my favorite, is called Emotion Prom. 02:12:59.120 |
This was done by Microsoft and a few other universities. 02:13:03.120 |
And what it basically looked at was how LLMs would react to emotional stimuli at the end of prompts. 02:13:10.120 |
So, for example, if your boss tells you that this project is really important for your career or for a big client, you're probably going to take it much more seriously. 02:13:19.120 |
And this prompting method tries to tie into that cognitive behavior of humans. 02:13:25.120 |
All you have to do is add one of these emotional stimuli to the end of your normal prompt. 02:13:29.120 |
And I'm sure you'll actually get better outputs. 02:13:31.120 |
I've seen it done time and time again from everything from cover letters to generating change logs. 02:13:38.120 |
The outputs just seem to get better and more accurate. 02:13:41.120 |
And the experiments show that this can lead to anywhere from an 8% increase to 115% increase, depending on the task at hand. 02:13:51.120 |
And so, those are three really quick, easy-hit methods that you can use in ChatGPT or in the AI features in your product. 02:13:59.120 |
We have all these available as templates in PromptHub. 02:14:06.120 |
You can use them there, run them through our playground, share them with your team, or you can have them via the links. 02:14:13.120 |
And so, thanks for taking the time to watch this. 02:14:15.120 |
I hope that you've walked away with a couple of new methods that you can try out in your everyday. 02:14:19.120 |
If you have any questions, feel free to reach out and be happy to chat about this stuff. 02:15:27.120 |
and the server is a custom Fastify implementation. 02:15:32.120 |
meaning getting results to the user as quickly as possible, 02:15:42.120 |
it's just a small screen that has a record topic button. 02:15:50.120 |
The audio when you release gets sent to the server as a buffer. 02:16:00.120 |
It is really quick for a short topic, 1.5 seconds. 02:16:08.120 |
So the client-server communication works through an event stream, 02:16:21.120 |
and the React state updates updating the screen. 02:16:23.120 |
Okay, so then the user knows something is going on. 02:16:27.120 |
In parallel, I start generating the Story Outline. 02:16:35.120 |
So it can generate a Story Outline in about 4 seconds. 02:16:39.120 |
And once we have that, we can start a bunch of other tasks in parallel. 02:16:45.120 |
Generating the title, generating the image, and generating and narrating the audio story all happen in parallel. 02:16:59.120 |
For this, OpenAI GPT-3 Turbo Instruct is used again, giving a really quick result. 02:17:05.120 |
Once the title is available, it's being sent to the client again as an event and rendered there. 02:17:17.120 |
First, there needs to be a prompt to actually generate the image. 02:17:23.120 |
So we pass in the whole story into a GPT-4 prompt, 02:17:27.120 |
that then extracts relevant representative keywords for an image prompt from the story. 02:17:33.120 |
That image prompt is passed into Stability AI Stable Diffusion Excel, 02:17:42.120 |
The generated image is stored as a virtual file in the server. 02:17:48.120 |
And then, an event is sent to the client with a path to that file. 02:17:53.120 |
The client can then, through a regular URL request, just retrieve the image as part of an image tag. 02:18:05.120 |
Generating the full audio story is the most time-consuming piece of the puzzle. 02:18:10.120 |
Here, we have a complex prompt that takes in the story 02:18:14.120 |
and creates a structure with dialogue and speakers and extends the story. 02:18:21.120 |
We use GPT-4 here with a low temperature to retain the story. 02:18:26.120 |
And the problem is it takes one and a half minutes, 02:18:29.120 |
which is unacceptably long for an interactive client. 02:18:39.120 |
That's a little bit more difficult than just streaming characters token by token. 02:18:44.120 |
We need to always partially parse the structure and then determine if there is a new passage 02:18:51.120 |
that we can actually narrate and synthesize speech for. 02:18:56.120 |
ModelFusion takes care of the partial parsing and returns an iterable over fragments of partially parsed results. 02:19:04.120 |
But the application needs to decide what to do with them. 02:19:07.120 |
Here, we determine which story part is finished so we can actually narrate it. 02:19:14.120 |
So we narrate each story part as it's getting finished. 02:19:19.120 |
For each story part, we need to determine which voice we use to narrate it. 02:19:26.120 |
The narrator has a predefined voice and for all the speakers where we already have voices, 02:19:33.120 |
However, when there's a new speaker, we need to figure out which voice to give it. 02:19:37.120 |
The first step for this is to generate a voice description for the speaker. 02:19:45.120 |
Here's a GPT-3-5 Turbo prompt that gives us a structured result with gender and a voice description. 02:19:51.120 |
And we then use that for retrieval where we beforehand embedded all the voices based on their descriptions 02:19:58.120 |
and now can just retrieve them filtered by gender. 02:20:02.120 |
Here, for the speech synthesis, Element and 11labs are supported. 02:20:16.120 |
Based on the voices that have been chosen, one of those providers is picked and the audio is synthesized. 02:20:23.120 |
Similar to the images, we generate an audio file and we store it virtually in the server 02:20:29.120 |
and then send the path to the client, which reconstructs the URL and just retrieves it as a media element. 02:20:37.120 |
Once the first audio is completed, the client can then start playing. 02:20:42.120 |
And while this is ongoing, in the background, you're listening and in the background, 02:20:47.120 |
the server continues to generate more and more parts. 02:20:52.120 |
And that's it. So let's recap how the main challenge of responsiveness is addressed here. 02:20:58.120 |
We have a loading state that has multiple parts that are updated as more results become available. 02:21:04.120 |
We use streaming and parallel processing in the backend to make results available as quickly as possible 02:21:11.120 |
and you can start listening while the processing is still going on. 02:21:14.120 |
And finally, models are being chosen such that the processing time for the generation, say, the story is minimized. 02:21:28.120 |
And if you want to find out more, you can find Storyteller and also Model Fusion on GitHub at github.com/lgrammel/storyteller and github.com/lgrammel/model/fusion. 02:23:53.120 |
I want to share with you an interesting generative AI project that I recently did. 02:23:59.120 |
Not too long ago, I made a game with 100% AI-generated content. 02:24:06.120 |
It's a simple game where you're wandering around lost in the forest and you go from scene to 02:24:13.120 |
scene, having encounters that impact your vigor and your courage. 02:24:18.120 |
The idea is that you want to find your home before you run out of courage. 02:24:28.120 |
And so, if you play a few times, you will have seen them all. 02:24:29.120 |
And so, if you play a few times, you will have seen them all. 02:24:31.120 |
Now, my favorite part of making this game was generating each scene and just seeing what AI 02:24:40.120 |
And I thought, wouldn't it be cool to share that experience with the player. 02:24:45.120 |
What if every time they went to a new scene, it was generated fresh for them. 02:24:51.120 |
And every game would be unique and different this way. 02:25:00.120 |
That sounded so cool that I wanted to try to do it. 02:25:03.120 |
Now, the first thing that I would need to do is to generate each scene and have a consistent way of doing that. 02:25:10.120 |
My scene definitions are JSON objects that describe what the scene is when you first find it, as well as when you come back to it later, and how that impacts your stats. 02:25:23.120 |
So, I started out by using OpenAI's completion endpoint and doing some prompt engineering. 02:25:44.120 |
Most of the time, I would get scenes that had the right JSON format and the content was good. 02:26:15.120 |
I generated 50 examples, just like these, and used them to fine-tune. 02:26:30.120 |
I took out any of the JSON and just generally described what I wanted, hoping that that information would be embedded in the training data. 02:26:52.120 |
That includes generating all the examples and doing the fine-tuning. 02:26:57.120 |
And when I tried it, I was very happy to find that it worked perfectly. 02:27:02.120 |
Even though I didn't mention the JSON at all, it came out perfect because of what was in the examples. 02:27:11.120 |
And that meant I had less tokens in the prompt, which is faster and cheaper and just easier to work with. 02:27:20.120 |
So I was really pleased with how this worked. 02:27:31.120 |
Leonardo not only lets you generate images, they also let you create your own image models. 02:27:38.120 |
And this is great for a game because it means that you can have stylistically consistent images, which is exactly what I needed. 02:27:48.120 |
So I spent a while using all the different parameters that Leonardo offers and working with the prompt to try and find an image that looks right and that I liked. 02:28:00.120 |
It turned out that using the description directly from the scene as the prompt made nice pictures, which I was surprised about since it had like second person and said things other than what was in there. 02:28:15.120 |
Now, the tricky part with fine-tuning an image model is that you need consistent images that have like the parts that should be the same are the same in all of your training data. 02:28:30.120 |
But the parts that you want to vary need to be varied. 02:28:34.120 |
Otherwise, it will overfit and all of your images will look the same. 02:28:37.120 |
But if you don't have that consistency between them, then it won't really know what you want and you won't get that good stylistic consistency. 02:28:47.120 |
This was really tricky, especially in my case, I needed the perspective and the scale to be consistent from scene to scene. 02:28:56.120 |
Obviously, I needed them all to be set in the forest, and I wanted to have this overall tone and texture that looked the same. 02:29:05.120 |
Some of my scenes have people in them, some have animals, some have buildings, some have nothing, and so it was hard to get that variety. 02:29:14.120 |
I ended up having to train a couple of models with different parameters, different sets of images, but I eventually found one that worked out. 02:29:23.120 |
And to test it out, I generated a lot of images. 02:29:32.120 |
And you can see they all have similar features like the zigzag path down the middle. 02:29:39.120 |
Obviously, the trees and the look and everything looks the same. 02:29:47.120 |
Each one is unique and different, but still feels cohesive, which I am very pleased about. 02:29:54.120 |
So now I had everything I needed to put it together and make the game. 02:29:58.120 |
I made a simple asset server that had an AI pipeline starting by requesting a new scene from OpenAI's endpoint using my custom model. 02:30:11.120 |
Once I get that, I validate the JSON to make sure that it's got all the keys it needs. 02:30:17.120 |
If it's good, I take the description and I send that to Leonardo. 02:30:22.120 |
Leonardo makes an image from my custom model, gives it back to me. 02:30:34.120 |
Here is an example scene that was created, and I'm very happy with it. 02:30:42.120 |
I made a simple preview server so that I could scroll through a bunch of these scenes that I generated to make sure they worked. 02:30:54.120 |
So I made some changes to the game to request images each time the player went to a new scene. 02:31:07.120 |
It takes 10, 20, sometimes 30 seconds to do this. 02:31:12.120 |
And that wouldn't be good for the play experience. 02:31:23.120 |
And then as scenes are taken out of it, I fill it back up again once it gets below a certain threshold. 02:31:30.120 |
And that way, there's always a scene that's ready to go. 02:31:38.120 |
And I'm going to share it with you right now. 02:31:41.120 |
Now, keep in mind, everything that we see has never been seen before and will never be seen again. 02:31:53.120 |
You always start out at this lamppost and you have to wander around and find your way home. 02:32:01.120 |
As your vigor goes down, your speed goes down as well. 02:32:04.120 |
And as the courage goes down, the viewport will get smaller and smaller. 02:32:18.120 |
This is like you encounter a soft blue pulsating light coming from the organic formation scattered around the glade. 02:32:27.120 |
Your fear and tiredness lift and you feel rejuvenated and the vigor goes up, but I'm already at full. 02:32:36.120 |
I won't read all of these, but this looks like a cool campfire scene, which is really neat. 02:32:49.120 |
There's a large dark cave over here at the end of the path somewhere. 02:32:58.120 |
And now we've gotten into some fog, foggy trees. 02:33:10.120 |
This is like a really windy road that we're going through. 02:33:21.120 |
Well, this is the game and it would continue on and on and on until you find your way home. 02:33:27.120 |
And then you can just play again and it would be different every time. 02:33:34.120 |
One thing is that these images are low resolution. 02:33:41.120 |
And I could make them a higher resolution by adding an AI upscaler to my pipeline. 02:33:53.120 |
Also, I could get more creative with adding something to the prompt to make a scene. 02:34:00.120 |
For example, I could let the user select a theme or maybe even get the time of day or the current weather at the location of where the user is set. 02:34:12.120 |
And then the scenes could be generated to match where they are for a very immersive experience. 02:34:18.120 |
And of course, I can use this same process on other projects. 02:34:28.120 |
I hope that you found this interesting and enjoyed watching it as much as I enjoyed putting it all together. 02:35:38.120 |
And welcome to my talk on how we're thinking about the levels of code AI. 02:35:43.120 |
My name is Otto Kukic and I am the director of DevRel at Sourcegraph. 02:35:48.120 |
At Sourcegraph, we're building Kodi, the only AI coding assistant that knows your entire codebase. 02:35:55.120 |
To help educate our customers and users, as well as shape our thinking of code AI, we've been using a concept that we call levels of code AI internally. 02:36:06.120 |
These levels have really resonated with our community, so we wanted to publicize them and start a conversation with a broader developer community and we're better to do it than at the AI engineer summit. 02:36:19.120 |
When we talk about code AI, we refer to software that builds software. 02:36:24.120 |
Today, 92% of developers are using code AI tools, whereas this number was just 1% a year ago. 02:36:33.120 |
Our founder and CEO, Quinn Slack, has shared a bold prediction that in five years, 99% of code will be written by AI. 02:36:43.120 |
While we await that future, let's talk about how we see the levels of code AI today. 02:36:49.120 |
We see six distinct levels across three different categories. 02:36:53.120 |
Human initiated, where humans are the primary coders. 02:36:58.120 |
AI initiated, where AI starts to take a proactive role in software development. 02:37:03.120 |
AI led code, where AI has full autonomy over a code base. 02:37:10.120 |
We'll contrast these levels of code with the SAE levels of autonomy for vehicles. 02:37:16.120 |
At level zero, the developer writes all code manually without any AI assistance. 02:37:23.120 |
The developer is responsible for writing, testing, and debugging a code base. 02:37:28.120 |
AI does not generate or modify any part of the code base, but IDE features like symbol name, completion can provide a bit of assistance. 02:37:38.120 |
This level reflects the traditional software development process before introducing any AI assistance into the development workflow. 02:37:47.120 |
A vehicle operating at level zero is fully reliant on the human driver for acceleration, steering, braking, and everything in between. 02:37:56.120 |
At level one, the developer begins to use AI that can generate single lines or whole blocks of code based on developer intent. 02:38:05.120 |
For example, a developer might write the signature of a function, and the AI will infer the context and generate the implementation details for said function. 02:38:15.120 |
At level one, the AI assistant has been trained on millions of lines of open source code and can leverage this to provide superior completions based on the developer's guidance. 02:38:27.120 |
SAE level one vehicles still require the full attention of the human driver, but offer features such as cruise control or lane centering that make driving an easier, safer, and more comfortable experience. 02:38:42.120 |
At level two, the AI coding assistant has superior understanding and context of the code base it is interacting with. 02:38:50.120 |
Where at level one, the context is broad and general, a level two AI coding assistant has specific context about the code base that it is working in. 02:39:00.120 |
This allows the AI assistant to make better suggestions for code completions. 02:39:05.120 |
For example, if you were working in a Node.js code base and were using the Axiom library to handle HTTP requests, a level two AI assistant would provide autocomplete suggestions based on the Axiom library as opposed to a different node HTTP library like fetch or super agent. 02:39:28.120 |
The human driver is still in control and can override anything the car does at any time, but features such as traffic aware cruise control or automatic lane changes can make driving a much smoother experience. 02:39:41.120 |
At level three, the developer provides high level requirements and the AI assistant delivers a code based solution. 02:39:49.120 |
The AI coding assistant goes beyond generating singular snippets of code to building out full components and even integrations with other pieces of software. 02:39:59.120 |
Rather than writing the code themselves, a developer could instruct a level three code AI assistant to add a user authentication to an application that they are building and the coding assistant would generate all of the code required. 02:40:12.120 |
The coding assistant could then explain to the developer the code it wrote, how it works, and how it integrates with the rest of the application. 02:40:19.120 |
SAE level three is also the first level where the vehicle itself takes on the primary role of driving, with the human driver being a fallback in case the vehicle cannot drive itself safely. 02:40:31.120 |
The vehicle can perform most of the driving tasks, but may encounter situations where it cannot adequately perform these tasks, so it's forced to give control back to the human driver. 02:40:43.120 |
At level four, the code AI assistant can proactively handle coding tasks without developer oversight. 02:40:50.120 |
Let's imagine a few scenarios where a level four code AI assistant would play a role. 02:40:55.120 |
A level four capable code AI assistant could continuously monitor your code changes and autonomously submit PRs to ensure your documentation stays up to date. 02:41:07.120 |
Even better, the coding assistant could monitor bug reports from customers and submit PRs to fix the issues. 02:41:14.120 |
The human developer could then simply review the pull requests and merge them. 02:41:19.120 |
Level four SAE vehicles can perform virtually all driving tasks under specific conditions. 02:41:26.120 |
For example, Waymo operates a fleet of fully automated self-driving taxis in cities where they have high-quality mapping data and can provide a safe driving experience for passengers without human drivers. 02:41:40.120 |
A customer simply hails a Waymo taxi using a mobile app, provides a destination, and the vehicle is responsible for taking the passenger to their final destination without any additional human input. 02:41:53.120 |
At level five, the AI assistant requires minimal human guidance on code generation and is capable of handling the entire software development lifecycle. 02:42:04.120 |
The developer provides high-level requirements and specifications. 02:42:08.120 |
The AI then designs the architecture, writes production quality code, handles deployment, and continuously improves the code base. 02:42:17.120 |
The developer's role is to validate that the end product meets the stated requirements, 02:42:22.120 |
but the developer does not necessarily look at the generated code. 02:42:27.120 |
The code AI assistant has complete autonomy to take code from concept to production. 02:42:34.120 |
A self-driving car capable of level five driving automation can perform all driving tasks under all conditions, humans optional. 02:42:43.120 |
The car is responsible for making all the decisions. 02:42:47.120 |
At this level, a steering wheel or any ability for a human to override the car is unnecessary. 02:42:53.120 |
So there you have it, the six levels of code AI, or at least how we're thinking about them at Sourcegraph. 02:43:06.120 |
And if you'd like to try Kodi for yourself, get it for your IDE of choice at Kodi.dev. 02:43:12.120 |
Thank you and I'll see you on the show floor. 02:43:28.120 |
Hey, I'm Matija and I'll show you how we created a GPT-powered full-stack web app generator 02:43:34.120 |
and how it was used to create over 10,000 applications in one month. 02:43:39.120 |
So first, we'll see what it is and then secondly, we'll check out how it works under the hood. 02:43:45.120 |
So everything happens on this web page and it's super simple. 02:43:49.120 |
First, we have to enter the name of our application. 02:43:52.120 |
Let's say we are building a simple to-do app. 02:43:55.120 |
Second part is describe how it works in a couple of sentences. 02:43:58.120 |
So we have a simple to-do app with one page listing all the tasks. 02:44:02.120 |
User can create tasks, change them, toggle them, edit them. 02:44:06.120 |
Creativity level corresponds to GPT temperature. 02:44:09.120 |
So we can go on the safe side and get less features 02:44:12.120 |
or we can go a little bit crazy but also have more mistakes. 02:44:18.120 |
And the last thing left to do is just to hit this generate button. 02:44:23.120 |
Here we can see the result of the generation. 02:44:26.120 |
So we got a full-stack app in React, Node.js, Prisma 02:44:30.120 |
and it's all glued together with a full-stack framework Wasp. 02:44:33.120 |
So the secret of Wasp is that it relies on this single configuration file 02:44:37.120 |
which describes your app in a high-level declarative manner. 02:44:41.120 |
So here, for example, we can see our auth in just a couple of lines, 02:44:53.120 |
And here we have our Node.js functions which are being executed on the backend. 02:44:58.120 |
So the last thing to do is just to download this app locally and run it with Wasp. 02:45:08.120 |
And now we just have to run it via Wasp Start. 02:45:27.120 |
So we have a database inspector that also comes with Wasp. 02:45:43.120 |
And let's check it out in the database again. 02:46:00.120 |
You can also now deploy this app with a single CLI command. 02:46:09.120 |
But we have a CLI helper in Wasp that makes it super easy to deploy to fly.io. 02:46:18.120 |
When we got Mage out, it was hardly the first AI coding agent. 02:46:22.120 |
But it was among the first ones that could generate a full stack web app with almost no errors. 02:46:27.120 |
When we released this and people started using it, we were getting two main questions. 02:46:41.120 |
There are three main reasons for Mage's performance. 02:46:44.120 |
First, it is specialized only for full stack web apps and nothing else. 02:46:51.120 |
That allows us to assume a lot upfront and makes everything easier and faster. 02:46:56.120 |
Second, it makes use of a high level web framework, Wasp. 02:47:00.120 |
That takes away a ton of boilerplate and makes it much easier for GPT to do its job. 02:47:05.120 |
And lastly, Mage fixes the errors before it gives you the final result. 02:47:09.120 |
Again, because of the two points I mentioned previously, this is also a simpler problem than for the general AI coding agents. 02:47:20.120 |
Since Mage knows we are building a full stack web app and it's using Wasp for it, we can produce a lot of code upfront without even touching the OpenAI's API and asking GPT any questions. 02:47:32.120 |
For example, some of the config files, then also some of the authentication logic, which we can see right here, and global CSS and similar. 02:47:47.120 |
The code agent's work consists of three main phases: planning, generating the code, and fixing the errors. 02:47:54.120 |
So let's expand the generation log and explore each of the cases. 02:47:58.120 |
Here, following the step zero, we can see the planning phase. 02:48:02.120 |
Given our app description, Mage device needs to generate the following queries and actions: entities for data models and one page. 02:48:13.120 |
Mage is actually implementing everything it planned for above. 02:48:16.120 |
And finally, here comes the error fixing phase. 02:48:19.120 |
Mage can detect some of the common errors and fix it for itself. 02:48:23.120 |
Here it failed to fix, so it had to try again. 02:48:27.120 |
And finally, when it cannot detect any more errors, we are done. 02:48:31.120 |
We can also see that all this took about 27,000 tokens. 02:48:35.120 |
The cool thing is that, while developing Mage, we identified the most common errors it consistently kept making. 02:48:41.120 |
Like mixing up the default and named imports. 02:48:43.120 |
Some of them we even ended up fixing with a simple heuristic, without involving GPT. 02:48:51.120 |
Again, VOS framework with its high level configuration was of great help here. 02:48:56.120 |
Since it removed the tone of code and reduced the space for errors significantly. 02:49:00.120 |
Now, let's take a look at another question we had. 02:49:05.120 |
A typical app we created with Mage took about 2-3 minutes and 25-60,000 tokens. 02:49:17.120 |
We used GPT-3.5 and GPT-4 interchangeably for different stages. 02:49:24.120 |
If we used only GPT-4 for everything, the cost would have been 10x more. 02:49:33.120 |
What we did is we used GPT-4 only for the planning stage. 02:49:36.120 |
Which is the most complex step and one that requires the most creativity. 02:49:40.120 |
For the actual implementation, we could comfortably use GPT-3.5, which is both faster and cheaper. 02:49:48.120 |
Again, the key here is that we provided a highly guided environment for the coding agent. 02:49:57.120 |
This is also the main difference between Mage and the other coding agents. 02:50:01.120 |
We tried another popular agent that uses the more free approach and relies more on the GPT itself. 02:50:08.120 |
And the cost to make a similar app as we did with Mage was between 80 cents and 10 dollars. 02:50:19.120 |
Is it going to magically produce any app you imagine or do you still have to put some work in? 02:50:24.120 |
At current stage, Mage serves as a really good and highly customized crowdstarter for full-stack web apps. 02:50:31.120 |
At that level, it can operate with almost no or very little errors that you can easily detect and fix. 02:50:36.120 |
Most of the people that tried it found it as a super easy way to get their app kickstarted with the mainstream pieces of stack such as React, Node, and Tailwind. 02:50:47.120 |
I personally believe this is what the future of SaaS starters looks like. 02:50:51.120 |
Tailored to your app instead of starting out with a generic boilerplate. 02:50:55.120 |
As you would expect, the more you push it, the more errors it starts making. 02:50:59.120 |
On the other hand, not giving enough information and just saying something like "make Facebook but yellow" can also be counter-effective. 02:51:10.120 |
We created Mage as an experiment to see how well it can produce full-stack web apps with FOSP. 02:51:18.120 |
The current main limitation of Mage comes from its simplicity. 02:51:21.120 |
And the fact there is no interaction with the user beyond the initial prompt. 02:51:25.120 |
So, that's something we are looking to add next. 02:51:29.120 |
Where you can, while still on the web page, interact with the agent and request changes and error fixes. 02:51:35.120 |
Another thing that would be interesting to explore would be using an LLM that is fine-tuned for WASP and web development. 02:51:41.120 |
Although, that would also make it more expensive. 02:51:44.120 |
Also, since WASP has such simple and humor-readable syntax, it's hard to predict how much benefit would fine-tuning bring. 02:51:54.120 |
We saw what Mage was, how it works, and what is the secret sauce that made it both fast and affordable to create web apps. 02:52:04.120 |
I had a lot of fun making this video with my helper. 02:52:09.120 |
Please give Mage a try and let us know how it went. 02:52:12.120 |
We are the same team that created WASP, which is a fully open-source web framework that makes it super easy to develop with React and Node.js. 02:52:20.120 |
Also, check out our repo and join our Discord for any questions and comments. 02:54:55.100 |
Ladies and gentlemen, we're starting now, please take your seats! 02:57:56.980 |
chain, Harrison Chase. Thank you guys for having me and thank you guys for being here. This is 02:58:16.720 |
maybe one of the most famous screens of 2023 and yet I believe and I think we all believe and that's 02:58:25.600 |
why we're all here that this is just the beginning of a lot of amazing things that we're all going 02:58:31.700 |
to create. Because as good as chat GPT is and as good as the language models that underlie 02:58:37.920 |
them are, by themselves they're just the start. By themselves they don't know about current 02:58:43.420 |
events, they cannot run the code that you write and they don't remember their previous interactions 02:58:48.260 |
with you. In order to get to a future where we have truly personalized and actually helpful 02:58:54.220 |
AI assistants, we're going to need to take these language models and use them as one 02:58:58.540 |
part of a larger system and that's what I think a lot of us in here are trying to do. These 02:59:04.860 |
systems will be able to produce seemingly amazing and magical experiences, they'll understand 02:59:11.340 |
the appropriate context and they'll be able to reason about it and respond appropriately. 02:59:17.820 |
At Langchain we're trying to help teams close that gap between these magical experiences and 02:59:24.300 |
the work that's actually required to get there and we believe that behind all of these seemingly 02:59:29.100 |
magical product moments, there is an extraordinary feat of engineering and that's why it's awesome 02:59:34.360 |
to be here at the AI engineering summit. I'm going to talk a little bit about some of the approaches 02:59:39.140 |
that we see work for developers when they're building these context aware reasoning applications 02:59:45.720 |
that are going to power the future. First I'm going to talk about context. When I say context, 02:59:52.260 |
I mean bringing relevant context to the language model so it can reason about what to do. Bringing 02:59:57.020 |
that context is really, really important because if you don't provide that context, no matter how 03:00:00.820 |
good the language model is, it won't be able to figure out what to do. The first type of context 03:00:08.420 |
and probably the most common type of context that we see people bringing to the language model, 03:00:12.660 |
we see them bringing through this instruction prompting type of approach where they basically 03:00:16.780 |
tell the language model how to respond to specific scenarios or specific inputs. This is pretty 03:00:23.500 |
straightforward and I think the way to think about it is if you have a new employee who shows 03:00:27.720 |
up on the first day of work, you give them an employee handbook and it tells them how they 03:00:30.800 |
should behave in certain scenarios. Equivalent that to kind of like this instruction prompting 03:00:35.800 |
technique. It's pretty straightforward, I think that's why people start with it, and as the 03:00:40.880 |
models get better and better, this zero shot type of prompting is going to be able to carry 03:00:45.100 |
a lot of the relevant context for how you expect the language model to behave. There are some cases 03:00:50.860 |
where telling the language model is actually quite hard and it becomes better to give it some few-shot 03:00:56.100 |
examples. It becomes better to give it examples where you show the language model how to behave 03:01:00.780 |
rather than just tell it how to behave. So I think a few concrete places where this works 03:01:06.100 |
is where it's actually a little bit difficult to describe how exactly the language model should 03:01:10.720 |
respond. So tone, I think, is a good use case for this, and then also structured output is 03:01:15.960 |
a good use case for this. You can give examples of the structured output format, you can give examples 03:01:21.860 |
of the output tone a little bit more easily than you could describe in language my particular 03:01:26.540 |
tone. The structured output is a little bit, but I think as it starts to get more and more 03:01:31.220 |
complicated, giving these really specific examples can help. The next type of context is maybe the 03:01:37.180 |
most, you know, it pops to the mind most when you hear of context and when you hear about bringing 03:01:42.220 |
context to the language model. Contrasting this with the first two, retrieval augmented generation 03:01:46.660 |
context uses context not to decide how to respond, but to decide what to base its response in. So, the 03:01:54.340 |
kind of like canonical thing is you have a user question, you do some retrieval strategy, you get 03:01:58.340 |
back some context, you pass that to the language model, and you say answer this question based on the 03:02:02.300 |
context that's provided to you. So this is a little bit different from the instructions. It's maybe the 03:02:07.220 |
same as asking someone to take a test with like an open book test. You can look at the book, you can look at the 03:02:12.580 |
answers, and in this case the answers are the text that you pass in to this context. And then the fourth 03:02:18.980 |
way that we see people providing context to language models is through fine-tuning, so updating the 03:02:24.100 |
actual weights of the language model. This is still kind of like in its infancy, and I think we're 03:02:29.620 |
starting to figure out how best to do this and what scenarios this is good to do in. One of the things 03:02:35.300 |
that we've seen is that this is good for the same use cases where few-shot examples are kind of good. 03:02:39.940 |
It takes it to another extreme. And so for tone and structured data parsing, these are two use 03:02:45.620 |
cases where we've seen it pretty beneficial to start doing some fine-tuning. And the idea here is that, 03:02:50.500 |
yeah, it can be helpful to have three examples of how your model should respond and what the tone 03:02:54.660 |
there should be, but what if you could give it 10,000 examples and it updates its way accordingly? 03:02:59.220 |
And so I think for those where the output is in a specific format, and again, you need more examples, 03:03:04.740 |
you need to show it a lot more than you can tell it, this is where we see fine-tuning starting to 03:03:08.660 |
become helpful, and I think we'll see that grow more and more over time. 03:03:11.140 |
So we've talked about context, and now I want to talk a little bit about the reasoning bit, 03:03:16.820 |
and I think this is the most exciting and the most new bit of it as well. And so we've tried to think 03:03:20.900 |
and categorize some of the approaches that we've seen to allow these applications to do this reasoning 03:03:27.380 |
component. And so we've listed a few of them out here and tried to discern a few different axes along 03:03:32.980 |
which they kind of vary. So if we think about kind of like just plain old code, this is kind of like 03:03:37.540 |
the way things were, you know, like a year ago, so a long, long time ago. And so in code you kind 03:03:44.100 |
of like -- it's all there, it's declared if it says what to run, it says what the outputs are, 03:03:49.540 |
what steps to take, things like that. We start adding in a language model call, and so this is like 03:03:54.100 |
the simplest form of these reasoning applications, and here you're using the language model to determine 03:03:59.140 |
what the output should be, but that's it. You're not using it to take actions yet, nothing fancy, 03:04:03.620 |
you're just using it to determine what the output should be, and it's just a single language model 03:04:06.900 |
call. So you're providing the context, and then you're returning the output to the user. If we take 03:04:12.820 |
it up a little bit, then we start to get into a chain of language model calls, or a chain of language 03:04:18.580 |
model call to API back to language model. And so this can be -- this is again used to decide the steps of 03:04:26.340 |
the output. And here there's multiple calls that are happening, and this can be used to break down 03:04:32.740 |
more complex tasks into individual components. It can be used to insert knowledge dynamically in the 03:04:38.820 |
middle of kind of like one language model call, then you go fetch some knowledge based on that language 03:04:42.820 |
model call, and then you do another one. But importantly here, the steps are known. You do this, 03:04:47.460 |
and then you do this, and then you do this. And so it's a chain of events, and that starts to change a 03:04:51.300 |
little bit when you use a router. And so in here, you're now using the language model call to start 03:04:57.140 |
determining which steps to take. So that's the big difference here. It's no longer just determining 03:05:00.980 |
the output of the system, but it's determining which steps to take. And so you can use it to determine 03:05:05.060 |
which prompts to use. So route between a prompt that's really good at math problems versus a prompt 03:05:10.740 |
that's really good at writing English essays. You can use it to route between language models. So one model 03:05:16.820 |
might be better than another. You might want to use Claude because of its long context window. Or you 03:05:20.820 |
might want to use GPT-4 because it's really good at reasoning. And so having the language model look 03:05:25.140 |
at the question and decide whether it needs to reason or whether it wants to respond in a long-form 03:05:28.740 |
fashion, you can determine which branches to go down. Or I think another common use case is using it to 03:05:34.100 |
determine which of several tools to take. So do I want to call this tool or do I want to call this tool? And 03:05:38.340 |
what should the input to that tools be? And so we have this router here, and I think before going on to the next 03:05:44.260 |
step, the main thing here that distinguishes it from that step is there's no kind of like cycles. You 03:05:49.780 |
don't kind of get these loops. You're just choosing kind of like which branch to go down. Once you start 03:05:56.020 |
adding in these loops, this is where we see a lot more complex applications. These are things that we 03:06:03.620 |
often see being called agents, kind of like out in the wild, and it's essentially kind of like a wow loop. 03:06:08.580 |
And then in that loop, you're doing a series of steps. And the language model is determining which 03:06:13.380 |
steps to do. And then at some point, there's a point where it can choose whether to end the loop or 03:06:17.380 |
not. And if it ends the loop, then you finish and return to the user. Otherwise, you go back and 03:06:21.540 |
continue the loop. And so here you get the language model deciding what the outputs are. It decides what 03:06:26.500 |
steps to take. And you do have these cycles. The last thing, and I think this is largely what we would 03:06:36.020 |
describe as kind of like what AutoGPT did that took the world by storm, is this idea of an agent 03:06:40.820 |
where you kind of like remove a lot of the kind of like guardrails around what steps to take. So here, 03:06:50.420 |
the sequences of steps that are available are almost like determined by the LLM. And what I mean by this 03:06:55.780 |
is that here is where you can start doing things like adding in tools that the language model can take. 03:07:01.140 |
So if you guys are familiar with the Voyager paper, it starts adding in tools and building up a skill 03:07:06.020 |
set of tools over time. And so some of the actions that the language model can take are dynamically 03:07:10.580 |
created. And then I think the other big thing here is that you remove some of the scaffolding from the 03:07:15.300 |
state machines. So some of the -- if I go back a little bit -- so a lot of these kind of like cycles 03:07:22.020 |
that we see in the wild break things down into discrete states. The most common one that we see are kind of 03:07:27.540 |
like plan, execute, and validate. So you ask the language model to plan what to do with it, 03:07:31.540 |
then it goes to it, and then you validate it often with a language model call or something like that. 03:07:35.060 |
And I think the big difference between that and then the autonomous agent style thing is that here, 03:07:40.900 |
you are implicitly asking the agent to do all of those things in one go. It should know when it 03:07:45.700 |
should plan, it should know when it should validate, and it should know when it should kind of like 03:07:49.220 |
determine what action to take. And you are asking it all to do that implicitly. You don't have these 03:07:54.020 |
kind of like distinct sequences of steps laid out in the code. And so this is a little bit about how 03:08:01.060 |
we're thinking about it. I think the thing to -- the thing that I like to say when saying this as well, 03:08:07.460 |
which goes back to the beginning, is that the main thing that we think is it's still extremely early 03:08:11.620 |
on in the space. We still think it's the beginning. And this could, you know, in three months be kind 03:08:16.260 |
of irrelevant as the space progresses. So I would just keep that in mind. If we think about kind of 03:08:22.260 |
like some of the magical experiences like this where it can reason over the relevant context, what is it 03:08:29.460 |
going to take to kind of like build it under the hood? What is the engineering that's going to go in 03:08:34.580 |
to all these seemingly magical experiences? And so this is an example of what could be going under the 03:08:41.300 |
hood of something like this. It's going to be a challenging experience to build these complex systems, 03:08:47.140 |
and that's why we're building some of the tooling like this, what you see here, to help debug, 03:08:51.860 |
understand, and iterate on these systems of the future. And so what exactly are the challenges 03:08:57.860 |
associated with building these complex context-aware reasoning applications? The first is kind of just the 03:09:05.060 |
orchestration layer. So figuring out which of the different reasoning kind of like cognitive 03:09:10.660 |
architectures you should be using. Should you be using a simple chain? Should you be using a router, 03:09:15.940 |
a more complex agent? And I think the thing to remember here is that it's not necessarily that one is 03:09:20.980 |
better than the other or superior to the other. They all have kind of like their pros and cons and 03:09:24.980 |
strengths and weaknesses. So chains are really good because you have more control over the sequence of 03:09:29.620 |
steps that are taken. Agents are better because they can more dynamically react to unexpected inputs and 03:09:34.580 |
handle edge cases. And so being able to choose the right cognitive architecture that you want and 03:09:40.100 |
being able to quickly experiment with a bunch of other ones are part of what inspired the initial release 03:09:45.380 |
of LangChain and kind of how we aim to help people prototype these types of applications. And then LangSmith, 03:09:52.820 |
which is this thing here, provides a lot of visibility into what actually is going on. As these applications 03:09:59.300 |
start to get more and more complex, understanding what exact sequences of tools are being used, what 03:10:05.300 |
exact sequences of language model calls are being made becomes increasingly important. Another big thing 03:10:11.380 |
that we see people struggling with and spending a lot of time on is good old-fashioned data engineering. 03:10:16.020 |
A lot of this comes down to providing the right context to language models, and the right context is 03:10:22.020 |
often data. So you need to have ways to load that data, you need to have ways to transform that data, 03:10:26.420 |
to transport that data, and then you often want to have observability into what exact data is getting 03:10:30.580 |
passed around and where. And so LangChain itself has a lot of open source kind of like modules for 03:10:35.860 |
loading that data and transforming that data. And then LangSmith, we often see being really useful for 03:10:41.300 |
debugging what exactly does that data look like by the time it's getting to the language model. Have you 03:10:46.020 |
extracted the right documents from your vector store? Have you transformed them and formatted in the right way 03:10:50.980 |
where it's clear to the language model what's actually in them? These are all things that you're going to 03:10:54.980 |
want to be able to debug so there's no little small errors or small issues that pop up. 03:10:59.940 |
And then the third thing that we see a lot of people spending time on when building these applications 03:11:05.940 |
is good old-fashioned prompt engineering. So the main new thing here is language models. And the main way of 03:11:11.620 |
interacting with language models is through prompts. And so being able to understand what exactly does the 03:11:16.980 |
fully formatted prompt look like by the time it's going into the language model is really important. 03:11:22.100 |
How are you combining the system instructions with maybe the few shot examples, any retrieved context, 03:11:29.700 |
the chat history that you've got going on, any previous steps that the agent took, what does this 03:11:33.940 |
all look by the time it gets to the language model? And what does this look like in the middle of this 03:11:38.980 |
complex application? It's easy enough to kind of like test and debug this if it's the first call, 03:11:44.020 |
the first part of the system. But after it's already done three of these steps, if you want to kind of 03:11:48.100 |
like debug what that prompt looks like, what that fully formatted prompt looks like, being able to do 03:11:52.340 |
that it becomes increasingly difficult as the systems kind of scale up in their entangledness. And so we've 03:11:58.260 |
tried to make it really easy to hop into any kind of like particular language model call at any point in 03:12:03.860 |
time, open it up in a playground like this so you can edit it directly and experiment with that prompt 03:12:09.780 |
engineering and go kind of like change some of the instructions and see how it responds or swap out model 03:12:16.580 |
providers so that you can see if another model provider does better. Another big challenge with 03:12:23.460 |
these language model applications and is probably worth a talk on its own is evaluation of them. And 03:12:28.820 |
so I think evaluation is really hard for a few reasons. I think the two primary ones are a lack of data 03:12:34.340 |
and a lack of good metrics. So comparing to traditional kind of like data science and machine learning with 03:12:39.620 |
those, you generally started with a data set. You needed that to build your model, and so then when it 03:12:44.180 |
came time to evaluate it, you at least had those data points that you could look at and evaluate on. 03:12:48.100 |
And I think that's a little bit different with a lot of these LLM applications because these models 03:12:53.460 |
are fantastic zero-shot kind of like learners. That's kind of like the whole exciting bit of them. And so 03:12:59.860 |
you can get to a working MVP without building up kind of like any data set at all. And that's awesome, 03:13:05.060 |
but that does make it a little bit of a challenge when it comes to evaluating them because you don't have 03:13:09.620 |
these data points. And so one of the things that we often encourage a lot of people to do and try to 03:13:15.380 |
help them do as well is build up these data sets and iterate on those. And those can come from either 03:13:20.420 |
labeling data points by hand or looking at production traffic and pulling things in or auto-generating 03:13:26.420 |
things with LLMs. The second big challenge in evaluation is lack of metrics. I think most 03:13:33.220 |
traditional kind of like quantitative metrics don't perform super well for large unstructured outputs. 03:13:38.980 |
A lot of what we see people doing is still doing a kind of like vibe check to kind of like see how the 03:13:44.980 |
model is performing. And as unsatisfying as that is, I still think that's probably the best way to gain 03:13:52.020 |
kind of like intuition as to what's going on. And so a lot of what we try to do is make it really 03:13:56.900 |
easy to observe the outputs and the inputs of the language models so that you can build up that 03:14:01.380 |
intuition. In terms of more quantitative and systematic metrics, we're very bullish on LLM-assisted 03:14:08.660 |
evaluation, so using LLMs to evaluate the outputs. And then I think maybe the biggest thing that we see 03:14:15.460 |
people doing in production is just keeping track of feedback, whether it be direct or indirect feedback. 03:14:20.660 |
So do they leave kind of like a thumbs up or a thumbs down on your application? That's an example of 03:14:24.580 |
direct feedback where you're gathering that. An example of indirect feedback might be if they click 03:14:28.340 |
on a link or that that might be a good thing that you provided a good suggestion. Or if they respond 03:14:33.460 |
really confused to your chatbot, that might be a good indication that your chatbot actually did not 03:14:38.500 |
perform well. And so tracking these over time and doing A/B testing with that using kind of like 03:14:43.060 |
traditional A/B testing software can be pretty impactful for gathering a sense online of how your model is 03:14:49.940 |
doing. And then the last interesting thing that we're spending a lot of time thinking about is 03:14:54.660 |
collaboration. So as these systems get bigger and bigger, they're doubtless going to be a collaboration 03:14:58.820 |
among a lot of people. And so who exactly is working on these systems? Is it all AI engineers? As we're 03:15:07.140 |
here today, is it a combination of AI engineers and data engineers and data scientists and product managers? 03:15:12.900 |
And I think one of the interesting trends that we're seeing is it's still a little bit unclear what the best 03:15:17.860 |
skill sets for this new AI engineer type role is. And there could very well be a bunch of different 03:15:23.940 |
skill sets that are valuable. So going back to kind of like the two things that we see making up a lot 03:15:28.980 |
of these applications, the context awareness and the reasoning bit. The context awareness is bringing the 03:15:33.300 |
right context to these applications. You often need kind of like a data engineering team to get in there and 03:15:38.340 |
assist with that. The reasoning bit is often done through prompting, and oftentimes that's best done by 03:15:43.060 |
non-technical people who can really outline the exact specification of the app that they're building, 03:15:48.020 |
whether they be product managers or subject matter experts. And so how do you enable collaboration 03:15:53.220 |
between these two different types of folks? And what exactly does that look like? I don't think that's 03:15:57.300 |
something that anyone kind of knows or definitely hasn't solved, but I think that's a really interesting 03:16:02.020 |
trend that we're thinking a lot about going forward. And so I think the main thing that I want to leave 03:16:11.460 |
you all with is that the big thing that we believe is that it's still really, really early on in this 03:16:17.220 |
journey. It's just the beginning. As crazy as things have been over the past year, they're hopefully 03:16:22.180 |
going to get even crazier. You saw an incredible demo of GPT-4V. Things like that are going to change it. 03:16:27.460 |
And so we think behind all of these things, it's going to take a lot of engineering. And we're 03:16:31.860 |
trying to build some of the tooling to help enable that. And I think you guys are all on the right 03:16:36.020 |
track towards becoming those types of engineers by being at a conference like this. So thank you, 03:16:40.900 |
SWIX, for having me. Thank you guys for being here. Have a good rest of your day. 03:16:52.420 |
Please welcome our next speaker, the founder of 567, Jason Liu. 03:17:08.820 |
Hey guys. So I didn't know I was going to be one of the keynote speakers. So this is probably going to be the most 03:17:18.500 |
reduced scope talk of today. I'm talking about type hints. And in particular, I'm talking about 03:17:25.220 |
how Pydantic might be all you need to build with language models. In particular, I want to talk about 03:17:30.180 |
structured prompting, which is the idea that we can use objects to define what we want back out, rather 03:17:34.900 |
than kind of praying to the LLM gods that the comma is in the right place and the bracket was closed. 03:17:41.060 |
So everyone here basically kind of knows or at least agrees that large language models are kind of 03:17:45.860 |
eating software. But what this really means in production is 90% of the applications you build are 03:17:51.460 |
just ones where you're asking the language model to output JSON or some structured output that you're 03:17:56.500 |
parsing with a regular expression. And that experience is pretty terrible. And the reason this is the case 03:18:01.780 |
is because we really want language models to be backwards compatible with the existing software that we 03:18:06.660 |
have. You know, code gen works. But a lot of the systems we have today are systems that we can't 03:18:11.060 |
change. And so, yeah, the idea is that although language models were introduced to us through ChatGPT, 03:18:17.620 |
most of us are actually building systems and not chatbots. We want to process input data, integrate with 03:18:23.140 |
existing systems via APIs or schemas that we might not have control over. And so the goal for today is 03:18:29.220 |
effectively introduce OpenAI function calling, introduce PyDantic, then introduce Instructor and 03:18:35.220 |
Marvin as a library to make using PyDantic to prompt language models much easier. And what this gets us 03:18:41.620 |
is, you know, better validation, makes your code a little bit cleaner, and then afterwards I'll talk 03:18:46.980 |
over some design patterns that I've uncovered and some of the applications that we have. 03:18:53.300 |
This is basically almost everyone's experience here, right? Like, you know, 03:18:56.340 |
Riley Goodside had a tweet about asking to get JSON out of BARD, and the only way you could do it was 03:19:01.620 |
to threaten to take a human life. And that's not code I really want to commit into my repos. And then when 03:19:06.740 |
you do ask for JSON, you know, maybe it works today, but maybe tomorrow, instead of getting JSON, you're 03:19:11.780 |
going to get, like, okay, here you go, here's some JSON. And then again, you kind of pray that the JSON's 03:19:16.580 |
parsed correctly. And I don't know if you noticed, but here, user is a key for one query, and username is a key for 03:19:22.260 |
another, and you would not really notice this unless you had, like, good logging in place. But really, 03:19:26.580 |
this should not happen to begin with, right? Like, you shouldn't have to, like, read the logs to figure 03:19:31.380 |
out that the passwords didn't match when you're signing up for an account. And so what this means is 03:19:36.100 |
our prompts and our schemas and our outputs are all strings. We're kind of writing code and text edit, 03:19:41.460 |
rather than an IDE where you could, you know, get linting or type checking or syntax highlighting. And so OpenAI 03:19:50.180 |
function calls somewhat fix this, right? We get to define JSON schema of the output that we want, 03:19:56.580 |
and OpenAI will do a better job in placing the JSON somewhere that you can reliably parse out. 03:20:01.620 |
So instead of going from string to string to string, you get string to dict to string, and then you still 03:20:09.220 |
have to call JSON loads. And again, you're kind of praying that everything is in there. And a lot of 03:20:12.900 |
this is kind of praying through the LLM gods. On top of that, like, if this code was committed to any 03:20:18.980 |
repo I was managing, like, I would be pissed, right? Complex data structures are already difficult to 03:20:25.060 |
define, and now you're working with the dictionary of JSON loads, and that also feels very unsafe because 03:20:30.740 |
you get missing keys, missing values, and you get hallucinations, and maybe the keys are spelled wrong, 03:20:36.420 |
you're missing an underscore, and you get all these issues. And then you end up writing code like this. 03:20:41.220 |
And this works for, like, name and age and email, and then you're checking if something is a bool by 03:20:46.260 |
parsing a string, and it gets really messy. And what Python has done to solve this is use Pydantic. 03:20:51.780 |
Pydantic is a library that do data model validation very similar to data classes. It is powered by type 03:20:58.740 |
hints. It has really great model and field validation. It has 70 million downloads a month, 03:21:05.460 |
which means it's a library that everyone can trust and use and know that it's going to be maintained 03:21:09.060 |
for a long period of time. And more importantly, it outputs JSON schema, which is how you communicate 03:21:13.620 |
with open AI function calling. And so the general idea is that we can define an object like delivery, 03:21:19.860 |
say that the timestamp is a date time, and the dimensions is a tuple of events. And even if you pass in a 03:21:24.820 |
string as a timestamp and a list of strings as tuples, everything is parsed out correctly. This 03:21:30.420 |
is all the code. We don't want it right. This is why there's 70 million downloads. More interestingly, 03:21:35.540 |
timestamp and dimensions are now things that your IDE is aware of. They know the type of that. You get 03:21:40.260 |
autocomplete and spell checking. Again, just more bug-free code. And so this really brings me to the idea 03:21:47.300 |
of structured prompting, because now your prompt isn't a triple quoted string. Your prompt is actual code 03:21:54.340 |
that you can look at. You can review. And everyone has written a function that returns 03:21:59.860 |
a data structure. Everyone knows how to manage code like this. Instead of doing the migration of JSON 03:22:05.300 |
schemas in the one-shot examples, you know, I've done database migrations. I know how some of these 03:22:09.940 |
things work. And more importantly, we can program this way. And so that's why I built a library called 03:22:14.980 |
Instructor a while ago. And the idea here is just to make open AI function calling super useful. 03:22:20.500 |
So the idea is you import Instructor. You patch the completion API. Debatable if this is the best 03:22:26.980 |
idea. But ultimately, you define your pydantic object. You set that as the response model of that 03:22:32.100 |
create call. And now you're guaranteed that that response model is the type of the entity that you 03:22:38.180 |
extract. So again, you get a nice autocomplete. You get type safety. Really great. 03:22:44.980 |
I would also want to mention that this only works for open AI function calling. If you want to use a more 03:22:49.460 |
comprehensive framework to do some of this pydantic work, I think Marvin is a really great library to 03:22:54.500 |
try out. They give you access to more language models and more capabilities above this response. 03:23:02.900 |
But the general idea here isn't that this is going to make your JSON come out better. The idea is that 03:23:07.860 |
when you define objects, you can find nested references. You can define methods of the behavior of 03:23:12.820 |
that object. You can return instances of that object instead of dictionaries. And you're going to write 03:23:17.700 |
cleaner code and code that's going to be easier to maintain as they're passed through different systems. 03:23:24.420 |
So here you have, for example, a base model, but you can add a method if you want to. You can define 03:23:28.820 |
the same class but with an address key. You can then define new classes like best friend and friends, 03:23:34.180 |
which is a list of user details. If I was to write this in JSON schema to make a post request, it would be 03:23:39.460 |
very unmanageable. But this makes it a lot easier. On top of that, when you have doc strings, the doc 03:23:44.500 |
strings are now a part of that JSON schema that is sent to open AI. And this is because the model now 03:23:50.420 |
represents both the prompt, the data, and the behavior all in one. You want good doc strings, 03:23:56.500 |
you want good field descriptors, and it's all part of the JSON schema that you send. And now your code 03:24:02.900 |
quality, your prompt quality, your data quality are all in sync. There's this one thing you want to 03:24:07.380 |
manage and one thing you want to review. And what that really means is that you need to have good variable 03:24:11.940 |
names, good descriptions, and good documentation. And this is something we should have anyways. 03:24:18.660 |
You can also do some really cool things with Pydantic without language models. For example, 03:24:22.660 |
you can define a validator. Here I define a function that takes in a value, I check that there is a 03:24:27.780 |
string in that value, and if it's not, I return a lowercase version of that, because that might be 03:24:31.540 |
how I want to parse my data. And when you construct this object, you get an error back out. We're not 03:24:37.540 |
going to fix it, but we get a validation error, something that we can catch reliably and understand. 03:24:42.500 |
But then if you introduce language models, you can just import the LLM validator. And now you can have 03:24:47.220 |
something that says, like, don't say mean things. And then when you construct an object that has 03:24:51.860 |
something that says that the meaning of life is the evil and steal things, you're going to get a 03:24:55.940 |
validation error and an error message. And this error message, the statement is objectable, is actually 03:25:01.300 |
coming out of a language model API call. It's using Instructor under the hood to define that. 03:25:06.740 |
But it's not enough to actually just point out these errors. You also want to fix that. And so the 03:25:11.300 |
easy way of doing that in Instructor is to just add max retries. Now what we do is we'll append the 03:25:18.260 |
message that you had before, but then we can also capture all the validations in one shot, send it 03:25:23.380 |
back to the language model, and try again. But the idea here that this isn't prompt change, this isn't 03:25:29.620 |
constitutional AI, here we just have validation, error handling, and then re-asking. And these are just 03:25:35.300 |
separate systems in code that we can manage. If you want something to be less than 10 characters, 03:25:39.620 |
there's a character count validator. If you want to make sure that a name is in a database, you can 03:25:44.420 |
just add a post request if you want to. But this is just classical code again. This is the backwards 03:25:48.580 |
compatibility of language models. But we can also do a lot more, right? Structured prompts, get you 03:25:54.660 |
structured outputs. But ideally, the structure actually helps you structure your thoughts. So here's 03:25:59.460 |
another example. It's really important for us to give language models the ability to have an escape hatch and 03:26:04.980 |
say that it doesn't know something or can't find something. And right now, most people will say 03:26:09.620 |
something like, return I don't know in all caps, check if I don't know all caps in string. Right? 03:26:16.660 |
Sometimes it doesn't say that. It's very difficult to manage. But here, you see that I've defined user 03:26:22.340 |
details with an optional role. That could be none. But the entity I want to extract is just maybe a user. 03:26:28.100 |
It has a result that's maybe a user. And then an error and an error message. And so I can write code 03:26:33.860 |
that looks like this. I get this object back out. It's a little bit more complicated. But now I can 03:26:39.540 |
kind of program with language models in a way that feels more like programming and less like chaining, 03:26:44.340 |
for example. Right? We can also define reusable components. Here I've defined a work time and a 03:26:53.060 |
leisure time as both a time range. And the time range has a start time and an end time. If I find 03:26:58.980 |
that this is not being parsed correctly, what I could do is actually add chain of thought directly in the 03:27:04.900 |
time range component. And now I have modularity in some of these features. And you can imagine having 03:27:12.420 |
a system where, in production, you disable that chain of thought field. And then in testing, you add that 03:27:18.660 |
to figure out what's the latency or performance trade-offs. You could also extract arbitrary values. 03:27:25.220 |
Here I define a property called key and value. And then I want to extract a list of properties. 03:27:29.220 |
You might want to add a prompt that says make sure the keys are consistent over those properties. 03:27:33.700 |
We can also add validators to make sure that's the case. And then re-ask when that's not the case. 03:27:37.940 |
If I want, you know, only five properties, I could add an index to the property key and just say, 03:27:43.060 |
well, now count them out. And when you count to five, stop. And you're going to get much more reliable 03:27:47.380 |
outputs. Some of the things that I find really interesting with this kind of method is prompting 03:27:52.100 |
data structures. Here I have user details. Age name as before. But now I define an ID and a friends array, 03:27:59.380 |
which is a list of IDs. And if you prompt that well enough, you can basically extract, 03:28:03.380 |
like a network out of your data. So, you know, we've seen that structured prompting kind of gives 03:28:09.780 |
you really useful components that you can reuse and make modular. And the idea again here is that we 03:28:15.700 |
want to model both the prompt, the data, and the behavior. Here I haven't mentioned too many methods 03:28:20.100 |
that you could act on this object. But the idea is almost like, you know, when we go from C to C++, 03:28:25.300 |
the thing we get is object-oriented programming, and that makes a lot of things easier. And we've learned our 03:28:29.540 |
lessons with object-oriented programming. And so if we do the right track, I think we're going to get a 03:28:33.620 |
lot more productive development out of these language models. And the second thing is that these language 03:28:37.940 |
models now can output data structures. You can pull up your old lead code textbooks or whatever and 03:28:43.140 |
actually figure out how to traverse these graphs, for example, process this data in a useful way. 03:28:47.460 |
And so now they can represent knowledge, workflows, and even plans that you can just dispatch to a 03:28:53.380 |
classical computer system. You can create the data that you want to send to Airflow rather than doing 03:28:59.220 |
this for loop, hoping it terminates. And so now I think about six minutes, so I'll go over some advanced 03:29:06.020 |
applications. These are actually fairly simple. I have some more documentation if you want to see that 03:29:10.340 |
later on. But let's go over some of these examples. So the first one is RAG. I think when we first started 03:29:16.900 |
out, a lot of these systems end up being systems where we embed the user query, make a vector database 03:29:21.700 |
search, return the results, and then hope that those are good enough. But in practice, you might have 03:29:25.620 |
multiple backends to search from. Maybe you want to rewrite the user query. Maybe you want to decompose 03:29:30.180 |
that user query. If you want to ask something like what was something that was recent, you need to have 03:29:35.540 |
time filters. And so you could define that as a data structure. Right? The search type is email or video. 03:29:41.460 |
Search has a title, a query, a before date, and a type. And then you can just implement the execute method 03:29:46.580 |
that says, you know, if type is video, do this. If email, do that. Really simple. And then what you want to 03:29:52.500 |
extract back out is multiple searches. Like, give me a list of search queries. And then you can write some 03:29:57.380 |
like async iota, map across these things. And now, because all that prompting is embedded in the data 03:30:04.260 |
structure, your prompt that you send to open AI is very simple. Your helpful assistant, segment the 03:30:09.220 |
search queries. And then what you get back out is this ability to just have an object that you can 03:30:13.620 |
program with in a way that you've managed sort of like all your life. Right? Something very 03:30:19.060 |
straightforward. But you can also do something more interesting. You can then plan. Right? Before we 03:30:24.500 |
talked about like extracting a social network, but you can actually just produce the entire DAG. 03:30:29.620 |
Here, I had the same graph structure. All right? It's an ID, a question, and a list of dependencies, 03:30:34.820 |
where I have a lot of information in the description here. And that's basically the prompt. 03:30:38.340 |
And what I want back out is a query plan. So now, if you send it to a query planner that says, like, 03:30:44.740 |
you're a helpful query planner, like, build out this query, you can ask something like, what is the 03:30:48.340 |
difference in populations of Canada and Texas home country? And then what you can see is, you know what, 03:30:53.220 |
like, if I'm good at elite code, I could query the first two in parallel because there are no 03:30:58.180 |
dependencies, and then wait for dependency three to merge, and then wait for four to merge those two. 03:31:04.340 |
But this requires one language model call, and now it's just traditional RAG. And if you have an IR 03:31:09.140 |
system, you get to skip this for loop of agent queries. You know, an example that was really popular 03:31:15.460 |
on Twitter recently was extracting knowledge graphs. You know, same thing here. Here, what I've done is I've 03:31:20.260 |
made sure that the data structure I model is as close as possible to the graph visualization API. 03:31:27.140 |
What that gets me is really, really simple code that does, basically, the creation and visualization 03:31:34.260 |
of a graph. I've just defined things one-to-one to the API, and now what I can do is if I ask for something 03:31:39.620 |
that's very simple, like, you know, give me the description of quantum mechanics, you can get a graph out. 03:31:46.580 |
That's basically in, like, 40 lines of code, because what you've done is you've modeled the data 03:31:50.500 |
structure the graph is needs to make the visualization. And we're kind of trying to couple 03:31:55.540 |
that a lot more. This is a more advanced example, so don't feel bad if you can't follow this one. 03:32:01.140 |
But here, what I've done is I've done a question answer is a question and an answer, and the answer is 03:32:06.420 |
a list of facts. And what a fact is, is it's a fact as a statement and a substring quote from the original 03:32:12.980 |
text. I want multiple quotes as a substring of the original text. And then what my validators do is 03:32:19.540 |
it says, you know what, for every quote you give me, validate that it exists in the text chunk. If it's 03:32:24.980 |
not there, throw out the fact. And then the validator for question and answer says, only show me facts 03:32:30.820 |
that have at least one substring quote from the original document. So now I'm trying to encapsulate 03:32:35.300 |
some of the business logic of not hallucinating, not by asking it to not hallucinate, but actually 03:32:40.580 |
trying to figure out what is the paraphrasing detection algorithm to identify what the quotes 03:32:47.860 |
were. And what this means is instead of being able to say that the answer was in page seven, 03:32:52.420 |
you can say the answer was this sentence, that sentence, and something else. And I know they exist in the text 03:32:57.940 |
chunks. And so I think what we end up finding is that as language models get more interesting and 03:33:06.420 |
more capable, we're only going to be limited in the creativity that we can have to actually prompt 03:33:10.900 |
these things, right? Like you can have instructions per object, you can have like recursive structures, 03:33:17.780 |
right? It goes into domain modeling more than it goes to prompt engineering. And again, 03:33:23.860 |
now we can use the code that we've always used. If you want more examples, I have a bunch of examples 03:33:29.060 |
here on different kinds of applications that I've had with some of my consulting clients. 03:33:32.420 |
Yeah, I think these are some really useful ones. And I'll go to the next slide, which is... 03:33:37.780 |
This doesn't have the QR code. That's fine. The updated slide has a QR code, but instead you can just 03:33:46.420 |
visit jxnl.github.io/instructor. I also want to call out that we're also experimenting with a lot of 03:33:52.580 |
different UIs to do this structured evaluation, right? Where you might want to figure out whether 03:33:58.820 |
or not one response was mean, but you also want to figure out what the distribution of floats was for 03:34:04.020 |
a different attribute and be able to write evals against that. And I think there's a lot of really 03:34:08.260 |
interesting open work to be done, right? Like right now we're doing very simple things around 03:34:11.780 |
extracting graphs out of documents. You can imagine a world where we have multimodal, in which case you 03:34:17.380 |
could be extracting bounding boxes, right? Like one application I'm really excited about is being able to 03:34:22.020 |
say, give an image, draw the bounding box for every image, and the search query I would need to go on 03:34:27.300 |
Amazon to buy this product. And then you can really instantly build a UI that just says, you know, for 03:34:32.660 |
every bounding box, render a modal, right? You can have like generative UI over images, over audio. I think in 03:34:38.820 |
general it's going to be a very exciting space to play more with structured outputs. Thank you. 03:34:43.140 |
Ladies and gentlemen, please join me in welcoming our next guest, Senior Applied Scientist at Amazon, 03:35:07.780 |
Eugene Yen. Thank you. Thank you everyone. I'm Eugene Yen and today I want to share with you about some 03:35:24.580 |
building blocks for LLM systems and products. Like many of you here, I'm trying to figure out how to 03:35:30.500 |
effectively use these LLMs in production. So a few months ago, to clarify my thinking, I wrote some 03:35:37.460 |
patterns about building LLM systems and products, and the community seemed to like it. There's Jason 03:35:43.140 |
asking for this to be seminar. So here you go, Jason. Today, I'm going to focus on four of those patterns: 03:35:50.020 |
evaluations, retrieval-augmented generation, guardrails, and collecting feedback. All the slides will be 03:35:56.340 |
made available after this talk. So I ask you to just focus. Buckle up, hang on tight, because we'll be 03:36:02.660 |
going really fast. All right, let's start with evals, or what I really consider the foundation of it all. 03:36:09.060 |
Why do we need evals? Well, evals help us understand if our prompt engineering, our retrieval augmentation, 03:36:15.380 |
our fine-tuning, is it doing anything at all? Right? Consider eval-driven development, where evals guide 03:36:22.420 |
how you build your system and product. We can also think of evals as test cases, right, where we run 03:36:27.460 |
these evals before deploying any new changes. It makes us feel safe. And finally, if managers at OpenAI take 03:36:35.060 |
the time to write evals or give feedback on them. You know it's pretty important. 03:36:42.980 |
But building evals is hard. Here are some things I've seen folks trip up on. Firstly, 03:36:48.580 |
we don't have a consistent approach to evals. If you think about more conventional machine learning, 03:36:53.780 |
regression, we have root mean square error, classification, precision recall, even ranking, 03:36:58.420 |
NDCG. All these metrics are pretty straightforward, and there's usually only one way to compute them. 03:37:03.300 |
But what about for LLMs? Well, we have this benchmark whereby we write a prompt, there's a multiple choice 03:37:09.540 |
question, we evaluate the model's ability to get it right. MMLU is an example that's widely used where 03:37:15.940 |
it assesses LLMs on knowledge and reasonability, you know, computer science questions, math, US history, 03:37:20.500 |
et cetera. But there's no consistent way to run MMLU. Less than a week ago, Avin and Sayash from 03:37:27.140 |
Princeton, evaluating LLMs is a minefield. They ask, are we assessing prompt sensitivity? Are we 03:37:35.620 |
assessing the LLM? Or are we assessing our prompt to get the LLM to give us what we want? On the same day, 03:37:41.700 |
Entrophic noted that the simple MCQ may not be as simple as it seems. Simple formatting changes, 03:37:48.100 |
such as different parentheses, lead to different changes in accuracy. And no one is, there's no 03:37:53.060 |
consistent way to do this. As a result, it makes it really difficult to compare models based on 03:37:58.020 |
these academic benchmarks. Now, speaking of academic benchmarks, we may have outgrown some of them. 03:38:04.500 |
For example, this task of summarization. On the top, you see the human evaluation scores on the reference 03:38:10.340 |
summaries. And on the bottom, you see the evaluation scores for the automated summaries. You don't have to 03:38:15.460 |
go through all the numbers there, but the point is that all the numbers on the bottom are already higher 03:38:21.300 |
than the numbers on top. Here's another one that's more recent on the XSUM dataset, extreme summarization, 03:38:27.220 |
where you see that all the human evaluation scores are lower than instruct GPT. And that's not even GPT-4. 03:38:33.860 |
Now, finally, with all these benchmarks being so easily available, we sometimes forget to ask ourselves, 03:38:40.260 |
hey, is it a fit for our task? If you think about it, does MMLU really apply to your task? Maybe, 03:38:48.180 |
if you're building a college-level chatbot, right? But here's Linus reminding us that we should be 03:38:53.380 |
measuring our apps on our task and not just rely on academic evals. So how do we do evals? Well, I think 03:39:01.700 |
as an industry, we're still figuring it out. Bar pointed out it's the number one challenge out there, 03:39:05.300 |
and we hear so many people talk about evals. I think there are some tenants emerging. Firstly, 03:39:10.660 |
I think we should build evals for our specific task. And it's okay to start small. It may seem daunting, 03:39:16.340 |
but it's okay to start small. How small? Well, here's Technium. You know, he releases a lot of open 03:39:21.300 |
source models. He starts with an eval set of 40 questions for his domain expert task. 40 evals. 03:39:27.540 |
That's all it takes, and it can go very far. Second, we should try to simplify the task as much 03:39:33.540 |
as we can. You know, while LLMs are very flexible, I think we have better chance if we try to make it 03:39:38.420 |
more specific. For example, if you're using an LLM for content-moration task, you can fall back to 03:39:44.100 |
simple precision and recall. How often is it catching toxicity? How often is it catching bias? How often is 03:39:49.060 |
it catching hallucination? Next, if it's something broader like writing SQL or extracting JSON, you 03:39:55.620 |
know, you can try to run the SQL and see if it returns the expected result. That's very deterministic. 03:40:00.820 |
Or you can check the extracted JSON keys and check if the JSON keys and the values match what you expect. 03:40:06.180 |
These are still fairly easy to evaluate because we have expected answers. 03:40:11.940 |
But if your task is more open-ended, such as dialogue, you may have to rely on a strong LLM to evaluate the output. 03:40:18.980 |
However, this can be really expensive. Here's Jerry saying, you know, 60 evals, GPT-4, it costs him a lot. 03:40:25.140 |
Finally, even if you have automated evals, I think we shouldn't discount the value of eyeballing the output. 03:40:33.060 |
Here's Jonathan from Mosaic. I don't believe that any of these evals capture what we care about. 03:40:38.580 |
They have a prompt to generate games for a three-year-old and a seven-year-old, and it was more 03:40:43.620 |
effective for them to actually just eyeball the output as it trains throughout the epochs. 03:40:47.460 |
Okay, that's it for evals. Now, retrieval-on-metageneration. I don't think I have to convince 03:40:54.020 |
you all here why we need retrieval-on-metageneration, but, you know, it lets us add knowledge to our model 03:41:00.340 |
as input context where we don't have to rely solely on the model's knowledge. And second, it's far 03:41:05.380 |
practical, right? It's cheaper and precise and continuously fine-tuning to our new knowledge. 03:41:09.860 |
But retrieving the right documents is really hard. Nonetheless, we have great speakers, 03:41:15.700 |
Jerry and Anton, sharing about this topic tomorrow, so I won't go into the challenges of retrieval here. 03:41:20.500 |
Instead, I'd like to focus on the LLM side of things, right, and discuss some of the challenges that 03:41:25.860 |
remain even if we have retrieval-on-metageneration. The first of all is that LLMs can't really see all the 03:41:33.620 |
documents you retrieve. Here's an interesting experiment, right? The task is retrieval-omited 03:41:38.740 |
question-and-answering, you know, historical queries on Google, and hand-annotated answers from Wikipedia. 03:41:44.740 |
As part of the context, they provide 20 documents. Each of these documents are at most 100 tokens long, 03:41:50.820 |
so that means 2,000 tokens maximum. And one of these documents contain the answer, and the rest 03:41:56.100 |
are simply distractors. So the question they had was this: How would the position of the document 03:42:02.180 |
containing the answer affect question-answering? Now, some of you may have seen this before, 03:42:06.420 |
don't spoil it for the rest. If the answer is in the first retrieved document, accuracy is the highest. 03:42:13.300 |
If it's in the last, accuracy is decent. But if it's somewhere in the middle, it's actually worse accuracy 03:42:21.300 |
than having no retrieval-on-metageneration. So what does this mean? It means that even if context window 03:42:29.140 |
sizes are growing, we shouldn't allow our retrieval to get worse. Getting the most relevant documents to rank 03:42:37.380 |
highly still matters, regardless of how big the context size is. And also, even if the answer 03:42:43.700 |
is in the context and in the top position, accuracy is only 75%. So that means even with perfect retrieval, 03:42:53.620 |
So another gotcha is that LLMs can't really tell if the retrieved context is irrelevant. Here's a simple 03:43:01.380 |
example. So here are 20 top sci-fi movies, and you can think of this as movies that I like. 03:43:07.780 |
And I asked the LLM if I would like Twilight. So for folks not familiar with Twilight, you know, 03:43:13.460 |
it's romantic fantasy, girl, vampire, werewolf, something like that. But I think I've never 03:43:20.580 |
watched it before. But I have a really important instruction. If it doesn't think I would like 03:43:26.580 |
Twilight because I've watched all these sci-fi movies, it should reply with not applicable. And 03:43:31.940 |
this is pretty important in recommendations. We don't want to make bad recommendations. So here's what 03:43:36.980 |
happened. First, it notes that Twilight is a different genre and not quite sci-fi, which is fantastic, 03:43:42.580 |
right? But then it suggests ET because of interspecies relationships. I mean, I'm not sure 03:43:53.060 |
how I feel about that. Yeah, I mean, how would you feel if you got this for a movie recommendation? 03:44:00.340 |
The point is, these LLMs are so fine-tuned to be helpful, and it's really smart. And they try their 03:44:06.420 |
best to give an answer, but sometimes it's really hard to get them to say something that's not relevant, 03:44:10.580 |
especially something that's fuzzy like this, right? So, how do we best address these limitations in RAC? 03:44:17.620 |
Well, I think that there are a lot of great ideas in the field of information retrieval. 03:44:21.060 |
Search and recommendations have been trying to figure out how to show the most relevant documents on top, 03:44:25.940 |
and I think it worked really well. And there's a lot that we can learn from them. Second, LLMs may not know that the 03:44:32.980 |
retrieved document is irrelevant. I think it helps to include a threshold to exclude irrelevant documents. So, in the 03:44:39.780 |
Twilight and sci-fi movie example, I bet we could do something like just measuring item distance between those two, and if it's too far, we don't go to the next step. 03:44:47.860 |
Next, guardrails. So, guardrails are really important in production. We want to make sure what we deploy is safe. 03:44:54.900 |
What's safe? What's safe? We can look at OpenAI's moderation API, hate, harassment, self-harm, 03:45:01.940 |
all that good stuff. But another thing that I also think about a lot is guardrails on factual consistency, 03:45:08.660 |
or we call that hallucinations. I think it's really important so that you don't have trust-busting 03:45:14.100 |
experiences. You can also think of these as evals for hallucination. Fortunately, or unfortunately, 03:45:21.380 |
the field of summarization has been trying to tackle this for a very long time, and we can take a leave 03:45:25.460 |
from that playbook. So, one approach to this is via the natural language inference task. In a nutshell, 03:45:33.460 |
given a premise and a hypothesis, we classify if the hypothesis is true or false. So, given a premise, 03:45:39.220 |
John likes all fruits, the hypothesis that John likes apples is true, therefore it's entailment, 03:45:44.180 |
because there's not enough information to confirm if John eats apples daily, it's neutral. And finally, 03:45:50.820 |
John dislikes apples, it's clearly false, therefore contradiction. Do you see how we can apply this to 03:45:57.380 |
document summarization? The premise is the document, and this hypothesis is the summary. 03:46:04.180 |
And it just works. Now, when doing this, though, it helps to apply at the sentence instead of the 03:46:09.220 |
entire document level. So, in this example here, the last sentence in the summary is incorrect. So, 03:46:15.060 |
if we run the NLI task on the entire document and summary, it's going to say that the entire summary 03:46:19.300 |
is correct. But if you run it at the sentence level, it's able to tell you that the last sentence 03:46:24.740 |
in the summary is incorrect. And they include a really nice ablation study, right, where they check the 03:46:30.580 |
granularity of the document. As we got finer and finer, from document to paragraph to sentence, 03:46:36.260 |
the accuracy of detecting factual inconsistency goes up. That's pretty amazing. Now, another approach is 03:46:42.260 |
sampling, right? And here's an example from Chef CheckGPD. Given an input document, we generate a 03:46:47.940 |
summary multiple times. Now, we check if those summaries are similar to each other, ngram overlap, 03:46:54.260 |
bird score, et cetera. The assumption is that if the summaries are very different, it probably means 03:46:59.860 |
that they're not grounded on the context document and therefore likely hallucinating. But if they're 03:47:04.660 |
quite similar, you can assume that they're grounded effectively and therefore factual. And the final 03:47:09.860 |
approach is asking a strong LLM. You know, conceptually, it's simple. Given an input document and summary, 03:47:15.540 |
they get the LLM to return a summary score. And this LLM has to be pretty strong. And we have seen that 03:47:20.660 |
strong LLMs are actually quite expensive. But in the case of factual consistency, I've seen 03:47:26.980 |
similar, simple, simpler methods outperform LLM-based approaches at a far lower cost. So, 03:47:34.420 |
try to keep things simple if you can. Okay, now to close the loop, let's touch briefly about 03:47:40.180 |
collecting feedback. And I'm going to need audience help here. So, why is collecting feedback important? 03:47:46.340 |
Because we want to understand what our customers like and don't like. And then the magic thing here 03:47:51.860 |
is that collecting feedback helps you build your evals and fine-tuning dataset. New models come and go 03:47:58.420 |
every day, but your evals and fine-tuning dataset, that's your transferable asset that you can always use. 03:48:04.900 |
So, but collecting feedback from users is not as easy as it seems. So, explicit feedback can be sparse. 03:48:11.140 |
Sparse means very low in number. And explicit feedback is feedback we ask users for. So, 03:48:15.860 |
here's a quick thought experiment. How many of you here use ChatGPT? Okay, I see a lot of you. How many of 03:48:21.700 |
you here actually click the thumbs up and thumbs down button? Accidentally. Okay, but these are the beta 03:48:28.260 |
testers, right? But you can see it's very small in number. So, even if you include this thumbs up, 03:48:32.980 |
thumbs down button, you may not be getting the feedback you expect. Now, if the issue with 03:48:38.900 |
explicit feedback is sparsity, then the issue with implicit feedback is noise. So, implicit feedback 03:48:44.580 |
is the feedback you get as users organically use your product, right? You don't have to ask them for 03:48:48.900 |
feedback, but you get this feedback. So, here's the same example. How often do you click the copy code 03:48:54.740 |
button? The rest of you just type it out like a madman? Okay. But does clicking the copy code 03:49:03.300 |
button mean that the code is correct? In this case, no. End rows is not a valid argument for Panda's 03:49:09.940 |
read parquet. But if we were to consider all code snippets that were copied as positive feedback, we would 03:49:16.580 |
have a lot of bad data in our training. So, think about that. So, how do we collect feedback? I don't 03:49:22.340 |
have any good answers, but here are two apps I've seen do it really well. First one, GitHub Copilot, 03:49:26.980 |
or any kind of coding assistant, right? For people not familiar with it, you type some functional 03:49:30.740 |
signature, some comments, and it suggests code. You can either accept the code, reject the code, 03:49:36.500 |
move on to the next suggestion. We do this dozens of times a day. Imagine how much feedback they get 03:49:43.300 |
from this, right? Here's a golden dataset. Another example is Mint Journey. For folks not familiar, 03:49:48.180 |
Mint Journey, you write a prompt, it suggests four images. And then, based on those images, 03:49:53.700 |
you can either rerun the prompt, you can either vary the prompt, that's what the V stands for, 03:49:58.900 |
or you can either upscale the image, that's what the U stands for. But do you know what an AI engineer 03:50:05.140 |
sees? Rerunning the prompt is negative reward, where the user doesn't like any of the images. 03:50:11.780 |
Varying the image is a small positive reward, where the user is saying, "This one has potential, 03:50:17.860 |
but tweak it slightly." And choosing the upscale image is a large positive reward, where the user 03:50:24.340 |
likes it and just wants to use it. So, think about this. Think about how you can build in this implicit 03:50:28.500 |
feedback data flywheel into your products that you quickly understand what users like and don't like. 03:50:34.900 |
Oh, sorry. You can take your phone out. All slides available after the talk. 03:50:39.860 |
So, that's all I wanted to share. If you remember anything from this talk, I hope it's these three 03:50:46.340 |
things. You need automated evals. You need automated evals. Just annotate 30 or 100 examples and start from 03:50:56.580 |
there, right? And then figure out how to automate it. It will help you iterate faster, right? On your prompt 03:51:01.140 |
engineering, on your retrieval augmentation, on your fine-tuning, help you deploy safer. I mean, this 03:51:06.660 |
is a huge conference of engineers. I don't think I have to explain to you the need for testing. Eyeballing 03:51:12.580 |
doesn't scale. It's good as a final vibe check, but it just doesn't scale. Every time you update the prompt, 03:51:17.620 |
you just want to run your evals immediately, right? I run hundreds of... I run tens of experiments every day, 03:51:22.820 |
and the only way I can do this is with automated evals. Second, reuse your existing systems as much 03:51:29.780 |
as you can. There's no need to reinvent the wheel. BM25, metadata matching can get you pretty far, 03:51:36.900 |
and so do the techniques from recommendation systems, right? Two-stage retrieval and ranking, 03:51:41.380 |
filtering, etc. All these information retrieval techniques are optimised to rank the most relevant 03:51:48.180 |
items on top, so don't forget about them. And finally, UX plays a large role in LLM products. 03:51:56.180 |
I think that a big chunk of GitHub Copilot and ChatGPT is UX. It allows you to use the LLMs in your context 03:52:04.180 |
without calling an API. You can use an IDE using a chat window. Similarly, UX makes it far more effective 03:52:11.300 |
for you to collect user feedback. Okay, that's all I had. Thank you, and keep on building. 03:52:17.620 |
Our next speaker is AI lead at Notion. Please welcome Linus Lee. 03:52:35.620 |
Hi, everyone. I'm Linus. I'm here to talk about embedding. I'm grateful to be here at the inaugural 03:52:51.940 |
AI engineer conference. Who learned something new today? Yeah. Before I talk about that, a little bit 03:52:58.820 |
about myself. If you don't know me already, I am Linus. I work on AI at Notion for the last year or so. 03:53:05.220 |
Before that, I did a lot of independent work, prototyping, experimenting with, 03:53:08.260 |
trying out different things with language models, with traditional LLP, things like TF, IDF, BM25, 03:53:13.220 |
to build interesting interfaces for reading and writing. In particular, I worked a lot with embedding 03:53:17.060 |
models and latent spaces of models, which is what I'll be talking about today. But before I do that, 03:53:22.900 |
I want to take a moment to say it's been almost a year since Notion launched Notion AI. Our public 03:53:28.260 |
beta was first announced in around November 2022. So as we get close to a year, we've been steadily launching new 03:53:34.820 |
and interesting features inside Notion AI. From November, we have AI autofill inside databases, 03:53:40.020 |
translation, and things coming soon, though not today, so keep an eye on the space. And obviously, 03:53:46.260 |
we're hiring, just like everybody else here. We're looking for AI engineers, product engineers, 03:53:50.020 |
machine learning engineers to tackle the full gamut of problems that people have been talking about 03:53:53.940 |
today. Agents, tool use, evaluations, data, training, and all the interface stuff that we'll see today 03:54:01.140 |
and tomorrow. So if you're interested, please grab me and we'll have a little chat. 03:54:06.180 |
Now, it wouldn't be Alan's talk without talking about latent spaces, so let's talk about it. 03:54:09.620 |
One of the problems that I find always motivated by is the problem of steering language models. And I 03:54:15.060 |
always say that prompting language models feels a lot like you're steering a car from the backseat 03:54:18.900 |
with a pool noodle. Like, yes, technically, you have some control over the motion of the vehicle. 03:54:23.380 |
It's like there's some connection. But like, you're not really in the driver's seat. The control 03:54:26.980 |
isn't really there. It's not really direct. There's like three layers of indirection between 03:54:30.180 |
you and what the vehicle's doing. And that, to me, trying to prompt a model, especially smaller, 03:54:35.940 |
more efficient models that we can use for production with just tokens, just prompts, 03:54:40.500 |
feels a lot like there's too many layers of indirection. And even though models are getting 03:54:44.740 |
better at understanding prompts, I think there's always going to be this fundamental barrier between 03:54:48.420 |
indirect control of models with just prompts and getting the model to do what we want them to do. 03:54:54.340 |
And so perhaps we can get a closer layer of control, a more direct layer of control, 03:54:58.340 |
by looking inside the model, which is where we look at latent spaces. 03:55:01.380 |
Latent spaces arise, I think, most famously inside embedding models. If you embed some piece of text, 03:55:09.540 |
that vector of 1536 numbers or 1024 numbers is inside a high-dimensional vector space. That's a 03:55:15.300 |
latent space. But also you can look at latent spaces inside activation spaces of models, inside token 03:55:20.580 |
embeddings, inside image models, and obviously other model architectures like autoencoder. 03:55:24.260 |
adapters. Today, we're going to be looking at embedding models, but I think a lot of the general 03:55:28.580 |
takeaways apply to other models, and I think there's a lot of fascinating research work happening inside 03:55:32.340 |
other models as well. When you look at an embedding, you kind of see this, right? You see like rows and 03:55:38.180 |
rows of numbers. If you ever debug some kind of an embedding pipeline and you print out the embedding, 03:55:42.340 |
you can kind of tell it has like a thousand numbers, but it's just looking at like a matrix screen of 03:55:46.740 |
numbers raining down. But in theory, there's a lot of information actually packed inside those embeddings. If you get an embedding of a piece of text or image, 03:55:53.860 |
these latent spaces, these embeddings represent, in theory, the most salient features of a text or 03:55:59.060 |
the image that the model is using to lower its loss or do its task. And so maybe if we can disentangle 03:56:04.180 |
some meaningful attributes or features out of these embeddings, if we can look at them a little more 03:56:08.340 |
closely and interpret them a little better, maybe we can build more expressive interfaces that let them 03:56:12.740 |
control the model by interfering or intervening inside the model. Another way to say that is that embeddings show us 03:56:19.940 |
what the model sees in a sample of input. So maybe we can read out what it sees and try to 03:56:25.140 |
understand better what the model's doing. And maybe we can even control the embedding, 03:56:30.340 |
intermediate activations to see what the model can generate. So let's see some of that. 03:56:36.340 |
So some of this some of you might have seen before, but I promise there's some new stuff at the end, so hang tight. 03:56:49.540 |
So here's some sentence that I have. It's a sentence about this novel, one of my favorite novels, 03:56:53.540 |
named Diaspora. It's a science fiction novel by Greg Egan that explores evolution and existence, 03:56:58.900 |
post-human artificial intelligences, something to do with alien civilizations and the questioning the 03:57:03.300 |
nature of reality and consciousness, which you might be doing a lot given all the things that are happening. 03:57:07.780 |
And so I have trained this model that can generate some embeddings out of this text. So if I hit the 03:57:15.300 |
center, it's going to give us an embedding. But it's an embedding of length 2048, and so it's quite large. 03:57:20.740 |
But it's just a row of numbers, right? But then I have a decoder half of this model that can take this 03:57:25.540 |
embedding and try to reconstruct the original input that may have produced this embedding. So in this case, 03:57:30.740 |
it took the original sentence. There's some variation. You can tell it's not exactly the same length, maybe. 03:57:34.940 |
But it's mostly reconstructed the original sentence, including the specific details like the title of the book and so on. 03:57:39.940 |
So we have an encoder that's going from text to embedding, and a decoder that's going from embedding back to text. 03:57:46.940 |
And now we can start to do things with the embedding to vary it a little bit. 03:57:49.940 |
And see what the decoder might see if we make some modifications to the embedding. 03:57:54.740 |
So here, I've tried to kind of blur the embedding and sample some points around the embedding with this blur radius. 03:58:02.740 |
And you can see the text that's generated from those blurry embeddings, they're a little off. 03:58:06.340 |
Like, this is not the correct title. The title's kind of gone here. 03:58:09.540 |
It still kept the name Greg, but it's a different person, and so there's kind of a semantic blur that's happened here. 03:58:18.340 |
But this is kind of boring. This is not really useful. 03:58:20.340 |
What's a little more useful is trying to actually manipulate things in more meaningful directions. 03:58:24.340 |
So now we have the same taste of text. And now here I have a bunch of controls. 03:58:28.340 |
So maybe I want to find a direction in this embedding space. 03:58:32.340 |
Here I've computed a direction where if you push an embedding in that direction, that's going to represent a shorter piece of text of roughly the same topic. 03:58:39.140 |
And so I pick this direction, and I hit go, and it'll try to push the embedding of this text in that direction and decode them out. 03:58:47.940 |
And you can tell they're a little bit shorter if I push it a little bit further, even. 03:58:51.940 |
So now I'm taking that shorter direction and moving a little farther along it and sampling, generating text out of those embeddings again. 03:58:59.140 |
And they're even a little bit shorter. But they've still kept the general kind of idea, general topic. 03:59:03.940 |
And with that kind of building block, you can build really interesting interfaces. 03:59:09.940 |
For example, I can plop this piece of text down here, and maybe I want to generate a couple of shorter versions. 03:59:15.940 |
So this is a little bit shorter. This is even more short. 03:59:21.940 |
But maybe I like this version. So I'm going to clone this over here. 03:59:25.940 |
And I'm going to make the sentiment of the sentence a little more negative. 03:59:29.940 |
And you can start to explore the latent space of this embedding model, this language model, 03:59:35.940 |
by actually moving around in a kind of spatial canvas interface, which is just kind of interesting. 03:59:41.940 |
Another thing you can do with this kind of embedding model is, 03:59:43.940 |
now that we have a vague sense that there are specific directions in this space that mean specific things, 03:59:48.940 |
we can start to more directly look at a text and ask the model, 03:59:52.940 |
hey, where does this piece of text lie along your length direction or along your negative sentiment direction? 03:59:58.940 |
So this is the original text that we've been playing with. 04:00:01.940 |
It's pretty objective, like a Wikipedia-style piece of text. 04:00:04.940 |
Here I've asked ChatGPT to take the original text and make it sound a lot more pessimistic. 04:00:08.940 |
So things like the futile quest for meaning and plunging deeper into the abyss of nihilism. 04:00:13.940 |
And if I embed both of these, what I'm asking the model to do here is embed both of these things in the embedding space of the model 04:00:20.940 |
and then project those embeddings down onto each of these directions. 04:00:25.940 |
So one way to read this table is that this default piece of text is at this point in this negative direction, 04:00:31.940 |
which by itself doesn't mean anything, but it's clearly less than this. 04:00:34.940 |
So this piece of text is much further along the negative sentiment axis inside this model. 04:00:38.940 |
When you look at other properties, like how much of the artistic kind of topic does it talk about, 04:00:47.940 |
Maybe the negative sentiment text is a bit more elaborate in its vocabulary. 04:00:52.940 |
And so you can start to project these things into these meaningful directions and say, 04:00:55.940 |
what are the features of the models, what are the attributes of the models finding in the text that we're feeding it? 04:01:02.940 |
Another way you could test out some of these ideas is by mixing embeddings. 04:01:06.940 |
And so here I'm going to embed both of these pieces of text. 04:01:09.940 |
This one's the one that we've been playing with. 04:01:11.940 |
This one is the beginning of a short story that I wrote once. 04:01:13.940 |
It's about this town in the Mediterranean coast that's calm and a little bit old. 04:01:22.940 |
And so I'm going to say, this is a 2,000-dimensional embedding. 04:01:25.940 |
I'm going to say, give me a new embedding that's just the first 1,000 or so dimensions from the one embedding, 04:01:30.940 |
and then take the last 1,000 dimensions of the second embedding and just slam them together and have this new embedding. 04:01:35.940 |
And naively, you wouldn't really think that that would amount too much. 04:01:39.940 |
But actually, if you generate some samples from it, you can see in a bit, you get a sentence that's kind of a semantic mix of both. 04:01:48.940 |
You have structural similarities to both of those things. 04:01:51.940 |
Like you have this structure where there's a quoted kind of title of a book in the beginning. 04:01:56.940 |
There's punctuation similarities, tone similarities. 04:01:59.940 |
And so this is an example of interpolating in latent space. 04:02:03.940 |
The last thing you may have seen on Twitter is about, okay, I have this un-embedding model and I have kind of an un-embedding model. 04:02:13.940 |
Can I use this un-embedding model and somehow fine-tune it or otherwise adapt it so we can read out text from other kinds of embedding spaces? 04:02:20.940 |
So this is the same sentence we've been using, but now when I hit this run button, it's going to embed this text not using my embedding model, but using OpenAI's text-to-eta2. 04:02:31.940 |
And then there's a linear adapter that I've trained so that my decoder model can read out not from my embedding model but from OpenAI's embedding space. 04:02:41.940 |
It's going to try to decode out the text from given just the OpenAI embedding. 04:02:45.940 |
And you can see, okay, it's not as perfect, but there's a surprising amount of detail that we've recovered out of just the embedding with no reference to the source text. 04:02:56.940 |
So you can see this proper noun, diaspora, it's surprisingly still in there. 04:03:00.940 |
This feature where there's a quoted title of a book is in there. 04:03:04.940 |
It's roughly about the same topic, things like the rogue AI. 04:03:07.940 |
Sometimes when I rerun this, there's also references to the author where the name is roughly correct. 04:03:12.940 |
So even surprising features like proper nouns, punctuation, things like the quotes, general structure and topic, obviously, 04:03:20.940 |
those are recoverable given just the embedding because of the amount of detail that these high-capacity embedding spaces have. 04:03:27.940 |
But not only can you do this in the text space, you can also do this in image space. 04:03:39.940 |
And for dumb technical reasons, I have to put two of them in. 04:03:42.940 |
And then let's try to interpolate in this image space. 04:03:47.940 |
I'm going to try to generate, say, like six images in between me and the Notion avatar version of me, the cartoon version of me. 04:03:56.940 |
If the back end will warm up, cold starting models is sometimes difficult. 04:04:05.940 |
So now it's generating six images, bridging, kind of interpolating between the photographic version of me and the cartoon version of me. 04:04:11.940 |
And again, it's not perfect, but you can see here, on the left, it's quite photographic. 04:04:17.940 |
And then as you move further down this interpolation, you're seeing more kind of cartoony features appear. 04:04:24.940 |
And it's actually quite a surprisingly smooth transition. 04:04:27.940 |
Another thing you can do on top of this is you can do text manipulations as well because clip is a multimodal text and image model. 04:04:39.940 |
I'm going to subtract the vector for a photo of a smiling man. 04:04:45.940 |
And instead, I'm going to add the vector for a photo of a very sad, crying man. 04:04:53.940 |
And empirically, I find that for text I have to be a little more careful, so I'm going to dial down how much of those vectors I'm adding and subtracting. 04:05:22.940 |
Like, you can try to add -- like, here's a photo of a beach. 04:05:28.940 |
This time maybe just generate 4 for the sake of time. 04:05:31.940 |
Or maybe there's a bug and it won't let me generate. 04:05:46.940 |
So in all these demos that I've done, both in the text and image domain -- okay, the beach didn't quite survive the latent space arithmetic. 04:05:52.940 |
But in all these demos, the only thing I'm doing is calculating vectors, calculating embeddings for examples. 04:05:58.940 |
And embedding them and just adding them together with some normalization. 04:06:03.940 |
And it's surprising that just by doing that, you can try to manipulate interesting features in text and images. 04:06:08.940 |
And with this, you can also do things like add style and subject at the same time. 04:06:13.940 |
You can -- this is a cool image that I thought I generated when I made my first demo. 04:06:16.940 |
And then you can also do some pretty smooth transitions between landscape imagery. 04:06:26.940 |
In all these prototypes, one principle that I've tried to reiterate to myself is that oftentimes when you're studying this very complex, sophisticated models, 04:06:36.940 |
you don't necessarily have the ability to look inside and say, "Okay, what's happening?" 04:06:39.940 |
Not even getting an intuitive understanding -- even getting an intuitive understanding of what is the model thinking, what is the model looking at, can be difficult. 04:06:45.940 |
And I think these are some of the ways that I've tried to render these invisible parts of the model a little bit more visible, 04:06:50.940 |
to let you a little bit more directly observe exactly what the model is -- the representations the model is operating in. 04:06:57.940 |
And sometimes you can also take those and directly interact or let humans directly interact with the representations to explore what these spaces represent. 04:07:06.940 |
And I think there's a ton of interesting, pretty groundbreaking research that's happening here. 04:07:11.940 |
On the left here is the Othello world model paper, which is fascinating. 04:07:16.940 |
And then on the right is a very, very recent -- I had to add this in last minute because it's super relevant. 04:07:21.940 |
In a lot of these examples, I've calculated these feature dimensions by just giving examples and calculating centroids between them. 04:07:27.940 |
But here, anthropics and new work along with other work from Conjecture and other labs have found unsupervised ways to try to automatically discover these dimensions inside models. 04:07:35.940 |
And in general, I'm really excited to see latent spaces that appear to encode, you know, by some definition interpretable, controllable representations of the models input and output. 04:07:45.940 |
I want to talk a little bit in the last few minutes about the models that I'm using. 04:07:50.940 |
I won't talk -- I won't go into too much detail, but it's fine-tuned from T5 checkpoint as a denoising autoencoder. 04:07:56.940 |
It's an encoder/decoder transformer with some modifications that you can see in the code. 04:08:04.940 |
I have some pooling layers to get an embedding. 04:08:06.940 |
This is like a normal T5 embedding model stack. 04:08:09.940 |
And then on the right, I have this special kind of gated layer that pulls from the embedding to decode from the embedding. 04:08:19.940 |
But we take this model, and we can adapt it to other models as well, as you saw with the OpenAI embedding recovery. 04:08:25.940 |
And so on the left is the normal trading regime where you have an encoder, you get an embedding, and you try to reconstruct the text. 04:08:30.940 |
On the right, we just train this linear adapter layer to go from embedding of a different model to then reconstruct the text with a normal decoder. 04:08:38.940 |
And today, I'm excited to share that these models that I've been dealing with, that you may have asked about before, are open on Hugging Face. 04:08:46.940 |
So you can go download them and try them out now. 04:08:49.940 |
On the left is the Hugging Face models, and then there's a Colab notebook that lets you get started really quickly and try to do things like interpolation and interpretation of these features. 04:08:56.940 |
And so if you find any interesting results with these, please let me know. 04:09:00.940 |
And if you have any questions, also reach out, and I'll be able to help you out. 04:09:04.940 |
The image model that I was using at the end was CacaoBrains Carlo. 04:09:10.940 |
In this model, this model is an unclip model, which is trained kind of like the way that DALI 2 was trained as a diffusion model that's trained to invert clip embedding. 04:09:17.940 |
So go from clip embedding of images back to text. 04:09:20.940 |
And that lets us do similar things as the text model that we used. 04:09:25.940 |
In all this prototyping, I think a general principle, if you have one takeaway from this talk, it's that when you're working with these really complex models and kind of inscrutable pieces of data, 04:09:34.940 |
if you can get something into a thing that feels like it can fit in your hand, that you can play with, that you can concretely see and observe and interact with, 04:09:41.940 |
can be directly manipulated, visualized, all these things, all the tools and prototypes that you can build around these things, 04:09:46.940 |
I think help us get a deeper understanding of how these models work and how we can improve them. 04:09:53.940 |
And in that way I think models, language models and image models, generative models are a really interesting laboratory for knowledge, 04:09:59.940 |
for studying how these different kinds of modalities can be represented. 04:10:04.940 |
And Brad Victor said, "The purpose of a thinking medium is to bring thought outside the head to represent these concepts in a form that can be seen with the senses and manipulated with the body. 04:10:14.940 |
In this way the medium is literally an extension of the mind." 04:10:17.940 |
And I think that's a great poetic way to kind of describe the philosophy that I've approached a lot of my prototyping with. 04:10:24.940 |
So, if you follow some of these principles and try to dig deeper in what the models are actually looking at, 04:10:29.940 |
build interfaces around them, I think more humane interfaces to knowledge are possible. 04:11:47.900 |
And now we welcome Dr. Brian Bischoff, head of AI at Hex, and Dr. Chris White, CTO at Prefect, in a fireside chat moderated by Brittany Walker, principal at CRV. 04:12:22.900 |
I'm really excited to be moderating this panel between two of my favorite people working in AI. 04:12:30.200 |
I'm a principal at CRV, which is an early-stage venture capital firm, investing primarily in seed and Series A startups. 04:12:37.260 |
Chris, why don't you give us a little bit about yourself? 04:12:47.720 |
We build a workflow orchestration dev tool and cell and remote orchestration as a service. 04:12:55.740 |
So I started kind of my journey into startup land and eventually AI and data, got a Ph.D. in math focused on non-convex optimization, which I'm sure a lot of people hear into. 04:13:06.520 |
And then eventually, you know, data science and then into the kind of dev tool space, which is where I'm at now. 04:13:18.640 |
Hex is a data science notebook platform, sort of like the best place to do data science workflows. 04:13:24.040 |
I was going to say I started my journey by getting a math Ph.D., but he kind of already took that one. 04:13:32.680 |
I've been doing data science and machine learning for about a decade and, yeah, currently find myself doing AI, as they call it these days. 04:13:40.660 |
So both of you are at relatively early stage startups. 04:13:44.540 |
And as we all know, early stage startups have a number of competing priorities, everything from hiring to fundraising to building products. 04:13:53.060 |
And one might say it would be a lot to kind of take a moment and just say, what is this AI thing? 04:14:01.580 |
And so I'm wondering, how did you decide that AI was something that you really needed to invest in when you already had, you know, established business, growing well, lots of users, lots of customers, presumably placing a lot of demands on your time? 04:14:15.360 |
So, Chris, I would love to hear from you on how you guys thought about that choice. 04:14:18.900 |
Yeah, so there are a couple of different dimensions to it for us. 04:14:22.140 |
So we are, you know, a workflow orchestration company, and our main user persona are data engineers and data scientists, but there's nothing inherent about our tool that requires you to have that type of use case. 04:14:33.180 |
And so one thing, one dimension for us is, right, we assumed that a big component of AI use cases were going to be data driven, right? 04:14:41.520 |
Like semantic search or, like, retrieval, summarization, these sorts of things. 04:14:46.280 |
So just we wanted to make sure that, you know, we had a seat at the table to understand how people were productionizing these things and, like, were there any new ETL considerations when you're, you know, moving data between maybe vector databases or something? 04:14:59.680 |
Another one that I think is interesting is when I look at AI going into production, I see, basically, a remote API that is expensive, brittle, and non-deterministic, and that's just a data API to me. 04:15:12.840 |
And so, right, if we can orchestrate these workflows that are building applications for data engineers, presumably a lot of that's going to translate over. 04:15:21.500 |
And so, and I mean, last, like, you know, I'm sure the reason most people are here now is, you know, it was fun, and so we just wanted to learn in the open. 04:15:28.360 |
So we did end up just kind of creating a new repo called Marvin that I think Jason mentioned in his last talk, just to kind of keep up, you know, be incentivized to keep up. 04:15:39.040 |
And, Brian, you were literally brought on board to Hex to focus on this stuff. 04:15:43.880 |
Would love to hear more about how that decision was made and how you've spent your time on it. 04:15:50.300 |
One is that data science is this unique interface between sort of, like, business acumen, creativity, and, like, pretty, like, difficult sometimes programming. 04:16:02.400 |
And it turns out that, like, the opportunity to unlock more creativity and more business acumen as part of that workflow is a really unique opportunity. 04:16:10.580 |
I think a lot of data people, the favorite part of their job is not remembering Matplotlib syntax. 04:16:16.860 |
And so the opportunity to sort of, like, take away that tedium is a really exciting place to be. 04:16:21.780 |
Also, realistically, any data platform that isn't integrating AI is almost certainly going to be dooming themselves to the now. 04:16:30.560 |
And sort of it will be table stakes pretty soon. 04:16:33.120 |
And so I think missing that opportunity would be pretty criminal. 04:16:38.760 |
So you decided that you were going to go ahead and do this. 04:16:44.200 |
What criteria did you evaluate when you were determining how you were going to build out these features or products? 04:16:50.380 |
Did you optimize for how quickly you could get to market, how hard it would be to build, ability to work within your existing resources? 04:16:57.740 |
What criteria did you consider when you were saying, okay, this is how we're actually going to take hold of this thing? 04:17:03.320 |
So for us, I guess there's two different angles. 04:17:06.360 |
There's the kind of just pure open source Marvin project. 04:17:10.060 |
It is a product, but not one that we sell, just one that we maintain. 04:17:13.760 |
And then we do have some AI features built into our actual core product. 04:17:18.140 |
And I think they have slightly different success criteria. 04:17:21.640 |
So for Marvin, it's mainly just getting to see how people are experimenting with LLMs and just talking to users directly. 04:17:30.280 |
It just kind of gives us that avenue and that audience. 04:17:32.560 |
And so that's just been really useful and insightful for us. 04:17:37.140 |
I mean, our head of AI gets, you know, talks to users at least, you know, a couple times a day. 04:17:42.360 |
And then for our core product, so one way that I love to think about DevTools and think about what we build is failure mode. 04:17:50.560 |
So, like, I like to think of choosing tools for what happens when they fail. 04:17:54.620 |
Can I quickly recover from that failure and understand it? 04:17:57.780 |
And so a lot of our features are geared towards that sort of kind of discoverability. 04:18:05.580 |
It's like quick error summaries shown on the dashboard for quick triage. 04:18:09.760 |
And then measuring success there is, like, relatively straightforward, right? 04:18:12.960 |
It's, like, how quickly are users kind of getting to the pages they want and how quickly are they debugging their workflows? 04:18:23.860 |
Yeah, my team's charter is relatively simple. 04:18:29.860 |
And so ultimately we're constantly thinking about sort of what the user is trying to do in Hex during their data science workflow 04:18:37.700 |
and making that as low friction as absolutely possible and giving them more enhancement opportunities. 04:18:43.440 |
So a simple example is I don't know how many times you all have had a very long SQL query that's made up of a bunch of CTEs 04:18:49.540 |
and it's a giant pain in the ass to work with. 04:18:53.000 |
It takes a single SQL query, breaks it into a bunch of different cells, and they're chained together in our platform. 04:18:58.120 |
This is, like, such a trivial thing to build, but it's something that I've wanted for eight years. 04:19:05.740 |
And so thinking like that makes it really easy to make tradeoffs in terms of what is important and what we should focus on. 04:19:12.800 |
And so in terms of, like, how we think about, yeah, like, where our positioning is, 04:19:17.620 |
it's really just how do we make things feel magical and feel smooth and comfortable. 04:19:21.140 |
And how did you reallocate resources beyond, you know, they obviously hired you. 04:19:26.580 |
That was a great step in the right direction. 04:19:28.560 |
But what else did you do to actually get up and running in terms of operationalizing some of this stuff? 04:19:33.800 |
Yeah, I think we kept things pretty slim and we continue to keep things pretty slim. 04:19:40.480 |
He built out a very simple prototype that seemed like it showed promise. 04:19:45.400 |
We scaled the team to a couple people and we've always remained as slim as possible while building out new features. 04:19:50.700 |
These days, I have a roadmap long enough for 20 engineers and we continue to stay around five. 04:19:58.860 |
Basically, like, ruthless prioritization is definitely an advantage. 04:20:02.740 |
And, Chris, you guys wound up hiring a guy as well, right? 04:20:08.300 |
So he definitely owns most of our AI, but also, right, like, anyone at the company that wants to participate. 04:20:15.480 |
And so there was one engineer that got really into it and is, for, you know, all intents and purposes, is, like, effectively switched to Adam's team and is now doing AI full-time. 04:20:24.040 |
Yeah, so you guys are really dedicating a lot to solving this problem, including the hiring of two people on your side and one on Brian's. 04:20:33.080 |
So presumably, you're going to be looking for a return on that investment. 04:20:36.640 |
So how do you think about what a successful implementation of an AI-based feature or product looks like? 04:20:42.920 |
For us, I would say that already we've hit that success criterion. 04:20:48.440 |
So now the question is, like, further investment or just kind of keep going with the way that we're doing it. 04:20:53.440 |
But so big thing was time to value in the core product. 04:20:57.780 |
That we can just easily see has definitely happened with just the few sprinkles of AI that we put in. 04:21:04.820 |
And then, kind of like I said in the beginning, just getting involved in those conversations, those really early conversations about companies looking to put AI in production. 04:21:14.900 |
And we've been having those on the regular now. 04:21:18.200 |
So I would say, like, already feels like it was well worth the investment. 04:21:23.640 |
You obviously just had a big launch the other day, too. 04:21:25.940 |
Curious how you thought about success for that. 04:21:28.700 |
Once again, it's sort of like how frequently do our users reach for this tool? 04:21:34.000 |
Ultimately, Magic is a tool that we've given to our users to try to make them more efficient and have a better experience using Hex. 04:21:40.340 |
And so if they're constantly interacting with Magic, if they're using it in every cell and every project, then that's a good sign that we're succeeding. 04:21:47.280 |
And so to make that possible, we really have to make sure that Magic has something to help with all parts of our platform. 04:21:53.700 |
We have a pretty complicated platform that can do a lot. 04:21:57.060 |
And so finding applications of AI in every single aspect of that platform has been one of our sort of, like, you know, north stars. 04:22:05.240 |
And very intentionally so to make sure that we're, you know, making our platform feel smooth at all times. 04:22:16.080 |
We're going to talk about how you guys actually built some of these features and products. 04:22:20.760 |
Since we're all here at the AI Engineer Summit, I assume we all have an interest in actually getting stuff done and putting it into prod. 04:22:27.400 |
So when you were making some of these initial determinations, Brian, how did you guys determine what to build versus buy? 04:22:34.480 |
So from day one, I think one of the first questions I asked when I joined is what they were doing for evaluation. 04:22:40.180 |
And you might say, like, okay, yeah, we've heard a lot about evaluation today. 04:22:43.320 |
But I would like to remind everyone here that that was February. 04:22:46.960 |
And the reason that I was asking that question already in February is because I've been working in machine learning for a long time where evaluation sort of, like, gives you the opportunity to do a good job. 04:22:57.780 |
And if you've done a poor job of objective framing and done a poor job of evaluation, you don't have much hope. 04:23:03.600 |
And so I think the first thing that we really looked into is evals. 04:23:06.960 |
And back then there was not 25 companies starting evals. 04:23:16.260 |
And I'm very confident that that was the right call for a few reasons. 04:23:19.780 |
One, evals should be as close to production as possible, is literally, like, using prod when possible. 04:23:26.440 |
And so to do that, you have to have very deep hooks into your platform when you're moving at the speed that we try to move. 04:23:34.320 |
On the flip side, we chose to not build our own vector database. 04:23:39.720 |
I've been, you know, doing semantic search with vectors for six, seven years now. 04:23:44.360 |
And I've used open source tools like Face and Pinecone back when it was more primitive. 04:23:51.020 |
Unfortunately, a lot of those tools are very complicated. 04:23:55.860 |
And so having set up vector databases before, I didn't want to go down that journey. 04:23:59.760 |
So we ended up working with LanceDB and sort of built a very custom implementation of vector retrieval that really fits our use case. 04:24:08.260 |
That was highly nuanced and highly complicated. 04:24:12.220 |
But it's what we needed to make our RAG pipeline really effective. 04:24:19.160 |
So ultimately, just sort of, where is the complexity worth the squeeze? 04:24:26.840 |
So I have a couple of different kind of things that we decided on here, and some of which are still in the works. 04:24:34.540 |
Vector databases, a million percent agree with that. 04:24:39.020 |
We haven't had as much need of one, I think, as Hex. 04:24:42.760 |
But we've done a lot with both Chroma and with Lance, but neither in production yet. 04:24:51.440 |
And so the way that I've, the exposure that I've seen about people actually integrating AI into, you know, their workflows and things is, 04:24:59.640 |
there's a lot of experimentation that happens. 04:25:01.840 |
And then you kind of want to get out of that experimental framework maybe by just looking at all of the prompts that you were using 04:25:09.180 |
and then just using those directly yourself with no framework in the middle. 04:25:13.100 |
And then once you're kind of in that mode, like I was saying before, like, you're just, at the end of the day, 04:25:18.360 |
you're interacting with an API, and there's lots of tooling for that. 04:25:22.880 |
And so I kind of see a lot of the decisions, at least, that we had to confront on build versus buy is, like, 04:25:31.740 |
Do we already have the sufficient dev tooling to support it, make sure it's observable, monitorable, and all this? 04:25:42.400 |
And as you mentioned, I think, you know, you talked about many, many eval startups. 04:25:46.920 |
I think we're all familiar with the broad landscape of vector databases as well. 04:25:51.400 |
Are there any pieces of your infrastructure stack that you wish people were building or you wish people were kind of tackling in a different way 04:26:00.640 |
than what you've seen out there so far, either one of you? 04:26:05.800 |
I mean, I think it would have been a hard sell to sell me on an eval platform. 04:26:12.500 |
I think there was some opportunity to sell me on, like, an observability platform for LLMs. 04:26:17.440 |
I've looked at quite a few, and I will admit to being an alumni of weights and biases, so I have some bias. 04:26:25.220 |
But that being said, I think there is still a golden opportunity for a really fantastic, like, experimentation plus observability platform. 04:26:36.660 |
One thing that I'm watching quite carefully is Rivet by Ironclad. 04:26:40.820 |
It's an open source library, and I think the way that they have approached the experimentation and iteration is really fantastic, 04:26:50.000 |
If I see something like that get laced really well into observability, that's something that I'd be excited about. 04:26:57.900 |
I think, like, small addition to what Brian said, which is just more focus on kind of the machine-to-machine layer of the tooling, 04:27:09.300 |
and so I think a lot, you know, right at the end of the day, the input is always kind of this natural language string, 04:27:15.340 |
and that makes a lot of sense, but the output, making it more of a guaranteed-typed output, 04:27:23.360 |
like with function calling and other things, I think is one step in the journey of integrating AI actually into back-end processes 04:27:32.060 |
and machine-to-machine processes, and so any focus in that area is where, you know, my interest gets peaked for sure. 04:27:40.720 |
Okay, so you have all your people, and you have all your tools, and then you're obviously completely good to go and fully in production. 04:27:51.100 |
What challenges did you run into along the way, maybe ones that you didn't expect or that were larger obstacles than you would have thought? 04:27:59.280 |
So I don't, integrating AI into our core product, I would say from a tooling and developer perspective and, you know, productionizing perspective, none. 04:28:10.020 |
Culturally, though, I would say we definitely hit, you know, some challenges, which is that when we first, we're like, 04:28:17.420 |
all right, let's start to incorporate some AI and do some ideation here, right? 04:28:21.480 |
A lot of engineers just started to throw everything at it, like, we should, it should do everything. 04:28:28.720 |
It can monitor itself for, like, all of this stuff, and it was like, all right, all right, everyone needs to kind of, like, backtrack, 04:28:34.700 |
and so just that internal conversation of, like, you know, getting buy-in on, like, very specific focus areas, 04:28:41.980 |
which, you know, at the end of the day, where we are focused is that just removal of user friction, 04:28:47.620 |
whether it's through design or just through, like, quicker surfacing of information that AI just, like, lets you do in a more guaranteed way. 04:28:54.500 |
But, yeah, restraining the enthusiasm was the biggest challenge for sure, and it still exists to this day. 04:29:10.580 |
It's a little different instantiation, which is to say that, like, you know, I've never met an engineer that's good at estimating how long things take, 04:29:17.040 |
and I would say that, like, that is somehow exacerbated with AI features because then you, your first few experiments show such great promise so quickly, 04:29:28.920 |
but then the long tail feels even longer than most engineering corner case triage, just such a long journey between we got this to work for a few cases and we think we can make it work to it's bulletproof is even more of a challenging journey. 04:29:45.400 |
And I, yeah, this, like, over enthusiasm, I think, yeah, slightly different instantiation, but similar flavor. 04:29:52.740 |
Whenever you're on that journey, how are you testing and tracking along the way, if at all, which is totally, yeah. 04:30:00.040 |
Yeah, I mean, to be a broken record, a lot of, like, robust evals, like, trying really hard to codify things into evaluations, trying really hard to codify, like, if someone comes to us and says, 04:30:13.480 |
wouldn't it be great if magic could do X, we sort of pursue that conversation a little bit further and say, like, okay, what would you expect magic to do with this prompt? 04:30:22.500 |
What would you expect magic to do in this case? 04:30:24.780 |
And kind of get them to kind of, like, vet that out and then sort of using this, like, barometer of could a new data scientist at your company with very little context do that? 04:30:36.440 |
And sort of that, like, you know, cutting edge around what's feasible and what's possible. 04:30:44.520 |
And, you know, one of the reasons I was excited to have the two of you up here together is because, you know, while Prefect has some elements of AI in the core product, as you mentioned, 04:30:53.640 |
probably you're best known for the Marvin project that you guys have put out there, which is kind of a standalone project, 04:31:00.420 |
which is a really interesting phenomenon that I'll say that I've kind of observed in this current wave of AI, which is, you know, 04:31:06.760 |
companies that maybe weren't doing AI previously launching entirely separate brands, essentially, alongside the core product. 04:31:13.940 |
So we'd love to understand more of what were your user experience considerations when you were building out, you know, Marvin as a separate product versus Prefect. 04:31:32.460 |
I think one kind of philosophical angle is, you know, we try to do things that maximize our ability to learn without having to go full commitment. 04:31:43.420 |
And so I think starting a new open source repo, like, right, we definitely have some ties to it now. 04:31:49.480 |
But past that, it's not all that high of a cost. 04:31:52.960 |
But, like, if it, you know, it's all upside, basically. 04:31:57.860 |
We learned a little bit more about how to, you know, write APIs that, you know, interface with all of the different LLMs, for example, or something like that. 04:32:06.040 |
Or if it does take off, which, you know, it basically did for us, 04:32:09.820 |
we got to meet all these new people who are working on interesting things like AI and data adjacent. 04:32:16.820 |
But for the core product, this was maybe more, I guess, kind of interesting. 04:32:24.240 |
And, Brian, I'd be curious to hear about how much you had to, like, really focus some of your prompts to the use case that you cared about. 04:32:32.020 |
So Prefect is a general purpose orchestrator. 04:32:34.680 |
And so the reason I say that again is our use case scope is, like, technically infinite. 04:32:41.320 |
And so helping people write code to do completely arbitrary things is definitely not a value add we're going to have over the engineers at OpenAI or at GitHub or something else. 04:32:52.080 |
So we knew that we couldn't invest in, like, that way of integrating AI. 04:32:57.220 |
And so then the next question was, like, okay, so then what are just the marginal ads? 04:33:02.600 |
And that's kind of where we landed, you know, where we are today. 04:33:05.280 |
But there was, we did put energy initially, like, can we put this directly in, like, the SDK or something like that? 04:33:12.800 |
And just very quickly realized that it was just too large of scope. 04:33:17.500 |
And at that point, you might as well just have the user do it themselves. 04:33:20.100 |
And, like, we're not adding anything to that workflow. 04:33:22.420 |
Yeah, and on the flip side, you know, Magic has been kind of a part of Hex, basically, it seems like, since inception from the outside. 04:33:30.160 |
Obviously, we've all seen, again, a number of text-to-SQL players out there. 04:33:34.200 |
We can make arguments about whether or not those should exist as standalone companies. 04:33:38.240 |
But I'm curious, you know, how you guys had to think about UX considerations when you were building out Magic in the context of the existing Hex product? 04:33:46.580 |
Ultimately, I've been really fortunate to kind of, like, work with a great design team who sort of, they're just excellent. 04:33:53.460 |
But the question about, like, how does Magic feel? 04:34:00.000 |
I think that's one thing that's been important from early on. 04:34:07.540 |
So it is a collection of features that makes the product easier and more comfortable to use. 04:34:12.860 |
That is an easy sort of thing to keep in mind when deciding how to design because it allows us to say, okay, like, we don't want this to distract from the core product experience. 04:34:24.460 |
We had one sprint where we would design something called Crystal Ball. 04:34:31.540 |
It did exactly what we wanted to do, and it felt wonderful. 04:34:36.640 |
However, ultimately, it drew the user away from the core Hex experience. 04:34:42.680 |
And very quickly, our CEO rightly was like, I feel like this is kind of splitting Magic out into its own little ecosystem. 04:34:49.960 |
And that made it kind of clear that that might be the wrong direction to go. 04:34:54.520 |
So even though Crystal Ball did feel really good and had a really incredible capability behind it, and frankly, the design on Crystal Ball was beautiful, the problem with that was it pulled us away from what we were really trying to do, which was make Hex better for all of our users. 04:35:10.700 |
Every Hex, like, consumer should be able to benefit from Magic features, and that was starting to split that. 04:35:17.400 |
And so we literally killed Crystal Ball despite it being a really cool experience for that reason. 04:35:23.600 |
So genuinely, we've really stuck to the, like, it's one platform, and Magic augments it. 04:35:31.780 |
And obviously, you know, Hex already had a relatively sizable user base at the time you guys launched this. 04:35:37.400 |
So I'm curious, how did you think about the rollout, like, just in terms of what users you gave it to and what timeline, what marketing did you do, all of those types of considerations? 04:35:47.260 |
Yeah, generally, we start with a private beta, and then we, as quickly as possible, expand that to a sort of, like, public beta. 04:35:53.080 |
Our goal is to find people that are, like, engaged with the product, and they are prepared for some of the limitations of AI tools. 04:36:01.360 |
Stochasticity has come up many times, and ultimately, we're expecting the user to work with a stochastic thing. 04:36:08.440 |
Also, they're working with something very complex, which is data science workflows. 04:36:13.460 |
So we're looking for people that are pretty technical in the early days. 04:36:17.160 |
Then we want to keep scaling and scaling to include the rest of the distribution in terms of technical capabilities so that we can make sure that it's really serving all of our users. 04:36:27.300 |
And on the flip side, again, you had a little bit maybe more flexibility with the rollout, just given it was a new repo. 04:36:34.200 |
I'm curious if that was different, similar to what Brian's talked about. 04:36:40.240 |
It was, we hacked on it, you know, we had fun with it. 04:36:42.860 |
We got it to a place where we felt proud of it, and then we clicked make public and then tweeted about it, and that was, like, the end of that. 04:36:51.400 |
But for integrating AI into our core product, I mean, this isn't particularly deep, but, you know, it's one of those things that I'm sure everyone here is thinking about and we'll continue to talk about, which is, for us, a large part of our customer base are, like, large enterprises and financial services and also healthcare. 04:37:10.020 |
And so, like, very, very security conscious, and so we definitely had to make sure that this was, like, a very opt-in type of feature. 04:37:17.980 |
But, like, you know, we still want to have little, like, tool tips, like, hey, if you click this, but also if you click this, we will send a couple of bits of data, you know, to a third-party provider, so. 04:37:28.980 |
And post-rollout, just to go to kind of the last logical part of the conversation here, how have you guys thought about continuing to kind of measure the outputs? 04:37:39.400 |
I mean, Brian, you're the big evals guy up here, so I'm sure that'll be the answer, but I would love to hear more about how you think about that measurement and in terms of both the model itself, but also in terms of, you know, the model in the context of the product, which I think is also something that people, you know, need to think about. 04:37:56.600 |
So I recently learned that there's a more friendly term than dogfooding, which is drinking your own champagne. 04:38:10.800 |
One of the fun things about trying to analyze product performance is that you normally do that via data science. 04:38:19.680 |
And so I have this fun thing where I'm using magic to analyze magic, and I put a lot of effort into trying to understand where it's succeeding and where it's failing, both through traditional product analytics, guided by using the product itself. 04:38:32.400 |
And so there's a very Ouroboros feeling, but ultimately good old-fashioned data science. 04:38:37.780 |
Love to hear it, and appropriate with where you've come from. 04:38:44.080 |
For us, it's, you know, I definitely don't have as much experience as Brian on that side of it. 04:38:49.760 |
But for a while, one thing we were doing when it was pure just like prompt input, string output with no typing interface whatsoever is then using that and then writing tests that, again, used in LLM to do comparisons and semantic comparisons. 04:39:04.220 |
And, like, right, there's obviously problems with that, but, like, it also kind of works. 04:39:08.260 |
But so then when we moved in kind of the typing world where, like, Marvin is for, like, guaranteed typed outputs, essentially, it definitely becomes a lot easier to test in that world, which is, you know, one reason that that's kind of the soapbox that I get on when I talk about LLM tooling, like, bringing it into the back end is just like having these typed handshakes. 04:39:28.540 |
Because, you know, you can write prompts where you know what the output should be, and it should have a certain type, and that's a very easy thing to test most of the time. 04:39:36.360 |
And one of the things I think has been, you know, most fascinating about this wave of software, and, Brian, you alluded to this a little bit earlier with your comments around, you know, being stochastic, essentially, is that it's not deterministic, right? 04:39:49.240 |
And also, I think that AI-based software doesn't have to be static, either. 04:39:53.880 |
It can be, you know, dynamic in a way that maybe traditional software isn't quite as much, and there's, you know, improvements that come along maybe on the UX side of things, but also the model. 04:40:03.300 |
We've heard a lot of people talk about techniques like fine-tuning, techniques like RLHF, RLAIF, all sorts of, you know, approaches to kind of continuing to improve the model itself in the context of the product over time. 04:40:16.220 |
So, I'm curious about how you think about measuring that improvement as you continue to hopefully, you know, collect data and refine your understanding of the end user. 04:40:25.240 |
There was a paper that came out in, like, June-ish or something that was, like, kind of splashy. 04:40:30.480 |
It was from Matai from Spark, and it was like, oh, like, the models are degrading over time, even when they say they're not. 04:40:37.460 |
And, like, what I thought was interesting was for, like, the people that are doing this stuff in prod, we already knew that. 04:40:44.200 |
Like, my evals failed the first day they switched to the new endpoint. 04:40:47.340 |
I didn't even switch the endpoint over, and suddenly my evals were failing. 04:40:50.940 |
So, I think there is a certain amount of, like, when you're building these stuff, these things in a production environment, you're keeping a very close eye on the performance over time, and you're building evals in this very robust way. 04:41:03.460 |
And I've said evals enough time for this conversation already, but I think the thing that I keep coming back to is, what do you care about in terms of your performance? 04:41:13.120 |
Boil your cases down to traditional methods of evaluation. 04:41:17.960 |
We don't need latent distance distributions and KL divergence between those distributions. 04:41:25.780 |
It turns out, like, blue scores of similarity aren't very good for LLMM outputs. 04:41:29.940 |
This has been known for three, four years now. 04:41:34.880 |
Understand what it means in a very clear, you know, human way. 04:41:43.460 |
And to the people that say, like, my task is too complicated, I can't tell if it's right or wrong, I have to use something more latent, I would challenge you to try harder. 04:41:52.760 |
The tasks that I'm evaluating are quite nuanced and quite complicated, and it hasn't always been easy for me to come up with binary evaluations, but you keep hunting and you eventually find things. 04:42:04.660 |
You talked about type checking, and you talk about, like, type handshakes, and that's something that, like, a lot of people in ML have been preaching the gospel of composability for five years now. 04:42:16.580 |
They're just maybe new to some of the people that are thinking about evals today. 04:42:21.180 |
Well, so, moral of the story is try harder, essentially. 04:42:28.560 |
I think the only thing I'd add is I don't have much take on actually how someone should do it or what they could consider, but I think, you know, you just described a highly non-deterministic, very dynamic experimentation workflow, and, like, those are the sorts of things that just, like, our core product is meant for. 04:42:47.140 |
And so, like, experimenting with those, like, just knowing the structure of them is maybe the best way to say it is what fascinates me more than the actual, like, details of what metrics you might be using. 04:42:58.580 |
Well, you know, I think the other reason I was really excited to do this panel is because we have kind of maybe two sides of the same coin as it relates to being an AI engineer here, right? 04:43:08.060 |
One person coming from more of a traditional ML background, one person coming from more of a traditional engineering background, and both of you building these AI-based products. 04:43:15.660 |
So I wanted to give you a second, if you have any last questions to ask of each other. 04:43:21.880 |
So you work in this, like, data workflow space, and, like, I've thought a lot about composability and, like, data workflows, and I've long been a fan of sort of, like, workflow-centric ML. 04:43:30.440 |
And so what I'd love to hear is sort of, like, when you think about building these agent pipelines, which are starting to get more into the, like, DAGs and the sort of, like, structured chains of response and request, what is the, like, one thing that, like, every AI engineer building agents should know from your sphere that will make it easier for them to build agents? 04:43:59.980 |
I don't, I think the main thing is something that I kind of alluded to earlier, which is think about failure modes. 04:44:07.740 |
So, like, runaway processes, capturing potential oddities in outputs or inputs as early as possible with some observability layer. 04:44:18.580 |
And so the earlier you can get that wiring in, I think, the better. 04:44:24.440 |
And then caching is, like, this is the only time I will ever say this. 04:44:29.300 |
It's definitely your friend in some of these situations. 04:44:31.420 |
But it's also the root of all evils, so you've got to kind of, you know, balance that. 04:44:37.040 |
But, yeah, I think just thinking about the observability and debuggability layer, especially with some of the kind of black boxy and, like, people who are pushing it and actually having, like, immediate eval of the returned code or something, like, having that monitoring layer, I think, is just key. 04:44:53.060 |
Chris, I know you've asked Brian a bunch during this panel, but anything else you want to ask? 04:44:56.440 |
Yeah, I mean, I'm just really curious, you know, I'm sure everybody asks you this, but the hallucination problem. 04:45:00.900 |
Like, how, you know, obviously your users can just confront it directly if it looks weird. 04:45:06.320 |
They can see that it looks weird or it errors out. 04:45:08.480 |
But just how do you think about it as the person building that interface for your users? 04:45:12.600 |
Someone recently asked me for, like, references on hallucination. 04:45:16.460 |
And I was like, what are some good references on hallucination? 04:45:18.380 |
And I Googled around, and I found that generally the advice that people are giving is to fix hallucination, basically rag harder, just, like, make a better retrieval augmented pipeline. 04:45:29.380 |
And when I said that and I looked at myself, I was like, honestly, that's, like, kind of how we solved it. 04:45:33.400 |
Like, our reduction in hallucination for magic, which is not an easy problem, was that we had to think a little bit more carefully about retrieval augmented generation. 04:45:44.560 |
And in particular, the retrieval is not something that you'll find in any book, even the book that I just published. 04:45:50.700 |
Like, even in there, I don't talk about this particular retrieval mechanism, but it took us some additional thinking, but we got there. 04:46:04.280 |
Last thing, just to wrap up, what is your hot take of the day for closing out the AI Engineer Summit today? 04:46:10.560 |
I definitely stopped building chat interfaces. 04:46:14.200 |
I think chat is a product, AI is a tool, and so finding ways to, once again, I know that I've said this before, but, like, improve on the machine. 04:46:23.740 |
The machine interfaces so that developers can actually benefit and use AI more directly, as opposed to building chat everywhere. 04:46:31.540 |
Mine is a little bit mean-spirited, so I apologize in advance. 04:46:35.400 |
I think a lot of the work that's in front of you as you're building out AI capabilities is going to be incredibly boring. 04:46:57.820 |
It's so fun, but there's a lot of data engineering work in front of you, and I think people haven't yet appreciated how important that is. 04:47:06.560 |
No, I think it's very real and very fair take, as all of us try to start, hopefully, moving into production with a bunch of this stuff. 04:47:13.200 |
That's where the rubber meets the road, right? 04:47:16.540 |
Thank you so much, the two of you, for coming up here with me. 04:47:28.600 |
I'm just going to say a few words of closing. 04:47:35.820 |
So we have placed a bunch of flag signs out in the ballroom. 04:47:42.320 |
So gravitate towards the topics that you want to discuss. 04:47:47.540 |
These are multi-modality, AI UX, agents, prompt engineering, code generation, LLM tooling. 04:48:00.840 |
Retrieval augmented generation, OSS models, hire and pitch, Langchain and Lama index, and GPUs and infra. 04:48:11.880 |
The LLM tooling, Langchain and Lama index, and AI UX will be found in Carmel, not the Lama index, that's across their booth, where the Microsoft Cloudflare, GitHub and Weights and Biases booth are. 04:48:25.020 |
And there's additional food in Carmel as well with these topics. 04:48:28.840 |
There's also Vector Village, a little further past Carmel. 04:48:35.440 |
You can connect your laptop, make a presentation. 04:48:37.820 |
There's also whiteboards for you to brainstorm. 04:48:40.920 |
And then obviously there's more lounge and table seating. 04:48:47.360 |
But don't worry, tomorrow we have a hosted bar courtesy of Decibel VC. 04:48:51.760 |
So thanks for making this an incredible opening day. 04:48:54.540 |
Please enjoy the topic tables until 9.30 p.m. 04:48:57.600 |
And then we'll see you tomorrow morning at 9 a.m. for Breakfast and Expo with the opening keynote from Mario Rodriguez at GitHub starting at 9.45.