Back to Index

Beating GPT-4 with Open Source Models - with Michael Royzen of Phind


Chapters

0:0 Introductions
1:2 Founding SmartLens in High School (2017)
3:44 Shifting to NLP
5:10 Sparking Interest in Long-Form Q&A (HuggingFace Demo)
8:32 Creating a Search Engine (Common Crawl, 2020)
11:29 Early Days: Hello Cognition to Phind
13:35 Phind Launch & In-Depth Look
20:58 Envisioning Phind: Integrating Reasoning with Code & Web
23:26 Exploring the Developer Productivity Landscape
26:28 Phind's Top Use Cases & Early Adoption
30:0 Behind Phind’s Rebranding (Advice from Paul Graham)
39:40 Crafting a Custom Model (Code Llama & Expanded Data)
44:34 Phind's Model: Evaluation Tactics & Metrics
47:0 Enhancing Accuracy with Reinforcement Learning
51:18 Running Models Locally: Interest & Techniques (Quantization)
67:13 Michael’s Autodidact Journey in AI Research
72:0 Lightning Round

Transcript

(upbeat music) - Hey everyone, welcome to the Latent Space Podcast. This is Alessio, partner and CTO of Residence Investable Partners, and I'm joined by my co-host, Swiggs, founder of Small AI. - Hey, and today we have in the studio Michael Voisin from Fines, welcome. - Thank you so much, it's great to be here.

- Yeah, we are recording this in a surprisingly hot October in San Francisco, and I mean, sometimes the studio works, but-- - The blue angels are flying by right now. - And the blue angels are flying by. (laughing) - Sorry about the noise. - I don't think they can hear it.

We have enough damping. Anyway, so welcome. I've seen Fine blow up this year, mostly I think since your launch in Feb, and V2, and then your Hacker News post. We tend to like to introduce our guests, but then obviously you can fill in the blanks with the origin story.

So you actually were a high school entrepreneur. You started SmartLens, which is a computer vision startup in 2017. - That's right, yeah. So I remember when TensorFlow came out and people started talking about, oh, obviously at the time, after AlexNet, the deep learning revolution was already in flow, and good computer vision models were a thing.

And what really made me interested in deep learning was I got invited to go to Apple's WWDC conference as a student scholar, 'cause I was really into making iOS apps at the time. And so I go there and I go to this talk where they add an API that let people run computer vision models on the device using far more efficient GPU primitives.

And after seeing that, I was like, oh, this is cool. This is gonna have a big explosion of different computer vision models running locally on the iPhone. And so I had this crazy idea where it was like, what if I could just make this model that could recognize just about anything and have it run on the device?

And that was the genesis for what eventually became SmartLens. I took this data set called ImageNet 22K. So most people, when they think of ImageNet, think of ImageNet 1K. But the full ImageNet actually has, I think, 22,000 different categories. Yeah, so I took that, filtered it, pre-processed it, and then did a massive fine tune on Inception V3, which was, I think, the state-of-the-art deep convolutional computer vision model at the time.

And to my surprise, it actually worked insanely well. I had no idea what would happen if I give a single model. I think it ended up being 17,000 categories approximately that I collapsed them into. It actually ended up working so well. It worked so well that it actually worked better than Google Lens, which released its V1 around the same time.

And so, and on top of this, the model ran on the device. So it didn't need an internet connection. A big part of the issue with Google Lens at the time was that connections were slower. 4G was around, but it wasn't nearly as fast. And so there was a noticeable lag having to upload an image to a server and get it back.

But just processing it locally, even on the iPhones of the day in 2017, much faster. And so it was a cool little project. It got some traction. TechCrunch wrote about it. And there was kind of one big spike in usage, and then over time it tapered off. But people still pay for it, which is wild.

- That's awesome. Oh, it's like a monthly or annual subscription? - Yeah, it's like a monthly subscription. - Even though you don't actually have any servers. - Even though we don't have any servers. That's right, I was in high school. I wanted to make a little bit of money.

I was like, yeah. - That's awesome. The modern equivalent is kind of Be My Eyes. And they actually disclosed in the GPT-4 Vision system card recently that the usage was surprisingly not that frequent. The extent to which all three of us have a sense of sight, I would think that if I lost my sense of sight, I would use Be My Eyes all the time.

The average usage of Be My Eyes per day is 1.5 times. - Exactly. And I was thinking about this as well, where I was also looking into image captioning, where you give a model an image, and then it tells you what's in the image. But it turns out that what people want is the exact opposite.

People want to give you a description, well, people want to give a description of an image, and then have the AI generate the image. - Oh, the other way. - Exactly. And so, at the time, I think there were some GANs, NVIDIA was working on this back in 2019, 2020.

They had some impressive, I think, phase scans, where they had this model that would produce these really high quality portraits. But it wasn't able to take a natural language description the way Midtourney or DALI 3 can, and just generate you an image with exactly what you described in it.

- Awesome. And how'd that get into NLP? - I released the Smart Lens app, and that was around the time, I was a senior in high school, I was applying to college. College rolls around, I'm still sort of working on updating the app in college. But I start thinking like, hey, what if I make an enterprise version of this as well?

At the time, there was Clarify that provided some computer vision APIs. But I thought, this massive classification model works so well, and it's so small, and so fast, might as well build an enterprise product. And I didn't even talk to users, or do any of those things that you're supposed to do.

I was just mainly interested in building a type of backend I've never built before. So I was mainly just doing it for myself, just to learn. And so I built this enterprise classification product, and as part of it, I'm also building an invoice processing product, where using some of the aspects that I built previously, although obviously it's very different from classification, I wanted to be able to just extract a bunch of structured data from an unstructured invoice through our API.

And that's what led me to HuggingFace for the first time, 'cause that involves some natural language components. And so I go to HuggingFace, and with various encoder models that were around at the time, I think, I used the standard BERT, and also LongFormer, which came out around the same time.

And LongFormer was interesting because it allowed, it had a much bigger context window than those models at the time. Like BERT, all of the first-gen encoder-only models, they only had a context window of 512 tokens. And it's fixed. There's none of this alibi or ROPE that we have now, where we can basically massage it to be longer.

It was, they're fixed, 512 absolute encodings. And so LongFormer at the time was the only way that you can fit, say, a sequence length, or ask a question about like 4,000 tokens worth of text. And so implemented LongFormer, it worked super well. But nobody really kind of used the enterprise product.

And that's kind of what I expected, 'cause at the end of the day, it was COVID. I was building this kind of mostly for me, mostly just kind of to learn. And so nobody really used it, and my heart wasn't in it, and I kind of just shelved it.

But a little later, I went back to Hugnyface, and I saw this demo that they had, and this is in the summer of 2020. They had this demo made by this researcher, Yassin Jarnit. And he called it LongForm Question Answering. And basically, it was this self-contained notebook demo where you can ask a model a question, the way that we do now with ChatGPT.

It would do a lookup into some database, and it would give you an answer. And it absolutely blew my mind. The demo itself, it used, I think, BART as the model. And in the notebook, it had support for both an Elasticsearch index of Wikipedia, as well as a Dense index, powered by Facebook's Face, Vice, I think that's how you pronounce it.

It had both, and it was very iffy. But when it worked, I think the question in the demo was, why are all boats white? When it worked, it blew my mind that instead of doing this few-shot thing, like people were doing GPT-3 at the time, which is all the rage, you could just ask a model a question, provide no extra context, and it would know what to do and just give you the answer.

It blew my mind to such an extent that I couldn't stop thinking about that. And I started thinking about ways to make it better. I tried training or doing the fine-tune with a larger BART model. And this BART model, yeah, it was fine-tuned on this Reddit dataset called Eli Five.

So basically- - The subreddit. - Yeah, the subreddit, yeah. Someone had scraped, I think, I forget who did it, but someone had scraped a subreddit. And put it into a well-formatted, relatively clean dataset of human questions and human answers. So we're bootstrapping this model from Eli Five, and that made it pretty good at at least getting the right format when doing this rag retrieval from these databases and then generating the final answer.

And so Eli Five actually turned out to be a good dataset for training these types of question-answering models because the question's written by a human. The answer's written by a human, and at least helps the model get the format right. Even if the model is still very small and it can't really think super well, at least it gets the format right.

And so it ends up acting as kind of a glorified summarization model where if it's fed in high-quality context from the retrieval system, it's able to have a reasonably high-quality output. And so once I made the model as big as I can, just fine-tuning on BART large, I started looking for ways to improve the index.

So in the demo, in the notebook, it was there were instructions for how to make an Elasticsearch index just for Wikipedia. And I was like, "Why not do all of Common Crawl?" So I downloaded Common Crawl, and thankfully I had like $10,000 or $15,000 worth of AWS credits left over from the SmartLens project.

That's what really allowed me to do this 'cause there's no other funding. I was still in college. Not a lot of money. And so I was able to spin up a bunch of instances and just process all of Common Crawl, which is massive. So it's roughly like, it's terabytes of text.

And so I whitelisted. I went to Alexa to get like the top 1000 websites or 10,000 websites in the world, and then filtered only by those websites, and then indexed those websites 'cause the webpages were already included in dump. So I just- - You mean to supplement Common Crawl or to filter Common Crawl?

- Filter Common Crawl. - Oh, okay. - Yeah. So we filtered Common Crawl just by, yeah, the top, I think, 10,000. Just to limit this, because obviously there's this massive long tail of small sites that are really cool, actually. And there's other projects like, shout out to Marginalian New, which is a search engine specialized on the long tail.

I think they actually exclude like the top 10,000. - That's what they do. - 10,000, yeah. - I've seen them around and just don't really know what their pitch is. - Yeah, yeah, yeah. So they exclude all the top stuff. So the long tail is cool, but for this, that was kind of out of the question, and that was most of the data anyway.

So we've removed that. And then I indexed the remaining approximately 350 million webpages through Elasticsearch. So I built this index running on AWS with these webpages, and it actually worked quite well. Like you can ask it like general common knowledge, history, politics, current events, questions, and it would be able to do a fast lookup in the index, feed it into the model, and it would give like a surprisingly good result.

And so when I saw that, I thought that this is definitely doable. And like, it kind of shocked me that like no one else was doing this. And so this was now the fall of 2020. And yeah, I was kind of shocked no one was doing this, but it costs a lot of money to keep it up.

I was still in college. There are things going on. I got bogged down by classes. And so I ended up shelving this for almost a full year, actually. And I returned to it in fall of 2021 when Big Science released T0. When Big Science released the T0 models, that was a massive jump in the reasoning ability of the model.

And it was better at reasoning, it was better at summarization. It was still a glorified summarizer, basically. - Was this a precursor to Bloom? Because Bloom's the one that I know. - I think Bloom ended up actually coming out in 2022, but Bloom had other problems where I think for whatever reason, the Bloom models just were never really that good, which is so sad 'cause I really wanted to use them.

But I think they didn't train on that much data. I think they used the original, they were trying to replicate GPT-3. So they just used those numbers, which we now know are far below Chinchilla Optimal. And even Chinchilla Optimal, which we can talk about later, what we're currently doing with the fine model goes, yeah, it goes way beyond that.

But they weren't sharing enough data. I'm not sure how that data was clean, but it probably wasn't super clean. And then they didn't really do any fine tuning until much later. So T0 worked well because they took the T5 models, which were closer to Chinchilla Optimal. 'Cause I think they were trained on also like 300 something billion tokens, similar to GPT-3, but the models were much smaller.

So the models, yeah, they were pre-trained better. And then they were fine-tuned on this. I think T0 is the first model that did large-scale instruction tuning from diverse data sources in the fall of 2021. This is before Instruct GPT. This is before Flan T5, which came out in 2022.

This is the very, very first, at least well-known example of that. And so it came out and then I did, on top of T0, I also did the Reddit Eli5 fine tune. And that was the first model and system that actually worked well enough to where I didn't get discouraged like I did previously.

'Cause the failure cases of the BART-based system was so egregious. Sometimes it would just misinterpret your answers so, or questions so horribly that it was just extremely discouraging. But for the first time, it was working reasonably well. I'm also using a much bigger model. I think the BART model is like 800 million parameters, but T0, we were using 3B.

So it was T0, 3B, bigger model. And that was the very first iteration of Hello. So ended up doing a show HN on Hacker News in January, 2022 of that system. Our fine tune T0 model connected to our Elasticsearch index of those 350 million top 10,000 common crawl websites.

And to the best of my knowledge, I think that's the first example that I'm aware of a LLM search engine model that's effectively connected to like a large enough index that I would consider like an internet scale. So I think we were the first to release like an internet scale LLM powered rag search system in January, 2022.

And around the time me and my future co-founder, Justin, we were like, you know, we really, why not do this full time? Like this seems like the future. This is really cool. I couldn't really sleep even. Like I was going to bed and I was like, I was thinking about it.

Like I would say up until like 2.30 AM, like reading papers on my phone in bed, go to sleep, wake up the next morning at like eight and just be super excited to keep working. And I was also doing my thesis at the same time, my senior honors thesis at UT Austin about something very similar.

We were researching factuality in abstractive question answering systems. So a lot of overlap with this project. And the conclusions of my research actually kind of helped guide the development path of Hello. In the research we found that LLMs don't, they don't know what they don't know. So the conclusion was, is that you always have to do a search to ensure that the model actually knows what it's talking about.

And my favorite example of this even today is kind of with chat GPT browsing, where you can ask chat GPT browsing, how do I run llama.cpp? And chat GPT browsing will think that llama.cpp is some file on your computer that you can just compile with GCC and you're all good.

It won't even bother doing a lookup, even though I'm sure somewhere in their internal prompts, they have something like, if you're not sure, do a lookup. Like that's not good enough. So models don't know what they don't know. You always have to do a search. And so we approached LLM powered question answering from the search angle.

We pivoted to make this for programmers in June of 2022, around the time that we were getting into YC. We realized that like, what we're really interested in, is the case where the models actually have to think. 'Cause up until then, the models were kind of more glorified summarization models.

Like we really thought of them like, the Google featured snippets, but on steroids. And so we like, we saw a future where, the simpler questions would get commoditized. And I still think that's going to happen with like Google SGE and like it's nowadays, it's really not that hard to like answer the more basic kind of like summarization, like current events questions with lightweight models.

That'll only continue to get cheaper over time. And so we kind of started thinking about this trade-off where LLM models are going to get both better and cheaper over time. And that's going to force people who run them to make a choice. Either you can run a model of the same intelligence that you could previously for cheaper, or you can run a better model for the same price.

And so someone like Google, once the price kind of falls low enough, they're going to deploy, and they're already doing this with SGE, they're going to deploy a relatively basic kind of glorified summarizer model that can answer very basic questions about like current events, like who won the Superbowl, like what's going on on Capitol Hill, like those types of things.

And the flip side of that is like more complex questions where like you have to reason and you have to solve problems and like debug code. And we realized like we were much more interested in kind of going along the bleeding edge of that frontier case. And so we've optimized everything that we do for that.

And that's a big reason of why we've built FIND specifically for programmers, as opposed to saying like, we're kind of a search engine for everyone because as these models get more capable, we're very interested in seeing kind of what the emergent properties are in terms of reasoning, in terms of being able to solve complex multi-step problems.

And I think that some of those emergent capabilities, like we're starting to see, but we don't even fully understand. So as I think there's always an opportunity for us to become more general if we wanted, but we've been along this path of like, what is the best, most advanced reasoning engine that's connected to your code base, that's connected to the internet that we can just provide?

- What is FIND today, pragmatically, from a product perspective? How do people interact with it? How does it plug into your workflow? - Yeah, so FIND is really a system. FIND is a system for programmers when they have a question or when they're frustrated or when something's not working.

- You're frustrated. - Yeah, for them to get on block. The most abstract page for FIND is like, if you're experiencing really any kind of issue as a programmer, we'll solve that issue for you in 15 seconds as opposed to 15 minutes or longer. And so, FIND has an interface on the web.

It has an interface in VS Code and more IDEs to come. But ultimately, it's just a system where a developer can paste in a question or paste in code that's not working. And FIND will do a search on the internet or they will find other code in your code base, perhaps that's relevant.

FIND will find the context that it needs to answer your question and then feed it to a reasoning engine powerful enough to actually answer it. So, that's really the philosophy behind FIND. It's a system for getting developers the answers that they're looking for. And so, right now from a product perspective, this means that we're really all about getting the right context.

So, the VS Code extension that we launched recently is a big part of this 'cause you can just ask a question and it knows where to find the right code context in your code. It can do an internet search as well. So, it's up to date. And it's not just reliant on what the model knows.

And it's able to figure out what it needs by itself and answer your question based on that. And if it needs some help, you can also get yourself kind of just, there's opportunities for you yourself to put in all that context in. But the issue is also not everyone wants to use VS Code.

Some people are real Neovim sticklers or they're using PyCharm or other IDEs, JetBrains. And so, for those people, they're actually okay with switching tabs, at least for now, if it means them getting their answer. 'Cause really, there's been an explosion of all these startups doing code, doing search, et cetera.

But really, who everyone's competing with is ChatGPT, which only has that one web interface. And ChatGPT is really the bar. And so, that's what we're up against. - And so, your idea, we have Aman from Cursor on the podcast and they've gone through the, we need to own the IDE thing.

Yours is more like, in order to get the right answer, people are happy to go somewhere else, basically. They're happy to get out of their IDE. - That was a great podcast, by the way. But yeah, so part of it is that people sometimes perhaps aren't even in an IDE.

So, the whole task of software engineering goes way beyond just running code, right? There's also a design stage. There's a planning stage. A lot of this happens on whiteboards. It happens in notebooks. And so, the web part of it also exists for that, where you're not even coding it and you're just trying to get a more conceptual understanding of what you're trying to build first.

But the podcast with Aman was great, but somewhere where I disagree with him is that you actually need to own the IDE. I think in the long, sorry, sorry. Oh, let's cut that. Yeah, so I thought the podcast with Aman was great, but somewhere where I disagree with him is that you need to own the IDE.

I think he made kind of some good points about not having platform risk in the longterm, but some of the features that were mentioned, like suggesting diffs, for example, those are all doable with an extension. We haven't yet seen, with VS Code in particular, any functionality that we'd like to do yet in the IDE that we can't either do through directly supported VS Code functionality or something that we kind of hack into there, which we've also done a fair bit of.

And so I think it remains to be seen where that goes. But I think what we're looking to be is we're not trying to just be in an IDE or be an IDE. Find is a system that goes beyond the IDE and is really meant to cover the entire lifecycle of a developer's thought process in going about like, hey, I have this idea and I want to get from that idea to a working product.

And so then that's what the longterm vision of Find is really about, is starting with that, where in the future, I think programming is just going to be really just the problem solving. Like you come up with an idea, you come up with the basic design for the algorithm in your head, and you just tell the AI, hey, just do it.

Just make it work. And that's what we're building towards. - Fantastic. I think we might want to give people, some impression about the type of traffic that you have, because when you present it with a text box, you could type in anything. And I don't know if you have some mental categorization of what are the top three use cases that people tend to call lesser.

- Yeah, that's a great question. So the two main types of searches that we see are how-to questions, like how to do X using Y tool. And this historically has been our bread and butter, because with our embeddings, like we're really, really good at just going over a bunch of developer documentation and figuring out exactly the part that's relevant and just telling you, okay, like you can use this method.

But as LLMs have gotten better, and as we've really transitioned to using GPT-4 a lot in our product, people organically just started pasting in code that's not working and just said, fix it for me. - Fix this. - Yeah. And what really shocks us is that a lot of the people who do that, they're coming from ChatGPT.

So they tried it in ChatGPT with ChatGPT-4. It didn't work. Maybe it required like some multi-step reasoning. Maybe it required like some internet context or something found in either a Stack Overflow post or some documentation to solve it. And so then they paste it into Find and then Find works.

So those are really those two different cases. Like, how can I build this conceptually or like remind me of this one detail that I need to build this thing, or just like, here's this code, fix it. And so that's what a big part of our VS Code extension is, is like enabling a much smoother, here, just like fix it for me type of workflow.

That's really its main benefits. Like it's in your code base, it's in the IDE. It knows how to find the relevant context to answer that question. But at the end of the day, like I said previously, that's still a relatively, not to say it's a small part, but it's a limited part of the entire kind of mental lifecycle of a programmer.

- Yeah. When you launched in, so you launched in Feb and then you launched V2 in August, you had a couple other pretty impactful posts/feature launches. The web search one was massive. And so you were mostly a GPT-4 wrapper. - We were for a long time. - For a long time, until recently.

- Yeah, until recently. - So like people coming over from ChatGPT were saying, "We're gonna say model." - Yep. - "What would be your version of web search? "Would that be the primary value proposition?" - Basically, yeah. And so what we've seen is that any model plus web search is just significantly better than that model itself.

- Do you think that's what you got right in April? Like, so you got 1500 points on Hacker News in April, which is like, if you live on Hacker News a lot, that is unheard of for someone so early on in your journey. - Yeah, super, super grateful for that.

Definitely was not expecting it. So what we've done with Hacker News is we've just kept launching. - Yeah. - Like, what they don't tell you is like, you can just keep launching. So that's what we've been doing. So we launched the very first version of Find in its current incarnation after like the previous demo connected to our own index.

Like once we got into YC, we scrapped our own index 'cause it was too cumbersome at the time. We moved over to using Bing as kind of just the raw source data. And we launched as Hello Cognition. And over time, every time we like added some intelligence to the product, a better model, we just keep launching.

And every additional time we launched, we got way more traffic. So we actually silently rebranded to Find in late December of last year. But like, we didn't have that much traffic. Like nobody really knew who we were. - How'd you pick the name of it? - Paul Graham actually picked it for us.

- All right, tell the story. - Yeah, so, oh boy. Yeah, where do I start? So this is a big aside. Should we go for like the full Paul Graham story or just the name? - Do you wanna do it now or you wanna do it later? I'll give you a choice.

(laughs) - I think, okay, let's just start with the name for now and then we can do the full Paul Graham story later. But basically, Paul Graham, when we were lucky enough to meet him, he saw our name and our domain was at the time, sayhello.so. And he's just like, "Guys, like, come on.

Like, what is this?" You know, like, and we were like, "Yeah." But like when we bought it, you know, we just kind of broke college students. Like we didn't have that much money. And like, we really liked "hello" as a name because it was the first like conversational search engine.

And that's kind of, that's the angle that we were approaching it from. And so we had sayhello.so and he's like, "There's so many problems with that." Like the sayhello, like what does that even mean? And like .so, like, it's gotta be like a .com. We did some time just like with Paul Graham in the room.

We just like looked at different domain names, like different things that like popped into our head. And one of the things that popped into, like Paul Graham said was fine. Like with the P-H-I-N-D spelling in particular. - Yeah, which is not typical naming advice, right? - Yes. - Because it's not, when people hear it, they don't spell it that way.

- Exactly. It's hard to spell. And also it's like very nineties. And so at first, like we didn't like, I was like, like, like, I don't know. But over time, like it kind of, it kept growing on us. And eventually we're like, okay, you know, we like the name.

It's owned by this elderly Canadian gentleman who got to know and he was willing to sell it to us. And so we bought it and we changed the name. Yeah. But anyways, where were you? - I had to ask. I mean, you know, everyone who looks at you is wondering.

- A lot of people, and a lot of people actually pronounce it finned, which, you know, by now is kind of, you know, it's part of the game, but eventually we want to buy F-I-N-D.com and then just have that redirect to P-H-I-N-D. So P-H-I-N-D is like definitely the right spelling.

But like, we'll just, yeah, we'll have all the cases addressed. - So Bing web search, and then in August you launched V2. Could you, is V2 the find as a system pitch? Or have you moved, evolved since then? - Yeah, so I don't, like the V2 moniker, like I don't really think of it that way in my mind.

There's like, there's the version we launched during, last summer during YC, which was the Bing version directed towards programmers. And that's kind of like, that's why I call it like the first incarnation of what we currently are. 'Cause it was already directed towards programmers. We had like a code snippet search built in as well.

'Cause at the time, you know, the models we were using weren't good enough to generate code snippets. Even GPT, like the Text DaVinci 2, which was available at the time, wasn't that good at generating code. And it would generate like very, very short, very incomplete code snippets. And so we launched that last summer.

Got some traction, but really like we were only doing like, I don't know, maybe like 10,000 searches a day. Like some people knew about it. Some people use it, which is impressive. 'Cause looking back, the product like was not that good. And yeah, every time we've like made an improvement to the way that we retrieve context through better embeddings, more intelligent, like HTML parsers, and importantly, like better underlying models.

Yeah, I would really consider every kind of iteration after that when we, every major version after that was when we introduced the better underlying answering model. Like in February, we launched, we had to swallow a bit of our pride when we were like, okay, our own models aren't good enough.

We have to go to open AI. And that actually, that did lead to kind of like our first like decent bump of traffic in February. And people kept using it. Like our attention was way better too. But we were still kind of running into problems of like more advanced reasoning.

Some people tried it, but people were leaving because even like GPT 3.5, both turbo and non-turbo, like still not that great at doing like code-related reasoning beyond like the how do you do X, like documentation search type of use case. And so it was really only when GPT-4 came around in April that we were like, okay, like this is like our first real opportunity to really make this thing like the way that it should have been all along.

And having GPT-4 as the brain is what led to that Hacker News post. And so what we did was we just let anyone use GPT-4 on Find for free without a login, which I actually don't regret. So it was very expensive obviously, but like at that stage, all we needed to do was show like, we just needed to like show people, here's what Find can do.

That was the main thing. And so that worked, that worked. Like we got a lot of users. Do you know Fireship? - Yeah, the YouTube Jeff Delaney. - Yeah, he made a short about Find. And that's on top of the Hacker News post. And that's what like really, really made it blow up.

It got millions of views in days. And he's just funny. Like what I love about Fireship is like he, like you guys, yeah. Yeah, like humor goes a long way towards like really grabbing people's attention. And so that blew up. - So something I would be anxious about as a founder during that period.

So obviously we all remember that pretty closely. There were a couple of people who had access to the GPT-4 API doing this, which is unrestricted access to GPT-4. And I have to imagine YC, OpenAI wasn't that happy about that. Because it was like kind of de facto access to GPT-4 before they released it.

- Chat GPT-4 was in Chat GPT from day one, I think. OpenAI actually came to our support because what happened was we had people building unofficial APIs around Find. Yeah, to try to get free access to it. And I think OpenAI actually has the right perspective on this where they're like, "Okay, people can do whatever they want with the API.

If they're paying for it, they can do whatever they want. But it's not okay if paying customers are being exploited by these other actors." So they actually got in touch with us and they helped us set up better Cloudflare bot monitoring controls to effectively crack down on those unofficial APIs, which we're very happy about.

But yeah, so we launched GPT-4. A lot of people come to the product. And yeah, for a long time we're just, we're figuring out like, how do we, like, what do we make of this, right? Like, how do we, A, make it better, but also deal with like our costs, which have just like massively, massively ballooned.

And I think over time it's, I think it's become more clear with the release of Lama 2 and Lama 3 on the horizon that we will once again see a return to vertical applications running their own models. As was true last year and before, I think that GPT-4, my hypothesis is that the jump from 4 to 4.5 or 4 to 5 will be smaller than the jump from 3 to 4.

And the reason why is because there were a lot of different things. Like there was two plus, effectively two, two and a half years of research that went into going from 3 to 4. Like more data, bigger model, all of like the instruction tuning techniques, RLHF, all of that is known.

And like Meta, for example, and now there's all these other startups like Mistral too. Like there's a bunch of very well-funded open source players that are now working on just like taking the recipe that's now known and scaling it up. So I think that even if a Delta exists in 2024, the Delta between proprietary and open source won't be large enough that a startup like us with a lot of data that we've collected can take the data that we have, fine tune an open source model and like be able to have it be better than whatever the proprietary model is at the time.

That's my hypothesis. That we'll once again see a return to these verticalized models. And that's something that we're super excited about 'cause yeah, that brings us to kind of the fine model because the plan from kind of the start was to be able to return to that if that makes sense.

And I think now we're definitely at a point where it does make sense because we have requests from users who like they want longer context in the model basically. Like they want to be able to ask questions about their entire code base. They want, and without, you know, context and retrieval and taking a chance of that, like I think it's generally been shown that if you have the space to just put the raw files inside of a big context window, that is still better than chunking and retrieval.

It just is. So there's various things that we could do with longer context, faster speed, lower cost. Super excited about that. And that's the direction that we're going to find model. And our big hypothesis there is precisely that we can take a really good open source model and then just train it on absolutely all of the high quality data that we can find.

And there's a lot of various, you know, interesting ideas for this. We have our own techniques that we're kind of playing with internally. One of the very interesting ideas that I've seen is Octopack from BigCode. I don't think that it made that big waves when it came out, I think in August, but the idea is that they have this dataset that maps GitHub commits to a change.

So basically there's all this really high quality, like human-made, human-written diff data out there on every time someone makes a commit in some repo. And you can use that to train models. You take the file state before and like given a commit message, what should that code look like in the future?

- You got it. - You can-- - You're money though, it's any good? - No, unfortunately. So we ran this experiment, we trained the fine model. And if you go to the BigCode leaderboard as of today, October 5th, all of our models are at the top of the BigCode leaderboard by far, it's not close, particularly in languages other than Python.

We have a 10 point gap between us and the next best model on Java, JavaScript, I think C#, multilingual. And what we kind of learned from that whole experience releasing those models is that human eval doesn't really matter. Not just that, but GPT-4 itself has been trained on human eval.

And we know this because GPT-4 is able to predict the exact docstring in many of the problems. I've seen it predict like the specific example values in the docstring, which is extremely improbable for it to just, you know, no. So I think there's a lot of dataset contamination and it only captures a very limited subset like what programmers are actually doing.

What we do internally for evaluations are we have GPT-4 score answers. GPT-4 is a really good evaluator. I mean, obviously it's by really good, I mean, it's the best that we have. I'm sure that, you know, a couple of months from now next year, we'll be like, oh, you know, like GPT-4.5, GPT-5, it's so much better, like GPT-4 is terrible.

But like right now it's the best that we have short of humans. And what we found is that when doing like temperature zero evals, it's actually mostly deterministic GPT-4 across runs in assigning scores to two different answers. So we found it to be a very useful tool in comparing our model to say GPT-4.

Yeah, on our like internal, like real world, here's what people will be asking this model dataset. And the other thing that we're running is just like releasing the model to our users and just seeing what they think. 'Cause that's like the only thing that really matters is like releasing it for the application that it's intended for and then seeing how people react.

And for the most part, the incredible thing is is that people don't notice a difference between our model and GPT-4 for the vast majority of searches. There's some reasoning problems that GPT-4 can still do better. We're working on addressing that. But in terms of like the types of questions that people are asking on find, yeah, like there's not that much difference.

And in fact, like I've been running my own kind of side-by-side comparisons. Shout out to Godmode by the way. And I've like myself, I've kind of confirmed this to be the case. And even sometimes it gives a better answer, perhaps like more concise or just like better implementation than GPT-4, which that's what surprises me.

And so by now we kind of have like this reasoning is all you need kind of hypothesis where we've seen emerging capabilities in the find model whereby training it on high quality code, it can actually like reason better. It went from not being able to solve like world problems where like riddles were like with like temporal and like placement of objects and moving and stuff like that, that GPT-4 can do pretty well.

We went from not being able to do those at all to being able to do them just by training on more code, which is wild. So we're already like starting to see like these emerging capabilities. - Yeah, so I just wanted to make sure that we have the, I guess like the model card in our heads.

So you started from Code Llama? - Yes. - 65, 34? - 34. So unfortunately there's no Code Llama 7 to be. If there was, that would be super cool, but there's not. - 34, and then, which in itself was Llama 2, which was on two trillion tokens and the added 500 billion code tokens.

- Yes. - And you just added a bunch more. - Yeah, and they also did a couple of things. So they did, I think they did 500 billion, the general pre-training and then they did an extra 20 billion long context pre-training. So they actually increased the like max position tokens to 16K up from 8K.

And then they changed the theta parameter for the ROPE embeddings as well to give it theoretically better long context support up to 100K tokens. But yeah, but otherwise it's like basically Llama 2. - So you just took that and just added data? - Exactly. - You didn't do any other fundamental?

- Yeah, so we didn't actually, we haven't yet done anything with the model architecture and we just trained it on like many, many more billions of tokens on our own infrastructure. And something else that we're taking a look at now is using reinforcement learning for correctness. One of the interesting pitfalls that we've noticed with the fine model is that in cases where it gets stuff wrong, sometime is capable of getting the right answer.

It's just, there's a big variance problem. It's wildly inconsistent. Like there are cases when it is able to get the right chain of thought and able to arrive at the right answer, but not always. And so like one of our hypotheses is something that we're gonna try is that like, we can actually do reinforcement learning on like for a given problem, generate a bunch of completions and then like use the correct answer as like a loss basically to try to get it to be more correct.

And I think there's a high chance I think of this working because it's very similar to the like RLHF method where you basically show pairs of completions for a given question, except the criteria is like, which one is like, you know, less harmful. But here, you know, we have a different criteria, but if the model's already capable of getting the right answer, which it is, we just need to cajole it into being more consistent.

- There were a couple of things that I noticed in the product that were not strange, but unique. So first of all, the model can talk multiple times in a row, like most other applications is like human model, human model. And then you had outside of the thumbs up, thumbs down, you have things like have DLLM prioritize this message and its answers, or then continue from this message to like go back.

How does that change the flow of the user? And like in terms of like prompting it, yeah, what are like some tricks or learnings to that? - Yeah, that's a good question. So yeah, that's specifically in our pair programmer mode, which is a more conversational mode that also like asks you clarifying questions back if it doesn't fully understand what you're doing and it kind of, it holds your hand a bit more.

And so from user feedback, we had requests to make more of an auto GPT, where you can kind of give it this problem that might take multiple searches or multiple different steps, like multiple reasoning steps to solve. And so that's the impetus behind building that product, being able to do multiple steps and also be able to handle really long conversations.

Like people are really trying to use the pair programmer to go from like, sometimes really from like basic idea to like complete working code. And so we noticed was, is that we were having like these very, very long threads, sometimes with like 60 messages, like a hundred messages. And like those become really, really challenging to manage like the appropriate context window of what should go inside of the context and how to preserve the context so that the model can continue or the product can continue giving good responses, even if you're like 60 messages deep in a conversation.

So that's where the prioritized user messages like comes from is like, people have asked us to just like let them pin messages that they want to be left in the conversation. And yeah, and then that seems to have like really gone a long way towards solving that problem. - Yeah, and then you have a run and replit thing.

Are you planning to build your own repl like learning some people trying to run the wrong code, unsafe code? - Yes, yes. So I think like in the long-term vision of like being a place where people can go from like idea to like fully working code, having a code sandbox, like a natively integrated code sandbox makes a lot of sense.

And replit is great and people use that feature. But yeah, I think there's more we can do in terms of like having something a bit closer to code interpreter where it's able to run the code and then like recursively iterate on it, exactly. - I think replit is working on APIs to enable you to do that.

- Yep. - So Amjad has specifically told me in person that he wants to enable that for people. At the same time, he's also working on his own models. - Right. - And Ghostwriter and all the other stuff. - Yeah. - So it's gonna get interesting. Like he wants to power you, but also compete with you.

- Yeah. And like, and we love replit. I think that a lot of these, like a lot of the companies in our space, like we're all going to converge to solving a very similar problem, but from a different angle. So like replit approaches this problem from the IDE side.

Like they started as like this IDE that you can run in the browser. And they started for like from that side, making coding just like more accessible. And we're approaching it from the side of like an LLM that's just like connected to everything that it needs to be connected to, which includes your code context.

So that's why like we're kind of making, you know, inroads into IDEs. But we're kind of, we're approaching this problem from different sides. And I think it will be interesting to see where things end up. But I think that, you know, in the long, long term, we have an opportunity to also just have like this general kind of like technical reasoning engine product that's, you know, potentially also not just for programmers and it's also powered in this web interface.

Like where there's potential, I think other things that we will build that eventually might go beyond like our current scope. - Exciting, we'll look forward to that. - Thank you. - We're gonna zoom out a little bit into sort of AI ecosystem stories, but first we gotta get the Paul Graham, Ron Conway story.

- Yeah, so flashback to last summer, we're in the YC batch. And we're doing the summer batch, summer 22. So the summer batch runs from June to September, approximately. This was late July, early August, right around the time that many like YC startups start like going out, like gearing up, here's how we're gonna pitch investors and everything.

And at the same time, me and my co-founder, Justin, we were planning on moving to New York. So for a long time, actually, we were thinking about building this company in New York, mainly for personal reasons, actually. 'Cause like during the pandemic, pre-Chad GPT, pre last year, pre the AI boom, SF unfortunately really kind of like-- - So did.

- Lost its luster, yeah, like no one was here. It was far from clear, like if there would be an AI boom, if like SF would be like the AI-- - Back. - Yeah, exactly. If SF would be so back, as everyone is saying these days, it was far from clear.

And so, and all of our friends, we were graduating college, 'cause like we happened to just graduate college and immediately start YC. Like we didn't even have, I think we had a week in between. So it was just-- - You didn't bother looking for jobs, you were just like, this is all good.

- Well, actually, both me and my co-founder, we had jobs that we secured in 2021 from previous internships, but we both, like we, funny enough, when I spoke to my boss's boss at the company at which, like where I reneged my offer, I told him we got into YC.

They actually said, yeah, you should do YC. - Wow, that's very selfless, that's great. - Yeah, that was really great that they did that. - In San Francisco, they would have offered to invest as well. - Yes, yes, they would have. But yeah, we were both planning to be in New York.

And all of our friends were there from college. And so like at this point, like we have this whole plan, we're like on August 1st, we're gonna move to New York. And we had like this Airbnb for the month of New York, we're gonna stay there and we're gonna work and like all of that.

The day before we go to New York, I call Justin and I just, I tell him like, why are we doing this? Like, why are we doing this? 'Cause in our batch, like by the time that August 1st rolled around, all of our mentors at YC were saying like, hey, like, you should really consider staying in SF.

- It's the hybrid batch, right? - Yeah, it was the hybrid batch. But like there were already signs that like something was kind of like afoot in SF, even if like we didn't fully wanna admit it yet. And so we were like, no, I don't know. And so the day before, like, I don't know, something kind of clicked when the rubber met the road and it was time to go to New York.

We were like, why are we doing this? And like, we didn't have any good reasons for staying in New York at that point beyond like our friends are there. So we still go to New York 'cause like we have the Airbnb, like we don't have any other kind of place to go for the next few weeks.

We're in New York. And New York is just unfortunately too much fun. Like all of my other friends from college who are just, you know, like basically starting their jobs, starting their lives as adults, you know, they just got stuck into these jobs. They're making all this money and they're like partying and like all these things are happening.

And like, yeah, it's just a very distracting place to be. And so we were just like sitting in this like small, you know, like cramped apartment, terrible posture, trying to get as much work done as we can. Too many distractions. And then we get this email from YC saying that Paul Graham is in town, in SF, and he is doing office hours with a certain number of startups in the current batch.

And whoever signs up first gets it. And I happened to be super lucky. I was about to go for a run, but I just, I saw the email notification come across the street. I immediately clicked on the link. And like immediately, like half the spots were gone, but somehow the very last spot was still available.

And so I picked the very, very last time slot at 7 p.m. semi-strategically, you know, so we would have like time to go over. And also because like, I didn't really know how we're going to get to SF yet. And so we made a plan that we're going to fly from New York to SF and back to New York in one day and do like the full round trip.

And we're going to meet with PG at the YC Mountain View office. And so we go there, we do that. We meet PG, you know, we tell him about the startup. And one thing I love about PG is that he gets like, he gets so excited. Like when he gets excited about something, like you can see his eyes like really light up.

And he'll just start asking you questions. In fact, it's a little challenging sometimes to like finish kind of like the rest of like the description of your pitch. 'Cause like, he'll just like start like, you know, asking all these questions about how it works and like, you know, what's going on.

- And what was the most challenging question that he asked you? - I think that like, he was asking us a lot of questions about like, like really how it worked. 'Cause like, as soon as like we told him like, hey, like we think that the future of search is answers, not links.

Like, we could really see like the gears turning in his head. I think we were like the first demo of that, that he saw. - And you're like 10 minutes with him, right? - We had like 45, yeah, we had a decent chunk of time. Yeah. And so we tell him how it works.

Like, he's very excited about it. And I just like, I just blurt it out. I just like ask him to invest. And he hasn't even seen the product yet. Oh, we just asked him to invest. And he says, yeah. And like, we're super excited about that. - And you're like, he haven't started your batch.

- No, no, no, this is like... - This is after your batch. - Yeah, this is about halfway through the batch. Or two, two, no, two thirds of the batch. - Which when you're like not technically fundraising yet. - Or about to start fundraising. Yeah, so we have like this demo and like we showed him and like, there was still a lot of issues with the product.

But I think like, it must have like still kind of like blown his mind in some way. And so, yeah, so like we're having fun. He's having fun. We have this dinner planned with this other friend that we had in SF. 'Cause we were only there for that one day.

So we thought, okay, after an hour, we'll be done. We'll grab dinner with our friend and we'll fly back to New York. But PG was like, I'm having so much fun. Like, do you wanna... - Have dinner? - Yeah, come to my house. Or he's like, I gotta go have dinner with my wife, Jessica.

Who's also awesome, by the way. - She's like the heart of YC. - Yeah, yeah, like Jessica does not get enough credit as an aside for her role. - He tries, he tries. - He tries, but like, yeah, Jessica really deserves a lot of credit. 'Cause she like, he understands like the technical side and she understands people and together, they're just like a phenomenal team.

But he's like, yeah, I gotta go see Jessica. But you guys are welcome to come with. Do you wanna come with? And we're like, we have this friend who's like right now outside of, like we're literally outside the door who like we also promised to get dinner with. So like, we'd love to, but like, I don't know if we can.

He's like, oh, he's welcome to come too. So like, yeah, so all of us just like hop in his car and we go to his house and then we just like have this, like we have dinner and we have this, like just chat about the future of search. Like I remember him telling Jessica distinctly, like our kids and our kids' kids are like, are not gonna know what like a search result is.

Like they're just gonna like have answers. So that was really like a mind blowing, like inflection point moment for sure. - Wow, that email changed your life. - Absolutely. - And you also just spoiled the booking system for PG. 'Cause now everyone's just gonna go after the last slot.

- Oh man, yeah, but like, I don't know if he even does that anymore. - He does, he does. Yeah, I've met other founders that he did it this year. - This year, gotcha. But when we told him about how we did it, he was like, I am like frankly shocked that like YC just did like a random like scheduling system.

They didn't like do anything else, but. - Okay, and then he introduces Duron Conway. - Yes. - Who is one of the most legendary angels in Silicon Valley. - Yes, so after PG invested, the rest of our round came together pretty quickly. And so like-- - By the way, I'm surprised, like it might feel like playing favorites, right?

Within the current batch to be like, yo, PG invested in this one. - Right, and like, yes. - Too bad for the others. - Too bad for the others, I guess. I think this is a bigger point about YC and like these accelerators in general is like, YC gets like a lot of criticism from founders who feel like they didn't get value out of it.

But like, in my view, YC is what you make of it. Like, and the YC tells you this, they're like, you really got to grab this opportunity, like buy the balls and make the most of it. And if you do, then it could be the best thing in the world.

And if you don't, and if you're just kind of like a passive, even like an average founder in YC, you're still gonna fail. And they tell you that, they're like, if you're average in your batch, you're gonna fail. Like you have to just be exceptional in every way. And so yeah, after PG invested, the rest of our round came together pretty quickly, which I'm very fortunate for.

And yeah, he introduces to Ron. And after he did, I get a call from Ron. And Ron says like, hey, like, you know, PG tells me what you're working on, I'd love to come meet you guys. And I'm like, wait, no way. And we're just holed up in this like little house in San Mateo, which is a little small, but you know, it had a nice patio.

In fact, we had like our monitors set up outside on the deck out there. And so Ron Conway comes over, we go over to the patio, where like our workstation is. And Ron Conway, he's known for having like this notebook that he goes around with, where he like sits down with the notebook and like takes very, very detailed notes.

So he never like forgets anything. So he sits down with his notebook and he asks us like, hey guys, like, what do you need? And we're like, oh, we need GPUs. Like back then the GPU shortage wasn't even nearly as bad as it is now. But like, even then it was still challenging to get like the quota that we needed.

And he's like, okay, no problem. And then like he leaves a couple hours later, we get an email and we're CC'd on an email that Ron wrote to Jensen, the CEO of NVIDIA, saying like, hey, like these guys need GPUs. - You didn't say how much? It was just like, just give them GPUs.

- Basically, yeah. Ron is known for writing these like one-liner emails that are like very short, but very to the point. And I think that's why like everyone responds to Ron. Everyone loves Ron. And so Jensen responds. He responds quickly, like tagging this VP of AI at NVIDIA. And we start working with NVIDIA, which is great.

And something that I love about NVIDIA, by the way, is that after that intro, we got matched with like a dedicated team. And at NVIDIA, they know that they're gonna win regardless. So they don't care where you get the GPUs from. They're like, they're truly neutral, unlike various sales reps that you might encounter at various like clouds and, you know, hardware companies, et cetera.

Like they actually just wanna help you 'cause they know they don't care. Like regardless, they know that if you're getting NVIDIA GPUs, they're still winning. So I guess that's a tip is that like, if you're looking for GPUs, like NVIDIA, yeah, they'll help you do it. - So like, so, okay, and then just to tie up this thing, because it, so first of all, that's a fantastic story.

And like, you know, I just wanted to let you tell that 'cause it's special. That is a strategic shift, right? That you already decided to make by the time you met Ron, which is we are going to have our own hardware. We're gonna rack him in a data center somewhere.

- Not even that we need our own hardware 'cause actually we don't, but we just need GPUs period. And like every cloud loves, like they have their own sales tactics and like they wanna make you commit to long terms and like very non-flexible terms. And like, there's all these, there's a web of different things that you kind of have to navigate.

NVIDIA will kind of be to the point and be like, okay, you can do this on this cloud, this on this cloud. If like, this is your budget, maybe you wanna consider buying as well. Like they'll help you walk through what the options are. And in terms of software, and the reason why they're helpful is 'cause like they look at the full picture.

So they'll help you with the hardware. And in terms of software, they actually implemented a custom feature for us in Faster Transformer, which is one of their libraries. - For you? - For us, yeah. Which is wild. Yeah, I don't think they would have done it otherwise. They implemented streaming generation for T5 based models.

Which we were running at the time, up until we switched to GPT in February, March of this year. So they implemented that just for us actually, and Faster Transformer. And so like, they'll help you look at the complete picture and then just help you get done what you need to get done.

- And I know one of your interests is also local models, open source models and hardware kind of goes hand in hand. Any fun projects, explorations in the space that you wanna share with local Llamas? - Yeah, so it's something that we're very interested in. Because something that kind of we're hearing a lot about is like people want something like find, especially companies, but they wanna have it like within like their own sandbox.

They wanna have it like on hardware that they control. And so I'm super, super interested in how we can get big models to run efficiently on local hardware. And so like, Llamas is great. Llamas CPP is great. Very interested in like where the whole quantization thing is going. 'Cause like, obviously there are all these like great quantization libraries now that go to four bit, eight bit, but specifically int eight and int four.

- Which is the lowest it can go, right? - Right, but with int eight, there's not necessarily a speed increase. It's just a storage optimization. Yeah, so we have these great quantization libraries that for the most part are able to get the size down with not that much quality loss.

But there is some, like the quantized models currently are actually worse than the non-quantized ones. And so I'm very curious if the future is something like what NVIDIA is doing with their implementation of FP8, which they're implementing in their transformer engine library. Where basically once FP8 support is kind of more widespread and hardware can support it efficiently, you can kind of switch between the two different FP8 formats one with greater precision, one with greater range, and then combine that with only not doing FP8 on every layer and doing like a mixed precision with like FP32 on some layers.

And like NVIDIA claims that this strategy that they're kind of demoing with the H100 has no degradation. And so it remains to be seen whether that is really true in practice, but that's something that we're excited about and whether that can be applied to like Macs and other hardware once they get FP8 support as well.

- Oh, we should also talk about hiring. How do you get your info, right? Like you seem to know, you seem self-taught. - Yeah, so I've always just, well, I'm fortunate to have like a decent systems background from UT Austin and somewhat of a research background, even though like I didn't publish any papers, but like I went through all the motions.

Like I didn't publish the thesis that I wrote mainly out of time because I was doing both of that and the startup at the same time. And then I graduated and then it was YC and then everything was kind of one after another. But like I'm very fortunate to kind of have like the systems and like a bit of like a research background.

But yeah, for the most part, outside of that foundation, like I've always just, whenever I've been interested in something, I just like, I go deep. - Give people tips, right? Like where do you, what fire hose do you drink from? - Yeah, exactly. So like whenever I see something that blows my mind, the way that that initial Hugging Face demo did, that was like the start of everything.

I'll just, yeah, I'll just like, I'll start from the beginning. I'll like, if I don't know anything, then like I'll just, I'll start by just trying to get a mental model of what is happening. Like first I need to understand what, so I can understand like the why, the how and the why.

And once I can understand that, then I can make my own hypotheses about like, okay, here are the assumptions that the authors of this made. And here's why maybe they're correct, maybe they're wrong. And here's how like I can improve on it and iterate on it. And I guess that's the mindset that I approach it from is like once I understand something, like how can it be better?

How can it be faster? How can it be like more accurate? And so I guess for anyone starting now, like I would have used Find. If I was starting now, 'cause like I would have loved to just have been able to say like, hey, like I have no idea what I'm doing.

Can you just like be this like technical research assistant and kind of hold my hand and like ask me clarifying questions and like help me like formalize my assumptions like along the way. I would have loved that. But yeah, I just kind of did that myself. - Yeah. Recording looms of yourself using Find actually would be pretty interesting.

- Yeah. - Because I think you would use Find differently than people would by themselves. - I think so, yeah. - Unprompted. - I generally use Find for everything, which is definitely, yeah. It's like, no, no, even like non-technical questions as well. 'Cause that's just something I'm curious about.

But that's generally like, that's less of a usage pattern nowadays. Like most people generally for the most part do technical questions on Find. And that is completely understandable because of very deliberate decisions that we've made in how we've optimized the product. Like we've optimized the product very much in a quality first manner as opposed to a like speed first or like some balance of the two matters.

So we're like, we have to run GPT-4 or some GPT-4 equivalent by default. And it has to give like a good answer to like a very demanding technical audience for people who will leave. So that's just the trade off. So like sometimes it's slower for like simple questions, but like we did that on purpose, so.

- Awesome. Before we do a lightning round, call for hiring any roles you're looking for. What should people know about working at Find? - Yeah. So we really straddled the line between product and research at Find. Like for the past little while, a lot of the work that we've done has been solely product.

But we also do, especially now with the Find model, a very particular kind of applied research in trying to apply the very latest techniques and techniques that might not, that have not even been proven yet to training the very, very best model for our vertical. And the two go hand in hand because the product, the UI, the UX is kind of model agnostic, but when it has a better kind of kernel, as Andrej Karpathy put it, plugged into it, it gets so much better.

So we're doing really kind of both at the same time. And so someone who like enjoys seeing both of those sides, like doing something very tangible that affects the user, high quality, reliable code that runs in production, but also having that chance to experiment with like building these models.

Yeah, we'd love to talk to you. - And the title is Applied AI Engineer. - I don't know what the title is. Like that is one title, but I don't know if like this really exists 'cause I feel like we're too rigid about like bucketing people into categories. - Yeah, founding engineer is fine.

- Yeah, well, we already have a founding engineer technically. - Well, for what it's worth, OpenAI is adopting Applied AI Engineer. - Really? - So it's becoming a thing. - It's becoming a thing. - All right. - We'll see. - We'll see. Lightning round. - Lightning round. - Yeah, we have two questions, acceleration, exploration, and then a takeaway.

So the acceleration one is what's something that already happened in AI that you thought would take much longer? - Yeah, the jump from these like models being glorified summarization models to actual powerful reasoning engines happened much faster than we thought. 'Cause like our product itself transitioned from being kind of, you know, this glorified summarization product to now like mostly a reasoning heavy product.

And we had no idea that this would happen this fast. Like we thought that like there'd be a lot more time and like many more things that needed to happen before we could do some level of like intelligent reasoning on a low level about people's code. But it's already happened and it happened much faster than we could have thought.

But I think that leads into your next point. - Which is exploration. - Exploration, yeah. - What do you think is the most interesting unsolved question in AI? - Yes. I think solving hallucinations, being able to guarantee that the answer will be correct is I think super interesting. And it's particularly relevant to us 'cause like we operate in a space where like everything needs to be correct.

Like the code, like not just the logic, but like the implementation, everything has to be completely correct. And there's a lot of very interesting work that's going on in this space. Some of it is approaching it from the angle of formal grammars. There's a very interesting paper that came out recently.

I forget where it came out of, but it came, the paper is basically, you can define a grammar that restricts and modifies the models. - Log props. - Exactly, like decoding strategy to only conform to that grammar. And that helps it. - Is this LMQL? Because I feel like LMQL is a little bit too structured for if the goal is avoiding hallucination.

That's such a vague goal. - Yeah. - Yeah, I haven't seen it. - This is only something we've begun to take a look at. I haven't fully read the paper yet. Like I've only kind of skimmed the abstract, but it's something that like, we're definitely interested in exploring further.

But something that we are like a bit further along on is also like exploring reinforcement learning for correctness, as opposed to only harmfulness the way it has typically been used in my research. - We just did a CEO paper on that. - Yeah. - Just a quick follow-up. Do you have internal evals for what hallucination rate is on stock GPT-4 and then maybe what yours is after fine tuning?

- We, yeah. So we don't measure hallucination directly in our internal benchmarks. We more measure like was the answer right or was it wrong? We measure hallucination indirectly by evaluating the context, like the RAG context fed into the model as well. So basically, if the context was bad and the answer was bad, then chances are like, it's the context, but if the context was good, and it just like misinterpreted that or had the wrong conclusion, then like we can take different steps there.

- Harrison from LangChain has been talking about this sort of two by two matrix with the RAGs people. It's pretty simple concept. What's the source of error? - Exactly. And I've been talking to Harrison actually about like a more like structured way, perhaps within LangChain to like two evals.

'Cause I think that's a massive problem. Like every single eval is different for these big large language models and doing them in a quantitative way is really hard, but it's possible with like a platform that I think harnesses GPT-4 in the right way. That and also perhaps a stricter prompting stricter prompting language, like a prompting markup language for prompting models is something I'm also very interested in.

'Cause we've written some very, very complex prompts, particularly for a VS code extension to like do like very fancy things with people's code. And like, I wish there was a way that you could have like a more formal way, like a Python for LLM prompting that you could activate desired things within like the models execution flow through some other abstraction above language that has been like tested to do that some of the time, perhaps like combined with like formal grammar limitations and stuff like that.

- Interesting. I have no idea what that looks like. - These are all things that have kind of emerged directly from the issues we're facing ourselves in China. But yeah, definitely very abstract so far. - Awesome. And yeah, just to wrap, what's one message, idea you want people to remember and think about?

- Yeah, I think pay attention to those moments that like really jump out at you. Like when you see like a crazy demo that you can't like forget about or like something that you just think is really, really cool. Yeah, don't let that go. 'Cause I see a lot of people trying to start startups from the angle of like, hey, I just wanna start a startup or I'm just like bored at my job or like I'm like generally interested in the space.

And I personally disagree with that. My take is that like, it's much easier, having been on both sides of that coin now, it's much easier to stay like obsessed every single day when the genesis of your startup is like something that really spoke to you in an incredibly meaningful way beyond just being kind of some insight that you've noticed.

And I guess that's, I think like what we're discovering now is that like in the long, long term, like what you're really building is like you're building a group of people that believe this thing, that believe that like the future of solving problems and making things will be just like focused more on like the human thought process as opposed to the implementation part.

And it's like, it's that belief that I think is what really gets you through the tough times and hopefully gets you to the other side someday. - Awesome. I kinda wanna play "Lose Yourself" as the outro music. - Then we'll get DMCA strike. - That'd be great though. - Thank you so much for coming on.

- Yeah, thank you so much for having me. This was really fun. (upbeat music) (upbeat music) (upbeat music) (upbeat music) (upbeat music)