Is finetuning GPT4o worth it?

(upbeat music) - Hey everyone, welcome to the Latent Space Podcast. This is Alessio, partner and CTO in Residence at Decibel Partners. And I'm joined by my co-host Swiggs, founder of Small.ai. - Hey, and today we're back in the studio, in person, after about three to four months in visa jail and travels and all other fun stuff that we talked about in the previous episode.

But today with special guests, Ali Pullen from Cosign. Welcome. - Hi, thanks for having me. - Very lucky to have you because you're on a two day trip to San Francisco. - Yeah, I wouldn't recommend it. I would not recommend it. Don't fly from London to San Francisco for two days.

- And you launched Genie on a plane, on plane WiFi, claiming state-of-the-art in SuiteBench, which we're all gonna talk about. I'm excited to dive into your whole journey because it has been a journey. I've been lucky to be a small angel in part of that journey. And it's exciting to see that you're launching to such acclaim and such results.

So I'll go over your brief background and then you can fill in the blanks on what else people should know about you. You did your bachelor's in computer science in Exeter, and then you worked at a startup that got acquired into GoPuff. And roundabout 2022, you started working on a stealth startup that became a YC startup.

What's that overall story? - Yeah, so basically when I left university, I met my now co-founder, Sam. At the time, we were both mobile devs. He was an Android developer, I was an iOS developer. And whilst at university, we built this sort of small consultancy, sort of we'd be approached to build projects for people.

And we would just take them up and start with their student projects. They weren't anything crazy or anything big. We started with those. And over time, we started doing larger and larger projects, more interesting things. And actually when we left university, we just kept doing that. We didn't really get jobs, traditional jobs.

It was also like in the middle of COVID, middle of lockdown. So we were like, this is a pretty good gig. We'll just keep like writing code in our bedrooms. And we did that for a while. And then a friend of ours that we went to Exeter with started a YC startup during COVID.

And it was one of these fast grocery delivery companies. At the time, I was living in the deepest, darkest countryside in England, where fast grocery companies are still not a thing. So he sort of pitched me this idea and was like, listen, like I need an iOS dev, do you fancy coming along?

And I thought, absolutely. It was a chance to get out of my parents' house, chance to move to London, you know, do interesting things. And at the time, truthfully, I had no idea what YC was. I had no idea. I wasn't in the startup space. I knew I liked coding and building apps and stuff, but I'd never really done anything in that area.

So I said, yes, absolutely. I moved to London just sort of as COVID was ending and yeah, worked at what was Fancy for about a year and a half. Then we brought Sam along as well. So Sam and I were the two engineers at Fancy for basically his entire life.

And we built literally everything. So like the client mobile apps, the backends, the internal like stock management system, the driver routing algorithms, all those things, literally like everything. It was my first, you know, both of us were super inexperienced. We didn't have like proper engineering experience. There were definitely decisions we'd do differently now.

We'd definitely buy a lot of stuff off the shelf, stuff like that. But it was the initial dip of the toe into like the world of startups. And we were both like hooked immediately. We were like, this is so cool. This sounds so much better than all our friends who were like consultants and doing like normal jobs, right?

We did that and it ran its course. And after, I want to say 18 months or so, GoPuff came and acquired us. And there was obviously a transitionary period and integration period, like with all acquisitions. And we did that. And as soon as we'd vested what we wanted to vest and as soon as we thought, okay, this chapter is sort of done in about 2022, we left and we knew that we wanted to go alone and try something like we'd had this taste.

Now we knew we'd seen how like a YC startup was managed like up close. And we knew that we wanted to do something similar ourselves. We had no idea what it was at the time. We just knew we wanted to do something. So we tried some small projects in various different areas.

But then Sam talked to me about GPT-3. He'd seen it on Reddit. - The source of all knowledge. - The source of all knowledge, absolutely. Sam loves Reddit. I'd actually heard of GPT-2 and obviously had like loosely followed what OpenAI had done with, what was the game they trained a model to play?

- Dota. - Was it Dota, yeah. So I'd followed that and knew loosely what GPT-2 was. I knew what BERT was. So I was like, okay, this GPT-3 thing sounds interesting. And he just mentioned it to me on a walk. And I then went home and like Googled GPT-3 and there was the playground.

It was the, and the model was DaVinci 2 at the time. And it was just the old school playground, completions, nothing crazy, no chat, no nothing. - I miss completions though. - Yeah, oh, completions. Honestly, I had this conversation in OpenAIs yesterday. I was like, I just, I know.

But yeah, so we, I started playing around with the playground and the first thing I ever wrote into it was like, hello world. And it gave me some sort of like fairly generic response back and I was like, okay, that looks pretty cool. The next thing was, I looked through the docs or they had a lot of example prompts 'cause I had no idea.

I didn't know if the, if you could put anything in, I didn't know if you had to structure in a certain way or whatever. And I saw that it could start writing like tables and JSON and stuff like that. So I was like, okay, can you write me something in JSON?

And it did. And I was like, oh wow, this is pretty cool. Can it just write arbitrary JSON for me? And immediately, as soon as I realized that, my mind was racing and I like got Sam in and we just started messing around in the playground, like fairly innocently to start with.

And then of course, both being mobile devs and also seeing, at that point, we'd learned about what the Codex model was. It was like, this thing's trained to write code. It sounds awesome. And Copilot was start, I think, I can't actually remember if Copilot had come out later. Yeah, it might've done.

- It's round about the same time as Codex. - Round about the same time, yeah. And we were like, okay, as mobile devs, let's see what we can do. So the initial thing was like, okay, let's see if we can get this AI to build us a mobile app from scratch.

We eventually built the world's most flimsy system, which was back in the day, we're like 4,000 token context windows, like chaining prompts, trying to keep as much context from one to the other, all these different things, where essentially you'd put in an app idea in a box, and then we'd do like very high level stuff, figuring out what the stack should be, figuring out what the front end should be written in, back end should be written in, all these different things.

And then we'd go through like for each thing, more and more levels of detail until the point that you actually got Codex to write the code for each thing. And we didn't do any templating or anything. We were like, no, we're gonna write all the code from scratch every time, which is basically why it barely worked.

But there were like occasions where you could put in something and it would build something that did actually run, the back end would run, the database would work. And we were like, oh my God, this is insane. This is so cool. And that's what we showed to our co-founder, Yang.

I met my co-founder, Yang, through Fancy, 'cause his wife was their first employee. And we showed him, and he was like, you've discovered fire, what is this? Like, this is insane. He has a lot more startup experience. Historically, he's had a few exits in the past and has been through all different industries.

He's like our dad, he's a bit older. He hates me saying that, but he's a bit older. - He's your COO now? - He's our COO, yeah. And we showed him and he was like, this is absolutely amazing, let's just do something. 'Cause he, at the time, was just about to have a child, so he didn't have anything going on either.

So we applied to YC, got an interview. The interview was, as most YC interviews are, short, curt, and pretty brutal. They told us they hated the idea. They didn't think it would work. And that's when we started brainstorming. It was almost like the interview was like an office hours kind of thing.

And we were like, okay, given what you know about the space now and how to build things with these LLMs, what can you bring out of what you've learned in building that thing into something that might be a bit more useful to people on the daily? And also, YC obviously likes B2B startups a little bit more, at least at the time they did back then.

So we were like, okay, maybe we could build something that helps you with existing codebases, like can sort of automate development stuff with existing codebases, not knowing at all what that would look like or how you would build it or any of these things. And they were like, yeah, that sounds interesting.

You should probably go ahead and do that. You're in, you've got two weeks to build us an MVP. And we were like, okay, okay. We did our best. The MVP was absolutely horrendous. It was a CLI tool, it sucked. And at the time we were like, we don't even know how to build what we want to build.

And we didn't really know what we wanted to build, to be honest. Like we knew we wanted to try to help automate dev work, but back then we just didn't know enough about how LLM apps were built, the intricacies and all those things. And also like the LLMs themselves, like 4,000 tokens, you're not going very far.

They're extremely expensive. So we ended up building a code-based retrieval tool originally. Our thought process originally was, we want to build something that can do our jobs for us. That is like the gold star, we know that. We've seen like there are glimpses of it happening with our initial demo that we did, but we don't see the path of how to do that at the moment.

Like the tech just wasn't there. So we were like, well, there are going to be some things that you need to build this when the tech does catch up. So retrieval being one of the most important things, like the model's going to have to build like pull code out of a code base somehow.

So we were like, well, let's just build the tooling around it. And eventually when the tech comes, then we'll be able to just like plug it into our tooling and then it should work basically. And to be fair, that's basically what we've done. And that's basically what's happened, which is very fortunate.

But in the meantime, whilst we were waiting for everything to sort of become available, we built this code-based retrieval tool. That was the first thing we ever launched when we were in YC, and it didn't work. It was really frustrating for us 'cause it was just me and Sam like working like all hours trying to get this thing to work.

It was quite a big task in and of itself, trying to get like a good semantic search engine working that could run locally on your machine. We were trying to avoid sending code to the cloud as much as possible. And then for very large code bases, you're like, you know, millions of lines of code.

You're trying to do some sort of like local HNSW thing that runs inside your VS Code instance that like eats all your RAM as you've seen in the past, all those different things. - Yep. - Yeah. - My first call with you, I think I had trouble. - You were like, "Yeah, it sucks, man." I was like, "Yeah, I know, I know, I know it sucks.

I'm sorry." But building all that stuff was essentially the first six to eight months of what at the time was built. - Which by the way, "Bildt." - "Bildt," yeah, it was a terrible, terrible name. - It was the worst part of trying to think about whether I would invest is whether or not people could pronounce it.

- No, so when we went on our first ever YC like retreat, no one got the name right. They were like, "Bildt, Bill, what?" And then we actually changed the name to COSI. Like, although some people would spell it as if you're cosigning for an apartment or something. Like, that's like, can't win.

Yeah, that was what "Bildt" was back then. But the ambition, and I did a talk on this back in the end of 2022, the ambition to like build something that essentially automated our jobs was still very much like core to what we were doing. But for a very long time, it was just never apparent to us like, how would you go about doing these things?

Even when like you had 3.5, 16K, 16K suddenly felt huge 'cause you've gone from four to 16, but even then 16K is like, a lot of Python files are longer than 16K. So you can't, you know, before you even start doing a completion, even then we were like, "Eh, yeah, it looks like we're still waiting." And then like towards the end of last year, you then start, you see 32K, 32K was really smart.

It was really expensive, but also like, you could fit a decent amount of stuff in it. 32K felt enormous. And then finally 128K came along and we were like, "Right, this is like, this is what we can actually deal with because fundamentally to build a product like this, you need to get as much information in front of the model as possible and make sure that everything it ever writes in output can be traced back to something in the context window so it's not hallucinating it." As soon as that model existed, I was like, "Okay, I know that this is now gonna be feasible in some way." We'd done early sort of dev work on Genie using 3.5, 16K.

And that was a very, very like crude way of proving that this loop that we were after and the way we were generating the data actually had signal and worked and could do something. But the model itself was not useful because you couldn't ever fit enough information into it for it to be able to do the task competently and also the base intelligence of the model.

I mean, 3.5, anyone who's used 3.5 knows the base intelligence of the model is lacking, especially when you're asking it to like do software engineering is quite involved. So we saw the 128K context model and at that point we'd been in touch with OpenAI about our ambitions and like how we wanted to build it.

We essentially are, I just took a punt. I was like, "I'm just gonna ask to see, can we like train this thing?" 'Cause at the time 4Turbo had just come out and back then there was still a decent amount of lag time between like OpenAI releasing a model and then allowing you to fine tune it in some way.

They've gotten much better about that recently. Like 4.0 fine tuning came out either, I think a day, 4.0 Mini fine tuning came out like the day after the model did. And I know that's something they're definitely like optimizing for super heavily inside, which is great to see. - Which is a little bit, for a year or so, YC companies had like a direct Slack channel to OpenAI.

- We still do. - Yeah. - Yeah. - So it's a little bit of that diminishing of the YC advantage there. - Yeah. - If they're releasing this fine tuning ability like a day after. - Yeah, no, no, absolutely. But like you can't build a startup on the YC advantage.

It's obviously nice, it makes you feel warm and fuzzy inside but like at the end of the day, it's not that that's gonna make you win. - Yeah. So like we'd spoken to Shamal there, that DevRel guy, I'm sure you know him. - I think he's head of solutions or something.

- He is in their applied team, yeah. We'd been talking to him from the very beginning when we got into YC and he's been absolutely fantastic throughout. I basically had pitched him this idea back when we were doing it on 3.5, 16K. And I was like, this is my crazy thesis.

I wanna see if this can work. And as soon as like that 128K model came out, I started like laying the groundwork. I was like, I know this definitely isn't possible 'cause he released it like yesterday, but know that I want it. And in the interim, like GPT-4, like 8K fine tuning came out.

We tried that, it's obviously even fewer tokens, but the intelligence helped. And I was like, if we can marry the intelligence and the context window length, then we're gonna have something special. And eventually we were able to get on the experimental access program and we got access to four turbo fine tuning.

As soon as we did that, because in the entire run up to that, we'd built the data pipeline. We already had all that set up. So we were like, right, we have the data. Now we have the model. Let's put it through and iterate essentially. And that's where like Genie as we know it today really was born.

I won't pretend like the first version of Genie that we trained was good. It was a disaster. That's where you realize all the implicit biases in your data set. And you realize that, oh, actually this decision you made that was fairly arbitrary was the wrong one. You have to do it a different way.

Other subtle things like, you know, how you write Git diffs and you're using LLMs and how you can best optimize that to make sure they actually apply and work and loads of different little edge cases. But as soon as we had access to the underlying tool, we were like, right, we can actually do this.

And I was, I breathed a sigh of relief 'cause I didn't know it was like, it wasn't a done deal, but I knew that we could build something useful. I mean, I knew that we could build something that would be measurably good on whatever eval at the time that you wanted to use.

Like at the time, back then, we weren't actually that familiar with Swift. But once Devon came out and they announced their Swift benchmark, like that's when my life took a turn. - Challenge accepted. - Yeah, challenge accepted. And that's where like, yes, that's where my friendships have gone. My sleep has gone, my weight, everything.

Got into Sweebench and yeah, it was actually a very useful tool in building Geni 'cause beforehand it was like, "Yes, vibe check this thing and see if it's useful." And then all of a sudden you have an actual measure to see like, couldn't it do software engineering? Not the best measure, obviously, but like it's the best that we've got now.

We would just iterate it and built. And eventually we got it to the point where it is now. And a little bit beyond since we actually got that score a couple of weeks ago. And yeah, it's been a hell of a journey from the beginning all the way now.

That was a very rambling answer to your question about how we got here, but that's essentially a potted answer how we got here. - Got the full origin story. - Yeah, no, totally. You mentioned bias in the data and some of these things. In your announcement video, you called Geni the worst-versed AI software engineering colleague.

And you kind of highlighted how the data needed to train it needs to show how a human engineer works. I think maybe you're contrasting that to just putting code in it. There's kind of like a lot more than code that goes into software engineering. How do you think about the data mixture?

You know, and like there's this kind of known truth that code makes models better when you put in the pre-training data. But since we put so much in the pre-training data, what else do you add when you turn to Genium? - Yeah, I think that sort of boils down fundamentally to the difference between a model writing code and a model doing software engineering.

Because the software engineering sort of discipline goes wider because if you look at something like a PR, that is obviously a artifact of some thought and some work that has happened and has eventually been squashed into some diffs, right? What the, very crudely, what the pre-trained models are reading is they're reading those final diffs and they're emulating that and they're being able to output it, right?

But of course, it's a super lossy thing, a PR. You have no idea why or how, for the most part, unless there are some comments, which, you know, anyone who's worked in a company realizes PR reviews can be a bit dodgy at times. But you see that you lose so much information at the end.

And that's perfectly fine because PRs aren't designed to be something that perfectly preserves everything that happened. But what we realized was if you want something that's a software engineer, and very crudely, we started with something that can do PRs for you, essentially, you need to be able to figure out why those things happened.

Otherwise, you're just gonna rely, essentially, you just have a code writing model. You have something that's good at human eval, but not very good at sweetbench, essentially. That realization was part of the kernel of the idea of the approach that we took to design the agent that is Genie.

The way that we decided we want to try to extract what happened in the past, like as forensically as possible, has been and is currently like one of the main things that we focus all our time on. Because doing that, getting as much signal out as possible, doing that as well as possible, is the biggest thing that we've seen that determines how well we do on that benchmark at the end of the day.

Once you've sorted things out, like output structure, how to get it consistently writing diffs, and all the stuff that is sort of ancillary to the model actually figuring out how to solve a problem, the core bit of solving the problem is how did the human solve this problem? And how can we best come up with how the human solved these problems?

So all the effort went in on that pipeline. And the mix that we ended up with was, as you've probably seen in the technical report and so on, all of those different languages and different combinations of different task types, all of that has run through that pipeline and we've extracted all that information out.

- How does that differ when you work with customers that have private workflows? Like, do you think, is there usually a big delta between what you get in open source and maybe public data versus like-- - Yeah, yeah, yeah. When you scrape enough of it, most of open source is updating readmes and docs.

It's hilarious, like we had to filter out so much of that stuff because when we first did the 3.5, 16K model, like the amount of readme updating that went in, we did like no data cleaning, no real like, we just sort of threw it in and saw what happened.

And it was just like, it was really good at updating readmes, really good at writing some comments, really good at complaining in get reviews, in NPR reviews rather. And it was, again, like we didn't clean the data. So you'd like give it some feedback and it would just like reply and like, it would just be quite insubordinate when it was getting back to you like, no, I don't think you're right.

And it would just sort of argue with you. So the process of doing all that was super interesting 'cause we realized from the beginning, okay, there's a huge amount of work that needs to go into like cleaning this, getting it aligned with what we want the model to do to be able to get the model to be useful in some way.

- I'm curious, like, how do you think about the customer willingness to share all of this historical data? I've done a lot of developer tools investing in my career and getting access to the code base is always one of the hard things. Are people getting more cautious about sharing this information?

In the past, it was maybe like, you know, you're using static analysis tool, like whatever else you need to plug into the code base, fine. Now you're building a model based on it. Like, what's the discussion going into these companies? Are most people comfortable with like letting you see how to work and sharing everything or?

- It depends on the sector mostly. We've actually seen, I'd say, people becoming more amenable to the idea over time, actually rather more skeptical 'cause I think they can see the upside. If this thing does what they say it does, it's gonna be more help to us than it is a risk to our infosec.

And of course, like companies building in this space, we're all gonna end up, you know, complying with the same rules and there are gonna be new rules that come out to make sure that we're looking at your code, that everything is safe and so on. So from what we've seen so far, we've spoken to some very large companies that you've definitely heard of and all of them obviously have stipulations and many of them want it to be sandboxed to start with and all the like very obvious things that I, you know, I would say as well.

But they're all super keen to have a go and see because like, despite all those things, if we can genuinely make them go faster, allow them to build more in a given time period and stuff, it's super worth it to them. - Okay, I'm gonna dive in a little bit on the process that you have created.

You showed the demo on your video and by the time that we release this, you should be taking people off the wait list and launching people so people can see this themselves. There's four main parts of the workflow, which is finding files, planning action, writing code and running tests.

And controversially, you have set yourself apart from the Devins of the world by saying that things like having access to a browser is not that important for you. Is that an accurate reading of what you wrote? - I don't remember saying that, but at least with what we've seen, the browser is helpful, but it's not as helpful as like, ragging the correct files, if that makes sense.

Like, it is still helpful, but obviously there are more fundamental things you have to get right before you get to like, oh yeah, you can read some docs or you can read a stack overflow article and stuff like that. - Yeah, the phrase I was indexing on was the other software tools are wrappers around foundational models with a few additional tools, such as a web browser or code interpreter.

- Oh, I see. No, I mean, no, I'm deriding the approach there, not the tools. - Yeah, exactly. So like, I would say in my standard model of what a code agent should look like, Devon has been very influential, obviously, because you could just add the docs of something and now I have, now when I'm installing a new library, I can just add docs.

Cursor also does this, right? And then obviously having a code interpreter does help. I guess you have that in the form of running tests. - I mean, the Genie has both of those tools available to it as well. So yeah, yeah, yeah. So we have a tool where you can like put in URLs and it will just read the URLs and it also uses Perplexity's API under the hood as well to be able to actually ask questions if it wants to.

- Okay. - So now we use both of those tools as well. Like those tools are super important and super key. I think obviously the most important tools to these agents are like being able to retrieve code from a code base, being able to read Stack Overflow articles and what have you and just be able to essentially be able to Google like we do is definitely super useful.

- Yeah. I thought maybe we could just kind of dive into each of those actions. Code retrieval, one of the core problems, you had an indexer that you've worked on, Even S has built. What makes it hard? What approach you thought would work, didn't work? Anything like that. - It's funny, I had a similar conversation to this when I was chatting to the guys from OpenAI yesterday.

The thing is that searching for code, specifically semantically, at least to start with, I mean like keyword search and stuff like that is a sole problem, it's been around for ages, but at least being able to, the phrase we always used back in the day was searching for what code does rather than what code is, like searching for functionality is really hard, really hard.

The way that we approached that problem was that obviously like a very basic and easy approach is right, let's just embed the code base, we'll chunk it up in some arbitrary way, maybe using an AST, maybe using number of lines, maybe using whatever, like some overlapping, just chunk it up and embed it.

And once you've done that, I will write a query saying like, find me some authentication code or something, embed it, and then do the cosine similarity and get the top of K, right? That doesn't work, and I wish it did work, don't get me wrong. It doesn't work well at all because fundamentally, if you think about like semantically how code looks is very different to how English looks, and there's like not a huge amount of signal that's carried between the two.

So what we ended up, the first approach we took and that kind of did well enough for a long time was, okay, let's train a model to be able to take in English code queries and then produce a hypothetical code snippet that might look like the answer, embed that, and then do the cosine similarity.

And that process, although very simple, gets you so much more performance out of the retrieval accuracy. And that was kind of like the start of our engine, as we called it, which is essentially like the aggregation of all these different heuristics, like semantic, keyword, LSP, and so on. And then we essentially had like a model that would, given an input, choose which ones it thought were most appropriate given the type of requests you had.

So the whole code search thing was a really hard problem. And actually what we ended up doing with Genie is we let the model through self-play figure out how to retrieve code. So actually we don't use our engine for Genie. So instead of like a request coming in and then like say GPT-4 with some JSON output being like, well, I think here we should use a keyword with these inputs and then we should use semantic and then we should like pick these results.

It's actually like a question comes in and Genie has self-played in its training data to be able to be like, okay, this is how I'm going to approach finding this information. Much more akin to how a developer would do it. 'Cause if I was like, Sean, go into this new code base you've never seen before and find me the code that does this, you're gonna probably, you might do some keywords.

You're gonna look over the file system. You're gonna try to figure out from the directories and the file names where it might be. You're gonna like jump in one and then once you're in there, you're probably gonna be doing the go to definition stuff to like jump from file to file and try to use the graph to like get closer and closer.

And that is exactly what Genie does. Starts on the file system, looks at the file system, picks some candidate files. Is this what I'm looking for, yes or no? If there's something that's interesting, like an import or something, it can command click on that thing, go to definition, go to references and so on.

And it can traverse the code base that way. - Are you using the VS Code LSP or? - No, that's no, we're not doing this in VS Code. We're just using the language servers running. But we really wanted to try to mimic the way we do it as best as possible.

And we did that during the self-play process when we were generating the data set. So although we did all that work originally, and although like Genie still has access to these tools, so it can do keyword searches and it can do basic semantic searches and it can use the graph.

It uses them through this process and figures out, okay, I've learned from data how to find stuff in code bases and I think in our technical report, I can't remember the exact number, but I think it was around 65 or 66% retrieval accuracy overall measured on, we know what lines we need for these tasks to find for the task to actually be able to be completed.

And we found about 66% of all those lines, which is one of the biggest areas of free performance that we can get hold of because when we were building Genie truthfully, like a lot more focus went on assuming you found the right information, you've been able to reproduce the issue, assuming that's true, how do you then go about solving it?

And the bulk of the work we did was on the solving. But when you go higher up the funnel, obviously like the funnel looks like, have you found everything you need for the task? Are you able to reproduce the problem that's seen in the issue? Are you then able to solve it?

And the funnel gets narrower as you go down. And at the top of the funnel, of course, is rank. So I'm actually quite happy with that score. I think it's still pretty impressive considering the size of some of the code bases we're using for this. But as soon as that, if that number becomes 80, I think how many more tasks we get right.

That's one of the key areas we're gonna focus on when we continue working on Genie. - Be interesting to break out a benchmark just for that. - Yeah, I mean, it's super easy. - 'Cause I don't know what state of the art is. - Yeah, I mean, like for a, it's super easy 'cause like for a given PR, you know what lines are edited.

- Oh, okay. - Yeah, you know what lines are edited. - So you can just, you can source it from Speedbench actually. - Yeah, you can do it, you can do it super easily. And that's how we got that figure out at the other end. For us being able to see it against, our historic models were super useful.

So we could see if we were, you know, actually helping ourselves or not. And initially, one of the biggest performance gains that we saw when we did work on the rag a bit was giving it the ability to use the LSP to like go to definition and really try to get it to emulate how we do that.

Because I'm sure when you go into an editor where like the LSP is not working or whatever, you suddenly feel really like disarmed and naked. You're like, oh my God, I didn't realize how much I actually use this to get about rather than just find stuff. So we really tried to get it to do that.

And that gave us a big jump in performance. So we went from like 54% up to like the 60s, but just by adding, focusing on that. - That's one weird trick. - Yes. - I'll briefly comment here. So this is the standard approach I would say most code tooling startups are pursuing.

The one company that's not doing this is magic.dev. - Yes. - So would you do things differently if you have a 10 million token context window? - If I had a 10 million context window and hundreds of millions of dollars, I wouldn't have gone and built, it's an LTM, it's not a transformer they're using, right?

If I'm not mistaken, I believe it's not a transformer. - Yeah. - Eric's gonna come on at some point. - I'm just, listen, they obviously know a lot more about their product than I do. I don't know a great deal about how magic works. - Nobody knows anything yet.

- Yeah, so I'm not gonna speculate. Would I do it the same way as them? I like the way we've done it because fundamentally, like we focus on the act of software engineering and what that looks like. And showing models how to do that. Fundamentally, the underlying model that we use is kind of null to us.

Like so long as it's the best one, I don't mind. And the context windows we've already seen, like you can get transformers to have like million, one and a half million token context windows. And that works perfectly well. So like as soon as you can fine tune Gemini 1.5, then you best be sure that Genie will run on Gemini 1.5 and like we'll probably get very good performance out of that.

I like our approach 'cause we can be super agile and be like, "Oh, well, Anthropic have just released "whatever and it might have half a million tokens "and it might be really smart." And I can just immediately take my JSONL file and just dump it in there and suddenly Genie works on there and it can do all the new things.

- Does Anthropic have the same fine tuning support as OpenAI? I actually haven't heard anyone do it. - They are working on it. They are partnered with AWS and it's gonna be in Bedrock. As far as I know, I think that's true. - Cool. We have to keep moving on to the other segments.

Planning. The second piece of your four-step grandmaster plan. That is the frontier right now. A lot of people are talking about Strawberry, Q*, whatever that is. Monte Carlo Tree Search. Is current state-of-the-art planning good enough? What prompts have worked? I don't even know what questions to ask. Like, what is the state of planning?

- I think it's fairly obvious that with the foundational models, like you can ask them to think by step by step and ask them to plan and stuff, but that isn't enough because if you look at how those models score on these benchmarks, then they're not even close to state-of-the-art.

- Which ones are you referencing? - So like just like Sweet Bench and so on, right? And like even the things that get really good scores on human eval agents as well, 'cause they have these loops, right? Obviously these things can reason, quote unquote, but the reasoning is the model, it's constrained by the model's intelligence, I'd say, very crudely.

And what we essentially wanted to do was we still thought, obviously reasoning is super important. We need it to get the performance we have, but we wanted the reasoning to emulate how we think about problems when we're solving them, as opposed to how a model thinks about a problem when we're solving it.

And that's obviously part of like the derivation pipeline that we have when we design our data. But the reasoning that the models do right now, and who knows what Q*, whatever it ends up being called, looks like, but certainly what I'm excited, on a small tangent to that, like what I'm really excited about is when models like that come out, obviously the signal in my data, when I regenerate it, goes up.

And then I can then train that model that's already better at reasoning with improved reasoning data and just like I can keep bootstrapping and keep leapfrogging every single time. And that is like super exciting to me 'cause I welcome like new models so much because immediately it just floats me up without having to do much work, which is always nice.

But at the state of reasoning generally, I don't see it going away anytime soon. I mean, that's like an autoregressive model doesn't think per se. And in the absence of having any thought, maybe an energy-based model or something like that, maybe that's what Q* is, who knows, some sort of like high level abstract space where thought happens before tokens get produced.

In the absence of that for the moment, I think it's all we have and it's gonna have to be the way it works. For what happens in the future, we'll have to see, but I think certainly it's never going to hinder performance to do it. And certainly the reasoning that we see Genie do when you compare it to like, if you ask GPT-4 to break down step-by-step and approach for the same problem, at least just on a vibe check alone looks far better.

- Two elements that I like that I didn't see in your initial video, we'll see when this Genie launches, is a planner chat, which is I can modify the plan while it's executing. And then the other thing is playbooks, which also from Devin, where here's how I like to do a thing and I'll use Markdown to specify how I do it.

I'm just curious if like, you know, those things help. - Yeah, no, absolutely. We're a hundred percent. We want everything to be editable, not least because it's really frustrating when it's not. Like if you're ever in a situation where like there's the one thing I just wish I could, and you'd be right if that one thing was right and you can't change it.

So we're going to make everything editable, including the code it writes. Like you can, if it makes a small error in a patch, you can just change it yourself and let it continue and it will be fine. So yeah, like those things are super important. We'll be doing those too.

- I'm curious, once you get to writing code, is most of the job done? I feel like the models are so good at writing code when they're like in small chunks that are like very well-instructed. What's kind of the drop off in the funnel? Like once you get to like, you got the right files and you got the right plan.

- That's a great question because by the time this is out, there'll be another blog post. Yeah, there'll be another blog post, which contains all the learnings that I delivered to OpenAI's fine-tuning team when we finally got the score. - Oh, that's good. Go for it, it's already out.

- Yeah, I don't have it on my phone, but basically I broke down the log probs. I basically got the average log prob for a token at every token position in the context window. So imagine an X-axis from zero to 128K, and then the average log prob for each index in there.

As we discussed, like the way Genie works normally is, you know, at the beginning you do your rag and then you do your planning and then you do your coding and that sort of cycle continues. The certainty of code writing is so much more certain than every other aspect of Genie's loop.

So whatever's going on under the hood, the model is really comfortable with writing code. There is no doubt and it's like in the token probabilities. One slightly different thing, I think, to how most of these models work is, at least for the most part, if you ask GPT4 in chat GPT to edit some code for you, it's going to rewrite the entire snippet for you with the changes in place.

We train Genie to write diffs and, you know, essentially patches, right? Because it's more token efficient and that is also fundamentally, we don't write patches as humans, but it's like the result of what we do is a patch, right? When Genie writes code, I don't know how much it's leaning on the pre-training like code writing corpus, because obviously it's just read code files there.

It's obviously probably read a lot of patches, but I would wager it's probably read more code files than it has patches. So it's probably leaning on a different part of its brain is my speculation. I have no proof for this. So I think the discipline of writing code is slightly different, but certainly is its most comfortable state when it's writing code.

So once you get to that point, so long as you're not too deep into the context window, another thing that I'll bring up in that blog post is performance of Genie over the length of the context window. Degrades fairly linearly. So actually, I actually broke it down by probability of solving a sweep bench issue, given the number of tokens of the context window.

It's 60K, it's basically 0.5. So if you go over 60K in context length, you are more likely to fail than you are to succeed just based on the amount of tokens you have on the context window. And when I presented that to the fine tuning team at OpenAI, that was super interesting to them as well.

And that is more of a foundational model attribute than it is an us attribute. However, the attention mechanism works in GPT-4, however, you know, they deal with the context window. At that point is, you know, influencing how Genie is able to form. Even though obviously all our training data is perfect, right, so even if like stuff is being solved in 110,000 tokens, sort of that area, the training data still shows it being solved there, but it's just in practice, the model is finding it much harder to solve stuff down that end of the context window.

- That's the scale with the context, so for a 200K context size, is 100K tokens like the 0.5? - I don't know. - Yeah, yeah, yeah. - Yeah, but I hope not. I hope you don't just take the context length and halve it and then say, "Oh, this is the usable context length." But what's been interesting is knowing that, actually really digging into the data, looking at the log probs, looking at how it performs over the entire window, it's influenced the short-term improvements we've made to Genie since we got that score.

So we actually made some small optimizations to try to make sure as best we can without overdoing it, trying to make sure that we can artificially make sure stuff sits within that sort of range because we know that's our sort of battle zone. And if we go outside of that, we're starting to push the limits, we're more likely to fail.

So just doing that sort of analysis has been super useful without actually messing with anything more structural and getting more performance out of it. - What about different languages? So in your technical report, the data makes this 21% JavaScript, 21% Python, 14% TypeScript, 14% TSX. - Which is JavaScript, JavaScript, JavaScript.

- Yeah, yeah, yeah. - Yes, yeah, yeah, that's true. - It's like 29% JavaScript. - That's true, that's true. Although TypeScript is so much superior, but anyway. - Do you see, how good is it at just generalizing? If you're writing Rust or C++ or whatever else, it's quite different?

- It's pretty good at generalizing. Obviously, though, I think there's 15 languages in that technical report, I think, that we've covered. The ones that we picked in the highest mix were the ones that, selfishly, we internally use the most, and also that are, I'd argue, some of the most popular ones.

When we have more resource as a company and more time, and once all the craziness that has just happened sort of dies down a bit, we are going to work on that mix. I'd love to see everything ideally be represented in a similar level as it is. If you took GitHub as a data set, if you took how are the languages broken down in terms of popularity, that would be my ideal data mix to start.

It's just that it's not cheap doing this. So, yeah, trying to have an equal amount of Ruby and Rust and all these different things at our current state is not really what we're looking for. - There's a lot of good Ruby in my GitHub profile. You can have it all.

- Well, okay, perfect, we'll just train on that. - For running tests, it sounds easy, but it isn't, especially when you're working in enterprise codebases that are kind of very hard to spin up. How do you set that up? It's like, how do you make a model actually understand how to run a codebase, which is different than writing code for a codebase?

- The model itself is not in charge of setting up the codebase and running it. So Genie sits on top of GitHub, and if you have CI running GitHub, you have GitHub actions and stuff like that, then Genie essentially makes a call out to that, runs your CI, sees the outputs, and then moves on.

Making a model itself set up a repo wasn't scoped in what we wanted Genie to be able to do, because for the most part, at least most enterprises have some sort of CI pipeline running, and a lot of, if you're doing some, even a lot of hobbyist software development has some sort of basic CI running as well.

And that was the lowest hanging fruit approach that we took. So when Genie ships, the way it will run its own code is it will basically run your CI, and it will take the, I'm not in charge of writing this, the rest of the team is, but I think it's the checks API on GitHub allows you to grab that information and throw it in the context window.

- What's the handoff like with the person? So Genie, you give it a task, and then how long are you supposed to supervise it for? Or are you just waiting for the checks to eventually run, and then you see how it goes? Like, what does it feel like? - There are a couple of modes that it can run in.

Essentially, it can run in fully headless autonomous modes. So say you assign it a ticket in linear or something, then it won't ask you for anything. It will just go ahead and try. Or if you're in the GUI on the website and you're using it, then you can give it a task, and it might choose to ask you a clarifying question.

So if you ask it something super broad, it might just come back to you and say, what does that actually mean? Or can you point me in the right direction for this? Because our decision internally was it's gonna piss people off way more if it just goes off and makes a completely ruined attempt at it, because it just, from day one, got the wrong idea.

So it can ask you a lot of questions. And once it's going, much like a regular PR, you can leave review comments, issue comments, all these different things. And it, because it's been trained to be a software engineering colleague, responds in actually a better way than a real colleague, because it's less snarky and less high and mighty.

And also the amount of filtering it has to do for LGTM. When you train a model to be a software engineer, essentially, it's like, you can just do anything. It's like, yeah, it looks good to me, bro. - Sure. (laughs) I just wanted to dive in a little bit more on your experience with the fine-tuning team.

John Allard was publicly sort of very commentary supportive and, you know, was part of it. Like, what is it like working with them? I also picked up that you initially started to fine-tune what was publicly available, the 16 to 32K range. You got access to do more than that.

You've also trained on billions of tokens instead of the usual millions range. Just like, take us through that fine-tuning journey and any advice that you may have. - It's been so cool. And this will be public by the time this goes out. Like, OpenAI themselves have said, we are pushing the boundaries of what is possible with fine-tuning.

Like, we are right on the edge. And like, we are working, genuinely working with them in figuring out how stuff works, what works, what doesn't work, because no one's doing, no one else is doing what we're doing. They have found what we've been working on super interesting, which is why they've allowed us to do so much, like, interesting stuff.

Working with John, I mean, I had a really good conversation with John yesterday. We had a little brainstorm after the video we shot. And one of the things, you mentioned the billions of tokens. One of the things we've noticed, and it's actually a very interesting problem for them as well, when you're building like a self-serve fine-tuning API, they have to decide how big your PEFT adapter, your lower adapter is going to be in some way.

And like, figuring that out is actually a really interesting problem. Because if you make it too big, because they support data sets that are so small, you can put like 20 examples through it or something like that. Like, if you had a really sparse, large adapter, you're not going to get any signal in that at all.

So they have to dynamically size these things. And there is an upper bound. And actually, we use models that are larger than what's publicly available. It's not even publicly available yet, but when this goes out, it will be. But we have larger lower adapters available to us, just because the amount of data that we're pumping through it.

And at that point, you start seeing really interesting other things, like you have to change your learning rate schedule and do all these different things that you don't have to do when you're on the smaller end of things. So working with that team is such a privilege, because obviously they're like at the top of their field in the fine-tuning space.

So as we learn stuff, they're learning stuff. And one of the things that I think really catalyzed this relationship is when we first started working on Genie, like I delivered them a presentation, which will eventually become the blog post that you'll love to read soon. The information I gave them there, I think is what showed them like, "Oh, wow, okay, these guys are really like pushing the boundaries of what we can do here." And truthfully, our data set, we view our data set right now as very small.

It's like the minimum that we're able to afford, literally afford right now to be able to produce a product like this. And it's only gonna get bigger. So yesterday while I was in their offices, I was basically, so we were planning, we were like, okay, how, this is where we're going in the next six to 12 months.

Like we're putting our foot on the gas here, 'cause this clearly works. Like I've demonstrated this is a good, you know, the best approach so far. And I wanna see where it can go. I wanna see what the scaling was like for the data. And at the moment, like it's hard to figure that out because you don't know when you're running into like saturating a PEFT adapter, as opposed to actually like, is this the model's limit?

Like, where is that? So finding all that stuff out is the work we're actively doing with them. And yeah, it's gonna get more and more collaborative over the next few weeks as we explore like larger adapters, pre-training extension, different things like that. - Awesome. I also wanted to talk briefly about the synthetic data process.

One of your core insights was that the vast majority of the time, the code that is published by a human is in a working state. And actually you need to fine tune on non-working code. - Yes. - So just, yeah, take us through that inspiration. How many rounds did you do?

- Yeah, I mean, it might be generous to say that the vast majority of code is in a working state. I don't know if I believe that. - Yeah, I don't know if I believe that. - I was like, that's very nice of you to say that my code works.

- Certainly, it's not true for me. No, I think that, so yeah, no, but it was, you're right. It's an interesting problem. And what we saw was when we didn't do that, obviously you have to basically like one-shot the answer. 'Cause after that it's like, well, I've never seen iteration before.

How am I supposed to figure out how this works? So what you're alluding to there is like the self-improvement loop that we started working on. And that was in sort of two parts. We synthetically generated runtime errors where we would intentionally mess with the AST to make stuff not work or index out of bounds or refer to a variable that doesn't exist or errors that the foundational models just make sometimes that you can't really avoid.

You can't expect it to be perfect. So we threw some of those in with a probability of happening. And on the self-improvement side, I spoke about this in the blog post, essentially the idea is that you generate your data in sort of batches. First batch is like perfect, like one example, like here's the problem, here's the answer, go train the model on it.

And then for the second batch, you then take the model that you trained before that can look like one commit into the future. And then you let it have the first attempt at solving the problem. And hopefully it gets it wrong. And if it gets it wrong, then you have like, okay, now the code base is in this incorrect state, but I know what the correct state is.

So I can do some diffing essentially to figure out how do I get the state that it's in now to the state that I want it in. And then you can train the model to then produce that diff next and so on and so on and so on. So the model can then learn and also reason as to why it needs to make these changes to be able to learn how to like learn, like solve problems iteratively and learn from its mistakes and stuff like that.

- And you pick the size of the data set just based on how much money you could spend generating it. Maybe you think you could just make more and get better. - Multiple of my monthly burn don't always spend doing this. Yeah, basically it was very much related to, yeah, just like capital.

And yes, with any luck that will be alleviated soon. - Very soon. I like drawing references to other things that are happening in the wild. So, 'cause we only get to release this podcast once a week, the Lama 3 paper also had some really interesting thoughts on synthetic data for code.

I don't know if you have reviewed that. I'll highlight the back translation section because one of your data set focuses is updating documentation. I think that translation between natural language, English versus code and back and forth, I think is actually a really ripe source of synthetic data. And Lama 3 specifically called out that they trained on that.

We should have gone more into that in our podcast with them, but we didn't know. But there's a lot of interesting work on synthetic data stuff. We do have to wrap up soon, but I'm going to briefly touch on the submission process for SweeBench. So, you have a 30% state-of-the-art SweeBench results, but it's not on the leaderboard because of submission issues.

I don't know if you want to comment on that stuff versus, we also want to talk about SweeBench verified. Yeah, just anything on the benchmarking side. - The potted history of this is quite simple actually. SweeBench up until, I want to say two weeks ago, but it might be less than that or more than that.

But I think two weeks ago, suddenly started mandating what they call trajectories when you submit. So, but prior to this, essentially when you run SweeBench, you run it through their harness and out the other end, you get a report.json, which is like, here's how many I resolved. Here's how many I didn't resolve.

These are the IDs, the ones I did. These ones, the IDs I didn't. And it gives you any ones that might have errored or something like that. And what you would submit would be all of your model patches that you outputted and that report. And then you would like PR that into the SweeBench repo and that would be it.

That was still the case when we made our submission on whatever day it was. They look at them every Monday. We submitted it at some point during the week. I want to say it was four days before that. And I sort of like sat back and waited. I assumed it would be fine.

When it came to Monday, they then said, actually, no, we want model trajectories. And I was like, okay, let me see what this is. And so on, I sort of dug into it. And like model trajectories are essentially the context window or like the reasoning process of like show you're working.

How did you get here? If you do a math exam, show me you're working. Whereas before they were like, just give me the final answer. Now they want to see the working, which I completely understand why they want to see that. Like the SweeBench fundamentally is an academic research project and they want all the stuff to be open source and public so people can learn from each other and improve and so on and on.

That's very good. I completely agree. However, at least for us, and the reason that we're not on the leaderboard is that obviously the model outputs that we generate are sort of a mirror of our training data set, right? Like you train the model to do a certain thing and output a certain way.

Whatever you output looks like your training data. For the moment as a closed source company, like fighting for an edge, we've decided not to publish that information for that exact reason. I don't want someone basically taking my trajectories and then taking a model that's soon to be GA and just distilling it immediately and then having genie for themselves.

And, you know, as a business owner, that's the decision I've had to make. The patches are still public. So like the, dare I say, traditional SweeBench submission, you can go to our GitHub repo and see it and run them for yourself and verify that the numbers come out correctly.

Like that is all, that is the potted reason. - That's the story. - That's the story. - SweeBench verified? You have a score? - I do have a score. I do have a score, 43.8%. It's one of those things where like there aren't that many people on the leaderboard yet.

So you don't know how good or bad that is. - It's a smaller data set, right? - Oh, it's great. So on a tangent, original SweeBench was 2,294. - Which is expensive. It's like $8,000 to run. - Oh, that's cheap. - That's cheap? What are you talking about? - I don't know.

At least for us, I don't even want to say it publicly. How much it cost us to run that thing. Expensive, slow, really like crap for iteration because like, you know, you make a change to your model. How does it do on SweeBench? I guess that's why SweeBench Lite existed, but SweeBench Lite was not a, it was easy stuff, right?

It wasn't a comprehensive measure of the overall thing. So we actually had the idea a month ago to what we were going to call SweeBench Small, where we were going to try to map out across SweeBench, like what is the distribution of like problem difficulty and all these different things, and try to come up with like 300 examples that sort of mapped that, where given a score on SweeBench Small, you could then predict your SweeBench large score and sort of go from there.

Fortunately, OpenAI did that for us and probably much better than we would have done. They use some human labelers, and as obviously we're working with OpenAI quite closely, they talked to us about it and they, you know, were able to let us know what the instance ID were, IDs were that were in the new SweeBench version.

And then as soon as I had that, I could just take the report from the one that I'd run and just diff them. And I was like, oh, we got 219 out of 500, which is 43.8%, which is to my knowledge, at least right now, state-of-the-art also, which makes sense.

But also GPT-4.0 gets, I believe, 33%, which is like, I double-checked that, but I believe-- - The August one, the new one. - Yeah, it's in their blog post. I can't remember which one it was. I don't know what the model version was, but GPT-4.0, I believe, gets 33%, which is obviously like significantly better than what it got on the original, like SweeBench, SweeBench, SweeBench.

- 2%. - Yeah, yeah, yeah, exactly, exactly. - It's running ridiculously low. - But no, SweeBench verified, like, it's so good. It's like, it's smaller. We know that the problems are solvable. It's not gonna cost me a lot of money to run it. It keeps my iteration time, you know, lower.

And there are also some things that we're gonna start to do internally when we run SweeBench to have more of an idea of how right our model is. So one of the things I was talking to John about yesterday was SweeBench is a pass or fail, right? Like you either have solved the problem or you haven't.

That is quite sparse. Like it doesn't give you a huge amount of information 'cause your model could have got a lot of it right. Like looking through when you do a math paper, you could have got the reason, you know, you're working right until like the penultimate step and then you get it wrong.

So we're gonna look into ways of measuring, okay, well, your model got it right up to this line and then it diverged. And that's super easy to do because obviously you know the correct state of all of those questions. So I think one of the ways we're gonna keep improving Genie is by going more in depth and saying, okay, for the ones that failed, was it right at any point?

Where did it go wrong? How did it go wrong? And then sort of trying to triage those sorts of issues. - So future plans, you have mentioned Context is sending an open source model. But basically I think, you know, what the Genie is is basically this like proprietary fine tune data set and process and software that you can add onto any model.

Is that the plan? That's the next year? It's gonna just be doing that? - We're gonna get really, we're gonna be the best in the world at doing that and continue being the best in the world at doing that and throwing in as many models as we can, seeing what the performance is like and seeing what things improve performance in what places.

And also making the data set larger is like one of the biggest things we're gonna be working on. - I think one of the decisions before you as a CEO is how much you have like the house model be like the one true thing. And then how much you spend time working on customer models.

- That's the thing that really gets me so excited. Genuinely, like we have a version of Genie that we named after one of our employees. (all laughing) It's called the John. We have a version of Genie that is fine tuned on our code base. So we basically, it's the base Genie and then we run the same data pipeline that we run on like all the stuff that we did to generate the main data set on our repo.

And then all of a sudden you have like something that is both very good at software engineering but is also extremely good at your repo. And that is phenomenal to use. Like it's really cool. - More broadly outside of Corsair, what are you seeing? What trends are you seeing that you're really excited by?

Who's doing great work that you wanna call out? - The one of the ones that, I mean, it's not an original choice but Cursor are absolutely killing it. All the employees at Corsair love using it. And it's a really, really good example of like just getting like UX right, basically.

Like putting the LLM in the right place and letting it allow you and getting out of the way when you don't want it there and making it familiar 'cause it's still VS code and all these things. They've, yeah, they've done an amazing job. And I think they just raised a round.

So congrats on that to them. So like they're doing amazing work. - The decision to fork VS code, I think was controversial. You guys started as a VS code extension. - We did, yeah. - Many, many, many people did that. And they did the one thing that no one wanted to do.

- I commend the bravery, honestly. Like I commend the bravery. 'Cause like in hindsight, obviously it's paid off. But at least for me in the moment, I was one of those people being like, is that gonna, are people gonna do that? Are people gonna download that? And yes, obviously they are.

Like sure, doing the hard thing, which is having worked on Jeannie recently, for the past eight months or whatever, as taxing as it's been on us, like one of the main things I have learned from this is like, no matter how small you are, how much resource you have, just like try to do the hard thing.

'Cause I think it has the biggest payoff. - More broadly, just like lessons that you've learned running your company. - Oh. - It's been a two year journey. - Two year journey. I mean, it's better than any real job we could ever get. Like, I feel so lucky to be working in this area.

Like, especially, you know, it was so validating to hear it from the guys at Open Hour as well, telling us like, we're on the cutting edge on the bat, we're pushing the boundaries of what's possible with what we're doing. Because like, I get to do, I get to be paid to do this.

You know, I have briefly, as you heard at the beginning, done real jobs and normal stuff. And like, just being able to do this on the daily, it's so interesting and so cool. It's like, I pinch myself a lot, genuinely, about the fact that I can do this. And also that, not only I can do this, but fortunately being a co-founder of the company, I have a huge amount of say as to where we go next.

And that is a big responsibility, but it's also so exciting to me. 'Cause I'm like, you know, steering the ship has been really interesting so far. And I like to think that we've got it right, you know, in the last sort of eight months or so. And that this is like, really the starting point of something massive to come.

- Awesome. Calls to action. I assume you're hiring. I assume you're also looking for customers. What's the ideal customer, ideal employee? - On the customer side, honestly, people who are just willing to try something new, like the Genie UX is different to a conventional IDE. Give it a chance.

Like that what we really do believe in this whole idea of like developers work is going to be abstracted, you know, levels higher than just the code. We still let you touch the code. We still want you to dive into the code if you need to. But fundamentally we think that if you're trying to offload the coding to a model, the model should do the coding and you should be in charge of guiding the model.

So people who are willing to give something new a chance. Size of company. And honestly, well, preferably the languages that are the most represented in our train days. So like anyway, if you're like doing TypeScript, JavaScript, Python, Java, that sort of thing. And in terms of size of company, like so long as you're willing to try it and there aren't any massive like infosec things that get in the way, like it doesn't really matter.

Like code base size can be arbitrary for us. We can deal with any code base size and essentially any language, but your mileage may vary. But for the most part, like anyone who's willing to give it a try is the ideal customer. And on the employee, honestly, we just want people who we're gonna be hiring both on like what we call like the traditional tech side.

So like building the product essentially and also hiring really heavily on the AI machine learning data set side as well. And in both cases, essentially what we just wanted were like really passionate people who are obsessed with something and are really passionate about something and are willing to, it sounds so corny, but like join us in what we're trying to do.

Like we have a very big ambition and we're biting off a very large problem here. And people who can look at what we've done so far and be like, wow, that's really impressive. I want to do that kind of work. I want to be pushing the boundaries. I want to be dealing with experimental stuff all the time.

But at the same time, you're putting it in people's hands and shipping it to people and so on. So if that sounds, you know, amenable to anyone, that's the kind of person we're looking to apply. - Excellent. Any last words? Any Trump impressions that you (laughs) Did you like the Trump impression?

- Yeah, everyone loved the Trump impression. - Yeah, I mean, it's funny 'cause like I have some bloopers. I'll show you the bloopers after we finish recording. I'll probably tweet them at some point. The initial cut of that video had me doing a Trump impression. I sort of sat down into the chair and be like Cosine is the most tremendous AI lab in the world.

Unbelievable. I walked in here and I said, well, this is an amazing lab. And like, we sent it to some of our friends. They were like, nah, you can't cold open with Trump, man. You just can't. Like, no one knows who you are. - You can end with it.

- But you can end with it. Now that that has gone out, we can now post the rest of the bloopers, which are essentially me just like fluffing my lines the entire time and screaming at my co-founder out of frustration. So, yeah. - Well, it was very well executed.

Actually, very few people do the kind of video that you did. I'm, as a sort of developer relations person, I'm actually excited by that stuff, but well, thank you for coming on. Very, very short notice. I hope you have a safe flight back and excited to see the full launch.

I think this is a super fruitful area and congrats on your launch. - Thank you so much for having me. Cheers. (upbeat music) (upbeat music) (upbeat music) (upbeat music)

Is finetuning GPT4o worth it?

Chapters

Transcript