back to index

Is finetuning GPT4o worth it?


Chapters

0:0 Alistair and Cosine intro
11:34 GPT4o finetuning
15:18 Genie Data Mix
18:9 Customizing for Customers
20:37 Genie Workflow
22:41 Code Retrieval
30:20 Planning
37:29 Language Mix
38:46 Running Code
41:19 Finetuning with OpenAI
44:32 Synthetic Code Data
47:54 SynData in Llama 3
48:33 SWE-Bench Submission Process
53:20 Future Plans
54:36 Ecosystem Trends
55:55 Founder Lessons
57:58 CTA: Hiring & Customers

Whisper Transcript | Transcript Only Page

00:00:00.000 | (upbeat music)
00:00:02.580 | - Hey everyone, welcome to the Latent Space Podcast.
00:00:06.880 | This is Alessio, partner and CTO
00:00:08.600 | in Residence at Decibel Partners.
00:00:10.040 | And I'm joined by my co-host Swiggs, founder of Small.ai.
00:00:13.040 | - Hey, and today we're back in the studio, in person,
00:00:17.200 | after about three to four months in visa jail and travels
00:00:21.000 | and all other fun stuff that we talked about
00:00:23.080 | in the previous episode.
00:00:24.440 | But today with special guests, Ali Pullen from Cosign.
00:00:27.640 | Welcome.
00:00:28.480 | - Hi, thanks for having me.
00:00:29.320 | - Very lucky to have you
00:00:30.960 | because you're on a two day trip to San Francisco.
00:00:33.000 | - Yeah, I wouldn't recommend it.
00:00:33.840 | I would not recommend it.
00:00:35.160 | Don't fly from London to San Francisco for two days.
00:00:37.080 | - And you launched Genie on a plane, on plane WiFi,
00:00:41.560 | claiming state-of-the-art in SuiteBench,
00:00:44.000 | which we're all gonna talk about.
00:00:45.400 | I'm excited to dive into your whole journey
00:00:47.800 | because it has been a journey.
00:00:48.920 | I've been lucky to be a small angel in part of that journey.
00:00:52.760 | And it's exciting to see that you're launching
00:00:54.800 | to such acclaim and such results.
00:00:58.680 | So I'll go over your brief background
00:01:00.680 | and then you can fill in the blanks
00:01:02.000 | on what else people should know about you.
00:01:03.960 | You did your bachelor's in computer science in Exeter,
00:01:07.160 | and then you worked at a startup
00:01:08.640 | that got acquired into GoPuff.
00:01:10.360 | And roundabout 2022, you started working on a stealth startup
00:01:14.560 | that became a YC startup.
00:01:15.880 | What's that overall story?
00:01:17.040 | - Yeah, so basically when I left university,
00:01:20.160 | I met my now co-founder, Sam.
00:01:23.080 | At the time, we were both mobile devs.
00:01:24.960 | He was an Android developer, I was an iOS developer.
00:01:27.200 | And whilst at university,
00:01:29.200 | we built this sort of small consultancy,
00:01:31.960 | sort of we'd be approached to build projects for people.
00:01:35.240 | And we would just take them up
00:01:37.080 | and start with their student projects.
00:01:38.440 | They weren't anything crazy or anything big.
00:01:40.240 | We started with those.
00:01:41.360 | And over time, we started doing larger and larger projects,
00:01:44.080 | more interesting things.
00:01:45.520 | And actually when we left university,
00:01:47.240 | we just kept doing that.
00:01:48.960 | We didn't really get jobs, traditional jobs.
00:01:51.920 | It was also like in the middle of COVID,
00:01:53.600 | middle of lockdown.
00:01:54.600 | So we were like, this is a pretty good gig.
00:01:56.040 | We'll just keep like writing code in our bedrooms.
00:01:58.040 | And we did that for a while.
00:02:00.320 | And then a friend of ours that we went to Exeter with
00:02:03.720 | started a YC startup during COVID.
00:02:07.360 | And it was one of these fast grocery delivery companies.
00:02:10.560 | At the time, I was living
00:02:11.920 | in the deepest, darkest countryside in England,
00:02:14.200 | where fast grocery companies are still not a thing.
00:02:16.600 | So he sort of pitched me this idea and was like,
00:02:19.000 | listen, like I need an iOS dev, do you fancy coming along?
00:02:22.320 | And I thought, absolutely.
00:02:23.560 | It was a chance to get out of my parents' house,
00:02:24.920 | chance to move to London, you know, do interesting things.
00:02:27.560 | And at the time, truthfully, I had no idea what YC was.
00:02:29.960 | I had no idea.
00:02:31.040 | I wasn't in the startup space.
00:02:32.560 | I knew I liked coding and building apps and stuff,
00:02:35.000 | but I'd never really done anything in that area.
00:02:38.360 | So I said, yes, absolutely.
00:02:39.960 | I moved to London just sort of as COVID was ending
00:02:43.240 | and yeah, worked at what was Fancy
00:02:46.160 | for about a year and a half.
00:02:47.520 | Then we brought Sam along as well.
00:02:49.720 | So Sam and I were the two engineers at Fancy
00:02:52.200 | for basically his entire life.
00:02:53.920 | And we built literally everything.
00:02:56.000 | So like the client mobile apps, the backends,
00:02:59.880 | the internal like stock management system,
00:03:02.680 | the driver routing algorithms,
00:03:04.920 | all those things, literally like everything.
00:03:07.120 | It was my first, you know,
00:03:09.000 | both of us were super inexperienced.
00:03:10.320 | We didn't have like proper engineering experience.
00:03:11.880 | There were definitely decisions we'd do differently now.
00:03:13.920 | We'd definitely buy a lot of stuff off the shelf,
00:03:15.640 | stuff like that.
00:03:16.480 | But it was the initial dip of the toe
00:03:19.760 | into like the world of startups.
00:03:21.720 | And we were both like hooked immediately.
00:03:23.320 | We were like, this is so cool.
00:03:24.520 | This sounds so much better than all our friends
00:03:26.080 | who were like consultants and doing like normal jobs, right?
00:03:28.720 | We did that and it ran its course.
00:03:30.400 | And after, I want to say 18 months or so,
00:03:32.640 | GoPuff came and acquired us.
00:03:34.600 | And there was obviously a transitionary period
00:03:36.240 | and integration period, like with all acquisitions.
00:03:38.320 | And we did that.
00:03:39.760 | And as soon as we'd vested what we wanted to vest
00:03:42.360 | and as soon as we thought, okay,
00:03:43.600 | this chapter is sort of done in about 2022,
00:03:46.880 | we left and we knew that we wanted to go alone
00:03:49.560 | and try something like we'd had this taste.
00:03:51.360 | Now we knew we'd seen how like a YC startup
00:03:53.880 | was managed like up close.
00:03:55.440 | And we knew that we wanted to do something similar ourselves.
00:03:57.600 | We had no idea what it was at the time.
00:03:59.800 | We just knew we wanted to do something.
00:04:01.120 | So we tried some small projects in various different areas.
00:04:05.440 | But then Sam talked to me about GPT-3.
00:04:09.400 | He'd seen it on Reddit.
00:04:10.640 | - The source of all knowledge.
00:04:12.080 | - The source of all knowledge, absolutely.
00:04:13.640 | Sam loves Reddit.
00:04:14.560 | I'd actually heard of GPT-2
00:04:17.120 | and obviously had like loosely followed
00:04:18.680 | what OpenAI had done with,
00:04:20.880 | what was the game they trained a model to play?
00:04:23.080 | - Dota.
00:04:23.920 | - Was it Dota, yeah.
00:04:24.920 | So I'd followed that and knew loosely what GPT-2 was.
00:04:29.200 | I knew what BERT was.
00:04:30.040 | So I was like, okay, this GPT-3 thing sounds interesting.
00:04:32.240 | And he just mentioned it to me on a walk.
00:04:34.320 | And I then went home and like Googled GPT-3
00:04:38.000 | and there was the playground.
00:04:38.960 | It was the, and the model was DaVinci 2 at the time.
00:04:41.840 | And it was just the old school playground, completions,
00:04:44.800 | nothing crazy, no chat, no nothing.
00:04:46.880 | - I miss completions though.
00:04:48.160 | - Yeah, oh, completions.
00:04:49.000 | Honestly, I had this conversation in OpenAIs yesterday.
00:04:51.160 | I was like, I just, I know.
00:04:53.200 | But yeah, so we,
00:04:55.360 | I started playing around with the playground
00:04:58.080 | and the first thing I ever wrote into it
00:05:00.280 | was like, hello world.
00:05:01.320 | And it gave me some sort of like fairly generic response
00:05:03.520 | back and I was like, okay, that looks pretty cool.
00:05:05.440 | The next thing was, I looked through the docs
00:05:08.680 | or they had a lot of example prompts
00:05:10.240 | 'cause I had no idea.
00:05:11.120 | I didn't know if the, if you could put anything in,
00:05:13.560 | I didn't know if you had to structure in a certain way
00:05:15.120 | or whatever.
00:05:15.960 | And I saw that it could start writing like tables
00:05:18.080 | and JSON and stuff like that.
00:05:19.600 | So I was like, okay,
00:05:20.440 | can you write me something in JSON?
00:05:21.640 | And it did.
00:05:22.680 | And I was like, oh wow, this is pretty cool.
00:05:25.680 | Can it just write arbitrary JSON for me?
00:05:28.240 | And immediately, as soon as I realized that,
00:05:31.120 | my mind was racing and I like got Sam in
00:05:34.040 | and we just started messing around in the playground,
00:05:37.080 | like fairly innocently to start with.
00:05:39.080 | And then of course, both being mobile devs
00:05:41.440 | and also seeing, at that point,
00:05:43.080 | we'd learned about what the Codex model was.
00:05:45.480 | It was like, this thing's trained to write code.
00:05:47.040 | It sounds awesome.
00:05:48.280 | And Copilot was start,
00:05:49.520 | I think, I can't actually remember if Copilot
00:05:51.520 | had come out later.
00:05:52.360 | Yeah, it might've done.
00:05:53.800 | - It's round about the same time as Codex.
00:05:54.640 | - Round about the same time, yeah.
00:05:56.040 | And we were like, okay, as mobile devs,
00:05:58.120 | let's see what we can do.
00:05:59.040 | So the initial thing was like, okay,
00:06:00.880 | let's see if we can get this AI
00:06:03.920 | to build us a mobile app from scratch.
00:06:06.280 | We eventually built the world's most flimsy system,
00:06:10.120 | which was back in the day,
00:06:11.120 | we're like 4,000 token context windows,
00:06:12.640 | like chaining prompts,
00:06:13.760 | trying to keep as much context from one to the other,
00:06:16.680 | all these different things,
00:06:17.560 | where essentially you'd put in an app idea in a box,
00:06:20.280 | and then we'd do like very high level stuff,
00:06:23.080 | figuring out what the stack should be,
00:06:24.400 | figuring out what the front end should be written in,
00:06:27.200 | back end should be written in,
00:06:28.080 | all these different things.
00:06:29.040 | And then we'd go through like for each thing,
00:06:32.240 | more and more levels of detail
00:06:33.800 | until the point that you actually got Codex
00:06:36.000 | to write the code for each thing.
00:06:37.920 | And we didn't do any templating or anything.
00:06:39.520 | We were like, no, we're gonna write all the code
00:06:40.800 | from scratch every time,
00:06:41.760 | which is basically why it barely worked.
00:06:44.080 | But there were like occasions
00:06:45.720 | where you could put in something
00:06:46.960 | and it would build something that did actually run,
00:06:49.640 | the back end would run, the database would work.
00:06:51.480 | And we were like, oh my God, this is insane.
00:06:53.280 | This is so cool.
00:06:54.440 | And that's what we showed to our co-founder, Yang.
00:06:58.280 | I met my co-founder, Yang, through Fancy,
00:07:00.400 | 'cause his wife was their first employee.
00:07:02.280 | And we showed him, and he was like,
00:07:04.520 | you've discovered fire, what is this?
00:07:06.080 | Like, this is insane.
00:07:07.240 | He has a lot more startup experience.
00:07:09.160 | Historically, he's had a few exits in the past
00:07:11.240 | and has been through all different industries.
00:07:13.880 | He's like our dad, he's a bit older.
00:07:15.240 | He hates me saying that, but he's a bit older.
00:07:16.840 | - He's your COO now?
00:07:17.960 | - He's our COO, yeah.
00:07:19.000 | And we showed him and he was like,
00:07:20.280 | this is absolutely amazing, let's just do something.
00:07:21.880 | 'Cause he, at the time, was just about to have a child,
00:07:24.640 | so he didn't have anything going on either.
00:07:26.640 | So we applied to YC, got an interview.
00:07:29.400 | The interview was, as most YC interviews are,
00:07:31.920 | short, curt, and pretty brutal.
00:07:33.640 | They told us they hated the idea.
00:07:35.080 | They didn't think it would work.
00:07:36.680 | And that's when we started brainstorming.
00:07:38.640 | It was almost like the interview
00:07:39.720 | was like an office hours kind of thing.
00:07:41.040 | And we were like, okay, given what you know
00:07:44.040 | about the space now and how to build things
00:07:45.960 | with these LLMs, what can you bring out
00:07:48.280 | of what you've learned in building that thing
00:07:49.920 | into something that might be a bit more useful
00:07:52.640 | to people on the daily?
00:07:53.480 | And also, YC obviously likes B2B startups
00:07:55.240 | a little bit more, at least at the time they did back then.
00:07:57.760 | So we were like, okay, maybe we could build something
00:08:00.080 | that helps you with existing codebases,
00:08:01.600 | like can sort of automate development stuff
00:08:03.000 | with existing codebases, not knowing at all
00:08:05.280 | what that would look like or how you would build it
00:08:07.480 | or any of these things.
00:08:09.040 | And they were like, yeah, that sounds interesting.
00:08:11.880 | You should probably go ahead and do that.
00:08:13.560 | You're in, you've got two weeks to build us an MVP.
00:08:16.080 | And we were like, okay, okay.
00:08:18.520 | We did our best.
00:08:19.360 | The MVP was absolutely horrendous.
00:08:20.480 | It was a CLI tool, it sucked.
00:08:22.360 | And at the time we were like, we don't even know
00:08:26.600 | how to build what we want to build.
00:08:28.480 | And we didn't really know what we wanted to build,
00:08:30.120 | to be honest.
00:08:30.960 | Like we knew we wanted to try to help automate dev work,
00:08:34.120 | but back then we just didn't know enough
00:08:35.600 | about how LLM apps were built,
00:08:37.920 | the intricacies and all those things.
00:08:39.120 | And also like the LLMs themselves,
00:08:40.560 | like 4,000 tokens, you're not going very far.
00:08:42.400 | They're extremely expensive.
00:08:43.880 | So we ended up building a code-based retrieval tool
00:08:46.920 | originally.
00:08:47.840 | Our thought process originally was,
00:08:49.560 | we want to build something that can do our jobs for us.
00:08:51.840 | That is like the gold star, we know that.
00:08:53.320 | We've seen like there are glimpses of it happening
00:08:55.520 | with our initial demo that we did,
00:08:57.560 | but we don't see the path of how to do that at the moment.
00:09:00.480 | Like the tech just wasn't there.
00:09:02.360 | So we were like, well, there are going to be some things
00:09:04.040 | that you need to build this when the tech does catch up.
00:09:06.560 | So retrieval being one of the most important things,
00:09:09.360 | like the model's going to have to build like pull code
00:09:11.000 | out of a code base somehow.
00:09:12.400 | So we were like, well,
00:09:13.240 | let's just build the tooling around it.
00:09:14.200 | And eventually when the tech comes,
00:09:15.440 | then we'll be able to just like plug it into our tooling
00:09:18.240 | and then it should work basically.
00:09:20.440 | And to be fair, that's basically what we've done.
00:09:22.840 | And that's basically what's happened,
00:09:23.920 | which is very fortunate.
00:09:25.240 | But in the meantime,
00:09:26.480 | whilst we were waiting for everything
00:09:27.720 | to sort of become available,
00:09:29.400 | we built this code-based retrieval tool.
00:09:31.320 | That was the first thing we ever launched
00:09:32.520 | when we were in YC, and it didn't work.
00:09:35.280 | It was really frustrating for us
00:09:36.520 | 'cause it was just me and Sam like working like all hours
00:09:39.080 | trying to get this thing to work.
00:09:40.560 | It was quite a big task in and of itself,
00:09:42.240 | trying to get like a good semantic search engine working
00:09:46.200 | that could run locally on your machine.
00:09:48.200 | We were trying to avoid sending code to the cloud
00:09:49.880 | as much as possible.
00:09:51.320 | And then for very large code bases,
00:09:52.760 | you're like, you know, millions of lines of code.
00:09:55.080 | You're trying to do some sort of like local HNSW thing
00:09:57.400 | that runs inside your VS Code instance
00:09:59.120 | that like eats all your RAM as you've seen in the past,
00:10:01.960 | all those different things.
00:10:02.800 | - Yep. - Yeah.
00:10:03.760 | - My first call with you, I think I had trouble.
00:10:06.000 | - You were like, "Yeah, it sucks, man."
00:10:06.840 | I was like, "Yeah, I know, I know, I know it sucks.
00:10:08.640 | I'm sorry."
00:10:10.360 | But building all that stuff was essentially
00:10:13.640 | the first six to eight months of what at the time was built.
00:10:18.480 | - Which by the way, "Bildt."
00:10:20.200 | - "Bildt," yeah, it was a terrible, terrible name.
00:10:22.640 | - It was the worst part of trying to think about
00:10:26.000 | whether I would invest is whether or not
00:10:27.520 | people could pronounce it.
00:10:29.000 | - No, so when we went on our first ever YC like retreat,
00:10:33.440 | no one got the name right.
00:10:34.520 | They were like, "Bildt, Bill, what?"
00:10:37.160 | And then we actually changed the name to COSI.
00:10:39.560 | Like, although some people would spell it
00:10:42.120 | as if you're cosigning for an apartment or something.
00:10:44.480 | Like, that's like, can't win.
00:10:46.480 | Yeah, that was what "Bildt" was back then.
00:10:47.880 | But the ambition, and I did a talk on this
00:10:49.840 | back in the end of 2022, the ambition to like build
00:10:52.560 | something that essentially automated our jobs
00:10:54.440 | was still very much like core to what we were doing.
00:10:58.480 | But for a very long time, it was just never apparent to us
00:11:01.080 | like, how would you go about doing these things?
00:11:03.440 | Even when like you had 3.5, 16K,
00:11:06.080 | 16K suddenly felt huge 'cause you've gone from four to 16,
00:11:09.080 | but even then 16K is like,
00:11:10.960 | a lot of Python files are longer than 16K.
00:11:13.240 | So you can't, you know,
00:11:14.440 | before you even start doing a completion,
00:11:16.400 | even then we were like, "Eh, yeah,
00:11:18.400 | it looks like we're still waiting."
00:11:19.480 | And then like towards the end of last year,
00:11:22.800 | you then start, you see 32K, 32K was really smart.
00:11:26.520 | It was really expensive, but also like,
00:11:29.280 | you could fit a decent amount of stuff in it.
00:11:30.640 | 32K felt enormous.
00:11:32.320 | And then finally 128K came along and we were like,
00:11:34.280 | "Right, this is like, this is what we can actually deal with
00:11:37.520 | because fundamentally to build a product like this,
00:11:39.840 | you need to get as much information
00:11:41.080 | in front of the model as possible
00:11:42.440 | and make sure that everything it ever writes in output
00:11:45.080 | can be traced back to something in the context window
00:11:48.200 | so it's not hallucinating it."
00:11:49.520 | As soon as that model existed, I was like,
00:11:52.200 | "Okay, I know that this is now gonna be feasible
00:11:54.640 | in some way."
00:11:55.480 | We'd done early sort of dev work on Genie using 3.5, 16K.
00:12:00.480 | And that was a very, very like crude way of proving
00:12:05.800 | that this loop that we were after
00:12:07.440 | and the way we were generating the data
00:12:09.960 | actually had signal and worked and could do something.
00:12:13.520 | But the model itself was not useful
00:12:15.240 | because you couldn't ever fit enough information into it
00:12:18.800 | for it to be able to do the task competently
00:12:20.920 | and also the base intelligence of the model.
00:12:23.120 | I mean, 3.5, anyone who's used 3.5 knows
00:12:25.240 | the base intelligence of the model is lacking,
00:12:27.520 | especially when you're asking it
00:12:28.360 | to like do software engineering is quite involved.
00:12:31.040 | So we saw the 128K context model
00:12:34.520 | and at that point we'd been in touch with OpenAI
00:12:38.760 | about our ambitions and like how we wanted to build it.
00:12:41.440 | We essentially are, I just took a punt.
00:12:43.000 | I was like, "I'm just gonna ask to see,
00:12:44.360 | can we like train this thing?"
00:12:45.680 | 'Cause at the time 4Turbo had just come out
00:12:48.160 | and back then there was still a decent amount of lag time
00:12:50.840 | between like OpenAI releasing a model
00:12:53.160 | and then allowing you to fine tune it in some way.
00:12:56.280 | They've gotten much better about that recently.
00:12:57.800 | Like 4.0 fine tuning came out either,
00:12:59.520 | I think a day, 4.0 Mini fine tuning came out
00:13:01.640 | like the day after the model did.
00:13:03.520 | And I know that's something they're definitely
00:13:04.600 | like optimizing for super heavily inside,
00:13:06.680 | which is great to see.
00:13:07.720 | - Which is a little bit, for a year or so,
00:13:10.520 | YC companies had like a direct Slack channel to OpenAI.
00:13:14.000 | - We still do.
00:13:14.840 | - Yeah. - Yeah.
00:13:15.680 | - So it's a little bit of that diminishing
00:13:17.400 | of the YC advantage there.
00:13:18.760 | - Yeah.
00:13:19.600 | - If they're releasing this fine tuning ability
00:13:20.880 | like a day after.
00:13:21.840 | - Yeah, no, no, absolutely.
00:13:22.680 | But like you can't build a startup on the YC advantage.
00:13:25.360 | It's obviously nice, it makes you feel warm and fuzzy inside
00:13:27.520 | but like at the end of the day,
00:13:28.560 | it's not that that's gonna make you win.
00:13:31.040 | - Yeah.
00:13:31.880 | So like we'd spoken to Shamal there,
00:13:34.520 | that DevRel guy, I'm sure you know him.
00:13:36.440 | - I think he's head of solutions or something.
00:13:38.120 | - He is in their applied team, yeah.
00:13:41.000 | We'd been talking to him from the very beginning
00:13:42.600 | when we got into YC
00:13:43.440 | and he's been absolutely fantastic throughout.
00:13:46.120 | I basically had pitched him this idea
00:13:47.880 | back when we were doing it on 3.5, 16K.
00:13:50.720 | And I was like, this is my crazy thesis.
00:13:53.000 | I wanna see if this can work.
00:13:54.400 | And as soon as like that 128K model came out,
00:13:57.240 | I started like laying the groundwork.
00:13:58.520 | I was like, I know this definitely isn't possible
00:14:00.440 | 'cause he released it like yesterday,
00:14:01.680 | but know that I want it.
00:14:03.840 | And in the interim, like GPT-4,
00:14:06.080 | like 8K fine tuning came out.
00:14:07.760 | We tried that, it's obviously even fewer tokens,
00:14:09.600 | but the intelligence helped.
00:14:10.960 | And I was like, if we can marry the intelligence
00:14:12.480 | and the context window length,
00:14:13.360 | then we're gonna have something special.
00:14:14.360 | And eventually we were able to get
00:14:16.400 | on the experimental access program
00:14:18.440 | and we got access to four turbo fine tuning.
00:14:22.040 | As soon as we did that,
00:14:23.600 | because in the entire run up to that,
00:14:25.080 | we'd built the data pipeline.
00:14:26.200 | We already had all that set up.
00:14:27.520 | So we were like, right, we have the data.
00:14:29.520 | Now we have the model.
00:14:30.520 | Let's put it through and iterate essentially.
00:14:33.960 | And that's where like Genie as we know it today
00:14:38.400 | really was born.
00:14:39.640 | I won't pretend like the first version of Genie
00:14:41.040 | that we trained was good.
00:14:41.880 | It was a disaster.
00:14:43.160 | That's where you realize all the implicit biases
00:14:45.200 | in your data set.
00:14:46.040 | And you realize that, oh, actually this decision you made
00:14:47.800 | that was fairly arbitrary was the wrong one.
00:14:49.640 | You have to do it a different way.
00:14:51.200 | Other subtle things like, you know,
00:14:52.880 | how you write Git diffs and you're using LLMs
00:14:55.680 | and how you can best optimize that
00:14:57.160 | to make sure they actually apply and work
00:14:58.520 | and loads of different little edge cases.
00:15:00.360 | But as soon as we had access to the underlying tool,
00:15:02.400 | we were like, right, we can actually do this.
00:15:04.600 | And I was, I breathed a sigh of relief
00:15:07.760 | 'cause I didn't know it was like, it wasn't a done deal,
00:15:09.960 | but I knew that we could build something useful.
00:15:12.080 | I mean, I knew that we could build something
00:15:13.480 | that would be measurably good on whatever eval at the time
00:15:18.480 | that you wanted to use.
00:15:20.080 | Like at the time, back then,
00:15:21.960 | we weren't actually that familiar with Swift.
00:15:23.240 | But once Devon came out and they announced
00:15:26.160 | their Swift benchmark,
00:15:27.000 | like that's when my life took a turn.
00:15:29.640 | - Challenge accepted.
00:15:30.480 | - Yeah, challenge accepted.
00:15:31.640 | And that's where like, yes,
00:15:32.920 | that's where my friendships have gone.
00:15:34.640 | My sleep has gone, my weight, everything.
00:15:38.000 | Got into Sweebench and yeah,
00:15:40.400 | it was actually a very useful tool in building Geni
00:15:42.680 | 'cause beforehand it was like,
00:15:43.520 | "Yes, vibe check this thing and see if it's useful."
00:15:45.920 | And then all of a sudden you have an actual measure
00:15:48.120 | to see like, couldn't it do software engineering?
00:15:50.800 | Not the best measure, obviously,
00:15:52.280 | but like it's the best that we've got now.
00:15:54.400 | We would just iterate it and built.
00:15:56.240 | And eventually we got it to the point where it is now.
00:15:59.440 | And a little bit beyond since we actually got that score
00:16:03.440 | a couple of weeks ago.
00:16:04.800 | And yeah, it's been a hell of a journey
00:16:06.600 | from the beginning all the way now.
00:16:07.600 | That was a very rambling answer
00:16:08.800 | to your question about how we got here,
00:16:10.120 | but that's essentially a potted answer how we got here.
00:16:13.160 | - Got the full origin story.
00:16:14.240 | - Yeah, no, totally.
00:16:15.440 | You mentioned bias in the data and some of these things.
00:16:17.960 | In your announcement video,
00:16:19.440 | you called Geni the worst-versed
00:16:21.080 | AI software engineering colleague.
00:16:23.040 | And you kind of highlighted how the data needed to train it
00:16:27.000 | needs to show how a human engineer works.
00:16:30.240 | I think maybe you're contrasting that
00:16:32.480 | to just putting code in it.
00:16:34.040 | There's kind of like a lot more than code
00:16:35.760 | that goes into software engineering.
00:16:37.600 | How do you think about the data mixture?
00:16:39.440 | You know, and like there's this kind of known truth
00:16:42.680 | that code makes models better
00:16:44.880 | when you put in the pre-training data.
00:16:46.200 | But since we put so much in the pre-training data,
00:16:48.560 | what else do you add when you turn to Genium?
00:16:51.000 | - Yeah, I think that sort of boils down fundamentally
00:16:54.120 | to the difference between a model writing code
00:16:56.600 | and a model doing software engineering.
00:16:58.520 | Because the software engineering sort of discipline
00:17:01.680 | goes wider because if you look at something like a PR,
00:17:05.640 | that is obviously a artifact of some thought
00:17:09.080 | and some work that has happened
00:17:10.720 | and has eventually been squashed into some diffs, right?
00:17:13.680 | What the, very crudely,
00:17:15.320 | what the pre-trained models are reading
00:17:18.040 | is they're reading those final diffs
00:17:19.320 | and they're emulating that
00:17:20.920 | and they're being able to output it, right?
00:17:22.680 | But of course, it's a super lossy thing, a PR.
00:17:25.200 | You have no idea why or how, for the most part,
00:17:27.600 | unless there are some comments,
00:17:28.560 | which, you know, anyone who's worked in a company
00:17:30.240 | realizes PR reviews can be a bit dodgy at times.
00:17:33.360 | But you see that you lose so much information at the end.
00:17:37.120 | And that's perfectly fine because PRs aren't designed
00:17:39.600 | to be something that perfectly preserves
00:17:41.440 | everything that happened.
00:17:42.880 | But what we realized was if you want something
00:17:45.720 | that's a software engineer, and very crudely,
00:17:47.760 | we started with something that can do PRs for you,
00:17:50.120 | essentially, you need to be able to figure out
00:17:53.320 | why those things happened.
00:17:54.920 | Otherwise, you're just gonna rely,
00:17:56.680 | essentially, you just have a code writing model.
00:17:58.000 | You have something that's good at human eval,
00:17:59.560 | but not very good at sweetbench, essentially.
00:18:01.960 | That realization was part of the kernel of the idea
00:18:05.680 | of the approach that we took to design the agent
00:18:08.680 | that is Genie.
00:18:10.200 | The way that we decided we want to try to extract
00:18:14.200 | what happened in the past, like as forensically as possible,
00:18:17.600 | has been and is currently like one of the main things
00:18:20.680 | that we focus all our time on.
00:18:22.440 | Because doing that, getting as much signal out as possible,
00:18:24.720 | doing that as well as possible,
00:18:26.320 | is the biggest thing that we've seen
00:18:29.240 | that determines how well we do on that benchmark
00:18:31.120 | at the end of the day.
00:18:32.080 | Once you've sorted things out, like output structure,
00:18:35.480 | how to get it consistently writing diffs,
00:18:37.800 | and all the stuff that is sort of ancillary
00:18:40.320 | to the model actually figuring out how to solve a problem,
00:18:43.200 | the core bit of solving the problem is
00:18:45.040 | how did the human solve this problem?
00:18:46.600 | And how can we best come up with
00:18:48.960 | how the human solved these problems?
00:18:50.960 | So all the effort went in on that pipeline.
00:18:53.440 | And the mix that we ended up with was,
00:18:56.720 | as you've probably seen in the technical report and so on,
00:18:59.360 | all of those different languages and different combinations
00:19:01.480 | of different task types,
00:19:02.680 | all of that has run through that pipeline
00:19:04.440 | and we've extracted all that information out.
00:19:06.040 | - How does that differ when you work with customers
00:19:08.480 | that have private workflows?
00:19:10.000 | Like, do you think, is there usually a big delta
00:19:12.600 | between what you get in open source
00:19:14.280 | and maybe public data versus like--
00:19:15.920 | - Yeah, yeah, yeah.
00:19:16.760 | When you scrape enough of it,
00:19:17.640 | most of open source is updating readmes and docs.
00:19:19.880 | It's hilarious, like we had to filter out
00:19:21.680 | so much of that stuff because
00:19:23.160 | when we first did the 3.5, 16K model,
00:19:27.400 | like the amount of readme updating that went in,
00:19:30.160 | we did like no data cleaning, no real like,
00:19:33.080 | we just sort of threw it in and saw what happened.
00:19:35.040 | And it was just like,
00:19:37.000 | it was really good at updating readmes,
00:19:38.760 | really good at writing some comments,
00:19:40.480 | really good at complaining in get reviews,
00:19:43.520 | in NPR reviews rather.
00:19:44.880 | And it was, again, like we didn't clean the data.
00:19:46.880 | So you'd like give it some feedback
00:19:48.240 | and it would just like reply and like,
00:19:50.080 | it would just be quite insubordinate
00:19:51.680 | when it was getting back to you like,
00:19:52.680 | no, I don't think you're right.
00:19:53.720 | And it would just sort of argue with you.
00:19:55.520 | So the process of doing all that was super interesting
00:19:58.920 | 'cause we realized from the beginning,
00:20:00.160 | okay, there's a huge amount of work
00:20:01.560 | that needs to go into like cleaning this,
00:20:03.600 | getting it aligned with what we want the model to do
00:20:06.120 | to be able to get the model to be useful in some way.
00:20:09.080 | - I'm curious, like, how do you think about
00:20:11.040 | the customer willingness to share
00:20:13.720 | all of this historical data?
00:20:14.880 | I've done a lot of developer tools investing in my career
00:20:17.960 | and getting access to the code base
00:20:19.840 | is always one of the hard things.
00:20:21.720 | Are people getting more cautious
00:20:24.440 | about sharing this information?
00:20:26.080 | In the past, it was maybe like, you know,
00:20:27.520 | you're using static analysis tool,
00:20:29.640 | like whatever else you need to plug into the code base, fine.
00:20:32.360 | Now you're building a model based on it.
00:20:34.800 | Like, what's the discussion going into these companies?
00:20:37.280 | Are most people comfortable with like letting you see
00:20:39.400 | how to work and sharing everything or?
00:20:41.400 | - It depends on the sector mostly.
00:20:44.120 | We've actually seen, I'd say,
00:20:45.720 | people becoming more amenable to the idea over time,
00:20:48.200 | actually rather more skeptical
00:20:49.800 | 'cause I think they can see the upside.
00:20:52.360 | If this thing does what they say it does,
00:20:54.760 | it's gonna be more help to us
00:20:56.360 | than it is a risk to our infosec.
00:20:58.520 | And of course, like companies building in this space,
00:21:01.240 | we're all gonna end up, you know,
00:21:02.360 | complying with the same rules
00:21:03.640 | and there are gonna be new rules that come out
00:21:04.960 | to make sure that we're looking at your code,
00:21:07.120 | that everything is safe and so on.
00:21:08.680 | So from what we've seen so far,
00:21:10.520 | we've spoken to some very large companies
00:21:12.120 | that you've definitely heard of
00:21:13.640 | and all of them obviously have stipulations
00:21:16.360 | and many of them want it to be sandboxed to start with
00:21:18.640 | and all the like very obvious things
00:21:20.280 | that I, you know, I would say as well.
00:21:22.280 | But they're all super keen to have a go
00:21:24.880 | and see because like, despite all those things,
00:21:27.320 | if we can genuinely make them go faster,
00:21:30.240 | allow them to build more in a given time period and stuff,
00:21:32.360 | it's super worth it to them.
00:21:33.840 | - Okay, I'm gonna dive in a little bit
00:21:35.600 | on the process that you have created.
00:21:38.720 | You showed the demo on your video
00:21:40.680 | and by the time that we release this,
00:21:42.480 | you should be taking people off the wait list
00:21:44.160 | and launching people so people can see this themselves.
00:21:47.000 | There's four main parts of the workflow,
00:21:50.120 | which is finding files, planning action,
00:21:53.000 | writing code and running tests.
00:21:55.160 | And controversially, you have set yourself apart
00:21:58.680 | from the Devins of the world
00:22:00.520 | by saying that things like having access to a browser
00:22:03.600 | is not that important for you.
00:22:04.960 | Is that an accurate reading of what you wrote?
00:22:07.240 | - I don't remember saying that,
00:22:08.640 | but at least with what we've seen,
00:22:11.640 | the browser is helpful,
00:22:13.320 | but it's not as helpful as like,
00:22:14.760 | ragging the correct files, if that makes sense.
00:22:17.280 | Like, it is still helpful,
00:22:18.640 | but obviously there are more fundamental things
00:22:21.240 | you have to get right before you get to like,
00:22:23.280 | oh yeah, you can read some docs
00:22:24.480 | or you can read a stack overflow article
00:22:26.120 | and stuff like that.
00:22:26.960 | - Yeah, the phrase I was indexing on
00:22:28.880 | was the other software tools
00:22:30.840 | are wrappers around foundational models
00:22:32.280 | with a few additional tools,
00:22:33.320 | such as a web browser or code interpreter.
00:22:35.120 | - Oh, I see.
00:22:35.960 | No, I mean, no, I'm deriding the approach there,
00:22:39.000 | not the tools.
00:22:39.920 | - Yeah, exactly.
00:22:40.760 | So like, I would say in my standard model
00:22:43.680 | of what a code agent should look like,
00:22:45.560 | Devon has been very influential, obviously,
00:22:47.640 | because you could just add the docs of something
00:22:51.000 | and now I have, now when I'm installing a new library,
00:22:53.920 | I can just add docs.
00:22:55.360 | Cursor also does this, right?
00:22:56.720 | And then obviously having a code interpreter does help.
00:22:59.160 | I guess you have that in the form of running tests.
00:23:01.880 | - I mean, the Genie has both of those tools
00:23:03.880 | available to it as well.
00:23:04.760 | So yeah, yeah, yeah.
00:23:05.800 | So we have a tool where you can like put in URLs
00:23:09.160 | and it will just read the URLs
00:23:10.240 | and it also uses Perplexity's API under the hood as well
00:23:12.920 | to be able to actually ask questions if it wants to.
00:23:14.960 | - Okay.
00:23:15.800 | - So now we use both of those tools as well.
00:23:16.960 | Like those tools are super important and super key.
00:23:20.720 | I think obviously the most important tools to these agents
00:23:24.440 | are like being able to retrieve code from a code base,
00:23:27.680 | being able to read Stack Overflow articles and what have you
00:23:30.640 | and just be able to essentially be able to Google like we do
00:23:32.960 | is definitely super useful.
00:23:35.000 | - Yeah.
00:23:35.840 | I thought maybe we could just kind of dive
00:23:36.800 | into each of those actions.
00:23:38.600 | Code retrieval, one of the core problems,
00:23:40.840 | you had an indexer that you've worked on,
00:23:43.640 | Even S has built.
00:23:45.080 | What makes it hard?
00:23:46.240 | What approach you thought would work, didn't work?
00:23:48.760 | Anything like that.
00:23:49.600 | - It's funny, I had a similar conversation to this
00:23:51.760 | when I was chatting to the guys from OpenAI yesterday.
00:23:54.680 | The thing is that searching for code,
00:23:57.680 | specifically semantically, at least to start with,
00:24:00.000 | I mean like keyword search and stuff like that
00:24:01.600 | is a sole problem, it's been around for ages,
00:24:04.120 | but at least being able to,
00:24:06.120 | the phrase we always used back in the day
00:24:07.760 | was searching for what code does rather than what code is,
00:24:11.200 | like searching for functionality is really hard, really hard.
00:24:16.200 | The way that we approached that problem
00:24:18.240 | was that obviously like a very basic and easy approach
00:24:22.320 | is right, let's just embed the code base,
00:24:23.800 | we'll chunk it up in some arbitrary way,
00:24:26.120 | maybe using an AST, maybe using number of lines,
00:24:28.440 | maybe using whatever, like some overlapping,
00:24:30.440 | just chunk it up and embed it.
00:24:31.920 | And once you've done that, I will write a query saying like,
00:24:34.800 | find me some authentication code or something,
00:24:36.920 | embed it, and then do the cosine similarity
00:24:39.040 | and get the top of K, right?
00:24:40.360 | That doesn't work, and I wish it did work,
00:24:42.600 | don't get me wrong.
00:24:43.640 | It doesn't work well at all because fundamentally,
00:24:47.360 | if you think about like semantically how code looks
00:24:50.000 | is very different to how English looks,
00:24:51.600 | and there's like not a huge amount of signal
00:24:53.680 | that's carried between the two.
00:24:55.000 | So what we ended up, the first approach we took
00:24:57.280 | and that kind of did well enough for a long time was,
00:25:01.080 | okay, let's train a model
00:25:03.680 | to be able to take in English code queries
00:25:06.800 | and then produce a hypothetical code snippet
00:25:09.840 | that might look like the answer,
00:25:12.320 | embed that, and then do the cosine similarity.
00:25:15.360 | And that process, although very simple,
00:25:17.520 | gets you so much more performance
00:25:19.920 | out of the retrieval accuracy.
00:25:21.920 | And that was kind of like the start of our engine,
00:25:25.080 | as we called it, which is essentially like the aggregation
00:25:28.080 | of all these different heuristics,
00:25:29.200 | like semantic, keyword, LSP, and so on.
00:25:33.200 | And then we essentially had like a model
00:25:36.040 | that would, given an input,
00:25:37.720 | choose which ones it thought were most appropriate
00:25:39.840 | given the type of requests you had.
00:25:41.640 | So the whole code search thing was a really hard problem.
00:25:46.360 | And actually what we ended up doing with Genie
00:25:48.160 | is we let the model through self-play
00:25:52.320 | figure out how to retrieve code.
00:25:53.720 | So actually we don't use our engine for Genie.
00:25:56.680 | So instead of like a request coming in
00:25:59.520 | and then like say GPT-4 with some JSON output being like,
00:26:02.720 | well, I think here we should use a keyword
00:26:04.360 | with these inputs and then we should use semantic
00:26:06.320 | and then we should like pick these results.
00:26:08.360 | It's actually like a question comes in
00:26:10.520 | and Genie has self-played in its training data
00:26:14.040 | to be able to be like,
00:26:14.880 | okay, this is how I'm going to approach
00:26:16.200 | finding this information.
00:26:17.320 | Much more akin to how a developer would do it.
00:26:19.880 | 'Cause if I was like,
00:26:20.960 | Sean, go into this new code base you've never seen before
00:26:23.600 | and find me the code that does this,
00:26:26.600 | you're gonna probably, you might do some keywords.
00:26:28.640 | You're gonna look over the file system.
00:26:30.080 | You're gonna try to figure out from the directories
00:26:32.000 | and the file names where it might be.
00:26:33.400 | You're gonna like jump in one
00:26:35.040 | and then once you're in there,
00:26:35.880 | you're probably gonna be doing the go to definition stuff
00:26:38.880 | to like jump from file to file
00:26:40.240 | and try to use the graph to like get closer and closer.
00:26:43.440 | And that is exactly what Genie does.
00:26:45.200 | Starts on the file system, looks at the file system,
00:26:47.320 | picks some candidate files.
00:26:48.960 | Is this what I'm looking for, yes or no?
00:26:51.000 | If there's something that's interesting,
00:26:52.280 | like an import or something,
00:26:53.320 | it can command click on that thing,
00:26:55.200 | go to definition, go to references and so on.
00:26:57.320 | And it can traverse the code base that way.
00:26:59.240 | - Are you using the VS Code LSP or?
00:27:01.560 | - No, that's no, we're not doing this in VS Code.
00:27:04.240 | We're just using the language servers running.
00:27:06.440 | But we really wanted to try to mimic
00:27:09.120 | the way we do it as best as possible.
00:27:11.360 | And we did that during the self-play process
00:27:13.680 | when we were generating the data set.
00:27:14.880 | So although we did all that work originally,
00:27:17.360 | and although like Genie still has access to these tools,
00:27:19.840 | so it can do keyword searches
00:27:21.040 | and it can do basic semantic searches
00:27:23.480 | and it can use the graph.
00:27:24.360 | It uses them through this process and figures out,
00:27:27.840 | okay, I've learned from data how to find stuff in code bases
00:27:31.120 | and I think in our technical report,
00:27:32.400 | I can't remember the exact number,
00:27:33.520 | but I think it was around 65 or 66% retrieval accuracy
00:27:36.680 | overall measured on,
00:27:38.520 | we know what lines we need for these tasks to find
00:27:42.080 | for the task to actually be able to be completed.
00:27:44.640 | And we found about 66% of all those lines,
00:27:47.640 | which is one of the biggest areas of free performance
00:27:51.160 | that we can get hold of
00:27:52.000 | because when we were building Genie truthfully,
00:27:54.200 | like a lot more focus went on
00:27:56.800 | assuming you found the right information,
00:27:59.000 | you've been able to reproduce the issue,
00:28:01.000 | assuming that's true,
00:28:02.560 | how do you then go about solving it?
00:28:04.800 | And the bulk of the work we did was on the solving.
00:28:08.240 | But when you go higher up the funnel,
00:28:09.720 | obviously like the funnel looks like,
00:28:11.240 | have you found everything you need for the task?
00:28:13.400 | Are you able to reproduce the problem
00:28:14.800 | that's seen in the issue?
00:28:16.040 | Are you then able to solve it?
00:28:17.240 | And the funnel gets narrower as you go down.
00:28:19.440 | And at the top of the funnel, of course, is rank.
00:28:20.880 | So I'm actually quite happy with that score.
00:28:22.760 | I think it's still pretty impressive
00:28:23.840 | considering the size of some of the code bases
00:28:25.360 | we're using for this.
00:28:27.640 | But as soon as that, if that number becomes 80,
00:28:29.880 | I think how many more tasks we get right.
00:28:31.360 | That's one of the key areas we're gonna focus on
00:28:33.560 | when we continue working on Genie.
00:28:35.120 | - Be interesting to break out a benchmark just for that.
00:28:37.720 | - Yeah, I mean, it's super easy.
00:28:39.200 | - 'Cause I don't know what state of the art is.
00:28:40.600 | - Yeah, I mean, like for a, it's super easy
00:28:42.800 | 'cause like for a given PR, you know what lines are edited.
00:28:45.840 | - Oh, okay.
00:28:46.680 | - Yeah, you know what lines are edited.
00:28:47.520 | - So you can just, you can source it
00:28:48.360 | from Speedbench actually.
00:28:49.200 | - Yeah, you can do it, you can do it super easily.
00:28:50.600 | And that's how we got that figure out at the other end.
00:28:53.000 | For us being able to see it against,
00:28:54.600 | our historic models were super useful.
00:28:56.440 | So we could see if we were, you know,
00:28:58.040 | actually helping ourselves or not.
00:28:59.680 | And initially, one of the biggest performance gains
00:29:02.440 | that we saw when we did work on the rag a bit
00:29:04.720 | was giving it the ability to use the LSP
00:29:06.720 | to like go to definition and really try to get it
00:29:08.560 | to emulate how we do that.
00:29:10.560 | Because I'm sure when you go into an editor
00:29:12.960 | where like the LSP is not working or whatever,
00:29:15.480 | you suddenly feel really like disarmed and naked.
00:29:17.520 | You're like, oh my God, I didn't realize
00:29:19.360 | how much I actually use this to get about
00:29:21.040 | rather than just find stuff.
00:29:23.120 | So we really tried to get it to do that.
00:29:24.520 | And that gave us a big jump in performance.
00:29:26.400 | So we went from like 54% up to like the 60s,
00:29:28.880 | but just by adding, focusing on that.
00:29:31.120 | - That's one weird trick.
00:29:32.120 | - Yes.
00:29:33.640 | - I'll briefly comment here.
00:29:35.280 | So this is the standard approach I would say
00:29:37.160 | most code tooling startups are pursuing.
00:29:40.680 | The one company that's not doing this is magic.dev.
00:29:44.000 | - Yes.
00:29:44.840 | - So would you do things differently
00:29:46.760 | if you have a 10 million token context window?
00:29:49.160 | - If I had a 10 million context window
00:29:50.880 | and hundreds of millions of dollars,
00:29:53.400 | I wouldn't have gone and built,
00:29:56.960 | it's an LTM, it's not a transformer they're using, right?
00:30:00.120 | If I'm not mistaken, I believe it's not a transformer.
00:30:01.960 | - Yeah.
00:30:02.800 | - Eric's gonna come on at some point.
00:30:03.640 | - I'm just, listen, they obviously know a lot more
00:30:05.720 | about their product than I do.
00:30:06.600 | I don't know a great deal about how magic works.
00:30:08.280 | - Nobody knows anything yet.
00:30:09.120 | - Yeah, so I'm not gonna speculate.
00:30:12.600 | Would I do it the same way as them?
00:30:14.480 | I like the way we've done it because fundamentally,
00:30:17.120 | like we focus on the act of software engineering
00:30:22.120 | and what that looks like.
00:30:23.320 | And showing models how to do that.
00:30:25.360 | Fundamentally, the underlying model that we use
00:30:28.280 | is kind of null to us.
00:30:30.320 | Like so long as it's the best one, I don't mind.
00:30:32.560 | And the context windows we've already seen,
00:30:34.760 | like you can get transformers to have like million,
00:30:37.320 | one and a half million token context windows.
00:30:40.120 | And that works perfectly well.
00:30:41.520 | So like as soon as you can fine tune Gemini 1.5,
00:30:45.040 | then you best be sure that Genie will run on Gemini 1.5
00:30:48.600 | and like we'll probably get very good performance
00:30:50.160 | out of that.
00:30:51.000 | I like our approach 'cause we can be super agile
00:30:52.760 | and be like, "Oh, well, Anthropic have just released
00:30:54.480 | "whatever and it might have half a million tokens
00:30:57.080 | "and it might be really smart."
00:30:58.280 | And I can just immediately take my JSONL file
00:31:00.520 | and just dump it in there and suddenly Genie works on there
00:31:02.480 | and it can do all the new things.
00:31:04.040 | - Does Anthropic have the same fine tuning support
00:31:06.320 | as OpenAI?
00:31:07.160 | I actually haven't heard anyone do it.
00:31:08.840 | - They are working on it.
00:31:09.960 | They are partnered with AWS and it's gonna be in Bedrock.
00:31:13.080 | As far as I know, I think that's true.
00:31:16.960 | - Cool.
00:31:17.800 | We have to keep moving on to the other segments.
00:31:19.640 | Planning.
00:31:20.480 | The second piece of your four-step grandmaster plan.
00:31:23.800 | That is the frontier right now.
00:31:25.560 | A lot of people are talking about Strawberry,
00:31:27.520 | Q*, whatever that is.
00:31:29.280 | Monte Carlo Tree Search.
00:31:30.880 | Is current state-of-the-art planning good enough?
00:31:33.320 | What prompts have worked?
00:31:35.120 | I don't even know what questions to ask.
00:31:36.320 | Like, what is the state of planning?
00:31:37.920 | - I think it's fairly obvious
00:31:38.760 | that with the foundational models,
00:31:40.400 | like you can ask them to think by step by step
00:31:42.000 | and ask them to plan and stuff,
00:31:43.480 | but that isn't enough
00:31:44.440 | because if you look at how those models score
00:31:46.000 | on these benchmarks,
00:31:46.840 | then they're not even close to state-of-the-art.
00:31:48.920 | - Which ones are you referencing?
00:31:50.080 | - So like just like Sweet Bench and so on, right?
00:31:52.840 | And like even the things that get really good scores
00:31:55.040 | on human eval agents as well,
00:31:56.320 | 'cause they have these loops, right?
00:31:57.840 | Obviously these things can reason, quote unquote,
00:32:00.560 | but the reasoning is the model,
00:32:03.680 | it's constrained by the model's intelligence,
00:32:05.640 | I'd say, very crudely.
00:32:07.320 | And what we essentially wanted to do
00:32:08.920 | was we still thought,
00:32:09.760 | obviously reasoning is super important.
00:32:11.040 | We need it to get the performance we have,
00:32:13.200 | but we wanted the reasoning to emulate
00:32:15.160 | how we think about problems when we're solving them,
00:32:17.240 | as opposed to how a model thinks about a problem
00:32:19.240 | when we're solving it.
00:32:20.200 | And that's obviously part of like the derivation pipeline
00:32:23.120 | that we have when we design our data.
00:32:25.880 | But the reasoning that the models do right now,
00:32:28.520 | and who knows what Q*,
00:32:30.280 | whatever it ends up being called, looks like,
00:32:32.760 | but certainly what I'm excited,
00:32:34.040 | on a small tangent to that,
00:32:35.440 | like what I'm really excited about
00:32:36.760 | is when models like that come out,
00:32:38.200 | obviously the signal in my data,
00:32:39.560 | when I regenerate it, goes up.
00:32:41.440 | And then I can then train that model
00:32:43.040 | that's already better at reasoning
00:32:44.280 | with improved reasoning data
00:32:46.280 | and just like I can keep bootstrapping
00:32:47.760 | and keep leapfrogging every single time.
00:32:49.520 | And that is like super exciting to me
00:32:51.600 | 'cause I welcome like new models so much
00:32:53.960 | because immediately it just floats me up
00:32:56.480 | without having to do much work, which is always nice.
00:32:58.720 | But at the state of reasoning generally,
00:33:00.840 | I don't see it going away anytime soon.
00:33:02.960 | I mean, that's like an autoregressive model
00:33:04.640 | doesn't think per se.
00:33:06.480 | And in the absence of having any thought,
00:33:08.840 | maybe an energy-based model or something like that,
00:33:11.360 | maybe that's what Q* is, who knows,
00:33:13.080 | some sort of like high level abstract space
00:33:16.120 | where thought happens before tokens get produced.
00:33:19.040 | In the absence of that for the moment,
00:33:20.680 | I think it's all we have
00:33:22.000 | and it's gonna have to be the way it works.
00:33:23.840 | For what happens in the future, we'll have to see,
00:33:26.360 | but I think certainly it's never going
00:33:27.800 | to hinder performance to do it.
00:33:29.200 | And certainly the reasoning that we see Genie do
00:33:33.160 | when you compare it to like,
00:33:34.520 | if you ask GPT-4 to break down step-by-step
00:33:38.040 | and approach for the same problem,
00:33:39.680 | at least just on a vibe check alone looks far better.
00:33:42.920 | - Two elements that I like
00:33:45.200 | that I didn't see in your initial video,
00:33:47.000 | we'll see when this Genie launches,
00:33:49.880 | is a planner chat,
00:33:51.520 | which is I can modify the plan while it's executing.
00:33:54.360 | And then the other thing is playbooks,
00:33:55.760 | which also from Devin,
00:33:57.240 | where here's how I like to do a thing
00:33:59.880 | and I'll use Markdown to specify how I do it.
00:34:02.920 | I'm just curious if like, you know, those things help.
00:34:05.640 | - Yeah, no, absolutely.
00:34:06.480 | We're a hundred percent.
00:34:07.320 | We want everything to be editable,
00:34:09.120 | not least because it's really frustrating when it's not.
00:34:11.200 | Like if you're ever in a situation
00:34:12.840 | where like there's the one thing I just wish I could,
00:34:15.720 | and you'd be right if that one thing was right
00:34:17.480 | and you can't change it.
00:34:18.520 | So we're going to make everything editable,
00:34:19.520 | including the code it writes.
00:34:20.560 | Like you can, if it makes a small error in a patch,
00:34:22.960 | you can just change it yourself and let it continue
00:34:24.680 | and it will be fine.
00:34:25.760 | So yeah, like those things are super important.
00:34:27.560 | We'll be doing those too.
00:34:28.640 | - I'm curious, once you get to writing code,
00:34:30.960 | is most of the job done?
00:34:32.640 | I feel like the models are so good at writing code
00:34:34.720 | when they're like in small chunks
00:34:36.560 | that are like very well-instructed.
00:34:38.320 | What's kind of the drop off in the funnel?
00:34:40.160 | Like once you get to like,
00:34:41.360 | you got the right files and you got the right plan.
00:34:43.680 | - That's a great question because by the time this is out,
00:34:46.480 | there'll be another blog post.
00:34:47.960 | Yeah, there'll be another blog post,
00:34:49.920 | which contains all the learnings that I delivered
00:34:52.880 | to OpenAI's fine-tuning team
00:34:54.160 | when we finally got the score.
00:34:55.600 | - Oh, that's good.
00:34:56.640 | Go for it, it's already out.
00:34:58.480 | - Yeah, I don't have it on my phone,
00:34:59.800 | but basically I broke down the log probs.
00:35:04.800 | I basically got the average log prob for a token
00:35:08.400 | at every token position in the context window.
00:35:10.320 | So imagine an X-axis from zero to 128K,
00:35:13.160 | and then the average log prob for each index in there.
00:35:16.160 | As we discussed, like the way Genie works normally is,
00:35:18.640 | you know, at the beginning you do your rag
00:35:20.120 | and then you do your planning and then you do your coding
00:35:21.680 | and that sort of cycle continues.
00:35:23.280 | The certainty of code writing is so much more certain
00:35:26.840 | than every other aspect of Genie's loop.
00:35:29.240 | So whatever's going on under the hood,
00:35:30.720 | the model is really comfortable with writing code.
00:35:32.520 | There is no doubt and it's like in the token probabilities.
00:35:35.680 | One slightly different thing, I think,
00:35:37.400 | to how most of these models work is,
00:35:40.480 | at least for the most part,
00:35:41.840 | if you ask GPT4 in chat GPT to edit some code for you,
00:35:45.440 | it's going to rewrite the entire snippet for you
00:35:47.360 | with the changes in place.
00:35:48.800 | We train Genie to write diffs and, you know,
00:35:51.280 | essentially patches, right?
00:35:52.320 | Because it's more token efficient
00:35:53.960 | and that is also fundamentally,
00:35:56.880 | we don't write patches as humans,
00:35:58.480 | but it's like the result of what we do is a patch, right?
00:36:01.440 | When Genie writes code,
00:36:04.200 | I don't know how much it's leaning
00:36:05.920 | on the pre-training like code writing corpus,
00:36:08.280 | because obviously it's just read code files there.
00:36:10.680 | It's obviously probably read a lot of patches,
00:36:12.240 | but I would wager it's probably read more code files
00:36:14.080 | than it has patches.
00:36:14.920 | So it's probably leaning on a different part of its brain
00:36:16.840 | is my speculation.
00:36:17.680 | I have no proof for this.
00:36:18.920 | So I think the discipline of writing code
00:36:20.840 | is slightly different,
00:36:21.680 | but certainly is its most comfortable state
00:36:24.200 | when it's writing code.
00:36:25.760 | So once you get to that point,
00:36:27.640 | so long as you're not too deep into the context window,
00:36:29.600 | another thing that I'll bring up in that blog post
00:36:31.600 | is performance of Genie
00:36:33.680 | over the length of the context window.
00:36:35.840 | Degrades fairly linearly.
00:36:38.680 | So actually, I actually broke it down
00:36:41.120 | by probability of solving a sweep bench issue,
00:36:44.360 | given the number of tokens of the context window.
00:36:46.400 | It's 60K, it's basically 0.5.
00:36:48.920 | So if you go over 60K in context length,
00:36:51.600 | you are more likely to fail than you are to succeed
00:36:53.760 | just based on the amount of tokens
00:36:55.360 | you have on the context window.
00:36:56.680 | And when I presented that to the fine tuning team
00:36:59.000 | at OpenAI, that was super interesting to them as well.
00:37:01.560 | And that is more of a foundational model attribute
00:37:05.840 | than it is an us attribute.
00:37:07.320 | However, the attention mechanism works in GPT-4,
00:37:10.480 | however, you know, they deal with the context window.
00:37:12.880 | At that point is, you know,
00:37:14.800 | influencing how Genie is able to form.
00:37:17.040 | Even though obviously all our training data is perfect,
00:37:19.560 | right, so even if like stuff is being solved
00:37:21.440 | in 110,000 tokens, sort of that area,
00:37:24.800 | the training data still shows it being solved there,
00:37:27.040 | but it's just in practice, the model is finding it
00:37:29.000 | much harder to solve stuff
00:37:30.040 | down that end of the context window.
00:37:31.800 | - That's the scale with the context,
00:37:33.240 | so for a 200K context size, is 100K tokens like the 0.5?
00:37:38.240 | - I don't know.
00:37:39.520 | - Yeah, yeah, yeah.
00:37:40.360 | - Yeah, but I hope not.
00:37:42.720 | I hope you don't just take the context length
00:37:44.320 | and halve it and then say,
00:37:45.160 | "Oh, this is the usable context length."
00:37:46.960 | But what's been interesting is knowing that,
00:37:49.720 | actually really digging into the data,
00:37:51.200 | looking at the log probs,
00:37:52.160 | looking at how it performs over the entire window,
00:37:54.920 | it's influenced the short-term improvements
00:37:57.720 | we've made to Genie since we got that score.
00:38:01.200 | So we actually made some small optimizations
00:38:03.640 | to try to make sure as best we can without overdoing it,
00:38:08.400 | trying to make sure that we can artificially
00:38:10.240 | make sure stuff sits within that sort of range
00:38:12.400 | because we know that's our sort of battle zone.
00:38:14.360 | And if we go outside of that,
00:38:15.440 | we're starting to push the limits,
00:38:16.720 | we're more likely to fail.
00:38:18.160 | So just doing that sort of analysis has been super useful
00:38:20.680 | without actually messing with anything more structural
00:38:24.600 | and getting more performance out of it.
00:38:26.200 | - What about different languages?
00:38:28.200 | So in your technical report,
00:38:29.720 | the data makes this 21% JavaScript, 21% Python,
00:38:33.480 | 14% TypeScript, 14% TSX.
00:38:36.960 | - Which is JavaScript, JavaScript, JavaScript.
00:38:38.720 | - Yeah, yeah, yeah.
00:38:39.560 | - Yes, yeah, yeah, that's true.
00:38:40.400 | - It's like 29% JavaScript.
00:38:41.240 | - That's true, that's true.
00:38:42.080 | Although TypeScript is so much superior, but anyway.
00:38:43.600 | - Do you see, how good is it at just generalizing?
00:38:46.400 | If you're writing Rust or C++ or whatever else,
00:38:50.200 | it's quite different?
00:38:51.640 | - It's pretty good at generalizing.
00:38:53.800 | Obviously, though, I think there's 15 languages
00:38:55.640 | in that technical report, I think, that we've covered.
00:38:58.440 | The ones that we picked in the highest mix
00:39:00.920 | were the ones that, selfishly, we internally use the most,
00:39:05.000 | and also that are, I'd argue,
00:39:06.960 | some of the most popular ones.
00:39:08.320 | When we have more resource as a company and more time,
00:39:12.640 | and once all the craziness that has just happened
00:39:14.840 | sort of dies down a bit,
00:39:15.680 | we are going to work on that mix.
00:39:17.360 | I'd love to see everything ideally be represented
00:39:20.600 | in a similar level as it is.
00:39:22.400 | If you took GitHub as a data set,
00:39:25.280 | if you took how are the languages broken down
00:39:27.480 | in terms of popularity,
00:39:28.400 | that would be my ideal data mix to start.
00:39:30.840 | It's just that it's not cheap doing this.
00:39:32.840 | So, yeah, trying to have an equal amount of Ruby and Rust
00:39:37.840 | and all these different things at our current state
00:39:41.000 | is not really what we're looking for.
00:39:43.080 | - There's a lot of good Ruby in my GitHub profile.
00:39:45.240 | You can have it all.
00:39:46.080 | - Well, okay, perfect, we'll just train on that.
00:39:48.240 | - For running tests, it sounds easy, but it isn't,
00:39:51.280 | especially when you're working in enterprise codebases
00:39:53.960 | that are kind of very hard to spin up.
00:39:56.080 | How do you set that up?
00:39:57.200 | It's like, how do you make a model
00:39:58.840 | actually understand how to run a codebase,
00:40:00.960 | which is different than writing code for a codebase?
00:40:03.680 | - The model itself is not in charge
00:40:05.800 | of setting up the codebase and running it.
00:40:07.840 | So Genie sits on top of GitHub,
00:40:09.480 | and if you have CI running GitHub,
00:40:11.880 | you have GitHub actions and stuff like that,
00:40:13.480 | then Genie essentially makes a call out to that,
00:40:16.040 | runs your CI, sees the outputs, and then moves on.
00:40:19.680 | Making a model itself set up a repo
00:40:23.280 | wasn't scoped in what we wanted Genie to be able to do,
00:40:26.160 | because for the most part, at least most enterprises
00:40:29.280 | have some sort of CI pipeline running,
00:40:31.040 | and a lot of, if you're doing some,
00:40:32.840 | even a lot of hobbyist software development
00:40:35.080 | has some sort of basic CI running as well.
00:40:37.240 | And that was the lowest hanging fruit approach that we took.
00:40:39.840 | So when Genie ships, the way it will run its own code
00:40:42.440 | is it will basically run your CI,
00:40:43.720 | and it will take the, I'm not in charge of writing this,
00:40:47.640 | the rest of the team is,
00:40:48.480 | but I think it's the checks API on GitHub
00:40:50.440 | allows you to grab that information
00:40:52.000 | and throw it in the context window.
00:40:53.480 | - What's the handoff like with the person?
00:40:56.360 | So Genie, you give it a task,
00:40:58.800 | and then how long are you supposed to supervise it for?
00:41:02.400 | Or are you just waiting for the checks to eventually run,
00:41:05.560 | and then you see how it goes?
00:41:06.800 | Like, what does it feel like?
00:41:08.280 | - There are a couple of modes that it can run in.
00:41:10.480 | Essentially, it can run in fully headless autonomous modes.
00:41:13.080 | So say you assign it a ticket in linear or something,
00:41:15.920 | then it won't ask you for anything.
00:41:17.960 | It will just go ahead and try.
00:41:19.960 | Or if you're in the GUI on the website and you're using it,
00:41:22.960 | then you can give it a task,
00:41:24.240 | and it might choose to ask you a clarifying question.
00:41:26.960 | So if you ask it something super broad,
00:41:29.320 | it might just come back to you and say,
00:41:30.800 | what does that actually mean?
00:41:31.840 | Or can you point me in the right direction for this?
00:41:33.360 | Because our decision internally
00:41:36.040 | was it's gonna piss people off way more
00:41:38.720 | if it just goes off and makes a completely
00:41:41.680 | ruined attempt at it,
00:41:42.680 | because it just, from day one, got the wrong idea.
00:41:45.600 | So it can ask you a lot of questions.
00:41:48.320 | And once it's going, much like a regular PR,
00:41:51.400 | you can leave review comments, issue comments,
00:41:54.400 | all these different things.
00:41:55.400 | And it, because it's been trained
00:41:57.360 | to be a software engineering colleague,
00:41:58.640 | responds in actually a better way than a real colleague,
00:42:01.160 | because it's less snarky and less high and mighty.
00:42:04.800 | And also the amount of filtering it has to do for LGTM.
00:42:07.520 | When you train a model to be a software engineer,
00:42:11.120 | essentially, it's like, you can just do anything.
00:42:12.480 | It's like, yeah, it looks good to me, bro.
00:42:13.840 | - Sure. (laughs)
00:42:15.640 | I just wanted to dive in a little bit more
00:42:17.120 | on your experience with the fine-tuning team.
00:42:19.280 | John Allard was publicly sort of very commentary supportive
00:42:22.840 | and, you know, was part of it.
00:42:24.240 | Like, what is it like working with them?
00:42:25.720 | I also picked up that you initially started to fine-tune
00:42:29.600 | what was publicly available, the 16 to 32K range.
00:42:32.960 | You got access to do more than that.
00:42:35.080 | You've also trained on billions of tokens
00:42:37.320 | instead of the usual millions range.
00:42:40.000 | Just like, take us through that fine-tuning journey
00:42:42.400 | and any advice that you may have.
00:42:43.840 | - It's been so cool.
00:42:45.720 | And this will be public by the time this goes out.
00:42:47.760 | Like, OpenAI themselves have said,
00:42:49.520 | we are pushing the boundaries
00:42:50.680 | of what is possible with fine-tuning.
00:42:52.480 | Like, we are right on the edge.
00:42:53.640 | And like, we are working, genuinely working with them
00:42:57.200 | in figuring out how stuff works, what works,
00:42:59.160 | what doesn't work, because no one's doing,
00:43:01.200 | no one else is doing what we're doing.
00:43:03.120 | They have found what we've been working on
00:43:04.640 | super interesting, which is why they've allowed us
00:43:06.680 | to do so much, like, interesting stuff.
00:43:09.080 | Working with John, I mean,
00:43:09.920 | I had a really good conversation with John yesterday.
00:43:11.560 | We had a little brainstorm after the video we shot.
00:43:14.120 | And one of the things,
00:43:15.880 | you mentioned the billions of tokens.
00:43:17.400 | One of the things we've noticed,
00:43:18.400 | and it's actually a very interesting problem
00:43:19.600 | for them as well,
00:43:20.440 | when you're building like a self-serve fine-tuning API,
00:43:22.720 | they have to decide how big your PEFT adapter,
00:43:26.840 | your lower adapter is going to be in some way.
00:43:28.520 | And like, figuring that out
00:43:29.440 | is actually a really interesting problem.
00:43:31.000 | Because if you make it too big,
00:43:33.240 | because they support data sets that are so small,
00:43:34.840 | you can put like 20 examples through it
00:43:36.080 | or something like that.
00:43:36.920 | Like, if you had a really sparse, large adapter,
00:43:39.000 | you're not going to get any signal in that at all.
00:43:40.840 | So they have to dynamically size these things.
00:43:42.600 | And there is an upper bound.
00:43:43.520 | And actually, we use models that are larger
00:43:47.480 | than what's publicly available.
00:43:48.680 | It's not even publicly available yet,
00:43:49.680 | but when this goes out, it will be.
00:43:51.720 | But we have larger lower adapters available to us,
00:43:56.200 | just because the amount of data
00:43:57.400 | that we're pumping through it.
00:43:58.640 | And at that point,
00:43:59.560 | you start seeing really interesting other things,
00:44:02.640 | like you have to change your learning rate schedule
00:44:05.000 | and do all these different things
00:44:06.160 | that you don't have to do
00:44:07.440 | when you're on the smaller end of things.
00:44:08.760 | So working with that team is such a privilege,
00:44:11.560 | because obviously they're like at the top of their field
00:44:13.880 | in the fine-tuning space.
00:44:15.520 | So as we learn stuff, they're learning stuff.
00:44:19.320 | And one of the things that I think
00:44:20.600 | really catalyzed this relationship
00:44:22.000 | is when we first started working on Genie,
00:44:23.720 | like I delivered them a presentation,
00:44:25.200 | which will eventually become the blog post
00:44:26.640 | that you'll love to read soon.
00:44:28.160 | The information I gave them there,
00:44:29.400 | I think is what showed them like,
00:44:30.240 | "Oh, wow, okay, these guys are really like
00:44:32.720 | pushing the boundaries of what we can do here."
00:44:35.120 | And truthfully, our data set,
00:44:37.440 | we view our data set right now as very small.
00:44:39.640 | It's like the minimum that we're able to afford,
00:44:42.400 | literally afford right now
00:44:43.560 | to be able to produce a product like this.
00:44:45.480 | And it's only gonna get bigger.
00:44:46.760 | So yesterday while I was in their offices,
00:44:48.440 | I was basically, so we were planning,
00:44:50.080 | we were like, okay, how,
00:44:51.480 | this is where we're going in the next six to 12 months.
00:44:53.880 | Like we're putting our foot on the gas here,
00:44:55.880 | 'cause this clearly works.
00:44:56.840 | Like I've demonstrated this is a good,
00:44:58.640 | you know, the best approach so far.
00:45:00.840 | And I wanna see where it can go.
00:45:01.840 | I wanna see what the scaling was like for the data.
00:45:03.720 | And at the moment, like it's hard to figure that out
00:45:05.440 | because you don't know when you're running into like
00:45:08.040 | saturating a PEFT adapter,
00:45:09.360 | as opposed to actually like, is this the model's limit?
00:45:11.680 | Like, where is that?
00:45:12.520 | So finding all that stuff out
00:45:13.720 | is the work we're actively doing with them.
00:45:16.200 | And yeah, it's gonna get more and more collaborative
00:45:18.960 | over the next few weeks as we explore like larger adapters,
00:45:22.320 | pre-training extension, different things like that.
00:45:24.640 | - Awesome.
00:45:25.480 | I also wanted to talk briefly
00:45:26.360 | about the synthetic data process.
00:45:29.160 | One of your core insights was that
00:45:30.800 | the vast majority of the time,
00:45:32.040 | the code that is published by a human is in a working state.
00:45:35.480 | And actually you need to fine tune on non-working code.
00:45:37.920 | - Yes.
00:45:38.760 | - So just, yeah, take us through that inspiration.
00:45:40.760 | How many rounds did you do?
00:45:43.480 | - Yeah, I mean, it might be generous to say
00:45:45.960 | that the vast majority of code is in a working state.
00:45:47.840 | I don't know if I believe that.
00:45:48.680 | - Yeah, I don't know if I believe that.
00:45:49.520 | - I was like, that's very nice of you to say
00:45:51.280 | that my code works.
00:45:52.120 | - Certainly, it's not true for me.
00:45:54.680 | No, I think that, so yeah, no, but it was, you're right.
00:45:57.560 | It's an interesting problem.
00:45:58.400 | And what we saw was when we didn't do that,
00:46:01.600 | obviously you have to basically like one-shot the answer.
00:46:04.600 | 'Cause after that it's like,
00:46:05.520 | well, I've never seen iteration before.
00:46:07.120 | How am I supposed to figure out how this works?
00:46:08.760 | So what you're alluding to there
00:46:11.960 | is like the self-improvement loop
00:46:13.280 | that we started working on.
00:46:14.840 | And that was in sort of two parts.
00:46:16.040 | We synthetically generated runtime errors
00:46:19.440 | where we would intentionally mess with the AST
00:46:23.400 | to make stuff not work or index out of bounds
00:46:26.840 | or refer to a variable that doesn't exist
00:46:28.800 | or errors that the foundational models
00:46:31.920 | just make sometimes that you can't really avoid.
00:46:34.080 | You can't expect it to be perfect.
00:46:36.040 | So we threw some of those in
00:46:37.200 | with a probability of happening.
00:46:39.520 | And on the self-improvement side,
00:46:41.480 | I spoke about this in the blog post,
00:46:43.400 | essentially the idea is that
00:46:45.320 | you generate your data in sort of batches.
00:46:48.480 | First batch is like perfect, like one example,
00:46:50.680 | like here's the problem, here's the answer,
00:46:52.160 | go train the model on it.
00:46:53.840 | And then for the second batch,
00:46:55.560 | you then take the model that you trained before
00:46:58.240 | that can look like one commit into the future.
00:47:00.600 | And then you let it have the first attempt
00:47:02.560 | at solving the problem.
00:47:03.760 | And hopefully it gets it wrong.
00:47:05.640 | And if it gets it wrong,
00:47:06.720 | then you have like, okay,
00:47:08.080 | now the code base is in this incorrect state,
00:47:09.760 | but I know what the correct state is.
00:47:11.040 | So I can do some diffing essentially
00:47:13.400 | to figure out how do I get the state that it's in now
00:47:16.160 | to the state that I want it in.
00:47:17.560 | And then you can train the model
00:47:18.720 | to then produce that diff next
00:47:20.520 | and so on and so on and so on.
00:47:21.920 | So the model can then learn
00:47:23.760 | and also reason as to why it needs to make these changes
00:47:26.760 | to be able to learn how to like learn,
00:47:28.480 | like solve problems iteratively
00:47:30.440 | and learn from its mistakes and stuff like that.
00:47:32.360 | - And you pick the size of the data set
00:47:34.200 | just based on how much money you could spend generating it.
00:47:36.880 | Maybe you think you could just make more and get better.
00:47:39.400 | - Multiple of my monthly burn
00:47:40.640 | don't always spend doing this.
00:47:42.200 | Yeah, basically it was very much related to,
00:47:45.040 | yeah, just like capital.
00:47:46.160 | And yes, with any luck that will be alleviated soon.
00:47:50.040 | - Very soon.
00:47:51.120 | I like drawing references to other things
00:47:52.880 | that are happening in the wild.
00:47:54.520 | So, 'cause we only get to release this podcast
00:47:56.600 | once a week, the Lama 3 paper
00:47:58.440 | also had some really interesting thoughts
00:48:00.800 | on synthetic data for code.
00:48:02.400 | I don't know if you have reviewed that.
00:48:05.080 | I'll highlight the back translation section
00:48:07.880 | because one of your data set focuses
00:48:10.400 | is updating documentation.
00:48:12.120 | I think that translation between natural language,
00:48:14.240 | English versus code and back and forth,
00:48:16.120 | I think is actually a really ripe source of synthetic data.
00:48:19.760 | And Lama 3 specifically called out
00:48:21.560 | that they trained on that.
00:48:23.360 | We should have gone more into that
00:48:24.320 | in our podcast with them,
00:48:25.160 | but we didn't know.
00:48:27.280 | But there's a lot of interesting work
00:48:28.800 | on synthetic data stuff.
00:48:30.280 | We do have to wrap up soon,
00:48:31.200 | but I'm going to briefly touch
00:48:33.160 | on the submission process for SweeBench.
00:48:35.320 | So, you have a 30% state-of-the-art SweeBench results,
00:48:39.200 | but it's not on the leaderboard
00:48:40.440 | because of submission issues.
00:48:41.960 | I don't know if you want to comment on that stuff
00:48:44.680 | versus, we also want to talk about SweeBench verified.
00:48:49.000 | Yeah, just anything on the benchmarking side.
00:48:51.280 | - The potted history of this is quite simple actually.
00:48:54.520 | SweeBench up until, I want to say two weeks ago,
00:48:57.960 | but it might be less than that or more than that.
00:49:00.200 | But I think two weeks ago,
00:49:01.720 | suddenly started mandating what they call trajectories
00:49:04.040 | when you submit.
00:49:04.880 | So, but prior to this,
00:49:06.200 | essentially when you run SweeBench,
00:49:08.200 | you run it through their harness
00:49:09.280 | and out the other end,
00:49:10.120 | you get a report.json,
00:49:11.240 | which is like, here's how many I resolved.
00:49:13.280 | Here's how many I didn't resolve.
00:49:14.360 | These are the IDs, the ones I did.
00:49:15.560 | These ones, the IDs I didn't.
00:49:16.520 | And it gives you any ones that might have errored
00:49:18.160 | or something like that.
00:49:19.600 | And what you would submit would be
00:49:22.480 | all of your model patches that you outputted
00:49:25.040 | and that report.
00:49:26.360 | And then you would like PR that into the SweeBench repo
00:49:28.640 | and that would be it.
00:49:30.000 | That was still the case
00:49:31.800 | when we made our submission on whatever day it was.
00:49:34.200 | They look at them every Monday.
00:49:35.760 | We submitted it at some point during the week.
00:49:37.640 | I want to say it was four days before that.
00:49:39.920 | And I sort of like sat back and waited.
00:49:43.240 | I assumed it would be fine.
00:49:44.480 | When it came to Monday,
00:49:46.360 | they then said, actually, no, we want model trajectories.
00:49:49.160 | And I was like, okay, let me see what this is.
00:49:51.440 | And so on, I sort of dug into it.
00:49:53.400 | And like model trajectories are essentially
00:49:57.360 | the context window or like the reasoning process
00:49:59.480 | of like show you're working.
00:50:00.560 | How did you get here?
00:50:01.400 | If you do a math exam, show me you're working.
00:50:03.800 | Whereas before they were like,
00:50:04.760 | just give me the final answer.
00:50:05.880 | Now they want to see the working,
00:50:06.720 | which I completely understand why they want to see that.
00:50:08.960 | Like the SweeBench fundamentally
00:50:10.760 | is an academic research project
00:50:12.360 | and they want all the stuff to be open source and public
00:50:14.840 | so people can learn from each other and improve
00:50:16.520 | and so on and on.
00:50:17.360 | That's very good.
00:50:18.200 | I completely agree.
00:50:19.280 | However, at least for us,
00:50:20.720 | and the reason that we're not on the leaderboard
00:50:22.400 | is that obviously the model outputs that we generate
00:50:26.400 | are sort of a mirror of our training data set, right?
00:50:29.280 | Like you train the model to do a certain thing
00:50:31.080 | and output a certain way.
00:50:32.000 | Whatever you output looks like your training data.
00:50:34.560 | For the moment as a closed source company,
00:50:36.400 | like fighting for an edge,
00:50:38.440 | we've decided not to publish that information
00:50:40.760 | for that exact reason.
00:50:41.600 | I don't want someone basically taking my trajectories
00:50:44.480 | and then taking a model that's soon to be GA
00:50:46.400 | and just distilling it immediately
00:50:47.760 | and then having genie for themselves.
00:50:49.360 | And, you know, as a business owner,
00:50:51.760 | that's the decision I've had to make.
00:50:53.520 | The patches are still public.
00:50:55.200 | So like the, dare I say, traditional SweeBench submission,
00:50:58.880 | you can go to our GitHub repo and see it
00:51:00.320 | and run them for yourself
00:51:01.640 | and verify that the numbers come out correctly.
00:51:03.520 | Like that is all, that is the potted reason.
00:51:05.560 | - That's the story.
00:51:06.400 | - That's the story.
00:51:07.240 | - SweeBench verified?
00:51:08.320 | You have a score?
00:51:09.200 | - I do have a score.
00:51:10.320 | I do have a score, 43.8%.
00:51:13.120 | It's one of those things
00:51:13.960 | where like there aren't that many people
00:51:14.920 | on the leaderboard yet.
00:51:15.760 | So you don't know how good or bad that is.
00:51:17.600 | - It's a smaller data set, right?
00:51:19.600 | - Oh, it's great.
00:51:21.160 | So on a tangent, original SweeBench was 2,294.
00:51:25.840 | - Which is expensive.
00:51:26.680 | It's like $8,000 to run.
00:51:29.360 | - Oh, that's cheap.
00:51:30.880 | - That's cheap?
00:51:31.720 | What are you talking about?
00:51:32.560 | - I don't know.
00:51:33.400 | At least for us, I don't even want to say it publicly.
00:51:35.520 | How much it cost us to run that thing.
00:51:39.040 | Expensive, slow, really like crap for iteration
00:51:42.400 | because like, you know, you make a change to your model.
00:51:45.120 | How does it do on SweeBench?
00:51:46.320 | I guess that's why SweeBench Lite existed,
00:51:47.840 | but SweeBench Lite was not a,
00:51:50.120 | it was easy stuff, right?
00:51:51.520 | It wasn't a comprehensive measure of the overall thing.
00:51:53.800 | So we actually had the idea a month ago
00:51:56.320 | to what we were going to call SweeBench Small,
00:51:58.800 | where we were going to try to map out across SweeBench,
00:52:01.320 | like what is the distribution of like problem difficulty
00:52:03.280 | and all these different things,
00:52:04.520 | and try to come up with like 300 examples
00:52:07.080 | that sort of mapped that,
00:52:08.160 | where given a score on SweeBench Small,
00:52:10.840 | you could then predict your SweeBench large score
00:52:13.320 | and sort of go from there.
00:52:14.400 | Fortunately, OpenAI did that for us
00:52:17.040 | and probably much better than we would have done.
00:52:18.720 | They use some human labelers,
00:52:19.920 | and as obviously we're working with OpenAI quite closely,
00:52:24.120 | they talked to us about it
00:52:25.200 | and they, you know, were able to let us know
00:52:28.360 | what the instance ID were,
00:52:29.640 | IDs were that were in the new SweeBench version.
00:52:32.960 | And then as soon as I had that,
00:52:34.600 | I could just take the report from the one
00:52:36.600 | that I'd run and just diff them.
00:52:38.280 | And I was like, oh, we got 219 out of 500, which is 43.8%,
00:52:42.400 | which is to my knowledge, at least right now,
00:52:45.560 | state-of-the-art also, which makes sense.
00:52:48.160 | But also GPT-4.0 gets, I believe, 33%,
00:52:51.760 | which is like, I double-checked that, but I believe--
00:52:55.000 | - The August one, the new one.
00:52:56.400 | - Yeah, it's in their blog post.
00:52:58.720 | I can't remember which one it was.
00:52:59.840 | I don't know what the model version was,
00:53:01.040 | but GPT-4.0, I believe, gets 33%,
00:53:03.600 | which is obviously like significantly better
00:53:05.720 | than what it got on the original,
00:53:08.760 | like SweeBench, SweeBench, SweeBench.
00:53:09.920 | - 2%.
00:53:10.760 | - Yeah, yeah, yeah, exactly, exactly.
00:53:12.120 | - It's running ridiculously low.
00:53:13.240 | - But no, SweeBench verified, like, it's so good.
00:53:16.320 | It's like, it's smaller.
00:53:17.920 | We know that the problems are solvable.
00:53:20.040 | It's not gonna cost me a lot of money to run it.
00:53:23.200 | It keeps my iteration time, you know, lower.
00:53:26.480 | And there are also some things
00:53:28.760 | that we're gonna start to do internally
00:53:30.320 | when we run SweeBench to have more of an idea
00:53:33.320 | of how right our model is.
00:53:34.600 | So one of the things I was talking to John about yesterday
00:53:36.840 | was SweeBench is a pass or fail, right?
00:53:39.200 | Like you either have solved the problem or you haven't.
00:53:41.560 | That is quite sparse.
00:53:43.480 | Like it doesn't give you a huge amount of information
00:53:45.080 | 'cause your model could have got a lot of it right.
00:53:46.680 | Like looking through when you do a math paper,
00:53:48.360 | you could have got the reason, you know,
00:53:49.600 | you're working right until like the penultimate step
00:53:51.400 | and then you get it wrong.
00:53:52.680 | So we're gonna look into ways of measuring,
00:53:56.200 | okay, well, your model got it right up to this line
00:53:58.840 | and then it diverged.
00:54:00.440 | And that's super easy to do
00:54:01.480 | because obviously you know the correct state
00:54:02.840 | of all of those questions.
00:54:04.040 | So I think one of the ways
00:54:05.800 | we're gonna keep improving Genie
00:54:07.080 | is by going more in depth and saying,
00:54:09.800 | okay, for the ones that failed, was it right at any point?
00:54:12.000 | Where did it go wrong?
00:54:13.040 | How did it go wrong?
00:54:14.040 | And then sort of trying to triage those sorts of issues.
00:54:16.720 | - So future plans, you have mentioned
00:54:18.400 | Context is sending an open source model.
00:54:20.240 | But basically I think, you know, what the Genie is
00:54:22.280 | is basically this like proprietary fine tune data set
00:54:24.440 | and process and software that you can add onto any model.
00:54:28.400 | Is that the plan?
00:54:29.240 | That's the next year?
00:54:30.360 | It's gonna just be doing that?
00:54:31.360 | - We're gonna get really,
00:54:33.000 | we're gonna be the best in the world at doing that
00:54:34.680 | and continue being the best in the world at doing that
00:54:36.760 | and throwing in as many models as we can,
00:54:39.960 | seeing what the performance is like
00:54:41.280 | and seeing what things improve performance in what places.
00:54:45.120 | And also making the data set larger
00:54:46.520 | is like one of the biggest things
00:54:47.840 | we're gonna be working on.
00:54:48.960 | - I think one of the decisions before you as a CEO
00:54:51.600 | is how much you have like the house model
00:54:54.240 | be like the one true thing.
00:54:55.640 | And then how much you spend time working on customer models.
00:54:59.920 | - That's the thing that really gets me so excited.
00:55:02.960 | Genuinely, like we have a version of Genie
00:55:06.680 | that we named after one of our employees.
00:55:08.360 | (all laughing)
00:55:09.760 | It's called the John.
00:55:11.000 | We have a version of Genie
00:55:13.880 | that is fine tuned on our code base.
00:55:15.680 | So we basically, it's the base Genie
00:55:17.600 | and then we run the same data pipeline
00:55:19.760 | that we run on like all the stuff that we did
00:55:21.400 | to generate the main data set on our repo.
00:55:24.080 | And then all of a sudden you have like something
00:55:25.880 | that is both very good at software engineering
00:55:27.240 | but is also extremely good at your repo.
00:55:29.600 | And that is phenomenal to use.
00:55:32.280 | Like it's really cool.
00:55:33.560 | - More broadly outside of Corsair,
00:55:35.120 | what are you seeing?
00:55:35.960 | What trends are you seeing that you're really excited by?
00:55:39.320 | Who's doing great work that you wanna call out?
00:55:41.440 | - The one of the ones that,
00:55:43.120 | I mean, it's not an original choice
00:55:44.560 | but Cursor are absolutely killing it.
00:55:46.320 | All the employees at Corsair love using it.
00:55:48.800 | And it's a really, really good example
00:55:52.440 | of like just getting like UX right, basically.
00:55:55.760 | Like putting the LLM in the right place
00:55:58.960 | and letting it allow you
00:56:00.080 | and getting out of the way when you don't want it there
00:56:02.120 | and making it familiar 'cause it's still VS code
00:56:04.360 | and all these things.
00:56:05.640 | They've, yeah, they've done an amazing job.
00:56:07.200 | And I think they just raised a round.
00:56:08.200 | So congrats on that to them.
00:56:09.280 | So like they're doing amazing work.
00:56:10.960 | - The decision to fork VS code, I think was controversial.
00:56:13.440 | You guys started as a VS code extension.
00:56:15.080 | - We did, yeah.
00:56:15.920 | - Many, many, many people did that.
00:56:16.760 | And they did the one thing that no one wanted to do.
00:56:19.040 | - I commend the bravery, honestly.
00:56:20.640 | Like I commend the bravery.
00:56:21.920 | 'Cause like in hindsight, obviously it's paid off.
00:56:25.040 | But at least for me in the moment,
00:56:27.560 | I was one of those people being like,
00:56:29.120 | is that gonna, are people gonna do that?
00:56:30.920 | Are people gonna download that?
00:56:31.960 | And yes, obviously they are.
00:56:32.960 | Like sure, doing the hard thing,
00:56:35.240 | which is having worked on Jeannie recently,
00:56:38.000 | for the past eight months or whatever,
00:56:40.080 | as taxing as it's been on us,
00:56:41.800 | like one of the main things I have learned from this
00:56:44.400 | is like, no matter how small you are,
00:56:46.680 | how much resource you have,
00:56:47.760 | just like try to do the hard thing.
00:56:49.400 | 'Cause I think it has the biggest payoff.
00:56:51.720 | - More broadly, just like lessons that you've learned
00:56:54.800 | running your company.
00:56:56.080 | - Oh.
00:56:57.960 | - It's been a two year journey.
00:56:59.320 | - Two year journey.
00:57:00.240 | I mean, it's better than any real job
00:57:02.560 | we could ever get.
00:57:03.440 | Like, I feel so lucky to be working in this area.
00:57:07.880 | Like, especially, you know,
00:57:09.440 | it was so validating to hear it
00:57:10.720 | from the guys at Open Hour as well,
00:57:11.880 | telling us like, we're on the cutting edge on the bat,
00:57:14.160 | we're pushing the boundaries of what's possible
00:57:15.480 | with what we're doing.
00:57:16.640 | Because like, I get to do, I get to be paid to do this.
00:57:19.240 | You know, I have briefly, as you heard at the beginning,
00:57:21.920 | done real jobs and normal stuff.
00:57:24.360 | And like, just being able to do this on the daily,
00:57:26.600 | it's so interesting and so cool.
00:57:28.400 | It's like, I pinch myself a lot, genuinely,
00:57:31.360 | about the fact that I can do this.
00:57:33.200 | And also that, not only I can do this,
00:57:34.960 | but fortunately being a co-founder of the company,
00:57:37.440 | I have a huge amount of say as to where we go next.
00:57:39.520 | And that is a big responsibility,
00:57:41.600 | but it's also so exciting to me.
00:57:42.920 | 'Cause I'm like, you know,
00:57:44.000 | steering the ship has been really interesting so far.
00:57:46.800 | And I like to think that we've got it right,
00:57:48.560 | you know, in the last sort of eight months or so.
00:57:51.400 | And that this is like, really the starting point
00:57:53.760 | of something massive to come.
00:57:55.480 | - Awesome.
00:57:56.320 | Calls to action.
00:57:57.160 | I assume you're hiring.
00:57:59.400 | I assume you're also looking for customers.
00:58:00.920 | What's the ideal customer, ideal employee?
00:58:04.120 | - On the customer side,
00:58:05.640 | honestly, people who are just willing to try something new,
00:58:07.680 | like the Genie UX is different to a conventional IDE.
00:58:12.120 | Give it a chance.
00:58:13.160 | Like that what we really do believe in this whole idea
00:58:15.400 | of like developers work is going to be abstracted,
00:58:18.120 | you know, levels higher than just the code.
00:58:20.240 | We still let you touch the code.
00:58:21.680 | We still want you to dive into the code if you need to.
00:58:23.960 | But fundamentally we think that
00:58:25.720 | if you're trying to offload the coding to a model,
00:58:28.040 | the model should do the coding
00:58:29.160 | and you should be in charge of guiding the model.
00:58:31.200 | So people who are willing to give something new a chance.
00:58:34.000 | Size of company.
00:58:34.840 | And honestly, well, preferably the languages
00:58:37.640 | that are the most represented in our train days.
00:58:40.000 | So like anyway, if you're like doing TypeScript,
00:58:41.760 | JavaScript, Python, Java, that sort of thing.
00:58:45.640 | And in terms of size of company,
00:58:47.400 | like so long as you're willing to try it
00:58:49.360 | and there aren't any massive like infosec things
00:58:51.880 | that get in the way, like it doesn't really matter.
00:58:53.760 | Like code base size can be arbitrary for us.
00:58:55.600 | We can deal with any code base size
00:58:57.480 | and essentially any language, but your mileage may vary.
00:58:59.920 | But for the most part, like anyone who's willing
00:59:02.240 | to give it a try is the ideal customer.
00:59:04.000 | And on the employee, honestly,
00:59:06.320 | we just want people who we're gonna be hiring both
00:59:09.480 | on like what we call like the traditional tech side.
00:59:12.960 | So like building the product essentially
00:59:14.880 | and also hiring really heavily
00:59:16.200 | on the AI machine learning data set side as well.
00:59:21.000 | And in both cases, essentially what we just wanted
00:59:24.480 | were like really passionate people
00:59:26.480 | who are obsessed with something
00:59:28.440 | and are really passionate about something
00:59:30.240 | and are willing to, it sounds so corny,
00:59:33.760 | but like join us in what we're trying to do.
00:59:35.720 | Like we have a very big ambition
00:59:37.080 | and we're biting off a very large problem here.
00:59:40.160 | And people who can look at what we've done so far
00:59:42.760 | and be like, wow, that's really impressive.
00:59:44.320 | I want to do that kind of work.
00:59:46.000 | I want to be pushing the boundaries.
00:59:47.240 | I want to be dealing with experimental stuff all the time.
00:59:51.040 | But at the same time, you're putting it in people's hands
00:59:53.280 | and shipping it to people and so on.
00:59:55.040 | So if that sounds, you know, amenable to anyone,
00:59:57.480 | that's the kind of person we're looking to apply.
00:59:59.560 | - Excellent.
01:00:00.400 | Any last words?
01:00:01.440 | Any Trump impressions that you (laughs)
01:00:04.200 | Did you like the Trump impression?
01:00:05.280 | - Yeah, everyone loved the Trump impression.
01:00:06.480 | - Yeah, I mean, it's funny 'cause like I have some bloopers.
01:00:10.360 | I'll show you the bloopers after we finish recording.
01:00:12.000 | I'll probably tweet them at some point.
01:00:13.520 | The initial cut of that video had me doing a Trump impression.
01:00:17.160 | I sort of sat down into the chair
01:00:18.960 | and be like Cosine is the most tremendous AI lab
01:00:21.720 | in the world.
01:00:23.240 | Unbelievable.
01:00:24.080 | I walked in here and I said, well, this is an amazing lab.
01:00:26.880 | And like, we sent it to some of our friends.
01:00:28.440 | They were like, nah, you can't cold open with Trump, man.
01:00:31.320 | You just can't.
01:00:32.160 | Like, no one knows who you are.
01:00:33.160 | - You can end with it.
01:00:34.000 | - But you can end with it.
01:00:35.000 | Now that that has gone out,
01:00:36.400 | we can now post the rest of the bloopers,
01:00:39.400 | which are essentially me just like fluffing my lines
01:00:42.560 | the entire time and screaming at my co-founder
01:00:44.600 | out of frustration.
01:00:45.440 | So, yeah.
01:00:46.280 | - Well, it was very well executed.
01:00:47.880 | Actually, very few people do the kind of video that you did.
01:00:50.120 | I'm, as a sort of developer relations person,
01:00:52.600 | I'm actually excited by that stuff,
01:00:53.880 | but well, thank you for coming on.
01:00:55.640 | Very, very short notice.
01:00:56.520 | I hope you have a safe flight back
01:00:57.600 | and excited to see the full launch.
01:01:00.080 | I think this is a super fruitful area
01:01:01.920 | and congrats on your launch.
01:01:03.720 | - Thank you so much for having me.
01:01:04.760 | Cheers.
01:01:05.760 | (upbeat music)
01:01:08.360 | (upbeat music)
01:01:10.940 | (upbeat music)
01:01:13.520 | (upbeat music)