Is finetuning GPT4o worth it?

00:00:00.000 | (upbeat music)

00:00:02.580 | - Hey everyone, welcome to the Latent Space Podcast.

00:00:06.880 | This is Alessio, partner and CTO

00:00:08.600 | in Residence at Decibel Partners.

00:00:10.040 | And I'm joined by my co-host Swiggs, founder of Small.ai.

00:00:13.040 | - Hey, and today we're back in the studio, in person,

00:00:17.200 | after about three to four months in visa jail and travels

00:00:21.000 | and all other fun stuff that we talked about

00:00:23.080 | in the previous episode.

00:00:24.440 | But today with special guests, Ali Pullen from Cosign.

00:00:27.640 | Welcome.

00:00:28.480 | - Hi, thanks for having me.

00:00:29.320 | - Very lucky to have you

00:00:30.960 | because you're on a two day trip to San Francisco.

00:00:33.000 | - Yeah, I wouldn't recommend it.

00:00:33.840 | I would not recommend it.

00:00:35.160 | Don't fly from London to San Francisco for two days.

00:00:37.080 | - And you launched Genie on a plane, on plane WiFi,

00:00:41.560 | claiming state-of-the-art in SuiteBench,

00:00:44.000 | which we're all gonna talk about.

00:00:45.400 | I'm excited to dive into your whole journey

00:00:47.800 | because it has been a journey.

00:00:48.920 | I've been lucky to be a small angel in part of that journey.

00:00:52.760 | And it's exciting to see that you're launching

00:00:54.800 | to such acclaim and such results.

00:00:58.680 | So I'll go over your brief background

00:01:00.680 | and then you can fill in the blanks

00:01:02.000 | on what else people should know about you.

00:01:03.960 | You did your bachelor's in computer science in Exeter,

00:01:07.160 | and then you worked at a startup

00:01:08.640 | that got acquired into GoPuff.

00:01:10.360 | And roundabout 2022, you started working on a stealth startup

00:01:14.560 | that became a YC startup.

00:01:15.880 | What's that overall story?

00:01:17.040 | - Yeah, so basically when I left university,

00:01:20.160 | I met my now co-founder, Sam.

00:01:23.080 | At the time, we were both mobile devs.

00:01:24.960 | He was an Android developer, I was an iOS developer.

00:01:27.200 | And whilst at university,

00:01:29.200 | we built this sort of small consultancy,

00:01:31.960 | sort of we'd be approached to build projects for people.

00:01:35.240 | And we would just take them up

00:01:37.080 | and start with their student projects.

00:01:38.440 | They weren't anything crazy or anything big.

00:01:40.240 | We started with those.

00:01:41.360 | And over time, we started doing larger and larger projects,

00:01:44.080 | more interesting things.

00:01:45.520 | And actually when we left university,

00:01:47.240 | we just kept doing that.

00:01:48.960 | We didn't really get jobs, traditional jobs.

00:01:51.920 | It was also like in the middle of COVID,

00:01:53.600 | middle of lockdown.

00:01:54.600 | So we were like, this is a pretty good gig.

00:01:56.040 | We'll just keep like writing code in our bedrooms.

00:01:58.040 | And we did that for a while.

00:02:00.320 | And then a friend of ours that we went to Exeter with

00:02:03.720 | started a YC startup during COVID.

00:02:07.360 | And it was one of these fast grocery delivery companies.

00:02:10.560 | At the time, I was living

00:02:11.920 | in the deepest, darkest countryside in England,

00:02:14.200 | where fast grocery companies are still not a thing.

00:02:16.600 | So he sort of pitched me this idea and was like,

00:02:19.000 | listen, like I need an iOS dev, do you fancy coming along?

00:02:22.320 | And I thought, absolutely.

00:02:23.560 | It was a chance to get out of my parents' house,

00:02:24.920 | chance to move to London, you know, do interesting things.

00:02:27.560 | And at the time, truthfully, I had no idea what YC was.

00:02:29.960 | I had no idea.

00:02:31.040 | I wasn't in the startup space.

00:02:32.560 | I knew I liked coding and building apps and stuff,

00:02:35.000 | but I'd never really done anything in that area.

00:02:38.360 | So I said, yes, absolutely.

00:02:39.960 | I moved to London just sort of as COVID was ending

00:02:43.240 | and yeah, worked at what was Fancy

00:02:46.160 | for about a year and a half.

00:02:47.520 | Then we brought Sam along as well.

00:02:49.720 | So Sam and I were the two engineers at Fancy

00:02:52.200 | for basically his entire life.

00:02:53.920 | And we built literally everything.

00:02:56.000 | So like the client mobile apps, the backends,

00:02:59.880 | the internal like stock management system,

00:03:02.680 | the driver routing algorithms,

00:03:04.920 | all those things, literally like everything.

00:03:07.120 | It was my first, you know,

00:03:09.000 | both of us were super inexperienced.

00:03:10.320 | We didn't have like proper engineering experience.

00:03:11.880 | There were definitely decisions we'd do differently now.

00:03:13.920 | We'd definitely buy a lot of stuff off the shelf,

00:03:15.640 | stuff like that.

00:03:16.480 | But it was the initial dip of the toe

00:03:19.760 | into like the world of startups.

00:03:21.720 | And we were both like hooked immediately.

00:03:23.320 | We were like, this is so cool.

00:03:24.520 | This sounds so much better than all our friends

00:03:26.080 | who were like consultants and doing like normal jobs, right?

00:03:28.720 | We did that and it ran its course.

00:03:30.400 | And after, I want to say 18 months or so,

00:03:32.640 | GoPuff came and acquired us.

00:03:34.600 | And there was obviously a transitionary period

00:03:36.240 | and integration period, like with all acquisitions.

00:03:38.320 | And we did that.

00:03:39.760 | And as soon as we'd vested what we wanted to vest

00:03:42.360 | and as soon as we thought, okay,

00:03:43.600 | this chapter is sort of done in about 2022,

00:03:46.880 | we left and we knew that we wanted to go alone

00:03:49.560 | and try something like we'd had this taste.

00:03:51.360 | Now we knew we'd seen how like a YC startup

00:03:53.880 | was managed like up close.

00:03:55.440 | And we knew that we wanted to do something similar ourselves.

00:03:57.600 | We had no idea what it was at the time.

00:03:59.800 | We just knew we wanted to do something.

00:04:01.120 | So we tried some small projects in various different areas.

00:04:05.440 | But then Sam talked to me about GPT-3.

00:04:09.400 | He'd seen it on Reddit.

00:04:10.640 | - The source of all knowledge.

00:04:12.080 | - The source of all knowledge, absolutely.

00:04:13.640 | Sam loves Reddit.

00:04:14.560 | I'd actually heard of GPT-2

00:04:17.120 | and obviously had like loosely followed

00:04:18.680 | what OpenAI had done with,

00:04:20.880 | what was the game they trained a model to play?

00:04:23.080 | - Dota.

00:04:23.920 | - Was it Dota, yeah.

00:04:24.920 | So I'd followed that and knew loosely what GPT-2 was.

00:04:29.200 | I knew what BERT was.

00:04:30.040 | So I was like, okay, this GPT-3 thing sounds interesting.

00:04:32.240 | And he just mentioned it to me on a walk.

00:04:34.320 | And I then went home and like Googled GPT-3

00:04:38.000 | and there was the playground.

00:04:38.960 | It was the, and the model was DaVinci 2 at the time.

00:04:41.840 | And it was just the old school playground, completions,

00:04:44.800 | nothing crazy, no chat, no nothing.

00:04:46.880 | - I miss completions though.

00:04:48.160 | - Yeah, oh, completions.

00:04:49.000 | Honestly, I had this conversation in OpenAIs yesterday.

00:04:51.160 | I was like, I just, I know.

00:04:53.200 | But yeah, so we,

00:04:55.360 | I started playing around with the playground

00:04:58.080 | and the first thing I ever wrote into it

00:05:00.280 | was like, hello world.

00:05:01.320 | And it gave me some sort of like fairly generic response

00:05:03.520 | back and I was like, okay, that looks pretty cool.

00:05:05.440 | The next thing was, I looked through the docs

00:05:08.680 | or they had a lot of example prompts

00:05:10.240 | 'cause I had no idea.

00:05:11.120 | I didn't know if the, if you could put anything in,

00:05:13.560 | I didn't know if you had to structure in a certain way

00:05:15.120 | or whatever.

00:05:15.960 | And I saw that it could start writing like tables

00:05:18.080 | and JSON and stuff like that.

00:05:19.600 | So I was like, okay,

00:05:20.440 | can you write me something in JSON?

00:05:21.640 | And it did.

00:05:22.680 | And I was like, oh wow, this is pretty cool.

00:05:25.680 | Can it just write arbitrary JSON for me?

00:05:28.240 | And immediately, as soon as I realized that,

00:05:31.120 | my mind was racing and I like got Sam in

00:05:34.040 | and we just started messing around in the playground,

00:05:37.080 | like fairly innocently to start with.

00:05:39.080 | And then of course, both being mobile devs

00:05:41.440 | and also seeing, at that point,

00:05:43.080 | we'd learned about what the Codex model was.

00:05:45.480 | It was like, this thing's trained to write code.

00:05:47.040 | It sounds awesome.

00:05:48.280 | And Copilot was start,

00:05:49.520 | I think, I can't actually remember if Copilot

00:05:51.520 | had come out later.

00:05:52.360 | Yeah, it might've done.

00:05:53.800 | - It's round about the same time as Codex.

00:05:54.640 | - Round about the same time, yeah.

00:05:56.040 | And we were like, okay, as mobile devs,

00:05:58.120 | let's see what we can do.

00:05:59.040 | So the initial thing was like, okay,

00:06:00.880 | let's see if we can get this AI

00:06:03.920 | to build us a mobile app from scratch.

00:06:06.280 | We eventually built the world's most flimsy system,

00:06:10.120 | which was back in the day,

00:06:11.120 | we're like 4,000 token context windows,

00:06:12.640 | like chaining prompts,

00:06:13.760 | trying to keep as much context from one to the other,

00:06:16.680 | all these different things,

00:06:17.560 | where essentially you'd put in an app idea in a box,

00:06:20.280 | and then we'd do like very high level stuff,

00:06:23.080 | figuring out what the stack should be,

00:06:24.400 | figuring out what the front end should be written in,

00:06:27.200 | back end should be written in,

00:06:28.080 | all these different things.

00:06:29.040 | And then we'd go through like for each thing,

00:06:32.240 | more and more levels of detail

00:06:33.800 | until the point that you actually got Codex

00:06:36.000 | to write the code for each thing.

00:06:37.920 | And we didn't do any templating or anything.

00:06:39.520 | We were like, no, we're gonna write all the code

00:06:40.800 | from scratch every time,

00:06:41.760 | which is basically why it barely worked.

00:06:44.080 | But there were like occasions

00:06:45.720 | where you could put in something

00:06:46.960 | and it would build something that did actually run,

00:06:49.640 | the back end would run, the database would work.

00:06:51.480 | And we were like, oh my God, this is insane.

00:06:53.280 | This is so cool.

00:06:54.440 | And that's what we showed to our co-founder, Yang.

00:06:58.280 | I met my co-founder, Yang, through Fancy,

00:07:00.400 | 'cause his wife was their first employee.

00:07:02.280 | And we showed him, and he was like,

00:07:04.520 | you've discovered fire, what is this?

00:07:06.080 | Like, this is insane.

00:07:07.240 | He has a lot more startup experience.

00:07:09.160 | Historically, he's had a few exits in the past

00:07:11.240 | and has been through all different industries.

00:07:13.880 | He's like our dad, he's a bit older.

00:07:15.240 | He hates me saying that, but he's a bit older.

00:07:16.840 | - He's your COO now?

00:07:17.960 | - He's our COO, yeah.

00:07:19.000 | And we showed him and he was like,

00:07:20.280 | this is absolutely amazing, let's just do something.

00:07:21.880 | 'Cause he, at the time, was just about to have a child,

00:07:24.640 | so he didn't have anything going on either.

00:07:26.640 | So we applied to YC, got an interview.

00:07:29.400 | The interview was, as most YC interviews are,

00:07:31.920 | short, curt, and pretty brutal.

00:07:33.640 | They told us they hated the idea.

00:07:35.080 | They didn't think it would work.

00:07:36.680 | And that's when we started brainstorming.

00:07:38.640 | It was almost like the interview

00:07:39.720 | was like an office hours kind of thing.

00:07:41.040 | And we were like, okay, given what you know

00:07:44.040 | about the space now and how to build things

00:07:45.960 | with these LLMs, what can you bring out

00:07:48.280 | of what you've learned in building that thing

00:07:49.920 | into something that might be a bit more useful

00:07:52.640 | to people on the daily?

00:07:53.480 | And also, YC obviously likes B2B startups

00:07:55.240 | a little bit more, at least at the time they did back then.

00:07:57.760 | So we were like, okay, maybe we could build something

00:08:00.080 | that helps you with existing codebases,

00:08:01.600 | like can sort of automate development stuff

00:08:03.000 | with existing codebases, not knowing at all

00:08:05.280 | what that would look like or how you would build it

00:08:07.480 | or any of these things.

00:08:09.040 | And they were like, yeah, that sounds interesting.

00:08:11.880 | You should probably go ahead and do that.

00:08:13.560 | You're in, you've got two weeks to build us an MVP.

00:08:16.080 | And we were like, okay, okay.

00:08:18.520 | We did our best.

00:08:19.360 | The MVP was absolutely horrendous.

00:08:20.480 | It was a CLI tool, it sucked.

00:08:22.360 | And at the time we were like, we don't even know

00:08:26.600 | how to build what we want to build.

00:08:28.480 | And we didn't really know what we wanted to build,

00:08:30.120 | to be honest.

00:08:30.960 | Like we knew we wanted to try to help automate dev work,

00:08:34.120 | but back then we just didn't know enough

00:08:35.600 | about how LLM apps were built,

00:08:37.920 | the intricacies and all those things.

00:08:39.120 | And also like the LLMs themselves,

00:08:40.560 | like 4,000 tokens, you're not going very far.

00:08:42.400 | They're extremely expensive.

00:08:43.880 | So we ended up building a code-based retrieval tool

00:08:46.920 | originally.

00:08:47.840 | Our thought process originally was,

00:08:49.560 | we want to build something that can do our jobs for us.

00:08:51.840 | That is like the gold star, we know that.

00:08:53.320 | We've seen like there are glimpses of it happening

00:08:55.520 | with our initial demo that we did,

00:08:57.560 | but we don't see the path of how to do that at the moment.

00:09:00.480 | Like the tech just wasn't there.

00:09:02.360 | So we were like, well, there are going to be some things

00:09:04.040 | that you need to build this when the tech does catch up.

00:09:06.560 | So retrieval being one of the most important things,

00:09:09.360 | like the model's going to have to build like pull code

00:09:11.000 | out of a code base somehow.

00:09:12.400 | So we were like, well,

00:09:13.240 | let's just build the tooling around it.

00:09:14.200 | And eventually when the tech comes,

00:09:15.440 | then we'll be able to just like plug it into our tooling

00:09:18.240 | and then it should work basically.

00:09:20.440 | And to be fair, that's basically what we've done.

00:09:22.840 | And that's basically what's happened,

00:09:23.920 | which is very fortunate.

00:09:25.240 | But in the meantime,

00:09:26.480 | whilst we were waiting for everything

00:09:27.720 | to sort of become available,

00:09:29.400 | we built this code-based retrieval tool.

00:09:31.320 | That was the first thing we ever launched

00:09:32.520 | when we were in YC, and it didn't work.

00:09:35.280 | It was really frustrating for us

00:09:36.520 | 'cause it was just me and Sam like working like all hours

00:09:39.080 | trying to get this thing to work.

00:09:40.560 | It was quite a big task in and of itself,

00:09:42.240 | trying to get like a good semantic search engine working

00:09:46.200 | that could run locally on your machine.

00:09:48.200 | We were trying to avoid sending code to the cloud

00:09:49.880 | as much as possible.

00:09:51.320 | And then for very large code bases,

00:09:52.760 | you're like, you know, millions of lines of code.

00:09:55.080 | You're trying to do some sort of like local HNSW thing

00:09:57.400 | that runs inside your VS Code instance

00:09:59.120 | that like eats all your RAM as you've seen in the past,

00:10:01.960 | all those different things.

00:10:02.800 | - Yep. - Yeah.

00:10:03.760 | - My first call with you, I think I had trouble.

00:10:06.000 | - You were like, "Yeah, it sucks, man."

00:10:06.840 | I was like, "Yeah, I know, I know, I know it sucks.

00:10:08.640 | I'm sorry."

00:10:10.360 | But building all that stuff was essentially

00:10:13.640 | the first six to eight months of what at the time was built.

00:10:18.480 | - Which by the way, "Bildt."

00:10:20.200 | - "Bildt," yeah, it was a terrible, terrible name.

00:10:22.640 | - It was the worst part of trying to think about

00:10:26.000 | whether I would invest is whether or not

00:10:27.520 | people could pronounce it.

00:10:29.000 | - No, so when we went on our first ever YC like retreat,

00:10:33.440 | no one got the name right.

00:10:34.520 | They were like, "Bildt, Bill, what?"

00:10:37.160 | And then we actually changed the name to COSI.

00:10:39.560 | Like, although some people would spell it

00:10:42.120 | as if you're cosigning for an apartment or something.

00:10:44.480 | Like, that's like, can't win.

00:10:46.480 | Yeah, that was what "Bildt" was back then.

00:10:47.880 | But the ambition, and I did a talk on this

00:10:49.840 | back in the end of 2022, the ambition to like build

00:10:52.560 | something that essentially automated our jobs

00:10:54.440 | was still very much like core to what we were doing.

00:10:58.480 | But for a very long time, it was just never apparent to us

00:11:01.080 | like, how would you go about doing these things?

00:11:03.440 | Even when like you had 3.5, 16K,

00:11:06.080 | 16K suddenly felt huge 'cause you've gone from four to 16,

00:11:09.080 | but even then 16K is like,

00:11:10.960 | a lot of Python files are longer than 16K.

00:11:13.240 | So you can't, you know,

00:11:14.440 | before you even start doing a completion,

00:11:16.400 | even then we were like, "Eh, yeah,

00:11:18.400 | it looks like we're still waiting."

00:11:19.480 | And then like towards the end of last year,

00:11:22.800 | you then start, you see 32K, 32K was really smart.

00:11:26.520 | It was really expensive, but also like,

00:11:29.280 | you could fit a decent amount of stuff in it.

00:11:30.640 | 32K felt enormous.

00:11:32.320 | And then finally 128K came along and we were like,

00:11:34.280 | "Right, this is like, this is what we can actually deal with

00:11:37.520 | because fundamentally to build a product like this,

00:11:39.840 | you need to get as much information

00:11:41.080 | in front of the model as possible

00:11:42.440 | and make sure that everything it ever writes in output

00:11:45.080 | can be traced back to something in the context window

00:11:48.200 | so it's not hallucinating it."

00:11:49.520 | As soon as that model existed, I was like,

00:11:52.200 | "Okay, I know that this is now gonna be feasible

00:11:54.640 | in some way."

00:11:55.480 | We'd done early sort of dev work on Genie using 3.5, 16K.

00:12:00.480 | And that was a very, very like crude way of proving

00:12:05.800 | that this loop that we were after

00:12:07.440 | and the way we were generating the data

00:12:09.960 | actually had signal and worked and could do something.

00:12:13.520 | But the model itself was not useful

00:12:15.240 | because you couldn't ever fit enough information into it

00:12:18.800 | for it to be able to do the task competently

00:12:20.920 | and also the base intelligence of the model.

00:12:23.120 | I mean, 3.5, anyone who's used 3.5 knows

00:12:25.240 | the base intelligence of the model is lacking,

00:12:27.520 | especially when you're asking it

00:12:28.360 | to like do software engineering is quite involved.

00:12:31.040 | So we saw the 128K context model

00:12:34.520 | and at that point we'd been in touch with OpenAI

00:12:38.760 | about our ambitions and like how we wanted to build it.

00:12:41.440 | We essentially are, I just took a punt.

00:12:43.000 | I was like, "I'm just gonna ask to see,

00:12:44.360 | can we like train this thing?"

00:12:45.680 | 'Cause at the time 4Turbo had just come out

00:12:48.160 | and back then there was still a decent amount of lag time

00:12:50.840 | between like OpenAI releasing a model

00:12:53.160 | and then allowing you to fine tune it in some way.

00:12:56.280 | They've gotten much better about that recently.

00:12:57.800 | Like 4.0 fine tuning came out either,

00:12:59.520 | I think a day, 4.0 Mini fine tuning came out

00:13:01.640 | like the day after the model did.

00:13:03.520 | And I know that's something they're definitely

00:13:04.600 | like optimizing for super heavily inside,

00:13:06.680 | which is great to see.

00:13:07.720 | - Which is a little bit, for a year or so,

00:13:10.520 | YC companies had like a direct Slack channel to OpenAI.

00:13:14.000 | - We still do.

00:13:14.840 | - Yeah. - Yeah.

00:13:15.680 | - So it's a little bit of that diminishing

00:13:17.400 | of the YC advantage there.

00:13:18.760 | - Yeah.

00:13:19.600 | - If they're releasing this fine tuning ability

00:13:20.880 | like a day after.

00:13:21.840 | - Yeah, no, no, absolutely.

00:13:22.680 | But like you can't build a startup on the YC advantage.

00:13:25.360 | It's obviously nice, it makes you feel warm and fuzzy inside

00:13:27.520 | but like at the end of the day,

00:13:28.560 | it's not that that's gonna make you win.

00:13:31.040 | - Yeah.

00:13:31.880 | So like we'd spoken to Shamal there,

00:13:34.520 | that DevRel guy, I'm sure you know him.

00:13:36.440 | - I think he's head of solutions or something.

00:13:38.120 | - He is in their applied team, yeah.

00:13:41.000 | We'd been talking to him from the very beginning

00:13:42.600 | when we got into YC

00:13:43.440 | and he's been absolutely fantastic throughout.

00:13:46.120 | I basically had pitched him this idea

00:13:47.880 | back when we were doing it on 3.5, 16K.

00:13:50.720 | And I was like, this is my crazy thesis.

00:13:53.000 | I wanna see if this can work.

00:13:54.400 | And as soon as like that 128K model came out,

00:13:57.240 | I started like laying the groundwork.

00:13:58.520 | I was like, I know this definitely isn't possible

00:14:00.440 | 'cause he released it like yesterday,

00:14:01.680 | but know that I want it.

00:14:03.840 | And in the interim, like GPT-4,

00:14:06.080 | like 8K fine tuning came out.

00:14:07.760 | We tried that, it's obviously even fewer tokens,

00:14:09.600 | but the intelligence helped.

00:14:10.960 | And I was like, if we can marry the intelligence

00:14:12.480 | and the context window length,

00:14:13.360 | then we're gonna have something special.

00:14:14.360 | And eventually we were able to get

00:14:16.400 | on the experimental access program

00:14:18.440 | and we got access to four turbo fine tuning.

00:14:22.040 | As soon as we did that,

00:14:23.600 | because in the entire run up to that,

00:14:25.080 | we'd built the data pipeline.

00:14:26.200 | We already had all that set up.

00:14:27.520 | So we were like, right, we have the data.

00:14:29.520 | Now we have the model.

00:14:30.520 | Let's put it through and iterate essentially.

00:14:33.960 | And that's where like Genie as we know it today

00:14:38.400 | really was born.

00:14:39.640 | I won't pretend like the first version of Genie

00:14:41.040 | that we trained was good.

00:14:41.880 | It was a disaster.

00:14:43.160 | That's where you realize all the implicit biases

00:14:45.200 | in your data set.

00:14:46.040 | And you realize that, oh, actually this decision you made

00:14:47.800 | that was fairly arbitrary was the wrong one.

00:14:49.640 | You have to do it a different way.

00:14:51.200 | Other subtle things like, you know,

00:14:52.880 | how you write Git diffs and you're using LLMs

00:14:55.680 | and how you can best optimize that

00:14:57.160 | to make sure they actually apply and work

00:14:58.520 | and loads of different little edge cases.

00:15:00.360 | But as soon as we had access to the underlying tool,

00:15:02.400 | we were like, right, we can actually do this.

00:15:04.600 | And I was, I breathed a sigh of relief

00:15:07.760 | 'cause I didn't know it was like, it wasn't a done deal,

00:15:09.960 | but I knew that we could build something useful.

00:15:12.080 | I mean, I knew that we could build something

00:15:13.480 | that would be measurably good on whatever eval at the time

00:15:18.480 | that you wanted to use.

00:15:20.080 | Like at the time, back then,

00:15:21.960 | we weren't actually that familiar with Swift.

00:15:23.240 | But once Devon came out and they announced

00:15:26.160 | their Swift benchmark,

00:15:27.000 | like that's when my life took a turn.

00:15:29.640 | - Challenge accepted.

00:15:30.480 | - Yeah, challenge accepted.

00:15:31.640 | And that's where like, yes,

00:15:32.920 | that's where my friendships have gone.

00:15:34.640 | My sleep has gone, my weight, everything.

00:15:38.000 | Got into Sweebench and yeah,

00:15:40.400 | it was actually a very useful tool in building Geni

00:15:42.680 | 'cause beforehand it was like,

00:15:43.520 | "Yes, vibe check this thing and see if it's useful."

00:15:45.920 | And then all of a sudden you have an actual measure

00:15:48.120 | to see like, couldn't it do software engineering?

00:15:50.800 | Not the best measure, obviously,

00:15:52.280 | but like it's the best that we've got now.

00:15:54.400 | We would just iterate it and built.

00:15:56.240 | And eventually we got it to the point where it is now.

00:15:59.440 | And a little bit beyond since we actually got that score

00:16:03.440 | a couple of weeks ago.

00:16:04.800 | And yeah, it's been a hell of a journey

00:16:06.600 | from the beginning all the way now.

00:16:07.600 | That was a very rambling answer

00:16:08.800 | to your question about how we got here,

00:16:10.120 | but that's essentially a potted answer how we got here.

00:16:13.160 | - Got the full origin story.

00:16:14.240 | - Yeah, no, totally.

00:16:15.440 | You mentioned bias in the data and some of these things.

00:16:17.960 | In your announcement video,

00:16:19.440 | you called Geni the worst-versed

00:16:21.080 | AI software engineering colleague.

00:16:23.040 | And you kind of highlighted how the data needed to train it

00:16:27.000 | needs to show how a human engineer works.

00:16:30.240 | I think maybe you're contrasting that

00:16:32.480 | to just putting code in it.

00:16:34.040 | There's kind of like a lot more than code

00:16:35.760 | that goes into software engineering.

00:16:37.600 | How do you think about the data mixture?

00:16:39.440 | You know, and like there's this kind of known truth

00:16:42.680 | that code makes models better

00:16:44.880 | when you put in the pre-training data.

00:16:46.200 | But since we put so much in the pre-training data,

00:16:48.560 | what else do you add when you turn to Genium?

00:16:51.000 | - Yeah, I think that sort of boils down fundamentally

00:16:54.120 | to the difference between a model writing code

00:16:56.600 | and a model doing software engineering.

00:16:58.520 | Because the software engineering sort of discipline

00:17:01.680 | goes wider because if you look at something like a PR,

00:17:05.640 | that is obviously a artifact of some thought

00:17:09.080 | and some work that has happened

00:17:10.720 | and has eventually been squashed into some diffs, right?

00:17:13.680 | What the, very crudely,

00:17:15.320 | what the pre-trained models are reading

00:17:18.040 | is they're reading those final diffs

00:17:19.320 | and they're emulating that

00:17:20.920 | and they're being able to output it, right?

00:17:22.680 | But of course, it's a super lossy thing, a PR.

00:17:25.200 | You have no idea why or how, for the most part,

00:17:27.600 | unless there are some comments,

00:17:28.560 | which, you know, anyone who's worked in a company

00:17:30.240 | realizes PR reviews can be a bit dodgy at times.

00:17:33.360 | But you see that you lose so much information at the end.

00:17:37.120 | And that's perfectly fine because PRs aren't designed

00:17:39.600 | to be something that perfectly preserves

00:17:41.440 | everything that happened.

00:17:42.880 | But what we realized was if you want something

00:17:45.720 | that's a software engineer, and very crudely,

00:17:47.760 | we started with something that can do PRs for you,

00:17:50.120 | essentially, you need to be able to figure out

00:17:53.320 | why those things happened.

00:17:54.920 | Otherwise, you're just gonna rely,

00:17:56.680 | essentially, you just have a code writing model.

00:17:58.000 | You have something that's good at human eval,

00:17:59.560 | but not very good at sweetbench, essentially.

00:18:01.960 | That realization was part of the kernel of the idea

00:18:05.680 | of the approach that we took to design the agent

00:18:08.680 | that is Genie.

00:18:10.200 | The way that we decided we want to try to extract

00:18:14.200 | what happened in the past, like as forensically as possible,

00:18:17.600 | has been and is currently like one of the main things

00:18:20.680 | that we focus all our time on.

00:18:22.440 | Because doing that, getting as much signal out as possible,

00:18:24.720 | doing that as well as possible,

00:18:26.320 | is the biggest thing that we've seen

00:18:29.240 | that determines how well we do on that benchmark

00:18:31.120 | at the end of the day.

00:18:32.080 | Once you've sorted things out, like output structure,

00:18:35.480 | how to get it consistently writing diffs,

00:18:37.800 | and all the stuff that is sort of ancillary

00:18:40.320 | to the model actually figuring out how to solve a problem,

00:18:43.200 | the core bit of solving the problem is

00:18:45.040 | how did the human solve this problem?

00:18:46.600 | And how can we best come up with

00:18:48.960 | how the human solved these problems?

00:18:50.960 | So all the effort went in on that pipeline.

00:18:53.440 | And the mix that we ended up with was,

00:18:56.720 | as you've probably seen in the technical report and so on,

00:18:59.360 | all of those different languages and different combinations

00:19:01.480 | of different task types,

00:19:02.680 | all of that has run through that pipeline

00:19:04.440 | and we've extracted all that information out.

00:19:06.040 | - How does that differ when you work with customers

00:19:08.480 | that have private workflows?

00:19:10.000 | Like, do you think, is there usually a big delta

00:19:12.600 | between what you get in open source

00:19:14.280 | and maybe public data versus like--

00:19:15.920 | - Yeah, yeah, yeah.

00:19:16.760 | When you scrape enough of it,

00:19:17.640 | most of open source is updating readmes and docs.

00:19:19.880 | It's hilarious, like we had to filter out

00:19:21.680 | so much of that stuff because

00:19:23.160 | when we first did the 3.5, 16K model,

00:19:27.400 | like the amount of readme updating that went in,

00:19:30.160 | we did like no data cleaning, no real like,

00:19:33.080 | we just sort of threw it in and saw what happened.

00:19:35.040 | And it was just like,

00:19:37.000 | it was really good at updating readmes,

00:19:38.760 | really good at writing some comments,

00:19:40.480 | really good at complaining in get reviews,

00:19:43.520 | in NPR reviews rather.

00:19:44.880 | And it was, again, like we didn't clean the data.

00:19:46.880 | So you'd like give it some feedback

00:19:48.240 | and it would just like reply and like,

00:19:50.080 | it would just be quite insubordinate

00:19:51.680 | when it was getting back to you like,

00:19:52.680 | no, I don't think you're right.

00:19:53.720 | And it would just sort of argue with you.

00:19:55.520 | So the process of doing all that was super interesting

00:19:58.920 | 'cause we realized from the beginning,

00:20:00.160 | okay, there's a huge amount of work

00:20:01.560 | that needs to go into like cleaning this,

00:20:03.600 | getting it aligned with what we want the model to do

00:20:06.120 | to be able to get the model to be useful in some way.

00:20:09.080 | - I'm curious, like, how do you think about

00:20:11.040 | the customer willingness to share

00:20:13.720 | all of this historical data?

00:20:14.880 | I've done a lot of developer tools investing in my career

00:20:17.960 | and getting access to the code base

00:20:19.840 | is always one of the hard things.

00:20:21.720 | Are people getting more cautious

00:20:24.440 | about sharing this information?

00:20:26.080 | In the past, it was maybe like, you know,

00:20:27.520 | you're using static analysis tool,

00:20:29.640 | like whatever else you need to plug into the code base, fine.

00:20:32.360 | Now you're building a model based on it.

00:20:34.800 | Like, what's the discussion going into these companies?

00:20:37.280 | Are most people comfortable with like letting you see

00:20:39.400 | how to work and sharing everything or?

00:20:41.400 | - It depends on the sector mostly.

00:20:44.120 | We've actually seen, I'd say,

00:20:45.720 | people becoming more amenable to the idea over time,

00:20:48.200 | actually rather more skeptical

00:20:49.800 | 'cause I think they can see the upside.

00:20:52.360 | If this thing does what they say it does,

00:20:54.760 | it's gonna be more help to us

00:20:56.360 | than it is a risk to our infosec.

00:20:58.520 | And of course, like companies building in this space,

00:21:01.240 | we're all gonna end up, you know,

00:21:02.360 | complying with the same rules

00:21:03.640 | and there are gonna be new rules that come out

00:21:04.960 | to make sure that we're looking at your code,

00:21:07.120 | that everything is safe and so on.

00:21:08.680 | So from what we've seen so far,

00:21:10.520 | we've spoken to some very large companies

00:21:12.120 | that you've definitely heard of

00:21:13.640 | and all of them obviously have stipulations

00:21:16.360 | and many of them want it to be sandboxed to start with

00:21:18.640 | and all the like very obvious things

00:21:20.280 | that I, you know, I would say as well.

00:21:22.280 | But they're all super keen to have a go

00:21:24.880 | and see because like, despite all those things,

00:21:27.320 | if we can genuinely make them go faster,

00:21:30.240 | allow them to build more in a given time period and stuff,

00:21:32.360 | it's super worth it to them.

00:21:33.840 | - Okay, I'm gonna dive in a little bit

00:21:35.600 | on the process that you have created.

00:21:38.720 | You showed the demo on your video

00:21:40.680 | and by the time that we release this,

00:21:42.480 | you should be taking people off the wait list

00:21:44.160 | and launching people so people can see this themselves.

00:21:47.000 | There's four main parts of the workflow,

00:21:50.120 | which is finding files, planning action,

00:21:53.000 | writing code and running tests.

00:21:55.160 | And controversially, you have set yourself apart

00:21:58.680 | from the Devins of the world

00:22:00.520 | by saying that things like having access to a browser

00:22:03.600 | is not that important for you.

00:22:04.960 | Is that an accurate reading of what you wrote?

00:22:07.240 | - I don't remember saying that,

00:22:08.640 | but at least with what we've seen,

00:22:11.640 | the browser is helpful,

00:22:13.320 | but it's not as helpful as like,

00:22:14.760 | ragging the correct files, if that makes sense.

00:22:17.280 | Like, it is still helpful,

00:22:18.640 | but obviously there are more fundamental things

00:22:21.240 | you have to get right before you get to like,

00:22:23.280 | oh yeah, you can read some docs

00:22:24.480 | or you can read a stack overflow article

00:22:26.120 | and stuff like that.

00:22:26.960 | - Yeah, the phrase I was indexing on

00:22:28.880 | was the other software tools

00:22:30.840 | are wrappers around foundational models

00:22:32.280 | with a few additional tools,

00:22:33.320 | such as a web browser or code interpreter.

00:22:35.120 | - Oh, I see.

00:22:35.960 | No, I mean, no, I'm deriding the approach there,

00:22:39.000 | not the tools.

00:22:39.920 | - Yeah, exactly.

00:22:40.760 | So like, I would say in my standard model

00:22:43.680 | of what a code agent should look like,

00:22:45.560 | Devon has been very influential, obviously,

00:22:47.640 | because you could just add the docs of something

00:22:51.000 | and now I have, now when I'm installing a new library,

00:22:53.920 | I can just add docs.

00:22:55.360 | Cursor also does this, right?

00:22:56.720 | And then obviously having a code interpreter does help.

00:22:59.160 | I guess you have that in the form of running tests.

00:23:01.880 | - I mean, the Genie has both of those tools

00:23:03.880 | available to it as well.

00:23:04.760 | So yeah, yeah, yeah.

00:23:05.800 | So we have a tool where you can like put in URLs

00:23:09.160 | and it will just read the URLs

00:23:10.240 | and it also uses Perplexity's API under the hood as well

00:23:12.920 | to be able to actually ask questions if it wants to.

00:23:14.960 | - Okay.

00:23:15.800 | - So now we use both of those tools as well.

00:23:16.960 | Like those tools are super important and super key.

00:23:20.720 | I think obviously the most important tools to these agents

00:23:24.440 | are like being able to retrieve code from a code base,

00:23:27.680 | being able to read Stack Overflow articles and what have you

00:23:30.640 | and just be able to essentially be able to Google like we do

00:23:32.960 | is definitely super useful.

00:23:35.000 | - Yeah.

00:23:35.840 | I thought maybe we could just kind of dive

00:23:36.800 | into each of those actions.

00:23:38.600 | Code retrieval, one of the core problems,

00:23:40.840 | you had an indexer that you've worked on,

00:23:43.640 | Even S has built.

00:23:45.080 | What makes it hard?

00:23:46.240 | What approach you thought would work, didn't work?

00:23:48.760 | Anything like that.

00:23:49.600 | - It's funny, I had a similar conversation to this

00:23:51.760 | when I was chatting to the guys from OpenAI yesterday.

00:23:54.680 | The thing is that searching for code,

00:23:57.680 | specifically semantically, at least to start with,

00:24:00.000 | I mean like keyword search and stuff like that

00:24:01.600 | is a sole problem, it's been around for ages,

00:24:04.120 | but at least being able to,

00:24:06.120 | the phrase we always used back in the day

00:24:07.760 | was searching for what code does rather than what code is,

00:24:11.200 | like searching for functionality is really hard, really hard.

00:24:16.200 | The way that we approached that problem

00:24:18.240 | was that obviously like a very basic and easy approach

00:24:22.320 | is right, let's just embed the code base,

00:24:23.800 | we'll chunk it up in some arbitrary way,

00:24:26.120 | maybe using an AST, maybe using number of lines,

00:24:28.440 | maybe using whatever, like some overlapping,

00:24:30.440 | just chunk it up and embed it.

00:24:31.920 | And once you've done that, I will write a query saying like,

00:24:34.800 | find me some authentication code or something,

00:24:36.920 | embed it, and then do the cosine similarity

00:24:39.040 | and get the top of K, right?

00:24:40.360 | That doesn't work, and I wish it did work,

00:24:42.600 | don't get me wrong.

00:24:43.640 | It doesn't work well at all because fundamentally,

00:24:47.360 | if you think about like semantically how code looks

00:24:50.000 | is very different to how English looks,

00:24:51.600 | and there's like not a huge amount of signal

00:24:53.680 | that's carried between the two.

00:24:55.000 | So what we ended up, the first approach we took

00:24:57.280 | and that kind of did well enough for a long time was,

00:25:01.080 | okay, let's train a model

00:25:03.680 | to be able to take in English code queries

00:25:06.800 | and then produce a hypothetical code snippet

00:25:09.840 | that might look like the answer,

00:25:12.320 | embed that, and then do the cosine similarity.

00:25:15.360 | And that process, although very simple,

00:25:17.520 | gets you so much more performance

00:25:19.920 | out of the retrieval accuracy.

00:25:21.920 | And that was kind of like the start of our engine,

00:25:25.080 | as we called it, which is essentially like the aggregation

00:25:28.080 | of all these different heuristics,

00:25:29.200 | like semantic, keyword, LSP, and so on.

00:25:33.200 | And then we essentially had like a model

00:25:36.040 | that would, given an input,

00:25:37.720 | choose which ones it thought were most appropriate

00:25:39.840 | given the type of requests you had.

00:25:41.640 | So the whole code search thing was a really hard problem.

00:25:46.360 | And actually what we ended up doing with Genie

00:25:48.160 | is we let the model through self-play

00:25:52.320 | figure out how to retrieve code.

00:25:53.720 | So actually we don't use our engine for Genie.

00:25:56.680 | So instead of like a request coming in

00:25:59.520 | and then like say GPT-4 with some JSON output being like,

00:26:02.720 | well, I think here we should use a keyword

00:26:04.360 | with these inputs and then we should use semantic

00:26:06.320 | and then we should like pick these results.

00:26:08.360 | It's actually like a question comes in

00:26:10.520 | and Genie has self-played in its training data

00:26:14.040 | to be able to be like,

00:26:14.880 | okay, this is how I'm going to approach

00:26:16.200 | finding this information.

00:26:17.320 | Much more akin to how a developer would do it.

00:26:19.880 | 'Cause if I was like,

00:26:20.960 | Sean, go into this new code base you've never seen before

00:26:23.600 | and find me the code that does this,

00:26:26.600 | you're gonna probably, you might do some keywords.

00:26:28.640 | You're gonna look over the file system.

00:26:30.080 | You're gonna try to figure out from the directories

00:26:32.000 | and the file names where it might be.

00:26:33.400 | You're gonna like jump in one

00:26:35.040 | and then once you're in there,

00:26:35.880 | you're probably gonna be doing the go to definition stuff

00:26:38.880 | to like jump from file to file

00:26:40.240 | and try to use the graph to like get closer and closer.

00:26:43.440 | And that is exactly what Genie does.

00:26:45.200 | Starts on the file system, looks at the file system,

00:26:47.320 | picks some candidate files.

00:26:48.960 | Is this what I'm looking for, yes or no?

00:26:51.000 | If there's something that's interesting,

00:26:52.280 | like an import or something,

00:26:53.320 | it can command click on that thing,

00:26:55.200 | go to definition, go to references and so on.

00:26:57.320 | And it can traverse the code base that way.

00:26:59.240 | - Are you using the VS Code LSP or?

00:27:01.560 | - No, that's no, we're not doing this in VS Code.

00:27:04.240 | We're just using the language servers running.

00:27:06.440 | But we really wanted to try to mimic

00:27:09.120 | the way we do it as best as possible.

00:27:11.360 | And we did that during the self-play process

00:27:13.680 | when we were generating the data set.

00:27:14.880 | So although we did all that work originally,

00:27:17.360 | and although like Genie still has access to these tools,

00:27:19.840 | so it can do keyword searches

00:27:21.040 | and it can do basic semantic searches

00:27:23.480 | and it can use the graph.

00:27:24.360 | It uses them through this process and figures out,

00:27:27.840 | okay, I've learned from data how to find stuff in code bases

00:27:31.120 | and I think in our technical report,

00:27:32.400 | I can't remember the exact number,

00:27:33.520 | but I think it was around 65 or 66% retrieval accuracy

00:27:36.680 | overall measured on,

00:27:38.520 | we know what lines we need for these tasks to find

00:27:42.080 | for the task to actually be able to be completed.

00:27:44.640 | And we found about 66% of all those lines,

00:27:47.640 | which is one of the biggest areas of free performance

00:27:51.160 | that we can get hold of

00:27:52.000 | because when we were building Genie truthfully,

00:27:54.200 | like a lot more focus went on

00:27:56.800 | assuming you found the right information,

00:27:59.000 | you've been able to reproduce the issue,

00:28:01.000 | assuming that's true,

00:28:02.560 | how do you then go about solving it?

00:28:04.800 | And the bulk of the work we did was on the solving.

00:28:08.240 | But when you go higher up the funnel,

00:28:09.720 | obviously like the funnel looks like,

00:28:11.240 | have you found everything you need for the task?

00:28:13.400 | Are you able to reproduce the problem

00:28:14.800 | that's seen in the issue?

00:28:16.040 | Are you then able to solve it?

00:28:17.240 | And the funnel gets narrower as you go down.

00:28:19.440 | And at the top of the funnel, of course, is rank.

00:28:20.880 | So I'm actually quite happy with that score.

00:28:22.760 | I think it's still pretty impressive

00:28:23.840 | considering the size of some of the code bases

00:28:25.360 | we're using for this.

00:28:27.640 | But as soon as that, if that number becomes 80,

00:28:29.880 | I think how many more tasks we get right.

00:28:31.360 | That's one of the key areas we're gonna focus on

00:28:33.560 | when we continue working on Genie.

00:28:35.120 | - Be interesting to break out a benchmark just for that.

00:28:37.720 | - Yeah, I mean, it's super easy.

00:28:39.200 | - 'Cause I don't know what state of the art is.

00:28:40.600 | - Yeah, I mean, like for a, it's super easy

00:28:42.800 | 'cause like for a given PR, you know what lines are edited.

00:28:45.840 | - Oh, okay.

00:28:46.680 | - Yeah, you know what lines are edited.

00:28:47.520 | - So you can just, you can source it

00:28:48.360 | from Speedbench actually.

00:28:49.200 | - Yeah, you can do it, you can do it super easily.

00:28:50.600 | And that's how we got that figure out at the other end.

00:28:53.000 | For us being able to see it against,

00:28:54.600 | our historic models were super useful.

00:28:56.440 | So we could see if we were, you know,

00:28:58.040 | actually helping ourselves or not.

00:28:59.680 | And initially, one of the biggest performance gains

00:29:02.440 | that we saw when we did work on the rag a bit

00:29:04.720 | was giving it the ability to use the LSP

00:29:06.720 | to like go to definition and really try to get it

00:29:08.560 | to emulate how we do that.

00:29:10.560 | Because I'm sure when you go into an editor

00:29:12.960 | where like the LSP is not working or whatever,

00:29:15.480 | you suddenly feel really like disarmed and naked.

00:29:17.520 | You're like, oh my God, I didn't realize

00:29:19.360 | how much I actually use this to get about

00:29:21.040 | rather than just find stuff.

00:29:23.120 | So we really tried to get it to do that.

00:29:24.520 | And that gave us a big jump in performance.

00:29:26.400 | So we went from like 54% up to like the 60s,

00:29:28.880 | but just by adding, focusing on that.

00:29:31.120 | - That's one weird trick.

00:29:32.120 | - Yes.

00:29:33.640 | - I'll briefly comment here.

00:29:35.280 | So this is the standard approach I would say

00:29:37.160 | most code tooling startups are pursuing.

00:29:40.680 | The one company that's not doing this is magic.dev.

00:29:44.000 | - Yes.

00:29:44.840 | - So would you do things differently

00:29:46.760 | if you have a 10 million token context window?

00:29:49.160 | - If I had a 10 million context window

00:29:50.880 | and hundreds of millions of dollars,

00:29:53.400 | I wouldn't have gone and built,

00:29:56.960 | it's an LTM, it's not a transformer they're using, right?

00:30:00.120 | If I'm not mistaken, I believe it's not a transformer.

00:30:01.960 | - Yeah.

00:30:02.800 | - Eric's gonna come on at some point.

00:30:03.640 | - I'm just, listen, they obviously know a lot more

00:30:05.720 | about their product than I do.

00:30:06.600 | I don't know a great deal about how magic works.

00:30:08.280 | - Nobody knows anything yet.

00:30:09.120 | - Yeah, so I'm not gonna speculate.

00:30:12.600 | Would I do it the same way as them?

00:30:14.480 | I like the way we've done it because fundamentally,

00:30:17.120 | like we focus on the act of software engineering

00:30:22.120 | and what that looks like.

00:30:23.320 | And showing models how to do that.

00:30:25.360 | Fundamentally, the underlying model that we use

00:30:28.280 | is kind of null to us.

00:30:30.320 | Like so long as it's the best one, I don't mind.

00:30:32.560 | And the context windows we've already seen,

00:30:34.760 | like you can get transformers to have like million,

00:30:37.320 | one and a half million token context windows.

00:30:40.120 | And that works perfectly well.

00:30:41.520 | So like as soon as you can fine tune Gemini 1.5,

00:30:45.040 | then you best be sure that Genie will run on Gemini 1.5

00:30:48.600 | and like we'll probably get very good performance

00:30:50.160 | out of that.

00:30:51.000 | I like our approach 'cause we can be super agile

00:30:52.760 | and be like, "Oh, well, Anthropic have just released

00:30:54.480 | "whatever and it might have half a million tokens

00:30:57.080 | "and it might be really smart."

00:30:58.280 | And I can just immediately take my JSONL file

00:31:00.520 | and just dump it in there and suddenly Genie works on there

00:31:02.480 | and it can do all the new things.

00:31:04.040 | - Does Anthropic have the same fine tuning support

00:31:06.320 | as OpenAI?

00:31:07.160 | I actually haven't heard anyone do it.

00:31:08.840 | - They are working on it.

00:31:09.960 | They are partnered with AWS and it's gonna be in Bedrock.

00:31:13.080 | As far as I know, I think that's true.

00:31:16.960 | - Cool.

00:31:17.800 | We have to keep moving on to the other segments.

00:31:19.640 | Planning.

00:31:20.480 | The second piece of your four-step grandmaster plan.

00:31:23.800 | That is the frontier right now.

00:31:25.560 | A lot of people are talking about Strawberry,

00:31:27.520 | Q*, whatever that is.

00:31:29.280 | Monte Carlo Tree Search.

00:31:30.880 | Is current state-of-the-art planning good enough?

00:31:33.320 | What prompts have worked?

00:31:35.120 | I don't even know what questions to ask.

00:31:36.320 | Like, what is the state of planning?

00:31:37.920 | - I think it's fairly obvious

00:31:38.760 | that with the foundational models,

00:31:40.400 | like you can ask them to think by step by step

00:31:42.000 | and ask them to plan and stuff,

00:31:43.480 | but that isn't enough

00:31:44.440 | because if you look at how those models score

00:31:46.000 | on these benchmarks,

00:31:46.840 | then they're not even close to state-of-the-art.

00:31:48.920 | - Which ones are you referencing?

00:31:50.080 | - So like just like Sweet Bench and so on, right?

00:31:52.840 | And like even the things that get really good scores

00:31:55.040 | on human eval agents as well,

00:31:56.320 | 'cause they have these loops, right?

00:31:57.840 | Obviously these things can reason, quote unquote,

00:32:00.560 | but the reasoning is the model,

00:32:03.680 | it's constrained by the model's intelligence,

00:32:05.640 | I'd say, very crudely.

00:32:07.320 | And what we essentially wanted to do

00:32:08.920 | was we still thought,

00:32:09.760 | obviously reasoning is super important.

00:32:11.040 | We need it to get the performance we have,

00:32:13.200 | but we wanted the reasoning to emulate

00:32:15.160 | how we think about problems when we're solving them,

00:32:17.240 | as opposed to how a model thinks about a problem

00:32:19.240 | when we're solving it.

00:32:20.200 | And that's obviously part of like the derivation pipeline

00:32:23.120 | that we have when we design our data.

00:32:25.880 | But the reasoning that the models do right now,

00:32:28.520 | and who knows what Q*,

00:32:30.280 | whatever it ends up being called, looks like,

00:32:32.760 | but certainly what I'm excited,

00:32:34.040 | on a small tangent to that,

00:32:35.440 | like what I'm really excited about

00:32:36.760 | is when models like that come out,

00:32:38.200 | obviously the signal in my data,

00:32:39.560 | when I regenerate it, goes up.

00:32:41.440 | And then I can then train that model

00:32:43.040 | that's already better at reasoning

00:32:44.280 | with improved reasoning data

00:32:46.280 | and just like I can keep bootstrapping

00:32:47.760 | and keep leapfrogging every single time.

00:32:49.520 | And that is like super exciting to me

00:32:51.600 | 'cause I welcome like new models so much

00:32:53.960 | because immediately it just floats me up

00:32:56.480 | without having to do much work, which is always nice.

00:32:58.720 | But at the state of reasoning generally,

00:33:00.840 | I don't see it going away anytime soon.

00:33:02.960 | I mean, that's like an autoregressive model

00:33:04.640 | doesn't think per se.

00:33:06.480 | And in the absence of having any thought,

00:33:08.840 | maybe an energy-based model or something like that,

00:33:11.360 | maybe that's what Q* is, who knows,

00:33:13.080 | some sort of like high level abstract space

00:33:16.120 | where thought happens before tokens get produced.

00:33:19.040 | In the absence of that for the moment,

00:33:20.680 | I think it's all we have

00:33:22.000 | and it's gonna have to be the way it works.

00:33:23.840 | For what happens in the future, we'll have to see,

00:33:26.360 | but I think certainly it's never going

00:33:27.800 | to hinder performance to do it.

00:33:29.200 | And certainly the reasoning that we see Genie do

00:33:33.160 | when you compare it to like,

00:33:34.520 | if you ask GPT-4 to break down step-by-step

00:33:38.040 | and approach for the same problem,

00:33:39.680 | at least just on a vibe check alone looks far better.

00:33:42.920 | - Two elements that I like

00:33:45.200 | that I didn't see in your initial video,

00:33:47.000 | we'll see when this Genie launches,

00:33:49.880 | is a planner chat,

00:33:51.520 | which is I can modify the plan while it's executing.

00:33:54.360 | And then the other thing is playbooks,

00:33:55.760 | which also from Devin,

00:33:57.240 | where here's how I like to do a thing

00:33:59.880 | and I'll use Markdown to specify how I do it.

00:34:02.920 | I'm just curious if like, you know, those things help.

00:34:05.640 | - Yeah, no, absolutely.

00:34:06.480 | We're a hundred percent.

00:34:07.320 | We want everything to be editable,

00:34:09.120 | not least because it's really frustrating when it's not.

00:34:11.200 | Like if you're ever in a situation

00:34:12.840 | where like there's the one thing I just wish I could,

00:34:15.720 | and you'd be right if that one thing was right

00:34:17.480 | and you can't change it.

00:34:18.520 | So we're going to make everything editable,

00:34:19.520 | including the code it writes.

00:34:20.560 | Like you can, if it makes a small error in a patch,

00:34:22.960 | you can just change it yourself and let it continue

00:34:24.680 | and it will be fine.

00:34:25.760 | So yeah, like those things are super important.

00:34:27.560 | We'll be doing those too.

00:34:28.640 | - I'm curious, once you get to writing code,

00:34:30.960 | is most of the job done?

00:34:32.640 | I feel like the models are so good at writing code

00:34:34.720 | when they're like in small chunks

00:34:36.560 | that are like very well-instructed.

00:34:38.320 | What's kind of the drop off in the funnel?

00:34:40.160 | Like once you get to like,

00:34:41.360 | you got the right files and you got the right plan.

00:34:43.680 | - That's a great question because by the time this is out,

00:34:46.480 | there'll be another blog post.

00:34:47.960 | Yeah, there'll be another blog post,

00:34:49.920 | which contains all the learnings that I delivered

00:34:52.880 | to OpenAI's fine-tuning team

00:34:54.160 | when we finally got the score.

00:34:55.600 | - Oh, that's good.

00:34:56.640 | Go for it, it's already out.

00:34:58.480 | - Yeah, I don't have it on my phone,

00:34:59.800 | but basically I broke down the log probs.

00:35:04.800 | I basically got the average log prob for a token

00:35:08.400 | at every token position in the context window.

00:35:10.320 | So imagine an X-axis from zero to 128K,

00:35:13.160 | and then the average log prob for each index in there.

00:35:16.160 | As we discussed, like the way Genie works normally is,

00:35:18.640 | you know, at the beginning you do your rag

00:35:20.120 | and then you do your planning and then you do your coding

00:35:21.680 | and that sort of cycle continues.

00:35:23.280 | The certainty of code writing is so much more certain

00:35:26.840 | than every other aspect of Genie's loop.

00:35:29.240 | So whatever's going on under the hood,

00:35:30.720 | the model is really comfortable with writing code.

00:35:32.520 | There is no doubt and it's like in the token probabilities.

00:35:35.680 | One slightly different thing, I think,

00:35:37.400 | to how most of these models work is,

00:35:40.480 | at least for the most part,

00:35:41.840 | if you ask GPT4 in chat GPT to edit some code for you,

00:35:45.440 | it's going to rewrite the entire snippet for you

00:35:47.360 | with the changes in place.

00:35:48.800 | We train Genie to write diffs and, you know,

00:35:51.280 | essentially patches, right?

00:35:52.320 | Because it's more token efficient

00:35:53.960 | and that is also fundamentally,

00:35:56.880 | we don't write patches as humans,

00:35:58.480 | but it's like the result of what we do is a patch, right?

00:36:01.440 | When Genie writes code,

00:36:04.200 | I don't know how much it's leaning

00:36:05.920 | on the pre-training like code writing corpus,

00:36:08.280 | because obviously it's just read code files there.

00:36:10.680 | It's obviously probably read a lot of patches,

00:36:12.240 | but I would wager it's probably read more code files

00:36:14.080 | than it has patches.

00:36:14.920 | So it's probably leaning on a different part of its brain

00:36:16.840 | is my speculation.

00:36:17.680 | I have no proof for this.

00:36:18.920 | So I think the discipline of writing code

00:36:20.840 | is slightly different,

00:36:21.680 | but certainly is its most comfortable state

00:36:24.200 | when it's writing code.

00:36:25.760 | So once you get to that point,

00:36:27.640 | so long as you're not too deep into the context window,

00:36:29.600 | another thing that I'll bring up in that blog post

00:36:31.600 | is performance of Genie

00:36:33.680 | over the length of the context window.

00:36:35.840 | Degrades fairly linearly.

00:36:38.680 | So actually, I actually broke it down

00:36:41.120 | by probability of solving a sweep bench issue,

00:36:44.360 | given the number of tokens of the context window.

00:36:46.400 | It's 60K, it's basically 0.5.

00:36:48.920 | So if you go over 60K in context length,

00:36:51.600 | you are more likely to fail than you are to succeed

00:36:53.760 | just based on the amount of tokens

00:36:55.360 | you have on the context window.

00:36:56.680 | And when I presented that to the fine tuning team

00:36:59.000 | at OpenAI, that was super interesting to them as well.

00:37:01.560 | And that is more of a foundational model attribute

00:37:05.840 | than it is an us attribute.

00:37:07.320 | However, the attention mechanism works in GPT-4,

00:37:10.480 | however, you know, they deal with the context window.

00:37:12.880 | At that point is, you know,

00:37:14.800 | influencing how Genie is able to form.

00:37:17.040 | Even though obviously all our training data is perfect,

00:37:19.560 | right, so even if like stuff is being solved

00:37:21.440 | in 110,000 tokens, sort of that area,

00:37:24.800 | the training data still shows it being solved there,

00:37:27.040 | but it's just in practice, the model is finding it

00:37:29.000 | much harder to solve stuff

00:37:30.040 | down that end of the context window.

00:37:31.800 | - That's the scale with the context,

00:37:33.240 | so for a 200K context size, is 100K tokens like the 0.5?

00:37:38.240 | - I don't know.

00:37:39.520 | - Yeah, yeah, yeah.

00:37:40.360 | - Yeah, but I hope not.

00:37:42.720 | I hope you don't just take the context length

00:37:44.320 | and halve it and then say,

00:37:45.160 | "Oh, this is the usable context length."

00:37:46.960 | But what's been interesting is knowing that,

00:37:49.720 | actually really digging into the data,

00:37:51.200 | looking at the log probs,

00:37:52.160 | looking at how it performs over the entire window,

00:37:54.920 | it's influenced the short-term improvements

00:37:57.720 | we've made to Genie since we got that score.

00:38:01.200 | So we actually made some small optimizations

00:38:03.640 | to try to make sure as best we can without overdoing it,

00:38:08.400 | trying to make sure that we can artificially

00:38:10.240 | make sure stuff sits within that sort of range

00:38:12.400 | because we know that's our sort of battle zone.

00:38:14.360 | And if we go outside of that,

00:38:15.440 | we're starting to push the limits,

00:38:16.720 | we're more likely to fail.

00:38:18.160 | So just doing that sort of analysis has been super useful

00:38:20.680 | without actually messing with anything more structural

00:38:24.600 | and getting more performance out of it.

00:38:26.200 | - What about different languages?

00:38:28.200 | So in your technical report,

00:38:29.720 | the data makes this 21% JavaScript, 21% Python,

00:38:33.480 | 14% TypeScript, 14% TSX.

00:38:36.960 | - Which is JavaScript, JavaScript, JavaScript.

00:38:38.720 | - Yeah, yeah, yeah.

00:38:39.560 | - Yes, yeah, yeah, that's true.

00:38:40.400 | - It's like 29% JavaScript.

00:38:41.240 | - That's true, that's true.

00:38:42.080 | Although TypeScript is so much superior, but anyway.

00:38:43.600 | - Do you see, how good is it at just generalizing?

00:38:46.400 | If you're writing Rust or C++ or whatever else,

00:38:50.200 | it's quite different?

00:38:51.640 | - It's pretty good at generalizing.

00:38:53.800 | Obviously, though, I think there's 15 languages

00:38:55.640 | in that technical report, I think, that we've covered.

00:38:58.440 | The ones that we picked in the highest mix

00:39:00.920 | were the ones that, selfishly, we internally use the most,

00:39:05.000 | and also that are, I'd argue,

00:39:06.960 | some of the most popular ones.

00:39:08.320 | When we have more resource as a company and more time,

00:39:12.640 | and once all the craziness that has just happened

00:39:14.840 | sort of dies down a bit,

00:39:15.680 | we are going to work on that mix.

00:39:17.360 | I'd love to see everything ideally be represented

00:39:20.600 | in a similar level as it is.

00:39:22.400 | If you took GitHub as a data set,

00:39:25.280 | if you took how are the languages broken down

00:39:27.480 | in terms of popularity,

00:39:28.400 | that would be my ideal data mix to start.

00:39:30.840 | It's just that it's not cheap doing this.

00:39:32.840 | So, yeah, trying to have an equal amount of Ruby and Rust

00:39:37.840 | and all these different things at our current state

00:39:41.000 | is not really what we're looking for.

00:39:43.080 | - There's a lot of good Ruby in my GitHub profile.

00:39:45.240 | You can have it all.

00:39:46.080 | - Well, okay, perfect, we'll just train on that.

00:39:48.240 | - For running tests, it sounds easy, but it isn't,

00:39:51.280 | especially when you're working in enterprise codebases

00:39:53.960 | that are kind of very hard to spin up.

00:39:56.080 | How do you set that up?

00:39:57.200 | It's like, how do you make a model

00:39:58.840 | actually understand how to run a codebase,

00:40:00.960 | which is different than writing code for a codebase?

00:40:03.680 | - The model itself is not in charge

00:40:05.800 | of setting up the codebase and running it.

00:40:07.840 | So Genie sits on top of GitHub,

00:40:09.480 | and if you have CI running GitHub,

00:40:11.880 | you have GitHub actions and stuff like that,

00:40:13.480 | then Genie essentially makes a call out to that,

00:40:16.040 | runs your CI, sees the outputs, and then moves on.

00:40:19.680 | Making a model itself set up a repo

00:40:23.280 | wasn't scoped in what we wanted Genie to be able to do,

00:40:26.160 | because for the most part, at least most enterprises

00:40:29.280 | have some sort of CI pipeline running,

00:40:31.040 | and a lot of, if you're doing some,

00:40:32.840 | even a lot of hobbyist software development

00:40:35.080 | has some sort of basic CI running as well.

00:40:37.240 | And that was the lowest hanging fruit approach that we took.

00:40:39.840 | So when Genie ships, the way it will run its own code

00:40:42.440 | is it will basically run your CI,

00:40:43.720 | and it will take the, I'm not in charge of writing this,

00:40:47.640 | the rest of the team is,

00:40:48.480 | but I think it's the checks API on GitHub

00:40:50.440 | allows you to grab that information

00:40:52.000 | and throw it in the context window.

00:40:53.480 | - What's the handoff like with the person?

00:40:56.360 | So Genie, you give it a task,

00:40:58.800 | and then how long are you supposed to supervise it for?

00:41:02.400 | Or are you just waiting for the checks to eventually run,

00:41:05.560 | and then you see how it goes?

00:41:06.800 | Like, what does it feel like?

00:41:08.280 | - There are a couple of modes that it can run in.

00:41:10.480 | Essentially, it can run in fully headless autonomous modes.

00:41:13.080 | So say you assign it a ticket in linear or something,

00:41:15.920 | then it won't ask you for anything.

00:41:17.960 | It will just go ahead and try.

00:41:19.960 | Or if you're in the GUI on the website and you're using it,

00:41:22.960 | then you can give it a task,

00:41:24.240 | and it might choose to ask you a clarifying question.

00:41:26.960 | So if you ask it something super broad,

00:41:29.320 | it might just come back to you and say,

00:41:30.800 | what does that actually mean?

00:41:31.840 | Or can you point me in the right direction for this?

00:41:33.360 | Because our decision internally

00:41:36.040 | was it's gonna piss people off way more

00:41:38.720 | if it just goes off and makes a completely

00:41:41.680 | ruined attempt at it,

00:41:42.680 | because it just, from day one, got the wrong idea.

00:41:45.600 | So it can ask you a lot of questions.

00:41:48.320 | And once it's going, much like a regular PR,

00:41:51.400 | you can leave review comments, issue comments,

00:41:54.400 | all these different things.

00:41:55.400 | And it, because it's been trained

00:41:57.360 | to be a software engineering colleague,

00:41:58.640 | responds in actually a better way than a real colleague,

00:42:01.160 | because it's less snarky and less high and mighty.

00:42:04.800 | And also the amount of filtering it has to do for LGTM.

00:42:07.520 | When you train a model to be a software engineer,

00:42:11.120 | essentially, it's like, you can just do anything.

00:42:12.480 | It's like, yeah, it looks good to me, bro.

00:42:13.840 | - Sure. (laughs)

00:42:15.640 | I just wanted to dive in a little bit more

00:42:17.120 | on your experience with the fine-tuning team.

00:42:19.280 | John Allard was publicly sort of very commentary supportive

00:42:22.840 | and, you know, was part of it.

00:42:24.240 | Like, what is it like working with them?

00:42:25.720 | I also picked up that you initially started to fine-tune

00:42:29.600 | what was publicly available, the 16 to 32K range.

00:42:32.960 | You got access to do more than that.

00:42:35.080 | You've also trained on billions of tokens

00:42:37.320 | instead of the usual millions range.

00:42:40.000 | Just like, take us through that fine-tuning journey

00:42:42.400 | and any advice that you may have.

00:42:43.840 | - It's been so cool.

00:42:45.720 | And this will be public by the time this goes out.

00:42:47.760 | Like, OpenAI themselves have said,

00:42:49.520 | we are pushing the boundaries

00:42:50.680 | of what is possible with fine-tuning.

00:42:52.480 | Like, we are right on the edge.

00:42:53.640 | And like, we are working, genuinely working with them

00:42:57.200 | in figuring out how stuff works, what works,

00:42:59.160 | what doesn't work, because no one's doing,

00:43:01.200 | no one else is doing what we're doing.

00:43:03.120 | They have found what we've been working on

00:43:04.640 | super interesting, which is why they've allowed us

00:43:06.680 | to do so much, like, interesting stuff.

00:43:09.080 | Working with John, I mean,

00:43:09.920 | I had a really good conversation with John yesterday.

00:43:11.560 | We had a little brainstorm after the video we shot.

00:43:14.120 | And one of the things,

00:43:15.880 | you mentioned the billions of tokens.

00:43:17.400 | One of the things we've noticed,

00:43:18.400 | and it's actually a very interesting problem

00:43:19.600 | for them as well,

00:43:20.440 | when you're building like a self-serve fine-tuning API,

00:43:22.720 | they have to decide how big your PEFT adapter,

00:43:26.840 | your lower adapter is going to be in some way.

00:43:28.520 | And like, figuring that out

00:43:29.440 | is actually a really interesting problem.

00:43:31.000 | Because if you make it too big,

00:43:33.240 | because they support data sets that are so small,

00:43:34.840 | you can put like 20 examples through it

00:43:36.080 | or something like that.

00:43:36.920 | Like, if you had a really sparse, large adapter,

00:43:39.000 | you're not going to get any signal in that at all.

00:43:40.840 | So they have to dynamically size these things.

00:43:42.600 | And there is an upper bound.

00:43:43.520 | And actually, we use models that are larger

00:43:47.480 | than what's publicly available.

00:43:48.680 | It's not even publicly available yet,

00:43:49.680 | but when this goes out, it will be.

00:43:51.720 | But we have larger lower adapters available to us,

00:43:56.200 | just because the amount of data

00:43:57.400 | that we're pumping through it.

00:43:58.640 | And at that point,

00:43:59.560 | you start seeing really interesting other things,

00:44:02.640 | like you have to change your learning rate schedule

00:44:05.000 | and do all these different things

00:44:06.160 | that you don't have to do

00:44:07.440 | when you're on the smaller end of things.

00:44:08.760 | So working with that team is such a privilege,

00:44:11.560 | because obviously they're like at the top of their field

00:44:13.880 | in the fine-tuning space.

00:44:15.520 | So as we learn stuff, they're learning stuff.

00:44:19.320 | And one of the things that I think

00:44:20.600 | really catalyzed this relationship

00:44:22.000 | is when we first started working on Genie,

00:44:23.720 | like I delivered them a presentation,

00:44:25.200 | which will eventually become the blog post

00:44:26.640 | that you'll love to read soon.

00:44:28.160 | The information I gave them there,

00:44:29.400 | I think is what showed them like,

00:44:30.240 | "Oh, wow, okay, these guys are really like

00:44:32.720 | pushing the boundaries of what we can do here."

00:44:35.120 | And truthfully, our data set,

00:44:37.440 | we view our data set right now as very small.

00:44:39.640 | It's like the minimum that we're able to afford,

00:44:42.400 | literally afford right now

00:44:43.560 | to be able to produce a product like this.

00:44:45.480 | And it's only gonna get bigger.

00:44:46.760 | So yesterday while I was in their offices,

00:44:48.440 | I was basically, so we were planning,

00:44:50.080 | we were like, okay, how,

00:44:51.480 | this is where we're going in the next six to 12 months.

00:44:53.880 | Like we're putting our foot on the gas here,

00:44:55.880 | 'cause this clearly works.

00:44:56.840 | Like I've demonstrated this is a good,

00:44:58.640 | you know, the best approach so far.

00:45:00.840 | And I wanna see where it can go.

00:45:01.840 | I wanna see what the scaling was like for the data.

00:45:03.720 | And at the moment, like it's hard to figure that out

00:45:05.440 | because you don't know when you're running into like

00:45:08.040 | saturating a PEFT adapter,

00:45:09.360 | as opposed to actually like, is this the model's limit?

00:45:11.680 | Like, where is that?

00:45:12.520 | So finding all that stuff out

00:45:13.720 | is the work we're actively doing with them.

00:45:16.200 | And yeah, it's gonna get more and more collaborative

00:45:18.960 | over the next few weeks as we explore like larger adapters,

00:45:22.320 | pre-training extension, different things like that.

00:45:24.640 | - Awesome.

00:45:25.480 | I also wanted to talk briefly

00:45:26.360 | about the synthetic data process.

00:45:29.160 | One of your core insights was that

00:45:30.800 | the vast majority of the time,

00:45:32.040 | the code that is published by a human is in a working state.

00:45:35.480 | And actually you need to fine tune on non-working code.

00:45:37.920 | - Yes.

00:45:38.760 | - So just, yeah, take us through that inspiration.

00:45:40.760 | How many rounds did you do?

00:45:43.480 | - Yeah, I mean, it might be generous to say

00:45:45.960 | that the vast majority of code is in a working state.

00:45:47.840 | I don't know if I believe that.

00:45:48.680 | - Yeah, I don't know if I believe that.

00:45:49.520 | - I was like, that's very nice of you to say

00:45:51.280 | that my code works.

00:45:52.120 | - Certainly, it's not true for me.

00:45:54.680 | No, I think that, so yeah, no, but it was, you're right.

00:45:57.560 | It's an interesting problem.

00:45:58.400 | And what we saw was when we didn't do that,

00:46:01.600 | obviously you have to basically like one-shot the answer.

00:46:04.600 | 'Cause after that it's like,

00:46:05.520 | well, I've never seen iteration before.

00:46:07.120 | How am I supposed to figure out how this works?

00:46:08.760 | So what you're alluding to there

00:46:11.960 | is like the self-improvement loop

00:46:13.280 | that we started working on.

00:46:14.840 | And that was in sort of two parts.

00:46:16.040 | We synthetically generated runtime errors

00:46:19.440 | where we would intentionally mess with the AST

00:46:23.400 | to make stuff not work or index out of bounds

00:46:26.840 | or refer to a variable that doesn't exist

00:46:28.800 | or errors that the foundational models

00:46:31.920 | just make sometimes that you can't really avoid.

00:46:34.080 | You can't expect it to be perfect.

00:46:36.040 | So we threw some of those in

00:46:37.200 | with a probability of happening.

00:46:39.520 | And on the self-improvement side,

00:46:41.480 | I spoke about this in the blog post,

00:46:43.400 | essentially the idea is that

00:46:45.320 | you generate your data in sort of batches.

00:46:48.480 | First batch is like perfect, like one example,

00:46:50.680 | like here's the problem, here's the answer,

00:46:52.160 | go train the model on it.

00:46:53.840 | And then for the second batch,

00:46:55.560 | you then take the model that you trained before

00:46:58.240 | that can look like one commit into the future.

00:47:00.600 | And then you let it have the first attempt

00:47:02.560 | at solving the problem.

00:47:03.760 | And hopefully it gets it wrong.

00:47:05.640 | And if it gets it wrong,

00:47:06.720 | then you have like, okay,

00:47:08.080 | now the code base is in this incorrect state,

00:47:09.760 | but I know what the correct state is.

00:47:11.040 | So I can do some diffing essentially

00:47:13.400 | to figure out how do I get the state that it's in now

00:47:16.160 | to the state that I want it in.

00:47:17.560 | And then you can train the model

00:47:18.720 | to then produce that diff next

00:47:20.520 | and so on and so on and so on.

00:47:21.920 | So the model can then learn

00:47:23.760 | and also reason as to why it needs to make these changes

00:47:26.760 | to be able to learn how to like learn,

00:47:28.480 | like solve problems iteratively

00:47:30.440 | and learn from its mistakes and stuff like that.

00:47:32.360 | - And you pick the size of the data set

00:47:34.200 | just based on how much money you could spend generating it.

00:47:36.880 | Maybe you think you could just make more and get better.

00:47:39.400 | - Multiple of my monthly burn

00:47:40.640 | don't always spend doing this.

00:47:42.200 | Yeah, basically it was very much related to,

00:47:45.040 | yeah, just like capital.

00:47:46.160 | And yes, with any luck that will be alleviated soon.

00:47:50.040 | - Very soon.

00:47:51.120 | I like drawing references to other things

00:47:52.880 | that are happening in the wild.

00:47:54.520 | So, 'cause we only get to release this podcast

00:47:56.600 | once a week, the Lama 3 paper

00:47:58.440 | also had some really interesting thoughts

00:48:00.800 | on synthetic data for code.

00:48:02.400 | I don't know if you have reviewed that.

00:48:05.080 | I'll highlight the back translation section

00:48:07.880 | because one of your data set focuses

00:48:10.400 | is updating documentation.

00:48:12.120 | I think that translation between natural language,

00:48:14.240 | English versus code and back and forth,

00:48:16.120 | I think is actually a really ripe source of synthetic data.

00:48:19.760 | And Lama 3 specifically called out

00:48:21.560 | that they trained on that.

00:48:23.360 | We should have gone more into that

00:48:24.320 | in our podcast with them,

00:48:25.160 | but we didn't know.

00:48:27.280 | But there's a lot of interesting work

00:48:28.800 | on synthetic data stuff.

00:48:30.280 | We do have to wrap up soon,

00:48:31.200 | but I'm going to briefly touch

00:48:33.160 | on the submission process for SweeBench.

00:48:35.320 | So, you have a 30% state-of-the-art SweeBench results,

00:48:39.200 | but it's not on the leaderboard

00:48:40.440 | because of submission issues.

00:48:41.960 | I don't know if you want to comment on that stuff

00:48:44.680 | versus, we also want to talk about SweeBench verified.

00:48:49.000 | Yeah, just anything on the benchmarking side.

00:48:51.280 | - The potted history of this is quite simple actually.

00:48:54.520 | SweeBench up until, I want to say two weeks ago,

00:48:57.960 | but it might be less than that or more than that.

00:49:00.200 | But I think two weeks ago,

00:49:01.720 | suddenly started mandating what they call trajectories

00:49:04.040 | when you submit.

00:49:04.880 | So, but prior to this,

00:49:06.200 | essentially when you run SweeBench,

00:49:08.200 | you run it through their harness

00:49:09.280 | and out the other end,

00:49:10.120 | you get a report.json,

00:49:11.240 | which is like, here's how many I resolved.

00:49:13.280 | Here's how many I didn't resolve.

00:49:14.360 | These are the IDs, the ones I did.

00:49:15.560 | These ones, the IDs I didn't.

00:49:16.520 | And it gives you any ones that might have errored

00:49:18.160 | or something like that.

00:49:19.600 | And what you would submit would be

00:49:22.480 | all of your model patches that you outputted

00:49:25.040 | and that report.

00:49:26.360 | And then you would like PR that into the SweeBench repo

00:49:28.640 | and that would be it.

00:49:30.000 | That was still the case

00:49:31.800 | when we made our submission on whatever day it was.

00:49:34.200 | They look at them every Monday.

00:49:35.760 | We submitted it at some point during the week.

00:49:37.640 | I want to say it was four days before that.

00:49:39.920 | And I sort of like sat back and waited.

00:49:43.240 | I assumed it would be fine.

00:49:44.480 | When it came to Monday,

00:49:46.360 | they then said, actually, no, we want model trajectories.

00:49:49.160 | And I was like, okay, let me see what this is.

00:49:51.440 | And so on, I sort of dug into it.

00:49:53.400 | And like model trajectories are essentially

00:49:57.360 | the context window or like the reasoning process

00:49:59.480 | of like show you're working.

00:50:00.560 | How did you get here?

00:50:01.400 | If you do a math exam, show me you're working.

00:50:03.800 | Whereas before they were like,

00:50:04.760 | just give me the final answer.

00:50:05.880 | Now they want to see the working,

00:50:06.720 | which I completely understand why they want to see that.

00:50:08.960 | Like the SweeBench fundamentally

00:50:10.760 | is an academic research project

00:50:12.360 | and they want all the stuff to be open source and public

00:50:14.840 | so people can learn from each other and improve

00:50:16.520 | and so on and on.

00:50:17.360 | That's very good.

00:50:18.200 | I completely agree.

00:50:19.280 | However, at least for us,

00:50:20.720 | and the reason that we're not on the leaderboard

00:50:22.400 | is that obviously the model outputs that we generate

00:50:26.400 | are sort of a mirror of our training data set, right?

00:50:29.280 | Like you train the model to do a certain thing

00:50:31.080 | and output a certain way.

00:50:32.000 | Whatever you output looks like your training data.

00:50:34.560 | For the moment as a closed source company,

00:50:36.400 | like fighting for an edge,

00:50:38.440 | we've decided not to publish that information

00:50:40.760 | for that exact reason.

00:50:41.600 | I don't want someone basically taking my trajectories

00:50:44.480 | and then taking a model that's soon to be GA

00:50:46.400 | and just distilling it immediately

00:50:47.760 | and then having genie for themselves.

00:50:49.360 | And, you know, as a business owner,

00:50:51.760 | that's the decision I've had to make.

00:50:53.520 | The patches are still public.

00:50:55.200 | So like the, dare I say, traditional SweeBench submission,

00:50:58.880 | you can go to our GitHub repo and see it

00:51:00.320 | and run them for yourself

00:51:01.640 | and verify that the numbers come out correctly.

00:51:03.520 | Like that is all, that is the potted reason.

00:51:05.560 | - That's the story.

00:51:06.400 | - That's the story.

00:51:07.240 | - SweeBench verified?

00:51:08.320 | You have a score?

00:51:09.200 | - I do have a score.

00:51:10.320 | I do have a score, 43.8%.

00:51:13.120 | It's one of those things

00:51:13.960 | where like there aren't that many people

00:51:14.920 | on the leaderboard yet.

00:51:15.760 | So you don't know how good or bad that is.

00:51:17.600 | - It's a smaller data set, right?

00:51:19.600 | - Oh, it's great.

00:51:21.160 | So on a tangent, original SweeBench was 2,294.

00:51:25.840 | - Which is expensive.

00:51:26.680 | It's like $8,000 to run.

00:51:29.360 | - Oh, that's cheap.

00:51:30.880 | - That's cheap?

00:51:31.720 | What are you talking about?

00:51:32.560 | - I don't know.

00:51:33.400 | At least for us, I don't even want to say it publicly.

00:51:35.520 | How much it cost us to run that thing.

00:51:39.040 | Expensive, slow, really like crap for iteration

00:51:42.400 | because like, you know, you make a change to your model.

00:51:45.120 | How does it do on SweeBench?

00:51:46.320 | I guess that's why SweeBench Lite existed,

00:51:47.840 | but SweeBench Lite was not a,

00:51:50.120 | it was easy stuff, right?

00:51:51.520 | It wasn't a comprehensive measure of the overall thing.

00:51:53.800 | So we actually had the idea a month ago

00:51:56.320 | to what we were going to call SweeBench Small,

00:51:58.800 | where we were going to try to map out across SweeBench,

00:52:01.320 | like what is the distribution of like problem difficulty

00:52:03.280 | and all these different things,

00:52:04.520 | and try to come up with like 300 examples

00:52:07.080 | that sort of mapped that,

00:52:08.160 | where given a score on SweeBench Small,

00:52:10.840 | you could then predict your SweeBench large score

00:52:13.320 | and sort of go from there.

00:52:14.400 | Fortunately, OpenAI did that for us

00:52:17.040 | and probably much better than we would have done.

00:52:18.720 | They use some human labelers,

00:52:19.920 | and as obviously we're working with OpenAI quite closely,

00:52:24.120 | they talked to us about it

00:52:25.200 | and they, you know, were able to let us know

00:52:28.360 | what the instance ID were,

00:52:29.640 | IDs were that were in the new SweeBench version.

00:52:32.960 | And then as soon as I had that,

00:52:34.600 | I could just take the report from the one

00:52:36.600 | that I'd run and just diff them.

00:52:38.280 | And I was like, oh, we got 219 out of 500, which is 43.8%,

00:52:42.400 | which is to my knowledge, at least right now,

00:52:45.560 | state-of-the-art also, which makes sense.

00:52:48.160 | But also GPT-4.0 gets, I believe, 33%,

00:52:51.760 | which is like, I double-checked that, but I believe--

00:52:55.000 | - The August one, the new one.

00:52:56.400 | - Yeah, it's in their blog post.

00:52:58.720 | I can't remember which one it was.

00:52:59.840 | I don't know what the model version was,

00:53:01.040 | but GPT-4.0, I believe, gets 33%,

00:53:03.600 | which is obviously like significantly better

00:53:05.720 | than what it got on the original,

00:53:08.760 | like SweeBench, SweeBench, SweeBench.

00:53:09.920 | - 2%.

00:53:10.760 | - Yeah, yeah, yeah, exactly, exactly.

00:53:12.120 | - It's running ridiculously low.

00:53:13.240 | - But no, SweeBench verified, like, it's so good.

00:53:16.320 | It's like, it's smaller.

00:53:17.920 | We know that the problems are solvable.

00:53:20.040 | It's not gonna cost me a lot of money to run it.

00:53:23.200 | It keeps my iteration time, you know, lower.

00:53:26.480 | And there are also some things

00:53:28.760 | that we're gonna start to do internally

00:53:30.320 | when we run SweeBench to have more of an idea

00:53:33.320 | of how right our model is.

00:53:34.600 | So one of the things I was talking to John about yesterday

00:53:36.840 | was SweeBench is a pass or fail, right?

00:53:39.200 | Like you either have solved the problem or you haven't.

00:53:41.560 | That is quite sparse.

00:53:43.480 | Like it doesn't give you a huge amount of information

00:53:45.080 | 'cause your model could have got a lot of it right.

00:53:46.680 | Like looking through when you do a math paper,

00:53:48.360 | you could have got the reason, you know,

00:53:49.600 | you're working right until like the penultimate step

00:53:51.400 | and then you get it wrong.

00:53:52.680 | So we're gonna look into ways of measuring,

00:53:56.200 | okay, well, your model got it right up to this line

00:53:58.840 | and then it diverged.

00:54:00.440 | And that's super easy to do

00:54:01.480 | because obviously you know the correct state

00:54:02.840 | of all of those questions.

00:54:04.040 | So I think one of the ways

00:54:05.800 | we're gonna keep improving Genie

00:54:07.080 | is by going more in depth and saying,

00:54:09.800 | okay, for the ones that failed, was it right at any point?

00:54:12.000 | Where did it go wrong?

00:54:13.040 | How did it go wrong?

00:54:14.040 | And then sort of trying to triage those sorts of issues.

00:54:16.720 | - So future plans, you have mentioned

00:54:18.400 | Context is sending an open source model.

00:54:20.240 | But basically I think, you know, what the Genie is

00:54:22.280 | is basically this like proprietary fine tune data set

00:54:24.440 | and process and software that you can add onto any model.

00:54:28.400 | Is that the plan?

00:54:29.240 | That's the next year?

00:54:30.360 | It's gonna just be doing that?

00:54:31.360 | - We're gonna get really,

00:54:33.000 | we're gonna be the best in the world at doing that

00:54:34.680 | and continue being the best in the world at doing that

00:54:36.760 | and throwing in as many models as we can,

00:54:39.960 | seeing what the performance is like

00:54:41.280 | and seeing what things improve performance in what places.

00:54:45.120 | And also making the data set larger

00:54:46.520 | is like one of the biggest things

00:54:47.840 | we're gonna be working on.

00:54:48.960 | - I think one of the decisions before you as a CEO

00:54:51.600 | is how much you have like the house model

00:54:54.240 | be like the one true thing.

00:54:55.640 | And then how much you spend time working on customer models.

00:54:59.920 | - That's the thing that really gets me so excited.

00:55:02.960 | Genuinely, like we have a version of Genie

00:55:06.680 | that we named after one of our employees.

00:55:08.360 | (all laughing)

00:55:09.760 | It's called the John.

00:55:11.000 | We have a version of Genie

00:55:13.880 | that is fine tuned on our code base.

00:55:15.680 | So we basically, it's the base Genie

00:55:17.600 | and then we run the same data pipeline

00:55:19.760 | that we run on like all the stuff that we did

00:55:21.400 | to generate the main data set on our repo.

00:55:24.080 | And then all of a sudden you have like something

00:55:25.880 | that is both very good at software engineering

00:55:27.240 | but is also extremely good at your repo.

00:55:29.600 | And that is phenomenal to use.

00:55:32.280 | Like it's really cool.

00:55:33.560 | - More broadly outside of Corsair,

00:55:35.120 | what are you seeing?

00:55:35.960 | What trends are you seeing that you're really excited by?

00:55:39.320 | Who's doing great work that you wanna call out?

00:55:41.440 | - The one of the ones that,

00:55:43.120 | I mean, it's not an original choice

00:55:44.560 | but Cursor are absolutely killing it.

00:55:46.320 | All the employees at Corsair love using it.

00:55:48.800 | And it's a really, really good example

00:55:52.440 | of like just getting like UX right, basically.

00:55:55.760 | Like putting the LLM in the right place

00:55:58.960 | and letting it allow you

00:56:00.080 | and getting out of the way when you don't want it there

00:56:02.120 | and making it familiar 'cause it's still VS code

00:56:04.360 | and all these things.

00:56:05.640 | They've, yeah, they've done an amazing job.

00:56:07.200 | And I think they just raised a round.

00:56:08.200 | So congrats on that to them.

00:56:09.280 | So like they're doing amazing work.

00:56:10.960 | - The decision to fork VS code, I think was controversial.

00:56:13.440 | You guys started as a VS code extension.

00:56:15.080 | - We did, yeah.

00:56:15.920 | - Many, many, many people did that.

00:56:16.760 | And they did the one thing that no one wanted to do.

00:56:19.040 | - I commend the bravery, honestly.

00:56:20.640 | Like I commend the bravery.

00:56:21.920 | 'Cause like in hindsight, obviously it's paid off.

00:56:25.040 | But at least for me in the moment,

00:56:27.560 | I was one of those people being like,

00:56:29.120 | is that gonna, are people gonna do that?

00:56:30.920 | Are people gonna download that?

00:56:31.960 | And yes, obviously they are.

00:56:32.960 | Like sure, doing the hard thing,

00:56:35.240 | which is having worked on Jeannie recently,

00:56:38.000 | for the past eight months or whatever,

00:56:40.080 | as taxing as it's been on us,

00:56:41.800 | like one of the main things I have learned from this

00:56:44.400 | is like, no matter how small you are,

00:56:46.680 | how much resource you have,

00:56:47.760 | just like try to do the hard thing.

00:56:49.400 | 'Cause I think it has the biggest payoff.

00:56:51.720 | - More broadly, just like lessons that you've learned

00:56:54.800 | running your company.

00:56:56.080 | - Oh.

00:56:57.960 | - It's been a two year journey.

00:56:59.320 | - Two year journey.

00:57:00.240 | I mean, it's better than any real job

00:57:02.560 | we could ever get.

00:57:03.440 | Like, I feel so lucky to be working in this area.

00:57:07.880 | Like, especially, you know,

00:57:09.440 | it was so validating to hear it

00:57:10.720 | from the guys at Open Hour as well,

00:57:11.880 | telling us like, we're on the cutting edge on the bat,

00:57:14.160 | we're pushing the boundaries of what's possible

00:57:15.480 | with what we're doing.

00:57:16.640 | Because like, I get to do, I get to be paid to do this.

00:57:19.240 | You know, I have briefly, as you heard at the beginning,

00:57:21.920 | done real jobs and normal stuff.

00:57:24.360 | And like, just being able to do this on the daily,

00:57:26.600 | it's so interesting and so cool.

00:57:28.400 | It's like, I pinch myself a lot, genuinely,

00:57:31.360 | about the fact that I can do this.

00:57:33.200 | And also that, not only I can do this,

00:57:34.960 | but fortunately being a co-founder of the company,

00:57:37.440 | I have a huge amount of say as to where we go next.

00:57:39.520 | And that is a big responsibility,

00:57:41.600 | but it's also so exciting to me.

00:57:42.920 | 'Cause I'm like, you know,

00:57:44.000 | steering the ship has been really interesting so far.

00:57:46.800 | And I like to think that we've got it right,

00:57:48.560 | you know, in the last sort of eight months or so.

00:57:51.400 | And that this is like, really the starting point

00:57:53.760 | of something massive to come.

00:57:55.480 | - Awesome.

00:57:56.320 | Calls to action.

00:57:57.160 | I assume you're hiring.

00:57:59.400 | I assume you're also looking for customers.

00:58:00.920 | What's the ideal customer, ideal employee?

00:58:04.120 | - On the customer side,

00:58:05.640 | honestly, people who are just willing to try something new,

00:58:07.680 | like the Genie UX is different to a conventional IDE.

00:58:12.120 | Give it a chance.

00:58:13.160 | Like that what we really do believe in this whole idea

00:58:15.400 | of like developers work is going to be abstracted,

00:58:18.120 | you know, levels higher than just the code.

00:58:20.240 | We still let you touch the code.

00:58:21.680 | We still want you to dive into the code if you need to.

00:58:23.960 | But fundamentally we think that

00:58:25.720 | if you're trying to offload the coding to a model,

00:58:28.040 | the model should do the coding

00:58:29.160 | and you should be in charge of guiding the model.

00:58:31.200 | So people who are willing to give something new a chance.

00:58:34.000 | Size of company.

00:58:34.840 | And honestly, well, preferably the languages

00:58:37.640 | that are the most represented in our train days.

00:58:40.000 | So like anyway, if you're like doing TypeScript,

00:58:41.760 | JavaScript, Python, Java, that sort of thing.

00:58:45.640 | And in terms of size of company,

00:58:47.400 | like so long as you're willing to try it

00:58:49.360 | and there aren't any massive like infosec things

00:58:51.880 | that get in the way, like it doesn't really matter.

00:58:53.760 | Like code base size can be arbitrary for us.

00:58:55.600 | We can deal with any code base size

00:58:57.480 | and essentially any language, but your mileage may vary.

00:58:59.920 | But for the most part, like anyone who's willing

00:59:02.240 | to give it a try is the ideal customer.

00:59:04.000 | And on the employee, honestly,

00:59:06.320 | we just want people who we're gonna be hiring both

00:59:09.480 | on like what we call like the traditional tech side.

00:59:12.960 | So like building the product essentially

00:59:14.880 | and also hiring really heavily

00:59:16.200 | on the AI machine learning data set side as well.

00:59:21.000 | And in both cases, essentially what we just wanted

00:59:24.480 | were like really passionate people

00:59:26.480 | who are obsessed with something

00:59:28.440 | and are really passionate about something

00:59:30.240 | and are willing to, it sounds so corny,

00:59:33.760 | but like join us in what we're trying to do.

00:59:35.720 | Like we have a very big ambition

00:59:37.080 | and we're biting off a very large problem here.

00:59:40.160 | And people who can look at what we've done so far

00:59:42.760 | and be like, wow, that's really impressive.

00:59:44.320 | I want to do that kind of work.

00:59:46.000 | I want to be pushing the boundaries.

00:59:47.240 | I want to be dealing with experimental stuff all the time.

00:59:51.040 | But at the same time, you're putting it in people's hands

00:59:53.280 | and shipping it to people and so on.

00:59:55.040 | So if that sounds, you know, amenable to anyone,

00:59:57.480 | that's the kind of person we're looking to apply.

00:59:59.560 | - Excellent.

01:00:00.400 | Any last words?

01:00:01.440 | Any Trump impressions that you (laughs)

01:00:04.200 | Did you like the Trump impression?

01:00:05.280 | - Yeah, everyone loved the Trump impression.

01:00:06.480 | - Yeah, I mean, it's funny 'cause like I have some bloopers.

01:00:10.360 | I'll show you the bloopers after we finish recording.

01:00:12.000 | I'll probably tweet them at some point.

01:00:13.520 | The initial cut of that video had me doing a Trump impression.

01:00:17.160 | I sort of sat down into the chair

01:00:18.960 | and be like Cosine is the most tremendous AI lab

01:00:21.720 | in the world.

01:00:23.240 | Unbelievable.

01:00:24.080 | I walked in here and I said, well, this is an amazing lab.

01:00:26.880 | And like, we sent it to some of our friends.

01:00:28.440 | They were like, nah, you can't cold open with Trump, man.

01:00:31.320 | You just can't.

01:00:32.160 | Like, no one knows who you are.

01:00:33.160 | - You can end with it.

01:00:34.000 | - But you can end with it.

01:00:35.000 | Now that that has gone out,

01:00:36.400 | we can now post the rest of the bloopers,

01:00:39.400 | which are essentially me just like fluffing my lines

01:00:42.560 | the entire time and screaming at my co-founder

01:00:44.600 | out of frustration.

01:00:45.440 | So, yeah.

01:00:46.280 | - Well, it was very well executed.

01:00:47.880 | Actually, very few people do the kind of video that you did.

01:00:50.120 | I'm, as a sort of developer relations person,

01:00:52.600 | I'm actually excited by that stuff,

01:00:53.880 | but well, thank you for coming on.

01:00:55.640 | Very, very short notice.

01:00:56.520 | I hope you have a safe flight back

01:00:57.600 | and excited to see the full launch.

01:01:00.080 | I think this is a super fruitful area

01:01:01.920 | and congrats on your launch.

01:01:03.720 | - Thank you so much for having me.

01:01:04.760 | Cheers.

01:01:05.760 | (upbeat music)

01:01:08.360 | (upbeat music)

01:01:10.940 | (upbeat music)

01:01:13.520 | (upbeat music)

Is finetuning GPT4o worth it?

Chapters