back to indexIs finetuning GPT4o worth it?
Chapters
0:0 Alistair and Cosine intro
11:34 GPT4o finetuning
15:18 Genie Data Mix
18:9 Customizing for Customers
20:37 Genie Workflow
22:41 Code Retrieval
30:20 Planning
37:29 Language Mix
38:46 Running Code
41:19 Finetuning with OpenAI
44:32 Synthetic Code Data
47:54 SynData in Llama 3
48:33 SWE-Bench Submission Process
53:20 Future Plans
54:36 Ecosystem Trends
55:55 Founder Lessons
57:58 CTA: Hiring & Customers
00:00:02.580 |
- Hey everyone, welcome to the Latent Space Podcast. 00:00:10.040 |
And I'm joined by my co-host Swiggs, founder of Small.ai. 00:00:13.040 |
- Hey, and today we're back in the studio, in person, 00:00:17.200 |
after about three to four months in visa jail and travels 00:00:24.440 |
But today with special guests, Ali Pullen from Cosign. 00:00:30.960 |
because you're on a two day trip to San Francisco. 00:00:35.160 |
Don't fly from London to San Francisco for two days. 00:00:37.080 |
- And you launched Genie on a plane, on plane WiFi, 00:00:48.920 |
I've been lucky to be a small angel in part of that journey. 00:00:52.760 |
And it's exciting to see that you're launching 00:01:03.960 |
You did your bachelor's in computer science in Exeter, 00:01:10.360 |
And roundabout 2022, you started working on a stealth startup 00:01:24.960 |
He was an Android developer, I was an iOS developer. 00:01:31.960 |
sort of we'd be approached to build projects for people. 00:01:41.360 |
And over time, we started doing larger and larger projects, 00:01:56.040 |
We'll just keep like writing code in our bedrooms. 00:02:00.320 |
And then a friend of ours that we went to Exeter with 00:02:07.360 |
And it was one of these fast grocery delivery companies. 00:02:11.920 |
in the deepest, darkest countryside in England, 00:02:14.200 |
where fast grocery companies are still not a thing. 00:02:16.600 |
So he sort of pitched me this idea and was like, 00:02:19.000 |
listen, like I need an iOS dev, do you fancy coming along? 00:02:23.560 |
It was a chance to get out of my parents' house, 00:02:24.920 |
chance to move to London, you know, do interesting things. 00:02:27.560 |
And at the time, truthfully, I had no idea what YC was. 00:02:32.560 |
I knew I liked coding and building apps and stuff, 00:02:35.000 |
but I'd never really done anything in that area. 00:02:39.960 |
I moved to London just sort of as COVID was ending 00:02:56.000 |
So like the client mobile apps, the backends, 00:03:10.320 |
We didn't have like proper engineering experience. 00:03:11.880 |
There were definitely decisions we'd do differently now. 00:03:13.920 |
We'd definitely buy a lot of stuff off the shelf, 00:03:24.520 |
This sounds so much better than all our friends 00:03:26.080 |
who were like consultants and doing like normal jobs, right? 00:03:34.600 |
And there was obviously a transitionary period 00:03:36.240 |
and integration period, like with all acquisitions. 00:03:39.760 |
And as soon as we'd vested what we wanted to vest 00:03:46.880 |
we left and we knew that we wanted to go alone 00:03:55.440 |
And we knew that we wanted to do something similar ourselves. 00:04:01.120 |
So we tried some small projects in various different areas. 00:04:20.880 |
what was the game they trained a model to play? 00:04:24.920 |
So I'd followed that and knew loosely what GPT-2 was. 00:04:30.040 |
So I was like, okay, this GPT-3 thing sounds interesting. 00:04:38.960 |
It was the, and the model was DaVinci 2 at the time. 00:04:41.840 |
And it was just the old school playground, completions, 00:04:49.000 |
Honestly, I had this conversation in OpenAIs yesterday. 00:05:01.320 |
And it gave me some sort of like fairly generic response 00:05:03.520 |
back and I was like, okay, that looks pretty cool. 00:05:05.440 |
The next thing was, I looked through the docs 00:05:11.120 |
I didn't know if the, if you could put anything in, 00:05:13.560 |
I didn't know if you had to structure in a certain way 00:05:15.960 |
And I saw that it could start writing like tables 00:05:34.040 |
and we just started messing around in the playground, 00:05:45.480 |
It was like, this thing's trained to write code. 00:05:49.520 |
I think, I can't actually remember if Copilot 00:06:06.280 |
We eventually built the world's most flimsy system, 00:06:13.760 |
trying to keep as much context from one to the other, 00:06:17.560 |
where essentially you'd put in an app idea in a box, 00:06:24.400 |
figuring out what the front end should be written in, 00:06:29.040 |
And then we'd go through like for each thing, 00:06:39.520 |
We were like, no, we're gonna write all the code 00:06:46.960 |
and it would build something that did actually run, 00:06:49.640 |
the back end would run, the database would work. 00:06:54.440 |
And that's what we showed to our co-founder, Yang. 00:07:09.160 |
Historically, he's had a few exits in the past 00:07:11.240 |
and has been through all different industries. 00:07:15.240 |
He hates me saying that, but he's a bit older. 00:07:20.280 |
this is absolutely amazing, let's just do something. 00:07:21.880 |
'Cause he, at the time, was just about to have a child, 00:07:29.400 |
The interview was, as most YC interviews are, 00:07:48.280 |
of what you've learned in building that thing 00:07:49.920 |
into something that might be a bit more useful 00:07:55.240 |
a little bit more, at least at the time they did back then. 00:07:57.760 |
So we were like, okay, maybe we could build something 00:08:05.280 |
what that would look like or how you would build it 00:08:09.040 |
And they were like, yeah, that sounds interesting. 00:08:13.560 |
You're in, you've got two weeks to build us an MVP. 00:08:22.360 |
And at the time we were like, we don't even know 00:08:28.480 |
And we didn't really know what we wanted to build, 00:08:30.960 |
Like we knew we wanted to try to help automate dev work, 00:08:40.560 |
like 4,000 tokens, you're not going very far. 00:08:43.880 |
So we ended up building a code-based retrieval tool 00:08:49.560 |
we want to build something that can do our jobs for us. 00:08:53.320 |
We've seen like there are glimpses of it happening 00:08:57.560 |
but we don't see the path of how to do that at the moment. 00:09:02.360 |
So we were like, well, there are going to be some things 00:09:04.040 |
that you need to build this when the tech does catch up. 00:09:06.560 |
So retrieval being one of the most important things, 00:09:09.360 |
like the model's going to have to build like pull code 00:09:15.440 |
then we'll be able to just like plug it into our tooling 00:09:20.440 |
And to be fair, that's basically what we've done. 00:09:36.520 |
'cause it was just me and Sam like working like all hours 00:09:42.240 |
trying to get like a good semantic search engine working 00:09:48.200 |
We were trying to avoid sending code to the cloud 00:09:52.760 |
you're like, you know, millions of lines of code. 00:09:55.080 |
You're trying to do some sort of like local HNSW thing 00:09:59.120 |
that like eats all your RAM as you've seen in the past, 00:10:03.760 |
- My first call with you, I think I had trouble. 00:10:06.840 |
I was like, "Yeah, I know, I know, I know it sucks. 00:10:13.640 |
the first six to eight months of what at the time was built. 00:10:20.200 |
- "Bildt," yeah, it was a terrible, terrible name. 00:10:22.640 |
- It was the worst part of trying to think about 00:10:29.000 |
- No, so when we went on our first ever YC like retreat, 00:10:37.160 |
And then we actually changed the name to COSI. 00:10:42.120 |
as if you're cosigning for an apartment or something. 00:10:49.840 |
back in the end of 2022, the ambition to like build 00:10:52.560 |
something that essentially automated our jobs 00:10:54.440 |
was still very much like core to what we were doing. 00:10:58.480 |
But for a very long time, it was just never apparent to us 00:11:01.080 |
like, how would you go about doing these things? 00:11:06.080 |
16K suddenly felt huge 'cause you've gone from four to 16, 00:11:22.800 |
you then start, you see 32K, 32K was really smart. 00:11:29.280 |
you could fit a decent amount of stuff in it. 00:11:32.320 |
And then finally 128K came along and we were like, 00:11:34.280 |
"Right, this is like, this is what we can actually deal with 00:11:37.520 |
because fundamentally to build a product like this, 00:11:42.440 |
and make sure that everything it ever writes in output 00:11:45.080 |
can be traced back to something in the context window 00:11:52.200 |
"Okay, I know that this is now gonna be feasible 00:11:55.480 |
We'd done early sort of dev work on Genie using 3.5, 16K. 00:12:00.480 |
And that was a very, very like crude way of proving 00:12:09.960 |
actually had signal and worked and could do something. 00:12:15.240 |
because you couldn't ever fit enough information into it 00:12:25.240 |
the base intelligence of the model is lacking, 00:12:28.360 |
to like do software engineering is quite involved. 00:12:34.520 |
and at that point we'd been in touch with OpenAI 00:12:38.760 |
about our ambitions and like how we wanted to build it. 00:12:48.160 |
and back then there was still a decent amount of lag time 00:12:53.160 |
and then allowing you to fine tune it in some way. 00:12:56.280 |
They've gotten much better about that recently. 00:13:03.520 |
And I know that's something they're definitely 00:13:10.520 |
YC companies had like a direct Slack channel to OpenAI. 00:13:19.600 |
- If they're releasing this fine tuning ability 00:13:22.680 |
But like you can't build a startup on the YC advantage. 00:13:25.360 |
It's obviously nice, it makes you feel warm and fuzzy inside 00:13:36.440 |
- I think he's head of solutions or something. 00:13:41.000 |
We'd been talking to him from the very beginning 00:13:43.440 |
and he's been absolutely fantastic throughout. 00:13:54.400 |
And as soon as like that 128K model came out, 00:13:58.520 |
I was like, I know this definitely isn't possible 00:14:07.760 |
We tried that, it's obviously even fewer tokens, 00:14:10.960 |
And I was like, if we can marry the intelligence 00:14:30.520 |
Let's put it through and iterate essentially. 00:14:33.960 |
And that's where like Genie as we know it today 00:14:39.640 |
I won't pretend like the first version of Genie 00:14:43.160 |
That's where you realize all the implicit biases 00:14:46.040 |
And you realize that, oh, actually this decision you made 00:14:52.880 |
how you write Git diffs and you're using LLMs 00:15:00.360 |
But as soon as we had access to the underlying tool, 00:15:02.400 |
we were like, right, we can actually do this. 00:15:07.760 |
'cause I didn't know it was like, it wasn't a done deal, 00:15:09.960 |
but I knew that we could build something useful. 00:15:13.480 |
that would be measurably good on whatever eval at the time 00:15:21.960 |
we weren't actually that familiar with Swift. 00:15:40.400 |
it was actually a very useful tool in building Geni 00:15:43.520 |
"Yes, vibe check this thing and see if it's useful." 00:15:45.920 |
And then all of a sudden you have an actual measure 00:15:48.120 |
to see like, couldn't it do software engineering? 00:15:56.240 |
And eventually we got it to the point where it is now. 00:15:59.440 |
And a little bit beyond since we actually got that score 00:16:10.120 |
but that's essentially a potted answer how we got here. 00:16:15.440 |
You mentioned bias in the data and some of these things. 00:16:23.040 |
And you kind of highlighted how the data needed to train it 00:16:39.440 |
You know, and like there's this kind of known truth 00:16:46.200 |
But since we put so much in the pre-training data, 00:16:48.560 |
what else do you add when you turn to Genium? 00:16:51.000 |
- Yeah, I think that sort of boils down fundamentally 00:16:54.120 |
to the difference between a model writing code 00:16:58.520 |
Because the software engineering sort of discipline 00:17:01.680 |
goes wider because if you look at something like a PR, 00:17:10.720 |
and has eventually been squashed into some diffs, right? 00:17:22.680 |
But of course, it's a super lossy thing, a PR. 00:17:25.200 |
You have no idea why or how, for the most part, 00:17:28.560 |
which, you know, anyone who's worked in a company 00:17:30.240 |
realizes PR reviews can be a bit dodgy at times. 00:17:33.360 |
But you see that you lose so much information at the end. 00:17:37.120 |
And that's perfectly fine because PRs aren't designed 00:17:42.880 |
But what we realized was if you want something 00:17:45.720 |
that's a software engineer, and very crudely, 00:17:47.760 |
we started with something that can do PRs for you, 00:17:50.120 |
essentially, you need to be able to figure out 00:17:56.680 |
essentially, you just have a code writing model. 00:17:58.000 |
You have something that's good at human eval, 00:17:59.560 |
but not very good at sweetbench, essentially. 00:18:01.960 |
That realization was part of the kernel of the idea 00:18:05.680 |
of the approach that we took to design the agent 00:18:10.200 |
The way that we decided we want to try to extract 00:18:14.200 |
what happened in the past, like as forensically as possible, 00:18:17.600 |
has been and is currently like one of the main things 00:18:22.440 |
Because doing that, getting as much signal out as possible, 00:18:29.240 |
that determines how well we do on that benchmark 00:18:32.080 |
Once you've sorted things out, like output structure, 00:18:40.320 |
to the model actually figuring out how to solve a problem, 00:18:56.720 |
as you've probably seen in the technical report and so on, 00:18:59.360 |
all of those different languages and different combinations 00:19:04.440 |
and we've extracted all that information out. 00:19:06.040 |
- How does that differ when you work with customers 00:19:10.000 |
Like, do you think, is there usually a big delta 00:19:17.640 |
most of open source is updating readmes and docs. 00:19:27.400 |
like the amount of readme updating that went in, 00:19:33.080 |
we just sort of threw it in and saw what happened. 00:19:44.880 |
And it was, again, like we didn't clean the data. 00:19:55.520 |
So the process of doing all that was super interesting 00:20:03.600 |
getting it aligned with what we want the model to do 00:20:06.120 |
to be able to get the model to be useful in some way. 00:20:14.880 |
I've done a lot of developer tools investing in my career 00:20:29.640 |
like whatever else you need to plug into the code base, fine. 00:20:34.800 |
Like, what's the discussion going into these companies? 00:20:37.280 |
Are most people comfortable with like letting you see 00:20:45.720 |
people becoming more amenable to the idea over time, 00:20:58.520 |
And of course, like companies building in this space, 00:21:03.640 |
and there are gonna be new rules that come out 00:21:04.960 |
to make sure that we're looking at your code, 00:21:16.360 |
and many of them want it to be sandboxed to start with 00:21:24.880 |
and see because like, despite all those things, 00:21:30.240 |
allow them to build more in a given time period and stuff, 00:21:42.480 |
you should be taking people off the wait list 00:21:44.160 |
and launching people so people can see this themselves. 00:21:55.160 |
And controversially, you have set yourself apart 00:22:00.520 |
by saying that things like having access to a browser 00:22:04.960 |
Is that an accurate reading of what you wrote? 00:22:14.760 |
ragging the correct files, if that makes sense. 00:22:18.640 |
but obviously there are more fundamental things 00:22:21.240 |
you have to get right before you get to like, 00:22:35.960 |
No, I mean, no, I'm deriding the approach there, 00:22:47.640 |
because you could just add the docs of something 00:22:51.000 |
and now I have, now when I'm installing a new library, 00:22:56.720 |
And then obviously having a code interpreter does help. 00:22:59.160 |
I guess you have that in the form of running tests. 00:23:05.800 |
So we have a tool where you can like put in URLs 00:23:10.240 |
and it also uses Perplexity's API under the hood as well 00:23:12.920 |
to be able to actually ask questions if it wants to. 00:23:16.960 |
Like those tools are super important and super key. 00:23:20.720 |
I think obviously the most important tools to these agents 00:23:24.440 |
are like being able to retrieve code from a code base, 00:23:27.680 |
being able to read Stack Overflow articles and what have you 00:23:30.640 |
and just be able to essentially be able to Google like we do 00:23:46.240 |
What approach you thought would work, didn't work? 00:23:49.600 |
- It's funny, I had a similar conversation to this 00:23:51.760 |
when I was chatting to the guys from OpenAI yesterday. 00:23:57.680 |
specifically semantically, at least to start with, 00:24:00.000 |
I mean like keyword search and stuff like that 00:24:01.600 |
is a sole problem, it's been around for ages, 00:24:07.760 |
was searching for what code does rather than what code is, 00:24:11.200 |
like searching for functionality is really hard, really hard. 00:24:18.240 |
was that obviously like a very basic and easy approach 00:24:26.120 |
maybe using an AST, maybe using number of lines, 00:24:31.920 |
And once you've done that, I will write a query saying like, 00:24:34.800 |
find me some authentication code or something, 00:24:43.640 |
It doesn't work well at all because fundamentally, 00:24:47.360 |
if you think about like semantically how code looks 00:24:55.000 |
So what we ended up, the first approach we took 00:24:57.280 |
and that kind of did well enough for a long time was, 00:25:12.320 |
embed that, and then do the cosine similarity. 00:25:21.920 |
And that was kind of like the start of our engine, 00:25:25.080 |
as we called it, which is essentially like the aggregation 00:25:37.720 |
choose which ones it thought were most appropriate 00:25:41.640 |
So the whole code search thing was a really hard problem. 00:25:46.360 |
And actually what we ended up doing with Genie 00:25:53.720 |
So actually we don't use our engine for Genie. 00:25:59.520 |
and then like say GPT-4 with some JSON output being like, 00:26:04.360 |
with these inputs and then we should use semantic 00:26:10.520 |
and Genie has self-played in its training data 00:26:17.320 |
Much more akin to how a developer would do it. 00:26:20.960 |
Sean, go into this new code base you've never seen before 00:26:26.600 |
you're gonna probably, you might do some keywords. 00:26:30.080 |
You're gonna try to figure out from the directories 00:26:35.880 |
you're probably gonna be doing the go to definition stuff 00:26:40.240 |
and try to use the graph to like get closer and closer. 00:26:45.200 |
Starts on the file system, looks at the file system, 00:26:55.200 |
go to definition, go to references and so on. 00:27:01.560 |
- No, that's no, we're not doing this in VS Code. 00:27:04.240 |
We're just using the language servers running. 00:27:17.360 |
and although like Genie still has access to these tools, 00:27:24.360 |
It uses them through this process and figures out, 00:27:27.840 |
okay, I've learned from data how to find stuff in code bases 00:27:33.520 |
but I think it was around 65 or 66% retrieval accuracy 00:27:38.520 |
we know what lines we need for these tasks to find 00:27:42.080 |
for the task to actually be able to be completed. 00:27:47.640 |
which is one of the biggest areas of free performance 00:27:52.000 |
because when we were building Genie truthfully, 00:28:04.800 |
And the bulk of the work we did was on the solving. 00:28:11.240 |
have you found everything you need for the task? 00:28:19.440 |
And at the top of the funnel, of course, is rank. 00:28:23.840 |
considering the size of some of the code bases 00:28:27.640 |
But as soon as that, if that number becomes 80, 00:28:31.360 |
That's one of the key areas we're gonna focus on 00:28:35.120 |
- Be interesting to break out a benchmark just for that. 00:28:39.200 |
- 'Cause I don't know what state of the art is. 00:28:42.800 |
'cause like for a given PR, you know what lines are edited. 00:28:49.200 |
- Yeah, you can do it, you can do it super easily. 00:28:50.600 |
And that's how we got that figure out at the other end. 00:28:59.680 |
And initially, one of the biggest performance gains 00:29:02.440 |
that we saw when we did work on the rag a bit 00:29:06.720 |
to like go to definition and really try to get it 00:29:12.960 |
where like the LSP is not working or whatever, 00:29:15.480 |
you suddenly feel really like disarmed and naked. 00:29:40.680 |
The one company that's not doing this is magic.dev. 00:29:46.760 |
if you have a 10 million token context window? 00:29:56.960 |
it's an LTM, it's not a transformer they're using, right? 00:30:00.120 |
If I'm not mistaken, I believe it's not a transformer. 00:30:03.640 |
- I'm just, listen, they obviously know a lot more 00:30:06.600 |
I don't know a great deal about how magic works. 00:30:14.480 |
I like the way we've done it because fundamentally, 00:30:17.120 |
like we focus on the act of software engineering 00:30:25.360 |
Fundamentally, the underlying model that we use 00:30:30.320 |
Like so long as it's the best one, I don't mind. 00:30:34.760 |
like you can get transformers to have like million, 00:30:37.320 |
one and a half million token context windows. 00:30:41.520 |
So like as soon as you can fine tune Gemini 1.5, 00:30:45.040 |
then you best be sure that Genie will run on Gemini 1.5 00:30:48.600 |
and like we'll probably get very good performance 00:30:51.000 |
I like our approach 'cause we can be super agile 00:30:52.760 |
and be like, "Oh, well, Anthropic have just released 00:30:54.480 |
"whatever and it might have half a million tokens 00:30:58.280 |
And I can just immediately take my JSONL file 00:31:00.520 |
and just dump it in there and suddenly Genie works on there 00:31:04.040 |
- Does Anthropic have the same fine tuning support 00:31:09.960 |
They are partnered with AWS and it's gonna be in Bedrock. 00:31:17.800 |
We have to keep moving on to the other segments. 00:31:20.480 |
The second piece of your four-step grandmaster plan. 00:31:25.560 |
A lot of people are talking about Strawberry, 00:31:30.880 |
Is current state-of-the-art planning good enough? 00:31:40.400 |
like you can ask them to think by step by step 00:31:44.440 |
because if you look at how those models score 00:31:46.840 |
then they're not even close to state-of-the-art. 00:31:50.080 |
- So like just like Sweet Bench and so on, right? 00:31:52.840 |
And like even the things that get really good scores 00:31:57.840 |
Obviously these things can reason, quote unquote, 00:32:03.680 |
it's constrained by the model's intelligence, 00:32:15.160 |
how we think about problems when we're solving them, 00:32:17.240 |
as opposed to how a model thinks about a problem 00:32:20.200 |
And that's obviously part of like the derivation pipeline 00:32:25.880 |
But the reasoning that the models do right now, 00:32:30.280 |
whatever it ends up being called, looks like, 00:32:56.480 |
without having to do much work, which is always nice. 00:33:08.840 |
maybe an energy-based model or something like that, 00:33:16.120 |
where thought happens before tokens get produced. 00:33:23.840 |
For what happens in the future, we'll have to see, 00:33:29.200 |
And certainly the reasoning that we see Genie do 00:33:39.680 |
at least just on a vibe check alone looks far better. 00:33:51.520 |
which is I can modify the plan while it's executing. 00:33:59.880 |
and I'll use Markdown to specify how I do it. 00:34:02.920 |
I'm just curious if like, you know, those things help. 00:34:09.120 |
not least because it's really frustrating when it's not. 00:34:12.840 |
where like there's the one thing I just wish I could, 00:34:15.720 |
and you'd be right if that one thing was right 00:34:20.560 |
Like you can, if it makes a small error in a patch, 00:34:22.960 |
you can just change it yourself and let it continue 00:34:25.760 |
So yeah, like those things are super important. 00:34:32.640 |
I feel like the models are so good at writing code 00:34:41.360 |
you got the right files and you got the right plan. 00:34:43.680 |
- That's a great question because by the time this is out, 00:34:49.920 |
which contains all the learnings that I delivered 00:35:04.800 |
I basically got the average log prob for a token 00:35:08.400 |
at every token position in the context window. 00:35:13.160 |
and then the average log prob for each index in there. 00:35:16.160 |
As we discussed, like the way Genie works normally is, 00:35:20.120 |
and then you do your planning and then you do your coding 00:35:23.280 |
The certainty of code writing is so much more certain 00:35:30.720 |
the model is really comfortable with writing code. 00:35:32.520 |
There is no doubt and it's like in the token probabilities. 00:35:41.840 |
if you ask GPT4 in chat GPT to edit some code for you, 00:35:45.440 |
it's going to rewrite the entire snippet for you 00:35:58.480 |
but it's like the result of what we do is a patch, right? 00:36:05.920 |
on the pre-training like code writing corpus, 00:36:08.280 |
because obviously it's just read code files there. 00:36:10.680 |
It's obviously probably read a lot of patches, 00:36:12.240 |
but I would wager it's probably read more code files 00:36:14.920 |
So it's probably leaning on a different part of its brain 00:36:27.640 |
so long as you're not too deep into the context window, 00:36:29.600 |
another thing that I'll bring up in that blog post 00:36:41.120 |
by probability of solving a sweep bench issue, 00:36:44.360 |
given the number of tokens of the context window. 00:36:51.600 |
you are more likely to fail than you are to succeed 00:36:56.680 |
And when I presented that to the fine tuning team 00:36:59.000 |
at OpenAI, that was super interesting to them as well. 00:37:01.560 |
And that is more of a foundational model attribute 00:37:07.320 |
However, the attention mechanism works in GPT-4, 00:37:10.480 |
however, you know, they deal with the context window. 00:37:17.040 |
Even though obviously all our training data is perfect, 00:37:24.800 |
the training data still shows it being solved there, 00:37:27.040 |
but it's just in practice, the model is finding it 00:37:33.240 |
so for a 200K context size, is 100K tokens like the 0.5? 00:37:42.720 |
I hope you don't just take the context length 00:37:52.160 |
looking at how it performs over the entire window, 00:38:03.640 |
to try to make sure as best we can without overdoing it, 00:38:10.240 |
make sure stuff sits within that sort of range 00:38:12.400 |
because we know that's our sort of battle zone. 00:38:18.160 |
So just doing that sort of analysis has been super useful 00:38:20.680 |
without actually messing with anything more structural 00:38:29.720 |
the data makes this 21% JavaScript, 21% Python, 00:38:36.960 |
- Which is JavaScript, JavaScript, JavaScript. 00:38:42.080 |
Although TypeScript is so much superior, but anyway. 00:38:43.600 |
- Do you see, how good is it at just generalizing? 00:38:46.400 |
If you're writing Rust or C++ or whatever else, 00:38:53.800 |
Obviously, though, I think there's 15 languages 00:38:55.640 |
in that technical report, I think, that we've covered. 00:39:00.920 |
were the ones that, selfishly, we internally use the most, 00:39:08.320 |
When we have more resource as a company and more time, 00:39:12.640 |
and once all the craziness that has just happened 00:39:17.360 |
I'd love to see everything ideally be represented 00:39:25.280 |
if you took how are the languages broken down 00:39:32.840 |
So, yeah, trying to have an equal amount of Ruby and Rust 00:39:37.840 |
and all these different things at our current state 00:39:43.080 |
- There's a lot of good Ruby in my GitHub profile. 00:39:46.080 |
- Well, okay, perfect, we'll just train on that. 00:39:48.240 |
- For running tests, it sounds easy, but it isn't, 00:39:51.280 |
especially when you're working in enterprise codebases 00:40:00.960 |
which is different than writing code for a codebase? 00:40:13.480 |
then Genie essentially makes a call out to that, 00:40:16.040 |
runs your CI, sees the outputs, and then moves on. 00:40:23.280 |
wasn't scoped in what we wanted Genie to be able to do, 00:40:26.160 |
because for the most part, at least most enterprises 00:40:37.240 |
And that was the lowest hanging fruit approach that we took. 00:40:39.840 |
So when Genie ships, the way it will run its own code 00:40:43.720 |
and it will take the, I'm not in charge of writing this, 00:40:58.800 |
and then how long are you supposed to supervise it for? 00:41:02.400 |
Or are you just waiting for the checks to eventually run, 00:41:08.280 |
- There are a couple of modes that it can run in. 00:41:10.480 |
Essentially, it can run in fully headless autonomous modes. 00:41:13.080 |
So say you assign it a ticket in linear or something, 00:41:19.960 |
Or if you're in the GUI on the website and you're using it, 00:41:24.240 |
and it might choose to ask you a clarifying question. 00:41:31.840 |
Or can you point me in the right direction for this? 00:41:42.680 |
because it just, from day one, got the wrong idea. 00:41:51.400 |
you can leave review comments, issue comments, 00:41:58.640 |
responds in actually a better way than a real colleague, 00:42:01.160 |
because it's less snarky and less high and mighty. 00:42:04.800 |
And also the amount of filtering it has to do for LGTM. 00:42:07.520 |
When you train a model to be a software engineer, 00:42:11.120 |
essentially, it's like, you can just do anything. 00:42:17.120 |
on your experience with the fine-tuning team. 00:42:19.280 |
John Allard was publicly sort of very commentary supportive 00:42:25.720 |
I also picked up that you initially started to fine-tune 00:42:29.600 |
what was publicly available, the 16 to 32K range. 00:42:40.000 |
Just like, take us through that fine-tuning journey 00:42:45.720 |
And this will be public by the time this goes out. 00:42:53.640 |
And like, we are working, genuinely working with them 00:43:04.640 |
super interesting, which is why they've allowed us 00:43:09.920 |
I had a really good conversation with John yesterday. 00:43:11.560 |
We had a little brainstorm after the video we shot. 00:43:20.440 |
when you're building like a self-serve fine-tuning API, 00:43:22.720 |
they have to decide how big your PEFT adapter, 00:43:26.840 |
your lower adapter is going to be in some way. 00:43:33.240 |
because they support data sets that are so small, 00:43:36.920 |
Like, if you had a really sparse, large adapter, 00:43:39.000 |
you're not going to get any signal in that at all. 00:43:40.840 |
So they have to dynamically size these things. 00:43:51.720 |
But we have larger lower adapters available to us, 00:43:59.560 |
you start seeing really interesting other things, 00:44:02.640 |
like you have to change your learning rate schedule 00:44:08.760 |
So working with that team is such a privilege, 00:44:11.560 |
because obviously they're like at the top of their field 00:44:15.520 |
So as we learn stuff, they're learning stuff. 00:44:32.720 |
pushing the boundaries of what we can do here." 00:44:37.440 |
we view our data set right now as very small. 00:44:39.640 |
It's like the minimum that we're able to afford, 00:44:51.480 |
this is where we're going in the next six to 12 months. 00:45:01.840 |
I wanna see what the scaling was like for the data. 00:45:03.720 |
And at the moment, like it's hard to figure that out 00:45:05.440 |
because you don't know when you're running into like 00:45:09.360 |
as opposed to actually like, is this the model's limit? 00:45:16.200 |
And yeah, it's gonna get more and more collaborative 00:45:18.960 |
over the next few weeks as we explore like larger adapters, 00:45:22.320 |
pre-training extension, different things like that. 00:45:32.040 |
the code that is published by a human is in a working state. 00:45:35.480 |
And actually you need to fine tune on non-working code. 00:45:38.760 |
- So just, yeah, take us through that inspiration. 00:45:45.960 |
that the vast majority of code is in a working state. 00:45:54.680 |
No, I think that, so yeah, no, but it was, you're right. 00:46:01.600 |
obviously you have to basically like one-shot the answer. 00:46:07.120 |
How am I supposed to figure out how this works? 00:46:19.440 |
where we would intentionally mess with the AST 00:46:23.400 |
to make stuff not work or index out of bounds 00:46:31.920 |
just make sometimes that you can't really avoid. 00:46:48.480 |
First batch is like perfect, like one example, 00:46:55.560 |
you then take the model that you trained before 00:46:58.240 |
that can look like one commit into the future. 00:47:08.080 |
now the code base is in this incorrect state, 00:47:13.400 |
to figure out how do I get the state that it's in now 00:47:23.760 |
and also reason as to why it needs to make these changes 00:47:30.440 |
and learn from its mistakes and stuff like that. 00:47:34.200 |
just based on how much money you could spend generating it. 00:47:36.880 |
Maybe you think you could just make more and get better. 00:47:46.160 |
And yes, with any luck that will be alleviated soon. 00:47:54.520 |
So, 'cause we only get to release this podcast 00:48:12.120 |
I think that translation between natural language, 00:48:16.120 |
I think is actually a really ripe source of synthetic data. 00:48:35.320 |
So, you have a 30% state-of-the-art SweeBench results, 00:48:41.960 |
I don't know if you want to comment on that stuff 00:48:44.680 |
versus, we also want to talk about SweeBench verified. 00:48:49.000 |
Yeah, just anything on the benchmarking side. 00:48:51.280 |
- The potted history of this is quite simple actually. 00:48:54.520 |
SweeBench up until, I want to say two weeks ago, 00:48:57.960 |
but it might be less than that or more than that. 00:49:01.720 |
suddenly started mandating what they call trajectories 00:49:16.520 |
And it gives you any ones that might have errored 00:49:26.360 |
And then you would like PR that into the SweeBench repo 00:49:31.800 |
when we made our submission on whatever day it was. 00:49:35.760 |
We submitted it at some point during the week. 00:49:46.360 |
they then said, actually, no, we want model trajectories. 00:49:49.160 |
And I was like, okay, let me see what this is. 00:49:57.360 |
the context window or like the reasoning process 00:50:01.400 |
If you do a math exam, show me you're working. 00:50:06.720 |
which I completely understand why they want to see that. 00:50:12.360 |
and they want all the stuff to be open source and public 00:50:14.840 |
so people can learn from each other and improve 00:50:20.720 |
and the reason that we're not on the leaderboard 00:50:22.400 |
is that obviously the model outputs that we generate 00:50:26.400 |
are sort of a mirror of our training data set, right? 00:50:29.280 |
Like you train the model to do a certain thing 00:50:32.000 |
Whatever you output looks like your training data. 00:50:38.440 |
we've decided not to publish that information 00:50:41.600 |
I don't want someone basically taking my trajectories 00:50:55.200 |
So like the, dare I say, traditional SweeBench submission, 00:51:01.640 |
and verify that the numbers come out correctly. 00:51:21.160 |
So on a tangent, original SweeBench was 2,294. 00:51:33.400 |
At least for us, I don't even want to say it publicly. 00:51:39.040 |
Expensive, slow, really like crap for iteration 00:51:42.400 |
because like, you know, you make a change to your model. 00:51:51.520 |
It wasn't a comprehensive measure of the overall thing. 00:51:56.320 |
to what we were going to call SweeBench Small, 00:51:58.800 |
where we were going to try to map out across SweeBench, 00:52:01.320 |
like what is the distribution of like problem difficulty 00:52:10.840 |
you could then predict your SweeBench large score 00:52:17.040 |
and probably much better than we would have done. 00:52:19.920 |
and as obviously we're working with OpenAI quite closely, 00:52:29.640 |
IDs were that were in the new SweeBench version. 00:52:38.280 |
And I was like, oh, we got 219 out of 500, which is 43.8%, 00:52:42.400 |
which is to my knowledge, at least right now, 00:52:51.760 |
which is like, I double-checked that, but I believe-- 00:53:13.240 |
- But no, SweeBench verified, like, it's so good. 00:53:20.040 |
It's not gonna cost me a lot of money to run it. 00:53:30.320 |
when we run SweeBench to have more of an idea 00:53:34.600 |
So one of the things I was talking to John about yesterday 00:53:39.200 |
Like you either have solved the problem or you haven't. 00:53:43.480 |
Like it doesn't give you a huge amount of information 00:53:45.080 |
'cause your model could have got a lot of it right. 00:53:46.680 |
Like looking through when you do a math paper, 00:53:49.600 |
you're working right until like the penultimate step 00:53:56.200 |
okay, well, your model got it right up to this line 00:54:09.800 |
okay, for the ones that failed, was it right at any point? 00:54:14.040 |
And then sort of trying to triage those sorts of issues. 00:54:20.240 |
But basically I think, you know, what the Genie is 00:54:22.280 |
is basically this like proprietary fine tune data set 00:54:24.440 |
and process and software that you can add onto any model. 00:54:33.000 |
we're gonna be the best in the world at doing that 00:54:34.680 |
and continue being the best in the world at doing that 00:54:41.280 |
and seeing what things improve performance in what places. 00:54:48.960 |
- I think one of the decisions before you as a CEO 00:54:55.640 |
And then how much you spend time working on customer models. 00:54:59.920 |
- That's the thing that really gets me so excited. 00:55:19.760 |
that we run on like all the stuff that we did 00:55:24.080 |
And then all of a sudden you have like something 00:55:25.880 |
that is both very good at software engineering 00:55:35.960 |
What trends are you seeing that you're really excited by? 00:55:39.320 |
Who's doing great work that you wanna call out? 00:55:52.440 |
of like just getting like UX right, basically. 00:56:00.080 |
and getting out of the way when you don't want it there 00:56:02.120 |
and making it familiar 'cause it's still VS code 00:56:10.960 |
- The decision to fork VS code, I think was controversial. 00:56:16.760 |
And they did the one thing that no one wanted to do. 00:56:21.920 |
'Cause like in hindsight, obviously it's paid off. 00:56:41.800 |
like one of the main things I have learned from this 00:56:51.720 |
- More broadly, just like lessons that you've learned 00:57:03.440 |
Like, I feel so lucky to be working in this area. 00:57:11.880 |
telling us like, we're on the cutting edge on the bat, 00:57:14.160 |
we're pushing the boundaries of what's possible 00:57:16.640 |
Because like, I get to do, I get to be paid to do this. 00:57:19.240 |
You know, I have briefly, as you heard at the beginning, 00:57:24.360 |
And like, just being able to do this on the daily, 00:57:34.960 |
but fortunately being a co-founder of the company, 00:57:37.440 |
I have a huge amount of say as to where we go next. 00:57:44.000 |
steering the ship has been really interesting so far. 00:57:48.560 |
you know, in the last sort of eight months or so. 00:57:51.400 |
And that this is like, really the starting point 00:58:05.640 |
honestly, people who are just willing to try something new, 00:58:07.680 |
like the Genie UX is different to a conventional IDE. 00:58:13.160 |
Like that what we really do believe in this whole idea 00:58:15.400 |
of like developers work is going to be abstracted, 00:58:21.680 |
We still want you to dive into the code if you need to. 00:58:25.720 |
if you're trying to offload the coding to a model, 00:58:29.160 |
and you should be in charge of guiding the model. 00:58:31.200 |
So people who are willing to give something new a chance. 00:58:37.640 |
that are the most represented in our train days. 00:58:40.000 |
So like anyway, if you're like doing TypeScript, 00:58:41.760 |
JavaScript, Python, Java, that sort of thing. 00:58:49.360 |
and there aren't any massive like infosec things 00:58:51.880 |
that get in the way, like it doesn't really matter. 00:58:57.480 |
and essentially any language, but your mileage may vary. 00:58:59.920 |
But for the most part, like anyone who's willing 00:59:06.320 |
we just want people who we're gonna be hiring both 00:59:09.480 |
on like what we call like the traditional tech side. 00:59:16.200 |
on the AI machine learning data set side as well. 00:59:21.000 |
And in both cases, essentially what we just wanted 00:59:37.080 |
and we're biting off a very large problem here. 00:59:40.160 |
And people who can look at what we've done so far 00:59:47.240 |
I want to be dealing with experimental stuff all the time. 00:59:51.040 |
But at the same time, you're putting it in people's hands 00:59:55.040 |
So if that sounds, you know, amenable to anyone, 00:59:57.480 |
that's the kind of person we're looking to apply. 01:00:06.480 |
- Yeah, I mean, it's funny 'cause like I have some bloopers. 01:00:10.360 |
I'll show you the bloopers after we finish recording. 01:00:13.520 |
The initial cut of that video had me doing a Trump impression. 01:00:18.960 |
and be like Cosine is the most tremendous AI lab 01:00:24.080 |
I walked in here and I said, well, this is an amazing lab. 01:00:28.440 |
They were like, nah, you can't cold open with Trump, man. 01:00:39.400 |
which are essentially me just like fluffing my lines 01:00:42.560 |
the entire time and screaming at my co-founder 01:00:47.880 |
Actually, very few people do the kind of video that you did. 01:00:50.120 |
I'm, as a sort of developer relations person,