back to indexWindsurf everywhere, doing everything, all at once - Kevin Hou, Windsurf

Chapters
0:0 Introduction to Windsurf: Discover the rapid growth and key features of this AI Engineer World's Fair product, including web search, MCP support, auto-generated memories, and parallel agents.
2:18 The Core Philosophy: Learn about the "secret sauce" behind Windsurf's intuitive, mind-reading AI, which creates a shared timeline between humans and AI.
3:46 Windsurf Everywhere: See the vision for Windsurf to ingest context from all developer tools, including Google Docs, Figma, GitHub, Notion, and Linear.
6:21 Windsurf Doing Everything: Explore how the AI will expand beyond coding to interact with third-party services, write design documents, and more.
8:40 Windsurf On All the Time: Understand the goal of creating a nearly autonomous AI that works in the background to assist developers.
11:17 Introducing SWE-1: Get a first look at the new software engineering model trained for entire workflows.
11:36 Benchmarking Success: Learn about the End-to-End Task Benchmark and Conversational SWE Task Benchmark, showcasing Windsurf's impressive results.
13:32 The Data Flywheel: Understand the feedback loop that drives Windsurf's continuous improvement.
14:49 The Future of AI Products: Hear Kevin's thoughts on the harmony of model, data, and application needed to build successful AI products in 2025.
00:00:00.000 |
Hello. How we doing? All right, how's the energy level? Good, good, yes, let's go, let's go. Two more, 00:00:27.600 |
two more. My name is Kevin, I lead product at Windsurf, and I'm super excited to be back here. 00:00:33.760 |
Thank you so much, Swix and Ben, it's always a pleasure to come back to AI Engineer World's Fair. 00:00:38.400 |
The velocity of our industry right now is incredible. It's like being on a kite on the ocean, 00:00:45.840 |
and we're really excited to see where the winds are taking us. A year ago, we didn't have Windsurf. 00:00:53.280 |
People were coding with autocomplete, no one had heard of an agent, and now the Windsurf editor is 00:00:59.200 |
being used by millions and millions of people all around the world. And hopefully this is a larger 00:01:05.280 |
number than last time. How many people have heard of Windsurf? And how many people have used Windsurf? 00:01:12.880 |
Good numbers, good numbers. We've got to improve that. And Windsurf itself has changed immensely in 00:01:19.600 |
the last six months since its launch in November. We retired the name Codium because we decided to catch 00:01:25.280 |
this new wave, which is, by the way, what we call our next generation innovations in the product. We call 00:01:31.840 |
them waves. And in case you missed it, we are now 10 waves in. And some of the key waves we've been really 00:01:37.840 |
excited about: web search, MCP support, auto-generated memories - oh, I was supposed to do that - auto-generated 00:01:45.120 |
memories, deploys, and parallel agents, to name just a few. And as the waves keep growing, as do the number of 00:01:54.560 |
people that have discovered and integrated Windsurf into their daily workflows. To this day we are 00:02:00.480 |
generating about 90 million lines of code every single day. And that equates to around a thousand, 00:02:06.400 |
over a thousand messages sent every single minute. But today is not about growth. I'm not going to sit 00:02:12.320 |
here and tell you about the numbers. I'm going to tell you about the why. Why do people feel connected to 00:02:17.120 |
the Windsurf editor? And I know no AI company really wants to disclose its secrets, but I had to come up with 00:02:24.160 |
some content. So today I'm going to let you in on one of ours. Our secret sauce is a shared timeline 00:02:32.960 |
between the human and the AI. And this is what makes people feel like we're reading their minds. 00:02:38.000 |
And now everything you do as a software engineer can be thought of on this shared timeline. So if we 00:02:45.280 |
rewind way back to the dark days - this is pre-autocomplete when everyone knew how to write a for loop - AI had to do 00:02:51.040 |
everything. You had to edit files. You had to type every single character. Imagine that. 00:02:55.280 |
But then once services like Copilot, like Codium, they launched, then devs got really excited. They 00:03:02.800 |
started seeing a small percentage of their code being written by AI, and we started to abstract and 00:03:07.280 |
accelerate the number of small edits, small actions that we would do for a user. And in late 2024, 00:03:13.840 |
with the advent of Windsurf's agent and the launch of the Windsurf editor, we saw that we could do 00:03:18.560 |
more and more for the user. We started being able to edit multiple files at once, perform background 00:03:24.640 |
research across thousands and thousands of files, and execute terminal commands directly inside the 00:03:30.320 |
editor. But at Windsurf, we're in the business of trying to change how software gets created. And this 00:03:38.080 |
means that the timeline is actually a little bit more complicated. It needs to handle actions taken outside 00:03:43.440 |
of just the IDE. And so given how much of a developer's workflow happens outside of the editor, 00:03:51.440 |
what does this mean for Windsurf? First, Windsurf is going to be everywhere. 00:03:57.440 |
Specifically, Windsurf will need to be able to read and ingest context from every single source that a 00:04:06.240 |
developer uses. And if we zoom out and think about what makes you all, software engineers, successful, 00:04:12.480 |
there are a couple of different categories. The first of which, coding related. File reads, running 00:04:18.960 |
terminal commands, seeing your history, even which tabs you have open inside of your editor. This all informs 00:04:24.320 |
how to generate the correct code. But it goes beyond that. There's external sources. Things like going 00:04:30.640 |
onto GitHub and viewing a past history of commits. Maybe looking at a PR that is doing something similar to 00:04:35.600 |
the feature you're about to implement. Doing online searches, web searches, looking at documentation. 00:04:40.720 |
And then there's the third category, and this is where it gets a little bit interesting. 00:04:44.400 |
It's called meta-learning. It's the idea of what separates a junior engineer from a senior engineer from 00:04:51.280 |
from a staff engineer. These are the organizational best practices, the engineering preferences that 00:04:56.880 |
all get encoded into what makes good code. And so if we think about what this means in practice, 00:05:04.080 |
let's say that we are going to build a new page on a data viz dashboard. Let's walk through step by 00:05:08.800 |
step. So first, you would probably start in Slack, as all good things start from Slack. You'll build context, 00:05:14.160 |
looking at a bunch of maybe customer requests. Maybe you'll have some internal messages. You'll collect that 00:05:18.560 |
context, and you'll start planning. And this means you're going to be in Google Docs. You're going to 00:05:23.040 |
be writing design docs, probably working on some infrastructure designs. You're going to be tracking 00:05:27.840 |
tickets inside of Jira. And then you might have a designer who's actually working in Figma in parallel, 00:05:33.040 |
putting together all this material. And then finally the fun part, or at least this is my favorite part, 00:05:37.600 |
which is the actual writing of the code, and hopefully use something like Windsurf to do so. 00:05:41.040 |
But you're not done from there. Once your code complete, you still have to open the PR. You've got to get reviews, 00:05:46.800 |
you've got to merge into main, you've got to deploy SEO, analytics, the list goes on and on and on. 00:05:51.760 |
And this is really why we've built what we've built. Because we know that for you, 00:05:57.680 |
it's extremely important that we can fetch context from your Google Docs, that we can read your Figma 00:06:03.440 |
files, and that we can one-click connect to any MCP service so that you can access your information in 00:06:09.680 |
things like Notion, Linear, Stripe, and countless others. And we've spent the last 10 waves making sure 00:06:16.640 |
that Windsurf can be ubiquitous. But we know that's also not enough. We know it's not enough just to read. 00:06:24.480 |
We need to be able to do and write everything. We need to be able to do it all for you. 00:06:28.560 |
And so the AI has to take action on a wide variety of surfaces beyond just the coding surface in order to 00:06:37.440 |
accomplish what a human software engineer would do. And so this doesn't mean just write code. This means 00:06:42.160 |
interacting with third-party services, provisioning API keys, writing design docs, PRDs, wireframing, 00:06:49.200 |
testing, and the list could go on and on and on. And so for the last six months we've oriented ourselves 00:06:55.280 |
around how do we do everything. And if we go back to this concrete example of building a new web app, 00:07:02.880 |
where do we start? We start by running code-based relevant terminal commands. This is something 00:07:09.040 |
that we launched right at the advent of Windsurf. And what's really cool about what we can do here is 00:07:13.760 |
that we can intelligently decide which commands we want to run automatically and which ones we want to wait 00:07:19.040 |
and ask for explicit user approval. Next, you'll open up Windsurf browser previews, which allows you 00:07:26.480 |
to iterate from there. It allows you to visually iterate with the agent so that Windsurf can take 00:07:30.560 |
control of Chrome just like you would, inspecting DOM elements, looking at your JS console, being able to 00:07:36.000 |
do what a web developer would do. And so now you could say our app is code complete. We'll use the GitHub MCP 00:07:43.760 |
to open up a pull request. And we can use context from your other PRs to be able to inform the description. 00:07:48.880 |
And inform the test plan. And code review is still a necessary part of any software company. 00:07:56.560 |
And so we launched Windsurf reviews, which can automatically leave comments and suggest changes 00:08:02.000 |
asynchronously so that you can be confident that the code that hits main is production ready. 00:08:08.320 |
So now that your code is merged, you'll want to be able to deploy. And so we also released a one-click service 00:08:15.760 |
to Netlify so that you can use Windsurf's custom tool integrations to actually just in one click the agent 00:08:21.600 |
will deploy what you have to the live web. And so as you can see, we've really built the ability for 00:08:29.200 |
Windsurf to read everything that you can and do everything or almost everything that a software engineer can. 00:08:38.880 |
It's only inevitable that Windsurf will be on all the time, working for you even when you don't know it. 00:08:46.880 |
We pioneered the agentic human-in-the-loop synchronous workflows back when we released Windsurf in 2024. 00:08:54.560 |
And today, timelines are 80 to 90% agent, 10 to 20% human. But we're trying to build towards a future that gets 00:09:04.240 |
the 99% agent and 1% human. We only want to ask the user for final approval. And as more and more of 00:09:12.800 |
these timelines and workflows become AI-powered, it becomes possible to have Windsurf working for you at 00:09:17.920 |
all times. Not only as you type and use autocomplete and tab, but also in the background, researching when 00:09:24.960 |
you're working, fully in parallel, only asking you to approve. And we want to build this future where you 00:09:32.160 |
can code anytime. You can write software at any time. This includes your bed, this includes the toilet, when 00:09:39.360 |
you're on the bus, voice-activated Alexa, the possibilities are endless. And so now that we've defined the 00:09:47.120 |
problem, it's a little bit more structured. You could say, all right, we'll throw GPT, we'll throw 00:09:51.360 |
Gemini at this timeline problem. But then from there, where do we go? How do we improve? And specifically, 00:09:58.080 |
how is Windsurf able to tackle this problem of the timeline? And if we take a step back, this really 00:10:05.840 |
doesn't look like we're writing code anymore. This looks significantly more complicated than your average 00:10:10.800 |
competitive program in question. Windsurf wants to revolutionize the way that software gets 00:10:17.040 |
built. It's not just how code gets written. We are solving a broader set of tasks than just code. And while the 00:10:24.480 |
industry focuses heavily on things like Sweebench, we know that the future is not going to be tokens 00:10:30.080 |
in, tokens out. Software engineering workflows are going to be much messier than this. It means that you have to be able to 00:10:35.280 |
pick up tasks mid-workflow. You have to be able to deal with messy code-based states mid-commit. And you will have to work with tools that are outside of the editor. 00:10:44.720 |
And so we have to be able to ingest and perform over this broad set of actions on this timeline to keep 00:10:52.160 |
our users in the flow. We have to be able to open up PRs. We have to know when to access analytics. We need to know 00:10:58.160 |
how to debug your CI/CD all by itself. And this problem starts to look really, really different from what 00:11:04.160 |
people are evaling on. And because we have our own representation of this timeline we needed a different 00:11:11.600 |
system to be able to handle these types of actions than what the off-the-shelf frontier models could give us. 00:11:15.600 |
And so where are we going with this? The realization of this is our brand new software engineering model 00:11:24.000 |
called SweeOne. We realized ourselves that we could actually dream bigger and build the best software 00:11:30.000 |
engineering model that we could. SweeOne is trained to handle software engineering workflows, not just 00:11:37.360 |
purely code generation. And we use two main offline eval benchmarks. The first one, end-to-end task 00:11:45.040 |
benchmark. This is basically tackling pull requests. This is saying, given an intent, given the starting point of a 00:11:50.320 |
code base, how do we get to the end and pass all the unit tests? Familiar paradigm. The second one is where it gets 00:11:56.960 |
a little bit more interesting. This is what we call a conversational SweeTask benchmark. And this is how well the 00:12:03.120 |
model can assist when it's being dropped into an existing user conversation or a partially completed task. And so this 00:12:10.960 |
actually lends itself very nicely to the windsurf paradigm, right? Because we're not going 00:12:14.480 |
cleanly from start to end. We're assisting in helping you along the way, mid-timeline. And so it results in this 00:12:20.880 |
blended score of helpfulness, efficiency, and correctness, and really tests the model's ability to 00:12:26.720 |
seamlessly integrate into the windsurf style of working. And this initial performance really gives us a lot of 00:12:32.880 |
confidence in SweeOne's architecture. Specifically, how we've been able to train for software engineering workflows. 00:12:40.080 |
And we've been able to achieve near-frontier model results at the fraction of the cost 00:12:49.040 |
And one of windsurf's greatest strengths, of course, is in the value of community. Real software engineers 00:12:56.800 |
doing real work, giving real feedback. And what we found is that SweeOne, it's in the little drop-down for the models, 00:13:03.680 |
it's right up there with the rest of the frontier models. People are choosing SweeOne because it 00:13:08.400 |
recognizes how they do work, not necessarily how to generate code. And it's contributing, actually, 00:13:14.640 |
an even higher frequency than models like 3.7 and 3.5. 00:13:19.280 |
Windsurf builds at the frontier so that our users can build more with the best technology. 00:13:27.200 |
We learn from our failure modes so that we can iterate from there. And what does this start to look 00:13:33.680 |
like? Dare I say it? A data flywheel? We ship the best product. Devs and non-devs use that product to 00:13:41.920 |
level up as a skill multiplier or as a skill enabler. Users then help us find the frontier. They use things 00:13:50.240 |
like thumbs up, thumbs down, accept, reject, constantly informing us not of what the SweeBench frontier is, 00:13:56.800 |
but what is the software engineering frontier. What tools are missing? Which workflows are repeated? Where does the product fall short? 00:14:06.000 |
And we take those insights and we build at this frontier. We train a better model. We build more tools. 00:14:15.600 |
We improve our agentic harness. We improve our memories, our checkpointing, with the goal of being everywhere, 00:14:22.240 |
doing anything. And we will repeat this cycle. We will be shipping, finding the frontier, building at the margin, 00:14:31.360 |
and repeating. And what gets me really personally excited about this is SweeOne is really an example of this in action. 00:14:39.360 |
We have a very small team, significantly fewer resources than the larger companies, and we were able to 00:14:45.600 |
achieve near-frontier model quality results with SweeOne. And even more so, this is really a demonstration 00:14:53.440 |
of what it means to build AI products in 2025. It demands this harmony of model, data, and application, 00:15:03.520 |
where the application is actually mimicking the user behavior that you want to replicate inside of your model. 00:15:08.720 |
And this is how Windsurf will be everywhere, doing everything all at once. 00:15:20.160 |
I won't give you any promises, but someone made a profit. But in all seriousness, thank you so much for 00:15:40.320 |
listening. I want to make sure that every engineer out there is using the best possible tools. So please 00:15:45.200 |
give Windsurf a try today, and we are also hiring across a number of different roles. We have a booth 00:15:49.280 |
downstairs, so please come join us. Help make this future a reality. Thank you.