back to indexSpotlight on Manus | Code w/ Claude

00:00:00.000 |
My name is Tao and my nickname is HighCloud. Actually, you can just find me on any social 00:00:18.620 |
network with my nickname HighCloud. And right now, I'm the co-founder of Manus AI and I'm acting 00:00:24.180 |
our chief product officer. But I'm not just a product guy. I'm coding for like 20, 80 years 00:00:30.740 |
from my very early age, like when I was nine, I can't remember. But for AI, actually, I'm 00:00:36.680 |
very newbie. I've just been into this industry for only two years. And what I want to achieve 00:00:43.820 |
in the AI industry is I want to build a product that can influence 24 hours for a single user. 00:00:51.360 |
That is two years ago. It is my dream. But right now, I think with Manus AI, actually, I can achieve 00:00:56.700 |
that dream at end of this year. Right now, the maximum usage for our user, there is one single 00:01:03.240 |
user. He can just consume the GPU power two hours per day just for himself. So I think we can achieve 00:01:11.720 |
this goal at end of this year. And today, what I want to talk about is about two things. One is 00:01:19.180 |
about Manus. Another thing is about cloud models. So the first talk is about what is Manus. Yeah, 00:01:24.920 |
Manus actually, you know, it's not like an English word. Because really, it isn't. Yeah. Manus, 00:01:31.260 |
this just comes from MIT's motto, which is Manus at Manus. It's an old Latin word, which means 00:01:37.660 |
man and hand. Why we choose this name Manus. Manus is the hand. It's because we think for like all the past 00:01:45.780 |
two years, all these frontier models, they already like very smart, just like a human's brain. Yeah, 00:01:52.740 |
they're super smart. They are capable of like doing different kind of tasks. But you know, even with 00:01:59.300 |
this very, very smart brain, we can make a real impact into a real world because we don't build hands for 00:02:05.940 |
things. You know, just like when I was nine, at that time, that is like 1996, back in China. And at 00:02:13.140 |
that time, our families are not, you know, we're not rich enough to have a computer in each of our 00:02:18.180 |
family. So I only have like two times a week to go to the computer room in my primary school. And at 00:02:25.460 |
that time, even I was the best in the class of coding. But without going to the computer room to like debug, 00:02:33.940 |
to write the code on the real machine, I can't get the code right just in one time with a pen and 00:02:41.060 |
the paper. But you know, that's exactly what we did for the past two years. We have a very smart man, 00:02:47.940 |
but we only give it a pen and a paper. And we ask them to solve very complex problems for us. So we think 00:02:55.380 |
that's our problem. So in Manus, we don't train models. We don't even like post-train fine-tuning 00:03:01.380 |
models. What we do is like we are building hands for these models. Yeah, that is, you know, the 00:03:08.340 |
concept behind our lane, Manus. Yeah. I think like most of you maybe saw our product Manus on 00:03:17.620 |
social net or you've already a Manus user and you must like saw some use cases on our website, 00:03:23.540 |
any like social network. But today, I also want to like just share like two new use cases, one from our 00:03:31.140 |
internal usage, another is from our user. Yeah. So this one is actually from our internal usage. Yeah. 00:03:37.300 |
Because we are expanding globally. We just opened our Singapore office three weeks ago and opened our 00:03:42.820 |
Tokyo office two weeks ago and we will open our San Francisco office tomorrow. Yeah. So when we are 00:03:48.820 |
choosing our Tokyo office, we are asking Manus to help us. It's like we're going to say, okay, we have 00:03:55.380 |
like 40 people. We will be relocating to Tokyo. So just find the office which can fit 40 people. And also, 00:04:02.900 |
we have to solve their accommodation problem. So this is a prompt we give to Manus. And after we give this 00:04:10.420 |
problem to Manus, Manus just have its own plan and then executes the plan. It's like search, 00:04:17.140 |
browse all these websites around the internet, doing a lot of browsing, browsing, browsing, research, 00:04:23.060 |
research, research. And after 24 minutes, Manus just delivers this website for us. Yeah, this website. 00:04:31.620 |
This is a Tokyo office accommodation recommendations. It first that it comes with a very interactive map, 00:04:38.500 |
which comes with all the 10 options Manus found for us. The blue marker is for the office's location. And the 00:04:45.220 |
green marker is the accommodation near that office. All these 10 options just on this interactive map. Yeah. And if 00:04:53.220 |
you keep scrolling down, you can see the first pair is a Shibuya one. It choose the Shibuya Scramble Square. That's 00:05:00.020 |
exactly the office we went to when we are looking at the office there. But you know, 00:05:04.660 |
this office is kind of like very fancy, but too expensive for startup like us. So we just choose our 00:05:10.500 |
office near this one, very, very close, like maybe 200 meters away. And it has the price and why we should 00:05:18.580 |
choose this office. And if you choose this office, what is accommodation options we could get? Also has the 00:05:26.020 |
distance to the office, which is very great. Comes with like 10 pairs of the, and at the bottom, there's 00:05:32.740 |
like an overview table for all these options. Yeah. So all these have been done just under 20 minutes. 00:05:39.860 |
So you can imagine maybe like your intern or your assistant can achieve such detailed quantity of 00:05:46.740 |
information in such a short time. Another thing I want to demo here is like, yeah, you can just maybe 00:05:53.700 |
send an image to Manus, an empty rule, and ask Manus to analyze this rule's style and go to EKS 00:06:00.660 |
website to find some furnitures for your rule. And then you can see the final result. Yeah. So look, 00:06:07.460 |
first Manus will just analyze this empty rule's image and come up with the idea, okay, 00:06:13.540 |
this, what style of this rule and the layout and what kind of furnitures I should look up in the EKS 00:06:22.260 |
website. And then Manus just go to EKS website and start browsing, browsing for all these furnitures 00:06:29.140 |
and save these furnitures images. And then at last Manus will just return an image just with the real EKS 00:06:38.180 |
So actually you can just click the link to buy it. Yeah. Right now we can't buy things for you. But who knows, 00:06:53.620 |
after three months, yeah, maybe we can do payment. Yeah. So that is kind of the things Manus can do. Yeah. It's like a 00:07:00.900 |
general agent. Yeah. They can solve a very long tail of different type of tasks for you. Yeah. So we can jump back to the 00:07:10.100 |
So that is like kind of like what is Manus. And today, I think the most I want to share is like how we built Manus. 00:07:17.620 |
Yeah. Cursor just inspired us a lot. We got a lot of inspirations from Cursor. I think you may sound weird, 00:07:24.340 |
because Cursor is kind of like a code editor, you know. Yeah. So all three founders of Manus, 00:07:29.700 |
like Pig, Red, and I, were all coders for a lot of years. So when we are using Cursor, I think Cursor is 00:07:34.900 |
actually a very great product. It can help us to write different language even we don't know how to code in 00:07:41.300 |
this language. And it's very like efficiency. But the most interesting part is not way coders using Cursor. 00:07:50.580 |
The most interesting part is when we watch our friends, our colleagues, which are long coders, 00:07:57.060 |
watching they using Cursor to solve their daily tasks like doing data visualization, batch file processing, 00:08:04.100 |
convert a video file into an audio file. It's very fascinating because, you know, this is the interface 00:08:10.340 |
of Cursor, right? When these long coders using Cursor, they don't care about the left side because they 00:08:17.220 |
they can't evaluate the code at all. All they can do is just to keep the accept, accept, accept, 00:08:23.700 |
accept, accept, accept, you know. Yeah. So it's very interesting when watching those friends, you know, 00:08:29.940 |
using Cursor. So we just come up with the idea, which is that those long programmers just using Cursor to 00:08:37.140 |
deal with their daily tasks. Yeah, not like us. You know, when we are using Cursor, we really want to 00:08:43.620 |
write some code, you know, that can run multiple times for the future. But for these friends, they just 00:08:49.940 |
want Cursor to solve their task. They don't care about the code. Next time, when they have the same task, 00:08:55.140 |
they will not run that code again. They will just ask Cursor to do that, maybe generate a new code for them. So code is 00:09:02.340 |
like maybe not the ultimate goal. It's just, you know, just an intermediate step for solving programs. 00:09:08.740 |
So we come up with the idea that maybe we should build the opposite. Yeah, we should build the right 00:09:14.900 |
panel of Cursor. And another thing we want to build is that we want the right panel to run in the cloud. 00:09:21.860 |
Why is that? Because, you know, when we are using Cursor on our computer, there's, it has to ask your 00:09:28.820 |
permission to continue because it is running on your computer. Any action it performs on your computer 00:09:36.020 |
is like, well, maybe dangerous, right? After it, maybe it will install some dependencies, install some 00:09:42.500 |
software. It may break your computer. But we think if it's running in the cloud, it's much safer. 00:09:50.500 |
And also, when it's running in the cloud, you don't have to pay your attention to it. You just, you know, 00:09:56.820 |
assign a task to Manus. You can just close your laptop or just put your phone back to your pocket. 00:10:02.500 |
You can do other thing. And after the task is done, we will give you a notification and you will get 00:10:07.540 |
the result. Yeah. So that's the original idea of how we come up with the idea of the Manus that is 00:10:16.020 |
happening last October. So from last October to this March, five months of work, this is what you saw 00:10:24.340 |
today, which is the Manus AI in the fifth of March. Yeah. So that's the original idea. Yeah. Thank you, 00:10:31.140 |
guys. And also, I want to share some details. When we're building Manus, what kind of thoughts we have 00:10:36.740 |
in our mind? The first thing is that the first key component of Manus is that we give Manus a computer, 00:10:43.460 |
which is super important when you compare it to other chatbot usage. Because, you know, you Manus, 00:10:49.380 |
each Manus tag will be assigned a fully functioned virtual machine. Manus can use the file system, 00:10:55.540 |
terminal, VSCO, and a real Chromium browser. You know, it's not a headless browser. Yeah. All these apps in the 00:11:03.220 |
virtual machine, which, you know, it creates a lot of different opportunity for Manus to solve different 00:11:09.140 |
kinds of problems. Like, you can just send a zip compress file, contains maybe hundreds of PDFs, 00:11:17.540 |
and ask Manus to unzip it, and then extract all these unstructured data from these hundreds of PDFs 00:11:26.260 |
into a structured spreadsheet. Yeah. That's the thing you can do. So we think, give it a computer is 00:11:31.940 |
actually what makes Manus really different compared to other agents or other chatbots. And the second 00:11:38.660 |
thing is that we think, you know, nowadays, it's not every information or data is on the public internet. 00:11:44.740 |
There are so many things behind the pay or in these private databases. But users, because Manus, you know, 00:11:51.380 |
we are targeting to the consumer market. We're not targeting to the enterprise market. So for an 00:11:56.420 |
average user, who is not very familiar of how to call an API, you know, how to write code to access 00:12:02.820 |
databases, we think it's better for us to prepay all these, like, private databases, APIs for the users. 00:12:10.340 |
Then users won't care how they can get maybe some real-time financial data. Yeah, things like that. So it's like, 00:12:16.660 |
we prepaid these private databases for our users. And the third thing is that, in Manus actually, 00:12:23.300 |
you can just teach Manus how to solve programs. Like, one month before we released Manus, ChatGPT just 00:12:31.140 |
released their deep research. So there's an experience, actually, I don't like. It's like, you ask a question 00:12:36.660 |
to deep research, it will return five or six questions for you. You know, I don't think that's a very good 00:12:42.660 |
experience because I want you to solve tasks for me, not asking me more questions. But you know, 00:12:47.780 |
some of our colleagues, someone in our team, they like this experience. So we have an internal discussion 00:12:53.860 |
about whether we should do something, maybe like a workflow or maybe like a hard code things to deliver 00:13:00.500 |
such experience. But instead, we didn't do that. We implemented a personal knowledge system, which is like, 00:13:07.140 |
you can just teach Manus. Next time, when you go out to do some research, before you do, before you start, 00:13:13.940 |
just confirm all the details with me and then execute it. Then Manus will, once you accept that knowledge 00:13:20.580 |
into your personal knowledge system, Manus will remember it. And it will just act like every time 00:13:28.260 |
when you want to do some research, it will confer with you first and then exit. I think all these three 00:13:34.180 |
things just makes Manus really powerful. But I think the most important part, because if you want to build 00:13:40.980 |
an agent, all these three things, I think maybe you definitely will give it a try. But we think, why the 00:13:46.980 |
magic happens? Why Manus is such a great experience? The most important part is not about these three 00:13:54.740 |
components. We think the most important part is that as the fundamental concept of Manus, 00:14:01.700 |
we have this less structure, more intelligence. You can find this sentence at the bottom of our official 00:14:06.900 |
website. We just, you know, we just believe in this. We took so much faith in it. And how do we define 00:14:13.460 |
less structure, more intelligence is that when we release Manus, actually, we put 42 use cases on our website. 00:14:20.740 |
And someone said, "Okay, the message is not the same. You just predefined some workflows." 00:14:27.700 |
Maybe you have like 42 predefined workflows. But actually, you know, inside Manus, in the core of Manus, 00:14:33.620 |
we have zero predefined workflows. It's just a very simple but a very robust structure. Very simple structure. 00:14:41.700 |
But we just left all the intelligence part to the foundational model. At that time, it is Cloud 00:14:47.460 |
Solar 3.7. And now we have Cloud 4, right? So what we are doing is like, because we are building the hands, 00:14:55.300 |
right? It's like, we just compose all these contexts to build them into like a more good structure. And then we 00:15:04.020 |
provide more context to the model. And we have less control. When I say less control, I mean multi-role agent system, 00:15:10.660 |
right? You have to specify, this is a coding agent, this is a search agent, this is a ba-ba-ba-ba-ba agent. We think 00:15:17.620 |
all these kinds of forms are a control. It just limits the real potential of LMS. So for us, it's like we just 00:15:25.060 |
provide more context to the model and then let the model to like improvise by itself. And all this 00:15:32.500 |
magic you get from MNAS just come from this very simple but very strong ideology. It's less structure, 00:15:40.340 |
more intelligence. Yeah. So that is the secret behind MNAS. And why we choose Cloud in the first place is for 00:15:50.100 |
three reasons. The first one is about the long-horizon planning. You know, I think before Cloud Solent's models, 00:15:57.860 |
because chatbot is just so successful. So all these models are out there, maybe before this March, 00:16:06.020 |
are post-trained and their alignments is for chatbot scenario. And in chatbot scenario, 00:16:15.220 |
the model intends to answer your question in one term. You ask the question and you get the answer. But, 00:16:21.940 |
you know, in agentic scenarios, like in MNAS, an average task will take maybe 30 or 50 steps and then get the 00:16:30.660 |
final answer. So when we are building MNAS, actually we try every model we can get our hands on. But it 00:16:37.060 |
finds out only Solent can know, okay, I am in a very giant agentic loop. So I should perform the action, 00:16:47.460 |
watch the observation, and decide what's the action, observation, absolute observation. It's just a very 00:16:52.820 |
giant agentic loop. So just Solent know, okay, I am in this loop. So I have to gather more information 00:16:59.620 |
before I deliver the final result. But, you know, for these five months, we tried all other models. 00:17:05.220 |
All of them failed. After maybe only one, two, three iterations, those models will think, okay, 00:17:11.860 |
I think it's enough. I will answer your questions right now. So I think that's a problem. And right 00:17:18.420 |
now what we found the best model to run a very long horizon planning scenes is cloud Solent models. 00:17:25.780 |
That is the first thing why we choose cloud. And the first second part is about the tool use. Because, 00:17:31.140 |
you know, agent product just heavily relied on the tool use and the function calling. Yeah, like for us, 00:17:38.740 |
we just abstract all these tools in that virtual machine. We have like 27 tools in that virtual 00:17:45.140 |
machine. So the agent framework has to decide what the action should be performed into that virtual 00:17:50.500 |
machine. So it's very important to call the right tool and write the tools parameters right. So at that 00:18:00.020 |
time when we are building managers, we don't have the sync tool in cloud models. So we kind of like 00:18:06.820 |
implemented some mechanism by our own. We call it a COT injection, which is like before every function 00:18:15.140 |
call. We just will use another specific, we call it a planner agent, to do the reasoning. At that time, 00:18:21.380 |
we don't have reasoning too. Yeah, because that's the 3.5. Yeah. So we have to do the reasoning by 00:18:26.820 |
ourselves. And then we inject that reasoning, that COT, into the main agent. And then we perform that 00:18:34.820 |
function call. And what we found out, it just boosts the performance of function calling. And also, in an 00:18:41.380 |
article, a Serapical release at the end of March, with the sync tool, they also found out this. And in 00:18:47.780 |
cloud4, it kind of has some like native support for the thinking in the tool use. I think you guys all say in 00:18:54.820 |
this morning's session. Yeah. So that's the second one. And the first, third one is that we think 00:19:00.820 |
Serapical's model may be the best, they have the best alignment with agentic usage. Because, you know, 00:19:06.500 |
in agentic usage, we have to deal with the browser, the computer. I think Serapical has put so much 00:19:12.740 |
resources on the alignment with the computer use, the browser use thing. So which makes cloud's model maybe the 00:19:20.740 |
best model to build agents. That's why we choose Serapical's models for madness. And also, you know, 00:19:28.100 |
we just spend a lot of tokens on cloud. And also, this is the t-shirt I wear in the GTC event. We just 00:19:36.660 |
wear this shirt in the GTC event. Yeah. We spend like one million dollars on cloud model in the first 00:19:43.620 |
of 14 days. I think maybe that's why they invited us here to get a spin. Yeah. It cost us a lot to be on 00:19:50.180 |
the stage, you know. Yeah. So that's the whole story about the madness so far. Yeah. So if you have 00:19:57.380 |
questions, you can just line up at the two mics. We're gonna start from right to left. Yeah, each by each. 00:20:16.580 |
Hello. Thanks for the presentation. Very interesting. So I don't know how much you can share, but I was just 00:20:27.940 |
curious about in your agentic workflow, especially the browser part, how much is that vision and how 00:20:37.620 |
much is that actually, you know, parsing the code of the web page, right? So how much of it is like 00:20:43.860 |
the model looking at the browser like a person would and how much of that is that is more like a text type 00:20:51.620 |
of interaction, if that makes sense? Uh, sorry, I may get the question in a second. So when you, 00:20:59.460 |
when the minus uses the web browser, right? Oh, yeah, yeah, yeah. It goes on a website, right? Yeah. And how 00:21:05.620 |
much of the understanding of the web page is based on the vision? Yeah, I got your question. And how much 00:21:10.660 |
is based on that? Yeah, yeah, that's a very good question. Actually, I can share it. It's not a 00:21:14.580 |
secret because, you know, yeah. Uh, um, when we are building master from last October, there is the, um, 00:21:20.820 |
open, open source project called the browser use. I think most of you guys may know that project. 00:21:25.380 |
So we just take a look at that project. We think their way of how to talk to the browsers is actually 00:21:31.540 |
very useful. So we just use that part because, you know, two months later, browser use has its own 00:21:37.620 |
agent framework. We don't use that part. We don't use that agent framework part. We only use the 00:21:42.660 |
protocol, they talk to the browsers. So the thing we are sending to the, uh, the context we're sending 00:21:49.460 |
to the foundational model when we are, when medicine is browsing internet are three things. The first 00:21:54.900 |
is the text in this viewport. We will send it to, to, uh, cloud. And the second thing is a screenshot. 00:22:03.380 |
Yeah. In this viewport. And the third thing is a screenshot, but with bounding boxes. You know, 00:22:10.420 |
then the model can decide which area he should click. Yeah. So right now we are just sharing this 00:22:15.780 |
reason to the model. Yeah. I don't know if I answer the question. Yeah. Yeah. Yeah. Okay. Yeah. Cool. 00:22:22.020 |
Oh yeah. Another one. Hi, I have a slightly controversial question, but I want to kind of put, 00:22:28.980 |
put this out to you. Yeah. Uh, with the advancement of a lot of these foundational model and the deep 00:22:34.420 |
research capability, how do you see a wrapper company kind of keeping their edge? Like what, 00:22:42.980 |
how do you see it as sort of, what are the research area? What are the areas that you're concerned about? 00:22:47.460 |
Yeah. What are you focused so that you can actually grow in parallel with these foundational model as 00:22:53.140 |
their capability increases? Definitely. I, I kind of get your question because, 00:22:56.580 |
you know, we answer this question for the past two months into a lot of investors, you know, 00:23:01.220 |
they are all asking, okay, what's your motor being as a wrapper, you know, yeah, things like that. Um, 00:23:07.940 |
we think, you know, there will be no mode just for some very specific technology or framework 00:23:14.500 |
because technology will be out of data very, very, very fast, even for models, right? So we think the 00:23:20.340 |
only thing we can win this competition is about the pace of innovation. Like what you mentioned, 00:23:27.140 |
there's like a use case like deep research. Deep research is one of the main use cases in Manus. 00:23:33.220 |
We have like maybe 20% of users are doing deep research every day on Manus. But you know, actually, 00:23:39.300 |
we put zero effort on deep research use cases. Like I just mentioned, we just build a very simple 00:23:47.060 |
structure, a very simple agent framework. The deep research capability is just the emergence from this 00:23:54.420 |
framework, which means, you know, maybe if OpenAI is maybe half a year to do the end-to-end training just 00:24:00.900 |
for these specific use cases. But this capability just, you know, just emerges from the whole structure. 00:24:07.380 |
So in six months, maybe we will have like hundreds of use cases like deep research. And also, 00:24:14.580 |
what we can do is like, we can leverage the best model in the world. Yeah, you know, I think these 00:24:21.060 |
are two things. It's about the flexibility of your whole agent framework. And the second thing is like, 00:24:27.700 |
we are very flexible of choosing the best model in the world. Yeah. All right. Thank you. Yeah. 00:24:33.940 |
Hi. Hi. Hi. Hello. Question. So, right now- Oh, maybe the last question. Sorry. Uh, done wrong. Yeah. 00:24:40.100 |
Okay. Um, so right now you build the, uh, browser in a virtual environment. Yeah. Are you planning to put, 00:24:47.780 |
uh, the browser in a Docker in the local computer and later they can use the local cookie and access all the 00:24:54.260 |
account? Yeah. Actually, we, we don't have a plan for local environment because as I just mentioned, we think it's very 00:25:00.180 |
important to like give the users attention back to users. We don't want to like user to pay attention 00:25:07.380 |
still to his phone or to his computer. We want everything to run in the cloud. And we, we, we don't 00:25:12.420 |
just have a, a Linux virtual machine. We still, we, we also have plan to have a virtual Windows, virtual 00:25:18.020 |
Android for Manus to, to run in the future. So it's all about running in the cloud. Yeah. Okay. I think my time is up.