Spotlight on Manus | Code w/ Claude

My name is Tao and my nickname is HighCloud. Actually, you can just find me on any social network with my nickname HighCloud. And right now, I'm the co-founder of Manus AI and I'm acting our chief product officer. But I'm not just a product guy. I'm coding for like 20, 80 years from my very early age, like when I was nine, I can't remember.

But for AI, actually, I'm very newbie. I've just been into this industry for only two years. And what I want to achieve in the AI industry is I want to build a product that can influence 24 hours for a single user. That is two years ago. It is my dream.

But right now, I think with Manus AI, actually, I can achieve that dream at end of this year. Right now, the maximum usage for our user, there is one single user. He can just consume the GPU power two hours per day just for himself. So I think we can achieve this goal at end of this year.

And today, what I want to talk about is about two things. One is about Manus. Another thing is about cloud models. So the first talk is about what is Manus. Yeah, Manus actually, you know, it's not like an English word. Because really, it isn't. Yeah. Manus, this just comes from MIT's motto, which is Manus at Manus.

It's an old Latin word, which means man and hand. Why we choose this name Manus. Manus is the hand. It's because we think for like all the past two years, all these frontier models, they already like very smart, just like a human's brain. Yeah, they're super smart. They are capable of like doing different kind of tasks.

But you know, even with this very, very smart brain, we can make a real impact into a real world because we don't build hands for things. You know, just like when I was nine, at that time, that is like 1996, back in China. And at that time, our families are not, you know, we're not rich enough to have a computer in each of our family.

So I only have like two times a week to go to the computer room in my primary school. And at that time, even I was the best in the class of coding. But without going to the computer room to like debug, to write the code on the real machine, I can't get the code right just in one time with a pen and the paper.

But you know, that's exactly what we did for the past two years. We have a very smart man, but we only give it a pen and a paper. And we ask them to solve very complex problems for us. So we think that's our problem. So in Manus, we don't train models.

We don't even like post-train fine-tuning models. What we do is like we are building hands for these models. Yeah, that is, you know, the concept behind our lane, Manus. Yeah. I think like most of you maybe saw our product Manus on social net or you've already a Manus user and you must like saw some use cases on our website, any like social network.

But today, I also want to like just share like two new use cases, one from our internal usage, another is from our user. Yeah. So this one is actually from our internal usage. Yeah. Because we are expanding globally. We just opened our Singapore office three weeks ago and opened our Tokyo office two weeks ago and we will open our San Francisco office tomorrow.

Yeah. So when we are choosing our Tokyo office, we are asking Manus to help us. It's like we're going to say, okay, we have like 40 people. We will be relocating to Tokyo. So just find the office which can fit 40 people. And also, we have to solve their accommodation problem.

So this is a prompt we give to Manus. And after we give this problem to Manus, Manus just have its own plan and then executes the plan. It's like search, browse all these websites around the internet, doing a lot of browsing, browsing, browsing, research, research, research. And after 24 minutes, Manus just delivers this website for us.

Yeah, this website. This is a Tokyo office accommodation recommendations. It first that it comes with a very interactive map, which comes with all the 10 options Manus found for us. The blue marker is for the office's location. And the green marker is the accommodation near that office. All these 10 options just on this interactive map.

Yeah. And if you keep scrolling down, you can see the first pair is a Shibuya one. It choose the Shibuya Scramble Square. That's exactly the office we went to when we are looking at the office there. But you know, this office is kind of like very fancy, but too expensive for startup like us.

So we just choose our office near this one, very, very close, like maybe 200 meters away. And it has the price and why we should choose this office. And if you choose this office, what is accommodation options we could get? Also has the distance to the office, which is very great.

Comes with like 10 pairs of the, and at the bottom, there's like an overview table for all these options. Yeah. So all these have been done just under 20 minutes. So you can imagine maybe like your intern or your assistant can achieve such detailed quantity of information in such a short time.

Another thing I want to demo here is like, yeah, you can just maybe send an image to Manus, an empty rule, and ask Manus to analyze this rule's style and go to EKS website to find some furnitures for your rule. And then you can see the final result. Yeah.

So look, first Manus will just analyze this empty rule's image and come up with the idea, okay, this, what style of this rule and the layout and what kind of furnitures I should look up in the EKS website. And then Manus just go to EKS website and start browsing, browsing for all these furnitures and save these furnitures images.

And then at last Manus will just return an image just with the real EKS So actually you can just click the link to buy it. Yeah. Right now we can't buy things for you. But who knows, after three months, yeah, maybe we can do payment. Yeah. So that is kind of the things Manus can do.

Yeah. It's like a general agent. Yeah. They can solve a very long tail of different type of tasks for you. Yeah. So we can jump back to the slides. Yeah. So that is like kind of like what is Manus. And today, I think the most I want to share is like how we built Manus.

Yeah. Cursor just inspired us a lot. We got a lot of inspirations from Cursor. I think you may sound weird, because Cursor is kind of like a code editor, you know. Yeah. So all three founders of Manus, like Pig, Red, and I, were all coders for a lot of years.

So when we are using Cursor, I think Cursor is actually a very great product. It can help us to write different language even we don't know how to code in this language. And it's very like efficiency. But the most interesting part is not way coders using Cursor. The most interesting part is when we watch our friends, our colleagues, which are long coders, watching they using Cursor to solve their daily tasks like doing data visualization, batch file processing, convert a video file into an audio file.

It's very fascinating because, you know, this is the interface of Cursor, right? When these long coders using Cursor, they don't care about the left side because they they can't evaluate the code at all. All they can do is just to keep the accept, accept, accept, accept, accept, accept, you know.

Yeah. So it's very interesting when watching those friends, you know, using Cursor. So we just come up with the idea, which is that those long programmers just using Cursor to deal with their daily tasks. Yeah, not like us. You know, when we are using Cursor, we really want to write some code, you know, that can run multiple times for the future.

But for these friends, they just want Cursor to solve their task. They don't care about the code. Next time, when they have the same task, they will not run that code again. They will just ask Cursor to do that, maybe generate a new code for them. So code is like maybe not the ultimate goal.

It's just, you know, just an intermediate step for solving programs. So we come up with the idea that maybe we should build the opposite. Yeah, we should build the right panel of Cursor. And another thing we want to build is that we want the right panel to run in the cloud.

Why is that? Because, you know, when we are using Cursor on our computer, there's, it has to ask your permission to continue because it is running on your computer. Any action it performs on your computer is like, well, maybe dangerous, right? After it, maybe it will install some dependencies, install some software.

It may break your computer. But we think if it's running in the cloud, it's much safer. And also, when it's running in the cloud, you don't have to pay your attention to it. You just, you know, assign a task to Manus. You can just close your laptop or just put your phone back to your pocket.

You can do other thing. And after the task is done, we will give you a notification and you will get the result. Yeah. So that's the original idea of how we come up with the idea of the Manus that is happening last October. So from last October to this March, five months of work, this is what you saw today, which is the Manus AI in the fifth of March.

Yeah. So that's the original idea. Yeah. Thank you, guys. And also, I want to share some details. When we're building Manus, what kind of thoughts we have in our mind? The first thing is that the first key component of Manus is that we give Manus a computer, which is super important when you compare it to other chatbot usage.

Because, you know, you Manus, each Manus tag will be assigned a fully functioned virtual machine. Manus can use the file system, terminal, VSCO, and a real Chromium browser. You know, it's not a headless browser. Yeah. All these apps in the virtual machine, which, you know, it creates a lot of different opportunity for Manus to solve different kinds of problems.

Like, you can just send a zip compress file, contains maybe hundreds of PDFs, and ask Manus to unzip it, and then extract all these unstructured data from these hundreds of PDFs into a structured spreadsheet. Yeah. That's the thing you can do. So we think, give it a computer is actually what makes Manus really different compared to other agents or other chatbots.

And the second thing is that we think, you know, nowadays, it's not every information or data is on the public internet. There are so many things behind the pay or in these private databases. But users, because Manus, you know, we are targeting to the consumer market. We're not targeting to the enterprise market.

So for an average user, who is not very familiar of how to call an API, you know, how to write code to access databases, we think it's better for us to prepay all these, like, private databases, APIs for the users. Then users won't care how they can get maybe some real-time financial data.

Yeah, things like that. So it's like, we prepaid these private databases for our users. And the third thing is that, in Manus actually, you can just teach Manus how to solve programs. Like, one month before we released Manus, ChatGPT just released their deep research. So there's an experience, actually, I don't like.

It's like, you ask a question to deep research, it will return five or six questions for you. You know, I don't think that's a very good experience because I want you to solve tasks for me, not asking me more questions. But you know, some of our colleagues, someone in our team, they like this experience.

So we have an internal discussion about whether we should do something, maybe like a workflow or maybe like a hard code things to deliver such experience. But instead, we didn't do that. We implemented a personal knowledge system, which is like, you can just teach Manus. Next time, when you go out to do some research, before you do, before you start, just confirm all the details with me and then execute it.

Then Manus will, once you accept that knowledge into your personal knowledge system, Manus will remember it. And it will just act like every time when you want to do some research, it will confer with you first and then exit. I think all these three things just makes Manus really powerful.

But I think the most important part, because if you want to build an agent, all these three things, I think maybe you definitely will give it a try. But we think, why the magic happens? Why Manus is such a great experience? The most important part is not about these three components.

We think the most important part is that as the fundamental concept of Manus, we have this less structure, more intelligence. You can find this sentence at the bottom of our official website. We just, you know, we just believe in this. We took so much faith in it. And how do we define less structure, more intelligence is that when we release Manus, actually, we put 42 use cases on our website.

And someone said, "Okay, the message is not the same. You just predefined some workflows." Maybe you have like 42 predefined workflows. But actually, you know, inside Manus, in the core of Manus, we have zero predefined workflows. It's just a very simple but a very robust structure. Very simple structure.

But we just left all the intelligence part to the foundational model. At that time, it is Cloud Solar 3.7. And now we have Cloud 4, right? So what we are doing is like, because we are building the hands, right? It's like, we just compose all these contexts to build them into like a more good structure.

And then we provide more context to the model. And we have less control. When I say less control, I mean multi-role agent system, right? You have to specify, this is a coding agent, this is a search agent, this is a ba-ba-ba-ba-ba agent. We think all these kinds of forms are a control.

It just limits the real potential of LMS. So for us, it's like we just provide more context to the model and then let the model to like improvise by itself. And all this magic you get from MNAS just come from this very simple but very strong ideology. It's less structure, more intelligence.

Yeah. So that is the secret behind MNAS. And why we choose Cloud in the first place is for three reasons. The first one is about the long-horizon planning. You know, I think before Cloud Solent's models, because chatbot is just so successful. So all these models are out there, maybe before this March, are post-trained and their alignments is for chatbot scenario.

And in chatbot scenario, the model intends to answer your question in one term. You ask the question and you get the answer. But, you know, in agentic scenarios, like in MNAS, an average task will take maybe 30 or 50 steps and then get the final answer. So when we are building MNAS, actually we try every model we can get our hands on.

But it finds out only Solent can know, okay, I am in a very giant agentic loop. So I should perform the action, watch the observation, and decide what's the action, observation, absolute observation. It's just a very giant agentic loop. So just Solent know, okay, I am in this loop.

So I have to gather more information before I deliver the final result. But, you know, for these five months, we tried all other models. All of them failed. After maybe only one, two, three iterations, those models will think, okay, I think it's enough. I will answer your questions right now.

So I think that's a problem. And right now what we found the best model to run a very long horizon planning scenes is cloud Solent models. That is the first thing why we choose cloud. And the first second part is about the tool use. Because, you know, agent product just heavily relied on the tool use and the function calling.

Yeah, like for us, we just abstract all these tools in that virtual machine. We have like 27 tools in that virtual machine. So the agent framework has to decide what the action should be performed into that virtual machine. So it's very important to call the right tool and write the tools parameters right.

So at that time when we are building managers, we don't have the sync tool in cloud models. So we kind of like implemented some mechanism by our own. We call it a COT injection, which is like before every function call. We just will use another specific, we call it a planner agent, to do the reasoning.

At that time, we don't have reasoning too. Yeah, because that's the 3.5. Yeah. So we have to do the reasoning by ourselves. And then we inject that reasoning, that COT, into the main agent. And then we perform that function call. And what we found out, it just boosts the performance of function calling.

And also, in an article, a Serapical release at the end of March, with the sync tool, they also found out this. And in cloud4, it kind of has some like native support for the thinking in the tool use. I think you guys all say in this morning's session. Yeah.

So that's the second one. And the first, third one is that we think Serapical's model may be the best, they have the best alignment with agentic usage. Because, you know, in agentic usage, we have to deal with the browser, the computer. I think Serapical has put so much resources on the alignment with the computer use, the browser use thing.

So which makes cloud's model maybe the best model to build agents. That's why we choose Serapical's models for madness. And also, you know, we just spend a lot of tokens on cloud. And also, this is the t-shirt I wear in the GTC event. We just wear this shirt in the GTC event.

Yeah. We spend like one million dollars on cloud model in the first of 14 days. I think maybe that's why they invited us here to get a spin. Yeah. It cost us a lot to be on the stage, you know. Yeah. So that's the whole story about the madness so far.

Yeah. So if you have questions, you can just line up at the two mics. We're gonna start from right to left. Yeah, each by each. Yeah. Any questions? Yeah. Cool. Oh, yeah. There's one. Hello. Thanks for the presentation. Very interesting. So I don't know how much you can share, but I was just curious about in your agentic workflow, especially the browser part, how much is that vision and how much is that actually, you know, parsing the code of the web page, right?

So how much of it is like the model looking at the browser like a person would and how much of that is that is more like a text type of interaction, if that makes sense? Uh, sorry, I may get the question in a second. So when you, when the minus uses the web browser, right?

Oh, yeah, yeah, yeah. It goes on a website, right? Yeah. And how much of the understanding of the web page is based on the vision? Yeah, I got your question. And how much is based on that? Yeah, yeah, that's a very good question. Actually, I can share it. It's not a secret because, you know, yeah.

Uh, um, when we are building master from last October, there is the, um, open, open source project called the browser use. I think most of you guys may know that project. So we just take a look at that project. We think their way of how to talk to the browsers is actually very useful.

So we just use that part because, you know, two months later, browser use has its own agent framework. We don't use that part. We don't use that agent framework part. We only use the protocol, they talk to the browsers. So the thing we are sending to the, uh, the context we're sending to the foundational model when we are, when medicine is browsing internet are three things.

The first is the text in this viewport. We will send it to, to, uh, cloud. And the second thing is a screenshot. Yeah. In this viewport. And the third thing is a screenshot, but with bounding boxes. You know, then the model can decide which area he should click. Yeah.

So right now we are just sharing this reason to the model. Yeah. I don't know if I answer the question. Yeah. Yeah. Yeah. Okay. Yeah. Cool. Oh yeah. Another one. Hi, I have a slightly controversial question, but I want to kind of put, put this out to you. Yeah.

Uh, with the advancement of a lot of these foundational model and the deep research capability, how do you see a wrapper company kind of keeping their edge? Like what, how do you see it as sort of, what are the research area? What are the areas that you're concerned about?

Yeah. What are you focused so that you can actually grow in parallel with these foundational model as their capability increases? Definitely. I, I kind of get your question because, you know, we answer this question for the past two months into a lot of investors, you know, they are all asking, okay, what's your motor being as a wrapper, you know, yeah, things like that.

Um, we think, you know, there will be no mode just for some very specific technology or framework because technology will be out of data very, very, very fast, even for models, right? So we think the only thing we can win this competition is about the pace of innovation. Like what you mentioned, there's like a use case like deep research.

Deep research is one of the main use cases in Manus. We have like maybe 20% of users are doing deep research every day on Manus. But you know, actually, we put zero effort on deep research use cases. Like I just mentioned, we just build a very simple structure, a very simple agent framework.

The deep research capability is just the emergence from this framework, which means, you know, maybe if OpenAI is maybe half a year to do the end-to-end training just for these specific use cases. But this capability just, you know, just emerges from the whole structure. So in six months, maybe we will have like hundreds of use cases like deep research.

And also, what we can do is like, we can leverage the best model in the world. Yeah, you know, I think these are two things. It's about the flexibility of your whole agent framework. And the second thing is like, we are very flexible of choosing the best model in the world.

Yeah. All right. Thank you. Yeah. Hi. Hi. Hi. Hello. Question. So, right now- Oh, maybe the last question. Sorry. Uh, done wrong. Yeah. Okay. Um, so right now you build the, uh, browser in a virtual environment. Yeah. Are you planning to put, uh, the browser in a Docker in the local computer and later they can use the local cookie and access all the account?

Yeah. Actually, we, we don't have a plan for local environment because as I just mentioned, we think it's very important to like give the users attention back to users. We don't want to like user to pay attention still to his phone or to his computer. We want everything to run in the cloud.

And we, we, we don't just have a, a Linux virtual machine. We still, we, we also have plan to have a virtual Windows, virtual Android for Manus to, to run in the future. So it's all about running in the cloud. Yeah. Okay. I think my time is up. Um, thanks for guys for today.

Yeah. Um, Thank you. Thank you. Bye. Thank you.

Spotlight on Manus | Code w/ Claude

Transcript