back to index

Spotlight on Manus | Code w/ Claude


Whisper Transcript | Transcript Only Page

00:00:00.000 | My name is Tao and my nickname is HighCloud. Actually, you can just find me on any social
00:00:18.620 | network with my nickname HighCloud. And right now, I'm the co-founder of Manus AI and I'm acting
00:00:24.180 | our chief product officer. But I'm not just a product guy. I'm coding for like 20, 80 years
00:00:30.740 | from my very early age, like when I was nine, I can't remember. But for AI, actually, I'm
00:00:36.680 | very newbie. I've just been into this industry for only two years. And what I want to achieve
00:00:43.820 | in the AI industry is I want to build a product that can influence 24 hours for a single user.
00:00:51.360 | That is two years ago. It is my dream. But right now, I think with Manus AI, actually, I can achieve
00:00:56.700 | that dream at end of this year. Right now, the maximum usage for our user, there is one single
00:01:03.240 | user. He can just consume the GPU power two hours per day just for himself. So I think we can achieve
00:01:11.720 | this goal at end of this year. And today, what I want to talk about is about two things. One is
00:01:19.180 | about Manus. Another thing is about cloud models. So the first talk is about what is Manus. Yeah,
00:01:24.920 | Manus actually, you know, it's not like an English word. Because really, it isn't. Yeah. Manus,
00:01:31.260 | this just comes from MIT's motto, which is Manus at Manus. It's an old Latin word, which means
00:01:37.660 | man and hand. Why we choose this name Manus. Manus is the hand. It's because we think for like all the past
00:01:45.780 | two years, all these frontier models, they already like very smart, just like a human's brain. Yeah,
00:01:52.740 | they're super smart. They are capable of like doing different kind of tasks. But you know, even with
00:01:59.300 | this very, very smart brain, we can make a real impact into a real world because we don't build hands for
00:02:05.940 | things. You know, just like when I was nine, at that time, that is like 1996, back in China. And at
00:02:13.140 | that time, our families are not, you know, we're not rich enough to have a computer in each of our
00:02:18.180 | family. So I only have like two times a week to go to the computer room in my primary school. And at
00:02:25.460 | that time, even I was the best in the class of coding. But without going to the computer room to like debug,
00:02:33.940 | to write the code on the real machine, I can't get the code right just in one time with a pen and
00:02:41.060 | the paper. But you know, that's exactly what we did for the past two years. We have a very smart man,
00:02:47.940 | but we only give it a pen and a paper. And we ask them to solve very complex problems for us. So we think
00:02:55.380 | that's our problem. So in Manus, we don't train models. We don't even like post-train fine-tuning
00:03:01.380 | models. What we do is like we are building hands for these models. Yeah, that is, you know, the
00:03:08.340 | concept behind our lane, Manus. Yeah. I think like most of you maybe saw our product Manus on
00:03:17.620 | social net or you've already a Manus user and you must like saw some use cases on our website,
00:03:23.540 | any like social network. But today, I also want to like just share like two new use cases, one from our
00:03:31.140 | internal usage, another is from our user. Yeah. So this one is actually from our internal usage. Yeah.
00:03:37.300 | Because we are expanding globally. We just opened our Singapore office three weeks ago and opened our
00:03:42.820 | Tokyo office two weeks ago and we will open our San Francisco office tomorrow. Yeah. So when we are
00:03:48.820 | choosing our Tokyo office, we are asking Manus to help us. It's like we're going to say, okay, we have
00:03:55.380 | like 40 people. We will be relocating to Tokyo. So just find the office which can fit 40 people. And also,
00:04:02.900 | we have to solve their accommodation problem. So this is a prompt we give to Manus. And after we give this
00:04:10.420 | problem to Manus, Manus just have its own plan and then executes the plan. It's like search,
00:04:17.140 | browse all these websites around the internet, doing a lot of browsing, browsing, browsing, research,
00:04:23.060 | research, research. And after 24 minutes, Manus just delivers this website for us. Yeah, this website.
00:04:31.620 | This is a Tokyo office accommodation recommendations. It first that it comes with a very interactive map,
00:04:38.500 | which comes with all the 10 options Manus found for us. The blue marker is for the office's location. And the
00:04:45.220 | green marker is the accommodation near that office. All these 10 options just on this interactive map. Yeah. And if
00:04:53.220 | you keep scrolling down, you can see the first pair is a Shibuya one. It choose the Shibuya Scramble Square. That's
00:05:00.020 | exactly the office we went to when we are looking at the office there. But you know,
00:05:04.660 | this office is kind of like very fancy, but too expensive for startup like us. So we just choose our
00:05:10.500 | office near this one, very, very close, like maybe 200 meters away. And it has the price and why we should
00:05:18.580 | choose this office. And if you choose this office, what is accommodation options we could get? Also has the
00:05:26.020 | distance to the office, which is very great. Comes with like 10 pairs of the, and at the bottom, there's
00:05:32.740 | like an overview table for all these options. Yeah. So all these have been done just under 20 minutes.
00:05:39.860 | So you can imagine maybe like your intern or your assistant can achieve such detailed quantity of
00:05:46.740 | information in such a short time. Another thing I want to demo here is like, yeah, you can just maybe
00:05:53.700 | send an image to Manus, an empty rule, and ask Manus to analyze this rule's style and go to EKS
00:06:00.660 | website to find some furnitures for your rule. And then you can see the final result. Yeah. So look,
00:06:07.460 | first Manus will just analyze this empty rule's image and come up with the idea, okay,
00:06:13.540 | this, what style of this rule and the layout and what kind of furnitures I should look up in the EKS
00:06:22.260 | website. And then Manus just go to EKS website and start browsing, browsing for all these furnitures
00:06:29.140 | and save these furnitures images. And then at last Manus will just return an image just with the real EKS
00:06:38.180 | So actually you can just click the link to buy it. Yeah. Right now we can't buy things for you. But who knows,
00:06:53.620 | after three months, yeah, maybe we can do payment. Yeah. So that is kind of the things Manus can do. Yeah. It's like a
00:07:00.900 | general agent. Yeah. They can solve a very long tail of different type of tasks for you. Yeah. So we can jump back to the
00:07:08.580 | slides. Yeah.
00:07:10.100 | So that is like kind of like what is Manus. And today, I think the most I want to share is like how we built Manus.
00:07:17.620 | Yeah. Cursor just inspired us a lot. We got a lot of inspirations from Cursor. I think you may sound weird,
00:07:24.340 | because Cursor is kind of like a code editor, you know. Yeah. So all three founders of Manus,
00:07:29.700 | like Pig, Red, and I, were all coders for a lot of years. So when we are using Cursor, I think Cursor is
00:07:34.900 | actually a very great product. It can help us to write different language even we don't know how to code in
00:07:41.300 | this language. And it's very like efficiency. But the most interesting part is not way coders using Cursor.
00:07:50.580 | The most interesting part is when we watch our friends, our colleagues, which are long coders,
00:07:57.060 | watching they using Cursor to solve their daily tasks like doing data visualization, batch file processing,
00:08:04.100 | convert a video file into an audio file. It's very fascinating because, you know, this is the interface
00:08:10.340 | of Cursor, right? When these long coders using Cursor, they don't care about the left side because they
00:08:17.220 | they can't evaluate the code at all. All they can do is just to keep the accept, accept, accept,
00:08:23.700 | accept, accept, accept, you know. Yeah. So it's very interesting when watching those friends, you know,
00:08:29.940 | using Cursor. So we just come up with the idea, which is that those long programmers just using Cursor to
00:08:37.140 | deal with their daily tasks. Yeah, not like us. You know, when we are using Cursor, we really want to
00:08:43.620 | write some code, you know, that can run multiple times for the future. But for these friends, they just
00:08:49.940 | want Cursor to solve their task. They don't care about the code. Next time, when they have the same task,
00:08:55.140 | they will not run that code again. They will just ask Cursor to do that, maybe generate a new code for them. So code is
00:09:02.340 | like maybe not the ultimate goal. It's just, you know, just an intermediate step for solving programs.
00:09:08.740 | So we come up with the idea that maybe we should build the opposite. Yeah, we should build the right
00:09:14.900 | panel of Cursor. And another thing we want to build is that we want the right panel to run in the cloud.
00:09:21.860 | Why is that? Because, you know, when we are using Cursor on our computer, there's, it has to ask your
00:09:28.820 | permission to continue because it is running on your computer. Any action it performs on your computer
00:09:36.020 | is like, well, maybe dangerous, right? After it, maybe it will install some dependencies, install some
00:09:42.500 | software. It may break your computer. But we think if it's running in the cloud, it's much safer.
00:09:50.500 | And also, when it's running in the cloud, you don't have to pay your attention to it. You just, you know,
00:09:56.820 | assign a task to Manus. You can just close your laptop or just put your phone back to your pocket.
00:10:02.500 | You can do other thing. And after the task is done, we will give you a notification and you will get
00:10:07.540 | the result. Yeah. So that's the original idea of how we come up with the idea of the Manus that is
00:10:16.020 | happening last October. So from last October to this March, five months of work, this is what you saw
00:10:24.340 | today, which is the Manus AI in the fifth of March. Yeah. So that's the original idea. Yeah. Thank you,
00:10:31.140 | guys. And also, I want to share some details. When we're building Manus, what kind of thoughts we have
00:10:36.740 | in our mind? The first thing is that the first key component of Manus is that we give Manus a computer,
00:10:43.460 | which is super important when you compare it to other chatbot usage. Because, you know, you Manus,
00:10:49.380 | each Manus tag will be assigned a fully functioned virtual machine. Manus can use the file system,
00:10:55.540 | terminal, VSCO, and a real Chromium browser. You know, it's not a headless browser. Yeah. All these apps in the
00:11:03.220 | virtual machine, which, you know, it creates a lot of different opportunity for Manus to solve different
00:11:09.140 | kinds of problems. Like, you can just send a zip compress file, contains maybe hundreds of PDFs,
00:11:17.540 | and ask Manus to unzip it, and then extract all these unstructured data from these hundreds of PDFs
00:11:26.260 | into a structured spreadsheet. Yeah. That's the thing you can do. So we think, give it a computer is
00:11:31.940 | actually what makes Manus really different compared to other agents or other chatbots. And the second
00:11:38.660 | thing is that we think, you know, nowadays, it's not every information or data is on the public internet.
00:11:44.740 | There are so many things behind the pay or in these private databases. But users, because Manus, you know,
00:11:51.380 | we are targeting to the consumer market. We're not targeting to the enterprise market. So for an
00:11:56.420 | average user, who is not very familiar of how to call an API, you know, how to write code to access
00:12:02.820 | databases, we think it's better for us to prepay all these, like, private databases, APIs for the users.
00:12:10.340 | Then users won't care how they can get maybe some real-time financial data. Yeah, things like that. So it's like,
00:12:16.660 | we prepaid these private databases for our users. And the third thing is that, in Manus actually,
00:12:23.300 | you can just teach Manus how to solve programs. Like, one month before we released Manus, ChatGPT just
00:12:31.140 | released their deep research. So there's an experience, actually, I don't like. It's like, you ask a question
00:12:36.660 | to deep research, it will return five or six questions for you. You know, I don't think that's a very good
00:12:42.660 | experience because I want you to solve tasks for me, not asking me more questions. But you know,
00:12:47.780 | some of our colleagues, someone in our team, they like this experience. So we have an internal discussion
00:12:53.860 | about whether we should do something, maybe like a workflow or maybe like a hard code things to deliver
00:13:00.500 | such experience. But instead, we didn't do that. We implemented a personal knowledge system, which is like,
00:13:07.140 | you can just teach Manus. Next time, when you go out to do some research, before you do, before you start,
00:13:13.940 | just confirm all the details with me and then execute it. Then Manus will, once you accept that knowledge
00:13:20.580 | into your personal knowledge system, Manus will remember it. And it will just act like every time
00:13:28.260 | when you want to do some research, it will confer with you first and then exit. I think all these three
00:13:34.180 | things just makes Manus really powerful. But I think the most important part, because if you want to build
00:13:40.980 | an agent, all these three things, I think maybe you definitely will give it a try. But we think, why the
00:13:46.980 | magic happens? Why Manus is such a great experience? The most important part is not about these three
00:13:54.740 | components. We think the most important part is that as the fundamental concept of Manus,
00:14:01.700 | we have this less structure, more intelligence. You can find this sentence at the bottom of our official
00:14:06.900 | website. We just, you know, we just believe in this. We took so much faith in it. And how do we define
00:14:13.460 | less structure, more intelligence is that when we release Manus, actually, we put 42 use cases on our website.
00:14:20.740 | And someone said, "Okay, the message is not the same. You just predefined some workflows."
00:14:27.700 | Maybe you have like 42 predefined workflows. But actually, you know, inside Manus, in the core of Manus,
00:14:33.620 | we have zero predefined workflows. It's just a very simple but a very robust structure. Very simple structure.
00:14:41.700 | But we just left all the intelligence part to the foundational model. At that time, it is Cloud
00:14:47.460 | Solar 3.7. And now we have Cloud 4, right? So what we are doing is like, because we are building the hands,
00:14:55.300 | right? It's like, we just compose all these contexts to build them into like a more good structure. And then we
00:15:04.020 | provide more context to the model. And we have less control. When I say less control, I mean multi-role agent system,
00:15:10.660 | right? You have to specify, this is a coding agent, this is a search agent, this is a ba-ba-ba-ba-ba agent. We think
00:15:17.620 | all these kinds of forms are a control. It just limits the real potential of LMS. So for us, it's like we just
00:15:25.060 | provide more context to the model and then let the model to like improvise by itself. And all this
00:15:32.500 | magic you get from MNAS just come from this very simple but very strong ideology. It's less structure,
00:15:40.340 | more intelligence. Yeah. So that is the secret behind MNAS. And why we choose Cloud in the first place is for
00:15:50.100 | three reasons. The first one is about the long-horizon planning. You know, I think before Cloud Solent's models,
00:15:57.860 | because chatbot is just so successful. So all these models are out there, maybe before this March,
00:16:06.020 | are post-trained and their alignments is for chatbot scenario. And in chatbot scenario,
00:16:15.220 | the model intends to answer your question in one term. You ask the question and you get the answer. But,
00:16:21.940 | you know, in agentic scenarios, like in MNAS, an average task will take maybe 30 or 50 steps and then get the
00:16:30.660 | final answer. So when we are building MNAS, actually we try every model we can get our hands on. But it
00:16:37.060 | finds out only Solent can know, okay, I am in a very giant agentic loop. So I should perform the action,
00:16:47.460 | watch the observation, and decide what's the action, observation, absolute observation. It's just a very
00:16:52.820 | giant agentic loop. So just Solent know, okay, I am in this loop. So I have to gather more information
00:16:59.620 | before I deliver the final result. But, you know, for these five months, we tried all other models.
00:17:05.220 | All of them failed. After maybe only one, two, three iterations, those models will think, okay,
00:17:11.860 | I think it's enough. I will answer your questions right now. So I think that's a problem. And right
00:17:18.420 | now what we found the best model to run a very long horizon planning scenes is cloud Solent models.
00:17:25.780 | That is the first thing why we choose cloud. And the first second part is about the tool use. Because,
00:17:31.140 | you know, agent product just heavily relied on the tool use and the function calling. Yeah, like for us,
00:17:38.740 | we just abstract all these tools in that virtual machine. We have like 27 tools in that virtual
00:17:45.140 | machine. So the agent framework has to decide what the action should be performed into that virtual
00:17:50.500 | machine. So it's very important to call the right tool and write the tools parameters right. So at that
00:18:00.020 | time when we are building managers, we don't have the sync tool in cloud models. So we kind of like
00:18:06.820 | implemented some mechanism by our own. We call it a COT injection, which is like before every function
00:18:15.140 | call. We just will use another specific, we call it a planner agent, to do the reasoning. At that time,
00:18:21.380 | we don't have reasoning too. Yeah, because that's the 3.5. Yeah. So we have to do the reasoning by
00:18:26.820 | ourselves. And then we inject that reasoning, that COT, into the main agent. And then we perform that
00:18:34.820 | function call. And what we found out, it just boosts the performance of function calling. And also, in an
00:18:41.380 | article, a Serapical release at the end of March, with the sync tool, they also found out this. And in
00:18:47.780 | cloud4, it kind of has some like native support for the thinking in the tool use. I think you guys all say in
00:18:54.820 | this morning's session. Yeah. So that's the second one. And the first, third one is that we think
00:19:00.820 | Serapical's model may be the best, they have the best alignment with agentic usage. Because, you know,
00:19:06.500 | in agentic usage, we have to deal with the browser, the computer. I think Serapical has put so much
00:19:12.740 | resources on the alignment with the computer use, the browser use thing. So which makes cloud's model maybe the
00:19:20.740 | best model to build agents. That's why we choose Serapical's models for madness. And also, you know,
00:19:28.100 | we just spend a lot of tokens on cloud. And also, this is the t-shirt I wear in the GTC event. We just
00:19:36.660 | wear this shirt in the GTC event. Yeah. We spend like one million dollars on cloud model in the first
00:19:43.620 | of 14 days. I think maybe that's why they invited us here to get a spin. Yeah. It cost us a lot to be on
00:19:50.180 | the stage, you know. Yeah. So that's the whole story about the madness so far. Yeah. So if you have
00:19:57.380 | questions, you can just line up at the two mics. We're gonna start from right to left. Yeah, each by each.
00:20:03.220 | Yeah. Any questions? Yeah.
00:20:07.860 | Cool. Oh, yeah. There's one.
00:20:16.580 | Hello. Thanks for the presentation. Very interesting. So I don't know how much you can share, but I was just
00:20:27.940 | curious about in your agentic workflow, especially the browser part, how much is that vision and how
00:20:37.620 | much is that actually, you know, parsing the code of the web page, right? So how much of it is like
00:20:43.860 | the model looking at the browser like a person would and how much of that is that is more like a text type
00:20:51.620 | of interaction, if that makes sense? Uh, sorry, I may get the question in a second. So when you,
00:20:59.460 | when the minus uses the web browser, right? Oh, yeah, yeah, yeah. It goes on a website, right? Yeah. And how
00:21:05.620 | much of the understanding of the web page is based on the vision? Yeah, I got your question. And how much
00:21:10.660 | is based on that? Yeah, yeah, that's a very good question. Actually, I can share it. It's not a
00:21:14.580 | secret because, you know, yeah. Uh, um, when we are building master from last October, there is the, um,
00:21:20.820 | open, open source project called the browser use. I think most of you guys may know that project.
00:21:25.380 | So we just take a look at that project. We think their way of how to talk to the browsers is actually
00:21:31.540 | very useful. So we just use that part because, you know, two months later, browser use has its own
00:21:37.620 | agent framework. We don't use that part. We don't use that agent framework part. We only use the
00:21:42.660 | protocol, they talk to the browsers. So the thing we are sending to the, uh, the context we're sending
00:21:49.460 | to the foundational model when we are, when medicine is browsing internet are three things. The first
00:21:54.900 | is the text in this viewport. We will send it to, to, uh, cloud. And the second thing is a screenshot.
00:22:03.380 | Yeah. In this viewport. And the third thing is a screenshot, but with bounding boxes. You know,
00:22:10.420 | then the model can decide which area he should click. Yeah. So right now we are just sharing this
00:22:15.780 | reason to the model. Yeah. I don't know if I answer the question. Yeah. Yeah. Yeah. Okay. Yeah. Cool.
00:22:22.020 | Oh yeah. Another one. Hi, I have a slightly controversial question, but I want to kind of put,
00:22:28.980 | put this out to you. Yeah. Uh, with the advancement of a lot of these foundational model and the deep
00:22:34.420 | research capability, how do you see a wrapper company kind of keeping their edge? Like what,
00:22:42.980 | how do you see it as sort of, what are the research area? What are the areas that you're concerned about?
00:22:47.460 | Yeah. What are you focused so that you can actually grow in parallel with these foundational model as
00:22:53.140 | their capability increases? Definitely. I, I kind of get your question because,
00:22:56.580 | you know, we answer this question for the past two months into a lot of investors, you know,
00:23:01.220 | they are all asking, okay, what's your motor being as a wrapper, you know, yeah, things like that. Um,
00:23:07.940 | we think, you know, there will be no mode just for some very specific technology or framework
00:23:14.500 | because technology will be out of data very, very, very fast, even for models, right? So we think the
00:23:20.340 | only thing we can win this competition is about the pace of innovation. Like what you mentioned,
00:23:27.140 | there's like a use case like deep research. Deep research is one of the main use cases in Manus.
00:23:33.220 | We have like maybe 20% of users are doing deep research every day on Manus. But you know, actually,
00:23:39.300 | we put zero effort on deep research use cases. Like I just mentioned, we just build a very simple
00:23:47.060 | structure, a very simple agent framework. The deep research capability is just the emergence from this
00:23:54.420 | framework, which means, you know, maybe if OpenAI is maybe half a year to do the end-to-end training just
00:24:00.900 | for these specific use cases. But this capability just, you know, just emerges from the whole structure.
00:24:07.380 | So in six months, maybe we will have like hundreds of use cases like deep research. And also,
00:24:14.580 | what we can do is like, we can leverage the best model in the world. Yeah, you know, I think these
00:24:21.060 | are two things. It's about the flexibility of your whole agent framework. And the second thing is like,
00:24:27.700 | we are very flexible of choosing the best model in the world. Yeah. All right. Thank you. Yeah.
00:24:33.940 | Hi. Hi. Hi. Hello. Question. So, right now- Oh, maybe the last question. Sorry. Uh, done wrong. Yeah.
00:24:40.100 | Okay. Um, so right now you build the, uh, browser in a virtual environment. Yeah. Are you planning to put,
00:24:47.780 | uh, the browser in a Docker in the local computer and later they can use the local cookie and access all the
00:24:54.260 | account? Yeah. Actually, we, we don't have a plan for local environment because as I just mentioned, we think it's very
00:25:00.180 | important to like give the users attention back to users. We don't want to like user to pay attention
00:25:07.380 | still to his phone or to his computer. We want everything to run in the cloud. And we, we, we don't
00:25:12.420 | just have a, a Linux virtual machine. We still, we, we also have plan to have a virtual Windows, virtual
00:25:18.020 | Android for Manus to, to run in the future. So it's all about running in the cloud. Yeah. Okay. I think my time is up.
00:25:24.900 | Um, thanks for guys for today. Yeah. Um,
00:25:29.580 | Thank you.
00:25:31.580 | Thank you.
00:25:34.080 | Thank you.