Building Agents with Amazon Nova Act and MCP - Du'An Lightfoot, Amazon (Full Workshop)

building agents with Amazon Nova Act and MCP I'm excited today because we're going to build intelligent autonomous AI systems that can help you build scale and improve your applications and business my name is the one Lightfoot and I'm joined by my name is the one Lightfoot and I'm joined by hey I'm Banjo Biyami I'm a solutions architect here at AWS now this is the AI engineer welfare and I've been in tech over 15 years and right now is the most exciting time for me in my entire career and one of the reasons for this excitement is agents how many of you right now are building agentic systems I love it so when we talk about agentic AI I think it's important that we level set from an AWS perspective there are three key terms we need to think about first the ability to plan agent gets a prompt it gets an objective and it determines the actions that need to be taken so creates the plan and then it takes actions on those actions by using things like tools now the last piece the third piece the third piece and probably the most interesting is the reasoning where the agent is able to evaluate the results and determine if it needs to update the plan and take additional actions until the objective is complete this is an agent now when we actually break down the architecture I think it's important to take a look at this because we have the user input we have the agentic system we have the possibility of some type of human in the loop and then we have the generator response response now when we dive a little deeper there are some components of this agentic system we have the LLM we have a knowledge base with external information that we may want to provide we have guardrails to say to the model don't do this or to ground the model with the truth from our knowledge base to say okay is this actually relevant information is this accurate to the information we're receiving from the knowledge base and then we have access to additional tools memory or we may need to talk to additional agents or LLMs like Amazon over at do something like MCP and we have the ability to design our own flows for these systems now the most interesting piece that I think a lot of us are probably focused on when we're building these systems is around the continuous evaluation framework like how do we know if we're using the right LLM how do we know if our prompt is consistent accurate or even optimized for the performance we're expecting and then how do we even judge our system how do we rate that and determine that that it's actually solving the problems that we need or intend now once we have this we need to log this information and then have some type of subject matter expert and determine how can we improve this system and this is the iterative approach so we're always trying to improve and optimize our agent system now continuing on this continuing on with this story now there are some use cases that we should be building these systems for like if it's complex tasks and we don't know which tool should be used how many tools should be used and we want the model to leverage his reasoning capabilities well this is a great use case for agent system but if it's something that is just one step our traditional if this then that approach is probably the best solution right we don't always need to provide some type of agency system for something that can be done with a traditional solution now when we talk about agents on AWS there are three approaches and perspectives we should think about first it's going to be the specialized using something like Amazon Q how many of you are have used Amazon Q the there's Amazon Q in the console to help solve your problems on AWS in the console there's Amazon Q developer inside of your IDE and right now one that I'm I would think most excited about is Amazon Q CLI agent how many of you have used that for me if you if you are into increasing your productivity using a CI CLI agent has helped me tremendously from editing the video it can do that summarizing the document reading my entire code base like today for one of my demos I had some code and I was trying to figure out why wasn't it working I said analyze this code and tell me what you see let me know what you see let me know the API's that it's calling well I looked at the API's well it didn't match my API's in the API gateway so when the code was deployed it wasn't deployed with the right API's so the agent was able to help me save a ton of time by just analyzing the code and tell me what it saw because I never seen the code before right so that's what these tools are able to help us do the next is fully managed if you're using Amazon bedrock you're able to leverage Amazon bedrock agents to build build and manage agents inside AWS and today what we're going to be focused on is the DIY to do-it-yourself approach by using strands agents this is allows you to not just leverage Amazon bedrock but also leverage models through other providers using light LLM now when we talk about strands agents strands agent was announced about a month ago I want to say something about a month ago this is open source extremely lightweight so if you use other agents flat frameworks is like that but the implementation is you'll see in the code how easy it is to build an agentic system or agent itself in a few lines of code and already get started I built a multi-agent solution in about under 50 lines of code and so when we break down strands agents there are three components we have a prompt we have a LLM and we have tools so you create a function called let's say a get weather tool right you define your agent you give it a prompt and it's already implemented and you'll see in the code as as banjo goes through here in a moment now taking it a step further as danielle presented today on amazon nova act these models are able to do some really cool things and this is another thing that i'm excited about amazon nova act is a research preview model and the capabilities of this allows you to use a prompt or give instructions and take complex tasks and do the and take complex tasks and do things like browse the internet to find research or to research or to search on amazon.com to find the top list of widgets right and then return them and then add them to your card so you'll see how we can leverage this not just using the sdk for amazon nova act but also by leveraging mcp which leads us into the last piece which i think when we're talking about agents i i don't think we will be here today as fast as we have moved if it wasn't for mcp how many of you are leveraging mcp today model context protocol how many of you have built your own mcp servers i built several um i got two that i use all the time one how many of you use obsidian okay so for my documentation i built obsidian mcp server this allows me to save all my documents reference all my documents and just my entire workflow is streamlined because of this mcp server i use right there but i also use one for my bookmarks i built a bookmark manager because every friday i'm restarting my i'm restarting my computer and i lose my bookmarks i save them and i forget about them but now i can just say save this bookmark it gives it a description gives it a title give it a date and i can even add notes so i can remember where this bookmark so now when i open up qcli i can say hey i'm looking on the topic i'm looking for um some information on mcp can you tell me all the bookmarks that i have then it'll find it can you tell me the ones i saved last week and so these this is the power that we have today but but with that being said i think it's time that we all start building banjo is going to take over but if you you open your laptops and log on to this link this is going to take you to a workshop environment where you have access to an amazon account where banjo is going to walk you through building out today's workshop i thank you for your time cool all right so uh this is going to be a hands-on workshop so we've provisioned an aws account for everybody here so you don't have to install anything on your computer everything's going to be done through the browser and i always say the hardest part of the workshop is just getting started so some of my colleagues are also here so raise your hand aws folks that are here to support so we're going to take some time to just get logged into an environment we're going to set up a vs code server enable models get the nova act api key so again this is the hardest part of the workshop that's getting started so let's take some time to just get into the environment and i'll follow along as well so and this is uh again everything is you don't have to install anything on your computer you don't have to use your own aws account everything is provisioned for you so but while that's loading i'm going to briefly walk through the three modules of the workshop so the workshop is really about how you can use nova act so the first module is just getting started with nova act we're going to make an api call that the second part of the module is going to make an mcp server that can leverage an nova act and finally we're going to use the strands agent to cook everything together so that's kind of the the three steps we'll go through at this workshop and all the code is available via the link on github so you can try it on your own as well uh but yeah trying to get started here if you can't follow along i'm going to be doing up here so don't worry too much and again all the code is available so you can try it out line offline okay so the first things first uh if you're following along make sure to click this open aws console button again we provision the aws account you know don't log into your own aws account don't try to create a new one everything is uh pre-visioned here already so i'm gonna click clicking that button to open up your aws account so logged into my aws account so the first thing we do in the aws account we're gonna enable uh amazon bedrock models so amazon bedrock think of as a serverless api to access different foundation models and you can build lots of agenda for applications in it so it has capabilities like knowledge bases guardrails you can build agents on top of it for anything you need to build uh ai agents or agenda applications amazon bedrock has capabilities for that but for this workshop we're just gonna enable specific models so gonna enable specific models you can click the amazon models and then we'll use the cloud 37 3.5 iq and 3.5 sonnet so those are the ones we're going to use for this workshop and that's going to request access there and again all the instructions are also in this workshop as well so we could follow along but i'm just going to go through it just for sake of time and then the next part once we get the model access there's a vs code server that has everything set up already so i'm just going to go in there and if the url and password is there you can log into your vs code server with everything installed i'm also going to log into amazon queue so amazon queues are id extension to help you write code we have time you can sign up through a builder id completely free you don't need uh aw's account you don't need to put in your credit card you can just log in through there i already have an account so just feed it up but it puts a nice little ai agent there you can ask questions update code etc so it's uh i'll show you some examples i'll just go through some of the code so so who's gotten to this point setting up all the models workshops because this is once you get all this done then that's when the real fun begins so that's getting a pulse if i need to slow down or slow down a bit okay i'll wait a bit again raise your hand if you're stuck anywhere questions we have uh agents that can come around and support you so i'm going to pause for a little bit any general questions while we're waiting oh yeah so this workshop uh again all the code is available online uh this workshop available as well so you can also look through that there's a website called workshops.aws and when you go there you can do something like uh nova act and then it's the only work that that shows up so you can always go to workshop.aws that search nova act and this workshop was so up so you can see see all the instructions all the code and run this uh on your own okay okay and then last thing uh because we're going to use nova act we actually need to get a nova act api key so if you go to nova amazon.com it this is a website that you can use the amazon nova model so you can do like chatting generating images uh speaking with nova uh generate videos but then also this is where the act api key is generated so if you're following along and you want to generate your key again it's free to log in you can use your amazon.com uh like when you order something on amazon.com account to log into this and then you can just generate a key here and they'll be able to access that oops oops okay so i'm going to walk through what module one is uh before again has anybody got in here just quick pulse check if not you know i'll continue i know the wi-fi is slow so it might be hard so i'll just continue on uh but yeah the first one we're going to see how nova act works uh how how the actual code looks like uh generated the key i need to export the key and then kind of running the first script which is actually going to open amazon.com uh and we're actually going to look for the first coffee maker so let me see how that code looks like let's go here make this bigger so very simple code uh with nova act it's again it's all in python sdk i so i decide what a page to go to so go to amazon.com i say i want you to search for a coffee maker i say select the first result and i say get the title of that product page so uh very simple if you've ever done kind of web automation before of something like uh selenium or playwright you probably have to like look for this diff tag you know look at this h1 tag grab this information a lot of manual processes of actually inspecting the actual website here i'm just saying click the search bar find something like i don't have to specify click this tag do that so it makes it much more easier to engage with the website as a natural human would instead of like looking through divs and trying to find this p tag specifically so uh this is a great way to just uh you know use nova act right out of the box so i'm going to uh run this so you can see examples all right all right so added my key going to what happens when i run this file give it a second oops we failed all right let's start over put it down three let's write one ah okay i know i gotta run it with this let's start that over yeah question so just explain why is banjo running that command it's running fxv fb it's a frame buffer where it runs your x11 system what happens there nova act actually goes and clicks a mouse on a browser that's why it needs to be run like that otherwise it has no gui so this is just kind of a way to emulate a graphical user interface on this linux box thank you daco yeah since we're running everything in the the cloud on a browser i'm saying you know open a browser again but it's already in a browser so that's why i crashed so i have to put that a frame buffer command uh and yeah the workshop kind of walks through why we did that but you can see uh what is going on when nova act says i'm going to search for a copy maker i'm at the amazon home page my task is to search for this so it's understanding what it's doing i see the search bark has copy maker i'm at the search spark here and now it actually puts the actual log of the actual html file so it's taking screenshots you can see what it looks like it got the first results i'm on the copy maker page it selected it and now i've got the title now it says you know what's the title of this product page all right got this black decker 12 copy maker my task to return the title of the product page product title it got that and it ended the session and then it also creates a a video log that i can actually look at to see what it did for each for everything it did in this webm file so yeah question yeah we'll repeat that it would be a microphone so the question was does it reason about the page in terms of pixels or in terms of text yeah so it's actually looking through the actual uh the page itself so you see in this video it sees it looks at the page i can see what's in the page so it's it's a large language model train so it can actually see the actual the page is doing so it's not looking at like like the h1 tag or whatnot and understand the context of that particular page you can see that's a search box okay i'm going to go click through that search box so yes it understands the pixel level of what's on that actual page so this is kind of the video it's hard to see make it bigger sped up so it opens the page it goes to the able to type in coffee maker there um it gets that information clicks the button so even if all the ads and everything the video can understand the task clicks that and it gets the information back so that's and that was a couple lines of code so you can extrapolate to other type of workflows you can do for searching through things sorry i have another question yeah so when you what i've experienced with these kind of frameworks is that when you run this on a server environment um services like cloudflare will block the access and maybe do a captcha challenge how do we solve that using q yeah so with uh so using amazon nova ad so it doesn't do captures it doesn't do that nature so it's it's meant for like workflows you understand but yes it's not going to bypass captures and other things of that nature as well so it's made for like going to amazon.com or look through a booking site but if something that like requires like a human or it wants again that you can't bypass that you you wouldn't use nova act for that use case if you need to pass a capture or something else that use another technology this is not meant to like overtake humans you know it's more like i'm helping them augment things but not if there's a capture involved and have to use a different technology for that it's awesome a preview it's awesome a preview yes this is a research preview as well so if that's a very good use case you know leave feedback on the nova the website so yeah is is human and loop possible at all with it yet well this this one it's no because i'm writing all the code here so but again this is python code so i could probably put in something here like you know ask something make an api call here so this is you know it's a python code you might be able to create some type of uh workflow that might augment like wait for a human response or whatnot because the browser is happening in like headless mode but could you make it work with a browser to human is also seeing at the same time yes yeah so it can pause and wait for somebody put in like a password credentials or do a capture and then once it receives that works continue on the workflow you could do yeah because right now i i ran it in headless mode but yes it can also run uh you know to open up the browser if i ran this on my macbook i would open up a chrome browser and go through that session also if you're running it and you wanted to bypass something that's two-factor if you're already logged into say amazon.com and then you run a code it's going to use your credentials in that browser session to continue on to perform that task so that's something that you can do as well cool so let me oops and then one other thing you can also do multi uh you know parallel execution so my last my next example is actually i'm trying to find multiple monitors and i want to compare them all at once so i'll show you how that code looks like so i can check for the monitor extract information i'm setting you know i want i'm defining what i want so again i'm you know saying i want to find the price the rating the size uh go to amazon.com uh i set it headless mode this time so i don't need to do the frame buffer i start multiple threads it looks for each monitor simultaneously because each of these are individual tasks so i can paralyze them instead of waiting it to go through i define the list of monitors i want to go through start the thread and then it starts executing and finds the results of the monitors so i can run that in the background it was starting with three parallel threads and it's open so again running in headless mode so it's going to be able to do this in the background where we can see kind of what the model is thinking how it navigates through the web page yep all right i have tried to use nova in the past april and uh it worked for the first time but once i did it again it triggered the capture is this something that has been already resolved or is this happening because i think the website and it was amazon in this case it was detecting it was a bot and uh is there like an llms.txt or robots.txt that can declare it so nova actor is a github repo so you could go there and just grab that but it's it's working now like i'm running it you know i just i this is this live code i'm doing right now like i just exported my api key started running it so uh you can try it in the workshop i'll be yeah i mean it's ready to go we're building right now and you can kind of see that it's going on in the background what it's doing i've looked at this monitor the dell monitor i'm at the amazon home page it's like it's going through looking through the search results it's saving things so you can see it's running in parallel it got the information for the one of the first ones so it's going as i just set that up and it can execute that so if you have some type of uh i don't know like daily news thing you need to go to the website and get news or something and i have a report nova and there's no api for that this is one way you can codify how to do that kind of search and get the information so if you have a question yeah i'm wondering so how successful is this in terms of like more ambiguous tasks because i i ran the amazon demo and that worked but i'm wondering could i just add google there sure and and and how like how big and and sort of how much does it know when it's navigating through like i was thinking like if i wanted to return a pair of sunglasses that broke would i would i be able to just say like start in google and then find this company's website find a way to you know engage support open a ticket like how much sort of like how vague can you be and how smart is it currently would that would that yeah i mean the more instruction you give obviously better but it's able to understand how to navigate a website that's what the model is trained on so if you say you know go to this sunglasses website it doesn't it probably wasn't training a specific sunglass website but it can understand that button support you know if this buzzes click a ticket so it understand kind of the general knowledge of how to navigate the website but if there's something very intricate about that website you're going to have to encode it in the text like make sure you click button x first or whatever so it understands how to navigate websites got it and does it understand when it's failed yeah sometimes sometimes i've seen it sometimes get stuck in a loop and like oh no i keep scrolling i keep scrolling i keep scrolling it doesn't know when to stop so it again this isn't research preview so things are getting better the model is getting updated behind the scenes but it's not like it's not agi so that's got it and one last question um how is it in terms of navigating like distrustful parts of the internet i mean there's a lot on the internet that we see and we know is not to be trusted or it's something not to be followed how have you sort of worked around that problem yeah because again it is a model in the background so it's going to understand like if you're doing something it's not going to want to click that or might be there's safeguards in place so that's built into the model but again uh it isn't research preview you still have to explicitly say what buttons to press for certain actions but again the model it is an lm train it's going to be able to understand the nuances and say if it can't take this action or can't do that that could happen but i haven't seen that use case but if you keep pushing it maybe you'll find those those things well the thing i had in my mind is like if you go to a site where you have to download a link sometimes there's an ad that says download a link and you know that that's just an ad trying to get your attention of course would the model know or is that some yeah if like for example like in the the amazon.com it shows an ad for something but i said find the first thing was able to scroll past that ad and click something so the model understands the task you give it so yes it can understand that thank you all right so this this is finished yeah that's really quick it showed you got it was able to find all the models give me the size the rating the price reach of the monitors so again it executed that on parallel it got me the nice information and that that's kind of the idea of like it can do parallel execution in the background so you don't have to wait for it and don't see it actually clicking through the the task and you get your information all right one more question then we'll move on to the mcp part so nova is specifically meant to be used with the browser correct uh so nova act so amazon nova is a family of models on amazon so if you go to this website nova amazon.com you see there are different foundation models like nova pro premier light micro these are like the text understanding models so like your typical llm calls there's also an image model called nova canvas can generate images that the video real called nova real where can generate videos and that it's also a speech model text speech to speech called nova sonic so nova is a foundation of found uh foundation models by amazon to do all these type of tasks and act is just another one for browser automation are there plans to expand this like beyond the browser so that we can someday take actions in slack or ide or anything outside of the browser maybe some of the team is here so maybe talk with them later thank you all right so i'm going to move on to the mcp part vanjo yep nova act is only available in u.s yes right now nova act is only available in the u.s it's in preview so it's just getting started so if you log in from like a different uh account like address like uk or something might not it won't work so it only works in the u.s at the moment yes all right one more question over there and then i'm going to move on if you live in three monitors um i got the same results as you did but i actually got a different price with the samsung odyssey why do you think that might be oh your amazon.com is different i don't know yeah because it is opening up a different browser so it could have clicked something differently yeah so yeah that's right we can actually look at the video premium or video playback to see what the results were like yeah one more okay one more quick quick one are there plans to support persisting browsing data such as cookies in the cloud browser so right now it's opening up its own browser but you can also set like your own like chromium profile and open up that browser so all the thing you have saved there like you want to log into your stuff you can set your own custom browser but right before it opens up a new like completely new browser without anything saved all right so i want to show uh i actually made an mcp server for nova act so a module tool is going through uh mcp and i can kind of show you what i did for the mcp server uh in fact we can use the amazon q here so i'm going to ask it uh can you tell me about the nova act mcp server can you tell me what it does what it does it does and oops so tell me about the nova act mcp server so you can see it's going through uh integrates nova act browser at mcp it has the browse session tool browser action execute parallel tasks take screenshots close browser list results so i created these different aspects of the mcp server so i could use something like claw desktop or cursor or amazon qcli to that say you know open amazon.com and find information for me so it's it's portable it understands uh so i don't have to actually write code i can say so go to amazon.com and find me the cop the first coffee maker it'll actually write all that code i did in the initial one to do that or the multi-monitor so i wrote a bunch of code to do this if i just said you know get me these three monitors to get the price it would actually write all the nova act code it needs to do that using the mcp server so that's kind of the power of mcp that i just describe a task and then i can encode the actual browse action things it needs to so and then i also made an mcp client that can actually interpret that so oops it connects to the mcp server it runs the code and is able to use query bedrock uh i am using a model so i'm using claude 3.5 sonnet here because it's an mcp client and needs to have an lm behind that and then it's able to you know understand which tools to use uh run the code and open up the browser and whatnot so let me just run the example here so module two so er open the file just did that we asked amazon q to explain the file to us and now we're actually going to run it so python 3 and then i can open this up okay so let's be adventurous if somebody give me a query to try since anyone has an idea i'm going to just ask it and do something so someone give me an idea of what to run another act fix wi-fi how would you fix the can you find a website to fix website can fine fine let's see website to fix wi-fi use headless mode i spelled it wrong but let's see all right it goes to google.com how to fix wi-fi problems troubleshooting guy in the box and press enter return a list of the website title descriptions all right it's going through that so it open google.com how to fix wi-fi problems i see an empty search bar where i can type queries for search information i should type how to fix wi-fi problems so you can see it's understanding what to do you know oh it hit a recapture page so okay the search results are not available blah blah so so see it looks like it got stuck in a recapture page so this is like a headless agent so someone asked a question about going to pass capture and whatnot you see that it's it got stuck doing that it looks like it's stuck in a loop now so it sees the capture again so i should skip the clip button to skip the capture window the capture is still open so it's probably going to be stuck here unless i close it so you can see there are limitations it's not going to pass captures and whatnot but that's that was a good query to show that it oh did it fill it it's still open so it's going to be stuck here so i'm just going to close it out but you can see you know it can't pass everything it can't navigate through websites so something like that will wasn't it will not work so that was a great test example to show if i use the the baked in one you know find that copy made under 50 dollars it'll be able to go through that and use headless mode but any questions on that seeing how the mcp server is working i didn't have to write code i just said do something it actually wrote the code to do it for me question over here yeah yeah so a question about if i can actually go into the browser and do it myself yeah if i ran this locally on my machine it will actually be able to you know open up the browser and i can have to click the button and it'll continue doing that right now i'm running it within the browser so i'm everything everything in headless mode so we can't interact with that so you can see it's able to find search under 50 dollars it can actually look at the website it's found search results on amazon.com so yeah so that for that use case where we're not passing captures is able to continue and find the information there so a question about can i actually order something if i use my own browser session and like logged in at my amazon.com account and said yes order this for me you know click through it will be able to understand that thing but i would have to put it in i would have to use my own browser session so i i wouldn't want to log in by myself so yeah another question if you give novak the authentication for amazon for example like you give it your login details then can it log in and complete that action for you yeah but if i say this is my username this is my password enter that into that field and you'll be able to understand you know this is a sign-in button and i have this information but again this is all python code so yeah you can encode it you can make it an environment variable so it won't read it directly so a lot of ways to do that does it also like understand 2fa let's say it asks you to go to your gmail and you will then open the gmail website check the email if you're logged in again on your session and then input it or is it okay well if there's no capture like we just thought of the capture yeah so there's no nothing blocking so but yeah again nova act is free to use so there's a lot of creativity in this room so i think we should have like a nova act hackathon i think that'll be you know do something crazy with nova act all right so one more question yep one more can i book a flight when my price alert is less than hundred dollars it's like a continuously check you probably use something else for that but yeah i mean no back and open up that website it can just have a query every day you know open google flights and look look at the quickest thing and if something is below this dead hold you know send me an email so again this is all a python script so you can set up something that triggers like once a day like in a lambda function and so yes totally possible so nova act is very flexible and because it can run in headless mode you don't need to have that ui so that's really what makes it helpful for interacting with websites that don't have a native api thanks yeah this is pretty cool i'm a little bit confused like we have the nova sdk sdk api key and we're also doing some stuff in bedrock ah yeah so how does this actually work yeah so in that the nova api key separate but for this mcp client i did it actually needs a large thing with models to understand what's still happening so if i go to claude oops i actually said i'm actually using a claude sonnet 3.5 for my mcp server so that's how because i just asked it you know find that website for me how does it know that about any of the code doing that so it's using a large language model underneath the hood to actually find that information so that's where we use bedrock for trying to find it in the code but it's on it yeah i set the model id so your assistant you're an ai system helping you have tools you're using cloud 3.5 sonnet you're making an api called a bedrock whenever something happens so that's where the the llm we're using but nova act is separate from that so this m select of using you know claude desktop it's running an llm in inside of that they would understand that for the mcp server a question here a question uh the question is uh does it integrate with browser plugins as well like could it integrate with last pass if you have the last class plugin fill in the credentials through last pass and then continue i haven't tried that but again it does you can set up to use your own browser so if you do that and if that's integrated it might be able to do that and click through that but i have not tested that but something to try out thank you the biggest problem you will face is two factor like even if you gave it a password like if you're using something like google authenticator or something that would be like the biggest problem to capture but other than that if you provide it an environmental variable or if you give it instructions on how to access last pass in the browser you should be able to do it all right and uh oh one more question then we'll go on to the last module so clearly there are a lot of different uh agent architectures you could use um and what i can imagine using this is uh like you have a coordinator agent set up somewhere that's running in the overall app and then when something pops up and says hey you need to go and look this up online go and check it uh it should mod so my question is how modular i mean it's just python so it should be pretty modular right is that the way in which you're imagining the architecture to be is just if i was coding a coordinator agent in lang chain or lang graph for example it would then call your sub-agent and get and and run it stuff and then get and then get a text-based output that i throw into my message queue that's how it all integrates together is that right yeah that's one way you can do it so nova act again right it's a python so it could be a tool it could be an api call and the next module we're actually going to show you how to actually make an agent from that so good good t up right here uh so um juan talked about the strands uh at the beginning so strands is a new agentic framework launched by uh aws so let me open up the link uh it's easy as a pip install strands and the first agent is like agent equals that so it's very it's a model first uh way of interacting with agents if you use a lot of agent frameworks in the past there's a lot of bootstrapping and making sure everything is correct and like but that was necessary for kind of the older models like if you think back to like like llama 2 for example like how how far our models have evolved since then so but now we we can pass a lot of the you know bootstrapping we did previously the agent can figure that out so we don't need all these very uh heavy ways and like you know make sure everything's typed and every so whatnot so here's a very simple example of how i actually spun up uh and also it has mcp native support so in this example i actually have two mcp servers uh i have the aws documentation and aws diagrams mcp server so if you go to this like aws labs mcp these are the official um aws mcp servers and there's a bunch of different ones from like a cost analysis nova canvas diagramming cloud formation lots of different ones here uh so again it's all on github aws labs mcp but the example i do here is i'm actually i made like a solutions architect agent your role is to help customers understand this building on aws and i define these two mcp servers here i give it a prompt and i say this agent has all the tools in the mcp server it has a bedrock model i'm using cloud haiku here and what's cool about strands it can also use like light llm and olama so it has access to lots of different things or you can run it locally and of course it has access to amazon bedrock so that's what we're using here so all those three things makes the agent the tools the model and the system prompt and then i can say get the documentation for aws lambda and create a diagram of a website that uses lambda so let me run this code okay so it uses uv to install the mcp server locally a lot of people i don't know where does mcp run this is running locally but there are other ways to run it like in a lambda function and whatnot but for just testing it out it pulls down the the mcp server locally and runs it you can see it's already executing so let's make this a bit bigger uh so it says okay i'm going to help you with that first i'm going to search the aws lambda documentation i'll read the documentation then i'll create a diagram illustrating a static site so you can see it does a post request to do the search so the mcp server defines where everything is i don't have to like feed it in the well-architected framework the aws documentation is always updated so it just knows call the search function it got the lambda welcome file it put that in it's able to generate the diagram it generates the diagram it tells us what is going on how the workflow looks like it tells me it saved the diagram to this location i can open it up generated diagrams oops and now it's very small let me see if i can make this bigger there we go so i was able to generate the diagram for me so all through that about uh you know 40 lines of code i have two mcp servers i have my prompt and is able to understand that get that and just generate something for me with that uh so that's very easy to get started with strands of building agentic workflows i know agent means a lot of different things to different people but you know if you have tools the model the system prompt do some type of action and strands makes it extremely easy to do that if i use other frameworks it could be a lot more code to do something like that especially integrating mcp natively like that i'm going to pause here for any strands questions there's coming um i know bedrock already had its kind of agents sdk so is strands replacing that or is this now the pro is this replacing that or is supposed to complement that like is this the preferred way of creating agents with models in bedrock yeah well when it comes to preferred way it always comes down to your use case so the bedrock agent has a lot more i guess opinionated ways to do things as you can do through the console it has a built-in support uh right there in aws well strand is more it's an open source framework so you can download the code you can use other models through that like light llm or llama if you use bedrock agent you can't run that offline so there's different use cases different developer tooling i mean me the software engineer i like you know code first doing things so it does depend on your use case what you're trying to do in your experience okay can you show the code real quick yeah yeah this is the code yeah just show the agent so this is an open source framework if you go where it says agent you and it says model right now we're using a bedrock model but you can use another model with light llm yeah so you don't need aws at all in that instance you can use all llama you can use open ai you can use right yeah so there's documentation and topic light llm a lot of different model providers olama open ai so it's an open source framework so you can use it whatever you want so but yeah that's the idea of a strand open source model agent development kit one question suppose i want to build a text to sql agent and i have say 15 tools already built in that i want this agent to be able to use if i use this framework how can i make sure that the agent know when to use the right tool and the sequence yeah great question uh so i didn't this example i have a weather agent so one thing you said you already have tools what i like about strands a lot is i can write a python function i already have and let's put this tool decorator and that's it you know you don't have to put anything else it understands this is the uh what you need to do and then when i'm going to that agent i have this tools and it has put in the the native tools we're going to be using http request is a standard tool in the strands framework so in this example i'm like asking what is the weather in seattle and then also how many words are in this response this is open api api weather.gov where you don't need an api key and it can find the information for you so i'm going to update this san francisco and this started wrong but it's to figure it out a weather example where the word count and i was very specific you know find the weather first and then how many words are in the response so it's able to use that tool it gets the forecast and then it knows to use that word count tool next so we're passing a lot of the information to the model the models are very smart now we don't have to say do this do this do this the let the agent figure it out that's kind of the role of the agent you give it the context and tools necessary it figures out the best way to solve the problem but then wouldn't it be prone to hallucination when you give it 20 tools and then because we've tried that with aws no the similar things when you bind more i think more than 10 tools it's going to sure it is always you know a balance but i'm again the models are much better like try using cloud force on it is it hallucinating as much like these newer models are much better for understanding the concept and understanding what tools when the older models sure they get confused there's so many things but i'm very confident on these newer models they can understand your use case and what tools available and figure out the best way to solve the problem so then with this framework there wouldn't be a way for you to orchestrate a customized flow but more like you give the control to the agent you could if you want to have like specific like do this specific way uh there are different ways in strands uh with something called workflow mode where you actually say uh you know this is the workflow i want to do research results analyze things write the final report if you have to do something very sequential a strands has that i won't have time to go through all the different you know ways to do multi-agent collaboration and whatnot but this for that specifically like i wanted to do x y z first the workflow away can do that so yes then is it possible say um i i don't have a predefined workflow but i know it needs to figure out the right workflow then then that's what i just did there you know i just gave it a sentence and figured it out i see yes okay perfect thank you so cloud four has something called interleave thinking i believe that's what it's called where it can handle multiple tools processing much better than most models today so if you're passing in 20 tools it's able to work through the agentic loop to really figure out which tool to run and it's also able to run parallel tool calls so rather than just say okay here's the objective let me run this tool it can say here's the objective let me run this tool this tool this tool this tool and this tool and then process the results and determine when needs to happen next so i would try a cloud for which he like banjo mentioned then last example really quick uh again you know strands i made my nova act mcp server and it can actually run that you know i had to find this is the mcp server use the nova act mcp you know use the cloud so same type of thing i could have another agent you know use uh nova act as well so strands make it very easy to build these agentic workflows uh so i really enjoyed the developer experience of using strands and you know i already have the mcp server we see the same exact example before so once you have the mcp server it's very easy to plug in into different uh architectures and strands makes it very easy to accept that but yeah those were the three modules really about how to use strands uh mcp then amazon nova act again strands is open source you can download it pip install strands if you just type strandagents.com it'll take you to the documentation uh again also nova act nova amazon.com it's free to log in and then i think that's all the time we have but we do have a survey uh and you can get aws credit code by filling out this survey so banjo i have a question about that workflow thing in strands when you create these individual agents can you define which tools are passed on to each agents yeah yeah it's a great question dark about different agents uh running out of time but i'll quickly show uh i have a multi-agent example i believe oh i think you had it in the docs yeah yeah it's in the docs yeah yeah yeah each of these is a different agent so you know this is an agent you can have a different system prompt you have different tools so you're just defining the agent and then yeah you can have different tools different whatever there different models and then the workflow would just call that so yes completely customizable so that's the good thing about strand it's very easy to customize and build uh scalable solutions like that thank you and then again uh here's the survey you can get aws credits for filling out this thing tell us how we did what you liked what you want to learn more and now go build yeah any other questions i think we have a minute thanks for the presentation um so as these systems develop i think that it's reasonable to assume that they would emerge as an increasingly effective vehicle for committing fraud online at scale which would push businesses to implement more things like captcha which kind of decreases the surface area that tools like this would be applicable so what is the long-term strategy for that well you already saw we failed with captcha today like you know we're not trying to back capture we're not trying to break things you know a responsible ai is very important to amazon so no we're not trying to let this tool commit fraud you know you have to have an api key so it could be monitored so use cases like that will be shut down we're done i think we're done yeah so thank you all i think it's finished oh we can keep going we have more time oh the clock the clock ran out so i thought we were kicked out all right well more questions then i guess i thought yeah another question so regarding nova act let's say that i have a headless browser in the cloud is there a way to connect nova act to my custom browser instance in the cloud yeah yeah yeah you can there's a way to like put your own browser instance so yeah novak supports that so oh pretty possible yeah thanks all right let me go to novak github page and it's just some examples there i think right because it says start at one and then you have 120 minutes i know i just told him okay but i think what happened that time oh yeah no you guys can keep rocking it okay yeah so yeah there's a way to set up your own user agent for nova act so definitely possible there's a lot of time so i don't know if anyone actually got into the workshop so we can still build some stuff or i can try some other examples that's a lot of time so we can try to make a streamlit app with nova act so we can try that one so we can try to make a streamlit app so we can try to make a streamlit app with nova act so we can try if that works so we can try to make a streamlit app with nova act so we can try to make a streamlit app with nova act so we can try if that works oops oops oops oops oops so one example i tried i tried to make a streamlit app that uh look for like the top five uh playstation games on game faqs and then create an image like a nice graph for me but it can't fail so uh i think that's one of the issues there i think it failed at one of the steps there uh let's see oh that nova act got an error so it couldn't navigate gamefaqs.com so it does it does fail at some of the things so that's you know again research preview you have to be more specific on how it goes through things uh oh yeah let me show you where the code is just so you can have an example let me pull up the code yeah let me try let me set up my local machine so we can see how it works oh yeah go for it does nova act depend on like uh semantic html and like good web design to actually work i mean it understands the actual page so it can click through those things but if the if the page like doesn't have like a search box or button and not be able to navigate so as long as the page it can see the page it can see the page understand where to click and then click those correct buttons so i get maybe a follow-up is there any like efforts to do like experimental like engagement on the page so if it comes on a page that it's not familiar with maybe it would try and act like a human would to like click on things or try things out depending what you you put in that prompt because again you're creating that workflow what it should do so if you say you know explore this website and find thing they will try to click through that but again it's up to kind of what that initial prompt is that you have for it yeah when you're using over act you're kind of giving it step-by-step instructions when you're using the sdk so that way if you kind of know it's an obscure website you can give it those instructions that it needs to perform rather than the mcp server is using natural language to infer what needs to be done so there's not specific instructions coming from you unless you provide it yeah so i'm going to run it locally on my machine just to show an example let's see oh let me hide my key for a second because it's been recorded all right pipe on get coffee thank you for coming all right so i'm just running it locally on my machine so without headless mode so you can see it opens up the browser it's able to type copy maker so what we're looking at now is not in headless mode this is actually nova act actually performing the task in the browser so yeah a lot of questions about how does it work you know and we can try more complicated examples i just wanted to show it can work on your machine and you can see the log you know i'm just looking for and if i like change the page while just doing something it's going to like mess up so i'm going to click the page and see what it does like so someone asked about click things of that nature what's it going to do now so see it crashed now because i brought i changed the different page didn't know what to do so examples you can interact when it's when it's going through the motion as well and then i believe i have an uh consider the mcp server i set up a clod instance oops and then i have a my nova act mcp servers there so i'm able to actually you know i click this you can see all the tools it has available so i can ask it to like navigate a website so anyone having a complex example you can see the mcp server so i know some people have been asking some complex examples so go ahead and give me one yeah you got you got one one one one question i had is uh can nova support like drag and drop functionality you can try it do you have a specific website that that has like drag and drop draw io uh let's go to draw.io and make a cool diagram use nova act i want to see what happens all right so let's go to draw io all right it opened the page do i have to accept something nope it's going oops all right open dry let's see and then i'll make this smaller wait for page to load look at my initial setup for template selections all right it's going oh it crashed what happened oh do i have to allow allow always i took a screenshot i need to continue the browser session to see what's available let's look at the screenshot all right it's it's opening up again uh it's going to draw io yeah if i keep clicking away it clicks back to the die the browser session so i need like two monitors let's see let's see let's see is it going to figure out how to use draw io wait for page take screenshot look for template options come a blank paper all right it's so it's kind of i didn't give it any specific instructions i just said make something cool so maybe that's too hard to interpret for this website maybe i have to say click this click the square button and then drag the square to the center or something i might have to be more explicit for that so it seems it seems to have frozen all right it's clicking something all right click new oh okay it's doing stuff again it's not like super real time it's going it's not like instantaneously but it's it is clicking through the buttons clicking through stuff all right did it do anything oh the claw so it looks like it failed so yeah it looks like clod failed that one so i won't blame no for that but that's like that's idea so thanks for trying to do something hard okay another question back there oh yeah can we try another one yeah let's try another one sure can we do uh you know on google maps find the top three rated coffee shops with within a mile radius of this hotel top three coffee shops shops near the marriott marquis in san francisco you'll figure it out all right open maps google search mirror marquis san francisco wait for results to load so it has a plan it's going to do something so let's see it opened google maps all right let's have mirror marquis san francisco so it's able to type that okay it searched it found the mirror marquis so there's a copy button let's see if it clicks that i'm curious looks like it's frozen give it a couple more seconds whether to click it got this 15 minutes i was trying to type in that box okay all right all right it's typing coffee shops all right all right it's going all right so all right it'll open the coffee shops and let's see if we can get those top three there's a four eight four seven another four seven let's see if it can get that did it crash i think it did it but i think i'm gonna blame claude claude desktop might need a different mcp client yeah i think yeah i think claude desktop doesn't like doing that but again because it's an mcp server i can open up a different mcp client so i can open like cursor for example and ask it questions through that cursor let me go this and then you see it has the mcp tools oops it has this up let me just open up a new one i can do the same thing and use nova act and then it's called the mcp tool again so that's the beauty of mcp i already have this server i can just use a different client it can understand all the information it needs to and do the exact same command so it's going to do the same thing cursor might be smarter than claude codes but yeah it's able to do the exact same type of thing so another question over here yeah i just got a question on the the nova act model yeah that model is that that is that running in the cloud yes so nova act question was where is nova act running and yes it's running in the cloud so yeah it's just you get that api key and it's doing the call behind the scenes in the aws cloud so then what what does it upload to the cloud well it's asking the questions and like you know go to google maps and then they would say i understand that and it's actually clicking those buttons and doing the actions so the actual uh intent of what you're trying to do in the specific action and then if i was if i was using it locally you couldn't use nova act locally it has to be uh connected to the internet to use it okay but if i for example though if i i wanted it to like look at my gmail oh yes ah yeah i see what you're saying yeah yeah it's you it's i mean it is you know it's an api endpoint it's been past the aws so you know only passing information that you feel like it's not going to be without training the data or taking any of that nature but it's going to the aws cloud and processing you know what to click on this button locally on your like browser so looks like it's not yeah see now it's even certain the rating it actually knows which rating to press so right yeah yeah well nova act is executing like in this mcp server example i say you know find the top three coffee shops in marriott near the marriott marquee and then i'm passing that information that the the llm is understanding that plan and then it uses nova act to interact with the browser because like cursor or cloud code or amazon queue they can't interact with the specific uh you know website by itself it uses it uses nova act to do that right but like given a question though like how how does it come with uh come up with the plan oh the mcp server like that the client so i picked the model in the example we had the mcp client we had as we showed the model right like the cloud 3.5 yeah that's coming up with the plan same thing here you know i asked you know help me find the top three coffee shop near the marriott marquee this the model that uh cursor is using is coming up with that plan and then i'm using the nova act mcp server to act on it exactly so this is the plan search for marriott marquee click the marriott marquee you know search for the things and you see all this information nova act return and actually it will return this time so i think the problem was with claude desktop but it got the three top three copy stops there right what are all the tools that uh nova act can do today uh so the mcp server is what i wrote so uh but the idea between nova act again it can interface with the web browser that that's the tool the browser is the tool and it can anything on the website you can actually click through go through the example etc i see you got the repo you got an architecture that showed the mcp just so they can see it yeah so i mentioned uh there's an official aws mcp servers so uh this aws labs mcp and a lot of different um mcp servers here for the one the nova act one i created my own one uh go back to the nova act examples or where do the ah here when i use amazon q to explain you know the mcp server for like what what's going on what tool was the browser session performing an action on the browser so this is a good uh thing to talk about so can you dive deeper on the browser action function and we can see because this is how it's actually acting so uh amazon cube browser action is designed to perform actions it has this uh what's cool about it it just does oops it's going to the code it performs a single action in the nova act browser so it's executing that action it stores this act.act is like what nova says you know click the search bar do this x y you know my the mcp client understands how to use this act.act it passes the correct action so we saw the example here one of the actions was like go to google maps or click this button or do that search that's how it's able you know these actions and then the nova act mcp server is translating that to actually click that button so the mcp server provides all the interfaces it necessarily needs so then these mcp clients can interact and do actions and do things yeah and nova act is just the model in the background that's able to click those buttons extending this question it so your mcp server so claude uh or cursor running locally right it's calling your mcp server that's also running locally is your mcp and your mcp server is the one that spun up the i guess the chromium instance yeah is it is your mcp server taking screenshots of what you see in chromium and shipping them to nova to nova act the screenshots are locally and then based on that like you can see it's actually getting all the information uh the final page information so it's not storing your screenshot data and sending that everything that it's running locally it's clicking those buttons based on what's on the browser sensor got it but is is any of any of the information in chromium does that any of that need to be sent into any no no everything running yes running locally okay distinction okay perfect thank you and let me open up the where was that looking so one of the things about making mcp servers is you have to provide a lot of context so uh for nova act like i say you know when writing active for nova action be descriptive of what to do you know click the hamburger menu icon go to order history don't find my order so the more you know uh concise and just prescriptive of what you want to do it's better you know search for hotels in houston so by average customer like so the better specific it is uh that's how the mcp's uh client is able to make those great requests and find the information so type copy maker search rock enter so the more prescriptive you are of nova act the better results you're going to be and i encoded that all into this uh mcp server so the clients can leverage that so i think that's probably one of the hardest things about making the mcp servers that's making sure you provide the next context of when to use the tool how to use the tool the inputs and outputs but once you solve all that it's very easy to plug and play the different mcp clients like we've done here so question yeah right so when nova act is doing something it's passing back the log of everything it's doing so you know what what steps it did so the starting page the add the results the action result id so it's keeping a of log of everything it did uh power so it's able to get that json i understand what the id what the result is so you can see what it's doing so it can move on to the next step yep other question sorry a quick question is this able to do uh like uh automated ui testing because of this well with nova act you know you can define like what you want it to do so you're going to have to define you know go to this button click this does this work so you can define that workflow so i mentioned before like back in the day like if i'm writing selenium code i have to go click this h1 tag do this like now you can just write a natural language you know click this button click that button so yes it can handle that use case uh specifically of like opening the browser or checking these things and but you have to like you know this nova act search for coffee maker you know you have you specifically have to write what buttons to press yeah thank you let's see i guess we have time i can show some multi-agent collaboration with strands that could be something cool uh i think i have a repo for that so should be uh go to the aws labs page where's that let's work and then claude cool okay i'm just gonna copy this code and put it into our environment so in this example i'm actually going to show how strands as multi-agent collaboration so one way i'm actually going to create a powerpoint presentation based on uh you know a cloud migration request i want to like move my uh infrastructure on premise to the cloud give me a presentation of how i would do that and so for this i created three different agents i created a cost analysis agent so i have a system prompt there a solutions architect agent does a map out what you're going to be doing and then each of these uh tools is an actual agent so this uh cost analysis has the docs mcp server the cost analysis mcp server it has its own prompt the presentation agent has its own system prompt it has a tool from the a powerpoint mcp server that i'm using and there's an architecture agent it also has you know its own specific tools system prompt etc so different agents for different uh things in the workflow and then i have this orchestrator agent welcome the migration orchestration agent it has a prompt and i tell what tools it has access to and then the cool thing with strands is i make this orchestrator agent and then the tools are just other agents in that so it knows when to call this agent for this particular tool when to do that and i say you know i want to migrate my work my uh workload so write the fight tools to find that so i made a fictional company called shop easy e-commerce they have on-premise java my sql database yeah i want zero down from migration like all this all these little constraints in there and i wanted to make a migration plan and a powerpoint presentation that i can present to my executives of how this would work and i just designed and i'm the orchestrator agent will find out what to do i don't specifically say do this one first do that first we'll let the the agent figure that out so let me run that strands and it should be multi-agent all right so cloud progression agent as tools all right again so all the mcp servers running locally it downloads it's using the ux it starts with the architecture design first generates a diagram i'm going to use waft so take some time it might fail but it will just update update itself making another judgment all right i think it couldn't generate the diagram there but it's saying all right i'm just gonna this this is what the diagram should have this is what we're going to doing now i'm just going to do a cost analysis cost analysis on basically the things we did there so it's it's a this this workflow takes me a couple minutes to run but you can see it's calling all these agents uh different things it's understanding what to do what actions to take first it's finding pricing for eks because it has a cost analysis tool and knows where to find that information so it has the up-to-date pricing all the time funding for aurora for its database so it's able to understand all that information and get the real-time up-to-date information just because we have that uh pricing mcp server from the aws labs it's called cost analysis yeah cost analysis mcp server documentation all the stuff you need for finding the right price on aws it has all that information and the agent was able to just use that one that's going to generate report so it's still running again this does take a while because i'm asking a very complex question a lot of things going so it does take a couple minutes to run through all that it gets its monthly spend predictions monthly savings etc so they would understand all the information and get all up-to-date information based on the plan we provided and the last thing now wants to create an executive presentation so download the powerpoint mcp server and now it's going to make a powerpoint presentation based on that so adding the title slide so you know add a placeholder so generating powerpoint is a very popular use case and there's an mcp server that can go ahead and just do that add bullet points etc so give it a couple another minute or two

Building Agents with Amazon Nova Act and MCP - Du'An Lightfoot, Amazon (Full Workshop)

Transcript