back to indexBuild an AI Research Agent: Apoorva Joshi

00:00:15.480 |
And welcome to this workshop I like to call the A to Z of building AI agents. 00:00:21.200 |
So during the workshop today, we'll spend about 20 to 30 minutes talking about the basic 00:00:25.780 |
concepts of what AI agents are, when to use them, the different components of agents and 00:00:31.600 |
concepts that you'll find helpful during the hands-on portions of the workshop. 00:00:37.240 |
And then you will spend the rest of the time building an AI agent of your own with help 00:00:42.440 |
and assistance from me, and I have my awesome team back there. 00:00:48.900 |
So if you run into issues, call upon one of us and we'll figure it out. 00:00:56.080 |
I'm Apoorva, and I'll be your lead instructor for today. 00:01:00.420 |
Five months ago, I stepped into my first ever developer advocacy role at MongoDB, and prior 00:01:05.860 |
to that, I spent about six years as a data scientist in the cybersecurity space, applying machine 00:01:11.400 |
learning to problems like phishing detection, malware and ransomware detection, that kind 00:01:16.380 |
Outside of work, I read a lot, try to do yoga kind of regularly, and I'm always on a mission 00:01:30.800 |
We are all here to learn, so ask as many questions as you'd like. 00:01:35.300 |
We'll go over key concepts before getting into the hands-on labs, so during these exercises, 00:01:40.720 |
we definitely encourage you to form groups and work together where you can. 00:01:47.340 |
Here's a link to the slides and also the hands-on lab that you'll be working through today, and 00:01:52.440 |
I'll leave this here for a few minutes for you all to scan. 00:01:56.400 |
So link and QR code should also be on these, like, postcards that were just handed out, 00:02:01.220 |
and if you didn't receive one, then raise your hand and we'll get you one. 00:02:34.760 |
So the goal of the workshop is to introduce you to the basic concepts of AI agents and also 00:02:40.160 |
get hands-on experience with building an agent end-to-end. 00:02:43.680 |
So, yeah, I'm going to start off by talking about what agents are, what are the AI agent 00:02:48.840 |
use cases, components of an agent, and then we'll build an AI research agent together, and 00:02:54.800 |
depending on how long it takes us, we may or may not have time for Q&A, but I'll be around 00:03:03.580 |
So let's start with talking about what are AI agents. 00:03:09.060 |
So an AI agent is a system that uses a large language model or LLM to reason through a problem, 00:03:16.000 |
create a plan to solve the problem, and also execute the plan with the help of a set of 00:03:22.720 |
So let's see how agents are different from other techniques for interacting with LLMs, because 00:03:28.340 |
this will kind of help us build an intuition for when to use agents. 00:03:32.460 |
So let's take the example of simple prompting, where you simply prompt an LLM to generate an 00:03:37.260 |
answer based on its pre-trained parametric knowledge. 00:03:40.460 |
So as you can imagine, this is good for point-in-time general knowledge kind of questions, but probably 00:03:46.920 |
Because even if you manage to prompt the LLM to perform really complex tasks, then it might 00:03:52.080 |
not have the means or information to execute on the task. 00:03:56.580 |
The LLM in this situation also can't self-revised and refine responses based on either previous 00:04:01.700 |
or new information, and it definitely doesn't have a means to learn preferences and provide 00:04:07.300 |
personalized responses over time, which sometimes is a requirement. 00:04:14.140 |
Moving on to retrieval augmented generation, aka RAG, with RAG, you can broaden the scope 00:04:19.300 |
of the LLM by augmenting its knowledge with information retrieved from a knowledge base. 00:04:24.580 |
So that way you can be somewhat confident that the LLM at least has information required to 00:04:29.860 |
perform tasks that you wanted to perform, but it doesn't quite solve for some of these other 00:04:34.820 |
requirements, such as handling complex tasks, self-refinement, or personalization. 00:04:42.400 |
Coming to agents, with agents, you can give the LLM access to external tools and past interactions 00:04:48.520 |
which act as the memory of the agent, and then you can prompt it to go through multiple iterations 00:04:53.500 |
of reasoning and action-taking to finally arrive at the final answer. 00:04:58.580 |
So tools is how agents are able to execute on complex multi-step tasks, and LLMs can also 00:05:05.120 |
be prompted to incorporate the feedback or output from tools into the reasoning process to say, 00:05:11.580 |
repeat steps if necessary, or call additional tools as follow-up tasks. 00:05:16.900 |
Coming to past interactions, past interactions can be persisted and updated, which means the LLM 00:05:22.160 |
agent can now learn from these to provide personalized responses over a period of time. 00:05:26.900 |
So as you can imagine, tools, memory, and iterative prompts can solve a lot of problems, but there's 00:05:32.700 |
obviously some known challenges at the moment, such as long-term planning, where the agent is 00:05:37.800 |
expected to execute complex tasks based on information, a lot of information or information it's learned 00:05:47.840 |
There's also a high cost and latency associated with agents because they typically trade these 00:05:55.700 |
But despite all of these challenges, I think we can agree that agents is how we get the most 00:06:05.040 |
So let's take some example tasks or questions and try to answer whether or not the task really 00:06:13.160 |
So this one, for example, like who was the first President of the United States? 00:06:17.780 |
Does it require an AI agent to complete this task? 00:06:25.720 |
But I would say no, because the information required to answer this question is very likely 00:06:30.620 |
present in the parametric knowledge of most LLMs that we know today. 00:06:41.820 |
What's the travel reimbursement policy for my company, MongoDB or your company? 00:07:13.860 |
So I would say it's a pretty straightforward task provided the LLM has access to the right 00:07:18.860 |
So to me it sounds like a better fit for retrieval augmented generation where the LLM has access 00:07:25.420 |
to the right knowledge base than something complex like an AI agent. 00:07:32.480 |
How has the trend in the average daily calorie intake, it's already too long, but among adults 00:07:39.120 |
And what impact might it have on obesity rates? 00:07:41.860 |
Additionally, can you provide a graphical representation of the trend? 00:07:52.420 |
Like, I think this task looks like it involves multiple subtasks such as at least data aggregation, 00:07:58.640 |
visualization, and also reasoning through the results that it's obtained from these various 00:08:04.680 |
So I think it sounds like a good fit for agents. 00:08:09.840 |
Using a personalized learning assistant that can adjust its language examples and methods 00:08:22.140 |
I think this is another example of a complex task which requires also long-term personalization. 00:08:27.320 |
So again, I think it's a good use case for agents. 00:08:30.320 |
So the TLDRs use agents for complex multi-step tasks that require integration of multiple capabilities such as question answering, task execution, analysis, that kind of thing. 00:08:42.760 |
And using all of these to arrive at a final answer or outcome. 00:08:46.580 |
And also if there is a need for personalization or adapted responses. 00:08:51.000 |
So as we saw, memory, tools, and being able to reason is what really makes AI agents so powerful. 00:08:58.620 |
So let's dig a little bit deeper into each of these components, starting with planning and reasoning. 00:09:06.280 |
So the simplest way to imbue planning and reasoning capabilities into agents is via, believe it or not, user prompts. 00:09:13.380 |
You can start super simple by prompting the agent to create a plan of action based on its initial understanding 00:09:19.780 |
of the problem and this is what we call planning without feedback since the agent does not modify 00:09:25.700 |
its execution plan based on any new information that it's gathering from tools that it's executing. 00:09:31.620 |
It's just in the beginning it creates an execution plan and runs with it. 00:09:36.280 |
So common design patterns for this kind of planning are chain of thought and tree of thoughts. 00:09:41.920 |
Then there's planning with feedback where you can prompt the agent to adjust and refine 00:09:45.620 |
its responses based on tool outcomes or even asking it to critique and reflect upon its own 00:09:53.640 |
And common design patterns in this regard are react and reflection and we'll experiment with 00:10:04.260 |
So chain of thought is as simple as prompting an LLM to think through a problem step by step instead 00:10:12.520 |
You can do this either in a zero shot manner by literally saying hey let's think step by 00:10:18.060 |
step or in a few shot manner where you show it how to work through a complex problem using 00:10:27.620 |
Then we have tree of thoughts which takes the idea of chain of thought up a notch. 00:10:31.840 |
So tree of thought allows LLM to perform deliberate decision making by considering multiple different 00:10:38.440 |
reasoning paths and having it self-evaluate choices to decide the next course of action. 00:10:44.780 |
So it kind of combines this LLM's ability to generate and evaluate thoughts with search 00:10:51.020 |
algorithms because it can also look ahead and backtrack when necessary to make kind of 00:10:59.620 |
Then we have patterns for reasoning with feedback starting with react. 00:11:03.960 |
So what we do here is we prompt LLM's to generate verbal reasoning traces and also tell us the 00:11:09.920 |
actions that it will take to solve a particular problem. 00:11:12.900 |
So after each action we ask the LLM to make an observation based on information or feedback 00:11:18.560 |
obtained from the previous action and plan what action to take next. 00:11:22.340 |
And then this kind of process continues until the LLM or you can intervene and say that you've 00:11:32.600 |
So in this example here as you can see the first thing that the LLM does is generates a 00:11:37.000 |
thought saying like okay this is how I need to solve this problem. 00:11:41.000 |
Then the second is an action step where in this case it's determined that it needs to call 00:11:47.020 |
the search tool with arguments that it's determined. 00:11:50.380 |
And then it makes an observation saying okay like I don't think I have an answer next. 00:11:54.740 |
This is what I'm going to do next and does that till it reaches the final answer. 00:12:01.860 |
Another technique for incorporating feedback into the planning process is via reflection. 00:12:07.040 |
And this involves prompting LLM's to reflect on and critique past actions to decide what action 00:12:13.480 |
to take next and you can either prompt the same LLM to generate and critique. 00:12:20.200 |
You can use different LLM's or even use multiple agents where one agent generates responses and 00:12:27.960 |
But yeah whatever the architecture the goal is to run the generation reflection loop several 00:12:33.360 |
times before the LLM arrives at a final answer. 00:12:37.160 |
So essentially trading compute for a better shot at accuracy. 00:12:43.520 |
The next component we want to talk about is memory. 00:12:47.300 |
And this component allows AI agents to store and recall past conversations and enables them 00:12:54.300 |
And as you can imagine memory is a pretty complex and nebulous concept. 00:12:59.780 |
And you could break it down into several categories but broadly when I think of memory it's two 00:13:05.380 |
main types of memory much like as humans right short-term and long-term memory. 00:13:10.340 |
So short-term memory in the case of agents deals with storing and retrieving information 00:13:17.500 |
And long-term memory deals with storing, updating and retrieving information from multiple conversations 00:13:25.160 |
And this is what really helps agents personalise their responses over a longish period of time. 00:13:32.760 |
So short-term memory is relatively easy to implement. 00:13:35.880 |
Like how hard can it be to store a single conversation, right? 00:13:39.680 |
Like in most cases not that hard but unless the conversation gets too long in which case you 00:13:44.060 |
need to now start considering how to condense that list so you aren't overwhelming the LLM with 00:13:50.360 |
too much information and some solutions to do that are things like retrieving the end most 00:13:56.180 |
recent messages or summarising the conversation at the cost of some information loss. 00:14:03.040 |
Long-term memory, on the other hand, is a largely unexplored area so far since it's non-trivial 00:14:08.540 |
to decide and implement what states to track and how to track them and when to update them. 00:14:16.800 |
But I think some patterns are emerging in the sense that the best way to go about implementing 00:14:22.500 |
long-term memory is to design application-specific agents. 00:14:26.460 |
That way you're able to narrow down the number of states you want to track and just focus on 00:14:37.100 |
So tools are interfaces for agents to interact with the external world in order to achieve their 00:14:43.360 |
objectives and these can range from simple APIs such as search weather APIs to complex things 00:14:50.140 |
like vector stores or even specialized machine learning or deep learning models. 00:14:55.200 |
So tools for LLMs are typically defined as functions and most recent LLMs have been trained to identify 00:15:03.380 |
when a function should be called and they'll respond with a function signature that you can then 00:15:08.740 |
use to call a particular function in your code. 00:15:12.120 |
And tools like Langchain handle the function calling for you but the basic concept still remains. 00:15:18.140 |
And to help the LLM identify which function to use, you typically use a descriptive tool name, 00:15:24.160 |
specify which function to call, provide a pretty detailed description of what exactly the function 00:15:29.440 |
function does and also the types of arguments would also be helpful. 00:15:34.940 |
So finally the fun part, you're not here to listen to me ramble on about agents. 00:15:40.940 |
So in today's workshop, we'll be building an AI research agent. 00:15:46.560 |
And the agent's primary objective is to provide research assistance by supplying a list of papers 00:15:52.520 |
to read, summarizing research papers and answering questions about research topics. 00:15:59.760 |
And this is kind of how the workflow of our agent is going to look like. 00:16:03.420 |
We will use a free and open source model from Fireworks called Fire Function V1. 00:16:07.900 |
They were just released a V2 but I had prepared my workshop until then. 00:16:12.840 |
So today we'll use V1 as the brain of our agent. 00:16:17.200 |
We will also try out some of the reasoning design patterns that we were just talking about like 00:16:22.660 |
We will also give the agent access to three tools. 00:16:25.960 |
One for getting paper summaries, one for getting a list of papers to read, and the third one being 00:16:34.120 |
answering tools using a MongoDB knowledge base. 00:16:38.620 |
And finally, we will also explore adding short-term memory to the agent and persisting it to a database 00:16:46.760 |
But yeah, very soon we are going to break for our first hands-on portion, but just some things 00:16:56.760 |
Each time we break for a hands-on section, you'll navigate to the hands-on lab at the QR code 00:17:02.180 |
that you have at your tables or you just can, and you'll work through one or more sections 00:17:08.340 |
And you'll see these emojis sprinkled all over the place. 00:17:10.720 |
So this, like, open hands emoji and the superhero emoji indicate hands-on sections, except I 00:17:17.340 |
would highly advise do the open hands ones first, and only if you have time, go to the 00:17:24.720 |
You'll also be filling code into a Jupyter notebook, and the places where you need to fill in code 00:17:30.400 |
are indicated by these code underscore block placeholders. 00:17:34.100 |
So those are the ones you need to fill in with your code. 00:17:38.040 |
And before any cell in the notebook that requires you to fill in code, you'll also see this books 00:17:42.280 |
emoji indicating documentation that you need to reference for that particular piece of code. 00:17:48.520 |
And finally, you'll find solutions to all the hands-on pieces at the QR code link, but I highly 00:17:56.820 |
encourage you to try working through stuff on your own before you look at the solutions. 00:18:01.400 |
And even if you do, then try to understand what's really going on. 00:18:06.360 |
With that, let's go ahead and break for our first hands-on section, which is just setting 00:18:12.260 |
up the development environment and prerequisites for the workshop. 00:18:16.140 |
So yeah, let's take about 15 to 20 minutes to work through this section. 00:18:20.820 |
So if you go to that link, you want to start at the section titled MongoDB Atlas, and work 00:18:25.140 |
all the way through to the dev environment section. 00:18:45.940 |
I think I'm going to move on just in the interest of time, but it's a self-paced lab and you'll 00:19:01.620 |
have access to all the material after the fact, so feel free to move at your own pace. 00:19:08.620 |
So let's move on to some libraries, tools, and general concepts that you'll come across 00:19:15.620 |
So the first thing you'll run into is this library called datasets, which we are going 00:19:21.300 |
to use to download a dataset of archive papers from Hugging Face. 00:19:24.620 |
We're going to use the load dataset from the mongodb educational AI Hugging Face org. 00:19:36.180 |
And then you'll run into something called Archive Loader, which is a document loader class in Lang 00:19:42.620 |
We are going to be using this to load research papers from archive org as Lang chain document 00:19:48.120 |
objects, and an example of what a document in Lang chain looks like is shown here. 00:19:53.840 |
So essentially has the raw text under the page content attribute and some automatically extracted 00:20:01.040 |
In this case, the published date, title, authors, and summary under the metadata attribute. 00:20:07.400 |
So we're going to be using archive loader in one of our two of our agent tools. 00:20:12.180 |
One tool is already done for you, and that's the tool to get relevant papers from archive, 00:20:17.000 |
and you'll also use the same document loader for the summary tool as well. 00:20:22.540 |
So the simplest way to create tools in Lang chain is using the tool decorator, which makes tools 00:20:28.780 |
So for this tool, we have used the load method of archive loader to load data into document 00:20:34.100 |
objects, and the query argument takes a topic or paper ID, and the load max docs indicates 00:20:39.900 |
how many documents to download from archive, and finally, we are only extracting the metadata 00:20:45.120 |
because we want to only provide a list of papers and not the full paper content. 00:20:50.600 |
We will also be using PyMongo, which is the Python driver for MongoDB. 00:20:55.580 |
We will use it to connect to MongoDB databases and collections, and also delete and insert 00:21:00.460 |
documents from and to MongoDB to build the knowledge base for our agent. 00:21:06.240 |
We will also be using a few Lang chain integrations, which are essentially stand-alone packages for 00:21:12.720 |
third-party providers such as MongoDB in Lang chain to make things like versioning, dependency 00:21:21.180 |
So we will use the Lang chain MongoDB integration to use MongoDB Atlas as a vector store and also 00:21:27.780 |
to store and retrieve chat history for the agent. 00:21:31.140 |
We will also use Lang chain Hugging Face to access open source embedding models from Hugging Face, 00:21:37.620 |
and finally, we will use Lang chain fireworks to access chat completion models from fireworks AI. 00:21:45.560 |
And you will be using the Lang chain expression language or LCEL to create rag and agent workflows 00:21:51.480 |
using Lang chain, and it is essentially a declarative way to chain together prompts, data processing 00:21:57.040 |
steps, LLMs, and tools in Lang chain fashion. 00:22:02.380 |
And each unit in the chain is called a runnable, and the way to chain them together is using 00:22:07.880 |
the pipe operator that takes the output from the left of the pipe and passes it as input 00:22:14.840 |
And here's a simple example of just passing a prompt to an LLM, generating an answer and formatting 00:22:21.600 |
And finally, if you want to call the chain, then you use the invoke method on it, and you'll be 00:22:25.720 |
using this to test out some of the things that you're building during the workshop. 00:22:31.320 |
And finally, you have this thing called runnable lambda, and this is a runnable that converts 00:22:36.700 |
any arbitrary Python function into a Lang chain runnable, and it's as simple as defining the 00:22:41.900 |
function and then wrapping the function into a runnable lambda. 00:22:46.160 |
So, yeah, let's take another 20 minutes to now create the tools for your research agent. 00:22:52.720 |
So, yeah, just work through the create agent tools section of the lab that you were just 00:23:01.460 |
So, hopefully we are kind of at least midway through creating tools for our agent, but in 00:23:08.780 |
the next section, we are going to be creating the agent itself and experiment with the different 00:23:14.820 |
reasoning design patterns that we were talking about, like chain of thought and react. 00:23:19.720 |
So, to create the agent, we are going to start with the simplest way of creating a tool calling 00:23:24.780 |
agent in Lang chain, which is using the create tool calling agent constructor. 00:23:30.800 |
And you're going to be starting with that abstraction, but let's try to understand what's happening 00:23:35.620 |
behind the scenes of that abstraction, right? 00:23:37.460 |
So, it's essentially creating a runnable sequence consisting of a prompt template which has a 00:23:43.200 |
placeholder for the agent's scratch pad, which is the agent's intermediate steps as it's taking 00:23:48.120 |
different actions and making observations, an LLM with knowledge of the tools that we were 00:23:53.840 |
just creating, and an output parser for formatting the agent's response. 00:24:00.580 |
And then we'll also be exploring a react agent that uses react prompting to guide the agent 00:24:05.260 |
to take a series of reasoning and action-taking steps to arrive at the final answer. 00:24:11.540 |
And for this, we'll use the create react agent constructor, which follows a similar series 00:24:17.300 |
of steps as the tool calling agent, except it uses a react prompt template, and the LLM has 00:24:22.960 |
a knowledge of when to stop the reason action-taking sequence using a stop sequence. 00:24:30.000 |
And then the output parser has logic to parse these react-style LLM calls, and you can see 00:24:36.480 |
So it has a thought, an action, an action input, and an observation, so just parsing that to make 00:24:47.760 |
And finally, you'll come across the agent executor, which is the runtime for the agent. 00:24:52.800 |
This is what actually calls the agent, executes the action that the agent is choosing, passes 00:24:57.800 |
the action outputs back to the agent, and repeats any steps as the agent decides what to do next. 00:25:04.220 |
And that's the pseudocode for what the agent executor is essentially doing. 00:25:08.980 |
So as long as the agent thinks that it hasn't finished its task, which is the while loop there, 00:25:13.760 |
the agent determines and runs a series of actions until it finally finishes the task. 00:25:19.420 |
So yeah, let's take another 20 minutes to complete the create agent section and any other things 00:25:30.420 |
We have one last thing to do with our research agent, which is to give it memory or add short-term memory to it. 00:25:42.420 |
And in this case, we are going to do that by giving it access to its chat message history. 00:25:49.420 |
So in Langchain, the way to do this is by wrapping the agent runnable that you created using the create 00:25:55.420 |
tool calling agent or create react agent, wrapping that runnable inside another runnable called runnable 00:26:01.420 |
with message history, which is specifically designed to manage the memory of other runnables. 00:26:07.420 |
So essentially, this runnable can take a function that persists the chat message history for your agent 00:26:17.420 |
And by default, it organizes the chat history using a session ID that you pass in along with 00:26:27.420 |
Let's play around with that for the remainder of the time. 00:26:31.420 |
And if you have any more questions or stuck at something, we can talk through that, too, 00:26:39.420 |
One last thing I would request once you're done with all your stuff is -- yeah, if you want 00:26:45.420 |
to connect -- that's not the mandatory thing. 00:26:50.420 |
But, yeah, I'd really appreciate if you could fill out a short survey that's at the QR code 00:26:57.420 |
This is the first time I'm doing this workshop. 00:26:59.420 |
So, any feedback you have will only help me make this better in the future. 00:27:07.420 |
Other than that, this is it from me for today.