Build an AI Research Agent: Apoorva Joshi

Hello, everyone. And welcome to this workshop I like to call the A to Z of building AI agents. So during the workshop today, we'll spend about 20 to 30 minutes talking about the basic concepts of what AI agents are, when to use them, the different components of agents and concepts that you'll find helpful during the hands-on portions of the workshop.

And then you will spend the rest of the time building an AI agent of your own with help and assistance from me, and I have my awesome team back there. There's Tom, Ben, and Fabian. So if you run into issues, call upon one of us and we'll figure it out.

Here's a little bit about me. I'm Apoorva, and I'll be your lead instructor for today. Five months ago, I stepped into my first ever developer advocacy role at MongoDB, and prior to that, I spent about six years as a data scientist in the cybersecurity space, applying machine learning to problems like phishing detection, malware and ransomware detection, that kind of stuff.

Outside of work, I read a lot, try to do yoga kind of regularly, and I'm always on a mission to hit as many local coffee shops as I can. A few ground rules before we begin. No stupid questions here today. We are all here to learn, so ask as many questions as you'd like.

We'll go over key concepts before getting into the hands-on labs, so during these exercises, we definitely encourage you to form groups and work together where you can. Here's a link to the slides and also the hands-on lab that you'll be working through today, and I'll leave this here for a few minutes for you all to scan.

So link and QR code should also be on these, like, postcards that were just handed out, and if you didn't receive one, then raise your hand and we'll get you one. Anyone need a postcard? Okay, you see some hands there, here. Tom? Right here. All right, moving on. So the goal of the workshop is to introduce you to the basic concepts of AI agents and also get hands-on experience with building an agent end-to-end.

So, yeah, I'm going to start off by talking about what agents are, what are the AI agent use cases, components of an agent, and then we'll build an AI research agent together, and depending on how long it takes us, we may or may not have time for Q&A, but I'll be around to answer questions later.

So let's start with talking about what are AI agents. So an AI agent is a system that uses a large language model or LLM to reason through a problem, create a plan to solve the problem, and also execute the plan with the help of a set of tools. So let's see how agents are different from other techniques for interacting with LLMs, because this will kind of help us build an intuition for when to use agents.

So let's take the example of simple prompting, where you simply prompt an LLM to generate an answer based on its pre-trained parametric knowledge. So as you can imagine, this is good for point-in-time general knowledge kind of questions, but probably not too much more, right? Because even if you manage to prompt the LLM to perform really complex tasks, then it might not have the means or information to execute on the task.

The LLM in this situation also can't self-revised and refine responses based on either previous or new information, and it definitely doesn't have a means to learn preferences and provide personalized responses over time, which sometimes is a requirement. Moving on to retrieval augmented generation, aka RAG, with RAG, you can broaden the scope of the LLM by augmenting its knowledge with information retrieved from a knowledge base.

So that way you can be somewhat confident that the LLM at least has information required to perform tasks that you wanted to perform, but it doesn't quite solve for some of these other requirements, such as handling complex tasks, self-refinement, or personalization. Coming to agents, with agents, you can give the LLM access to external tools and past interactions which act as the memory of the agent, and then you can prompt it to go through multiple iterations of reasoning and action-taking to finally arrive at the final answer.

So tools is how agents are able to execute on complex multi-step tasks, and LLMs can also be prompted to incorporate the feedback or output from tools into the reasoning process to say, repeat steps if necessary, or call additional tools as follow-up tasks. Coming to past interactions, past interactions can be persisted and updated, which means the LLM agent can now learn from these to provide personalized responses over a period of time.

So as you can imagine, tools, memory, and iterative prompts can solve a lot of problems, but there's obviously some known challenges at the moment, such as long-term planning, where the agent is expected to execute complex tasks based on information, a lot of information or information it's learned not over a longish period of time.

There's also a high cost and latency associated with agents because they typically trade these for a shot at higher accuracy. But despite all of these challenges, I think we can agree that agents is how we get the most out of large language models as of today. So let's take some example tasks or questions and try to answer whether or not the task really requires an AI agent.

So this one, for example, like who was the first President of the United States? Does it require an AI agent to complete this task? I see some people nodding yes. Mostly no. But I would say no, because the information required to answer this question is very likely present in the parametric knowledge of most LLMs that we know today.

So I don't think it requires an AI agent. How about this one? What's the travel reimbursement policy for my company, MongoDB or your company? Do you think this task requires an AI agent? What's that? Yes. What's the two steps? It's first you need to disambiguate for . That's a good point.

Okay. All right. All right. Yeah. So I would say it's a pretty straightforward task provided the LLM has access to the right information. So to me it sounds like a better fit for retrieval augmented generation where the LLM has access to the right knowledge base than something complex like an AI agent.

How about this guy? How has the trend in the average daily calorie intake, it's already too long, but among adults changed over the last decade? And what impact might it have on obesity rates? Additionally, can you provide a graphical representation of the trend? Do we think this requires an AI agent?

I would think so. Like, I think this task looks like it involves multiple subtasks such as at least data aggregation, visualization, and also reasoning through the results that it's obtained from these various tasks. So I think it sounds like a good fit for agents. How about this one? Using a personalized learning assistant that can adjust its language examples and methods based on the student's responses.

I see some nods and I agree. I think this is another example of a complex task which requires also long-term personalization. So again, I think it's a good use case for agents. So the TLDRs use agents for complex multi-step tasks that require integration of multiple capabilities such as question answering, task execution, analysis, that kind of thing.

And using all of these to arrive at a final answer or outcome. And also if there is a need for personalization or adapted responses. So as we saw, memory, tools, and being able to reason is what really makes AI agents so powerful. So let's dig a little bit deeper into each of these components, starting with planning and reasoning.

So the simplest way to imbue planning and reasoning capabilities into agents is via, believe it or not, user prompts. You can start super simple by prompting the agent to create a plan of action based on its initial understanding of the problem and this is what we call planning without feedback since the agent does not modify its execution plan based on any new information that it's gathering from tools that it's executing.

It's just in the beginning it creates an execution plan and runs with it. So common design patterns for this kind of planning are chain of thought and tree of thoughts. Then there's planning with feedback where you can prompt the agent to adjust and refine its responses based on tool outcomes or even asking it to critique and reflect upon its own responses.

And common design patterns in this regard are react and reflection and we'll experiment with some of these in today's workshop. So let's first understand chain of thought. So chain of thought is as simple as prompting an LLM to think through a problem step by step instead of directly providing an answer.

You can do this either in a zero shot manner by literally saying hey let's think step by step or in a few shot manner where you show it how to work through a complex problem using one or more examples. Then we have tree of thoughts which takes the idea of chain of thought up a notch.

So tree of thought allows LLM to perform deliberate decision making by considering multiple different reasoning paths and having it self-evaluate choices to decide the next course of action. So it kind of combines this LLM's ability to generate and evaluate thoughts with search algorithms because it can also look ahead and backtrack when necessary to make kind of global choices.

Then we have patterns for reasoning with feedback starting with react. So what we do here is we prompt LLM's to generate verbal reasoning traces and also tell us the actions that it will take to solve a particular problem. So after each action we ask the LLM to make an observation based on information or feedback obtained from the previous action and plan what action to take next.

And then this kind of process continues until the LLM or you can intervene and say that you've reached the final answer so exit the loop. So in this example here as you can see the first thing that the LLM does is generates a thought saying like okay this is how I need to solve this problem.

Then the second is an action step where in this case it's determined that it needs to call the search tool with arguments that it's determined. And then it makes an observation saying okay like I don't think I have an answer next. This is what I'm going to do next and does that till it reaches the final answer.

Another technique for incorporating feedback into the planning process is via reflection. And this involves prompting LLM's to reflect on and critique past actions to decide what action to take next and you can either prompt the same LLM to generate and critique. You can use different LLM's or even use multiple agents where one agent generates responses and the other critiques them.

But yeah whatever the architecture the goal is to run the generation reflection loop several times before the LLM arrives at a final answer. So essentially trading compute for a better shot at accuracy. The next component we want to talk about is memory. And this component allows AI agents to store and recall past conversations and enables them to learn from these interactions.

And as you can imagine memory is a pretty complex and nebulous concept. And you could break it down into several categories but broadly when I think of memory it's two main types of memory much like as humans right short-term and long-term memory. So short-term memory in the case of agents deals with storing and retrieving information from a single conversation.

And long-term memory deals with storing, updating and retrieving information from multiple conversations had over a period of time. And this is what really helps agents personalise their responses over a longish period of time. So short-term memory is relatively easy to implement. Like how hard can it be to store a single conversation, right?

Like in most cases not that hard but unless the conversation gets too long in which case you need to now start considering how to condense that list so you aren't overwhelming the LLM with too much information and some solutions to do that are things like retrieving the end most recent messages or summarising the conversation at the cost of some information loss.

Long-term memory, on the other hand, is a largely unexplored area so far since it's non-trivial to decide and implement what states to track and how to track them and when to update them. But I think some patterns are emerging in the sense that the best way to go about implementing long-term memory is to design application-specific agents.

That way you're able to narrow down the number of states you want to track and just focus on those and thinking about how to update them. And finally, we have tools. So tools are interfaces for agents to interact with the external world in order to achieve their objectives and these can range from simple APIs such as search weather APIs to complex things like vector stores or even specialized machine learning or deep learning models.

So tools for LLMs are typically defined as functions and most recent LLMs have been trained to identify when a function should be called and they'll respond with a function signature that you can then use to call a particular function in your code. And tools like Langchain handle the function calling for you but the basic concept still remains.

And to help the LLM identify which function to use, you typically use a descriptive tool name, specify which function to call, provide a pretty detailed description of what exactly the function function does and also the types of arguments would also be helpful. So finally the fun part, you're not here to listen to me ramble on about agents.

So in today's workshop, we'll be building an AI research agent. And the agent's primary objective is to provide research assistance by supplying a list of papers to read, summarizing research papers and answering questions about research topics. And this is kind of how the workflow of our agent is going to look like.

We will use a free and open source model from Fireworks called Fire Function V1. They were just released a V2 but I had prepared my workshop until then. So today we'll use V1 as the brain of our agent. We will also try out some of the reasoning design patterns that we were just talking about like chain of thought and react.

We will also give the agent access to three tools. One for getting paper summaries, one for getting a list of papers to read, and the third one being answering tools using a MongoDB knowledge base. And finally, we will also explore adding short-term memory to the agent and persisting it to a database in MongoDB.

But yeah, very soon we are going to break for our first hands-on portion, but just some things to keep in mind. Each time we break for a hands-on section, you'll navigate to the hands-on lab at the QR code that you have at your tables or you just can, and you'll work through one or more sections at a time.

And you'll see these emojis sprinkled all over the place. So this, like, open hands emoji and the superhero emoji indicate hands-on sections, except I would highly advise do the open hands ones first, and only if you have time, go to the super emoji sections. You'll also be filling code into a Jupyter notebook, and the places where you need to fill in code are indicated by these code underscore block placeholders.

So those are the ones you need to fill in with your code. And before any cell in the notebook that requires you to fill in code, you'll also see this books emoji indicating documentation that you need to reference for that particular piece of code. And finally, you'll find solutions to all the hands-on pieces at the QR code link, but I highly encourage you to try working through stuff on your own before you look at the solutions.

And even if you do, then try to understand what's really going on. With that, let's go ahead and break for our first hands-on section, which is just setting up the development environment and prerequisites for the workshop. So yeah, let's take about 15 to 20 minutes to work through this section.

So if you go to that link, you want to start at the section titled MongoDB Atlas, and work all the way through to the dev environment section. Let's go. How are we feeling? Are we mostly done? Not done? Done? Okay. Five more minutes? Yeah. Let's do five more minutes.

All right. I think I'm going to move on just in the interest of time, but it's a self-paced lab and you'll have access to all the material after the fact, so feel free to move at your own pace. Cool. So let's move on to some libraries, tools, and general concepts that you'll come across in the next hands-on portion.

So the first thing you'll run into is this library called datasets, which we are going to use to download a dataset of archive papers from Hugging Face. We're going to use the load dataset from the mongodb educational AI Hugging Face org. And then you'll run into something called Archive Loader, which is a document loader class in Lang chain.

We are going to be using this to load research papers from archive org as Lang chain document objects, and an example of what a document in Lang chain looks like is shown here. So essentially has the raw text under the page content attribute and some automatically extracted metadata. In this case, the published date, title, authors, and summary under the metadata attribute.

So we're going to be using archive loader in one of our two of our agent tools. One tool is already done for you, and that's the tool to get relevant papers from archive, and you'll also use the same document loader for the summary tool as well. So the simplest way to create tools in Lang chain is using the tool decorator, which makes tools out of functions.

So for this tool, we have used the load method of archive loader to load data into document objects, and the query argument takes a topic or paper ID, and the load max docs indicates how many documents to download from archive, and finally, we are only extracting the metadata because we want to only provide a list of papers and not the full paper content.

We will also be using PyMongo, which is the Python driver for MongoDB. We will use it to connect to MongoDB databases and collections, and also delete and insert documents from and to MongoDB to build the knowledge base for our agent. We will also be using a few Lang chain integrations, which are essentially stand-alone packages for third-party providers such as MongoDB in Lang chain to make things like versioning, dependency management and testing kind of easier.

So we will use the Lang chain MongoDB integration to use MongoDB Atlas as a vector store and also to store and retrieve chat history for the agent. We will also use Lang chain Hugging Face to access open source embedding models from Hugging Face, and finally, we will use Lang chain fireworks to access chat completion models from fireworks AI.

And you will be using the Lang chain expression language or LCEL to create rag and agent workflows using Lang chain, and it is essentially a declarative way to chain together prompts, data processing steps, LLMs, and tools in Lang chain fashion. And each unit in the chain is called a runnable, and the way to chain them together is using the pipe operator that takes the output from the left of the pipe and passes it as input to the right of the pipe.

And here's a simple example of just passing a prompt to an LLM, generating an answer and formatting its input. And finally, if you want to call the chain, then you use the invoke method on it, and you'll be using this to test out some of the things that you're building during the workshop.

And finally, you have this thing called runnable lambda, and this is a runnable that converts any arbitrary Python function into a Lang chain runnable, and it's as simple as defining the function and then wrapping the function into a runnable lambda. So, yeah, let's take another 20 minutes to now create the tools for your research agent.

So, yeah, just work through the create agent tools section of the lab that you were just working through. Okay. So, hopefully we are kind of at least midway through creating tools for our agent, but in the next section, we are going to be creating the agent itself and experiment with the different reasoning design patterns that we were talking about, like chain of thought and react.

So, to create the agent, we are going to start with the simplest way of creating a tool calling agent in Lang chain, which is using the create tool calling agent constructor. And you're going to be starting with that abstraction, but let's try to understand what's happening behind the scenes of that abstraction, right?

So, it's essentially creating a runnable sequence consisting of a prompt template which has a placeholder for the agent's scratch pad, which is the agent's intermediate steps as it's taking different actions and making observations, an LLM with knowledge of the tools that we were just creating, and an output parser for formatting the agent's response.

And then we'll also be exploring a react agent that uses react prompting to guide the agent to take a series of reasoning and action-taking steps to arrive at the final answer. And for this, we'll use the create react agent constructor, which follows a similar series of steps as the tool calling agent, except it uses a react prompt template, and the LLM has a knowledge of when to stop the reason action-taking sequence using a stop sequence.

And then the output parser has logic to parse these react-style LLM calls, and you can see what those look like right there. So it has a thought, an action, an action input, and an observation, so just parsing that to make it more readable to the user in the end.

And finally, you'll come across the agent executor, which is the runtime for the agent. This is what actually calls the agent, executes the action that the agent is choosing, passes the action outputs back to the agent, and repeats any steps as the agent decides what to do next. And that's the pseudocode for what the agent executor is essentially doing.

So as long as the agent thinks that it hasn't finished its task, which is the while loop there, the agent determines and runs a series of actions until it finally finishes the task. So yeah, let's take another 20 minutes to complete the create agent section and any other things that you were working on previously.

Yeah. All right. We have one last thing to do with our research agent, which is to give it memory or add short-term memory to it. And in this case, we are going to do that by giving it access to its chat message history. So in Langchain, the way to do this is by wrapping the agent runnable that you created using the create tool calling agent or create react agent, wrapping that runnable inside another runnable called runnable with message history, which is specifically designed to manage the memory of other runnables.

So essentially, this runnable can take a function that persists the chat message history for your agent to a database. We will use MongoDB in this case. And by default, it organizes the chat history using a session ID that you pass in along with your input query or prompt. So, yeah.

Let's play around with that for the remainder of the time. And if you have any more questions or stuck at something, we can talk through that, too, for the rest of the time. One last thing I would request once you're done with all your stuff is -- yeah, if you want to connect -- that's not the mandatory thing.

Nothing is mandatory. But, yeah, I'd really appreciate if you could fill out a short survey that's at the QR code link that you scanned in the beginning. This is the first time I'm doing this workshop. So, any feedback you have will only help me make this better in the future.

So, yeah. That would be much appreciated. Other than that, this is it from me for today. And thanks for being here. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you.

Thank you. We'll see you next time.

Build an AI Research Agent: Apoorva Joshi

Transcript