Building Multimodal AI Agents From Scratch

Hi, everyone. Thanks for taking the time to be here today. Welcome to this workshop where you will learn about agents, multimodality, and hopefully get to build -- not hopefully, you will build a multimodal agent of your own from scratch. Whose first workshop of the day? How many? Show of hands.

Okay. First workshop. How many people already been in a workshop? Oh, wow. Like, really proactive, overachieving crowd. Great to see that. All right. With that, here's a little bit about me. I'm Apoorva. I'll be your lead instructor for today. I'm also joined by my awesome team here. We have Richmond.

He's waving at you. There's Rafa back there. He's waving at you as well. There's Tebow. And we have Mikiko. But yeah, here's a bit about me. I'm Apoorva. I'm currently an AI-focused developer advocate at MongoDB, which means I spend a lot of my time building workshops like this one for AI developers like you to help them build AI applications of their own.

Prior to this role, I spent about six years working as a data scientist in the cybersecurity space. And outside of work, I like to read a lot. I try to yoga pretty regularly, or used to try to until I busted my knee a while ago. And I'm always on a mission to visit as many local coffee shops as I can in whichever city I'm in.

So this is what the next hour and 20 minutes is going to look like. We'll be going over key concepts of AI agents, discuss what multimodality is, and finally, we are going to put the two together and build a multimodal AI agent from scratch using good old Python. And since this is a relatively short workshop session, this is what you can expect.

We are going to focus on getting concepts down, and we'll see what we have time for. Based on my practice sessions, I believe you get about 55 minutes to an hour to actually take your time to write code or just, like, get a really good understanding of how multimodal agents work in code.

So just wanted to set that expectation there. And with that, let's get started. So let's first answer two questions. What are AI agents and why do we need them? But even before I get into that, how many of you already have some experience with AI agents? Okay, there's some of you.

Hopefully, there'll be something new for you at this workshop. And for those of you who are new to the concept, there's a lot of stuff coming your way. So apologies for that. Or not. Okay, so here's my definition of an AI agent. I like to define an AI agent as a system that uses a large language model.

I'll be referring to this as an LLM. To reason through a problem, create a plan to solve the problem, and execute and iterate on the plan with the help of a set of tools. So in the past two years or so, we have seen three main paradigms for interacting with LLMs.

There's simple prompting, rag, and agents. So let's briefly talk about each of these because this will help us build an intuition for when you want to use an AI agent versus something else. So with simple prompting, you're simply asking the LLM questions and expecting the LLM to rely on its pre-trained, or as we call it, parametric knowledge to answer these questions.

So this means the LLM cannot answer questions if the information required to answer them is not present in its pre-trained knowledge. It cannot really handle complex queries, and it cannot provide personalized responses or even refine its responses. With rag, how many of you have built a rag application? Okay, some of you.

So you know, with rag, you take this up a notch, you augment the LLM's knowledge with information from external data sources. And as you can imagine, this solves some problems where you can now be reasonably certain that the LLM has the information required to answer user questions and also incorporate some basic light personalization if it's given access to the right sources of information.

But this still doesn't equip the LLM with the ability to handle complex multi-step tasks or to self-refinance responses. But that's okay, because not all tasks might require this capability. And finally, 2025 is the year of Agents. So if you need, if you have complex multi-step tasks or need a deep personalization, any sort of adaptive learning in your applications, then you'd want to use AI Agents.

So with AI Agents, what we've done is given LLM the agency to determine the sequence of steps required to complete a particular task. And they do this by taking actions with the help of tools that you provide them and reasoning through the results of these tool executions and also its past interactions to inform what to do next.

And this is what makes agents extremely flexible and capable of handling a wide variety of complex tasks. But the, yeah? Sorry, can we share these presentations? Yes, it will be shared. The recordings will be up on YouTube as well. So, yeah. The one thing to note here, though, is that Agents come with a higher cost and latency.

After all, you're expecting LLMs to do all the heavy lifting of thinking through the problem, coming up with a plan of action, executing the actions, rectifying its responses. So my word of caution here is only use Agents if you need to. Don't complicate whatever it is you're trying to build.

But we are building an AI Agent today. So for today, let's just throw an agent at the problem, a simple problem. Okay. So to summarize, use Agents for complex tasks that don't have a structured workflow or where the series of steps required to solve the problem is hard to predict.

Or tasks that have a high latency tolerance, as I was just mentioning. Or for tasks where it's acceptable for your application or system to return non-deterministic outputs, which means the same result is not guaranteed for the same inputs. And this is true for any application that uses LLMs. But this effect is especially amplified in agentic workflows.

And finally, tasks that might benefit from any sort of personalization or adaptive behavior over a long period of time, all of these are fair game for AI agents. Now let's talk about the different components of AI agents, just to get a better understanding of how these systems work. So an agent typically has four main components.

There's perception, which is how agents gather information about their environment. Planning and reasoning, which helps the agent reason through a problem and come up with a plan to solve it. Then there's tools, which are external interfaces that help the agent act upon and solve a problem. And memory, which helps agents learn from past interactions.

And if you are passionate about memory, we have Richmond, who knows a lot about this topic. So definitely catch him after or during the presentation. So all of this sounds a bit like a human, doesn't it? But that's the whole goal of agents. The goal with LLM-based agents is to give these systems the autonomy to carry out complex tasks, much like we humans do.

So it's not a surprise that the components kind of resemble how we think through problems and go about the world. So let's dive a bit deeper into each of these components. Let's talk about perception first. So perception, as I said, is the mechanism by which agents gather information about their environment.

And this happens via some form of inputs, whether it's a user like you interacting with the agent or triggered by something else, like an email or a Slack message. And text inputs have been the most common form of interacting with LLMs and agents so far. But over the past few months, we've seen images, voice, video also being part of this perception mechanism for agents.

And in today's workshop, we'll be working with two of these, which is text and images. The next component we have is planning and reasoning. And shocker, the component that helps agents plan and reason is LLMs. So given a user query, it's the LLM's job to determine how to go about solving the problem.

But they can't do all of this on their own. They need some guidance. And the way to provide guidance at this point is to prompt the LLM. And you can start simple by prompting the LLM to create a plan of action based on its initial understanding of the problem.

And this is what we call planning without feedback, since the LLM doesn't really modify its initial plan of action based on information gathered from tool outcomes or its own reasoning traces. And a common design pattern for this kind of planning is chain of thought. And chain of thought is as simple as prompting the LLM to think through a problem step by step without directly jumping to giving the user an answer.

And you can do this in two ways. In a zero-shot manner, where you literally prompt the LLM, like tell it, like, let's think step by step. Or you can do this in a few-shot manner, where you're providing examples of how the LLM should go about thinking through a problem so that the next time you give it another problem, it'll use your examples to guide its reasoning.

process. Then there's planning with feedback, where you can prompt the LLM to adjust and refine its initial plan based on new information. Again, obtained from tool outcomes or based on its own previous reasoning. And a common design pattern that you will implement today is React, which is short for reasoning and act.

And what we do in this pattern is you prompt the LLM to generate verbal reasoning traces and also tell you the actions that it will take to solve the task. And then after each action, we ask the LLM to make an observation based on the information that it gathered from that tool execution and think through that and plan what to do next.

And this continues until the LLM determines that I have the final answer and that's when it will exit that execution loop and provide the answer to the user. Then the next thing we have is tools. And tools are essentially interfaces for agents to interact with their external world in order to achieve their objectives.

And these tools can range from simple APIs, such as I'm sure you've seen examples of weather and search APIs, to vector stores, to even specialized machine learning models. And tools for LLMs are typically defined as functions. And most recent LLMs have been trained to identify when a function should be called and also the arguments for a function call.

But the one thing to note is the LLM doesn't actually execute the function. This is something we will have to implement in our code. And in addition to actually defining the function, you typically also need to provide the LLM a function or tool schema. And this basically is just a JSON file or with MCP servers.

You might have seen a different way of defining tools, but essentially you're providing the name of the tool to call a description of what the tool does and the parameters that the tool takes and also their types and descriptions. So for example, I have a weather tool here and in my tool schema, I'm saying the name of the tool, the description and saying the input that it takes is a city, the type is string and the description of that parameter.

And finally, the last component is memory. This component is what allows AI agents to store and recall past conversations and enables them to learn from these interactions. And memory, if you think of human memory, it's a pretty nebulous concept. There's so many different types of memory. If you ask a psychologist, they'll tell you all about that.

But I'm not a psychologist, so I think of it in pretty primitive terms. I think of it in as two broad categories. One is short term, which deals with storing and retrieving information from a single conversation. And then there's long term, which deals with storing, updating, and retrieving information obtained over multiple conversations had with the agent over a longer period of time.

And this is really what enables agents to personalize their responses over a long period of time. But in today's lab, we'll implement short term memory for our multimodal agent. And again, if you want to learn more about this nebulous and extensive topic, then here's a talk that I gave a few months ago.

So I'll leave this here for a few seconds. But if you want to talk to someone live, then talk to me, Richmond, Mikiko, anyone from our team. Moving on. Okay, so let's take an example and understand how all of these components work together, right? So the first thing that happens is a query comes in to the agent here.

I'm asking this agent, what's the weather in San Francisco today? The agent forwards the query to an LLM. So think of agents as a software application or system with different components, one of them being one or more LLMs. And the LLM in this case has access to a set of tools.

In this case, it has access to a weather and search API and also its past interactions or memories. So based on the tools it has access to, in this case for this query, the LLM might decide that the weather API would be the most suitable to get information about this query.

And it will also parse out the arguments for this tool from the user query. And like I mentioned, your agent also needs to have code to actually execute the tools. So we have that in our agent. It's going to make a call to the weather API with the arguments extracted by the LLM, get a response back from the API, forward that to the LLM.

And at this point, the LLM has two options. It can either decide that it needs more information and decide to call more tools, or it can be like, I have the final answer. I'm going to generate that now. So in this case, it has the temperature in San Francisco.

So it might be like, I know the answer. It generates a natural language response. And that gets forwarded to the user. So that's kind of the full flow of how a tool calling, simple tool calling agent works. The other thing we need to talk about today, which is the more interesting part, I believe, is multimodality, because we are, after all, building a multimodal agent.

So what is multimodality? Multimodality in the context of machine learning or AI is the ability of machine learning models to process, understand, and at this point, even generate different types of data, such as text, images, audio, video, etc. And like I mentioned, in today's lab, we'll be working with two of these modalities, which is text and images.

So here's some real world examples of data that contains a combination of images and text, just to give you some inspiration for the kind of problems and domains you can apply your learnings from today, too. So there's graphs, tables, and then there's these types of data interleave with text.

So think of research papers, financial reports, any sort of organizational reporting, which typically has like some graphs, analysis, and text all combined together, or healthcare documents. The list is virtually endless. There's a lot of real world data looks something like this. So to make sense of this type of data, we currently have two classes of multimodal machine learning models.

And the first type of models we see are multimodal embedding models. And the job of these models is essentially to take multiple types of data as input and generate embeddings for them so that all of these diverse data types can be searched and retrieved together using techniques like vector search, hybrid search, graph-based retrieval, whatever retrieval mechanism you want.

And the other class of models is multimodal LLMs, which can be DeepSeq does that at this point, Claude, OpenAI, like ChatGPT has a voice mode. So the job of these LLMs is to take all of these different data types as input and also generate outputs in these different data formats.

Now, if you give a multimodal LLM tools to search through multimodal data and use its reasoning capabilities to make sense of this information and to solve complex problems, what you have at your hands is a multimodal agent. So let's build that. Enough talk. Let's actually talk about the agent that we are going to build today.

So we are going to start with something simple. We're going to remove as many abstractions as possible, start with very simple objectives, and build an agent from scratch. You get a really good understanding of what it really takes to build a multimodal agent in practice. So our agent has two simple objectives, the first one being answer questions about a large corpus of documents, and then also given a chart or diagram help the user make sense of it by explaining and analyzing that figure.

I do this all the time. When I'm reading research papers, I'll just take a screenshot, pass it to Claude, and be like, explain that equation, especially with all those like mathematical symbols and whatnot. Sounds pretty reasonable? Easy? Not quite. There's a small catch. And the catch is we want to search over documents with mixed modalities.

So in our case, our corpus is going to be documents that have text interleaved with things like images and tables. And that complicates things because retrieving the right information from mixed modality documents is not a trivial problem. The challenge lies in actually preparing the corpus of documents for search and retrieval.

So typically, for text-based documents, if you've built a RAG application, you chunk up those documents, embed those chunks, and then retrieve relevant chunks to pass as context to an LLM. But you can't really do this when you have images and tables in your documents. And one way to do this, there's so many tools out in the market.

There's like LLMAPARs, unstructured, that use vision transformers or object recognition models to first identify and extract the different elements. Like they'll extract text, images, tables separately, then you chunk the text as usual, but you summarize the images and tables instead, and then basically convert everything to the text domain by creating embeddings of the text chunks and summaries using a text embedding model.

I know that's already a mouthful, but you'll see how to simplify this process using a new type of models. I'll just get to that in just a little bit. Another technique similar to the previous one is that you still extract the text and non-text elements. You chunk the text, but instead of summarizing the images and tables, you would embed all of these, like the text chunks, images, and tables using a multimodal embedding model because it has the capacity to understand and embed all of these different data types.

I already see some of you losing me because these data processing pipelines are pretty complex and they come with their own limitations, right? They sound promising, but they have mainly these two limitations. So the first one is that they face the same drawbacks that you see with chunking. So to me, the biggest problem with chunking is the loss of context at the chunk boundaries, which is why techniques like parent document retrieval, metadata pre-filtering, are becoming popular where you add back context that was lost during chunking at either retrieval or generation time.

Also notice how complex these processing pipelines were, right? You need an object recognition model to extract the elements, potentially another LM call to actually summarize these elements in addition to chunking and embedding, which is already one too many steps. Another limitation with a lot of multimodal embedding models lies in the architecture of the models themselves.

So until recently, the architecture of most multimodal embedding models, at least for text and images, was based on OpenAI's clip model. And what happens in this architecture is text and images are passed through separate networks for generating the embeddings of these data types. And this results in something we call a modality gap, where irrelevant items of the same modality end up close to each other rather than relevant items of different modalities.

So in a clip model, for example, text vectors of irrelevant text might appear close together in vector space rather than text and images corresponding to related subjects. And that's a problem. But this has changed with the advent of vision language model, or VLM-based architectures. So in this architecture, both modalities are vectorized using the same encoder.

And this ensures that both text and visual features are treated as part of a more unified representation than as distinct components. So with these models, all you really need is a screenshot of documents containing whether it's purely images, purely text, or a combination of text, images, tables, etc. And this is what, because of that unified architecture, it ensures that the contextual relationships between text and visual data is preserved.

So as you can imagine, this greatly simplifies the data processing pipeline for multimodal data and also ensures that you get better retrieval quality because you're no longer separating these texts and images. So basically, given a document containing a combination of text and images, you simply take a screenshot of it, pass it through a multimodal embedding model, and the embedding that you get from that makes this data ready for retrieval.

Pretty straightforward process there. So let's quickly look at how some of the key features of the agent that we are going to build is work. And then we can go implement these in code. So let's talk about the data preparation pipeline for the corpus of documents that our agent is going to use to answer questions.

So like I mentioned, the first thing we are going to do is for each document in our corpus, we are going to convert that into a set of screenshots. And in our case, each screenshot is going to represent a page in the document. We'll then store the screenshots locally, but if you were to do this in production, then you might want to store them to some form of blob storage like S3, Google Cloud Storage, whatever your preferred cloud provider is.

And we'll also note the path to where the image is stored. And then store this as metadata along with the embeddings of the screenshots generated using a multimodal embedding model and store these into a vector database. So in our lab, we'll use the latest multimodal embedding model from YGI, and we'll use MongoDB as a vector database.

Because I work at MongoDB. So one important thing to note here that we are not storing the raw screenshots. Yes? So you are retrieving the documents into screenshots, several screenshots or you're picking up the screenshots from the document and sending it to the -- So, okay. So say I have a PDF containing multiple pages.

Each page in the PDF, I'm going to take a screenshot of it. And each screenshot is going to be saved separately in blob storage. And references will be stored as metadata along with embeddings in the vector database. And why do you have to take the screenshot of each page?

Is it to save the space? So, like, I showed you two methods before, right? Like, to be able to use the image and table data for reasoning, you need to be able to retrieve that document that contains all of that together. And the reason I'm taking a screenshot is to preserve, like, the continuity between the text elements and the image elements.

So all of them can be retrieved together as context. But this is worse than chunks. What's that? I mean, a screenshot for each page, it was the context of the whole, I mean, that. But then with chunking, it's worse because you have, like, a way -- usually you're keeping, like, two paragraphs together or something.

So, like, slightly better, you would always, of course, maybe want to augment this method with metadata pre-filtering, other methods that you use for traditional chunking. But I still think one page is better than two paragraphs or a small paragraph of text. Do you want overlapping segments here as well, just like you would make something?

Yes. Yeah. So when you -- so here we'll take screenshots of, like, distinct pages. But if you want that continuity, you might want to, like, keep some overlap as well. Yeah. Good point. Better than giving it the full PDF at once. Because LLMs have, like, that lost-in-the-middle problem. So if you give it -- like, just because you have a large context window doesn't mean you should flood it.

Because they still have the problem of searching through, like, the full -- a large document to find the right information. So you're trying to -- So the relation between these different states should be maintained? Yeah. I think, as she was pointing out, either you'd have to, like, maybe structure it a little differently to have some overlap or store some additional metadata to maintain that continuity, like page numbers or -- and whenever you're retrieving a page, you could do something like retrieve the previous two, next two pages, things like that.

Yeah. Any more questions? Yeah. You talked about Voyage multimodal 3, like, why that one versus others? Any VLM-based model, really. Like, the whole point is to show you that clip-based models have that -- I'm not using a clip-based model because it has a modality gap, but any VLM-based model is very good.

Yeah? So this works for text and image. Yeah. And they're more, like, video. Right. Yeah. Like, yep, 100%. So this doesn't really deal with that. But essentially, you could extend this concept to different modalities. It's just I typically don't see, like, video or audio occurring with text. Like, I just chose this because, like, images, figures typically occur with text.

But so, yeah, screenshots might not apply to other modalities. There are different ways to handle it. Today, we are only focusing on images and text. Okay. Two more questions. Okay. Yeah. So, VLM, I guess this is different from the last language model, right? So it has, I guess, smaller parameters than last language model.

But then for home, it's a bit less, I guess. It's just, it's still a large -- For example, like we make an embedding from the VLM. Mm-hmm. And then if you make an embedding from the, let's say, other 002, which is bigger parameters, then performance probably -- actually is probably less.

VLMs tend to get pretty big, too. So they're basically still large models. They just can handle, like, images and text. But they're still pretty sizable models, yeah. And I can point you to some benchmarks that show that they're still good at even purely text or purely image data and then a combination of both.

How to run in the local? What's that? How to run in the local machine, I guess. Like, you'd find a model that works with your hardware specifications, just like an LLM, right? Like, not all LLMs can be run on your machine. So it would be similar to those, yeah.

You know, there was one. No, there was one. Let's skip one. Okay, one more. Yeah, so in this case, you're using multimodal, right? Mm-hmm. It's true that your image and text is strongly aligned to the modalities. But what if you have a modality that's really weakly aligned to the time series, right?

So which means your data space, they're not pretty close. How do you handle those? Sorry, can you say that again? If you have a modality that's not strongly aligned with the rest of the modalities, like, for example, time series, right? If you embed it into the same data space, they are not, like, really close to each other.

Yeah. So in both situations, how do you handle that? So, like, time series data with text? Like, I'm trying to understand, like, a situation where you would have totally disparate. So you have, like, text, you may have time series, too, right? Mm-hmm. And that time series data may not be really aligned with.

Yeah. And that means when you clip them, they don't really go very well. So how do you handle those? Yeah, I think time series data, typically, you don't even, like, use embeddings for it. It's just, like, you treat them like any other features like you would for traditional ML models.

You definitely want, like, a different retrieval strategy for those. It would be hard to put them in the same, yeah, vector space as text and images. So you might need to work with, like, different retrieval methodologies. Yeah. Yeah. All right. I'm going to move forward here very-- like, in a few minutes, we will hit the hands-on portion.

So if we have more questions, just call out to our team, and we'll take more questions then. Cool. All right. OK. OK. Let's quickly talk about the workflow of our agent. We looked at a random example before, but let's talk about the agent you're going to build. So query comes in, agent forwards the query to a multimodal LLM.

So note, we are going to use a multimodal embedding model for retrieval, but we also need a multimodal LLM. We are going to use, I think, Gemini 2.0 Flash Experimental, some long name. But yeah, basically, we need that LLM because once we give it, like, that interleaved document with text and images, we need an LLM that can make sense of both these modalities.

So that's why I'm using that LLM. It has just one tool, which is a vector search tool to retrieve those multimodal documents and also its past interactions and memory. So based on the query, the LLM can decide to call the vector search tool. And if it does that, it'll return the name of the tool and the arguments to use to call the tool.

Again, the agent has code to actually call the tool, so it calls the vector search tool. And typically, if you're working with text-based data, you get the documents back directly from vector search. But in this case, what we are going to get back is references to the screenshots. Remember, we didn't store those in the vector database.

Those are in our local or blob storage. So then our agent needs to have that additional step of using those image references to actually get the screenshots from blob storage. And then it's going to pass those images along with the original user query and any past conversational history to the multimodal LLM.

So each time an LLM call is made, whether it's to determine what tools to call or generate the final answer, the images are also going to be passed along with the query and conversation history to the LLM. Then it generates an answer and that gets returned back to the user.

And finally, depending on the query, the LLM might also decide that it doesn't need to call a tool. So, for example, if the user is simply asking, like, "Hey, summarize this image," it might not need to call a tool. So in that case, it'll say, "I don't need to call tools." It'll simply generate an answer, and that gets forwarded to the user.

And the final thing, let's talk about the memory management mechanism for our agent because this is important for it to actually have coherent multi-term conversations with the user. So, like I mentioned before, we'll be implementing short-term memory for the agent. And the way this works is each user query is associated with a session ID just as some identifier to distinguish between different conversations.

So, given a user query, we obtain its session or conversation ID, and we query a database consisting of previous turns in the conversation to get that chat history for that session. And each time, again, in addition to the context, we also pass in that chat history, just so the LLM can use that as additional context to determine if it even needs to call tools or not.

And then, when the LLM generates a response, the other thing that happens is we add this current response and the current query back to the database to add on to the history for that session. Now, you can also log a tool call, their outcomes, any reasoning traces from the LLM, but at a minimum, you at least want to be logging the LLM's response and the user queries themselves.

And finally, that is enough talking from me. You will be quiet for the rest of the workshop. We have about 45-ish minutes. So, head over to that link to access the hands-on lab. Recommend running that on your laptop. So, instead of that QR code, actually type in that URL.

This should take you to a GitHub repo. Follow the instructions in the readme to get set up for the lab. That should take about 10 minutes. And then, you have two options. You can either -- there are two notebooks in there. One is called lab.ipynb. And if you actually want to -- if you're in the mood to actually write code right now, that's the notebook you'll be using.

You'll have -- you'll see reference documentation in line in the notebook indicated by that books emoji that tells you use this documentation, fill in your code. You can do that if that sounds too daunting. There's also a notebook called solutions.ipynb that has all the code pre-filled. So, you can just run through that notebook, read the comments to get an understanding of how the agent works.

But which option you use, I'm here. My team is here. Just call on us to -- if you have any questions. And yeah, for anyone actually filling in the code, you can also refer to the solutions. Don't get too frustrated if you get stuck. We don't want that. All right.

So, I'm shutting up now. Let's go ahead and build that agent.

Building Multimodal AI Agents From Scratch — Apoorva Joshi, MongoDB

Transcript