
Today, we're going to be building an AI agent that allows us to chat with videos. This AI agent will be fully conversational. We'll be using Mistral's embed and LEM endpoints to build it. We'll also be using the Aurelio platform's video processing and chunking endpoints to prepare our data. Our agent will also use async and streaming so that we get a more scalable application that also has a nice user-friendly design of seeing streaming of tokens.
And towards the end of the video, we're going to be bringing all this together and seeing how we can build a significantly optimized agent that will allow us to reduce our costs pretty dramatically. So let's jump straight into it. When we're working through this example notebook, we first install our prerequisites.
We have the Aurelio SDK for the video transcriptions and chunking. We have YouTube DLP here, which is going to download a YouTube video for us. And Mr. AI, of course, for the LEM and embed endpoints. So the first thing we're going to do is download a YouTube video here.
This YouTube video is one that I did before, and it's essentially a video version of this article here. So going through this, we see that I'm talking about AI agents as neuro-symbolic systems and talking about the neuro side of AI versus the symbolic side of AI. A little bit of history there, what they both mean, so on and so on.
So we'll be able to ask a few questions around all of that. Now, we're going to go ahead, and first thing we need to do is actually get the Aurelio API key. So for that, we need to go to platform.aurelio.ai. We log in at the top here, and I'm going to use Google.
Okay, and because it's my first time logging with this account, it's going to take me through this little guide. We can, you can go through it, or you can skip it by pressing the X here. So I'm going to come over to here, and we do need to add credits before using the API, but you can also use coupon codes to get free credits, essentially.
So, I'm going to show you how to do that. The coupon code is for $5, so you'd go to here, click purchase, and what you need to do is come over to here where it says add promotion code and type in JB video agent. Okay, and you can apply that, and you'll get $5 in credits.
Okay, so now we can go over to API keys, we create a new API key, you can call it whatever you want. I'm going to call it video agent, and I'm just going to copy that, bring it over into my notebook here, and there we go. So I've now authenticated with the Aurelio client here, and then what I'm going to do is send the video to the platform for processing.
So I'm just going to process it as a single video right now. I'm not going to chunk anything yet, we're going to do chunking a little bit later, because I want to show you how we use chunking to optimize our costs, which are, we'll see, it's very significant. So, that will take a little while to process.
So while that's processing, let me talk you through what the first pipeline is that we're building. Now, to begin with, our pipeline is going to be pretty simple. So we're going to have our video coming in from YouTube, we're going to be passing that into the system prompt of the agent.
Now, the system prompt is going to include some additional context. So it's going to include some instructions on how it should be used initially, but then we're going to insert the additional context, which is our transcribed video, into the system prompt here. Then, this is going to come down here, and it's going to be fed into an LLM alongside a user query, which is going to bring that in from over here.
Okay. So our user will probably ask the question, and that will feed in just here. Now, our LLM will then produce an answer based on both the context that we provided up here, and our user query, and it will return it back to us. So it's pretty straightforward. This is the initial pipeline we're going to be building, but we are going to make this slightly more sophisticated.
So what we will be doing is removing this context from over here, and also this video component. So those will now exist over here in a separate part, which is actually our tool. Okay. So this is our tool here. Now, what that tool is going to do is given a LLM generated query.
So the LLM will generate a query, like, okay, I need to search based on the user's query. I need to search for this, and it might even do this multiple times. It might try and answer multiple questions. So based on that, it's going to come over to here, and an embedding model, again, from Mistral, is going to convert that into what we'd call a query vector, which I would usually write as XQ.
All right, so that embed endpoint takes that query and turns it into XQ. Now, what we've also done in the middle of all this is we've taken our article here, and we've actually chunked it into many smaller parts. Now, by chunking these into many smaller parts, we allow ourselves to perform what is essentially RAG across them.
So retrieval augments generation. And the way that we do that, again, this is before we come to the LLM inference time. The way that we will have done that, or set that up, is that all of these chunks here will have actually been passed through our embed endpoint, and been used to generate a Numpy array.
Now, once we have that Numpy array over here, and then we have our query vector, all we need to do is perform a simple NP dot across both our query vector and that array. And what we'll get from that is a similarity rate that tells us which of those values are the most relevant or the most similar to our query vector.
So we can essentially cancel out a few of those records and only return the ones that are the most relevant. So in that whole process, although it seems like a more complicated system, and it is, to some degree, a more complicated system, we actually massively reduce the number of tokens that we're sending to our LLM.
And this is with just one transcribed video. With more, that would be even greater. Now, once that has finished processing, we can come down to here and check the content. So you see there is essentially the full transcribed video in there. We can also count the number of words from that transcribed video.
So it's just over 4,000 words. Now we can move on to connecting that to our LLM for that first version of a simple video plus LLM pipeline. To do that, we will need a Mistral API key. We get those from console.mistral.ai API keys. So you'd come over here, create new key, and I'm going to call this video agent.
You then just copy your key and throw it into your notebook here. Now, if you run this straight away, occasionally, you might see this error, unauthorized. Now, if we go back to the API key created box here. It does note that sometimes it can take a few minutes to be usable.
So we can just try again. Okay. So it seems like we're good now. And there we go. So now we need to get our message content. So this is what it was returned based on this chat complete from Mistral here. So first things first, we are using the Mistral large latest model.
And I just want to point out for the system message here is a pretty generic system message. But then we're adding in the transcription and the content of that transcription, which we got earlier on from the video file content there. And then as a user message, I said, hi, can you summarize this for me?
Okay. So pretty straightforward. And you can see here that we have the assistant message with some content. Okay. So that's cool. One thing that we can also see in here. So if we go to response usage, we can see how many tokens we used, which would be pretty useful later on.
Now, already looking at this, see that we use a ton of prompt tokens. So prompt tokens are the input tokens. Completion tokens are what the LLM generates back to us. So feeding the whole transcription in with every single query, that is just naturally going to lead to a lot of prompt tokens being used, which of course does add up to the cost.
And we'll see the cost for that relatively soon. But for now, let's just work on making this pipeline that we have a little more conversational. So to do that, we're going to define this agent class. Now, I wouldn't necessarily define this as an agent just yet, but later this class will become our agent class.
So right now, I'm just wrapping what we did already. So we have our prompt and then we'll be feeding in our user query when we hit the chat method. Then we are hitting client chat complete with those messages. And all I'm doing here is keeping these self.messages. So self.messages is a simple list where we're storing the messages as we create more and more interactions.
So let's see how that works. I'm going to display everything in Markdown, by the way. As we saw just up here, the agent is using Markdown. So it's always nicer to see the responses in Markdown. So we have our response. You can see that the query here is, can you summarize the meaning of symbolic in this article?
And in the context of this article, symbolic refers to the traditional approach to artificial intelligence that involves using handwritten rules, anthologies, and logical functions, so on and so on. Okay, cool. Now let's check that our conversational features, i.e. the self.messages attribute is working by asking a follow-up question. Can you give me that, but in short bullet points?
If the chat history is not being stored and being sent to our LLM, the LLM will have no idea how to answer this. It will just see that we're asking, can you give me that? It has no idea what that is, but in short bullet points. Okay, cool. And we get some nice bullet points.
Okay. So we have that built a LLM pipeline. It is conversational and it has that input for video. And we're able to ask it questions about video and speak with it in a conversational way. Now, what I want to talk about next is using asynchronous code and implementing streaming.
Now, why do those two things matter? Well, async code is actually incredibly important, especially for AI applications. The reason I say that is because with AI applications, we tend to use a lot of API calls, more so than many other applications. And those API calls, not only do we have a lot of them, but they also take a long time because they're waiting for an LLM to respond.
Now, if we're writing synchronous code, whilst we are waiting for a response from our LLM API, our Python code is doing nothing. It is just waiting. It's doing nothing. If we write asynchronous code, our Python code can be going and doing other things whilst it is waiting. And then once the response is received from the API, our Python code will see that and it can then jump back into that task and continue processing from there.
Okay, so it frees up a ton of compute time if we write our code asynchronously, especially for AI applications. Then we have streaming. Now, streaming is more of a preference. So, with streaming, let's say we get quite a long response from our LLM. Without streaming, that can feel quite painful to a user.
The user experience is that you're just waiting and waiting and waiting. And then you just get this massive chunk of text. And that's okay. But it's not a great experience. And there are so many applications out there that use streaming that most users of AI systems now kind of expect streaming as a standard.
So, it's almost an essential feature for most, particularly chat, interfaces. And another thing that we can do really nicely, which I'll use Perplexity as an example here. When we use the Perplexity interface and Perplexity goes and searches for something, the LLM is generating that search query. And it's deciding what tool to use, which is, in that case, a web search.
If we stream our tokens, we can actually implement that in our own applications as well. So, whilst, in this case, as we'll see later, whilst our agent is going and looking at our video transcription and performing a search, we can actually see that our agent is doing that and can see how it's doing that.
We can see the search queries that are happening, which allows us to build far more interesting experiences for users, in my opinion. Okay, so, let's go ahead and see how we can rewrite this agent class to use async and also implement streaming. So, we're not changing anything about how we're initializing our agent here.
The only thing we're really changing is we're now defining our chat method here with async. So, this makes it an asynchronous method. And we're switching from Mr. All's synchronous chat method to Mr. All's streaming and async chat method. Okay? So, async, because of this, this means we will now need to await the response.
And because we're streaming, that means that we will see just here, we're actually iterating through our response and getting the tokens out from there. Okay? So, the chunks are coming through from our streaming response description in a slightly different way to what they were before, which was the, we just got a single assistant message block and that just had everything inside it.
Instead, now, we're getting messages, but for each chunk. So, we still need to go through, we need to go through into the chunks of data, we need to extract from the choices. And we're now looking at the delta, which contains any changes. And inside the delta, we have our content.
So, we're saying if that content is not none, okay? If that content is not none, we are going to go into this if statement here. And we're also just using the walrus operator here to pipe this information into the token variable here. Because that means we don't have to, well, we don't have to take this and put it into here, which would be just not as clean.
So, we do that. Then, the final thing that we do is we also create a list of all of our tokens that we're receiving. So, what that allows us to do is, on the next line down here, it allows us to take all of those tokens, join them together, and then use them to actually output a final assistant message object, which contains everything.
So, we do that. There's also a slight difference to our usage attribute here. So, we have to go chunk data usage. All right. So, that is it. We extract our assistant message from the messages that we added here. And let's go ahead and try that. So, I'm going to use the same query again.
Can you summarize the meaning of symbolic in this article? And we should see a streamed output. Okay. Cool. So, we're now seeing a streamed output there. And one other thing that I should point out is that we have to await our agent.chat method now. Okay. So, that all looks pretty good.
We can continue our conversation a little bit here. And I want to do this so that we can see the, essentially, the usage of our conversation over time. Great. So, plenty of tokens output there. But then, of course, also plenty of tokens input. So, we can see that even though we output a ton of text here, now, they're pretty lengthy responses, that output is still not even comparable to the inputs that we have, which are huge.
So, the number of inputs here is, of course, all made up by us feeding that article in every single time. So, the question now is, one, how much does that cost? And then, two, can we optimize it? So, we define a cost calculator here. And we need to calculate based on these prices here.
So, input tokens, which is our prompt tokens, and output tokens, which is our completion tokens. So, I'm going to run that, and let's just see how much that costs us. So, looking at this, we're paying, these seem like pretty small numbers, but this is for, each one of these is for a single interaction.
So, that actually will add up quite quickly. After just 100 interactions, you're paying a dollar. Or even, sorry, even more than a dollar. Probably, you're hitting a dollar within maybe 70 interactions. So, depending on if this is a single conversation that's just getting bigger and bigger, or whether it's multiple conversations over time with a small number of interactions in there.
But you can see how that would add up very quickly. Now, to optimize this, what we're going to do is what I described here. So, this component here, where we're breaking up our transcribed document into smaller pieces, and then only using what we need from those chunks. So, how do we implement that?
There are a few steps. Like I said, it's not necessarily a simpler system, although it's not complicated. But it is a far more efficient, and scalable, and cost-effective system. So, what will we do? We're going to break our transcribed documents into smaller chunks. Embed those chunks into vector embeddings.
Sore those vector embeddings in a numpy array. Then, when querying, our LLM will transform our question into a small query. We embed that query into a vector embedding that creates the XQ query vector that I mentioned earlier. Then, we compare the semantic similarity between our query vector and chunk vectors to find the most similar chunks, and return only those most relevant chunks to our LLM for the final response.
So, let's start by chunking our document. Now that we're using async everything, we can go ahead and use the async Aurelio chunking endpoint as well. Although, given that this is preprocessing, we don't necessarily need to. But in any case, then we'll come down to here. And we first want to set up our chunking options for our semantic chunker.
So, we're using a semantic chunker, which means that our chunks are going to be produced by looking at the semantic similarities between parallel components in the text. Next, we say that we want a maximum chunk length of up to 500 tokens, which is quite big to be honest. I might even suggest going lower, but it's okay.
And we use a window size of 5. That is essentially what is the rolling window in which we are comparing the similarities between our parallel components within the text. So, then we get our chunks. You can see in the first chunk, I'm talking about the Apple Remote program and the Front Row program.
And then there's a couple of chunks in the middle here that are a little bit disjointed from the other parts. It's mostly spoken audio, so I'm rambling a little bit. And then we can see down here that this final chunk here is focusing on the React agent and how it relates to a broader definition of agents.
So, we have those. And what we can do now is actually take those chunks and we're going to embed them using Mistral again. Again, to keep everything in line with the async approach, we're using the async method for embeddings here. And we do actually need to use this async method later on when we're producing our query embedding.
And we use the Mistral embed model. Okay. So, we're just taking the content of our chunks out here and inputting them to create our embeddings. And if we look in our embeddings response data, we'll see that we have 35 chunks there. We can also go in and see the length of each one of those embeddings.
So, the embedding dimensionality of this Mistral embed model is 1024, which is a very typical dimensionality for most embedding models. Now, we take all those and we're just going to combine them all into a single NumPy array. This will allow us to perform the dot product comparison later against it.
And let's see how we would do that. So, I'm going to ask the first question. Of course, later on, I'm going to want to embed all of this within our search tool. But first, I want to ask this first question of, what is the relationship between AI agents and good old-fashioned AI?
I'm going to embed that query using Mistral embed again. And I'm going to convert it into an array and create my query vector. The query vector shape gives us the dimensionality of the model. And with that, we're actually ready to calculate the similarity, specifically dot product similarity, between our query vector, XQ, and the pre-computed document chunk vectors.
And you can see in here, these are all of the similarity scores. So, if we're looking through this, we would see, okay, these values here, these ones, these are all kind of higher ones around there. And they all tend to correlate into a certain segment of the transcribed document.
And that is generally expected because when you're talking through something, you're going to be switching from one topic to another topic to another topic over time. And those topics that share the highest proximity within your, you know, your story or your video, they're probably going to be more related than a random chunk from, you know, let's say the end of the video or the start of the video.
Okay, so we can use np.org.sort with a top K value of three to return the top three most similar chunks. Okay, so we see that that is 10, 12, and 13, which is probably, if we look at these, I think it is this value, 10, 12, and 13, so the ones we pointed out earlier.
Great, so we have those. And then what we're going to do is get those index values, and we're going to use them to get our most relevant chunks or the chunk content from those. Like so. So looking at these, we can see in here, I'm talking about the symbolic AI stuff in the middle here.
We're talking about rules, ontologies, and other logical functions, okay, which is the good old-fashioned AI. Then here, I actually mentioned good old-fashioned AI directly at the end now, but I think I'm generally talking about the same thing. And then here, I think, oh, go-fi, probably I'm saying good old-fashioned AI, i.e.
go-fi there. And, yeah, I'm talking about how that compares to connectionism. So all those seem relevant to the question. Now, what we want to do is take what we've just done and compress all that into a single function. Okay, that function, again, will be async. So, in this function, we are providing a .string, which we're going to use this .string in order to understand how it should use this function and when it should use this function.
So, I want to give it as much information as possible here. That's why I'm saying use this tool to search a relevant chunk of information from the provided video. And I'm saying how the LM should use this tool as well. So, I'm saying provide as much context as possible to the query parameter, ensuring to write your search query in natural language.
That's how you would get the best results here. Then, I also want to say, okay, if there are multiple questions being asked, to just use this tool for one of those questions at a time. And the reason we do that is it can improve the retrieval quality. Because if you imagine asking five questions all at once and your LM sends all that to your embedding model, your embedding model is essentially taking a query and it's placing it at one point in vector space.
And if you have five different meanings to your query and you're trying to compress those five different meanings into one point in vector space, you are essentially diluting the quality of your embedding. You're almost averaging out your embedding between those five different meanings that your query actually has. So, although it can still give you the results you need, it's generally better to try and separate out your query that can be more concise when you're doing a vector search like this.
So, all we're doing here is what we just did. Okay. We're creating our query vector, using it to search across our chunks, getting the highest scoring similarity scores from that, and then using that to retrieve the most relevant chunks. And then I also merge those into a single string for our LM to use.
Okay. So let's try that quickly. Our query here is, I think it's still what we asked before, which is what is the relationship between AI agents and GoFi? So we should see similar results, well, the same results, even, and yes, that is what we can see there. Cool. So we now need to redefine our agent and plug that new function or tool into it.
Now, to do that with the Mr API, we actually need to create a function schema object, which is what we're doing here. So I am getting the doc string with this from our search function, which we can see here. I'm placing that doc string into the description here of our function schema.
Then I'm taking the name of our, let me even just copy this, just the name of our search function, which is obviously search. Then I'm also just initializing these parameters. Now the reason I initialize these and don't fill them out directly is because we're going to iterate through all of the parameters of our function dynamically and set those.
To do that, we also need to be able to map from Python types to the data types that Mistral understands and uses. And these are, as far as I'm aware, exactly the same as the OpenAI mappings as well. Okay. So we get the signature here. Let me just show you what that looks like.
So we get the signature. And then we're just going to go through those. And let me, again, just show you what those look like very quickly. I'll show you the name and D type. So what do we have here? We have even, let me print D type. There we are.
So we have the name, which is query, and then we have D type, which actually contains both the name and also what type it is. The way that we would get the type only from that is we actually do D type annotation, which you can see here. Okay, cool.
So that is what we're processing. And we just do that quickly with this loop, which is pretty straightforward. And that creates our function schema that we can, you know, we can apply this to any function. And then Mistral will be able to use this function schema and use that when it's defining the tools.
And then all we need to do is transform that into a MistralAI function object. We could have done this before, but this is, this is a general format. So I do like to show this, but you can view this function as being the same as this function schema that we defined up here.
Cool. So now what do we want to work through here? We can actually remove this callable. We don't use that. And we can go ahead and just work through, okay, what are we, what are we doing? So first we're adding in our tool signatures. So why do we add tool signatures?
Well, that is so our LLM knows what tools it has access to and also how to use them. So we pass those in for that. You can see that we also remove from here our transcribed content. We don't need any more. We're going to use that or we're going to provide that via our search tool and then make some modifications to our chat method as well.
So we're still using stream async, but now we've added the tool signatures to the tools parameter. And we're also, we don't need to do this. This is the default value, but we set tool choice equal to auto. So we, yeah, we could just comment that out. It doesn't matter.
We'll do the same thing, but I like to be explicit. What you can do if you want to force tool calling is you can set any, and that will essentially tell Mr. All you have to use a tool call. We're not doing that here because we don't want to force that all the time.
Okay. Then this bit here is kind of the same, but also slightly different. So we have two conditions here. Whereas before we just had one, which was essentially this bit here, which is saying, okay, if there's some content, we're going to stream that content directly. But now we also are getting tool calls and tool calls are sent to a different part of our chunk objects.
As you can see here, we're going Delta content for the content that we stream, and we're doing Delta tool calls for tool calls. So slightly different, you know, we're seeing those being returned in slightly different ways. So let's run this and see what happens. I am removing these for now because we're, we're not quite done with cleaning up our agent function yet.
So I'm just removing those for now. And instead, I want to return that tool call so we can just have a deeper look at it in a moment. So tool call, what do we have? This is a tool call. It's a tool call object. Inside there, we have a function attribute, which contains the function call.
Inside that function call is the actual instructions from the LM on what tool or function to use. So it's saying we need to use search tool and these are the arguments. So I want you to provide this string into the query parameter. Okay. We also importantly need to pull out the tool call ID later.
So just be aware of that. So what we've done here is our LLM is now able to generate the instructions for what tool to use, but it can't use them. And that is because we haven't written the code that allows the execution of our tools. So we need to do that.
To do so, we're going to create a tool execution function. That tool execution function is going to take our tool call. So that is this object here. So you're going to take that tool call. We're going to get the tool name out from it, which is search. Then we're going to load the tool parameters, which are from here.
So it's just a query and that string. And we're also going to get the tool call ID, which is exactly what we saw here. So pulling all of that information out, then what we're doing is we're using this tool map dictionary here. So what is tool map? Let me run this and I will just show you very quickly.
So tool map is just mapping us from a string, which is our tool name to the actual function that it refers to. Okay. That's all. That's what we've done here. So that means that we can access our tool map as a dictionary and actually use it to execute the chosen tool.
In this case, we just have one chosen tool, but you could imagine if we had multiple tools here, we could have like calculator, code execution, so on and so on. This execute tool function would work also for the multi-tool use scenario. And from that, we're going to return tool message containing all of that useful information, including the output from us executing our function or tool.
Great. So we're going to take that tool call and we're going to just plug it into the execute tool and see what happens. And you see that we get this, so we get this tool message and we get all this content, right? So this content is exactly what we wrote before, where we are providing the most relevant chunks based on our user query.
So that's what we're getting in the content there. We're also getting, okay, what tool was that? And we're also getting the tool call ID. Cool. So we've done that. Now what I want to do is I want to extend the agent messages. I'm doing this like from outside the agent class for now.
In a moment, we're going to implement this all within the agent class. But for now, outside, I'm going to go to agent, send those messages and add an assistant message, which is essentially the LLM saying, I want to use this tool. And this is how we're going to use it.
Followed by the tool output here. Okay. And then if we look at our agent messages after this, we see that we have a system message. That's the first one that we've predefined. We have the user message, which is, can you summarize the meaning of symbolic in this article? Then we have our assistant message, which is the LLM saying, hey, I want you to use this query for this tool.
And then we get the response from that tool. With all that information, our LLM now has everything it needs in order to answer our original question. So let's take what we would usually have in that agent chat method. Let's extract that out and just run it against our new agent messages that we have, which include the assistant message and tool response.
Okay. So we can see that it's streaming everything here and saying, okay, symbolic AI refers to those written rules, intelligence, logical functions, so on and so on. All right. So definitely stuff that is coming from that transcribed document. Now, how do we take all that and refactor our agent class for the final time to include all that additional tool execution logic?
Well, we add one more attribute, which is max steps. This is more of a precaution than anything else. So if our agent gets stuck in this loop of iterating again and again, and we're hitting the LLM APIs again and again, that can, of course, drive up costs pretty quickly.
So to put a limit on that, we say, I don't want to go above three iterations of the agent saying, I want to use this tool and this tool and then this tool and then respond, right? It will just have three opportunities. And that should be all it needs.
So let's go through our new logic. We have a while loop here, which says, okay, we're going to go through a maximum of those max steps. We, this bit is pretty similar. So we generate our response asynchronously. We say, okay, if the tool calls object is a list, we are going to print sort of clean this up a little bit.
So it's a bit nicer to read. We're going to print. The function name or the name of the tool that we are calling and the arguments being passed that, then we're going to execute that tool here. That will give us our tool message. And then as we did before, we're just extending our chat history with the assistant message that told us what tools to use, how to use it and the tool response.
And that is if we have a tool call, otherwise the logic is exactly the same for if we have just normal tokens being streamed back to us. Finally, if we see that the length after going through this async for loop here, if we see that we have all tokens here, that means that our agent has responded to us directly using that content field.
And we should break out of the loop and which is exactly what we do. So let's run that. And now let's try again. So one thing that we're doing here, just to try or test this out, is I'm actually asking two questions or I'm asking the two questions that should require the LM to search or use the search tool.
And then I'm asking a final question, which is saying, okay, bring those both together and explain them to me. Now also throwing in a little bit of a spanner here by asking about DeepSeq, the transcribed document doesn't mention anything about DeepSeq. So let's just see how our agent tackles this problem.
Okay. So we have our outputs here. And you'll see it. Sometimes this will work. Sometimes it will hallucinate, particularly with the DeepSeq part. And you see here that the search tool was used twice. Once to look for good old fashioned AI. And another time search for DeepSeq. Then it just explained, okay, yes, the document mentions good old fashioned AI.
It does not explicitly mention DeepSeq. And then, yeah, it just continues answering our questions. So we can see that, yeah, that agentic flow does work. Now let's look at the usage for our query. Now we can see that despite us having two tools being used, right, which is probably the worst case scenario for this sort of question.
Despite that, which we can see with the usage info for these first two components here, it didn't necessarily add that many tokens, especially when compared to our original costs. So if we look at the original costs from here, so this is from the earlier execution that we performed, we spent this much across this is across three interactions.
So we could say, okay, it was only this much for that one single interaction. Now, if we look at the cost for each one of these queries, that is pretty significantly lower. Looking at roughly half the price for that first query, despite asking a far more complicated question as well.
Now, I think this message, this note was from my previous testing, and I probably got lucky one time. But this gives you a good idea of just how cheap it can get. So I think the, in this scenario, the completion tokens were probably quite low for my final response.
But, yeah, that is it. So we've been through, we've built a fully functional conversational agent with async and streaming using Mistral. We've implemented video with that using the Aurelio SDK. And we've also seen how we can then optimize further using chunking to essentially just reduce our costs pretty significantly.
And I believe probably if we optimize this further, we can definitely get that down even further. So that's it for this video. I hope all this has been useful and interesting. For now, I'll leave it there. So thank you very much for watching, and I will see you again in the next one.
Bye. I'll see you again in the next one.