large language models and conversational agents are one of the most interesting technologies to have proliferated recently, specifically agents and their use of tools. So using agents we essentially just give a large language model access to different tools and tools can mean a lot of things as I will explain in a moment.
By using agents and tools we have a pretty much infinite potential in what we can actually do with large language models. They allow us to search web, run code, do maths and a lot more. Now we can actually get pretty far with some of the pre-built tools that are available through lang chain but in a lot of use cases we will probably need to either modify those existing tools in some way or actually just build completely custom new tools to fit to our particular use case.
So that's exactly what we're going to talk about in this video. We're going to take a look at a few pretty simple examples to get started of tools that we can that we can build before moving on to what I think is a more interesting example of a tool that takes inspiration from the recent Hugging GPT paper and will essentially give our ChatGPT large language model access to another deep learning model which will allow it to actually look at images and caption them.
So that should be pretty interesting but before we get into that let's just define kind of what I mean by a tool very quickly. So we have our large language models and by default they just kind of exist in isolation. You put in some text and you generate some text.
You get some output text from them. Tools can be a lot of different things but the core is that they are something, this box in the middle here, and that something takes in some like a string or a couple of parameters and it outputs a string or a couple of parameters.
Now the input to this tool is going to be controlled by our large language model and then the output actually goes back to our large language model. So it's just a way of giving new abilities to our large language models. You can think of maths for example. Large language models are very bad at maths but we can actually just tell it okay you're bad at maths just use this tool and you can calculate whatever you want, right?
That is basically what tools are. They are anything that we can plug into our large language model and that anything is pretty flexible on what we're using. We just need to figure out a way of inputting and outputting relevant information that our large language model can understand. So let's get started with our first example.
We're going to be using this notebook here so there will be a link to this notebook you can follow along with me if you want. We are going to be using, later on I mentioned, we're going to be using another ML model as one of our tools. So we're going to be using transformers.
If you have CUDA enabled GPU this part will run a lot faster for you if you use that GPU or if you're on Colab what you can do is you can go to runtime at the top, change runtime type, hardware accelerator and make sure you have GPU selected. Okay so we're going to run those prerequisites that'll just take a moment for those to install.
Okay once they are installed we're going to come to our first example. So we're going to build a simple circle circumference calculator tool. Okay so one thing that large language models are bad at is maths even when it's something pretty simple like this. So what we're doing is we're importing this base tool class from line chain tools and we're using that to initialize our circumference tool class.
Okay so we're going to inherit the the main methods from that base tool. Now there's two important things that we define here the name which is just the name of our tool circumference calculator and more importantly the description. So the description is a literally natural language description that is going to be used by our large language model to decide which tool to use.
Okay so we need it to describe when to use this tool. So we just say use this tool when you need to calculate a circumference using the radius of a circle. Okay it's really clear and straightforward and then we need to define these two functions. So we have run and a run.
Okay a run is when you would run this tool asynchronously that I'm not going to talk about that it's kind of beyond the scope of this video. I will talk about this at some point in the future but for now we're going to leave that. So we're just going to focus on the standard run function.
Okay so run function in this example takes a single input radius which is going to be either int or a float and we're just going to return the circumference based on the radius of the circle. Okay that's all this is super simple. I just want to show you like the structure of one of these tools.
Okay so we run that and when our large language model decides to run this tool it is going to go to this. Okay so now we've defined that let's go ahead and actually initialize our large language model, initialize the conversational agent that we're going to use. So to do that we actually first need an OpenAI API key.
For that you need to go to platform.openai.com and your API key you just put it in here or you can set it as a environment variable and this will grab it for you. Then we initialize large language model. So we're going to be using chat OpenAI because we're using a chat model here so gpt 3.5 turbo.
Okay temperature because we're doing some maths this is pretty important it is very good to keep your temperature very low. You can think of this as basically randomness so when you're doing maths or writing code or anything like that it's a very good idea to keep the temperature or randomness of the model as low as possible because you don't want it to be random with code it's going to more likely make mistakes.
When you do want it to be more creative like with creative writing then you would set temperature to one or higher. Then we initialize conversational memory so this is just going to allow us to remember the previous five interactions between the AI and us in this conversation. The memory key here we set this to chat history because that is what will be used by the agent down here.
So we just align those two things. I've spoken about that in a previous video on a chat in this line chain video series so if you want to look more into that I would recommend you take a look at the chat video of this series. So let's run that.
Okay cool and now what we want to do is initialize the agent. So we're going to be using let's run this. So we're going to be using the tools here so the circumference tool is the only tool we're using just one. What we need to pass it as a list into the tools parameter of initialize agent.
The agent we're going to be using is a chat agent which means we're using a chat model. It's conversational meaning it has conversational memory. It uses a react framework which is like a reasoning and action framework so basically with this the model is able to reason about what steps to take and also take actions like use a tool based on those thoughts and description is referring to the fact that the large language model decides which tool to use based on the description of those tools.
Alright we pass in our large language model set verbose to true because we're developing this right so we want to see all the information or the like everything that is happening. If you are like pushing this out to customers you don't necessarily need this and max iterations so this is saying how many steps of reasoning and action and observation will you are you allowed to take before we say stop.
Right this is important because it can get stuck in an infinite loop of just trying tool after tool after tool so we don't want it to do that. It's going to cost money for starters and then we also have the early stopping method so generate means that the model is going to decide when it should stop.
Okay cool so I have already run that and then we can go to our first question which is can you calculate the circumference of a circle that has a radius of 7.81 millimeters. Let's see what it comes up with. Okay so it gets 49.03 millimeters let's take a look what the actual answer is.
Okay 49.07 so it got close but it's it's not really accurate. Okay and if we take a look at why this isn't accurate even though we've already passed in that tool it's actually because it didn't use the tool. Okay it jumped straight to the final answer so the reason that it's doing that is because this this model is actually overly confident in its own ability to do maths and so to fix that we basically need to tell it that it's terrible at maths.
So we need to update the system message so the initial message that the model is given to tell it that it can't do maths. So let's have a look at the existing prompt. Okay so it's just system is a large language model trained by OpenAI designed to assist with a wide variety of range of tasks so on and so on.
Basically it's telling it you can do anything right but we don't want it to do anything. We want it to not do maths. So I'm just taking that same prompt but I'm adding a new line in saying unfortunately Assistant is terrible at maths. When provided with math questions no matter how simple Assistant always refers to its trusty tools and absolutely does not try to answer math questions by itself.
Okay so we just add this in and this is enough for the model to decide okay actually I shouldn't just try and guess the answer I should use my circumference calculator tool. Okay so we have our new system message here we need to update our prompt so the prompt actually contains a system message some other things and also the tools that it has available so we need to create a new prompt using both of those things and then we set our prompt to that new prompt.
Okay now let's try and ask the exact same question again and see what happens. Okay great so it's using the circumference calculator this time it inputs 7.81 and the output that it returns is 49.07 right which is accurate because it is literally running some Python code to calculate that.
Okay so then the output is a circumference of a circle with a radius of 7.81 is approximately 49.07 millimeters which is much more accurate. Okay that's pretty cool now in that example we just use one single input and return a single output. What if we would like to use a tool with multiple inputs?
Okay we can also do that so we input these these are just specific to the tool we're using and let me just describe the tool. Okay so what we're going to do is we have our function here and it's a hypotenuse calculator. Okay we can calculate the hypotenuse of a triangle in different ways so we depending on what we have right so if we have the adjacent side and the opposite side we can use this here so we take the square root of the adjacent side squared plus the opposite side squared.
If we have an angle and one of those sides we can use this here so we can use the adjacent side over the cosine of the angle or the opposite side over the sine of the angle. Right so there's multiple options here multiple ways we can go about this.
So because of this we have multiple inputs and we actually want the large language model to decide which of these inputs to use. Right so we're telling it use this tool when you need to calculate the length of a hypotenuse given one or two sides of a triangle and or an angle in degrees.
Okay to use the tool you must provide at least two of the following parameters one adjacent side opposite side or angle. Okay so with that that's actually all we need and this should work. Okay so again we're not using async run we're just using run. So we run that we set our new tools list in here we could also include the previous circumference calculator tool but just you know I just want to show you how to build custom tools here so I'm not going to we're just going to include the single tool.
Then we need to create a new prompt. Okay so we create our new prompt with this code here and then we also because we just changed the tools that the agent has access to we need to update those in the agent itself. Okay so we also update the agent tools with our new tools list.
Okay and now we can ask the question if I have a triangle with two sides of length 51 and 34 centimeters what is the length of the hypotenuse? Let's try. So it goes into the agent execute chain goes to the correct action it inputs the correct items both of the sides and there we go we get our answer so 61.92 so the length of the hypotenuse is approximately 61.29 centimeters.
Now let's try this so rather than giving both sides and we're now going to give the opposite side and an angle. Okay let's see if that works I haven't double checked logic here so I'm not actually sure if this is correct but all we really care about here is showing that the inputs and the outputs are being used correctly by the model.
Yeah it does opposite side and angle and the observation so the calculated output there is 55.86 which we get down here. Okay so that seems to be working which is pretty cool. Now both of these are pretty ready simple examples right very simple tools. We can do a lot more and what I want to show you now is actually using tools to give your GPT model or the large language model access to other deep learning models.
Okay because large language models can't do everything. Okay we still use other deep learning models for other things right so what I want to do is well I'm taking inspiration first from this paper Hugging GPT and the idea is that you use a large language model as a controller and that will refer to other models and these are all like open source models from Hugging Face.
So one they have here is this image captioning model so what I thought okay we can try the same. So we're using a different model I found the one that they use in that paper at least in that visual wasn't the best so we're going to use a more I think it's a more recent model called Blip and we just initialize that and we'll use that for image captioning and actually I didn't even update this here right so initially I tried with that model let's update this cool so what we do is we from Transformers we want to import the processor and the actual model itself the model we're using is this and we here we're just saying okay if we have a CUDA enabled GPU available use that rather than CPU okay it just means things will be a bit faster.
Now the processor that will process any text if we pass in any text in this example we don't but it does do that so tokenizes the text and it will also pre-process any images okay so Blip expects a specific dimensionality of images it expects the image pixels to be normalized so that will be handled by the processor and then after that we will pass it to the model here which is Blip for conditional generation.
Conditional generation is just saying okay given like some text it's going to start generating some text and it's going to this model I don't know exactly how it works but I think it uses whatever it sees from the images right so converts them into probably a set of an array or a single vector and uses that to inform that text generation okay so we initialize that and we move it to our CUDA enabled GPU if it's available.
Okay cool so if we to generate these captions we actually need an image so I'm going to start with this image here I'm going to download it with requests and then I'm going to open it with Pill okay so that will just create a Python image object that we can then view and this is also the format that the Blip processor expects to be fed.
Okay so that is loaded it takes a little bit of time and it's just a picture of a orangutan in a tree okay so let's go ahead and pass that to our processor here and then what we're going to do is generate some some text based on the inputs from this okay we also limit that to 20 tokens so yeah we use that by default I think this actually does already use 20 tokens but the default value is to be deprecated at some point so you should set the number of tokens you want here and then what we do so this will output I think it outputs a like a array of tokens so then we use the processor decode to convert those tokens into like human readable text so let's run that and see see what it thinks we have okay processor is not defined I need to run this okay so then we come to here run this and now we should get our output okay so we get there is a monkey that is sitting in a tree which is accurate okay so now what we want to do is just use this like processes logic in our tool so that's what we're going to do here so all this code is basically just what we what we just went through so I'm not going to go through that again the all we do is we pass in the URL so a string which is where we're going to be downloading everything so this should actually be URL okay and yeah looks good so the description that we're using here is that we should use this tool when given the URL of an image that you'd like to be described it will return a simple caption describing the image okay so we're just telling it when to use it and what it does and then we reinitialize our tools list okay run this and then so the system message before that we created it mentioned that thing about you know unfortunately you're terrible at maths we're not using any math tools now so we can we can remove that that's all I've done here and we reinitialize our prompt and also reinitialize our tools okay and now we can ask about this image URL here okay so it enters into the agent executor chain goes straight to the image captioner passes in the URL and this is the same image that we saw before so the observation is that there is a monkey that is sitting in the tree and so the final answer is if we come down to here the image shows a monkey sitting in a tree okay so now our large language model agent can actually tell us what is in a image which is is pretty cool and obviously we can pair this with multiple different deep learning models like they did in the Hugging GPT paper let's try a few more so I'm gonna load this image okay so just have some guy surfing and if we ask the agent what is in this image let's see what it comes up with okay so we get surfer riding a wave in the ocean on a clear day okay so again pretty accurate I wouldn't I probably wouldn't bet on it being a clear day but yeah maybe it is okay so let's come down to the final one and let's see this one's a little more difficult so let's see what it says okay so in this image you can see there's a baby I think it's a I think it's a crocodile it's a crocodile with the like narrow snout and it's just like in a in a river sitting on a on a log okay so we say what is in this image it goes down this image gives us the observation that this is a lizard that is sitting on a tree branch in the water okay so it hasn't been specific about whether it's a crocodile or an alligator it's just said a lizard so you know not perfect but I think you know it's still still pretty good so naturally I think you can already see there's there's quite a lot that we can do with tools and we don't have to restrict ourselves to just a single tool per agent as well we can we can just add like a ton of different tools and give our agent you know a ton of options on what it can do and that's actually what we see in like the recent auto gbt and baby agi models that have kind of gone crazy recently that's what they do there's just like a ton of different tools that these models can use they have like an internal dialogue that it goes through but a big part of it is actually the number of tools that they can use so yeah you can you can really do quite a lot with these and I think for anyone that's actually working with large language models and using these you really do need to be using or need to get familiar with building tools because they really are like a key component of being able to do cool stuff with these models now this is pretty introductory there is a ton of other stuff that we can cover with with tools and for sure there will be future videos going into more detail building more advanced conversational agents and yeah plenty more so that should be pretty interesting but for now we're going to leave it there so I hope this video has been interesting and useful thank you very much for watching and I will see you again in the next one.
Bye! you