Cogito v1 Outperforms Llama 4 | Full Tutorial with LM Studio and LiteLLM

Today we're going to be taking a look at how we can build a fully local agent using the Cogito V1 LM, which according to their benchmarks is on par and sometimes better than even Llama 4. So this is a really high quality LM that we can run fully locally with various hardware limitations as they've given us a whole suite of different model sizes to go ahead and use.

Now in this video we're going to be specifically using this model here which is the 32 billion parameter Cogito V1 preview and we're going to be using LM studio to download and run our LM locally. So to download LM studio you will just go to the LM studio download page, download it and once it has been downloaded you should be able to see something like this.

You may need to switch between user and developer over here to give you access to this page here. Now in here you would load your models so that you can use them locally but of course if this is your first I'm using LM studio you probably don't have any models preloaded so you'll need to go to discover.

Let's find Cogito or another LM if you prefer you can use Cogito. Now in here we're going to go ahead and we're going to go with this 32 billion parameter model. Now to download these models let me go into one that I haven't downloaded. You would simply click the green download button.

That will take a little bit of time So whilst we're waiting for that to download let me show you how to get the code that we're going to be running through. So we're going to go into the Aureolabs cookbook. We are going to Gen.AI Local LM studio and there is this Cogito v1 notebook.

You can download this directly or what I would recommend because we have all of the requirements within this pyproject.tuml file here. What I would recommend is that you actually just clone the repo with this here. Now you just write git clone and then the repo name. I've already downloaded it so I don't need to go ahead and do that again.

Instead I'm just going to navigate into the repo and then in here we have the same structure that we saw in the repo so I'm going to go into Gen.AI and then it is local and LM studio. Okay so now in here we are in the right place to begin running everything.

I am going to be setting up my python environment and virtual environment using uv. For mac you can use brew install uv like so if you don't have uv already and once you have downloaded uv you would just do uvvnv python and then I'm going to be using 3.12.7 here you can use whatever version you would like but if you want an exact replica of what I am doing I would recommend going with that.

Once you run that that is going to install the python version for you and create the virtual environment which you can then activate with source vn bin activate. Okay so now you'll see that we have our local LM studio environment set up and activated. Now you will also need to do uv sync to install everything and we can just quickly check that we do have everything installed with uv pip list and you should see in here that we have light lm we should see google search results graph ai lib and a few other things but they're the main ones that we care about right now.

Now let's switch back to LM studio and hopefully by this point you may have had your model download although some of them especially those larger 32 billion parameters or larger models can take a little while to do so but once it has finished downloading you'll be able to one begin the LM studio server which you can reach here okay this is your local host port 1234 and you just want to load the LM model that you'd like to use i'm going to be using the cogito v1 32 billion parameter model okay and that will just take a moment for everything to load and whilst it is loading we can go over to our notebook now this notebook is within the cookbook repo local LM studio and it's like cogito v1 notebook there now the first thing i want to do is actually just confirm that i can connect to the LM studio server which i do by just curling to localhost 1234 v1 which you just see v1 dpi and i'm hitting the models endpoint which will list all the models assuming that my server is running so if i run that we will get this so i can see in here the model i want is available so what we would typically do is we'd actually copy this model id and we'd bring it down to here and we'd actually just enter it here so our model because we're using lightln which is another library i'll talk about in a moment we need to specify to lightln that we are using LM studio so it's LM studio forward slash and then the model name as per the defined model name up here so for example lama4 you would be going with this model here great so we can run that there are a few things that we need to do here okay so what lightln is essentially doing is it is a library that allow you to easily interface with various LM providers so mistral cohere open ai anthropic all those providers you can connect to via this you know the same syntax which is is nice and it also includes local LM providers like LM studio or olama so to connect to LM studio we specify in the prefix here that we are connecting to LM studio we do need to set the LM studio api base so this i think by default is actually using this although i could be wrong but in any case i'm being we do want to be explicit here so i'm just setting the v1 of the api there then we are hitting the completion endpoint with our model defined here we do have to set a dummy api key so that's what i'm doing there and it's just a single message that i'm sending to my LM here um hello how are you that's all i'm asking and we'll just see what happens okay so we can also switch over to LM studio here and you can see that something has been happening here okay you can see in this green area it generated this prediction so we see something is happening in our studio and we can switch back to our notebook and we'll see that we have this model response and we can go all the way over to right here and see that the response here is hot i'm doing well thank you for asking how can i help you today okay so that is the response that we got from our local fully local LM which is the cogito v1 model now we here i'm showing you how to do synchronous completion most of the time in pretty much every use case you typically begin doing streaming or async at least in a lot of use cases so i'm just going to show you very quickly how we do that okay so async is handled with a completion and streaming is turned on by saying stream equal to true and that's actually it so we would await our async completion endpoint here and we simply do async for chunk in response okay we should see here and as the response receives the tokens or chunks we are going to print them okay so we can see there's quite a lot of stuff coming through here so this is pretty hard to pass and so what we can do is parse it with code okay so when i say hard to parse i'm referring to being able to read this easily so i'm going to parse this with code and make it easier for us as people to parse so we'll do that here and we can see that okay this is much it's easier to read and it's kind of what we're expecting from streaming so what did i do i just said okay if we have a token from the trunk choices delta content then we're going to print that token that's all we're doing it's super simple now what i've just shown you are the basics okay we have seen how to set everything up download the model set up and run our lm studio server and then begin hitting that with completion and async completion requests now that is super nice and helpful but most of the things that at least i'm building and i think a lot of people are building at the moment has something to do with agents in some way or another so that's really what i want to focus on and look at for the remainder of this video so we're going to first jump into the building a tool and tool use with our local lm and then we're going to construct an agent using what we've seen so far and the tool calling as well so jumping into tool calls one thing that we can do in some cases with light lm is we can check if our current model supports function calling however this isn't always accurate so you can see here it's actually false and that is not true this model does support function calling i'm not entirely sure why this is and i've generally found this to be pretty unreliable in tell me whether an lm does actually support function calling or not so yeah you can use this but just take what it says with a grain of salt so we can do function calling or tool calling whatever you want to call it but there is a small catch which is that in order to do so we actually need to pretend that this is an open ai model and to do that we need to replace lm studio in our model prefix with open ai then because light lm now thinks that we're hitting open ai by default it's going to hit the like the base url for open ai is like api.openai.com or something along those lines we don't want to be hitting that endpoint because we're actually using lm studio locally so we just change the base url here okay so it's that localhost123v1 that we saw before so yes this is how we would pretend that this is actually an open ai endpoint and instead hit our lm studio server so let's see how that works okay so you can see that we have the model is correct and if we come across a little bit here we will see the completion so i'm doing well thank you how are you today so and so on cool so we have that now we can also do the same with streaming and we'll see the same thing again which is great so now that we have that set up with this new you know pretend open ai lm we can use function calling now function calling or tool calling we need a tool or function to call so let's go ahead and set up i'm going to be using the serp api here this is a very simple web search api there are many of them out there and i wouldn't necessarily even recommend serp api but it comes with very easy setup and i think it's 100 queries for free every month so it's a nice tool to use in these tutorials so to get your serp api api key we're going to go to serpapi.com dashboard you will need to create an account but then once you create an account you just copy your api key now once you've copied your api key run this cell here and at the top you'll get this little message box saying enter your serp api key enter it just press enter and then what we're doing here is we're using ai o http here to make an asynchronous request because you know we're building an async agent we need everything to be async including our function calls so i am making a async request to the serp api with a search query of latest world news okay and we can see that we have like world news here okay so we could expand this there's probably there i think there's 10 yeah we have 10 records or responses here there's not one limitations with serp api so it's just giving us snippets here it's not giving us like a big page of information so it is pretty limited in what you're you're going to get from this but it's okay like this is this is enough for us to get a good idea of what is happening especially when our lm takes all of this and uses it to produce like a more structured explanation of what we're asking for now the structure of all this is you know it's not bad but let's clean it up a little bit before we start throwing into our lm so i'm going to use a padantic base model and what we're going to do is we're going to say okay for this for each of these articles here or i suppose responses or links we are going to extract the title the source of that article the link where it's coming from and also the snippet which is the you know the information that we have here so we're going to get get all that information and we're going to use this class method to actually build our pydantic base model from the results here directly so we do that and using our list comprehension here we're just iterating through the organic results and we're using that article from serp api result to generate this object this base model once we have this base model we also define this string method here which is going to more easily allow us to build essentially like a markdown formatted world news that we would then feed into our lm which i can show you very quickly if we go string articles zero we'll see that we get this here and we can also do just made this a little cleaner and do from ipython display import markdown display okay and we get this great so we have that now what i want to do is just compress all of that into a single function that will become our our function call or tool which is what we have here and the only thing that we need to add here and we actually don't even need this a little bit here we can keep it it's not a big deal but let's not for now so the only thing that we need to do is take our function here and turn it into an open ai readable schema and the way that we do that is well i'm using the graph ai library it's a very lightweight ai framework and it just includes this nice utility of get schemers so you pass it a set of callables or functions and it's going to give you this list of function schemers so with that list of function schemers we just pass that into our tools parameter here of the completion or a completion endpoints we pass in our messages which is this tell me about latest world news and we're going to specify to our lm that it can or cannot use the tool so that's what this tool choice equals auto is and this is actually the default value but i just want to be explicit and show you everything here so that you know how to change those if you want to force tool use you can use required cool so let's run that and see what we get just here i'm passing the output out just to make it a bit more readable for us so i'm looking at okay if a function has been called which it will call what is the name of that function we can see it's web search and then what are the arguments to that function which is actually a string so we'll have to pass the sign to json but you can see here query latest world news cool now in this case we don't necessarily need this mapping dictionary here but if we have many tools this can be pretty useful so it's essentially just mapping from a string like the name of our function to the function itself that's all that's happening here and we use it like this so we have our tool map here and we're just passing the function name from our response here into that tool map and what's that what that is basically doing is this okay that is the sort of cleaner version of that but we're using this mapping dictionary to do that and the reason just if it isn't clear the reason that we have this mapping dictionary is because our lm is going to be providing us with a string which is the name of this function to call so we're taking that string mapping it from the string to the function and then calling it and we're calling it with those arguments okay so we can see here that the output from our tool is the list of these various news headlines and snippets and what we can do now is we take our tool call which is generated by our lm so that's like what goes in so the function arguments and name that is being formatted into an assistant message and then we take the response from our tool and we format that into a tool message okay yeah and basically both of these are going to be added to our message history so that we can see or the lm on the next iteration is going to see the user made this query i as an assistant then decided to go and use the web search tool with some particular arguments and i got this response which was you know all this information and then we pass that back to our lm and our lm is going to see all that and say okay i've i've done what i needed to do i found this information for the user now i'm going to respond directly to the user okay so let's see what it does this can take a little bit of time as you can see here we're up to more than 20 seconds already that's mostly because of all of the text that we're inputting from the tool here because there is quite a lot going in so that increases the call time quite a bit especially when you're on local that is something of course you you would optimize by simply reducing the number of tokens that you're feeding in or pulling out of your lm via prompting i shouldn't say short for example okay so we can see that we got our response here here are some of the latest world headlines from various sources and talking about tariffs and various other things great so that's good but now what i want to do is basically take all of what we just did and compress it all into this agent class here okay so this agent class is it's just doing what we've just done but it's rather than okay we can see the initialization stuff here we're initializing the tool getting the function schemas creating that tool mapping and just setting up an initial system message for our agent then you know the real meat of this is inside the async call here so the async call is saying okay while i am under the maximum of iterations so we we just set this maximum of iterations to avoid looping indefinitely and racking up a big well in this case we wouldn't rack up a big open ai bill if you were using open ai you would in this case we're safe but to avoid you know unlimited iterations of our lm if it gets stuck in a loop we just set this max iterations okay by default i'm saying three here so what we're doing is we have our user query that comes in we pass that to our lm with the async completion we come through we check if the assistant generated tool call if there was we're just going to pull out that tool call information the tool name tool arguments and tool call id is important otherwise if there wasn't anything i would just say none then no matter what happened here we append our assistant message and then finally we say okay if there was a tool call i'm going to call that tool and then i'm also going to append that tool message to the chat history okay otherwise if there was not a tool call that means that the assistant answered directly and that probably means the assistant wants to send this response back to the user so we just break out of our loop and then we return the final message which should be in theory an assistant message so we can initialize that and now let's use our agent with our web search tool to tell us about the latest well news and let's see what happens again this might take a little moment so i will just jump ahead very quickly okay so it took a little bit of time to run there almost a minute but we can see that we actually got this response so we have a pretty good summary of what i think seems to probably be happening in the world right now and all of this we've or at least except from the web search component has been executed and generated using our fully local lm with the cogito v1 model now of course you can also switch this out for llama 4 the new mystery models and basically any lm that is released and is popular enough to be quantized you will be able to use it so yeah this is in my opinion pretty exciting there's a lot of cool stuff can do with these lms and the most or at least the coolest part here is just the fact that we can run all of these locally without needing to hit apis so yeah that's all i really wanted to cover in this video so i will leave it there for now thank you very much for watching hope all this has been useful and interesting and i will see you again in the next one bye

Cogito v1 Outperforms Llama 4 | Full Tutorial with LM Studio and LiteLLM

Chapters

Transcript