back to indexCogito v1 Outperforms Llama 4 | Full Tutorial with LM Studio and LiteLLM

Chapters
0:0 Cogito v1
0:26 Local LLMs with LM Studio
2:38 Python Setup with uv
3:49 Loading our LLM in LM Studio
4:31 Using LM Studio with Python
5:17 Using LiteLLM with LM Studio
9:14 Tools and Agents
12:29 Create a Web Search Tool
16:19 Tool Calling with LiteLLM
20:54 Building a Local Agent
00:00:00.000 |
Today we're going to be taking a look at how we can build a fully local agent using the Cogito 00:00:06.820 |
V1 LM, which according to their benchmarks is on par and sometimes better than even Llama 4. 00:00:13.920 |
So this is a really high quality LM that we can run fully locally with various hardware limitations 00:00:20.640 |
as they've given us a whole suite of different model sizes to go ahead and use. Now in this 00:00:27.020 |
video we're going to be specifically using this model here which is the 32 billion parameter Cogito 00:00:32.180 |
V1 preview and we're going to be using LM studio to download and run our LM locally. So to download 00:00:42.300 |
LM studio you will just go to the LM studio download page, download it and once it has been downloaded 00:00:49.280 |
you should be able to see something like this. You may need to switch between user and developer 00:00:56.180 |
over here to give you access to this page here. Now in here you would load your models so that you can 00:01:03.960 |
use them locally but of course if this is your first I'm using LM studio you probably don't have any 00:01:09.380 |
models preloaded so you'll need to go to discover. Let's find Cogito or another LM if you prefer 00:01:18.400 |
you can use Cogito. Now in here we're going to go ahead and we're going to go with this 32 billion parameter model. Now to download these models 00:01:26.240 |
let me go into one that I haven't downloaded. You would simply click the green download button. That will take a little bit of time 00:01:34.260 |
So whilst we're waiting for that to download let me show you how to get the code that we're going to be running through. 00:01:42.040 |
So we're going to go into the Aureolabs cookbook. We are going to Gen.AI 00:01:48.820 |
Local LM studio and there is this Cogito v1 notebook. You can download this directly 00:01:56.680 |
or what I would recommend because we have all of the requirements within this pyproject.tuml file 00:02:02.980 |
here. What I would recommend is that you actually just clone the repo 00:02:12.520 |
and then the repo name. I've already downloaded it so I don't need to go ahead and do that again. 00:02:18.360 |
Instead I'm just going to navigate into the repo and then in here we have the same structure that we saw 00:02:28.040 |
and then it is local and LM studio. Okay so now in here we are in the right place to 00:02:37.160 |
begin running everything. I am going to be setting up my python environment and virtual environment 00:02:41.560 |
using uv. For mac you can use brew install uv like so if you don't have uv already and once you have 00:02:49.160 |
downloaded uv you would just do uvvnv python and then I'm going to be using 3.12.7 here you can use 00:02:57.480 |
whatever version you would like but if you want an exact replica of what I am doing I would recommend 00:03:03.880 |
going with that. Once you run that that is going to install the python version for you and create the 00:03:10.760 |
virtual environment which you can then activate with source vn bin activate. Okay so now you'll see that 00:03:19.560 |
we have our local LM studio environment set up and activated. Now you will also need to do uv sync 00:03:29.480 |
to install everything and we can just quickly check that we do have everything installed with uv pip list 00:03:35.880 |
and you should see in here that we have light lm we should see google search results graph ai lib 00:03:43.880 |
and a few other things but they're the main ones that we care about right now. Now let's switch back to 00:03:50.600 |
LM studio and hopefully by this point you may have had your model download although some of them especially 00:03:56.840 |
those larger 32 billion parameters or larger models can take a little while to do so but once it has 00:04:04.840 |
finished downloading you'll be able to one begin the LM studio server which you can reach here okay 00:04:13.320 |
this is your local host port 1234 and you just want to load the LM model that you'd like to use 00:04:22.520 |
i'm going to be using the cogito v1 32 billion parameter model okay and that will just take a 00:04:28.440 |
moment for everything to load and whilst it is loading we can go over to our notebook now this notebook is 00:04:37.480 |
within the cookbook repo local LM studio and it's like cogito v1 notebook there now the first thing i want to 00:04:45.960 |
do is actually just confirm that i can connect to the LM studio server which i do by just curling to localhost 00:04:57.160 |
1234 v1 which you just see v1 dpi and i'm hitting the models endpoint which will list all the models assuming 00:05:08.440 |
that my server is running so if i run that we will get this so i can see in here the model i want is 00:05:16.680 |
available so what we would typically do is we'd actually copy this model id and we'd bring it down 00:05:24.600 |
to here and we'd actually just enter it here so our model because we're using lightln which is another 00:05:34.120 |
library i'll talk about in a moment we need to specify to lightln that we are using LM studio 00:05:40.360 |
so it's LM studio forward slash and then the model name as per the defined model name up here so for 00:05:49.080 |
example lama4 you would be going with this model here great so we can run that there are a few things 00:05:54.520 |
that we need to do here okay so what lightln is essentially doing is it is a library that allow you 00:06:01.240 |
to easily interface with various LM providers so mistral cohere open ai anthropic all those providers 00:06:11.000 |
you can connect to via this you know the same syntax which is is nice and it also includes local LM 00:06:18.280 |
providers like LM studio or olama so to connect to LM studio we specify in the prefix here that we are 00:06:25.800 |
connecting to LM studio we do need to set the LM studio api base so this i think by default is actually 00:06:34.600 |
using this although i could be wrong but in any case i'm being we do want to be explicit here 00:06:39.800 |
so i'm just setting the v1 of the api there then we are hitting the completion endpoint with our model 00:06:47.160 |
defined here we do have to set a dummy api key so that's what i'm doing there and it's just a single 00:06:53.480 |
message that i'm sending to my LM here um hello how are you that's all i'm asking and we'll just see what 00:07:00.120 |
happens okay so we can also switch over to LM studio here and you can see that something has been happening 00:07:09.080 |
here okay you can see in this green area it generated this prediction so we see something is happening in 00:07:15.960 |
our studio and we can switch back to our notebook and we'll see that we have this model response and we 00:07:23.080 |
can go all the way over to right here and see that the response here is hot i'm doing well thank you for 00:07:29.080 |
asking how can i help you today okay so that is the response that we got from our local fully local LM 00:07:36.280 |
which is the cogito v1 model now we here i'm showing you how to do synchronous completion 00:07:43.240 |
most of the time in pretty much every use case you typically begin doing streaming or async at least in a 00:07:54.440 |
lot of use cases so i'm just going to show you very quickly how we do that okay so async is handled with 00:08:00.280 |
a completion and streaming is turned on by saying stream equal to true and that's actually it so we 00:08:07.240 |
would await our async completion endpoint here and we simply do async for chunk in response okay we should 00:08:14.840 |
see here and as the response receives the tokens or chunks we are going to print them okay so we can see 00:08:25.160 |
there's quite a lot of stuff coming through here so this is pretty hard to pass and so what we can do is 00:08:38.040 |
parse it with code okay so when i say hard to parse i'm referring to being able to read this easily 00:08:44.360 |
so i'm going to parse this with code and make it easier for us as people to parse 00:08:50.680 |
so we'll do that here and we can see that okay this is much it's easier to read and it's kind of what 00:08:57.400 |
we're expecting from streaming so what did i do i just said okay if we have a token from the trunk 00:09:05.800 |
choices delta content then we're going to print that token that's all we're doing it's super simple now what 00:09:15.000 |
i've just shown you are the basics okay we have seen how to set everything up download the model 00:09:23.480 |
set up and run our lm studio server and then begin hitting that with completion and 00:09:29.560 |
async completion requests now that is super nice and helpful but most of the things that at least i'm 00:09:39.720 |
building and i think a lot of people are building at the moment has something to do with agents in some 00:09:44.840 |
way or another so that's really what i want to focus on and look at for the remainder of this video 00:09:52.360 |
so we're going to first jump into the building a tool and tool use with our local lm and then we're 00:10:00.040 |
going to construct an agent using what we've seen so far and the tool calling as well so jumping into tool 00:10:08.680 |
calls one thing that we can do in some cases with light lm is we can check if our current model supports 00:10:17.000 |
function calling however this isn't always accurate so you can see here it's actually false and that is 00:10:23.720 |
not true this model does support function calling i'm not entirely sure why this is and i've generally 00:10:30.680 |
found this to be pretty unreliable in tell me whether an lm does actually support function calling or not 00:10:36.440 |
so yeah you can use this but just take what it says with a grain of salt 00:10:46.280 |
so we can do function calling or tool calling whatever you want to call it 00:10:52.600 |
but there is a small catch which is that in order to do so we actually need to pretend that this is an 00:11:02.840 |
open ai model and to do that we need to replace lm studio in our model prefix with open ai then because 00:11:13.880 |
light lm now thinks that we're hitting open ai by default it's going to hit the like the base url 00:11:20.840 |
for open ai is like api.openai.com or something along those lines 00:11:25.880 |
we don't want to be hitting that endpoint because we're actually using lm studio locally so we just 00:11:33.400 |
change the base url here okay so it's that localhost123v1 that we saw before 00:11:40.360 |
so yes this is how we would pretend that this is actually an open ai endpoint and instead 00:11:48.200 |
hit our lm studio server so let's see how that works okay so you can see that we have the model is 00:11:56.680 |
correct and if we come across a little bit here we will see the completion so i'm doing well thank you 00:12:05.480 |
how are you today so and so on cool so we have that now we can also do the same with streaming 00:12:13.800 |
and we'll see the same thing again which is great so now that we have that set up with this new you 00:12:24.200 |
know pretend open ai lm we can use function calling now function calling or tool calling we need a tool 00:12:33.240 |
or function to call so let's go ahead and set up i'm going to be using the serp api here this is a very 00:12:40.520 |
simple web search api there are many of them out there and i wouldn't necessarily even recommend 00:12:46.760 |
serp api but it comes with very easy setup and i think it's 100 queries for free every month 00:12:56.920 |
so it's a nice tool to use in these tutorials so to get your serp api api key we're going to go to 00:13:07.000 |
serpapi.com dashboard you will need to create an account but then once you create an account you just 00:13:14.280 |
copy your api key now once you've copied your api key run this cell here and at the top you'll 00:13:20.040 |
get this little message box saying enter your serp api key enter it just press enter and then what 00:13:26.680 |
we're doing here is we're using ai o http here to make an asynchronous request because you know we're 00:13:35.960 |
building an async agent we need everything to be async including our function calls so i am making a 00:13:45.320 |
async request to the serp api with a search query of latest world news okay and we can see that we 00:13:53.720 |
have like world news here okay so we could expand this there's probably there i think there's 10 yeah 00:14:03.320 |
we have 10 records or responses here there's not one limitations with serp api so it's just giving us 00:14:10.200 |
snippets here it's not giving us like a big page of information so it is pretty limited in what you're 00:14:15.080 |
you're going to get from this but it's okay like this is this is enough for us to get a good idea of 00:14:23.240 |
what is happening especially when our lm takes all of this and uses it to produce like a more structured 00:14:31.080 |
explanation of what we're asking for now the structure of all this is you know it's not bad but 00:14:38.440 |
let's clean it up a little bit before we start throwing into our lm so i'm going to use a padantic base model 00:14:45.000 |
and what we're going to do is we're going to say okay for this for each of these articles here 00:14:49.800 |
or i suppose responses or links we are going to extract the title the source of that article 00:14:58.440 |
the link where it's coming from and also the snippet which is the you know the information that we have 00:15:04.040 |
here so we're going to get get all that information and we're going to use this class method to actually 00:15:09.960 |
build our pydantic base model from the results here directly so we do that and using our list 00:15:19.880 |
comprehension here we're just iterating through the organic results and we're using that article from 00:15:26.520 |
serp api result to generate this object this base model once we have this base model we also define this 00:15:33.640 |
string method here which is going to more easily allow us to build essentially like a markdown formatted 00:15:40.680 |
world news that we would then feed into our lm which i can show you very quickly if we go string articles 00:15:50.360 |
zero we'll see that we get this here and we can also do just made this a little cleaner 00:15:56.600 |
and do from ipython display import markdown display 00:16:04.520 |
okay and we get this great so we have that now what i want to do is just compress all of that 00:16:12.040 |
into a single function that will become our our function call or tool which is what we have here 00:16:18.440 |
and the only thing that we need to add here and we actually don't even need this a little bit here 00:16:23.320 |
we can keep it it's not a big deal but let's not for now so the only thing that we need to do is 00:16:29.480 |
take our function here and turn it into an open ai readable schema and the way that we do that is 00:16:38.120 |
well i'm using the graph ai library it's a very lightweight ai framework and it just includes this 00:16:45.400 |
nice utility of get schemers so you pass it a set of callables or functions and it's going to give you 00:16:51.400 |
this list of function schemers so with that list of function schemers we just pass that into our tools 00:16:58.280 |
parameter here of the completion or a completion endpoints we pass in our messages which is this 00:17:04.520 |
tell me about latest world news and we're going to specify to our lm that it can or cannot use the tool 00:17:12.120 |
so that's what this tool choice equals auto is and this is actually the default value but i just want to 00:17:18.680 |
be explicit and show you everything here so that you know how to change those if you want to force tool use 00:17:25.400 |
you can use required cool so let's run that and see what we get just here i'm passing the output out 00:17:35.800 |
just to make it a bit more readable for us so i'm looking at okay if a function has been called which 00:17:40.280 |
it will call what is the name of that function we can see it's web search and then what are the arguments 00:17:45.480 |
to that function which is actually a string so we'll have to pass the sign to json but you can see here 00:17:52.200 |
query latest world news cool now in this case we don't necessarily need this mapping dictionary 00:18:01.480 |
here but if we have many tools this can be pretty useful so it's essentially just mapping from a string 00:18:11.400 |
like the name of our function to the function itself that's all that's happening here and we use it like 00:18:18.440 |
this so we have our tool map here and we're just passing the function name from our response here 00:18:23.960 |
into that tool map and what's that what that is basically doing is this okay that is the sort of 00:18:30.680 |
cleaner version of that but we're using this mapping dictionary to do that and the reason just if it isn't 00:18:38.600 |
clear the reason that we have this mapping dictionary is because our lm is going to be providing us with 00:18:44.680 |
a string which is the name of this function to call so we're taking that string mapping it from the string 00:18:52.280 |
to the function and then calling it and we're calling it with those arguments okay so 00:18:59.320 |
we can see here that the output from our tool is the list of these various news headlines and snippets 00:19:07.880 |
and what we can do now is we take our tool call which is generated by our lm so that's like what goes in 00:19:15.560 |
so the function arguments and name that is being formatted into an assistant message and then we take 00:19:22.200 |
the response from our tool and we format that into a tool message okay yeah and basically both of these 00:19:28.760 |
are going to be added to our message history so that we can see or the lm on the next iteration is going to 00:19:36.040 |
see the user made this query i as an assistant then decided to go and use the web search tool with some 00:19:44.600 |
particular arguments and i got this response which was you know all this information and then we pass 00:19:51.560 |
that back to our lm and our lm is going to see all that and say okay i've i've done what i needed to do 00:19:56.360 |
i found this information for the user now i'm going to respond directly to the user okay so let's see 00:20:05.480 |
what it does this can take a little bit of time as you can see here we're up to more than 20 seconds 00:20:13.560 |
already that's mostly because of all of the text that we're inputting from the tool here because there 00:20:20.040 |
is quite a lot going in so that increases the call time quite a bit especially when you're on local 00:20:26.200 |
that is something of course you you would optimize by simply reducing the number of tokens that you're 00:20:32.680 |
feeding in or pulling out of your lm via prompting i shouldn't say short for example 00:20:38.840 |
okay so we can see that we got our response here here are some of the latest world headlines 00:20:46.360 |
from various sources and talking about tariffs and various other things great so that's good 00:20:53.880 |
but now what i want to do is basically take all of what we just did and compress it all into this 00:20:59.880 |
agent class here okay so this agent class is it's just doing what we've just done but it's rather than 00:21:08.040 |
okay we can see the initialization stuff here we're initializing the tool getting the function schemas 00:21:12.920 |
creating that tool mapping and just setting up an initial system message for our agent then you know the 00:21:20.200 |
real meat of this is inside the async call here so the async call is saying okay while i am under the 00:21:28.680 |
maximum of iterations so we we just set this maximum of iterations to avoid looping indefinitely and 00:21:34.840 |
racking up a big well in this case we wouldn't rack up a big open ai bill if you were using open ai you would 00:21:42.200 |
in this case we're safe but to avoid you know unlimited iterations of our lm if it gets stuck in a loop we 00:21:50.520 |
just set this max iterations okay by default i'm saying three here so what we're doing is we have our user 00:21:58.280 |
query that comes in we pass that to our lm with the async completion we come through we check if the 00:22:07.960 |
assistant generated tool call if there was we're just going to pull out that tool call information 00:22:13.880 |
the tool name tool arguments and tool call id is important otherwise if there wasn't anything i would 00:22:19.320 |
just say none then no matter what happened here we append our assistant message 00:22:30.440 |
and then finally we say okay if there was a tool call i'm going to call that tool and then i'm also 00:22:36.120 |
going to append that tool message to the chat history okay otherwise if there was not a tool call that 00:22:42.760 |
means that the assistant answered directly and that probably means the assistant wants to send this 00:22:47.560 |
response back to the user so we just break out of our loop and then we return the final message which 00:22:54.520 |
should be in theory an assistant message so we can initialize that and now let's use our agent with our 00:23:02.920 |
web search tool to tell us about the latest well news and let's see what happens again this might take 00:23:09.480 |
a little moment so i will just jump ahead very quickly okay so it took a little bit of time to run there 00:23:16.760 |
almost a minute but we can see that we actually got this response so we have a pretty good summary of 00:23:26.360 |
what i think seems to probably be happening in the world right now and all of this we've or at least 00:23:32.680 |
except from the web search component has been executed and generated using our fully local lm 00:23:41.000 |
with the cogito v1 model now of course you can also switch this out for llama 4 the new mystery models 00:23:51.000 |
and basically any lm that is released and is popular enough to be quantized you will be able to use it 00:23:59.160 |
so yeah this is in my opinion pretty exciting there's a lot of cool stuff can do with these lms and the 00:24:07.560 |
most or at least the coolest part here is just the fact that we can run all of these locally 00:24:12.920 |
without needing to hit apis so yeah that's all i really wanted to cover in this video so i will 00:24:19.480 |
leave it there for now thank you very much for watching hope all this has been useful and interesting