back to index

Cogito v1 Outperforms Llama 4 | Full Tutorial with LM Studio and LiteLLM


Chapters

0:0 Cogito v1
0:26 Local LLMs with LM Studio
2:38 Python Setup with uv
3:49 Loading our LLM in LM Studio
4:31 Using LM Studio with Python
5:17 Using LiteLLM with LM Studio
9:14 Tools and Agents
12:29 Create a Web Search Tool
16:19 Tool Calling with LiteLLM
20:54 Building a Local Agent

Whisper Transcript | Transcript Only Page

00:00:00.000 | Today we're going to be taking a look at how we can build a fully local agent using the Cogito
00:00:06.820 | V1 LM, which according to their benchmarks is on par and sometimes better than even Llama 4.
00:00:13.920 | So this is a really high quality LM that we can run fully locally with various hardware limitations
00:00:20.640 | as they've given us a whole suite of different model sizes to go ahead and use. Now in this
00:00:27.020 | video we're going to be specifically using this model here which is the 32 billion parameter Cogito
00:00:32.180 | V1 preview and we're going to be using LM studio to download and run our LM locally. So to download
00:00:42.300 | LM studio you will just go to the LM studio download page, download it and once it has been downloaded
00:00:49.280 | you should be able to see something like this. You may need to switch between user and developer
00:00:56.180 | over here to give you access to this page here. Now in here you would load your models so that you can
00:01:03.960 | use them locally but of course if this is your first I'm using LM studio you probably don't have any
00:01:09.380 | models preloaded so you'll need to go to discover. Let's find Cogito or another LM if you prefer
00:01:18.400 | you can use Cogito. Now in here we're going to go ahead and we're going to go with this 32 billion parameter model. Now to download these models
00:01:26.240 | let me go into one that I haven't downloaded. You would simply click the green download button. That will take a little bit of time
00:01:34.260 | So whilst we're waiting for that to download let me show you how to get the code that we're going to be running through.
00:01:42.040 | So we're going to go into the Aureolabs cookbook. We are going to Gen.AI
00:01:48.820 | Local LM studio and there is this Cogito v1 notebook. You can download this directly
00:01:56.680 | or what I would recommend because we have all of the requirements within this pyproject.tuml file
00:02:02.980 | here. What I would recommend is that you actually just clone the repo
00:02:06.200 | with this here. Now you just write git clone
00:02:12.520 | and then the repo name. I've already downloaded it so I don't need to go ahead and do that again.
00:02:18.360 | Instead I'm just going to navigate into the repo and then in here we have the same structure that we saw
00:02:24.840 | in the repo so I'm going to go into Gen.AI
00:02:28.040 | and then it is local and LM studio. Okay so now in here we are in the right place to
00:02:37.160 | begin running everything. I am going to be setting up my python environment and virtual environment
00:02:41.560 | using uv. For mac you can use brew install uv like so if you don't have uv already and once you have
00:02:49.160 | downloaded uv you would just do uvvnv python and then I'm going to be using 3.12.7 here you can use
00:02:57.480 | whatever version you would like but if you want an exact replica of what I am doing I would recommend
00:03:03.880 | going with that. Once you run that that is going to install the python version for you and create the
00:03:10.760 | virtual environment which you can then activate with source vn bin activate. Okay so now you'll see that
00:03:19.560 | we have our local LM studio environment set up and activated. Now you will also need to do uv sync
00:03:29.480 | to install everything and we can just quickly check that we do have everything installed with uv pip list
00:03:35.880 | and you should see in here that we have light lm we should see google search results graph ai lib
00:03:43.880 | and a few other things but they're the main ones that we care about right now. Now let's switch back to
00:03:50.600 | LM studio and hopefully by this point you may have had your model download although some of them especially
00:03:56.840 | those larger 32 billion parameters or larger models can take a little while to do so but once it has
00:04:04.840 | finished downloading you'll be able to one begin the LM studio server which you can reach here okay
00:04:13.320 | this is your local host port 1234 and you just want to load the LM model that you'd like to use
00:04:22.520 | i'm going to be using the cogito v1 32 billion parameter model okay and that will just take a
00:04:28.440 | moment for everything to load and whilst it is loading we can go over to our notebook now this notebook is
00:04:37.480 | within the cookbook repo local LM studio and it's like cogito v1 notebook there now the first thing i want to
00:04:45.960 | do is actually just confirm that i can connect to the LM studio server which i do by just curling to localhost
00:04:57.160 | 1234 v1 which you just see v1 dpi and i'm hitting the models endpoint which will list all the models assuming
00:05:08.440 | that my server is running so if i run that we will get this so i can see in here the model i want is
00:05:16.680 | available so what we would typically do is we'd actually copy this model id and we'd bring it down
00:05:24.600 | to here and we'd actually just enter it here so our model because we're using lightln which is another
00:05:34.120 | library i'll talk about in a moment we need to specify to lightln that we are using LM studio
00:05:40.360 | so it's LM studio forward slash and then the model name as per the defined model name up here so for
00:05:49.080 | example lama4 you would be going with this model here great so we can run that there are a few things
00:05:54.520 | that we need to do here okay so what lightln is essentially doing is it is a library that allow you
00:06:01.240 | to easily interface with various LM providers so mistral cohere open ai anthropic all those providers
00:06:11.000 | you can connect to via this you know the same syntax which is is nice and it also includes local LM
00:06:18.280 | providers like LM studio or olama so to connect to LM studio we specify in the prefix here that we are
00:06:25.800 | connecting to LM studio we do need to set the LM studio api base so this i think by default is actually
00:06:34.600 | using this although i could be wrong but in any case i'm being we do want to be explicit here
00:06:39.800 | so i'm just setting the v1 of the api there then we are hitting the completion endpoint with our model
00:06:47.160 | defined here we do have to set a dummy api key so that's what i'm doing there and it's just a single
00:06:53.480 | message that i'm sending to my LM here um hello how are you that's all i'm asking and we'll just see what
00:07:00.120 | happens okay so we can also switch over to LM studio here and you can see that something has been happening
00:07:09.080 | here okay you can see in this green area it generated this prediction so we see something is happening in
00:07:15.960 | our studio and we can switch back to our notebook and we'll see that we have this model response and we
00:07:23.080 | can go all the way over to right here and see that the response here is hot i'm doing well thank you for
00:07:29.080 | asking how can i help you today okay so that is the response that we got from our local fully local LM
00:07:36.280 | which is the cogito v1 model now we here i'm showing you how to do synchronous completion
00:07:43.240 | most of the time in pretty much every use case you typically begin doing streaming or async at least in a
00:07:54.440 | lot of use cases so i'm just going to show you very quickly how we do that okay so async is handled with
00:08:00.280 | a completion and streaming is turned on by saying stream equal to true and that's actually it so we
00:08:07.240 | would await our async completion endpoint here and we simply do async for chunk in response okay we should
00:08:14.840 | see here and as the response receives the tokens or chunks we are going to print them okay so we can see
00:08:25.160 | there's quite a lot of stuff coming through here so this is pretty hard to pass and so what we can do is
00:08:38.040 | parse it with code okay so when i say hard to parse i'm referring to being able to read this easily
00:08:44.360 | so i'm going to parse this with code and make it easier for us as people to parse
00:08:50.680 | so we'll do that here and we can see that okay this is much it's easier to read and it's kind of what
00:08:57.400 | we're expecting from streaming so what did i do i just said okay if we have a token from the trunk
00:09:05.800 | choices delta content then we're going to print that token that's all we're doing it's super simple now what
00:09:15.000 | i've just shown you are the basics okay we have seen how to set everything up download the model
00:09:23.480 | set up and run our lm studio server and then begin hitting that with completion and
00:09:29.560 | async completion requests now that is super nice and helpful but most of the things that at least i'm
00:09:39.720 | building and i think a lot of people are building at the moment has something to do with agents in some
00:09:44.840 | way or another so that's really what i want to focus on and look at for the remainder of this video
00:09:52.360 | so we're going to first jump into the building a tool and tool use with our local lm and then we're
00:10:00.040 | going to construct an agent using what we've seen so far and the tool calling as well so jumping into tool
00:10:08.680 | calls one thing that we can do in some cases with light lm is we can check if our current model supports
00:10:17.000 | function calling however this isn't always accurate so you can see here it's actually false and that is
00:10:23.720 | not true this model does support function calling i'm not entirely sure why this is and i've generally
00:10:30.680 | found this to be pretty unreliable in tell me whether an lm does actually support function calling or not
00:10:36.440 | so yeah you can use this but just take what it says with a grain of salt
00:10:46.280 | so we can do function calling or tool calling whatever you want to call it
00:10:52.600 | but there is a small catch which is that in order to do so we actually need to pretend that this is an
00:11:02.840 | open ai model and to do that we need to replace lm studio in our model prefix with open ai then because
00:11:13.880 | light lm now thinks that we're hitting open ai by default it's going to hit the like the base url
00:11:20.840 | for open ai is like api.openai.com or something along those lines
00:11:25.880 | we don't want to be hitting that endpoint because we're actually using lm studio locally so we just
00:11:33.400 | change the base url here okay so it's that localhost123v1 that we saw before
00:11:40.360 | so yes this is how we would pretend that this is actually an open ai endpoint and instead
00:11:48.200 | hit our lm studio server so let's see how that works okay so you can see that we have the model is
00:11:56.680 | correct and if we come across a little bit here we will see the completion so i'm doing well thank you
00:12:05.480 | how are you today so and so on cool so we have that now we can also do the same with streaming
00:12:13.800 | and we'll see the same thing again which is great so now that we have that set up with this new you
00:12:24.200 | know pretend open ai lm we can use function calling now function calling or tool calling we need a tool
00:12:33.240 | or function to call so let's go ahead and set up i'm going to be using the serp api here this is a very
00:12:40.520 | simple web search api there are many of them out there and i wouldn't necessarily even recommend
00:12:46.760 | serp api but it comes with very easy setup and i think it's 100 queries for free every month
00:12:56.920 | so it's a nice tool to use in these tutorials so to get your serp api api key we're going to go to
00:13:07.000 | serpapi.com dashboard you will need to create an account but then once you create an account you just
00:13:14.280 | copy your api key now once you've copied your api key run this cell here and at the top you'll
00:13:20.040 | get this little message box saying enter your serp api key enter it just press enter and then what
00:13:26.680 | we're doing here is we're using ai o http here to make an asynchronous request because you know we're
00:13:35.960 | building an async agent we need everything to be async including our function calls so i am making a
00:13:45.320 | async request to the serp api with a search query of latest world news okay and we can see that we
00:13:53.720 | have like world news here okay so we could expand this there's probably there i think there's 10 yeah
00:14:03.320 | we have 10 records or responses here there's not one limitations with serp api so it's just giving us
00:14:10.200 | snippets here it's not giving us like a big page of information so it is pretty limited in what you're
00:14:15.080 | you're going to get from this but it's okay like this is this is enough for us to get a good idea of
00:14:23.240 | what is happening especially when our lm takes all of this and uses it to produce like a more structured
00:14:31.080 | explanation of what we're asking for now the structure of all this is you know it's not bad but
00:14:38.440 | let's clean it up a little bit before we start throwing into our lm so i'm going to use a padantic base model
00:14:45.000 | and what we're going to do is we're going to say okay for this for each of these articles here
00:14:49.800 | or i suppose responses or links we are going to extract the title the source of that article
00:14:58.440 | the link where it's coming from and also the snippet which is the you know the information that we have
00:15:04.040 | here so we're going to get get all that information and we're going to use this class method to actually
00:15:09.960 | build our pydantic base model from the results here directly so we do that and using our list
00:15:19.880 | comprehension here we're just iterating through the organic results and we're using that article from
00:15:26.520 | serp api result to generate this object this base model once we have this base model we also define this
00:15:33.640 | string method here which is going to more easily allow us to build essentially like a markdown formatted
00:15:40.680 | world news that we would then feed into our lm which i can show you very quickly if we go string articles
00:15:50.360 | zero we'll see that we get this here and we can also do just made this a little cleaner
00:15:56.600 | and do from ipython display import markdown display
00:16:04.520 | okay and we get this great so we have that now what i want to do is just compress all of that
00:16:12.040 | into a single function that will become our our function call or tool which is what we have here
00:16:18.440 | and the only thing that we need to add here and we actually don't even need this a little bit here
00:16:23.320 | we can keep it it's not a big deal but let's not for now so the only thing that we need to do is
00:16:29.480 | take our function here and turn it into an open ai readable schema and the way that we do that is
00:16:38.120 | well i'm using the graph ai library it's a very lightweight ai framework and it just includes this
00:16:45.400 | nice utility of get schemers so you pass it a set of callables or functions and it's going to give you
00:16:51.400 | this list of function schemers so with that list of function schemers we just pass that into our tools
00:16:58.280 | parameter here of the completion or a completion endpoints we pass in our messages which is this
00:17:04.520 | tell me about latest world news and we're going to specify to our lm that it can or cannot use the tool
00:17:12.120 | so that's what this tool choice equals auto is and this is actually the default value but i just want to
00:17:18.680 | be explicit and show you everything here so that you know how to change those if you want to force tool use
00:17:25.400 | you can use required cool so let's run that and see what we get just here i'm passing the output out
00:17:35.800 | just to make it a bit more readable for us so i'm looking at okay if a function has been called which
00:17:40.280 | it will call what is the name of that function we can see it's web search and then what are the arguments
00:17:45.480 | to that function which is actually a string so we'll have to pass the sign to json but you can see here
00:17:52.200 | query latest world news cool now in this case we don't necessarily need this mapping dictionary
00:18:01.480 | here but if we have many tools this can be pretty useful so it's essentially just mapping from a string
00:18:11.400 | like the name of our function to the function itself that's all that's happening here and we use it like
00:18:18.440 | this so we have our tool map here and we're just passing the function name from our response here
00:18:23.960 | into that tool map and what's that what that is basically doing is this okay that is the sort of
00:18:30.680 | cleaner version of that but we're using this mapping dictionary to do that and the reason just if it isn't
00:18:38.600 | clear the reason that we have this mapping dictionary is because our lm is going to be providing us with
00:18:44.680 | a string which is the name of this function to call so we're taking that string mapping it from the string
00:18:52.280 | to the function and then calling it and we're calling it with those arguments okay so
00:18:59.320 | we can see here that the output from our tool is the list of these various news headlines and snippets
00:19:07.880 | and what we can do now is we take our tool call which is generated by our lm so that's like what goes in
00:19:15.560 | so the function arguments and name that is being formatted into an assistant message and then we take
00:19:22.200 | the response from our tool and we format that into a tool message okay yeah and basically both of these
00:19:28.760 | are going to be added to our message history so that we can see or the lm on the next iteration is going to
00:19:36.040 | see the user made this query i as an assistant then decided to go and use the web search tool with some
00:19:44.600 | particular arguments and i got this response which was you know all this information and then we pass
00:19:51.560 | that back to our lm and our lm is going to see all that and say okay i've i've done what i needed to do
00:19:56.360 | i found this information for the user now i'm going to respond directly to the user okay so let's see
00:20:05.480 | what it does this can take a little bit of time as you can see here we're up to more than 20 seconds
00:20:13.560 | already that's mostly because of all of the text that we're inputting from the tool here because there
00:20:20.040 | is quite a lot going in so that increases the call time quite a bit especially when you're on local
00:20:26.200 | that is something of course you you would optimize by simply reducing the number of tokens that you're
00:20:32.680 | feeding in or pulling out of your lm via prompting i shouldn't say short for example
00:20:38.840 | okay so we can see that we got our response here here are some of the latest world headlines
00:20:46.360 | from various sources and talking about tariffs and various other things great so that's good
00:20:53.880 | but now what i want to do is basically take all of what we just did and compress it all into this
00:20:59.880 | agent class here okay so this agent class is it's just doing what we've just done but it's rather than
00:21:08.040 | okay we can see the initialization stuff here we're initializing the tool getting the function schemas
00:21:12.920 | creating that tool mapping and just setting up an initial system message for our agent then you know the
00:21:20.200 | real meat of this is inside the async call here so the async call is saying okay while i am under the
00:21:28.680 | maximum of iterations so we we just set this maximum of iterations to avoid looping indefinitely and
00:21:34.840 | racking up a big well in this case we wouldn't rack up a big open ai bill if you were using open ai you would
00:21:42.200 | in this case we're safe but to avoid you know unlimited iterations of our lm if it gets stuck in a loop we
00:21:50.520 | just set this max iterations okay by default i'm saying three here so what we're doing is we have our user
00:21:58.280 | query that comes in we pass that to our lm with the async completion we come through we check if the
00:22:07.960 | assistant generated tool call if there was we're just going to pull out that tool call information
00:22:13.880 | the tool name tool arguments and tool call id is important otherwise if there wasn't anything i would
00:22:19.320 | just say none then no matter what happened here we append our assistant message
00:22:30.440 | and then finally we say okay if there was a tool call i'm going to call that tool and then i'm also
00:22:36.120 | going to append that tool message to the chat history okay otherwise if there was not a tool call that
00:22:42.760 | means that the assistant answered directly and that probably means the assistant wants to send this
00:22:47.560 | response back to the user so we just break out of our loop and then we return the final message which
00:22:54.520 | should be in theory an assistant message so we can initialize that and now let's use our agent with our
00:23:02.920 | web search tool to tell us about the latest well news and let's see what happens again this might take
00:23:09.480 | a little moment so i will just jump ahead very quickly okay so it took a little bit of time to run there
00:23:16.760 | almost a minute but we can see that we actually got this response so we have a pretty good summary of
00:23:26.360 | what i think seems to probably be happening in the world right now and all of this we've or at least
00:23:32.680 | except from the web search component has been executed and generated using our fully local lm
00:23:41.000 | with the cogito v1 model now of course you can also switch this out for llama 4 the new mystery models
00:23:51.000 | and basically any lm that is released and is popular enough to be quantized you will be able to use it
00:23:59.160 | so yeah this is in my opinion pretty exciting there's a lot of cool stuff can do with these lms and the
00:24:07.560 | most or at least the coolest part here is just the fact that we can run all of these locally
00:24:12.920 | without needing to hit apis so yeah that's all i really wanted to cover in this video so i will
00:24:19.480 | leave it there for now thank you very much for watching hope all this has been useful and interesting
00:24:25.640 | and i will see you again in the next one bye