back to indexNEW Pinecone Assistant Features + GA Release!

Chapters
0:0 New AI Assistant
0:40 Pinecone Assistant Release
2:38 Pinecone Assistant Python Client
3:39 Assistant Custom Instructions
7:5 Pinecone Assistant APIs
9:42 Assistant Chat API
19:37 Context API
21:37 Pinecone Chat Completions
24:42 Deleting our Assistant and Concluding
00:00:00.000 |
Today we're going to be trying out the new Pinecone Assistant, which has just been made 00:00:04.240 |
generally available. Now for those of you that don't know, Pinecone Assistant is an API service 00:00:11.600 |
that provides you with an agent that comes with best-in-class, essentially, rag capabilities. 00:00:18.400 |
It focuses on making your agent as grounded in truth as possible. Of course, being Pinecone, 00:00:28.080 |
their whole thing is vector databases retrieval, so they're pretty good at that sort of thing. 00:00:33.760 |
Now with Assistant being made generally available, that comes with a few new features. So those are, 00:00:41.760 |
and there's a great summary of everything in this article here as well, but those are probably the 00:00:47.680 |
most exciting, in my opinion, the custom instructions, which obviously is pretty useful 00:00:52.960 |
when you're building your own agents, and this, which is very cool, so new input and output 00:00:57.360 |
formats. So the input formats, I mean, that's nice, so we have markdown and duct files now, 00:01:02.880 |
which is cool, in addition to PDF and text, but then what I do like here is the output format. 00:01:10.960 |
There is essentially a JSON mode, so your agent can output structured output now, and generally 00:01:18.960 |
speaking, I'm always a big fan of that because basically everything that I've had to build with 00:01:24.640 |
agents, especially, well, I'd say more recently, but for quite a long time now, that ability to 00:01:31.360 |
output structured text is just incredibly useful, at least to do that reliably. So that's pretty 00:01:39.360 |
big, in my opinion. Region control, so EU, US, so obviously GDPR if you need that, and 00:01:46.000 |
I shouldn't say finally, because these are also pretty important, but there are also, 00:01:51.360 |
there's also a new chat API and a new context API, so we'll take a look at all of those. 00:01:57.680 |
Now, before we jump in, I will just say, so this visualizes things pretty well to why you might 00:02:04.720 |
actually want to use PyCon assistance over like OpenAI assistance, which would be the 00:02:08.640 |
most similar thing out there. Generally speaking, and we will see this in the example that we walk 00:02:14.400 |
through, PyCon assistance is, it's just very good at grounding everything it tells you with 00:02:20.640 |
sources. So it's much more trustworthy, much less likely to hallucinate and make stuff up 00:02:28.080 |
than you would find in OpenAI assistance, and you can see some metrics here, but yeah, we'll see 00:02:34.960 |
actually in the example that we walk through, this is the case as well. So the example we're 00:02:39.840 |
going to walk through, there will be a link to this example in the description video and also 00:02:44.880 |
the top comment video, I will make sure that is in there, but we're going to go in and we're going 00:02:50.480 |
to build something very simple. So we're going to build a assistant, it's going to help us 00:02:56.080 |
understand some, like an AI paper, and it's going to be, it's going to be from Yorkshire, 00:03:04.800 |
so it will be, well, we'll see, we'll see what it will be like. So first thing is, of course, 00:03:10.320 |
API key, this will be free. So you can go and get an API key, it's app.pycon.io. I did include a 00:03:20.160 |
link here, there. So you can go there, you get your API key, and then you just enter it. So I'm 00:03:28.320 |
going to be creating my new assistant called the Yorkshire Assistant. Right now, I don't have any 00:03:34.080 |
assistants, okay, so you can just list those like that. So I'm going to go ahead and create him. 00:03:38.720 |
So my instructions, of course, one of the key features here is those custom instructions. 00:03:43.760 |
I'm going to go with the typical, you are a helpful assistant that must help with some queries, 00:03:48.320 |
and then we're going to modify it a little bit. So he's going to be helpful, and he's also going 00:03:52.000 |
to be from the Yorkshire countryside and will always respond with heavy Yorkshire slang, 00:03:57.760 |
colloquialisms, and references to that great county. They will, or he will try, he, she, 00:04:05.760 |
will try to use relevant metaphors to explain, explain concepts to the user. And then one thing 00:04:14.640 |
that I really like about more recent models is that they do tend to go with markdown output. 00:04:20.720 |
And I found that with assistant, it didn't go with markdown output, which is actually 00:04:26.240 |
probably not a bad thing. By default, I would maybe rather defined, I would like that. So I'm 00:04:32.720 |
just telling it's here. So, you know, format your answers in markdown whilst maintaining the 00:04:39.760 |
Yorkshire accent. So we create our assistant. Here I'm doing another quick check, okay, 00:04:46.480 |
just making sure we don't actually have that assistant already. Then we do create assistant, 00:04:55.200 |
passing our name, instructions, and then timeout here is just, okay, how long are we going to wait 00:04:59.360 |
before we just return? You can, it doesn't really matter, but it's there. Now we're going to 00:05:06.240 |
download a kind of interesting paper. I haven't been through the whole thing, to be honest, 00:05:12.640 |
but it looks pretty interesting from the abstract. So there is this reasoning language models or 00:05:18.000 |
large reasoning models, a blueprint. So it's just kind of talking about these new models, 00:05:24.160 |
which are getting a bit of, you know, people like them. It's basically, people are very 00:05:29.040 |
interested in these. It's like the O1s, O3s, and a few of those other ones. So there are quite a 00:05:33.920 |
few of these out there and a lot of people are very interested in them and they are pretty cool. 00:05:39.360 |
So we have this paper and I want to learn a little bit about it. So we go on, we download it here. 00:05:46.960 |
That's just going to download it. If you're in Colab, it's going to download it here for you. 00:05:52.640 |
If you are, of course, local or somewhere else, it's going to just, it's just going to put it in 00:05:57.760 |
the same file folder that you are working in right now. Okay. We're going to go ahead and I'm going 00:06:03.920 |
to take my file name, which is this, I saved it right here, and I'm going to upload that file. 00:06:12.320 |
I had this metadata, you actually don't even need this and I don't even use it, but you basically, 00:06:19.680 |
you can just put in some stuff in here. Actually the type of this is a paper. Okay. So, you know, 00:06:27.040 |
you can put that in there if you like, you don't have to, it doesn't really matter. But what this 00:06:33.120 |
will do, this is now going to go ahead and it's just going to upload our file. It does take a 00:06:37.920 |
moment because it is, well, it's waiting not just for the upload, but also for the file to finish 00:06:44.080 |
processing and also being made available. So you can modify that as a, I think it's a weight 00:06:50.560 |
parameter, if I'm not wrong. And that will basically just say, okay, wait a little bit or 00:06:55.120 |
not. It might also be timeouts, but anyway, you can use that if you want. We're just going to wait. 00:07:01.600 |
Okay. So that is now uploaded and we're going to go through and just try out these various 00:07:09.600 |
APIs that are available. So very quickly just summarize what those are. There's three, I think 00:07:16.880 |
it's just three. So there's a chat API. This is like the standard API, which we use to interact 00:07:23.440 |
with our system. Think of this like chat completions from OpenAI, but it is Pinecone's 00:07:30.000 |
version of that, which obviously has the, you know, the other fields and stuff that are relevant 00:07:34.640 |
to what you're doing here. Okay. So this is the one that you're probably going to want to use if 00:07:40.640 |
you're using this entire system as essentially a chatbot or like a full-on agent. This is what you 00:07:48.160 |
would use. Okay. And we'll, we'll use it in a moment. The other one is the context API. So that 00:07:54.160 |
is okay. In some cases, maybe you don't want to use the agent part of what Pinecone is doing here 00:08:01.920 |
or what the assistant is. Instead, maybe you just want to, you know, take advantage of Pinecone's 00:08:08.640 |
retrieval component, right? So the retrieval of Pinecone is obviously pretty good. So this will, 00:08:15.600 |
the context API just allows you to take the documents that you have uploaded, the files 00:08:21.840 |
that you've uploaded. They obviously get processed, they get stored, indexed, you know, whatever 00:08:26.800 |
they're doing there. And then this context API basically allows you to retrieve everything, 00:08:34.400 |
right? It doesn't, it doesn't go through the whole generation. There's no agent that's deciding, 00:08:38.480 |
oh, I need to search for this. I'm going to retrieve these. I'm now I'm going to generate 00:08:42.720 |
an answer. There's none of that. It's just the retrieval component, right? So if you, if you 00:08:48.720 |
look, it's basically in retrieval augmented generation, right? That I mentioned here, 00:08:54.640 |
it's like the retrieval augmentation parts. So then what you get is all of your citations. They 00:09:02.240 |
don't call them citations. They're called snippets. And those snippets are, you know, 00:09:06.960 |
it's basically your context or your chunks of a document that you can then take and send to a 00:09:12.800 |
downstream agent, LLM, you know, whatever you're doing. Okay. So it's up to you to define that 00:09:18.880 |
generation part in the case of using the context API, which I like, I like that they're breaking 00:09:24.560 |
this apart a little bit because I don't always want to use like the full thing. Then finally 00:09:30.320 |
check completions API. This is the chat API, but it's just an opening eye compatible one. So we'll 00:09:36.240 |
talk about this later. Yeah, I'll leave it for later, but that can be useful in some scenarios 00:09:41.600 |
as well. Cool. So let's see how the chat API works first. So chat API, we're going to go in, 00:09:49.600 |
I'm going to create our message lists of messages, which is just role and content, 00:09:54.480 |
pretty familiar format there. And of course, this is a list. So you can have many messages in here. 00:10:02.080 |
You might, you know, when you're obviously using this as a chat bot, you're going to be adding 00:10:06.960 |
multiple messages. It's going to be a conversation. So you would be appending those to your messages 00:10:13.200 |
object as you go or messages list, I should say. So we create our messages. Then we're going to 00:10:21.200 |
go assistant chat. Okay. That's it. It's just assistant chat, super easy. And we just pass 00:10:26.720 |
our messages into the messages parameter there. And then we get a response. Okay. The direct 00:10:32.800 |
content generated by our system is going to be going into a response message content. So I'm 00:10:40.080 |
going to ask what is a reasoning language model or RLM. And we'll see what happens. I'm not using 00:10:49.840 |
print here, because I've asked it to output things in markdown. So I am just asking one formatting 00:10:56.560 |
that into markdown. And you can see here, there isn't actually any markdown necessarily being used. 00:11:01.760 |
But occasionally there will be some. So it depends on, you know, what is it answering? 00:11:06.080 |
It doesn't always need to. So we say a up. And RLM or reasoning language model is a type of AI 00:11:13.840 |
model that sends, you know, so on and so on with advanced reasoning mechanisms. Okay. They're built 00:11:19.920 |
on these three main pillars, so on and so on. Right. So it's just explaining what an RLM is. 00:11:27.040 |
And then we have the traditional sort of Yorkshire AOP there as well. We can run that again, see what 00:11:32.320 |
we get. Okay. It's being a little more Yorkshire here. So AOP and RLM or reasoning language model 00:11:40.800 |
is a rate fancy type of AI model, which is great. Rate. This sort of stuff, I don't even know how 00:11:48.720 |
you can pronounce. Large language models with advanced reasoning mechanisms, complex problems, 00:12:01.120 |
solving tasks, integrating structured reasoning processes like Monte Carlo tree search. Okay. So 00:12:11.200 |
we have that. In essence, RLMs are the next step in AI evolution, combing the best language, 00:12:18.960 |
understanding, and advanced reasoning to solve complex problems more effectively. So it's all 00:12:23.280 |
right. I mean, I would like more Yorkshire in there, but it was not bad. And it was definitely 00:12:28.400 |
accurate. So cool. Chat response. We have this. So I just want to show you like the full response, 00:12:34.880 |
what we have in there. So what we just extracted out here is this. Okay. So we have this bit here, 00:12:39.440 |
but you see, there's a ton of other stuff in here as well. And this is useful. This is incredibly 00:12:42.960 |
useful when we're, you know, the whole point of grounding our LLM or agent responses, assistant 00:12:49.920 |
responses, whatever you want to call it, the whole point of grounding them with citations and truth 00:12:56.240 |
is for the most part to provide users with a little more trust in the system. And of course, 00:13:03.920 |
I mean, there's a whole thing of, you know, it wouldn't be able to answer the question if you 00:13:06.960 |
didn't give it this information. That's a big part of it, of course. But part of the, you know, 00:13:13.440 |
returning citations is that trust component, which can be very useful. But of course, reading this, 00:13:19.360 |
I don't necessarily know where this information is coming from as a user. So one thing that I 00:13:24.960 |
think a lot of us would like to do downstream is modify this to actually include the citations, 00:13:32.000 |
or at least include the sources somewhere in our responses or in our, you know, interfaces, 00:13:38.000 |
whatever we're doing. So let's go through this and just see how we'd actually do that. 00:13:43.680 |
So Pinecone does return these citations, okay, which is great. So we have citations. The 00:13:52.320 |
citation, it shows the position in the text. This is a character position where that citation is 00:13:58.160 |
coming from, okay. And then it mentions the references. And references is actually a list 00:14:02.240 |
here. So we have a list of these reference objects. Most importantly, we have the pages, 00:14:08.080 |
right. We have the file that that comes from, which is useful. But I actually think more useful 00:14:17.200 |
is actually that we get the signed URL, right. So the signed URL is just like a private URL that 00:14:22.240 |
we can access to go ahead and actually see our PDF, right. And this is a PDF using language models. 00:14:30.640 |
So this is being stored by Pinecone. And then they've given us this link to go and access it. 00:14:36.080 |
And of course, we can then go and share that link. So in our interfaces, 00:14:40.560 |
we'd be taking that signed URL, pushing it forward into like a UI or whatever else. And, yeah, 00:14:49.680 |
getting, we're basically able to provide users with, okay, this is exactly where this information 00:14:54.880 |
is coming from, which is pretty cool. So we have that. I'm not going to go through this right now. 00:15:01.360 |
Close this. But we have those two bits of information, which are pretty important, 00:15:08.240 |
in my opinion, and we can do a lot with them. And, yeah, you can see here within the same 00:15:13.680 |
references list, I think we have another reference. Yes. So references is actually a list 00:15:20.880 |
of references. And we can implement those as we wish. But in this example, we're just going to 00:15:25.920 |
implement the first reference for each citation that we get here. We have another citation here, 00:15:30.720 |
position, you know, a little bit later in the text. That's coming from page three, you know, 00:15:35.920 |
so on and so on. You can see it continues going. So we have a few citations in there. 00:15:43.360 |
And let's see how we might want to integrate that nicely with our response. So within, 00:15:49.680 |
as I mentioned, citation object, we have that character position. We have our references, 00:15:54.240 |
which includes pages and assigned URL. We're going to use all of that. So what I'm doing here, 00:15:59.200 |
I'm creating a citation in markdown. So in markdown, I want to take that pages text or, 00:16:08.320 |
you know, list of pages. And I want to take where that goes to. And let's see what that looks like. 00:16:15.760 |
Okay. So it looks like this. Pretty messy. But you can see that this is actually just 00:16:19.680 |
markdown format. So we have square brackets followed by link within the standard brackets. 00:16:24.880 |
And if we display that in markdown, it's going to look like this. Okay. So we have this nice 00:16:30.080 |
little citation. We can click on this. And it'll take us back through to that PDF, which is great. 00:16:36.640 |
So we can actually go ahead and insert those into our response. So let's go ahead and do that. 00:16:43.200 |
So what we're doing, I'm just taking out the response content. I'm converting it into a 00:16:50.800 |
string. So I'm not like overwriting the original. Then I'm going to loop through the citations that 00:16:56.480 |
we have in reverse order. So we're going to be inserting them. You know, if we insert them in 00:17:02.640 |
the original order, basically we would have to modify the position count for every new insertion. 00:17:10.160 |
So we don't want to do that because that's complicated. So we'll just do it in the reverse 00:17:14.480 |
order. Okay. So reverse order, we build that citation, which is what I just showed you here. 00:17:21.760 |
So we're going to build this. That's what we do here. Then I'm going to just insert it, right? 00:17:27.920 |
So we have our content. We're going to take the content up to that position, insert our citation, 00:17:34.160 |
and then content, you know, following that position. This will insert it right after a 00:17:44.240 |
word. So we could also add a little bit of like a space here as well, if we wanted, 00:17:49.600 |
and that should look a little nicer. It's up to what we would like. Okay. We can see this. 00:17:58.400 |
Right? So now this is our text, and we have those citations right in the middle of that. 00:18:04.080 |
Okay. Which is pretty cool. Great. So we have that. Now, the one thing I did mention before 00:18:15.200 |
is that using Pinecone Assistant, you generally get better grounding of knowledge than you would 00:18:21.120 |
with like OpenAI Assistant. So let's ask a, you know, a kind of relevant question, 00:18:28.640 |
but this information is just not contained within the paper. Okay. So how many organizations are 00:18:34.160 |
adopting RLMs? That is just not mentioned. So let's see what it comes up with. 00:18:40.240 |
Now, this is the sort of question where typically you're probably fairly likely to end up getting a 00:18:46.560 |
hallucination. And yeah, here we avoid that, fortunately, which is, you know, one of the 00:18:54.640 |
pros of Pinecone Assistant. So we get a lot. I can't give you an exact number of organizations 00:19:03.520 |
adopting RLMs, but I can tell you that these models are garnering a lot of interest in various 00:19:11.280 |
sectors. RLMs are being used in various fields like healthcare, so on and so on. Yeah. There's 00:19:20.320 |
a lot more Yorkshire in here. And I think that all looks pretty good. So we have that, right? 00:19:27.760 |
So it's saying, okay, I can't give you an exact answer, but, you know, this is probably more of 00:19:34.560 |
like an opinion more than anything. So that's pretty cool. Now, context API. This is what I 00:19:40.240 |
mentioned before. So basically what we have with the chat API is like agent, rag, document processing, 00:19:49.440 |
all of that's kind of wrapped into one thing. This is breaking up part of that pipeline. 00:19:53.680 |
So context API, you're still doing the document processing, right? So that's all handled when 00:19:58.160 |
you obviously upload your files, but then retrieval, right? It's extracting out that 00:20:03.920 |
the retrieval component or the retrieval augmentation of rag and, you know, getting rid 00:20:09.600 |
of the sort of chat or LLM agent component. So let's see, let's have a look at what this looks 00:20:16.800 |
like. So we're going to say, what is an RLM? Okay. Just see what we get. Okay, cool. So that gives 00:20:27.200 |
you a lot. So let's try and like pass through this a little bit. So we get this context response 00:20:35.600 |
object inside the context response. We have snippets. That is a list of snippet objects, 00:20:42.000 |
which contain mainly this content. That's probably what we're most interested in there. 00:20:47.840 |
So yeah, you have all this content and these are basically like chunks of document that are 00:20:52.960 |
relevant to whatever you've asked, right? So what is it? RLM is probably quite a few relevant chunks 00:20:58.560 |
in there. It includes also includes the file with the link as well as before. So you can pull that 00:21:04.800 |
information through if you like, which is useful. And also pretty useful here is the basically the 00:21:11.760 |
pages or the page number where this information is coming from. So, you know, if we're wanting 00:21:17.680 |
a little more, a little more control over what we're building, this is pretty good. It means 00:21:23.040 |
that we can use a part of, you know, Pinecone's assistant without using the full thing, which is 00:21:29.440 |
nice. So yeah, we would take those snippets, feed them into some downstream LM agent, you know, 00:21:36.160 |
whatever it is that you're doing now onto the final API, chat completions API. Now this isn't, 00:21:44.240 |
there's no new functionality with this API beyond what we have in the chat API. However, what it 00:21:50.480 |
does do is essentially copies, or it does, it copies the OpenAI standard format for chat completions 00:21:59.600 |
and allows us to essentially swap out a OpenAI or other LLM that uses the same format as OpenAI. 00:22:09.040 |
It allows us to swap out that API endpoint for Pinecone's assistant, which is pretty useful. 00:22:16.080 |
So that's, let's see how that works. We go here. We need to get our assistant host, 00:22:24.800 |
assistant name, right? So we're basically constructing a URL here. We have, 00:22:31.440 |
because we're going to be essentially replacing where OpenAI is pointing to this. Okay. So we 00:22:39.040 |
have our, this is our host. And then we also have in here, this would be Yorkshire assistant. Okay. 00:22:46.640 |
So we put those together to get our base or just URL. Okay. Which is this. And then what we're 00:22:55.440 |
going to do is actually initialize OpenAI client for API key. We're actually passing in our Pinecone 00:23:00.560 |
API key, but for base URL, we're swapping the standard base or default URL that OpenAI would 00:23:08.240 |
use, which would point to OpenAI with this. Okay. So we run that. And now we can actually interact 00:23:19.280 |
with Pinecone assistant as we would with, you know, chat completions through OpenAI. 00:23:25.360 |
So we can run this. This is exactly the same format that we would use with OpenAI. 00:23:33.760 |
And we can see here that we are replying to the question before where we're asking 00:23:41.120 |
about organizations using the new RLMs and we get our rate. Let's have a gander at what we've got 00:23:48.800 |
here from the search results. It ain't exactly clear how many organizations are adopting RLMs. 00:23:55.840 |
Right. So there we go. We we've got our answer. Like it doesn't know again. Right. Which is, 00:24:01.120 |
is what we want it to be saying. We don't want it to be saying like making things up. 00:24:04.720 |
That's kind of, you know, just exactly what we don't want it to do. 00:24:08.560 |
So that is good. And okay. Why would we do that? You know, why, why would we use the 00:24:15.840 |
chat completions API? Well, it's basically just to allow us if we're using OpenAI or other 00:24:20.960 |
providers, we can just quickly swap out and sort of test and demo with Pinecone. Or even if we, 00:24:27.600 |
you know, if we're offering more, if we've built something and we're offering multiple LLM 00:24:32.160 |
providers, we can go and just swap that out in our code very easily, which, you know, when you need 00:24:40.400 |
to move fast, this is incredibly useful. Okay. So with that, we're actually done. So that's, 00:24:45.760 |
that's everything with, well, not everything. That's probably the main features of a system 00:24:50.240 |
where we, you know, we should be aware of. Now, finally, when you are done with your system, 00:24:56.800 |
you might want to delete it so you can go ahead and we'll just run delete your system. Easy. 00:25:02.480 |
Now I would recommend, so there is this article from Pinecone, so that the sort of blog release 00:25:10.240 |
on this, I think is super helpful in just understanding what is actually going on here, 00:25:15.360 |
which is just incredibly useful. So I'll leave it there for now. I hope all this has been useful 00:25:22.320 |
and interesting. So thank you very much for watching and I'll see you again in the next one.