NEW Pinecone Assistant Features + GA Release!

00:00:00.000 | Today we're going to be trying out the new Pinecone Assistant, which has just been made

00:00:04.240 | generally available. Now for those of you that don't know, Pinecone Assistant is an API service

00:00:11.600 | that provides you with an agent that comes with best-in-class, essentially, rag capabilities.

00:00:18.400 | It focuses on making your agent as grounded in truth as possible. Of course, being Pinecone,

00:00:28.080 | their whole thing is vector databases retrieval, so they're pretty good at that sort of thing.

00:00:33.760 | Now with Assistant being made generally available, that comes with a few new features. So those are,

00:00:41.760 | and there's a great summary of everything in this article here as well, but those are probably the

00:00:47.680 | most exciting, in my opinion, the custom instructions, which obviously is pretty useful

00:00:52.960 | when you're building your own agents, and this, which is very cool, so new input and output

00:00:57.360 | formats. So the input formats, I mean, that's nice, so we have markdown and duct files now,

00:01:02.880 | which is cool, in addition to PDF and text, but then what I do like here is the output format.

00:01:10.960 | There is essentially a JSON mode, so your agent can output structured output now, and generally

00:01:18.960 | speaking, I'm always a big fan of that because basically everything that I've had to build with

00:01:24.640 | agents, especially, well, I'd say more recently, but for quite a long time now, that ability to

00:01:31.360 | output structured text is just incredibly useful, at least to do that reliably. So that's pretty

00:01:39.360 | big, in my opinion. Region control, so EU, US, so obviously GDPR if you need that, and

00:01:46.000 | I shouldn't say finally, because these are also pretty important, but there are also,

00:01:51.360 | there's also a new chat API and a new context API, so we'll take a look at all of those.

00:01:57.680 | Now, before we jump in, I will just say, so this visualizes things pretty well to why you might

00:02:04.720 | actually want to use PyCon assistance over like OpenAI assistance, which would be the

00:02:08.640 | most similar thing out there. Generally speaking, and we will see this in the example that we walk

00:02:14.400 | through, PyCon assistance is, it's just very good at grounding everything it tells you with

00:02:20.640 | sources. So it's much more trustworthy, much less likely to hallucinate and make stuff up

00:02:28.080 | than you would find in OpenAI assistance, and you can see some metrics here, but yeah, we'll see

00:02:34.960 | actually in the example that we walk through, this is the case as well. So the example we're

00:02:39.840 | going to walk through, there will be a link to this example in the description video and also

00:02:44.880 | the top comment video, I will make sure that is in there, but we're going to go in and we're going

00:02:50.480 | to build something very simple. So we're going to build a assistant, it's going to help us

00:02:56.080 | understand some, like an AI paper, and it's going to be, it's going to be from Yorkshire,

00:03:04.800 | so it will be, well, we'll see, we'll see what it will be like. So first thing is, of course,

00:03:10.320 | API key, this will be free. So you can go and get an API key, it's app.pycon.io. I did include a

00:03:20.160 | link here, there. So you can go there, you get your API key, and then you just enter it. So I'm

00:03:28.320 | going to be creating my new assistant called the Yorkshire Assistant. Right now, I don't have any

00:03:34.080 | assistants, okay, so you can just list those like that. So I'm going to go ahead and create him.

00:03:38.720 | So my instructions, of course, one of the key features here is those custom instructions.

00:03:43.760 | I'm going to go with the typical, you are a helpful assistant that must help with some queries,

00:03:48.320 | and then we're going to modify it a little bit. So he's going to be helpful, and he's also going

00:03:52.000 | to be from the Yorkshire countryside and will always respond with heavy Yorkshire slang,

00:03:57.760 | colloquialisms, and references to that great county. They will, or he will try, he, she,

00:04:05.760 | will try to use relevant metaphors to explain, explain concepts to the user. And then one thing

00:04:14.640 | that I really like about more recent models is that they do tend to go with markdown output.

00:04:20.720 | And I found that with assistant, it didn't go with markdown output, which is actually

00:04:26.240 | probably not a bad thing. By default, I would maybe rather defined, I would like that. So I'm

00:04:32.720 | just telling it's here. So, you know, format your answers in markdown whilst maintaining the

00:04:39.760 | Yorkshire accent. So we create our assistant. Here I'm doing another quick check, okay,

00:04:46.480 | just making sure we don't actually have that assistant already. Then we do create assistant,

00:04:55.200 | passing our name, instructions, and then timeout here is just, okay, how long are we going to wait

00:04:59.360 | before we just return? You can, it doesn't really matter, but it's there. Now we're going to

00:05:06.240 | download a kind of interesting paper. I haven't been through the whole thing, to be honest,

00:05:12.640 | but it looks pretty interesting from the abstract. So there is this reasoning language models or

00:05:18.000 | large reasoning models, a blueprint. So it's just kind of talking about these new models,

00:05:24.160 | which are getting a bit of, you know, people like them. It's basically, people are very

00:05:29.040 | interested in these. It's like the O1s, O3s, and a few of those other ones. So there are quite a

00:05:33.920 | few of these out there and a lot of people are very interested in them and they are pretty cool.

00:05:39.360 | So we have this paper and I want to learn a little bit about it. So we go on, we download it here.

00:05:46.960 | That's just going to download it. If you're in Colab, it's going to download it here for you.

00:05:52.640 | If you are, of course, local or somewhere else, it's going to just, it's just going to put it in

00:05:57.760 | the same file folder that you are working in right now. Okay. We're going to go ahead and I'm going

00:06:03.920 | to take my file name, which is this, I saved it right here, and I'm going to upload that file.

00:06:12.320 | I had this metadata, you actually don't even need this and I don't even use it, but you basically,

00:06:19.680 | you can just put in some stuff in here. Actually the type of this is a paper. Okay. So, you know,

00:06:27.040 | you can put that in there if you like, you don't have to, it doesn't really matter. But what this

00:06:33.120 | will do, this is now going to go ahead and it's just going to upload our file. It does take a

00:06:37.920 | moment because it is, well, it's waiting not just for the upload, but also for the file to finish

00:06:44.080 | processing and also being made available. So you can modify that as a, I think it's a weight

00:06:50.560 | parameter, if I'm not wrong. And that will basically just say, okay, wait a little bit or

00:06:55.120 | not. It might also be timeouts, but anyway, you can use that if you want. We're just going to wait.

00:07:01.600 | Okay. So that is now uploaded and we're going to go through and just try out these various

00:07:09.600 | APIs that are available. So very quickly just summarize what those are. There's three, I think

00:07:16.880 | it's just three. So there's a chat API. This is like the standard API, which we use to interact

00:07:23.440 | with our system. Think of this like chat completions from OpenAI, but it is Pinecone's

00:07:30.000 | version of that, which obviously has the, you know, the other fields and stuff that are relevant

00:07:34.640 | to what you're doing here. Okay. So this is the one that you're probably going to want to use if

00:07:40.640 | you're using this entire system as essentially a chatbot or like a full-on agent. This is what you

00:07:48.160 | would use. Okay. And we'll, we'll use it in a moment. The other one is the context API. So that

00:07:54.160 | is okay. In some cases, maybe you don't want to use the agent part of what Pinecone is doing here

00:08:01.920 | or what the assistant is. Instead, maybe you just want to, you know, take advantage of Pinecone's

00:08:08.640 | retrieval component, right? So the retrieval of Pinecone is obviously pretty good. So this will,

00:08:15.600 | the context API just allows you to take the documents that you have uploaded, the files

00:08:21.840 | that you've uploaded. They obviously get processed, they get stored, indexed, you know, whatever

00:08:26.800 | they're doing there. And then this context API basically allows you to retrieve everything,

00:08:34.400 | right? It doesn't, it doesn't go through the whole generation. There's no agent that's deciding,

00:08:38.480 | oh, I need to search for this. I'm going to retrieve these. I'm now I'm going to generate

00:08:42.720 | an answer. There's none of that. It's just the retrieval component, right? So if you, if you

00:08:48.720 | look, it's basically in retrieval augmented generation, right? That I mentioned here,

00:08:54.640 | it's like the retrieval augmentation parts. So then what you get is all of your citations. They

00:09:02.240 | don't call them citations. They're called snippets. And those snippets are, you know,

00:09:06.960 | it's basically your context or your chunks of a document that you can then take and send to a

00:09:12.800 | downstream agent, LLM, you know, whatever you're doing. Okay. So it's up to you to define that

00:09:18.880 | generation part in the case of using the context API, which I like, I like that they're breaking

00:09:24.560 | this apart a little bit because I don't always want to use like the full thing. Then finally

00:09:30.320 | check completions API. This is the chat API, but it's just an opening eye compatible one. So we'll

00:09:36.240 | talk about this later. Yeah, I'll leave it for later, but that can be useful in some scenarios

00:09:41.600 | as well. Cool. So let's see how the chat API works first. So chat API, we're going to go in,

00:09:49.600 | I'm going to create our message lists of messages, which is just role and content,

00:09:54.480 | pretty familiar format there. And of course, this is a list. So you can have many messages in here.

00:10:02.080 | You might, you know, when you're obviously using this as a chat bot, you're going to be adding

00:10:06.960 | multiple messages. It's going to be a conversation. So you would be appending those to your messages

00:10:13.200 | object as you go or messages list, I should say. So we create our messages. Then we're going to

00:10:21.200 | go assistant chat. Okay. That's it. It's just assistant chat, super easy. And we just pass

00:10:26.720 | our messages into the messages parameter there. And then we get a response. Okay. The direct

00:10:32.800 | content generated by our system is going to be going into a response message content. So I'm

00:10:40.080 | going to ask what is a reasoning language model or RLM. And we'll see what happens. I'm not using

00:10:49.840 | print here, because I've asked it to output things in markdown. So I am just asking one formatting

00:10:56.560 | that into markdown. And you can see here, there isn't actually any markdown necessarily being used.

00:11:01.760 | But occasionally there will be some. So it depends on, you know, what is it answering?

00:11:06.080 | It doesn't always need to. So we say a up. And RLM or reasoning language model is a type of AI

00:11:13.840 | model that sends, you know, so on and so on with advanced reasoning mechanisms. Okay. They're built

00:11:19.920 | on these three main pillars, so on and so on. Right. So it's just explaining what an RLM is.

00:11:27.040 | And then we have the traditional sort of Yorkshire AOP there as well. We can run that again, see what

00:11:32.320 | we get. Okay. It's being a little more Yorkshire here. So AOP and RLM or reasoning language model

00:11:40.800 | is a rate fancy type of AI model, which is great. Rate. This sort of stuff, I don't even know how

00:11:48.720 | you can pronounce. Large language models with advanced reasoning mechanisms, complex problems,

00:12:01.120 | solving tasks, integrating structured reasoning processes like Monte Carlo tree search. Okay. So

00:12:11.200 | we have that. In essence, RLMs are the next step in AI evolution, combing the best language,

00:12:18.960 | understanding, and advanced reasoning to solve complex problems more effectively. So it's all

00:12:23.280 | right. I mean, I would like more Yorkshire in there, but it was not bad. And it was definitely

00:12:28.400 | accurate. So cool. Chat response. We have this. So I just want to show you like the full response,

00:12:34.880 | what we have in there. So what we just extracted out here is this. Okay. So we have this bit here,

00:12:39.440 | but you see, there's a ton of other stuff in here as well. And this is useful. This is incredibly

00:12:42.960 | useful when we're, you know, the whole point of grounding our LLM or agent responses, assistant

00:12:49.920 | responses, whatever you want to call it, the whole point of grounding them with citations and truth

00:12:56.240 | is for the most part to provide users with a little more trust in the system. And of course,

00:13:03.920 | I mean, there's a whole thing of, you know, it wouldn't be able to answer the question if you

00:13:06.960 | didn't give it this information. That's a big part of it, of course. But part of the, you know,

00:13:13.440 | returning citations is that trust component, which can be very useful. But of course, reading this,

00:13:19.360 | I don't necessarily know where this information is coming from as a user. So one thing that I

00:13:24.960 | think a lot of us would like to do downstream is modify this to actually include the citations,

00:13:32.000 | or at least include the sources somewhere in our responses or in our, you know, interfaces,

00:13:38.000 | whatever we're doing. So let's go through this and just see how we'd actually do that.

00:13:43.680 | So Pinecone does return these citations, okay, which is great. So we have citations. The

00:13:52.320 | citation, it shows the position in the text. This is a character position where that citation is

00:13:58.160 | coming from, okay. And then it mentions the references. And references is actually a list

00:14:02.240 | here. So we have a list of these reference objects. Most importantly, we have the pages,

00:14:08.080 | right. We have the file that that comes from, which is useful. But I actually think more useful

00:14:17.200 | is actually that we get the signed URL, right. So the signed URL is just like a private URL that

00:14:22.240 | we can access to go ahead and actually see our PDF, right. And this is a PDF using language models.

00:14:30.640 | So this is being stored by Pinecone. And then they've given us this link to go and access it.

00:14:36.080 | And of course, we can then go and share that link. So in our interfaces,

00:14:40.560 | we'd be taking that signed URL, pushing it forward into like a UI or whatever else. And, yeah,

00:14:49.680 | getting, we're basically able to provide users with, okay, this is exactly where this information

00:14:54.880 | is coming from, which is pretty cool. So we have that. I'm not going to go through this right now.

00:15:01.360 | Close this. But we have those two bits of information, which are pretty important,

00:15:08.240 | in my opinion, and we can do a lot with them. And, yeah, you can see here within the same

00:15:13.680 | references list, I think we have another reference. Yes. So references is actually a list

00:15:20.880 | of references. And we can implement those as we wish. But in this example, we're just going to

00:15:25.920 | implement the first reference for each citation that we get here. We have another citation here,

00:15:30.720 | position, you know, a little bit later in the text. That's coming from page three, you know,

00:15:35.920 | so on and so on. You can see it continues going. So we have a few citations in there.

00:15:43.360 | And let's see how we might want to integrate that nicely with our response. So within,

00:15:49.680 | as I mentioned, citation object, we have that character position. We have our references,

00:15:54.240 | which includes pages and assigned URL. We're going to use all of that. So what I'm doing here,

00:15:59.200 | I'm creating a citation in markdown. So in markdown, I want to take that pages text or,

00:16:08.320 | you know, list of pages. And I want to take where that goes to. And let's see what that looks like.

00:16:15.760 | Okay. So it looks like this. Pretty messy. But you can see that this is actually just

00:16:19.680 | markdown format. So we have square brackets followed by link within the standard brackets.

00:16:24.880 | And if we display that in markdown, it's going to look like this. Okay. So we have this nice

00:16:30.080 | little citation. We can click on this. And it'll take us back through to that PDF, which is great.

00:16:36.640 | So we can actually go ahead and insert those into our response. So let's go ahead and do that.

00:16:43.200 | So what we're doing, I'm just taking out the response content. I'm converting it into a

00:16:50.800 | string. So I'm not like overwriting the original. Then I'm going to loop through the citations that

00:16:56.480 | we have in reverse order. So we're going to be inserting them. You know, if we insert them in

00:17:02.640 | the original order, basically we would have to modify the position count for every new insertion.

00:17:10.160 | So we don't want to do that because that's complicated. So we'll just do it in the reverse

00:17:14.480 | order. Okay. So reverse order, we build that citation, which is what I just showed you here.

00:17:21.760 | So we're going to build this. That's what we do here. Then I'm going to just insert it, right?

00:17:27.920 | So we have our content. We're going to take the content up to that position, insert our citation,

00:17:34.160 | and then content, you know, following that position. This will insert it right after a

00:17:44.240 | word. So we could also add a little bit of like a space here as well, if we wanted,

00:17:49.600 | and that should look a little nicer. It's up to what we would like. Okay. We can see this.

00:17:58.400 | Right? So now this is our text, and we have those citations right in the middle of that.

00:18:04.080 | Okay. Which is pretty cool. Great. So we have that. Now, the one thing I did mention before

00:18:15.200 | is that using Pinecone Assistant, you generally get better grounding of knowledge than you would

00:18:21.120 | with like OpenAI Assistant. So let's ask a, you know, a kind of relevant question,

00:18:28.640 | but this information is just not contained within the paper. Okay. So how many organizations are

00:18:34.160 | adopting RLMs? That is just not mentioned. So let's see what it comes up with.

00:18:40.240 | Now, this is the sort of question where typically you're probably fairly likely to end up getting a

00:18:46.560 | hallucination. And yeah, here we avoid that, fortunately, which is, you know, one of the

00:18:54.640 | pros of Pinecone Assistant. So we get a lot. I can't give you an exact number of organizations

00:19:03.520 | adopting RLMs, but I can tell you that these models are garnering a lot of interest in various

00:19:11.280 | sectors. RLMs are being used in various fields like healthcare, so on and so on. Yeah. There's

00:19:20.320 | a lot more Yorkshire in here. And I think that all looks pretty good. So we have that, right?

00:19:27.760 | So it's saying, okay, I can't give you an exact answer, but, you know, this is probably more of

00:19:34.560 | like an opinion more than anything. So that's pretty cool. Now, context API. This is what I

00:19:40.240 | mentioned before. So basically what we have with the chat API is like agent, rag, document processing,

00:19:49.440 | all of that's kind of wrapped into one thing. This is breaking up part of that pipeline.

00:19:53.680 | So context API, you're still doing the document processing, right? So that's all handled when

00:19:58.160 | you obviously upload your files, but then retrieval, right? It's extracting out that

00:20:03.920 | the retrieval component or the retrieval augmentation of rag and, you know, getting rid

00:20:09.600 | of the sort of chat or LLM agent component. So let's see, let's have a look at what this looks

00:20:16.800 | like. So we're going to say, what is an RLM? Okay. Just see what we get. Okay, cool. So that gives

00:20:27.200 | you a lot. So let's try and like pass through this a little bit. So we get this context response

00:20:35.600 | object inside the context response. We have snippets. That is a list of snippet objects,

00:20:42.000 | which contain mainly this content. That's probably what we're most interested in there.

00:20:47.840 | So yeah, you have all this content and these are basically like chunks of document that are

00:20:52.960 | relevant to whatever you've asked, right? So what is it? RLM is probably quite a few relevant chunks

00:20:58.560 | in there. It includes also includes the file with the link as well as before. So you can pull that

00:21:04.800 | information through if you like, which is useful. And also pretty useful here is the basically the

00:21:11.760 | pages or the page number where this information is coming from. So, you know, if we're wanting

00:21:17.680 | a little more, a little more control over what we're building, this is pretty good. It means

00:21:23.040 | that we can use a part of, you know, Pinecone's assistant without using the full thing, which is

00:21:29.440 | nice. So yeah, we would take those snippets, feed them into some downstream LM agent, you know,

00:21:36.160 | whatever it is that you're doing now onto the final API, chat completions API. Now this isn't,

00:21:44.240 | there's no new functionality with this API beyond what we have in the chat API. However, what it

00:21:50.480 | does do is essentially copies, or it does, it copies the OpenAI standard format for chat completions

00:21:59.600 | and allows us to essentially swap out a OpenAI or other LLM that uses the same format as OpenAI.

00:22:09.040 | It allows us to swap out that API endpoint for Pinecone's assistant, which is pretty useful.

00:22:16.080 | So that's, let's see how that works. We go here. We need to get our assistant host,

00:22:24.800 | assistant name, right? So we're basically constructing a URL here. We have,

00:22:31.440 | because we're going to be essentially replacing where OpenAI is pointing to this. Okay. So we

00:22:39.040 | have our, this is our host. And then we also have in here, this would be Yorkshire assistant. Okay.

00:22:46.640 | So we put those together to get our base or just URL. Okay. Which is this. And then what we're

00:22:55.440 | going to do is actually initialize OpenAI client for API key. We're actually passing in our Pinecone

00:23:00.560 | API key, but for base URL, we're swapping the standard base or default URL that OpenAI would

00:23:08.240 | use, which would point to OpenAI with this. Okay. So we run that. And now we can actually interact

00:23:19.280 | with Pinecone assistant as we would with, you know, chat completions through OpenAI.

00:23:25.360 | So we can run this. This is exactly the same format that we would use with OpenAI.

00:23:33.760 | And we can see here that we are replying to the question before where we're asking

00:23:41.120 | about organizations using the new RLMs and we get our rate. Let's have a gander at what we've got

00:23:48.800 | here from the search results. It ain't exactly clear how many organizations are adopting RLMs.

00:23:55.840 | Right. So there we go. We we've got our answer. Like it doesn't know again. Right. Which is,

00:24:01.120 | is what we want it to be saying. We don't want it to be saying like making things up.

00:24:04.720 | That's kind of, you know, just exactly what we don't want it to do.

00:24:08.560 | So that is good. And okay. Why would we do that? You know, why, why would we use the

00:24:15.840 | chat completions API? Well, it's basically just to allow us if we're using OpenAI or other

00:24:20.960 | providers, we can just quickly swap out and sort of test and demo with Pinecone. Or even if we,

00:24:27.600 | you know, if we're offering more, if we've built something and we're offering multiple LLM

00:24:32.160 | providers, we can go and just swap that out in our code very easily, which, you know, when you need

00:24:40.400 | to move fast, this is incredibly useful. Okay. So with that, we're actually done. So that's,

00:24:45.760 | that's everything with, well, not everything. That's probably the main features of a system

00:24:50.240 | where we, you know, we should be aware of. Now, finally, when you are done with your system,

00:24:56.800 | you might want to delete it so you can go ahead and we'll just run delete your system. Easy.

00:25:02.480 | Now I would recommend, so there is this article from Pinecone, so that the sort of blog release

00:25:10.240 | on this, I think is super helpful in just understanding what is actually going on here,

00:25:15.360 | which is just incredibly useful. So I'll leave it there for now. I hope all this has been useful

00:25:22.320 | and interesting. So thank you very much for watching and I'll see you again in the next one.

00:25:28.400 | Bye.

00:25:41.040 | (soft music)

00:25:43.460 | (music)

NEW Pinecone Assistant Features + GA Release!

Chapters