RAG is a hack - with Jerry Liu of LlamaIndex

(upbeat music) - Hey, everyone. Welcome to the Latent Space Podcast. This is Alessio, partner and CTO of Residence and Decibel Partners, and I'm joined by my co-host Swix, founder of Small AI. - And today we finally have Jerry Liu on the podcast. Hey, Jerry. - Hey, guys. Hey, Swix, I'm Alessio.

Thanks for having me. - It's so weird because we keep running into each other in San Francisco AI events, so it's kind of weird to finally just have a conversation recorded for everybody else. - Yeah, I know. I'm really looking forward to this. I guess I have further questions.

- So I tend to introduce people on their formal background and then ask something on the more personal side. So you are part of the Princeton gang. - Yeah. I don't know if there is like an official Princeton gang. - There is more Princeton gang. I attended your meeting.

There was like four of you. - Oh, cool. Okay, nice. - With Prem and the others. - Oh, yeah, yeah, yeah, yeah. - Where you did bachelor's in CS and certificate in finance. That's also fun. I also did finance. And I think I saw that you also interned at Two Sigma where I worked in New York.

You were a machine learning engineer. - You were at Two Sigma? - Yeah, very briefly. - Oh, cool. All right, I didn't know that. Okay. - That was my first like proper engineering job before I went into DevRel. - Oh, okay. Oh, wow. Nice. - And then you were a machine learning engineer at Quora, AI research scientist at Uber for three years, and then two years machine learning engineer at Robust Intelligence before starting Llama Index.

So that's your LinkedIn. What's not on your LinkedIn that people should know about you? - I think back during my Quora days, I had this like three month phase where I just wrote like a ton of Quora answers. And so I think if you look at my tweets nowadays, you can basically see that as like the V2 of my three month like Quora stint where I just like went ham on Quora for a bit.

I actually, I think I was back then, actually, when I was working on Quora, I think the thing that everybody was fascinated in was just like general like deep learning advancements and stuff like GANs and generative like images and just like new architectures that were evolving. And it was a pretty exciting time to be a researcher actually, 'cause you were going in like really understanding some of the new techniques.

So I kind of use that as like a learning opportunity to basically just like read a bunch of papers and then answer questions on Quora. And so you can kind of see traces of that basically in my current Twitter where it's just like really about kind of like framing concepts and trying to make it understandable and educate other users on it.

- Yeah, I've said, so a lot of people come to me for my Twitter advice, but like I think you are doing one of the best jobs in AI Twitter, just explaining concepts and just consistently getting hits out. - Thank you. (laughing) - And I didn't know it was due to the Quora training.

Let's just sign on on Quora. A lot of people, including myself, like kind of wrote on Quora as like one of the web 1.0 like sort of question answer forums. But now I think it's becoming a senior resurgence obviously due to Poe. And obviously Adam D'Angelo has always been a leading tech figure, but what do you think is like kind of underrated about Quora?

- I really like the mission of Quora when I joined. In fact, I think when I interned there like in 2015 and I joined full time in 2017, one is like they had and they have like a very talented engineering team and just like really, really smart people. And the other part is the whole mission of the company is to just like spread knowledge and to educate people.

Right, and to me that really resonated. I really liked the idea of just like education and democratizing the flow of information. And if you imagine like kind of back then it was like, okay, you have Google, which is like for search, but then you have Quora, which is just like user generated like grassroots type content.

And I really liked that concept because it's just like, okay, there's certain types of information that aren't accessible to people, but you can make accessible by just like surfacing it. And so actually, I don't know if like most people know that about like Quora, like if they've used the product, whether through like SEO, right, or kind of like actively, but that really was what drew me to it.

- Yeah, I think most people challenges with it is that sometimes you don't know if it's like a veiled product pitch, right? - Yeah. - It's like, you know. - Of course, like quality of the answer matters quite a bit. And then-- - It's like five alternatives and then here's the one I work on.

- Yeah, like recommendation issues and all that stuff. I used, I worked on Rexis at Quora actually. So, I got a taste of it. - So how do you solve stuff like that? - Well, I mean, I kind of more approached it from machine learning techniques, which might be a nice segue into rag actually.

A lot of it was just information retrieval. We weren't like solving anything that was like super different than what was standard in the industry at the time, but just like ranking based on user preferences. I think a lot of Quora was very metrics driven. So just like trying to maximize like, you know, daily active hours, like, you know, time spent on site, those types of things.

And all the machine learning algorithms were really just based on embeddings. You know, you have a user embedding and you have like item embeddings and you try to train the models to try to maximize the similarity of these. And it's basically a retrieval problem. - Okay, so you've been working on rag for longer than most people think?

- Well, kind of. So I worked there for like a year, right? - Yeah. - Just transparently. And then I worked at Uber where I was not working on ranking. It was more like kind of deep learning training for self-driving and computer vision and that type of stuff. But I think, yeah, I mean, I think in the LLM world, it's kind of just like a combination of like everything these days.

I mean, retrieval is not really LLMs, but like it fits within the space of like LLM apps. And then obviously like having knowledge of the underlying deep learning architecture helps, having knowledge of basic software engineering principles helps too. And so I think it's kind of nice that like this whole LLM space is basically just a combination of just like a bunch of stuff that you probably like people have done in the past.

- It's good. It's like a summary capstone project. - Yeah, exactly. Yeah. - Yeah. - And before we dive into LLMA Index, what do they feed you a robust intelligence that both you and Harrison from Lanchain came out of it at the same time? Was there like, yeah, is there any fun story of like how both of you kind of came out with kind of like core infrastructure to LLM workflows today?

Or how close were you at robust? Like any fun behind the scenes? - Yeah. Yeah. We worked pretty closely. I mean, we were on the same team for like two years. I got to know Harrison and the rest of the team pretty well. I mean, I have a respect the people there.

The people there were very driven, very passionate. And it definitely pushed me to be a better engineer and leader and those types of things. Yeah, I don't really have a concrete explanation for this. I think it's more just, we have like an LLM hackathon around like September. This was just like exploring GPT-3 or it was October actually.

And then the day after I went on vacation for a week and a half. And so I just didn't track Slack or anything. Came back, saw that Harrison started Lanchain. I was like, oh, that's cool. I was like, oh, I'll play around with LLMs a bit and then hacked around on stuff.

And I think I've told the story a few times, but you know, I was like trying to feed in information into the GPT-3. And then you deal with like context window limitations and there was no tooling or really practices to try to understand how do you, you know, get GPT-3 to navigate large amounts of data.

And that's kind of how the project started. Really was just one of those things where early days, like we were just trying to build something that was interesting and not really, like I wanted to start a company. I had other ideas actually of what I wanted to start. And I was very interested in, for instance, like multi-modal data, like video data and that type of stuff.

And then this just kind of grew and eventually took over the other idea. Text is the universal interface. - I think so. I think so. I actually think once the multi-modal models come out, I think there's just like mathematically nicer properties of you can just get like join multi-modal embeddings like clip style.

But how, like text is really nice because from a software engineering principle, it just makes things way more modular. You just convert everything into text and then you just represent everything as text. - Yeah. I'm just explaining retroactively why working on LLM Index took off versus if you had chose to spend your time on multi-modal, we probably wouldn't be talking about whatever you ended up working on.

- Yeah, that's true. - It's troubled. Yeah, I think so. So interesting. So November 9th. So that was a very productive month, I guess. So October, November. November 9th, you announced GPT Tree Index and you picked the tree logo. Very, very, very cool. Everyone, every project must have an emoji.

- Yeah. Yeah. That probably was somewhat inspired by a light rain. I will admit, yeah. - It uses GPT to build a knowledge tree in a bottoms-up fashion by applying a summarization prompt for each node. - Yep. - Which I like that original vision. Your messaging around about them was also that you're creating optimized data structures.

What's the journey to that and how does that contrast with LLM Index today? - Yeah, so, okay. Maybe I can tell a little bit about the beginning intuitions. I think when I first started, this really wasn't supposed to be something that was like a toolkit that people use. It was more just like a system.

And the way I wanted to think about the system was more a thought exercise of how language models with their reasoning capabilities, if you just treat them as like brains, can organize information and then traverse it. So I didn't want to think about embeddings, right? To me, embeddings just felt like it was just an external thing that was like, well, it was just external to try and actually tap into the capabilities of language models themselves, right?

I really wanted to see, you know, just as a human brain could synthesize stuff, could we create some sort of structure where there's this neural CPU, if you will, can organize a bunch of information, auto-summarize a bunch of stuff, and then also traverse the structure that I created. That was the inspiration for this initial tree index.

It didn't actually, to be honest, and I think I said this in the first tweet, it didn't actually work super well, right? The GPT-3 at the time- - You're very honest about that. - Yeah, I know, I mean, it was just like, GPT-4 obviously is much better at reasoning.

I'm one of the first to say, you shouldn't use anything pre-GPT-4 for anything that requires complex reasoning because it's just gonna be unreliable. Okay, disregarding stuff like fine-tuning, but it worked okay, but I think it definitely struck a chord with kind of like the Twitter crowd, which is just like looking for kind of, just like new ideas at the time, I guess just like thinking about how you can actually bake this into some sort of application because I think what I also ended up discovering was the fact that basically everybody, there was starting to become a wave of developers building on top of GPT-3 and people were starting to realize that what makes them really useful is to apply them on top of your personal data.

And so even if the solution itself was kind of like primitive at the time, like the problem statement itself was very powerful. And so I think being motivated by the problem statement, right, like this broad mission of how do I unlock elements on top of the data also contributed to the development of Lama Index to the state it is today.

And so I think part of the reason our toolkit has evolved beyond the like just existing set of like data structures is we really tried to take a step back and think, okay, what exactly are the tools that would actually make this useful for a developer? And then, you know, somewhere around December, we made an active effort to basically like push towards that direction, make the code base more modular, right, more friendly as an open source library.

And then also start adding in like embeddings, start thinking into practical considerations like latency, cost, performance, those types of things. And then really motivated by that mission, like start expanding the scope of the toolkit towards like covering the life cycle of like data ingestion and querying. - Yeah, where you also added Lama Hub and-- - Yeah, yeah, so I think that was in like January on the data loading side.

And so we started adding like some data loaders, saw an opportunity there, started adding more stuff on the retrieval querying side, right, we still have like the core data structures, but how do you actually make them more modular and kind of like decouple storing state from the types of like queries that you could run on top of this a little bit.

And then starting to get into more complex interactions like train of thought, reasoning, routing, and you know, like agent loops. - Yeah, yeah, very cool. - And then you and I spent a bunch of time earlier this year talking about Lama Hub, what that might become. You were still at Robust.

When did you decide it was time to start the company and then start to think about what Lama Index is today? - Probably December, yeah. And so it was clear that, you know, it was kind of interesting. I was getting some inbound from initial VCs. I was talking about this project.

And then in the beginning, I was like, oh yeah, you know, this is just like a design project, but you know, what about my other idea on like video data? Right, and I was trying to like get their thoughts on that. And then everybody was just like, oh yeah, whatever.

Like that part's like a crowded market. And then it became clear that, you know, this was actually a pretty big opportunity. And like coincidentally, right, like this actually did relate to, like my interests have always been at the intersection of AI data and kind of like building practical applications.

And it was clear that this was evolving into a much bigger opportunity than the previous idea was. So around December. And then I think I gave a pretty long notice, but I left officially like early March. - What were your thinkings in terms of like moats and, you know, founders kind of like overthink it sometimes.

You obviously had like a lot of open source love and like a lot of community. And yeah, like, were you ever thinking, okay, I don't know, this is maybe not enough to start a company or did you always have conviction about it? - Oh no, I mean, a hundred percent.

I felt like I did this exercise, like honestly, probably more late December and then early January, 'cause I was just existentially worried about whether or not this would actually be a company at all. And okay, what were the key questions I was thinking about? And these were the same things that like other founders, investors, and also like friends would ask me is just like, okay, what happens if context windows get much bigger?

What's the point of actually structuring data, right, in the right way, right? Why don't you just dump everything into the prompt? Fine tuning, like what if you just train the model over this data? And then, you know, what's the point of doing this stuff? And then some other ideas is what if like open AI actually just like takes this, like, you know, builds upwards on top of the, their existing like foundation models and starts building in some like built-in orchestration capabilities around stuff like rag and agents and those types of things.

And so I basically ran through this mental exercise and, you know, I'm happy to talk a little bit more about those thoughts as well, but at a high level, well, context windows have gotten bigger, but there's obviously still a need for rag. I think rag is just like one of those things that like, in general, what people care about is yes, they do care about performance, but they also care about stuff like latency and costs.

And my entire reasoning at the time was just like, okay, like, yes, maybe we'll have like much bigger context windows as we've seen with like 100K context windows, but for enterprises like, you know, data, which is not in just like the scale of like a few documents, it's usually in like gigabytes, terabytes, petabytes, like how do you actually just unlock language models over that data, right?

And so it was clear there was just like, whether it's rag or some other paradigm, no one really knew what that answer was. And so there was clearly like technical opportunity here. Like there was just stacks that needed to be invented to actually solve this type of problem because language models themselves didn't have access to this data.

And so if like you just dumped all this data into, let's say a model had like hypothetically an infinite context window, right? And you just dump like 50 gigabytes of data into the context window. That just seemed very inefficient to me because you have these network transfer costs of uploading 50 gigabytes of data to get back a single response.

And so I kind of realized, you know, there's always gonna be some curve, regardless of like the performance of the best performing models, of like cost versus performance. And so what RAG does is it does provide extra data points along that access because you can kind of control the amount of context you actually want it to retrieve.

And of course, like RAG as a term was still evolving back then, but it was just this whole idea of like, how do you just fetch a bunch of information to actually, you know, like stuff into the prompt. And so people, even back then, were kind of thinking about some of those considerations.

- And then you fundraised in June, or you announced your fundraise in June. - Yeah. - With Greylock. How was that process? Just like take us through that process of thinking about the fundraise and your plans for the company, you know, at the time. - Yeah, definitely. I mean, I think we knew we wanted to, I mean, obviously we knew we wanted to fundraise.

I think there was also a bunch of like investor interest and it was probably pretty unusual given the, you know, like hype wave of generative AI. So like a lot of investors were kind of reaching out around like December, January, February. In the end, we went with Greylock. Greylock's great.

You know, they've been great partners so far. And like, to be honest, like there's a lot of like great VCs out there. And a lot of them who are specialized on like open source, data, infra, and that type of stuff. What we really wanted to do was, because for us, like time was of the essence, like we wanted to ship very quickly and still kind of build Mindshare in this space.

We just kept the fundraising process very efficient. I think we basically did it in like a week or like three days, I think so. - Yeah. - Just like front loaded it. And then, and then just like-- - We picked the one named Jerry. - Hey, yeah, exactly. (both laughing) - I'm kidding.

Guys, I mean, he's obviously great and Greylock's fantastic for him. - Yeah, I know. And embedding some of my research. So yeah, just we picked Greylock. They've been great partners. I think in general, when I talk to founders about like the fundraise process, it's never like the most fun period, I think.

Because it's always just like, you know, there's a lot of logistics, there's lawyers, you have to, you know, get in the loop. And then you, and like a lot of founders just want to go back to building. And so I think in the end, we're happy that we kept it to a pretty efficient process.

- Cool. And so you fundraise with Simon, your co-founder. And how do you split things with him? How big is your team now? - The team is growing. By the time this podcast is released, we'll probably have had one more person join the team. And so basically, it's between, we're rapidly getting to like eight or nine people.

At the current moment, we're around like six. And so just like, there'll be some exciting developments in the next few weeks. So I'm excited to kind of, to announce that. We've been pretty selective in terms of like how we like grow the team. Obviously, like we look for people that are really active in terms of contributions to Lum Index, people that have like very strong engineering backgrounds.

And primarily, we've been kind of just looking for builders, people that kind of like grow the open source and also eventually this like managed like enterprise platform as well with us. In terms of like Simon, yeah, I've known Simon for a few years now. I knew him back at Uber ATG in Toronto.

He's one of the smartest people I knew. Like has a sense of both like a deep understanding of ML, but also just like first principles thinking about like engineering and technical concepts in general. And I think one of my criteria is when I was like looking for a co-founder for this project was someone that was like technically better than me, 'cause I knew I wanted like a CTO.

And so honestly, like there weren't a lot of people that, I mean, I know a lot of people that are smarter than me, but like that fit that bill. We're willing to do a startup and also just have the same like values that I shared, right? And just, I think doing a startup is very hard work, right?

It's not like, I'm sure like you guys all know this. It's a lot of hours, a lot of late nights, and you want to be like in the same place together and just like being willing to hash out stuff and have that grit basically. And I really looked for that.

And so Simon really fit that bill. And I think I convinced him to jump on board. - Yeah, yeah, nice job. And obviously I've had the pleasure of chatting and working with a little bit with both of you. What would you say those like your top like one or two values are when thinking about that or the culture of the company and that kind of stuff?

- Yeah, well, I think in terms of the culture of the company it's really like, I mean, there's a few things I can name off the top of my head. One is just like passion, integrity. I think that's very important for us. We want to be honest. We don't want to like obviously like copy code or kind of like, you know, just like, you know not give attribution, those types of things.

And just like be true to ourselves. I think we're all very like down to earth, like humble people. But obviously I think just willingness to just like own stuff and dive right in. And I think grit comes with that. I think in the end, like this is a very fast moving space and we want to just like be one of the, you know like dominant forces in helping to provide like production quality outline applications.

- Yeah. So I promise we'll get to the more technical questions. But I also want to impress on the audience that this is a very conscious and intentional company building. And since your fundraising post, which was in June and now it's September, so it's been about three months. You've actually gained 50% in terms of stars and followers.

You've 3X your download count to 600,000 a month and your Discord membership has reached 10,000. So like a lot of ongoing growth. - Yeah, definitely. And obviously there's a lot of room to expand there too. And so open source growth is gonna continue to be one of our core goals.

'Cause in the end, it's just like, we want this thing to be, well, one big, right? We all have like big ambitions, but to just like really provide value to developers and helping them in prototyping and also productionization of their apps. And I think it turns out we're in the fortunate circumstance where a lot of different companies and individuals, right?

Are in that phase of like, you know maybe they've hacked around on some initial LLM applications, but they're also looking to, you know, start to think about what are the production grade challenges necessary to actually, you know, that to solve to actually make this thing robust and reliable in the real world.

And so we want to basically provide the tooling to do that. And to do that, we need to both spread awareness and education of a lot of the key practices of what's going on. And so a lot of this is going to be continued growth, expansion and education. And we do prioritize that very heavily.

- Awesome. Let's dive into some of the questions you were asking yourself initially around fine tuning and rag, how these things play together. You mentioned context. What is the minimum viable context for rag? So what's like a context window too small. And at the same time, maybe what's like a maximum context window.

We talked before about the LLMs are U-shaped reasoners. So as the context got larger, like it really only focuses on the end and the start of the prompt and then it kind of peters down. Any learnings, any kind of like tips you want to give people as they think about it?

- So this is a great question. And I think part of what I wanted to kind of like talk about a conceptual level, especially with the idea of like thinking about what is the minimum context? Like, okay, what if the minimum context was like 10 tokens versus like, you know, 2K tokens versus like a million tokens, right?

Like, and what does that really give you? And what are the limitations if it's like 10 tokens? It's kind of like, like eight bit, 16 bit games, right? Like back in the day, like if you play Mario and you have like the initial Mario where the graphics were very blocky and now obviously it's like full HD, 3D, just the resolution of the context and the output will change depending on how much context you can actually fit in.

The way I kind of think about this from a more principled manner is like, there's this concept of like information capacity, just this idea of like entropy, like given any fixed amount of like storage space, like how much information can you actually compact in there? And so basically a context window length is just like some fixed amount of storage space, right?

And so there's some theoretical limit to the maximum amount of information you can compact into like a 4,000 token storage space. And what does that storage space use for these days with LLMs? It's for inputs and also outputs. And so this really controls a maximum amount of information you can feed in terms of the prompt plus the granularity of the output.

If you had an infinite context window, you could have an infinitely detailed response and also infinitely detailed memory. But if you don't, you can only kind of represent stuff in more quantized bits, right? And so the smaller the context window, just generally speaking, the less details and maybe the less, and for like specific precise information are gonna be able to surface at any given point in time.

- And when you have short context, is the answer just like get a better model or is the answer maybe, hey, there needs to be a balance between fine tuning and RAG to make sure you're gonna like leverage the context, but at the same time, don't keep it too low resolution?

- Yeah, yeah. Well, there's probably some minimum threat. I don't think anyone wants to work with like a 10, I mean, that's just a thought exercise anyways, a 10 token context window. I think nowadays the modern context window is like 2K, 4K is enough for just like doing some sort of retrieval on granular context and be able to synthesize information.

I think for most intents and purposes, that level of resolution is probably fine for most people, for most use cases. I think the limitation is actually more on, okay, if you're gonna actually combine this thing with some sort of retrieval data structure mechanism, there's just limitations on the retrieval side because maybe you're not actually fetching the most relevant context to actually answer this question, right?

Like, yes, like given the right context, 4,000 tokens is enough, but if you're just doing like top case similarity, like you might not be fetching the right information from the documents. - Yeah, so how should people think about when to stick with RAG versus when to even entertain fine tuning?

And also in terms of what's like the threshold of data that you need to actually worry about fine tuning versus like just stick with RAG. Obviously you're biased because you're building a RAG company, but- - No, no, actually, I think I have like a few hot takes in here, some of which sound like a little bit contradictory to what we're actually building.

To be honest, I don't think anyone knows the right answer. I think this is just- - We're pursuing the truth. - Yeah, exactly. This is just like thought exercise towards like understanding the truth, right? So I think, okay, I have a few hot takes. One is like RAG is basically just a hack, but it turns out it's a very good hack because what is RAG?

RAG is you keep the model fixed and you just figure out a good way to like stuff stuff into the prompt of the language model. Everything that we're doing nowadays in terms of like stuffing stuff into the prompt is just algorithmic. We're just figuring out nice algorithms to like retrieve right information with top case similarity, do some sort of like hybrid search, some sort of like a chain of thought decomp, and then just like stuff stuff into the prompt.

So it's all like algorithmic, and it's more like just software engineering to try to make the most out of these like existing APIs. The reason I say it's a hack is just like from a pure like optimization standpoint, if you think about this from like the machine learning lens, unless the software engineering lens, there's pieces in here that are gonna be like suboptimal, right?

Like, obviously, like the thing about machine learning is when you optimize like some system that can be optimized within machine learning, like the set of parameters, you're really like changing like the entire system's weights to try to optimize the subjective function. And if you just cobble a bunch of stuff together, you can't really optimize the pieces are inefficient, right?

And so like a retrieval interface, like doing top can embedding lookup, that part is inefficient, because there might be potentially a better, more learned retrieval algorithm that's better. If you kind of do stuff like some sort of, I know nowadays there's this concept of how do you do like short-term or long-term memory, right?

Like represent stuff in some sort of vector embedding, do chunk sizes, all that stuff. It's all just like decisions that you make that aren't really optimized, right? It's more, and it's not really automatically learned, it's more just things that you set beforehand to actually feed into the system. There's a lot of room to actually optimize the performance of an entire LLM system, potentially in a more like machine learning base way, right?

And I will leave room for that. And this is also why I think like in the long-term, like I do think fine tuning will probably have like greater importance and just like, there will probably be new architectures invented that where you can actually kind of like include a lot of this under the black box, as opposed to having like hobbling together a bunch of components outside the black box.

That said, just very practically, given the current state of things, like even if I said RAG is a hack, it's a very good hack and it's also very easy to use, right? And so just like for kind of like the AI engineer persona, that like, which to be fair is kind of one of the reasons generative AI has gotten so big, is because it's way more accessible for everybody to get into, as opposed to just like traditional machine learning.

It tends to be good enough, right? And if we can basically provide these existing techniques to help people really optimize how to use existing systems without having to really deeply understand machine learning, I still think that's a huge value add. And so there's very much like a UX and ease of use problem here, which is just like RAG is way easier to onboard and use.

And that's probably like the primary reason why everyone shouldn't do RAG instead of fine tuning to begin with. If you think about like the 80/20 rule, like RAG very much fits within that and fine tuning doesn't really right now. And then I'm just kind of like leaving room for the future that, you know, like in the end, fine tuning can probably take over some of the aspects of like what RAG does.

- I don't know if this is mentioned in your recap there, but explainability also allows for sourcing. And like at the end of the day, like to increase trust, we have to source documents. - Yeah, so I think what RAG does is it increases like transparency, visibility into the actual documents that are getting fed into their contacts.

- Here's where they got it from. - Exactly, and so that's definitely an advantage. I think the other piece that I think is an advantage, and I think that's something that someone actually brought up, is just you can do access control with RAG if you have an external source system.

You can't really do that with large language models, which is like gate information to the neural net weights, like depending on the type of user. For the first point, you could technically, right, you could technically have the language model, like if it memorized enough information, just like a site sources, but there's a question of just trust.

Whether or not you're accurate. - Yeah, well, but like it makes it up right now 'cause it's like not good enough, but imagine a world where it is good enough and it does give accurate citations. - Yeah, no, I think to establish trust, you just need a direct connection.

So it's kind of weird, it's this melding of, you know, deep learning systems versus very traditional information retrieval. - Yeah, exactly. So I think, I mean, I kind of think about it as analogous to like humans, right? Like we as humans, obviously we use the internet, we use tools.

These tools have API interfaces are well-defined. And obviously we're not, like the tools aren't part of us. And so we're not like back propping or optimizing over these tools. And so kind of when you think about like RAG, it's basically LLM is learning how to use like a vector database to look up information that it doesn't know.

And so then there's just a question of like how much information is inherent within the network itself and how much does it need to do some sort of like tool used to look up stuff that it doesn't know. And I do think there'll probably be more and more of that interplay as time goes on.

- Yeah. Some follow-ups on discussions that we've had. So, you know, we discussed fine tuning a bit and what's your current take on whether you can fine tune new knowledge into LLMs? - That's one of those things where I think long-term you definitely can. I think some people say you can't, I disagree.

I think you definitely can. Just right now I haven't gotten it to work yet. So, so I think like- - You've tried. - Yeah, well, not in a very principled way, right? Like this is something that requires like an actual research scientist and not someone that has like, you know, an hour or two per night to actually get this.

- Like you were a research scientist at Uber. - Yeah, yeah, but it's like full-time, full-time work. So I think what I specifically concretely did was I took OpenAI's fine tuning endpoints and then tried to, you know, it's in like a chat message interface. And so there's like a user assistant message format.

And so what I did was I tried to take just some piece of text and have the LLM memorize it by just asking it a bunch of questions about the text. So given a bunch of contexts, I would generate some questions and then generate some response and just fine tune over the question responses.

That hasn't really worked super well. But that's also because I'm just like trying to like use OpenAI's endpoints as is. If you just think about like traditional, like how you train a Transformers model, there's kind of like the instruction like fine tuning aspect, right? You like kind of ask it stuff and guide it with correct responses, but then there's also just like next token production.

And that's something that you can't really do with the OpenAI API, but you can do with if you just trained it yourself. And that's probably possible if you just like train it over some corpus of data. I think Shashira from Berkeley said like, you know, when they trained Gorilla, they were like, oh, you know this, a lot of these LLMs are actually pretty good at memorizing information.

Just the way the API interface is exposed is just no one knows how to use them right now, right? And so I think that's probably one of the issues. - Just to clue people in who haven't read the paper, Gorilla is the one where they train to use specific APIs?

- Yeah, yeah. And I think they also did something where like the model itself could learn to, yeah, I think this was on the Gorilla paper. Like the model itself could try to learn some prior over the data to decide like what tool to pick. But there's also, it's also augmented with retrieval that helps supplement it in case like the prior doesn't actually work.

- Is that something that you'd be interested in supporting? - I mean, I think in the longterm, like if like this is kind of how fine-tuning like RAG evolves, like I do think there'll be some aspect where fine-tuning will probably memorize some high-level concepts of knowledge, but then like RAG will just be there to supplement like aspects that it doesn't know, yeah.

- Yeah, awesome. - Obviously RAG is the default way. Like to be clear, RAG right now is the default way to actually augment stuff with knowledge. I think it's just an open question of how much the LLM can actually internalize both high-level concepts, but also details as you can like train stuff over it.

And coming from an ML background, like there is a certain beauty in just baking everything into some training process of a language model. Like if you just take raw chat GPT or chat GPT code interpreter, right, like GPT-4, it's not like you do RAG with it. You just ask it questions about like, "Hey, how do I like define a pedantic model in Python?" And then like, "Can you give me an example?

Can you visualize a graph?" It just does it, right? And we'll run it through code interpreters as a tool, but that's not like a source for knowledge. It's just an execution environment. And so there is some beauty in just like having the model itself, like just, you know, instead of you kind of defining the algorithm for what the data structure should look like, the model just learns it under the hood.

That said, I think the reason it's not a thing right now is just like, no one knows how to do it. It probably costs too much money. And then also like the API interfaces and just like the actual like ability to kind of evaluate and improve on performance, like isn't known to most people.

- Yeah. It also would be better with browsing. (laughs) - Yeah. - I wonder when they're going to put that back. - Okay. - Okay, cool. Yeah, so, and then one more follow-up before we go into RAG for AI engineers is on your brief mention about security or auth.

And how many of the people that you talk to, you know, you talk to a lot of people putting Lama Index into production, how many people actually are there versus just like, let's just dump a whole company notion into this thing. - Wait, you're talking about from like the security auth standpoint?

- Yeah, how big a need is that? Because I talked to some people who are thinking about building tools in that domain, but I don't know if people want it. I mean, I think bigger companies, like just bigger companies, like banks, consulting firms, like they all want this. - Yes, it's a requirement, right?

- The way they're using Lama Index is not with this, obviously, 'cause I don't think we have support for like access control or author or that type of stuff like on a hood, 'cause we're more just like an orchestration framework. And so the way they do it, they build these initial apps is more kind of like prototype, like let's kind of, yeah, like, you know, use some publicly available data that's not super sensitive.

Let's like, you know, assume that every user is going to be able to have access to the same amount of knowledge, those types of things. I think users have asked for it, but I don't think that's like a P zero. Like, I think the P zero is more on like, can we get this thing working before we expand this to like more users within the work?

- Yep. - Cool. So there's a bunch of pieces to RAG, obviously. It's not just an acronym. And you tweeted recently, you think every AI engineer should build it from scratch at least once. Why is that? - I think so. I'm actually kind of curious to hear your thoughts about this, but this kind of relates to the initial like AI engineering posts that you put out.

And then also just like the role of an AI engineer and the skills that they're going to have to learn to truly succeed. 'Cause there's an entire spectrum. On one end, you have people that don't really like understand the fundamentals and just want to use this to like cobble something together to build something.

And I think there is a beauty in that for what it's worth. Like it's just one of those things. And Gen AI has made it so that you can just use these models in inference only mode, cobble something together, use it to power your app experiences. On the other end, what we're increasingly seeing is that like more and more developers building with these apps start running into honestly like pretty similar issues that like will play just a standard ML engineer building like a classifier model, which is just like accuracy problems, like and hallucinations, basically just an accuracy problem, right?

Like it's not giving you the right results. So what do you do? You have to iterate on the model itself. You have to figure out what parameters you tweak. You have to gain some intuition about this entire process. That workflow is pretty similar, honestly, like even if you're not training the model to just like tuning an ML model with like hyper parameters and learning like proper ML practices of like, okay, how do I have like define a good evaluation benchmark?

How do I define like the right set of metrics to use, right? How do I actually iterate and improve the performance of this pipeline for production? What tools do I use, right? Like every ML engineer use like some form of weights and biases, TensorBoards, or like some other experimentation tracking tool.

Like what tools should I use to actually help build like LLM applications and optimize it for production? There's like a certain amount of just like LLM ops, like tooling and concepts and just like practices that people will kind of have to internalize if they want to optimize these. And so I think that the reason I think like being able to build like RAG from scratch is important is it really gives you a sense of like how things are working to help you build intuition about like what parameters are within a RAG system and which ones actually tweak to make them better.

One of the advantages of Lomindex, the Lomindex quick start is it's three lines of code. The downside of that is you have zero visibility into what's actually going on under the hood. And I think this is something that we've kind of been thinking about for a while. And I'm like, okay, let's just release like a new tutorial series.

That's just like, no three lines of code. We're just gonna go in and actually show you how the thing actually works under the hood, right? And so like, does everybody need this? Like probably not. Like as for some people, the three lines of code might work. But I think increasingly, like honestly, 90% of the users I talk to have questions about how to improve the performance of their app.

And so just like, given this is just like one of those things that's like better for the understanding. - Yeah, I'd say it is one of the most useful tools of any sort of developer education toolkit to write things yourself from scratch. So Kelsey Hightower famously wrote Kubernetes the hard way, which is don't use Kubernetes.

Just like do everything. Here's everything that you would have to do by yourself. And you should be able to put all these things together yourself to understand the value of Kubernetes. And the same thing for Lomindex. I've done, I was the guy who did the same for React. And yeah, it's pretty, well, it's pretty good exercise for you to just fully understand everything that's going on under the hood.

And I was actually gonna suggest, well, in one of the previous conversations, you know, there's all these like hyperparameters, like the size of the chunks and all that. And I was thinking like, you know, what would hyperparameter optimization for RAG look like? - Yeah, definitely. I mean, so absolutely.

I think that's gonna be an increasing thing. I think that's something we're kind of looking at. - I think someone should just put, do like some large scale study and then just ablate everything. And just, you tell us. - I think it's gonna be hard to find a universal default that works for everybody.

I think it's gonna be somewhat- - Are you telling me it depends? - Boo! - I do think it's gonna be somewhat dependent on the data and use case. I think if there was a universal default, that'd be amazing. But I think increasingly we found, you know, people are just defining their own like custom parsers for like PDFs, markdown files for like, you know, SCC filings versus like, you know, Slack conversations.

And then like the use case too, like, do you want like a summarization, like the granularity of the response? Like it really affects the parameters that you wanna pick. And so I do like the idea of hyperparameter optimization though. But it's kind of like one of those things where you are kind of like training the model basically, kind of on your own data domain.

- Yeah. You mentioned custom parsers. You've designed LamaIndex. Maybe we can talk about like the surface area of the framework. You designed LamaIndex in a way that it's more modular. Yeah, like you mentioned. How would you describe the different components and what's customizable in each? - Yeah, I think they're all customizable.

And I think that there is a certain burden on us to make that more clear through the docs. - Well, number four is customization tutorials. - Yeah, yeah. But I think like just in general, I think we do try to make it so that you can plug in the out of the box stuff.

But like if you want to kind of customize more lower level components, like we definitely encourage you to do that and plug it into the rest of our abstractions. So let me just walk through like maybe some of the basic components of LamaIndex. There's data loaders. You can load data from different data sources.

We have LamaHub, which you guys brought up, which is a collection of different data loaders of like unstructured and unstructured data, like PDFs, file types, like Slack, Notion, all that stuff. Now you load in this data. We have a bunch of like parsers and transformers. You can split the text.

You can add metadata to the text and then basically figure out a way to load it into like a vector store. So, I mean, you worked at like Airbrite, right? It's kind of like there is some aspect like E and T, right? And in terms of like transforming this data.

And then the L, right? Loading it into some storage abstraction, we have like a bunch of integrations with different document storage systems. So that's data. And then the second piece really is about like, how do you retrieve this data? How do you like synthesize this data? And how do you like do some sort of higher level reasoning over this data?

So retrieval is one of the core abstractions that we have. We do encourage people to like customize, find your own retrievers. That's why we have that section on kind of like how do you define your own like customer retriever, but also we have like out of the box ones.

The retrieval algorithm kind of depends on how you structure the data, obviously. Like if you just flat index everything with like chunks with like embeddings, then you can really only do like top K like lookup plus maybe like keyword search or something. But if you can index it in some sort of like hierarchy, like defined relationships, you can do more interesting things, like actually traverse relationships between nodes.

Then after you have this data, how do you like synthesize the data, right? And this is the part where you feed it into the language model. There's some response abstraction that can abstract away over like long context to actually still give you a response even if the context overflows the context window.

And then there's kind of these like higher level like reasoning primitives that I'm gonna define broadly. And I'm just gonna call them in some general bucket of like agents, even though everybody has different definitions of agents. And agents- - But you're the first to data agents, which I was very excited about.

- Yeah, we kind of like coined that term. And the way we thought about it was, we wanted to think about how to use agents for like data workflows basically. And so what are the reasoning primitives that you wanna do? So the most simple reasoning primitive you can do is some sort of routing module.

Like you can just, it's a classifier. Like given a query, just make some automated decision on what choice to pick, right? You could use LLMs. You don't have to use LLMs. You could just train a classifier basically. That's something that we might actually explore. And then the next piece is, okay, what are some higher level things?

You can have the LLM like define like a query plan, right? To actually execute over the data. You can do some sort of while loop, right? That's basically what an agent loop is, which is like React, tree of thoughts, like chain of thought, like the open AI function calling like while loop to try to like take a question and try to break it down into some series of steps to actually try to execute to get back a response.

And so there's a range in complexity from like simple reasoning primitives to more advanced ones. And I think that's the way we kind of think about it is like which ones should we implement and how do they work well? Like, do they work well over like the types of like data tasks that we give them?

- How do you think about optimizing each piece? So take embedding models as one piece of it. You offer fine tuning embedding models. And I saw it was like fine tuning gives you like 5, 10% increase. What's kind of like the Delta left on the embedding side? Do you think we can get models that are like a lot better?

Do you think like that's one piece where people should really not spend too much time? - I mean, I think they should. I just think it's not the only parameter 'cause I think in the end, if you think about everything that goes into retrieval, the chunking algorithm, how you define like metadata, right?

We'll bias your embedding representations. Then there's the actual embedding model itself, which is something that you can try optimizing. And then there's like the retrieval algorithm. Are you gonna just do top K? Are you gonna do like hybrid search? Are you gonna do auto retrieval? Like there's a bunch of parameters.

And so I do think it's something everybody should try. I think by default, we use like OpenAI's embedding model. A lot of people these days use like sentence transformers because it's just like free open source and you can actually optimize, directly optimize it. This is an active area of exploration.

I do think one of our goals is it should ideally be relatively free for every developer to just run some fine tuning process over their data to squeeze out some more points and performance. And if it's that relatively free and there's no downsides, everybody should basically do it. There's just some complexities in terms of optimizing your embedding model, especially in a production grade data pipeline.

If you actually fine tune the embedding model and the embedding space changes, you're gonna have to re-index all your documents. And for a lot of people, that's not feasible. And so I think like Joe from Vespa on our webinars, there's this idea that depending on kind of like, if you're just using like document and query embeddings, you could keep the document embeddings frozen and just train a linear transform on the query or any sort of transform on the query, right?

So therefore it's just a query side transformation instead of actually having to re-index all the document embeddings. The other piece is- - Wow, that's pretty smart. - Yeah, yeah, so I think we weren't able to get like huge performance gains there, but it does like improve performance a little bit.

And that's something that basically, everybody should be able to kick off. You can actually do that on Llama Index too. - OpenAI has a cookbook on adding bias to the embeddings too, right? - Yeah, yeah, I think so. Yeah, there's just like different parameters that you can try adding to try to like optimize the retrieval process.

And the idea is just like, okay, by default, you have all this text, it kind of lives in some latent space, right? - Shut out, shut out latent space. You should take a drink every time. - Yeah, but it lives in some latent space. But like depending on the specific types of questions that the user might wanna ask, the latent space might not be optimized, right?

For actual, like to actually retrieve the relevant piece of context that the user wanna ask. So can you shift the embedding points a little bit, right? And how do we do that basically? That's really a key question here. So optimizing the embedding model, even changing the way you like chunk things, these all shift the embeddings.

- So the retrieval is interesting. I got a bunch of startup pitches that are like, like rag is cool, but like there's a lot of stuff in terms of ranking that could be better. There's a lot of stuff in terms of sunsetting data once it starts to become stale, that could be better.

Are you gonna move into that part too? So like you have SEC Insights as one of kind of like your demos and that's like a great example of, hey, I don't wanna embed all the historical documents because a lot of them are outdated and I don't want them to be in the context.

What's that problem space like? How much of it are you gonna also help with and versus how much you expect others to take care of? - Yeah, I'm happy to talk about SEC Insights in just a bit. I think more broadly about the like overall retrieval space, we're very interested in it because a lot of these are very practical problems that people have asked us.

So the idea of outdated data, I think how do you like deprecate or time wait data and do that in a reliable manner, I guess, so you don't just like kind of set some parameter and all of a sudden that affects all your retrieval algorithms is pretty important because people have started bringing that up.

Like I have a bunch of duplicate documents, things get out of date, how do I like sunset documents? And then ranking, right? Yeah, so I think this space is not new. I think like rather than inventing like new retriever techniques for the sake of like just inventing better ranking, we wanna take existing ranking techniques and kind of like package it in a way that's like intuitive and easy for people to understand.

That said, I think there are interesting and new retrieval techniques that are kind of in place that can be done with when you tie it into some downstream rack system. I mean, like the reason for this is just like, if you think about how like the idea of like chunking text, right, like that really, that just really wasn't a thing or at least for this specific purpose of like, like the reason chunking is a thing in rag right now is because like you wanna fit within the context of an LLM, right?

Like why do you wanna chunk a document? That just was less of a thing, I think back then. If you wanted to like transform a document, it was more for like structured data extraction or something in the past. And so there's kind of like certain new concepts that you gotta play with that you can use to invent kind of more interesting retrieval techniques.

Another example here is actually LLM based reasoning, like LLM based chain of thought reasoning. You can take a question, break it down into smaller components and use that to actually send to your retrieval system. And that gives you better results than kind of like sending the full question to a retrieval system.

That also wasn't really a thing back then, but then you can kind of figure out an interesting way of like blending old and the new, right, with LLMs and data. - Yeah. There's a lot of ideas that you come across. Do you have a store of them? - So, okay, I think that the, yeah, I think sometimes I get like inspiration.

There's like some problem statement and I'm just like, oh, let's hack this out. - Following you is very hard because it's just a lot of homework. - So I think I've started to like step on the brakes just a little bit. - No, no, no, keep going, keep going.

- No, no, no. Well, the reason is just like, okay, if I just have, invent like a hundred more retrieval techniques, like sure, but like how do people know which one is good and which one's like bad, right? And so- - Have a librarian, right? Like it's gonna catalog it and go- - You're gonna need some like benchmarks.

And so I think that's probably the focus for the next few weeks is actually like properly kind of like having an understanding of like, oh, you know, when should you do this? Or like, does this actually work well? - Yeah, some kind of like maybe like a flow chart, decision tree type of thing.

- Yeah, exactly. - When this, do that, you know, something like that that would be really helpful for me. Thank you. (both laughing) Do you want to talk about SCC Insights? - Sure, yeah. - You had a question. - Yeah, yeah, just, I mean, that's kind of like a good- - It seems like your most successful side project.

- Yeah, okay. So what is SCC Insights for our listeners? Our SCC Insights is a full stack LLM chatbot application that does analysis over your SCC 10K and 10Q filings, I think. And so the goal for building this project is really twofold. The reason we started building this was one, it was a great way to dog food the production readiness for our library.

We actually ended up like adding a bunch of stuff and fixing a ton of bugs because of this. And I think it was great because like, you know, thinking about how we handle like callbacks, streaming, actually generating like reliable sub-responses and bubbling up sources citations. These are all things that like, you know, if you're just building the library in isolation, you don't really think about it.

But if you're trying to tie this into a downstream application, like it really starts mattering. - Is this for your error messages? What do you mean? You talk about bubbling up stuff. For observability. - So like sources. Like if you go into SCC Insights and you type something, you can actually see the highlights in the right side.

And so like, yeah, that was something that like took a little bit of like understanding to figure out how to build well. And so it was great for dogfooding improvement of the library itself. And then as we're building the app, the second thing was we're starting to talk to users and just like trying to showcase like kind of bigger companies, like the potential of Llamandex as a framework.

Because these days, obviously building a chatbot, right, with Streamlit or something, it'll take you like 30 minutes or an hour. Like there's plenty of templates out there on Llamandex, ClientTrain, like you can just build a chatbot. But how do you build something that kind of like satisfies some of this like criteria of surfacing like citations, being transparent, seeing like having a good UX, and then also being able to handle different types of questions, right?

Like more complex questions that compare different documents. That's something that I think people are still trying to explore. And so what we did was like, we showed both like, well, first like organizations and possibilities of like what you can do when you actually build something like this. And then after like, you know, we kind of like stealth launched this for fun, just as a separate project, just to see if we could get feedback from users who are using this world to see like, you know, how we can improve stuff.

And then we thought like, ah, you know, we built this, right? Obviously, we're not gonna sell like a financial app, like that's not really in our wheelhouse, but we're just gonna open source the entire thing. And so that now is basically just like a really nice, like full stack app template you can use and customize on your own, right?

To build your own chatbot, whether it is a really financial documents or over like other types of documents. And it provides like a nice template for basically anybody to kind of like go in and get started. There's certain components though, that like aren't released yet that we're going to, in the next few weeks.

Like one is just like kind of more detailed guides on like different modular components within it. So if you're like a full stack developer, you can go in and actually take the pieces that you want and actually kind of build your own custom flows. The second piece is like, take there's like certain components in there that might not be directly related to the LLM app that would be nice to just like have people use.

An example is the PDF viewer, like the PDF viewer with like citations. I think we're just gonna give that, right? So, you know, you could be using any library you want, but then you can just, you know, just drop in a PDF viewer, right? So that it's just like a fun little module that you can view.

- Nice, nice. Yeah, that's a really good community service right there. Well, so I want to talk a little bit about your cloud offering. 'Cause you mentioned, I forget the name that you had for it, enterprise something. - Well, one, we haven't come up with a name. We're kind of calling it LLM index platform, platform LLM index enterprise.

I'm open to suggestions here. So I think the high level of what I can probably say is just like, yeah, I think we're looking at ways of like actively kind of complimenting the developer experience, like building LLM index. You know, we've always been very focused on stuff around like plugging in your data into the language model.

And so can we build tools that help like augment that experience beyond the open source library, right? And so I think what we're gonna do is like make a build an experience where it's very seamless to transition from the open source library with like a one line toggle. You can basically get this like complimentary service and then figure out a way to like monetize in a bit.

I think our revenue focus this year is kind of is less emphasized. Like it's more just about like, can we build some managed offering that like provides complimentary value to what the open source library provides? - Yeah, I think it's the classic thing about all open source is you want to start building the most popular open source projects in your category to own that category.

You're gonna make it very easy to host. Therefore, then you have to, you've just built your biggest competitor, which is you. Yeah, it'll be fun. - I think it'll be like complimentary 'cause I think it'll be like, you know, use the open source library and then you have a toggle and all of a sudden, you know, you can see this basically like a pipeline-ish thing pop up and then it'll be able to kind of like, you'll have a UI, there'll be some enterprise guarantees and the end goal would be to help you build like a production rag app more easily.

- Yeah, great, awesome. Should we go on to like ecosystem and other stuff? - Yeah. - Go ahead. - Data loaders, there's a lot of them. What are maybe some of the most popular, maybe under, not underrated, but like underexpected, you know, and how has the open source side of it helped with like getting a lot more connectors?

You only have six people on the team today, so you couldn't have done it all yourself. - Oh, for sure. Yeah, I think the nice thing about like Blahma Hub itself is just, it's supposed to be a community-driven hub. And so actually the bulk of the peers are completely community contributed.

And so we haven't written that many like first party connectors actually for this. It's more just like kind of encouraging people to contribute to the community. In terms of the most popular tools or the data loaders, I think we have Google Analytics on this and I forgot the specifics.

It's some mix of like the PDF loaders. We have like 10 of them, but there's some subset of them that are popular. And then there's Google, like I think Gmail and like G-Drive. And then I think maybe it's like one of Slack or Notion. One thing I will say though, and I think like Swix might probably knows this better than I do, given that you were used to working at Airbyte, is like, it's very hard to build like, especially for a full-on service like Notion, Slack or like Salesforce to build like a really, really high quality loader that really extracts all the information that people want, right?

And so I think the thing is when people start out, like they will probably use these loaders and it's a great tool to get started. And for a lot of people it's like good enough and they submit PRs if they want more additional features. If like you get to a point where you actually wanna call like an API that hasn't been supported yet, or, you know, you want to kind of load in stuff that like in metadata or something that hasn't been directly baked into the logic of the loader itself, people start adding up like writing their own custom loaders.

And that is a thing that we're seeing. And that's something that we're okay with, right? 'Cause like a lot of this is more just like community driven and if you wanna submit a PR to improve the existing one, you can, otherwise you can create your own custom ones. - Yeah.

And all that is custom loaders all supported within Llama Index or do you pair it with something else? - Oh, it's just like, I mean, you just define your own subclass. I think that's it. Yeah, yeah. - 'Cause typically in the data ecosystem with Erbite, you know, Erbite has his own strategies with custom loaders, but also you could write your own with like DAGster or like Prefects or one of those tools.

- Yeah, yeah, exactly. So I think for us it's more, we just have a very flexible like document abstraction that you can fill in with any content that you want. - Okay. Are people really dumping all their Gmail into these things? You said Gmail is number two. - Yeah, it's like one of Google, some Google product.

I think it's Gmail. - Oh, it's not Gmail. - I think it might be. Yeah. - Oh, wow. - I'm not sure actually. - I mean, that's the most private data source. - That's true. - So I'm surprised that people don't meet you. I mean, I'm sure some people are, but like I'm sure, I'm surprised it's popular.

- Yeah. Let me revisit the Google Analytics. - Okay. - I wanna try and give you the accurate response, yeah. - Yeah. Well, and then, so the LLM engine, I assume OpenAI is gonna be a majority. Is it an overwhelming majority? What's the market share between like OpenAI, Cohere, Anthropic, you know, whatever you're seeing.

OpenSource too. - OpenAI has a majority, but then like there's Anthropic and there's also OpenSource. I think there is a lot of people trying out like Llama 2 and some variant of like a top OpenSource model. - Side note, any confusion there? Llama 2 versus Llama? - Yeah, I think whenever I go to these talks, I always open it up with like, we started before.

- We are not. - Yeah, exactly. We started before Meta, right? I wanna point that out. But no, props to them. We try to use it for like branding. We just add two Llamas when we have like a Llama 2 integration instead of one Llama. Anyways. Yeah, so I think a lot of people are trying out the popular OpenSource models.

And we have, these days we have like, there's a lot of toolkits and OpenSource projects that allow you to self-host and deploy Llama 2. - Yes. - Right. And like, Llama is just a very recent example, I think that we had an integration with. And so we just, by virtue of having more of these services, I think more and more people are trying it out.

- Yeah. Do you think there's potential there? Is like, is that gonna be an increasing trend? - OpenSource? - Yeah. - Yeah, definitely. I think in general, people hate monopolies. And so like there's a, whenever like OpenAI has something really cool or like any company has something really cool, even Meta, like there's just gonna be a huge competitive pressure from other people to do something that's more open and better.

And so I do think just market pressures will improve like OpenSource adoption. - Last thing I'll say about this, which is just really like, it gets clicks. People like, are like, psychologically want that. But then at the end of the day, they want, they fall for brand name and popular and performance benchmarks, you know?

And at the end of the day, OpenAI still wins on that. - I think that's true. But I just think like, unless you were like an active employee at OpenAI, right? Like all these research labs are putting out like ML, like PhDs or kind of like other companies too, they're investing a lot of dollars.

There's gonna be a lot of like competitive pressures to develop like better models. So is it gonna be like all fully open source with like a permissive license? Like, I'm not completely sure, but like there's just a lot of just incentive for people to develop their stuff here. - Have you looked at like rag specific models, like contextual?

- No, is it public or? - No, they literally just, so Dewey Kila, I think is his name, you probably came across him. He wrote the rag paper at Meta and just started contextual AI to create a rag specific model. I don't know what that means. I was hoping that you do, 'cause it's your business.

- If I had inside information. I mean, you know, to be honest, I think this kind of relates to my previous point on like rag and fine tuning. Like a rag specific model is a model architecture that's designed for better rag. And it's less the software engineering principle of like, how can I take existing stuff and just plug and play different components into it?

And there's a beauty in that from ease of use and modularity. But like when you wanna end to end optimize the thing, you might want a more specific model. I just, yeah, I don't know. I think building your own models is honestly pretty hard. And I think the issue is if you also build your own models, like you're also just gonna have to keep up with like the rate of L and advances.

Like basically the question is when GPT-5 and six and whatever, like anthropic cloud three comes out, like how can you prove that you're actually better than a software developers cobbling together their own components on top of a base model, right? Even if it's just like conceptually, this is better than maybe like GPT-3 or GPT-4.

- Yeah, yeah. Base model game is expensive. - Yeah. - What about vector stores? I know Spook says we're in a Chroma sweatshirt. - Yeah, because this is a swag game. - I have the mug from Chroma, it's been great. - What do you think, what do you think there?

Like there's a lot of them. Are they pretty interchangeable for like your users use case? Is HNSW all we need? Is there room for improvements there? - Is MPRA all we need? - Yeah, yeah. - I think, yeah, we try to remain unopinionated about storage providers. So it's not like, we don't try to like play favorites.

So we have like a bunch of integrations, obviously. And the way we try to do is we just try to find like some standard interfaces, but obviously like different vector stores will support kind of like slightly additional things like metadata filters and those things. And the goal is to have our users basically leave it up to them to try to figure out like what makes sense for their use case.

In terms of like the algorithm itself, I don't think the Delta on like improving the vector store, like embedding lookup algorithm is that high. I think the stuff has been mostly solved or at least there's just a lot of other stuff you can do to try to improve the performance.

No, I mean like everything else that we just talked about, like in terms of like accuracy, right? To improve RAG, like everything that we talked about, like clunking, like metadata, like. - Yeah, well, I mean, I was just thinking like, maybe for me, the interesting question is, there are like eight, it's a kind of game of thrones.

There's like eight, the war of eight databases right now. - Oh, oh, I see, I see. - How do they stand out and how did they become very good partners with Lava Index? - Oh, I mean, I think we're, yeah, we're pretty good partners with most of them. Let's see.

- Well, like, so if you're a vector database founder, like what do you work on? - That's a good question. I think one thing I'm very interested in is, and this is something I think I've started to see a general trend towards, is combining structured data querying with unstructured data querying.

And I think that will probably just expand the query sophistication of these vector stores and basically make it so that users don't have to think about whether they-- - Would you call this like hybrid querying? Is that what Weaviate's doing? - Yeah, I mean, I think like, if you think about metadata filters, that's basically a structured filter.

It's like a select star or select where, right? Something equals something. And then you combine that with semantic search. I know, I think like LanceDB or something was like trying to do some like joint interface. The reason is like most data is semi-structured. There's some structured annotations and there's some like unstructured texts.

And so like somehow combining all the expressivity of like SQL with like the flexibility of semantic search is something that I think is gonna be really important. And we have some basic hacks right now that allow you to jointly query both a SQL database, like a separate SQL database and a vector store to like combine the information.

That's obviously gonna be less efficient than if you just combined it into one system, yeah. And so I think like PG vector, like, you know, that type of stuff, I think it's starting to get there. But like in general, like how do you have an expressive query language to actually do like structured querying along with like all the capabilities of semantic search?

- So your current favorite is just put it into Postgres? - No, no, no, we don't play-- - Postgres language, the query language. - I actually don't know what the best language would be for this. 'Cause I think it will be something that like the model hasn't been fine-tuned over.

And so you might wanna train the model over this, but some way of like expressing structured data filters. And this could be include time too, right? It doesn't have to just be like a where clause with this idea of like a semantic search. - Yeah, yeah. And we talked about graph representations.

- Yeah, oh yeah, that's another thing too. And there's like, yeah, so that's actually something I didn't even bring up yet. Like there's this interesting idea of like, can you actually have the language model, like explore like relationships within the data too, right? And somehow combine that information with stuff that's like more structured within the DB.

- Awesome. - What else is left in the stack? - Oh, evals. - Yeah. - What are your current strong beliefs about how to evaluate RAG? - I think I have thoughts. I think we're trying to curate this into some like more opinionated principles because there are some like open questions here.

I think one question I had to think about is whether you should do like evals like component by component first or is yours do the end-to-end thing? I think you should, you might actually just want to do the end-to-end thing first just to do a sanity check of whether or not like this, given a query and the final response, whether or not it even makes sense.

Like you eyeball it, right? And then you only try to do some basic evals. And then once you like diagnose what the issue is, then you go into the kind of like specific area to find some more solid benchmarks and try to like improve stuff. So what is end-to-end evals?

Like it's, you have a query, it goes in through a retrieval system, you get back something, you synthesize response, and that's your final thing. And you evaluate the quality of the final response. And these days there's plenty of projects, like startups, like companies, research, doing stuff around like GPT-4, right?

As like a human judge to basically kind of like synthetically generate a data set. - Do you think those will do well? - I mean, I think- - It's too easy. - Well, I think, oh, you're talking about like the startups? - Yeah. - I don't know. I don't know from the startup side.

I just know from a technical side, I think people are gonna do more of it. The main issue right now is just, it's really unreliable. Like it's just, like there's like variance in the response when you wanna be- - Yeah, then they won't do more of it. I mean, 'cause it's bad.

- No, but these models will get better and you'll probably fine tune a model to be a better judge. I think that's probably what's gonna happen. So I'm like reasonably bullish on this because I don't think there's really a good alternative beyond you just human annotating a bunch of data sets and then trying to like just manually go through and curating, like evaluating eval metrics.

And so this is just gonna be a more scalable solution. In terms of the startups, yeah, I mean, I think there's a bunch of companies doing this. In the end, it probably comes down to some aspect of like UX speed and then whether you can like fine tune a model.

And then, so that's end-to-end evals. And then I think like what we found is for RAG, a lot of times like what ends up affecting this like end response is retrieval. You're just not able to retrieve the right response. I think having proper retrieval benchmarks, especially if you wanna do production RAG is actually quite important.

I think, what does having good retrieval metrics tell you? It tells you that at least like the retrieval is good. It doesn't necessarily guarantee the end generation is good, but at least it gives you some sort of like sanity track, right, so you can like fix one component while optimizing the rest.

What retrieval like evaluation is pretty standard and it's been around for a while. It's just like an IR problem basically. You have some like input query, you get back some retrieved set of context and then there's some ground truth in that ranked set. And then you try to measure it based on ranking metrics.

So the closer that ground truth is to the top, the more you reward the evals. And then the closer it is to the bottom or if it's not in the retrieved side at all, then you penalize the evals. And so that's just like a classic ranking problem. Most people starting out probably don't know how to do this.

Right now, we just launched some like basic retrieval evaluation modules to help users do this. One is just like curating this data set in the first place. And one thing that we're very interested in is this idea of like synthetic data set generation for evals. So how can you, given some context, generate a set of questions with Drupal 2.4 and then all of a sudden you have like question and then context pairs and that becomes your ground truth.

- Yeah. Are data agent evals the same thing or is there a separate set of stuff for agents that you think is relevant here? - Data agents add like another layer of complexity 'cause then it's just like you have just more loops in the system. Like you can evaluate like each chain of thought loop itself like every LLM call to see whether or not the input to that specific step in the chain of thought process actually works or is correct.

Or you could evaluate like the final response to see if that's correct. This gets even more complicated when you do like multi-agent stuff because now you have like some communication between like different agents. Like you have a top level orchestration agent passing it on to some low level stuff.

I'm probably less familiar with kind of like agent eval frameworks. I know they're starting to become a thing. I know I was talking to like June from the Journal of Agents paper, which is pretty unrelated to what we're doing now, but it's very interesting where it's like, so you can kind of evaluate like overall agent simulations by just like kind of understanding whether or not they like modeled this distribution of human behavior, but that's like a very macro principle, right?

And that's very much to evaluate stuff to kind of like model the distribution of things. And I think that works well when you're trying to like generate something for like creative purposes, but for stuff where you really want the agent to like achieve a certain task, it really is like whether or not it achieved the task or not, right?

'Cause then it's not like, oh, does it generally mimic human behavior? It's like, no, like did you like send this email or not? Right, like, 'cause otherwise like this thing didn't work. Yeah. - Makes sense. Awesome. Yeah, let's jump into Lightning Round. So we have two question, acceleration, exploration, and then one final takeaway.

The acceleration question is, what's something that already happened in AI that you thought would take much longer to get here? - I think just the ability of LLMs to generate believable outputs, and both for texts and also for images. And I think just the whole reason I started hacking around with LLMs, honestly, I felt like I got into it pretty late.

I should've gone into it like early 2022 because Ubuntu 3 had been out for a while. Like just the fact that there was this engine that was capable of like reasoning and no one was really like tapping into it. And then the fact that, you know, I used to work in image generation for a while.

Like I did GANs and stuff back in the day, and that was like pretty hard to train. You would generate these like 32 by 32 images, and then now taking a look at some of the stuff by like Dolly and, you know, mid-journey and those things. So it's just, it's very good.

Yeah. - Exploration. What do you think is the most interesting unsolved question in AI? - Yeah, I'd probably work on some aspect of like personalization of memory. I think a lot of people have thoughts about that, but like for what it's worth, I don't think the final state will be right.

I think it will be some like fancy algorithm or architecture where you like bake it into like the architecture of the model itself. Like if you have like a personalized assistant that you can talk to, that will like learn behaviors over time, right? And kind of like learn stuff through like conversation history, what exactly is the right architecture there?

I do think that will be part of like- - Continuous fine tuning? - Yeah, like some aspect of that, right? Like these are like, I don't actually know the specific technique, but I don't think it's just gonna be something where you have like a fixed vector store and that thing will be like the thing that restores all your memories.

- Yeah, it's interesting because I feel like using model weights for memory, it's just such an unreliable storage device. - I know, but like, I just think from like the AGI, like, you know, just modeling like the human brain perspective, I think that there is something nice about just like being able to optimize that system, right?

And to optimize a system, you need parameters and then that's where you just get into the neural map piece. - Cool, cool, and yeah, take away, you got the audience ear, what's something you want everyone to think about or yeah, take away from this conversation and your thinking. - I think there were a few key things.

So we talked about two of them already, which was SEC insights, which if you guys haven't checked it out, I've definitely encouraged you to do so, because it's not just like a random like SEC app, it's like a full stack thing that we open source, right? And so if you guys wanna track it out, I would definitely do that.

It provides a template for you to build kind of like production grade rag apps and we're gonna open source like and modularize more components of that soon. - Into a workshop. - Yeah, and the second piece is we are thinking a lot about like retrieval and evals. I think right now we're kind of exploring integrations with like a few different partners and so hopefully some of that will be released soon.

And so just like, how do you basically have an experience where you just like write long index code, all of a sudden you can easily run like retrievals, evals and like traces, all that stuff and like a service. And so I think we're working with like a few providers on that.

And then the other piece, which we did talk about already is this idea of like, yeah, building like rag from scratch. I mean, I think everybody should do it. I think like I would check out the guide if you guys haven't already, I think it's in our docs, but instead of just using, you know, either the kind of like the retriever query engine and Lomindex or like the conversational QA train and Lang train, I would take a look at how do you actually chunk parse data and do like top can batting retrieval.

'Cause I really think that by doing that process, it helps you understand the decisions, the prompts, the language models to use. - That's it. - Yeah. - Thank you so much. - Thank you, Jerry. - Yeah, thank you. (upbeat music) (upbeat music) (upbeat music) (upbeat music) (upbeat music) (upbeat music)

RAG is a hack - with Jerry Liu of LlamaIndex

Chapters

Transcript