back to indexRAG is a hack - with Jerry Liu of LlamaIndex
Chapters
0:0 Introductions and Jerry’s background
4:38 Starting LlamaIndex as a side project
5:27 Evolution from tree-index to current LlamaIndex and LlamaHub architecture
11:35 Deciding to leave Robust to start the LlamaIndex company and raising funding
21:37 Context window size and information capacity for LLMs
23:9 Minimum viable context and maximum context for RAG
24:27 Fine-tuning vs RAG - current limitations and future potential
25:29 RAG as a hack but good hack for now
28:9 RAG benefits - transparency and access control
29:40 Potential for fine-tuning to take over some RAG capabilities
32:5 Baking everything into an end-to-end trained LLM
35:39 Similarities between iterating on ML models and LLM apps
37:6 Modularity and customization options in LlamaIndex: data loading, retrieval, synthesis, reasoning
43:10 Evaluating and optimizing each component of Lama Index system
49:13 Building retrieval benchmarks to evaluate RAG
50:38 SEC Insights - open source full stack LLM app using LlamaIndex
53:7 Enterprise platform to complement LlamaIndex open source
54:33 Community contributions for LlamaHub data loaders
57:21 LLM engine usage - majority OpenAI but options expanding
60:43 Vector store landscape
64:33 Exploring relationships and graphs within data
68:29 Additional complexity of evaluating agent loops
69:20 Lightning Round
00:00:13.680 |
and I'm joined by my co-host Swix, founder of Small AI. 00:00:17.240 |
- And today we finally have Jerry Liu on the podcast. 00:00:23.720 |
- It's so weird because we keep running into each other 00:00:27.640 |
so it's kind of weird to finally just have a conversation 00:00:35.760 |
- So I tend to introduce people on their formal background 00:00:38.220 |
and then ask something on the more personal side. 00:00:46.720 |
I don't know if there is like an official Princeton gang. 00:01:02.280 |
And I think I saw that you also interned at Two Sigma 00:01:12.340 |
- That was my first like proper engineering job 00:01:17.020 |
- And then you were a machine learning engineer at Quora, 00:01:19.960 |
AI research scientist at Uber for three years, 00:01:24.600 |
at Robust Intelligence before starting Llama Index. 00:01:35.420 |
where I just wrote like a ton of Quora answers. 00:01:37.460 |
And so I think if you look at my tweets nowadays, 00:01:44.820 |
where I just like went ham on Quora for a bit. 00:01:51.340 |
I think the thing that everybody was fascinated in 00:01:53.700 |
was just like general like deep learning advancements 00:01:57.240 |
and stuff like GANs and generative like images 00:01:59.900 |
and just like new architectures that were evolving. 00:02:03.920 |
'cause you were going in like really understanding 00:02:06.760 |
So I kind of use that as like a learning opportunity 00:02:08.360 |
to basically just like read a bunch of papers 00:02:15.760 |
where it's just like really about kind of like 00:02:17.320 |
framing concepts and trying to make it understandable 00:02:21.160 |
- Yeah, I've said, so a lot of people come to me 00:02:23.900 |
but like I think you are doing one of the best jobs 00:02:32.440 |
- And I didn't know it was due to the Quora training. 00:02:38.080 |
like kind of wrote on Quora as like one of the web 1.0 00:02:42.440 |
But now I think it's becoming a senior resurgence 00:02:49.840 |
but what do you think is like kind of underrated about Quora? 00:02:52.120 |
- I really like the mission of Quora when I joined. 00:02:54.640 |
In fact, I think when I interned there like in 2015 00:03:02.200 |
and they have like a very talented engineering team 00:03:07.720 |
And the other part is the whole mission of the company 00:03:10.120 |
is to just like spread knowledge and to educate people. 00:03:15.200 |
I really liked the idea of just like education 00:03:32.640 |
but you can make accessible by just like surfacing it. 00:03:34.680 |
And so actually, I don't know if like most people 00:03:45.360 |
- Yeah, I think most people challenges with it 00:03:53.120 |
- Of course, like quality of the answer matters quite a bit. 00:03:57.400 |
- Yeah, like recommendation issues and all that stuff. 00:04:07.600 |
which might be a nice segue into rag actually. 00:04:13.880 |
than what was standard in the industry at the time, 00:04:15.640 |
but just like ranking based on user preferences. 00:04:18.360 |
I think a lot of Quora was very metrics driven. 00:04:20.320 |
So just like trying to maximize like, you know, 00:04:49.920 |
It was more like kind of deep learning training 00:04:52.160 |
for self-driving and computer vision and that type of stuff. 00:04:55.520 |
But I think, yeah, I mean, I think in the LLM world, 00:05:05.200 |
but like it fits within the space of like LLM apps. 00:05:10.440 |
of the underlying deep learning architecture helps, 00:05:12.600 |
having knowledge of basic software engineering principles 00:05:18.360 |
this whole LLM space is basically just a combination 00:05:21.240 |
that you probably like people have done in the past. 00:05:56.040 |
I mean, we were on the same team for like two years. 00:05:57.440 |
I got to know Harrison and the rest of the team pretty well. 00:06:00.880 |
The people there were very driven, very passionate. 00:06:02.480 |
And it definitely pushed me to be a better engineer 00:06:06.880 |
Yeah, I don't really have a concrete explanation for this. 00:06:11.720 |
we have like an LLM hackathon around like September. 00:06:21.040 |
And so I just didn't track Slack or anything. 00:06:24.000 |
Came back, saw that Harrison started Lanchain. 00:06:27.120 |
I was like, oh, I'll play around with LLMs a bit 00:06:32.440 |
but you know, I was like trying to feed in information 00:06:36.800 |
And then you deal with like context window limitations 00:06:47.840 |
Really was just one of those things where early days, 00:06:58.280 |
I had other ideas actually of what I wanted to start. 00:07:14.480 |
I actually think once the multi-modal models come out, 00:07:16.600 |
I think there's just like mathematically nicer properties 00:07:19.520 |
of you can just get like join multi-modal embeddings 00:07:25.800 |
because from a software engineering principle, 00:07:29.920 |
and then you just represent everything as text. 00:07:35.400 |
versus if you had chose to spend your time on multi-modal, 00:07:49.280 |
So that was a very productive month, I guess. 00:08:02.520 |
That probably was somewhat inspired by a light rain. 00:08:09.200 |
by applying a summarization prompt for each node. 00:08:18.680 |
was also that you're creating optimized data structures. 00:08:26.160 |
and how does that contrast with LLM Index today? 00:08:42.560 |
And the way I wanted to think about the system 00:08:45.480 |
of how language models with their reasoning capabilities, 00:08:49.760 |
can organize information and then traverse it. 00:08:52.160 |
So I didn't want to think about embeddings, right? 00:08:58.720 |
to try and actually tap into the capabilities 00:09:03.800 |
just as a human brain could synthesize stuff, 00:09:14.200 |
and then also traverse the structure that I created. 00:09:16.920 |
That was the inspiration for this initial tree index. 00:09:54.760 |
because I think what I also ended up discovering 00:10:00.840 |
there was starting to become a wave of developers 00:10:07.480 |
is to apply them on top of your personal data. 00:10:13.640 |
like the problem statement itself was very powerful. 00:10:16.200 |
And so I think being motivated by the problem statement, 00:10:19.520 |
of how do I unlock elements on top of the data 00:10:21.840 |
also contributed to the development of Lama Index 00:10:30.680 |
the like just existing set of like data structures 00:10:33.120 |
is we really tried to take a step back and think, 00:10:36.920 |
that would actually make this useful for a developer? 00:10:39.440 |
And then, you know, somewhere around December, 00:10:42.600 |
to basically like push towards that direction, 00:10:48.160 |
And then also start adding in like embeddings, 00:10:53.840 |
like latency, cost, performance, those types of things. 00:10:58.680 |
like start expanding the scope of the toolkit 00:11:08.240 |
- Yeah, yeah, so I think that was in like January 00:11:12.040 |
And so we started adding like some data loaders, 00:11:15.880 |
started adding more stuff on the retrieval querying side, 00:11:18.600 |
right, we still have like the core data structures, 00:11:20.640 |
but how do you actually make them more modular 00:11:26.840 |
that you could run on top of this a little bit. 00:11:28.920 |
And then starting to get into more complex interactions 00:11:36.360 |
- And then you and I spent a bunch of time earlier this year 00:11:39.960 |
talking about Lama Hub, what that might become. 00:11:45.440 |
When did you decide it was time to start the company 00:11:48.600 |
and then start to think about what Lama Index is today? 00:12:01.960 |
oh yeah, you know, this is just like a design project, 00:12:03.920 |
but you know, what about my other idea on like video data? 00:12:06.320 |
Right, and I was trying to like get their thoughts on that. 00:12:09.800 |
And then everybody was just like, oh yeah, whatever. 00:12:25.080 |
and kind of like building practical applications. 00:12:28.240 |
into a much bigger opportunity than the previous idea was. 00:12:31.520 |
And then I think I gave a pretty long notice, 00:12:35.600 |
- What were your thinkings in terms of like moats and, 00:12:40.600 |
you know, founders kind of like overthink it sometimes. 00:12:43.200 |
You obviously had like a lot of open source love 00:12:47.120 |
And yeah, like, were you ever thinking, okay, 00:12:50.200 |
I don't know, this is maybe not enough to start a company 00:12:56.760 |
I felt like I did this exercise, like honestly, 00:12:59.600 |
probably more late December and then early January, 00:13:03.040 |
'cause I was just existentially worried about 00:13:05.360 |
whether or not this would actually be a company at all. 00:13:08.160 |
And okay, what were the key questions I was thinking about? 00:13:11.360 |
And these were the same things that like other founders, 00:13:14.400 |
investors, and also like friends would ask me is just like, 00:13:17.000 |
okay, what happens if context windows get much bigger? 00:13:20.520 |
What's the point of actually structuring data, right, 00:13:24.920 |
Why don't you just dump everything into the prompt? 00:13:27.040 |
Fine tuning, like what if you just train the model 00:13:29.680 |
And then, you know, what's the point of doing this stuff? 00:13:32.880 |
And then some other ideas is what if like open AI 00:13:36.200 |
actually just like takes this, like, you know, 00:13:43.920 |
and starts building in some like built-in orchestration 00:13:46.280 |
capabilities around stuff like rag and agents 00:13:49.160 |
And so I basically ran through this mental exercise 00:13:51.200 |
and, you know, I'm happy to talk a little bit more 00:14:00.840 |
I think rag is just like one of those things that like, 00:14:07.040 |
but they also care about stuff like latency and costs. 00:14:09.280 |
And my entire reasoning at the time was just like, okay, 00:14:12.320 |
like, yes, maybe we'll have like much bigger context windows 00:14:15.760 |
as we've seen with like 100K context windows, 00:14:20.280 |
which is not in just like the scale of like a few documents, 00:14:23.640 |
it's usually in like gigabytes, terabytes, petabytes, 00:14:26.360 |
like how do you actually just unlock language models 00:14:36.040 |
And so there was clearly like technical opportunity here. 00:14:38.080 |
Like there was just stacks that needed to be invented 00:14:44.360 |
And so if like you just dumped all this data into, 00:14:55.760 |
because you have these network transfer costs 00:15:10.120 |
And so what RAG does is it does provide extra data points 00:15:14.200 |
along that access because you can kind of control 00:15:16.000 |
the amount of context you actually want it to retrieve. 00:15:25.040 |
to actually, you know, like stuff into the prompt. 00:15:28.880 |
were kind of thinking about some of those considerations. 00:15:42.080 |
and your plans for the company, you know, at the time. 00:15:48.760 |
I mean, obviously we knew we wanted to fundraise. 00:15:50.360 |
I think there was also a bunch of like investor interest 00:15:54.520 |
given the, you know, like hype wave of generative AI. 00:15:56.880 |
So like a lot of investors were kind of reaching out 00:16:02.880 |
You know, they've been great partners so far. 00:16:06.160 |
like there's a lot of like great VCs out there. 00:16:09.720 |
on like open source, data, infra, and that type of stuff. 00:16:15.280 |
because for us, like time was of the essence, 00:16:19.000 |
and still kind of build Mindshare in this space. 00:16:21.040 |
We just kept the fundraising process very efficient. 00:16:51.880 |
it's never like the most fun period, I think. 00:17:04.200 |
we're happy that we kept it to a pretty efficient process. 00:17:08.120 |
And so you fundraise with Simon, your co-founder. 00:17:17.280 |
we'll probably have had one more person join the team. 00:17:22.840 |
we're rapidly getting to like eight or nine people. 00:17:25.000 |
At the current moment, we're around like six. 00:17:26.680 |
And so just like, there'll be some exciting developments 00:17:37.880 |
Obviously, like we look for people that are really active 00:17:41.880 |
people that have like very strong engineering backgrounds. 00:17:44.240 |
And primarily, we've been kind of just looking for builders, 00:17:46.360 |
people that kind of like grow the open source 00:17:59.600 |
Like has a sense of both like a deep understanding of ML, 00:18:06.440 |
about like engineering and technical concepts in general. 00:18:09.120 |
And I think one of my criteria is when I was like 00:18:12.880 |
was someone that was like technically better than me, 00:18:16.280 |
And so honestly, like there weren't a lot of people that, 00:18:19.080 |
I mean, I know a lot of people that are smarter than me, 00:18:23.440 |
and also just have the same like values that I shared, right? 00:18:26.200 |
And just, I think doing a startup is very hard work, right? 00:18:28.760 |
It's not like, I'm sure like you guys all know this. 00:18:33.800 |
and you want to be like in the same place together 00:18:36.360 |
and just like being willing to hash out stuff 00:18:42.440 |
And I think I convinced him to jump on board. 00:18:46.240 |
And obviously I've had the pleasure of chatting 00:18:48.320 |
and working with a little bit with both of you. 00:18:50.960 |
What would you say those like your top like one 00:18:55.440 |
or the culture of the company and that kind of stuff? 00:18:58.200 |
- Yeah, well, I think in terms of the culture of the company 00:19:01.680 |
it's really like, I mean, there's a few things 00:19:10.600 |
We don't want to like obviously like copy code 00:19:12.880 |
or kind of like, you know, just like, you know 00:19:26.720 |
I think in the end, like this is a very fast moving space 00:19:29.560 |
and we want to just like be one of the, you know 00:19:33.920 |
like production quality outline applications. 00:19:37.000 |
So I promise we'll get to the more technical questions. 00:19:46.000 |
And since your fundraising post, which was in June 00:19:51.000 |
and now it's September, so it's been about three months. 00:19:53.760 |
You've actually gained 50% in terms of stars and followers. 00:19:58.480 |
You've 3X your download count to 600,000 a month 00:20:01.480 |
and your Discord membership has reached 10,000. 00:20:07.040 |
And obviously there's a lot of room to expand there too. 00:20:15.240 |
we want this thing to be, well, one big, right? 00:20:18.960 |
but to just like really provide value to developers 00:20:25.720 |
And I think it turns out we're in the fortunate circumstance 00:20:28.200 |
where a lot of different companies and individuals, right? 00:20:37.100 |
start to think about what are the production grade 00:20:41.520 |
that to solve to actually make this thing robust 00:20:45.680 |
And so we want to basically provide the tooling to do that. 00:20:49.120 |
And to do that, we need to both spread awareness 00:20:53.800 |
And so a lot of this is going to be continued growth, 00:21:02.920 |
you were asking yourself initially around fine tuning 00:21:21.280 |
We talked before about the LLMs are U-shaped reasoners. 00:21:36.800 |
you want to give people as they think about it? 00:21:41.160 |
And I think part of what I wanted to kind of like 00:21:46.600 |
especially with the idea of like thinking about 00:21:49.640 |
Like, okay, what if the minimum context was like 10 tokens 00:21:56.400 |
And what are the limitations if it's like 10 tokens? 00:21:58.760 |
It's kind of like, like eight bit, 16 bit games, right? 00:22:10.120 |
just the resolution of the context and the output will change 00:22:13.320 |
depending on how much context you can actually fit in. 00:22:18.560 |
there's this concept of like information capacity, 00:22:23.720 |
like given any fixed amount of like storage space, 00:22:27.080 |
like how much information can you actually compact in there? 00:22:32.000 |
is just like some fixed amount of storage space, right? 00:22:38.120 |
you can compact into like a 4,000 token storage space. 00:22:40.920 |
And what does that storage space use for these days 00:22:46.080 |
And so this really controls a maximum amount of information 00:22:53.480 |
you could have an infinitely detailed response 00:22:56.960 |
But if you don't, you can only kind of represent stuff 00:23:08.640 |
are gonna be able to surface at any given point in time. 00:23:18.840 |
there needs to be a balance between fine tuning and RAG 00:23:21.600 |
to make sure you're gonna like leverage the context, 00:23:24.040 |
but at the same time, don't keep it too low resolution? 00:23:29.400 |
I don't think anyone wants to work with like a 10, 00:23:31.280 |
I mean, that's just a thought exercise anyways, 00:23:44.000 |
that level of resolution is probably fine for most people, 00:23:50.480 |
okay, if you're gonna actually combine this thing 00:23:52.520 |
with some sort of retrieval data structure mechanism, 00:23:55.240 |
there's just limitations on the retrieval side 00:24:05.880 |
but if you're just doing like top case similarity, 00:24:07.720 |
like you might not be fetching the right information 00:24:18.720 |
And also in terms of what's like the threshold 00:24:22.000 |
of data that you need to actually worry about fine tuning 00:24:32.600 |
some of which sound like a little bit contradictory 00:24:35.640 |
To be honest, I don't think anyone knows the right answer. 00:24:37.400 |
I think this is just- - We're pursuing the truth. 00:24:53.320 |
to like stuff stuff into the prompt of the language model. 00:24:58.000 |
in terms of like stuffing stuff into the prompt 00:25:02.720 |
to like retrieve right information with top case similarity, 00:25:10.600 |
and then just like stuff stuff into the prompt. 00:25:17.320 |
to try to make the most out of these like existing APIs. 00:25:21.160 |
is just like from a pure like optimization standpoint, 00:25:23.680 |
if you think about this from like the machine learning lens, 00:25:29.600 |
Like, obviously, like the thing about machine learning 00:25:34.760 |
that can be optimized within machine learning, 00:25:38.360 |
you're really like changing like the entire system's weights 00:25:44.160 |
And if you just cobble a bunch of stuff together, 00:25:46.480 |
you can't really optimize the pieces are inefficient, right? 00:25:56.880 |
more learned retrieval algorithm that's better. 00:26:05.800 |
of how do you do like short-term or long-term memory, right? 00:26:08.120 |
Like represent stuff in some sort of vector embedding, 00:26:15.920 |
It's more, and it's not really automatically learned, 00:26:18.520 |
it's more just things that you set beforehand 00:26:27.680 |
potentially in a more like machine learning base way, right? 00:26:31.920 |
And this is also why I think like in the long-term, 00:26:34.880 |
like I do think fine tuning will probably have 00:26:41.080 |
there will probably be new architectures invented 00:26:43.920 |
that where you can actually kind of like include 00:26:59.320 |
And so just like for kind of like the AI engineer persona, 00:27:02.440 |
that like, which to be fair is kind of one of the reasons 00:27:08.040 |
is because it's way more accessible for everybody 00:27:16.040 |
And if we can basically provide these existing techniques 00:27:18.440 |
to help people really optimize how to use existing systems 00:27:21.720 |
without having to really deeply understand machine learning, 00:27:28.880 |
which is just like RAG is way easier to onboard and use. 00:27:41.640 |
And then I'm just kind of like leaving room for the future 00:27:46.000 |
fine tuning can probably take over some of the aspects 00:27:50.680 |
- I don't know if this is mentioned in your recap there, 00:28:00.040 |
like to increase trust, we have to source documents. 00:28:11.840 |
- Exactly, and so that's definitely an advantage. 00:28:14.040 |
I think the other piece that I think is an advantage, 00:28:23.320 |
You can't really do that with large language models, 00:28:26.400 |
which is like gate information to the neural net weights, 00:28:31.240 |
For the first point, you could technically, right, 00:28:35.120 |
you could technically have the language model, 00:28:44.160 |
- Yeah, well, but like it makes it up right now 00:29:00.480 |
versus very traditional information retrieval. 00:29:08.280 |
Like we as humans, obviously we use the internet, 00:29:11.960 |
These tools have API interfaces are well-defined. 00:29:14.400 |
And obviously we're not, like the tools aren't part of us. 00:29:20.600 |
And so kind of when you think about like RAG, 00:29:26.120 |
like a vector database to look up information 00:29:30.720 |
how much information is inherent within the network itself 00:29:33.280 |
and how much does it need to do some sort of like tool 00:29:36.840 |
And I do think there'll probably be more and more 00:29:41.960 |
Some follow-ups on discussions that we've had. 00:29:47.880 |
and what's your current take on whether you can fine tune 00:29:55.400 |
I think some people say you can't, I disagree. 00:29:58.640 |
Just right now I haven't gotten it to work yet. 00:30:01.480 |
- Yeah, well, not in a very principled way, right? 00:30:07.780 |
an hour or two per night to actually get this. 00:30:09.760 |
- Like you were a research scientist at Uber. 00:30:11.400 |
- Yeah, yeah, but it's like full-time, full-time work. 00:30:14.000 |
So I think what I specifically concretely did 00:30:21.880 |
And so there's like a user assistant message format. 00:30:24.440 |
And so what I did was I tried to take just some piece 00:30:28.480 |
by just asking it a bunch of questions about the text. 00:30:34.560 |
and just fine tune over the question responses. 00:30:56.360 |
but then there's also just like next token production. 00:30:58.940 |
And that's something that you can't really do 00:31:02.600 |
but you can do with if you just trained it yourself. 00:31:06.280 |
if you just like train it over some corpus of data. 00:31:18.800 |
is just no one knows how to use them right now, right? 00:31:21.160 |
And so I think that's probably one of the issues. 00:31:23.560 |
- Just to clue people in who haven't read the paper, 00:31:25.580 |
Gorilla is the one where they train to use specific APIs? 00:31:37.600 |
Like the model itself could try to learn some prior 00:31:41.680 |
over the data to decide like what tool to pick. 00:31:44.080 |
But there's also, it's also augmented with retrieval 00:31:47.640 |
in case like the prior doesn't actually work. 00:31:51.640 |
- Is that something that you'd be interested in supporting? 00:31:55.680 |
like if like this is kind of how fine-tuning like RAG 00:31:58.880 |
evolves, like I do think there'll be some aspect 00:32:04.160 |
but then like RAG will just be there to supplement 00:32:11.680 |
Like to be clear, RAG right now is the default way 00:32:26.680 |
like there is a certain beauty in just baking everything 00:32:29.560 |
into some training process of a language model. 00:32:35.800 |
or chat GPT code interpreter, right, like GPT-4, 00:32:42.020 |
"Hey, how do I like define a pedantic model in Python?" 00:32:47.820 |
And we'll run it through code interpreters as a tool, 00:32:54.420 |
having the model itself, like just, you know, 00:32:56.980 |
instead of you kind of defining the algorithm 00:32:58.700 |
for what the data structure should look like, 00:33:02.440 |
That said, I think the reason it's not a thing right now 00:33:14.100 |
to kind of evaluate and improve on performance, 00:33:24.380 |
- I wonder when they're going to put that back. 00:33:32.100 |
is on your brief mention about security or auth. 00:33:47.140 |
let's just dump a whole company notion into this thing. 00:33:56.300 |
who are thinking about building tools in that domain, 00:34:02.380 |
like just bigger companies, like banks, consulting firms, 00:34:16.060 |
'cause we're more just like an orchestration framework. 00:34:24.060 |
like, you know, use some publicly available data 00:34:38.500 |
before we expand this to like more users within the work? 00:34:43.420 |
So there's a bunch of pieces to RAG, obviously. 00:35:04.380 |
And then also just like the role of an AI engineer 00:35:06.940 |
and the skills that they're going to have to learn 00:35:12.540 |
that don't really like understand the fundamentals 00:35:16.460 |
to like cobble something together to build something. 00:35:19.100 |
And I think there is a beauty in that for what it's worth. 00:35:28.940 |
On the other end, what we're increasingly seeing 00:35:34.940 |
start running into honestly like pretty similar issues 00:35:37.900 |
that like will play just a standard ML engineer 00:35:49.860 |
You have to figure out what parameters you tweak. 00:35:51.420 |
You have to gain some intuition about this entire process. 00:35:58.020 |
to just like tuning an ML model with like hyper parameters 00:36:00.940 |
and learning like proper ML practices of like, 00:36:03.860 |
okay, how do I have like define a good evaluation benchmark? 00:36:06.940 |
How do I define like the right set of metrics to use, right? 00:36:10.540 |
and improve the performance of this pipeline for production? 00:36:14.420 |
Like every ML engineer use like some form of weights 00:36:18.020 |
or like some other experimentation tracking tool. 00:36:26.420 |
There's like a certain amount of just like LLM ops, 00:36:29.260 |
like tooling and concepts and just like practices 00:36:34.340 |
And so I think that the reason I think like being able 00:36:39.860 |
is it really gives you a sense of like how things are working 00:36:44.060 |
about like what parameters are within a RAG system 00:36:46.780 |
and which ones actually tweak to make them better. 00:36:50.380 |
the Lomindex quick start is it's three lines of code. 00:36:53.820 |
The downside of that is you have zero visibility 00:36:56.180 |
into what's actually going on under the hood. 00:36:58.700 |
that we've kind of been thinking about for a while. 00:37:06.420 |
how the thing actually works under the hood, right? 00:37:11.980 |
Like as for some people, the three lines of code might work. 00:37:19.580 |
about how to improve the performance of their app. 00:37:21.100 |
And so just like, given this is just like one of those things 00:37:24.860 |
- Yeah, I'd say it is one of the most useful tools 00:37:38.100 |
Kubernetes the hard way, which is don't use Kubernetes. 00:37:42.180 |
Here's everything that you would have to do by yourself. 00:37:44.940 |
And you should be able to put all these things together 00:37:47.220 |
yourself to understand the value of Kubernetes. 00:37:51.620 |
I've done, I was the guy who did the same for React. 00:37:54.820 |
And yeah, it's pretty, well, it's pretty good exercise 00:38:05.700 |
you know, there's all these like hyperparameters, 00:38:12.860 |
what would hyperparameter optimization for RAG look like? 00:38:22.060 |
I think that's something we're kind of looking at. 00:38:29.380 |
- I think it's gonna be hard to find a universal default 00:38:37.100 |
- I do think it's gonna be somewhat dependent 00:38:47.380 |
people are just defining their own like custom parsers 00:38:50.100 |
for like PDFs, markdown files for like, you know, 00:38:52.580 |
SCC filings versus like, you know, Slack conversations. 00:39:01.180 |
Like it really affects the parameters that you wanna pick. 00:39:07.860 |
where you are kind of like training the model basically, 00:39:16.300 |
Maybe we can talk about like the surface area 00:39:19.140 |
You designed LamaIndex in a way that it's more modular. 00:39:23.340 |
How would you describe the different components 00:39:30.740 |
And I think that there is a certain burden on us 00:39:35.860 |
- Well, number four is customization tutorials. 00:39:50.380 |
and plug it into the rest of our abstractions. 00:39:52.860 |
like maybe some of the basic components of LamaIndex. 00:39:55.380 |
You can load data from different data sources. 00:39:58.660 |
which is a collection of different data loaders 00:40:04.180 |
like PDFs, file types, like Slack, Notion, all that stuff. 00:40:10.380 |
We have a bunch of like parsers and transformers. 00:40:15.260 |
and then basically figure out a way to load it 00:40:19.220 |
So, I mean, you worked at like Airbrite, right? 00:40:20.940 |
It's kind of like there is some aspect like E and T, right? 00:40:34.060 |
And then the second piece really is about like, 00:40:44.220 |
So retrieval is one of the core abstractions that we have. 00:40:49.940 |
That's why we have that section on kind of like 00:40:51.460 |
how do you define your own like customer retriever, 00:41:03.140 |
then you can really only do like top K like lookup 00:41:08.540 |
But if you can index it in some sort of like hierarchy, 00:41:12.780 |
like actually traverse relationships between nodes. 00:41:22.700 |
There's some response abstraction that can abstract away 00:41:25.260 |
over like long context to actually still give you a response 00:41:28.100 |
even if the context overflows the context window. 00:41:30.420 |
And then there's kind of these like higher level 00:41:32.340 |
like reasoning primitives that I'm gonna define broadly. 00:41:35.820 |
And I'm just gonna call them in some general bucket 00:41:39.340 |
even though everybody has different definitions of agents. 00:41:55.580 |
So the most simple reasoning primitive you can do 00:42:08.660 |
That's something that we might actually explore. 00:42:14.620 |
You can have the LLM like define like a query plan, right? 00:42:27.380 |
like the open AI function calling like while loop 00:42:31.460 |
and try to break it down into some series of steps 00:42:34.340 |
to actually try to execute to get back a response. 00:42:38.340 |
from like simple reasoning primitives to more advanced ones. 00:42:40.620 |
And I think that's the way we kind of think about it 00:42:45.980 |
Like, do they work well over like the types of like data 00:42:49.460 |
- How do you think about optimizing each piece? 00:43:02.180 |
What's kind of like the Delta left on the embedding side? 00:43:05.900 |
Do you think we can get models that are like a lot better? 00:43:09.620 |
where people should really not spend too much time? 00:43:18.340 |
if you think about everything that goes into retrieval, 00:43:28.180 |
Then there's the actual embedding model itself, 00:43:30.020 |
which is something that you can try optimizing. 00:43:31.900 |
And then there's like the retrieval algorithm. 00:43:37.900 |
And so I do think it's something everybody should try. 00:43:40.900 |
I think by default, we use like OpenAI's embedding model. 00:43:44.740 |
A lot of people these days use like sentence transformers 00:43:48.780 |
and you can actually optimize, directly optimize it. 00:43:56.420 |
it should ideally be relatively free for every developer 00:44:00.580 |
to just run some fine tuning process over their data 00:44:03.060 |
to squeeze out some more points and performance. 00:44:04.860 |
And if it's that relatively free and there's no downsides, 00:44:12.220 |
especially in a production grade data pipeline. 00:44:14.260 |
If you actually fine tune the embedding model 00:44:18.380 |
you're gonna have to re-index all your documents. 00:44:20.300 |
And for a lot of people, that's not feasible. 00:44:22.300 |
And so I think like Joe from Vespa on our webinars, 00:44:25.460 |
there's this idea that depending on kind of like, 00:44:29.060 |
if you're just using like document and query embeddings, 00:44:32.220 |
you could keep the document embeddings frozen 00:44:34.700 |
and just train a linear transform on the query 00:44:36.660 |
or any sort of transform on the query, right? 00:44:38.780 |
So therefore it's just a query side transformation 00:44:44.300 |
The other piece is- - Wow, that's pretty smart. 00:44:50.340 |
but it does like improve performance a little bit. 00:45:05.380 |
to try to like optimize the retrieval process. 00:45:11.660 |
it kind of lives in some latent space, right? 00:45:22.500 |
But like depending on the specific types of questions 00:45:26.860 |
the latent space might not be optimized, right? 00:45:32.980 |
the relevant piece of context that the user wanna ask. 00:45:34.740 |
So can you shift the embedding points a little bit, right? 00:45:46.340 |
I got a bunch of startup pitches that are like, 00:45:48.580 |
like rag is cool, but like there's a lot of stuff 00:45:54.300 |
There's a lot of stuff in terms of sunsetting data 00:45:57.980 |
once it starts to become stale, that could be better. 00:46:03.740 |
So like you have SEC Insights as one of kind of like 00:46:06.260 |
your demos and that's like a great example of, 00:46:08.860 |
hey, I don't wanna embed all the historical documents 00:46:19.980 |
and versus how much you expect others to take care of? 00:46:23.220 |
- Yeah, I'm happy to talk about SEC Insights in just a bit. 00:46:25.660 |
I think more broadly about the like overall retrieval space, 00:46:28.260 |
we're very interested in it because a lot of these 00:46:29.940 |
are very practical problems that people have asked us. 00:46:33.300 |
I think how do you like deprecate or time wait data 00:46:38.580 |
so you don't just like kind of set some parameter 00:46:41.620 |
all your retrieval algorithms is pretty important 00:46:43.740 |
because people have started bringing that up. 00:46:46.940 |
things get out of date, how do I like sunset documents? 00:46:56.180 |
like new retriever techniques for the sake of like 00:47:04.180 |
that's like intuitive and easy for people to understand. 00:47:09.980 |
and new retrieval techniques that are kind of in place 00:47:18.220 |
I mean, like the reason for this is just like, 00:47:20.140 |
if you think about how like the idea of like chunking text, 00:47:24.540 |
right, like that really, that just really wasn't a thing 00:47:28.780 |
or at least for this specific purpose of like, 00:47:31.500 |
like the reason chunking is a thing in rag right now 00:47:38.220 |
That just was less of a thing, I think back then. 00:47:42.900 |
it was more for like structured data extraction 00:47:45.540 |
And so there's kind of like certain new concepts 00:47:47.540 |
that you gotta play with that you can use to invent 00:47:50.740 |
kind of more interesting retrieval techniques. 00:47:52.740 |
Another example here is actually LLM based reasoning, 00:48:00.700 |
and use that to actually send to your retrieval system. 00:48:09.500 |
but then you can kind of figure out an interesting way 00:48:32.540 |
- So I think I've started to like step on the brakes 00:48:44.940 |
but like how do people know which one is good 00:48:54.380 |
for the next few weeks is actually like properly 00:48:56.780 |
kind of like having an understanding of like, 00:49:01.260 |
- Yeah, some kind of like maybe like a flow chart, 00:49:06.020 |
- When this, do that, you know, something like that 00:49:17.020 |
- Yeah, yeah, just, I mean, that's kind of like a good- 00:49:19.780 |
- It seems like your most successful side project. 00:49:26.940 |
Our SCC Insights is a full stack LLM chatbot application 00:49:31.660 |
that does analysis over your SCC 10K and 10Q filings, 00:49:47.820 |
We actually ended up like adding a bunch of stuff 00:49:51.900 |
And I think it was great because like, you know, 00:49:53.900 |
thinking about how we handle like callbacks, streaming, 00:49:57.820 |
actually generating like reliable sub-responses 00:50:03.740 |
if you're just building the library in isolation, 00:50:13.860 |
Like if you go into SCC Insights and you type something, 00:50:16.180 |
you can actually see the highlights in the right side. 00:50:20.340 |
that like took a little bit of like understanding 00:50:23.820 |
And so it was great for dogfooding improvement 00:50:28.260 |
the second thing was we're starting to talk to users 00:50:33.820 |
like the potential of Llamandex as a framework. 00:50:36.740 |
Because these days, obviously building a chatbot, right, 00:50:45.580 |
But how do you build something that kind of like satisfies 00:50:48.020 |
some of this like criteria of surfacing like citations, 00:50:51.580 |
being transparent, seeing like having a good UX, 00:51:03.100 |
we showed both like, well, first like organizations 00:51:11.740 |
we kind of like stealth launched this for fun, 00:51:15.500 |
just to see if we could get feedback from users 00:51:17.180 |
who are using this world to see like, you know, 00:51:23.780 |
Obviously, we're not gonna sell like a financial app, 00:51:28.660 |
but we're just gonna open source the entire thing. 00:51:30.100 |
And so that now is basically just like a really nice, 00:51:46.540 |
that like aren't released yet that we're going to, 00:51:51.740 |
Like one is just like kind of more detailed guides 00:51:54.220 |
on like different modular components within it. 00:51:57.940 |
you can go in and actually take the pieces that you want 00:52:00.220 |
and actually kind of build your own custom flows. 00:52:03.660 |
take there's like certain components in there 00:52:05.500 |
that might not be directly related to the LLM app 00:52:07.620 |
that would be nice to just like have people use. 00:52:15.820 |
So, you know, you could be using any library you want, 00:52:24.300 |
Yeah, that's a really good community service right there. 00:52:45.180 |
So I think the high level of what I can probably say 00:52:50.180 |
is just like, yeah, I think we're looking at ways 00:52:55.420 |
the developer experience, like building LLM index. 00:53:04.900 |
And so can we build tools that help like augment 00:53:07.220 |
that experience beyond the open source library, right? 00:53:14.900 |
from the open source library with like a one line toggle. 00:53:18.740 |
You can basically get this like complimentary service 00:53:20.980 |
and then figure out a way to like monetize in a bit. 00:53:37.140 |
about all open source is you want to start building 00:53:47.220 |
you've just built your biggest competitor, which is you. 00:53:55.300 |
use the open source library and then you have a toggle 00:53:57.780 |
and all of a sudden, you know, you can see this 00:54:03.580 |
and then it'll be able to kind of like, you'll have a UI, 00:54:14.900 |
Should we go on to like ecosystem and other stuff? 00:54:24.540 |
maybe under, not underrated, but like underexpected, 00:54:27.940 |
you know, and how has the open source side of it helped 00:54:37.980 |
Yeah, I think the nice thing about like Blahma Hub itself 00:54:40.820 |
is just, it's supposed to be a community-driven hub. 00:54:49.340 |
like first party connectors actually for this. 00:54:51.180 |
It's more just like kind of encouraging people 00:54:56.100 |
In terms of the most popular tools or the data loaders, 00:55:06.020 |
but there's some subset of them that are popular. 00:55:07.820 |
And then there's Google, like I think Gmail and like G-Drive. 00:55:12.260 |
And then I think maybe it's like one of Slack or Notion. 00:55:17.820 |
and I think like Swix might probably knows this better 00:55:20.260 |
than I do, given that you were used to working at Airbyte, 00:55:24.580 |
especially for a full-on service like Notion, Slack 00:55:29.260 |
really high quality loader that really extracts 00:55:33.220 |
And so I think the thing is when people start out, 00:55:41.140 |
And for a lot of people it's like good enough 00:55:42.820 |
and they submit PRs if they want more additional features. 00:55:45.260 |
If like you get to a point where you actually wanna call 00:55:49.820 |
or, you know, you want to kind of load in stuff 00:55:58.660 |
people start adding up like writing their own custom loaders. 00:56:02.300 |
And that's something that we're okay with, right? 00:56:03.980 |
'Cause like a lot of this is more just like community driven 00:56:08.740 |
otherwise you can create your own custom ones. 00:56:13.060 |
within Llama Index or do you pair it with something else? 00:56:20.060 |
- 'Cause typically in the data ecosystem with Erbite, 00:56:23.580 |
you know, Erbite has his own strategies with custom loaders, 00:56:26.100 |
but also you could write your own with like DAGster 00:56:33.180 |
we just have a very flexible like document abstraction 00:56:35.140 |
that you can fill in with any content that you want. 00:56:37.980 |
Are people really dumping all their Gmail into these things? 00:56:44.100 |
- Yeah, it's like one of Google, some Google product. 00:56:52.620 |
- I mean, that's the most private data source. 00:56:57.580 |
- So I'm surprised that people don't meet you. 00:57:01.500 |
but like I'm sure, I'm surprised it's popular. 00:57:22.020 |
Cohere, Anthropic, you know, whatever you're seeing. 00:57:29.060 |
I think there is a lot of people trying out like Llama 2 00:57:32.060 |
and some variant of like a top OpenSource model. 00:57:38.020 |
- Yeah, I think whenever I go to these talks, 00:57:53.140 |
Yeah, so I think a lot of people are trying out 00:58:01.420 |
there's a lot of toolkits and OpenSource projects 00:58:04.460 |
that allow you to self-host and deploy Llama 2. 00:58:08.380 |
And like, Llama is just a very recent example, 00:58:12.380 |
And so we just, by virtue of having more of these services, 00:58:14.940 |
I think more and more people are trying it out. 00:58:18.820 |
Is like, is that gonna be an increasing trend? 00:58:27.500 |
whenever like OpenAI has something really cool 00:58:30.020 |
or like any company has something really cool, even Meta, 00:58:33.220 |
like there's just gonna be a huge competitive pressure 00:58:46.980 |
People like, are like, psychologically want that. 00:58:52.580 |
and popular and performance benchmarks, you know? 00:58:56.500 |
And at the end of the day, OpenAI still wins on that. 00:59:02.300 |
unless you were like an active employee at OpenAI, right? 00:59:04.660 |
Like all these research labs are putting out like ML, 00:59:07.540 |
like PhDs or kind of like other companies too, 00:59:11.900 |
There's gonna be a lot of like competitive pressures 00:59:19.500 |
but like there's just a lot of just incentive 00:59:23.340 |
- Have you looked at like rag specific models, 00:59:32.940 |
I think is his name, you probably came across him. 00:59:44.540 |
I was hoping that you do, 'cause it's your business. 00:59:51.980 |
I think this kind of relates to my previous point 00:59:56.020 |
Like a rag specific model is a model architecture 01:00:00.020 |
And it's less the software engineering principle 01:00:03.860 |
and just plug and play different components into it? 01:00:08.940 |
But like when you wanna end to end optimize the thing, 01:00:15.900 |
I think building your own models is honestly pretty hard. 01:00:20.220 |
And I think the issue is if you also build your own models, 01:00:25.660 |
Like basically the question is when GPT-5 and six 01:00:29.420 |
and whatever, like anthropic cloud three comes out, 01:00:31.860 |
like how can you prove that you're actually better 01:00:40.780 |
this is better than maybe like GPT-3 or GPT-4. 01:00:49.340 |
I know Spook says we're in a Chroma sweatshirt. 01:00:53.900 |
- I have the mug from Chroma, it's been great. 01:00:57.300 |
- What do you think, what do you think there? 01:01:11.380 |
- I think, yeah, we try to remain unopinionated 01:01:15.460 |
So it's not like, we don't try to like play favorites. 01:01:17.380 |
So we have like a bunch of integrations, obviously. 01:01:19.020 |
And the way we try to do is we just try to find 01:01:23.700 |
will support kind of like slightly additional things 01:01:27.940 |
And the goal is to have our users basically leave it up 01:01:30.860 |
to them to try to figure out like what makes sense 01:01:39.580 |
like embedding lookup algorithm is that high. 01:01:44.300 |
or at least there's just a lot of other stuff you can do 01:01:48.900 |
No, I mean like everything else that we just talked about, 01:01:52.020 |
To improve RAG, like everything that we talked about, 01:01:56.140 |
- Yeah, well, I mean, I was just thinking like, 01:02:00.620 |
there are like eight, it's a kind of game of thrones. 01:02:02.580 |
There's like eight, the war of eight databases right now. 01:02:13.060 |
we're pretty good partners with most of them. 01:02:16.380 |
- Well, like, so if you're a vector database founder, 01:02:25.860 |
and this is something I think I've started to see 01:02:37.020 |
the query sophistication of these vector stores 01:02:39.420 |
and basically make it so that users don't have to think 01:02:50.420 |
It's like a select star or select where, right? 01:02:55.180 |
And then you combine that with semantic search. 01:02:59.140 |
was like trying to do some like joint interface. 01:03:02.420 |
The reason is like most data is semi-structured. 01:03:07.900 |
And so like somehow combining all the expressivity 01:03:12.260 |
of like SQL with like the flexibility of semantic search 01:03:14.820 |
is something that I think is gonna be really important. 01:03:18.860 |
that allow you to jointly query both a SQL database, 01:03:22.540 |
like a separate SQL database and a vector store 01:03:27.220 |
than if you just combined it into one system, yeah. 01:03:29.420 |
And so I think like PG vector, like, you know, 01:03:31.380 |
that type of stuff, I think it's starting to get there. 01:03:34.020 |
like how do you have an expressive query language 01:03:37.620 |
along with like all the capabilities of semantic search? 01:03:40.260 |
- So your current favorite is just put it into Postgres? 01:03:49.300 |
- I actually don't know what the best language 01:03:55.180 |
that like the model hasn't been fine-tuned over. 01:03:57.340 |
And so you might wanna train the model over this, 01:04:00.100 |
but some way of like expressing structured data filters. 01:04:06.580 |
It doesn't have to just be like a where clause 01:04:17.340 |
so that's actually something I didn't even bring up yet. 01:04:23.020 |
like explore like relationships within the data too, right? 01:04:25.860 |
And somehow combine that information with stuff 01:04:45.860 |
because there are some like open questions here. 01:04:55.700 |
you might actually just want to do the end-to-end thing first 01:04:57.620 |
just to do a sanity check of whether or not like this, 01:05:04.220 |
And then you only try to do some basic evals. 01:05:06.340 |
And then once you like diagnose what the issue is, 01:05:08.700 |
then you go into the kind of like specific area 01:05:21.820 |
you get back something, you synthesize response, 01:05:24.420 |
And you evaluate the quality of the final response. 01:05:35.180 |
As like a human judge to basically kind of like 01:05:41.300 |
- Well, I think, oh, you're talking about like the startups? 01:05:50.580 |
The main issue right now is just, it's really unreliable. 01:05:53.420 |
Like it's just, like there's like variance in the response 01:05:56.780 |
when you wanna be- - Yeah, then they won't do 01:06:00.820 |
and you'll probably fine tune a model to be a better judge. 01:06:07.260 |
because I don't think there's really a good alternative 01:06:09.740 |
beyond you just human annotating a bunch of data sets 01:06:12.500 |
and then trying to like just manually go through 01:06:17.460 |
And so this is just gonna be a more scalable solution. 01:06:21.140 |
I think there's a bunch of companies doing this. 01:06:22.660 |
In the end, it probably comes down to some aspect 01:06:31.860 |
And then I think like what we found is for RAG, 01:06:39.420 |
You're just not able to retrieve the right response. 01:06:46.260 |
I think, what does having good retrieval metrics tell you? 01:06:49.540 |
It tells you that at least like the retrieval is good. 01:06:54.740 |
but at least it gives you some sort of like sanity track, 01:07:00.980 |
What retrieval like evaluation is pretty standard 01:07:12.500 |
and then there's some ground truth in that ranked set. 01:07:15.420 |
And then you try to measure it based on ranking metrics. 01:07:17.580 |
So the closer that ground truth is to the top, 01:07:27.140 |
And so that's just like a classic ranking problem. 01:07:38.620 |
One is just like curating this data set in the first place. 01:07:43.300 |
is this idea of like synthetic data set generation 01:07:49.820 |
and then all of a sudden you have like question 01:07:51.300 |
and then context pairs and that becomes your ground truth. 01:07:56.700 |
or is there a separate set of stuff for agents 01:08:01.700 |
- Data agents add like another layer of complexity 01:08:03.900 |
'cause then it's just like you have just more loops 01:08:07.060 |
Like you can evaluate like each chain of thought loop itself 01:08:10.740 |
like every LLM call to see whether or not the input 01:08:14.300 |
to that specific step in the chain of thought process 01:08:20.420 |
Or you could evaluate like the final response 01:08:28.700 |
Like you have a top level orchestration agent 01:08:43.660 |
which is pretty unrelated to what we're doing now, 01:08:47.260 |
so you can kind of evaluate like overall agent simulations 01:08:55.180 |
but that's like a very macro principle, right? 01:08:59.220 |
to kind of like model the distribution of things. 01:09:03.980 |
when you're trying to like generate something 01:09:07.300 |
but for stuff where you really want the agent 01:09:16.380 |
It's like, no, like did you like send this email or not? 01:09:18.540 |
Right, like, 'cause otherwise like this thing didn't work. 01:09:26.340 |
So we have two question, acceleration, exploration, 01:09:35.060 |
that you thought would take much longer to get here? 01:09:48.380 |
honestly, I felt like I got into it pretty late. 01:09:53.580 |
Like just the fact that there was this engine 01:10:01.900 |
I used to work in image generation for a while. 01:10:07.500 |
You would generate these like 32 by 32 images, 01:10:10.420 |
and then now taking a look at some of the stuff 01:10:12.180 |
by like Dolly and, you know, mid-journey and those things. 01:10:28.340 |
I think a lot of people have thoughts about that, 01:10:37.220 |
into like the architecture of the model itself. 01:10:39.580 |
Like if you have like a personalized assistant 01:10:43.820 |
that will like learn behaviors over time, right? 01:10:48.660 |
what exactly is the right architecture there? 01:10:56.540 |
I don't actually know the specific technique, 01:10:57.940 |
but I don't think it's just gonna be something 01:11:14.260 |
- I know, but like, I just think from like the AGI, 01:11:23.220 |
about just like being able to optimize that system, right? 01:11:26.380 |
And to optimize a system, you need parameters 01:11:35.740 |
what's something you want everyone to think about 01:11:49.940 |
because it's not just like a random like SEC app, 01:11:52.500 |
it's like a full stack thing that we open source, right? 01:12:05.340 |
- Yeah, and the second piece is we are thinking a lot 01:12:10.380 |
I think right now we're kind of exploring integrations 01:12:14.180 |
and so hopefully some of that will be released soon. 01:12:16.820 |
And so just like, how do you basically have an experience 01:12:23.140 |
all of a sudden you can easily run like retrievals, 01:12:25.660 |
evals and like traces, all that stuff and like a service. 01:12:32.540 |
which we did talk about already is this idea of like, 01:12:40.940 |
if you guys haven't already, I think it's in our docs, 01:12:45.180 |
either the kind of like the retriever query engine 01:12:48.860 |
and Lomindex or like the conversational QA train 01:12:57.700 |
'Cause I really think that by doing that process,