back to index

RAG is a hack - with Jerry Liu of LlamaIndex


Chapters

0:0 Introductions and Jerry’s background
4:38 Starting LlamaIndex as a side project
5:27 Evolution from tree-index to current LlamaIndex and LlamaHub architecture
11:35 Deciding to leave Robust to start the LlamaIndex company and raising funding
21:37 Context window size and information capacity for LLMs
23:9 Minimum viable context and maximum context for RAG
24:27 Fine-tuning vs RAG - current limitations and future potential
25:29 RAG as a hack but good hack for now
28:9 RAG benefits - transparency and access control
29:40 Potential for fine-tuning to take over some RAG capabilities
32:5 Baking everything into an end-to-end trained LLM
35:39 Similarities between iterating on ML models and LLM apps
37:6 Modularity and customization options in LlamaIndex: data loading, retrieval, synthesis, reasoning
43:10 Evaluating and optimizing each component of Lama Index system
49:13 Building retrieval benchmarks to evaluate RAG
50:38 SEC Insights - open source full stack LLM app using LlamaIndex
53:7 Enterprise platform to complement LlamaIndex open source
54:33 Community contributions for LlamaHub data loaders
57:21 LLM engine usage - majority OpenAI but options expanding
60:43 Vector store landscape
64:33 Exploring relationships and graphs within data
68:29 Additional complexity of evaluating agent loops
69:20 Lightning Round

Whisper Transcript | Transcript Only Page

00:00:00.000 | (upbeat music)
00:00:02.580 | - Hey, everyone.
00:00:07.940 | Welcome to the Latent Space Podcast.
00:00:09.960 | This is Alessio, partner and CTO
00:00:11.960 | of Residence and Decibel Partners,
00:00:13.680 | and I'm joined by my co-host Swix, founder of Small AI.
00:00:17.240 | - And today we finally have Jerry Liu on the podcast.
00:00:19.960 | Hey, Jerry.
00:00:20.880 | - Hey, guys.
00:00:21.720 | Hey, Swix, I'm Alessio.
00:00:22.600 | Thanks for having me.
00:00:23.720 | - It's so weird because we keep running into each other
00:00:26.160 | in San Francisco AI events,
00:00:27.640 | so it's kind of weird to finally just have a conversation
00:00:30.320 | recorded for everybody else.
00:00:31.680 | - Yeah, I know.
00:00:32.500 | I'm really looking forward to this.
00:00:33.340 | I guess I have further questions.
00:00:35.760 | - So I tend to introduce people on their formal background
00:00:38.220 | and then ask something on the more personal side.
00:00:41.240 | So you are part of the Princeton gang.
00:00:44.240 | - Yeah.
00:00:46.720 | I don't know if there is like an official Princeton gang.
00:00:49.200 | - There is more Princeton gang.
00:00:50.040 | I attended your meeting.
00:00:51.240 | There was like four of you.
00:00:52.440 | - Oh, cool.
00:00:53.280 | Okay, nice.
00:00:54.100 | - With Prem and the others.
00:00:55.200 | - Oh, yeah, yeah, yeah, yeah.
00:00:57.500 | - Where you did bachelor's in CS
00:00:58.840 | and certificate in finance.
00:01:00.020 | That's also fun.
00:01:01.000 | I also did finance.
00:01:02.280 | And I think I saw that you also interned at Two Sigma
00:01:04.720 | where I worked in New York.
00:01:06.780 | You were a machine learning engineer.
00:01:07.620 | - You were at Two Sigma?
00:01:08.720 | - Yeah, very briefly.
00:01:09.840 | - Oh, cool.
00:01:10.680 | All right, I didn't know that.
00:01:11.520 | Okay.
00:01:12.340 | - That was my first like proper engineering job
00:01:13.180 | before I went into DevRel.
00:01:14.520 | - Oh, okay.
00:01:15.360 | Oh, wow.
00:01:16.180 | Nice.
00:01:17.020 | - And then you were a machine learning engineer at Quora,
00:01:19.960 | AI research scientist at Uber for three years,
00:01:22.540 | and then two years machine learning engineer
00:01:24.600 | at Robust Intelligence before starting Llama Index.
00:01:27.500 | So that's your LinkedIn.
00:01:28.380 | What's not on your LinkedIn
00:01:29.860 | that people should know about you?
00:01:31.300 | - I think back during my Quora days,
00:01:33.180 | I had this like three month phase
00:01:35.420 | where I just wrote like a ton of Quora answers.
00:01:37.460 | And so I think if you look at my tweets nowadays,
00:01:40.060 | you can basically see that as like the V2
00:01:42.780 | of my three month like Quora stint
00:01:44.820 | where I just like went ham on Quora for a bit.
00:01:47.500 | I actually, I think I was back then,
00:01:49.500 | actually, when I was working on Quora,
00:01:51.340 | I think the thing that everybody was fascinated in
00:01:53.700 | was just like general like deep learning advancements
00:01:57.240 | and stuff like GANs and generative like images
00:01:59.900 | and just like new architectures that were evolving.
00:02:01.720 | And it was a pretty exciting time
00:02:02.820 | to be a researcher actually,
00:02:03.920 | 'cause you were going in like really understanding
00:02:05.600 | some of the new techniques.
00:02:06.760 | So I kind of use that as like a learning opportunity
00:02:08.360 | to basically just like read a bunch of papers
00:02:10.200 | and then answer questions on Quora.
00:02:12.040 | And so you can kind of see traces of that
00:02:14.480 | basically in my current Twitter
00:02:15.760 | where it's just like really about kind of like
00:02:17.320 | framing concepts and trying to make it understandable
00:02:19.800 | and educate other users on it.
00:02:21.160 | - Yeah, I've said, so a lot of people come to me
00:02:23.080 | for my Twitter advice,
00:02:23.900 | but like I think you are doing one of the best jobs
00:02:26.120 | in AI Twitter,
00:02:27.360 | just explaining concepts
00:02:28.360 | and just consistently getting hits out.
00:02:30.760 | - Thank you.
00:02:31.600 | (laughing)
00:02:32.440 | - And I didn't know it was due to the Quora training.
00:02:34.880 | Let's just sign on on Quora.
00:02:36.660 | A lot of people, including myself,
00:02:38.080 | like kind of wrote on Quora as like one of the web 1.0
00:02:40.660 | like sort of question answer forums.
00:02:42.440 | But now I think it's becoming a senior resurgence
00:02:45.140 | obviously due to Poe.
00:02:46.520 | And obviously Adam D'Angelo
00:02:48.180 | has always been a leading tech figure,
00:02:49.840 | but what do you think is like kind of underrated about Quora?
00:02:52.120 | - I really like the mission of Quora when I joined.
00:02:54.640 | In fact, I think when I interned there like in 2015
00:02:58.200 | and I joined full time in 2017,
00:02:59.960 | one is like they had
00:03:02.200 | and they have like a very talented engineering team
00:03:05.120 | and just like really, really smart people.
00:03:07.720 | And the other part is the whole mission of the company
00:03:10.120 | is to just like spread knowledge and to educate people.
00:03:13.200 | Right, and to me that really resonated.
00:03:15.200 | I really liked the idea of just like education
00:03:17.560 | and democratizing the flow of information.
00:03:19.440 | And if you imagine like kind of back then
00:03:22.040 | it was like, okay, you have Google,
00:03:23.240 | which is like for search,
00:03:24.320 | but then you have Quora,
00:03:25.160 | which is just like user generated
00:03:26.800 | like grassroots type content.
00:03:28.240 | And I really liked that concept
00:03:29.440 | because it's just like, okay,
00:03:30.440 | there's certain types of information
00:03:31.560 | that aren't accessible to people,
00:03:32.640 | but you can make accessible by just like surfacing it.
00:03:34.680 | And so actually, I don't know if like most people
00:03:37.120 | know that about like Quora,
00:03:38.200 | like if they've used the product,
00:03:40.400 | whether through like SEO, right,
00:03:41.920 | or kind of like actively,
00:03:43.440 | but that really was what drew me to it.
00:03:45.360 | - Yeah, I think most people challenges with it
00:03:48.160 | is that sometimes you don't know
00:03:49.120 | if it's like a veiled product pitch, right?
00:03:51.440 | - Yeah.
00:03:52.280 | - It's like, you know.
00:03:53.120 | - Of course, like quality of the answer matters quite a bit.
00:03:54.880 | And then--
00:03:55.720 | - It's like five alternatives
00:03:56.560 | and then here's the one I work on.
00:03:57.400 | - Yeah, like recommendation issues and all that stuff.
00:03:59.320 | I used, I worked on Rexis at Quora actually.
00:04:01.560 | So, I got a taste of it.
00:04:02.400 | - So how do you solve stuff like that?
00:04:03.800 | - Well, I mean, I kind of more approached it
00:04:05.440 | from machine learning techniques,
00:04:07.600 | which might be a nice segue into rag actually.
00:04:09.880 | A lot of it was just information retrieval.
00:04:11.240 | We weren't like solving anything
00:04:12.520 | that was like super different
00:04:13.880 | than what was standard in the industry at the time,
00:04:15.640 | but just like ranking based on user preferences.
00:04:18.360 | I think a lot of Quora was very metrics driven.
00:04:20.320 | So just like trying to maximize like, you know,
00:04:22.240 | daily active hours, like, you know,
00:04:24.440 | time spent on site, those types of things.
00:04:27.200 | And all the machine learning algorithms
00:04:29.040 | were really just based on embeddings.
00:04:31.280 | You know, you have a user embedding
00:04:33.000 | and you have like item embeddings
00:04:34.520 | and you try to train the models
00:04:35.640 | to try to maximize the similarity of these.
00:04:37.720 | And it's basically a retrieval problem.
00:04:39.680 | - Okay, so you've been working on rag
00:04:41.640 | for longer than most people think?
00:04:43.040 | - Well, kind of.
00:04:43.880 | So I worked there for like a year, right?
00:04:45.760 | - Yeah.
00:04:46.600 | - Just transparently.
00:04:47.440 | And then I worked at Uber
00:04:48.640 | where I was not working on ranking.
00:04:49.920 | It was more like kind of deep learning training
00:04:52.160 | for self-driving and computer vision and that type of stuff.
00:04:55.520 | But I think, yeah, I mean, I think in the LLM world,
00:04:59.360 | it's kind of just like a combination
00:05:01.200 | of like everything these days.
00:05:03.320 | I mean, retrieval is not really LLMs,
00:05:05.200 | but like it fits within the space of like LLM apps.
00:05:08.960 | And then obviously like having knowledge
00:05:10.440 | of the underlying deep learning architecture helps,
00:05:12.600 | having knowledge of basic software engineering principles
00:05:14.720 | helps too.
00:05:15.560 | And so I think it's kind of nice that like
00:05:18.360 | this whole LLM space is basically just a combination
00:05:20.320 | of just like a bunch of stuff
00:05:21.240 | that you probably like people have done in the past.
00:05:24.240 | - It's good.
00:05:25.080 | It's like a summary capstone project.
00:05:26.600 | - Yeah, exactly.
00:05:27.440 | Yeah.
00:05:28.800 | - Yeah.
00:05:29.640 | - And before we dive into LLMA Index,
00:05:34.160 | what do they feed you a robust intelligence
00:05:36.240 | that both you and Harrison from Lanchain
00:05:38.480 | came out of it at the same time?
00:05:40.200 | Was there like, yeah, is there any fun story
00:05:43.160 | of like how both of you kind of came out
00:05:44.880 | with kind of like core infrastructure
00:05:46.920 | to LLM workflows today?
00:05:48.400 | Or how close were you at robust?
00:05:50.520 | Like any fun behind the scenes?
00:05:52.680 | - Yeah.
00:05:53.520 | Yeah.
00:05:54.360 | We worked pretty closely.
00:05:56.040 | I mean, we were on the same team for like two years.
00:05:57.440 | I got to know Harrison and the rest of the team pretty well.
00:05:59.200 | I mean, I have a respect the people there.
00:06:00.880 | The people there were very driven, very passionate.
00:06:02.480 | And it definitely pushed me to be a better engineer
00:06:04.520 | and leader and those types of things.
00:06:06.880 | Yeah, I don't really have a concrete explanation for this.
00:06:10.600 | I think it's more just,
00:06:11.720 | we have like an LLM hackathon around like September.
00:06:15.160 | This was just like exploring GPT-3
00:06:16.920 | or it was October actually.
00:06:18.560 | And then the day after I went on vacation
00:06:20.160 | for a week and a half.
00:06:21.040 | And so I just didn't track Slack or anything.
00:06:24.000 | Came back, saw that Harrison started Lanchain.
00:06:26.280 | I was like, oh, that's cool.
00:06:27.120 | I was like, oh, I'll play around with LLMs a bit
00:06:29.840 | and then hacked around on stuff.
00:06:31.000 | And I think I've told the story a few times,
00:06:32.440 | but you know, I was like trying to feed in information
00:06:34.760 | into the GPT-3.
00:06:36.800 | And then you deal with like context window limitations
00:06:39.160 | and there was no tooling or really practices
00:06:41.320 | to try to understand how do you, you know,
00:06:43.320 | get GPT-3 to navigate large amounts of data.
00:06:45.880 | And that's kind of how the project started.
00:06:47.840 | Really was just one of those things where early days,
00:06:51.400 | like we were just trying to build something
00:06:52.880 | that was interesting and not really,
00:06:55.520 | like I wanted to start a company.
00:06:58.280 | I had other ideas actually of what I wanted to start.
00:07:01.600 | And I was very interested in, for instance,
00:07:03.400 | like multi-modal data, like video data
00:07:05.200 | and that type of stuff.
00:07:06.320 | And then this just kind of grew
00:07:07.720 | and eventually took over the other idea.
00:07:10.080 | Text is the universal interface.
00:07:12.640 | - I think so.
00:07:13.640 | I think so.
00:07:14.480 | I actually think once the multi-modal models come out,
00:07:16.600 | I think there's just like mathematically nicer properties
00:07:19.520 | of you can just get like join multi-modal embeddings
00:07:21.720 | like clip style.
00:07:23.600 | But how, like text is really nice
00:07:25.800 | because from a software engineering principle,
00:07:27.360 | it just makes things way more modular.
00:07:28.560 | You just convert everything into text
00:07:29.920 | and then you just represent everything as text.
00:07:31.400 | - Yeah.
00:07:32.240 | I'm just explaining retroactively
00:07:33.400 | why working on LLM Index took off
00:07:35.400 | versus if you had chose to spend your time on multi-modal,
00:07:38.240 | we probably wouldn't be talking about
00:07:40.160 | whatever you ended up working on.
00:07:41.480 | - Yeah, that's true.
00:07:43.400 | - It's troubled.
00:07:44.600 | Yeah, I think so.
00:07:46.600 | So interesting.
00:07:47.440 | So November 9th.
00:07:49.280 | So that was a very productive month, I guess.
00:07:51.560 | So October, November.
00:07:53.200 | November 9th, you announced GPT Tree Index
00:07:56.040 | and you picked the tree logo.
00:07:57.280 | Very, very, very cool.
00:07:58.520 | Everyone, every project must have an emoji.
00:08:00.640 | - Yeah.
00:08:01.480 | Yeah.
00:08:02.520 | That probably was somewhat inspired by a light rain.
00:08:05.400 | I will admit, yeah.
00:08:06.680 | - It uses GPT to build a knowledge tree
00:08:08.360 | in a bottoms-up fashion
00:08:09.200 | by applying a summarization prompt for each node.
00:08:11.800 | - Yep.
00:08:12.640 | - Which I like that original vision.
00:08:16.280 | Your messaging around about them
00:08:18.680 | was also that you're creating optimized data structures.
00:08:21.760 | What's the journey to that
00:08:26.160 | and how does that contrast with LLM Index today?
00:08:29.360 | - Yeah, so, okay.
00:08:30.800 | Maybe I can tell a little bit
00:08:31.880 | about the beginning intuitions.
00:08:34.040 | I think when I first started,
00:08:36.040 | this really wasn't supposed to be something
00:08:37.760 | that was like a toolkit that people use.
00:08:40.480 | It was more just like a system.
00:08:42.560 | And the way I wanted to think about the system
00:08:44.280 | was more a thought exercise
00:08:45.480 | of how language models with their reasoning capabilities,
00:08:48.200 | if you just treat them as like brains,
00:08:49.760 | can organize information and then traverse it.
00:08:52.160 | So I didn't want to think about embeddings, right?
00:08:54.200 | To me, embeddings just felt like
00:08:55.440 | it was just an external thing that was like,
00:08:57.320 | well, it was just external
00:08:58.720 | to try and actually tap into the capabilities
00:09:00.760 | of language models themselves, right?
00:09:02.080 | I really wanted to see, you know,
00:09:03.800 | just as a human brain could synthesize stuff,
00:09:06.200 | could we create some sort of structure
00:09:07.840 | where there's this neural CPU, if you will,
00:09:10.640 | can organize a bunch of information,
00:09:12.560 | auto-summarize a bunch of stuff,
00:09:14.200 | and then also traverse the structure that I created.
00:09:16.920 | That was the inspiration for this initial tree index.
00:09:20.280 | It didn't actually, to be honest,
00:09:22.480 | and I think I said this in the first tweet,
00:09:23.840 | it didn't actually work super well, right?
00:09:25.840 | The GPT-3 at the time-
00:09:26.680 | - You're very honest about that.
00:09:27.720 | - Yeah, I know, I mean, it was just like,
00:09:29.760 | GPT-4 obviously is much better at reasoning.
00:09:31.880 | I'm one of the first to say,
00:09:33.280 | you shouldn't use anything pre-GPT-4
00:09:35.240 | for anything that requires complex reasoning
00:09:37.520 | because it's just gonna be unreliable.
00:09:39.000 | Okay, disregarding stuff like fine-tuning,
00:09:40.800 | but it worked okay,
00:09:42.280 | but I think it definitely struck a chord
00:09:44.200 | with kind of like the Twitter crowd,
00:09:46.000 | which is just like looking for kind of,
00:09:48.520 | just like new ideas at the time,
00:09:51.080 | I guess just like thinking about
00:09:52.240 | how you can actually bake this
00:09:53.520 | into some sort of application
00:09:54.760 | because I think what I also ended up discovering
00:09:57.200 | was the fact that basically everybody,
00:10:00.840 | there was starting to become a wave of developers
00:10:02.800 | building on top of GPT-3
00:10:04.400 | and people were starting to realize
00:10:05.760 | that what makes them really useful
00:10:07.480 | is to apply them on top of your personal data.
00:10:09.640 | And so even if the solution itself
00:10:11.960 | was kind of like primitive at the time,
00:10:13.640 | like the problem statement itself was very powerful.
00:10:16.200 | And so I think being motivated by the problem statement,
00:10:18.400 | right, like this broad mission
00:10:19.520 | of how do I unlock elements on top of the data
00:10:21.840 | also contributed to the development of Lama Index
00:10:24.560 | to the state it is today.
00:10:26.200 | And so I think part of the reason
00:10:28.320 | our toolkit has evolved beyond
00:10:30.680 | the like just existing set of like data structures
00:10:33.120 | is we really tried to take a step back and think,
00:10:35.360 | okay, what exactly are the tools
00:10:36.920 | that would actually make this useful for a developer?
00:10:39.440 | And then, you know, somewhere around December,
00:10:41.200 | we made an active effort
00:10:42.600 | to basically like push towards that direction,
00:10:44.720 | make the code base more modular, right,
00:10:46.360 | more friendly as an open source library.
00:10:48.160 | And then also start adding in like embeddings,
00:10:51.200 | start thinking into practical considerations
00:10:53.840 | like latency, cost, performance, those types of things.
00:10:56.240 | And then really motivated by that mission,
00:10:58.680 | like start expanding the scope of the toolkit
00:11:00.400 | towards like covering the life cycle
00:11:02.160 | of like data ingestion and querying.
00:11:04.680 | - Yeah, where you also added Lama Hub and--
00:11:08.240 | - Yeah, yeah, so I think that was in like January
00:11:10.920 | on the data loading side.
00:11:12.040 | And so we started adding like some data loaders,
00:11:14.120 | saw an opportunity there,
00:11:15.880 | started adding more stuff on the retrieval querying side,
00:11:18.600 | right, we still have like the core data structures,
00:11:20.640 | but how do you actually make them more modular
00:11:22.840 | and kind of like decouple storing state
00:11:25.400 | from the types of like queries
00:11:26.840 | that you could run on top of this a little bit.
00:11:28.920 | And then starting to get into more complex interactions
00:11:31.520 | like train of thought, reasoning, routing,
00:11:33.120 | and you know, like agent loops.
00:11:34.920 | - Yeah, yeah, very cool.
00:11:36.360 | - And then you and I spent a bunch of time earlier this year
00:11:39.960 | talking about Lama Hub, what that might become.
00:11:42.920 | You were still at Robust.
00:11:45.440 | When did you decide it was time to start the company
00:11:48.600 | and then start to think about what Lama Index is today?
00:11:52.560 | - Probably December, yeah.
00:11:54.080 | And so it was clear that, you know,
00:11:56.520 | it was kind of interesting.
00:11:57.360 | I was getting some inbound from initial VCs.
00:11:59.160 | I was talking about this project.
00:12:00.440 | And then in the beginning, I was like,
00:12:01.960 | oh yeah, you know, this is just like a design project,
00:12:03.920 | but you know, what about my other idea on like video data?
00:12:06.320 | Right, and I was trying to like get their thoughts on that.
00:12:09.800 | And then everybody was just like, oh yeah, whatever.
00:12:12.440 | Like that part's like a crowded market.
00:12:14.800 | And then it became clear that, you know,
00:12:16.440 | this was actually a pretty big opportunity.
00:12:18.320 | And like coincidentally, right,
00:12:20.120 | like this actually did relate to,
00:12:22.040 | like my interests have always been
00:12:23.080 | at the intersection of AI data
00:12:25.080 | and kind of like building practical applications.
00:12:26.600 | And it was clear that this was evolving
00:12:28.240 | into a much bigger opportunity than the previous idea was.
00:12:30.680 | So around December.
00:12:31.520 | And then I think I gave a pretty long notice,
00:12:33.120 | but I left officially like early March.
00:12:35.600 | - What were your thinkings in terms of like moats and,
00:12:40.600 | you know, founders kind of like overthink it sometimes.
00:12:43.200 | You obviously had like a lot of open source love
00:12:45.520 | and like a lot of community.
00:12:47.120 | And yeah, like, were you ever thinking, okay,
00:12:50.200 | I don't know, this is maybe not enough to start a company
00:12:52.800 | or did you always have conviction about it?
00:12:55.680 | - Oh no, I mean, a hundred percent.
00:12:56.760 | I felt like I did this exercise, like honestly,
00:12:59.600 | probably more late December and then early January,
00:13:03.040 | 'cause I was just existentially worried about
00:13:05.360 | whether or not this would actually be a company at all.
00:13:08.160 | And okay, what were the key questions I was thinking about?
00:13:11.360 | And these were the same things that like other founders,
00:13:14.400 | investors, and also like friends would ask me is just like,
00:13:17.000 | okay, what happens if context windows get much bigger?
00:13:20.520 | What's the point of actually structuring data, right,
00:13:22.880 | in the right way, right?
00:13:24.920 | Why don't you just dump everything into the prompt?
00:13:27.040 | Fine tuning, like what if you just train the model
00:13:28.840 | over this data?
00:13:29.680 | And then, you know, what's the point of doing this stuff?
00:13:32.880 | And then some other ideas is what if like open AI
00:13:36.200 | actually just like takes this, like, you know,
00:13:39.880 | builds upwards on top of the,
00:13:42.480 | their existing like foundation models
00:13:43.920 | and starts building in some like built-in orchestration
00:13:46.280 | capabilities around stuff like rag and agents
00:13:48.200 | and those types of things.
00:13:49.160 | And so I basically ran through this mental exercise
00:13:51.200 | and, you know, I'm happy to talk a little bit more
00:13:53.520 | about those thoughts as well,
00:13:54.480 | but at a high level, well, context windows
00:13:57.400 | have gotten bigger,
00:13:58.240 | but there's obviously still a need for rag.
00:14:00.840 | I think rag is just like one of those things that like,
00:14:03.480 | in general, what people care about is yes,
00:14:05.480 | they do care about performance,
00:14:07.040 | but they also care about stuff like latency and costs.
00:14:09.280 | And my entire reasoning at the time was just like, okay,
00:14:12.320 | like, yes, maybe we'll have like much bigger context windows
00:14:15.760 | as we've seen with like 100K context windows,
00:14:17.760 | but for enterprises like, you know, data,
00:14:20.280 | which is not in just like the scale of like a few documents,
00:14:23.640 | it's usually in like gigabytes, terabytes, petabytes,
00:14:26.360 | like how do you actually just unlock language models
00:14:28.840 | over that data, right?
00:14:30.120 | And so it was clear there was just like,
00:14:32.480 | whether it's rag or some other paradigm,
00:14:34.480 | no one really knew what that answer was.
00:14:36.040 | And so there was clearly like technical opportunity here.
00:14:38.080 | Like there was just stacks that needed to be invented
00:14:40.160 | to actually solve this type of problem
00:14:41.920 | because language models themselves
00:14:43.000 | didn't have access to this data.
00:14:44.360 | And so if like you just dumped all this data into,
00:14:47.440 | let's say a model had like hypothetically
00:14:49.360 | an infinite context window, right?
00:14:50.800 | And you just dump like 50 gigabytes of data
00:14:52.960 | into the context window.
00:14:54.400 | That just seemed very inefficient to me
00:14:55.760 | because you have these network transfer costs
00:14:57.440 | of uploading 50 gigabytes of data
00:14:59.360 | to get back a single response.
00:15:01.040 | And so I kind of realized, you know,
00:15:03.000 | there's always gonna be some curve,
00:15:04.520 | regardless of like the performance
00:15:05.760 | of the best performing models,
00:15:07.120 | of like cost versus performance.
00:15:10.120 | And so what RAG does is it does provide extra data points
00:15:14.200 | along that access because you can kind of control
00:15:16.000 | the amount of context you actually want it to retrieve.
00:15:18.600 | And of course, like RAG as a term
00:15:20.640 | was still evolving back then,
00:15:21.720 | but it was just this whole idea of like,
00:15:23.200 | how do you just fetch a bunch of information
00:15:25.040 | to actually, you know, like stuff into the prompt.
00:15:27.360 | And so people, even back then,
00:15:28.880 | were kind of thinking about some of those considerations.
00:15:30.720 | - And then you fundraised in June,
00:15:33.000 | or you announced your fundraise in June.
00:15:34.320 | - Yeah.
00:15:35.160 | - With Greylock.
00:15:36.480 | How was that process?
00:15:37.800 | Just like take us through that process
00:15:40.720 | of thinking about the fundraise
00:15:42.080 | and your plans for the company, you know, at the time.
00:15:46.360 | - Yeah, definitely.
00:15:47.200 | I mean, I think we knew we wanted to,
00:15:48.760 | I mean, obviously we knew we wanted to fundraise.
00:15:50.360 | I think there was also a bunch of like investor interest
00:15:53.160 | and it was probably pretty unusual
00:15:54.520 | given the, you know, like hype wave of generative AI.
00:15:56.880 | So like a lot of investors were kind of reaching out
00:15:58.920 | around like December, January, February.
00:16:00.920 | In the end, we went with Greylock.
00:16:02.040 | Greylock's great.
00:16:02.880 | You know, they've been great partners so far.
00:16:04.640 | And like, to be honest,
00:16:06.160 | like there's a lot of like great VCs out there.
00:16:08.200 | And a lot of them who are specialized
00:16:09.720 | on like open source, data, infra, and that type of stuff.
00:16:13.200 | What we really wanted to do was,
00:16:15.280 | because for us, like time was of the essence,
00:16:17.280 | like we wanted to ship very quickly
00:16:19.000 | and still kind of build Mindshare in this space.
00:16:21.040 | We just kept the fundraising process very efficient.
00:16:23.000 | I think we basically did it in like a week
00:16:25.160 | or like three days, I think so.
00:16:27.200 | - Yeah.
00:16:28.040 | - Just like front loaded it.
00:16:29.400 | And then, and then just like--
00:16:31.400 | - We picked the one named Jerry.
00:16:32.720 | - Hey, yeah, exactly.
00:16:34.160 | (both laughing)
00:16:36.960 | - I'm kidding.
00:16:37.800 | Guys, I mean, he's obviously great
00:16:39.360 | and Greylock's fantastic for him.
00:16:41.280 | - Yeah, I know.
00:16:42.120 | And embedding some of my research.
00:16:43.480 | So yeah, just we picked Greylock.
00:16:46.560 | They've been great partners.
00:16:48.240 | I think in general, when I talk to founders
00:16:49.960 | about like the fundraise process,
00:16:51.880 | it's never like the most fun period, I think.
00:16:54.360 | Because it's always just like, you know,
00:16:56.000 | there's a lot of logistics, there's lawyers,
00:16:57.800 | you have to, you know, get in the loop.
00:16:59.720 | And then you, and like a lot of founders
00:17:01.520 | just want to go back to building.
00:17:02.760 | And so I think in the end,
00:17:04.200 | we're happy that we kept it to a pretty efficient process.
00:17:07.280 | - Cool.
00:17:08.120 | And so you fundraise with Simon, your co-founder.
00:17:10.240 | And how do you split things with him?
00:17:11.920 | How big is your team now?
00:17:13.120 | - The team is growing.
00:17:14.360 | By the time this podcast is released,
00:17:17.280 | we'll probably have had one more person join the team.
00:17:19.680 | And so basically, it's between,
00:17:22.840 | we're rapidly getting to like eight or nine people.
00:17:25.000 | At the current moment, we're around like six.
00:17:26.680 | And so just like, there'll be some exciting developments
00:17:29.720 | in the next few weeks.
00:17:30.560 | So I'm excited to kind of, to announce that.
00:17:34.040 | We've been pretty selective
00:17:36.280 | in terms of like how we like grow the team.
00:17:37.880 | Obviously, like we look for people that are really active
00:17:40.000 | in terms of contributions to Lum Index,
00:17:41.880 | people that have like very strong engineering backgrounds.
00:17:44.240 | And primarily, we've been kind of just looking for builders,
00:17:46.360 | people that kind of like grow the open source
00:17:47.800 | and also eventually this like managed
00:17:49.720 | like enterprise platform as well with us.
00:17:52.160 | In terms of like Simon,
00:17:53.480 | yeah, I've known Simon for a few years now.
00:17:55.080 | I knew him back at Uber ATG in Toronto.
00:17:57.640 | He's one of the smartest people I knew.
00:17:59.600 | Like has a sense of both like a deep understanding of ML,
00:18:04.600 | but also just like first principles thinking
00:18:06.440 | about like engineering and technical concepts in general.
00:18:09.120 | And I think one of my criteria is when I was like
00:18:10.920 | looking for a co-founder for this project
00:18:12.880 | was someone that was like technically better than me,
00:18:14.560 | 'cause I knew I wanted like a CTO.
00:18:16.280 | And so honestly, like there weren't a lot of people that,
00:18:19.080 | I mean, I know a lot of people that are smarter than me,
00:18:21.200 | but like that fit that bill.
00:18:22.400 | We're willing to do a startup
00:18:23.440 | and also just have the same like values that I shared, right?
00:18:26.200 | And just, I think doing a startup is very hard work, right?
00:18:28.760 | It's not like, I'm sure like you guys all know this.
00:18:31.240 | It's a lot of hours, a lot of late nights,
00:18:33.800 | and you want to be like in the same place together
00:18:36.360 | and just like being willing to hash out stuff
00:18:38.000 | and have that grit basically.
00:18:39.280 | And I really looked for that.
00:18:40.640 | And so Simon really fit that bill.
00:18:42.440 | And I think I convinced him to jump on board.
00:18:44.800 | - Yeah, yeah, nice job.
00:18:46.240 | And obviously I've had the pleasure of chatting
00:18:48.320 | and working with a little bit with both of you.
00:18:50.960 | What would you say those like your top like one
00:18:53.000 | or two values are when thinking about that
00:18:55.440 | or the culture of the company and that kind of stuff?
00:18:58.200 | - Yeah, well, I think in terms of the culture of the company
00:19:01.680 | it's really like, I mean, there's a few things
00:19:04.400 | I can name off the top of my head.
00:19:05.800 | One is just like passion, integrity.
00:19:08.720 | I think that's very important for us.
00:19:09.760 | We want to be honest.
00:19:10.600 | We don't want to like obviously like copy code
00:19:12.880 | or kind of like, you know, just like, you know
00:19:15.000 | not give attribution, those types of things.
00:19:16.600 | And just like be true to ourselves.
00:19:18.520 | I think we're all very like down to earth,
00:19:19.960 | like humble people.
00:19:20.800 | But obviously I think just willingness
00:19:22.840 | to just like own stuff and dive right in.
00:19:25.480 | And I think grit comes with that.
00:19:26.720 | I think in the end, like this is a very fast moving space
00:19:29.560 | and we want to just like be one of the, you know
00:19:32.200 | like dominant forces in helping to provide
00:19:33.920 | like production quality outline applications.
00:19:35.960 | - Yeah.
00:19:37.000 | So I promise we'll get to the more technical questions.
00:19:39.440 | But I also want to impress on the audience
00:19:42.160 | that this is a very conscious
00:19:43.600 | and intentional company building.
00:19:46.000 | And since your fundraising post, which was in June
00:19:51.000 | and now it's September, so it's been about three months.
00:19:53.760 | You've actually gained 50% in terms of stars and followers.
00:19:58.480 | You've 3X your download count to 600,000 a month
00:20:01.480 | and your Discord membership has reached 10,000.
00:20:03.600 | So like a lot of ongoing growth.
00:20:06.200 | - Yeah, definitely.
00:20:07.040 | And obviously there's a lot of room to expand there too.
00:20:09.600 | And so open source growth is gonna continue
00:20:12.280 | to be one of our core goals.
00:20:14.200 | 'Cause in the end, it's just like,
00:20:15.240 | we want this thing to be, well, one big, right?
00:20:17.480 | We all have like big ambitions,
00:20:18.960 | but to just like really provide value to developers
00:20:21.600 | and helping them in prototyping
00:20:23.840 | and also productionization of their apps.
00:20:25.720 | And I think it turns out we're in the fortunate circumstance
00:20:28.200 | where a lot of different companies and individuals, right?
00:20:31.120 | Are in that phase of like, you know
00:20:32.560 | maybe they've hacked around
00:20:33.600 | on some initial LLM applications,
00:20:35.560 | but they're also looking to, you know,
00:20:37.100 | start to think about what are the production grade
00:20:39.000 | challenges necessary to actually, you know,
00:20:41.520 | that to solve to actually make this thing robust
00:20:44.520 | and reliable in the real world.
00:20:45.680 | And so we want to basically provide the tooling to do that.
00:20:49.120 | And to do that, we need to both spread awareness
00:20:51.160 | and education of a lot of the key practices
00:20:52.800 | of what's going on.
00:20:53.800 | And so a lot of this is going to be continued growth,
00:20:55.960 | expansion and education.
00:20:56.960 | And we do prioritize that very heavily.
00:20:59.060 | - Awesome.
00:21:01.420 | Let's dive into some of the questions
00:21:02.920 | you were asking yourself initially around fine tuning
00:21:06.320 | and rag, how these things play together.
00:21:09.280 | You mentioned context.
00:21:11.320 | What is the minimum viable context for rag?
00:21:14.840 | So what's like a context window too small.
00:21:17.440 | And at the same time,
00:21:18.280 | maybe what's like a maximum context window.
00:21:21.280 | We talked before about the LLMs are U-shaped reasoners.
00:21:25.320 | So as the context got larger,
00:21:27.520 | like it really only focuses on the end
00:21:29.680 | and the start of the prompt
00:21:30.840 | and then it kind of peters down.
00:21:33.520 | Any learnings, any kind of like tips
00:21:36.800 | you want to give people as they think about it?
00:21:39.800 | - So this is a great question.
00:21:41.160 | And I think part of what I wanted to kind of like
00:21:45.280 | talk about a conceptual level,
00:21:46.600 | especially with the idea of like thinking about
00:21:48.360 | what is the minimum context?
00:21:49.640 | Like, okay, what if the minimum context was like 10 tokens
00:21:51.880 | versus like, you know, 2K tokens
00:21:53.520 | versus like a million tokens, right?
00:21:54.840 | Like, and what does that really give you?
00:21:56.400 | And what are the limitations if it's like 10 tokens?
00:21:58.760 | It's kind of like, like eight bit, 16 bit games, right?
00:22:02.560 | Like back in the day, like if you play Mario
00:22:04.640 | and you have like the initial Mario
00:22:06.440 | where the graphics were very blocky
00:22:07.840 | and now obviously it's like full HD, 3D,
00:22:10.120 | just the resolution of the context and the output will change
00:22:13.320 | depending on how much context you can actually fit in.
00:22:16.000 | The way I kind of think about this
00:22:17.240 | from a more principled manner is like,
00:22:18.560 | there's this concept of like information capacity,
00:22:22.320 | just this idea of like entropy,
00:22:23.720 | like given any fixed amount of like storage space,
00:22:27.080 | like how much information can you actually compact in there?
00:22:29.960 | And so basically a context window length
00:22:32.000 | is just like some fixed amount of storage space, right?
00:22:34.720 | And so there's some theoretical limit
00:22:36.760 | to the maximum amount of information
00:22:38.120 | you can compact into like a 4,000 token storage space.
00:22:40.920 | And what does that storage space use for these days
00:22:42.840 | with LLMs?
00:22:43.680 | It's for inputs and also outputs.
00:22:46.080 | And so this really controls a maximum amount of information
00:22:48.560 | you can feed in terms of the prompt
00:22:50.280 | plus the granularity of the output.
00:22:52.080 | If you had an infinite context window,
00:22:53.480 | you could have an infinitely detailed response
00:22:55.240 | and also infinitely detailed memory.
00:22:56.960 | But if you don't, you can only kind of represent stuff
00:22:58.960 | in more quantized bits, right?
00:23:01.040 | And so the smaller the context window,
00:23:03.200 | just generally speaking, the less details
00:23:05.320 | and maybe the less,
00:23:06.800 | and for like specific precise information
00:23:08.640 | are gonna be able to surface at any given point in time.
00:23:11.760 | - And when you have short context,
00:23:13.920 | is the answer just like get a better model
00:23:16.680 | or is the answer maybe, hey,
00:23:18.840 | there needs to be a balance between fine tuning and RAG
00:23:21.600 | to make sure you're gonna like leverage the context,
00:23:24.040 | but at the same time, don't keep it too low resolution?
00:23:27.240 | - Yeah, yeah.
00:23:28.080 | Well, there's probably some minimum threat.
00:23:29.400 | I don't think anyone wants to work with like a 10,
00:23:31.280 | I mean, that's just a thought exercise anyways,
00:23:33.120 | a 10 token context window.
00:23:34.480 | I think nowadays the modern context window
00:23:36.200 | is like 2K, 4K is enough for just like doing
00:23:38.640 | some sort of retrieval on granular context
00:23:40.400 | and be able to synthesize information.
00:23:42.120 | I think for most intents and purposes,
00:23:44.000 | that level of resolution is probably fine for most people,
00:23:46.160 | for most use cases.
00:23:48.040 | I think the limitation is actually more on,
00:23:50.480 | okay, if you're gonna actually combine this thing
00:23:52.520 | with some sort of retrieval data structure mechanism,
00:23:55.240 | there's just limitations on the retrieval side
00:23:58.000 | because maybe you're not actually fetching
00:23:59.600 | the most relevant context
00:24:00.800 | to actually answer this question, right?
00:24:02.360 | Like, yes, like given the right context,
00:24:04.560 | 4,000 tokens is enough,
00:24:05.880 | but if you're just doing like top case similarity,
00:24:07.720 | like you might not be fetching the right information
00:24:10.000 | from the documents.
00:24:11.160 | - Yeah, so how should people think about
00:24:13.160 | when to stick with RAG
00:24:15.560 | versus when to even entertain fine tuning?
00:24:18.720 | And also in terms of what's like the threshold
00:24:22.000 | of data that you need to actually worry about fine tuning
00:24:25.040 | versus like just stick with RAG.
00:24:26.320 | Obviously you're biased
00:24:27.320 | because you're building a RAG company, but-
00:24:28.880 | - No, no, actually,
00:24:31.000 | I think I have like a few hot takes in here,
00:24:32.600 | some of which sound like a little bit contradictory
00:24:34.400 | to what we're actually building.
00:24:35.640 | To be honest, I don't think anyone knows the right answer.
00:24:37.400 | I think this is just- - We're pursuing the truth.
00:24:38.840 | - Yeah, exactly.
00:24:39.680 | This is just like thought exercise
00:24:40.840 | towards like understanding the truth, right?
00:24:42.120 | So I think, okay, I have a few hot takes.
00:24:45.160 | One is like RAG is basically just a hack,
00:24:47.320 | but it turns out it's a very good hack
00:24:49.240 | because what is RAG?
00:24:51.080 | RAG is you keep the model fixed
00:24:52.400 | and you just figure out a good way
00:24:53.320 | to like stuff stuff into the prompt of the language model.
00:24:56.440 | Everything that we're doing nowadays
00:24:58.000 | in terms of like stuffing stuff into the prompt
00:24:59.960 | is just algorithmic.
00:25:01.000 | We're just figuring out nice algorithms
00:25:02.720 | to like retrieve right information with top case similarity,
00:25:06.920 | do some sort of like hybrid search,
00:25:08.880 | some sort of like a chain of thought decomp,
00:25:10.600 | and then just like stuff stuff into the prompt.
00:25:12.520 | So it's all like algorithmic,
00:25:13.800 | and it's more like just software engineering
00:25:17.320 | to try to make the most out of these like existing APIs.
00:25:19.960 | The reason I say it's a hack
00:25:21.160 | is just like from a pure like optimization standpoint,
00:25:23.680 | if you think about this from like the machine learning lens,
00:25:25.640 | unless the software engineering lens,
00:25:27.280 | there's pieces in here
00:25:28.120 | that are gonna be like suboptimal, right?
00:25:29.600 | Like, obviously, like the thing about machine learning
00:25:32.160 | is when you optimize like some system
00:25:34.760 | that can be optimized within machine learning,
00:25:36.680 | like the set of parameters,
00:25:38.360 | you're really like changing like the entire system's weights
00:25:41.600 | to try to optimize the subjective function.
00:25:44.160 | And if you just cobble a bunch of stuff together,
00:25:46.480 | you can't really optimize the pieces are inefficient, right?
00:25:49.360 | And so like a retrieval interface,
00:25:51.000 | like doing top can embedding lookup,
00:25:52.920 | that part is inefficient,
00:25:54.800 | because there might be potentially a better,
00:25:56.880 | more learned retrieval algorithm that's better.
00:25:59.680 | If you kind of do stuff like some sort of,
00:26:04.240 | I know nowadays there's this concept
00:26:05.800 | of how do you do like short-term or long-term memory, right?
00:26:08.120 | Like represent stuff in some sort of vector embedding,
00:26:10.440 | do chunk sizes, all that stuff.
00:26:12.320 | It's all just like decisions that you make
00:26:14.040 | that aren't really optimized, right?
00:26:15.920 | It's more, and it's not really automatically learned,
00:26:18.520 | it's more just things that you set beforehand
00:26:20.440 | to actually feed into the system.
00:26:22.160 | There's a lot of room to actually optimize
00:26:24.760 | the performance of an entire LLM system,
00:26:27.680 | potentially in a more like machine learning base way, right?
00:26:30.080 | And I will leave room for that.
00:26:31.920 | And this is also why I think like in the long-term,
00:26:34.880 | like I do think fine tuning will probably have
00:26:37.240 | like greater importance and just like,
00:26:41.080 | there will probably be new architectures invented
00:26:43.920 | that where you can actually kind of like include
00:26:46.360 | a lot of this under the black box,
00:26:48.000 | as opposed to having like hobbling together
00:26:50.200 | a bunch of components outside the black box.
00:26:52.040 | That said, just very practically,
00:26:54.280 | given the current state of things,
00:26:55.600 | like even if I said RAG is a hack,
00:26:57.160 | it's a very good hack
00:26:58.000 | and it's also very easy to use, right?
00:26:59.320 | And so just like for kind of like the AI engineer persona,
00:27:02.440 | that like, which to be fair is kind of one of the reasons
00:27:06.400 | generative AI has gotten so big,
00:27:08.040 | is because it's way more accessible for everybody
00:27:10.440 | to get into, as opposed to just like
00:27:12.520 | traditional machine learning.
00:27:14.640 | It tends to be good enough, right?
00:27:16.040 | And if we can basically provide these existing techniques
00:27:18.440 | to help people really optimize how to use existing systems
00:27:21.720 | without having to really deeply understand machine learning,
00:27:24.040 | I still think that's a huge value add.
00:27:25.680 | And so there's very much like a UX
00:27:27.560 | and ease of use problem here,
00:27:28.880 | which is just like RAG is way easier to onboard and use.
00:27:31.960 | And that's probably like the primary reason
00:27:34.120 | why everyone shouldn't do RAG
00:27:35.160 | instead of fine tuning to begin with.
00:27:37.000 | If you think about like the 80/20 rule,
00:27:38.520 | like RAG very much fits within that
00:27:39.880 | and fine tuning doesn't really right now.
00:27:41.640 | And then I'm just kind of like leaving room for the future
00:27:44.400 | that, you know, like in the end,
00:27:46.000 | fine tuning can probably take over some of the aspects
00:27:48.680 | of like what RAG does.
00:27:50.680 | - I don't know if this is mentioned in your recap there,
00:27:54.200 | but explainability also allows for sourcing.
00:27:58.280 | And like at the end of the day,
00:28:00.040 | like to increase trust, we have to source documents.
00:28:03.320 | - Yeah, so I think what RAG does
00:28:05.680 | is it increases like transparency,
00:28:07.560 | visibility into the actual documents
00:28:09.480 | that are getting fed into their contacts.
00:28:10.920 | - Here's where they got it from.
00:28:11.840 | - Exactly, and so that's definitely an advantage.
00:28:14.040 | I think the other piece that I think is an advantage,
00:28:15.840 | and I think that's something
00:28:16.680 | that someone actually brought up,
00:28:18.000 | is just you can do access control with RAG
00:28:21.760 | if you have an external source system.
00:28:23.320 | You can't really do that with large language models,
00:28:26.400 | which is like gate information to the neural net weights,
00:28:28.880 | like depending on the type of user.
00:28:31.240 | For the first point, you could technically, right,
00:28:35.120 | you could technically have the language model,
00:28:37.800 | like if it memorized enough information,
00:28:39.680 | just like a site sources,
00:28:41.240 | but there's a question of just trust.
00:28:42.600 | Whether or not you're accurate.
00:28:44.160 | - Yeah, well, but like it makes it up right now
00:28:46.640 | 'cause it's like not good enough,
00:28:47.720 | but imagine a world where it is good enough
00:28:49.200 | and it does give accurate citations.
00:28:51.400 | - Yeah, no, I think to establish trust,
00:28:53.040 | you just need a direct connection.
00:28:54.840 | So it's kind of weird, it's this melding of,
00:28:58.280 | you know, deep learning systems
00:29:00.480 | versus very traditional information retrieval.
00:29:03.960 | - Yeah, exactly.
00:29:05.200 | So I think, I mean, I kind of think about it
00:29:06.760 | as analogous to like humans, right?
00:29:08.280 | Like we as humans, obviously we use the internet,
00:29:10.960 | we use tools.
00:29:11.960 | These tools have API interfaces are well-defined.
00:29:14.400 | And obviously we're not, like the tools aren't part of us.
00:29:17.560 | And so we're not like back propping
00:29:19.000 | or optimizing over these tools.
00:29:20.600 | And so kind of when you think about like RAG,
00:29:22.920 | it's basically LLM is learning how to use
00:29:26.120 | like a vector database to look up information
00:29:28.160 | that it doesn't know.
00:29:29.200 | And so then there's just a question of like
00:29:30.720 | how much information is inherent within the network itself
00:29:33.280 | and how much does it need to do some sort of like tool
00:29:35.280 | used to look up stuff that it doesn't know.
00:29:36.840 | And I do think there'll probably be more and more
00:29:38.840 | of that interplay as time goes on.
00:29:40.640 | - Yeah.
00:29:41.960 | Some follow-ups on discussions that we've had.
00:29:45.200 | So, you know, we discussed fine tuning a bit
00:29:47.880 | and what's your current take on whether you can fine tune
00:29:50.840 | new knowledge into LLMs?
00:29:52.440 | - That's one of those things
00:29:53.280 | where I think long-term you definitely can.
00:29:55.400 | I think some people say you can't, I disagree.
00:29:57.520 | I think you definitely can.
00:29:58.640 | Just right now I haven't gotten it to work yet.
00:30:00.000 | So, so I think like- - You've tried.
00:30:01.480 | - Yeah, well, not in a very principled way, right?
00:30:04.120 | Like this is something that requires
00:30:05.400 | like an actual research scientist
00:30:06.600 | and not someone that has like, you know,
00:30:07.780 | an hour or two per night to actually get this.
00:30:09.760 | - Like you were a research scientist at Uber.
00:30:11.400 | - Yeah, yeah, but it's like full-time, full-time work.
00:30:14.000 | So I think what I specifically concretely did
00:30:16.880 | was I took OpenAI's fine tuning endpoints
00:30:18.800 | and then tried to, you know,
00:30:20.160 | it's in like a chat message interface.
00:30:21.880 | And so there's like a user assistant message format.
00:30:24.440 | And so what I did was I tried to take just some piece
00:30:26.400 | of text and have the LLM memorize it
00:30:28.480 | by just asking it a bunch of questions about the text.
00:30:30.680 | So given a bunch of contexts,
00:30:31.920 | I would generate some questions
00:30:33.400 | and then generate some response
00:30:34.560 | and just fine tune over the question responses.
00:30:37.200 | That hasn't really worked super well.
00:30:39.640 | But that's also because I'm just like trying
00:30:41.760 | to like use OpenAI's endpoints as is.
00:30:44.640 | If you just think about like traditional,
00:30:46.000 | like how you train a Transformers model,
00:30:48.640 | there's kind of like the instruction
00:30:51.120 | like fine tuning aspect, right?
00:30:52.640 | You like kind of ask it stuff
00:30:55.040 | and guide it with correct responses,
00:30:56.360 | but then there's also just like next token production.
00:30:58.940 | And that's something that you can't really do
00:31:01.360 | with the OpenAI API,
00:31:02.600 | but you can do with if you just trained it yourself.
00:31:04.880 | And that's probably possible
00:31:06.280 | if you just like train it over some corpus of data.
00:31:08.520 | I think Shashira from Berkeley said like,
00:31:10.360 | you know, when they trained Gorilla,
00:31:11.360 | they were like, oh, you know this,
00:31:12.640 | a lot of these LLMs are actually pretty good
00:31:14.280 | at memorizing information.
00:31:16.060 | Just the way the API interface is exposed
00:31:18.800 | is just no one knows how to use them right now, right?
00:31:21.160 | And so I think that's probably one of the issues.
00:31:23.560 | - Just to clue people in who haven't read the paper,
00:31:25.580 | Gorilla is the one where they train to use specific APIs?
00:31:29.700 | - Yeah, yeah.
00:31:30.540 | And I think they also did something
00:31:31.680 | where like the model itself could learn to,
00:31:36.040 | yeah, I think this was on the Gorilla paper.
00:31:37.600 | Like the model itself could try to learn some prior
00:31:41.680 | over the data to decide like what tool to pick.
00:31:44.080 | But there's also, it's also augmented with retrieval
00:31:46.480 | that helps supplement it
00:31:47.640 | in case like the prior doesn't actually work.
00:31:51.640 | - Is that something that you'd be interested in supporting?
00:31:54.520 | - I mean, I think in the longterm,
00:31:55.680 | like if like this is kind of how fine-tuning like RAG
00:31:58.880 | evolves, like I do think there'll be some aspect
00:32:01.200 | where fine-tuning will probably memorize
00:32:02.600 | some high-level concepts of knowledge,
00:32:04.160 | but then like RAG will just be there to supplement
00:32:06.460 | like aspects that it doesn't know, yeah.
00:32:08.500 | - Yeah, awesome.
00:32:09.880 | - Obviously RAG is the default way.
00:32:11.680 | Like to be clear, RAG right now is the default way
00:32:13.360 | to actually augment stuff with knowledge.
00:32:15.600 | I think it's just an open question
00:32:16.720 | of how much the LLM can actually internalize
00:32:19.560 | both high-level concepts, but also details
00:32:22.280 | as you can like train stuff over it.
00:32:24.540 | And coming from an ML background,
00:32:26.680 | like there is a certain beauty in just baking everything
00:32:29.560 | into some training process of a language model.
00:32:33.120 | Like if you just take raw chat GPT
00:32:35.800 | or chat GPT code interpreter, right, like GPT-4,
00:32:38.460 | it's not like you do RAG with it.
00:32:40.340 | You just ask it questions about like,
00:32:42.020 | "Hey, how do I like define a pedantic model in Python?"
00:32:44.300 | And then like, "Can you give me an example?
00:32:45.660 | Can you visualize a graph?"
00:32:46.500 | It just does it, right?
00:32:47.820 | And we'll run it through code interpreters as a tool,
00:32:49.860 | but that's not like a source for knowledge.
00:32:50.980 | It's just an execution environment.
00:32:52.620 | And so there is some beauty in just like
00:32:54.420 | having the model itself, like just, you know,
00:32:56.980 | instead of you kind of defining the algorithm
00:32:58.700 | for what the data structure should look like,
00:33:00.260 | the model just learns it under the hood.
00:33:02.440 | That said, I think the reason it's not a thing right now
00:33:04.860 | is just like, no one knows how to do it.
00:33:06.660 | It probably costs too much money.
00:33:07.900 | And then also like the API interfaces
00:33:10.900 | and just like the actual like ability
00:33:14.100 | to kind of evaluate and improve on performance,
00:33:17.220 | like isn't known to most people.
00:33:18.960 | - Yeah.
00:33:20.420 | It also would be better with browsing.
00:33:22.180 | (laughs)
00:33:23.540 | - Yeah.
00:33:24.380 | - I wonder when they're going to put that back.
00:33:25.820 | - Okay.
00:33:26.660 | - Okay, cool.
00:33:28.860 | Yeah, so, and then one more follow-up
00:33:30.140 | before we go into RAG for AI engineers
00:33:32.100 | is on your brief mention about security or auth.
00:33:36.660 | And how many of the people that you talk to,
00:33:40.220 | you know, you talk to a lot of people
00:33:41.820 | putting Lama Index into production,
00:33:44.260 | how many people actually are there
00:33:46.300 | versus just like,
00:33:47.140 | let's just dump a whole company notion into this thing.
00:33:49.500 | - Wait, you're talking about
00:33:50.340 | from like the security auth standpoint?
00:33:51.900 | - Yeah, how big a need is that?
00:33:54.140 | Because I talked to some people
00:33:56.300 | who are thinking about building tools in that domain,
00:33:59.540 | but I don't know if people want it.
00:34:01.180 | I mean, I think bigger companies,
00:34:02.380 | like just bigger companies, like banks, consulting firms,
00:34:05.900 | like they all want this.
00:34:07.020 | - Yes, it's a requirement, right?
00:34:08.900 | - The way they're using Lama Index
00:34:10.540 | is not with this, obviously,
00:34:12.540 | 'cause I don't think we have support
00:34:13.580 | for like access control or author
00:34:15.060 | or that type of stuff like on a hood,
00:34:16.060 | 'cause we're more just like an orchestration framework.
00:34:18.580 | And so the way they do it,
00:34:19.740 | they build these initial apps
00:34:21.220 | is more kind of like prototype,
00:34:22.900 | like let's kind of, yeah,
00:34:24.060 | like, you know, use some publicly available data
00:34:25.940 | that's not super sensitive.
00:34:27.060 | Let's like, you know, assume that every user
00:34:28.700 | is going to be able to have access
00:34:29.660 | to the same amount of knowledge,
00:34:30.780 | those types of things.
00:34:32.180 | I think users have asked for it,
00:34:33.620 | but I don't think that's like a P zero.
00:34:35.300 | Like, I think the P zero is more on like,
00:34:37.460 | can we get this thing working
00:34:38.500 | before we expand this to like more users within the work?
00:34:41.340 | - Yep.
00:34:42.340 | - Cool.
00:34:43.420 | So there's a bunch of pieces to RAG, obviously.
00:34:46.580 | It's not just an acronym.
00:34:48.420 | And you tweeted recently,
00:34:49.820 | you think every AI engineer
00:34:51.540 | should build it from scratch at least once.
00:34:53.820 | Why is that?
00:34:56.020 | - I think so.
00:34:57.340 | I'm actually kind of curious
00:34:58.620 | to hear your thoughts about this,
00:35:00.100 | but this kind of relates to the initial
00:35:02.060 | like AI engineering posts that you put out.
00:35:04.380 | And then also just like the role of an AI engineer
00:35:06.940 | and the skills that they're going to have to learn
00:35:08.500 | to truly succeed.
00:35:10.140 | 'Cause there's an entire spectrum.
00:35:11.500 | On one end, you have people
00:35:12.540 | that don't really like understand the fundamentals
00:35:15.620 | and just want to use this
00:35:16.460 | to like cobble something together to build something.
00:35:19.100 | And I think there is a beauty in that for what it's worth.
00:35:20.780 | Like it's just one of those things.
00:35:21.820 | And Gen AI has made it
00:35:23.580 | so that you can just use these models
00:35:25.260 | in inference only mode,
00:35:26.260 | cobble something together,
00:35:27.220 | use it to power your app experiences.
00:35:28.940 | On the other end, what we're increasingly seeing
00:35:32.020 | is that like more and more developers
00:35:33.900 | building with these apps
00:35:34.940 | start running into honestly like pretty similar issues
00:35:37.900 | that like will play just a standard ML engineer
00:35:40.380 | building like a classifier model,
00:35:41.540 | which is just like accuracy problems,
00:35:43.460 | like and hallucinations,
00:35:44.460 | basically just an accuracy problem, right?
00:35:45.740 | Like it's not giving you the right results.
00:35:47.020 | So what do you do?
00:35:47.860 | You have to iterate on the model itself.
00:35:49.860 | You have to figure out what parameters you tweak.
00:35:51.420 | You have to gain some intuition about this entire process.
00:35:53.980 | That workflow is pretty similar, honestly,
00:35:56.500 | like even if you're not training the model
00:35:58.020 | to just like tuning an ML model with like hyper parameters
00:36:00.940 | and learning like proper ML practices of like,
00:36:03.860 | okay, how do I have like define a good evaluation benchmark?
00:36:06.940 | How do I define like the right set of metrics to use, right?
00:36:09.340 | How do I actually iterate
00:36:10.540 | and improve the performance of this pipeline for production?
00:36:13.260 | What tools do I use, right?
00:36:14.420 | Like every ML engineer use like some form of weights
00:36:16.900 | and biases, TensorBoards,
00:36:18.020 | or like some other experimentation tracking tool.
00:36:20.500 | Like what tools should I use
00:36:22.620 | to actually help build like LLM applications
00:36:24.780 | and optimize it for production?
00:36:26.420 | There's like a certain amount of just like LLM ops,
00:36:29.260 | like tooling and concepts and just like practices
00:36:31.900 | that people will kind of have to internalize
00:36:33.220 | if they want to optimize these.
00:36:34.340 | And so I think that the reason I think like being able
00:36:37.580 | to build like RAG from scratch is important
00:36:39.860 | is it really gives you a sense of like how things are working
00:36:42.660 | to help you build intuition
00:36:44.060 | about like what parameters are within a RAG system
00:36:46.780 | and which ones actually tweak to make them better.
00:36:48.540 | One of the advantages of Lomindex,
00:36:50.380 | the Lomindex quick start is it's three lines of code.
00:36:53.820 | The downside of that is you have zero visibility
00:36:56.180 | into what's actually going on under the hood.
00:36:57.820 | And I think this is something
00:36:58.700 | that we've kind of been thinking about for a while.
00:37:00.300 | And I'm like, okay, let's just release
00:37:01.580 | like a new tutorial series.
00:37:02.740 | That's just like, no three lines of code.
00:37:05.180 | We're just gonna go in and actually show you
00:37:06.420 | how the thing actually works under the hood, right?
00:37:08.300 | And so like, does everybody need this?
00:37:10.780 | Like probably not.
00:37:11.980 | Like as for some people, the three lines of code might work.
00:37:14.940 | But I think increasingly, like honestly,
00:37:17.540 | 90% of the users I talk to have questions
00:37:19.580 | about how to improve the performance of their app.
00:37:21.100 | And so just like, given this is just like one of those things
00:37:23.220 | that's like better for the understanding.
00:37:24.860 | - Yeah, I'd say it is one of the most useful tools
00:37:28.740 | of any sort of developer education toolkit
00:37:31.820 | to write things yourself from scratch.
00:37:35.460 | So Kelsey Hightower famously wrote
00:37:38.100 | Kubernetes the hard way, which is don't use Kubernetes.
00:37:40.980 | Just like do everything.
00:37:42.180 | Here's everything that you would have to do by yourself.
00:37:44.940 | And you should be able to put all these things together
00:37:47.220 | yourself to understand the value of Kubernetes.
00:37:50.300 | And the same thing for Lomindex.
00:37:51.620 | I've done, I was the guy who did the same for React.
00:37:54.820 | And yeah, it's pretty, well, it's pretty good exercise
00:37:57.780 | for you to just fully understand everything
00:37:59.180 | that's going on under the hood.
00:38:00.780 | And I was actually gonna suggest,
00:38:03.180 | well, in one of the previous conversations,
00:38:05.700 | you know, there's all these like hyperparameters,
00:38:07.260 | like the size of the chunks and all that.
00:38:09.420 | And I was thinking like, you know,
00:38:12.860 | what would hyperparameter optimization for RAG look like?
00:38:17.860 | - Yeah, definitely.
00:38:18.780 | I mean, so absolutely.
00:38:20.060 | I think that's gonna be an increasing thing.
00:38:22.060 | I think that's something we're kind of looking at.
00:38:23.380 | - I think someone should just put,
00:38:24.780 | do like some large scale study
00:38:26.420 | and then just ablate everything.
00:38:27.980 | And just, you tell us.
00:38:29.380 | - I think it's gonna be hard to find a universal default
00:38:33.580 | that works for everybody.
00:38:34.580 | I think it's gonna be somewhat-
00:38:35.420 | - Are you telling me it depends?
00:38:36.260 | - Boo!
00:38:37.100 | - I do think it's gonna be somewhat dependent
00:38:41.860 | on the data and use case.
00:38:42.860 | I think if there was a universal default,
00:38:44.460 | that'd be amazing.
00:38:45.500 | But I think increasingly we found, you know,
00:38:47.380 | people are just defining their own like custom parsers
00:38:50.100 | for like PDFs, markdown files for like, you know,
00:38:52.580 | SCC filings versus like, you know, Slack conversations.
00:38:56.220 | And then like the use case too,
00:38:57.940 | like, do you want like a summarization,
00:38:59.660 | like the granularity of the response?
00:39:01.180 | Like it really affects the parameters that you wanna pick.
00:39:03.420 | And so I do like the idea
00:39:05.620 | of hyperparameter optimization though.
00:39:06.900 | But it's kind of like one of those things
00:39:07.860 | where you are kind of like training the model basically,
00:39:10.820 | kind of on your own data domain.
00:39:12.580 | - Yeah.
00:39:13.540 | You mentioned custom parsers.
00:39:14.940 | You've designed LamaIndex.
00:39:16.300 | Maybe we can talk about like the surface area
00:39:17.900 | of the framework.
00:39:19.140 | You designed LamaIndex in a way that it's more modular.
00:39:21.660 | Yeah, like you mentioned.
00:39:23.340 | How would you describe the different components
00:39:26.420 | and what's customizable in each?
00:39:29.260 | - Yeah, I think they're all customizable.
00:39:30.740 | And I think that there is a certain burden on us
00:39:33.580 | to make that more clear through the docs.
00:39:35.860 | - Well, number four is customization tutorials.
00:39:38.020 | - Yeah, yeah.
00:39:38.860 | But I think like just in general,
00:39:40.100 | I think we do try to make it so that
00:39:42.180 | you can plug in the out of the box stuff.
00:39:43.780 | But like if you want to kind of customize
00:39:47.340 | more lower level components,
00:39:48.620 | like we definitely encourage you to do that
00:39:50.380 | and plug it into the rest of our abstractions.
00:39:52.020 | So let me just walk through
00:39:52.860 | like maybe some of the basic components of LamaIndex.
00:39:54.500 | There's data loaders.
00:39:55.380 | You can load data from different data sources.
00:39:57.100 | We have LamaHub, which you guys brought up,
00:39:58.660 | which is a collection of different data loaders
00:40:01.420 | of like unstructured and unstructured data,
00:40:04.180 | like PDFs, file types, like Slack, Notion, all that stuff.
00:40:08.420 | Now you load in this data.
00:40:10.380 | We have a bunch of like parsers and transformers.
00:40:12.500 | You can split the text.
00:40:13.420 | You can add metadata to the text
00:40:15.260 | and then basically figure out a way to load it
00:40:17.460 | into like a vector store.
00:40:19.220 | So, I mean, you worked at like Airbrite, right?
00:40:20.940 | It's kind of like there is some aspect like E and T, right?
00:40:23.380 | And in terms of like transforming this data.
00:40:25.500 | And then the L, right?
00:40:26.900 | Loading it into some storage abstraction,
00:40:28.420 | we have like a bunch of integrations
00:40:29.740 | with different document storage systems.
00:40:31.700 | So that's data.
00:40:34.060 | And then the second piece really is about like,
00:40:36.660 | how do you retrieve this data?
00:40:38.940 | How do you like synthesize this data?
00:40:41.020 | And how do you like do some sort of
00:40:42.380 | higher level reasoning over this data?
00:40:44.220 | So retrieval is one of the core abstractions that we have.
00:40:46.900 | We do encourage people to like customize,
00:40:48.580 | find your own retrievers.
00:40:49.940 | That's why we have that section on kind of like
00:40:51.460 | how do you define your own like customer retriever,
00:40:53.140 | but also we have like out of the box ones.
00:40:55.300 | The retrieval algorithm kind of depends
00:40:58.100 | on how you structure the data, obviously.
00:40:59.700 | Like if you just flat index everything
00:41:01.300 | with like chunks with like embeddings,
00:41:03.140 | then you can really only do like top K like lookup
00:41:05.980 | plus maybe like keyword search or something.
00:41:08.540 | But if you can index it in some sort of like hierarchy,
00:41:10.660 | like defined relationships,
00:41:11.660 | you can do more interesting things,
00:41:12.780 | like actually traverse relationships between nodes.
00:41:16.140 | Then after you have this data,
00:41:17.580 | how do you like synthesize the data, right?
00:41:19.580 | And this is the part where you feed it
00:41:21.060 | into the language model.
00:41:22.700 | There's some response abstraction that can abstract away
00:41:25.260 | over like long context to actually still give you a response
00:41:28.100 | even if the context overflows the context window.
00:41:30.420 | And then there's kind of these like higher level
00:41:32.340 | like reasoning primitives that I'm gonna define broadly.
00:41:35.820 | And I'm just gonna call them in some general bucket
00:41:38.220 | of like agents,
00:41:39.340 | even though everybody has different definitions of agents.
00:41:41.820 | And agents-
00:41:42.700 | - But you're the first to data agents,
00:41:44.300 | which I was very excited about.
00:41:45.300 | - Yeah, we kind of like coined that term.
00:41:47.060 | And the way we thought about it was,
00:41:49.020 | we wanted to think about how to use agents
00:41:51.060 | for like data workflows basically.
00:41:53.140 | And so what are the reasoning primitives
00:41:54.700 | that you wanna do?
00:41:55.580 | So the most simple reasoning primitive you can do
00:41:57.260 | is some sort of routing module.
00:41:58.540 | Like you can just, it's a classifier.
00:42:00.500 | Like given a query,
00:42:01.620 | just make some automated decision
00:42:02.980 | on what choice to pick, right?
00:42:04.940 | You could use LLMs.
00:42:05.780 | You don't have to use LLMs.
00:42:06.740 | You could just train a classifier basically.
00:42:08.660 | That's something that we might actually explore.
00:42:10.820 | And then the next piece is,
00:42:12.780 | okay, what are some higher level things?
00:42:14.620 | You can have the LLM like define like a query plan, right?
00:42:17.820 | To actually execute over the data.
00:42:19.940 | You can do some sort of while loop, right?
00:42:21.740 | That's basically what an agent loop is,
00:42:23.260 | which is like React, tree of thoughts,
00:42:26.540 | like chain of thought,
00:42:27.380 | like the open AI function calling like while loop
00:42:30.140 | to try to like take a question
00:42:31.460 | and try to break it down into some series of steps
00:42:34.340 | to actually try to execute to get back a response.
00:42:36.660 | And so there's a range in complexity
00:42:38.340 | from like simple reasoning primitives to more advanced ones.
00:42:40.620 | And I think that's the way we kind of think about it
00:42:42.620 | is like which ones should we implement
00:42:44.980 | and how do they work well?
00:42:45.980 | Like, do they work well over like the types of like data
00:42:48.500 | tasks that we give them?
00:42:49.460 | - How do you think about optimizing each piece?
00:42:51.940 | So take embedding models as one piece of it.
00:42:55.580 | You offer fine tuning embedding models.
00:42:58.620 | And I saw it was like fine tuning
00:43:00.100 | gives you like 5, 10% increase.
00:43:02.180 | What's kind of like the Delta left on the embedding side?
00:43:05.900 | Do you think we can get models that are like a lot better?
00:43:08.220 | Do you think like that's one piece
00:43:09.620 | where people should really not spend too much time?
00:43:13.300 | - I mean, I think they should.
00:43:14.860 | I just think it's not the only parameter
00:43:17.020 | 'cause I think in the end,
00:43:18.340 | if you think about everything that goes into retrieval,
00:43:21.900 | the chunking algorithm,
00:43:23.180 | how you define like metadata, right?
00:43:26.300 | We'll bias your embedding representations.
00:43:28.180 | Then there's the actual embedding model itself,
00:43:30.020 | which is something that you can try optimizing.
00:43:31.900 | And then there's like the retrieval algorithm.
00:43:33.420 | Are you gonna just do top K?
00:43:34.500 | Are you gonna do like hybrid search?
00:43:35.620 | Are you gonna do auto retrieval?
00:43:36.660 | Like there's a bunch of parameters.
00:43:37.900 | And so I do think it's something everybody should try.
00:43:40.900 | I think by default, we use like OpenAI's embedding model.
00:43:44.740 | A lot of people these days use like sentence transformers
00:43:46.940 | because it's just like free open source
00:43:48.780 | and you can actually optimize, directly optimize it.
00:43:51.340 | This is an active area of exploration.
00:43:54.540 | I do think one of our goals is
00:43:56.420 | it should ideally be relatively free for every developer
00:44:00.580 | to just run some fine tuning process over their data
00:44:03.060 | to squeeze out some more points and performance.
00:44:04.860 | And if it's that relatively free and there's no downsides,
00:44:06.980 | everybody should basically do it.
00:44:08.900 | There's just some complexities
00:44:10.460 | in terms of optimizing your embedding model,
00:44:12.220 | especially in a production grade data pipeline.
00:44:14.260 | If you actually fine tune the embedding model
00:44:17.220 | and the embedding space changes,
00:44:18.380 | you're gonna have to re-index all your documents.
00:44:20.300 | And for a lot of people, that's not feasible.
00:44:22.300 | And so I think like Joe from Vespa on our webinars,
00:44:25.460 | there's this idea that depending on kind of like,
00:44:29.060 | if you're just using like document and query embeddings,
00:44:32.220 | you could keep the document embeddings frozen
00:44:34.700 | and just train a linear transform on the query
00:44:36.660 | or any sort of transform on the query, right?
00:44:38.780 | So therefore it's just a query side transformation
00:44:40.900 | instead of actually having to re-index
00:44:42.540 | all the document embeddings.
00:44:44.300 | The other piece is- - Wow, that's pretty smart.
00:44:46.100 | - Yeah, yeah, so I think we weren't able
00:44:48.940 | to get like huge performance gains there,
00:44:50.340 | but it does like improve performance a little bit.
00:44:52.260 | And that's something that basically,
00:44:53.900 | everybody should be able to kick off.
00:44:55.020 | You can actually do that on Llama Index too.
00:44:56.660 | - OpenAI has a cookbook on adding bias
00:44:59.060 | to the embeddings too, right?
00:45:00.940 | - Yeah, yeah, I think so.
00:45:02.540 | Yeah, there's just like different parameters
00:45:03.940 | that you can try adding
00:45:05.380 | to try to like optimize the retrieval process.
00:45:07.580 | And the idea is just like, okay, by default,
00:45:10.620 | you have all this text,
00:45:11.660 | it kind of lives in some latent space, right?
00:45:15.380 | - Shut out, shut out latent space.
00:45:17.220 | You should take a drink every time.
00:45:18.460 | - Yeah, but it lives in some latent space.
00:45:22.500 | But like depending on the specific types of questions
00:45:25.420 | that the user might wanna ask,
00:45:26.860 | the latent space might not be optimized, right?
00:45:29.180 | For actual, like to actually retrieve
00:45:32.980 | the relevant piece of context that the user wanna ask.
00:45:34.740 | So can you shift the embedding points a little bit, right?
00:45:37.220 | And how do we do that basically?
00:45:38.420 | That's really a key question here.
00:45:39.900 | So optimizing the embedding model,
00:45:41.820 | even changing the way you like chunk things,
00:45:43.500 | these all shift the embeddings.
00:45:44.620 | - So the retrieval is interesting.
00:45:46.340 | I got a bunch of startup pitches that are like,
00:45:48.580 | like rag is cool, but like there's a lot of stuff
00:45:52.180 | in terms of ranking that could be better.
00:45:54.300 | There's a lot of stuff in terms of sunsetting data
00:45:57.980 | once it starts to become stale, that could be better.
00:46:00.620 | Are you gonna move into that part too?
00:46:03.740 | So like you have SEC Insights as one of kind of like
00:46:06.260 | your demos and that's like a great example of,
00:46:08.860 | hey, I don't wanna embed all the historical documents
00:46:12.020 | because a lot of them are outdated
00:46:13.500 | and I don't want them to be in the context.
00:46:15.820 | What's that problem space like?
00:46:17.260 | How much of it are you gonna also help with
00:46:19.980 | and versus how much you expect others to take care of?
00:46:23.220 | - Yeah, I'm happy to talk about SEC Insights in just a bit.
00:46:25.660 | I think more broadly about the like overall retrieval space,
00:46:28.260 | we're very interested in it because a lot of these
00:46:29.940 | are very practical problems that people have asked us.
00:46:31.940 | So the idea of outdated data,
00:46:33.300 | I think how do you like deprecate or time wait data
00:46:36.180 | and do that in a reliable manner, I guess,
00:46:38.580 | so you don't just like kind of set some parameter
00:46:40.460 | and all of a sudden that affects
00:46:41.620 | all your retrieval algorithms is pretty important
00:46:43.740 | because people have started bringing that up.
00:46:45.500 | Like I have a bunch of duplicate documents,
00:46:46.940 | things get out of date, how do I like sunset documents?
00:46:49.220 | And then ranking, right?
00:46:50.700 | Yeah, so I think this space is not new.
00:46:53.580 | I think like rather than inventing
00:46:56.180 | like new retriever techniques for the sake of like
00:46:58.380 | just inventing better ranking,
00:47:00.540 | we wanna take existing ranking techniques
00:47:02.700 | and kind of like package it in a way
00:47:04.180 | that's like intuitive and easy for people to understand.
00:47:06.900 | That said, I think there are interesting
00:47:09.980 | and new retrieval techniques that are kind of in place
00:47:13.660 | that can be done with when you tie it
00:47:16.180 | into some downstream rack system.
00:47:18.220 | I mean, like the reason for this is just like,
00:47:20.140 | if you think about how like the idea of like chunking text,
00:47:24.540 | right, like that really, that just really wasn't a thing
00:47:28.780 | or at least for this specific purpose of like,
00:47:31.500 | like the reason chunking is a thing in rag right now
00:47:33.660 | is because like you wanna fit
00:47:34.900 | within the context of an LLM, right?
00:47:37.020 | Like why do you wanna chunk a document?
00:47:38.220 | That just was less of a thing, I think back then.
00:47:40.340 | If you wanted to like transform a document,
00:47:42.900 | it was more for like structured data extraction
00:47:44.380 | or something in the past.
00:47:45.540 | And so there's kind of like certain new concepts
00:47:47.540 | that you gotta play with that you can use to invent
00:47:50.740 | kind of more interesting retrieval techniques.
00:47:52.740 | Another example here is actually LLM based reasoning,
00:47:55.700 | like LLM based chain of thought reasoning.
00:47:57.940 | You can take a question,
00:47:59.020 | break it down into smaller components
00:48:00.700 | and use that to actually send to your retrieval system.
00:48:03.740 | And that gives you better results
00:48:04.900 | than kind of like sending the full question
00:48:06.500 | to a retrieval system.
00:48:07.740 | That also wasn't really a thing back then,
00:48:09.500 | but then you can kind of figure out an interesting way
00:48:11.140 | of like blending old and the new, right,
00:48:13.060 | with LLMs and data.
00:48:14.380 | - Yeah.
00:48:16.740 | There's a lot of ideas that you come across.
00:48:19.980 | Do you have a store of them?
00:48:22.580 | - So, okay, I think that the, yeah,
00:48:25.780 | I think sometimes I get like inspiration.
00:48:27.460 | There's like some problem statement
00:48:28.620 | and I'm just like, oh, let's hack this out.
00:48:29.900 | - Following you is very hard
00:48:30.740 | because it's just a lot of homework.
00:48:32.540 | - So I think I've started to like step on the brakes
00:48:37.540 | just a little bit.
00:48:38.460 | - No, no, no, keep going, keep going.
00:48:39.540 | - No, no, no.
00:48:40.380 | Well, the reason is just like, okay,
00:48:41.460 | if I just have, invent like a hundred
00:48:42.940 | more retrieval techniques, like sure,
00:48:44.940 | but like how do people know which one is good
00:48:46.780 | and which one's like bad, right?
00:48:47.860 | And so-
00:48:48.700 | - Have a librarian, right?
00:48:49.540 | Like it's gonna catalog it and go-
00:48:51.180 | - You're gonna need some like benchmarks.
00:48:52.580 | And so I think that's probably the focus
00:48:54.380 | for the next few weeks is actually like properly
00:48:56.780 | kind of like having an understanding of like,
00:48:58.380 | oh, you know, when should you do this?
00:48:59.660 | Or like, does this actually work well?
00:49:01.260 | - Yeah, some kind of like maybe like a flow chart,
00:49:03.820 | decision tree type of thing.
00:49:05.180 | - Yeah, exactly.
00:49:06.020 | - When this, do that, you know, something like that
00:49:07.340 | that would be really helpful for me.
00:49:08.740 | Thank you.
00:49:09.580 | (both laughing)
00:49:11.740 | Do you want to talk about SCC Insights?
00:49:13.620 | - Sure, yeah.
00:49:15.580 | - You had a question.
00:49:17.020 | - Yeah, yeah, just, I mean, that's kind of like a good-
00:49:19.780 | - It seems like your most successful side project.
00:49:22.460 | - Yeah, okay.
00:49:23.460 | So what is SCC Insights for our listeners?
00:49:26.940 | Our SCC Insights is a full stack LLM chatbot application
00:49:31.660 | that does analysis over your SCC 10K and 10Q filings,
00:49:36.660 | I think.
00:49:37.500 | And so the goal for building this project
00:49:40.340 | is really twofold.
00:49:41.700 | The reason we started building this was one,
00:49:44.180 | it was a great way to dog food
00:49:45.780 | the production readiness for our library.
00:49:47.820 | We actually ended up like adding a bunch of stuff
00:49:50.180 | and fixing a ton of bugs because of this.
00:49:51.900 | And I think it was great because like, you know,
00:49:53.900 | thinking about how we handle like callbacks, streaming,
00:49:57.820 | actually generating like reliable sub-responses
00:50:00.180 | and bubbling up sources citations.
00:50:01.900 | These are all things that like, you know,
00:50:03.740 | if you're just building the library in isolation,
00:50:05.420 | you don't really think about it.
00:50:06.260 | But if you're trying to tie this
00:50:07.140 | into a downstream application,
00:50:08.220 | like it really starts mattering.
00:50:09.580 | - Is this for your error messages?
00:50:10.940 | What do you mean?
00:50:11.780 | You talk about bubbling up stuff.
00:50:12.620 | For observability. - So like sources.
00:50:13.860 | Like if you go into SCC Insights and you type something,
00:50:16.180 | you can actually see the highlights in the right side.
00:50:18.180 | And so like, yeah, that was something
00:50:20.340 | that like took a little bit of like understanding
00:50:22.740 | to figure out how to build well.
00:50:23.820 | And so it was great for dogfooding improvement
00:50:25.620 | of the library itself.
00:50:26.700 | And then as we're building the app,
00:50:28.260 | the second thing was we're starting to talk to users
00:50:30.180 | and just like trying to showcase
00:50:32.140 | like kind of bigger companies,
00:50:33.820 | like the potential of Llamandex as a framework.
00:50:36.740 | Because these days, obviously building a chatbot, right,
00:50:39.340 | with Streamlit or something,
00:50:40.260 | it'll take you like 30 minutes or an hour.
00:50:41.740 | Like there's plenty of templates out there
00:50:43.060 | on Llamandex, ClientTrain,
00:50:44.100 | like you can just build a chatbot.
00:50:45.580 | But how do you build something that kind of like satisfies
00:50:48.020 | some of this like criteria of surfacing like citations,
00:50:51.580 | being transparent, seeing like having a good UX,
00:50:54.220 | and then also being able to handle
00:50:55.420 | different types of questions, right?
00:50:56.580 | Like more complex questions
00:50:57.780 | that compare different documents.
00:50:59.460 | That's something that I think people
00:51:00.580 | are still trying to explore.
00:51:01.420 | And so what we did was like,
00:51:03.100 | we showed both like, well, first like organizations
00:51:07.220 | and possibilities of like what you can do
00:51:08.780 | when you actually build something like this.
00:51:10.420 | And then after like, you know,
00:51:11.740 | we kind of like stealth launched this for fun,
00:51:14.180 | just as a separate project,
00:51:15.500 | just to see if we could get feedback from users
00:51:17.180 | who are using this world to see like, you know,
00:51:18.620 | how we can improve stuff.
00:51:19.940 | And then we thought like, ah, you know,
00:51:22.660 | we built this, right?
00:51:23.780 | Obviously, we're not gonna sell like a financial app,
00:51:26.260 | like that's not really in our wheelhouse,
00:51:28.660 | but we're just gonna open source the entire thing.
00:51:30.100 | And so that now is basically just like a really nice,
00:51:33.060 | like full stack app template you can use
00:51:34.780 | and customize on your own, right?
00:51:35.940 | To build your own chatbot,
00:51:37.220 | whether it is a really financial documents
00:51:38.900 | or over like other types of documents.
00:51:40.620 | And it provides like a nice template
00:51:41.980 | for basically anybody to kind of like go in
00:51:43.500 | and get started.
00:51:45.180 | There's certain components though,
00:51:46.540 | that like aren't released yet that we're going to,
00:51:49.060 | in the next few weeks.
00:51:51.740 | Like one is just like kind of more detailed guides
00:51:54.220 | on like different modular components within it.
00:51:56.300 | So if you're like a full stack developer,
00:51:57.940 | you can go in and actually take the pieces that you want
00:52:00.220 | and actually kind of build your own custom flows.
00:52:02.260 | The second piece is like,
00:52:03.660 | take there's like certain components in there
00:52:05.500 | that might not be directly related to the LLM app
00:52:07.620 | that would be nice to just like have people use.
00:52:10.220 | An example is the PDF viewer,
00:52:12.020 | like the PDF viewer with like citations.
00:52:14.540 | I think we're just gonna give that, right?
00:52:15.820 | So, you know, you could be using any library you want,
00:52:17.900 | but then you can just, you know,
00:52:19.060 | just drop in a PDF viewer, right?
00:52:20.620 | So that it's just like a fun little module
00:52:22.460 | that you can view.
00:52:23.300 | - Nice, nice.
00:52:24.300 | Yeah, that's a really good community service right there.
00:52:27.300 | Well, so I want to talk a little bit
00:52:28.980 | about your cloud offering.
00:52:31.020 | 'Cause you mentioned, I forget the name
00:52:32.660 | that you had for it, enterprise something.
00:52:35.300 | - Well, one, we haven't come up with a name.
00:52:37.060 | We're kind of calling it LLM index platform,
00:52:41.060 | platform LLM index enterprise.
00:52:42.780 | I'm open to suggestions here.
00:52:45.180 | So I think the high level of what I can probably say
00:52:50.180 | is just like, yeah, I think we're looking at ways
00:52:53.380 | of like actively kind of complimenting
00:52:55.420 | the developer experience, like building LLM index.
00:52:57.940 | You know, we've always been very focused
00:53:00.740 | on stuff around like plugging in your data
00:53:03.580 | into the language model.
00:53:04.900 | And so can we build tools that help like augment
00:53:07.220 | that experience beyond the open source library, right?
00:53:10.020 | And so I think what we're gonna do
00:53:11.500 | is like make a build an experience
00:53:13.340 | where it's very seamless to transition
00:53:14.900 | from the open source library with like a one line toggle.
00:53:18.740 | You can basically get this like complimentary service
00:53:20.980 | and then figure out a way to like monetize in a bit.
00:53:23.420 | I think our revenue focus this year
00:53:25.500 | is kind of is less emphasized.
00:53:28.460 | Like it's more just about like,
00:53:29.540 | can we build some managed offering
00:53:30.780 | that like provides complimentary value
00:53:32.260 | to what the open source library provides?
00:53:34.460 | - Yeah, I think it's the classic thing
00:53:37.140 | about all open source is you want to start building
00:53:40.260 | the most popular open source projects
00:53:41.580 | in your category to own that category.
00:53:44.420 | You're gonna make it very easy to host.
00:53:46.180 | Therefore, then you have to,
00:53:47.220 | you've just built your biggest competitor, which is you.
00:53:50.020 | Yeah, it'll be fun.
00:53:52.140 | - I think it'll be like complimentary
00:53:53.500 | 'cause I think it'll be like, you know,
00:53:55.300 | use the open source library and then you have a toggle
00:53:57.780 | and all of a sudden, you know, you can see this
00:54:00.220 | basically like a pipeline-ish thing pop up
00:54:03.580 | and then it'll be able to kind of like, you'll have a UI,
00:54:07.380 | there'll be some enterprise guarantees
00:54:09.380 | and the end goal would be to help you build
00:54:11.020 | like a production rag app more easily.
00:54:12.660 | - Yeah, great, awesome.
00:54:14.900 | Should we go on to like ecosystem and other stuff?
00:54:17.460 | - Yeah. - Go ahead.
00:54:19.300 | - Data loaders, there's a lot of them.
00:54:21.700 | What are maybe some of the most popular,
00:54:24.540 | maybe under, not underrated, but like underexpected,
00:54:27.940 | you know, and how has the open source side of it helped
00:54:31.620 | with like getting a lot more connectors?
00:54:33.460 | You only have six people on the team today,
00:54:35.140 | so you couldn't have done it all yourself.
00:54:37.020 | - Oh, for sure.
00:54:37.980 | Yeah, I think the nice thing about like Blahma Hub itself
00:54:40.820 | is just, it's supposed to be a community-driven hub.
00:54:43.020 | And so actually the bulk of the peers
00:54:44.860 | are completely community contributed.
00:54:46.700 | And so we haven't written that many
00:54:49.340 | like first party connectors actually for this.
00:54:51.180 | It's more just like kind of encouraging people
00:54:53.180 | to contribute to the community.
00:54:56.100 | In terms of the most popular tools or the data loaders,
00:54:59.900 | I think we have Google Analytics on this
00:55:01.500 | and I forgot the specifics.
00:55:02.540 | It's some mix of like the PDF loaders.
00:55:05.180 | We have like 10 of them,
00:55:06.020 | but there's some subset of them that are popular.
00:55:07.820 | And then there's Google, like I think Gmail and like G-Drive.
00:55:12.260 | And then I think maybe it's like one of Slack or Notion.
00:55:15.260 | One thing I will say though,
00:55:17.820 | and I think like Swix might probably knows this better
00:55:20.260 | than I do, given that you were used to working at Airbyte,
00:55:22.300 | is like, it's very hard to build like,
00:55:24.580 | especially for a full-on service like Notion, Slack
00:55:27.020 | or like Salesforce to build like a really,
00:55:29.260 | really high quality loader that really extracts
00:55:31.300 | all the information that people want, right?
00:55:33.220 | And so I think the thing is when people start out,
00:55:37.700 | like they will probably use these loaders
00:55:39.820 | and it's a great tool to get started.
00:55:41.140 | And for a lot of people it's like good enough
00:55:42.820 | and they submit PRs if they want more additional features.
00:55:45.260 | If like you get to a point where you actually wanna call
00:55:47.700 | like an API that hasn't been supported yet,
00:55:49.820 | or, you know, you want to kind of load in stuff
00:55:53.660 | that like in metadata or something
00:55:55.260 | that hasn't been directly baked
00:55:56.620 | into the logic of the loader itself,
00:55:58.660 | people start adding up like writing their own custom loaders.
00:56:00.900 | And that is a thing that we're seeing.
00:56:02.300 | And that's something that we're okay with, right?
00:56:03.980 | 'Cause like a lot of this is more just like community driven
00:56:06.380 | and if you wanna submit a PR
00:56:07.620 | to improve the existing one, you can,
00:56:08.740 | otherwise you can create your own custom ones.
00:56:10.300 | - Yeah.
00:56:11.140 | And all that is custom loaders all supported
00:56:13.060 | within Llama Index or do you pair it with something else?
00:56:15.580 | - Oh, it's just like, I mean,
00:56:17.300 | you just define your own subclass.
00:56:18.380 | I think that's it.
00:56:19.220 | Yeah, yeah.
00:56:20.060 | - 'Cause typically in the data ecosystem with Erbite,
00:56:23.580 | you know, Erbite has his own strategies with custom loaders,
00:56:26.100 | but also you could write your own with like DAGster
00:56:28.460 | or like Prefects or one of those tools.
00:56:30.700 | - Yeah, yeah, exactly.
00:56:31.780 | So I think for us it's more,
00:56:33.180 | we just have a very flexible like document abstraction
00:56:35.140 | that you can fill in with any content that you want.
00:56:37.140 | - Okay.
00:56:37.980 | Are people really dumping all their Gmail into these things?
00:56:40.940 | You said Gmail is number two.
00:56:44.100 | - Yeah, it's like one of Google, some Google product.
00:56:47.460 | I think it's Gmail. - Oh, it's not Gmail.
00:56:48.860 | - I think it might be.
00:56:49.900 | Yeah. - Oh, wow.
00:56:50.860 | - I'm not sure actually.
00:56:52.620 | - I mean, that's the most private data source.
00:56:56.740 | - That's true.
00:56:57.580 | - So I'm surprised that people don't meet you.
00:56:59.980 | I mean, I'm sure some people are,
00:57:01.500 | but like I'm sure, I'm surprised it's popular.
00:57:04.100 | - Yeah.
00:57:05.020 | Let me revisit the Google Analytics.
00:57:06.300 | - Okay. - I wanna try
00:57:07.140 | and give you the accurate response, yeah.
00:57:09.060 | - Yeah.
00:57:10.460 | Well, and then, so the LLM engine,
00:57:13.900 | I assume OpenAI is gonna be a majority.
00:57:16.780 | Is it an overwhelming majority?
00:57:19.620 | What's the market share between like OpenAI,
00:57:22.020 | Cohere, Anthropic, you know, whatever you're seeing.
00:57:24.460 | OpenSource too.
00:57:25.300 | - OpenAI has a majority,
00:57:26.140 | but then like there's Anthropic
00:57:27.580 | and there's also OpenSource.
00:57:29.060 | I think there is a lot of people trying out like Llama 2
00:57:32.060 | and some variant of like a top OpenSource model.
00:57:34.900 | - Side note, any confusion there?
00:57:36.300 | Llama 2 versus Llama?
00:57:38.020 | - Yeah, I think whenever I go to these talks,
00:57:40.540 | I always open it up with like,
00:57:41.820 | we started before. - We are not.
00:57:42.660 | - Yeah, exactly.
00:57:43.500 | We started before Meta, right?
00:57:44.660 | I wanna point that out.
00:57:46.220 | But no, props to them.
00:57:47.580 | We try to use it for like branding.
00:57:49.060 | We just add two Llamas
00:57:50.060 | when we have like a Llama 2 integration
00:57:51.460 | instead of one Llama.
00:57:52.300 | Anyways.
00:57:53.140 | Yeah, so I think a lot of people are trying out
00:57:57.140 | the popular OpenSource models.
00:57:58.580 | And we have, these days we have like,
00:58:01.420 | there's a lot of toolkits and OpenSource projects
00:58:04.460 | that allow you to self-host and deploy Llama 2.
00:58:07.540 | - Yes. - Right.
00:58:08.380 | And like, Llama is just a very recent example,
00:58:10.540 | I think that we had an integration with.
00:58:12.380 | And so we just, by virtue of having more of these services,
00:58:14.940 | I think more and more people are trying it out.
00:58:16.620 | - Yeah.
00:58:17.460 | Do you think there's potential there?
00:58:18.820 | Is like, is that gonna be an increasing trend?
00:58:21.900 | - OpenSource? - Yeah.
00:58:22.860 | - Yeah, definitely.
00:58:23.700 | I think in general, people hate monopolies.
00:58:25.740 | And so like there's a,
00:58:27.500 | whenever like OpenAI has something really cool
00:58:30.020 | or like any company has something really cool, even Meta,
00:58:33.220 | like there's just gonna be a huge competitive pressure
00:58:35.300 | from other people to do something
00:58:36.500 | that's more open and better.
00:58:38.060 | And so I do think just market pressures
00:58:39.780 | will improve like OpenSource adoption.
00:58:42.660 | - Last thing I'll say about this,
00:58:43.740 | which is just really like, it gets clicks.
00:58:46.980 | People like, are like, psychologically want that.
00:58:50.340 | But then at the end of the day,
00:58:51.180 | they want, they fall for brand name
00:58:52.580 | and popular and performance benchmarks, you know?
00:58:56.500 | And at the end of the day, OpenAI still wins on that.
00:58:59.740 | - I think that's true.
00:59:00.820 | But I just think like,
00:59:02.300 | unless you were like an active employee at OpenAI, right?
00:59:04.660 | Like all these research labs are putting out like ML,
00:59:07.540 | like PhDs or kind of like other companies too,
00:59:10.500 | they're investing a lot of dollars.
00:59:11.900 | There's gonna be a lot of like competitive pressures
00:59:13.500 | to develop like better models.
00:59:14.700 | So is it gonna be like all fully open source
00:59:17.220 | with like a permissive license?
00:59:18.260 | Like, I'm not completely sure,
00:59:19.500 | but like there's just a lot of just incentive
00:59:21.460 | for people to develop their stuff here.
00:59:23.340 | - Have you looked at like rag specific models,
00:59:25.420 | like contextual?
00:59:26.460 | - No, is it public or?
00:59:29.180 | - No, they literally just, so Dewey Kila,
00:59:32.940 | I think is his name, you probably came across him.
00:59:35.900 | He wrote the rag paper at Meta
00:59:37.820 | and just started contextual AI
00:59:40.900 | to create a rag specific model.
00:59:42.540 | I don't know what that means.
00:59:44.540 | I was hoping that you do, 'cause it's your business.
00:59:47.060 | - If I had inside information.
00:59:50.580 | I mean, you know, to be honest,
00:59:51.980 | I think this kind of relates to my previous point
00:59:54.580 | on like rag and fine tuning.
00:59:56.020 | Like a rag specific model is a model architecture
00:59:58.340 | that's designed for better rag.
01:00:00.020 | And it's less the software engineering principle
01:00:01.940 | of like, how can I take existing stuff
01:00:03.860 | and just plug and play different components into it?
01:00:05.660 | And there's a beauty in that
01:00:07.340 | from ease of use and modularity.
01:00:08.940 | But like when you wanna end to end optimize the thing,
01:00:12.060 | you might want a more specific model.
01:00:13.900 | I just, yeah, I don't know.
01:00:15.900 | I think building your own models is honestly pretty hard.
01:00:20.220 | And I think the issue is if you also build your own models,
01:00:22.740 | like you're also just gonna have to keep up
01:00:24.140 | with like the rate of L and advances.
01:00:25.660 | Like basically the question is when GPT-5 and six
01:00:29.420 | and whatever, like anthropic cloud three comes out,
01:00:31.860 | like how can you prove that you're actually better
01:00:34.900 | than a software developers
01:00:36.460 | cobbling together their own components
01:00:37.860 | on top of a base model, right?
01:00:39.420 | Even if it's just like conceptually,
01:00:40.780 | this is better than maybe like GPT-3 or GPT-4.
01:00:43.820 | - Yeah, yeah.
01:00:45.340 | Base model game is expensive.
01:00:46.820 | - Yeah.
01:00:47.820 | - What about vector stores?
01:00:49.340 | I know Spook says we're in a Chroma sweatshirt.
01:00:51.900 | - Yeah, because this is a swag game.
01:00:53.900 | - I have the mug from Chroma, it's been great.
01:00:57.300 | - What do you think, what do you think there?
01:00:59.420 | Like there's a lot of them.
01:01:00.580 | Are they pretty interchangeable
01:01:02.380 | for like your users use case?
01:01:04.820 | Is HNSW all we need?
01:01:07.300 | Is there room for improvements there?
01:01:09.220 | - Is MPRA all we need?
01:01:10.540 | - Yeah, yeah.
01:01:11.380 | - I think, yeah, we try to remain unopinionated
01:01:14.500 | about storage providers.
01:01:15.460 | So it's not like, we don't try to like play favorites.
01:01:17.380 | So we have like a bunch of integrations, obviously.
01:01:19.020 | And the way we try to do is we just try to find
01:01:20.980 | like some standard interfaces,
01:01:22.180 | but obviously like different vector stores
01:01:23.700 | will support kind of like slightly additional things
01:01:26.060 | like metadata filters and those things.
01:01:27.940 | And the goal is to have our users basically leave it up
01:01:30.860 | to them to try to figure out like what makes sense
01:01:32.540 | for their use case.
01:01:33.620 | In terms of like the algorithm itself,
01:01:35.660 | I don't think the Delta
01:01:37.660 | on like improving the vector store,
01:01:39.580 | like embedding lookup algorithm is that high.
01:01:42.020 | I think the stuff has been mostly solved
01:01:44.300 | or at least there's just a lot of other stuff you can do
01:01:46.740 | to try to improve the performance.
01:01:48.900 | No, I mean like everything else that we just talked about,
01:01:50.540 | like in terms of like accuracy, right?
01:01:52.020 | To improve RAG, like everything that we talked about,
01:01:53.700 | like clunking, like metadata, like.
01:01:56.140 | - Yeah, well, I mean, I was just thinking like,
01:01:58.020 | maybe for me, the interesting question is,
01:02:00.620 | there are like eight, it's a kind of game of thrones.
01:02:02.580 | There's like eight, the war of eight databases right now.
01:02:05.460 | - Oh, oh, I see, I see.
01:02:06.820 | - How do they stand out
01:02:07.860 | and how did they become very good partners
01:02:09.220 | with Lava Index?
01:02:10.060 | - Oh, I mean, I think we're, yeah,
01:02:13.060 | we're pretty good partners with most of them.
01:02:15.060 | Let's see.
01:02:16.380 | - Well, like, so if you're a vector database founder,
01:02:19.140 | like what do you work on?
01:02:21.380 | - That's a good question.
01:02:23.060 | I think one thing I'm very interested in is,
01:02:25.860 | and this is something I think I've started to see
01:02:27.940 | a general trend towards,
01:02:29.060 | is combining structured data querying
01:02:31.740 | with unstructured data querying.
01:02:33.420 | And I think that will probably just expand
01:02:37.020 | the query sophistication of these vector stores
01:02:39.420 | and basically make it so that users don't have to think
01:02:41.580 | about whether they--
01:02:42.420 | - Would you call this like hybrid querying?
01:02:44.380 | Is that what Weaviate's doing?
01:02:46.540 | - Yeah, I mean, I think like,
01:02:47.660 | if you think about metadata filters,
01:02:48.900 | that's basically a structured filter.
01:02:50.420 | It's like a select star or select where, right?
01:02:54.060 | Something equals something.
01:02:55.180 | And then you combine that with semantic search.
01:02:57.060 | I know, I think like LanceDB or something
01:02:59.140 | was like trying to do some like joint interface.
01:03:02.420 | The reason is like most data is semi-structured.
01:03:05.260 | There's some structured annotations
01:03:06.500 | and there's some like unstructured texts.
01:03:07.900 | And so like somehow combining all the expressivity
01:03:12.260 | of like SQL with like the flexibility of semantic search
01:03:14.820 | is something that I think is gonna be really important.
01:03:17.300 | And we have some basic hacks right now
01:03:18.860 | that allow you to jointly query both a SQL database,
01:03:22.540 | like a separate SQL database and a vector store
01:03:24.300 | to like combine the information.
01:03:25.940 | That's obviously gonna be less efficient
01:03:27.220 | than if you just combined it into one system, yeah.
01:03:29.420 | And so I think like PG vector, like, you know,
01:03:31.380 | that type of stuff, I think it's starting to get there.
01:03:33.180 | But like in general,
01:03:34.020 | like how do you have an expressive query language
01:03:35.660 | to actually do like structured querying
01:03:37.620 | along with like all the capabilities of semantic search?
01:03:40.260 | - So your current favorite is just put it into Postgres?
01:03:44.220 | - No, no, no, we don't play--
01:03:45.980 | - Postgres language, the query language.
01:03:49.300 | - I actually don't know what the best language
01:03:52.780 | would be for this.
01:03:53.620 | 'Cause I think it will be something
01:03:55.180 | that like the model hasn't been fine-tuned over.
01:03:57.340 | And so you might wanna train the model over this,
01:04:00.100 | but some way of like expressing structured data filters.
01:04:04.500 | And this could be include time too, right?
01:04:06.580 | It doesn't have to just be like a where clause
01:04:08.940 | with this idea of like a semantic search.
01:04:11.420 | - Yeah, yeah.
01:04:12.260 | And we talked about graph representations.
01:04:14.700 | - Yeah, oh yeah, that's another thing too.
01:04:16.220 | And there's like, yeah,
01:04:17.340 | so that's actually something I didn't even bring up yet.
01:04:19.860 | Like there's this interesting idea of like,
01:04:21.300 | can you actually have the language model,
01:04:23.020 | like explore like relationships within the data too, right?
01:04:25.860 | And somehow combine that information with stuff
01:04:28.020 | that's like more structured within the DB.
01:04:30.860 | - Awesome.
01:04:31.700 | - What else is left in the stack?
01:04:34.620 | - Oh, evals.
01:04:35.620 | - Yeah.
01:04:36.460 | - What are your current strong beliefs
01:04:39.180 | about how to evaluate RAG?
01:04:40.860 | - I think I have thoughts.
01:04:41.820 | I think we're trying to curate this
01:04:42.980 | into some like more opinionated principles
01:04:45.860 | because there are some like open questions here.
01:04:47.540 | I think one question I had to think about
01:04:48.900 | is whether you should do like evals
01:04:50.860 | like component by component first
01:04:52.340 | or is yours do the end-to-end thing?
01:04:54.580 | I think you should,
01:04:55.700 | you might actually just want to do the end-to-end thing first
01:04:57.620 | just to do a sanity check of whether or not like this,
01:05:00.340 | given a query and the final response,
01:05:01.820 | whether or not it even makes sense.
01:05:03.220 | Like you eyeball it, right?
01:05:04.220 | And then you only try to do some basic evals.
01:05:06.340 | And then once you like diagnose what the issue is,
01:05:08.700 | then you go into the kind of like specific area
01:05:11.340 | to find some more solid benchmarks
01:05:13.420 | and try to like improve stuff.
01:05:14.940 | So what is end-to-end evals?
01:05:17.020 | Like it's, you have a query,
01:05:19.180 | it goes in through a retrieval system,
01:05:21.820 | you get back something, you synthesize response,
01:05:23.540 | and that's your final thing.
01:05:24.420 | And you evaluate the quality of the final response.
01:05:27.060 | And these days there's plenty of projects,
01:05:30.300 | like startups, like companies, research,
01:05:33.020 | doing stuff around like GPT-4, right?
01:05:35.180 | As like a human judge to basically kind of like
01:05:37.140 | synthetically generate a data set.
01:05:37.980 | - Do you think those will do well?
01:05:39.380 | - I mean, I think- - It's too easy.
01:05:41.300 | - Well, I think, oh, you're talking about like the startups?
01:05:44.860 | - Yeah.
01:05:45.700 | - I don't know.
01:05:46.540 | I don't know from the startup side.
01:05:47.380 | I just know from a technical side,
01:05:48.220 | I think people are gonna do more of it.
01:05:50.580 | The main issue right now is just, it's really unreliable.
01:05:53.420 | Like it's just, like there's like variance in the response
01:05:56.780 | when you wanna be- - Yeah, then they won't do
01:05:57.820 | more of it.
01:05:58.660 | I mean, 'cause it's bad.
01:05:59.500 | - No, but these models will get better
01:06:00.820 | and you'll probably fine tune a model to be a better judge.
01:06:03.260 | I think that's probably what's gonna happen.
01:06:04.500 | So I'm like reasonably bullish on this
01:06:07.260 | because I don't think there's really a good alternative
01:06:09.740 | beyond you just human annotating a bunch of data sets
01:06:12.500 | and then trying to like just manually go through
01:06:14.700 | and curating, like evaluating eval metrics.
01:06:17.460 | And so this is just gonna be a more scalable solution.
01:06:19.860 | In terms of the startups, yeah, I mean,
01:06:21.140 | I think there's a bunch of companies doing this.
01:06:22.660 | In the end, it probably comes down to some aspect
01:06:24.340 | of like UX speed and then whether you can
01:06:27.620 | like fine tune a model.
01:06:29.140 | And then, so that's end-to-end evals.
01:06:31.860 | And then I think like what we found is for RAG,
01:06:34.340 | a lot of times like what ends up affecting
01:06:37.500 | this like end response is retrieval.
01:06:39.420 | You're just not able to retrieve the right response.
01:06:41.300 | I think having proper retrieval benchmarks,
01:06:43.260 | especially if you wanna do production RAG
01:06:44.820 | is actually quite important.
01:06:46.260 | I think, what does having good retrieval metrics tell you?
01:06:49.540 | It tells you that at least like the retrieval is good.
01:06:52.180 | It doesn't necessarily guarantee
01:06:53.260 | the end generation is good,
01:06:54.740 | but at least it gives you some sort of like sanity track,
01:06:58.260 | right, so you can like fix one component
01:06:59.780 | while optimizing the rest.
01:07:00.980 | What retrieval like evaluation is pretty standard
01:07:04.340 | and it's been around for a while.
01:07:05.860 | It's just like an IR problem basically.
01:07:07.940 | You have some like input query,
01:07:10.460 | you get back some retrieved set of context
01:07:12.500 | and then there's some ground truth in that ranked set.
01:07:15.420 | And then you try to measure it based on ranking metrics.
01:07:17.580 | So the closer that ground truth is to the top,
01:07:20.420 | the more you reward the evals.
01:07:22.380 | And then the closer it is to the bottom
01:07:23.900 | or if it's not in the retrieved side at all,
01:07:25.580 | then you penalize the evals.
01:07:27.140 | And so that's just like a classic ranking problem.
01:07:29.580 | Most people starting out
01:07:30.900 | probably don't know how to do this.
01:07:32.740 | Right now, we just launched
01:07:33.780 | some like basic retrieval evaluation modules
01:07:37.220 | to help users do this.
01:07:38.620 | One is just like curating this data set in the first place.
01:07:41.140 | And one thing that we're very interested in
01:07:43.300 | is this idea of like synthetic data set generation
01:07:45.260 | for evals.
01:07:46.100 | So how can you, given some context,
01:07:47.820 | generate a set of questions with Drupal 2.4
01:07:49.820 | and then all of a sudden you have like question
01:07:51.300 | and then context pairs and that becomes your ground truth.
01:07:53.980 | - Yeah.
01:07:55.020 | Are data agent evals the same thing
01:07:56.700 | or is there a separate set of stuff for agents
01:07:59.940 | that you think is relevant here?
01:08:01.700 | - Data agents add like another layer of complexity
01:08:03.900 | 'cause then it's just like you have just more loops
01:08:06.220 | in the system.
01:08:07.060 | Like you can evaluate like each chain of thought loop itself
01:08:10.740 | like every LLM call to see whether or not the input
01:08:14.300 | to that specific step in the chain of thought process
01:08:16.580 | actually works or is correct.
01:08:20.420 | Or you could evaluate like the final response
01:08:22.220 | to see if that's correct.
01:08:23.220 | This gets even more complicated
01:08:24.460 | when you do like multi-agent stuff
01:08:25.820 | because now you have like some communication
01:08:27.420 | between like different agents.
01:08:28.700 | Like you have a top level orchestration agent
01:08:30.500 | passing it on to some low level stuff.
01:08:33.740 | I'm probably less familiar
01:08:35.460 | with kind of like agent eval frameworks.
01:08:36.980 | I know they're starting to become a thing.
01:08:39.620 | I know I was talking to like June
01:08:42.020 | from the Journal of Agents paper,
01:08:43.660 | which is pretty unrelated to what we're doing now,
01:08:45.780 | but it's very interesting where it's like,
01:08:47.260 | so you can kind of evaluate like overall agent simulations
01:08:50.460 | by just like kind of understanding
01:08:52.020 | whether or not they like modeled
01:08:53.660 | this distribution of human behavior,
01:08:55.180 | but that's like a very macro principle, right?
01:08:57.220 | And that's very much to evaluate stuff
01:08:59.220 | to kind of like model the distribution of things.
01:09:02.700 | And I think that works well
01:09:03.980 | when you're trying to like generate something
01:09:05.620 | for like creative purposes,
01:09:07.300 | but for stuff where you really want the agent
01:09:09.100 | to like achieve a certain task,
01:09:10.460 | it really is like whether or not
01:09:11.700 | it achieved the task or not, right?
01:09:13.060 | 'Cause then it's not like,
01:09:14.500 | oh, does it generally mimic human behavior?
01:09:16.380 | It's like, no, like did you like send this email or not?
01:09:18.540 | Right, like, 'cause otherwise like this thing didn't work.
01:09:21.260 | Yeah.
01:09:22.100 | - Makes sense.
01:09:22.940 | Awesome.
01:09:23.900 | Yeah, let's jump into Lightning Round.
01:09:26.340 | So we have two question, acceleration, exploration,
01:09:29.340 | and then one final takeaway.
01:09:31.500 | The acceleration question is,
01:09:32.740 | what's something that already happened in AI
01:09:35.060 | that you thought would take much longer to get here?
01:09:37.780 | - I think just the ability of LLMs
01:09:39.540 | to generate believable outputs,
01:09:41.300 | and both for texts and also for images.
01:09:44.900 | And I think just the whole reason
01:09:47.140 | I started hacking around with LLMs,
01:09:48.380 | honestly, I felt like I got into it pretty late.
01:09:50.060 | I should've gone into it like early 2022
01:09:51.940 | because Ubuntu 3 had been out for a while.
01:09:53.580 | Like just the fact that there was this engine
01:09:56.900 | that was capable of like reasoning
01:09:58.140 | and no one was really like tapping into it.
01:10:00.420 | And then the fact that, you know,
01:10:01.900 | I used to work in image generation for a while.
01:10:03.540 | Like I did GANs and stuff back in the day,
01:10:05.620 | and that was like pretty hard to train.
01:10:07.500 | You would generate these like 32 by 32 images,
01:10:10.420 | and then now taking a look at some of the stuff
01:10:12.180 | by like Dolly and, you know, mid-journey and those things.
01:10:14.780 | So it's just, it's very good.
01:10:16.540 | Yeah.
01:10:17.380 | - Exploration.
01:10:18.860 | What do you think is the most interesting
01:10:20.180 | unsolved question in AI?
01:10:21.660 | - Yeah, I'd probably work on some aspect
01:10:24.020 | of like personalization of memory.
01:10:28.340 | I think a lot of people have thoughts about that,
01:10:30.020 | but like for what it's worth,
01:10:31.260 | I don't think the final state will be right.
01:10:32.740 | I think it will be some like fancy algorithm
01:10:35.540 | or architecture where you like bake it
01:10:37.220 | into like the architecture of the model itself.
01:10:39.580 | Like if you have like a personalized assistant
01:10:41.540 | that you can talk to,
01:10:43.820 | that will like learn behaviors over time, right?
01:10:45.860 | And kind of like learn stuff
01:10:47.260 | through like conversation history,
01:10:48.660 | what exactly is the right architecture there?
01:10:50.340 | I do think that will be part of like-
01:10:52.780 | - Continuous fine tuning?
01:10:54.020 | - Yeah, like some aspect of that, right?
01:10:55.700 | Like these are like,
01:10:56.540 | I don't actually know the specific technique,
01:10:57.940 | but I don't think it's just gonna be something
01:10:59.460 | where you have like a fixed vector store
01:11:00.700 | and that thing will be like the thing
01:11:02.220 | that restores all your memories.
01:11:03.780 | - Yeah, it's interesting because I feel
01:11:07.060 | like using model weights for memory,
01:11:11.340 | it's just such an unreliable storage device.
01:11:14.260 | - I know, but like, I just think from like the AGI,
01:11:18.580 | like, you know, just modeling
01:11:20.500 | like the human brain perspective,
01:11:21.660 | I think that there is something nice
01:11:23.220 | about just like being able to optimize that system, right?
01:11:26.380 | And to optimize a system, you need parameters
01:11:28.380 | and then that's where you just get
01:11:29.220 | into the neural map piece.
01:11:30.660 | - Cool, cool, and yeah, take away,
01:11:33.780 | you got the audience ear,
01:11:35.740 | what's something you want everyone to think about
01:11:38.220 | or yeah, take away from this conversation
01:11:41.020 | and your thinking.
01:11:42.300 | - I think there were a few key things.
01:11:44.460 | So we talked about two of them already,
01:11:46.180 | which was SEC insights,
01:11:47.460 | which if you guys haven't checked it out,
01:11:48.740 | I've definitely encouraged you to do so,
01:11:49.940 | because it's not just like a random like SEC app,
01:11:52.500 | it's like a full stack thing that we open source, right?
01:11:54.700 | And so if you guys wanna track it out,
01:11:55.980 | I would definitely do that.
01:11:57.580 | It provides a template for you to build
01:11:59.020 | kind of like production grade rag apps
01:12:00.700 | and we're gonna open source like
01:12:02.260 | and modularize more components of that soon.
01:12:04.180 | - Into a workshop.
01:12:05.340 | - Yeah, and the second piece is we are thinking a lot
01:12:08.540 | about like retrieval and evals.
01:12:10.380 | I think right now we're kind of exploring integrations
01:12:12.900 | with like a few different partners
01:12:14.180 | and so hopefully some of that will be released soon.
01:12:16.820 | And so just like, how do you basically have an experience
01:12:20.420 | where you just like write long index code,
01:12:23.140 | all of a sudden you can easily run like retrievals,
01:12:25.660 | evals and like traces, all that stuff and like a service.
01:12:28.860 | And so I think we're working with like
01:12:29.980 | a few providers on that.
01:12:31.460 | And then the other piece,
01:12:32.540 | which we did talk about already is this idea of like,
01:12:34.940 | yeah, building like rag from scratch.
01:12:36.620 | I mean, I think everybody should do it.
01:12:37.980 | I think like I would check out the guide
01:12:40.940 | if you guys haven't already, I think it's in our docs,
01:12:42.820 | but instead of just using, you know,
01:12:45.180 | either the kind of like the retriever query engine
01:12:48.860 | and Lomindex or like the conversational QA train
01:12:51.180 | and Lang train, I would take a look at
01:12:53.780 | how do you actually chunk parse data
01:12:55.860 | and do like top can batting retrieval.
01:12:57.700 | 'Cause I really think that by doing that process,
01:12:59.780 | it helps you understand the decisions,
01:13:01.740 | the prompts, the language models to use.
01:13:04.420 | - That's it.
01:13:05.260 | - Yeah. - Thank you so much.
01:13:06.100 | - Thank you, Jerry.
01:13:06.920 | - Yeah, thank you.
01:13:07.760 | (upbeat music)
01:13:10.340 | (upbeat music)
01:13:12.920 | (upbeat music)
01:13:15.500 | (upbeat music)
01:13:18.080 | (upbeat music)
01:13:20.660 | (upbeat music)
01:13:23.240 | [BLANK_AUDIO]