RAG is a hack - with Jerry Liu of LlamaIndex

00:00:00.000 | (upbeat music)

00:00:02.580 | - Hey, everyone.

00:00:07.940 | Welcome to the Latent Space Podcast.

00:00:09.960 | This is Alessio, partner and CTO

00:00:11.960 | of Residence and Decibel Partners,

00:00:13.680 | and I'm joined by my co-host Swix, founder of Small AI.

00:00:17.240 | - And today we finally have Jerry Liu on the podcast.

00:00:19.960 | Hey, Jerry.

00:00:20.880 | - Hey, guys.

00:00:21.720 | Hey, Swix, I'm Alessio.

00:00:22.600 | Thanks for having me.

00:00:23.720 | - It's so weird because we keep running into each other

00:00:26.160 | in San Francisco AI events,

00:00:27.640 | so it's kind of weird to finally just have a conversation

00:00:30.320 | recorded for everybody else.

00:00:31.680 | - Yeah, I know.

00:00:32.500 | I'm really looking forward to this.

00:00:33.340 | I guess I have further questions.

00:00:35.760 | - So I tend to introduce people on their formal background

00:00:38.220 | and then ask something on the more personal side.

00:00:41.240 | So you are part of the Princeton gang.

00:00:44.240 | - Yeah.

00:00:46.720 | I don't know if there is like an official Princeton gang.

00:00:49.200 | - There is more Princeton gang.

00:00:50.040 | I attended your meeting.

00:00:51.240 | There was like four of you.

00:00:52.440 | - Oh, cool.

00:00:53.280 | Okay, nice.

00:00:54.100 | - With Prem and the others.

00:00:55.200 | - Oh, yeah, yeah, yeah, yeah.

00:00:57.500 | - Where you did bachelor's in CS

00:00:58.840 | and certificate in finance.

00:01:00.020 | That's also fun.

00:01:01.000 | I also did finance.

00:01:02.280 | And I think I saw that you also interned at Two Sigma

00:01:04.720 | where I worked in New York.

00:01:06.780 | You were a machine learning engineer.

00:01:07.620 | - You were at Two Sigma?

00:01:08.720 | - Yeah, very briefly.

00:01:09.840 | - Oh, cool.

00:01:10.680 | All right, I didn't know that.

00:01:11.520 | Okay.

00:01:12.340 | - That was my first like proper engineering job

00:01:13.180 | before I went into DevRel.

00:01:14.520 | - Oh, okay.

00:01:15.360 | Oh, wow.

00:01:16.180 | Nice.

00:01:17.020 | - And then you were a machine learning engineer at Quora,

00:01:19.960 | AI research scientist at Uber for three years,

00:01:22.540 | and then two years machine learning engineer

00:01:24.600 | at Robust Intelligence before starting Llama Index.

00:01:27.500 | So that's your LinkedIn.

00:01:28.380 | What's not on your LinkedIn

00:01:29.860 | that people should know about you?

00:01:31.300 | - I think back during my Quora days,

00:01:33.180 | I had this like three month phase

00:01:35.420 | where I just wrote like a ton of Quora answers.

00:01:37.460 | And so I think if you look at my tweets nowadays,

00:01:40.060 | you can basically see that as like the V2

00:01:42.780 | of my three month like Quora stint

00:01:44.820 | where I just like went ham on Quora for a bit.

00:01:47.500 | I actually, I think I was back then,

00:01:49.500 | actually, when I was working on Quora,

00:01:51.340 | I think the thing that everybody was fascinated in

00:01:53.700 | was just like general like deep learning advancements

00:01:57.240 | and stuff like GANs and generative like images

00:01:59.900 | and just like new architectures that were evolving.

00:02:01.720 | And it was a pretty exciting time

00:02:02.820 | to be a researcher actually,

00:02:03.920 | 'cause you were going in like really understanding

00:02:05.600 | some of the new techniques.

00:02:06.760 | So I kind of use that as like a learning opportunity

00:02:08.360 | to basically just like read a bunch of papers

00:02:10.200 | and then answer questions on Quora.

00:02:12.040 | And so you can kind of see traces of that

00:02:14.480 | basically in my current Twitter

00:02:15.760 | where it's just like really about kind of like

00:02:17.320 | framing concepts and trying to make it understandable

00:02:19.800 | and educate other users on it.

00:02:21.160 | - Yeah, I've said, so a lot of people come to me

00:02:23.080 | for my Twitter advice,

00:02:23.900 | but like I think you are doing one of the best jobs

00:02:26.120 | in AI Twitter,

00:02:27.360 | just explaining concepts

00:02:28.360 | and just consistently getting hits out.

00:02:30.760 | - Thank you.

00:02:31.600 | (laughing)

00:02:32.440 | - And I didn't know it was due to the Quora training.

00:02:34.880 | Let's just sign on on Quora.

00:02:36.660 | A lot of people, including myself,

00:02:38.080 | like kind of wrote on Quora as like one of the web 1.0

00:02:40.660 | like sort of question answer forums.

00:02:42.440 | But now I think it's becoming a senior resurgence

00:02:45.140 | obviously due to Poe.

00:02:46.520 | And obviously Adam D'Angelo

00:02:48.180 | has always been a leading tech figure,

00:02:49.840 | but what do you think is like kind of underrated about Quora?

00:02:52.120 | - I really like the mission of Quora when I joined.

00:02:54.640 | In fact, I think when I interned there like in 2015

00:02:58.200 | and I joined full time in 2017,

00:02:59.960 | one is like they had

00:03:02.200 | and they have like a very talented engineering team

00:03:05.120 | and just like really, really smart people.

00:03:07.720 | And the other part is the whole mission of the company

00:03:10.120 | is to just like spread knowledge and to educate people.

00:03:13.200 | Right, and to me that really resonated.

00:03:15.200 | I really liked the idea of just like education

00:03:17.560 | and democratizing the flow of information.

00:03:19.440 | And if you imagine like kind of back then

00:03:22.040 | it was like, okay, you have Google,

00:03:23.240 | which is like for search,

00:03:24.320 | but then you have Quora,

00:03:25.160 | which is just like user generated

00:03:26.800 | like grassroots type content.

00:03:28.240 | And I really liked that concept

00:03:29.440 | because it's just like, okay,

00:03:30.440 | there's certain types of information

00:03:31.560 | that aren't accessible to people,

00:03:32.640 | but you can make accessible by just like surfacing it.

00:03:34.680 | And so actually, I don't know if like most people

00:03:37.120 | know that about like Quora,

00:03:38.200 | like if they've used the product,

00:03:40.400 | whether through like SEO, right,

00:03:41.920 | or kind of like actively,

00:03:43.440 | but that really was what drew me to it.

00:03:45.360 | - Yeah, I think most people challenges with it

00:03:48.160 | is that sometimes you don't know

00:03:49.120 | if it's like a veiled product pitch, right?

00:03:51.440 | - Yeah.

00:03:52.280 | - It's like, you know.

00:03:53.120 | - Of course, like quality of the answer matters quite a bit.

00:03:54.880 | And then--

00:03:55.720 | - It's like five alternatives

00:03:56.560 | and then here's the one I work on.

00:03:57.400 | - Yeah, like recommendation issues and all that stuff.

00:03:59.320 | I used, I worked on Rexis at Quora actually.

00:04:01.560 | So, I got a taste of it.

00:04:02.400 | - So how do you solve stuff like that?

00:04:03.800 | - Well, I mean, I kind of more approached it

00:04:05.440 | from machine learning techniques,

00:04:07.600 | which might be a nice segue into rag actually.

00:04:09.880 | A lot of it was just information retrieval.

00:04:11.240 | We weren't like solving anything

00:04:12.520 | that was like super different

00:04:13.880 | than what was standard in the industry at the time,

00:04:15.640 | but just like ranking based on user preferences.

00:04:18.360 | I think a lot of Quora was very metrics driven.

00:04:20.320 | So just like trying to maximize like, you know,

00:04:22.240 | daily active hours, like, you know,

00:04:24.440 | time spent on site, those types of things.

00:04:27.200 | And all the machine learning algorithms

00:04:29.040 | were really just based on embeddings.

00:04:31.280 | You know, you have a user embedding

00:04:33.000 | and you have like item embeddings

00:04:34.520 | and you try to train the models

00:04:35.640 | to try to maximize the similarity of these.

00:04:37.720 | And it's basically a retrieval problem.

00:04:39.680 | - Okay, so you've been working on rag

00:04:41.640 | for longer than most people think?

00:04:43.040 | - Well, kind of.

00:04:43.880 | So I worked there for like a year, right?

00:04:45.760 | - Yeah.

00:04:46.600 | - Just transparently.

00:04:47.440 | And then I worked at Uber

00:04:48.640 | where I was not working on ranking.

00:04:49.920 | It was more like kind of deep learning training

00:04:52.160 | for self-driving and computer vision and that type of stuff.

00:04:55.520 | But I think, yeah, I mean, I think in the LLM world,

00:04:59.360 | it's kind of just like a combination

00:05:01.200 | of like everything these days.

00:05:03.320 | I mean, retrieval is not really LLMs,

00:05:05.200 | but like it fits within the space of like LLM apps.

00:05:08.960 | And then obviously like having knowledge

00:05:10.440 | of the underlying deep learning architecture helps,

00:05:12.600 | having knowledge of basic software engineering principles

00:05:14.720 | helps too.

00:05:15.560 | And so I think it's kind of nice that like

00:05:18.360 | this whole LLM space is basically just a combination

00:05:20.320 | of just like a bunch of stuff

00:05:21.240 | that you probably like people have done in the past.

00:05:24.240 | - It's good.

00:05:25.080 | It's like a summary capstone project.

00:05:26.600 | - Yeah, exactly.

00:05:27.440 | Yeah.

00:05:28.800 | - Yeah.

00:05:29.640 | - And before we dive into LLMA Index,

00:05:34.160 | what do they feed you a robust intelligence

00:05:36.240 | that both you and Harrison from Lanchain

00:05:38.480 | came out of it at the same time?

00:05:40.200 | Was there like, yeah, is there any fun story

00:05:43.160 | of like how both of you kind of came out

00:05:44.880 | with kind of like core infrastructure

00:05:46.920 | to LLM workflows today?

00:05:48.400 | Or how close were you at robust?

00:05:50.520 | Like any fun behind the scenes?

00:05:52.680 | - Yeah.

00:05:53.520 | Yeah.

00:05:54.360 | We worked pretty closely.

00:05:56.040 | I mean, we were on the same team for like two years.

00:05:57.440 | I got to know Harrison and the rest of the team pretty well.

00:05:59.200 | I mean, I have a respect the people there.

00:06:00.880 | The people there were very driven, very passionate.

00:06:02.480 | And it definitely pushed me to be a better engineer

00:06:04.520 | and leader and those types of things.

00:06:06.880 | Yeah, I don't really have a concrete explanation for this.

00:06:10.600 | I think it's more just,

00:06:11.720 | we have like an LLM hackathon around like September.

00:06:15.160 | This was just like exploring GPT-3

00:06:16.920 | or it was October actually.

00:06:18.560 | And then the day after I went on vacation

00:06:20.160 | for a week and a half.

00:06:21.040 | And so I just didn't track Slack or anything.

00:06:24.000 | Came back, saw that Harrison started Lanchain.

00:06:26.280 | I was like, oh, that's cool.

00:06:27.120 | I was like, oh, I'll play around with LLMs a bit

00:06:29.840 | and then hacked around on stuff.

00:06:31.000 | And I think I've told the story a few times,

00:06:32.440 | but you know, I was like trying to feed in information

00:06:34.760 | into the GPT-3.

00:06:36.800 | And then you deal with like context window limitations

00:06:39.160 | and there was no tooling or really practices

00:06:41.320 | to try to understand how do you, you know,

00:06:43.320 | get GPT-3 to navigate large amounts of data.

00:06:45.880 | And that's kind of how the project started.

00:06:47.840 | Really was just one of those things where early days,

00:06:51.400 | like we were just trying to build something

00:06:52.880 | that was interesting and not really,

00:06:55.520 | like I wanted to start a company.

00:06:58.280 | I had other ideas actually of what I wanted to start.

00:07:01.600 | And I was very interested in, for instance,

00:07:03.400 | like multi-modal data, like video data

00:07:05.200 | and that type of stuff.

00:07:06.320 | And then this just kind of grew

00:07:07.720 | and eventually took over the other idea.

00:07:10.080 | Text is the universal interface.

00:07:12.640 | - I think so.

00:07:13.640 | I think so.

00:07:14.480 | I actually think once the multi-modal models come out,

00:07:16.600 | I think there's just like mathematically nicer properties

00:07:19.520 | of you can just get like join multi-modal embeddings

00:07:21.720 | like clip style.

00:07:23.600 | But how, like text is really nice

00:07:25.800 | because from a software engineering principle,

00:07:27.360 | it just makes things way more modular.

00:07:28.560 | You just convert everything into text

00:07:29.920 | and then you just represent everything as text.

00:07:31.400 | - Yeah.

00:07:32.240 | I'm just explaining retroactively

00:07:33.400 | why working on LLM Index took off

00:07:35.400 | versus if you had chose to spend your time on multi-modal,

00:07:38.240 | we probably wouldn't be talking about

00:07:40.160 | whatever you ended up working on.

00:07:41.480 | - Yeah, that's true.

00:07:43.400 | - It's troubled.

00:07:44.600 | Yeah, I think so.

00:07:46.600 | So interesting.

00:07:47.440 | So November 9th.

00:07:49.280 | So that was a very productive month, I guess.

00:07:51.560 | So October, November.

00:07:53.200 | November 9th, you announced GPT Tree Index

00:07:56.040 | and you picked the tree logo.

00:07:57.280 | Very, very, very cool.

00:07:58.520 | Everyone, every project must have an emoji.

00:08:00.640 | - Yeah.

00:08:01.480 | Yeah.

00:08:02.520 | That probably was somewhat inspired by a light rain.

00:08:05.400 | I will admit, yeah.

00:08:06.680 | - It uses GPT to build a knowledge tree

00:08:08.360 | in a bottoms-up fashion

00:08:09.200 | by applying a summarization prompt for each node.

00:08:11.800 | - Yep.

00:08:12.640 | - Which I like that original vision.

00:08:16.280 | Your messaging around about them

00:08:18.680 | was also that you're creating optimized data structures.

00:08:21.760 | What's the journey to that

00:08:26.160 | and how does that contrast with LLM Index today?

00:08:29.360 | - Yeah, so, okay.

00:08:30.800 | Maybe I can tell a little bit

00:08:31.880 | about the beginning intuitions.

00:08:34.040 | I think when I first started,

00:08:36.040 | this really wasn't supposed to be something

00:08:37.760 | that was like a toolkit that people use.

00:08:40.480 | It was more just like a system.

00:08:42.560 | And the way I wanted to think about the system

00:08:44.280 | was more a thought exercise

00:08:45.480 | of how language models with their reasoning capabilities,

00:08:48.200 | if you just treat them as like brains,

00:08:49.760 | can organize information and then traverse it.

00:08:52.160 | So I didn't want to think about embeddings, right?

00:08:54.200 | To me, embeddings just felt like

00:08:55.440 | it was just an external thing that was like,

00:08:57.320 | well, it was just external

00:08:58.720 | to try and actually tap into the capabilities

00:09:00.760 | of language models themselves, right?

00:09:02.080 | I really wanted to see, you know,

00:09:03.800 | just as a human brain could synthesize stuff,

00:09:06.200 | could we create some sort of structure

00:09:07.840 | where there's this neural CPU, if you will,

00:09:10.640 | can organize a bunch of information,

00:09:12.560 | auto-summarize a bunch of stuff,

00:09:14.200 | and then also traverse the structure that I created.

00:09:16.920 | That was the inspiration for this initial tree index.

00:09:20.280 | It didn't actually, to be honest,

00:09:22.480 | and I think I said this in the first tweet,

00:09:23.840 | it didn't actually work super well, right?

00:09:25.840 | The GPT-3 at the time-

00:09:26.680 | - You're very honest about that.

00:09:27.720 | - Yeah, I know, I mean, it was just like,

00:09:29.760 | GPT-4 obviously is much better at reasoning.

00:09:31.880 | I'm one of the first to say,

00:09:33.280 | you shouldn't use anything pre-GPT-4

00:09:35.240 | for anything that requires complex reasoning

00:09:37.520 | because it's just gonna be unreliable.

00:09:39.000 | Okay, disregarding stuff like fine-tuning,

00:09:40.800 | but it worked okay,

00:09:42.280 | but I think it definitely struck a chord

00:09:44.200 | with kind of like the Twitter crowd,

00:09:46.000 | which is just like looking for kind of,

00:09:48.520 | just like new ideas at the time,

00:09:51.080 | I guess just like thinking about

00:09:52.240 | how you can actually bake this

00:09:53.520 | into some sort of application

00:09:54.760 | because I think what I also ended up discovering

00:09:57.200 | was the fact that basically everybody,

00:10:00.840 | there was starting to become a wave of developers

00:10:02.800 | building on top of GPT-3

00:10:04.400 | and people were starting to realize

00:10:05.760 | that what makes them really useful

00:10:07.480 | is to apply them on top of your personal data.

00:10:09.640 | And so even if the solution itself

00:10:11.960 | was kind of like primitive at the time,

00:10:13.640 | like the problem statement itself was very powerful.

00:10:16.200 | And so I think being motivated by the problem statement,

00:10:18.400 | right, like this broad mission

00:10:19.520 | of how do I unlock elements on top of the data

00:10:21.840 | also contributed to the development of Lama Index

00:10:24.560 | to the state it is today.

00:10:26.200 | And so I think part of the reason

00:10:28.320 | our toolkit has evolved beyond

00:10:30.680 | the like just existing set of like data structures

00:10:33.120 | is we really tried to take a step back and think,

00:10:35.360 | okay, what exactly are the tools

00:10:36.920 | that would actually make this useful for a developer?

00:10:39.440 | And then, you know, somewhere around December,

00:10:41.200 | we made an active effort

00:10:42.600 | to basically like push towards that direction,

00:10:44.720 | make the code base more modular, right,

00:10:46.360 | more friendly as an open source library.

00:10:48.160 | And then also start adding in like embeddings,

00:10:51.200 | start thinking into practical considerations

00:10:53.840 | like latency, cost, performance, those types of things.

00:10:56.240 | And then really motivated by that mission,

00:10:58.680 | like start expanding the scope of the toolkit

00:11:00.400 | towards like covering the life cycle

00:11:02.160 | of like data ingestion and querying.

00:11:04.680 | - Yeah, where you also added Lama Hub and--

00:11:08.240 | - Yeah, yeah, so I think that was in like January

00:11:10.920 | on the data loading side.

00:11:12.040 | And so we started adding like some data loaders,

00:11:14.120 | saw an opportunity there,

00:11:15.880 | started adding more stuff on the retrieval querying side,

00:11:18.600 | right, we still have like the core data structures,

00:11:20.640 | but how do you actually make them more modular

00:11:22.840 | and kind of like decouple storing state

00:11:25.400 | from the types of like queries

00:11:26.840 | that you could run on top of this a little bit.

00:11:28.920 | And then starting to get into more complex interactions

00:11:31.520 | like train of thought, reasoning, routing,

00:11:33.120 | and you know, like agent loops.

00:11:34.920 | - Yeah, yeah, very cool.

00:11:36.360 | - And then you and I spent a bunch of time earlier this year

00:11:39.960 | talking about Lama Hub, what that might become.

00:11:42.920 | You were still at Robust.

00:11:45.440 | When did you decide it was time to start the company

00:11:48.600 | and then start to think about what Lama Index is today?

00:11:52.560 | - Probably December, yeah.

00:11:54.080 | And so it was clear that, you know,

00:11:56.520 | it was kind of interesting.

00:11:57.360 | I was getting some inbound from initial VCs.

00:11:59.160 | I was talking about this project.

00:12:00.440 | And then in the beginning, I was like,

00:12:01.960 | oh yeah, you know, this is just like a design project,

00:12:03.920 | but you know, what about my other idea on like video data?

00:12:06.320 | Right, and I was trying to like get their thoughts on that.

00:12:09.800 | And then everybody was just like, oh yeah, whatever.

00:12:12.440 | Like that part's like a crowded market.

00:12:14.800 | And then it became clear that, you know,

00:12:16.440 | this was actually a pretty big opportunity.

00:12:18.320 | And like coincidentally, right,

00:12:20.120 | like this actually did relate to,

00:12:22.040 | like my interests have always been

00:12:23.080 | at the intersection of AI data

00:12:25.080 | and kind of like building practical applications.

00:12:26.600 | And it was clear that this was evolving

00:12:28.240 | into a much bigger opportunity than the previous idea was.

00:12:30.680 | So around December.

00:12:31.520 | And then I think I gave a pretty long notice,

00:12:33.120 | but I left officially like early March.

00:12:35.600 | - What were your thinkings in terms of like moats and,

00:12:40.600 | you know, founders kind of like overthink it sometimes.

00:12:43.200 | You obviously had like a lot of open source love

00:12:45.520 | and like a lot of community.

00:12:47.120 | And yeah, like, were you ever thinking, okay,

00:12:50.200 | I don't know, this is maybe not enough to start a company

00:12:52.800 | or did you always have conviction about it?

00:12:55.680 | - Oh no, I mean, a hundred percent.

00:12:56.760 | I felt like I did this exercise, like honestly,

00:12:59.600 | probably more late December and then early January,

00:13:03.040 | 'cause I was just existentially worried about

00:13:05.360 | whether or not this would actually be a company at all.

00:13:08.160 | And okay, what were the key questions I was thinking about?

00:13:11.360 | And these were the same things that like other founders,

00:13:14.400 | investors, and also like friends would ask me is just like,

00:13:17.000 | okay, what happens if context windows get much bigger?

00:13:20.520 | What's the point of actually structuring data, right,

00:13:22.880 | in the right way, right?

00:13:24.920 | Why don't you just dump everything into the prompt?

00:13:27.040 | Fine tuning, like what if you just train the model

00:13:28.840 | over this data?

00:13:29.680 | And then, you know, what's the point of doing this stuff?

00:13:32.880 | And then some other ideas is what if like open AI

00:13:36.200 | actually just like takes this, like, you know,

00:13:39.880 | builds upwards on top of the,

00:13:42.480 | their existing like foundation models

00:13:43.920 | and starts building in some like built-in orchestration

00:13:46.280 | capabilities around stuff like rag and agents

00:13:48.200 | and those types of things.

00:13:49.160 | And so I basically ran through this mental exercise

00:13:51.200 | and, you know, I'm happy to talk a little bit more

00:13:53.520 | about those thoughts as well,

00:13:54.480 | but at a high level, well, context windows

00:13:57.400 | have gotten bigger,

00:13:58.240 | but there's obviously still a need for rag.

00:14:00.840 | I think rag is just like one of those things that like,

00:14:03.480 | in general, what people care about is yes,

00:14:05.480 | they do care about performance,

00:14:07.040 | but they also care about stuff like latency and costs.

00:14:09.280 | And my entire reasoning at the time was just like, okay,

00:14:12.320 | like, yes, maybe we'll have like much bigger context windows

00:14:15.760 | as we've seen with like 100K context windows,

00:14:17.760 | but for enterprises like, you know, data,

00:14:20.280 | which is not in just like the scale of like a few documents,

00:14:23.640 | it's usually in like gigabytes, terabytes, petabytes,

00:14:26.360 | like how do you actually just unlock language models

00:14:28.840 | over that data, right?

00:14:30.120 | And so it was clear there was just like,

00:14:32.480 | whether it's rag or some other paradigm,

00:14:34.480 | no one really knew what that answer was.

00:14:36.040 | And so there was clearly like technical opportunity here.

00:14:38.080 | Like there was just stacks that needed to be invented

00:14:40.160 | to actually solve this type of problem

00:14:41.920 | because language models themselves

00:14:43.000 | didn't have access to this data.

00:14:44.360 | And so if like you just dumped all this data into,

00:14:47.440 | let's say a model had like hypothetically

00:14:49.360 | an infinite context window, right?

00:14:50.800 | And you just dump like 50 gigabytes of data

00:14:52.960 | into the context window.

00:14:54.400 | That just seemed very inefficient to me

00:14:55.760 | because you have these network transfer costs

00:14:57.440 | of uploading 50 gigabytes of data

00:14:59.360 | to get back a single response.

00:15:01.040 | And so I kind of realized, you know,

00:15:03.000 | there's always gonna be some curve,

00:15:04.520 | regardless of like the performance

00:15:05.760 | of the best performing models,

00:15:07.120 | of like cost versus performance.

00:15:10.120 | And so what RAG does is it does provide extra data points

00:15:14.200 | along that access because you can kind of control

00:15:16.000 | the amount of context you actually want it to retrieve.

00:15:18.600 | And of course, like RAG as a term

00:15:20.640 | was still evolving back then,

00:15:21.720 | but it was just this whole idea of like,

00:15:23.200 | how do you just fetch a bunch of information

00:15:25.040 | to actually, you know, like stuff into the prompt.

00:15:27.360 | And so people, even back then,

00:15:28.880 | were kind of thinking about some of those considerations.

00:15:30.720 | - And then you fundraised in June,

00:15:33.000 | or you announced your fundraise in June.

00:15:34.320 | - Yeah.

00:15:35.160 | - With Greylock.

00:15:36.480 | How was that process?

00:15:37.800 | Just like take us through that process

00:15:40.720 | of thinking about the fundraise

00:15:42.080 | and your plans for the company, you know, at the time.

00:15:46.360 | - Yeah, definitely.

00:15:47.200 | I mean, I think we knew we wanted to,

00:15:48.760 | I mean, obviously we knew we wanted to fundraise.

00:15:50.360 | I think there was also a bunch of like investor interest

00:15:53.160 | and it was probably pretty unusual

00:15:54.520 | given the, you know, like hype wave of generative AI.

00:15:56.880 | So like a lot of investors were kind of reaching out

00:15:58.920 | around like December, January, February.

00:16:00.920 | In the end, we went with Greylock.

00:16:02.040 | Greylock's great.

00:16:02.880 | You know, they've been great partners so far.

00:16:04.640 | And like, to be honest,

00:16:06.160 | like there's a lot of like great VCs out there.

00:16:08.200 | And a lot of them who are specialized

00:16:09.720 | on like open source, data, infra, and that type of stuff.

00:16:13.200 | What we really wanted to do was,

00:16:15.280 | because for us, like time was of the essence,

00:16:17.280 | like we wanted to ship very quickly

00:16:19.000 | and still kind of build Mindshare in this space.

00:16:21.040 | We just kept the fundraising process very efficient.

00:16:23.000 | I think we basically did it in like a week

00:16:25.160 | or like three days, I think so.

00:16:27.200 | - Yeah.

00:16:28.040 | - Just like front loaded it.

00:16:29.400 | And then, and then just like--

00:16:31.400 | - We picked the one named Jerry.

00:16:32.720 | - Hey, yeah, exactly.

00:16:34.160 | (both laughing)

00:16:36.960 | - I'm kidding.

00:16:37.800 | Guys, I mean, he's obviously great

00:16:39.360 | and Greylock's fantastic for him.

00:16:41.280 | - Yeah, I know.

00:16:42.120 | And embedding some of my research.

00:16:43.480 | So yeah, just we picked Greylock.

00:16:46.560 | They've been great partners.

00:16:48.240 | I think in general, when I talk to founders

00:16:49.960 | about like the fundraise process,

00:16:51.880 | it's never like the most fun period, I think.

00:16:54.360 | Because it's always just like, you know,

00:16:56.000 | there's a lot of logistics, there's lawyers,

00:16:57.800 | you have to, you know, get in the loop.

00:16:59.720 | And then you, and like a lot of founders

00:17:01.520 | just want to go back to building.

00:17:02.760 | And so I think in the end,

00:17:04.200 | we're happy that we kept it to a pretty efficient process.

00:17:07.280 | - Cool.

00:17:08.120 | And so you fundraise with Simon, your co-founder.

00:17:10.240 | And how do you split things with him?

00:17:11.920 | How big is your team now?

00:17:13.120 | - The team is growing.

00:17:14.360 | By the time this podcast is released,

00:17:17.280 | we'll probably have had one more person join the team.

00:17:19.680 | And so basically, it's between,

00:17:22.840 | we're rapidly getting to like eight or nine people.

00:17:25.000 | At the current moment, we're around like six.

00:17:26.680 | And so just like, there'll be some exciting developments

00:17:29.720 | in the next few weeks.

00:17:30.560 | So I'm excited to kind of, to announce that.

00:17:34.040 | We've been pretty selective

00:17:36.280 | in terms of like how we like grow the team.

00:17:37.880 | Obviously, like we look for people that are really active

00:17:40.000 | in terms of contributions to Lum Index,

00:17:41.880 | people that have like very strong engineering backgrounds.

00:17:44.240 | And primarily, we've been kind of just looking for builders,

00:17:46.360 | people that kind of like grow the open source

00:17:47.800 | and also eventually this like managed

00:17:49.720 | like enterprise platform as well with us.

00:17:52.160 | In terms of like Simon,

00:17:53.480 | yeah, I've known Simon for a few years now.

00:17:55.080 | I knew him back at Uber ATG in Toronto.

00:17:57.640 | He's one of the smartest people I knew.

00:17:59.600 | Like has a sense of both like a deep understanding of ML,

00:18:04.600 | but also just like first principles thinking

00:18:06.440 | about like engineering and technical concepts in general.

00:18:09.120 | And I think one of my criteria is when I was like

00:18:10.920 | looking for a co-founder for this project

00:18:12.880 | was someone that was like technically better than me,

00:18:14.560 | 'cause I knew I wanted like a CTO.

00:18:16.280 | And so honestly, like there weren't a lot of people that,

00:18:19.080 | I mean, I know a lot of people that are smarter than me,

00:18:21.200 | but like that fit that bill.

00:18:22.400 | We're willing to do a startup

00:18:23.440 | and also just have the same like values that I shared, right?

00:18:26.200 | And just, I think doing a startup is very hard work, right?

00:18:28.760 | It's not like, I'm sure like you guys all know this.

00:18:31.240 | It's a lot of hours, a lot of late nights,

00:18:33.800 | and you want to be like in the same place together

00:18:36.360 | and just like being willing to hash out stuff

00:18:38.000 | and have that grit basically.

00:18:39.280 | And I really looked for that.

00:18:40.640 | And so Simon really fit that bill.

00:18:42.440 | And I think I convinced him to jump on board.

00:18:44.800 | - Yeah, yeah, nice job.

00:18:46.240 | And obviously I've had the pleasure of chatting

00:18:48.320 | and working with a little bit with both of you.

00:18:50.960 | What would you say those like your top like one

00:18:53.000 | or two values are when thinking about that

00:18:55.440 | or the culture of the company and that kind of stuff?

00:18:58.200 | - Yeah, well, I think in terms of the culture of the company

00:19:01.680 | it's really like, I mean, there's a few things

00:19:04.400 | I can name off the top of my head.

00:19:05.800 | One is just like passion, integrity.

00:19:08.720 | I think that's very important for us.

00:19:09.760 | We want to be honest.

00:19:10.600 | We don't want to like obviously like copy code

00:19:12.880 | or kind of like, you know, just like, you know

00:19:15.000 | not give attribution, those types of things.

00:19:16.600 | And just like be true to ourselves.

00:19:18.520 | I think we're all very like down to earth,

00:19:19.960 | like humble people.

00:19:20.800 | But obviously I think just willingness

00:19:22.840 | to just like own stuff and dive right in.

00:19:25.480 | And I think grit comes with that.

00:19:26.720 | I think in the end, like this is a very fast moving space

00:19:29.560 | and we want to just like be one of the, you know

00:19:32.200 | like dominant forces in helping to provide

00:19:33.920 | like production quality outline applications.

00:19:35.960 | - Yeah.

00:19:37.000 | So I promise we'll get to the more technical questions.

00:19:39.440 | But I also want to impress on the audience

00:19:42.160 | that this is a very conscious

00:19:43.600 | and intentional company building.

00:19:46.000 | And since your fundraising post, which was in June

00:19:51.000 | and now it's September, so it's been about three months.

00:19:53.760 | You've actually gained 50% in terms of stars and followers.

00:19:58.480 | You've 3X your download count to 600,000 a month

00:20:01.480 | and your Discord membership has reached 10,000.

00:20:03.600 | So like a lot of ongoing growth.

00:20:06.200 | - Yeah, definitely.

00:20:07.040 | And obviously there's a lot of room to expand there too.

00:20:09.600 | And so open source growth is gonna continue

00:20:12.280 | to be one of our core goals.

00:20:14.200 | 'Cause in the end, it's just like,

00:20:15.240 | we want this thing to be, well, one big, right?

00:20:17.480 | We all have like big ambitions,

00:20:18.960 | but to just like really provide value to developers

00:20:21.600 | and helping them in prototyping

00:20:23.840 | and also productionization of their apps.

00:20:25.720 | And I think it turns out we're in the fortunate circumstance

00:20:28.200 | where a lot of different companies and individuals, right?

00:20:31.120 | Are in that phase of like, you know

00:20:32.560 | maybe they've hacked around

00:20:33.600 | on some initial LLM applications,

00:20:35.560 | but they're also looking to, you know,

00:20:37.100 | start to think about what are the production grade

00:20:39.000 | challenges necessary to actually, you know,

00:20:41.520 | that to solve to actually make this thing robust

00:20:44.520 | and reliable in the real world.

00:20:45.680 | And so we want to basically provide the tooling to do that.

00:20:49.120 | And to do that, we need to both spread awareness

00:20:51.160 | and education of a lot of the key practices

00:20:52.800 | of what's going on.

00:20:53.800 | And so a lot of this is going to be continued growth,

00:20:55.960 | expansion and education.

00:20:56.960 | And we do prioritize that very heavily.

00:20:59.060 | - Awesome.

00:21:01.420 | Let's dive into some of the questions

00:21:02.920 | you were asking yourself initially around fine tuning

00:21:06.320 | and rag, how these things play together.

00:21:09.280 | You mentioned context.

00:21:11.320 | What is the minimum viable context for rag?

00:21:14.840 | So what's like a context window too small.

00:21:17.440 | And at the same time,

00:21:18.280 | maybe what's like a maximum context window.

00:21:21.280 | We talked before about the LLMs are U-shaped reasoners.

00:21:25.320 | So as the context got larger,

00:21:27.520 | like it really only focuses on the end

00:21:29.680 | and the start of the prompt

00:21:30.840 | and then it kind of peters down.

00:21:33.520 | Any learnings, any kind of like tips

00:21:36.800 | you want to give people as they think about it?

00:21:39.800 | - So this is a great question.

00:21:41.160 | And I think part of what I wanted to kind of like

00:21:45.280 | talk about a conceptual level,

00:21:46.600 | especially with the idea of like thinking about

00:21:48.360 | what is the minimum context?

00:21:49.640 | Like, okay, what if the minimum context was like 10 tokens

00:21:51.880 | versus like, you know, 2K tokens

00:21:53.520 | versus like a million tokens, right?

00:21:54.840 | Like, and what does that really give you?

00:21:56.400 | And what are the limitations if it's like 10 tokens?

00:21:58.760 | It's kind of like, like eight bit, 16 bit games, right?

00:22:02.560 | Like back in the day, like if you play Mario

00:22:04.640 | and you have like the initial Mario

00:22:06.440 | where the graphics were very blocky

00:22:07.840 | and now obviously it's like full HD, 3D,

00:22:10.120 | just the resolution of the context and the output will change

00:22:13.320 | depending on how much context you can actually fit in.

00:22:16.000 | The way I kind of think about this

00:22:17.240 | from a more principled manner is like,

00:22:18.560 | there's this concept of like information capacity,

00:22:22.320 | just this idea of like entropy,

00:22:23.720 | like given any fixed amount of like storage space,

00:22:27.080 | like how much information can you actually compact in there?

00:22:29.960 | And so basically a context window length

00:22:32.000 | is just like some fixed amount of storage space, right?

00:22:34.720 | And so there's some theoretical limit

00:22:36.760 | to the maximum amount of information

00:22:38.120 | you can compact into like a 4,000 token storage space.

00:22:40.920 | And what does that storage space use for these days

00:22:42.840 | with LLMs?

00:22:43.680 | It's for inputs and also outputs.

00:22:46.080 | And so this really controls a maximum amount of information

00:22:48.560 | you can feed in terms of the prompt

00:22:50.280 | plus the granularity of the output.

00:22:52.080 | If you had an infinite context window,

00:22:53.480 | you could have an infinitely detailed response

00:22:55.240 | and also infinitely detailed memory.

00:22:56.960 | But if you don't, you can only kind of represent stuff

00:22:58.960 | in more quantized bits, right?

00:23:01.040 | And so the smaller the context window,

00:23:03.200 | just generally speaking, the less details

00:23:05.320 | and maybe the less,

00:23:06.800 | and for like specific precise information

00:23:08.640 | are gonna be able to surface at any given point in time.

00:23:11.760 | - And when you have short context,

00:23:13.920 | is the answer just like get a better model

00:23:16.680 | or is the answer maybe, hey,

00:23:18.840 | there needs to be a balance between fine tuning and RAG

00:23:21.600 | to make sure you're gonna like leverage the context,

00:23:24.040 | but at the same time, don't keep it too low resolution?

00:23:27.240 | - Yeah, yeah.

00:23:28.080 | Well, there's probably some minimum threat.

00:23:29.400 | I don't think anyone wants to work with like a 10,

00:23:31.280 | I mean, that's just a thought exercise anyways,

00:23:33.120 | a 10 token context window.

00:23:34.480 | I think nowadays the modern context window

00:23:36.200 | is like 2K, 4K is enough for just like doing

00:23:38.640 | some sort of retrieval on granular context

00:23:40.400 | and be able to synthesize information.

00:23:42.120 | I think for most intents and purposes,

00:23:44.000 | that level of resolution is probably fine for most people,

00:23:46.160 | for most use cases.

00:23:48.040 | I think the limitation is actually more on,

00:23:50.480 | okay, if you're gonna actually combine this thing

00:23:52.520 | with some sort of retrieval data structure mechanism,

00:23:55.240 | there's just limitations on the retrieval side

00:23:58.000 | because maybe you're not actually fetching

00:23:59.600 | the most relevant context

00:24:00.800 | to actually answer this question, right?

00:24:02.360 | Like, yes, like given the right context,

00:24:04.560 | 4,000 tokens is enough,

00:24:05.880 | but if you're just doing like top case similarity,

00:24:07.720 | like you might not be fetching the right information

00:24:10.000 | from the documents.

00:24:11.160 | - Yeah, so how should people think about

00:24:13.160 | when to stick with RAG

00:24:15.560 | versus when to even entertain fine tuning?

00:24:18.720 | And also in terms of what's like the threshold

00:24:22.000 | of data that you need to actually worry about fine tuning

00:24:25.040 | versus like just stick with RAG.

00:24:26.320 | Obviously you're biased

00:24:27.320 | because you're building a RAG company, but-

00:24:28.880 | - No, no, actually,

00:24:31.000 | I think I have like a few hot takes in here,

00:24:32.600 | some of which sound like a little bit contradictory

00:24:34.400 | to what we're actually building.

00:24:35.640 | To be honest, I don't think anyone knows the right answer.

00:24:37.400 | I think this is just- - We're pursuing the truth.

00:24:38.840 | - Yeah, exactly.

00:24:39.680 | This is just like thought exercise

00:24:40.840 | towards like understanding the truth, right?

00:24:42.120 | So I think, okay, I have a few hot takes.

00:24:45.160 | One is like RAG is basically just a hack,

00:24:47.320 | but it turns out it's a very good hack

00:24:49.240 | because what is RAG?

00:24:51.080 | RAG is you keep the model fixed

00:24:52.400 | and you just figure out a good way

00:24:53.320 | to like stuff stuff into the prompt of the language model.

00:24:56.440 | Everything that we're doing nowadays

00:24:58.000 | in terms of like stuffing stuff into the prompt

00:24:59.960 | is just algorithmic.

00:25:01.000 | We're just figuring out nice algorithms

00:25:02.720 | to like retrieve right information with top case similarity,

00:25:06.920 | do some sort of like hybrid search,

00:25:08.880 | some sort of like a chain of thought decomp,

00:25:10.600 | and then just like stuff stuff into the prompt.

00:25:12.520 | So it's all like algorithmic,

00:25:13.800 | and it's more like just software engineering

00:25:17.320 | to try to make the most out of these like existing APIs.

00:25:19.960 | The reason I say it's a hack

00:25:21.160 | is just like from a pure like optimization standpoint,

00:25:23.680 | if you think about this from like the machine learning lens,

00:25:25.640 | unless the software engineering lens,

00:25:27.280 | there's pieces in here

00:25:28.120 | that are gonna be like suboptimal, right?

00:25:29.600 | Like, obviously, like the thing about machine learning

00:25:32.160 | is when you optimize like some system

00:25:34.760 | that can be optimized within machine learning,

00:25:36.680 | like the set of parameters,

00:25:38.360 | you're really like changing like the entire system's weights

00:25:41.600 | to try to optimize the subjective function.

00:25:44.160 | And if you just cobble a bunch of stuff together,

00:25:46.480 | you can't really optimize the pieces are inefficient, right?

00:25:49.360 | And so like a retrieval interface,

00:25:51.000 | like doing top can embedding lookup,

00:25:52.920 | that part is inefficient,

00:25:54.800 | because there might be potentially a better,

00:25:56.880 | more learned retrieval algorithm that's better.

00:25:59.680 | If you kind of do stuff like some sort of,

00:26:04.240 | I know nowadays there's this concept

00:26:05.800 | of how do you do like short-term or long-term memory, right?

00:26:08.120 | Like represent stuff in some sort of vector embedding,

00:26:10.440 | do chunk sizes, all that stuff.

00:26:12.320 | It's all just like decisions that you make

00:26:14.040 | that aren't really optimized, right?

00:26:15.920 | It's more, and it's not really automatically learned,

00:26:18.520 | it's more just things that you set beforehand

00:26:20.440 | to actually feed into the system.

00:26:22.160 | There's a lot of room to actually optimize

00:26:24.760 | the performance of an entire LLM system,

00:26:27.680 | potentially in a more like machine learning base way, right?

00:26:30.080 | And I will leave room for that.

00:26:31.920 | And this is also why I think like in the long-term,

00:26:34.880 | like I do think fine tuning will probably have

00:26:37.240 | like greater importance and just like,

00:26:41.080 | there will probably be new architectures invented

00:26:43.920 | that where you can actually kind of like include

00:26:46.360 | a lot of this under the black box,

00:26:48.000 | as opposed to having like hobbling together

00:26:50.200 | a bunch of components outside the black box.

00:26:52.040 | That said, just very practically,

00:26:54.280 | given the current state of things,

00:26:55.600 | like even if I said RAG is a hack,

00:26:57.160 | it's a very good hack

00:26:58.000 | and it's also very easy to use, right?

00:26:59.320 | And so just like for kind of like the AI engineer persona,

00:27:02.440 | that like, which to be fair is kind of one of the reasons

00:27:06.400 | generative AI has gotten so big,

00:27:08.040 | is because it's way more accessible for everybody

00:27:10.440 | to get into, as opposed to just like

00:27:12.520 | traditional machine learning.

00:27:14.640 | It tends to be good enough, right?

00:27:16.040 | And if we can basically provide these existing techniques

00:27:18.440 | to help people really optimize how to use existing systems

00:27:21.720 | without having to really deeply understand machine learning,

00:27:24.040 | I still think that's a huge value add.

00:27:25.680 | And so there's very much like a UX

00:27:27.560 | and ease of use problem here,

00:27:28.880 | which is just like RAG is way easier to onboard and use.

00:27:31.960 | And that's probably like the primary reason

00:27:34.120 | why everyone shouldn't do RAG

00:27:35.160 | instead of fine tuning to begin with.

00:27:37.000 | If you think about like the 80/20 rule,

00:27:38.520 | like RAG very much fits within that

00:27:39.880 | and fine tuning doesn't really right now.

00:27:41.640 | And then I'm just kind of like leaving room for the future

00:27:44.400 | that, you know, like in the end,

00:27:46.000 | fine tuning can probably take over some of the aspects

00:27:48.680 | of like what RAG does.

00:27:50.680 | - I don't know if this is mentioned in your recap there,

00:27:54.200 | but explainability also allows for sourcing.

00:27:58.280 | And like at the end of the day,

00:28:00.040 | like to increase trust, we have to source documents.

00:28:03.320 | - Yeah, so I think what RAG does

00:28:05.680 | is it increases like transparency,

00:28:07.560 | visibility into the actual documents

00:28:09.480 | that are getting fed into their contacts.

00:28:10.920 | - Here's where they got it from.

00:28:11.840 | - Exactly, and so that's definitely an advantage.

00:28:14.040 | I think the other piece that I think is an advantage,

00:28:15.840 | and I think that's something

00:28:16.680 | that someone actually brought up,

00:28:18.000 | is just you can do access control with RAG

00:28:21.760 | if you have an external source system.

00:28:23.320 | You can't really do that with large language models,

00:28:26.400 | which is like gate information to the neural net weights,

00:28:28.880 | like depending on the type of user.

00:28:31.240 | For the first point, you could technically, right,

00:28:35.120 | you could technically have the language model,

00:28:37.800 | like if it memorized enough information,

00:28:39.680 | just like a site sources,

00:28:41.240 | but there's a question of just trust.

00:28:42.600 | Whether or not you're accurate.

00:28:44.160 | - Yeah, well, but like it makes it up right now

00:28:46.640 | 'cause it's like not good enough,

00:28:47.720 | but imagine a world where it is good enough

00:28:49.200 | and it does give accurate citations.

00:28:51.400 | - Yeah, no, I think to establish trust,

00:28:53.040 | you just need a direct connection.

00:28:54.840 | So it's kind of weird, it's this melding of,

00:28:58.280 | you know, deep learning systems

00:29:00.480 | versus very traditional information retrieval.

00:29:03.960 | - Yeah, exactly.

00:29:05.200 | So I think, I mean, I kind of think about it

00:29:06.760 | as analogous to like humans, right?

00:29:08.280 | Like we as humans, obviously we use the internet,

00:29:10.960 | we use tools.

00:29:11.960 | These tools have API interfaces are well-defined.

00:29:14.400 | And obviously we're not, like the tools aren't part of us.

00:29:17.560 | And so we're not like back propping

00:29:19.000 | or optimizing over these tools.

00:29:20.600 | And so kind of when you think about like RAG,

00:29:22.920 | it's basically LLM is learning how to use

00:29:26.120 | like a vector database to look up information

00:29:28.160 | that it doesn't know.

00:29:29.200 | And so then there's just a question of like

00:29:30.720 | how much information is inherent within the network itself

00:29:33.280 | and how much does it need to do some sort of like tool

00:29:35.280 | used to look up stuff that it doesn't know.

00:29:36.840 | And I do think there'll probably be more and more

00:29:38.840 | of that interplay as time goes on.

00:29:40.640 | - Yeah.

00:29:41.960 | Some follow-ups on discussions that we've had.

00:29:45.200 | So, you know, we discussed fine tuning a bit

00:29:47.880 | and what's your current take on whether you can fine tune

00:29:50.840 | new knowledge into LLMs?

00:29:52.440 | - That's one of those things

00:29:53.280 | where I think long-term you definitely can.

00:29:55.400 | I think some people say you can't, I disagree.

00:29:57.520 | I think you definitely can.

00:29:58.640 | Just right now I haven't gotten it to work yet.

00:30:00.000 | So, so I think like- - You've tried.

00:30:01.480 | - Yeah, well, not in a very principled way, right?

00:30:04.120 | Like this is something that requires

00:30:05.400 | like an actual research scientist

00:30:06.600 | and not someone that has like, you know,

00:30:07.780 | an hour or two per night to actually get this.

00:30:09.760 | - Like you were a research scientist at Uber.

00:30:11.400 | - Yeah, yeah, but it's like full-time, full-time work.

00:30:14.000 | So I think what I specifically concretely did

00:30:16.880 | was I took OpenAI's fine tuning endpoints

00:30:18.800 | and then tried to, you know,

00:30:20.160 | it's in like a chat message interface.

00:30:21.880 | And so there's like a user assistant message format.

00:30:24.440 | And so what I did was I tried to take just some piece

00:30:26.400 | of text and have the LLM memorize it

00:30:28.480 | by just asking it a bunch of questions about the text.

00:30:30.680 | So given a bunch of contexts,

00:30:31.920 | I would generate some questions

00:30:33.400 | and then generate some response

00:30:34.560 | and just fine tune over the question responses.

00:30:37.200 | That hasn't really worked super well.

00:30:39.640 | But that's also because I'm just like trying

00:30:41.760 | to like use OpenAI's endpoints as is.

00:30:44.640 | If you just think about like traditional,

00:30:46.000 | like how you train a Transformers model,

00:30:48.640 | there's kind of like the instruction

00:30:51.120 | like fine tuning aspect, right?

00:30:52.640 | You like kind of ask it stuff

00:30:55.040 | and guide it with correct responses,

00:30:56.360 | but then there's also just like next token production.

00:30:58.940 | And that's something that you can't really do

00:31:01.360 | with the OpenAI API,

00:31:02.600 | but you can do with if you just trained it yourself.

00:31:04.880 | And that's probably possible

00:31:06.280 | if you just like train it over some corpus of data.

00:31:08.520 | I think Shashira from Berkeley said like,

00:31:10.360 | you know, when they trained Gorilla,

00:31:11.360 | they were like, oh, you know this,

00:31:12.640 | a lot of these LLMs are actually pretty good

00:31:14.280 | at memorizing information.

00:31:16.060 | Just the way the API interface is exposed

00:31:18.800 | is just no one knows how to use them right now, right?

00:31:21.160 | And so I think that's probably one of the issues.

00:31:23.560 | - Just to clue people in who haven't read the paper,

00:31:25.580 | Gorilla is the one where they train to use specific APIs?

00:31:29.700 | - Yeah, yeah.

00:31:30.540 | And I think they also did something

00:31:31.680 | where like the model itself could learn to,

00:31:36.040 | yeah, I think this was on the Gorilla paper.

00:31:37.600 | Like the model itself could try to learn some prior

00:31:41.680 | over the data to decide like what tool to pick.

00:31:44.080 | But there's also, it's also augmented with retrieval

00:31:46.480 | that helps supplement it

00:31:47.640 | in case like the prior doesn't actually work.

00:31:51.640 | - Is that something that you'd be interested in supporting?

00:31:54.520 | - I mean, I think in the longterm,

00:31:55.680 | like if like this is kind of how fine-tuning like RAG

00:31:58.880 | evolves, like I do think there'll be some aspect

00:32:01.200 | where fine-tuning will probably memorize

00:32:02.600 | some high-level concepts of knowledge,

00:32:04.160 | but then like RAG will just be there to supplement

00:32:06.460 | like aspects that it doesn't know, yeah.

00:32:08.500 | - Yeah, awesome.

00:32:09.880 | - Obviously RAG is the default way.

00:32:11.680 | Like to be clear, RAG right now is the default way

00:32:13.360 | to actually augment stuff with knowledge.

00:32:15.600 | I think it's just an open question

00:32:16.720 | of how much the LLM can actually internalize

00:32:19.560 | both high-level concepts, but also details

00:32:22.280 | as you can like train stuff over it.

00:32:24.540 | And coming from an ML background,

00:32:26.680 | like there is a certain beauty in just baking everything

00:32:29.560 | into some training process of a language model.

00:32:33.120 | Like if you just take raw chat GPT

00:32:35.800 | or chat GPT code interpreter, right, like GPT-4,

00:32:38.460 | it's not like you do RAG with it.

00:32:40.340 | You just ask it questions about like,

00:32:42.020 | "Hey, how do I like define a pedantic model in Python?"

00:32:44.300 | And then like, "Can you give me an example?

00:32:45.660 | Can you visualize a graph?"

00:32:46.500 | It just does it, right?

00:32:47.820 | And we'll run it through code interpreters as a tool,

00:32:49.860 | but that's not like a source for knowledge.

00:32:50.980 | It's just an execution environment.

00:32:52.620 | And so there is some beauty in just like

00:32:54.420 | having the model itself, like just, you know,

00:32:56.980 | instead of you kind of defining the algorithm

00:32:58.700 | for what the data structure should look like,

00:33:00.260 | the model just learns it under the hood.

00:33:02.440 | That said, I think the reason it's not a thing right now

00:33:04.860 | is just like, no one knows how to do it.

00:33:06.660 | It probably costs too much money.

00:33:07.900 | And then also like the API interfaces

00:33:10.900 | and just like the actual like ability

00:33:14.100 | to kind of evaluate and improve on performance,

00:33:17.220 | like isn't known to most people.

00:33:18.960 | - Yeah.

00:33:20.420 | It also would be better with browsing.

00:33:22.180 | (laughs)

00:33:23.540 | - Yeah.

00:33:24.380 | - I wonder when they're going to put that back.

00:33:25.820 | - Okay.

00:33:26.660 | - Okay, cool.

00:33:28.860 | Yeah, so, and then one more follow-up

00:33:30.140 | before we go into RAG for AI engineers

00:33:32.100 | is on your brief mention about security or auth.

00:33:36.660 | And how many of the people that you talk to,

00:33:40.220 | you know, you talk to a lot of people

00:33:41.820 | putting Lama Index into production,

00:33:44.260 | how many people actually are there

00:33:46.300 | versus just like,

00:33:47.140 | let's just dump a whole company notion into this thing.

00:33:49.500 | - Wait, you're talking about

00:33:50.340 | from like the security auth standpoint?

00:33:51.900 | - Yeah, how big a need is that?

00:33:54.140 | Because I talked to some people

00:33:56.300 | who are thinking about building tools in that domain,

00:33:59.540 | but I don't know if people want it.

00:34:01.180 | I mean, I think bigger companies,

00:34:02.380 | like just bigger companies, like banks, consulting firms,

00:34:05.900 | like they all want this.

00:34:07.020 | - Yes, it's a requirement, right?

00:34:08.900 | - The way they're using Lama Index

00:34:10.540 | is not with this, obviously,

00:34:12.540 | 'cause I don't think we have support

00:34:13.580 | for like access control or author

00:34:15.060 | or that type of stuff like on a hood,

00:34:16.060 | 'cause we're more just like an orchestration framework.

00:34:18.580 | And so the way they do it,

00:34:19.740 | they build these initial apps

00:34:21.220 | is more kind of like prototype,

00:34:22.900 | like let's kind of, yeah,

00:34:24.060 | like, you know, use some publicly available data

00:34:25.940 | that's not super sensitive.

00:34:27.060 | Let's like, you know, assume that every user

00:34:28.700 | is going to be able to have access

00:34:29.660 | to the same amount of knowledge,

00:34:30.780 | those types of things.

00:34:32.180 | I think users have asked for it,

00:34:33.620 | but I don't think that's like a P zero.

00:34:35.300 | Like, I think the P zero is more on like,

00:34:37.460 | can we get this thing working

00:34:38.500 | before we expand this to like more users within the work?

00:34:41.340 | - Yep.

00:34:42.340 | - Cool.

00:34:43.420 | So there's a bunch of pieces to RAG, obviously.

00:34:46.580 | It's not just an acronym.

00:34:48.420 | And you tweeted recently,

00:34:49.820 | you think every AI engineer

00:34:51.540 | should build it from scratch at least once.

00:34:53.820 | Why is that?

00:34:56.020 | - I think so.

00:34:57.340 | I'm actually kind of curious

00:34:58.620 | to hear your thoughts about this,

00:35:00.100 | but this kind of relates to the initial

00:35:02.060 | like AI engineering posts that you put out.

00:35:04.380 | And then also just like the role of an AI engineer

00:35:06.940 | and the skills that they're going to have to learn

00:35:08.500 | to truly succeed.

00:35:10.140 | 'Cause there's an entire spectrum.

00:35:11.500 | On one end, you have people

00:35:12.540 | that don't really like understand the fundamentals

00:35:15.620 | and just want to use this

00:35:16.460 | to like cobble something together to build something.

00:35:19.100 | And I think there is a beauty in that for what it's worth.

00:35:20.780 | Like it's just one of those things.

00:35:21.820 | And Gen AI has made it

00:35:23.580 | so that you can just use these models

00:35:25.260 | in inference only mode,

00:35:26.260 | cobble something together,

00:35:27.220 | use it to power your app experiences.

00:35:28.940 | On the other end, what we're increasingly seeing

00:35:32.020 | is that like more and more developers

00:35:33.900 | building with these apps

00:35:34.940 | start running into honestly like pretty similar issues

00:35:37.900 | that like will play just a standard ML engineer

00:35:40.380 | building like a classifier model,

00:35:41.540 | which is just like accuracy problems,

00:35:43.460 | like and hallucinations,

00:35:44.460 | basically just an accuracy problem, right?

00:35:45.740 | Like it's not giving you the right results.

00:35:47.020 | So what do you do?

00:35:47.860 | You have to iterate on the model itself.

00:35:49.860 | You have to figure out what parameters you tweak.

00:35:51.420 | You have to gain some intuition about this entire process.

00:35:53.980 | That workflow is pretty similar, honestly,

00:35:56.500 | like even if you're not training the model

00:35:58.020 | to just like tuning an ML model with like hyper parameters

00:36:00.940 | and learning like proper ML practices of like,

00:36:03.860 | okay, how do I have like define a good evaluation benchmark?

00:36:06.940 | How do I define like the right set of metrics to use, right?

00:36:09.340 | How do I actually iterate

00:36:10.540 | and improve the performance of this pipeline for production?

00:36:13.260 | What tools do I use, right?

00:36:14.420 | Like every ML engineer use like some form of weights

00:36:16.900 | and biases, TensorBoards,

00:36:18.020 | or like some other experimentation tracking tool.

00:36:20.500 | Like what tools should I use

00:36:22.620 | to actually help build like LLM applications

00:36:24.780 | and optimize it for production?

00:36:26.420 | There's like a certain amount of just like LLM ops,

00:36:29.260 | like tooling and concepts and just like practices

00:36:31.900 | that people will kind of have to internalize

00:36:33.220 | if they want to optimize these.

00:36:34.340 | And so I think that the reason I think like being able

00:36:37.580 | to build like RAG from scratch is important

00:36:39.860 | is it really gives you a sense of like how things are working

00:36:42.660 | to help you build intuition

00:36:44.060 | about like what parameters are within a RAG system

00:36:46.780 | and which ones actually tweak to make them better.

00:36:48.540 | One of the advantages of Lomindex,

00:36:50.380 | the Lomindex quick start is it's three lines of code.

00:36:53.820 | The downside of that is you have zero visibility

00:36:56.180 | into what's actually going on under the hood.

00:36:57.820 | And I think this is something

00:36:58.700 | that we've kind of been thinking about for a while.

00:37:00.300 | And I'm like, okay, let's just release

00:37:01.580 | like a new tutorial series.

00:37:02.740 | That's just like, no three lines of code.

00:37:05.180 | We're just gonna go in and actually show you

00:37:06.420 | how the thing actually works under the hood, right?

00:37:08.300 | And so like, does everybody need this?

00:37:10.780 | Like probably not.

00:37:11.980 | Like as for some people, the three lines of code might work.

00:37:14.940 | But I think increasingly, like honestly,

00:37:17.540 | 90% of the users I talk to have questions

00:37:19.580 | about how to improve the performance of their app.

00:37:21.100 | And so just like, given this is just like one of those things

00:37:23.220 | that's like better for the understanding.

00:37:24.860 | - Yeah, I'd say it is one of the most useful tools

00:37:28.740 | of any sort of developer education toolkit

00:37:31.820 | to write things yourself from scratch.

00:37:35.460 | So Kelsey Hightower famously wrote

00:37:38.100 | Kubernetes the hard way, which is don't use Kubernetes.

00:37:40.980 | Just like do everything.

00:37:42.180 | Here's everything that you would have to do by yourself.

00:37:44.940 | And you should be able to put all these things together

00:37:47.220 | yourself to understand the value of Kubernetes.

00:37:50.300 | And the same thing for Lomindex.

00:37:51.620 | I've done, I was the guy who did the same for React.

00:37:54.820 | And yeah, it's pretty, well, it's pretty good exercise

00:37:57.780 | for you to just fully understand everything

00:37:59.180 | that's going on under the hood.

00:38:00.780 | And I was actually gonna suggest,

00:38:03.180 | well, in one of the previous conversations,

00:38:05.700 | you know, there's all these like hyperparameters,

00:38:07.260 | like the size of the chunks and all that.

00:38:09.420 | And I was thinking like, you know,

00:38:12.860 | what would hyperparameter optimization for RAG look like?

00:38:17.860 | - Yeah, definitely.

00:38:18.780 | I mean, so absolutely.

00:38:20.060 | I think that's gonna be an increasing thing.

00:38:22.060 | I think that's something we're kind of looking at.

00:38:23.380 | - I think someone should just put,

00:38:24.780 | do like some large scale study

00:38:26.420 | and then just ablate everything.

00:38:27.980 | And just, you tell us.

00:38:29.380 | - I think it's gonna be hard to find a universal default

00:38:33.580 | that works for everybody.

00:38:34.580 | I think it's gonna be somewhat-

00:38:35.420 | - Are you telling me it depends?

00:38:36.260 | - Boo!

00:38:37.100 | - I do think it's gonna be somewhat dependent

00:38:41.860 | on the data and use case.

00:38:42.860 | I think if there was a universal default,

00:38:44.460 | that'd be amazing.

00:38:45.500 | But I think increasingly we found, you know,

00:38:47.380 | people are just defining their own like custom parsers

00:38:50.100 | for like PDFs, markdown files for like, you know,

00:38:52.580 | SCC filings versus like, you know, Slack conversations.

00:38:56.220 | And then like the use case too,

00:38:57.940 | like, do you want like a summarization,

00:38:59.660 | like the granularity of the response?

00:39:01.180 | Like it really affects the parameters that you wanna pick.

00:39:03.420 | And so I do like the idea

00:39:05.620 | of hyperparameter optimization though.

00:39:06.900 | But it's kind of like one of those things

00:39:07.860 | where you are kind of like training the model basically,

00:39:10.820 | kind of on your own data domain.

00:39:12.580 | - Yeah.

00:39:13.540 | You mentioned custom parsers.

00:39:14.940 | You've designed LamaIndex.

00:39:16.300 | Maybe we can talk about like the surface area

00:39:17.900 | of the framework.

00:39:19.140 | You designed LamaIndex in a way that it's more modular.

00:39:21.660 | Yeah, like you mentioned.

00:39:23.340 | How would you describe the different components

00:39:26.420 | and what's customizable in each?

00:39:29.260 | - Yeah, I think they're all customizable.

00:39:30.740 | And I think that there is a certain burden on us

00:39:33.580 | to make that more clear through the docs.

00:39:35.860 | - Well, number four is customization tutorials.

00:39:38.020 | - Yeah, yeah.

00:39:38.860 | But I think like just in general,

00:39:40.100 | I think we do try to make it so that

00:39:42.180 | you can plug in the out of the box stuff.

00:39:43.780 | But like if you want to kind of customize

00:39:47.340 | more lower level components,

00:39:48.620 | like we definitely encourage you to do that

00:39:50.380 | and plug it into the rest of our abstractions.

00:39:52.020 | So let me just walk through

00:39:52.860 | like maybe some of the basic components of LamaIndex.

00:39:54.500 | There's data loaders.

00:39:55.380 | You can load data from different data sources.

00:39:57.100 | We have LamaHub, which you guys brought up,

00:39:58.660 | which is a collection of different data loaders

00:40:01.420 | of like unstructured and unstructured data,

00:40:04.180 | like PDFs, file types, like Slack, Notion, all that stuff.

00:40:08.420 | Now you load in this data.

00:40:10.380 | We have a bunch of like parsers and transformers.

00:40:12.500 | You can split the text.

00:40:13.420 | You can add metadata to the text

00:40:15.260 | and then basically figure out a way to load it

00:40:17.460 | into like a vector store.

00:40:19.220 | So, I mean, you worked at like Airbrite, right?

00:40:20.940 | It's kind of like there is some aspect like E and T, right?

00:40:23.380 | And in terms of like transforming this data.

00:40:25.500 | And then the L, right?

00:40:26.900 | Loading it into some storage abstraction,

00:40:28.420 | we have like a bunch of integrations

00:40:29.740 | with different document storage systems.

00:40:31.700 | So that's data.

00:40:34.060 | And then the second piece really is about like,

00:40:36.660 | how do you retrieve this data?

00:40:38.940 | How do you like synthesize this data?

00:40:41.020 | And how do you like do some sort of

00:40:42.380 | higher level reasoning over this data?

00:40:44.220 | So retrieval is one of the core abstractions that we have.

00:40:46.900 | We do encourage people to like customize,

00:40:48.580 | find your own retrievers.

00:40:49.940 | That's why we have that section on kind of like

00:40:51.460 | how do you define your own like customer retriever,

00:40:53.140 | but also we have like out of the box ones.

00:40:55.300 | The retrieval algorithm kind of depends

00:40:58.100 | on how you structure the data, obviously.

00:40:59.700 | Like if you just flat index everything

00:41:01.300 | with like chunks with like embeddings,

00:41:03.140 | then you can really only do like top K like lookup

00:41:05.980 | plus maybe like keyword search or something.

00:41:08.540 | But if you can index it in some sort of like hierarchy,

00:41:10.660 | like defined relationships,

00:41:11.660 | you can do more interesting things,

00:41:12.780 | like actually traverse relationships between nodes.

00:41:16.140 | Then after you have this data,

00:41:17.580 | how do you like synthesize the data, right?

00:41:19.580 | And this is the part where you feed it

00:41:21.060 | into the language model.

00:41:22.700 | There's some response abstraction that can abstract away

00:41:25.260 | over like long context to actually still give you a response

00:41:28.100 | even if the context overflows the context window.

00:41:30.420 | And then there's kind of these like higher level

00:41:32.340 | like reasoning primitives that I'm gonna define broadly.

00:41:35.820 | And I'm just gonna call them in some general bucket

00:41:38.220 | of like agents,

00:41:39.340 | even though everybody has different definitions of agents.

00:41:41.820 | And agents-

00:41:42.700 | - But you're the first to data agents,

00:41:44.300 | which I was very excited about.

00:41:45.300 | - Yeah, we kind of like coined that term.

00:41:47.060 | And the way we thought about it was,

00:41:49.020 | we wanted to think about how to use agents

00:41:51.060 | for like data workflows basically.

00:41:53.140 | And so what are the reasoning primitives

00:41:54.700 | that you wanna do?

00:41:55.580 | So the most simple reasoning primitive you can do

00:41:57.260 | is some sort of routing module.

00:41:58.540 | Like you can just, it's a classifier.

00:42:00.500 | Like given a query,

00:42:01.620 | just make some automated decision

00:42:02.980 | on what choice to pick, right?

00:42:04.940 | You could use LLMs.

00:42:05.780 | You don't have to use LLMs.

00:42:06.740 | You could just train a classifier basically.

00:42:08.660 | That's something that we might actually explore.

00:42:10.820 | And then the next piece is,

00:42:12.780 | okay, what are some higher level things?

00:42:14.620 | You can have the LLM like define like a query plan, right?

00:42:17.820 | To actually execute over the data.

00:42:19.940 | You can do some sort of while loop, right?

00:42:21.740 | That's basically what an agent loop is,

00:42:23.260 | which is like React, tree of thoughts,

00:42:26.540 | like chain of thought,

00:42:27.380 | like the open AI function calling like while loop

00:42:30.140 | to try to like take a question

00:42:31.460 | and try to break it down into some series of steps

00:42:34.340 | to actually try to execute to get back a response.

00:42:36.660 | And so there's a range in complexity

00:42:38.340 | from like simple reasoning primitives to more advanced ones.

00:42:40.620 | And I think that's the way we kind of think about it

00:42:42.620 | is like which ones should we implement

00:42:44.980 | and how do they work well?

00:42:45.980 | Like, do they work well over like the types of like data

00:42:48.500 | tasks that we give them?

00:42:49.460 | - How do you think about optimizing each piece?

00:42:51.940 | So take embedding models as one piece of it.

00:42:55.580 | You offer fine tuning embedding models.

00:42:58.620 | And I saw it was like fine tuning

00:43:00.100 | gives you like 5, 10% increase.

00:43:02.180 | What's kind of like the Delta left on the embedding side?

00:43:05.900 | Do you think we can get models that are like a lot better?

00:43:08.220 | Do you think like that's one piece

00:43:09.620 | where people should really not spend too much time?

00:43:13.300 | - I mean, I think they should.

00:43:14.860 | I just think it's not the only parameter

00:43:17.020 | 'cause I think in the end,

00:43:18.340 | if you think about everything that goes into retrieval,

00:43:21.900 | the chunking algorithm,

00:43:23.180 | how you define like metadata, right?

00:43:26.300 | We'll bias your embedding representations.

00:43:28.180 | Then there's the actual embedding model itself,

00:43:30.020 | which is something that you can try optimizing.

00:43:31.900 | And then there's like the retrieval algorithm.

00:43:33.420 | Are you gonna just do top K?

00:43:34.500 | Are you gonna do like hybrid search?

00:43:35.620 | Are you gonna do auto retrieval?

00:43:36.660 | Like there's a bunch of parameters.

00:43:37.900 | And so I do think it's something everybody should try.

00:43:40.900 | I think by default, we use like OpenAI's embedding model.

00:43:44.740 | A lot of people these days use like sentence transformers

00:43:46.940 | because it's just like free open source

00:43:48.780 | and you can actually optimize, directly optimize it.

00:43:51.340 | This is an active area of exploration.

00:43:54.540 | I do think one of our goals is

00:43:56.420 | it should ideally be relatively free for every developer

00:44:00.580 | to just run some fine tuning process over their data

00:44:03.060 | to squeeze out some more points and performance.

00:44:04.860 | And if it's that relatively free and there's no downsides,

00:44:06.980 | everybody should basically do it.

00:44:08.900 | There's just some complexities

00:44:10.460 | in terms of optimizing your embedding model,

00:44:12.220 | especially in a production grade data pipeline.

00:44:14.260 | If you actually fine tune the embedding model

00:44:17.220 | and the embedding space changes,

00:44:18.380 | you're gonna have to re-index all your documents.

00:44:20.300 | And for a lot of people, that's not feasible.

00:44:22.300 | And so I think like Joe from Vespa on our webinars,

00:44:25.460 | there's this idea that depending on kind of like,

00:44:29.060 | if you're just using like document and query embeddings,

00:44:32.220 | you could keep the document embeddings frozen

00:44:34.700 | and just train a linear transform on the query

00:44:36.660 | or any sort of transform on the query, right?

00:44:38.780 | So therefore it's just a query side transformation

00:44:40.900 | instead of actually having to re-index

00:44:42.540 | all the document embeddings.

00:44:44.300 | The other piece is- - Wow, that's pretty smart.

00:44:46.100 | - Yeah, yeah, so I think we weren't able

00:44:48.940 | to get like huge performance gains there,

00:44:50.340 | but it does like improve performance a little bit.

00:44:52.260 | And that's something that basically,

00:44:53.900 | everybody should be able to kick off.

00:44:55.020 | You can actually do that on Llama Index too.

00:44:56.660 | - OpenAI has a cookbook on adding bias

00:44:59.060 | to the embeddings too, right?

00:45:00.940 | - Yeah, yeah, I think so.

00:45:02.540 | Yeah, there's just like different parameters

00:45:03.940 | that you can try adding

00:45:05.380 | to try to like optimize the retrieval process.

00:45:07.580 | And the idea is just like, okay, by default,

00:45:10.620 | you have all this text,

00:45:11.660 | it kind of lives in some latent space, right?

00:45:15.380 | - Shut out, shut out latent space.

00:45:17.220 | You should take a drink every time.

00:45:18.460 | - Yeah, but it lives in some latent space.

00:45:22.500 | But like depending on the specific types of questions

00:45:25.420 | that the user might wanna ask,

00:45:26.860 | the latent space might not be optimized, right?

00:45:29.180 | For actual, like to actually retrieve

00:45:32.980 | the relevant piece of context that the user wanna ask.

00:45:34.740 | So can you shift the embedding points a little bit, right?

00:45:37.220 | And how do we do that basically?

00:45:38.420 | That's really a key question here.

00:45:39.900 | So optimizing the embedding model,

00:45:41.820 | even changing the way you like chunk things,

00:45:43.500 | these all shift the embeddings.

00:45:44.620 | - So the retrieval is interesting.

00:45:46.340 | I got a bunch of startup pitches that are like,

00:45:48.580 | like rag is cool, but like there's a lot of stuff

00:45:52.180 | in terms of ranking that could be better.

00:45:54.300 | There's a lot of stuff in terms of sunsetting data

00:45:57.980 | once it starts to become stale, that could be better.

00:46:00.620 | Are you gonna move into that part too?

00:46:03.740 | So like you have SEC Insights as one of kind of like

00:46:06.260 | your demos and that's like a great example of,

00:46:08.860 | hey, I don't wanna embed all the historical documents

00:46:12.020 | because a lot of them are outdated

00:46:13.500 | and I don't want them to be in the context.

00:46:15.820 | What's that problem space like?

00:46:17.260 | How much of it are you gonna also help with

00:46:19.980 | and versus how much you expect others to take care of?

00:46:23.220 | - Yeah, I'm happy to talk about SEC Insights in just a bit.

00:46:25.660 | I think more broadly about the like overall retrieval space,

00:46:28.260 | we're very interested in it because a lot of these

00:46:29.940 | are very practical problems that people have asked us.

00:46:31.940 | So the idea of outdated data,

00:46:33.300 | I think how do you like deprecate or time wait data

00:46:36.180 | and do that in a reliable manner, I guess,

00:46:38.580 | so you don't just like kind of set some parameter

00:46:40.460 | and all of a sudden that affects

00:46:41.620 | all your retrieval algorithms is pretty important

00:46:43.740 | because people have started bringing that up.

00:46:45.500 | Like I have a bunch of duplicate documents,

00:46:46.940 | things get out of date, how do I like sunset documents?

00:46:49.220 | And then ranking, right?

00:46:50.700 | Yeah, so I think this space is not new.

00:46:53.580 | I think like rather than inventing

00:46:56.180 | like new retriever techniques for the sake of like

00:46:58.380 | just inventing better ranking,

00:47:00.540 | we wanna take existing ranking techniques

00:47:02.700 | and kind of like package it in a way

00:47:04.180 | that's like intuitive and easy for people to understand.

00:47:06.900 | That said, I think there are interesting

00:47:09.980 | and new retrieval techniques that are kind of in place

00:47:13.660 | that can be done with when you tie it

00:47:16.180 | into some downstream rack system.

00:47:18.220 | I mean, like the reason for this is just like,

00:47:20.140 | if you think about how like the idea of like chunking text,

00:47:24.540 | right, like that really, that just really wasn't a thing

00:47:28.780 | or at least for this specific purpose of like,

00:47:31.500 | like the reason chunking is a thing in rag right now

00:47:33.660 | is because like you wanna fit

00:47:34.900 | within the context of an LLM, right?

00:47:37.020 | Like why do you wanna chunk a document?

00:47:38.220 | That just was less of a thing, I think back then.

00:47:40.340 | If you wanted to like transform a document,

00:47:42.900 | it was more for like structured data extraction

00:47:44.380 | or something in the past.

00:47:45.540 | And so there's kind of like certain new concepts

00:47:47.540 | that you gotta play with that you can use to invent

00:47:50.740 | kind of more interesting retrieval techniques.

00:47:52.740 | Another example here is actually LLM based reasoning,

00:47:55.700 | like LLM based chain of thought reasoning.

00:47:57.940 | You can take a question,

00:47:59.020 | break it down into smaller components

00:48:00.700 | and use that to actually send to your retrieval system.

00:48:03.740 | And that gives you better results

00:48:04.900 | than kind of like sending the full question

00:48:06.500 | to a retrieval system.

00:48:07.740 | That also wasn't really a thing back then,

00:48:09.500 | but then you can kind of figure out an interesting way

00:48:11.140 | of like blending old and the new, right,

00:48:13.060 | with LLMs and data.

00:48:14.380 | - Yeah.

00:48:16.740 | There's a lot of ideas that you come across.

00:48:19.980 | Do you have a store of them?

00:48:22.580 | - So, okay, I think that the, yeah,

00:48:25.780 | I think sometimes I get like inspiration.

00:48:27.460 | There's like some problem statement

00:48:28.620 | and I'm just like, oh, let's hack this out.

00:48:29.900 | - Following you is very hard

00:48:30.740 | because it's just a lot of homework.

00:48:32.540 | - So I think I've started to like step on the brakes

00:48:37.540 | just a little bit.

00:48:38.460 | - No, no, no, keep going, keep going.

00:48:39.540 | - No, no, no.

00:48:40.380 | Well, the reason is just like, okay,

00:48:41.460 | if I just have, invent like a hundred

00:48:42.940 | more retrieval techniques, like sure,

00:48:44.940 | but like how do people know which one is good

00:48:46.780 | and which one's like bad, right?

00:48:47.860 | And so-

00:48:48.700 | - Have a librarian, right?

00:48:49.540 | Like it's gonna catalog it and go-

00:48:51.180 | - You're gonna need some like benchmarks.

00:48:52.580 | And so I think that's probably the focus

00:48:54.380 | for the next few weeks is actually like properly

00:48:56.780 | kind of like having an understanding of like,

00:48:58.380 | oh, you know, when should you do this?

00:48:59.660 | Or like, does this actually work well?

00:49:01.260 | - Yeah, some kind of like maybe like a flow chart,

00:49:03.820 | decision tree type of thing.

00:49:05.180 | - Yeah, exactly.

00:49:06.020 | - When this, do that, you know, something like that

00:49:07.340 | that would be really helpful for me.

00:49:08.740 | Thank you.

00:49:09.580 | (both laughing)

00:49:11.740 | Do you want to talk about SCC Insights?

00:49:13.620 | - Sure, yeah.

00:49:15.580 | - You had a question.

00:49:17.020 | - Yeah, yeah, just, I mean, that's kind of like a good-

00:49:19.780 | - It seems like your most successful side project.

00:49:22.460 | - Yeah, okay.

00:49:23.460 | So what is SCC Insights for our listeners?

00:49:26.940 | Our SCC Insights is a full stack LLM chatbot application

00:49:31.660 | that does analysis over your SCC 10K and 10Q filings,

00:49:36.660 | I think.

00:49:37.500 | And so the goal for building this project

00:49:40.340 | is really twofold.

00:49:41.700 | The reason we started building this was one,

00:49:44.180 | it was a great way to dog food

00:49:45.780 | the production readiness for our library.

00:49:47.820 | We actually ended up like adding a bunch of stuff

00:49:50.180 | and fixing a ton of bugs because of this.

00:49:51.900 | And I think it was great because like, you know,

00:49:53.900 | thinking about how we handle like callbacks, streaming,

00:49:57.820 | actually generating like reliable sub-responses

00:50:00.180 | and bubbling up sources citations.

00:50:01.900 | These are all things that like, you know,

00:50:03.740 | if you're just building the library in isolation,

00:50:05.420 | you don't really think about it.

00:50:06.260 | But if you're trying to tie this

00:50:07.140 | into a downstream application,

00:50:08.220 | like it really starts mattering.

00:50:09.580 | - Is this for your error messages?

00:50:10.940 | What do you mean?

00:50:11.780 | You talk about bubbling up stuff.

00:50:12.620 | For observability. - So like sources.

00:50:13.860 | Like if you go into SCC Insights and you type something,

00:50:16.180 | you can actually see the highlights in the right side.

00:50:18.180 | And so like, yeah, that was something

00:50:20.340 | that like took a little bit of like understanding

00:50:22.740 | to figure out how to build well.

00:50:23.820 | And so it was great for dogfooding improvement

00:50:25.620 | of the library itself.

00:50:26.700 | And then as we're building the app,

00:50:28.260 | the second thing was we're starting to talk to users

00:50:30.180 | and just like trying to showcase

00:50:32.140 | like kind of bigger companies,

00:50:33.820 | like the potential of Llamandex as a framework.

00:50:36.740 | Because these days, obviously building a chatbot, right,

00:50:39.340 | with Streamlit or something,

00:50:40.260 | it'll take you like 30 minutes or an hour.

00:50:41.740 | Like there's plenty of templates out there

00:50:43.060 | on Llamandex, ClientTrain,

00:50:44.100 | like you can just build a chatbot.

00:50:45.580 | But how do you build something that kind of like satisfies

00:50:48.020 | some of this like criteria of surfacing like citations,

00:50:51.580 | being transparent, seeing like having a good UX,

00:50:54.220 | and then also being able to handle

00:50:55.420 | different types of questions, right?

00:50:56.580 | Like more complex questions

00:50:57.780 | that compare different documents.

00:50:59.460 | That's something that I think people

00:51:00.580 | are still trying to explore.

00:51:01.420 | And so what we did was like,

00:51:03.100 | we showed both like, well, first like organizations

00:51:07.220 | and possibilities of like what you can do

00:51:08.780 | when you actually build something like this.

00:51:10.420 | And then after like, you know,

00:51:11.740 | we kind of like stealth launched this for fun,

00:51:14.180 | just as a separate project,

00:51:15.500 | just to see if we could get feedback from users

00:51:17.180 | who are using this world to see like, you know,

00:51:18.620 | how we can improve stuff.

00:51:19.940 | And then we thought like, ah, you know,

00:51:22.660 | we built this, right?

00:51:23.780 | Obviously, we're not gonna sell like a financial app,

00:51:26.260 | like that's not really in our wheelhouse,

00:51:28.660 | but we're just gonna open source the entire thing.

00:51:30.100 | And so that now is basically just like a really nice,

00:51:33.060 | like full stack app template you can use

00:51:34.780 | and customize on your own, right?

00:51:35.940 | To build your own chatbot,

00:51:37.220 | whether it is a really financial documents

00:51:38.900 | or over like other types of documents.

00:51:40.620 | And it provides like a nice template

00:51:41.980 | for basically anybody to kind of like go in

00:51:43.500 | and get started.

00:51:45.180 | There's certain components though,

00:51:46.540 | that like aren't released yet that we're going to,

00:51:49.060 | in the next few weeks.

00:51:51.740 | Like one is just like kind of more detailed guides

00:51:54.220 | on like different modular components within it.

00:51:56.300 | So if you're like a full stack developer,

00:51:57.940 | you can go in and actually take the pieces that you want

00:52:00.220 | and actually kind of build your own custom flows.

00:52:02.260 | The second piece is like,

00:52:03.660 | take there's like certain components in there

00:52:05.500 | that might not be directly related to the LLM app

00:52:07.620 | that would be nice to just like have people use.

00:52:10.220 | An example is the PDF viewer,

00:52:12.020 | like the PDF viewer with like citations.

00:52:14.540 | I think we're just gonna give that, right?

00:52:15.820 | So, you know, you could be using any library you want,

00:52:17.900 | but then you can just, you know,

00:52:19.060 | just drop in a PDF viewer, right?

00:52:20.620 | So that it's just like a fun little module

00:52:22.460 | that you can view.

00:52:23.300 | - Nice, nice.

00:52:24.300 | Yeah, that's a really good community service right there.

00:52:27.300 | Well, so I want to talk a little bit

00:52:28.980 | about your cloud offering.

00:52:31.020 | 'Cause you mentioned, I forget the name

00:52:32.660 | that you had for it, enterprise something.

00:52:35.300 | - Well, one, we haven't come up with a name.

00:52:37.060 | We're kind of calling it LLM index platform,

00:52:41.060 | platform LLM index enterprise.

00:52:42.780 | I'm open to suggestions here.

00:52:45.180 | So I think the high level of what I can probably say

00:52:50.180 | is just like, yeah, I think we're looking at ways

00:52:53.380 | of like actively kind of complimenting

00:52:55.420 | the developer experience, like building LLM index.

00:52:57.940 | You know, we've always been very focused

00:53:00.740 | on stuff around like plugging in your data

00:53:03.580 | into the language model.

00:53:04.900 | And so can we build tools that help like augment

00:53:07.220 | that experience beyond the open source library, right?

00:53:10.020 | And so I think what we're gonna do

00:53:11.500 | is like make a build an experience

00:53:13.340 | where it's very seamless to transition

00:53:14.900 | from the open source library with like a one line toggle.

00:53:18.740 | You can basically get this like complimentary service

00:53:20.980 | and then figure out a way to like monetize in a bit.

00:53:23.420 | I think our revenue focus this year

00:53:25.500 | is kind of is less emphasized.

00:53:28.460 | Like it's more just about like,

00:53:29.540 | can we build some managed offering

00:53:30.780 | that like provides complimentary value

00:53:32.260 | to what the open source library provides?

00:53:34.460 | - Yeah, I think it's the classic thing

00:53:37.140 | about all open source is you want to start building

00:53:40.260 | the most popular open source projects

00:53:41.580 | in your category to own that category.

00:53:44.420 | You're gonna make it very easy to host.

00:53:46.180 | Therefore, then you have to,

00:53:47.220 | you've just built your biggest competitor, which is you.

00:53:50.020 | Yeah, it'll be fun.

00:53:52.140 | - I think it'll be like complimentary

00:53:53.500 | 'cause I think it'll be like, you know,

00:53:55.300 | use the open source library and then you have a toggle

00:53:57.780 | and all of a sudden, you know, you can see this

00:54:00.220 | basically like a pipeline-ish thing pop up

00:54:03.580 | and then it'll be able to kind of like, you'll have a UI,

00:54:07.380 | there'll be some enterprise guarantees

00:54:09.380 | and the end goal would be to help you build

00:54:11.020 | like a production rag app more easily.

00:54:12.660 | - Yeah, great, awesome.

00:54:14.900 | Should we go on to like ecosystem and other stuff?

00:54:17.460 | - Yeah. - Go ahead.

00:54:19.300 | - Data loaders, there's a lot of them.

00:54:21.700 | What are maybe some of the most popular,

00:54:24.540 | maybe under, not underrated, but like underexpected,

00:54:27.940 | you know, and how has the open source side of it helped

00:54:31.620 | with like getting a lot more connectors?

00:54:33.460 | You only have six people on the team today,

00:54:35.140 | so you couldn't have done it all yourself.

00:54:37.020 | - Oh, for sure.

00:54:37.980 | Yeah, I think the nice thing about like Blahma Hub itself

00:54:40.820 | is just, it's supposed to be a community-driven hub.

00:54:43.020 | And so actually the bulk of the peers

00:54:44.860 | are completely community contributed.

00:54:46.700 | And so we haven't written that many

00:54:49.340 | like first party connectors actually for this.

00:54:51.180 | It's more just like kind of encouraging people

00:54:53.180 | to contribute to the community.

00:54:56.100 | In terms of the most popular tools or the data loaders,

00:54:59.900 | I think we have Google Analytics on this

00:55:01.500 | and I forgot the specifics.

00:55:02.540 | It's some mix of like the PDF loaders.

00:55:05.180 | We have like 10 of them,

00:55:06.020 | but there's some subset of them that are popular.

00:55:07.820 | And then there's Google, like I think Gmail and like G-Drive.

00:55:12.260 | And then I think maybe it's like one of Slack or Notion.

00:55:15.260 | One thing I will say though,

00:55:17.820 | and I think like Swix might probably knows this better

00:55:20.260 | than I do, given that you were used to working at Airbyte,

00:55:22.300 | is like, it's very hard to build like,

00:55:24.580 | especially for a full-on service like Notion, Slack

00:55:27.020 | or like Salesforce to build like a really,

00:55:29.260 | really high quality loader that really extracts

00:55:31.300 | all the information that people want, right?

00:55:33.220 | And so I think the thing is when people start out,

00:55:37.700 | like they will probably use these loaders

00:55:39.820 | and it's a great tool to get started.

00:55:41.140 | And for a lot of people it's like good enough

00:55:42.820 | and they submit PRs if they want more additional features.

00:55:45.260 | If like you get to a point where you actually wanna call

00:55:47.700 | like an API that hasn't been supported yet,

00:55:49.820 | or, you know, you want to kind of load in stuff

00:55:53.660 | that like in metadata or something

00:55:55.260 | that hasn't been directly baked

00:55:56.620 | into the logic of the loader itself,

00:55:58.660 | people start adding up like writing their own custom loaders.

00:56:00.900 | And that is a thing that we're seeing.

00:56:02.300 | And that's something that we're okay with, right?

00:56:03.980 | 'Cause like a lot of this is more just like community driven

00:56:06.380 | and if you wanna submit a PR

00:56:07.620 | to improve the existing one, you can,

00:56:08.740 | otherwise you can create your own custom ones.

00:56:10.300 | - Yeah.

00:56:11.140 | And all that is custom loaders all supported

00:56:13.060 | within Llama Index or do you pair it with something else?

00:56:15.580 | - Oh, it's just like, I mean,

00:56:17.300 | you just define your own subclass.

00:56:18.380 | I think that's it.

00:56:19.220 | Yeah, yeah.

00:56:20.060 | - 'Cause typically in the data ecosystem with Erbite,

00:56:23.580 | you know, Erbite has his own strategies with custom loaders,

00:56:26.100 | but also you could write your own with like DAGster

00:56:28.460 | or like Prefects or one of those tools.

00:56:30.700 | - Yeah, yeah, exactly.

00:56:31.780 | So I think for us it's more,

00:56:33.180 | we just have a very flexible like document abstraction

00:56:35.140 | that you can fill in with any content that you want.

00:56:37.140 | - Okay.

00:56:37.980 | Are people really dumping all their Gmail into these things?

00:56:40.940 | You said Gmail is number two.

00:56:44.100 | - Yeah, it's like one of Google, some Google product.

00:56:47.460 | I think it's Gmail. - Oh, it's not Gmail.

00:56:48.860 | - I think it might be.

00:56:49.900 | Yeah. - Oh, wow.

00:56:50.860 | - I'm not sure actually.

00:56:52.620 | - I mean, that's the most private data source.

00:56:56.740 | - That's true.

00:56:57.580 | - So I'm surprised that people don't meet you.

00:56:59.980 | I mean, I'm sure some people are,

00:57:01.500 | but like I'm sure, I'm surprised it's popular.

00:57:04.100 | - Yeah.

00:57:05.020 | Let me revisit the Google Analytics.

00:57:06.300 | - Okay. - I wanna try

00:57:07.140 | and give you the accurate response, yeah.

00:57:09.060 | - Yeah.

00:57:10.460 | Well, and then, so the LLM engine,

00:57:13.900 | I assume OpenAI is gonna be a majority.

00:57:16.780 | Is it an overwhelming majority?

00:57:19.620 | What's the market share between like OpenAI,

00:57:22.020 | Cohere, Anthropic, you know, whatever you're seeing.

00:57:24.460 | OpenSource too.

00:57:25.300 | - OpenAI has a majority,

00:57:26.140 | but then like there's Anthropic

00:57:27.580 | and there's also OpenSource.

00:57:29.060 | I think there is a lot of people trying out like Llama 2

00:57:32.060 | and some variant of like a top OpenSource model.

00:57:34.900 | - Side note, any confusion there?

00:57:36.300 | Llama 2 versus Llama?

00:57:38.020 | - Yeah, I think whenever I go to these talks,

00:57:40.540 | I always open it up with like,

00:57:41.820 | we started before. - We are not.

00:57:42.660 | - Yeah, exactly.

00:57:43.500 | We started before Meta, right?

00:57:44.660 | I wanna point that out.

00:57:46.220 | But no, props to them.

00:57:47.580 | We try to use it for like branding.

00:57:49.060 | We just add two Llamas

00:57:50.060 | when we have like a Llama 2 integration

00:57:51.460 | instead of one Llama.

00:57:52.300 | Anyways.

00:57:53.140 | Yeah, so I think a lot of people are trying out

00:57:57.140 | the popular OpenSource models.

00:57:58.580 | And we have, these days we have like,

00:58:01.420 | there's a lot of toolkits and OpenSource projects

00:58:04.460 | that allow you to self-host and deploy Llama 2.

00:58:07.540 | - Yes. - Right.

00:58:08.380 | And like, Llama is just a very recent example,

00:58:10.540 | I think that we had an integration with.

00:58:12.380 | And so we just, by virtue of having more of these services,

00:58:14.940 | I think more and more people are trying it out.

00:58:16.620 | - Yeah.

00:58:17.460 | Do you think there's potential there?

00:58:18.820 | Is like, is that gonna be an increasing trend?

00:58:21.900 | - OpenSource? - Yeah.

00:58:22.860 | - Yeah, definitely.

00:58:23.700 | I think in general, people hate monopolies.

00:58:25.740 | And so like there's a,

00:58:27.500 | whenever like OpenAI has something really cool

00:58:30.020 | or like any company has something really cool, even Meta,

00:58:33.220 | like there's just gonna be a huge competitive pressure

00:58:35.300 | from other people to do something

00:58:36.500 | that's more open and better.

00:58:38.060 | And so I do think just market pressures

00:58:39.780 | will improve like OpenSource adoption.

00:58:42.660 | - Last thing I'll say about this,

00:58:43.740 | which is just really like, it gets clicks.

00:58:46.980 | People like, are like, psychologically want that.

00:58:50.340 | But then at the end of the day,

00:58:51.180 | they want, they fall for brand name

00:58:52.580 | and popular and performance benchmarks, you know?

00:58:56.500 | And at the end of the day, OpenAI still wins on that.

00:58:59.740 | - I think that's true.

00:59:00.820 | But I just think like,

00:59:02.300 | unless you were like an active employee at OpenAI, right?

00:59:04.660 | Like all these research labs are putting out like ML,

00:59:07.540 | like PhDs or kind of like other companies too,

00:59:10.500 | they're investing a lot of dollars.

00:59:11.900 | There's gonna be a lot of like competitive pressures

00:59:13.500 | to develop like better models.

00:59:14.700 | So is it gonna be like all fully open source

00:59:17.220 | with like a permissive license?

00:59:18.260 | Like, I'm not completely sure,

00:59:19.500 | but like there's just a lot of just incentive

00:59:21.460 | for people to develop their stuff here.

00:59:23.340 | - Have you looked at like rag specific models,

00:59:25.420 | like contextual?

00:59:26.460 | - No, is it public or?

00:59:29.180 | - No, they literally just, so Dewey Kila,

00:59:32.940 | I think is his name, you probably came across him.

00:59:35.900 | He wrote the rag paper at Meta

00:59:37.820 | and just started contextual AI

00:59:40.900 | to create a rag specific model.

00:59:42.540 | I don't know what that means.

00:59:44.540 | I was hoping that you do, 'cause it's your business.

00:59:47.060 | - If I had inside information.

00:59:50.580 | I mean, you know, to be honest,

00:59:51.980 | I think this kind of relates to my previous point

00:59:54.580 | on like rag and fine tuning.

00:59:56.020 | Like a rag specific model is a model architecture

00:59:58.340 | that's designed for better rag.

01:00:00.020 | And it's less the software engineering principle

01:00:01.940 | of like, how can I take existing stuff

01:00:03.860 | and just plug and play different components into it?

01:00:05.660 | And there's a beauty in that

01:00:07.340 | from ease of use and modularity.

01:00:08.940 | But like when you wanna end to end optimize the thing,

01:00:12.060 | you might want a more specific model.

01:00:13.900 | I just, yeah, I don't know.

01:00:15.900 | I think building your own models is honestly pretty hard.

01:00:20.220 | And I think the issue is if you also build your own models,

01:00:22.740 | like you're also just gonna have to keep up

01:00:24.140 | with like the rate of L and advances.

01:00:25.660 | Like basically the question is when GPT-5 and six

01:00:29.420 | and whatever, like anthropic cloud three comes out,

01:00:31.860 | like how can you prove that you're actually better

01:00:34.900 | than a software developers

01:00:36.460 | cobbling together their own components

01:00:37.860 | on top of a base model, right?

01:00:39.420 | Even if it's just like conceptually,

01:00:40.780 | this is better than maybe like GPT-3 or GPT-4.

01:00:43.820 | - Yeah, yeah.

01:00:45.340 | Base model game is expensive.

01:00:46.820 | - Yeah.

01:00:47.820 | - What about vector stores?

01:00:49.340 | I know Spook says we're in a Chroma sweatshirt.

01:00:51.900 | - Yeah, because this is a swag game.

01:00:53.900 | - I have the mug from Chroma, it's been great.

01:00:57.300 | - What do you think, what do you think there?

01:00:59.420 | Like there's a lot of them.

01:01:00.580 | Are they pretty interchangeable

01:01:02.380 | for like your users use case?

01:01:04.820 | Is HNSW all we need?

01:01:07.300 | Is there room for improvements there?

01:01:09.220 | - Is MPRA all we need?

01:01:10.540 | - Yeah, yeah.

01:01:11.380 | - I think, yeah, we try to remain unopinionated

01:01:14.500 | about storage providers.

01:01:15.460 | So it's not like, we don't try to like play favorites.

01:01:17.380 | So we have like a bunch of integrations, obviously.

01:01:19.020 | And the way we try to do is we just try to find

01:01:20.980 | like some standard interfaces,

01:01:22.180 | but obviously like different vector stores

01:01:23.700 | will support kind of like slightly additional things

01:01:26.060 | like metadata filters and those things.

01:01:27.940 | And the goal is to have our users basically leave it up

01:01:30.860 | to them to try to figure out like what makes sense

01:01:32.540 | for their use case.

01:01:33.620 | In terms of like the algorithm itself,

01:01:35.660 | I don't think the Delta

01:01:37.660 | on like improving the vector store,

01:01:39.580 | like embedding lookup algorithm is that high.

01:01:42.020 | I think the stuff has been mostly solved

01:01:44.300 | or at least there's just a lot of other stuff you can do

01:01:46.740 | to try to improve the performance.

01:01:48.900 | No, I mean like everything else that we just talked about,

01:01:50.540 | like in terms of like accuracy, right?

01:01:52.020 | To improve RAG, like everything that we talked about,

01:01:53.700 | like clunking, like metadata, like.

01:01:56.140 | - Yeah, well, I mean, I was just thinking like,

01:01:58.020 | maybe for me, the interesting question is,

01:02:00.620 | there are like eight, it's a kind of game of thrones.

01:02:02.580 | There's like eight, the war of eight databases right now.

01:02:05.460 | - Oh, oh, I see, I see.

01:02:06.820 | - How do they stand out

01:02:07.860 | and how did they become very good partners

01:02:09.220 | with Lava Index?

01:02:10.060 | - Oh, I mean, I think we're, yeah,

01:02:13.060 | we're pretty good partners with most of them.

01:02:15.060 | Let's see.

01:02:16.380 | - Well, like, so if you're a vector database founder,

01:02:19.140 | like what do you work on?

01:02:21.380 | - That's a good question.

01:02:23.060 | I think one thing I'm very interested in is,

01:02:25.860 | and this is something I think I've started to see

01:02:27.940 | a general trend towards,

01:02:29.060 | is combining structured data querying

01:02:31.740 | with unstructured data querying.

01:02:33.420 | And I think that will probably just expand

01:02:37.020 | the query sophistication of these vector stores

01:02:39.420 | and basically make it so that users don't have to think

01:02:41.580 | about whether they--

01:02:42.420 | - Would you call this like hybrid querying?

01:02:44.380 | Is that what Weaviate's doing?

01:02:46.540 | - Yeah, I mean, I think like,

01:02:47.660 | if you think about metadata filters,

01:02:48.900 | that's basically a structured filter.

01:02:50.420 | It's like a select star or select where, right?

01:02:54.060 | Something equals something.

01:02:55.180 | And then you combine that with semantic search.

01:02:57.060 | I know, I think like LanceDB or something

01:02:59.140 | was like trying to do some like joint interface.

01:03:02.420 | The reason is like most data is semi-structured.

01:03:05.260 | There's some structured annotations

01:03:06.500 | and there's some like unstructured texts.

01:03:07.900 | And so like somehow combining all the expressivity

01:03:12.260 | of like SQL with like the flexibility of semantic search

01:03:14.820 | is something that I think is gonna be really important.

01:03:17.300 | And we have some basic hacks right now

01:03:18.860 | that allow you to jointly query both a SQL database,

01:03:22.540 | like a separate SQL database and a vector store

01:03:24.300 | to like combine the information.

01:03:25.940 | That's obviously gonna be less efficient

01:03:27.220 | than if you just combined it into one system, yeah.

01:03:29.420 | And so I think like PG vector, like, you know,

01:03:31.380 | that type of stuff, I think it's starting to get there.

01:03:33.180 | But like in general,

01:03:34.020 | like how do you have an expressive query language

01:03:35.660 | to actually do like structured querying

01:03:37.620 | along with like all the capabilities of semantic search?

01:03:40.260 | - So your current favorite is just put it into Postgres?

01:03:44.220 | - No, no, no, we don't play--

01:03:45.980 | - Postgres language, the query language.

01:03:49.300 | - I actually don't know what the best language

01:03:52.780 | would be for this.

01:03:53.620 | 'Cause I think it will be something

01:03:55.180 | that like the model hasn't been fine-tuned over.

01:03:57.340 | And so you might wanna train the model over this,

01:04:00.100 | but some way of like expressing structured data filters.

01:04:04.500 | And this could be include time too, right?

01:04:06.580 | It doesn't have to just be like a where clause

01:04:08.940 | with this idea of like a semantic search.

01:04:11.420 | - Yeah, yeah.

01:04:12.260 | And we talked about graph representations.

01:04:14.700 | - Yeah, oh yeah, that's another thing too.

01:04:16.220 | And there's like, yeah,

01:04:17.340 | so that's actually something I didn't even bring up yet.

01:04:19.860 | Like there's this interesting idea of like,

01:04:21.300 | can you actually have the language model,

01:04:23.020 | like explore like relationships within the data too, right?

01:04:25.860 | And somehow combine that information with stuff

01:04:28.020 | that's like more structured within the DB.

01:04:30.860 | - Awesome.

01:04:31.700 | - What else is left in the stack?

01:04:34.620 | - Oh, evals.

01:04:35.620 | - Yeah.

01:04:36.460 | - What are your current strong beliefs

01:04:39.180 | about how to evaluate RAG?

01:04:40.860 | - I think I have thoughts.

01:04:41.820 | I think we're trying to curate this

01:04:42.980 | into some like more opinionated principles

01:04:45.860 | because there are some like open questions here.

01:04:47.540 | I think one question I had to think about

01:04:48.900 | is whether you should do like evals

01:04:50.860 | like component by component first

01:04:52.340 | or is yours do the end-to-end thing?

01:04:54.580 | I think you should,

01:04:55.700 | you might actually just want to do the end-to-end thing first

01:04:57.620 | just to do a sanity check of whether or not like this,

01:05:00.340 | given a query and the final response,

01:05:01.820 | whether or not it even makes sense.

01:05:03.220 | Like you eyeball it, right?

01:05:04.220 | And then you only try to do some basic evals.

01:05:06.340 | And then once you like diagnose what the issue is,

01:05:08.700 | then you go into the kind of like specific area

01:05:11.340 | to find some more solid benchmarks

01:05:13.420 | and try to like improve stuff.

01:05:14.940 | So what is end-to-end evals?

01:05:17.020 | Like it's, you have a query,

01:05:19.180 | it goes in through a retrieval system,

01:05:21.820 | you get back something, you synthesize response,

01:05:23.540 | and that's your final thing.

01:05:24.420 | And you evaluate the quality of the final response.

01:05:27.060 | And these days there's plenty of projects,

01:05:30.300 | like startups, like companies, research,

01:05:33.020 | doing stuff around like GPT-4, right?

01:05:35.180 | As like a human judge to basically kind of like

01:05:37.140 | synthetically generate a data set.

01:05:37.980 | - Do you think those will do well?

01:05:39.380 | - I mean, I think- - It's too easy.

01:05:41.300 | - Well, I think, oh, you're talking about like the startups?

01:05:44.860 | - Yeah.

01:05:45.700 | - I don't know.

01:05:46.540 | I don't know from the startup side.

01:05:47.380 | I just know from a technical side,

01:05:48.220 | I think people are gonna do more of it.

01:05:50.580 | The main issue right now is just, it's really unreliable.

01:05:53.420 | Like it's just, like there's like variance in the response

01:05:56.780 | when you wanna be- - Yeah, then they won't do

01:05:57.820 | more of it.

01:05:58.660 | I mean, 'cause it's bad.

01:05:59.500 | - No, but these models will get better

01:06:00.820 | and you'll probably fine tune a model to be a better judge.

01:06:03.260 | I think that's probably what's gonna happen.

01:06:04.500 | So I'm like reasonably bullish on this

01:06:07.260 | because I don't think there's really a good alternative

01:06:09.740 | beyond you just human annotating a bunch of data sets

01:06:12.500 | and then trying to like just manually go through

01:06:14.700 | and curating, like evaluating eval metrics.

01:06:17.460 | And so this is just gonna be a more scalable solution.

01:06:19.860 | In terms of the startups, yeah, I mean,

01:06:21.140 | I think there's a bunch of companies doing this.

01:06:22.660 | In the end, it probably comes down to some aspect

01:06:24.340 | of like UX speed and then whether you can

01:06:27.620 | like fine tune a model.

01:06:29.140 | And then, so that's end-to-end evals.

01:06:31.860 | And then I think like what we found is for RAG,

01:06:34.340 | a lot of times like what ends up affecting

01:06:37.500 | this like end response is retrieval.

01:06:39.420 | You're just not able to retrieve the right response.

01:06:41.300 | I think having proper retrieval benchmarks,

01:06:43.260 | especially if you wanna do production RAG

01:06:44.820 | is actually quite important.

01:06:46.260 | I think, what does having good retrieval metrics tell you?

01:06:49.540 | It tells you that at least like the retrieval is good.

01:06:52.180 | It doesn't necessarily guarantee

01:06:53.260 | the end generation is good,

01:06:54.740 | but at least it gives you some sort of like sanity track,

01:06:58.260 | right, so you can like fix one component

01:06:59.780 | while optimizing the rest.

01:07:00.980 | What retrieval like evaluation is pretty standard

01:07:04.340 | and it's been around for a while.

01:07:05.860 | It's just like an IR problem basically.

01:07:07.940 | You have some like input query,

01:07:10.460 | you get back some retrieved set of context

01:07:12.500 | and then there's some ground truth in that ranked set.

01:07:15.420 | And then you try to measure it based on ranking metrics.

01:07:17.580 | So the closer that ground truth is to the top,

01:07:20.420 | the more you reward the evals.

01:07:22.380 | And then the closer it is to the bottom

01:07:23.900 | or if it's not in the retrieved side at all,

01:07:25.580 | then you penalize the evals.

01:07:27.140 | And so that's just like a classic ranking problem.

01:07:29.580 | Most people starting out

01:07:30.900 | probably don't know how to do this.

01:07:32.740 | Right now, we just launched

01:07:33.780 | some like basic retrieval evaluation modules

01:07:37.220 | to help users do this.

01:07:38.620 | One is just like curating this data set in the first place.

01:07:41.140 | And one thing that we're very interested in

01:07:43.300 | is this idea of like synthetic data set generation

01:07:45.260 | for evals.

01:07:46.100 | So how can you, given some context,

01:07:47.820 | generate a set of questions with Drupal 2.4

01:07:49.820 | and then all of a sudden you have like question

01:07:51.300 | and then context pairs and that becomes your ground truth.

01:07:53.980 | - Yeah.

01:07:55.020 | Are data agent evals the same thing

01:07:56.700 | or is there a separate set of stuff for agents

01:07:59.940 | that you think is relevant here?

01:08:01.700 | - Data agents add like another layer of complexity

01:08:03.900 | 'cause then it's just like you have just more loops

01:08:06.220 | in the system.

01:08:07.060 | Like you can evaluate like each chain of thought loop itself

01:08:10.740 | like every LLM call to see whether or not the input

01:08:14.300 | to that specific step in the chain of thought process

01:08:16.580 | actually works or is correct.

01:08:20.420 | Or you could evaluate like the final response

01:08:22.220 | to see if that's correct.

01:08:23.220 | This gets even more complicated

01:08:24.460 | when you do like multi-agent stuff

01:08:25.820 | because now you have like some communication

01:08:27.420 | between like different agents.

01:08:28.700 | Like you have a top level orchestration agent

01:08:30.500 | passing it on to some low level stuff.

01:08:33.740 | I'm probably less familiar

01:08:35.460 | with kind of like agent eval frameworks.

01:08:36.980 | I know they're starting to become a thing.

01:08:39.620 | I know I was talking to like June

01:08:42.020 | from the Journal of Agents paper,

01:08:43.660 | which is pretty unrelated to what we're doing now,

01:08:45.780 | but it's very interesting where it's like,

01:08:47.260 | so you can kind of evaluate like overall agent simulations

01:08:50.460 | by just like kind of understanding

01:08:52.020 | whether or not they like modeled

01:08:53.660 | this distribution of human behavior,

01:08:55.180 | but that's like a very macro principle, right?

01:08:57.220 | And that's very much to evaluate stuff

01:08:59.220 | to kind of like model the distribution of things.

01:09:02.700 | And I think that works well

01:09:03.980 | when you're trying to like generate something

01:09:05.620 | for like creative purposes,

01:09:07.300 | but for stuff where you really want the agent

01:09:09.100 | to like achieve a certain task,

01:09:10.460 | it really is like whether or not

01:09:11.700 | it achieved the task or not, right?

01:09:13.060 | 'Cause then it's not like,

01:09:14.500 | oh, does it generally mimic human behavior?

01:09:16.380 | It's like, no, like did you like send this email or not?

01:09:18.540 | Right, like, 'cause otherwise like this thing didn't work.

01:09:21.260 | Yeah.

01:09:22.100 | - Makes sense.

01:09:22.940 | Awesome.

01:09:23.900 | Yeah, let's jump into Lightning Round.

01:09:26.340 | So we have two question, acceleration, exploration,

01:09:29.340 | and then one final takeaway.

01:09:31.500 | The acceleration question is,

01:09:32.740 | what's something that already happened in AI

01:09:35.060 | that you thought would take much longer to get here?

01:09:37.780 | - I think just the ability of LLMs

01:09:39.540 | to generate believable outputs,

01:09:41.300 | and both for texts and also for images.

01:09:44.900 | And I think just the whole reason

01:09:47.140 | I started hacking around with LLMs,

01:09:48.380 | honestly, I felt like I got into it pretty late.

01:09:50.060 | I should've gone into it like early 2022

01:09:51.940 | because Ubuntu 3 had been out for a while.

01:09:53.580 | Like just the fact that there was this engine

01:09:56.900 | that was capable of like reasoning

01:09:58.140 | and no one was really like tapping into it.

01:10:00.420 | And then the fact that, you know,

01:10:01.900 | I used to work in image generation for a while.

01:10:03.540 | Like I did GANs and stuff back in the day,

01:10:05.620 | and that was like pretty hard to train.

01:10:07.500 | You would generate these like 32 by 32 images,

01:10:10.420 | and then now taking a look at some of the stuff

01:10:12.180 | by like Dolly and, you know, mid-journey and those things.

01:10:14.780 | So it's just, it's very good.

01:10:16.540 | Yeah.

01:10:17.380 | - Exploration.

01:10:18.860 | What do you think is the most interesting

01:10:20.180 | unsolved question in AI?

01:10:21.660 | - Yeah, I'd probably work on some aspect

01:10:24.020 | of like personalization of memory.

01:10:28.340 | I think a lot of people have thoughts about that,

01:10:30.020 | but like for what it's worth,

01:10:31.260 | I don't think the final state will be right.

01:10:32.740 | I think it will be some like fancy algorithm

01:10:35.540 | or architecture where you like bake it

01:10:37.220 | into like the architecture of the model itself.

01:10:39.580 | Like if you have like a personalized assistant

01:10:41.540 | that you can talk to,

01:10:43.820 | that will like learn behaviors over time, right?

01:10:45.860 | And kind of like learn stuff

01:10:47.260 | through like conversation history,

01:10:48.660 | what exactly is the right architecture there?

01:10:50.340 | I do think that will be part of like-

01:10:52.780 | - Continuous fine tuning?

01:10:54.020 | - Yeah, like some aspect of that, right?

01:10:55.700 | Like these are like,

01:10:56.540 | I don't actually know the specific technique,

01:10:57.940 | but I don't think it's just gonna be something

01:10:59.460 | where you have like a fixed vector store

01:11:00.700 | and that thing will be like the thing

01:11:02.220 | that restores all your memories.

01:11:03.780 | - Yeah, it's interesting because I feel

01:11:07.060 | like using model weights for memory,

01:11:11.340 | it's just such an unreliable storage device.

01:11:14.260 | - I know, but like, I just think from like the AGI,

01:11:18.580 | like, you know, just modeling

01:11:20.500 | like the human brain perspective,

01:11:21.660 | I think that there is something nice

01:11:23.220 | about just like being able to optimize that system, right?

01:11:26.380 | And to optimize a system, you need parameters

01:11:28.380 | and then that's where you just get

01:11:29.220 | into the neural map piece.

01:11:30.660 | - Cool, cool, and yeah, take away,

01:11:33.780 | you got the audience ear,

01:11:35.740 | what's something you want everyone to think about

01:11:38.220 | or yeah, take away from this conversation

01:11:41.020 | and your thinking.

01:11:42.300 | - I think there were a few key things.

01:11:44.460 | So we talked about two of them already,

01:11:46.180 | which was SEC insights,

01:11:47.460 | which if you guys haven't checked it out,

01:11:48.740 | I've definitely encouraged you to do so,

01:11:49.940 | because it's not just like a random like SEC app,

01:11:52.500 | it's like a full stack thing that we open source, right?

01:11:54.700 | And so if you guys wanna track it out,

01:11:55.980 | I would definitely do that.

01:11:57.580 | It provides a template for you to build

01:11:59.020 | kind of like production grade rag apps

01:12:00.700 | and we're gonna open source like

01:12:02.260 | and modularize more components of that soon.

01:12:04.180 | - Into a workshop.

01:12:05.340 | - Yeah, and the second piece is we are thinking a lot

01:12:08.540 | about like retrieval and evals.

01:12:10.380 | I think right now we're kind of exploring integrations

01:12:12.900 | with like a few different partners

01:12:14.180 | and so hopefully some of that will be released soon.

01:12:16.820 | And so just like, how do you basically have an experience

01:12:20.420 | where you just like write long index code,

01:12:23.140 | all of a sudden you can easily run like retrievals,

01:12:25.660 | evals and like traces, all that stuff and like a service.

01:12:28.860 | And so I think we're working with like

01:12:29.980 | a few providers on that.

01:12:31.460 | And then the other piece,

01:12:32.540 | which we did talk about already is this idea of like,

01:12:34.940 | yeah, building like rag from scratch.

01:12:36.620 | I mean, I think everybody should do it.

01:12:37.980 | I think like I would check out the guide

01:12:40.940 | if you guys haven't already, I think it's in our docs,

01:12:42.820 | but instead of just using, you know,

01:12:45.180 | either the kind of like the retriever query engine

01:12:48.860 | and Lomindex or like the conversational QA train

01:12:51.180 | and Lang train, I would take a look at

01:12:53.780 | how do you actually chunk parse data

01:12:55.860 | and do like top can batting retrieval.

01:12:57.700 | 'Cause I really think that by doing that process,

01:12:59.780 | it helps you understand the decisions,

01:13:01.740 | the prompts, the language models to use.

01:13:04.420 | - That's it.

01:13:05.260 | - Yeah. - Thank you so much.

01:13:06.100 | - Thank you, Jerry.

01:13:06.920 | - Yeah, thank you.

01:13:07.760 | (upbeat music)

01:13:10.340 | (upbeat music)

01:13:12.920 | (upbeat music)

01:13:15.500 | (upbeat music)

01:13:18.080 | (upbeat music)

01:13:20.660 | (upbeat music)

01:13:23.240 | [BLANK_AUDIO]

RAG is a hack - with Jerry Liu of LlamaIndex

Chapters