Eugene Yan on RecSys with Generative Retrieval (RQ-VAE)

00:00:00.000 | i guess not let me also share this on the discord how do we invite folks there's a lot of people

00:00:09.340 | registered for this one but i can share the room then let me let me do it got it okay yeah

00:00:16.180 | you can have a question you mentioned in the description of the luma please read the paper

00:00:38.240 | if you have time try to grok section 3.1 with your favorite ai teacher who's your favorite ai teacher

00:00:44.480 | oh clod opus opus interesting i never use clod for papers oh really i have it because i just have

00:00:54.700 | a max description uh subscription that's why i hate i don't know i never i think i should

00:01:01.900 | try to be described as well but i use and i just used to be found okay i think we will

00:01:17.720 | just wait one more minute and then a few minutes actually like i feel like we have a lot of people

00:01:23.680 | registered yeah unless you have a long session let me just kick off i i do have a lot okay okay i want to share

00:01:33.900 | um but we can wait one more minute one or two more minutes i'm going to start with a demo and we'll go

00:01:39.540 | into the paper i'm going quote and then at the end we have live demo where i'll take in requests from

00:01:46.240 | folks to um just you know just provide the input prompt and we'll see what happens this model is uh

00:01:55.680 | currently only trained on one epoch so we'll see wow uh michael is new to paper club so basically every

00:02:04.320 | wednesday we do a different paper um we love volunteers if you want to volunteer to share something

00:02:10.160 | um otherwise it's mostly me or eugene or someone um this paper club is kind of whatever's latest or new

00:02:17.600 | whatever gets shared uh someone will cover it and then some people are experts in domains like eugene

00:02:23.120 | is the god of rexus so he's sharing the latest state of the art of rexus stuff um this week yeah that's

00:02:30.480 | that's what he's going to talk about next week we have uh the paper how much do language models

00:02:36.000 | memorize um the week after someone has added vision towards gpt oss so i think we do that one it's

00:02:46.400 | like a rl vr uh there's also a news research technical report we can get one of them to present

00:02:53.680 | but honestly it wasn't great so i think we skip it but yeah if anyone also wants to volunteer papers

00:03:00.960 | or you know if you want someone else to share them we have active discord just

00:03:05.840 | share and then someone will do rj presents a lot too yep okay so i'm gonna just get started um i won't

00:03:16.000 | be able to look at the chat oh so let me know if the the font is too small uh and because i'm just

00:03:24.240 | doing this on my own laptop so let me know if the font is too small okay cool so what do i want to talk

00:03:31.200 | about today wait am i sharing the wrong screen wrong you're sharing your whole things okay uh i need to

00:03:38.960 | move zoom to this desktop paint perfect and now i want to share this test can you see this

00:03:49.520 | perfect okay so what is this recommended systems with generator retrieval let's start from the back what is it

00:03:59.200 | that we want to achieve here's an example of what we want to achieve whereby given this this is literally

00:04:07.360 | the input string recommend playstation products similar to this thing so what is this thing this

00:04:13.040 | this is the semantic id you see that it's in harmony format we have a semantic id start

00:04:18.000 | we have a semantic id end and then we have four levels of semantic ids one two three four how to read

00:04:26.720 | this it's hard but we have a mapping table so over here recommend playstation products similar to this

00:04:33.040 | semantic id i have no idea what this semantic id is but i have a mapping table so you can see this semantic id is

00:04:39.680 | actually logitech gaming keyboard so when i ask for playstation products it's able to say playstation

00:04:48.480 | uh gaming headset for playstation this is uh i don't think this is all this for pc so that's a little bit

00:04:55.440 | off there and this is for pc ps5 ps4 so not that's not too bad now let's try another something else a bit a

00:05:04.080 | little bit different recommend zelda products similar to this thing which is new super mario bros

00:05:11.520 | is able to respond zelda okay this is off teenage mutant ninja turtles and nintendo wii so i guess

00:05:18.240 | maybe we don't have a lot of zelda products in the catalog let's try something a little bit different

00:05:22.880 | super mario bros meets assassin's creed what happens uh we get bad man quick question what is uh what is

00:05:31.040 | this sid to what does semantic id mean what are these tokens like at a high level so semantic ids are a

00:05:38.800 | representation of the item um so here's how we try to represent the item instead of a random hash which

00:05:46.560 | is what the paper says uh we will just represent an item so imagine we try to merge something cute

00:05:52.240 | super mario bros with something about assassination well we get batman so this is an early example

00:06:02.160 | maybe let's just try you get lego batman the lego part makes it cute yeah it makes sense right cute

00:06:09.760 | let's try residence evil with donkey kong okay maybe this is a little bit off um but i don't know dog

00:06:16.720 | let's just say dog i have no idea whether this is in the training data how this would look dog and super

00:06:22.240 | mario well it's not so you can see it defaults to zelda because there's a very strong relationship

00:06:27.120 | between nintendo super mario and zelda actually sega i actually wonder if this is in the data set

00:06:33.200 | luigi dragon ball brutal legend so essentially we have recommendations we can learn the recommendations

00:06:40.480 | but now that we can use natural language to shape the recommendations by cross learning recommendation

00:06:46.480 | data with the natural language data so now that is the quick demo hopefully you are interested enough

00:06:53.760 | to pay attention to this to this um this is i'm gonna go fairly fast and i'm going to go through

00:07:01.440 | a lot we're going to go through paper we're going to go through my notes and we are going to go through

00:07:06.320 | um code so i hope you give me your attention oh vivo has a hand raised and then a quick question on

00:07:13.120 | that example so right now you're crossing stuff that should be in training data right so stuff like

00:07:18.560 | uh zelda batman assassin's creed resident evil uh this should also theoretically just work with natural

00:07:25.680 | language right so if i was to ask something like single player and something like a single player

00:07:31.280 | game or like a multiplayer game like zelda right because zelda is multiplayer or you know yeah multiplayer

00:07:38.160 | game like zelda and i would hope i get something that's like a nintendo game so like maybe like um

00:07:43.920 | mario kart is similar to zelda but multiplayer right let's try this this is the amazing spider-man

00:07:49.920 | we know this because there's a the output is the past of the this is this is the actual output which

00:07:55.440 | is in sid this this log over here is the past so this is recommend single player game similar to spider-man

00:08:06.400 | okay batman batman batman multiplayer i i don't know how this will work uh and this is a very janky

00:08:14.640 | young model well it's still batman well maybe doesn't doesn't work very well some some batman

00:08:20.160 | games are multiplayer but i mean i think the point is yours is like a toy example but yeah i just wanted

00:08:24.960 | to clarify theoretically that should work right yep theoretically it should work if we have enough training

00:08:30.240 | data that we augment well enough now i haven't done a lot of data annotation here uh but now let's jump

00:08:35.920 | into the paper so this is the paper um the blue highlights essentially mean implementation details

00:08:42.320 | the yellow highlights are interesting things green is good red is bad so most recommended systems they do

00:08:50.240 | retrieval by doing this thing which is uh we embed queries and items in the same vector space and then we

00:08:58.640 | do approximate nearest neighbors so essentially all of this is always standard approximate embed everything

00:09:03.920 | and do approximate nearest neighbors but what this purpose uh suggesting is that now we can actually

00:09:11.360 | just decode we can get a model to return the most similar products so now this is of course very expensive

00:09:21.120 | compared to um just approximate nearest neighbors which is extremely cheap but this gives us a few special

00:09:28.160 | properties whereby now we can tweak how we want the output to look like as you can see with just

00:09:35.280 | natural text and we essentially it allows us to kind of filter uh our recommendations and shape our

00:09:41.440 | recommendations so now imagine that uh if i were to chat imagine a chatbot capability right um like this

00:09:49.360 | okay let's say we have zelda and assuming it's better trained super mario bros and i want more zelda products

00:09:56.240 | similar to super mario bros uh we get one of these well only one but we'll see essentially just more

00:10:03.360 | training data needs just needs more training data and augmentation so over here now we get three zelda

00:10:09.200 | products of course the temperature here is fairly high uh essentially i'm just trying to test how this

00:10:13.440 | looks like and you can see by by suggesting the model understands the word zelda the model understands this

00:10:20.000 | product and the model understands products similar to this product but in the form of zelda

00:10:27.120 | uh yeah is this meant to be like used in real-time low latency small model situation or like

00:10:40.960 | honestly from an external perspective it just doesn't seem that like it doesn't seem super clear what's

00:10:47.920 | special about this right because if you just ask any model like yes gemini or chat gpt or anything you

00:10:53.680 | know uh recommend me multiple multiplayer games similar to zelda you should know this right yeah the

00:11:01.520 | what i think is unique about this model so imagine if you just ask a regular lm recommend me all to

00:11:08.480 | play a games unique to zelda it will recommend you games based on its memory and based on what is learned

00:11:12.800 | on the internet so what is unique here is i'm trying to join world knowledge learn on the internet and

00:11:22.400 | customer behavioral knowledge which is proprietary and seldom available on the internet so i'm trying to

00:11:30.000 | merge those together so usually when you use world knowledge there's a very strong popularity bias right

00:11:37.520 | but when you try to merge it with actual behavioral knowledge that's progressing in real time and you

00:11:42.800 | can imagine mapping it to things like this you can get that kind of data so that's the that's what i

00:11:49.680 | mean by this is a bilingual model that speaks english and customer behavior in terms of semantic ids

00:11:56.400 | very interesting and uh like more clear-cut example would be like uh uh lm would be out of date because

00:12:05.040 | there's knowledge cut off but if you have like a new song or something you know exactly yeah something

00:12:10.320 | like this is a soft course oh another henry's yeah uh so what is the world knowledge and the local

00:12:16.240 | knowledge conflict so what will happen then i don't know um yeah i don't know i i guess it really depends

00:12:23.120 | on so that's world knowledge right and then when you maybe fine-tune this on local knowledge it depends on

00:12:28.240 | how much strength you want to know local knowledge to uh update the weights okay okay so while it's oh go

00:12:37.280 | ahead from basic training principles the local knowledge should always come first right because

00:12:43.600 | let's say like you have a base model that thinks you know if someone likes zelda they like mario kart but

00:12:50.240 | if you've trained it such that people that like zelda prefer some other category of game your your model

00:12:57.840 | is trained to you know predict one after the other so the the later training often over you know has more

00:13:05.840 | impact yeah i mean so the reason why i asked is because there's a lot of abundance of evidence in

00:13:10.320 | the like base model so by just having one example does that completely override the the information it

00:13:17.520 | has learned or with so much evidence um i don't know it's a good test i guess

00:13:24.400 | yeah honestly i don't know so that that's where this is what i'm trying to explore while we're

00:13:29.120 | already 15 minutes in um barely touch anything i want to cover but i'm going to try to speed through

00:13:35.840 | this a little bit faster and i'll stop for questions maybe at a 45 minute mark so now most systems

00:13:41.600 | nowadays use this retrieve and rank strategy essentially um here's the approximate nearest index

00:13:47.760 | i was talking about over here given a request we embed it you get approximate nearest neighbors index

00:13:53.040 | and as you add features and then you try to do some ranking on top of it this could be as simple as a

00:13:57.280 | logistic regression or decision tree or two tower network or whatever a sas rack we actually look

00:14:02.240 | at a sas rack which is essentially a one or two layer auto decoder model that is trained on user

00:14:07.760 | sequences we'll look at that later um so how they are trying to do this is they have a semantic

00:14:16.960 | representation of items called a semantic id so semantic id so usually product items or any item

00:14:24.320 | you represent is usually a random hash like some random hash of numbers this semantic id and we have

00:14:30.080 | seen is a sequence of four levels of tokens so now these four levels of tokens they actually encode

00:14:36.240 | the each item's content information and and we'll see how we do that so what they do is they use a

00:14:42.160 | pre-trained text encoder to generate content embeddings and this is to encode text you can

00:14:47.360 | imagine using a clip to encode images using uh other encoders to encode audio etc then they do a

00:14:55.840 | quantization scheme so what is this quantization scheme i think this is uh very important and that's

00:15:00.640 | why i want to take extra time to go through section uh 3.1 of this so here are the benefits that they they

00:15:10.160 | expound right like training and transform on semantically meaningful data allows knowledge

00:15:13.680 | sharing across items so essentially now we don't need random ids now when we understand one item in

00:15:19.920 | one format we can now understand it uh a new item so now for a new item if we have no customer behavioral

00:15:26.720 | data so long as we have a way to get the semantic id for it and because the semantic id for it the

00:15:33.280 | assumption is that the levels of the semantic id the first level the second level the third level if it

00:15:38.560 | we'll find a similar item we can start to recommend that item as well all right so essentially

00:15:43.440 | generalized to newly added items in a corpus and the other thing is the scale of item corpus usually

00:15:48.880 | you know on e-commerce or whatever the scale of items is like millions and maybe even billions

00:15:53.680 | so when you train such a model uh your embedding table becomes very big right in order of millions

00:15:58.240 | of billions but if you use semantic ids and if a semantic id is just represented by a combination of tokens

00:16:05.840 | uh we can now have a use very few embeddings to represent a lot of tables uh and and i'll show

00:16:13.120 | you i'll show you what i mean by this um the first thing i want to talk about is how we generate

00:16:20.720 | semantic ids um so one of the previous ideas was using a vector quantization recommender so it generates

00:16:30.960 | codes they're like semantic ids uh what they use is an rqvae residual quantize uh variation and auto encoder

00:16:43.040 | i think i think this is quite important like how to train a good rqvae that leads to semantically

00:16:49.120 | meaningful semantic ids that your downstream model can then learn from uh i think it's quite challenging and

00:16:55.120 | i haven't found very good uh literature understanding of how to do this well so this is their proposed

00:17:02.720 | framework first we're given an item given an item's content be text audio image or video we encode it to

00:17:12.000 | first we get the vectors then we quantize it to semantic code words that's step number one second

00:17:20.480 | now once we have these tokens the semantic tokens for an item we can now use it to train a transformer

00:17:27.200 | model or language model which is the demo i just showed you so i'm going to take a little bit of time

00:17:35.920 | to try to help to try to help us all understand what residual quantization means

00:17:42.320 | so essentially given an embedding okay given an embedding that comes in we consider this the residue

00:17:48.560 | we think of everything here within this within this gray box as residuos and embedding comes in we

00:17:56.160 | consider the first residue we try to find which codebook token it is most similar to

00:18:04.800 | and we assign it to it right so now we've assigned it to token number seven and uh it's the most similar

00:18:11.760 | so then what we take is we take the initial residue that comes in the initial embedding that comes in

00:18:16.960 | this blue box we subtract away the embedding of the codebook token seven and now we get one right

00:18:24.240 | this this remaining thing here this is the remaining residue then we repeat it at codebook level two

00:18:32.000 | at codebook level two we try to find the most similar codebook token embedding the most similar

00:18:38.720 | vector we try to find the most similar vector we map it to it and over here we map it to

00:18:41.840 | uh dimension this token one right so now we take the previous residue we subtract the new

00:18:50.000 | codebook level two vector and now we get the remaining residue is at this level at the residue two

00:18:59.600 | right and then we repeat this right we repeat this along we keep repeating this until uh and along the

00:19:07.520 | way we keep trying to minimize the residue so essentially based on this we can now assign

00:19:12.080 | given an embedding space we can now assign it to integers or tokens right so now this outstanding the

00:19:22.080 | remaining the semantic code for this is seven one four so now if we take seven plus one plus four

00:19:28.880 | the output of this will be the quantized representation essentially what has come

00:19:33.440 | in which is blue we do this residual quantization and when we take the respective tokens they are assigned

00:19:38.880 | to it if we sum it all up it should be the same as whatever has come in and then based on this this

00:19:48.080 | quantized representation with a decoder we can get the original embedding

00:19:55.200 | any questions here did i lose everyone where does i mean i read the paper too and this part totally lost

00:20:01.760 | me where does that initial code book come from uh you initialize it randomly okay as you start oh

00:20:10.320 | yeah or there are smarter ways to do this which i'll talk about so you've got the random initialization

00:20:16.240 | and then in this process where you're learning the representation are you like iteratively updating the

00:20:21.120 | code book and updating okay that makes a lot more time yes so this rqvae needs to be learned um and

00:20:31.280 | when you learn this rqvae there are a lot of metrics to look at which we'll see and it's really hard to

00:20:36.080 | understand which metrics indicate a good rqvae um okay so that's the semantic id generation uh quick

00:20:43.600 | question uh and finally when we try to calculate the loss of the contested version and original version

00:20:50.800 | at 714 do we actually do addition or there's some other operation actually when i imagine that after

00:21:00.160 | training the code book it will be deviated from the original initialization so it won't be in the same

00:21:07.680 | scale right addition may not make sense i actually do addition so just addition yeah i just do addition

00:21:14.960 | okay and i think the the image suggests as addition as well i think they mentioned in the paper that they

00:21:24.160 | do uh three different code books for that reason so that the each code book has sort of like a different

00:21:31.440 | scale so that you can there they are additive exactly um because the norm the residues decreases with

00:21:38.000 | increasing levels right so so that was what i explained via the image i'm going to try to explain

00:21:46.320 | it again via the text in the paper because i i think this is really important so whatever comes in

00:21:53.440 | is the latent representation right and we consider this the initial residue which is r0 now this is this

00:22:01.440 | r0 is quantized essentially converted to an integer by mapping into the nearest embedding from that level

00:22:07.840 | of code book so r0 so now to get the next residual right we just take r0 minus the code book embedding

00:22:18.160 | right over here to to get the next residue we just take okay then similarly we recursively repeat this

00:22:25.120 | to find the m code words that represent the semantic id so this is done from cost to find now there's this

00:22:34.560 | section over here which i think is also quite important uh that i really want to try to help us all

00:22:39.680 | understand essentially now we know how we get the quantization the input and output everything

00:22:46.400 | now this z you can see over here which is really just a summation uh which is really just a summation

00:22:55.040 | of all the code book vectors this is now passed is to into the decoder right which tries to recreate

00:23:02.560 | the input embedding so now this imagine your code book could be like 32 dimensions and your input could be

00:23:11.280 | like 1124 dimensions this decoder now needs to map it from the input the the code book representation

00:23:19.760 | through the original input and has to try to learn that okay so then the next thing is how do we what is

00:23:30.960 | the loss for this so this is quite interesting to me this is the one of the first few times i've actually

00:23:35.600 | trained a rqvae and i spent quite a bit of time trying to understand this um i'll briefly talk

00:23:42.000 | through it very quickly and then we're going to the notes to try to see how to understand this so the loss

00:23:47.360 | function is the reconstruction loss plus the rqvae loss right and the reconstruction loss this is

00:23:54.560 | straightforward but rqvae loss uh takes a bit of time to understand so now let's go into this

00:24:02.640 | so the rqvae loss right is the reconstruction loss which is how well we can reconstruct the original

00:24:08.720 | output this is quite straightforward the original input is x over here the reconstructed output is x

00:24:15.120 | hat over here and this is really squared loss error so so how can we reconstruct it now that's question

00:24:22.080 | that's quite straightforward but the quantization loss this is this one is a little bit more complex

00:24:28.400 | you can see that there are two terms here which is really sg sg means stop graded stop gradient on

00:24:34.080 | residual minus the codebook and embedding and then there's a beta term which is how you want to balance

00:24:42.320 | it and then the residual minus the stop gradient on the embedding so you can see the left and the right

00:24:48.080 | is essentially the same except the stop gradient is on different terms so the first term this updates the

00:24:56.560 | codebook codebook vectors which is this uh which is e which is this embedding here to be closer to the

00:25:04.000 | residual right we stop the gradient so that we treat the residual as the fixed target

00:25:10.080 | and now what this does is it pulls the embedding closer to the target

00:25:19.840 | yeah and then so what this does is updates the codebook vectors to represent better

00:25:24.480 | residuos now the second one is that it updates the encoder to return residuos closer to the

00:25:34.160 | codebook embedding what so you can see where we stop the gradient on the codebook embedding and we treat

00:25:40.160 | the embedding as fixed so now the embedding is fixed so now your encoder has to try to

00:25:45.840 | do the encoding so that it maps better to the codebook embedding so essentially the first term

00:25:55.200 | teaches the codebook the second term teaches the encoder

00:26:06.880 | i think it's something like gans i'm not very familiar with gans uh but so yeah i i'm not sure i

00:26:14.640 | can if i can comment on that uh ted i think vector quantization is a multiple rounding us yes i i think

00:26:21.440 | the intuition is right this is very similar um and then there's a beta which is a weighting factor right

00:26:27.440 | so you can see in the paper they use a weighting factor of 0.25 essentially what this means is that

00:26:34.480 | we want to train the encoder less than we want to train the codebook and again how do you choose this

00:26:44.720 | beta it it's it's really hard i don't know in this case i think having high fidelity of the codebook is

00:26:50.080 | important because that's what we will use to train our downstream model we're not training rqv for the

00:26:55.680 | sake of training in rqva we're training rq over here we're trying to rqva for the sake of having

00:27:00.400 | good semantic ids that we can use in a downstream use case any questions here

00:27:07.840 | did i lose everyone i was like i'm i was just i've just been building a variational auto encoder so

00:27:18.640 | i'm a little curious like when in all my reading there's always like a elbow or elbow like loss

00:27:26.560 | the and i don't see one there did you did you understand why they didn't use that or is it just

00:27:34.080 | not mentioned or they didn't use that it's just not mentioned i guess it's because um when you look

00:27:39.920 | at our cube and the rqvas didn't really come from this it came from representing audio and also images

00:27:49.600 | decided these two papers right residual vector quantizer so uh ted has a has a hand raised

00:27:56.960 | yeah so i don't deeply understand it but i think i can answer rj's question at a superficial level

00:28:05.840 | which is that the elbow is used to approximate when you are training your regular continuous vae okay and so you

00:28:15.920 | have some assumptions around like whatever normal distribution and things like that that what you're

00:28:20.880 | trying to do is you're trying to do maximum likelihood but you can't do it directly so you use

00:28:25.360 | this elbow as a proxy in order to try to get to the maximum likelihood okay what the what the vqvae paper

00:28:34.240 | says is that we're using a whatever multinomial distribution instead of this continuous distribution

00:28:40.400 | and and that's the reason why uh because of that distribution assumption we no longer have this

00:28:46.240 | elbow thing i've never fully wrapped my head around it but it's fundamentally because this discrete

00:28:51.440 | distribution is very different from the continuous distribution that we use with elbow i i so somebody

00:28:57.440 | can go deeper than that yeah that's as much as i got okay no that's actually super helpful and thank you for

00:29:03.760 | the i'm gonna have to read those two citations as well thank thanks a lot guys that's helpful thank you

00:29:10.160 | welcome am i going too fast right now or is it fine a bit too fast

00:29:17.120 | maybe a high level overview again for those that aren't following the technical deep dives um what is

00:29:27.280 | okay the high level overview is given some embedding of our item we want to summarize that embedding the

00:29:37.280 | embedding could be a hundred and a thousand dimensions we want to summarize it into tokens

00:29:42.960 | which is integers and that's what an rqva does

00:29:47.280 | so again to i think again i think in this paper the rqva is quite essential i think the rest of it this

00:29:56.320 | this was the one that i had the hardest uh the largest challenge to try to implement the rest

00:30:01.760 | are quite straightforward so i'll spend i want to spend a bit more time here can i ask um question

00:30:06.960 | go ahead yeah uh that sum it's the the vector sum you're summing you're summing the embeddings

00:30:13.040 | yeah okay yeah you just sum it up and you'll get the output is it fair to say the novel contribution

00:30:20.560 | here is intermediate token summarization versus textual yeah it could be essentially it's a way to convert

00:30:25.520 | embeddings to tokens right firstly you quantize the embeddings and you think about it it's really tokens

00:30:33.120 | what's a good choice of a number of levels it's hard um i don't have the answer to that but in the paper

00:30:40.720 | they use three levels and each level has um 256 code words so that essentially can represent 16 million

00:30:52.720 | unique combinations essentially 16 million unique combinations

00:30:56.160 | so now let's go into the code of course i won't be going to all the code i just want to go through

00:31:05.840 | the part to help folks understand uh a bit more

00:31:09.760 | i have this oh see this okay perfect so you can see over here the code book loss essentially that's

00:31:22.080 | how we represent the losses right the code book loss is just the the input we detach it the stop gradient

00:31:32.000 | and the quantized which is uh the quantized which is just the uh the combine the combined embeddings

00:31:39.600 | right the combined cobalt embeddings this is the code book loss commitment and go ahead

00:31:44.400 | and the commitment loss is really we just take the quantization and we detach it

00:31:52.400 | and we do a mean squared error on the input right that's essentially all there is to this loss

00:31:58.560 | right so that's how we implement it quite straightforward then um this is how we do the quantization

00:32:08.480 | so you can see we firstly initialize uh some array to just collect all these for every vector quantization

00:32:20.560 | level we take the residual and the residual is really the first time at the very first input the residue

00:32:27.120 | is just the input embedding right we take the residual we pass it through the vector quantization

00:32:33.680 | layer and we get the output then we get the residual for the next level by just taking the residual and we

00:32:43.920 | subtract subtract subtract the the embeddings right that's the residual now our new output

00:32:52.080 | is we just keep summing up all this so our quantized output is just uh we just keep summing up

00:33:02.320 | so essentially at the end we can represent this initial residual by the sum of the vector quantization layers

00:33:12.720 | and that's that's that recursive thing that's all there is to it okay maybe i'm oversimplifying it but yeah

00:33:22.160 | then when we encode it to semantic ids essentially this x is the embedding right

00:33:29.280 | uh we convert this embedding so for every vector oh yes go for it

00:33:34.960 | is someone unmuted when the color is unmuted

00:33:43.280 | okay maybe thank you for muting them so the input is really the embedding right so for every vector

00:33:52.960 | quantization quantization quantization layer and the embedding is the initial residual

00:33:57.360 | again we just pass it in we get the in we get the output and this output we essentially what we want

00:34:05.520 | is the index so the index and then we get the residual and so we get the residual minus the codebook embedding

00:34:13.200 | and we just go through it again so at the end of this we get all the indexes for the semantic id

00:34:20.160 | and now that is our those are you consider them as our tokens right the sid tokens that we essentially

00:34:27.040 | saw here right um this over here this maps to the first index 201 the second index third index fourth index

00:34:38.080 | so essentially my rqvae has three levels and the fourth level is just to for preventing collisions

00:34:48.880 | um i guess the last thing uh initialize okay the last thing the question that people folks had and

00:34:54.640 | this is what the paper does as well which is how do you first initialize these code books

00:34:59.200 | um there's a very interesting idea which is that we just take the first batch

00:35:04.000 | and for each level we perform k-means

00:35:12.400 | uh and we try to fit it right and then after that we just return the clusters we we do k-means with the

00:35:21.360 | means in terms of your kv size so uh so not kv size uh in terms of your cookbook size so my cookbook size

00:35:29.600 | is two five six so i want k-means with two five six means so now i have all these two five six means

00:35:38.160 | i just initialize it uh with these two five six means and that's how i do a smart initialization

00:35:47.040 | what does the fourth level in the rqvae my rqvae only has three levels the fourth level is something i

00:35:53.920 | artificially add when there is a collision in rqvae how to determine code book size again that's a great

00:36:00.160 | question i actually don't know uh how to develop a cookbook size it's really hard because we're not

00:36:04.480 | again we're not training the rqvae just for the sake of training rqva we're training rqvae for the

00:36:09.520 | downstream usage

00:36:10.400 | oh

00:36:13.680 | i think that's a that's a great example uh that of the vbo of the image that the bush had so now let's

00:36:23.120 | look at some trained rqvaes to try to understand how this works so i've trained a couple of rqvaes not all

00:36:33.680 | of them work very well but i want to talk you through this um so essentially if you recall

00:36:40.160 | their commitment with their beta was 0.25 right so all of these the beta is 0.25 unless i say otherwise

00:36:46.560 | so let's first understand what the impact on learning rate is um initially i have two learning rates the

00:36:55.520 | green learning rate is lower than the brown learning rate so you can see of course the brown one with a

00:37:01.600 | higher learning rate uh we get oh first let me take let me take some time to explain what the different

00:37:07.760 | losses means um okay so this is the validation loss essentially the the loss on the validation set uh this

00:37:23.200 | loss total is really just the training loss now if you recall the loss of the rqvae is two has two things

00:37:31.440 | in it that's the code book loss and then there's the commitment loss so now this and this is the

00:37:38.400 | reconstruction loss you can think of it as if you try to reconstruct the embedding how well does it do now

00:37:44.400 | so now this is the um vector quantization loss i do i didn't actually calculate the i think this is i

00:37:52.800 | can't remember this is the i think this is the code book loss right and the reconstruction loss is the uh

00:38:00.560 | um the commitment loss which is the reconstruction of the output loss okay so you can see and this is the

00:38:07.440 | residual norm which is how much residue is there left at the end of all my code book levels uh ideally this

00:38:15.520 | this demonstrates if you have very low residual norm it means that um your code books are

00:38:21.280 | explaining away most of the rest most of the input residual so a lower one is bad so of course all the

00:38:31.920 | losses the lower is better residual norm the lower is better now this one this is a different metric

00:38:38.000 | this is the unique ids proportion so at uh at some period i give a big batch of data uh maybe 16k or 32k

00:38:48.800 | give a big batch of data i do the i put it through semantic ids uh the quantization and i calculate

00:38:54.640 | the number of unique ids so you can see that as we train the number of unique ids gets less and less

00:39:00.080 | essentially the model is better at reconstructing the output but it uses fewer and fewer unique ids

00:39:08.160 | so what is a good model i think um what is a good model is quite hard to try to understand

00:39:18.800 | but here's what one person responded uh and the response kinds of make kind of make sense

00:39:24.640 | so that's what i've been working on so look at validation reconstruction loss again how well is your

00:39:31.520 | is your rqvae good at reconstructing the input the quantization error again this is your codebook error

00:39:37.120 | and codebook usage i think of this as unique ids as well as codebook usage we also see some graphs of codebook usage

00:39:46.880 | um okay so now you can see that with a higher learning with a higher learning rate which is

00:39:53.440 | this one this learning rate is higher than the degree we have lower losses overall but

00:40:00.080 | our codebook usage is also lower so i think that's that's one thing to note again i don't know which

00:40:06.880 | of which of these are better is it lower loss better or codebook usage better and then i just

00:40:10.880 | took a step and just pick something um now over here this uh i think maybe this is just easier to

00:40:19.440 | compare these are the various weights right if you remember this is the commitment weight this is a

00:40:23.760 | commitment weight of 0.5 and this is a commitment one essentially the commitment weight is how much

00:40:29.520 | weight that you want to have on your um on your um on your on your codebook right on your uh on your

00:40:40.400 | commitment loss on the on the validation on the reconstruction error so you can and this this is

00:40:47.680 | commitment weight of 0.25 which is the default so you can see with a higher commitment weight uh our

00:40:52.960 | reconstruction error is actually very similar right but when we push the commitment weight up to 1.0

00:41:00.960 | you can see something something really off course here uh our loss is a lot um just x-ray funcily

00:41:09.040 | you can see our loss is just very high but the unique ids is also higher which is actually better

00:41:15.600 | and the residual norm is lower which is a good thing that means we explain most of the variance

00:41:22.480 | so in terms of how to figure out how to pick a good commitment weight i think it's very tricky

00:41:27.200 | uh i haven't figured it out essentially here when i double the commitment weight i get told i have lower

00:41:32.480 | validation loss and my id proportion is similar but my residue norm is higher but in the end i just stick

00:41:39.280 | to the default which is 0.25 and over here this is another run i did where i clean the data essentially

00:41:45.680 | i excluded uh the product data and all this is like open open source amazon reviews

00:41:52.320 | data i clean the data i exclude those that are unusually short remove html tags and you can see

00:41:57.600 | by cleaning the data you can get a much lower validation loss much lower reconstruction loss

00:42:03.040 | residue norm is similar but your unique id proportion is a lot higher

00:42:10.960 | um does that make sense am i going yes the reader with the reduced number of unique ids am i just going

00:42:16.560 | too deep into this but i also do want to share a lot um okay we just take okay maybe i'm going too deep

00:42:26.720 | into this but i'll just take one last thing to show you how a good codebook looks like um so this is the

00:42:36.960 | codebooks so these are the codebooks right and this is this is the distribution of codebook usage

00:42:41.200 | this is a this is a not a very good distribution of codebook usage you can see all the codebook usages in

00:42:50.400 | um token number one right and you can see there's very large spikes so this is when your rqva is not

00:42:57.040 | trained well you don't have a very good distribution and but and then you can see over here um

00:43:05.600 | where's my numbers for you can see over here the proportion of unique ids is only 26 that means all your

00:43:21.600 | data you've only collapsed it to a quarter of it essentially you've thrown away three quarters of

00:43:27.200 | your data but what happens when you have a good uh good rqvae here's how it looks like the proportion

00:43:35.280 | of unique ids is 89 um but this is the codebook usage you can see a codebook usage is very well sort of

00:43:44.400 | somewhat well distributed and all the quotes are well learned and all the quotes are fairly well

00:43:51.680 | utilized so this is what it means by having a good codebook usage

00:43:58.560 | so what have you done so that you can uh have a good uh code distribution codebook

00:44:27.120 | um the main thing that i have done is so a few things they use k-means clustering initialization

00:44:37.600 | so i do that k-means clustering initialization um the other thing i do is

00:44:48.960 | codebook reset so essentially any time at every epoch if there are any codes that are unused i reset it

00:44:57.200 | so essentially i'm just forcing i i force the i reset unused codes for each rqvae layer

00:45:02.960 | so i'm forcing the model to keep learning on those unused codes seems to work decently um yeah

00:45:15.520 | so the next thing now we will very quickly go through the recommendations part of it because i

00:45:20.560 | think it's fairly straightforward um so oh michael has a hand raise hello

00:45:28.560 | yes hi you good yes yeah i'm just yeah i'm a south indian and i am right now entering into this air

00:45:39.200 | field so right now what do you suggest and how to enter into this field as per the new concepts that

00:45:46.320 | are right now in your point of view yeah i think a great way to learn this is i'm gonna share an invite

00:45:56.240 | link uh a great way to learn this is really just to hang out in the discord channel

00:46:02.640 | um so i'm gonna copy the link i'm gonna paste it in the chat i think the really just us is just

00:46:09.680 | learning via osmosis right yeah what are the folks yeah yeah due to this yeah boom i have learned the no

00:46:17.600 | code development and i then and all the all these things for almost four months but right after i have

00:46:25.760 | sort of issue i found her big right because like coding so yeah thank you michael what i would do

00:46:32.080 | is to really just uh ask on the discord channel and the thing is the the reason why i'm saying this

00:46:38.240 | is there's only 15 minutes left and i still have quite a bit i want to share and go through yeah

00:46:42.640 | anyone has any questions related to the paper

00:46:48.320 | frankie okay can you hear me yes yeah a quick question maybe it's not relevant but i noticed

00:46:54.480 | that you did some stop gradient uh so it's stopping back propagation is that is that because you're trying

00:47:00.240 | to freeze something when you're doing training i don't quite understand it's it's just um firstly it's

00:47:07.600 | part of the rqvae formulation where they do stop gradient right and so i'm just following it and the reason

00:47:14.720 | is if you don't do stop gradient if you don't do stop gradient this is the loss function right it just

00:47:21.040 | simplifies to this right which is residual minus codebook embedding so now this leads to a degenerate

00:47:26.480 | solution right you can imagine that your encoder will just encode everything to 000 and your codebook

00:47:33.600 | will just encode everything to 000 and essentially you just use one single thing where they both cheat

00:47:38.640 | okay but then you have to allow training at some point right because you're training the codebook

00:47:44.400 | so how do you how does that how does that codebook get modulated then if you do stop gradient so we

00:47:50.240 | do stop reading on the residue for this for the first half of the rqvae loss equation and then we do stop

00:47:56.800 | gradient on the codebook for the second half of the and then it's a it's a weightage we waited by

00:48:01.600 | yeah i see it thank you so much yeah okay so i know we have some questions left but i'm going to take

00:48:06.320 | five to ten minutes to try to go through the recommendations part of it uh which is really where it all comes

00:48:12.640 | together and we also have results for that so i'm going to ignore the questions for a while so now

00:48:18.160 | generative uh recommendations right what do you do is we essentially reconstruct item sequences for every

00:48:23.520 | user we sort them in terms of all the items they've interacted with then given a sequence of items

00:48:30.320 | the recommender's task is to predict the next item very much the same as language modeling given a

00:48:37.040 | sentence predict the next word in the sentence right essentially that's what sasrack is what a lot of our

00:48:42.960 | recommendation model is right now so uh so for a regular recommendation given a sequence of items you

00:48:51.520 | just predict the next item for a semantic id based on recommendation given a sequence of semantic id you

00:48:58.720 | predict the next semantic id which is not which is now not a single item but a sequence of four semantic ids

00:49:05.040 | so now the task is more challenging you have to try to predict the four semantic ids

00:49:09.360 | all right so you can see that they they use some data sets here um and they have the rqva which we

00:49:15.840 | spoke about there's an encoder a residual quantizer and the decoder i think the encoder and decoder is

00:49:20.720 | essentially the same thing i mean not the same thing it's just mirror image of each other different

00:49:25.440 | ways of course as well and residual quantizer so you can see they have a sequence to sequence

00:49:31.280 | more than implementation um and this is their code book right uh one or two four tokens um

00:49:37.920 | it's only three levels and the fourth level is for breaking hashes they include user specific

00:49:44.720 | tokens for personalization in my implementation in my implementation i didn't do that now let's look

00:49:49.680 | at some sas rec code for the recommendations and we'll see how how similar it is to um

00:49:55.360 | how similar it is to language modeling

00:49:59.760 | so over here this is the sas red code and if you see it's actually very similar right you see

00:50:08.080 | modules like causal self-attention which is predicting the next token you can see the familiar things attention

00:50:15.040 | number of heads etc and then you know qkv and then you have our mlp uh which is really just uh reluce

00:50:22.160 | and then the transformer block this is a pre-layer norm layer norm self-attention layer norm and mlp

00:50:29.040 | that's it that is our recommendation system model so now the sas rec is item embedding for every single

00:50:37.280 | item we need to have a single item embedding position embedding and dropout

00:50:41.840 | and then the blocks are really just the number of hidden units you want and just transformer blocks

00:50:47.520 | and of course a final layer norm um for the forward pass what we do is we get all the item embeddings

00:50:57.440 | we get all the positions and we combine it so now we have the hidden states right it's actually the

00:51:05.760 | what has happened in the past this is what has happened in the past now for the prediction we take

00:51:11.040 | the hidden states which is the past five six now imagine this is language modeling we take the past

00:51:18.800 | five to six tokens we forward it we get the next language token candidate embeddings you can see now

00:51:25.440 | we get the candidate embeddings and what we do in recommendations oops what we do in recommendations

00:51:36.960 | is we score them and the score is really just dot product essentially given what has happened in the

00:51:43.680 | past uh we encode it we get a final hidden layer given all the candidate embeddings we do a dot product

00:51:50.400 | and that's the score so essentially given all the sentences given all the potential next tokens

00:51:57.680 | get the dot product to get the next best token and that's the same for recommendations uh over here

00:52:03.520 | um and the training step i don't think i'll go through this um i won't go through this

00:52:12.720 | now for sasright semantic id it's the same right now this is the forward pass uh no this is not the

00:52:18.880 | forward pass for a sasright semantic id this is the same predict next item so now you saw that in

00:52:26.560 | the previously we were just predicting only one single next item now in this case we have to predict

00:52:32.160 | four tokens to represent the next item so when we do training we apply teacher forcing whereby

00:52:41.920 | first we try to predict the first token if that is correct that's fantastic we use that that first

00:52:46.720 | token to predict the next token but if it's wrong we replace it with the actual correct token

00:52:51.120 | so that's what we do when we do training but when we do evaluation we don't do this

00:52:57.680 | uh when we do evaluation

00:53:01.040 | you can see uh

00:53:07.840 | okay i'm definitely losing people here so i i i won't i won't go too deep into this but when we do

00:53:12.080 | evaluation we actually don't try to correct the evaluation so now let's look at some results of

00:53:16.880 | how this runs essentially when i was doing this all that i really cared about was do my semantic ids

00:53:22.640 | make sense so let's first look at this uh this is a this is a sasright model you can see the the ndcg is

00:53:31.040 | uh unusually high right so firstly the purple line is the raw sasrack that just predicts item ids

00:53:39.920 | given the past item ids predicts the next item id uh let's just look at hit rate right you can see the

00:53:44.960 | hit rate is 99 uh i i made it artificially easy uh because i just wanted to get a sense of whether

00:53:50.480 | there was any signal or not you can see here is 99 uh we exclude the most infrequent items there's no

00:53:56.160 | cold start problem over here and uh the the false negatives are the false positives are very easy to

00:54:01.680 | get so you can see this is how the regular suspect does with ndcg of 0.76 now let's now look at our

00:54:08.320 | semantic ids now again to be to be very clear right semantic ids the combination of semantic ids that are

00:54:15.440 | possible is 16.8 million but in our data set we only have 67 000 potential data points so the fact

00:54:23.440 | that if from that 16.8 combinations is able to even predict correct data points that's a huge thing

00:54:31.040 | so over here we can see that after we train it the hit rate is 81 percent that is huge essentially it

00:54:37.840 | means that the model this new recommendation model instead of predicting the next item it has to predict

00:54:44.400 | four next items the the four tokens that make up the next item and only if that is an exact match we

00:54:51.680 | consider as a hit a hit rate right recall and it's able to do that and you can see that the ndcg is

00:54:57.520 | able to do that uh on ndcg is also decent so now it doesn't it doesn't quite outperform uh the regular

00:55:03.600 | sas right because this is uh it would outperform it if we had coastal items all the very dirty items i

00:55:10.000 | think it could beat or even match but essentially this was all just a test to try to see if we can actually

00:55:16.240 | train such a model now now that this test has passed we know that our semantic ids are actually working

00:55:23.440 | well now we can fine-tune a language model right um this is so this is fine-tuning a language model

00:55:33.200 | you can see that the this is the learning rate so initially this is my original data you can see

00:55:40.080 | uh maybe i'll just hide this first this is this is training on my original data and you can see my

00:55:44.160 | original data this is the validation my original data was very huge uh was there was a lot of data

00:55:49.360 | because i didn't really clean it i just used everything i was just not sure whether you could

00:55:52.720 | learn or and i found that it could learn so let's look at an example over here this so you can see over

00:56:02.720 | here this is the initial props right and it's able to say things like let's look at test number three

00:56:09.360 | if product a has this id if product b has this id what can you tell me about their relationship

00:56:13.920 | and all he sees is ids but he's able to say they're related through puzzles

00:56:19.120 | um and then list three products similar to this product right i don't know what this product is um but

00:56:28.240 | i i do know that when i look at it in the title form this is what i expected it was able to respond

00:56:33.200 | invalid semantic ids so what this means is that hey you know this thing can learn uh on my original data

00:56:40.240 | so then after i clean up my original data so now this is my new data set you can see a new data set

00:56:44.720 | it's half the size of original data when you can see when we have this sharp drop during training loss

00:56:50.400 | essentially that means i'm it's the end of an epoch it's a new epoch but evaluation loss just keeps going

00:56:56.080 | down as well and the final outcome we have is a model like this this is a checkpoint from the first epoch

00:57:04.160 | whereby it's able to um it's able to speak in both english and semantic ids

00:57:15.360 | um okay that was all i had uh any questions sorry it took so long

00:57:20.720 | you sure you have a question hi eugene uh so you said that um when you notice some code points are

00:57:33.520 | being unused you reset them um after you reset did you notice them getting used to have some metric to

00:57:39.840 | track the reset one actually was if the resetting was actually helpful in that case yes i i was able to

00:57:47.280 | try that uh sorry i'm just responding to numbers yes i do have that um uh in my let's again let's look at

00:57:59.040 | this when i was training the rqv i had a lot of metrics uh i only went through just a few of them

00:58:09.040 | um but this is it this is the metric um that is important um first oh wait are you oh wait i'm

00:58:17.360 | stopping you stop sorry okay here we go um first let me reset the axis

00:58:24.560 | okay so the metric that is important is this codebook usage so i do log codebook usage

00:58:38.240 | um at every epoch so you can see at step 100 or at epoch 100 codebook usage was only 24 20 24

00:58:44.400 | but as you keep training it codebook usage essentially goes up to 100

00:58:48.480 | so that's the metric i i love nice thanks you're welcome uh frankie oh yeah um can you go over like

00:58:57.840 | again uh for the cement that i semantic id part what is the training data because i'm having a hard time

00:59:04.400 | understanding like what what do you actually what is the llm learning that so can you can explain that

00:59:10.560 | again uh for training the rqvae no no no no for the semantic id part uh the last example because you

00:59:19.360 | basically you're basically inputting say a cement id and say what are the hardest two things related to

00:59:25.280 | each other and i'm having a hard time figuring out like what i what is the um was it actually learning

00:59:31.600 | here so this is the training data so essentially the training data is a sequence of items so you can

00:59:38.720 | see this is the raw item right uh and for each item we map it to this sequence of semantic ids

00:59:46.880 | this is now we're trying to order does the order in this uh list matter or not yes the order matters

00:59:53.840 | is essentially the sequence of what people transact okay and if there's the order in the

00:59:59.920 | semantic id matters as well for example if we were to swap the positions of sid105 and sid341 it wouldn't

01:00:06.960 | make sense that would be an invalid semantic id and and the reason is because the code the first code book

01:00:12.560 | vocab the first number should be between 0 to 255 the second number should be between 256 to 511 and

01:00:19.280 | so on and so forth yeah so that makes sense i mean but i'm talking about like in the group of four so

01:00:25.200 | if you permute those uh does that does it matter so it does matter does matter as well yeah it definitely

01:00:33.520 | it must be in this order if not it will be an invalid semantic id i understand because you have you have to

01:00:39.280 | go with your levels right of your coding right so you have four tokens so that understand and but i'm saying

01:00:44.480 | that you have multiple items here right yes so if you permute those does it matter uh i don't know

01:00:51.920 | i think it depends it may or may not so essentially if i were to buy a phone and then you recommend me a

01:01:00.000 | phone case and then pair of headphones and a screen protector that makes a lot of sense but if i were to

01:01:07.440 | buy a screen protector and you recommend me a phone that doesn't make sense

01:01:10.640 | right in terms of recommendations and user behavior okay okay so that's encodings

01:01:19.440 | that's the sequence encodes that behavior that's what you're saying yes that's correct okay

01:01:31.600 | any other questions yeah eugene uh i had done a lot of work on vector quantization and the there are some

01:01:43.680 | other uh techniques to handle like under utilization like splitting and one important question is the

01:01:53.520 | distance metric like i noticed that you're using euclidean distance to measure the difference between

01:02:01.600 | the embeddings and the code words but the euclidean distance is not good when there is a difference in

01:02:09.840 | value or rate for the features so have you tried to do it i'm sorry i'm in the noisy environment please go ahead

01:02:23.520 | okay so have you tried to do it okay so have you tried different metrics like mahalanova's distance

01:02:27.440 | um i have not

01:02:34.080 | essentially my implementation uh reflects what um what was written in the paper so and i i'm not sure if i'm

01:02:47.840 | really using euclidean distance uh i don't remember where is it

01:02:54.880 | yeah but it's it's worth a try but i i just use the implementation written in the paper and it works

01:03:04.160 | it's in the main uh like uh equation where you have the stop gradient because that uh that sign is the

01:03:12.000 | ingredient distance and it is by default what k means algorithm uses but yeah you can change that

01:03:18.080 | so then that's what i'm using yeah so i just replicated that

01:03:22.480 | any other question oh pastor you have a question yeah so i have a question related to to that so in

01:03:34.160 | the paper says that you that you use that you use k means k means to to initialize the codebook and then

01:03:44.320 | like you used it like you use it for the free first training batch and then you use the centroids

01:03:51.920 | uh as initialization is is that what you use and uh follow of that like it also says that there is an

01:04:01.040 | option to use like k means clustering hierarchically but it lost semantic meaning so i didn't really

01:04:08.400 | understand that part if you can briefly explain like and and if you use the centroids as initialization

01:04:14.880 | thanks uh so yes that's what i used um i'm sorry what was your second question yeah so so also the paper

01:04:24.960 | the paper said that the other option is used k means hierarchically but it lost semantic meaning

01:04:34.640 | between the clusters yes yes yes so yeah do you know why or how that plays or what's the difference

01:04:42.720 | between using the the you know the the what is suggesting the paper uh that's a great question

01:04:49.520 | right so are you referring to this uh they use k-means clustering hierarchically essentially these

01:04:55.760 | are different alternatives or quantization right um over here we quantize using an rqvae they they also

01:05:02.800 | talk about locality sensitive hashing then the other one is they use k-means clustering first k-means

01:05:08.080 | then second level k-means and third k-means um i don't know why it loses semantic meaning um honestly

01:05:16.320 | so i i i i won't be able to address that i haven't actually tried this yeah and i don't have a strong

01:05:21.680 | intuition on why it loses semantic meaning okay cool thanks thanks yeah i can uh try to answer that

01:05:30.560 | like when you are uh having the same id and you're using the proper uh k-mean clustering the entities

01:05:41.360 | are unique and they represent something but if you try to split them into different

01:05:46.320 | levels like the first order and you group them and then you do grouping of the second layer as if

01:05:54.400 | like you're trying to cluster numbers uh if you keep a three-digit number nine seven three uh and cluster it

01:06:04.160 | with eight seven uh eight two one then they are close together but if you cluster on the uh the

01:06:10.880 | hundred digits first then at the uh tens and then uh the ones you know you lose the meaning and the

01:06:20.240 | these numbers do not become close to each other

01:06:26.000 | awesome makes sense thank you thank you okay well very much over time thank you vibu for kindly

01:06:34.000 | hosting us for so long uh i can stay for any other questions feel free to take eugene's time is your

01:06:42.640 | experience uh for next week we will volunteer for a paper i'll post it in the chat right now

01:06:49.920 | um but yeah thank you thank you eugene for presenting and same thing if anyone wants to volunteer a paper

01:06:56.560 | to present the following week um you know now is your shot you'll you'll definitely learn a lot

01:07:03.600 | going through the paper being ready to teach it to someone otherwise the paper is how do language models

01:07:09.440 | memorize and and also to be clear i want to make it very clear right that i did not prepare for

01:07:14.640 | this over the course of a week so if you're preparing for this i've been working on this for

01:07:19.920 | several weeks ever since like 4th of july holiday right and essentially this is just i've been working

01:07:26.320 | on it i'm excited to share about with people and that's why i'm sharing most of the time we usually

01:07:30.080 | just go through a paper so if you want to volunteer it's okay to just have the paper that's it

01:07:35.200 | if you want to see the other side of uh you know the other extreme unlike eugene who prepares for

01:07:40.720 | months i read the paper in the day before the day or the day and i morning off hey don't call me out

01:07:47.600 | yes but people have slides that's the thing i will never have slides no no i i used to do slides i

01:07:52.640 | actually think it's uh slightly detrimental i don't think it's as beneficial to make slides other papers

01:07:58.000 | anymore papers are good ways to read uh to understand information so now it's just uh walk through a

01:08:04.160 | highlighted paper just read it understand it and yeah okay take eugene's time asking about semantic ids

01:08:15.440 | yeah and we can we can also go through more of the paper if you want i know i haven't

01:08:19.280 | gone through as much of it as i should but it's actually the the answer is there and uh the the

01:08:27.760 | results replication is very similar to show you have a hand raised hey eugene uh so while you're

01:08:32.400 | implementing this right uh were there anything was there anything from the paper that you could not

01:08:37.040 | replicate which you had tried uh i think it took me a lot of time to try to train a good rqvae um

01:08:52.480 | then that's why i i spent so much time on it so essentially to train an rqvae you see that i

01:08:59.360 | have so many experiments right gradient i i did crazy things stop gradient to decoder um gradient clipping

01:09:07.600 | and a lot of it failed um i even tried like there was this thing called um exponential moving average

01:09:14.720 | to learn the rqvae code books like it has great uh unique id unique ids right um but it's just

01:09:22.960 | not able to learn you can see that the loss is very bad um and and that was because i was new to this so

01:09:30.400 | you can see that the loss uh actually if we if we zoom in on only those after 2000 you can see that

01:09:38.080 | we see zotero oh shoot you only see zotero oh thank you i should share my desktop thank you for

01:09:44.640 | so you can see i i tried some crazy things like this exponential exponential moving average approach

01:09:50.240 | to update my code books so you can see um this the purple line is the exponential moving

01:09:55.360 | average you can see the loss is very high um and it just doesn't work as well uh like gradient

01:10:02.000 | clipping the rotation trick uh figure out the right batch size is what the what the right weight is what

01:10:06.960 | the clean data is so this took me several experiments and i that was after even deleting the experiments

01:10:13.360 | that just don't make sense to save whereas the recommendation part i just needed two experiments

01:10:19.040 | you can see yes even though that the recommendation loss is uh the hit rate and ndcg is not close to the

01:10:26.080 | sastrack one the intent was never to that the intent was can i can a sastrack actually learn on this

01:10:32.880 | sequence of four tokens and very quickly you find that yes it can learn and then that's when i okay now

01:10:40.000 | let's not train a sastrack from scratch let's train a um let's fine tune a quen model from scratch and

01:10:48.960 | this was not part of the paper this was just uh for my own learning you can see the paper uses all these

01:10:54.640 | uh recommendation models but yeah that's it and uh because i'm also planning to train an rqva is there

01:11:04.720 | some uh good places i can start with i remember this uh there should be some official report for

01:11:10.240 | that on github if i'm not wrong i remember seeing that a while back uh i well i wish there

01:11:17.840 | was but i definitely didn't come across that um but what i would recommend is essentially this is my

01:11:27.680 | recommendation uh is very great for learning especially now with things like clock code

01:11:33.520 | to try to code this out from scratch uh you really learn the nitty-gritty things right about how

01:11:39.360 | that code is actually represented uh s and and yeah you're i think colette was right i was using euclidean

01:11:48.000 | distance uh and this is the distance i was computing so but yeah i don't know maybe there's a better

01:11:53.600 | distance i i certainly haven't gone as far down the road i just took the the default implementation and

01:11:59.760 | just followed it makes sense thanks for answering the questions eugene you're welcome eugene quick

01:12:07.440 | question on your highlighting what does red and green mean um so over here so green means good right like

01:12:20.000 | these are the benefits training allows knowledge sharing atomic item ideas and then red is the

01:12:25.840 | downsides right essentially we are considering a new technique what a downside so with the regular sas

01:12:31.440 | right if i had to train a regular sas right and i had a billion products my embedding table would be a

01:12:36.160 | billion products but if i had two five six with the power of with just a four level code book two five six

01:12:45.120 | four four four levels i can represent four billion products and that's just one oh two four embeddings

01:12:52.800 | right so instead of a billion embeddings i uh yeah yeah a billion level embedding table i just need a

01:13:00.000 | thousand level embedding table and that saves you a lot of uh compute i see thank you thank you you're

01:13:08.320 | welcome kishore has a hand raised i assume kishore has asked his question frankie you have a hand raised

01:13:13.120 | uh yeah so i think it's similar to what you just spoke about i had a question about if you do the two

01:13:18.640 | tower one you have two encoders and they're trying to like create this embedding that approximates whatever

01:13:23.760 | query and item so it creates an embedding and i can't so i think you tried to explain a little bit just

01:13:30.800 | now but i try to understand why is that embedding worse than this one that's creating by you know

01:13:37.600 | whatever your quantization steps right then the betting is not worse i think the embedding is probably

01:13:43.680 | just as good or even better but there will be a lot of those embeddings so imagine if we had a billion

01:13:50.720 | products we will need a billion embeddings one for each product right but with semantic ids

01:13:57.680 | if i have um if i have a four level code book because each code book can represent two five and

01:14:04.640 | assume each code book can represent two five can you see my raycast yes so imagine each code book

01:14:10.480 | can represent two five six right and then the combination all the auto permutations of this

01:14:16.160 | codebook is actually four billion now i don't have a i don't need a billion products i just need

01:14:23.840 | two five six times four i don't need a billion embeddings i just need two five six times four

01:14:28.000 | embeddings so in some sense i kind of think of it as a better encoding of what it is right so you're

01:14:36.240 | saying that i can more economically compress more economically compress yeah i think economically

01:14:41.120 | compress is the right term so and and the reason is because at least in my so in the paper

01:14:49.920 | in the paper they showed that um using their generative retrieval they were able to achieve

01:14:56.000 | better results that's not something that i was able to replicate um and also because their data

01:15:01.840 | sets are actually kind of small the other size is actually very small all right you can see that

01:15:09.040 | their data sets is like very few items um i deliberately chose a data set that was much bigger

01:15:16.640 | and maybe with a bigger this i again i don't know how this would scale with a much bigger data set my

01:15:21.120 | data is at least 4x bigger than their largest data set and actually want to scale this to million level

01:15:26.480 | data sets right uh so i haven't that's one thing i haven't been able to replicate but again you can see

01:15:32.800 | here that um it's it's not stupid this idea is the model even a regular substrat is able to learn

01:15:42.480 | on this sequence of token ids

01:15:43.920 | yeah any other questions uh how are you sorry the hand raising function seems like very me i just asked

01:15:57.120 | directly okay so i'm curious that for the suspect models and the crane model that you fine tuned have

01:16:05.440 | you compared the performance of those in terms of uh next uh cement handy recommendation

01:16:13.760 | so obviously the sasrack model is way better um the the quen model is not as good um but the purpose of

01:16:24.080 | the quen model is not to be better than the sasrack model right the purpose of the quen model is to

01:16:32.080 | train to have this new capability whereby users can

01:16:40.160 | shape their recommendations essentially you can imagine this like chat right let's say okay

01:16:50.800 | it's the convergence of chat it's the convergence of recommendation systems search and chat

01:16:59.520 | that is the intent in the sense that it's have more general interface than than the suspect that

01:17:09.920 | that mostly like the commandless semantic id in this case i think if you will build a chat model

01:17:16.400 | then this cram model could be helpful uh yeah yeah so flow has a great question um what is this model

01:17:25.920 | param size so um let's look at it

01:17:31.680 | do i have logs please tell me i have logs yes i do have logs fantastic

01:17:39.280 | usually you're not sharing your skin again yeah i'm showing my screen now uh okay so you can see

01:17:43.440 | this is the regular sasrack model um you can see we have these users and this is a total parameter

01:17:49.840 | size it's only a one million parameter model right and then when we train it on and then for the larger

01:17:58.880 | sasrack model which is the one with the tokens um

01:18:08.880 | oh it's a seven million parameter model yeah so yeah it's whereas quen is like 8b

01:18:17.120 | so yeah it's a lot smaller but you know for recommendation systems models where you're just

01:18:22.880 | predicting an express token you don't need something so big there's there's no benefit from additional

01:18:27.600 | depth in your transformer model

01:18:36.960 | okay okay i guess the questions are drying up okay so that's it then thank you for staying

01:18:41.760 | oh this is me yes so the semantics id uh is it uh only four uh level four digits

01:18:54.800 | in my use case i only use four digits but you can imagine this to be any you can make it eight levels

01:19:02.640 | at every code book level you can have more code books more code words in my case every couple

01:19:08.400 | levels are in 256 so you can have it as wide and as deep as you want okay and again how to decide that

01:19:17.520 | is very very difficult uh because there is if you know of a way please share me i don't know of a way

01:19:25.360 | that i can assess if my sas track is good or not no i don't know of a way where i can assess if my

01:19:31.680 | model is good or not my rqvae the value of my model is actually the value of what it does downstream

01:19:43.600 | um so that is tricky so i have i have some rqvae run analysis so essentially with all the different

01:19:51.760 | ways like what the different unique ids are so essentially eventually what i did is i prioritize

01:19:57.200 | validation loss and the proportion of unique ids um and also not too many tokens because that makes

01:20:03.520 | real-time generation uh in the chat format if you have like if your if your semantic id is like 32 tokens

01:20:10.480 | long then in order to return the product you need 32 tokens which is not very efficient so i just chose

01:20:15.600 | to have four tokens long uh yeah i think there are other ways uh you know to measure that you mentioned

01:20:25.120 | two of them um but um you can also look at the perplexity or code book like token efficiency

01:20:37.360 | and uh what does that mean like uh how many tokens are needed to achieve uh good performance in your

01:20:45.440 | opinion but but what is the depth measurement of good performance then yes uh you there are some

01:20:53.600 | practical uh characteristics uh like scalability um can you uh add more data and uh you see it's able to

01:21:06.800 | to handle that um and uh how like inference latency how long will it uh take to uh to try um you know

01:21:20.560 | to see what are the recommendations yeah in first latency i think it's a function of the codebook depth

01:21:28.320 | right so with more with more code books more code levels required you have you need more tokens which is

01:21:35.440 | in first latency yeah okay sorry someone else uh it was not a question i remember seeing in a paper where

01:21:50.240 | um they evaluated uh the code words which they learned i think they did something like uh a visualization

01:21:56.960 | where they could get the taxonomy and they showed that uh each taxonomy made sense where each uh the code word

01:22:04.880 | and then below that all the sub code words were i guess something more like categories and subcategories

01:22:10.720 | i don't know if it was the same paper we showed that maybe it's the same paper it's the same paper yeah

01:22:15.680 | they they talk about it um whereby yeah it's over here they what they try to what this visualization shows

01:22:21.920 | is that our semantic ideas have meaning right so it tries to show that okay you know this is how is that

01:22:29.440 | like how do you interpret it etc um but for me i don't really need them to have meaning it's okay to have

01:22:35.760 | a meaning that i don't understand so long as it performs well on my downstream task of blending language

01:22:43.440 | and recommendations together so you can see over here uh this is like the amazing spider-man right

01:22:48.320 | mario like products i guess it equates it to nintendo like products donkey kong zelda donkey kong um yeah

01:22:58.160 | so i don't let me try a playstation

01:23:01.040 | playstation products for spider-man what comes out last of us playstation 3 last of us metal gear solid

01:23:09.120 | position so essentially i think this is what is interesting to me uh evaluating at that level will

01:23:16.160 | kind of act like a unit test for the rqva right because evaluation at this level in the downstream

01:23:21.600 | it could also be that maybe your recommendation model is bad maybe but exactly right exactly yeah so but

01:23:28.640 | the thing is this thing this visualization they had over here is very very difficult to do you have to

01:23:34.480 | to look at it you have to try to find you have to try to find the metadata to map to it and then try

01:23:39.360 | to look at it right but it's very challenging like how do you measure what is a good distribution

01:23:43.760 | i guess you can measure kl divergence or you can measure uh you have good measures of distribution

01:23:49.200 | but it's it's really challenging do you think we could initialize the code word so some categories

01:23:55.600 | you have internally for your own products and use that maybe i don't know i think i honestly i don't

01:24:02.080 | know i think it might be hard yeah vipu you have a question i'm curious in your model if you tried um

01:24:10.480 | instead of passing in semantic ids and uh like natural language query if you pass in just a natural

01:24:19.280 | language question does it predict your semantic ids properly so like if you know you have a query for

01:24:25.600 | zelda um you know if you just ask it like what is a game with a princess and a guy link and this and

01:24:32.000 | that uh can you get it to start outputting semantic ids right as opposed like and let's say like you're

01:24:38.000 | really prompted right like normally you would expect the llm it's still based on llm backbone it would

01:24:42.640 | predict uh the it would just use normal text right like it would say the word the legend of zelda but if

01:24:48.640 | you know if you feel short prompt it and you tell like you know you must answer with a semantic id so

01:24:54.160 | right now it's using i just finished the amazing spider-man recommending our product that's

01:25:00.320 | wow this is this is insane to me it's like

01:25:05.280 | this is really i don't know where the limits of this i think it's not very smart so clearly let's just

01:25:12.880 | say firstly it's not very smart the recommendations are not very good but the fact that you can do this

01:25:17.840 | and it can return you semantic ids that are in your catalog

01:25:20.960 | my god i mean i think there's also like there's the other side of just

01:25:30.160 | naive overfitting right if i do this training if i just have like regular sft and start adding tokens

01:25:38.080 | the model will use them right the value exactly exactly proper embedding understanding what is

01:25:43.680 | the new deal can see mass effect i don't know okay i mean that shows how old i am i just finished um

01:25:51.120 | elden ring oh elden ring okay

01:25:56.160 | i don't know i i honestly don't i honestly i'm not able to so firstly this is a gaming desk i don't know if

01:26:06.640 | this is a good game but at least it's like same level of maturity it's not like i think

01:26:12.160 | the fun little evals you can do here to test if these are learned is like you have expectation

01:26:20.000 | correlation right so like if you finish a souls game will it recommend you the next one right

01:26:25.760 | like before elden ring or if you do like civilization four will it recommend civilization five right

01:26:33.280 | i don't even know if civilization four is in the data

01:26:36.000 | well i actually don't know it sounds sounds bad to me but yeah but it's it's interesting nonetheless

01:26:44.240 | and then i'm curious like can you just use it as a chat model like if you just say like you know

01:26:48.160 | what's the weather in new york and july does it output games

01:27:00.160 | wow uh so firstly hallucination but second it maintains english language and and thirdly the uh the pavement's

01:27:11.280 | it's melting oh yeah i don't know 75 celsius they're so good all right okay now here's a crazy test i like sci-fi

01:27:23.280 | and action games

01:27:33.280 | i am not able to evaluate this because i don't know about this game but fantasy and let's say a cute

01:27:49.600 | sounds bad ask it about like a niche game that not many people have played you know

01:27:55.280 | and then we see if people have played the game

01:27:58.240 | yeah i don't know i like animal crossing again this data set is very old so

01:28:08.160 | but you can see it's like now you can talk to it and you can return you things that are in the catalog

01:28:16.320 | so ted you have a henry i don't think this is important but since you're just playing around

01:28:21.920 | one of the things when i do these kinds of things is i like to take the recommendation and then just

01:28:26.960 | flip it to see whether it sends me back to the

01:28:30.160 | the same right so sometimes you get like a pair that just go back and forth or you get a ring

01:28:35.840 | but if it doesn't if it's just kind of doing a random walk that's usually a bad sign does that make

01:28:41.200 | sense yeah it makes sense let's try this super mario bros recommend me another game

01:28:45.120 | okay user has done super mario bros

01:28:50.720 | let's do this

01:28:53.840 | ghost squad

01:29:02.000 | returns you exactly the same thing

01:29:03.280 | ghost squad is in the tractor yeah it's like a black hole that

01:29:10.080 | i think i like this time but but you can try this

01:29:15.440 | ghost squad to splinter cell and just see if you wow

01:29:18.800 | is ghost squad part of splinter cell

01:29:26.800 | yeah it could be very interesting i actually have no idea how this model is learning

01:29:33.360 | i feel like it would be fun to do one of these and interpret it with

01:29:42.560 | more like familiar categories and so you can see splinter cell to rainbow six yeah and then the next thing is

01:29:51.200 | clearly the obvious next thing to do here is

01:29:53.440 | mac and third thank you once you have a model that you have access to the weights thing but all

01:30:01.360 | right i'll create some binary probes for you for some linear probes for your model

01:30:05.280 | um so i yeah i hope to release this soon like at least the model weights um and then for example it

01:30:13.360 | will be done but it's just a fun idea that i've been brewing for a while thank you everyone

01:30:19.200 | we're half an hour past half an hour past this is very very interesting eugene uh as i said if

01:30:26.800 | there is a chance that we can discuss i have a lot of ideas for the vector quantization and

01:30:34.000 | also thank you some measures of that feel free to put in a just put in the lm paper club chat and

01:30:42.480 | and just pick and just tag me and then we can just chat there okay thank you everyone this was

01:30:48.000 | thank you thank you uh stop the video uh it's the okay the recording issues stop on its own right

01:30:53.520 | it does yep okay i'll stop okay take care

01:30:57.120 | thank you see you guys next week

Eugene Yan on RecSys with Generative Retrieval (RQ-VAE)

Chapters