Back to Index

Eugene Yan on RecSys with Generative Retrieval (RQ-VAE)


Chapters

0:0 Introduction
3:52 Recommendation Systems with Generative Retrieval
8:48 Traditional vs. Generative Retrieval
10:49 Semantic IDs and Model Uniqueness
14:15 Semantic ID Generation and RQ-VAE
17:37 Residual Quantization (RQ)
23:26 RQ-VAE Loss Function
34:55 Codebook Initialization and Challenges
36:32 RQ-VAE Training Metrics and Observations
48:19 Generative Recommendations with SASRec
55:25 Fine-tuning a Language Model for Generative Retrieval
84:8 Model Capabilities and Future Directions

Transcript

i guess not let me also share this on the discord how do we invite folks there's a lot of people registered for this one but i can share the room then let me let me do it got it okay yeah you can have a question you mentioned in the description of the luma please read the paper if you have time try to grok section 3.1 with your favorite ai teacher who's your favorite ai teacher oh clod opus opus interesting i never use clod for papers oh really i have it because i just have a max description uh subscription that's why i hate i don't know i never i think i should try to be described as well but i use and i just used to be found okay i think we will just wait one more minute and then a few minutes actually like i feel like we have a lot of people registered yeah unless you have a long session let me just kick off i i do have a lot okay okay i want to share um but we can wait one more minute one or two more minutes i'm going to start with a demo and we'll go into the paper i'm going quote and then at the end we have live demo where i'll take in requests from folks to um just you know just provide the input prompt and we'll see what happens this model is uh currently only trained on one epoch so we'll see wow uh michael is new to paper club so basically every wednesday we do a different paper um we love volunteers if you want to volunteer to share something um otherwise it's mostly me or eugene or someone um this paper club is kind of whatever's latest or new whatever gets shared uh someone will cover it and then some people are experts in domains like eugene is the god of rexus so he's sharing the latest state of the art of rexus stuff um this week yeah that's that's what he's going to talk about next week we have uh the paper how much do language models memorize um the week after someone has added vision towards gpt oss so i think we do that one it's like a rl vr uh there's also a news research technical report we can get one of them to present but honestly it wasn't great so i think we skip it but yeah if anyone also wants to volunteer papers or you know if you want someone else to share them we have active discord just share and then someone will do rj presents a lot too yep okay so i'm gonna just get started um i won't be able to look at the chat oh so let me know if the the font is too small uh and because i'm just doing this on my own laptop so let me know if the font is too small okay cool so what do i want to talk about today wait am i sharing the wrong screen wrong you're sharing your whole things okay uh i need to move zoom to this desktop paint perfect and now i want to share this test can you see this perfect okay so what is this recommended systems with generator retrieval let's start from the back what is it that we want to achieve here's an example of what we want to achieve whereby given this this is literally the input string recommend playstation products similar to this thing so what is this thing this this is the semantic id you see that it's in harmony format we have a semantic id start we have a semantic id end and then we have four levels of semantic ids one two three four how to read this it's hard but we have a mapping table so over here recommend playstation products similar to this semantic id i have no idea what this semantic id is but i have a mapping table so you can see this semantic id is actually logitech gaming keyboard so when i ask for playstation products it's able to say playstation uh gaming headset for playstation this is uh i don't think this is all this for pc so that's a little bit off there and this is for pc ps5 ps4 so not that's not too bad now let's try another something else a bit a little bit different recommend zelda products similar to this thing which is new super mario bros is able to respond zelda okay this is off teenage mutant ninja turtles and nintendo wii so i guess maybe we don't have a lot of zelda products in the catalog let's try something a little bit different super mario bros meets assassin's creed what happens uh we get bad man quick question what is uh what is this sid to what does semantic id mean what are these tokens like at a high level so semantic ids are a representation of the item um so here's how we try to represent the item instead of a random hash which is what the paper says uh we will just represent an item so imagine we try to merge something cute super mario bros with something about assassination well we get batman so this is an early example maybe let's just try you get lego batman the lego part makes it cute yeah it makes sense right cute let's try residence evil with donkey kong okay maybe this is a little bit off um but i don't know dog let's just say dog i have no idea whether this is in the training data how this would look dog and super mario well it's not so you can see it defaults to zelda because there's a very strong relationship between nintendo super mario and zelda actually sega i actually wonder if this is in the data set luigi dragon ball brutal legend so essentially we have recommendations we can learn the recommendations but now that we can use natural language to shape the recommendations by cross learning recommendation data with the natural language data so now that is the quick demo hopefully you are interested enough to pay attention to this to this um this is i'm gonna go fairly fast and i'm going to go through a lot we're going to go through paper we're going to go through my notes and we are going to go through um code so i hope you give me your attention oh vivo has a hand raised and then a quick question on that example so right now you're crossing stuff that should be in training data right so stuff like uh zelda batman assassin's creed resident evil uh this should also theoretically just work with natural language right so if i was to ask something like single player and something like a single player game or like a multiplayer game like zelda right because zelda is multiplayer or you know yeah multiplayer game like zelda and i would hope i get something that's like a nintendo game so like maybe like um mario kart is similar to zelda but multiplayer right let's try this this is the amazing spider-man we know this because there's a the output is the past of the this is this is the actual output which is in sid this this log over here is the past so this is recommend single player game similar to spider-man okay batman batman batman multiplayer i i don't know how this will work uh and this is a very janky young model well it's still batman well maybe doesn't doesn't work very well some some batman games are multiplayer but i mean i think the point is yours is like a toy example but yeah i just wanted to clarify theoretically that should work right yep theoretically it should work if we have enough training data that we augment well enough now i haven't done a lot of data annotation here uh but now let's jump into the paper so this is the paper um the blue highlights essentially mean implementation details the yellow highlights are interesting things green is good red is bad so most recommended systems they do retrieval by doing this thing which is uh we embed queries and items in the same vector space and then we do approximate nearest neighbors so essentially all of this is always standard approximate embed everything and do approximate nearest neighbors but what this purpose uh suggesting is that now we can actually just decode we can get a model to return the most similar products so now this is of course very expensive compared to um just approximate nearest neighbors which is extremely cheap but this gives us a few special properties whereby now we can tweak how we want the output to look like as you can see with just natural text and we essentially it allows us to kind of filter uh our recommendations and shape our recommendations so now imagine that uh if i were to chat imagine a chatbot capability right um like this okay let's say we have zelda and assuming it's better trained super mario bros and i want more zelda products similar to super mario bros uh we get one of these well only one but we'll see essentially just more training data needs just needs more training data and augmentation so over here now we get three zelda products of course the temperature here is fairly high uh essentially i'm just trying to test how this looks like and you can see by by suggesting the model understands the word zelda the model understands this product and the model understands products similar to this product but in the form of zelda uh yeah is this meant to be like used in real-time low latency small model situation or like honestly from an external perspective it just doesn't seem that like it doesn't seem super clear what's special about this right because if you just ask any model like yes gemini or chat gpt or anything you know uh recommend me multiple multiplayer games similar to zelda you should know this right yeah the what i think is unique about this model so imagine if you just ask a regular lm recommend me all to play a games unique to zelda it will recommend you games based on its memory and based on what is learned on the internet so what is unique here is i'm trying to join world knowledge learn on the internet and customer behavioral knowledge which is proprietary and seldom available on the internet so i'm trying to merge those together so usually when you use world knowledge there's a very strong popularity bias right but when you try to merge it with actual behavioral knowledge that's progressing in real time and you can imagine mapping it to things like this you can get that kind of data so that's the that's what i mean by this is a bilingual model that speaks english and customer behavior in terms of semantic ids very interesting and uh like more clear-cut example would be like uh uh lm would be out of date because there's knowledge cut off but if you have like a new song or something you know exactly yeah something like this is a soft course oh another henry's yeah uh so what is the world knowledge and the local knowledge conflict so what will happen then i don't know um yeah i don't know i i guess it really depends on so that's world knowledge right and then when you maybe fine-tune this on local knowledge it depends on how much strength you want to know local knowledge to uh update the weights okay okay so while it's oh go ahead from basic training principles the local knowledge should always come first right because let's say like you have a base model that thinks you know if someone likes zelda they like mario kart but if you've trained it such that people that like zelda prefer some other category of game your your model is trained to you know predict one after the other so the the later training often over you know has more impact yeah i mean so the reason why i asked is because there's a lot of abundance of evidence in the like base model so by just having one example does that completely override the the information it has learned or with so much evidence um i don't know it's a good test i guess yeah honestly i don't know so that that's where this is what i'm trying to explore while we're already 15 minutes in um barely touch anything i want to cover but i'm going to try to speed through this a little bit faster and i'll stop for questions maybe at a 45 minute mark so now most systems nowadays use this retrieve and rank strategy essentially um here's the approximate nearest index i was talking about over here given a request we embed it you get approximate nearest neighbors index and as you add features and then you try to do some ranking on top of it this could be as simple as a logistic regression or decision tree or two tower network or whatever a sas rack we actually look at a sas rack which is essentially a one or two layer auto decoder model that is trained on user sequences we'll look at that later um so how they are trying to do this is they have a semantic representation of items called a semantic id so semantic id so usually product items or any item you represent is usually a random hash like some random hash of numbers this semantic id and we have seen is a sequence of four levels of tokens so now these four levels of tokens they actually encode the each item's content information and and we'll see how we do that so what they do is they use a pre-trained text encoder to generate content embeddings and this is to encode text you can imagine using a clip to encode images using uh other encoders to encode audio etc then they do a quantization scheme so what is this quantization scheme i think this is uh very important and that's why i want to take extra time to go through section uh 3.1 of this so here are the benefits that they they expound right like training and transform on semantically meaningful data allows knowledge sharing across items so essentially now we don't need random ids now when we understand one item in one format we can now understand it uh a new item so now for a new item if we have no customer behavioral data so long as we have a way to get the semantic id for it and because the semantic id for it the assumption is that the levels of the semantic id the first level the second level the third level if it we'll find a similar item we can start to recommend that item as well all right so essentially generalized to newly added items in a corpus and the other thing is the scale of item corpus usually you know on e-commerce or whatever the scale of items is like millions and maybe even billions so when you train such a model uh your embedding table becomes very big right in order of millions of billions but if you use semantic ids and if a semantic id is just represented by a combination of tokens uh we can now have a use very few embeddings to represent a lot of tables uh and and i'll show you i'll show you what i mean by this um the first thing i want to talk about is how we generate semantic ids um so one of the previous ideas was using a vector quantization recommender so it generates codes they're like semantic ids uh what they use is an rqvae residual quantize uh variation and auto encoder i think i think this is quite important like how to train a good rqvae that leads to semantically meaningful semantic ids that your downstream model can then learn from uh i think it's quite challenging and i haven't found very good uh literature understanding of how to do this well so this is their proposed framework first we're given an item given an item's content be text audio image or video we encode it to first we get the vectors then we quantize it to semantic code words that's step number one second now once we have these tokens the semantic tokens for an item we can now use it to train a transformer model or language model which is the demo i just showed you so i'm going to take a little bit of time to try to help to try to help us all understand what residual quantization means so essentially given an embedding okay given an embedding that comes in we consider this the residue we think of everything here within this within this gray box as residuos and embedding comes in we consider the first residue we try to find which codebook token it is most similar to and we assign it to it right so now we've assigned it to token number seven and uh it's the most similar so then what we take is we take the initial residue that comes in the initial embedding that comes in this blue box we subtract away the embedding of the codebook token seven and now we get one right this this remaining thing here this is the remaining residue then we repeat it at codebook level two at codebook level two we try to find the most similar codebook token embedding the most similar vector we try to find the most similar vector we map it to it and over here we map it to uh dimension this token one right so now we take the previous residue we subtract the new codebook level two vector and now we get the remaining residue is at this level at the residue two right and then we repeat this right we repeat this along we keep repeating this until uh and along the way we keep trying to minimize the residue so essentially based on this we can now assign given an embedding space we can now assign it to integers or tokens right so now this outstanding the remaining the semantic code for this is seven one four so now if we take seven plus one plus four the output of this will be the quantized representation essentially what has come in which is blue we do this residual quantization and when we take the respective tokens they are assigned to it if we sum it all up it should be the same as whatever has come in and then based on this this quantized representation with a decoder we can get the original embedding any questions here did i lose everyone where does i mean i read the paper too and this part totally lost me where does that initial code book come from uh you initialize it randomly okay as you start oh yeah or there are smarter ways to do this which i'll talk about so you've got the random initialization and then in this process where you're learning the representation are you like iteratively updating the code book and updating okay that makes a lot more time yes so this rqvae needs to be learned um and when you learn this rqvae there are a lot of metrics to look at which we'll see and it's really hard to understand which metrics indicate a good rqvae um okay so that's the semantic id generation uh quick question uh and finally when we try to calculate the loss of the contested version and original version at 714 do we actually do addition or there's some other operation actually when i imagine that after training the code book it will be deviated from the original initialization so it won't be in the same scale right addition may not make sense i actually do addition so just addition yeah i just do addition okay and i think the the image suggests as addition as well i think they mentioned in the paper that they do uh three different code books for that reason so that the each code book has sort of like a different scale so that you can there they are additive exactly um because the norm the residues decreases with increasing levels right so so that was what i explained via the image i'm going to try to explain it again via the text in the paper because i i think this is really important so whatever comes in is the latent representation right and we consider this the initial residue which is r0 now this is this r0 is quantized essentially converted to an integer by mapping into the nearest embedding from that level of code book so r0 so now to get the next residual right we just take r0 minus the code book embedding right over here to to get the next residue we just take okay then similarly we recursively repeat this to find the m code words that represent the semantic id so this is done from cost to find now there's this section over here which i think is also quite important uh that i really want to try to help us all understand essentially now we know how we get the quantization the input and output everything now this z you can see over here which is really just a summation uh which is really just a summation of all the code book vectors this is now passed is to into the decoder right which tries to recreate the input embedding so now this imagine your code book could be like 32 dimensions and your input could be like 1124 dimensions this decoder now needs to map it from the input the the code book representation through the original input and has to try to learn that okay so then the next thing is how do we what is the loss for this so this is quite interesting to me this is the one of the first few times i've actually trained a rqvae and i spent quite a bit of time trying to understand this um i'll briefly talk through it very quickly and then we're going to the notes to try to see how to understand this so the loss function is the reconstruction loss plus the rqvae loss right and the reconstruction loss this is straightforward but rqvae loss uh takes a bit of time to understand so now let's go into this so the rqvae loss right is the reconstruction loss which is how well we can reconstruct the original output this is quite straightforward the original input is x over here the reconstructed output is x hat over here and this is really squared loss error so so how can we reconstruct it now that's question that's quite straightforward but the quantization loss this is this one is a little bit more complex you can see that there are two terms here which is really sg sg means stop graded stop gradient on residual minus the codebook and embedding and then there's a beta term which is how you want to balance it and then the residual minus the stop gradient on the embedding so you can see the left and the right is essentially the same except the stop gradient is on different terms so the first term this updates the codebook codebook vectors which is this uh which is e which is this embedding here to be closer to the residual right we stop the gradient so that we treat the residual as the fixed target and now what this does is it pulls the embedding closer to the target yeah and then so what this does is updates the codebook vectors to represent better residuos now the second one is that it updates the encoder to return residuos closer to the codebook embedding what so you can see where we stop the gradient on the codebook embedding and we treat the embedding as fixed so now the embedding is fixed so now your encoder has to try to do the encoding so that it maps better to the codebook embedding so essentially the first term teaches the codebook the second term teaches the encoder i think it's something like gans i'm not very familiar with gans uh but so yeah i i'm not sure i can if i can comment on that uh ted i think vector quantization is a multiple rounding us yes i i think the intuition is right this is very similar um and then there's a beta which is a weighting factor right so you can see in the paper they use a weighting factor of 0.25 essentially what this means is that we want to train the encoder less than we want to train the codebook and again how do you choose this beta it it's it's really hard i don't know in this case i think having high fidelity of the codebook is important because that's what we will use to train our downstream model we're not training rqv for the sake of training in rqva we're training rq over here we're trying to rqva for the sake of having good semantic ids that we can use in a downstream use case any questions here did i lose everyone i was like i'm i was just i've just been building a variational auto encoder so i'm a little curious like when in all my reading there's always like a elbow or elbow like loss the and i don't see one there did you did you understand why they didn't use that or is it just not mentioned or they didn't use that it's just not mentioned i guess it's because um when you look at our cube and the rqvas didn't really come from this it came from representing audio and also images decided these two papers right residual vector quantizer so uh ted has a has a hand raised yeah so i don't deeply understand it but i think i can answer rj's question at a superficial level which is that the elbow is used to approximate when you are training your regular continuous vae okay and so you have some assumptions around like whatever normal distribution and things like that that what you're trying to do is you're trying to do maximum likelihood but you can't do it directly so you use this elbow as a proxy in order to try to get to the maximum likelihood okay what the what the vqvae paper says is that we're using a whatever multinomial distribution instead of this continuous distribution and and that's the reason why uh because of that distribution assumption we no longer have this elbow thing i've never fully wrapped my head around it but it's fundamentally because this discrete distribution is very different from the continuous distribution that we use with elbow i i so somebody can go deeper than that yeah that's as much as i got okay no that's actually super helpful and thank you for the i'm gonna have to read those two citations as well thank thanks a lot guys that's helpful thank you welcome am i going too fast right now or is it fine a bit too fast maybe a high level overview again for those that aren't following the technical deep dives um what is okay the high level overview is given some embedding of our item we want to summarize that embedding the embedding could be a hundred and a thousand dimensions we want to summarize it into tokens which is integers and that's what an rqva does so again to i think again i think in this paper the rqva is quite essential i think the rest of it this this was the one that i had the hardest uh the largest challenge to try to implement the rest are quite straightforward so i'll spend i want to spend a bit more time here can i ask um question go ahead yeah uh that sum it's the the vector sum you're summing you're summing the embeddings yeah okay yeah you just sum it up and you'll get the output is it fair to say the novel contribution here is intermediate token summarization versus textual yeah it could be essentially it's a way to convert embeddings to tokens right firstly you quantize the embeddings and you think about it it's really tokens what's a good choice of a number of levels it's hard um i don't have the answer to that but in the paper they use three levels and each level has um 256 code words so that essentially can represent 16 million unique combinations essentially 16 million unique combinations so now let's go into the code of course i won't be going to all the code i just want to go through the part to help folks understand uh a bit more i have this oh see this okay perfect so you can see over here the code book loss essentially that's how we represent the losses right the code book loss is just the the input we detach it the stop gradient and the quantized which is uh the quantized which is just the uh the combine the combined embeddings right the combined cobalt embeddings this is the code book loss commitment and go ahead and the commitment loss is really we just take the quantization and we detach it and we do a mean squared error on the input right that's essentially all there is to this loss right so that's how we implement it quite straightforward then um this is how we do the quantization so you can see we firstly initialize uh some array to just collect all these for every vector quantization level we take the residual and the residual is really the first time at the very first input the residue is just the input embedding right we take the residual we pass it through the vector quantization layer and we get the output then we get the residual for the next level by just taking the residual and we subtract subtract subtract the the embeddings right that's the residual now our new output is we just keep summing up all this so our quantized output is just uh we just keep summing up so essentially at the end we can represent this initial residual by the sum of the vector quantization layers and that's that's that recursive thing that's all there is to it okay maybe i'm oversimplifying it but yeah then when we encode it to semantic ids essentially this x is the embedding right uh we convert this embedding so for every vector oh yes go for it is someone unmuted when the color is unmuted okay maybe thank you for muting them so the input is really the embedding right so for every vector quantization quantization quantization layer and the embedding is the initial residual again we just pass it in we get the in we get the output and this output we essentially what we want is the index so the index and then we get the residual and so we get the residual minus the codebook embedding and we just go through it again so at the end of this we get all the indexes for the semantic id and now that is our those are you consider them as our tokens right the sid tokens that we essentially saw here right um this over here this maps to the first index 201 the second index third index fourth index so essentially my rqvae has three levels and the fourth level is just to for preventing collisions um i guess the last thing uh initialize okay the last thing the question that people folks had and this is what the paper does as well which is how do you first initialize these code books um there's a very interesting idea which is that we just take the first batch and for each level we perform k-means uh and we try to fit it right and then after that we just return the clusters we we do k-means with the means in terms of your kv size so uh so not kv size uh in terms of your cookbook size so my cookbook size is two five six so i want k-means with two five six means so now i have all these two five six means i just initialize it uh with these two five six means and that's how i do a smart initialization what does the fourth level in the rqvae my rqvae only has three levels the fourth level is something i artificially add when there is a collision in rqvae how to determine code book size again that's a great question i actually don't know uh how to develop a cookbook size it's really hard because we're not again we're not training the rqvae just for the sake of training rqva we're training rqvae for the downstream usage oh i think that's a that's a great example uh that of the vbo of the image that the bush had so now let's look at some trained rqvaes to try to understand how this works so i've trained a couple of rqvaes not all of them work very well but i want to talk you through this um so essentially if you recall their commitment with their beta was 0.25 right so all of these the beta is 0.25 unless i say otherwise so let's first understand what the impact on learning rate is um initially i have two learning rates the green learning rate is lower than the brown learning rate so you can see of course the brown one with a higher learning rate uh we get oh first let me take let me take some time to explain what the different losses means um okay so this is the validation loss essentially the the loss on the validation set uh this loss total is really just the training loss now if you recall the loss of the rqvae is two has two things in it that's the code book loss and then there's the commitment loss so now this and this is the reconstruction loss you can think of it as if you try to reconstruct the embedding how well does it do now so now this is the um vector quantization loss i do i didn't actually calculate the i think this is i can't remember this is the i think this is the code book loss right and the reconstruction loss is the uh um the commitment loss which is the reconstruction of the output loss okay so you can see and this is the residual norm which is how much residue is there left at the end of all my code book levels uh ideally this this demonstrates if you have very low residual norm it means that um your code books are explaining away most of the rest most of the input residual so a lower one is bad so of course all the losses the lower is better residual norm the lower is better now this one this is a different metric this is the unique ids proportion so at uh at some period i give a big batch of data uh maybe 16k or 32k give a big batch of data i do the i put it through semantic ids uh the quantization and i calculate the number of unique ids so you can see that as we train the number of unique ids gets less and less essentially the model is better at reconstructing the output but it uses fewer and fewer unique ids so what is a good model i think um what is a good model is quite hard to try to understand but here's what one person responded uh and the response kinds of make kind of make sense so that's what i've been working on so look at validation reconstruction loss again how well is your is your rqvae good at reconstructing the input the quantization error again this is your codebook error and codebook usage i think of this as unique ids as well as codebook usage we also see some graphs of codebook usage um okay so now you can see that with a higher learning with a higher learning rate which is this one this learning rate is higher than the degree we have lower losses overall but our codebook usage is also lower so i think that's that's one thing to note again i don't know which of which of these are better is it lower loss better or codebook usage better and then i just took a step and just pick something um now over here this uh i think maybe this is just easier to compare these are the various weights right if you remember this is the commitment weight this is a commitment weight of 0.5 and this is a commitment one essentially the commitment weight is how much weight that you want to have on your um on your um on your on your codebook right on your uh on your commitment loss on the on the validation on the reconstruction error so you can and this this is commitment weight of 0.25 which is the default so you can see with a higher commitment weight uh our reconstruction error is actually very similar right but when we push the commitment weight up to 1.0 you can see something something really off course here uh our loss is a lot um just x-ray funcily you can see our loss is just very high but the unique ids is also higher which is actually better and the residual norm is lower which is a good thing that means we explain most of the variance so in terms of how to figure out how to pick a good commitment weight i think it's very tricky uh i haven't figured it out essentially here when i double the commitment weight i get told i have lower validation loss and my id proportion is similar but my residue norm is higher but in the end i just stick to the default which is 0.25 and over here this is another run i did where i clean the data essentially i excluded uh the product data and all this is like open open source amazon reviews data i clean the data i exclude those that are unusually short remove html tags and you can see by cleaning the data you can get a much lower validation loss much lower reconstruction loss residue norm is similar but your unique id proportion is a lot higher um does that make sense am i going yes the reader with the reduced number of unique ids am i just going too deep into this but i also do want to share a lot um okay we just take okay maybe i'm going too deep into this but i'll just take one last thing to show you how a good codebook looks like um so this is the codebooks so these are the codebooks right and this is this is the distribution of codebook usage this is a this is a not a very good distribution of codebook usage you can see all the codebook usages in um token number one right and you can see there's very large spikes so this is when your rqva is not trained well you don't have a very good distribution and but and then you can see over here um where's my numbers for you can see over here the proportion of unique ids is only 26 that means all your data you've only collapsed it to a quarter of it essentially you've thrown away three quarters of your data but what happens when you have a good uh good rqvae here's how it looks like the proportion of unique ids is 89 um but this is the codebook usage you can see a codebook usage is very well sort of somewhat well distributed and all the quotes are well learned and all the quotes are fairly well utilized so this is what it means by having a good codebook usage so what have you done so that you can uh have a good uh code distribution codebook um the main thing that i have done is so a few things they use k-means clustering initialization so i do that k-means clustering initialization um the other thing i do is codebook reset so essentially any time at every epoch if there are any codes that are unused i reset it so essentially i'm just forcing i i force the i reset unused codes for each rqvae layer so i'm forcing the model to keep learning on those unused codes seems to work decently um yeah so the next thing now we will very quickly go through the recommendations part of it because i think it's fairly straightforward um so oh michael has a hand raise hello yes hi you good yes yeah i'm just yeah i'm a south indian and i am right now entering into this air field so right now what do you suggest and how to enter into this field as per the new concepts that are right now in your point of view yeah i think a great way to learn this is i'm gonna share an invite link uh a great way to learn this is really just to hang out in the discord channel um so i'm gonna copy the link i'm gonna paste it in the chat i think the really just us is just learning via osmosis right yeah what are the folks yeah yeah due to this yeah boom i have learned the no code development and i then and all the all these things for almost four months but right after i have sort of issue i found her big right because like coding so yeah thank you michael what i would do is to really just uh ask on the discord channel and the thing is the the reason why i'm saying this is there's only 15 minutes left and i still have quite a bit i want to share and go through yeah anyone has any questions related to the paper frankie okay can you hear me yes yeah a quick question maybe it's not relevant but i noticed that you did some stop gradient uh so it's stopping back propagation is that is that because you're trying to freeze something when you're doing training i don't quite understand it's it's just um firstly it's part of the rqvae formulation where they do stop gradient right and so i'm just following it and the reason is if you don't do stop gradient if you don't do stop gradient this is the loss function right it just simplifies to this right which is residual minus codebook embedding so now this leads to a degenerate solution right you can imagine that your encoder will just encode everything to 000 and your codebook will just encode everything to 000 and essentially you just use one single thing where they both cheat okay but then you have to allow training at some point right because you're training the codebook so how do you how does that how does that codebook get modulated then if you do stop gradient so we do stop reading on the residue for this for the first half of the rqvae loss equation and then we do stop gradient on the codebook for the second half of the and then it's a it's a weightage we waited by yeah i see it thank you so much yeah okay so i know we have some questions left but i'm going to take five to ten minutes to try to go through the recommendations part of it uh which is really where it all comes together and we also have results for that so i'm going to ignore the questions for a while so now generative uh recommendations right what do you do is we essentially reconstruct item sequences for every user we sort them in terms of all the items they've interacted with then given a sequence of items the recommender's task is to predict the next item very much the same as language modeling given a sentence predict the next word in the sentence right essentially that's what sasrack is what a lot of our recommendation model is right now so uh so for a regular recommendation given a sequence of items you just predict the next item for a semantic id based on recommendation given a sequence of semantic id you predict the next semantic id which is not which is now not a single item but a sequence of four semantic ids so now the task is more challenging you have to try to predict the four semantic ids all right so you can see that they they use some data sets here um and they have the rqva which we spoke about there's an encoder a residual quantizer and the decoder i think the encoder and decoder is essentially the same thing i mean not the same thing it's just mirror image of each other different ways of course as well and residual quantizer so you can see they have a sequence to sequence more than implementation um and this is their code book right uh one or two four tokens um it's only three levels and the fourth level is for breaking hashes they include user specific tokens for personalization in my implementation in my implementation i didn't do that now let's look at some sas rec code for the recommendations and we'll see how how similar it is to um how similar it is to language modeling so over here this is the sas red code and if you see it's actually very similar right you see modules like causal self-attention which is predicting the next token you can see the familiar things attention number of heads etc and then you know qkv and then you have our mlp uh which is really just uh reluce and then the transformer block this is a pre-layer norm layer norm self-attention layer norm and mlp that's it that is our recommendation system model so now the sas rec is item embedding for every single item we need to have a single item embedding position embedding and dropout and then the blocks are really just the number of hidden units you want and just transformer blocks and of course a final layer norm um for the forward pass what we do is we get all the item embeddings we get all the positions and we combine it so now we have the hidden states right it's actually the what has happened in the past this is what has happened in the past now for the prediction we take the hidden states which is the past five six now imagine this is language modeling we take the past five to six tokens we forward it we get the next language token candidate embeddings you can see now we get the candidate embeddings and what we do in recommendations oops what we do in recommendations is we score them and the score is really just dot product essentially given what has happened in the past uh we encode it we get a final hidden layer given all the candidate embeddings we do a dot product and that's the score so essentially given all the sentences given all the potential next tokens get the dot product to get the next best token and that's the same for recommendations uh over here um and the training step i don't think i'll go through this um i won't go through this now for sasright semantic id it's the same right now this is the forward pass uh no this is not the forward pass for a sasright semantic id this is the same predict next item so now you saw that in the previously we were just predicting only one single next item now in this case we have to predict four tokens to represent the next item so when we do training we apply teacher forcing whereby first we try to predict the first token if that is correct that's fantastic we use that that first token to predict the next token but if it's wrong we replace it with the actual correct token so that's what we do when we do training but when we do evaluation we don't do this uh when we do evaluation you can see uh okay i'm definitely losing people here so i i i won't i won't go too deep into this but when we do evaluation we actually don't try to correct the evaluation so now let's look at some results of how this runs essentially when i was doing this all that i really cared about was do my semantic ids make sense so let's first look at this uh this is a this is a sasright model you can see the the ndcg is uh unusually high right so firstly the purple line is the raw sasrack that just predicts item ids given the past item ids predicts the next item id uh let's just look at hit rate right you can see the hit rate is 99 uh i i made it artificially easy uh because i just wanted to get a sense of whether there was any signal or not you can see here is 99 uh we exclude the most infrequent items there's no cold start problem over here and uh the the false negatives are the false positives are very easy to get so you can see this is how the regular suspect does with ndcg of 0.76 now let's now look at our semantic ids now again to be to be very clear right semantic ids the combination of semantic ids that are possible is 16.8 million but in our data set we only have 67 000 potential data points so the fact that if from that 16.8 combinations is able to even predict correct data points that's a huge thing so over here we can see that after we train it the hit rate is 81 percent that is huge essentially it means that the model this new recommendation model instead of predicting the next item it has to predict four next items the the four tokens that make up the next item and only if that is an exact match we consider as a hit a hit rate right recall and it's able to do that and you can see that the ndcg is able to do that uh on ndcg is also decent so now it doesn't it doesn't quite outperform uh the regular sas right because this is uh it would outperform it if we had coastal items all the very dirty items i think it could beat or even match but essentially this was all just a test to try to see if we can actually train such a model now now that this test has passed we know that our semantic ids are actually working well now we can fine-tune a language model right um this is so this is fine-tuning a language model you can see that the this is the learning rate so initially this is my original data you can see uh maybe i'll just hide this first this is this is training on my original data and you can see my original data this is the validation my original data was very huge uh was there was a lot of data because i didn't really clean it i just used everything i was just not sure whether you could learn or and i found that it could learn so let's look at an example over here this so you can see over here this is the initial props right and it's able to say things like let's look at test number three if product a has this id if product b has this id what can you tell me about their relationship and all he sees is ids but he's able to say they're related through puzzles um and then list three products similar to this product right i don't know what this product is um but i i do know that when i look at it in the title form this is what i expected it was able to respond invalid semantic ids so what this means is that hey you know this thing can learn uh on my original data so then after i clean up my original data so now this is my new data set you can see a new data set it's half the size of original data when you can see when we have this sharp drop during training loss essentially that means i'm it's the end of an epoch it's a new epoch but evaluation loss just keeps going down as well and the final outcome we have is a model like this this is a checkpoint from the first epoch whereby it's able to um it's able to speak in both english and semantic ids um okay that was all i had uh any questions sorry it took so long you sure you have a question hi eugene uh so you said that um when you notice some code points are being unused you reset them um after you reset did you notice them getting used to have some metric to track the reset one actually was if the resetting was actually helpful in that case yes i i was able to try that uh sorry i'm just responding to numbers yes i do have that um uh in my let's again let's look at this when i was training the rqv i had a lot of metrics uh i only went through just a few of them um but this is it this is the metric um that is important um first oh wait are you oh wait i'm stopping you stop sorry okay here we go um first let me reset the axis okay so the metric that is important is this codebook usage so i do log codebook usage um at every epoch so you can see at step 100 or at epoch 100 codebook usage was only 24 20 24 but as you keep training it codebook usage essentially goes up to 100 so that's the metric i i love nice thanks you're welcome uh frankie oh yeah um can you go over like again uh for the cement that i semantic id part what is the training data because i'm having a hard time understanding like what what do you actually what is the llm learning that so can you can explain that again uh for training the rqvae no no no no for the semantic id part uh the last example because you basically you're basically inputting say a cement id and say what are the hardest two things related to each other and i'm having a hard time figuring out like what i what is the um was it actually learning here so this is the training data so essentially the training data is a sequence of items so you can see this is the raw item right uh and for each item we map it to this sequence of semantic ids this is now we're trying to order does the order in this uh list matter or not yes the order matters is essentially the sequence of what people transact okay and if there's the order in the semantic id matters as well for example if we were to swap the positions of sid105 and sid341 it wouldn't make sense that would be an invalid semantic id and and the reason is because the code the first code book vocab the first number should be between 0 to 255 the second number should be between 256 to 511 and so on and so forth yeah so that makes sense i mean but i'm talking about like in the group of four so if you permute those uh does that does it matter so it does matter does matter as well yeah it definitely it must be in this order if not it will be an invalid semantic id i understand because you have you have to go with your levels right of your coding right so you have four tokens so that understand and but i'm saying that you have multiple items here right yes so if you permute those does it matter uh i don't know i think it depends it may or may not so essentially if i were to buy a phone and then you recommend me a phone case and then pair of headphones and a screen protector that makes a lot of sense but if i were to buy a screen protector and you recommend me a phone that doesn't make sense right in terms of recommendations and user behavior okay okay so that's encodings that's the sequence encodes that behavior that's what you're saying yes that's correct okay any other questions yeah eugene uh i had done a lot of work on vector quantization and the there are some other uh techniques to handle like under utilization like splitting and one important question is the distance metric like i noticed that you're using euclidean distance to measure the difference between the embeddings and the code words but the euclidean distance is not good when there is a difference in value or rate for the features so have you tried to do it i'm sorry i'm in the noisy environment please go ahead okay so have you tried to do it okay so have you tried different metrics like mahalanova's distance um i have not essentially my implementation uh reflects what um what was written in the paper so and i i'm not sure if i'm really using euclidean distance uh i don't remember where is it yeah but it's it's worth a try but i i just use the implementation written in the paper and it works it's in the main uh like uh equation where you have the stop gradient because that uh that sign is the ingredient distance and it is by default what k means algorithm uses but yeah you can change that so then that's what i'm using yeah so i just replicated that any other question oh pastor you have a question yeah so i have a question related to to that so in the paper says that you that you use that you use k means k means to to initialize the codebook and then like you used it like you use it for the free first training batch and then you use the centroids uh as initialization is is that what you use and uh follow of that like it also says that there is an option to use like k means clustering hierarchically but it lost semantic meaning so i didn't really understand that part if you can briefly explain like and and if you use the centroids as initialization thanks uh so yes that's what i used um i'm sorry what was your second question yeah so so also the paper the paper said that the other option is used k means hierarchically but it lost semantic meaning between the clusters yes yes yes so yeah do you know why or how that plays or what's the difference between using the the you know the the what is suggesting the paper uh that's a great question right so are you referring to this uh they use k-means clustering hierarchically essentially these are different alternatives or quantization right um over here we quantize using an rqvae they they also talk about locality sensitive hashing then the other one is they use k-means clustering first k-means then second level k-means and third k-means um i don't know why it loses semantic meaning um honestly so i i i i won't be able to address that i haven't actually tried this yeah and i don't have a strong intuition on why it loses semantic meaning okay cool thanks thanks yeah i can uh try to answer that like when you are uh having the same id and you're using the proper uh k-mean clustering the entities are unique and they represent something but if you try to split them into different levels like the first order and you group them and then you do grouping of the second layer as if like you're trying to cluster numbers uh if you keep a three-digit number nine seven three uh and cluster it with eight seven uh eight two one then they are close together but if you cluster on the uh the hundred digits first then at the uh tens and then uh the ones you know you lose the meaning and the these numbers do not become close to each other awesome makes sense thank you thank you okay well very much over time thank you vibu for kindly hosting us for so long uh i can stay for any other questions feel free to take eugene's time is your experience uh for next week we will volunteer for a paper i'll post it in the chat right now um but yeah thank you thank you eugene for presenting and same thing if anyone wants to volunteer a paper to present the following week um you know now is your shot you'll you'll definitely learn a lot going through the paper being ready to teach it to someone otherwise the paper is how do language models memorize and and also to be clear i want to make it very clear right that i did not prepare for this over the course of a week so if you're preparing for this i've been working on this for several weeks ever since like 4th of july holiday right and essentially this is just i've been working on it i'm excited to share about with people and that's why i'm sharing most of the time we usually just go through a paper so if you want to volunteer it's okay to just have the paper that's it if you want to see the other side of uh you know the other extreme unlike eugene who prepares for months i read the paper in the day before the day or the day and i morning off hey don't call me out yes but people have slides that's the thing i will never have slides no no i i used to do slides i actually think it's uh slightly detrimental i don't think it's as beneficial to make slides other papers anymore papers are good ways to read uh to understand information so now it's just uh walk through a highlighted paper just read it understand it and yeah okay take eugene's time asking about semantic ids yeah and we can we can also go through more of the paper if you want i know i haven't gone through as much of it as i should but it's actually the the answer is there and uh the the results replication is very similar to show you have a hand raised hey eugene uh so while you're implementing this right uh were there anything was there anything from the paper that you could not replicate which you had tried uh i think it took me a lot of time to try to train a good rqvae um then that's why i i spent so much time on it so essentially to train an rqvae you see that i have so many experiments right gradient i i did crazy things stop gradient to decoder um gradient clipping and a lot of it failed um i even tried like there was this thing called um exponential moving average to learn the rqvae code books like it has great uh unique id unique ids right um but it's just not able to learn you can see that the loss is very bad um and and that was because i was new to this so you can see that the loss uh actually if we if we zoom in on only those after 2000 you can see that we see zotero oh shoot you only see zotero oh thank you i should share my desktop thank you for so you can see i i tried some crazy things like this exponential exponential moving average approach to update my code books so you can see um this the purple line is the exponential moving average you can see the loss is very high um and it just doesn't work as well uh like gradient clipping the rotation trick uh figure out the right batch size is what the what the right weight is what the clean data is so this took me several experiments and i that was after even deleting the experiments that just don't make sense to save whereas the recommendation part i just needed two experiments you can see yes even though that the recommendation loss is uh the hit rate and ndcg is not close to the sastrack one the intent was never to that the intent was can i can a sastrack actually learn on this sequence of four tokens and very quickly you find that yes it can learn and then that's when i okay now let's not train a sastrack from scratch let's train a um let's fine tune a quen model from scratch and this was not part of the paper this was just uh for my own learning you can see the paper uses all these uh recommendation models but yeah that's it and uh because i'm also planning to train an rqva is there some uh good places i can start with i remember this uh there should be some official report for that on github if i'm not wrong i remember seeing that a while back uh i well i wish there was but i definitely didn't come across that um but what i would recommend is essentially this is my recommendation uh is very great for learning especially now with things like clock code to try to code this out from scratch uh you really learn the nitty-gritty things right about how that code is actually represented uh s and and yeah you're i think colette was right i was using euclidean distance uh and this is the distance i was computing so but yeah i don't know maybe there's a better distance i i certainly haven't gone as far down the road i just took the the default implementation and just followed it makes sense thanks for answering the questions eugene you're welcome eugene quick question on your highlighting what does red and green mean um so over here so green means good right like these are the benefits training allows knowledge sharing atomic item ideas and then red is the downsides right essentially we are considering a new technique what a downside so with the regular sas right if i had to train a regular sas right and i had a billion products my embedding table would be a billion products but if i had two five six with the power of with just a four level code book two five six four four four levels i can represent four billion products and that's just one oh two four embeddings right so instead of a billion embeddings i uh yeah yeah a billion level embedding table i just need a thousand level embedding table and that saves you a lot of uh compute i see thank you thank you you're welcome kishore has a hand raised i assume kishore has asked his question frankie you have a hand raised uh yeah so i think it's similar to what you just spoke about i had a question about if you do the two tower one you have two encoders and they're trying to like create this embedding that approximates whatever query and item so it creates an embedding and i can't so i think you tried to explain a little bit just now but i try to understand why is that embedding worse than this one that's creating by you know whatever your quantization steps right then the betting is not worse i think the embedding is probably just as good or even better but there will be a lot of those embeddings so imagine if we had a billion products we will need a billion embeddings one for each product right but with semantic ids if i have um if i have a four level code book because each code book can represent two five and assume each code book can represent two five can you see my raycast yes so imagine each code book can represent two five six right and then the combination all the auto permutations of this codebook is actually four billion now i don't have a i don't need a billion products i just need two five six times four i don't need a billion embeddings i just need two five six times four embeddings so in some sense i kind of think of it as a better encoding of what it is right so you're saying that i can more economically compress more economically compress yeah i think economically compress is the right term so and and the reason is because at least in my so in the paper in the paper they showed that um using their generative retrieval they were able to achieve better results that's not something that i was able to replicate um and also because their data sets are actually kind of small the other size is actually very small all right you can see that their data sets is like very few items um i deliberately chose a data set that was much bigger and maybe with a bigger this i again i don't know how this would scale with a much bigger data set my data is at least 4x bigger than their largest data set and actually want to scale this to million level data sets right uh so i haven't that's one thing i haven't been able to replicate but again you can see here that um it's it's not stupid this idea is the model even a regular substrat is able to learn on this sequence of token ids yeah any other questions uh how are you sorry the hand raising function seems like very me i just asked directly okay so i'm curious that for the suspect models and the crane model that you fine tuned have you compared the performance of those in terms of uh next uh cement handy recommendation so obviously the sasrack model is way better um the the quen model is not as good um but the purpose of the quen model is not to be better than the sasrack model right the purpose of the quen model is to train to have this new capability whereby users can shape their recommendations essentially you can imagine this like chat right let's say okay it's the convergence of chat it's the convergence of recommendation systems search and chat that is the intent in the sense that it's have more general interface than than the suspect that that mostly like the commandless semantic id in this case i think if you will build a chat model then this cram model could be helpful uh yeah yeah so flow has a great question um what is this model param size so um let's look at it do i have logs please tell me i have logs yes i do have logs fantastic usually you're not sharing your skin again yeah i'm showing my screen now uh okay so you can see this is the regular sasrack model um you can see we have these users and this is a total parameter size it's only a one million parameter model right and then when we train it on and then for the larger sasrack model which is the one with the tokens um oh it's a seven million parameter model yeah so yeah it's whereas quen is like 8b so yeah it's a lot smaller but you know for recommendation systems models where you're just predicting an express token you don't need something so big there's there's no benefit from additional depth in your transformer model okay okay i guess the questions are drying up okay so that's it then thank you for staying oh this is me yes so the semantics id uh is it uh only four uh level four digits in my use case i only use four digits but you can imagine this to be any you can make it eight levels at every code book level you can have more code books more code words in my case every couple levels are in 256 so you can have it as wide and as deep as you want okay and again how to decide that is very very difficult uh because there is if you know of a way please share me i don't know of a way that i can assess if my sas track is good or not no i don't know of a way where i can assess if my model is good or not my rqvae the value of my model is actually the value of what it does downstream um so that is tricky so i have i have some rqvae run analysis so essentially with all the different ways like what the different unique ids are so essentially eventually what i did is i prioritize validation loss and the proportion of unique ids um and also not too many tokens because that makes real-time generation uh in the chat format if you have like if your if your semantic id is like 32 tokens long then in order to return the product you need 32 tokens which is not very efficient so i just chose to have four tokens long uh yeah i think there are other ways uh you know to measure that you mentioned two of them um but um you can also look at the perplexity or code book like token efficiency and uh what does that mean like uh how many tokens are needed to achieve uh good performance in your opinion but but what is the depth measurement of good performance then yes uh you there are some practical uh characteristics uh like scalability um can you uh add more data and uh you see it's able to to handle that um and uh how like inference latency how long will it uh take to uh to try um you know to see what are the recommendations yeah in first latency i think it's a function of the codebook depth right so with more with more code books more code levels required you have you need more tokens which is in first latency yeah okay sorry someone else uh it was not a question i remember seeing in a paper where um they evaluated uh the code words which they learned i think they did something like uh a visualization where they could get the taxonomy and they showed that uh each taxonomy made sense where each uh the code word and then below that all the sub code words were i guess something more like categories and subcategories i don't know if it was the same paper we showed that maybe it's the same paper it's the same paper yeah they they talk about it um whereby yeah it's over here they what they try to what this visualization shows is that our semantic ideas have meaning right so it tries to show that okay you know this is how is that like how do you interpret it etc um but for me i don't really need them to have meaning it's okay to have a meaning that i don't understand so long as it performs well on my downstream task of blending language and recommendations together so you can see over here uh this is like the amazing spider-man right mario like products i guess it equates it to nintendo like products donkey kong zelda donkey kong um yeah so i don't let me try a playstation playstation products for spider-man what comes out last of us playstation 3 last of us metal gear solid position so essentially i think this is what is interesting to me uh evaluating at that level will kind of act like a unit test for the rqva right because evaluation at this level in the downstream it could also be that maybe your recommendation model is bad maybe but exactly right exactly yeah so but the thing is this thing this visualization they had over here is very very difficult to do you have to to look at it you have to try to find you have to try to find the metadata to map to it and then try to look at it right but it's very challenging like how do you measure what is a good distribution i guess you can measure kl divergence or you can measure uh you have good measures of distribution but it's it's really challenging do you think we could initialize the code word so some categories you have internally for your own products and use that maybe i don't know i think i honestly i don't know i think it might be hard yeah vipu you have a question i'm curious in your model if you tried um instead of passing in semantic ids and uh like natural language query if you pass in just a natural language question does it predict your semantic ids properly so like if you know you have a query for zelda um you know if you just ask it like what is a game with a princess and a guy link and this and that uh can you get it to start outputting semantic ids right as opposed like and let's say like you're really prompted right like normally you would expect the llm it's still based on llm backbone it would predict uh the it would just use normal text right like it would say the word the legend of zelda but if you know if you feel short prompt it and you tell like you know you must answer with a semantic id so right now it's using i just finished the amazing spider-man recommending our product that's wow this is this is insane to me it's like this is really i don't know where the limits of this i think it's not very smart so clearly let's just say firstly it's not very smart the recommendations are not very good but the fact that you can do this and it can return you semantic ids that are in your catalog my god i mean i think there's also like there's the other side of just naive overfitting right if i do this training if i just have like regular sft and start adding tokens the model will use them right the value exactly exactly proper embedding understanding what is the new deal can see mass effect i don't know okay i mean that shows how old i am i just finished um elden ring oh elden ring okay i don't know i i honestly don't i honestly i'm not able to so firstly this is a gaming desk i don't know if this is a good game but at least it's like same level of maturity it's not like i think the fun little evals you can do here to test if these are learned is like you have expectation correlation right so like if you finish a souls game will it recommend you the next one right like before elden ring or if you do like civilization four will it recommend civilization five right i don't even know if civilization four is in the data well i actually don't know it sounds sounds bad to me but yeah but it's it's interesting nonetheless and then i'm curious like can you just use it as a chat model like if you just say like you know what's the weather in new york and july does it output games wow uh so firstly hallucination but second it maintains english language and and thirdly the uh the pavement's it's melting oh yeah i don't know 75 celsius they're so good all right okay now here's a crazy test i like sci-fi and action games i am not able to evaluate this because i don't know about this game but fantasy and let's say a cute sounds bad ask it about like a niche game that not many people have played you know and then we see if people have played the game yeah i don't know i like animal crossing again this data set is very old so but you can see it's like now you can talk to it and you can return you things that are in the catalog so ted you have a henry i don't think this is important but since you're just playing around one of the things when i do these kinds of things is i like to take the recommendation and then just flip it to see whether it sends me back to the the same right so sometimes you get like a pair that just go back and forth or you get a ring but if it doesn't if it's just kind of doing a random walk that's usually a bad sign does that make sense yeah it makes sense let's try this super mario bros recommend me another game okay user has done super mario bros let's do this ghost squad returns you exactly the same thing ghost squad is in the tractor yeah it's like a black hole that i think i like this time but but you can try this ghost squad to splinter cell and just see if you wow is ghost squad part of splinter cell yeah it could be very interesting i actually have no idea how this model is learning i feel like it would be fun to do one of these and interpret it with more like familiar categories and so you can see splinter cell to rainbow six yeah and then the next thing is clearly the obvious next thing to do here is mac and third thank you once you have a model that you have access to the weights thing but all right i'll create some binary probes for you for some linear probes for your model um so i yeah i hope to release this soon like at least the model weights um and then for example it will be done but it's just a fun idea that i've been brewing for a while thank you everyone we're half an hour past half an hour past this is very very interesting eugene uh as i said if there is a chance that we can discuss i have a lot of ideas for the vector quantization and also thank you some measures of that feel free to put in a just put in the lm paper club chat and and just pick and just tag me and then we can just chat there okay thank you everyone this was thank you thank you uh stop the video uh it's the okay the recording issues stop on its own right it does yep okay i'll stop okay take care thank you see you guys next week