BERTopic Explained

Today it's estimated that up to 90% of the world's data is unstructured Meaning that it's built specifically for human consumption rather than for machines and that's Great for us, but it's also kind of difficult because when you're trying to organize all of that data There's a lot of data it is quite simply impossible To get people to do that.

It's too slow and it's too expensive fortunately There are more and more techniques that allow us to actually understand Texts or unstructured texts. We're now able to search based on the meaning of text identify the sentiment of text extract human entities and a lot more Transformer models are behind a lot of this now these transform models are the sort of thing that make sci-fi from a few years ago look almost obsolete because being able to Communicate in a human like way with a computer was regulated just as sci-fi, but now these models Can actually do a lot of the stuff that was deemed impossible Not that long ago in machine learning this act of organizing or clustering data is Generally referred to as topic modeling is the automatic clustering of data into particular topics What we're going to do in this video is take high-level view of that topic The library how we would use about topics default parameters to perform topic modeling on some data Then we'll go into the details and so what is better topic actually doing so we're going to have a look at The transformer embedding model a u-map for dimensionality reduction.

We're going to take a look at HDB scan for clustering and now we're going to look at CTF IDF for basically Extracting topics from your clusters each of those are Really interesting and once you understand how they work you can use that understanding to improve the performance of your Burt topic Topic modeling once you've been through all of that.

We're going to then Go back to Burt topic and see how our understanding of these different components Allows us to improve our topic modeling with Burt topic, so We're going to cover a lot But let's start with that high-level overview of the Burt topic library so we're going to start with this reddit topics data set and In here, we have the subreddit that the data has been pulled from We have the titled thread and we have the actual text within this self text feature We have a few different subreddits in here.

We have investing Python language technology and pytorch so we have four subreddits and why I want to do is use just the text and Automatically cluster each of these threads into a Cluster that represents that particular subreddit. So investing Python pytorch and language technology So in here, we have very similar datasets actually not the same.

We're going to use that one later But this one's a little bit shorter and it just includes data from the Python subreddit. We have 933 rows in there and Just going through this one as a quick example of how we might use the Burt topic library using the Basic parameters so we can remove Rows where soft text is NN or very short and this is if you're using a data frame Otherwise we would use this Okay So you can come up here and you just pip install the datasets library for that And then we come down here and we get to Burt topic now Burt topic you would install Burt topic so at the top here And All we're doing here.

So I've added this count vectorizer which is essentially just going to remove soft words in one of the latest steps of Burt topic and all you do is First just convert your data to a list if you need to actually this is for a panda's a frame So this is what we'd be running with the hugging face datasets object and All we do is and These arguments so we make sure we have language set to English so using a English language model For embedding the text and then we just fit transform.

Okay, so that's like fit and predict For this text here. Okay, and we can see here. We've just All we've done here is embedded the text to embeddings then we so those embeddings are quite a very high dimensional and It will depend on the model that we're using but you're you're thinking like hundreds of dimensions and then we reduce the dimensionality of those embeddings and Cluster those and then there is the finding of topics Based on the text in within each one those clusters So from that you we get those two list topics and props and if we take a look so we have This is just printing center print out five of them, but the size is too big.

So it's fine I'm not gonna look through all of them. So we have this predicted topic - so this is just a plus so we don't know what the actual cluster is and We come down and it's it was it I think Python for sure. Yeah, Python Now we have another one a negative one in that topic Is used where?

that data Doesn't seem to fit into any of the clusters that birth topic has identified. So it's an outlier Now in our case we can say okay, this is probably Python But that topic with the default parameters hasn't identified us. So but we're going to learn how to improve those default parameters so we'll figure that out and Then all we can do is see the top words found within each topic so basically the topic Using this get topic info You see that we we do actually have a lot of topics here.

Now. We're trying to break the Python subreddit down into Multiple topics here. So I'm not really sure how would be best to do that We can definitely see some of these make sense. So here you have the base on a assume image processing with Python learning Python Web development with Python and a few other things as well.

They're Curious about this one It was like some sort of game. I don't know but anyway, so we'll go down we have 16 of these of these topics and then we can also go ahead and visualize those so This is quite useful when trying to see how each topic might relate to another or how similar they are And Then if we come here, you can see how each of those clusters actually kind of grouped together in this hierarchy And we can also visualize the words within each one those topics as well But you know, this is just really simple but topic at the moment.

We've just used default parameters We haven't really done anything What I want to do is actually go into the details of that topic So let's start having a look at each of the components that I mentioned before within the bird topic pipeline So as I said bird topic consists of a few different components So we start off with our without text now our text can come in a lot of different formats but typically we're going to go with something like paragraph or sentence size chunks of text and We typically call those paragraph or sentence size chunks of text documents so our our documents or they come down here and pass into a NLP typically a sentence transformer model, so we have we'll just draw that as like a box here and that sentence transformer is going to output for each document a single vector now this Vector or this embedding is very high dimensional.

So it is I'm gonna Write as a list. It's just gonna be really Really big and we're going to have loads of them as we have loads of documents but within that vector we essentially have a numerical representation of the meaning behind its respective document So it's like we're translating human meaning into Machine meaning for each of these documents.

That's pretty much how we can see it, which is already I think Really cool. So once we have those vectors The problem with these is that they're very high dimensional. So sentence transformers typical dimension sizes 768 dimensions you have some sort of like half or also some mini models Which will output something like three eight four?

Dimensions now you have some models that output many more like open a eyes Embedding models I think can go up to 200 or no. No, they can go way higher They can go to I think it's just over 10k. So It's pretty huge now What we need to do is compress the information within those vectors into a smaller space So for that we use something called you map Now you map is Well, there's a lot to talk about when it comes to you map and we're only going to cover a very high level in this video But even that way we're going to leave it until later to ready.

So Dig into it to an extent But what that will do is allow us to compress those very high dimensional vectors into a smaller vector space so what we would typically do is go from 768 or any of those to something like 3d or 2d okay, so we can have something like 2 or 3d vectors now and what's really useful is that we can actually visualize that so we can have a look at our output from you map and Try and figure out.

Okay, is this producing? data that we can cluster or not because it's very Not always very easy, but it's easier to see or easier to assess whether there are Clusters in the data or some sort of structure in our data When we can actually see it So that's really useful and then after that we're going to pass those into another set so HDB scan now HDB scan again like you map.

There's a lot to talk about. We're going to cover it a high level but what that will do is actually identify clusters in our data and it's From what I have seen so far. It's very good at it And again, it's super interesting both you map and HDB scan in my opinion a very interesting techniques and then after that move on to a another Technique or component which is called C T F IDF now a few many of you May recognize the TF IDF part.

The C part is a modification of TF IDF TF IDF is almost like a way of identifying relevant Documents remember documents are those chunks of text based on a particular query. What we do here is identify particular words that are relevant to a particular set of documents e.g. our clusters or our topics and then it outputs those so what we do is we'd end up with like four clusters for example, which is what we're going to be Using so we'd end up with these four clusters and we'd be able to see okay.

This one over here is talking about money Stocks and and so on So that would be its topic and that would hopefully be the investing subreddit and then over here We would have for example Python and we would see things like well Python programming code and so on as those As those terms match up to that topic the best now At very high level that is Burt topic but of course There's not much we can do in terms of improving our code If this is all we know about that topic, we really want to see it in practice so what we're going to do is actually go through each of these steps one by one and See what we can do to improve that component for our particular data set and At least have a few ideas on how we might go about Tuning each of these components for other data sets as well Okay, so we'll start with the transformer embedding step.

So What we're going to do first is Actually get our data set. So I'm using this time. We're using the full data set the reddit topics data set I'm specifying this revision because in the future I'll probably add more and data to the state set adding this in will make sure that you are using the current version e.g.

the version I use of That data set rather than any new versions So now we have just in the 4000 rows in there again, I'm just going to remove any short Items in there and that leaves with just over 3,000 Unshuffled data Although you don't really need to Good practice anyway And then yeah, so if you find this is taking a long time to run on your computer Just reduce this number Okay, and if you do use this Then definitely shuffle before hand, but otherwise you probably don't need to so if it's taking a long time to run Just reduce this to like 1,000 just so you're Working through this with less data Again also you can find this Notebook and all the other notebooks.

I'm going to work through in the description below So the first step as I said is going to be that embedding set so What we're doing here is actually pulling in one. There is a mini sentence transform models I mentioned we can see here that it creates a word embeddings or sentence embeddings of 384 tokens, which is is nice rather than using the larger models.

It just means it's going to run quicker Particularly for just not using any Special machines, I wish I'm not I'm on my laptop right now So we come down and we can create the embeddings now as Mentioned we probably don't want to do all this at once unless you have some sort of supercomputer So I'm running through those in batches of 16, so I'm taking 16 of those documents.

I mentioned putting those into our sentence transform model and outputting 16 of these sentence embeddings and Then adding all those to this embeds Array here You can see that. Okay, so we encode the batch which we pull from there. It's just 16 at time we get our beddings to that batch and then we just add them to the embeds array and That's actually order is to the sentence transform or sentence embeddings, but I'm not going to go into detail on that because there are Many articles and videos on that that we have covered already, so we're gonna keep that very light and We move on to you map which we haven't covered before so we'll go into a little more detail on that now the you map dimensionality reduction step is very important because at the moment we have these 384 dimensional vectors and That's a lot of dimensions and We don't really need all of those dimensions to Fit the information that we need For just clustering our data or clustering that the meaning of those documents Instead we can actually try and compress as much meaning as possible Into a smaller vector space and with you map.

We're going to try and do that to two or three dimensions Now another reason we use this is one we can visualize our clusters which is very helpful if we're trying to automatically cluster things and - it makes in the next step of clustering much more efficient So there are several reasons that we do this That's a few of them So to better understand what you map is actually doing and why it might be better than other methods Which we'll talk about it's probably best that we Start with some data that we can visualize.

So the word embeddings or sentence embeddings that we just created 384 dimensional we can't visualize them at the moment So what I'm going to do is actually switch across to another data set and start with three dimensions and what we'll do is reduce that to two dimensions and we'll see how that works or we'll see the Behavior of those reduction techniques.

I'll just you map. What are you the ones as well? So For that I'm going to use this world cities geo Dataset in there what we have is we have cities their countries regions and continent Now what we want to focus on is a continent now. We also have these other features.

So like latitude and longitude and Using the latitude and longitude what I've done is correct these X Y and C coordinates. So we now have a coordinate system In order to place each of these cities on a Hypothetical globe and We have just over 9,000 of those so let's have a look at those Okay, so I'm just plotting these with plot league and you can see to go down I'm not going to go through it and What we can see is What looks pretty much like the outline of the world?

Okay, so it's not perfect like for some reason the United States of America just isn't in this data set So we have the North American continent and it only contains countries in Central America And then includes Canada right at the top there. So There's some missing chunks and I believe Russia is also missing obviously again quite a big place and I'm sure there are many other places missing as well.

But All that say it's not a perfect data set But we have enough in there to kind of see what we can do with reducing this 3d space into 2d space now Essentially what we're doing is we're trying to recreate a world map now. It's definitely not going to Replicate a world map or not any that any sensible person would use but we should be able to see some features of the different continents the countries and so on so what we're going to do with this world map is Reduce it into two-dimensional space using not just you map but also to other very popular dimensionality reduction techniques PCA or principal component analysis and TC now we can see when we try and do this with PCA we kind of like a circular globe like thing and We get these clusters that kind of overlap which it is not ideal.

I don't think a Clustering over them is going to perform particularly well here so but we what we do have is the distances between Those clusters or those continents has been preserved relatively. Well, at least the ones that were further apart from each other and That is because PCA is very good at preserving large distances But it's not so good at maintaining the local structure of your data and in our case, we kind of want to go and Maintain to an extent both the local and the global structure of our data as much as possible So what if we take a look at TC knee now?

TC knee is almost the opposite rather than focusing on the global structured data it focuses on local structure because it essentially works by creating a graph where each Sort of neighbor node is linked and it tries to maintain those local links between neighbors and this is very good at representing the local structure of our data and Through the local structure.

It can almost indirectly infer to an extent the Global structure, but it's still not that great it and we get this sort of messy overlapping of some clusters and generally with our Dataset at least the clusters and just not really tied together very well now another issue with TC knees I it's very random.

So every time you're in it, you're going to get a very You can get a very wildly different result, which is again, not very ideal. So what we use instead is UMAP Now UMAP kind of gives us the best of both worlds it performs well with both local and global structure and We can apply it incredibly easy Using the UMAP library which we can just pick install and Then in a few lines of code, we can actually apply UMAP to our data But there's much more to UMAP than just a few lines of code.

There are many parameters that we can Tune in order for UMAP to work with our data in a particular way or just fit to just the topology of our particular data set, so There's a lot to cover But I think by far the key parameter That can make the biggest impact is the n neighbors parameter now Do you map algorithm it does a lot of things but one of the key things it does is it identifies?

the density of particular areas in our data by searching through for each point in our data set searching through and finding the K nearest neighbors and that is what this n neighbors parameter is setting it's saying okay, how many of nearest neighbors are we going to have a look at and As you can imagine if you set an n neighbors parameter to be very small It is only really going to pick up on the local structure of your data Whereas if you set it to be very large, it's going to focus more on the global structure so what we can do is try and find something in the middle where we're focusing on both the global and local structure and This can work incredibly well.

So if we take a look at our Reduced data set we can see straight away that the continents are in the right places They've kind of been switched over so we have North is in the South and South is in North and In Oceania, we have a few islands that seem to so New Zealand is over here rather than over here Where I suppose it should be But otherwise, it seems to have done relatively well Now a lot of these so we have Japan over here.

We have the Maldives and Philippines a lot of these kind of bits that go off of the main clusters are Actually islands which makes sense. They are not as close To the continents or anything so it kind of makes sense that they are not within the same cluster and We can even see to an extent as the shape of some countries So over here as soon as I saw this sort of yellow bit sticking out on the left of Europe I thought that looks a lot like the Spanish Peninsula and If we go over there, we actually see that it is all these cities over here or Spain and then we also have Europe Portugal as well next to Spain over here we have France and Yeah, you have this of the local structure of countries Does seem to be preserved quite well?

So that seems to work really well already now there was just one other parameter that I added in order to get this sort of graph and That was the minute this parameter. Now what this does is Limits how closely you map can put two points together. So by Default this is I believe 0.1 and What I did was increase this up to 0.5 because we're 0.1 A lot of these links tend to just be really tightly string together and you almost get like strings all of countries but increasing that it kind of pushed the The structure out a little bit to be less stringy and more kind of like continents So that's another really cool parameter that you can so fine-tune and hopefully get some good results with now, let's take a look at the code that we apply to our reddit topics data, so come down here and What I'm doing is we set or what I was doing here is taking a look at the different nearest neighbors Values and seeing what seems to work.

Well And in the end what I settled on is first Compressing our data into a 3d space rather than 2d which makes sense. It means we have more dimensionalities to maintain more information Set nearest neighbors or n neighbors. Sorry to three number of components to three. That's just see the 3d aspect You set up to choose you want to 2d and then I also added in min dis but rather than increasing min dis from 0.1 actually decreases so the points can be closer together and What we got out from that is this and you can see here we have these four clusters and We have color over here, it's my bad, but essentially what we have is I think we're so red is Language technology purple is investing Blue is Pytorch and green is Python so we have all of these of Language technology Pytorch and Python or kind of together because they're more similar than investing which is really far away But we can clearly see that These look like pretty good clusters.

There's some overlap and that's to be expected there is there are a lot of topics or threads that could be in both Pytorch and Python and Language technology all at once. I said there is of course some overlap But for the most part we can see some pretty clear clusters there So that tells us okay, that's great.

Now we can move on to the next step which is clustering with HDB scan now What we've seen here we have we have like color-coded clusters But in reality, we're assuming that we don't actually have the subreddit titles already Okay, we want to be trying to extract those clusters without knowing what the clusters already are So we're gonna do that All we start with is actually a pip install of HDB scan super easy and then we can import HDB scan and We fit to our three-dimensional reddit topics data And then we can use this condensed tree plot to View how our strip how our clusters are being built But all we see at the moment on there is all of these red lines.

Now. These are not actually lines are circles Identifying the clusters that have been extracted Now obviously that it seems like there's a lot of clusters that have been extracted. We don't need we don't want to do that so Let's what we can do is make the criteria For selecting a cluster more strict.

So at the moment the Number of points needed to create a cluster I believe by default is something like five which is very small when we have just over 3,000 vectors that we want to cluster into just four topics So we can Increase the minimum cluster size and if we go with something like 80 to start with we get this so now we can actually see the condensed tree plot and Basically It's moving through the the lambda value here, which is just the parameters that HDB scan is using and we can imagine this as a line starting from the bottom and going up and as the line moves up we are Identifying the the strongest almost clusters within our data now at the moment, we're just returning to and Realistically, we want to try and return like four clusters and to me It probably seems like we'd want this one on the left But then we'd want this one this one and this one prior clusters or pulling in this big green blob instead Because it's being viewed algorithmically is a stronger cluster So maybe we can try and reduce Maybe our min cluster size.

We said to be too high we can reduce that Just reducing by 20 and then we pull in this really long one on the left over here So that also doesn't look very good. So clearly the issue is not the min cluster size and Maybe we need to change something else So what we can do is we'll keep mingle size at 80 and Will instead reduce the min samples which by default is set to the same as a min cluster size now min samples is basically how dense the core of A cluster needs to be in order to become a core of a cluster So we reduce that and if we reduce that then we get the four clusters That to me looks like we probably want So we have these four I imagine these three over here which are connected for longer up here are probably Python Pytorch and language technology and then over here we probably have investing So let's have a look at what we have Okay, cool, so you see that it looks like we have similar clusters what we got before right When we were pulling the colors in from the data and if we have look here, yes green investing red language technology purple Python and Blue is Pytorch.

And then we also have these orange Points now, these are the outliers. I think Susan belonging in any of these clusters and If we have a look at these Come over to here. We see things like daily Daily general discussion and advice thread, okay So to an extent it kind of makes sense that like this isn't really Necessarily about investing.

It's just like the general discussion and then up here We have some maybe it would be better if these were included within the language technology cluster And then over here down the bottom we have similar to the investing bit where we had the the description thread we have another description for a thread here, but from Python, so It's not perfect.

There are some outliers that we would like to be in there, but Quite a few of those outliers Kind of makes sense that they're not in there as a description threads or something else So it's kind of it's pretty cool to see that it it can actually do that.

So well Otherwise, I think that clustering looks pretty good So let's move on to the next step of CTF IDF Now we'll go through CTF IDF very quickly but all this is really doing is it's looking at the frequency of words within a particular class or within a particular topic that we've built and saying okay are those words very common so words like the a be very common and They would be scored lowly Thanks to the IDF part of the function, which is the inverse second frequency.

Look now common these words are But if we have a rare word something like Python and it just keeps popping up in one of these topics all the time Then there's a good idea that maybe Python the word Python is pretty relevant to that particular topic now I'm not gonna go not really gonna go through the code on this.

Just make it really quick All I'm going to do is show you We go through here. I'm actually tokenizing using NLTK now with Better topic. We don't really need to worry so much about this. It's going to be handled for us So that's not so important but we can see we create these tokens which are the words and it's These tokens or terms and that to the CTF IDF algorithm is going to be looking at in terms of frequency and identifying the most relevant ones So for example here we have Pytorch Model gradients, they're probably pretty relevant to the the Pytorch or Python or even natural language topic But less so for investing So come down here We have this is a number of tokens within each topic including our you know, not relevant or outlier topic And if we Scroll down these are our TF IDF scores for basically for every single possible word or token Within each of our topics.

So you see we have five rows here and then within each of those rows we have a sparsely populated matrix where each Entry in that matrix represents one token or one term by the word the although we remove these so that wouldn't be in there so like the word Python or PyTorch or investing or Money or something along those lines Then I'm just going to return the top and most common words per class.

We come down here and we see Okay, cool, so this one is our irrelevant topic and Then up here. I mean we can see straight away from these what? Hopefully what topics they are as we have market stock. It's probably investing Up here we have project Python code. I'm pretty sure it's Python loss X model PyTorch Pretty confident that's PyTorch.

And then here we have NLP and text data model. It's got to be a language technology so high level That's no more in detail, but still kind of high level. That's how it works Now, how do we apply what we've just gone through to the BERT topic library Fortunately, it's incredibly easy So again, I'm starting another notebook here.

This is number six Same again download data remove anything that's particularly long. This one's you'll be really short by the way, so No problem And then what we do is as before we initialize our UMAP and HDB scan Components, but we're just initializing. I'm not fitting or anything right now.

I've initialized these with the same Same parameters. I just add the the minimum spanning tree here for the HBC model which is basically just going to build like a linked tree through the through the data and I found that it improves the performance so added that in there and another thing that I've added in is this prediction data equals true now if we When you're using HDB scan with That topic you need to include that Otherwise you will get This attribute error that I mentioned here.

So to be Aaron no prediction data is that generated? You will get that error if you don't include prediction data equals truth for your HDB scan component Then other things that we need is our sentence transform the embedding model and in the TF IDF or CTF IDF set I didn't really go through it But I removed soft words like the a and so on So again as we you saw right to start I'm going to include the count vectorizer to remove those soft words And I added a few more as well.

So I took the nltk soft words Yeah, and then I also added these other words that kept popping up all the time. I'm kind of polluting the topics and Yeah Initialize the model and then we can Initialize Bert topic as we did before but then we add all of these different components Into the definition So it's literally what we've just been through we do the same thing we Initialize all those components, but we don't run them in their respective libraries.

We actually just pass them to That topic library, which makes it I think a lot nicer to work with because you just throw everything you initialize everything and Then just throw thing into that topic and then that topic will run through it all for you, which is really cool So as before we do the fit transform which is like a fit and predict we'll get topics and probabilities from that And we come down here and we have our topic so topic zero it looks like investing number one pie torch to language technology and three Python So that's really cool.

It has the topics And I can also see the hierarchy of those topics as well. So we have like Python pie torch and Language technology kind of grouped together earlier than the investing topic so I think that is Incredibly cool how easily we can organize data using Meaning rather than using keywords or some sort of complex rule based method With bird topic.

We literally just feed in our data obviously it helps to have an understanding of UMAP HTTP scan and the transformer embedding models But using all of what we've just learned we're able to Do this which is is very impressive just automatically cluster all of our Data in a meaningful way.

Okay, so that's it for this video. I Hope that topic and all these other Components of bird topic are as interesting to you as they are to me But I'll leave it there for now. So thank you very much for watching and I will see you in the next one.

Bye

BERTopic Explained

Chapters

Transcript