back to indexBERTopic Explained
Chapters
0:0 Intro
1:40 In this video
2:58 BERTopic Getting Started
8:48 BERTopic Components
15:21 Transformer Embedding
18:33 Dimensionality Reduction
25:7 UMAP
31:48 Clustering
37:22 c-TF-IDF
40:49 Custom BERTopic
44:4 Final Thoughts
00:00:00.000 |
Today it's estimated that up to 90% of the world's data is unstructured 00:00:08.900 |
specifically for human consumption rather than for machines and 00:00:16.380 |
Great for us, but it's also kind of difficult because when you're trying to organize all of that data 00:00:29.940 |
To get people to do that. It's too slow and it's too expensive 00:00:36.700 |
There are more and more techniques that allow us to actually understand 00:00:40.900 |
Texts or unstructured texts. We're now able to search based on the meaning of text 00:00:54.980 |
Transformer models are behind a lot of this now these transform models are the sort of thing that make 00:01:02.260 |
sci-fi from a few years ago look almost obsolete because 00:01:10.340 |
Communicate in a human like way with a computer was regulated just as sci-fi, but now these models 00:01:18.380 |
Can actually do a lot of the stuff that was deemed impossible 00:01:32.340 |
Generally referred to as topic modeling is the automatic clustering of data into particular topics 00:01:40.300 |
What we're going to do in this video is take high-level view of that topic 00:01:44.580 |
The library how we would use about topics default parameters to perform topic modeling on some data 00:01:51.520 |
Then we'll go into the details and so what is better topic actually doing so we're going to have a look at 00:02:00.000 |
The transformer embedding model a u-map for dimensionality reduction. We're going to take a look at 00:02:07.440 |
HDB scan for clustering and now we're going to look at CTF IDF 00:02:22.000 |
Really interesting and once you understand how they work you can use that understanding to improve the performance of 00:02:33.560 |
Topic modeling once you've been through all of that. We're going to then 00:02:36.920 |
Go back to Burt topic and see how our understanding of these different components 00:02:44.440 |
Allows us to improve our topic modeling with Burt topic, so 00:02:52.640 |
But let's start with that high-level overview of the Burt topic library 00:02:58.160 |
so we're going to start with this reddit topics data set and 00:03:02.600 |
In here, we have the subreddit that the data has been pulled from 00:03:09.080 |
We have the titled thread and we have the actual text within this self text feature 00:03:14.440 |
We have a few different subreddits in here. We have 00:03:20.760 |
Python language technology and pytorch so we have four subreddits and 00:03:31.840 |
Automatically cluster each of these threads into 00:03:39.960 |
Cluster that represents that particular subreddit. So investing Python pytorch and language technology 00:03:46.520 |
So in here, we have very similar datasets actually not the same. We're going to use that one later 00:03:51.760 |
But this one's a little bit shorter and it just includes data from the Python subreddit. We have 00:04:00.160 |
Just going through this one as a quick example of how we might use the Burt topic library 00:04:14.080 |
Rows where soft text is NN or very short and this is if you're using a data frame 00:04:25.720 |
So you can come up here and you just pip install the datasets library 00:04:33.560 |
And then we come down here and we get to Burt topic now Burt topic you would install Burt topic so at the top here 00:04:44.360 |
All we're doing here. So I've added this count vectorizer 00:04:47.920 |
which is essentially just going to remove soft words in one of the latest steps of 00:04:57.160 |
First just convert your data to a list if you need to actually this is for a panda's a frame 00:05:03.480 |
So this is what we'd be running with the hugging face datasets 00:05:13.080 |
These arguments so we make sure we have language set to English so using a English language model 00:05:19.240 |
For embedding the text and then we just fit transform. Okay, so that's like fit and predict 00:05:27.760 |
For this text here. Okay, and we can see here. We've just 00:05:31.240 |
All we've done here is embedded the text to embeddings 00:05:39.400 |
so those embeddings are quite a very high dimensional and 00:05:42.240 |
It will depend on the model that we're using but you're you're thinking like hundreds of dimensions and 00:05:51.460 |
then we reduce the dimensionality of those embeddings and 00:05:56.720 |
Cluster those and then there is the finding of topics 00:06:01.420 |
Based on the text in within each one those clusters 00:06:05.240 |
So from that you we get those two list topics and props and if we take a look so we have 00:06:13.100 |
This is just printing center print out five of them, but the size is too big. So it's fine 00:06:19.040 |
I'm not gonna look through all of them. So we have this predicted topic - so this is just a plus 00:06:24.800 |
so we don't know what the actual cluster is and 00:06:27.380 |
We come down and it's it was it I think Python for sure. Yeah, Python 00:06:34.980 |
Now we have another one a negative one in that topic 00:06:45.760 |
Doesn't seem to fit into any of the clusters that birth topic has identified. So it's an outlier 00:06:53.880 |
Now in our case we can say okay, this is probably Python 00:06:58.280 |
But that topic with the default parameters hasn't identified us. So but we're going to learn how to improve those default parameters 00:07:10.840 |
Then all we can do is see the top words found within each topic so basically the topic 00:07:22.600 |
You see that we we do actually have a lot of topics here. Now. We're trying to break 00:07:33.120 |
Multiple topics here. So I'm not really sure how would be best to do that 00:07:39.320 |
We can definitely see some of these make sense. So here you have the 00:07:44.480 |
base on a assume image processing with Python 00:07:51.520 |
Web development with Python and a few other things as well. They're 00:07:58.440 |
It was like some sort of game. I don't know but anyway, so we'll go down 00:08:03.240 |
we have 16 of these of these topics and then we can also go ahead and visualize those so 00:08:11.240 |
This is quite useful when trying to see how each topic might relate to another or how similar they are 00:08:20.280 |
Then if we come here, you can see how each of those clusters actually kind of grouped together in this hierarchy 00:08:27.880 |
And we can also visualize the words within each one those topics as well 00:08:34.340 |
But you know, this is just really simple but topic at the moment. We've just used default parameters 00:08:43.720 |
What I want to do is actually go into the details of that topic 00:08:48.000 |
So let's start having a look at each of the components that I mentioned before 00:09:10.560 |
So we start off with our without text now our text can come in a lot of different formats 00:09:18.380 |
but typically we're going to go with something like paragraph or sentence size chunks of text and 00:09:23.880 |
We typically call those paragraph or sentence size chunks of text documents 00:09:30.960 |
so our our documents or they come down here and pass into a 00:09:40.960 |
transformer model, so we have we'll just draw that as like a box here and 00:09:49.680 |
going to output for each document a single vector now this 00:09:55.220 |
Vector or this embedding is very high dimensional. So it is I'm gonna 00:10:05.120 |
Really big and we're going to have loads of them as we have loads of documents 00:10:17.440 |
representation of the meaning behind its respective document 00:10:25.840 |
Machine meaning for each of these documents. That's pretty much how we can see it, which is already I think 00:10:37.880 |
The problem with these is that they're very high dimensional. So sentence transformers typical dimension sizes 00:10:44.820 |
768 dimensions you have some sort of like half or also some mini models 00:10:51.840 |
Which will output something like three eight four? 00:10:54.800 |
Dimensions now you have some models that output many more like open a eyes 00:11:00.840 |
Embedding models I think can go up to 200 or no. No, they can go way higher 00:11:06.040 |
They can go to I think it's just over 10k. So 00:11:14.680 |
What we need to do is compress the information within those vectors into a smaller space 00:11:29.960 |
Well, there's a lot to talk about when it comes to you map and we're only going to cover a very high level in this video 00:11:37.120 |
But even that way we're going to leave it until later to ready. So 00:11:44.920 |
But what that will do is allow us to compress those very high dimensional vectors into a smaller vector space 00:11:57.160 |
768 or any of those to something like 3d or 2d 00:12:02.500 |
okay, so we can have something like 2 or 3d vectors now and 00:12:08.320 |
what's really useful is that we can actually visualize that so we can have a look at our output from you map and 00:12:17.720 |
data that we can cluster or not because it's very 00:12:23.960 |
Not always very easy, but it's easier to see or easier to assess whether there are 00:12:30.520 |
Clusters in the data or some sort of structure in our data 00:12:36.400 |
So that's really useful and then after that we're going to pass those 00:12:49.360 |
HDB scan again like you map. There's a lot to talk about. We're going to cover it a high level 00:13:00.040 |
From what I have seen so far. It's very good at it 00:13:03.840 |
And again, it's super interesting both you map and HDB scan in my opinion a very interesting techniques 00:13:27.000 |
May recognize the TF IDF part. The C part is a modification of 00:13:42.960 |
Documents remember documents are those chunks of text based on a particular query. What we do here is 00:13:56.880 |
documents e.g. our clusters or our topics and then it outputs those so what we do is we'd end up with like 00:14:05.400 |
four clusters for example, which is what we're going to be 00:14:10.200 |
Using so we'd end up with these four clusters and we'd be able to see okay. This one over here is talking about 00:14:23.320 |
So that would be its topic and that would hopefully be the investing subreddit and then over here 00:14:30.560 |
We would have for example Python and we would see things like well 00:14:39.360 |
As those terms match up to that topic the best 00:14:50.120 |
There's not much we can do in terms of improving our code 00:14:54.280 |
If this is all we know about that topic, we really want to see it in practice 00:14:58.880 |
so what we're going to do is actually go through each of these steps one by one and 00:15:06.840 |
improve that component for our particular data set and 00:15:11.800 |
At least have a few ideas on how we might go about 00:15:16.040 |
Tuning each of these components for other data sets as well 00:15:20.280 |
Okay, so we'll start with the transformer embedding step. So 00:15:28.840 |
Actually get our data set. So I'm using this time. We're using the full data set the reddit topics data set 00:15:35.080 |
I'm specifying this revision because in the future I'll probably add more and 00:15:41.400 |
adding this in will make sure that you are using the current version e.g. the version I use of 00:15:51.720 |
So now we have just in the 4000 rows in there again, I'm just going to remove any short 00:15:59.740 |
Items in there and that leaves with just over 3,000 00:16:10.880 |
And then yeah, so if you find this is taking a long time to run on your computer 00:16:22.760 |
Then definitely shuffle before hand, but otherwise you probably don't need to so if it's taking a long time to run 00:16:42.120 |
Notebook and all the other notebooks. I'm going to work through in the description below 00:16:48.200 |
So the first step as I said is going to be that embedding 00:16:55.680 |
What we're doing here is actually pulling in one. There is a mini sentence transform models 00:17:00.000 |
I mentioned we can see here that it creates a word embeddings or sentence embeddings of 00:17:05.980 |
384 tokens, which is is nice rather than using the larger models. It just means it's going to run quicker 00:17:15.880 |
Special machines, I wish I'm not I'm on my laptop right now 00:17:31.480 |
Mentioned we probably don't want to do all this at once unless you have some sort of supercomputer 00:17:37.820 |
So I'm running through those in batches of 16, so I'm taking 16 of those documents. I mentioned 00:17:46.240 |
putting those into our sentence transform model and 00:18:01.120 |
You can see that. Okay, so we encode the batch which we pull from there. It's just 16 at time 00:18:10.460 |
beddings to that batch and then we just add them to the embeds array and 00:18:14.600 |
That's actually order is to the sentence transform or sentence embeddings, but I'm not going to go into detail on that because there are 00:18:33.480 |
We move on to you map which we haven't covered before so we'll go into a little more detail on that 00:18:48.520 |
at the moment we have these 384 dimensional vectors and 00:19:05.840 |
For just clustering our data or clustering that the meaning of those documents 00:19:13.340 |
Instead we can actually try and compress as much meaning as possible 00:19:18.480 |
Into a smaller vector space and with you map. We're going to try and do that to two or three dimensions 00:19:24.560 |
Now another reason we use this is one we can visualize our clusters 00:19:30.800 |
which is very helpful if we're trying to automatically cluster things and 00:19:36.520 |
- it makes in the next step of clustering much more efficient 00:19:48.360 |
So to better understand what you map is actually doing and why it might be better than other methods 00:19:59.040 |
Start with some data that we can visualize. So the word embeddings or sentence embeddings that we just created 00:20:05.160 |
384 dimensional we can't visualize them at the moment 00:20:09.080 |
So what I'm going to do is actually switch across to another data set and start with three dimensions 00:20:14.440 |
and what we'll do is reduce that to two dimensions and 00:20:22.360 |
Behavior of those reduction techniques. I'll just you map. What are you the ones as well? So 00:20:29.520 |
For that I'm going to use this world cities geo 00:20:33.200 |
Dataset in there what we have is we have cities their countries regions and continent 00:20:42.400 |
Now what we want to focus on is a continent now. We also have these other features. So like latitude and longitude and 00:20:49.560 |
Using the latitude and longitude what I've done is correct these X Y and C coordinates. So we now have a coordinate system 00:21:06.000 |
We have just over 9,000 of those so let's have a look at those 00:21:12.880 |
Okay, so I'm just plotting these with plot league and you can see to go down I'm not going to go through it and 00:21:22.800 |
What looks pretty much like the outline of the world? 00:21:26.520 |
Okay, so it's not perfect like for some reason the United States of America just isn't in this data set 00:21:35.280 |
So we have the North American continent and it only contains countries in Central America 00:21:41.840 |
And then includes Canada right at the top there. So 00:21:58.520 |
I'm sure there are many other places missing as well. But 00:22:06.920 |
But we have enough in there to kind of see what we can do with reducing this 00:22:15.360 |
Essentially what we're doing is we're trying to recreate a world map now. It's definitely 00:22:24.240 |
Replicate a world map or not any that any sensible person would use 00:22:33.400 |
the different continents the countries and so on so what we're going to do with this world map is 00:22:40.720 |
Reduce it into two-dimensional space using not just you map but also to other very popular 00:22:47.560 |
dimensionality reduction techniques PCA or principal component analysis and 00:22:52.640 |
TC now we can see when we try and do this with 00:23:04.280 |
We get these clusters that kind of overlap which it is not ideal. I don't think a 00:23:10.760 |
Clustering over them is going to perform particularly well here 00:23:23.320 |
Those clusters or those continents has been preserved relatively. Well, at least the ones that were further apart from each other and 00:23:31.920 |
That is because PCA is very good at preserving large distances 00:23:40.000 |
maintaining the local structure of your data and 00:23:47.440 |
Maintain to an extent both the local and the global structure of our data as much as possible 00:24:00.360 |
TC knee is almost the opposite rather than focusing on the 00:24:05.640 |
global structured data it focuses on local structure because it essentially works by creating a 00:24:15.680 |
Sort of neighbor node is linked and it tries to maintain those 00:24:23.880 |
this is very good at representing the local structure of our data and 00:24:29.480 |
Through the local structure. It can almost indirectly infer to an extent the 00:24:35.120 |
Global structure, but it's still not that great it and we get this sort of messy 00:24:48.720 |
Dataset at least the clusters and just not really tied together very well now another issue with TC knees 00:24:56.000 |
I it's very random. So every time you're in it, you're going to get a very 00:25:00.640 |
You can get a very wildly different result, which is again, not very ideal. So what we use instead is UMAP 00:25:09.240 |
Now UMAP kind of gives us the best of both worlds it 00:25:14.280 |
performs well with both local and global structure and 00:25:27.660 |
Then in a few lines of code, we can actually apply UMAP to our data 00:25:33.360 |
But there's much more to UMAP than just a few lines of code. There are many parameters that we can 00:25:50.160 |
just the topology of our particular data set, so 00:26:02.480 |
That can make the biggest impact is the n neighbors parameter now 00:26:08.840 |
Do you map algorithm it does a lot of things but one of the key things it does is it identifies? 00:26:23.880 |
searching through for each point in our data set searching through and finding the 00:26:30.580 |
K nearest neighbors and that is what this n neighbors parameter is setting 00:26:35.580 |
it's saying okay, how many of nearest neighbors are we going to have a look at and 00:26:39.280 |
As you can imagine if you set an n neighbors parameter to be very small 00:26:45.960 |
It is only really going to pick up on the local structure of your data 00:26:52.000 |
Whereas if you set it to be very large, it's going to focus more on the global structure 00:26:56.240 |
so what we can do is try and find something in the middle where we're focusing on both the 00:27:05.400 |
This can work incredibly well. So if we take a look at our 00:27:14.600 |
Reduced data set we can see straight away that the continents are in the right places 00:27:26.640 |
North is in the South and South is in North and 00:27:29.440 |
In Oceania, we have a few islands that seem to so New Zealand is over here rather than over here 00:27:39.880 |
But otherwise, it seems to have done relatively well 00:27:44.640 |
Now a lot of these so we have Japan over here. We have the Maldives and 00:27:51.160 |
Philippines a lot of these kind of bits that go off of the main clusters are 00:27:57.600 |
Actually islands which makes sense. They are not as close 00:28:05.600 |
so it kind of makes sense that they are not within the same cluster and 00:28:18.760 |
So over here as soon as I saw this sort of yellow bit sticking out on the left of Europe 00:28:25.280 |
I thought that looks a lot like the Spanish Peninsula and 00:28:29.600 |
If we go over there, we actually see that it is all these cities over here or Spain and then we also have Europe 00:28:58.120 |
there was just one other parameter that I added in order to get this sort of graph and 00:29:03.960 |
That was the minute this parameter. Now what this does is 00:29:10.720 |
Limits how closely you map can put two points together. So by 00:29:21.840 |
What I did was increase this up to 0.5 because we're 0.1 00:29:26.640 |
A lot of these links tend to just be really tightly string together and you almost get like 00:29:40.280 |
The structure out a little bit to be less stringy and more kind of like continents 00:29:45.280 |
So that's another really cool parameter that you can 00:29:48.960 |
so fine-tune and hopefully get some good results with now, let's take a look at the code that we apply to our 00:30:00.400 |
What I'm doing is we set or what I was doing here is taking a look at the different nearest neighbors 00:30:17.920 |
Compressing our data into a 3d space rather than 2d which makes sense. It means we have more dimensionalities to 00:30:26.120 |
Set nearest neighbors or n neighbors. Sorry to three number of components to three. That's just see the 3d aspect 00:30:34.800 |
You set up to choose you want to 2d and then I also added in min dis 00:30:39.040 |
but rather than increasing min dis from 0.1 actually decreases so the points can be closer together and 00:30:45.240 |
What we got out from that is this and you can see here we have these four clusters and 00:30:53.160 |
We have color over here, it's my bad, but essentially what we have is I think we're so red is 00:31:14.660 |
Language technology Pytorch and Python or kind of together because they're more similar than investing which is really far away 00:31:24.960 |
These look like pretty good clusters. There's some overlap and that's to be expected 00:31:31.000 |
there is there are a lot of topics or threads that could be in both Pytorch and Python and 00:31:37.160 |
Language technology all at once. I said there is of course some overlap 00:31:42.000 |
But for the most part we can see some pretty clear clusters there 00:31:46.920 |
So that tells us okay, that's great. Now we can move on to the next step which is clustering with HDB scan now 00:31:56.360 |
What we've seen here we have we have like color-coded clusters 00:32:00.880 |
But in reality, we're assuming that we don't actually have the subreddit titles already 00:32:06.280 |
Okay, we want to be trying to extract those clusters without knowing what the clusters already are 00:32:15.800 |
All we start with is actually a pip install of HDB scan 00:32:38.800 |
View how our strip how our clusters are being built 00:32:42.020 |
But all we see at the moment on there is all of these red lines. Now. These are not actually lines are circles 00:32:50.200 |
Identifying the clusters that have been extracted 00:32:52.720 |
Now obviously that it seems like there's a lot of clusters that have been extracted. We don't need we don't want to do that 00:33:05.840 |
For selecting a cluster more strict. So at the moment the 00:33:15.240 |
I believe by default is something like five which is very small when we have just over 00:33:20.520 |
3,000 vectors that we want to cluster into just four topics 00:33:30.920 |
Increase the minimum cluster size and if we go with something like 80 to start with 00:33:36.080 |
we get this so now we can actually see the condensed tree plot and 00:33:43.360 |
It's moving through the the lambda value here, which is just the parameters 00:33:49.880 |
that HDB scan is using and we can imagine this as a line starting from the bottom and going up and as the line 00:34:01.760 |
Identifying the the strongest almost clusters within our data 00:34:06.600 |
now at the moment, we're just returning to and 00:34:11.920 |
Realistically, we want to try and return like four clusters and to me 00:34:16.680 |
It probably seems like we'd want this one on the left 00:34:20.080 |
But then we'd want this one this one and this one prior clusters or pulling in this big green blob instead 00:34:27.280 |
Because it's being viewed algorithmically is a stronger cluster 00:34:35.040 |
Maybe our min cluster size. We said to be too high we can reduce that 00:34:42.000 |
Just reducing by 20 and then we pull in this really long one on the left over here 00:34:47.280 |
So that also doesn't look very good. So clearly the issue is not the min cluster size and 00:34:58.120 |
So what we can do is we'll keep mingle size at 80 and 00:35:05.280 |
Will instead reduce the min samples which by default is set to the same as a min cluster size now min samples is 00:35:16.640 |
A cluster needs to be in order to become a core of a cluster 00:35:20.880 |
So we reduce that and if we reduce that then we get the four clusters 00:35:30.480 |
So we have these four I imagine these three over here which are connected for longer up here are probably 00:35:37.320 |
Python Pytorch and language technology and then over here we probably have investing 00:35:46.600 |
Okay, cool, so you see that it looks like we have similar clusters what we got before right 00:35:56.880 |
When we were pulling the colors in from the data and if we have look here, yes green investing red language technology 00:36:07.040 |
Blue is Pytorch. And then we also have these orange 00:36:10.240 |
Points now, these are the outliers. I think Susan belonging in any of these clusters and 00:36:25.360 |
Daily general discussion and advice thread, okay 00:36:28.860 |
So to an extent it kind of makes sense that like this isn't really 00:36:34.480 |
Necessarily about investing. It's just like the general discussion and then up here 00:36:40.000 |
We have some maybe it would be better if these were included within the language technology cluster 00:36:46.540 |
And then over here down the bottom we have similar to the investing bit where we had the the description thread 00:36:52.680 |
we have another description for a thread here, but from Python, so 00:36:56.360 |
It's not perfect. There are some outliers that we would like to be in there, but 00:37:04.680 |
Kind of makes sense that they're not in there as a description threads or something else 00:37:10.000 |
So it's kind of it's pretty cool to see that it it can actually do that. So well 00:37:17.560 |
Otherwise, I think that clustering looks pretty good 00:37:32.120 |
but all this is really doing is it's looking at the 00:37:36.040 |
frequency of words within a particular class or within a particular topic that we've built and 00:37:43.040 |
saying okay are those words very common so words like the a be very common and 00:37:51.840 |
Thanks to the IDF part of the function, which is the inverse second frequency. Look now common these words are 00:37:57.880 |
But if we have a rare word something like Python and it just keeps popping up in one of these topics all the time 00:38:08.720 |
Python the word Python is pretty relevant to that particular topic now 00:38:13.120 |
I'm not gonna go not really gonna go through the code on this. Just make it really quick 00:38:27.200 |
Better topic. We don't really need to worry so much about this. It's going to be handled for us 00:38:32.800 |
So that's not so important but we can see we create these tokens which are the words and it's 00:38:38.800 |
These tokens or terms and that to the CTF IDF algorithm is going to be looking at in terms of frequency and identifying 00:38:51.280 |
Model gradients, they're probably pretty relevant to the the Pytorch or Python or even natural language 00:39:05.880 |
We have this is a number of tokens within each topic including our you know, not relevant or outlier topic 00:39:19.600 |
Scroll down these are our TF IDF scores for basically for every single possible word or token 00:39:28.160 |
Within each of our topics. So you see we have five rows here and then within each of those rows 00:39:34.000 |
we have a sparsely populated matrix where each 00:39:37.480 |
Entry in that matrix represents one token or one term by the word the although we remove these so that wouldn't be in there 00:39:57.000 |
Then I'm just going to return the top and most common words per class. We come down here and we see 00:40:03.320 |
Okay, cool, so this one is our irrelevant topic and 00:40:09.040 |
Then up here. I mean we can see straight away from these what? 00:40:13.280 |
Hopefully what topics they are as we have market stock. It's probably investing 00:40:21.480 |
Up here we have project Python code. I'm pretty sure it's Python loss 00:40:31.160 |
Pretty confident that's PyTorch. And then here we have NLP and text data model. It's got to be a language technology 00:40:42.240 |
That's no more in detail, but still kind of high level. That's how it works 00:40:49.280 |
Now, how do we apply what we've just gone through to the BERT topic library 00:40:57.800 |
So again, I'm starting another notebook here. This is number six 00:41:02.160 |
Same again download data remove anything that's particularly long. This one's you'll be really short by the way, so 00:41:12.440 |
And then what we do is as before we initialize our UMAP and HDB scan 00:41:18.840 |
Components, but we're just initializing. I'm not fitting or anything right now. I've initialized these with the same 00:41:26.080 |
Same parameters. I just add the the minimum spanning tree here for the HBC model 00:41:33.800 |
which is basically just going to build like a linked tree through the through the data and I found that it improves the performance so 00:41:40.640 |
added that in there and another thing that I've added in is this prediction data equals true now if we 00:41:55.160 |
This attribute error that I mentioned here. So to be Aaron no prediction data is that generated? 00:42:00.160 |
You will get that error if you don't include prediction data equals truth for your HDB scan component 00:42:09.720 |
other things that we need is our sentence transform the embedding model and 00:42:16.920 |
TF IDF or CTF IDF set I didn't really go through it 00:42:20.680 |
But I removed soft words like the a and so on 00:42:24.320 |
So again as we you saw right to start I'm going to include the count vectorizer to remove those soft words 00:42:31.080 |
And I added a few more as well. So I took the nltk soft words 00:42:35.280 |
Yeah, and then I also added these other words that kept popping up all the time. I'm kind of polluting the topics and 00:42:49.360 |
Initialize Bert topic as we did before but then we add all of these different components 00:42:59.040 |
So it's literally what we've just been through we do the same thing we 00:43:05.560 |
Initialize all those components, but we don't run them in their respective libraries. We actually just pass them to 00:43:15.040 |
I think a lot nicer to work with because you just throw everything you initialize everything and 00:43:21.240 |
Then just throw thing into that topic and then that topic will run through it all for you, which is really cool 00:43:26.880 |
So as before we do the fit transform which is like a fit and predict we'll get topics and probabilities from that 00:43:36.960 |
And we come down here and we have our topic so topic zero it looks like investing number one 00:43:43.480 |
pie torch to language technology and three Python 00:43:52.440 |
And I can also see the hierarchy of those topics as well. So we have like Python pie torch and 00:43:58.000 |
Language technology kind of grouped together earlier than the investing topic 00:44:08.800 |
Incredibly cool how easily we can organize data 00:44:15.520 |
Meaning rather than using keywords or some sort of complex 00:44:23.040 |
With bird topic. We literally just feed in our data 00:44:26.920 |
obviously it helps to have an understanding of UMAP HTTP scan and 00:44:41.800 |
Do this which is is very impressive just automatically 00:44:49.280 |
Data in a meaningful way. Okay, so that's it for this video. I 00:44:58.120 |
Components of bird topic are as interesting to you as they are to me 00:45:04.080 |
But I'll leave it there for now. So thank you very much for watching and