back to index

BERTopic Explained


Chapters

0:0 Intro
1:40 In this video
2:58 BERTopic Getting Started
8:48 BERTopic Components
15:21 Transformer Embedding
18:33 Dimensionality Reduction
25:7 UMAP
31:48 Clustering
37:22 c-TF-IDF
40:49 Custom BERTopic
44:4 Final Thoughts

Whisper Transcript | Transcript Only Page

00:00:00.000 | Today it's estimated that up to 90% of the world's data is unstructured
00:00:06.220 | Meaning that it's built
00:00:08.900 | specifically for human consumption rather than for machines and
00:00:14.260 | that's
00:00:16.380 | Great for us, but it's also kind of difficult because when you're trying to organize all of that data
00:00:22.740 | There's a lot of data
00:00:25.260 | it is
00:00:27.660 | quite simply impossible
00:00:29.940 | To get people to do that. It's too slow and it's too expensive
00:00:34.220 | fortunately
00:00:36.700 | There are more and more techniques that allow us to actually understand
00:00:40.900 | Texts or unstructured texts. We're now able to search based on the meaning of text
00:00:47.860 | identify the sentiment of text
00:00:50.140 | extract human entities and a lot more
00:00:54.980 | Transformer models are behind a lot of this now these transform models are the sort of thing that make
00:01:02.260 | sci-fi from a few years ago look almost obsolete because
00:01:07.580 | being able to
00:01:10.340 | Communicate in a human like way with a computer was regulated just as sci-fi, but now these models
00:01:18.380 | Can actually do a lot of the stuff that was deemed impossible
00:01:22.980 | Not that long ago in machine learning
00:01:26.700 | this act of
00:01:29.500 | organizing or clustering data is
00:01:32.340 | Generally referred to as topic modeling is the automatic clustering of data into particular topics
00:01:40.300 | What we're going to do in this video is take high-level view of that topic
00:01:44.580 | The library how we would use about topics default parameters to perform topic modeling on some data
00:01:51.520 | Then we'll go into the details and so what is better topic actually doing so we're going to have a look at
00:02:00.000 | The transformer embedding model a u-map for dimensionality reduction. We're going to take a look at
00:02:07.440 | HDB scan for clustering and now we're going to look at CTF IDF
00:02:14.320 | for basically
00:02:16.320 | Extracting topics from your clusters
00:02:19.040 | each of those are
00:02:22.000 | Really interesting and once you understand how they work you can use that understanding to improve the performance of
00:02:31.880 | Burt topic
00:02:33.560 | Topic modeling once you've been through all of that. We're going to then
00:02:36.920 | Go back to Burt topic and see how our understanding of these different components
00:02:44.440 | Allows us to improve our topic modeling with Burt topic, so
00:02:50.160 | We're going to cover a lot
00:02:52.640 | But let's start with that high-level overview of the Burt topic library
00:02:58.160 | so we're going to start with this reddit topics data set and
00:03:02.600 | In here, we have the subreddit that the data has been pulled from
00:03:09.080 | We have the titled thread and we have the actual text within this self text feature
00:03:14.440 | We have a few different subreddits in here. We have
00:03:18.780 | investing
00:03:20.760 | Python language technology and pytorch so we have four subreddits and
00:03:26.600 | why I want to do is
00:03:29.360 | use just the text and
00:03:31.840 | Automatically cluster each of these threads into
00:03:39.960 | Cluster that represents that particular subreddit. So investing Python pytorch and language technology
00:03:46.520 | So in here, we have very similar datasets actually not the same. We're going to use that one later
00:03:51.760 | But this one's a little bit shorter and it just includes data from the Python subreddit. We have
00:03:57.320 | 933 rows in there and
00:04:00.160 | Just going through this one as a quick example of how we might use the Burt topic library
00:04:08.840 | using the
00:04:10.760 | Basic parameters so we can remove
00:04:14.080 | Rows where soft text is NN or very short and this is if you're using a data frame
00:04:21.420 | Otherwise we would use this
00:04:25.720 | So you can come up here and you just pip install the datasets library
00:04:30.200 | for that
00:04:33.560 | And then we come down here and we get to Burt topic now Burt topic you would install Burt topic so at the top here
00:04:44.360 | All we're doing here. So I've added this count vectorizer
00:04:47.920 | which is essentially just going to remove soft words in one of the latest steps of
00:04:52.640 | Burt topic and
00:04:55.160 | all you do is
00:04:57.160 | First just convert your data to a list if you need to actually this is for a panda's a frame
00:05:03.480 | So this is what we'd be running with the hugging face datasets
00:05:07.200 | object and
00:05:09.960 | All we do is and
00:05:13.080 | These arguments so we make sure we have language set to English so using a English language model
00:05:19.240 | For embedding the text and then we just fit transform. Okay, so that's like fit and predict
00:05:27.760 | For this text here. Okay, and we can see here. We've just
00:05:31.240 | All we've done here is embedded the text to embeddings
00:05:37.020 | then we
00:05:39.400 | so those embeddings are quite a very high dimensional and
00:05:42.240 | It will depend on the model that we're using but you're you're thinking like hundreds of dimensions and
00:05:51.460 | then we reduce the dimensionality of those embeddings and
00:05:56.720 | Cluster those and then there is the finding of topics
00:06:01.420 | Based on the text in within each one those clusters
00:06:05.240 | So from that you we get those two list topics and props and if we take a look so we have
00:06:13.100 | This is just printing center print out five of them, but the size is too big. So it's fine
00:06:19.040 | I'm not gonna look through all of them. So we have this predicted topic - so this is just a plus
00:06:24.800 | so we don't know what the actual cluster is and
00:06:27.380 | We come down and it's it was it I think Python for sure. Yeah, Python
00:06:34.980 | Now we have another one a negative one in that topic
00:06:39.700 | Is used where?
00:06:42.840 | that data
00:06:45.760 | Doesn't seem to fit into any of the clusters that birth topic has identified. So it's an outlier
00:06:53.880 | Now in our case we can say okay, this is probably Python
00:06:58.280 | But that topic with the default parameters hasn't identified us. So but we're going to learn how to improve those default parameters
00:07:06.460 | so we'll figure that out and
00:07:10.840 | Then all we can do is see the top words found within each topic so basically the topic
00:07:20.600 | Using this get topic info
00:07:22.600 | You see that we we do actually have a lot of topics here. Now. We're trying to break
00:07:28.760 | the Python
00:07:31.120 | subreddit down into
00:07:33.120 | Multiple topics here. So I'm not really sure how would be best to do that
00:07:39.320 | We can definitely see some of these make sense. So here you have the
00:07:44.480 | base on a assume image processing with Python
00:07:49.520 | learning Python
00:07:51.520 | Web development with Python and a few other things as well. They're
00:07:55.960 | Curious about this one
00:07:58.440 | It was like some sort of game. I don't know but anyway, so we'll go down
00:08:03.240 | we have 16 of these of these topics and then we can also go ahead and visualize those so
00:08:11.240 | This is quite useful when trying to see how each topic might relate to another or how similar they are
00:08:20.280 | Then if we come here, you can see how each of those clusters actually kind of grouped together in this hierarchy
00:08:27.880 | And we can also visualize the words within each one those topics as well
00:08:34.340 | But you know, this is just really simple but topic at the moment. We've just used default parameters
00:08:41.280 | We haven't really done anything
00:08:43.720 | What I want to do is actually go into the details of that topic
00:08:48.000 | So let's start having a look at each of the components that I mentioned before
00:08:55.140 | within the bird topic pipeline
00:08:58.600 | So as I said bird topic
00:09:01.800 | consists of a few different components
00:09:10.560 | So we start off with our without text now our text can come in a lot of different formats
00:09:18.380 | but typically we're going to go with something like paragraph or sentence size chunks of text and
00:09:23.880 | We typically call those paragraph or sentence size chunks of text documents
00:09:30.960 | so our our documents or they come down here and pass into a
00:09:39.320 | typically a sentence
00:09:40.960 | transformer model, so we have we'll just draw that as like a box here and
00:09:46.600 | that sentence transformer is
00:09:49.680 | going to output for each document a single vector now this
00:09:55.220 | Vector or this embedding is very high dimensional. So it is I'm gonna
00:10:01.480 | Write as a list. It's just gonna be really
00:10:05.120 | Really big and we're going to have loads of them as we have loads of documents
00:10:09.100 | but within that
00:10:12.160 | vector we essentially have a
00:10:15.480 | numerical
00:10:17.440 | representation of the meaning behind its respective document
00:10:21.240 | So it's like we're translating
00:10:23.520 | human meaning into
00:10:25.840 | Machine meaning for each of these documents. That's pretty much how we can see it, which is already I think
00:10:34.800 | Really cool. So once we have those vectors
00:10:37.880 | The problem with these is that they're very high dimensional. So sentence transformers typical dimension sizes
00:10:44.820 | 768 dimensions you have some sort of like half or also some mini models
00:10:51.840 | Which will output something like three eight four?
00:10:54.800 | Dimensions now you have some models that output many more like open a eyes
00:11:00.840 | Embedding models I think can go up to 200 or no. No, they can go way higher
00:11:06.040 | They can go to I think it's just over 10k. So
00:11:12.680 | pretty huge now
00:11:14.680 | What we need to do is compress the information within those vectors into a smaller space
00:11:20.600 | So for that we use something called you map
00:11:23.200 | Now you map is
00:11:29.960 | Well, there's a lot to talk about when it comes to you map and we're only going to cover a very high level in this video
00:11:37.120 | But even that way we're going to leave it until later to ready. So
00:11:41.680 | Dig into it to an extent
00:11:44.920 | But what that will do is allow us to compress those very high dimensional vectors into a smaller vector space
00:11:53.120 | so what we would typically do is go from
00:11:57.160 | 768 or any of those to something like 3d or 2d
00:12:02.500 | okay, so we can have something like 2 or 3d vectors now and
00:12:08.320 | what's really useful is that we can actually visualize that so we can have a look at our output from you map and
00:12:14.440 | Try and figure out. Okay, is this producing?
00:12:17.720 | data that we can cluster or not because it's very
00:12:23.960 | Not always very easy, but it's easier to see or easier to assess whether there are
00:12:30.520 | Clusters in the data or some sort of structure in our data
00:12:33.800 | When we can actually see it
00:12:36.400 | So that's really useful and then after that we're going to pass those
00:12:42.100 | into another set so
00:12:44.880 | HDB scan
00:12:49.360 | HDB scan again like you map. There's a lot to talk about. We're going to cover it a high level
00:12:54.120 | but what that will do is actually
00:12:56.360 | identify clusters in our data and it's
00:13:00.040 | From what I have seen so far. It's very good at it
00:13:03.840 | And again, it's super interesting both you map and HDB scan in my opinion a very interesting techniques
00:13:12.240 | and then after that
00:13:13.960 | move on to a
00:13:15.840 | another
00:13:17.400 | Technique or component which is called C
00:13:23.440 | IDF now a few many of you
00:13:27.000 | May recognize the TF IDF part. The C part is a modification of
00:13:33.840 | TF IDF TF IDF is almost like a
00:13:37.600 | way of identifying
00:13:40.720 | relevant
00:13:42.960 | Documents remember documents are those chunks of text based on a particular query. What we do here is
00:13:49.280 | identify
00:13:51.640 | particular words
00:13:53.360 | that are relevant to a particular set of
00:13:56.880 | documents e.g. our clusters or our topics and then it outputs those so what we do is we'd end up with like
00:14:05.400 | four clusters for example, which is what we're going to be
00:14:10.200 | Using so we'd end up with these four clusters and we'd be able to see okay. This one over here is talking about
00:14:17.960 | money
00:14:20.760 | Stocks and and so on
00:14:23.320 | So that would be its topic and that would hopefully be the investing subreddit and then over here
00:14:30.560 | We would have for example Python and we would see things like well
00:14:33.800 | Python
00:14:35.680 | programming code and so on as those
00:14:39.360 | As those terms match up to that topic the best
00:14:44.800 | At very high level that is Burt topic
00:14:47.720 | but of course
00:14:50.120 | There's not much we can do in terms of improving our code
00:14:54.280 | If this is all we know about that topic, we really want to see it in practice
00:14:58.880 | so what we're going to do is actually go through each of these steps one by one and
00:15:03.160 | See what we can do to
00:15:06.840 | improve that component for our particular data set and
00:15:11.800 | At least have a few ideas on how we might go about
00:15:16.040 | Tuning each of these components for other data sets as well
00:15:20.280 | Okay, so we'll start with the transformer embedding step. So
00:15:26.040 | What we're going to do first is
00:15:28.840 | Actually get our data set. So I'm using this time. We're using the full data set the reddit topics data set
00:15:35.080 | I'm specifying this revision because in the future I'll probably add more and
00:15:38.960 | data to the state set
00:15:41.400 | adding this in will make sure that you are using the current version e.g. the version I use of
00:15:48.680 | That data set rather than any new versions
00:15:51.720 | So now we have just in the 4000 rows in there again, I'm just going to remove any short
00:15:59.740 | Items in there and that leaves with just over 3,000
00:16:04.840 | Unshuffled data
00:16:06.560 | Although you don't really need to
00:16:08.560 | Good practice anyway
00:16:10.880 | And then yeah, so if you find this is taking a long time to run on your computer
00:16:17.600 | Just reduce this number
00:16:20.200 | Okay, and if you do use this
00:16:22.760 | Then definitely shuffle before hand, but otherwise you probably don't need to so if it's taking a long time to run
00:16:29.280 | Just reduce this to like 1,000
00:16:31.280 | just so you're
00:16:33.960 | Working through this with less data
00:16:36.000 | Again also you can
00:16:39.800 | find this
00:16:42.120 | Notebook and all the other notebooks. I'm going to work through in the description below
00:16:48.200 | So the first step as I said is going to be that embedding
00:16:53.180 | set so
00:16:55.680 | What we're doing here is actually pulling in one. There is a mini sentence transform models
00:17:00.000 | I mentioned we can see here that it creates a word embeddings or sentence embeddings of
00:17:05.980 | 384 tokens, which is is nice rather than using the larger models. It just means it's going to run quicker
00:17:13.080 | Particularly for just not using any
00:17:15.880 | Special machines, I wish I'm not I'm on my laptop right now
00:17:21.080 | So we come down and we can
00:17:29.480 | create the embeddings now as
00:17:31.480 | Mentioned we probably don't want to do all this at once unless you have some sort of supercomputer
00:17:37.820 | So I'm running through those in batches of 16, so I'm taking 16 of those documents. I mentioned
00:17:46.240 | putting those into our sentence transform model and
00:17:49.920 | outputting 16
00:17:52.680 | of these sentence embeddings and
00:17:56.440 | Then adding all those to this embeds
00:17:58.800 | Array here
00:18:01.120 | You can see that. Okay, so we encode the batch which we pull from there. It's just 16 at time
00:18:08.200 | we get our
00:18:10.460 | beddings to that batch and then we just add them to the embeds array and
00:18:14.600 | That's actually order is to the sentence transform or sentence embeddings, but I'm not going to go into detail on that because there are
00:18:23.600 | Many articles and videos on that
00:18:26.880 | that we have covered already, so
00:18:30.320 | we're gonna keep that very light and
00:18:33.480 | We move on to you map which we haven't covered before so we'll go into a little more detail on that
00:18:39.360 | now the you map
00:18:42.280 | dimensionality reduction step is
00:18:45.320 | very important because
00:18:48.520 | at the moment we have these 384 dimensional vectors and
00:18:53.560 | That's a lot of dimensions and
00:18:57.360 | We don't really need
00:19:00.340 | all of those dimensions to
00:19:03.200 | Fit the information that we need
00:19:05.840 | For just clustering our data or clustering that the meaning of those documents
00:19:13.340 | Instead we can actually try and compress as much meaning as possible
00:19:18.480 | Into a smaller vector space and with you map. We're going to try and do that to two or three dimensions
00:19:24.560 | Now another reason we use this is one we can visualize our clusters
00:19:30.800 | which is very helpful if we're trying to automatically cluster things and
00:19:36.520 | - it makes in the next step of clustering much more efficient
00:19:42.500 | So there are several reasons that we do this
00:19:46.520 | That's a few of them
00:19:48.360 | So to better understand what you map is actually doing and why it might be better than other methods
00:19:55.760 | Which we'll talk about
00:19:57.040 | it's probably best that we
00:19:59.040 | Start with some data that we can visualize. So the word embeddings or sentence embeddings that we just created
00:20:05.160 | 384 dimensional we can't visualize them at the moment
00:20:09.080 | So what I'm going to do is actually switch across to another data set and start with three dimensions
00:20:14.440 | and what we'll do is reduce that to two dimensions and
00:20:16.800 | we'll see
00:20:19.440 | how that works or we'll see the
00:20:22.360 | Behavior of those reduction techniques. I'll just you map. What are you the ones as well? So
00:20:29.520 | For that I'm going to use this world cities geo
00:20:33.200 | Dataset in there what we have is we have cities their countries regions and continent
00:20:42.400 | Now what we want to focus on is a continent now. We also have these other features. So like latitude and longitude and
00:20:49.560 | Using the latitude and longitude what I've done is correct these X Y and C coordinates. So we now have a coordinate system
00:20:57.080 | In order to place each of these cities on a
00:21:02.480 | Hypothetical globe and
00:21:06.000 | We have just over 9,000 of those so let's have a look at those
00:21:12.880 | Okay, so I'm just plotting these with plot league and you can see to go down I'm not going to go through it and
00:21:19.160 | What we can see is
00:21:22.800 | What looks pretty much like the outline of the world?
00:21:26.520 | Okay, so it's not perfect like for some reason the United States of America just isn't in this data set
00:21:35.280 | So we have the North American continent and it only contains countries in Central America
00:21:41.840 | And then includes Canada right at the top there. So
00:21:45.280 | There's some missing chunks and I believe
00:21:49.840 | Russia is also missing
00:21:53.880 | obviously again quite a big place and
00:21:58.520 | I'm sure there are many other places missing as well. But
00:22:03.360 | All that say it's not a perfect data set
00:22:06.920 | But we have enough in there to kind of see what we can do with reducing this
00:22:11.680 | 3d space into 2d space now
00:22:15.360 | Essentially what we're doing is we're trying to recreate a world map now. It's definitely
00:22:21.040 | not going to
00:22:24.240 | Replicate a world map or not any that any sensible person would use
00:22:30.560 | we should be able to see some features of
00:22:33.400 | the different continents the countries and so on so what we're going to do with this world map is
00:22:40.720 | Reduce it into two-dimensional space using not just you map but also to other very popular
00:22:47.560 | dimensionality reduction techniques PCA or principal component analysis and
00:22:52.640 | TC now we can see when we try and do this with
00:22:58.800 | PCA we kind of like a circular
00:23:02.280 | globe like thing and
00:23:04.280 | We get these clusters that kind of overlap which it is not ideal. I don't think a
00:23:10.760 | Clustering over them is going to perform particularly well here
00:23:17.520 | but we what we do have is
00:23:21.880 | distances between
00:23:23.320 | Those clusters or those continents has been preserved relatively. Well, at least the ones that were further apart from each other and
00:23:31.920 | That is because PCA is very good at preserving large distances
00:23:36.600 | But it's not so good at
00:23:40.000 | maintaining the local structure of your data and
00:23:44.120 | in our case, we kind of want to go and
00:23:47.440 | Maintain to an extent both the local and the global structure of our data as much as possible
00:23:55.280 | So what if we take a look at TC knee now?
00:24:00.360 | TC knee is almost the opposite rather than focusing on the
00:24:05.640 | global structured data it focuses on local structure because it essentially works by creating a
00:24:12.920 | graph where each
00:24:15.680 | Sort of neighbor node is linked and it tries to maintain those
00:24:19.880 | local
00:24:21.880 | links between neighbors and
00:24:23.880 | this is very good at representing the local structure of our data and
00:24:29.480 | Through the local structure. It can almost indirectly infer to an extent the
00:24:35.120 | Global structure, but it's still not that great it and we get this sort of messy
00:24:42.320 | overlapping of some clusters and
00:24:45.400 | generally
00:24:47.400 | with our
00:24:48.720 | Dataset at least the clusters and just not really tied together very well now another issue with TC knees
00:24:56.000 | I it's very random. So every time you're in it, you're going to get a very
00:25:00.640 | You can get a very wildly different result, which is again, not very ideal. So what we use instead is UMAP
00:25:09.240 | Now UMAP kind of gives us the best of both worlds it
00:25:14.280 | performs well with both local and global structure and
00:25:17.680 | We can apply it incredibly easy
00:25:22.880 | Using the UMAP library
00:25:24.880 | which we can just pick install and
00:25:27.660 | Then in a few lines of code, we can actually apply UMAP to our data
00:25:33.360 | But there's much more to UMAP than just a few lines of code. There are many parameters that we can
00:25:41.320 | Tune in order for UMAP to work with our data
00:25:45.800 | in a particular way or just fit to
00:25:50.160 | just the topology of our particular data set, so
00:25:55.000 | There's a lot to cover
00:25:58.200 | But I think by far the key parameter
00:26:02.480 | That can make the biggest impact is the n neighbors parameter now
00:26:08.840 | Do you map algorithm it does a lot of things but one of the key things it does is it identifies?
00:26:18.200 | density of
00:26:20.000 | particular areas in our data by
00:26:23.880 | searching through for each point in our data set searching through and finding the
00:26:30.580 | K nearest neighbors and that is what this n neighbors parameter is setting
00:26:35.580 | it's saying okay, how many of nearest neighbors are we going to have a look at and
00:26:39.280 | As you can imagine if you set an n neighbors parameter to be very small
00:26:45.960 | It is only really going to pick up on the local structure of your data
00:26:52.000 | Whereas if you set it to be very large, it's going to focus more on the global structure
00:26:56.240 | so what we can do is try and find something in the middle where we're focusing on both the
00:27:01.920 | global and local structure and
00:27:05.400 | This can work incredibly well. So if we take a look at our
00:27:14.600 | Reduced data set we can see straight away that the continents are in the right places
00:27:21.160 | They've kind of been switched over
00:27:24.040 | so we have
00:27:26.640 | North is in the South and South is in North and
00:27:29.440 | In Oceania, we have a few islands that seem to so New Zealand is over here rather than over here
00:27:37.880 | Where I suppose it should be
00:27:39.880 | But otherwise, it seems to have done relatively well
00:27:44.640 | Now a lot of these so we have Japan over here. We have the Maldives and
00:27:51.160 | Philippines a lot of these kind of bits that go off of the main clusters are
00:27:57.600 | Actually islands which makes sense. They are not as close
00:28:02.200 | To the continents or anything
00:28:05.600 | so it kind of makes sense that they are not within the same cluster and
00:28:10.160 | We can even see
00:28:12.960 | to an extent as the
00:28:15.240 | shape of some countries
00:28:18.760 | So over here as soon as I saw this sort of yellow bit sticking out on the left of Europe
00:28:25.280 | I thought that looks a lot like the Spanish Peninsula and
00:28:29.600 | If we go over there, we actually see that it is all these cities over here or Spain and then we also have Europe
00:28:37.400 | Portugal as well
00:28:39.760 | next to Spain over here we have France and
00:28:42.840 | Yeah, you have this of the local
00:28:46.320 | structure of countries
00:28:49.120 | Does seem to be preserved quite well?
00:28:51.400 | So that seems to work really well already
00:28:58.120 | there was just one other parameter that I added in order to get this sort of graph and
00:29:03.960 | That was the minute this parameter. Now what this does is
00:29:10.720 | Limits how closely you map can put two points together. So by
00:29:17.380 | Default this is I believe 0.1 and
00:29:21.840 | What I did was increase this up to 0.5 because we're 0.1
00:29:26.640 | A lot of these links tend to just be really tightly string together and you almost get like
00:29:32.500 | strings all of countries
00:29:35.240 | but increasing that it kind of pushed the
00:29:40.280 | The structure out a little bit to be less stringy and more kind of like continents
00:29:45.280 | So that's another really cool parameter that you can
00:29:48.960 | so fine-tune and hopefully get some good results with now, let's take a look at the code that we apply to our
00:29:56.360 | reddit topics data, so come down here and
00:30:00.400 | What I'm doing is we set or what I was doing here is taking a look at the different nearest neighbors
00:30:09.200 | Values and seeing what seems to work. Well
00:30:11.960 | And in the end what I settled on is
00:30:15.000 | first
00:30:17.920 | Compressing our data into a 3d space rather than 2d which makes sense. It means we have more dimensionalities to
00:30:23.480 | maintain more information
00:30:26.120 | Set nearest neighbors or n neighbors. Sorry to three number of components to three. That's just see the 3d aspect
00:30:34.800 | You set up to choose you want to 2d and then I also added in min dis
00:30:39.040 | but rather than increasing min dis from 0.1 actually decreases so the points can be closer together and
00:30:45.240 | What we got out from that is this and you can see here we have these four clusters and
00:30:53.160 | We have color over here, it's my bad, but essentially what we have is I think we're so red is
00:31:04.040 | Language technology purple is investing
00:31:07.360 | Blue is Pytorch and green is Python
00:31:11.860 | so we have all of these of
00:31:14.660 | Language technology Pytorch and Python or kind of together because they're more similar than investing which is really far away
00:31:22.320 | But we can clearly see that
00:31:24.960 | These look like pretty good clusters. There's some overlap and that's to be expected
00:31:31.000 | there is there are a lot of topics or threads that could be in both Pytorch and Python and
00:31:37.160 | Language technology all at once. I said there is of course some overlap
00:31:42.000 | But for the most part we can see some pretty clear clusters there
00:31:46.920 | So that tells us okay, that's great. Now we can move on to the next step which is clustering with HDB scan now
00:31:56.360 | What we've seen here we have we have like color-coded clusters
00:32:00.880 | But in reality, we're assuming that we don't actually have the subreddit titles already
00:32:06.280 | Okay, we want to be trying to extract those clusters without knowing what the clusters already are
00:32:12.100 | So we're gonna do that
00:32:15.800 | All we start with is actually a pip install of HDB scan
00:32:21.720 | super easy and then we can
00:32:24.200 | import HDB scan and
00:32:26.840 | We fit to our three-dimensional
00:32:29.280 | reddit topics data
00:32:32.160 | And then we can use this
00:32:35.160 | condensed tree plot
00:32:38.800 | View how our strip how our clusters are being built
00:32:42.020 | But all we see at the moment on there is all of these red lines. Now. These are not actually lines are circles
00:32:50.200 | Identifying the clusters that have been extracted
00:32:52.720 | Now obviously that it seems like there's a lot of clusters that have been extracted. We don't need we don't want to do that
00:33:01.680 | Let's what we can do is make the criteria
00:33:05.840 | For selecting a cluster more strict. So at the moment the
00:33:11.280 | Number of points needed to create a cluster
00:33:15.240 | I believe by default is something like five which is very small when we have just over
00:33:20.520 | 3,000 vectors that we want to cluster into just four topics
00:33:26.240 | So we
00:33:30.920 | Increase the minimum cluster size and if we go with something like 80 to start with
00:33:36.080 | we get this so now we can actually see the condensed tree plot and
00:33:40.720 | Basically
00:33:43.360 | It's moving through the the lambda value here, which is just the parameters
00:33:49.880 | that HDB scan is using and we can imagine this as a line starting from the bottom and going up and as the line
00:33:57.720 | moves up
00:33:59.760 | we are
00:34:01.760 | Identifying the the strongest almost clusters within our data
00:34:06.600 | now at the moment, we're just returning to and
00:34:11.920 | Realistically, we want to try and return like four clusters and to me
00:34:16.680 | It probably seems like we'd want this one on the left
00:34:20.080 | But then we'd want this one this one and this one prior clusters or pulling in this big green blob instead
00:34:27.280 | Because it's being viewed algorithmically is a stronger cluster
00:34:31.280 | So maybe we can try and reduce
00:34:35.040 | Maybe our min cluster size. We said to be too high we can reduce that
00:34:42.000 | Just reducing by 20 and then we pull in this really long one on the left over here
00:34:47.280 | So that also doesn't look very good. So clearly the issue is not the min cluster size and
00:34:53.520 | Maybe we need to change something else
00:34:58.120 | So what we can do is we'll keep mingle size at 80 and
00:35:05.280 | Will instead reduce the min samples which by default is set to the same as a min cluster size now min samples is
00:35:12.400 | basically how dense the core of
00:35:16.640 | A cluster needs to be in order to become a core of a cluster
00:35:20.880 | So we reduce that and if we reduce that then we get the four clusters
00:35:26.300 | That to me looks like we probably want
00:35:30.480 | So we have these four I imagine these three over here which are connected for longer up here are probably
00:35:37.320 | Python Pytorch and language technology and then over here we probably have investing
00:35:43.920 | So let's have a look at what we have
00:35:46.600 | Okay, cool, so you see that it looks like we have similar clusters what we got before right
00:35:56.880 | When we were pulling the colors in from the data and if we have look here, yes green investing red language technology
00:36:04.460 | purple Python and
00:36:07.040 | Blue is Pytorch. And then we also have these orange
00:36:10.240 | Points now, these are the outliers. I think Susan belonging in any of these clusters and
00:36:17.120 | If we have a look at these
00:36:20.040 | Come over to here. We see things like daily
00:36:25.360 | Daily general discussion and advice thread, okay
00:36:28.860 | So to an extent it kind of makes sense that like this isn't really
00:36:34.480 | Necessarily about investing. It's just like the general discussion and then up here
00:36:40.000 | We have some maybe it would be better if these were included within the language technology cluster
00:36:46.540 | And then over here down the bottom we have similar to the investing bit where we had the the description thread
00:36:52.680 | we have another description for a thread here, but from Python, so
00:36:56.360 | It's not perfect. There are some outliers that we would like to be in there, but
00:37:02.200 | Quite a few of those outliers
00:37:04.680 | Kind of makes sense that they're not in there as a description threads or something else
00:37:10.000 | So it's kind of it's pretty cool to see that it it can actually do that. So well
00:37:17.560 | Otherwise, I think that clustering looks pretty good
00:37:22.600 | So let's move on to the next step of CTF IDF
00:37:27.600 | Now we'll go through
00:37:30.320 | CTF IDF very quickly
00:37:32.120 | but all this is really doing is it's looking at the
00:37:36.040 | frequency of words within a particular class or within a particular topic that we've built and
00:37:43.040 | saying okay are those words very common so words like the a be very common and
00:37:49.280 | They would be scored lowly
00:37:51.840 | Thanks to the IDF part of the function, which is the inverse second frequency. Look now common these words are
00:37:57.880 | But if we have a rare word something like Python and it just keeps popping up in one of these topics all the time
00:38:05.200 | Then there's a good idea that maybe
00:38:08.720 | Python the word Python is pretty relevant to that particular topic now
00:38:13.120 | I'm not gonna go not really gonna go through the code on this. Just make it really quick
00:38:17.200 | All I'm going to do is show you
00:38:20.200 | We go through here. I'm actually
00:38:23.080 | tokenizing using NLTK now with
00:38:27.200 | Better topic. We don't really need to worry so much about this. It's going to be handled for us
00:38:32.800 | So that's not so important but we can see we create these tokens which are the words and it's
00:38:38.800 | These tokens or terms and that to the CTF IDF algorithm is going to be looking at in terms of frequency and identifying
00:38:47.240 | the most relevant ones
00:38:49.240 | So for example here we have Pytorch
00:38:51.280 | Model gradients, they're probably pretty relevant to the the Pytorch or Python or even natural language
00:38:58.840 | topic
00:39:00.680 | But less so for investing
00:39:02.840 | So come down here
00:39:05.880 | We have this is a number of tokens within each topic including our you know, not relevant or outlier topic
00:39:13.080 | And if we
00:39:19.600 | Scroll down these are our TF IDF scores for basically for every single possible word or token
00:39:28.160 | Within each of our topics. So you see we have five rows here and then within each of those rows
00:39:34.000 | we have a sparsely populated matrix where each
00:39:37.480 | Entry in that matrix represents one token or one term by the word the although we remove these so that wouldn't be in there
00:39:46.000 | so like the word Python or PyTorch or
00:39:50.000 | investing or
00:39:53.040 | Money or something along those lines
00:39:57.000 | Then I'm just going to return the top and most common words per class. We come down here and we see
00:40:03.320 | Okay, cool, so this one is our irrelevant topic and
00:40:09.040 | Then up here. I mean we can see straight away from these what?
00:40:13.280 | Hopefully what topics they are as we have market stock. It's probably investing
00:40:21.480 | Up here we have project Python code. I'm pretty sure it's Python loss
00:40:27.800 | X model PyTorch
00:40:31.160 | Pretty confident that's PyTorch. And then here we have NLP and text data model. It's got to be a language technology
00:40:40.720 | high level
00:40:42.240 | That's no more in detail, but still kind of high level. That's how it works
00:40:49.280 | Now, how do we apply what we've just gone through to the BERT topic library
00:40:54.720 | Fortunately, it's incredibly easy
00:40:57.800 | So again, I'm starting another notebook here. This is number six
00:41:02.160 | Same again download data remove anything that's particularly long. This one's you'll be really short by the way, so
00:41:10.960 | problem
00:41:12.440 | And then what we do is as before we initialize our UMAP and HDB scan
00:41:18.840 | Components, but we're just initializing. I'm not fitting or anything right now. I've initialized these with the same
00:41:26.080 | Same parameters. I just add the the minimum spanning tree here for the HBC model
00:41:33.800 | which is basically just going to build like a linked tree through the through the data and I found that it improves the performance so
00:41:40.640 | added that in there and another thing that I've added in is this prediction data equals true now if we
00:41:46.760 | When you're using HDB scan with
00:41:49.680 | That topic you need to include that
00:41:52.400 | Otherwise you will get
00:41:55.160 | This attribute error that I mentioned here. So to be Aaron no prediction data is that generated?
00:42:00.160 | You will get that error if you don't include prediction data equals truth for your HDB scan component
00:42:09.720 | other things that we need is our sentence transform the embedding model and
00:42:15.320 | in the
00:42:16.920 | TF IDF or CTF IDF set I didn't really go through it
00:42:20.680 | But I removed soft words like the a and so on
00:42:24.320 | So again as we you saw right to start I'm going to include the count vectorizer to remove those soft words
00:42:31.080 | And I added a few more as well. So I took the nltk soft words
00:42:35.280 | Yeah, and then I also added these other words that kept popping up all the time. I'm kind of polluting the topics and
00:42:46.360 | Initialize the model and then we can
00:42:49.360 | Initialize Bert topic as we did before but then we add all of these different components
00:42:55.560 | Into the definition
00:42:59.040 | So it's literally what we've just been through we do the same thing we
00:43:05.560 | Initialize all those components, but we don't run them in their respective libraries. We actually just pass them to
00:43:13.040 | That topic library, which makes it
00:43:15.040 | I think a lot nicer to work with because you just throw everything you initialize everything and
00:43:21.240 | Then just throw thing into that topic and then that topic will run through it all for you, which is really cool
00:43:26.880 | So as before we do the fit transform which is like a fit and predict we'll get topics and probabilities from that
00:43:36.960 | And we come down here and we have our topic so topic zero it looks like investing number one
00:43:43.480 | pie torch to language technology and three Python
00:43:48.360 | So that's really cool. It has the topics
00:43:52.440 | And I can also see the hierarchy of those topics as well. So we have like Python pie torch and
00:43:58.000 | Language technology kind of grouped together earlier than the investing topic
00:44:06.800 | I think that is
00:44:08.800 | Incredibly cool how easily we can organize data
00:44:13.240 | using
00:44:15.520 | Meaning rather than using keywords or some sort of complex
00:44:20.200 | rule based method
00:44:23.040 | With bird topic. We literally just feed in our data
00:44:26.920 | obviously it helps to have an understanding of UMAP HTTP scan and
00:44:31.840 | the transformer embedding models
00:44:35.600 | But using all of what we've just learned
00:44:38.800 | we're able to
00:44:41.800 | Do this which is is very impressive just automatically
00:44:46.760 | cluster all of our
00:44:49.280 | Data in a meaningful way. Okay, so that's it for this video. I
00:44:55.000 | Hope that topic and all these other
00:44:58.120 | Components of bird topic are as interesting to you as they are to me
00:45:04.080 | But I'll leave it there for now. So thank you very much for watching and
00:45:09.840 | I will see you in the next one. Bye