BERTopic Explained

00:00:00.000 | Today it's estimated that up to 90% of the world's data is unstructured

00:00:06.220 | Meaning that it's built

00:00:08.900 | specifically for human consumption rather than for machines and

00:00:14.260 | that's

00:00:16.380 | Great for us, but it's also kind of difficult because when you're trying to organize all of that data

00:00:22.740 | There's a lot of data

00:00:25.260 | it is

00:00:27.660 | quite simply impossible

00:00:29.940 | To get people to do that. It's too slow and it's too expensive

00:00:34.220 | fortunately

00:00:36.700 | There are more and more techniques that allow us to actually understand

00:00:40.900 | Texts or unstructured texts. We're now able to search based on the meaning of text

00:00:47.860 | identify the sentiment of text

00:00:50.140 | extract human entities and a lot more

00:00:54.980 | Transformer models are behind a lot of this now these transform models are the sort of thing that make

00:01:02.260 | sci-fi from a few years ago look almost obsolete because

00:01:07.580 | being able to

00:01:10.340 | Communicate in a human like way with a computer was regulated just as sci-fi, but now these models

00:01:18.380 | Can actually do a lot of the stuff that was deemed impossible

00:01:22.980 | Not that long ago in machine learning

00:01:26.700 | this act of

00:01:29.500 | organizing or clustering data is

00:01:32.340 | Generally referred to as topic modeling is the automatic clustering of data into particular topics

00:01:40.300 | What we're going to do in this video is take high-level view of that topic

00:01:44.580 | The library how we would use about topics default parameters to perform topic modeling on some data

00:01:51.520 | Then we'll go into the details and so what is better topic actually doing so we're going to have a look at

00:02:00.000 | The transformer embedding model a u-map for dimensionality reduction. We're going to take a look at

00:02:07.440 | HDB scan for clustering and now we're going to look at CTF IDF

00:02:14.320 | for basically

00:02:16.320 | Extracting topics from your clusters

00:02:19.040 | each of those are

00:02:22.000 | Really interesting and once you understand how they work you can use that understanding to improve the performance of

00:02:29.040 | your

00:02:31.880 | Burt topic

00:02:33.560 | Topic modeling once you've been through all of that. We're going to then

00:02:36.920 | Go back to Burt topic and see how our understanding of these different components

00:02:44.440 | Allows us to improve our topic modeling with Burt topic, so

00:02:50.160 | We're going to cover a lot

00:02:52.640 | But let's start with that high-level overview of the Burt topic library

00:02:58.160 | so we're going to start with this reddit topics data set and

00:03:02.600 | In here, we have the subreddit that the data has been pulled from

00:03:09.080 | We have the titled thread and we have the actual text within this self text feature

00:03:14.440 | We have a few different subreddits in here. We have

00:03:18.780 | investing

00:03:20.760 | Python language technology and pytorch so we have four subreddits and

00:03:26.600 | why I want to do is

00:03:29.360 | use just the text and

00:03:31.840 | Automatically cluster each of these threads into

00:03:38.560 | a

00:03:39.960 | Cluster that represents that particular subreddit. So investing Python pytorch and language technology

00:03:46.520 | So in here, we have very similar datasets actually not the same. We're going to use that one later

00:03:51.760 | But this one's a little bit shorter and it just includes data from the Python subreddit. We have

00:03:57.320 | 933 rows in there and

00:04:00.160 | Just going through this one as a quick example of how we might use the Burt topic library

00:04:08.840 | using the

00:04:10.760 | Basic parameters so we can remove

00:04:14.080 | Rows where soft text is NN or very short and this is if you're using a data frame

00:04:21.420 | Otherwise we would use this

00:04:23.640 | Okay

00:04:25.720 | So you can come up here and you just pip install the datasets library

00:04:30.200 | for that

00:04:33.560 | And then we come down here and we get to Burt topic now Burt topic you would install Burt topic so at the top here

00:04:41.560 | And

00:04:44.360 | All we're doing here. So I've added this count vectorizer

00:04:47.920 | which is essentially just going to remove soft words in one of the latest steps of

00:04:52.640 | Burt topic and

00:04:55.160 | all you do is

00:04:57.160 | First just convert your data to a list if you need to actually this is for a panda's a frame

00:05:03.480 | So this is what we'd be running with the hugging face datasets

00:05:07.200 | object and

00:05:09.960 | All we do is and

00:05:13.080 | These arguments so we make sure we have language set to English so using a English language model

00:05:19.240 | For embedding the text and then we just fit transform. Okay, so that's like fit and predict

00:05:27.760 | For this text here. Okay, and we can see here. We've just

00:05:31.240 | All we've done here is embedded the text to embeddings

00:05:37.020 | then we

00:05:39.400 | so those embeddings are quite a very high dimensional and

00:05:42.240 | It will depend on the model that we're using but you're you're thinking like hundreds of dimensions and

00:05:51.460 | then we reduce the dimensionality of those embeddings and

00:05:56.720 | Cluster those and then there is the finding of topics

00:06:01.420 | Based on the text in within each one those clusters

00:06:05.240 | So from that you we get those two list topics and props and if we take a look so we have

00:06:13.100 | This is just printing center print out five of them, but the size is too big. So it's fine

00:06:19.040 | I'm not gonna look through all of them. So we have this predicted topic - so this is just a plus

00:06:24.800 | so we don't know what the actual cluster is and

00:06:27.380 | We come down and it's it was it I think Python for sure. Yeah, Python

00:06:34.980 | Now we have another one a negative one in that topic

00:06:39.700 | Is used where?

00:06:42.840 | that data

00:06:45.760 | Doesn't seem to fit into any of the clusters that birth topic has identified. So it's an outlier

00:06:53.880 | Now in our case we can say okay, this is probably Python

00:06:58.280 | But that topic with the default parameters hasn't identified us. So but we're going to learn how to improve those default parameters

00:07:06.460 | so we'll figure that out and

00:07:10.840 | Then all we can do is see the top words found within each topic so basically the topic

00:07:20.600 | Using this get topic info

00:07:22.600 | You see that we we do actually have a lot of topics here. Now. We're trying to break

00:07:28.760 | the Python

00:07:31.120 | subreddit down into

00:07:33.120 | Multiple topics here. So I'm not really sure how would be best to do that

00:07:39.320 | We can definitely see some of these make sense. So here you have the

00:07:44.480 | base on a assume image processing with Python

00:07:49.520 | learning Python

00:07:51.520 | Web development with Python and a few other things as well. They're

00:07:55.960 | Curious about this one

00:07:58.440 | It was like some sort of game. I don't know but anyway, so we'll go down

00:08:03.240 | we have 16 of these of these topics and then we can also go ahead and visualize those so

00:08:11.240 | This is quite useful when trying to see how each topic might relate to another or how similar they are

00:08:18.640 | And

00:08:20.280 | Then if we come here, you can see how each of those clusters actually kind of grouped together in this hierarchy

00:08:27.880 | And we can also visualize the words within each one those topics as well

00:08:34.340 | But you know, this is just really simple but topic at the moment. We've just used default parameters

00:08:41.280 | We haven't really done anything

00:08:43.720 | What I want to do is actually go into the details of that topic

00:08:48.000 | So let's start having a look at each of the components that I mentioned before

00:08:55.140 | within the bird topic pipeline

00:08:58.600 | So as I said bird topic

00:09:01.800 | consists of a few different components

00:09:10.560 | So we start off with our without text now our text can come in a lot of different formats

00:09:18.380 | but typically we're going to go with something like paragraph or sentence size chunks of text and

00:09:23.880 | We typically call those paragraph or sentence size chunks of text documents

00:09:30.960 | so our our documents or they come down here and pass into a

00:09:37.320 | NLP

00:09:39.320 | typically a sentence

00:09:40.960 | transformer model, so we have we'll just draw that as like a box here and

00:09:46.600 | that sentence transformer is

00:09:49.680 | going to output for each document a single vector now this

00:09:55.220 | Vector or this embedding is very high dimensional. So it is I'm gonna

00:10:01.480 | Write as a list. It's just gonna be really

00:10:05.120 | Really big and we're going to have loads of them as we have loads of documents

00:10:09.100 | but within that

00:10:12.160 | vector we essentially have a

00:10:15.480 | numerical

00:10:17.440 | representation of the meaning behind its respective document

00:10:21.240 | So it's like we're translating

00:10:23.520 | human meaning into

00:10:25.840 | Machine meaning for each of these documents. That's pretty much how we can see it, which is already I think

00:10:34.800 | Really cool. So once we have those vectors

00:10:37.880 | The problem with these is that they're very high dimensional. So sentence transformers typical dimension sizes

00:10:44.820 | 768 dimensions you have some sort of like half or also some mini models

00:10:51.840 | Which will output something like three eight four?

00:10:54.800 | Dimensions now you have some models that output many more like open a eyes

00:11:00.840 | Embedding models I think can go up to 200 or no. No, they can go way higher

00:11:06.040 | They can go to I think it's just over 10k. So

00:11:10.040 | It's

00:11:12.680 | pretty huge now

00:11:14.680 | What we need to do is compress the information within those vectors into a smaller space

00:11:20.600 | So for that we use something called you map

00:11:23.200 | Now you map is

00:11:29.960 | Well, there's a lot to talk about when it comes to you map and we're only going to cover a very high level in this video

00:11:37.120 | But even that way we're going to leave it until later to ready. So

00:11:41.680 | Dig into it to an extent

00:11:44.920 | But what that will do is allow us to compress those very high dimensional vectors into a smaller vector space

00:11:53.120 | so what we would typically do is go from

00:11:57.160 | 768 or any of those to something like 3d or 2d

00:12:02.500 | okay, so we can have something like 2 or 3d vectors now and

00:12:08.320 | what's really useful is that we can actually visualize that so we can have a look at our output from you map and

00:12:14.440 | Try and figure out. Okay, is this producing?

00:12:17.720 | data that we can cluster or not because it's very

00:12:23.960 | Not always very easy, but it's easier to see or easier to assess whether there are

00:12:30.520 | Clusters in the data or some sort of structure in our data

00:12:33.800 | When we can actually see it

00:12:36.400 | So that's really useful and then after that we're going to pass those

00:12:42.100 | into another set so

00:12:44.880 | HDB scan

00:12:47.240 | now

00:12:49.360 | HDB scan again like you map. There's a lot to talk about. We're going to cover it a high level

00:12:54.120 | but what that will do is actually

00:12:56.360 | identify clusters in our data and it's

00:13:00.040 | From what I have seen so far. It's very good at it

00:13:03.840 | And again, it's super interesting both you map and HDB scan in my opinion a very interesting techniques

00:13:12.240 | and then after that

00:13:13.960 | move on to a

00:13:15.840 | another

00:13:17.400 | Technique or component which is called C

00:13:20.720 | T F

00:13:23.440 | IDF now a few many of you

00:13:27.000 | May recognize the TF IDF part. The C part is a modification of

00:13:33.840 | TF IDF TF IDF is almost like a

00:13:37.600 | way of identifying

00:13:40.720 | relevant

00:13:42.960 | Documents remember documents are those chunks of text based on a particular query. What we do here is

00:13:49.280 | identify

00:13:51.640 | particular words

00:13:53.360 | that are relevant to a particular set of

00:13:56.880 | documents e.g. our clusters or our topics and then it outputs those so what we do is we'd end up with like

00:14:05.400 | four clusters for example, which is what we're going to be

00:14:10.200 | Using so we'd end up with these four clusters and we'd be able to see okay. This one over here is talking about

00:14:17.960 | money

00:14:20.760 | Stocks and and so on

00:14:23.320 | So that would be its topic and that would hopefully be the investing subreddit and then over here

00:14:30.560 | We would have for example Python and we would see things like well

00:14:33.800 | Python

00:14:35.680 | programming code and so on as those

00:14:39.360 | As those terms match up to that topic the best

00:14:42.880 | now

00:14:44.800 | At very high level that is Burt topic

00:14:47.720 | but of course

00:14:50.120 | There's not much we can do in terms of improving our code

00:14:54.280 | If this is all we know about that topic, we really want to see it in practice

00:14:58.880 | so what we're going to do is actually go through each of these steps one by one and

00:15:03.160 | See what we can do to

00:15:06.840 | improve that component for our particular data set and

00:15:11.800 | At least have a few ideas on how we might go about

00:15:16.040 | Tuning each of these components for other data sets as well

00:15:20.280 | Okay, so we'll start with the transformer embedding step. So

00:15:26.040 | What we're going to do first is

00:15:28.840 | Actually get our data set. So I'm using this time. We're using the full data set the reddit topics data set

00:15:35.080 | I'm specifying this revision because in the future I'll probably add more and

00:15:38.960 | data to the state set

00:15:41.400 | adding this in will make sure that you are using the current version e.g. the version I use of

00:15:48.680 | That data set rather than any new versions

00:15:51.720 | So now we have just in the 4000 rows in there again, I'm just going to remove any short

00:15:59.740 | Items in there and that leaves with just over 3,000

00:16:04.840 | Unshuffled data

00:16:06.560 | Although you don't really need to

00:16:08.560 | Good practice anyway

00:16:10.880 | And then yeah, so if you find this is taking a long time to run on your computer

00:16:17.600 | Just reduce this number

00:16:20.200 | Okay, and if you do use this

00:16:22.760 | Then definitely shuffle before hand, but otherwise you probably don't need to so if it's taking a long time to run

00:16:29.280 | Just reduce this to like 1,000

00:16:31.280 | just so you're

00:16:33.960 | Working through this with less data

00:16:36.000 | Again also you can

00:16:39.800 | find this

00:16:42.120 | Notebook and all the other notebooks. I'm going to work through in the description below

00:16:48.200 | So the first step as I said is going to be that embedding

00:16:53.180 | set so

00:16:55.680 | What we're doing here is actually pulling in one. There is a mini sentence transform models

00:17:00.000 | I mentioned we can see here that it creates a word embeddings or sentence embeddings of

00:17:05.980 | 384 tokens, which is is nice rather than using the larger models. It just means it's going to run quicker

00:17:13.080 | Particularly for just not using any

00:17:15.880 | Special machines, I wish I'm not I'm on my laptop right now

00:17:21.080 | So we come down and we can

00:17:29.480 | create the embeddings now as

00:17:31.480 | Mentioned we probably don't want to do all this at once unless you have some sort of supercomputer

00:17:37.820 | So I'm running through those in batches of 16, so I'm taking 16 of those documents. I mentioned

00:17:46.240 | putting those into our sentence transform model and

00:17:49.920 | outputting 16

00:17:52.680 | of these sentence embeddings and

00:17:56.440 | Then adding all those to this embeds

00:17:58.800 | Array here

00:18:01.120 | You can see that. Okay, so we encode the batch which we pull from there. It's just 16 at time

00:18:08.200 | we get our

00:18:10.460 | beddings to that batch and then we just add them to the embeds array and

00:18:14.600 | That's actually order is to the sentence transform or sentence embeddings, but I'm not going to go into detail on that because there are

00:18:23.600 | Many articles and videos on that

00:18:26.880 | that we have covered already, so

00:18:30.320 | we're gonna keep that very light and

00:18:33.480 | We move on to you map which we haven't covered before so we'll go into a little more detail on that

00:18:39.360 | now the you map

00:18:42.280 | dimensionality reduction step is

00:18:45.320 | very important because

00:18:48.520 | at the moment we have these 384 dimensional vectors and

00:18:53.560 | That's a lot of dimensions and

00:18:57.360 | We don't really need

00:19:00.340 | all of those dimensions to

00:19:03.200 | Fit the information that we need

00:19:05.840 | For just clustering our data or clustering that the meaning of those documents

00:19:13.340 | Instead we can actually try and compress as much meaning as possible

00:19:18.480 | Into a smaller vector space and with you map. We're going to try and do that to two or three dimensions

00:19:24.560 | Now another reason we use this is one we can visualize our clusters

00:19:30.800 | which is very helpful if we're trying to automatically cluster things and

00:19:36.520 | - it makes in the next step of clustering much more efficient

00:19:42.500 | So there are several reasons that we do this

00:19:46.520 | That's a few of them

00:19:48.360 | So to better understand what you map is actually doing and why it might be better than other methods

00:19:55.760 | Which we'll talk about

00:19:57.040 | it's probably best that we

00:19:59.040 | Start with some data that we can visualize. So the word embeddings or sentence embeddings that we just created

00:20:05.160 | 384 dimensional we can't visualize them at the moment

00:20:09.080 | So what I'm going to do is actually switch across to another data set and start with three dimensions

00:20:14.440 | and what we'll do is reduce that to two dimensions and

00:20:16.800 | we'll see

00:20:19.440 | how that works or we'll see the

00:20:22.360 | Behavior of those reduction techniques. I'll just you map. What are you the ones as well? So

00:20:29.520 | For that I'm going to use this world cities geo

00:20:33.200 | Dataset in there what we have is we have cities their countries regions and continent

00:20:42.400 | Now what we want to focus on is a continent now. We also have these other features. So like latitude and longitude and

00:20:49.560 | Using the latitude and longitude what I've done is correct these X Y and C coordinates. So we now have a coordinate system

00:20:57.080 | In order to place each of these cities on a

00:21:02.480 | Hypothetical globe and

00:21:06.000 | We have just over 9,000 of those so let's have a look at those

00:21:12.880 | Okay, so I'm just plotting these with plot league and you can see to go down I'm not going to go through it and

00:21:19.160 | What we can see is

00:21:22.800 | What looks pretty much like the outline of the world?

00:21:26.520 | Okay, so it's not perfect like for some reason the United States of America just isn't in this data set

00:21:35.280 | So we have the North American continent and it only contains countries in Central America

00:21:41.840 | And then includes Canada right at the top there. So

00:21:45.280 | There's some missing chunks and I believe

00:21:49.840 | Russia is also missing

00:21:53.880 | obviously again quite a big place and

00:21:58.520 | I'm sure there are many other places missing as well. But

00:22:03.360 | All that say it's not a perfect data set

00:22:06.920 | But we have enough in there to kind of see what we can do with reducing this

00:22:11.680 | 3d space into 2d space now

00:22:15.360 | Essentially what we're doing is we're trying to recreate a world map now. It's definitely

00:22:21.040 | not going to

00:22:24.240 | Replicate a world map or not any that any sensible person would use

00:22:28.560 | but

00:22:30.560 | we should be able to see some features of

00:22:33.400 | the different continents the countries and so on so what we're going to do with this world map is

00:22:40.720 | Reduce it into two-dimensional space using not just you map but also to other very popular

00:22:47.560 | dimensionality reduction techniques PCA or principal component analysis and

00:22:52.640 | TC now we can see when we try and do this with

00:22:58.800 | PCA we kind of like a circular

00:23:02.280 | globe like thing and

00:23:04.280 | We get these clusters that kind of overlap which it is not ideal. I don't think a

00:23:10.760 | Clustering over them is going to perform particularly well here

00:23:14.800 | so

00:23:17.520 | but we what we do have is

00:23:19.920 | the

00:23:21.880 | distances between

00:23:23.320 | Those clusters or those continents has been preserved relatively. Well, at least the ones that were further apart from each other and

00:23:31.920 | That is because PCA is very good at preserving large distances

00:23:36.600 | But it's not so good at

00:23:40.000 | maintaining the local structure of your data and

00:23:44.120 | in our case, we kind of want to go and

00:23:47.440 | Maintain to an extent both the local and the global structure of our data as much as possible

00:23:55.280 | So what if we take a look at TC knee now?

00:24:00.360 | TC knee is almost the opposite rather than focusing on the

00:24:05.640 | global structured data it focuses on local structure because it essentially works by creating a

00:24:12.920 | graph where each

00:24:15.680 | Sort of neighbor node is linked and it tries to maintain those

00:24:19.880 | local

00:24:21.880 | links between neighbors and

00:24:23.880 | this is very good at representing the local structure of our data and

00:24:29.480 | Through the local structure. It can almost indirectly infer to an extent the

00:24:35.120 | Global structure, but it's still not that great it and we get this sort of messy

00:24:42.320 | overlapping of some clusters and

00:24:45.400 | generally

00:24:47.400 | with our

00:24:48.720 | Dataset at least the clusters and just not really tied together very well now another issue with TC knees

00:24:56.000 | I it's very random. So every time you're in it, you're going to get a very

00:25:00.640 | You can get a very wildly different result, which is again, not very ideal. So what we use instead is UMAP

00:25:09.240 | Now UMAP kind of gives us the best of both worlds it

00:25:14.280 | performs well with both local and global structure and

00:25:17.680 | We can apply it incredibly easy

00:25:22.880 | Using the UMAP library

00:25:24.880 | which we can just pick install and

00:25:27.660 | Then in a few lines of code, we can actually apply UMAP to our data

00:25:33.360 | But there's much more to UMAP than just a few lines of code. There are many parameters that we can

00:25:41.320 | Tune in order for UMAP to work with our data

00:25:45.800 | in a particular way or just fit to

00:25:50.160 | just the topology of our particular data set, so

00:25:55.000 | There's a lot to cover

00:25:58.200 | But I think by far the key parameter

00:26:02.480 | That can make the biggest impact is the n neighbors parameter now

00:26:08.840 | Do you map algorithm it does a lot of things but one of the key things it does is it identifies?

00:26:15.920 | the

00:26:18.200 | density of

00:26:20.000 | particular areas in our data by

00:26:23.880 | searching through for each point in our data set searching through and finding the

00:26:30.580 | K nearest neighbors and that is what this n neighbors parameter is setting

00:26:35.580 | it's saying okay, how many of nearest neighbors are we going to have a look at and

00:26:39.280 | As you can imagine if you set an n neighbors parameter to be very small

00:26:45.960 | It is only really going to pick up on the local structure of your data

00:26:52.000 | Whereas if you set it to be very large, it's going to focus more on the global structure

00:26:56.240 | so what we can do is try and find something in the middle where we're focusing on both the

00:27:01.920 | global and local structure and

00:27:05.400 | This can work incredibly well. So if we take a look at our

00:27:14.600 | Reduced data set we can see straight away that the continents are in the right places

00:27:21.160 | They've kind of been switched over

00:27:24.040 | so we have

00:27:26.640 | North is in the South and South is in North and

00:27:29.440 | In Oceania, we have a few islands that seem to so New Zealand is over here rather than over here

00:27:37.880 | Where I suppose it should be

00:27:39.880 | But otherwise, it seems to have done relatively well

00:27:44.640 | Now a lot of these so we have Japan over here. We have the Maldives and

00:27:51.160 | Philippines a lot of these kind of bits that go off of the main clusters are

00:27:57.600 | Actually islands which makes sense. They are not as close

00:28:02.200 | To the continents or anything

00:28:05.600 | so it kind of makes sense that they are not within the same cluster and

00:28:10.160 | We can even see

00:28:12.960 | to an extent as the

00:28:15.240 | shape of some countries

00:28:18.760 | So over here as soon as I saw this sort of yellow bit sticking out on the left of Europe

00:28:25.280 | I thought that looks a lot like the Spanish Peninsula and

00:28:29.600 | If we go over there, we actually see that it is all these cities over here or Spain and then we also have Europe

00:28:37.400 | Portugal as well

00:28:39.760 | next to Spain over here we have France and

00:28:42.840 | Yeah, you have this of the local

00:28:46.320 | structure of countries

00:28:49.120 | Does seem to be preserved quite well?

00:28:51.400 | So that seems to work really well already

00:28:56.160 | now

00:28:58.120 | there was just one other parameter that I added in order to get this sort of graph and

00:29:03.960 | That was the minute this parameter. Now what this does is

00:29:10.720 | Limits how closely you map can put two points together. So by

00:29:17.380 | Default this is I believe 0.1 and

00:29:21.840 | What I did was increase this up to 0.5 because we're 0.1

00:29:26.640 | A lot of these links tend to just be really tightly string together and you almost get like

00:29:32.500 | strings all of countries

00:29:35.240 | but increasing that it kind of pushed the

00:29:40.280 | The structure out a little bit to be less stringy and more kind of like continents

00:29:45.280 | So that's another really cool parameter that you can

00:29:48.960 | so fine-tune and hopefully get some good results with now, let's take a look at the code that we apply to our

00:29:56.360 | reddit topics data, so come down here and

00:30:00.400 | What I'm doing is we set or what I was doing here is taking a look at the different nearest neighbors

00:30:09.200 | Values and seeing what seems to work. Well

00:30:11.960 | And in the end what I settled on is

00:30:15.000 | first

00:30:17.920 | Compressing our data into a 3d space rather than 2d which makes sense. It means we have more dimensionalities to

00:30:23.480 | maintain more information

00:30:26.120 | Set nearest neighbors or n neighbors. Sorry to three number of components to three. That's just see the 3d aspect

00:30:34.800 | You set up to choose you want to 2d and then I also added in min dis

00:30:39.040 | but rather than increasing min dis from 0.1 actually decreases so the points can be closer together and

00:30:45.240 | What we got out from that is this and you can see here we have these four clusters and

00:30:53.160 | We have color over here, it's my bad, but essentially what we have is I think we're so red is

00:31:04.040 | Language technology purple is investing

00:31:07.360 | Blue is Pytorch and green is Python

00:31:11.860 | so we have all of these of

00:31:14.660 | Language technology Pytorch and Python or kind of together because they're more similar than investing which is really far away

00:31:22.320 | But we can clearly see that

00:31:24.960 | These look like pretty good clusters. There's some overlap and that's to be expected

00:31:31.000 | there is there are a lot of topics or threads that could be in both Pytorch and Python and

00:31:37.160 | Language technology all at once. I said there is of course some overlap

00:31:42.000 | But for the most part we can see some pretty clear clusters there

00:31:46.920 | So that tells us okay, that's great. Now we can move on to the next step which is clustering with HDB scan now

00:31:56.360 | What we've seen here we have we have like color-coded clusters

00:32:00.880 | But in reality, we're assuming that we don't actually have the subreddit titles already

00:32:06.280 | Okay, we want to be trying to extract those clusters without knowing what the clusters already are

00:32:12.100 | So we're gonna do that

00:32:15.800 | All we start with is actually a pip install of HDB scan

00:32:21.720 | super easy and then we can

00:32:24.200 | import HDB scan and

00:32:26.840 | We fit to our three-dimensional

00:32:29.280 | reddit topics data

00:32:32.160 | And then we can use this

00:32:35.160 | condensed tree plot

00:32:37.520 | to

00:32:38.800 | View how our strip how our clusters are being built

00:32:42.020 | But all we see at the moment on there is all of these red lines. Now. These are not actually lines are circles

00:32:50.200 | Identifying the clusters that have been extracted

00:32:52.720 | Now obviously that it seems like there's a lot of clusters that have been extracted. We don't need we don't want to do that

00:32:59.880 | so

00:33:01.680 | Let's what we can do is make the criteria

00:33:05.840 | For selecting a cluster more strict. So at the moment the

00:33:11.280 | Number of points needed to create a cluster

00:33:15.240 | I believe by default is something like five which is very small when we have just over

00:33:20.520 | 3,000 vectors that we want to cluster into just four topics

00:33:26.240 | So we

00:33:29.760 | can

00:33:30.920 | Increase the minimum cluster size and if we go with something like 80 to start with

00:33:36.080 | we get this so now we can actually see the condensed tree plot and

00:33:40.720 | Basically

00:33:43.360 | It's moving through the the lambda value here, which is just the parameters

00:33:49.880 | that HDB scan is using and we can imagine this as a line starting from the bottom and going up and as the line

00:33:57.720 | moves up

00:33:59.760 | we are

00:34:01.760 | Identifying the the strongest almost clusters within our data

00:34:06.600 | now at the moment, we're just returning to and

00:34:11.920 | Realistically, we want to try and return like four clusters and to me

00:34:16.680 | It probably seems like we'd want this one on the left

00:34:20.080 | But then we'd want this one this one and this one prior clusters or pulling in this big green blob instead

00:34:27.280 | Because it's being viewed algorithmically is a stronger cluster

00:34:31.280 | So maybe we can try and reduce

00:34:35.040 | Maybe our min cluster size. We said to be too high we can reduce that

00:34:42.000 | Just reducing by 20 and then we pull in this really long one on the left over here

00:34:47.280 | So that also doesn't look very good. So clearly the issue is not the min cluster size and

00:34:53.520 | Maybe we need to change something else

00:34:58.120 | So what we can do is we'll keep mingle size at 80 and

00:35:05.280 | Will instead reduce the min samples which by default is set to the same as a min cluster size now min samples is

00:35:12.400 | basically how dense the core of

00:35:16.640 | A cluster needs to be in order to become a core of a cluster

00:35:20.880 | So we reduce that and if we reduce that then we get the four clusters

00:35:26.300 | That to me looks like we probably want

00:35:30.480 | So we have these four I imagine these three over here which are connected for longer up here are probably

00:35:37.320 | Python Pytorch and language technology and then over here we probably have investing

00:35:43.920 | So let's have a look at what we have

00:35:46.600 | Okay, cool, so you see that it looks like we have similar clusters what we got before right

00:35:56.880 | When we were pulling the colors in from the data and if we have look here, yes green investing red language technology

00:36:04.460 | purple Python and

00:36:07.040 | Blue is Pytorch. And then we also have these orange

00:36:10.240 | Points now, these are the outliers. I think Susan belonging in any of these clusters and

00:36:17.120 | If we have a look at these

00:36:20.040 | Come over to here. We see things like daily

00:36:25.360 | Daily general discussion and advice thread, okay

00:36:28.860 | So to an extent it kind of makes sense that like this isn't really

00:36:34.480 | Necessarily about investing. It's just like the general discussion and then up here

00:36:40.000 | We have some maybe it would be better if these were included within the language technology cluster

00:36:46.540 | And then over here down the bottom we have similar to the investing bit where we had the the description thread

00:36:52.680 | we have another description for a thread here, but from Python, so

00:36:56.360 | It's not perfect. There are some outliers that we would like to be in there, but

00:37:02.200 | Quite a few of those outliers

00:37:04.680 | Kind of makes sense that they're not in there as a description threads or something else

00:37:10.000 | So it's kind of it's pretty cool to see that it it can actually do that. So well

00:37:17.560 | Otherwise, I think that clustering looks pretty good

00:37:22.600 | So let's move on to the next step of CTF IDF

00:37:27.600 | Now we'll go through

00:37:30.320 | CTF IDF very quickly

00:37:32.120 | but all this is really doing is it's looking at the

00:37:36.040 | frequency of words within a particular class or within a particular topic that we've built and

00:37:43.040 | saying okay are those words very common so words like the a be very common and

00:37:49.280 | They would be scored lowly

00:37:51.840 | Thanks to the IDF part of the function, which is the inverse second frequency. Look now common these words are

00:37:57.880 | But if we have a rare word something like Python and it just keeps popping up in one of these topics all the time

00:38:05.200 | Then there's a good idea that maybe

00:38:08.720 | Python the word Python is pretty relevant to that particular topic now

00:38:13.120 | I'm not gonna go not really gonna go through the code on this. Just make it really quick

00:38:17.200 | All I'm going to do is show you

00:38:20.200 | We go through here. I'm actually

00:38:23.080 | tokenizing using NLTK now with

00:38:27.200 | Better topic. We don't really need to worry so much about this. It's going to be handled for us

00:38:32.800 | So that's not so important but we can see we create these tokens which are the words and it's

00:38:38.800 | These tokens or terms and that to the CTF IDF algorithm is going to be looking at in terms of frequency and identifying

00:38:47.240 | the most relevant ones

00:38:49.240 | So for example here we have Pytorch

00:38:51.280 | Model gradients, they're probably pretty relevant to the the Pytorch or Python or even natural language

00:38:58.840 | topic

00:39:00.680 | But less so for investing

00:39:02.840 | So come down here

00:39:05.880 | We have this is a number of tokens within each topic including our you know, not relevant or outlier topic

00:39:13.080 | And if we

00:39:19.600 | Scroll down these are our TF IDF scores for basically for every single possible word or token

00:39:28.160 | Within each of our topics. So you see we have five rows here and then within each of those rows

00:39:34.000 | we have a sparsely populated matrix where each

00:39:37.480 | Entry in that matrix represents one token or one term by the word the although we remove these so that wouldn't be in there

00:39:46.000 | so like the word Python or PyTorch or

00:39:50.000 | investing or

00:39:53.040 | Money or something along those lines

00:39:57.000 | Then I'm just going to return the top and most common words per class. We come down here and we see

00:40:03.320 | Okay, cool, so this one is our irrelevant topic and

00:40:09.040 | Then up here. I mean we can see straight away from these what?

00:40:13.280 | Hopefully what topics they are as we have market stock. It's probably investing

00:40:21.480 | Up here we have project Python code. I'm pretty sure it's Python loss

00:40:27.800 | X model PyTorch

00:40:31.160 | Pretty confident that's PyTorch. And then here we have NLP and text data model. It's got to be a language technology

00:40:38.680 | so

00:40:40.720 | high level

00:40:42.240 | That's no more in detail, but still kind of high level. That's how it works

00:40:49.280 | Now, how do we apply what we've just gone through to the BERT topic library

00:40:54.720 | Fortunately, it's incredibly easy

00:40:57.800 | So again, I'm starting another notebook here. This is number six

00:41:02.160 | Same again download data remove anything that's particularly long. This one's you'll be really short by the way, so

00:41:08.740 | No

00:41:10.960 | problem

00:41:12.440 | And then what we do is as before we initialize our UMAP and HDB scan

00:41:18.840 | Components, but we're just initializing. I'm not fitting or anything right now. I've initialized these with the same

00:41:26.080 | Same parameters. I just add the the minimum spanning tree here for the HBC model

00:41:33.800 | which is basically just going to build like a linked tree through the through the data and I found that it improves the performance so

00:41:40.640 | added that in there and another thing that I've added in is this prediction data equals true now if we

00:41:46.760 | When you're using HDB scan with

00:41:49.680 | That topic you need to include that

00:41:52.400 | Otherwise you will get

00:41:55.160 | This attribute error that I mentioned here. So to be Aaron no prediction data is that generated?

00:42:00.160 | You will get that error if you don't include prediction data equals truth for your HDB scan component

00:42:06.800 | Then

00:42:09.720 | other things that we need is our sentence transform the embedding model and

00:42:15.320 | in the

00:42:16.920 | TF IDF or CTF IDF set I didn't really go through it

00:42:20.680 | But I removed soft words like the a and so on

00:42:24.320 | So again as we you saw right to start I'm going to include the count vectorizer to remove those soft words

00:42:31.080 | And I added a few more as well. So I took the nltk soft words

00:42:35.280 | Yeah, and then I also added these other words that kept popping up all the time. I'm kind of polluting the topics and

00:42:44.360 | Yeah

00:42:46.360 | Initialize the model and then we can

00:42:49.360 | Initialize Bert topic as we did before but then we add all of these different components

00:42:55.560 | Into the definition

00:42:59.040 | So it's literally what we've just been through we do the same thing we

00:43:05.560 | Initialize all those components, but we don't run them in their respective libraries. We actually just pass them to

00:43:13.040 | That topic library, which makes it

00:43:15.040 | I think a lot nicer to work with because you just throw everything you initialize everything and

00:43:21.240 | Then just throw thing into that topic and then that topic will run through it all for you, which is really cool

00:43:26.880 | So as before we do the fit transform which is like a fit and predict we'll get topics and probabilities from that

00:43:36.960 | And we come down here and we have our topic so topic zero it looks like investing number one

00:43:43.480 | pie torch to language technology and three Python

00:43:48.360 | So that's really cool. It has the topics

00:43:52.440 | And I can also see the hierarchy of those topics as well. So we have like Python pie torch and

00:43:58.000 | Language technology kind of grouped together earlier than the investing topic

00:44:04.360 | so

00:44:06.800 | I think that is

00:44:08.800 | Incredibly cool how easily we can organize data

00:44:13.240 | using

00:44:15.520 | Meaning rather than using keywords or some sort of complex

00:44:20.200 | rule based method

00:44:23.040 | With bird topic. We literally just feed in our data

00:44:26.920 | obviously it helps to have an understanding of UMAP HTTP scan and

00:44:31.840 | the transformer embedding models

00:44:35.600 | But using all of what we've just learned

00:44:38.800 | we're able to

00:44:41.800 | Do this which is is very impressive just automatically

00:44:46.760 | cluster all of our

00:44:49.280 | Data in a meaningful way. Okay, so that's it for this video. I

00:44:55.000 | Hope that topic and all these other

00:44:58.120 | Components of bird topic are as interesting to you as they are to me

00:45:04.080 | But I'll leave it there for now. So thank you very much for watching and

00:45:09.840 | I will see you in the next one. Bye

BERTopic Explained

Chapters