back to index

Advanced Sentiment Analysis with NLP Transformers + Vector Search


Chapters

0:0 Intro
0:31 What we will build
3:1 Code links and prerequisites
4:16 Dataset download and preprocessing
5:49 Using RoBERTa sentiment analysis model
8:15 Retriever model for building dense vectors
9:39 Create Pinecone vector index
11:40 Sentiment scores, vectors, and indexing
17:35 Sentiment analysis / opinion mining
20:43 Sentiment analysis with specific date range
21:44 Sentiment analysis on specific info
23:58 Final notes

Whisper Transcript | Transcript Only Page

00:00:00.000 | Today we're going to learn how to do sentiment mining using vector search and
00:00:06.140 | transform models. At the core of this we have sentiment analysis. Now sentiment
00:00:11.040 | analysis is a NLP technique where we essentially want to extract the emotion
00:00:18.720 | or the sentiment behind some text. Naturally this makes it a really
00:00:24.000 | interesting and useful tool when it comes to analyzing a ton of language
00:00:30.280 | text data. And here we're gonna have a look at how to apply sentiment analysis
00:00:37.720 | and more specifically sentiment mining to the hotel industry so that we can
00:00:43.140 | understand customer perception through hotel reviews. With this we could
00:00:48.640 | identify the perfect hotel for whatever it is we want. Or from the other side, from
00:00:54.680 | the hotel management side, we can analyze customer reviews and maybe identify
00:01:01.240 | areas that are pretty strong and other areas that could do a bit of work. So the
00:01:06.040 | idea here is we're gonna take a ton of these customer reviews for different
00:01:10.080 | hotels and apply this technique to them. But of course we don't have to restrict
00:01:16.120 | this to reviews, although we can we can do product reviews for example. But we
00:01:21.400 | can actually apply this to any text where we want to extract sentiment and
00:01:25.840 | maybe you want to analyze different ideas or features within that data set.
00:01:32.760 | So the hotel example maybe you want to have a look at whether people think our
00:01:37.560 | room sizes are good or whether they like the breakfast and we could do that with
00:01:42.800 | this sentiment mining. So to do that we are essentially going to take all of
00:01:49.080 | these reviews we're going to embed them using something called a sentence
00:01:54.280 | transformer model. We're going to index those within a vector database and we're
00:01:58.880 | going to search for things like "are customers satisfied with room sizes for
00:02:05.040 | hotels in London" and that will return a set of reviews that are relevant to our
00:02:13.080 | particular query and then assess the general sentiment around them. Now we can
00:02:18.620 | also pair this with metadata which is just attached bits of information to
00:02:24.160 | each vector or piece or record within our vector database and we could use
00:02:29.680 | that to maybe visualize the sentiment across different hotels. Or we could even
00:02:34.200 | take that a little further and use something called metadata filtering to
00:02:37.680 | for example look at reviews over time and we could use that to look at the
00:02:43.360 | sentiment of different hotels or different aspects with each of those
00:02:48.240 | hotels over time and see if things are either improving or getting worse and so
00:02:53.520 | on. But we'll look at all of that in a lot more detail towards the end of the
00:02:57.280 | video. For now what we're going to do is have a look at the actual notebook that
00:03:01.760 | we are going to be going through. So if you'd like to follow along with me you
00:03:06.360 | can actually get the notebook either in the description of this video or you can
00:03:12.280 | go over to here pinecone.io/docs/examples/sentiment-mining and up at the top
00:03:19.040 | here you can either get the notebook on github or you can click over here to
00:03:24.680 | open it in Colab which is what I'm going to be running through. Now credit for
00:03:29.080 | this idea and notebook goes to usherac again so it's a really cool example and
00:03:36.280 | it's really useful. So we'll start going through it now. So the first thing we
00:03:41.120 | need to do is install any dependencies that we have. So we have a few here
00:03:46.240 | mainly sentence transformers, PyHandClient, datasets which we can pull
00:03:51.480 | everything from and then for visualizations seaborn and matplotlib.
00:03:55.640 | So we run that and one other thing we should also check if you are running
00:04:00.480 | this in Colab is in your runtime go over to change runtime type and just make
00:04:06.800 | sure that you have the hardware accelerator on GPU here. So if you do
00:04:12.120 | need to switch you just have to restart and reinstall those prerequisites there.
00:04:16.720 | So once that is ready you can go to load and prepare dataset. So we're going to be
00:04:22.040 | using this dataset that usherac is hosting on HuggingFace. So usherac
00:04:27.640 | hotel reviews. We're going to take the training split and we're just going to
00:04:30.840 | convert it into a pandas data frame. So run this and once that has completed we
00:04:37.200 | come down to here and some of these reviews can obviously be much longer
00:04:44.280 | than others. So what we are going to do is just limit it to the first 800
00:04:50.400 | characters of each review because later on when we're creating our embeddings
00:04:55.200 | the models we will be using do have a maximum sequence length. So anything
00:05:01.120 | beyond that will get truncated anyway. So we're just going to do that now rather
00:05:06.200 | than later on. This will also help us when we later want to sort the text from
00:05:13.240 | these reviews in the Pinecone Vector database there are size limits to that.
00:05:18.840 | So by cutting down the length here we're going to make sure that we definitely do
00:05:25.080 | not go over that. So here we can see we have a single hotel here there are several in
00:05:30.400 | this dataset and we just have the reviews. So you can see it's just natural
00:05:36.480 | language here there's nothing specific there are no like star ratings or
00:05:42.200 | anything here. So we really have to rely on the natural language of these reviews
00:05:46.360 | and sentiment analysis to actually understand those. So we're going to
00:05:50.800 | initialize our sentiment analysis model. We're going to be using Roberta here and
00:05:55.240 | it has been fine-tuned specifically for this. So what we're going to do is we're
00:06:00.800 | going to set our device to GPU if it is available. It may not be but if you are
00:06:06.960 | using Colab and you set the runtime up here you should be able to get that. So
00:06:11.480 | this here is just using the CUDA zero device. Okay and then from Transformers
00:06:17.560 | we're going to import a few things. We've got a pipeline which is going to be our
00:06:21.120 | sentiment analysis pipeline later on. We have a tokenizer and a model for
00:06:26.440 | sequence classification which is our again our sentiment analysis model. Okay
00:06:32.320 | so we're going to be using this Roberta based sentiment sentiment analysis model.
00:06:36.720 | We have three labels here so that's positive, negative and neutral. We're
00:06:42.600 | going to load our tokenizer from HuggingFace and then we're going to load
00:06:46.520 | all of that into this pipeline here. So this sentiment analysis pipeline on the
00:06:51.360 | device that we've chosen already and that's everything. So that will just
00:06:57.600 | package everything up into like really easy to use pipeline where we can just
00:07:01.280 | feed in some text and it's going to output us a little dictionary telling us
00:07:05.680 | the sentiment of that text. Now we will need to map these so the output that we
00:07:13.720 | get will say either label 0, 1 or 2. That isn't obviously clear what that means so
00:07:19.560 | we're just going to map those to their actual meanings which is negative,
00:07:22.800 | neutral or positive. Then we take a look at this one. So room is small for a
00:07:28.360 | superior room and poorly lit. No view, need better lighting. It's obviously not a
00:07:33.360 | very good review. They're not particularly happy with the room. So let's run this
00:07:39.520 | and we're going to see what we get. So I think I just need to run this again. Okay
00:07:46.960 | and then come down here run again and you see that we get this label 0 and so
00:07:54.960 | that's a predicted sentiment and we get this score so the confidence. So if we
00:08:00.040 | come up here we can see label 0 is negative right? Label 0 negative. Cool so
00:08:09.560 | 77% confidence in this being negative and I think that's fair enough. Now
00:08:16.400 | what we're going to do is initialize what's called a retriever model. So the
00:08:19.560 | retriever model is going to handle the construction of our vector embeddings. So
00:08:25.200 | we're going to take all of these reviews we're going to convert them into what
00:08:29.640 | are called dense vector embeddings and these are essentially a vector
00:08:36.080 | representation of the meaning behind each one of these reviews and what that
00:08:42.360 | does is allows us to search for something based on its meaning. So the
00:08:47.840 | earlier example I used which was something like hotel room sizes for
00:08:52.280 | London hotels. We can search for that and return reviews that are relevant to that
00:08:59.520 | particular query. That's not based on words being matched but the actual
00:09:03.760 | meaning behind what we have said. So we're going to initialize a model for
00:09:11.800 | that it's a sentence transformer it's a pretty small one this mini LM is like an
00:09:16.800 | efficient smaller model but it still works very well. So we will download that
00:09:23.080 | and we can see it creates these vector embeddings which are 384 dimensions in
00:09:29.840 | size. So we're essentially taking our text and encoding all that meaning into
00:09:35.520 | a 384 dimensional vector embedding. Okay and then what we need to do is
00:09:41.400 | initialize a pinecone index. So this is going to be our vector database where we're
00:09:45.600 | going to store all of these vector embeddings. What we want to do here is we
00:09:51.120 | first need to get an API key. Okay so we would go over to app.pinecone.io
00:09:59.320 | When you come into this you probably just have a single project. You
00:10:05.280 | may need to sign up and that's free so don't worry about that you sign up and
00:10:09.600 | you'll probably have a single project here. That will be your name followed by
00:10:13.840 | default project so like this one for me. I'll click on that and this will probably
00:10:19.800 | be empty which is fine because we're going to create an index but what we do
00:10:23.760 | need is we need to go to the API keys here. We need to go to copy key value and
00:10:28.840 | then you just copy that value into here. Okay and that is your personal API key.
00:10:36.600 | Now for me I have stored that in a variable called API key. So I will run
00:10:44.360 | that. That will connect and that initializes our connection to our pinecone
00:10:49.800 | project. Now what we can do is initialize a index. So for me I've actually already
00:10:57.320 | created this in advance so this here will basically check if it already
00:11:02.920 | exists. If it doesn't already exist it will create the index with these
00:11:07.360 | parameters here. So dimensionality is at 384 that's the vector embedding
00:11:12.640 | dimensionality of the model that we saw earlier and also the metric here is
00:11:17.960 | essentially how we compare each of the vectors within our vector index. So that
00:11:23.600 | will depend on the retrieval model that you are using. In this case with the
00:11:29.200 | retrieval model we're using we need to use the cosine metric and once that's
00:11:33.400 | created or just identified as already existing we would connect to that index
00:11:39.600 | like so. From here what we want to do is actually move on to generating our
00:11:44.080 | embeddings. So what we are going to do here is take all of our reviews we're
00:11:50.120 | going to run them through our sentiment analysis model and get their sentiment
00:11:57.440 | scores whether they are positive, negative, neutral and we're going to do
00:12:02.200 | that using this function here. So get sentiment and we'll just pass the reviews
00:12:07.560 | through there and we can see okay we take these first three reviews, run that
00:12:15.560 | and we'll get the first three sentiments for those first three reviews. So we have
00:12:21.160 | the very first item is negative so if I let's have a look at what we are
00:12:27.160 | actually looking at there. So review
00:12:32.280 | first three so actually you can't see that there. Do this okay so this is the
00:12:41.360 | first one yeah it looks pretty negative straight away. Number one so this is the
00:12:46.940 | neutral one with a confidence score of 77. Yes just the location view doesn't
00:12:53.480 | really make any sense and number two positive and yeah I mean it's it's
00:12:59.760 | generally positive it's not not super over-the-top but they're saying it's
00:13:03.880 | pretty central I think and also it's very helpful. So that sounds pretty good
00:13:10.080 | and there's a confidence of 89.7% there. So we're basically going to do that for
00:13:16.320 | all of our data but there are a few other things that I want to be able to
00:13:20.760 | do here. So I want to be able to use metadata filtering to search through a
00:13:25.680 | particular date range and to do that we need to get the timestamps from our
00:13:31.120 | dataset. Now if we have a look at our dataset we can see that we have this
00:13:38.880 | review date but in order to be able to use those dates in Pinecone for
00:13:45.760 | filtering like above a certain date or below a certain date we actually need to
00:13:49.560 | convert them into a number. So to do that we use the DateUtil parser and if
00:13:58.320 | we just take one example so we're going to get the time some for review date 0
00:14:03.320 | so the first one if we have a look okay so we're going to take that and we can
00:14:15.120 | get the timestamp like so. Okay that would just allow us to compare all the
00:14:20.080 | dates using greater than, less than operators and so on. Now from here we
00:14:27.840 | have everything we need so we can create our or we can get our sentiments, we can get
00:14:35.200 | our dates and what we now do is actually create our embeddings, pull all that
00:14:40.000 | together and then insert them into our Pinecone vector database. So we're going
00:14:46.480 | to do this in batches of 64 there's two reasons for that, one to avoid putting
00:14:52.720 | too much data into our sentence transformer at once which will basically
00:14:57.080 | cause an out-of-memory error and also to avoid sending too much data in one go
00:15:03.040 | through the API calls to Pinecone. So we extract the batch from our data frame, we
00:15:11.920 | encode so this creates the embeddings for that batch, we then get the timestamp
00:15:18.320 | using the values that we saw before so that getTimeStamp function. We want to
00:15:24.640 | get the sentiment labels and scores here so this positive, negative, neutral and the
00:15:30.480 | actual confidence score. We get those and then we use this here to convert all
00:15:35.320 | that into a metadata dictionary. So if I just show you very quickly what that
00:15:40.240 | might look like, it will basically be something like label is equal to
00:15:45.800 | positive for example. We would also have the date or timestamp, that'll be whatever
00:15:55.240 | number there. We'll have a score and very important for us to be able to read the
00:16:05.560 | review after is we'll also have the review. So that is essentially what we're
00:16:12.360 | building within this meta variable here but it will actually be within a list
00:16:16.760 | and we'll have several of these dictionaries one after the other and the
00:16:22.600 | final thing that we need to do is just create some unique IDs. Now these are
00:16:25.880 | just numbers, there's nothing special there. Obviously if you have unique IDs that you
00:16:30.500 | need to use for your reviews or your data set, use those actual unique IDs
00:16:36.160 | rather than just account like we're doing here. Then from there we add all
00:16:40.640 | those into what we call the upsert list. So the upsert list is just a list of tuples.
00:16:45.920 | So we have the list here of the tuples and those tuples are the record ID and
00:16:52.400 | the record dense vector and the record metadata and then we upsert all of those
00:16:58.400 | in to Pineco. So let's go ahead and run that. Now as you can see this will take a
00:17:06.080 | little bit of time so on here showing about 25 minutes which is not too long
00:17:12.040 | but what I, I don't know where that long I've already done it so I am going to
00:17:17.440 | come here I'm gonna stop this and I'm just gonna take this to show you that we
00:17:21.800 | have everything in there already. Okay so describe index sets and we should see
00:17:28.080 | okay total vector count we have 93.7 thousand items in there. Okay cool so now
00:17:36.680 | what we can do is move on to our opinion mining or sentiment mining where we are
00:17:42.060 | essentially going to ask some questions and we're going to take a look and
00:17:45.840 | analyze the sentiment for the hotel reviews based on what is we're searching
00:17:51.400 | for. So the first thing I want to search for which is what I mentioned right at the start is
00:17:56.200 | room sizes. So are customers happy with room sizes for London hotels. So we're
00:18:04.160 | going to return 500 reviews that are relevant to this query. We're going to
00:18:10.520 | include the metadata so that makes sure that we include the metadata that we
00:18:16.080 | index so the review text which is it's pretty useful for us to understand what
00:18:20.640 | we're returning here. We also want to see the sentiment scores which is pretty
00:18:26.960 | useful for actually analyzing what we're seeing here and the actual sentiment
00:18:32.080 | labels. So we can do that now that will return a little dictionary looks like
00:18:40.160 | this. Okay so we can see here we have a ton of these actually so let's just have
00:18:49.240 | a look at one. So we have the ID so 37796 like I said it's just a count so it's
00:18:56.440 | the 37,000 item that we indexed. We have the hotel name that it's coming from so
00:19:05.760 | also included that before. We have labels so this is a negative response and this
00:19:11.960 | is the actual taste of review. Did not like the first room offered was not in
00:19:16.320 | hotel but over the road so yeah it doesn't seem that happy about it. It's
00:19:22.320 | happy with the manager but reception staffed it's not very happy with so a
00:19:27.680 | bit of a mixed review but generally I think it seems kind of negative. Then down
00:19:32.280 | here we have our timestamp which is what we use in order to do the metadata
00:19:37.840 | filtering that you'll see later. Okay great so let's come down and what we can
00:19:43.080 | do we can run this. This is just showing a few of the reviews we already looked at
00:19:49.280 | them so let's not go through that again.
00:19:53.240 | Okay and what we can do is we're just going to count the sentiment of
00:19:59.400 | everything returned. So we have negative, neutral, positive you can see here we
00:20:04.140 | have the labels here. So we're just going to count out of everything we return
00:20:08.600 | what is the overall sentiment that we're seeing. So we run this or I think I'm
00:20:17.160 | skipping ahead a little bit. I need to run this. Okay so you see you get negative,
00:20:21.280 | neutral, positive, way more positive than anything else there. So surprisingly
00:20:26.960 | people are actually happy with the room sizes for London hotels but generally
00:20:31.800 | speaking we don't want to just know this for all London hotels or maybe we want
00:20:36.120 | to be a little more specific in what we're searching for. So let's go through
00:20:40.080 | and see how we can do that. So first thing we want to look at is the metadata
00:20:45.320 | filtering for fine-tuning our search to a specific date range. So what we're going to do
00:20:50.560 | is get a start time and end time. So we're going to do 2015 December so
00:20:57.240 | Christmas up to the end of the year in 2015 as well and that will just generate
00:21:04.240 | those integer numbers that you saw before or there might be floats actually.
00:21:09.320 | So the timestamps. Yep and we're basically going to say to Pinecone we
00:21:14.920 | only want to search within the space where we have records that are either
00:21:19.040 | have a timestamp that is larger in this value or less than this value. So to do
00:21:25.180 | that we're going to add a filter timestamp greater than or equal to the
00:21:29.000 | start time or less than or equal to the end time. Run that and we're just going
00:21:34.160 | to plot our bar chart here and we see that during this time people still
00:21:38.680 | generally quite positive although there's definitely a lot more negatives
00:21:42.680 | within that time frame. So another thing we can do is actually apply the same
00:21:47.000 | thing to different hotels. So we'll apply that and what we're going to do is look
00:21:52.400 | into five different areas. We're going to look at the room sizes, the cleanliness, staff,
00:21:56.880 | food and air conditioning. So what we do is we'll have a query for each one of
00:22:03.000 | these. Are customers happy with room sizes? Are customers satisfied with cleanliness of rooms?
00:22:07.920 | Essentially we're just taking each one of these parameters and we're reformatting it
00:22:12.120 | into more of a natural language question. So we run this, come down and we're going
00:22:20.240 | to go through each of the hotels that we pulled from up here. So Strand Palace
00:22:26.440 | Hotel, Britannia, Grand Royale and so on and so on and we're just going to
00:22:31.040 | iterate through those. We're going to filter for each one of those hotel names
00:22:35.020 | here so that we're returning results specific to each one of those hotels.
00:22:39.380 | We're going to make sure that we here we need to encode that query as we did
00:22:43.600 | before to create our query embedding. We're going to count the sentiment as we
00:22:47.560 | did before and then we're going to store all of this into a set of results here.
00:22:52.680 | Okay so let's go and do this. Area by the way is just what we set up here. So room
00:22:59.720 | size, cleanliness and so on. So we'll go through each one of those and get our
00:23:04.880 | results. Okay so that will run for a moment. Okay so 10 seconds to run there
00:23:11.400 | and then what we can do is take a look. So we have all these results here and
00:23:16.400 | then we can go ahead and visualize that to make a little more sense of it and
00:23:23.400 | there we go. So we have all the different parameters here. Room size, cleanliness,
00:23:29.840 | staff, food, AC for each one of those hotels. So generally speaking looks like
00:23:37.160 | everyone just loves Intercontinental London the O2. Grand Royale London Hyde
00:23:42.280 | Park seems pretty popular as well and we shouldn't never stay here especially if
00:23:47.840 | you want AC. So you can go through and read each of these. These are just kind
00:23:52.760 | of like an analysis that we've pulled from each one of these results but
00:23:58.640 | generally speaking you see really easily how useful sentiment mining can be. Just
00:24:04.560 | analyzing the customer sentiment around in this case hotels but also products or
00:24:11.040 | services or cafes you know all these different things and probably a lot of
00:24:16.040 | other things that have absolutely nothing to do with customer reviews and
00:24:20.440 | we can of course also do this in pretty much near real time. So we can run our
00:24:27.360 | sentiment analysis model on reviews coming in, upsert them into Pinecone and
00:24:31.720 | that comes with a almost instant refresh. So we can then update these visuals and
00:24:38.520 | our analysis over time in near real time which is pretty cool. So that's it for
00:24:46.360 | this video, this example. I hope this has been interesting and useful. Thank you
00:24:53.160 | very much for watching and I will see you again in the next one. Bye.
00:25:03.280 | (gentle music)