back to indexAdvanced Sentiment Analysis with NLP Transformers + Vector Search
Chapters
0:0 Intro
0:31 What we will build
3:1 Code links and prerequisites
4:16 Dataset download and preprocessing
5:49 Using RoBERTa sentiment analysis model
8:15 Retriever model for building dense vectors
9:39 Create Pinecone vector index
11:40 Sentiment scores, vectors, and indexing
17:35 Sentiment analysis / opinion mining
20:43 Sentiment analysis with specific date range
21:44 Sentiment analysis on specific info
23:58 Final notes
00:00:00.000 |
Today we're going to learn how to do sentiment mining using vector search and 00:00:06.140 |
transform models. At the core of this we have sentiment analysis. Now sentiment 00:00:11.040 |
analysis is a NLP technique where we essentially want to extract the emotion 00:00:18.720 |
or the sentiment behind some text. Naturally this makes it a really 00:00:24.000 |
interesting and useful tool when it comes to analyzing a ton of language 00:00:30.280 |
text data. And here we're gonna have a look at how to apply sentiment analysis 00:00:37.720 |
and more specifically sentiment mining to the hotel industry so that we can 00:00:43.140 |
understand customer perception through hotel reviews. With this we could 00:00:48.640 |
identify the perfect hotel for whatever it is we want. Or from the other side, from 00:00:54.680 |
the hotel management side, we can analyze customer reviews and maybe identify 00:01:01.240 |
areas that are pretty strong and other areas that could do a bit of work. So the 00:01:06.040 |
idea here is we're gonna take a ton of these customer reviews for different 00:01:10.080 |
hotels and apply this technique to them. But of course we don't have to restrict 00:01:16.120 |
this to reviews, although we can we can do product reviews for example. But we 00:01:21.400 |
can actually apply this to any text where we want to extract sentiment and 00:01:25.840 |
maybe you want to analyze different ideas or features within that data set. 00:01:32.760 |
So the hotel example maybe you want to have a look at whether people think our 00:01:37.560 |
room sizes are good or whether they like the breakfast and we could do that with 00:01:42.800 |
this sentiment mining. So to do that we are essentially going to take all of 00:01:49.080 |
these reviews we're going to embed them using something called a sentence 00:01:54.280 |
transformer model. We're going to index those within a vector database and we're 00:01:58.880 |
going to search for things like "are customers satisfied with room sizes for 00:02:05.040 |
hotels in London" and that will return a set of reviews that are relevant to our 00:02:13.080 |
particular query and then assess the general sentiment around them. Now we can 00:02:18.620 |
also pair this with metadata which is just attached bits of information to 00:02:24.160 |
each vector or piece or record within our vector database and we could use 00:02:29.680 |
that to maybe visualize the sentiment across different hotels. Or we could even 00:02:34.200 |
take that a little further and use something called metadata filtering to 00:02:37.680 |
for example look at reviews over time and we could use that to look at the 00:02:43.360 |
sentiment of different hotels or different aspects with each of those 00:02:48.240 |
hotels over time and see if things are either improving or getting worse and so 00:02:53.520 |
on. But we'll look at all of that in a lot more detail towards the end of the 00:02:57.280 |
video. For now what we're going to do is have a look at the actual notebook that 00:03:01.760 |
we are going to be going through. So if you'd like to follow along with me you 00:03:06.360 |
can actually get the notebook either in the description of this video or you can 00:03:12.280 |
go over to here pinecone.io/docs/examples/sentiment-mining and up at the top 00:03:19.040 |
here you can either get the notebook on github or you can click over here to 00:03:24.680 |
open it in Colab which is what I'm going to be running through. Now credit for 00:03:29.080 |
this idea and notebook goes to usherac again so it's a really cool example and 00:03:36.280 |
it's really useful. So we'll start going through it now. So the first thing we 00:03:41.120 |
need to do is install any dependencies that we have. So we have a few here 00:03:46.240 |
mainly sentence transformers, PyHandClient, datasets which we can pull 00:03:51.480 |
everything from and then for visualizations seaborn and matplotlib. 00:03:55.640 |
So we run that and one other thing we should also check if you are running 00:04:00.480 |
this in Colab is in your runtime go over to change runtime type and just make 00:04:06.800 |
sure that you have the hardware accelerator on GPU here. So if you do 00:04:12.120 |
need to switch you just have to restart and reinstall those prerequisites there. 00:04:16.720 |
So once that is ready you can go to load and prepare dataset. So we're going to be 00:04:22.040 |
using this dataset that usherac is hosting on HuggingFace. So usherac 00:04:27.640 |
hotel reviews. We're going to take the training split and we're just going to 00:04:30.840 |
convert it into a pandas data frame. So run this and once that has completed we 00:04:37.200 |
come down to here and some of these reviews can obviously be much longer 00:04:44.280 |
than others. So what we are going to do is just limit it to the first 800 00:04:50.400 |
characters of each review because later on when we're creating our embeddings 00:04:55.200 |
the models we will be using do have a maximum sequence length. So anything 00:05:01.120 |
beyond that will get truncated anyway. So we're just going to do that now rather 00:05:06.200 |
than later on. This will also help us when we later want to sort the text from 00:05:13.240 |
these reviews in the Pinecone Vector database there are size limits to that. 00:05:18.840 |
So by cutting down the length here we're going to make sure that we definitely do 00:05:25.080 |
not go over that. So here we can see we have a single hotel here there are several in 00:05:30.400 |
this dataset and we just have the reviews. So you can see it's just natural 00:05:36.480 |
language here there's nothing specific there are no like star ratings or 00:05:42.200 |
anything here. So we really have to rely on the natural language of these reviews 00:05:46.360 |
and sentiment analysis to actually understand those. So we're going to 00:05:50.800 |
initialize our sentiment analysis model. We're going to be using Roberta here and 00:05:55.240 |
it has been fine-tuned specifically for this. So what we're going to do is we're 00:06:00.800 |
going to set our device to GPU if it is available. It may not be but if you are 00:06:06.960 |
using Colab and you set the runtime up here you should be able to get that. So 00:06:11.480 |
this here is just using the CUDA zero device. Okay and then from Transformers 00:06:17.560 |
we're going to import a few things. We've got a pipeline which is going to be our 00:06:21.120 |
sentiment analysis pipeline later on. We have a tokenizer and a model for 00:06:26.440 |
sequence classification which is our again our sentiment analysis model. Okay 00:06:32.320 |
so we're going to be using this Roberta based sentiment sentiment analysis model. 00:06:36.720 |
We have three labels here so that's positive, negative and neutral. We're 00:06:42.600 |
going to load our tokenizer from HuggingFace and then we're going to load 00:06:46.520 |
all of that into this pipeline here. So this sentiment analysis pipeline on the 00:06:51.360 |
device that we've chosen already and that's everything. So that will just 00:06:57.600 |
package everything up into like really easy to use pipeline where we can just 00:07:01.280 |
feed in some text and it's going to output us a little dictionary telling us 00:07:05.680 |
the sentiment of that text. Now we will need to map these so the output that we 00:07:13.720 |
get will say either label 0, 1 or 2. That isn't obviously clear what that means so 00:07:19.560 |
we're just going to map those to their actual meanings which is negative, 00:07:22.800 |
neutral or positive. Then we take a look at this one. So room is small for a 00:07:28.360 |
superior room and poorly lit. No view, need better lighting. It's obviously not a 00:07:33.360 |
very good review. They're not particularly happy with the room. So let's run this 00:07:39.520 |
and we're going to see what we get. So I think I just need to run this again. Okay 00:07:46.960 |
and then come down here run again and you see that we get this label 0 and so 00:07:54.960 |
that's a predicted sentiment and we get this score so the confidence. So if we 00:08:00.040 |
come up here we can see label 0 is negative right? Label 0 negative. Cool so 00:08:09.560 |
77% confidence in this being negative and I think that's fair enough. Now 00:08:16.400 |
what we're going to do is initialize what's called a retriever model. So the 00:08:19.560 |
retriever model is going to handle the construction of our vector embeddings. So 00:08:25.200 |
we're going to take all of these reviews we're going to convert them into what 00:08:29.640 |
are called dense vector embeddings and these are essentially a vector 00:08:36.080 |
representation of the meaning behind each one of these reviews and what that 00:08:42.360 |
does is allows us to search for something based on its meaning. So the 00:08:47.840 |
earlier example I used which was something like hotel room sizes for 00:08:52.280 |
London hotels. We can search for that and return reviews that are relevant to that 00:08:59.520 |
particular query. That's not based on words being matched but the actual 00:09:03.760 |
meaning behind what we have said. So we're going to initialize a model for 00:09:11.800 |
that it's a sentence transformer it's a pretty small one this mini LM is like an 00:09:16.800 |
efficient smaller model but it still works very well. So we will download that 00:09:23.080 |
and we can see it creates these vector embeddings which are 384 dimensions in 00:09:29.840 |
size. So we're essentially taking our text and encoding all that meaning into 00:09:35.520 |
a 384 dimensional vector embedding. Okay and then what we need to do is 00:09:41.400 |
initialize a pinecone index. So this is going to be our vector database where we're 00:09:45.600 |
going to store all of these vector embeddings. What we want to do here is we 00:09:51.120 |
first need to get an API key. Okay so we would go over to app.pinecone.io 00:09:59.320 |
When you come into this you probably just have a single project. You 00:10:05.280 |
may need to sign up and that's free so don't worry about that you sign up and 00:10:09.600 |
you'll probably have a single project here. That will be your name followed by 00:10:13.840 |
default project so like this one for me. I'll click on that and this will probably 00:10:19.800 |
be empty which is fine because we're going to create an index but what we do 00:10:23.760 |
need is we need to go to the API keys here. We need to go to copy key value and 00:10:28.840 |
then you just copy that value into here. Okay and that is your personal API key. 00:10:36.600 |
Now for me I have stored that in a variable called API key. So I will run 00:10:44.360 |
that. That will connect and that initializes our connection to our pinecone 00:10:49.800 |
project. Now what we can do is initialize a index. So for me I've actually already 00:10:57.320 |
created this in advance so this here will basically check if it already 00:11:02.920 |
exists. If it doesn't already exist it will create the index with these 00:11:07.360 |
parameters here. So dimensionality is at 384 that's the vector embedding 00:11:12.640 |
dimensionality of the model that we saw earlier and also the metric here is 00:11:17.960 |
essentially how we compare each of the vectors within our vector index. So that 00:11:23.600 |
will depend on the retrieval model that you are using. In this case with the 00:11:29.200 |
retrieval model we're using we need to use the cosine metric and once that's 00:11:33.400 |
created or just identified as already existing we would connect to that index 00:11:39.600 |
like so. From here what we want to do is actually move on to generating our 00:11:44.080 |
embeddings. So what we are going to do here is take all of our reviews we're 00:11:50.120 |
going to run them through our sentiment analysis model and get their sentiment 00:11:57.440 |
scores whether they are positive, negative, neutral and we're going to do 00:12:02.200 |
that using this function here. So get sentiment and we'll just pass the reviews 00:12:07.560 |
through there and we can see okay we take these first three reviews, run that 00:12:15.560 |
and we'll get the first three sentiments for those first three reviews. So we have 00:12:21.160 |
the very first item is negative so if I let's have a look at what we are 00:12:32.280 |
first three so actually you can't see that there. Do this okay so this is the 00:12:41.360 |
first one yeah it looks pretty negative straight away. Number one so this is the 00:12:46.940 |
neutral one with a confidence score of 77. Yes just the location view doesn't 00:12:53.480 |
really make any sense and number two positive and yeah I mean it's it's 00:12:59.760 |
generally positive it's not not super over-the-top but they're saying it's 00:13:03.880 |
pretty central I think and also it's very helpful. So that sounds pretty good 00:13:10.080 |
and there's a confidence of 89.7% there. So we're basically going to do that for 00:13:16.320 |
all of our data but there are a few other things that I want to be able to 00:13:20.760 |
do here. So I want to be able to use metadata filtering to search through a 00:13:25.680 |
particular date range and to do that we need to get the timestamps from our 00:13:31.120 |
dataset. Now if we have a look at our dataset we can see that we have this 00:13:38.880 |
review date but in order to be able to use those dates in Pinecone for 00:13:45.760 |
filtering like above a certain date or below a certain date we actually need to 00:13:49.560 |
convert them into a number. So to do that we use the DateUtil parser and if 00:13:58.320 |
we just take one example so we're going to get the time some for review date 0 00:14:03.320 |
so the first one if we have a look okay so we're going to take that and we can 00:14:15.120 |
get the timestamp like so. Okay that would just allow us to compare all the 00:14:20.080 |
dates using greater than, less than operators and so on. Now from here we 00:14:27.840 |
have everything we need so we can create our or we can get our sentiments, we can get 00:14:35.200 |
our dates and what we now do is actually create our embeddings, pull all that 00:14:40.000 |
together and then insert them into our Pinecone vector database. So we're going 00:14:46.480 |
to do this in batches of 64 there's two reasons for that, one to avoid putting 00:14:52.720 |
too much data into our sentence transformer at once which will basically 00:14:57.080 |
cause an out-of-memory error and also to avoid sending too much data in one go 00:15:03.040 |
through the API calls to Pinecone. So we extract the batch from our data frame, we 00:15:11.920 |
encode so this creates the embeddings for that batch, we then get the timestamp 00:15:18.320 |
using the values that we saw before so that getTimeStamp function. We want to 00:15:24.640 |
get the sentiment labels and scores here so this positive, negative, neutral and the 00:15:30.480 |
actual confidence score. We get those and then we use this here to convert all 00:15:35.320 |
that into a metadata dictionary. So if I just show you very quickly what that 00:15:40.240 |
might look like, it will basically be something like label is equal to 00:15:45.800 |
positive for example. We would also have the date or timestamp, that'll be whatever 00:15:55.240 |
number there. We'll have a score and very important for us to be able to read the 00:16:05.560 |
review after is we'll also have the review. So that is essentially what we're 00:16:12.360 |
building within this meta variable here but it will actually be within a list 00:16:16.760 |
and we'll have several of these dictionaries one after the other and the 00:16:22.600 |
final thing that we need to do is just create some unique IDs. Now these are 00:16:25.880 |
just numbers, there's nothing special there. Obviously if you have unique IDs that you 00:16:30.500 |
need to use for your reviews or your data set, use those actual unique IDs 00:16:36.160 |
rather than just account like we're doing here. Then from there we add all 00:16:40.640 |
those into what we call the upsert list. So the upsert list is just a list of tuples. 00:16:45.920 |
So we have the list here of the tuples and those tuples are the record ID and 00:16:52.400 |
the record dense vector and the record metadata and then we upsert all of those 00:16:58.400 |
in to Pineco. So let's go ahead and run that. Now as you can see this will take a 00:17:06.080 |
little bit of time so on here showing about 25 minutes which is not too long 00:17:12.040 |
but what I, I don't know where that long I've already done it so I am going to 00:17:17.440 |
come here I'm gonna stop this and I'm just gonna take this to show you that we 00:17:21.800 |
have everything in there already. Okay so describe index sets and we should see 00:17:28.080 |
okay total vector count we have 93.7 thousand items in there. Okay cool so now 00:17:36.680 |
what we can do is move on to our opinion mining or sentiment mining where we are 00:17:42.060 |
essentially going to ask some questions and we're going to take a look and 00:17:45.840 |
analyze the sentiment for the hotel reviews based on what is we're searching 00:17:51.400 |
for. So the first thing I want to search for which is what I mentioned right at the start is 00:17:56.200 |
room sizes. So are customers happy with room sizes for London hotels. So we're 00:18:04.160 |
going to return 500 reviews that are relevant to this query. We're going to 00:18:10.520 |
include the metadata so that makes sure that we include the metadata that we 00:18:16.080 |
index so the review text which is it's pretty useful for us to understand what 00:18:20.640 |
we're returning here. We also want to see the sentiment scores which is pretty 00:18:26.960 |
useful for actually analyzing what we're seeing here and the actual sentiment 00:18:32.080 |
labels. So we can do that now that will return a little dictionary looks like 00:18:40.160 |
this. Okay so we can see here we have a ton of these actually so let's just have 00:18:49.240 |
a look at one. So we have the ID so 37796 like I said it's just a count so it's 00:18:56.440 |
the 37,000 item that we indexed. We have the hotel name that it's coming from so 00:19:05.760 |
also included that before. We have labels so this is a negative response and this 00:19:11.960 |
is the actual taste of review. Did not like the first room offered was not in 00:19:16.320 |
hotel but over the road so yeah it doesn't seem that happy about it. It's 00:19:22.320 |
happy with the manager but reception staffed it's not very happy with so a 00:19:27.680 |
bit of a mixed review but generally I think it seems kind of negative. Then down 00:19:32.280 |
here we have our timestamp which is what we use in order to do the metadata 00:19:37.840 |
filtering that you'll see later. Okay great so let's come down and what we can 00:19:43.080 |
do we can run this. This is just showing a few of the reviews we already looked at 00:19:53.240 |
Okay and what we can do is we're just going to count the sentiment of 00:19:59.400 |
everything returned. So we have negative, neutral, positive you can see here we 00:20:04.140 |
have the labels here. So we're just going to count out of everything we return 00:20:08.600 |
what is the overall sentiment that we're seeing. So we run this or I think I'm 00:20:17.160 |
skipping ahead a little bit. I need to run this. Okay so you see you get negative, 00:20:21.280 |
neutral, positive, way more positive than anything else there. So surprisingly 00:20:26.960 |
people are actually happy with the room sizes for London hotels but generally 00:20:31.800 |
speaking we don't want to just know this for all London hotels or maybe we want 00:20:36.120 |
to be a little more specific in what we're searching for. So let's go through 00:20:40.080 |
and see how we can do that. So first thing we want to look at is the metadata 00:20:45.320 |
filtering for fine-tuning our search to a specific date range. So what we're going to do 00:20:50.560 |
is get a start time and end time. So we're going to do 2015 December so 00:20:57.240 |
Christmas up to the end of the year in 2015 as well and that will just generate 00:21:04.240 |
those integer numbers that you saw before or there might be floats actually. 00:21:09.320 |
So the timestamps. Yep and we're basically going to say to Pinecone we 00:21:14.920 |
only want to search within the space where we have records that are either 00:21:19.040 |
have a timestamp that is larger in this value or less than this value. So to do 00:21:25.180 |
that we're going to add a filter timestamp greater than or equal to the 00:21:29.000 |
start time or less than or equal to the end time. Run that and we're just going 00:21:34.160 |
to plot our bar chart here and we see that during this time people still 00:21:38.680 |
generally quite positive although there's definitely a lot more negatives 00:21:42.680 |
within that time frame. So another thing we can do is actually apply the same 00:21:47.000 |
thing to different hotels. So we'll apply that and what we're going to do is look 00:21:52.400 |
into five different areas. We're going to look at the room sizes, the cleanliness, staff, 00:21:56.880 |
food and air conditioning. So what we do is we'll have a query for each one of 00:22:03.000 |
these. Are customers happy with room sizes? Are customers satisfied with cleanliness of rooms? 00:22:07.920 |
Essentially we're just taking each one of these parameters and we're reformatting it 00:22:12.120 |
into more of a natural language question. So we run this, come down and we're going 00:22:20.240 |
to go through each of the hotels that we pulled from up here. So Strand Palace 00:22:26.440 |
Hotel, Britannia, Grand Royale and so on and so on and we're just going to 00:22:31.040 |
iterate through those. We're going to filter for each one of those hotel names 00:22:35.020 |
here so that we're returning results specific to each one of those hotels. 00:22:39.380 |
We're going to make sure that we here we need to encode that query as we did 00:22:43.600 |
before to create our query embedding. We're going to count the sentiment as we 00:22:47.560 |
did before and then we're going to store all of this into a set of results here. 00:22:52.680 |
Okay so let's go and do this. Area by the way is just what we set up here. So room 00:22:59.720 |
size, cleanliness and so on. So we'll go through each one of those and get our 00:23:04.880 |
results. Okay so that will run for a moment. Okay so 10 seconds to run there 00:23:11.400 |
and then what we can do is take a look. So we have all these results here and 00:23:16.400 |
then we can go ahead and visualize that to make a little more sense of it and 00:23:23.400 |
there we go. So we have all the different parameters here. Room size, cleanliness, 00:23:29.840 |
staff, food, AC for each one of those hotels. So generally speaking looks like 00:23:37.160 |
everyone just loves Intercontinental London the O2. Grand Royale London Hyde 00:23:42.280 |
Park seems pretty popular as well and we shouldn't never stay here especially if 00:23:47.840 |
you want AC. So you can go through and read each of these. These are just kind 00:23:52.760 |
of like an analysis that we've pulled from each one of these results but 00:23:58.640 |
generally speaking you see really easily how useful sentiment mining can be. Just 00:24:04.560 |
analyzing the customer sentiment around in this case hotels but also products or 00:24:11.040 |
services or cafes you know all these different things and probably a lot of 00:24:16.040 |
other things that have absolutely nothing to do with customer reviews and 00:24:20.440 |
we can of course also do this in pretty much near real time. So we can run our 00:24:27.360 |
sentiment analysis model on reviews coming in, upsert them into Pinecone and 00:24:31.720 |
that comes with a almost instant refresh. So we can then update these visuals and 00:24:38.520 |
our analysis over time in near real time which is pretty cool. So that's it for 00:24:46.360 |
this video, this example. I hope this has been interesting and useful. Thank you 00:24:53.160 |
very much for watching and I will see you again in the next one. Bye.