back to indexHow to Index Q&A Data With Haystack and Elasticsearch
Chapters
0:0
0:53 install it on windows using the msi installer
11:13 index all of these documents into our elastic search
12:29 count the number of entries
00:00:00.000 |
Okay so in this video what we're going to do is actually 00:00:07.200 |
all of our paragraphs from Meditations by Marcus Aurelius 00:00:11.600 |
and to do this we are going to be using the Elasticsearch document store. 00:00:16.960 |
So of course if we're using Elasticsearch we first need to actually 00:00:20.560 |
download and install it so I'm just going to take you through 00:00:29.280 |
And all we need to do is head on over to this website up here 00:00:35.520 |
and elasticsearch.co and you can see the address just there. Now I'm going 00:00:43.040 |
to follow the instructions for Windows but of course if you're on Linux or Mac 00:00:47.520 |
just follow through it's very similar either way. 00:00:57.920 |
using the MSI installer. So just scroll down here and we can see 00:01:03.440 |
we can download the package from this link so download 00:01:14.960 |
once you see this window pop up we just go through with all of the default 00:01:19.440 |
settings. So install as a service and continue 00:01:23.840 |
through obviously if you do need to change anything change it 00:01:28.320 |
but for me there's nothing here that I want to modify. 00:01:32.000 |
Notice here we have the HTTP port and we're using 00:01:35.520 |
92.0.0 we'll be using that later. We just continue through here default 00:01:40.800 |
settings and then we click install and we just 00:01:46.560 |
Okay so now that we've installed Elasticsearch we can 00:01:51.200 |
go ahead and actually check that it's running. 00:01:54.560 |
So to do that we're going to import Python requests 00:02:00.000 |
and whenever we interact with Elasticsearch it's either going to be 00:02:04.720 |
through haystack or it will be through the request library and we'll just 00:02:10.000 |
interact with the Elasticsearch API. So to check the health of our cluster 00:02:19.920 |
so essentially check that it's actually up and running 00:02:23.360 |
all we need to do is send a get request to localhost and if you remember 00:02:30.800 |
earlier we had it was port 9.2.0.0 of course if the port 00:02:35.600 |
on yours was different modify it this is just the default value 00:02:40.240 |
and after this we need to reach out to the cluster endpoint 00:02:44.560 |
and we are checking the health and then we'll just 00:02:47.680 |
format that as a JSON. So what you should see here 00:02:51.920 |
is we have our cluster which is Elasticsearch 00:02:55.520 |
may have a different name if you modified it but by default it's Elasticsearch 00:02:59.680 |
the status is yellow which basically just means we have one node up and 00:03:05.760 |
running you can have multiple nodes in Elasticsearch 00:03:08.880 |
and for your cluster health to be green it will expect your 00:03:16.480 |
shards of indexes to have a backup shards across different nodes and 00:03:21.920 |
obviously we can't do that if we only have one node but it's completely fine 00:03:25.040 |
for us because we're just in development if you're in production 00:03:28.080 |
yes you probably want it to have those backup shards 00:03:32.720 |
if none of that made any sense don't worry about it we really don't need to 00:03:39.840 |
now what we can also do is we can check if we have any indices 00:03:48.960 |
now if I take a look at mine I will already have some indices set up 00:03:55.600 |
which I've just set up prior to recording this 00:04:09.760 |
and this time we want to call the CAT API which is what we would call 00:04:17.600 |
whenever we want to see data in a table human readable format 00:04:21.760 |
rather than JSON and what we're checking here are the 00:04:28.880 |
and we'll just add text onto there so we can actually see that 00:04:32.800 |
and this is quite messy so if we just print it instead 00:04:36.800 |
look a bit cleaner okay so you can see I have these 00:04:44.000 |
either of those no you won't have either of those so don't 00:04:47.040 |
worry about that now what we are going to do is create a 00:04:52.320 |
new index which will be called Aurelius and that 00:05:00.720 |
now to actually implement that we will be going through the Haystack 00:05:14.720 |
and what we want to do is from Haystack dot document store 00:05:28.880 |
elastic search document store so this is our document store instance 00:05:35.520 |
and of course this is not aware of our elastic search 00:05:39.200 |
instance we need to initialize that so we'll store it in a 00:05:49.440 |
and all we write is elastic search document store 00:05:53.840 |
now we need to initialize it with the parameters so it knows 00:05:57.200 |
where to connect to our elastic search instance 00:06:07.120 |
local host now if you have a username and password set which you don't by 00:06:13.360 |
default you will need to enter them in here I don't have any set so 00:06:25.200 |
and then we also need to specify our index and at the moment we don't have an 00:06:29.040 |
Aurelius index and that's fine because this will initialize it for us 00:06:37.760 |
now if we go down here we can see what it actually did so 00:06:45.120 |
it sent a put request to here localhost 9200 Aurelius 00:06:52.960 |
so that's how you create a new index after that 00:06:56.720 |
what we want to do is first import our data so 00:07:02.880 |
we have the data here which I got from this website 00:07:13.600 |
find on github I'll keep a link in the description so you can just go and 00:07:19.760 |
copy that if you need to now I haven't really done much 00:07:29.760 |
that data so we do that with open and from here that data file is 00:07:37.360 |
located two folders up in a data folder it's called meditations.txt 00:07:57.680 |
and then if we just have a quick look at first 100 characters there we 00:08:05.760 |
see that we have this new line character and that 00:08:24.160 |
and then if we check the length of that see that we have 00:08:27.920 |
508 separate paragraphs in there so what we now want to do 00:08:36.160 |
is we want to modify this data so that it's in the correct format 00:08:41.680 |
for haystack and elasticsearch so that format looks like this so it 00:08:52.160 |
dictionary looks like this of the text and inside here we would 00:09:06.800 |
optional field called meta and meta contains a dictionary and in 00:09:13.920 |
here we can put whatever we want so for us I don't think at the moment 00:09:19.280 |
there's really that much to put into here other than 00:09:27.680 |
maybe the source is probably a better word to use here 00:09:35.440 |
now later on we will probably add a few other books as well and then the source 00:09:40.640 |
will be different and when we return that item from our 00:09:45.520 |
retriever and our reader we'll at least be able to see which book 00:09:49.120 |
came from him would be also be pretty cool to 00:09:53.360 |
maybe include like a page number or something but 00:09:56.400 |
at the moment with this there are no page numbers included so 00:10:05.120 |
so that's a format that we need and it's going to be a list of these 00:10:10.480 |
so to do that we'll just do some list comprehension 00:10:16.000 |
so we're going to write this and let's just copy this 00:10:21.200 |
I think yeah that should be fine we'll copy this 00:10:25.440 |
and just indent that and in here we have our paragraph 00:10:33.440 |
and sources meditations for all of them and then we just write 00:10:40.160 |
in and data okay so yeah that should work and if we just 00:10:51.120 |
okay so that's that's what we want so we have text 00:10:55.360 |
we have the paragraph and then in here we have this meta with the source 00:11:01.680 |
that looks pretty good and we'll just double check 00:11:06.240 |
the length again it should be 508 okay perfect now what we need to do 00:11:13.280 |
is index all of these documents into our elastic search instance 00:11:20.480 |
and to do that it's it's super easy all we do is 00:11:23.520 |
call docstore because we're doing this through haystack now 00:11:27.200 |
and we do write documents and we just pass in our data.json 00:11:34.800 |
and that should work okay cool so we can see here what it's done 00:11:42.080 |
as it's sent a post request to the bulk api and sent two of them 00:11:49.520 |
i assume because it can only send so many documents at once so that's 00:12:00.720 |
that we actually have 508 documents in our elastic search instance 00:12:08.240 |
so to do that we're going to revert back to requests 00:12:22.800 |
9200 and here we need to specify the index that we want to count the 00:12:30.400 |
number of entries in and then all we do is add count onto the 00:12:34.560 |
end there and this will return a json object so we 00:12:38.640 |
do this so that we can see it and sure enough we 00:12:51.280 |
so up here we had meditations we've now got that 00:12:58.640 |
and we've also set up the first part of our stack over here so elastic 00:13:06.400 |
now has meditations in there so we can cross that off now the next step 00:13:14.080 |
is setting up our retriever which we'll cover in the 00:13:17.280 |
next video so that's everything for this video i hope you enjoyed and i will see