How to Index Q&A Data With Haystack and Elasticsearch

00:00:00.000 | Okay so in this video what we're going to do is actually

00:00:03.600 | index our data so at the moment we just have

00:00:07.200 | all of our paragraphs from Meditations by Marcus Aurelius

00:00:11.600 | and to do this we are going to be using the Elasticsearch document store.

00:00:16.960 | So of course if we're using Elasticsearch we first need to actually

00:00:20.560 | download and install it so I'm just going to take you through

00:00:23.920 | those steps now.

00:00:29.280 | And all we need to do is head on over to this website up here

00:00:35.520 | and elasticsearch.co and you can see the address just there. Now I'm going

00:00:43.040 | to follow the instructions for Windows but of course if you're on Linux or Mac

00:00:47.520 | just follow through it's very similar either way.

00:00:51.840 | So here we're going to install it on Windows

00:00:57.920 | using the MSI installer. So just scroll down here and we can see

00:01:03.440 | we can download the package from this link so download

00:01:07.200 | that and once you download it just open it

00:01:10.720 | and we'll see this window pop up. So

00:01:14.960 | once you see this window pop up we just go through with all of the default

00:01:19.440 | settings. So install as a service and continue

00:01:23.840 | through obviously if you do need to change anything change it

00:01:28.320 | but for me there's nothing here that I want to modify.

00:01:32.000 | Notice here we have the HTTP port and we're using

00:01:35.520 | 92.0.0 we'll be using that later. We just continue through here default

00:01:40.800 | settings and then we click install and we just

00:01:43.840 | let that install.

00:01:46.560 | Okay so now that we've installed Elasticsearch we can

00:01:51.200 | go ahead and actually check that it's running.

00:01:54.560 | So to do that we're going to import Python requests

00:02:00.000 | and whenever we interact with Elasticsearch it's either going to be

00:02:04.720 | through haystack or it will be through the request library and we'll just

00:02:10.000 | interact with the Elasticsearch API. So to check the health of our cluster

00:02:19.920 | so essentially check that it's actually up and running

00:02:23.360 | all we need to do is send a get request to localhost and if you remember

00:02:30.800 | earlier we had it was port 9.2.0.0 of course if the port

00:02:35.600 | on yours was different modify it this is just the default value

00:02:40.240 | and after this we need to reach out to the cluster endpoint

00:02:44.560 | and we are checking the health and then we'll just

00:02:47.680 | format that as a JSON. So what you should see here

00:02:51.920 | is we have our cluster which is Elasticsearch

00:02:55.520 | may have a different name if you modified it but by default it's Elasticsearch

00:02:59.680 | the status is yellow which basically just means we have one node up and

00:03:05.760 | running you can have multiple nodes in Elasticsearch

00:03:08.880 | and for your cluster health to be green it will expect your

00:03:16.480 | shards of indexes to have a backup shards across different nodes and

00:03:21.920 | obviously we can't do that if we only have one node but it's completely fine

00:03:25.040 | for us because we're just in development if you're in production

00:03:28.080 | yes you probably want it to have those backup shards

00:03:32.720 | if none of that made any sense don't worry about it we really don't need to

00:03:35.840 | know any of that for what we're doing here

00:03:39.840 | now what we can also do is we can check if we have any indices

00:03:46.560 | already

00:03:48.960 | now if I take a look at mine I will already have some indices set up

00:03:55.600 | which I've just set up prior to recording this

00:04:00.960 | and to check that we go to localhost again

00:04:09.760 | and this time we want to call the CAT API which is what we would call

00:04:17.600 | whenever we want to see data in a table human readable format

00:04:21.760 | rather than JSON and what we're checking here are the

00:04:26.400 | indices

00:04:28.880 | and we'll just add text onto there so we can actually see that

00:04:32.800 | and this is quite messy so if we just print it instead

00:04:36.800 | look a bit cleaner okay so you can see I have these

00:04:40.320 | two indices you shouldn't I don't think have

00:04:44.000 | either of those no you won't have either of those so don't

00:04:47.040 | worry about that now what we are going to do is create a

00:04:52.320 | new index which will be called Aurelius and that

00:04:56.000 | is where we will put our documents

00:05:00.720 | now to actually implement that we will be going through the Haystack

00:05:06.480 | library which you can pip install

00:05:10.560 | farm Haystack

00:05:14.720 | and what we want to do is from Haystack dot document store

00:05:22.000 | elastic search import

00:05:28.880 | elastic search document store so this is our document store instance

00:05:35.520 | and of course this is not aware of our elastic search

00:05:39.200 | instance we need to initialize that so we'll store it in a

00:05:46.240 | variable called docstore

00:05:49.440 | and all we write is elastic search document store

00:05:53.840 | now we need to initialize it with the parameters so it knows

00:05:57.200 | where to connect to our elastic search instance

00:06:00.640 | so to do that we write host and this is

00:06:07.120 | local host now if you have a username and password set which you don't by

00:06:13.360 | default you will need to enter them in here I don't have any set so

00:06:19.200 | no worries

00:06:25.200 | and then we also need to specify our index and at the moment we don't have an

00:06:29.040 | Aurelius index and that's fine because this will initialize it for us

00:06:33.520 | so we'll just call it Aurelius

00:06:37.760 | now if we go down here we can see what it actually did so

00:06:45.120 | it sent a put request to here localhost 9200 Aurelius

00:06:52.960 | so that's how you create a new index after that

00:06:56.720 | what we want to do is first import our data so

00:07:02.880 | we have the data here which I got from this website

00:07:09.920 | and process with this script which you can

00:07:13.600 | find on github I'll keep a link in the description so you can just go and

00:07:19.760 | copy that if you need to now I haven't really done much

00:07:23.680 | pre-processing it's pretty straightforward

00:07:26.640 | and all you need to do here is actually open

00:07:29.760 | that data so we do that with open and from here that data file is

00:07:37.360 | located two folders up in a data folder it's called meditations.txt

00:07:46.560 | I'm going to be reading that

00:07:49.680 | and all we do is data equals f dot read

00:07:57.680 | and then if we just have a quick look at first 100 characters there we

00:08:05.760 | see that we have this new line character and that

00:08:09.520 | signifies a new paragraph from the text so

00:08:15.200 | what we want to do here

00:08:18.320 | is split the data by new line

00:08:24.160 | and then if we check the length of that see that we have

00:08:27.920 | 508 separate paragraphs in there so what we now want to do

00:08:36.160 | is we want to modify this data so that it's in the correct format

00:08:41.680 | for haystack and elasticsearch so that format looks like this so it

00:08:48.880 | expects a list of dictionaries where each

00:08:52.160 | dictionary looks like this of the text and inside here we would

00:08:58.800 | have our paragraph so each one

00:09:03.440 | of these items here and then there's another

00:09:06.800 | optional field called meta and meta contains a dictionary and in

00:09:13.920 | here we can put whatever we want so for us I don't think at the moment

00:09:19.280 | there's really that much to put into here other than

00:09:23.920 | where it came from so the the book or maybe

00:09:27.680 | maybe the source is probably a better word to use here

00:09:32.480 | and all of these are coming from meditations

00:09:35.440 | now later on we will probably add a few other books as well and then the source

00:09:40.640 | will be different and when we return that item from our

00:09:45.520 | retriever and our reader we'll at least be able to see which book

00:09:49.120 | came from him would be also be pretty cool to

00:09:53.360 | maybe include like a page number or something but

00:09:56.400 | at the moment with this there are no page numbers included so

00:10:00.880 | we don't we're not doing that at the moment

00:10:05.120 | so that's a format that we need and it's going to be a list of these

00:10:10.480 | so to do that we'll just do some list comprehension

00:10:16.000 | so we're going to write this and let's just copy this

00:10:21.200 | I think yeah that should be fine we'll copy this

00:10:25.440 | and just indent that and in here we have our paragraph

00:10:33.440 | and sources meditations for all of them and then we just write

00:10:36.720 | for paragraph

00:10:40.160 | in and data okay so yeah that should work and if we just

00:10:46.240 | check what we have here

00:10:51.120 | okay so that's that's what we want so we have text

00:10:55.360 | we have the paragraph and then in here we have this meta with the source

00:10:58.560 | which is always meditations at the moment so

00:11:01.680 | that looks pretty good and we'll just double check

00:11:06.240 | the length again it should be 508 okay perfect now what we need to do

00:11:13.280 | is index all of these documents into our elastic search instance

00:11:20.480 | and to do that it's it's super easy all we do is

00:11:23.520 | call docstore because we're doing this through haystack now

00:11:27.200 | and we do write documents and we just pass in our data.json

00:11:34.800 | and that should work okay cool so we can see here what it's done

00:11:42.080 | as it's sent a post request to the bulk api and sent two of them

00:11:49.520 | i assume because it can only send so many documents at once so that's

00:11:56.240 | pretty cool and now what i want to check is

00:12:00.720 | that we actually have 508 documents in our elastic search instance

00:12:08.240 | so to do that we're going to revert back to requests

00:12:11.920 | so we do requests.get again go to our

00:12:20.000 | localhost

00:12:22.800 | 9200 and here we need to specify the index that we want to count the

00:12:30.400 | number of entries in and then all we do is add count onto the

00:12:34.560 | end there and this will return a json object so we

00:12:38.640 | do this so that we can see it and sure enough we

00:12:42.080 | have 508 items in that document store

00:12:47.360 | so if we head on back to our original plan

00:12:51.280 | so up here we had meditations we've now got that

00:12:58.640 | and we've also set up the first part of our stack over here so elastic

00:13:06.400 | now has meditations in there so we can cross that off now the next step

00:13:14.080 | is setting up our retriever which we'll cover in the

00:13:17.280 | next video so that's everything for this video i hope you enjoyed and i will see

00:13:25.040 | you again in the next one

How to Index Q&A Data With Haystack and Elasticsearch

Chapters