Unicode Normalization for NLP in Python

00:00:00.000 | Okay, so we're going to take a look at Unicode normalization

00:00:04.520 | Unicode normalization is something that we use when we have those weird

00:00:10.100 | Font variants that people always use on the internet

00:00:13.140 | So if you've ever seen people using those odd characters, I think they use it to express some form of

00:00:20.900 | individuality or to catch your attention and then we also have

00:00:26.460 | Another issue where we have weird glyphs in text and this is more reasonable because it's actually a part of

00:00:34.380 | Language that's like little glyphs. So you have the accents above the E's and stuff in Italian or Spanish and

00:00:40.220 | Those little glyphs all together. They're called diacritics

00:00:44.760 | And whenever we come across diacritics or that weird text we can get issues when we're building models

00:00:52.900 | The issues with the weird text is obviously we have someone has got hello world and normal text

00:01:00.380 | And we're comparing it to some hello world and some weird text with circles around every letter

00:01:05.280 | we can't actually compare them like like because our

00:01:08.940 | Models or code in general is not going to be able to compare those two different Unicode

00:01:15.840 | character sets and the issue diacritics is that those characters always have this hidden property in

00:01:23.320 | That we have one Unicode character, which is the capital C with Cedilla, but then we have an

00:01:30.540 | Identical set of characters, which is for example the Latin capital C

00:01:36.980 | immediately followed by something called a combining Cedilla character and

00:01:42.680 | They together look exactly like the other Unicode character

00:01:47.700 | and this is

00:01:50.920 | quite difficult to deal with so we have these two problems and

00:01:56.200 | We use Unicode normalization to actually deal with those when we're building in a few models

00:02:02.080 | so I kind of said this there's two forms of

00:02:05.320 | Equivalent characters are not really equivalent equivalent

00:02:09.560 | The first of those is the compatibility equivalence. That's where it has stuff like font variants

00:02:14.720 | We have different line break sequences circled variant superscripts subscripts fractions and a few other things as well

00:02:22.200 | Now we want a model to see both

00:02:26.600 | hello world with those we have circles and also just hello world as

00:02:31.720 | one because that's how we read it and that's how it's supposed to be interpreted and

00:02:37.840 | that is what the compatibility equivalence is for and

00:02:42.160 | We'll look at how we actually deal with that

00:02:46.380 | pretty soon and then we also have the canonical equivalence, which is the

00:02:52.040 | Thing with the accents and the glyphs I mentioned before

00:02:55.200 | So you have a few different reasons for that, but two that I think are most relevant

00:03:02.440 | Is we have the combined characters. So we have that series to do like character and then we also have the capital C

00:03:09.560 | plus the combining similar characters merge together and

00:03:14.280 | Then we also have conjoined the Korean characters, which I think are pretty common as well

00:03:21.360 | Canonical equivalence is much more to do with characters that we can't really see that they are different, but they are in fact different

00:03:31.080 | whereas

00:03:32.400 | Compatibility equivalence is more to do with they purposely made them different and in reality that a meaning is the same

00:03:39.800 | So we have two different directions for how we can transform

00:03:45.800 | Our text between these two different forms

00:03:49.400 | So we have decomposition which is breaking down

00:03:53.720 | Unicode characters into smaller parts or more normal parts and then we have composition which is

00:04:01.540 | Taking multiple Unicode characters and merging them into a single

00:04:06.640 | accepted Unicode character

00:04:09.460 | So I've got this example here

00:04:12.140 | so this u

00:04:14.980 | 0-0 c7

00:04:17.900 | If we take a look here, this is our C with cedilla and

00:04:23.860 | We see here. This is what it looks like. It has this C and it's got a little cedilla at the bottom

00:04:30.180 | Then the other side we have these two characters here and if we just take a look

00:04:35.540 | Here we can see. Okay. This is the C plus cedilla. So these are two separate Unicode characters

00:04:41.820 | Then we see okay, they actually look exactly the same again. And obviously that's where our problem is

00:04:46.500 | So what we can do is we can decompose them into

00:04:51.460 | Their different parts now. These are already separated. So when we decompose them, we just get the same thing again

00:04:58.140 | whereas for our C with cedilla character

00:05:01.940 | we decompose that and we basically get these two different parts, which is the Latin capital C and

00:05:08.100 | The combining cedilla character and then we can form canonical composition to put those

00:05:14.860 | Both together and merge them back into the capital C with cedilla

00:05:21.740 | And that's essentially how decomposition and composition works. Also, it's slightly different for the

00:05:27.460 | Compatibility decomposition, but we'll talk about that quite soon

00:05:32.100 | when we take the fact that we have these two different directions composition decomposition and

00:05:38.100 | we have our two types of transformations, which is compatibility and

00:05:44.780 | canonical equivalence

00:05:48.540 | We get these four forms

00:05:51.420 | So we're form D, which is canonical

00:05:56.100 | Decomposition, which is what I showed you here where we're decomposing those characters into its individual parts

00:06:03.860 | And if we just take a look at how to actually do this in Python

00:06:09.220 | so we'll take

00:06:11.900 | This Unicode here

00:06:16.420 | And

00:06:18.420 | We'll just place it here and this is our C with cedilla character

00:06:26.940 | So if we just print that out

00:06:36.660 | We see we have that character

00:06:38.740 | Now the other one is where it's kind of both together

00:06:43.020 | so I'm just going to call it C plus cedilla and

00:06:45.860 | That is the Latin capital C, which is

00:06:53.380 | 0043 which if I just print this out so we can just see it before we put the cedilla on the end

00:07:00.740 | We just have a C and then for the cedilla

00:07:03.660 | We just put 0 3

00:07:07.500 | 2 7 and we get that and

00:07:11.700 | Obviously these look the same, but if we compare them

00:07:14.740 | We'll see that they are not the same

00:07:19.760 | Okay, we get false

00:07:23.940 | So to deal with that, this is where we need to use our canonical decomposition

00:07:30.300 | Or NFD that we can see here

00:07:33.620 | So to do all this we're going to need to import

00:07:37.860 | the Unicode

00:07:40.660 | data library

00:07:42.660 | And then we use Unicode data

00:07:44.900 | Normalization

00:07:48.780 | In this case we're using an FD which is canonical decomposition

00:07:54.980 | And then what we want to do is

00:07:58.500 | Passing our C with cedilla because we're going to want to break this down into the two different parts

00:08:05.500 | So that's the one that we need to

00:08:07.500 | transform

00:08:09.940 | And on the other side, we're gonna have our C plus cedilla, which is our two characters and we see

00:08:16.180 | Choose changes to

00:08:19.620 | Normalize that we have true

00:08:22.180 | so now what we've done is converted a single character into the two separate characters here and

00:08:28.580 | That is because we've used normal form

00:08:33.180 | Compositions decompose those we wrote them apart now on the other side that we have the canonical

00:08:39.300 | composition where we build them back up into one and

00:08:42.500 | to do that we use NFC and

00:08:46.420 | Obviously if we try it with this

00:08:48.820 | We're not going to get the right answer because we're not gonna find that they match because we're compositioning this

00:08:54.940 | Back into itself. So it's just gonna be this again

00:08:58.940 | Against this which are separate so we actually need to switch which side we have this function on

00:09:06.260 | So if I just remove this

00:09:09.980 | And copy this across

00:09:13.340 | And we'll see that now we get true because what we've done is

00:09:19.740 | converted these

00:09:22.100 | into this

00:09:23.820 | That's how we normalize for canonical equivalence, which is essentially where we can't actually see the difference on the other side

00:09:31.620 | We have where people using the weird text. So in our abbreviations, we have these two with the K and

00:09:39.020 | That K means compatibility where there isn't a K

00:09:43.100 | That means we're using the canonical equivalence where there is a K. We're using the compatibility equivalence

00:09:48.780 | Now the first of those is normal form KD, which is compatibility

00:09:53.460 | Decomposition now this breaks down the fancy or alternative characters

00:09:59.740 | Into their smaller parts if they do have small parts

00:10:03.700 | So for example fractions if we have the 1/2 fraction that will get broken down

00:10:09.220 | Into the numbers 1 and 2 and also a fraction slash character

00:10:15.700 | Which can actually see down here and we also have our fancy characters

00:10:21.120 | So where we have this fancy capital H and we decompose it into just a normal Latin capital letter H

00:10:27.660 | And that's how the compatibility

00:10:30.580 | decomposition works and

00:10:32.900 | to apply that

00:10:34.900 | We want to use

00:10:37.100 | NF KD

00:10:39.260 | So if we just take what we have here

00:10:43.020 | And we're just gonna switch what we're actually using

00:10:45.980 | So I'm going to switch out the sui sedilla for this fancy H

00:10:51.740 | So your fancy H

00:10:54.140 | In fact, we can just leave it like that because we can at least see what we're doing now

00:10:58.740 | so we're gonna put that here and

00:11:01.260 | We want to compare that to just a normal letter H

00:11:05.540 | Obviously this false doesn't match

00:11:09.300 | What we need to do is normalize this and decompose it into the capital H character

00:11:14.660 | So let's take this

00:11:18.380 | And we're going to use our normalized function again

00:11:24.260 | but this time

00:11:27.180 | we want to use

00:11:29.180 | compatibility equivalence reasons to K and we're decomposing it using D and

00:11:33.340 | Now you can see that we are getting true. So if we just print out the results of this function

00:11:39.340 | you

00:11:41.340 | Can see okay great. It's just taking that H and converting it into something normal

00:11:47.520 | And then that leads us on to our final normal form, which is normal form at KC

00:11:54.340 | So normal form KC consists of two sets

00:11:58.100 | We have the compatibility

00:12:00.660 | decomposition, which is what we've just done and

00:12:05.620 | Then there's a second set which is a canonical composition. So we're building that back up those different parts

00:12:12.020 | canonically and

00:12:13.660 | This allows us to normalize all variants of a given character into a single shared form

00:12:20.180 | So for example with our fancy H

00:12:23.100 | We can add the combining Cedilla to that in order to just make this

00:12:31.260 | horrible monstrosity of a character

00:12:35.220 | and

00:12:36.220 | We would write that out as

00:12:38.220 | We have H here

00:12:41.100 | So we just put that straight in and then we can just come up here and get our

00:12:46.860 | Cedilla Unicode and

00:12:49.380 | Put that in and if we put those together we get this weird character

00:12:53.860 | Now if we wanted to compare that to another character, which is the H with Cedilla

00:13:00.820 | Which is a single Unicode character. We're gonna have some issues because this is just one character

00:13:07.260 | so if we use

00:13:09.500 | NFKD we can give it a go. So we'll add this in

00:13:15.140 | Let's try and compare it to this

00:13:18.780 | Okay, we'll get false and that's because this is breaking this down into two different parts so a H and

00:13:30.180 | This combining Cedilla. So if I just remove this and print out you see, okay

00:13:34.700 | They look the same but they're not the same because we have those two characters again

00:13:38.020 | So this is where we need canonical composition to bring those together into a single character

00:13:43.900 | So that looks like this. So we have

00:13:47.420 | Initially, we have our compatibility decomposition

00:13:51.500 | If we go across we have this final which is a canonical composition and this is the NFKC

00:13:59.780 | normal form so normal form KC and

00:14:03.500 | To apply that all we need to do is

00:14:07.260 | Obviously just this to KC and

00:14:10.940 | Okay, we run that we seem to get the same

00:14:15.660 | result, but then if we add this we can see okay now we're getting what we need and

00:14:24.420 | In reality, I think for most cases or almost all that I can think of anyway

00:14:30.300 | You're gonna use this

00:14:32.740 | NFKC to normalize your text

00:14:35.180 | Because this is going to provide you with the cleanest simplest data set that is the most normalized

00:14:41.700 | So when going forward with your language models, this is definitely the form that I would go with

00:14:50.660 | Now, of course you can mix it up you use different ones, but I would definitely recommend if this is quite confusing

00:14:57.460 | hard to get a grasp of just

00:15:01.180 | Taking these Unicode characters playing around them a little bit applying these normal form

00:15:07.860 | Functions to them and just seeing what happens and I think it'll probably click quite quickly

00:15:13.900 | So that's it for this video. I

00:15:18.220 | Hope it's been useful and you've enjoyed it

00:15:20.820 | So thank you for watching and I'll see you again in the next one

Unicode Normalization for NLP in Python

Chapters