back to index

Unicode Normalization for NLP in Python


Chapters

0:0 Intro
0:40 Diacritics
3:40 Decomposition
5:32 Conversion
11:48 Normal Form

Whisper Transcript | Transcript Only Page

00:00:00.000 | Okay, so we're going to take a look at Unicode normalization
00:00:04.520 | Unicode normalization is something that we use when we have those weird
00:00:10.100 | Font variants that people always use on the internet
00:00:13.140 | So if you've ever seen people using those odd characters, I think they use it to express some form of
00:00:20.900 | individuality or to catch your attention and then we also have
00:00:26.460 | Another issue where we have weird glyphs in text and this is more reasonable because it's actually a part of
00:00:34.380 | Language that's like little glyphs. So you have the accents above the E's and stuff in Italian or Spanish and
00:00:40.220 | Those little glyphs all together. They're called diacritics
00:00:44.760 | And whenever we come across diacritics or that weird text we can get issues when we're building models
00:00:52.900 | The issues with the weird text is obviously we have someone has got hello world and normal text
00:01:00.380 | And we're comparing it to some hello world and some weird text with circles around every letter
00:01:05.280 | we can't actually compare them like like because our
00:01:08.940 | Models or code in general is not going to be able to compare those two different Unicode
00:01:15.840 | character sets and the issue diacritics is that those characters always have this hidden property in
00:01:23.320 | That we have one Unicode character, which is the capital C with Cedilla, but then we have an
00:01:30.540 | Identical set of characters, which is for example the Latin capital C
00:01:36.980 | immediately followed by something called a combining Cedilla character and
00:01:42.680 | They together look exactly like the other Unicode character
00:01:47.700 | and this is
00:01:50.920 | quite difficult to deal with so we have these two problems and
00:01:56.200 | We use Unicode normalization to actually deal with those when we're building in a few models
00:02:02.080 | so I kind of said this there's two forms of
00:02:05.320 | Equivalent characters are not really equivalent equivalent
00:02:09.560 | The first of those is the compatibility equivalence. That's where it has stuff like font variants
00:02:14.720 | We have different line break sequences circled variant superscripts subscripts fractions and a few other things as well
00:02:22.200 | Now we want a model to see both
00:02:26.600 | hello world with those we have circles and also just hello world as
00:02:31.720 | one because that's how we read it and that's how it's supposed to be interpreted and
00:02:37.840 | that is what the compatibility equivalence is for and
00:02:42.160 | We'll look at how we actually deal with that
00:02:46.380 | pretty soon and then we also have the canonical equivalence, which is the
00:02:52.040 | Thing with the accents and the glyphs I mentioned before
00:02:55.200 | So you have a few different reasons for that, but two that I think are most relevant
00:03:02.440 | Is we have the combined characters. So we have that series to do like character and then we also have the capital C
00:03:09.560 | plus the combining similar characters merge together and
00:03:14.280 | Then we also have conjoined the Korean characters, which I think are pretty common as well
00:03:21.360 | Canonical equivalence is much more to do with characters that we can't really see that they are different, but they are in fact different
00:03:31.080 | whereas
00:03:32.400 | Compatibility equivalence is more to do with they purposely made them different and in reality that a meaning is the same
00:03:39.800 | So we have two different directions for how we can transform
00:03:45.800 | Our text between these two different forms
00:03:49.400 | So we have decomposition which is breaking down
00:03:53.720 | Unicode characters into smaller parts or more normal parts and then we have composition which is
00:04:01.540 | Taking multiple Unicode characters and merging them into a single
00:04:06.640 | accepted Unicode character
00:04:09.460 | So I've got this example here
00:04:12.140 | so this u
00:04:14.980 | 0-0 c7
00:04:17.900 | If we take a look here, this is our C with cedilla and
00:04:23.860 | We see here. This is what it looks like. It has this C and it's got a little cedilla at the bottom
00:04:30.180 | Then the other side we have these two characters here and if we just take a look
00:04:35.540 | Here we can see. Okay. This is the C plus cedilla. So these are two separate Unicode characters
00:04:41.820 | Then we see okay, they actually look exactly the same again. And obviously that's where our problem is
00:04:46.500 | So what we can do is we can decompose them into
00:04:51.460 | Their different parts now. These are already separated. So when we decompose them, we just get the same thing again
00:04:58.140 | whereas for our C with cedilla character
00:05:01.940 | we decompose that and we basically get these two different parts, which is the Latin capital C and
00:05:08.100 | The combining cedilla character and then we can form canonical composition to put those
00:05:14.860 | Both together and merge them back into the capital C with cedilla
00:05:21.740 | And that's essentially how decomposition and composition works. Also, it's slightly different for the
00:05:27.460 | Compatibility decomposition, but we'll talk about that quite soon
00:05:32.100 | when we take the fact that we have these two different directions composition decomposition and
00:05:38.100 | we have our two types of transformations, which is compatibility and
00:05:44.780 | canonical equivalence
00:05:48.540 | We get these four forms
00:05:51.420 | So we're form D, which is canonical
00:05:56.100 | Decomposition, which is what I showed you here where we're decomposing those characters into its individual parts
00:06:03.860 | And if we just take a look at how to actually do this in Python
00:06:09.220 | so we'll take
00:06:11.900 | This Unicode here
00:06:18.420 | We'll just place it here and this is our C with cedilla character
00:06:26.940 | So if we just print that out
00:06:36.660 | We see we have that character
00:06:38.740 | Now the other one is where it's kind of both together
00:06:43.020 | so I'm just going to call it C plus cedilla and
00:06:45.860 | That is the Latin capital C, which is
00:06:53.380 | 0043 which if I just print this out so we can just see it before we put the cedilla on the end
00:07:00.740 | We just have a C and then for the cedilla
00:07:03.660 | We just put 0 3
00:07:07.500 | 2 7 and we get that and
00:07:11.700 | Obviously these look the same, but if we compare them
00:07:14.740 | We'll see that they are not the same
00:07:19.760 | Okay, we get false
00:07:23.940 | So to deal with that, this is where we need to use our canonical decomposition
00:07:30.300 | Or NFD that we can see here
00:07:33.620 | So to do all this we're going to need to import
00:07:37.860 | the Unicode
00:07:40.660 | data library
00:07:42.660 | And then we use Unicode data
00:07:44.900 | Normalization
00:07:48.780 | In this case we're using an FD which is canonical decomposition
00:07:54.980 | And then what we want to do is
00:07:58.500 | Passing our C with cedilla because we're going to want to break this down into the two different parts
00:08:05.500 | So that's the one that we need to
00:08:07.500 | transform
00:08:09.940 | And on the other side, we're gonna have our C plus cedilla, which is our two characters and we see
00:08:16.180 | Choose changes to
00:08:19.620 | Normalize that we have true
00:08:22.180 | so now what we've done is converted a single character into the two separate characters here and
00:08:28.580 | That is because we've used normal form
00:08:33.180 | Compositions decompose those we wrote them apart now on the other side that we have the canonical
00:08:39.300 | composition where we build them back up into one and
00:08:42.500 | to do that we use NFC and
00:08:46.420 | Obviously if we try it with this
00:08:48.820 | We're not going to get the right answer because we're not gonna find that they match because we're compositioning this
00:08:54.940 | Back into itself. So it's just gonna be this again
00:08:58.940 | Against this which are separate so we actually need to switch which side we have this function on
00:09:06.260 | So if I just remove this
00:09:09.980 | And copy this across
00:09:13.340 | And we'll see that now we get true because what we've done is
00:09:19.740 | converted these
00:09:22.100 | into this
00:09:23.820 | That's how we normalize for canonical equivalence, which is essentially where we can't actually see the difference on the other side
00:09:31.620 | We have where people using the weird text. So in our abbreviations, we have these two with the K and
00:09:39.020 | That K means compatibility where there isn't a K
00:09:43.100 | That means we're using the canonical equivalence where there is a K. We're using the compatibility equivalence
00:09:48.780 | Now the first of those is normal form KD, which is compatibility
00:09:53.460 | Decomposition now this breaks down the fancy or alternative characters
00:09:59.740 | Into their smaller parts if they do have small parts
00:10:03.700 | So for example fractions if we have the 1/2 fraction that will get broken down
00:10:09.220 | Into the numbers 1 and 2 and also a fraction slash character
00:10:15.700 | Which can actually see down here and we also have our fancy characters
00:10:21.120 | So where we have this fancy capital H and we decompose it into just a normal Latin capital letter H
00:10:27.660 | And that's how the compatibility
00:10:30.580 | decomposition works and
00:10:32.900 | to apply that
00:10:34.900 | We want to use
00:10:37.100 | NF KD
00:10:39.260 | So if we just take what we have here
00:10:43.020 | And we're just gonna switch what we're actually using
00:10:45.980 | So I'm going to switch out the sui sedilla for this fancy H
00:10:51.740 | So your fancy H
00:10:54.140 | In fact, we can just leave it like that because we can at least see what we're doing now
00:10:58.740 | so we're gonna put that here and
00:11:01.260 | We want to compare that to just a normal letter H
00:11:05.540 | Obviously this false doesn't match
00:11:09.300 | What we need to do is normalize this and decompose it into the capital H character
00:11:14.660 | So let's take this
00:11:18.380 | And we're going to use our normalized function again
00:11:24.260 | but this time
00:11:27.180 | we want to use
00:11:29.180 | compatibility equivalence reasons to K and we're decomposing it using D and
00:11:33.340 | Now you can see that we are getting true. So if we just print out the results of this function
00:11:41.340 | Can see okay great. It's just taking that H and converting it into something normal
00:11:47.520 | And then that leads us on to our final normal form, which is normal form at KC
00:11:54.340 | So normal form KC consists of two sets
00:11:58.100 | We have the compatibility
00:12:00.660 | decomposition, which is what we've just done and
00:12:05.620 | Then there's a second set which is a canonical composition. So we're building that back up those different parts
00:12:12.020 | canonically and
00:12:13.660 | This allows us to normalize all variants of a given character into a single shared form
00:12:20.180 | So for example with our fancy H
00:12:23.100 | We can add the combining Cedilla to that in order to just make this
00:12:31.260 | horrible monstrosity of a character
00:12:36.220 | We would write that out as
00:12:38.220 | We have H here
00:12:41.100 | So we just put that straight in and then we can just come up here and get our
00:12:46.860 | Cedilla Unicode and
00:12:49.380 | Put that in and if we put those together we get this weird character
00:12:53.860 | Now if we wanted to compare that to another character, which is the H with Cedilla
00:13:00.820 | Which is a single Unicode character. We're gonna have some issues because this is just one character
00:13:07.260 | so if we use
00:13:09.500 | NFKD we can give it a go. So we'll add this in
00:13:15.140 | Let's try and compare it to this
00:13:18.780 | Okay, we'll get false and that's because this is breaking this down into two different parts so a H and
00:13:30.180 | This combining Cedilla. So if I just remove this and print out you see, okay
00:13:34.700 | They look the same but they're not the same because we have those two characters again
00:13:38.020 | So this is where we need canonical composition to bring those together into a single character
00:13:43.900 | So that looks like this. So we have
00:13:47.420 | Initially, we have our compatibility decomposition
00:13:51.500 | If we go across we have this final which is a canonical composition and this is the NFKC
00:13:59.780 | normal form so normal form KC and
00:14:03.500 | To apply that all we need to do is
00:14:07.260 | Obviously just this to KC and
00:14:10.940 | Okay, we run that we seem to get the same
00:14:15.660 | result, but then if we add this we can see okay now we're getting what we need and
00:14:24.420 | In reality, I think for most cases or almost all that I can think of anyway
00:14:30.300 | You're gonna use this
00:14:32.740 | NFKC to normalize your text
00:14:35.180 | Because this is going to provide you with the cleanest simplest data set that is the most normalized
00:14:41.700 | So when going forward with your language models, this is definitely the form that I would go with
00:14:50.660 | Now, of course you can mix it up you use different ones, but I would definitely recommend if this is quite confusing
00:14:57.460 | hard to get a grasp of just
00:15:01.180 | Taking these Unicode characters playing around them a little bit applying these normal form
00:15:07.860 | Functions to them and just seeing what happens and I think it'll probably click quite quickly
00:15:13.900 | So that's it for this video. I
00:15:18.220 | Hope it's been useful and you've enjoyed it
00:15:20.820 | So thank you for watching and I'll see you again in the next one