back to indexUnicode Normalization for NLP in Python
Chapters
0:0 Intro
0:40 Diacritics
3:40 Decomposition
5:32 Conversion
11:48 Normal Form
00:00:00.000 |
Okay, so we're going to take a look at Unicode normalization 00:00:04.520 |
Unicode normalization is something that we use when we have those weird 00:00:10.100 |
Font variants that people always use on the internet 00:00:13.140 |
So if you've ever seen people using those odd characters, I think they use it to express some form of 00:00:20.900 |
individuality or to catch your attention and then we also have 00:00:26.460 |
Another issue where we have weird glyphs in text and this is more reasonable because it's actually a part of 00:00:34.380 |
Language that's like little glyphs. So you have the accents above the E's and stuff in Italian or Spanish and 00:00:40.220 |
Those little glyphs all together. They're called diacritics 00:00:44.760 |
And whenever we come across diacritics or that weird text we can get issues when we're building models 00:00:52.900 |
The issues with the weird text is obviously we have someone has got hello world and normal text 00:01:00.380 |
And we're comparing it to some hello world and some weird text with circles around every letter 00:01:05.280 |
we can't actually compare them like like because our 00:01:08.940 |
Models or code in general is not going to be able to compare those two different Unicode 00:01:15.840 |
character sets and the issue diacritics is that those characters always have this hidden property in 00:01:23.320 |
That we have one Unicode character, which is the capital C with Cedilla, but then we have an 00:01:30.540 |
Identical set of characters, which is for example the Latin capital C 00:01:36.980 |
immediately followed by something called a combining Cedilla character and 00:01:42.680 |
They together look exactly like the other Unicode character 00:01:50.920 |
quite difficult to deal with so we have these two problems and 00:01:56.200 |
We use Unicode normalization to actually deal with those when we're building in a few models 00:02:05.320 |
Equivalent characters are not really equivalent equivalent 00:02:09.560 |
The first of those is the compatibility equivalence. That's where it has stuff like font variants 00:02:14.720 |
We have different line break sequences circled variant superscripts subscripts fractions and a few other things as well 00:02:26.600 |
hello world with those we have circles and also just hello world as 00:02:31.720 |
one because that's how we read it and that's how it's supposed to be interpreted and 00:02:37.840 |
that is what the compatibility equivalence is for and 00:02:46.380 |
pretty soon and then we also have the canonical equivalence, which is the 00:02:52.040 |
Thing with the accents and the glyphs I mentioned before 00:02:55.200 |
So you have a few different reasons for that, but two that I think are most relevant 00:03:02.440 |
Is we have the combined characters. So we have that series to do like character and then we also have the capital C 00:03:09.560 |
plus the combining similar characters merge together and 00:03:14.280 |
Then we also have conjoined the Korean characters, which I think are pretty common as well 00:03:21.360 |
Canonical equivalence is much more to do with characters that we can't really see that they are different, but they are in fact different 00:03:32.400 |
Compatibility equivalence is more to do with they purposely made them different and in reality that a meaning is the same 00:03:39.800 |
So we have two different directions for how we can transform 00:03:49.400 |
So we have decomposition which is breaking down 00:03:53.720 |
Unicode characters into smaller parts or more normal parts and then we have composition which is 00:04:01.540 |
Taking multiple Unicode characters and merging them into a single 00:04:17.900 |
If we take a look here, this is our C with cedilla and 00:04:23.860 |
We see here. This is what it looks like. It has this C and it's got a little cedilla at the bottom 00:04:30.180 |
Then the other side we have these two characters here and if we just take a look 00:04:35.540 |
Here we can see. Okay. This is the C plus cedilla. So these are two separate Unicode characters 00:04:41.820 |
Then we see okay, they actually look exactly the same again. And obviously that's where our problem is 00:04:46.500 |
So what we can do is we can decompose them into 00:04:51.460 |
Their different parts now. These are already separated. So when we decompose them, we just get the same thing again 00:05:01.940 |
we decompose that and we basically get these two different parts, which is the Latin capital C and 00:05:08.100 |
The combining cedilla character and then we can form canonical composition to put those 00:05:14.860 |
Both together and merge them back into the capital C with cedilla 00:05:21.740 |
And that's essentially how decomposition and composition works. Also, it's slightly different for the 00:05:27.460 |
Compatibility decomposition, but we'll talk about that quite soon 00:05:32.100 |
when we take the fact that we have these two different directions composition decomposition and 00:05:38.100 |
we have our two types of transformations, which is compatibility and 00:05:56.100 |
Decomposition, which is what I showed you here where we're decomposing those characters into its individual parts 00:06:03.860 |
And if we just take a look at how to actually do this in Python 00:06:18.420 |
We'll just place it here and this is our C with cedilla character 00:06:38.740 |
Now the other one is where it's kind of both together 00:06:43.020 |
so I'm just going to call it C plus cedilla and 00:06:53.380 |
0043 which if I just print this out so we can just see it before we put the cedilla on the end 00:07:11.700 |
Obviously these look the same, but if we compare them 00:07:23.940 |
So to deal with that, this is where we need to use our canonical decomposition 00:07:33.620 |
So to do all this we're going to need to import 00:07:48.780 |
In this case we're using an FD which is canonical decomposition 00:07:58.500 |
Passing our C with cedilla because we're going to want to break this down into the two different parts 00:08:09.940 |
And on the other side, we're gonna have our C plus cedilla, which is our two characters and we see 00:08:22.180 |
so now what we've done is converted a single character into the two separate characters here and 00:08:33.180 |
Compositions decompose those we wrote them apart now on the other side that we have the canonical 00:08:39.300 |
composition where we build them back up into one and 00:08:48.820 |
We're not going to get the right answer because we're not gonna find that they match because we're compositioning this 00:08:54.940 |
Back into itself. So it's just gonna be this again 00:08:58.940 |
Against this which are separate so we actually need to switch which side we have this function on 00:09:13.340 |
And we'll see that now we get true because what we've done is 00:09:23.820 |
That's how we normalize for canonical equivalence, which is essentially where we can't actually see the difference on the other side 00:09:31.620 |
We have where people using the weird text. So in our abbreviations, we have these two with the K and 00:09:39.020 |
That K means compatibility where there isn't a K 00:09:43.100 |
That means we're using the canonical equivalence where there is a K. We're using the compatibility equivalence 00:09:48.780 |
Now the first of those is normal form KD, which is compatibility 00:09:53.460 |
Decomposition now this breaks down the fancy or alternative characters 00:09:59.740 |
Into their smaller parts if they do have small parts 00:10:03.700 |
So for example fractions if we have the 1/2 fraction that will get broken down 00:10:09.220 |
Into the numbers 1 and 2 and also a fraction slash character 00:10:15.700 |
Which can actually see down here and we also have our fancy characters 00:10:21.120 |
So where we have this fancy capital H and we decompose it into just a normal Latin capital letter H 00:10:43.020 |
And we're just gonna switch what we're actually using 00:10:45.980 |
So I'm going to switch out the sui sedilla for this fancy H 00:10:54.140 |
In fact, we can just leave it like that because we can at least see what we're doing now 00:11:01.260 |
We want to compare that to just a normal letter H 00:11:09.300 |
What we need to do is normalize this and decompose it into the capital H character 00:11:18.380 |
And we're going to use our normalized function again 00:11:29.180 |
compatibility equivalence reasons to K and we're decomposing it using D and 00:11:33.340 |
Now you can see that we are getting true. So if we just print out the results of this function 00:11:41.340 |
Can see okay great. It's just taking that H and converting it into something normal 00:11:47.520 |
And then that leads us on to our final normal form, which is normal form at KC 00:12:00.660 |
decomposition, which is what we've just done and 00:12:05.620 |
Then there's a second set which is a canonical composition. So we're building that back up those different parts 00:12:13.660 |
This allows us to normalize all variants of a given character into a single shared form 00:12:23.100 |
We can add the combining Cedilla to that in order to just make this 00:12:41.100 |
So we just put that straight in and then we can just come up here and get our 00:12:49.380 |
Put that in and if we put those together we get this weird character 00:12:53.860 |
Now if we wanted to compare that to another character, which is the H with Cedilla 00:13:00.820 |
Which is a single Unicode character. We're gonna have some issues because this is just one character 00:13:09.500 |
NFKD we can give it a go. So we'll add this in 00:13:18.780 |
Okay, we'll get false and that's because this is breaking this down into two different parts so a H and 00:13:30.180 |
This combining Cedilla. So if I just remove this and print out you see, okay 00:13:34.700 |
They look the same but they're not the same because we have those two characters again 00:13:38.020 |
So this is where we need canonical composition to bring those together into a single character 00:13:47.420 |
Initially, we have our compatibility decomposition 00:13:51.500 |
If we go across we have this final which is a canonical composition and this is the NFKC 00:14:15.660 |
result, but then if we add this we can see okay now we're getting what we need and 00:14:24.420 |
In reality, I think for most cases or almost all that I can think of anyway 00:14:35.180 |
Because this is going to provide you with the cleanest simplest data set that is the most normalized 00:14:41.700 |
So when going forward with your language models, this is definitely the form that I would go with 00:14:50.660 |
Now, of course you can mix it up you use different ones, but I would definitely recommend if this is quite confusing 00:15:01.180 |
Taking these Unicode characters playing around them a little bit applying these normal form 00:15:07.860 |
Functions to them and just seeing what happens and I think it'll probably click quite quickly 00:15:20.820 |
So thank you for watching and I'll see you again in the next one