Unicode Normalization for NLP in Python

Okay, so we're going to take a look at Unicode normalization Unicode normalization is something that we use when we have those weird Font variants that people always use on the internet So if you've ever seen people using those odd characters, I think they use it to express some form of individuality or to catch your attention and then we also have Another issue where we have weird glyphs in text and this is more reasonable because it's actually a part of Language that's like little glyphs.

So you have the accents above the E's and stuff in Italian or Spanish and Those little glyphs all together. They're called diacritics And whenever we come across diacritics or that weird text we can get issues when we're building models The issues with the weird text is obviously we have someone has got hello world and normal text And we're comparing it to some hello world and some weird text with circles around every letter we can't actually compare them like like because our Models or code in general is not going to be able to compare those two different Unicode character sets and the issue diacritics is that those characters always have this hidden property in That we have one Unicode character, which is the capital C with Cedilla, but then we have an Identical set of characters, which is for example the Latin capital C immediately followed by something called a combining Cedilla character and They together look exactly like the other Unicode character and this is quite difficult to deal with so we have these two problems and We use Unicode normalization to actually deal with those when we're building in a few models so I kind of said this there's two forms of Equivalent characters are not really equivalent equivalent The first of those is the compatibility equivalence.

That's where it has stuff like font variants We have different line break sequences circled variant superscripts subscripts fractions and a few other things as well Now we want a model to see both hello world with those we have circles and also just hello world as one because that's how we read it and that's how it's supposed to be interpreted and that is what the compatibility equivalence is for and We'll look at how we actually deal with that pretty soon and then we also have the canonical equivalence, which is the Thing with the accents and the glyphs I mentioned before So you have a few different reasons for that, but two that I think are most relevant Is we have the combined characters.

So we have that series to do like character and then we also have the capital C plus the combining similar characters merge together and Then we also have conjoined the Korean characters, which I think are pretty common as well Canonical equivalence is much more to do with characters that we can't really see that they are different, but they are in fact different whereas Compatibility equivalence is more to do with they purposely made them different and in reality that a meaning is the same So we have two different directions for how we can transform Our text between these two different forms So we have decomposition which is breaking down Unicode characters into smaller parts or more normal parts and then we have composition which is Taking multiple Unicode characters and merging them into a single accepted Unicode character So I've got this example here so this u 0-0 c7 If we take a look here, this is our C with cedilla and We see here.

This is what it looks like. It has this C and it's got a little cedilla at the bottom Then the other side we have these two characters here and if we just take a look Here we can see. Okay. This is the C plus cedilla. So these are two separate Unicode characters Then we see okay, they actually look exactly the same again.

And obviously that's where our problem is So what we can do is we can decompose them into Their different parts now. These are already separated. So when we decompose them, we just get the same thing again whereas for our C with cedilla character we decompose that and we basically get these two different parts, which is the Latin capital C and The combining cedilla character and then we can form canonical composition to put those Both together and merge them back into the capital C with cedilla And that's essentially how decomposition and composition works.

Also, it's slightly different for the Compatibility decomposition, but we'll talk about that quite soon when we take the fact that we have these two different directions composition decomposition and we have our two types of transformations, which is compatibility and canonical equivalence We get these four forms So we're form D, which is canonical Decomposition, which is what I showed you here where we're decomposing those characters into its individual parts And if we just take a look at how to actually do this in Python so we'll take This Unicode here And We'll just place it here and this is our C with cedilla character So if we just print that out We see we have that character Now the other one is where it's kind of both together so I'm just going to call it C plus cedilla and That is the Latin capital C, which is 0043 which if I just print this out so we can just see it before we put the cedilla on the end We just have a C and then for the cedilla We just put 0 3 2 7 and we get that and Obviously these look the same, but if we compare them We'll see that they are not the same Okay, we get false So to deal with that, this is where we need to use our canonical decomposition Or NFD that we can see here So to do all this we're going to need to import the Unicode data library And then we use Unicode data Normalization In this case we're using an FD which is canonical decomposition And then what we want to do is Passing our C with cedilla because we're going to want to break this down into the two different parts So that's the one that we need to transform And on the other side, we're gonna have our C plus cedilla, which is our two characters and we see Choose changes to Normalize that we have true so now what we've done is converted a single character into the two separate characters here and That is because we've used normal form Compositions decompose those we wrote them apart now on the other side that we have the canonical composition where we build them back up into one and to do that we use NFC and Obviously if we try it with this We're not going to get the right answer because we're not gonna find that they match because we're compositioning this Back into itself.

So it's just gonna be this again Against this which are separate so we actually need to switch which side we have this function on So if I just remove this And copy this across And we'll see that now we get true because what we've done is converted these into this That's how we normalize for canonical equivalence, which is essentially where we can't actually see the difference on the other side We have where people using the weird text.

So in our abbreviations, we have these two with the K and That K means compatibility where there isn't a K That means we're using the canonical equivalence where there is a K. We're using the compatibility equivalence Now the first of those is normal form KD, which is compatibility Decomposition now this breaks down the fancy or alternative characters Into their smaller parts if they do have small parts So for example fractions if we have the 1/2 fraction that will get broken down Into the numbers 1 and 2 and also a fraction slash character Which can actually see down here and we also have our fancy characters So where we have this fancy capital H and we decompose it into just a normal Latin capital letter H And that's how the compatibility decomposition works and to apply that We want to use NF KD So if we just take what we have here And we're just gonna switch what we're actually using So I'm going to switch out the sui sedilla for this fancy H So your fancy H In fact, we can just leave it like that because we can at least see what we're doing now so we're gonna put that here and We want to compare that to just a normal letter H Obviously this false doesn't match What we need to do is normalize this and decompose it into the capital H character So let's take this And we're going to use our normalized function again but this time we want to use compatibility equivalence reasons to K and we're decomposing it using D and Now you can see that we are getting true.

So if we just print out the results of this function you Can see okay great. It's just taking that H and converting it into something normal And then that leads us on to our final normal form, which is normal form at KC So normal form KC consists of two sets We have the compatibility decomposition, which is what we've just done and Then there's a second set which is a canonical composition.

So we're building that back up those different parts canonically and This allows us to normalize all variants of a given character into a single shared form So for example with our fancy H We can add the combining Cedilla to that in order to just make this horrible monstrosity of a character and We would write that out as We have H here So we just put that straight in and then we can just come up here and get our Cedilla Unicode and Put that in and if we put those together we get this weird character Now if we wanted to compare that to another character, which is the H with Cedilla Which is a single Unicode character.

We're gonna have some issues because this is just one character so if we use NFKD we can give it a go. So we'll add this in Let's try and compare it to this Okay, we'll get false and that's because this is breaking this down into two different parts so a H and This combining Cedilla.

So if I just remove this and print out you see, okay They look the same but they're not the same because we have those two characters again So this is where we need canonical composition to bring those together into a single character So that looks like this. So we have Initially, we have our compatibility decomposition If we go across we have this final which is a canonical composition and this is the NFKC normal form so normal form KC and To apply that all we need to do is Obviously just this to KC and Okay, we run that we seem to get the same result, but then if we add this we can see okay now we're getting what we need and In reality, I think for most cases or almost all that I can think of anyway You're gonna use this NFKC to normalize your text Because this is going to provide you with the cleanest simplest data set that is the most normalized So when going forward with your language models, this is definitely the form that I would go with Now, of course you can mix it up you use different ones, but I would definitely recommend if this is quite confusing hard to get a grasp of just Taking these Unicode characters playing around them a little bit applying these normal form Functions to them and just seeing what happens and I think it'll probably click quite quickly So that's it for this video.

I Hope it's been useful and you've enjoyed it So thank you for watching and I'll see you again in the next one

Unicode Normalization for NLP in Python

Chapters

Transcript