Back to Index

Novice to Advanced RegEx in Less-than 30 Minutes + Python


Chapters

0:0
13:9 character sets
15:0 boundaries
17:41 assertions
21:0 modifiers

Transcript

Hi and welcome to the video. Today we're going to look at regex which is short for regular expressions. This is essentially the de facto standard for parsing text. So what we're going to do in this video is run through the basics of regex first. So stuff like how to define digits, whitespace, how to use quantifiers to tell us how many digits or whitespace or any other character we want to include, how to use capture groups and character classes and also how we can use boundary definitions.

So how we define the start of a line, the end of a line or boundaries of a word. And we should move through these quite quickly because they're not very difficult, they're pretty straightforward. And then we can move on to what I think is the more exciting interesting stuff which is a little more advanced.

So these are things like look ahead or look behind assertions, modifiers and conditionals which are essentially like if-else statements but for your regex code which is pretty interesting. So we're going to work through all of that in this video, we're going to code through a few examples in Python, we're also going to use regex 101 which is like an interactive debugger or regex building tool that we can use.

So it should be pretty interesting and let's jump straight into it. Okay so we're going to start in regex 101 and we're just going to have a look through a few of the metacharacters. So let's say we have this string, I'm just going to make it up. We have a few letters in here, some numbers and two other characters as well.

Now we can obviously directly match these by actually writing out the exact characters but generally we're not going to want to do that if we're using regex. So this is where we start to use metacharacters. So we'll just go through the most common ones, we have digits this will match any number so anything between 0 and 9.

And we can inverse that with a uppercase D and then that will match anything that is not a digit. We have W which will match any word character and then we can also reverse that again by capitalizing it. So all of these metacharacters we can usually uppercase them and it will reverse what they're doing.

Do white space with S, in this case we don't actually have any white space so let's add some in. And it will also match new lines as well. So you can't see this highlighted but if you look up here you can say two matches then if I add that new line in it goes up to three.

And then to do anything but white space we just uppercase that again. And then this one is a little bit of a special one, this one matches any character except from new lines. So this is not matching our new line here. We obviously can't uppercase this one but I suppose the opposite would actually just be a new line character like this.

So let's switch over to Python and see how we would do this. So we import re which is the regex module and we'll go through these a little bit in more depth later but for now I'm just going to do re.findall. And in our first argument here we put the pattern that we are going to use to search.

So in this case it would be backslash n, although we're not going to use that, we're going to use backslash d for any digits. And then let's just pull this one in. Okay and we return 1 0 0 0. Okay so these four characters here which is exactly what we would get here.

Okay so that's cool. Now in the case of this full stop what if we would like to actually match a full stop and just a full stop. To do that we actually escape the meta character using a backslash and just like that. Okay so that's it for meta characters.

Let's move on to quantifiers. So quantifiers essentially allow us to match a specific number of characters. So as of yet we've only been matching one at a time. So here we are matching four characters but we're only matching one character at a time. Four times one, two, three, four.

Whereas quantifiers allow us to write our pattern and then add this quantifier to specify how many times to actually match that. So the first of those is the one or more quantifier. So this will match that pattern one or more times, just a plus sign. We also have zero or more quantifier.

So this is matching that pattern zero times or more times. Okay and something that we can also add in here. As you'll see this is matching it as many times as possible but maybe we actually want to limit the number of times I'm matching something. And this is a difference between what is called greedy and lazy quantifiers.

So at the moment we have a greedy quantifier. So it's saying one or more times and it's going all the way up to four. Okay which is as many characters as it can fit into its pattern. But if you want it to not do that and instead be lazy and simply pick up as few characters as possible that match the criteria, we can just add a question mark onto the end.

And then we're back to matching just one because it's one or more and we are limiting it to the minimum number of matches there. So keep that in mind and we'll just quickly go over again towards the end just a little bit. We also have the once or none.

So let's write a new test string here. So here we have a few words and we'd like to match all of the words. So what we can do is this. And here we're kind of matching all the words but there's this one in the middle where we have a hyphen in the middle.

And this is something that will happen quite often. And ideally we also want to put good hearted as a single word. So we could do this. But then we're only matching that single good hearted part. So instead we add a once or none quantifier. Okay so now we're matching that word as well.

Now if we also want to match the A because you can see here it's not matching because we're expecting at least two word characters because we have W here and W here. We can just add a zero or none quantifier onto the end there. And now we're getting all of our words together including the hyphenated words.

We can also specify a specific quantity which we do like this. So here we are getting three word characters at a time. You can see here we're not specifying three characters that make up a word. We're just saying three characters. So here we're getting multiple matches for single words which is fine because we haven't specified that.

We'll go over how to define word boundaries later. And we can also turn this into a range. So let's go three to five. Okay so now we're matching a minimum three characters and a maximum five. Now you might have guessed this but we can actually just remove one of these numbers to get less than five.

Now if this doesn't work for you just make sure that you are using the Python flavor because for other languages this might change and if you're on PCRE for example this won't work. So change back to Python. And we can also do three or more as well. Now I think this is a good example with our lazy quantifier.

So here we're matching between either three up to five characters at a time. If we had a lazy quantifier it's always going to limit that as much as possible. So we're going to go down to three. Okay so you can see here that it's limiting how many characters it's including in there.

It's getting lazy rather than greedy. Okay so let's write out a new example here. Okay so a few unexpected words are to be expected. So this is a good example of where we can use capture groups. So anything contained within round brackets will create a capture group. So a capture group is simply a fancy way of saying treat everything within these brackets as a single unit.

So you can see here I only put these dots in as a filler but it actually matches because those dots mean anything. So three anythings in a row and this is matching basically anything. So we have all these matches here and it's treating those as a unit. So it's doing three anythings and matching that and then moving on to next three anythings.

Now what we can do is we want to match unexpected and expected. So we want to match the word with or without its negative prefix. So we can add expected here. But here we're only getting expected we're not getting the un from unexpected. So we just add this. So now we're getting unexpected but we're not getting expected because we have specified okay we want un here.

We want this capture group. So all we need to do is actually make this optional by adding a zero or one quantifier. And there we go we are now capturing unexpected and expected. So let's have a quick go at this in Python see what it looks like. Okay and we run this and now we find that we are only seeing un which is probably not what we're expecting.

The reason this happens is that find all tries to match capture groups which is exactly what we have here. So what we can do is modify this capture group to make it a non-capturing group while still maintaining this behavior of zero or one. So all we do to do that is add a question mark inside followed by a colon and then here we are capturing everything again.

So that's just a little bit of a strange behavior to watch out for. Now we can also add a or logic to our capture groups. So maybe we want to capture anything where we are saying expected with a negative prefix and that can either be not expected or unexpected.

And we want both of these to match. Now to do this we actually just add a pipe into our capturing group and then we add not like so. And now we are matching both non-expected and unexpected. Okay so that's it for capture groups and let's move on to character sets.

So the syntax for character sets is kind of similar to the syntax for capture groups in that we use brackets but this time they're square brackets instead. And you can see these kind of like a list. So anything we put within here will be treated as a character to match.

Here will be treated as a character to match. But unlike capture groups it's not treating them all as a unit. So if we put un in here it's actually just matching of u or n. And we put unexpected and it's not going to match unexpected as a unit it's just going to match each one of those words within those square brackets.

So let's return to our earlier example. So earlier on we were matching all of the digits in our string. So what we could do is write out all of the digits like this and we get the exact same effect. Obviously this is quite long so what we can do instead is write this with a dash in the middle.

And this is any character within the range of 0 to 9. We can also add letters to this. So a to z for example. And you might also think okay we can also add these hyphens in right? But obviously we are using these hyphens to define our ranges. So in order to add a hyphen in here we need to use backspace to escape it.

And now we are matching the full string. And if we want to match the full string as a whole of course we just add our quantifier. Now let's move on to boundaries. So I'm just going to write out a new string for this. Okay so here I want to show you the startString and endString boundaries.

So startString is using this character. So here if we put ifit it's only going to match ifit at the start here. It's not going to match ifit here as well. And if we remove the character it does. Okay so we add this character to specify that we only want to search from the very start of our string.

Now the equal and opposite of the startString character is the endString character. And that is a dollar symbol. So let's rewrite this. Okay and we want to look for example. And here with the dollar symbol we only match the final example rather than both of them. So you can see there.

I'm going to go back to one of the earlier examples again now. Okay so here I also want to show you the word boundary. So the best way to identify word boundary is not by using for example white space. Because yes that does work in a lot of cases but it doesn't work if we have a comma, full stop, hyphen or anything like this.

So what we can do instead is use backslash b. And this identifies every single word boundary within our text as you can see from the pink lines. So then we can use that to capture any of our words. And now quite easily we've captured every single word and we're pulling them out in a more efficient way than if we had tried to write you know s or if we'd have gone with a grouping like this and added all these different things.

All we need to do is add a word boundary. Okay so now we'll move on to some of the what I think are the more interesting and definitely a bit more advanced methods in Regex. So the first of those is the look ahead and look behind assertions. So if we have this string here we have two hello worlds.

One of them is preceded by a one and a colon. The other preceded by two and a colon. Now what if we want to match hello world but we only want to match hello world if it is preceded by a one and a colon. But we don't want to include that one and a colon within our pattern because if we want to do that we would just write this.

But this will return the entire string. So I'll show you over here. Okay so we're returning the full string. What if we only actually want to pull out this hello world? Well we could go for hello world but then we're returning both. We don't want to do that we only want the first one.

So to do this we use a look behind assertion. So this means that we are looking behind our pattern which means anything preceding it and we are asserting that there is this other pattern there. So we do this with this pattern here. Okay and anything we place in between this equal sign and this closing bracket is included within our assertion pattern.

Okay so now we are matching just this first hello world. So if we go and take this and put it into our code here we will return just the first hello world. Now on the other hand maybe we want to match something that comes after our pattern and to do that we use a look ahead assertion which as you've probably guessed is basically exactly the same but on the other side.

It does use a slightly different syntax but other than that there's really no difference. So in our case we're going to search for this comma. So in between the equal sign and the closing bracket here that's where we put our pattern and here again we're matching this hello world.

I'm going to put this in python and of course we'll just get the exact same thing. Now on the other hand maybe we don't want to assert that something is in front or behind our pattern we actually want to assert that something is not there. So what we can do is we can make this a negative look ahead by replacing this equal sign with an exclamation mark.

Now we are looking for the hello world that is not followed by a comma which is obviously this one and again with the look behind we just modify that as well. So whereas the look behind looked like this we just again remove the equal sign and replace it with a exclamation mark and then we get the second hello world.

Okay so that's it for the assertions let's move on to modifiers. So we can actually see we have a few modifiers here and these are essentially ways of modifying the behavior of our entire regular expression. Now obviously python doesn't have this little set regex options here so what we do instead is we can either do an inline modifier like this one and let me just give you a good example quickly.

So we'll just write out a string that includes a newline character in the middle. Okay and we're going to match character and then anything following that character. Now if you remember this anything meta character does not actually match anything it matches anything except from newlines. So in this case it is not matching here because we are expecting newline.

Now if we open this we can see that we have this single line dot matches newline. So we add that and now it's changed the behavior of our regex and the anything meta character also matches newline characters. Now let's remove it from here and we can add it inline like this.

So here we're just adding the s within this global modifying function and we can also add other modifiers as well if you want so you just all you need to do is add the letter that represents that global modifier and add it within those brackets. Now if we take that over to python and we add this in so here is our inline modifier yep it works that's great if we get rid of this it doesn't match anymore and that's what we would expect.

Now in python we can also add modifiers within the function itself so what we do is we add re dot and then the capital of the modifying flag and there we go it matches again. So there's a few different ways that you can do it in python. Now let's move over to conditionals this is the last one we're going to work through and these are probably a little more complex in my opinion to actually read and understand.

So I'm going to use a few different examples here I'm just going to quickly make this up okay so we have these three lines each of them has something in common. Now what I want to do is enter a condition within a capture group which is either true or not.

Now I don't really want to specify that I need this condition within my regex because if it isn't there I want to search for something else and to do that we add this group here and this is basically our if else group. So I'm just going to put I here for now I'll explain that in a moment and this is our if the condition is true then we also want to search for this if the condition is not true then just search for this instead.

So our condition here is within a capturing group and this token here refers to the index of our capture group so in our case we only have one capture group and you can see here it even highlights first capturing group so this actually needs to be one and then that essentially links our condition within this capturing group to this if else statement and we can see that now okay this condition is not true because we haven't written condition anywhere so it's going into the if clause and it's saying okay if it's true which it isn't search for this right but it isn't true so we are actually only searching for by so it is producing a match because all we need to match to is by but in reality we actually do want to search for a condition which is going to be hello and here now we are matching two things we're matching this because hello is true here and if hello is true our if else statement says okay now we need to match space world which we do right here and on this line just like before we're finding okay hello is not true doesn't say hello here so this is not what we want to look for we actually want to look for this which is our else statement and we do in fact have by here so that again does match so i mean that's everything on the regular expressions and i just want to quickly go through what the difference is between re match search and find or in python so let's remove these so i'm just going to write a string very quickly okay we're just going to use this as our example now if we do re match you remember before we had this sort of string character re.match essentially is like putting this character in front of whatever you type within it so what i mean by that is if we do re.match hello we will get a match so we also as well we need to put dot group after we use re.match or re.search something to be aware of okay and yeah okay fine we would expect that because we're searching through and yep there's a hello here there's a hello here of course it's going to match a hello that's fine but if we put world we don't match anything okay so it's a non-type which means nothing has been matched and the reason for that is that match automatically adds the starter string token in there so if we put world at the start here this would work okay but obviously before we didn't have it there so it didn't work so that's what re.match does it also just returns you one match unlike find all you remember we're getting a list of matches now re.search doesn't specify that we only need to look at the first part of the string instead our research looks through the whole thing so if we search world we actually do get a response because we're not specifying it needs to be here okay so that's great but you'll also notice that we are only returning one thing here and that will not change if we add hello okay we're still just returning one item so what re.search does is it comes through here it searches a whole string but it only searches the first instance so it gets to here it says okay i found hello and then it returns that match it doesn't go any further and finds anything else and that's where find all is a little bit different so find all we can't use group here we just print out x does go through and find everything so that's it for this video i hope it's been useful uh we've been you know through a lot of regex um so don't worry if it can't blow your mind a bit if you're new to this it is quite a lot but nonetheless regex is super important it's i would definitely recommend getting familiar with it if this is new to you it's it's an incredibly useful skill no matter what you are specializing in as long as you code you're probably going to use regex so that's the video on regex i hope you enjoyed and as always thank you for watching bye