back to indexNovice to Advanced RegEx in Less-than 30 Minutes + Python
Chapters
0:0
13:9 character sets
15:0 boundaries
17:41 assertions
21:0 modifiers
00:00:00.000 |
Hi and welcome to the video. Today we're going to look at regex which is short for regular 00:00:06.240 |
expressions. This is essentially the de facto standard for parsing text. So what we're going 00:00:13.600 |
to do in this video is run through the basics of regex first. So stuff like how to define digits, 00:00:23.120 |
whitespace, how to use quantifiers to tell us how many digits or whitespace or any other character 00:00:31.040 |
we want to include, how to use capture groups and character classes and also how we can use 00:00:38.480 |
boundary definitions. So how we define the start of a line, the end of a line or boundaries of a 00:00:44.720 |
word. And we should move through these quite quickly because they're not very difficult, 00:00:50.160 |
they're pretty straightforward. And then we can move on to what I think is the more exciting 00:00:55.120 |
interesting stuff which is a little more advanced. So these are things like look ahead or look behind 00:01:02.880 |
assertions, modifiers and conditionals which are essentially like if-else statements but for your 00:01:11.440 |
regex code which is pretty interesting. So we're going to work through all of that in this video, 00:01:16.800 |
we're going to code through a few examples in Python, we're also going to use regex 101 which 00:01:20.800 |
is like an interactive debugger or regex building tool that we can use. So it should be pretty 00:01:28.560 |
interesting and let's jump straight into it. Okay so we're going to start in regex 101 and we're 00:01:34.640 |
just going to have a look through a few of the metacharacters. So let's say we have this string, 00:01:42.480 |
I'm just going to make it up. We have a few letters in here, some numbers and two other 00:01:52.080 |
characters as well. Now we can obviously directly match these by actually writing out the exact 00:02:01.120 |
characters but generally we're not going to want to do that if we're using regex. 00:02:09.040 |
So this is where we start to use metacharacters. So we'll just go through the most common ones, 00:02:15.760 |
we have digits this will match any number so anything between 0 and 9. 00:02:21.520 |
And we can inverse that with a uppercase D and then that will match anything that is not a digit. 00:02:28.880 |
We have W which will match any word character and then we can also reverse that again by 00:02:36.880 |
capitalizing it. So all of these metacharacters we can usually uppercase them and it will reverse 00:02:43.040 |
what they're doing. Do white space with S, in this case we don't actually have any white space 00:02:50.400 |
so let's add some in. And it will also match new lines as well. So you can't see this highlighted 00:02:58.880 |
but if you look up here you can say two matches then if I add that new line in it goes up to three. 00:03:05.600 |
And then to do anything but white space we just uppercase that again. And then this one is a 00:03:11.840 |
little bit of a special one, this one matches any character except from new lines. So this is not 00:03:18.000 |
matching our new line here. We obviously can't uppercase this one but I suppose the opposite 00:03:25.920 |
would actually just be a new line character like this. So let's switch over to Python and see how 00:03:33.840 |
we would do this. So we import re which is the regex module and we'll go through these a little 00:03:40.000 |
bit in more depth later but for now I'm just going to do re.findall. And in our first argument here 00:03:51.280 |
we put the pattern that we are going to use to search. So in this case it would be backslash n, 00:04:00.320 |
although we're not going to use that, we're going to use backslash d for any digits. 00:04:07.680 |
Okay and we return 1 0 0 0. Okay so these four characters here which is exactly what we would 00:04:20.240 |
get here. Okay so that's cool. Now in the case of this full stop what if we would like to actually 00:04:29.040 |
match a full stop and just a full stop. To do that we actually escape the meta character using 00:04:36.400 |
a backslash and just like that. Okay so that's it for meta characters. Let's move on to quantifiers. 00:04:44.000 |
So quantifiers essentially allow us to match a specific number of characters. So as of yet we've 00:04:52.480 |
only been matching one at a time. So here we are matching four characters but we're only matching 00:04:58.560 |
one character at a time. Four times one, two, three, four. Whereas quantifiers allow us to 00:05:04.800 |
write our pattern and then add this quantifier to specify how many times to actually match that. 00:05:10.080 |
So the first of those is the one or more quantifier. So this will match that pattern 00:05:16.800 |
one or more times, just a plus sign. We also have zero or more quantifier. So this is matching 00:05:25.680 |
that pattern zero times or more times. Okay and something that we can also add in here. 00:05:33.120 |
As you'll see this is matching it as many times as possible but maybe we actually want to limit 00:05:40.640 |
the number of times I'm matching something. And this is a difference between what is called greedy 00:05:46.640 |
and lazy quantifiers. So at the moment we have a greedy quantifier. So it's saying one or more 00:05:54.000 |
times and it's going all the way up to four. Okay which is as many characters as it can fit into 00:05:59.840 |
its pattern. But if you want it to not do that and instead be lazy and simply pick up as few 00:06:08.240 |
characters as possible that match the criteria, we can just add a question mark onto the end. 00:06:13.520 |
And then we're back to matching just one because it's one or more and we are limiting it to the 00:06:19.040 |
minimum number of matches there. So keep that in mind and we'll just quickly go over again 00:06:24.880 |
towards the end just a little bit. We also have the once or none. So let's write a new test string 00:06:33.360 |
here. So here we have a few words and we'd like to match all of the words. So what we can do 00:06:51.600 |
is this. And here we're kind of matching all the words but there's this one in the middle where we 00:06:58.000 |
have a hyphen in the middle. And this is something that will happen quite often. And ideally we also 00:07:06.000 |
want to put good hearted as a single word. So we could do this. But then we're only matching that 00:07:19.680 |
single good hearted part. So instead we add a once or none quantifier. Okay so now we're matching 00:07:28.720 |
that word as well. Now if we also want to match the A because you can see here it's not matching 00:07:36.000 |
because we're expecting at least two word characters because we have W here and W here. 00:07:44.480 |
We can just add a zero or none quantifier onto the end there. And now we're getting all of our 00:07:50.000 |
words together including the hyphenated words. We can also specify a specific quantity which we do 00:07:59.280 |
like this. So here we are getting three word characters at a time. You can see here we're 00:08:07.040 |
not specifying three characters that make up a word. We're just saying three characters. So 00:08:14.080 |
here we're getting multiple matches for single words which is fine because we haven't specified 00:08:19.200 |
that. We'll go over how to define word boundaries later. And we can also turn this into a range. 00:08:27.040 |
So let's go three to five. Okay so now we're matching a minimum three characters and a 00:08:37.440 |
maximum five. Now you might have guessed this but we can actually just remove one of these 00:08:42.320 |
numbers to get less than five. Now if this doesn't work for you just make sure that you 00:08:47.600 |
are using the Python flavor because for other languages this might change and if you're on 00:08:54.160 |
PCRE for example this won't work. So change back to Python. 00:09:00.320 |
And we can also do three or more as well. Now I think this is a good example with our 00:09:11.440 |
lazy quantifier. So here we're matching between either three up to five characters at a time. 00:09:19.280 |
If we had a lazy quantifier it's always going to limit that as much as possible. So we're 00:09:23.680 |
going to go down to three. Okay so you can see here that it's limiting how many characters 00:09:29.840 |
it's including in there. It's getting lazy rather than greedy. Okay so let's write out a 00:09:36.400 |
new example here. Okay so a few unexpected words are to be expected. So this is a good example of 00:09:47.280 |
where we can use capture groups. So anything contained within round brackets will create a 00:09:56.400 |
capture group. So a capture group is simply a fancy way of saying treat everything within these 00:10:01.840 |
brackets as a single unit. So you can see here I only put these dots in as a filler but it actually 00:10:08.400 |
matches because those dots mean anything. So three anythings in a row and this is matching 00:10:16.000 |
basically anything. So we have all these matches here and it's treating those as a unit. So it's 00:10:25.120 |
doing three anythings and matching that and then moving on to next three anythings. Now what we 00:10:31.920 |
can do is we want to match unexpected and expected. So we want to match the word with or without its 00:10:39.760 |
negative prefix. So we can add expected here. But here we're only getting expected we're not 00:10:49.120 |
getting the un from unexpected. So we just add this. So now we're getting unexpected but we're 00:10:56.800 |
not getting expected because we have specified okay we want un here. We want this capture group. 00:11:04.640 |
So all we need to do is actually make this optional by adding a zero or one quantifier. 00:11:12.400 |
And there we go we are now capturing unexpected and expected. 00:11:16.480 |
So let's have a quick go at this in Python see what it looks like. 00:11:21.440 |
Okay and we run this and now we find that we are only seeing un which is probably not what 00:11:36.320 |
we're expecting. The reason this happens is that find all tries to match capture groups 00:11:43.120 |
which is exactly what we have here. So what we can do is modify this capture group to 00:11:50.640 |
make it a non-capturing group while still maintaining this behavior of zero or one. 00:11:57.840 |
So all we do to do that is add a question mark inside followed by a colon and then here we are 00:12:06.000 |
capturing everything again. So that's just a little bit of a strange behavior to watch out for. 00:12:12.880 |
Now we can also add a or logic to our capture groups. 00:12:19.120 |
So maybe we want to capture anything where we are saying 00:12:25.440 |
expected with a negative prefix and that can either be not expected or unexpected. 00:12:34.240 |
Now to do this we actually just add a pipe into our capturing group and then we add not like so. 00:12:51.680 |
And now we are matching both non-expected and unexpected. 00:12:59.120 |
Okay so that's it for capture groups and let's move on to character sets. 00:13:04.240 |
So the syntax for character sets is kind of similar to the syntax for capture groups in that 00:13:12.640 |
we use brackets but this time they're square brackets instead. And you can see these kind 00:13:18.640 |
of like a list. So anything we put within here will be treated as a character to match. 00:13:26.320 |
Here will be treated as a character to match. But unlike capture groups it's not treating them all 00:13:33.840 |
as a unit. So if we put un in here it's actually just matching of u or n. And we put unexpected 00:13:42.640 |
and it's not going to match unexpected as a unit it's just going to match each one of those words 00:13:48.480 |
within those square brackets. So let's return to our earlier example. 00:13:56.640 |
So earlier on we were matching all of the digits in our string. 00:14:03.920 |
So what we could do is write out all of the digits like this and we get the exact same effect. 00:14:14.800 |
Obviously this is quite long so what we can do instead is write this with a dash in the middle. 00:14:22.800 |
And this is any character within the range of 0 to 9. We can also add letters to this. 00:14:30.240 |
So a to z for example. And you might also think okay we can also add these hyphens in right? 00:14:39.440 |
But obviously we are using these hyphens to define our ranges. So in order to add a hyphen 00:14:46.240 |
in here we need to use backspace to escape it. And now we are matching the full string. And 00:14:52.240 |
if we want to match the full string as a whole of course we just add our quantifier. Now let's 00:14:57.520 |
move on to boundaries. So I'm just going to write out a new string for this. 00:15:10.320 |
Okay so here I want to show you the startString and endString boundaries. So startString is 00:15:21.680 |
using this character. So here if we put ifit it's only going to match ifit at the start here. 00:15:29.280 |
It's not going to match ifit here as well. And if we remove the character it does. Okay so 00:15:35.040 |
we add this character to specify that we only want to search from the very start of our string. 00:15:43.040 |
Now the equal and opposite of the startString character is the endString character. 00:15:51.440 |
And that is a dollar symbol. So let's rewrite this. 00:15:55.520 |
Okay and we want to look for example. And here with the dollar symbol we only match the final 00:16:13.440 |
example rather than both of them. So you can see there. I'm going to go back to one of the earlier 00:16:19.840 |
examples again now. Okay so here I also want to show you the word boundary. So the best way to 00:16:38.160 |
identify word boundary is not by using for example white space. Because yes that does work in a lot 00:16:45.440 |
of cases but it doesn't work if we have a comma, full stop, hyphen or anything like this. So what 00:16:53.760 |
we can do instead is use backslash b. And this identifies every single word boundary within our 00:17:01.680 |
text as you can see from the pink lines. So then we can use that to capture any of our words. And 00:17:08.400 |
now quite easily we've captured every single word and we're pulling them out in a more efficient 00:17:13.520 |
way than if we had tried to write you know s or if we'd have gone with a grouping like this and 00:17:21.120 |
added all these different things. All we need to do is add a word boundary. Okay so now we'll move 00:17:28.640 |
on to some of the what I think are the more interesting and definitely a bit more advanced 00:17:33.760 |
methods in Regex. So the first of those is the look ahead and look behind assertions. 00:17:46.160 |
we have two hello worlds. One of them is preceded by a one and a colon. The other 00:17:55.280 |
preceded by two and a colon. Now what if we want to match hello world but we only want to match 00:18:04.000 |
hello world if it is preceded by a one and a colon. But we don't want to include that one and 00:18:10.080 |
a colon within our pattern because if we want to do that we would just write this. But this will 00:18:15.600 |
return the entire string. So I'll show you over here. Okay so we're returning the full string. 00:18:30.960 |
What if we only actually want to pull out this hello world? Well we could go for hello world 00:18:37.680 |
but then we're returning both. We don't want to do that we only want the first one. So to do this 00:18:43.680 |
we use a look behind assertion. So this means that we are looking behind our pattern which means 00:18:52.160 |
anything preceding it and we are asserting that there is this other pattern there. So we do this 00:18:59.760 |
with this pattern here. Okay and anything we place in between this equal sign and this 00:19:08.960 |
closing bracket is included within our assertion pattern. 00:19:12.480 |
Okay so now we are matching just this first hello world. So if we go and take this and put it into 00:19:22.000 |
our code here we will return just the first hello world. Now on the other hand maybe we want to 00:19:32.720 |
match something that comes after our pattern and to do that we use a look ahead assertion which 00:19:38.640 |
as you've probably guessed is basically exactly the same but on the other side. 00:19:43.200 |
It does use a slightly different syntax but other than that there's really no difference. So in our 00:19:50.720 |
case we're going to search for this comma. So in between the equal sign and the closing bracket 00:19:57.520 |
here that's where we put our pattern and here again we're matching this hello world. 00:20:03.680 |
I'm going to put this in python and of course we'll just get the exact same thing. 00:20:08.880 |
Now on the other hand maybe we don't want to assert that something is in front or behind 00:20:16.240 |
our pattern we actually want to assert that something is not there. So what we can do 00:20:21.360 |
is we can make this a negative look ahead by replacing this equal sign with an exclamation 00:20:27.520 |
mark. Now we are looking for the hello world that is not followed by a comma which is obviously this 00:20:34.240 |
one and again with the look behind we just modify that as well. So whereas the look behind looked 00:20:45.120 |
like this we just again remove the equal sign and replace it with a exclamation mark 00:20:51.840 |
and then we get the second hello world. Okay so that's it for the assertions let's move on to 00:20:58.720 |
modifiers. So we can actually see we have a few modifiers here and these are essentially ways of 00:21:06.800 |
modifying the behavior of our entire regular expression. Now obviously python doesn't have 00:21:13.200 |
this little set regex options here so what we do instead is we can either do an inline modifier 00:21:22.080 |
like this one and let me just give you a good example quickly. 00:21:27.360 |
So we'll just write out a string that includes a newline character in the middle. 00:21:39.360 |
Okay and we're going to match character and then anything following that character. Now if you 00:21:46.000 |
remember this anything meta character does not actually match anything it matches anything 00:21:51.120 |
except from newlines. So in this case it is not matching here because we are expecting newline. 00:21:58.000 |
Now if we open this we can see that we have this single line dot matches newline. So we add that 00:22:04.160 |
and now it's changed the behavior of our regex and the anything meta character 00:22:10.640 |
also matches newline characters. Now let's remove it from here and we can add it inline like this. 00:22:18.960 |
So here we're just adding the s within this global modifying function and we can also add 00:22:28.320 |
other modifiers as well if you want so you just all you need to do is add the letter that represents 00:22:33.840 |
that global modifier and add it within those brackets. Now if we take that over to python 00:22:43.040 |
and we add this in so here is our inline modifier yep it works that's great if we get rid of this 00:22:59.760 |
it doesn't match anymore and that's what we would expect. Now in python we can also add modifiers 00:23:07.280 |
within the function itself so what we do is we add re dot and then the capital of the 00:23:14.720 |
modifying flag and there we go it matches again. So there's a few different ways that you can do 00:23:21.600 |
it in python. Now let's move over to conditionals this is the last one we're going to work through 00:23:28.240 |
and these are probably a little more complex in my opinion to actually read and understand. 00:23:34.400 |
So I'm going to use a few different examples here 00:23:44.800 |
okay so we have these three lines each of them has something in common. Now what I want to do 00:23:54.400 |
is enter a condition within a capture group which is either true or not. Now I don't really want to 00:24:04.320 |
specify that I need this condition within my regex because if it isn't there I want to search for 00:24:10.000 |
something else and to do that we add this group here and this is basically our if else group. 00:24:19.760 |
So I'm just going to put I here for now I'll explain that in a moment and this is our if the 00:24:25.120 |
condition is true then we also want to search for this if the condition is not true then just search 00:24:32.880 |
for this instead. So our condition here is within a capturing group and this token here refers to 00:24:44.240 |
the index of our capture group so in our case we only have one capture group and you can see here 00:24:50.960 |
it even highlights first capturing group so this actually needs to be one and then that essentially 00:24:58.480 |
links our condition within this capturing group to this if else statement and we can see that now 00:25:06.400 |
okay this condition is not true because we haven't written condition anywhere so it's going into the 00:25:11.600 |
if clause and it's saying okay if it's true which it isn't search for this right but it isn't true 00:25:18.720 |
so we are actually only searching for by so it is producing a match because all we need to match to 00:25:24.400 |
is by but in reality we actually do want to search for a condition which is going to be hello and 00:25:33.120 |
here now we are matching two things we're matching this because hello is true here and if hello is 00:25:40.960 |
true our if else statement says okay now we need to match space world which we do right here and 00:25:49.200 |
on this line just like before we're finding okay hello is not true doesn't say hello here so this 00:25:57.120 |
is not what we want to look for we actually want to look for this which is our else statement and 00:26:02.240 |
we do in fact have by here so that again does match so i mean that's everything on the regular 00:26:11.520 |
expressions and i just want to quickly go through what the difference is between re match search 00:26:17.680 |
and find or in python so let's remove these so i'm just going to write a string very quickly 00:26:29.840 |
okay we're just going to use this as our example now if we do re match you remember before we had 00:26:38.000 |
this sort of string character re.match essentially is like putting this character in front of 00:26:44.400 |
whatever you type within it so what i mean by that is if we do re.match hello 00:26:54.880 |
we will get a match so we also as well we need to put dot group after we use re.match or re.search 00:27:02.400 |
something to be aware of okay and yeah okay fine we would expect that because we're searching 00:27:07.280 |
through and yep there's a hello here there's a hello here of course it's going to match a hello 00:27:10.480 |
that's fine but if we put world we don't match anything okay so it's a non-type which means 00:27:18.560 |
nothing has been matched and the reason for that is that match automatically adds the starter string 00:27:25.680 |
token in there so if we put world at the start here this would work okay but obviously before 00:27:36.640 |
we didn't have it there so it didn't work so that's what re.match does it also just returns 00:27:44.000 |
you one match unlike find all you remember we're getting a list of matches now re.search doesn't 00:27:51.680 |
specify that we only need to look at the first part of the string instead our research looks 00:27:58.640 |
through the whole thing so if we search world we actually do get a response because we're not 00:28:07.760 |
specifying it needs to be here okay so that's great but you'll also notice that we are only 00:28:12.880 |
returning one thing here and that will not change if we add hello okay we're still just returning 00:28:22.080 |
one item so what re.search does is it comes through here it searches a whole string but 00:28:28.560 |
it only searches the first instance so it gets to here it says okay i found hello and then it 00:28:34.000 |
returns that match it doesn't go any further and finds anything else and that's where find all 00:28:41.200 |
is a little bit different so find all we can't use group here we just print out x does go through 00:28:51.040 |
and find everything so that's it for this video i hope it's been useful uh we've been you know 00:28:58.640 |
through a lot of regex um so don't worry if it can't blow your mind a bit if you're new to this 00:29:04.480 |
it is quite a lot but nonetheless regex is super important it's i would definitely recommend 00:29:11.760 |
getting familiar with it if this is new to you it's it's an incredibly useful skill no matter 00:29:17.360 |
what you are specializing in as long as you code you're probably going to use regex so that's the 00:29:22.720 |
video on regex i hope you enjoyed and as always thank you for watching bye