back to index

Novice to Advanced RegEx in Less-than 30 Minutes + Python


Chapters

0:0
13:9 character sets
15:0 boundaries
17:41 assertions
21:0 modifiers

Whisper Transcript | Transcript Only Page

00:00:00.000 | Hi and welcome to the video. Today we're going to look at regex which is short for regular
00:00:06.240 | expressions. This is essentially the de facto standard for parsing text. So what we're going
00:00:13.600 | to do in this video is run through the basics of regex first. So stuff like how to define digits,
00:00:23.120 | whitespace, how to use quantifiers to tell us how many digits or whitespace or any other character
00:00:31.040 | we want to include, how to use capture groups and character classes and also how we can use
00:00:38.480 | boundary definitions. So how we define the start of a line, the end of a line or boundaries of a
00:00:44.720 | word. And we should move through these quite quickly because they're not very difficult,
00:00:50.160 | they're pretty straightforward. And then we can move on to what I think is the more exciting
00:00:55.120 | interesting stuff which is a little more advanced. So these are things like look ahead or look behind
00:01:02.880 | assertions, modifiers and conditionals which are essentially like if-else statements but for your
00:01:11.440 | regex code which is pretty interesting. So we're going to work through all of that in this video,
00:01:16.800 | we're going to code through a few examples in Python, we're also going to use regex 101 which
00:01:20.800 | is like an interactive debugger or regex building tool that we can use. So it should be pretty
00:01:28.560 | interesting and let's jump straight into it. Okay so we're going to start in regex 101 and we're
00:01:34.640 | just going to have a look through a few of the metacharacters. So let's say we have this string,
00:01:42.480 | I'm just going to make it up. We have a few letters in here, some numbers and two other
00:01:52.080 | characters as well. Now we can obviously directly match these by actually writing out the exact
00:02:01.120 | characters but generally we're not going to want to do that if we're using regex.
00:02:09.040 | So this is where we start to use metacharacters. So we'll just go through the most common ones,
00:02:15.760 | we have digits this will match any number so anything between 0 and 9.
00:02:21.520 | And we can inverse that with a uppercase D and then that will match anything that is not a digit.
00:02:28.880 | We have W which will match any word character and then we can also reverse that again by
00:02:36.880 | capitalizing it. So all of these metacharacters we can usually uppercase them and it will reverse
00:02:43.040 | what they're doing. Do white space with S, in this case we don't actually have any white space
00:02:50.400 | so let's add some in. And it will also match new lines as well. So you can't see this highlighted
00:02:58.880 | but if you look up here you can say two matches then if I add that new line in it goes up to three.
00:03:05.600 | And then to do anything but white space we just uppercase that again. And then this one is a
00:03:11.840 | little bit of a special one, this one matches any character except from new lines. So this is not
00:03:18.000 | matching our new line here. We obviously can't uppercase this one but I suppose the opposite
00:03:25.920 | would actually just be a new line character like this. So let's switch over to Python and see how
00:03:33.840 | we would do this. So we import re which is the regex module and we'll go through these a little
00:03:40.000 | bit in more depth later but for now I'm just going to do re.findall. And in our first argument here
00:03:51.280 | we put the pattern that we are going to use to search. So in this case it would be backslash n,
00:04:00.320 | although we're not going to use that, we're going to use backslash d for any digits.
00:04:04.560 | And then let's just pull this one in.
00:04:07.680 | Okay and we return 1 0 0 0. Okay so these four characters here which is exactly what we would
00:04:20.240 | get here. Okay so that's cool. Now in the case of this full stop what if we would like to actually
00:04:29.040 | match a full stop and just a full stop. To do that we actually escape the meta character using
00:04:36.400 | a backslash and just like that. Okay so that's it for meta characters. Let's move on to quantifiers.
00:04:44.000 | So quantifiers essentially allow us to match a specific number of characters. So as of yet we've
00:04:52.480 | only been matching one at a time. So here we are matching four characters but we're only matching
00:04:58.560 | one character at a time. Four times one, two, three, four. Whereas quantifiers allow us to
00:05:04.800 | write our pattern and then add this quantifier to specify how many times to actually match that.
00:05:10.080 | So the first of those is the one or more quantifier. So this will match that pattern
00:05:16.800 | one or more times, just a plus sign. We also have zero or more quantifier. So this is matching
00:05:25.680 | that pattern zero times or more times. Okay and something that we can also add in here.
00:05:33.120 | As you'll see this is matching it as many times as possible but maybe we actually want to limit
00:05:40.640 | the number of times I'm matching something. And this is a difference between what is called greedy
00:05:46.640 | and lazy quantifiers. So at the moment we have a greedy quantifier. So it's saying one or more
00:05:54.000 | times and it's going all the way up to four. Okay which is as many characters as it can fit into
00:05:59.840 | its pattern. But if you want it to not do that and instead be lazy and simply pick up as few
00:06:08.240 | characters as possible that match the criteria, we can just add a question mark onto the end.
00:06:13.520 | And then we're back to matching just one because it's one or more and we are limiting it to the
00:06:19.040 | minimum number of matches there. So keep that in mind and we'll just quickly go over again
00:06:24.880 | towards the end just a little bit. We also have the once or none. So let's write a new test string
00:06:33.360 | here. So here we have a few words and we'd like to match all of the words. So what we can do
00:06:51.600 | is this. And here we're kind of matching all the words but there's this one in the middle where we
00:06:58.000 | have a hyphen in the middle. And this is something that will happen quite often. And ideally we also
00:07:06.000 | want to put good hearted as a single word. So we could do this. But then we're only matching that
00:07:19.680 | single good hearted part. So instead we add a once or none quantifier. Okay so now we're matching
00:07:28.720 | that word as well. Now if we also want to match the A because you can see here it's not matching
00:07:36.000 | because we're expecting at least two word characters because we have W here and W here.
00:07:44.480 | We can just add a zero or none quantifier onto the end there. And now we're getting all of our
00:07:50.000 | words together including the hyphenated words. We can also specify a specific quantity which we do
00:07:59.280 | like this. So here we are getting three word characters at a time. You can see here we're
00:08:07.040 | not specifying three characters that make up a word. We're just saying three characters. So
00:08:14.080 | here we're getting multiple matches for single words which is fine because we haven't specified
00:08:19.200 | that. We'll go over how to define word boundaries later. And we can also turn this into a range.
00:08:27.040 | So let's go three to five. Okay so now we're matching a minimum three characters and a
00:08:37.440 | maximum five. Now you might have guessed this but we can actually just remove one of these
00:08:42.320 | numbers to get less than five. Now if this doesn't work for you just make sure that you
00:08:47.600 | are using the Python flavor because for other languages this might change and if you're on
00:08:54.160 | PCRE for example this won't work. So change back to Python.
00:09:00.320 | And we can also do three or more as well. Now I think this is a good example with our
00:09:11.440 | lazy quantifier. So here we're matching between either three up to five characters at a time.
00:09:19.280 | If we had a lazy quantifier it's always going to limit that as much as possible. So we're
00:09:23.680 | going to go down to three. Okay so you can see here that it's limiting how many characters
00:09:29.840 | it's including in there. It's getting lazy rather than greedy. Okay so let's write out a
00:09:36.400 | new example here. Okay so a few unexpected words are to be expected. So this is a good example of
00:09:47.280 | where we can use capture groups. So anything contained within round brackets will create a
00:09:56.400 | capture group. So a capture group is simply a fancy way of saying treat everything within these
00:10:01.840 | brackets as a single unit. So you can see here I only put these dots in as a filler but it actually
00:10:08.400 | matches because those dots mean anything. So three anythings in a row and this is matching
00:10:16.000 | basically anything. So we have all these matches here and it's treating those as a unit. So it's
00:10:25.120 | doing three anythings and matching that and then moving on to next three anythings. Now what we
00:10:31.920 | can do is we want to match unexpected and expected. So we want to match the word with or without its
00:10:39.760 | negative prefix. So we can add expected here. But here we're only getting expected we're not
00:10:49.120 | getting the un from unexpected. So we just add this. So now we're getting unexpected but we're
00:10:56.800 | not getting expected because we have specified okay we want un here. We want this capture group.
00:11:04.640 | So all we need to do is actually make this optional by adding a zero or one quantifier.
00:11:12.400 | And there we go we are now capturing unexpected and expected.
00:11:16.480 | So let's have a quick go at this in Python see what it looks like.
00:11:21.440 | Okay and we run this and now we find that we are only seeing un which is probably not what
00:11:36.320 | we're expecting. The reason this happens is that find all tries to match capture groups
00:11:43.120 | which is exactly what we have here. So what we can do is modify this capture group to
00:11:50.640 | make it a non-capturing group while still maintaining this behavior of zero or one.
00:11:57.840 | So all we do to do that is add a question mark inside followed by a colon and then here we are
00:12:06.000 | capturing everything again. So that's just a little bit of a strange behavior to watch out for.
00:12:12.880 | Now we can also add a or logic to our capture groups.
00:12:19.120 | So maybe we want to capture anything where we are saying
00:12:25.440 | expected with a negative prefix and that can either be not expected or unexpected.
00:12:32.720 | And we want both of these to match.
00:12:34.240 | Now to do this we actually just add a pipe into our capturing group and then we add not like so.
00:12:51.680 | And now we are matching both non-expected and unexpected.
00:12:59.120 | Okay so that's it for capture groups and let's move on to character sets.
00:13:04.240 | So the syntax for character sets is kind of similar to the syntax for capture groups in that
00:13:12.640 | we use brackets but this time they're square brackets instead. And you can see these kind
00:13:18.640 | of like a list. So anything we put within here will be treated as a character to match.
00:13:26.320 | Here will be treated as a character to match. But unlike capture groups it's not treating them all
00:13:33.840 | as a unit. So if we put un in here it's actually just matching of u or n. And we put unexpected
00:13:42.640 | and it's not going to match unexpected as a unit it's just going to match each one of those words
00:13:48.480 | within those square brackets. So let's return to our earlier example.
00:13:56.640 | So earlier on we were matching all of the digits in our string.
00:14:03.920 | So what we could do is write out all of the digits like this and we get the exact same effect.
00:14:14.800 | Obviously this is quite long so what we can do instead is write this with a dash in the middle.
00:14:22.800 | And this is any character within the range of 0 to 9. We can also add letters to this.
00:14:30.240 | So a to z for example. And you might also think okay we can also add these hyphens in right?
00:14:39.440 | But obviously we are using these hyphens to define our ranges. So in order to add a hyphen
00:14:46.240 | in here we need to use backspace to escape it. And now we are matching the full string. And
00:14:52.240 | if we want to match the full string as a whole of course we just add our quantifier. Now let's
00:14:57.520 | move on to boundaries. So I'm just going to write out a new string for this.
00:15:10.320 | Okay so here I want to show you the startString and endString boundaries. So startString is
00:15:21.680 | using this character. So here if we put ifit it's only going to match ifit at the start here.
00:15:29.280 | It's not going to match ifit here as well. And if we remove the character it does. Okay so
00:15:35.040 | we add this character to specify that we only want to search from the very start of our string.
00:15:43.040 | Now the equal and opposite of the startString character is the endString character.
00:15:51.440 | And that is a dollar symbol. So let's rewrite this.
00:15:55.520 | Okay and we want to look for example. And here with the dollar symbol we only match the final
00:16:13.440 | example rather than both of them. So you can see there. I'm going to go back to one of the earlier
00:16:19.840 | examples again now. Okay so here I also want to show you the word boundary. So the best way to
00:16:38.160 | identify word boundary is not by using for example white space. Because yes that does work in a lot
00:16:45.440 | of cases but it doesn't work if we have a comma, full stop, hyphen or anything like this. So what
00:16:53.760 | we can do instead is use backslash b. And this identifies every single word boundary within our
00:17:01.680 | text as you can see from the pink lines. So then we can use that to capture any of our words. And
00:17:08.400 | now quite easily we've captured every single word and we're pulling them out in a more efficient
00:17:13.520 | way than if we had tried to write you know s or if we'd have gone with a grouping like this and
00:17:21.120 | added all these different things. All we need to do is add a word boundary. Okay so now we'll move
00:17:28.640 | on to some of the what I think are the more interesting and definitely a bit more advanced
00:17:33.760 | methods in Regex. So the first of those is the look ahead and look behind assertions.
00:17:41.680 | So if we have this string here
00:17:46.160 | we have two hello worlds. One of them is preceded by a one and a colon. The other
00:17:55.280 | preceded by two and a colon. Now what if we want to match hello world but we only want to match
00:18:04.000 | hello world if it is preceded by a one and a colon. But we don't want to include that one and
00:18:10.080 | a colon within our pattern because if we want to do that we would just write this. But this will
00:18:15.600 | return the entire string. So I'll show you over here. Okay so we're returning the full string.
00:18:30.960 | What if we only actually want to pull out this hello world? Well we could go for hello world
00:18:37.680 | but then we're returning both. We don't want to do that we only want the first one. So to do this
00:18:43.680 | we use a look behind assertion. So this means that we are looking behind our pattern which means
00:18:52.160 | anything preceding it and we are asserting that there is this other pattern there. So we do this
00:18:59.760 | with this pattern here. Okay and anything we place in between this equal sign and this
00:19:08.960 | closing bracket is included within our assertion pattern.
00:19:12.480 | Okay so now we are matching just this first hello world. So if we go and take this and put it into
00:19:22.000 | our code here we will return just the first hello world. Now on the other hand maybe we want to
00:19:32.720 | match something that comes after our pattern and to do that we use a look ahead assertion which
00:19:38.640 | as you've probably guessed is basically exactly the same but on the other side.
00:19:43.200 | It does use a slightly different syntax but other than that there's really no difference. So in our
00:19:50.720 | case we're going to search for this comma. So in between the equal sign and the closing bracket
00:19:57.520 | here that's where we put our pattern and here again we're matching this hello world.
00:20:03.680 | I'm going to put this in python and of course we'll just get the exact same thing.
00:20:08.880 | Now on the other hand maybe we don't want to assert that something is in front or behind
00:20:16.240 | our pattern we actually want to assert that something is not there. So what we can do
00:20:21.360 | is we can make this a negative look ahead by replacing this equal sign with an exclamation
00:20:27.520 | mark. Now we are looking for the hello world that is not followed by a comma which is obviously this
00:20:34.240 | one and again with the look behind we just modify that as well. So whereas the look behind looked
00:20:45.120 | like this we just again remove the equal sign and replace it with a exclamation mark
00:20:51.840 | and then we get the second hello world. Okay so that's it for the assertions let's move on to
00:20:58.720 | modifiers. So we can actually see we have a few modifiers here and these are essentially ways of
00:21:06.800 | modifying the behavior of our entire regular expression. Now obviously python doesn't have
00:21:13.200 | this little set regex options here so what we do instead is we can either do an inline modifier
00:21:22.080 | like this one and let me just give you a good example quickly.
00:21:27.360 | So we'll just write out a string that includes a newline character in the middle.
00:21:39.360 | Okay and we're going to match character and then anything following that character. Now if you
00:21:46.000 | remember this anything meta character does not actually match anything it matches anything
00:21:51.120 | except from newlines. So in this case it is not matching here because we are expecting newline.
00:21:58.000 | Now if we open this we can see that we have this single line dot matches newline. So we add that
00:22:04.160 | and now it's changed the behavior of our regex and the anything meta character
00:22:10.640 | also matches newline characters. Now let's remove it from here and we can add it inline like this.
00:22:18.960 | So here we're just adding the s within this global modifying function and we can also add
00:22:28.320 | other modifiers as well if you want so you just all you need to do is add the letter that represents
00:22:33.840 | that global modifier and add it within those brackets. Now if we take that over to python
00:22:43.040 | and we add this in so here is our inline modifier yep it works that's great if we get rid of this
00:22:59.760 | it doesn't match anymore and that's what we would expect. Now in python we can also add modifiers
00:23:07.280 | within the function itself so what we do is we add re dot and then the capital of the
00:23:14.720 | modifying flag and there we go it matches again. So there's a few different ways that you can do
00:23:21.600 | it in python. Now let's move over to conditionals this is the last one we're going to work through
00:23:28.240 | and these are probably a little more complex in my opinion to actually read and understand.
00:23:34.400 | So I'm going to use a few different examples here
00:23:39.040 | I'm just going to quickly make this up
00:23:44.800 | okay so we have these three lines each of them has something in common. Now what I want to do
00:23:54.400 | is enter a condition within a capture group which is either true or not. Now I don't really want to
00:24:04.320 | specify that I need this condition within my regex because if it isn't there I want to search for
00:24:10.000 | something else and to do that we add this group here and this is basically our if else group.
00:24:19.760 | So I'm just going to put I here for now I'll explain that in a moment and this is our if the
00:24:25.120 | condition is true then we also want to search for this if the condition is not true then just search
00:24:32.880 | for this instead. So our condition here is within a capturing group and this token here refers to
00:24:44.240 | the index of our capture group so in our case we only have one capture group and you can see here
00:24:50.960 | it even highlights first capturing group so this actually needs to be one and then that essentially
00:24:58.480 | links our condition within this capturing group to this if else statement and we can see that now
00:25:06.400 | okay this condition is not true because we haven't written condition anywhere so it's going into the
00:25:11.600 | if clause and it's saying okay if it's true which it isn't search for this right but it isn't true
00:25:18.720 | so we are actually only searching for by so it is producing a match because all we need to match to
00:25:24.400 | is by but in reality we actually do want to search for a condition which is going to be hello and
00:25:33.120 | here now we are matching two things we're matching this because hello is true here and if hello is
00:25:40.960 | true our if else statement says okay now we need to match space world which we do right here and
00:25:49.200 | on this line just like before we're finding okay hello is not true doesn't say hello here so this
00:25:57.120 | is not what we want to look for we actually want to look for this which is our else statement and
00:26:02.240 | we do in fact have by here so that again does match so i mean that's everything on the regular
00:26:11.520 | expressions and i just want to quickly go through what the difference is between re match search
00:26:17.680 | and find or in python so let's remove these so i'm just going to write a string very quickly
00:26:29.840 | okay we're just going to use this as our example now if we do re match you remember before we had
00:26:38.000 | this sort of string character re.match essentially is like putting this character in front of
00:26:44.400 | whatever you type within it so what i mean by that is if we do re.match hello
00:26:54.880 | we will get a match so we also as well we need to put dot group after we use re.match or re.search
00:27:02.400 | something to be aware of okay and yeah okay fine we would expect that because we're searching
00:27:07.280 | through and yep there's a hello here there's a hello here of course it's going to match a hello
00:27:10.480 | that's fine but if we put world we don't match anything okay so it's a non-type which means
00:27:18.560 | nothing has been matched and the reason for that is that match automatically adds the starter string
00:27:25.680 | token in there so if we put world at the start here this would work okay but obviously before
00:27:36.640 | we didn't have it there so it didn't work so that's what re.match does it also just returns
00:27:44.000 | you one match unlike find all you remember we're getting a list of matches now re.search doesn't
00:27:51.680 | specify that we only need to look at the first part of the string instead our research looks
00:27:58.640 | through the whole thing so if we search world we actually do get a response because we're not
00:28:07.760 | specifying it needs to be here okay so that's great but you'll also notice that we are only
00:28:12.880 | returning one thing here and that will not change if we add hello okay we're still just returning
00:28:22.080 | one item so what re.search does is it comes through here it searches a whole string but
00:28:28.560 | it only searches the first instance so it gets to here it says okay i found hello and then it
00:28:34.000 | returns that match it doesn't go any further and finds anything else and that's where find all
00:28:41.200 | is a little bit different so find all we can't use group here we just print out x does go through
00:28:51.040 | and find everything so that's it for this video i hope it's been useful uh we've been you know
00:28:58.640 | through a lot of regex um so don't worry if it can't blow your mind a bit if you're new to this
00:29:04.480 | it is quite a lot but nonetheless regex is super important it's i would definitely recommend
00:29:11.760 | getting familiar with it if this is new to you it's it's an incredibly useful skill no matter
00:29:17.360 | what you are specializing in as long as you code you're probably going to use regex so that's the
00:29:22.720 | video on regex i hope you enjoyed and as always thank you for watching bye